Tokenizer

  • Tokenizer expands the contraction to recover the essential grammatical Features of the pronoun and the Verb.
  • Space-delimited languages
  • White space delimited tokens may not be the valid token
  • Chinese and Thai
    • Words are written in succession with no indication of word boundaries
  • Word Structure
  • Punctuation
  • Sub word tokenization