Tokenizer
- Tokenizer expands the contraction to recover the essential grammatical Features of the pronoun and the Verb.
- Space-delimited languages
- White space delimited tokens may not be the valid token
- Chinese and Thai
- Words are written in succession with no indication of word boundaries
- Word Structure
- Punctuation
- Sub word tokenization