Language Identification
- Identifying the language of the document
- Documents could be multilingual at the sentence level or paragraph level too
- Unique Character Set
- Shared Character Set
- Byte Range Distribution used for Character Set Identification
- sort the bytes in a file by frequency count and use the sorted list as a signature vector for comparison via an n-gram model