Language Identification

  • Identifying the language of the document
  • Documents could be multilingual at the sentence level or paragraph level too
  • Unique Character Set
  • Shared Character Set
  • Byte Range Distribution used for Character Set Identification
  • sort the bytes in a file by frequency count and use the sorted list as a signature vector for comparison via an n-gram model