One of the main core features of this package is the capability to tokenize Myanmar language text. At the time of this writing, it supports:

  • Syllable-level tokenization (Burmese, Karen, Shan, Mon)
  • Word-level tokenization (Burmese)

Syllable-level tokenization

This tokenization is based on regular expression (regex). It supports Burmese, Karen, Shan and Mon languages. Call it as follows:

It will return a list of tokens (tokenized words).

Word-level tokenization

On the other hand, word-level tokenization supports only Burmese. It is based on conditional random field (CRF) prediction. Call the tokenize function as usual and specify the form parameter to word.

The output is slightly different from the syllable label depending on the input text. In the second example which contains English words, notice that word-tokenization combine Alan Turing into a single word.

Continue reading:—-7f60cf5620c9—4