One of the main core features of this package is the capability to tokenize Myanmar language text. At the time of this writing, it supports:
- Syllable-level tokenization (Burmese, Karen, Shan, Mon)
- Word-level tokenization (Burmese)
This tokenization is based on regular expression (regex). It supports Burmese, Karen, Shan and Mon languages. Call it as follows:
It will return a list of tokens (tokenized words).
On the other hand, word-level tokenization supports only Burmese. It is based on conditional random field (CRF) prediction. Call the tokenize function as usual and specify the
form parameter to
The output is slightly different from the syllable label depending on the input text. In the second example which contains English words, notice that word-tokenization combine
Alan Turing into a single word.
Continue reading: https://towardsdatascience.com/myanmar-language-natural-language-processing-in-python-30489b5de2ca?source=rss—-7f60cf5620c9—4