Tokenize

In this section, you will learn to perform tokenization on Vietnamese text. Create a new Python file and add the following code inside it.

from pyvi import ViTokenizertext = 'Xin chào! Rất vui được gặp bạn.'
result = ViTokenizer.tokenize(text)
print(result)

You should get the following output:

Xin chào ! Rất vui được gặp bạn .

Each token will be separated by a white space. You can easily convert it to a list by splitting the text with whitespace:

result.split(' ')

The new output is as follows:

['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.']

spacy_tokenize

Besides that, pyvi does provide an alternative function called spacy_tokenize for better integration with spaCy package. Simply call it as follows:

result = ViTokenizer.spacy_tokenize(text)

The output is a tuple with the following items:

  • a list of tokenized tokens
  • a list of booleans indicating if the token followed by a space

You should get the following output when you ran the file:

(['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.'], [True, False, True, True, True, True, True, False, False])

Use index 0 to get the list:

result[0]

Continue reading: https://towardsdatascience.com/introduction-to-pyvi-python-vietnamese-nlp-toolkit-ff5124983dc2?source=rss—-7f60cf5620c9—4

Source: towardsdatascience.com