In this section, you will learn to perform tokenization on Vietnamese text. Create a new Python file and add the following code inside it.

from pyvi import ViTokenizertext = 'Xin chào! Rất vui được gặp bạn.'
result = ViTokenizer.tokenize(text)

You should get the following output:

Xin chào ! Rất vui được gặp bạn .

Each token will be separated by a white space. You can easily convert it to a list by splitting the text with whitespace:

result.split(' ')

The new output is as follows:

['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.']


Besides that, pyvi does provide an alternative function called spacy_tokenize for better integration with spaCy package. Simply call it as follows:

result = ViTokenizer.spacy_tokenize(text)

The output is a tuple with the following items:

  • a list of tokenized tokens
  • a list of booleans indicating if the token followed by a space

You should get the following output when you ran the file:

(['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.'], [True, False, True, True, True, True, True, False, False])

Use index 0 to get the list:


Continue reading:—-7f60cf5620c9—4