0. INSTALL AND IMPORT DEPENDENCIES
We first have to install the transformers library, this can be easily done through
pip install transformers.
Next, we import the libraries needed for the rest of the tutorial.
1. INSTANTIATE THE TPU
The model has been trained using Colab free TPUs. TPUs will allow us to train our model much faster and will also allow us to use a larger batch size. To enable TPU on Colab click on “Edit”->“Notebook Settings” and select “TPU” in the “Hardware Accelerator” field. To instantiate the TPU in order to use it with TensorFlow we need to run the following code
To make full use of the TPU potential, we have set a batch size that is a multiple of the number of TPUs in our cluster. We will then just need to instantiate our model under
2. DATA EXPLORATION
Let’s load the data. We will concatenate the headline and the description into a single input text that we will feed to our network later.
The news headlines are classified into 41 categories, let’s visualize how they are distributed.
We see that we have a lot of categories with few entries. Furthermore, some categories may refer to closely related or overlapping concepts. Since there is a significant number of categories to predict, let’s aggregate the categories that refer to similar concepts. This will make the classification task a little bit easier.
We are thus left with 28 aggregated categories distributed as follows
3. DATA PREPROCESSING
We have now to preprocess our data in a way that can be used by a Tensorflow Keras model. As a first step, we need to turn the classes labels into indices. We don’t need a one-hot encoding since we will work with TensorFlow SparseCategorical loss.
Next, we need to tokenize the text i.e. we need to transform our strings into a list of indices that can be fed to the model. The transformers library provides us the AutoTokenizer class that allows loading the pre-trained tokenizer used for RoBERTa.
RoBERTa uses a byte-level BPE tokenizer that performs subword tokenization, i.e. unknown rare words are split into common subwords present in the vocabulary. We will see what this means in examples.
Here the flag
padding=True will pad the sentence to the max length passed in the batch. On the other side,
truncation=True will truncate the sentences to the maximum number of tokens the model can accept (512 for RoBERTa, as for BERT).
Let’s visualize how the text gets tokenized.
Input: Twitter Users...
Continue reading: https://towardsdatascience.com/news-category-classification-fine-tuning-roberta-on-tpus-with-tensorflow-f057c37b093?source=rss—-7f60cf5620c9—4