How to train your own high performing sentiment analysis model

Photo by Pietro Jeng on Unsplash

Sentiment analysis is a technique in natural language processing used to identify emotions associated with the text. Common use cases of sentiment analysis include monitoring customers’ feedbacks on social media, brand and campaign monitoring.

In this article, we examine how you can train your own sentiment analysis model on a custom dataset by leveraging on a pre-trained HuggingFace model. We will also examine how to efficiently perform single and batch prediction on the fine-tuned model in both CPU and GPU environments. If you are looking to for an out-of-the-box sentiment analysis model, check out my previous article on how to perform sentiment analysis in python with just 3 lines of code.

pip install transformers
pip install fast_ml==3.68
pip install datasets
import numpy as np
import pandas as pd
from fast_ml.model_development import train_valid_test_split
from transformers import Trainer, TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch import nn
from torch.nn.functional import softmax
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder
import datasets

Enable GPU accelerator if it is available.

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (f'Device Availble: {DEVICE}')

We will be using an ecommerce dataset that contains text reviews and ratings for women’s clothes.

df = pd.read_csv('/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)

We are only interested in the Review Text and Rating columns. The Review Text column serves as input variable to the model and the Rating column is our target variable it has values ranging from 1 (least favourable) to 5 (most favourable).

For clarity, let’s append “Star” or “Stars” behind each integer rating.

df_reviews = df.loc[:, ['Review Text', 'Rating']].dropna()
df_reviews['Rating'] = df_reviews['Rating'].apply(lambda x: f'{x} Stars' if x != 1 else f'{x} Star')

This is how the data looks like now, where 1,2,3,4,5 stars are our class labels.

Let’s encode the ratings using Sklearn’s LabelEncoder.

le = LabelEncoder()
df_reviews['Rating'] = le.fit_transform(df_reviews['Rating'])

Notice that the Rating column has been transformed from a text to an integer column.

The numbers in the Rating column…

Continue reading:—-7f60cf5620c9—4