By Saurabh Sharma, Machine Learning Engineer.

Photo by Anete Lusina on pexels.

In a real-world scenario, documents that we encounter usually cover more than one topic. A topic is something that describes the meaning of a document concisely. For instance, let’s take one review from a garage service website — “The process of booking was simple, and the tire I bought was a good price. The only issue was that service took 35 minutes from arriving at the depot to leaving, which I felt was too long for a pre-arranged appointment.

In this review, there are numerous intuitions, such as — “ease_of_booking”, “tyre_price”, “service_duration” — that we can call “topics.” It’s been a challenging task for researchers to churn out such topic clusters from a raw unstructured set of documents. I propose a two-step approach using LDA and BERT to build a domain-specific document categorizer that categorizes each document into a set of topic clusters from raw unlabelled document datasets.

My approach involves two main subtasks:

  1. Unsupervised learning using LDA (Latent Dirichlet Allocation) to mine a set of topics from an unlabelled document dataset.
  2. Supervised learning using BERT to build a muti-topic document categorizer.

Let’s begin…


I scrapped reviews of an online garage booking service website for this task and stored them into a CSV file.

import numpy as np 
import pandas as pd # data processing, CSV file I/O
import os
import re
## provide your own file path 
train_df = pd.read_csv("../input/garage_service_reviews.csv)

Created by the author.

Unsupervised learning using LDA

The following pre-processing steps were performed before training the garage booking service reviews using LDA.


This module is the initial and crucial phase of the pre-processing, where the text data was cleaned by removing punctuation, stop words, non-ASCII values. Finally, the list of lemmatized tokens for each of the reviews is returned as an output.

import spacy
nlp = spacy.load("en_core_web_sm")

The above lines of code will load the spacy model.

def preprocess_text(text):
    doc1 = nlp(text.lower())
    preprocessed_txt = [str(token.lemma_) for token in doc1 if not       token.is_stop and not token.is_punct and not token.is_digit and token.is_ascii]
    return preprocessed_txt
##making call to preprocess_text function
train_df['text'] =...

Continue reading: