By Saurabh Sharma, Machine Learning Engineer.
In a real-world scenario, documents that we encounter usually cover more than one topic. A topic is something that describes the meaning of a document concisely. For instance, let’s take one review from a garage service website — “The process of booking was simple, and the tire I bought was a good price. The only issue was that service took 35 minutes from arriving at the depot to leaving, which I felt was too long for a pre-arranged appointment.”
In this review, there are numerous intuitions, such as — “ease_of_booking”, “tyre_price”, “service_duration” — that we can call “topics.” It’s been a challenging task for researchers to churn out such topic clusters from a raw unstructured set of documents. I propose a two-step approach using LDA and BERT to build a domain-specific document categorizer that categorizes each document into a set of topic clusters from raw unlabelled document datasets.
My approach involves two main subtasks:
- Unsupervised learning using LDA (Latent Dirichlet Allocation) to mine a set of topics from an unlabelled document dataset.
- Supervised learning using BERT to build a muti-topic document categorizer.
I scrapped reviews of an online garage booking service website for this task and stored them into a CSV file.
Created by the author.
Unsupervised learning using LDA
The following pre-processing steps were performed before training the garage booking service reviews using LDA.
This module is the initial and crucial phase of the pre-processing, where the text data was cleaned by removing punctuation, stop words, non-ASCII values. Finally, the list of lemmatized tokens for each of the reviews is returned as an output.
The above lines of code will load the spacy model.
Continue reading: https://www.kdnuggets.com/2021/08/multilabel-document-categorization.html