(testing signal)


NLP Tutorial: Topic Modeling in Python with BerTopic

BerTopic is a topic modeling technique that uses transformers (BERT embeddings) and class-based TF-IDF to create dense clusters. It also allows you to easily interpret and visualize the topics generated. In this NLP tutorial, we will use Olympic Tokyo 2020 Tweets with a goal to create a model that can automatically categorize the tweets by their topics. The BerTopic algorithm contains 3 stages:Embed the textual data(documents) Embed the documents with BERT, or it can use any other embedding technique. The algorithm uses UMAP to reduce the dimensionality of embeddeddings and the HDBSCAN technique.

Davis David Hacker Noon profile picture

@davisdavidDavis David

Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.


NLP Natural Language Processing

Main ideas

The process for NLP is always similar to other classification algos:

  • Compile documents. Get the data which uses to be raw text.
  • Featurize documents. Get the text in a format that ML algorithms understand.
  • Compare features for classification. Use ML techniques to build the model.

Unstructured text ⇒ Compile documents ⇒ Featurize them ⇒ Compare features

How does does the algorithm work

In NLP the featurization is done through vectorization:

  • Corpus of D documents: a = “The House is Blue” , b = ”The House is Red”.
  • Build and index of relevant, meaningful keywords. Eg (house,blue,red)
  • Vectorize documents. Eg a = “The Blue House” ⇒ (1,1,0)
  • Compare the docs as follows:

Use cosine similarity to compare: similarity docs(a,b) = cos (θ)

Characterize the terms:
Term Frequency TF(t) = TF(t,d) ⇒ Importance of the term t within doc d
Inverse Doc Frequency IDF(t) = log (D/t) ⇒ Importance of term within corpus D
TF-IDF = This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.… Read more...