BerTopic is a topic modeling technique that uses transformers (BERT embeddings) and class-based TF-IDF to create dense clusters. It also allows you to easily interpret and visualize the topics generated. In this NLP tutorial, we will use Olympic Tokyo 2020 Tweets with a goal to create a model that can automatically categorize the tweets by their topics. The BerTopic algorithm contains 3 stages:Embed the textual data(documents) Embed the documents with BERT, or it can use any other embedding technique. The algorithm uses UMAP to reduce the dimensionality of embeddeddings and the HDBSCAN technique.
Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.