In this post, we’ll look at a set of tweets and try to determine its major theme(s). But first, a little bit of context.
As written by yamini5 on Analytics Vidhya, “Topic modelling[sic] refers to the task of identifying topics that best describes a set of documents.” Simply put, topic modeling refers to the process of ingesting a bunch of unstructured and unlabeled text data and then classifying them into the different topics that they represent. For example, we might have a collection of Emily Dickinson’s poems. When we try to classify them, we’re probably going to end up with the topics of life, death, and love. Topic modeling refers to how the computer can do all this using some fancy math (that we’re not going to talk about, LOL).
One popular algorithm for doing topic modeling is LDA or Latent Dirichlet Allocation. However, LDA has two major limitations: 1) it assumes the text is a mixture of topics, and 2) it performs poorly on short text (<50 words).
Enter our protagonist: GSDMM.
GSDMM, short for Gibbs Sampling Dirichlet Multinomial Mixture, is a model proposed by Jianhua Yin and Jianyong Wang that works better than vanilla LDA. It assumes that one document is only about one topic, and it works great on short texts like tweets and movie reviews. Matyas Amrouche has a great article if you want the intuition behind the algorithm. Or, if you’d prefer, you can read the original paper here.
Today, we’ll take a bunch of tweets from former President Donald Trump and try to apply topic modeling. This hands-on tutorial aims to assign a topic to each one of former President Trump’s tweets.
Let’s dig in.
First, go to thetrumparchive.com and click on “Retweet filters” and “Hide Retweets” buttons to exclude the retweets from our analysis.
Then, click on “Date filters” and set the date range from 2020–11–07 to 2021–01–07.
Next, we’ll export the tweets into a json file. Click on the “Export” button.
Click on “Start export” to continue with the process.
This is what you’ll see while it’s processing.
Once done, click on the lower right corner of the box that appears and drag it to the right to make it bigger and easier to see.
Click somewhere inside the text box and then press CTRL+A (press the CTRL key and the letter “A” on the keyboard at the same time) to select everything inside the text box.
Once everything is selected, everything inside the box will be highlighted.
Place your cursor over the blue highlighting and…
Continue reading: https://towardsdatascience.com/gsdmm-topic-modeling-for-social-media-posts-and-reviews-8726489dc52f?source=rss—-7f60cf5620c9—4