Using Natural Language Processing (NLP) and K-Means to cluster unlabelled text in Python

This guide goes through how we can use Natural Language Processing (NLP) and K-means in Python to automatically cluster unlabelled product names to quickly understand what kinds of products are in a data set.

This article is Part 2 and will cover: K-means Clustering, Assessing Cluster Quality and Finetuning.

If you haven’t already, please read Part 1 which covers: Preprocessing and Vectorisation.

Now that we have our word matrices, let’s get clustering.

This is the sexy part: clustering our word matrices.

K-means clustering allocates data points into discrete groups based on their similarity or proximity to each other. We specify the number of clusters K and the algorithm iteratively assigns each observation to a cluster until each cluster’s observations are as close as possible to its mean (or centroid).

Theoretically similar food names should be clustered together because they have similar values for the same words (and n-grams).

Optimal number of clusters

How do we know what to specify as K? We can use the elbow method to test different values for K and compare the distances of each data point from their centroids (the sum of squared errors or SSE).

To understand the types of foods in our data set, we want to balance having food names be as similar as possible in a cluster (low SSE) and having meaningful clusters of more than 1 or 2 food names each.

Let’s do this with our bag of words matrix. We know there are 851 distinct words in our food names so there can’t be more food types than that.

The more clusters we create, the lower the SSE should be and the closer together each cluster is from its centroid. If we were to extend the graph to K=1,500 (one cluster for each distinct food name), SSE would be 0.

The elbow point looks to be at 200. The decline in SSE after this point starts to get increasingly smaller.

Creating the clusters

Let’s start here and test K=200.

Now each row in our bag of words matrix has been assigned to a cluster between 0 and 199. And you can see our ground spices in the first 5 rows are in the same cluster. Wahoo!

Creating the clusters was easy enough. Now we want to know if the clustering appropriately answers the question of what kinds of foods are in the data set. We already know that more clusters should mean a lower SSE. But how does that affect how meaningful the clusters are?

We need other measures to…

Continue reading:—-7f60cf5620c9—4