(testing signal)

Tag: KMeans

K Means Clustering Project (Pieran Data)

For this project we will attempt to use KMeans Clustering to cluster Universities into to two groups, Private and Public. It is very important to note, we actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.

When using the Kmeans algorithm under normal circumstances, it is because you don’t have labels. In this case we will use the labels to try to get an idea of how well the algorithm performed, but you won’t usually do this for Kmeans, so the classification report and confusion matrix at the end of this project, don’t truly make sense in a real world setting!.

The Data
We will use a data frame with 777 observations on the following 18 variables.… Read more...

Mastering Clustering Methods in Python

Clustering is the process of separating different parts of data based on common characteristics. Disparate industries including retail, finance and healthcare use clustering techniques for various analytical tasks. In retail, clustering can help identify distinct consumer populations, which can then allow a company to create targeted advertising based on consumer demographics that may be too complicated to inspect manually. In finance, clustering can detect different forms of illegal market activity like orderbook spoofing in which traders deceitfully place large orders to pressure other traders into buying or selling an asset. In healthcare, clustering methods have been used to figure out patient cost patterns, early onset neurological disorders and cancer gene expression.


K-means clustering: find my tribe!

What is K-means clustering?

Find hidden information

At its core K-means clustering is an algorithm that tries to categorize sample data according to certain rules. If our sample data are vectors of real numbers, the simplest criteria is the distance between the data.

Suppose I have 4 sample data points in the two dimensional space:

(0, 0), (0, 1), (0, 3), (0, 4)

Data points: (0, 0), (0, 1), (0, 3), (0, 4) (Image by Author)

and suppose I know that I have two groups, how do I find the center of the two groups and the member data points of each group?

First step is to pick two starting points for each of the groups. In this example, it’s quite easy to simply eyeball and, say, pick (0, 0) as starting point for the first group and (0, 3) as starting point for the second group.


Image Segmentation with Clustering

The Fundamentals of K-Means and Fuzzy-C Means Clustering and their usage for Image Segmentation


Clustering Product Names with Python — Part 2

Using Natural Language Processing (NLP) and K-Means to cluster unlabelled text in Python

This guide goes through how we can use Natural Language Processing (NLP) and K-means in Python to automatically cluster unlabelled product names to quickly understand what kinds of products are in a data set.

This article is Part 2 and will cover: K-means Clustering, Assessing Cluster Quality and Finetuning.

If you haven’t already, please read Part 1 which covers: Preprocessing and Vectorisation.

Now that we have our word matrices, let’s get clustering.

This is the sexy part: clustering our word matrices.

K-means clustering allocates data points into discrete groups based on their similarity or proximity to each other.


Clustering Product Names with Python — Part 1

The method consists of the following steps:

  • Preprocessing the text (the food names) into clean words so that we can turn it into numerical data.
  • Vectorisation which is the process of turning words into numerical features to prepare for machine learning.
  • Applying K-means clustering, an unsupervised machine learning algorithm, to group food names with similar words together.
  • Assessing cluster quality through cluster labelling and visualisation.
  • Finetuning steps 1–4 to improve cluster quality.

This article is Part 1 and will cover: Preprocessing and Vectorisation.

Be sure to also check out Part 2 which will cover: K-means Clustering, Assessing Cluster Quality and Finetuning.

Full disclosure: this data set actually comes with a column ‘Classification Name’ with 268 categories but for demonstration purposes, let’s pretend it’s not there 😉

This guide will use Pandas, NumPy, scikit-learn, FuzzyWuzzy, Matplotlib and Plotly.


Clustering types with various applications

Clustering types and their usage areas are explained with python implementation

Ibrahim Kovan

Unlabeled datasets can be grouped by considering their similar properties with the unsupervised learning technique. However, the point of view of these similar features is different in each algorithm. Unsupervised learning provides detailed information about the dataset as well as labeling the data.


How to Create Stunning Web Apps for your Data Science Projects

By Murallie Thuwarakesh, Data Scientist at Stax, Inc.

Photo by Meagan Carsience on Unsplash

Web development isn’t a data scientist’s core competency. Most data scientists don’t bother to learn different technologies to do it. It’s just not their cup of coffee.

Yet, most data science projects also have a software development component. Developers sometimes have a different understanding of the problem, and they use discrete technologies. It often causes problems and drains the precious time of both teams unproductively.

Also, visualization tools such as Tableau and Power BI focus more on data exploration. Yet, it’s only part of a complete data science project. If you need to integrate a machine learning model, they are far from perfect.


Are Topics Also Communities of Words?

In data science, we use unsupervised algorithms to help us find natural (data-drive) groupings of data. Probably the most applied clustering algorithm is K-Means. When those data are words however, other algorithms like Latent Dirichlet Allocation (LDA) are more popular. LDA is more popular than K-Means because LDA will assign multiple topics to a single document whereas K-Means optimizes for mutually exclusive groups (aka, hard clustering).

The drawback with both approaches is that each require the user to input a specific number of clusters/topics for the model to then attempt to “find” in the data. Having to input the number of topics a-priori can be a challenge because we often don’t know what the optimal number of groupings should be.


A deep dive into partitioning around medoids

Throughout the series, we have implemented many different algorithms, let’s compare them a bit regarding runtime and outcome. Because we implemented everything in base R without taking advantage of vectorization, the runtime will be significantly longer than using optimized algorithms built in C or FORTRAN.

Clustering outcome

Let’s start by visualizing the results. Of course the colors for the “same” cluster
can differ between the different algorithms, because they do not know which
cluster belongs to which species.