## What is K-means clustering?

Find hidden information

At its core K-means clustering is an algorithm that tries to categorize sample data according to certain rules. If our sample data are vectors of real numbers, the simplest criteria is the distance between the data.

Suppose I have 4 sample data points in the two dimensional space:

(0, 0), (0, 1), (0, 3), (0, 4)

and suppose I know that I have two groups, how do I find the center of the two groups and the member data points of each group?

First step is to pick two starting points for each of the groups. In this example, it’s quite easy to simply eyeball and, say, pick (0, 0) as starting point for the first group and (0, 3) as starting point for the second group.

But let’s not do that, since that would make this example too easy.

Suppose I actually picked (0, 0) and (0, 1) as starting points, (0, 0) represents group 0, (0, 1) represents group 1. Next step is to assign the four data points to each of the two groups. We do this by calculating the distance of the data points to each of the groups and assign each data point to the one that’s closest.

Starting with data point (0, 0), its closest point is obviously (0, 0), with distance 0, so data point (0, 0) is assigned to group 0. Moving on to the second data point, (0, 1), it’s closer to group 1, so assign it to group 1. After going through every data point, we have:

`group 0:data points = (0, 0)group 1:data points = (0, 1), (0, 3), (0, 4)`

Now that we have assigned data points to each group, the mean of each group needs to be updated. The mean of group 0 remains unchanged at (0, 0), but the mean of group 1 is now:

`mean of group 1 = [(0, 1) + (0, 3) + (0, 4)] / 3 = (0, 2.333...)`

And we have our new groups:

`group 0:mean = (0, 0)group 1:mean = (0, 2.333...)`

But now that we have updated the group means, the data points for each group needs to be reassigned to make sure they are still closest to their group mean:

`(0, 0): closest to (0, 0)(0, 1): closest to (0, 0)!!(0, 3): closest to (0, 2.333...)(0, 4): closest to (0, 2.333...)`

Notice now that point (0, 1) is no longer closest to the mean of group 1, so it needs to be reassigned to group 0:

`group 0:data points = (0, 0), (0, 1)group 1:data points = (0, 3), (0, 4)`

Alternating between assigning data points to groups and updating group means, we eventually reach a stable state where no data point switch groups anymore, and that is our final K-mean clusters:

`group...`