Throughout the series, we have implemented many different algorithms, let’s compare them a bit regarding runtime and outcome. Because we implemented everything in base R without taking advantage of vectorization, the runtime will be significantly longer than using optimized algorithms built in C or FORTRAN.
Let’s start by visualizing the results. Of course the colors for the “same” cluster
can differ between the different algorithms, because they do not know which
cluster belongs to which species.
Most algorithms do find more or less correct clusters, with the exception of k-medians. We also see that the PAM algorithm does actually not perform any swaps at all, highlighting the strength of its BUILD phase!
Also keep in mind if you do compare the number of iterations between PAM and Lloyd k-medoids that PAM only performs a single swap per iteration, whereas the Lloyd k-medoids performs a swap for each current medoid, the number of total swaps is therefore k * iterations.
If you want to learn more about with which objective metrics you can judge your clustering outcomes, check out the corresponding section in my article on k-means:
Finally, let’s compare the runtimes of the different algorithms, and let’s also
check how much faster the implementation in FORTRAN from R is:
As expected, PAM is the slowest algorithm, followed by Lloyd style k-medoids.
Since the other lines are very close on the scale lets look at ratios instead:
Our vanilla k-means algorithm is 4000x slower than the base k-means! This
demonstrates the drastic performance gain you can get if one implements an
algorithm more efficiently and in a lower-level language, like C++. But our
goal here was not efficiency, but understanding.
Congratulations if you made it this far. With PAM, you know now a very sophisticated clustering method that can be robustly applied to many data sets. Due to its high computational cost, it might not be completely suitable for very large data sets. If this is the case for you, check out algorithms designed for that, such as CLARA or CLARANS. This post also concludes my mini-series on k-means and related clustering algorithms. Stay tuned on what comes next!
Continue reading: https://towardsdatascience.com/a-deep-dive-into-partitioning-around-medoids-a77d9b888881?source=rss—-7f60cf5620c9—4