Understanding K-means and Hierarchical Clustering.
Let’s start by Understanding What is a Clustering?
Clustering is a task of grouping a set of objects in such a way that objects in same group (called Clusters) are more similar to each other than to those in other clusters.
Technically speaking it is an Unsupervised Machine Learning Method or Analytical Technique where we were given a set of Features without Labels and then we attempt to Group them into Natural Clusters.
Aim Here is to:
· Find the Hidden structures from Unlabeled Data.
· Create Stable Clusters
So what we can understand by Stable Clusters…!
For successful Grouping or Segmentation, clusters formed must be Stable means the created groups should have intra-cluster-homogeneity and inter-cluster-heterogeneity. So ultimately we are trying to check:
· How far apart each cluster is?
· Within a cluster how tightly elements are bound together?
Applications of Clustering:
· Marketing or Retail Companies uses this method for segmentation of Customers based on Buying History.(like Platinum, Golden and Silver Customers)
· HR’s usually use this for behavioral analysis of employees.
· It is broadly used in Medical science field for building Groups of Genes etc.
(1) Hierarchical or Linkage-Based Clustering:
Hierarchical Clustering is probably the simplest and very straightforward paradigm of clustering. The methodology of this algorithm is as follow:
>> start by assigning each element or points to a Cluster, so that if you have N observations, you now have N Clusters. Initially each cluster will contain just 1 item.
·>> Find the most similar or very closet pair of clusters and merge them into a single cluster. So that now you have (N-1) cluster.
>> Continue this merging process until all the elements are clustered into a single large cluster.
Now to clearly define this algorithm we need to determine two parameters:
1. How to measure distance between two Clusters?
2. When to stop merging?
Generally there are 3 ways to measure distance between clusters.
1. Single-linkage Clustering: It considers the Shortest distance between two clusters from any member of one cluster to any member of another Cluster
D(C1,C2) = min{d(x , y) : x € C1 , y € C2}
2. Complete-Linkage Clustering: It considers the greatest distance between two clusters from any member of one cluster to any member of another Cluster.
D(C1,C2) = max{d(x , y) : x € C1 , y € C2}
3. Average-Linkage Clustering: It considers the average distance between two clusters from point in one cluster to point in another Cluster.
D(C1,C2) = ∑ d(x , y) / (|C1|*|C2|)
By employing this Method the outcome of such a method can be described as Dendogram which is a tree structure of cluster elements. e.g.
By looking at Dendogram one can decide which linkage had been used for making Clusters. Usually a Single-Linkage produces un-structured Dendogram, where as Complete or Average Linkages produces a proper tree like Structures.
Structure of Hierarchical Clustering:
By this example we can observe that initially we had 5 data-points and we have assigned each as a single cluster. ({a}, {b}, {c}, {d}, {e})
Now by looking carefully we can say that b & c and d & e are closer to each other, so we grouped them. Now we have 3 clusters. ({a}, {b, c}, {d, e}) ………………so on……………… ……….……. And finally we have a single large cluster {a, b, c, d, e}.
At last if someone wants to turn Dendogram into Clusters you need to apply a stopping criterion which depends upon number of Clusters you want.
We also call Hierarchical Clustering as a Greedy Algorithm because splits and merges of clusters vary based on Linkage selection.
(2) K-Means Clustering:
K-Means clustering method constructs a partition of a dataset (D) of X objects into as set of k clusters. Given a number of clusters k we need to optimize the clustering criterion.
The methodology of K-Means clustering is as follow:
>> Randomly pickup k data-points and assign them as a cluster centers or seeds.
>> Look at each data-point and assign them to their nearest centers. This step is called Assignment step.
>> Now for each cluster re-compute the center which will simply be the mean of all points in each of the clusters. This step is called Optimization step where we find new centers of data-points and re-assign new data points to them.
>> Repeat the step 2 and step 3 until convergence.
Now question arises, when to stop this algorithm …?
While repeating Assignment and Optimization steps when we reach a situation where cluster centers do not change between two adjacent assignments then we can stop this process.
Here as we discussed, during each assignment we need to each assign data-points to their nearest centers. So how to find the nearest distance…?
The distance measure which is used in K-Means clustering is called the ‘Euclidean Distance’. Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
By this we can visualize that how to compute an Euclidean Distance.
We also discussed during each new assignment we need to calculate Cetroids of clusters. So for finding Centroids we need to calculate mean of each data-points of that Cluster.
If we talk about complexity of 1 iteration of K-means clustering it will be,
K*N*D where,
K = number of clusters,
N = number of data-points and
D =time required to compute distance between a pair of points.
Now let’s understand the Cost function of K-Means Clustering:
Where,
Xi = ith Data point
µki = center of cluster from K clusters w.r.t, Xi data-point.
Ck = Points belongs to the Kth cluster
Practical Considerations in K-means Clutering:
- K-Means clustering does not applicable on Categorical data
2. number of clusters (k) has to be pre-determined
3. randomly selected Centroids may have impact on final cluster generation.
Hope you guys have enjoyed this Article. In the next article we will see the demonstration of both K-Means and Hierarchical clustering in Python.
Thank you.