Dimensionality reduction is not clustering
Too obvious? Most people don’t think so.
Dimensionality reduction and clustering are both unsupervised learning techniques in machine learning. Both of these techniques can help find the hidden structure in the data, when no guidance is available. I will explain this with an example. But before that I just want to explicitly mention what this post is about. Most data science enthusiasts, especially beginners, do not appreciate the fact that the “clusters” that dimensionality reduction methods return are not the same as the clusters that clustering returns. Both the techniques have different agendas behind them and the interpretation of results should be made keeping that in mind. Now let’s get to the meat of the post.
What is unsupervised learning?
Broadly speaking, there are two ways in which we have learnt things in school (there are more ways including reinforcement learning, but let’s stick to two in this post).
Imagine a teacher comes in the class and shows you a picture of a cat and tells you that the picture they are showing you is of a cat. They also show you a picture of a dog and tell you that the picture is of a dog. Next, they ask you to identify the differences between the two. This kind of learning is where you are supervised beforehand that the two pictures shown are of differently named animals and you learn to then identify what makes them different.
The other kind of learning is where the teacher shows you a lot of pictures of cats and dogs mixed up in a pile and doesn’t tell you which picture belongs to which group (cat or dog). Instead they ask you to segregate the pictures into piles based on their similarity into two or more piles. This kind of learning is unsupervised where you use certain criteria to segregate the pictures into two or more piles (remember ideally the piles you create should be two, but because you are not supervised as to how many kinds of animals the pile contains, it is an exercise that you need to do on your own. Therefore, you might end up with more than two piles).
Obviously, the first technique is much robust (given the labelling is done correctly in the first place). But unfortunately, most of the data we have in the world is unsupervised. And the two techniques we are going to discuss in this post help us learn patterns in the data in this fashion.
Side point to ponder: Is supervised learning always desired? Could there be times when you have the data labels and you would still prefer using unsupervised learning to understand your data?
What are these two unsupervised learning techniques? Why do we need more than one technique to learn? Are they replaceable or do they generate the same kind of understanding of the data? These are some of the question we will try and understand in this post. And we will understand this using a toy dataset and carry out the analysis and visualization in python.
What is dimensionality reduction?
Let us continue our example from above. Imagine you have a pile of pictures of different mammals (and you do not know what the labels of each picture are). You can look at all pictures and come up with criteria to differentiate between each picture. For example, you can segregate the pictures on the basis of colors, number of hind legs, number of fore legs, presence of whiskers, presence of fur, picture taken during the day or night, picture is a cartoon representation of the animal or a real camera photograph and so on. Some of the features will be highly correlated. For example, most mammals that have two eyes will also have two external pinna or external ears (except whales and platypus that do not have external pinna). Some features will be noise like a feature like “picture taken during the day or night” or “picture is a cartoon representation of the animal or a real camera photograph” will not be helpful in segregating mammals. Dimensionality reduction techniques like Principal Component analysis (PCA) will try to retain only the most important criteria that could best distinguish all mammal pictures. Because you are reducing the criteria or dimensions to a small number, these techniques are called dimensionality reduction techniques.
To get a more detailed understanding of PCA and dimensionality reduction, you can read my previous post on this topic. This is the link.
What is clustering?
Now imagine you want to explicitly segregate the mammal pictures into two clusters (imagine there are only dogs and cats images). You would want to take all the features and calculate a similarity score. (There are different ways to measure similarity, we are not going to get into that in this post.) Based on the similarity score, you would assign each mammal picture to one of the two clusters (here, we have predetermined how many clusters we want i.e. n =2. In reality, number of clusters expected is not one-size-fits-all exercise or trivial. But that is a discussion of another post). Hopefully, if the features that we are using to calculate similarity scores would result in segregating all dog and dog-like pictures into one group and all cat and cat-like pictures into another.
A very important point to note here is that the similarity scores are a function of the features used to calculate it. Obviously, noisy features might make the similarity scores more promiscuous. In real world analyses, most pipelines usually carry out dimensionality reduction on features first, to obtain relatively more informative features first and then use those features to carry out clustering.
Applying PCA and K-means clustering to wine dataset
So far, we discussed the concept behind what dimensionality reduction and clustering does. But what is the entire hullabaloo about? Why do people usually mistake results of dimensionality reduction with that of clustering and sometimes consider them replaceable. It’s a little more nuanced to understand just through words. Therefore, we are going to understand the subtle difference between the two techniques through coding.
At this point, you should keep in mind that, unlike clustering, there is no similarity score based explicit grouping of mammal pictures in dimensionality reduction. So even though dimensionality reduction gives some idea of variation among the mammal pictures, its motivation is not to segregate them into clusters but only to represent the pictures with the least amount of necessary information that is just sufficient enough to explain the differences among the pictures.
Still not clear? Do not worry, read ahead and code along. I suggest you to come back and read the paragraph above once you have reached the end of this section. I assure you, it will make sense in retrospect.
The code below does the following:
- Import all the important libraries necessary to run the code.
- We load an in-built dataset on wine classification in the sklearn library and save it to a variable called wine. There are three kinds of wines (class_0, class_1 and class_2. Don’t worry we won’t use these labels while learning.) These wines are described by thirteen chemical features like phenol content, alcohol content and so on.
- Next we carry out K-means clustering on the thirteen features of wines. Note we are explicitly mentioning three clusters. But you could change it to two or four or more. (For brevity, we are setting n=3 since we know there are three kinds of wines.)
- In the next line of code, we independently also carry out Principal Component analysis, a kind of dimensionality reduction techniques. And we are explicitly mentioning that we want the thirteen features to be collapsed to just two features that could still do a decent job in describing the wine samples (note how the exercise is not saying to segregate but to describe the wine samples in the best way that the original thirteen features were doing. Read this again.)
- In the next line, let’s plot the results. K-means clustering returns three clusters based on similarity score calculated from thirteen features (we haev not used the wine labels so far). Let’s plot them over the PCA generated plot with x-axis being principal component 1 and y-axis being principal component 2. In essence, we are plotting the information of thirteen features describing wines using just two features in lower dimension (which are a linear combination of original features). Let’s color cluster 1 as red, cluster 2 as blue and cluster 3 as green. (it might ot might not put all the original wine class labels into independent clusters that it returned. Let’s see later.)
- Next, let us color the same PCA plot with original wine labels (instead of clusters returned by K-means clustering). We color class_0 wine as red, class_1 wine as blue and class_2 wine as green.
In the end, we obtain two versions of PCA plot with different coloring schemes. The plot on the left is color coded on the basis of the cluster to which each wine sample was assigned to (cluster 1, 2 or 3). The plot on the right is color coded on the basis of the original wine class label that each wine sample was from.
Let us interpret the two plots in the next section.
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
# Load the winedataset
wine = load_wine()
# Perform k-means clustering on the wine dataset
kmeans = KMeans(n_clusters=3, random_state=0).fit(wine.data)
# Perform PCA on the wine dataset
pca = PCA(n_components=2).fit_transform(wine.data)
# Plot the results of k-means clustering and PCA
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
# Plot the k-means clusters
colors = np.array(['red', 'blue', 'green'])
ax1.scatter(pca[:, 0], pca[:, 1], c=colors[kmeans.labels_])
ax1.set_title('K-means clustering')
# Plot the PCA results
ax2.scatter(pca[:, 0], pca[:, 1], c=colors[wine.target])
ax2.set_title('PCA')
plt.show()
Dimensionality reduction is not clustering
We can already see the difference between the two plots (figure below).
If we look at the plot on the right, we see that wine class_1 (blue) and class_2 (green) are not well segregated based on the information contained in the thirteen features of wine chemistry. class_0 (red) is relatively well separated from the other two wine classes.
However, the plot on the left shows that K-means has returned three distinct clusters based on the chemistry of the wine samples, agnostic of what the real wine class was. As you can appreciate, the clusters are distinct based on the similarity score, but they re-assign the three kinds of wine classes into three clusters. The labels have lost their meaning. Let’s interpret the left plot to understand this. So the green cluster (or the cluster 3 returned by K-means) is a mixture of class_1 and class_2 wines (original labels can be referenced from the plot on the right - class_0 is red, class_1 is blue and class_2 is green). Similarly the red cluster (or the cluster 1 returned by K-means) is a mixture of class_0, class_1 and class_2 wines (original labels). Finally, the blue cluster (or the cluster 2 returned by K-means) is largely class_0 (original label red in right plot), with one wine sample from class_1 included as well (original label blue in right plot).
What does it tell us?
Dimensionality reduction results using PCA just reveal how best the wine samples are related to each other based on the information encoded in the thirteen features describing their chemistry. It shows how class_0 is distinct from class_1 and class_2, whereas class_1 and class_2 are not well segregated based on the overall variance explained by the features. Clustering on the other hand tells us that if overall similarity scores are to be calculated from the thirteen features and thereafter used to segregate the wine samples into three clusters, original labels might not be a good representation of the wine samples after all. One possibility could be that the class labels could have been assigned in the first place just by the mere fact that they were collected from three different geographical locations in Italy (for more information about the dataset read the documentation of this dataset from sklearn’s official page). So clustering reveals that because there is so much variability in the wine samples even from the same geographical location, a better way of clustering them could be based on the similarity scores obtained from thirteen features. Recall, PCA did not have any agenda to do so in the first place.
Let the plots simmer a bit longer. Re-read the last section, keeping the plots in front of you again. You will get an epiphany eventually, the longer you stare at the plots (if you haven’t got any yet).
Conclusion
This post was motivated by my observation across literature where people use dimensionality reduction techniques as a replacement of clustering. That is not correct. The two techniques have different reasons to exist and they reveal different aspects about the data. Clustering returns clusters based on the similarity among the samples calculated from the features that describe the sample. Dimensionality reduction techniques simply preserve the relationship among the samples in lower dimensions i.e. there is no explicit agenda to carry out any clustering. To put simply, if data was telling some story, dimensionality reduction is the ability to describe the story in less words making it less verbose without affecting the overall essence or message of the story, as much as possible. Clustering on the other hand is the ability to identify what part of the story would be appropriate to be called a premise, what part of the story would qualify as the build-up, what part would qualify as the conflict, what part would qualify as escalation and finally what part would qualify as resolution. You see both tell different things about the data. And one should not be replaced by the other. In fact, in practice people sometimes carry out dimensionality reduction first followed by clustering.
What I have not covered in this post is what could happen when clustering is carried out with redundancy in the data? But I will leave you with this- data with and without redundancy (correlated features) could yield very different results. Something I could cover in the next posts.
Bonus coding section
Here’s the code that I used above in the style of object oriented programming for advanced coders:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
class WineVisualizer:
def __init__(self):
self.wine = load_wine()
self.kmeans = None
self.pca = None
def perform_clustering(self, n_clusters):
self.kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(self.wine.data)
def perform_pca(self, n_components):
self.pca = PCA(n_components=n_components).fit_transform(self.wine.data)
def plot_original_labels_and_clusters(self):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
# Plot the k-means clusters
colors = np.array(['red', 'blue', 'green'])
ax1.scatter(self.pca[:, 0], self.pca[:, 1], c=colors[self.kmeans.labels_])
ax1.set_title('K-means clusters overlayed on PCA plot')
# Plot the PCA results
ax2.scatter(self.pca[:, 0], self.pca[:, 1], c=colors[self.wine.target])
ax2.set_title('Original wine class labels overlayed on PCA plot')
plt.show()
# Create an instance of WineVisualizer
visualizer = WineVisualizer()
# Perform k-means clustering with 3 clusters
visualizer.perform_clustering(n_clusters=3)
# Perform PCA with 2 components
visualizer.perform_pca(n_components=2)
# Plot the k-means clusters
visualizer.plot_original_labels_and_clusters()