Coding Principal Component Analysis (PCA) as a python class | by Mohak Sharda
Finding hidden patterns in the data from a coder’s perspective
Principal Component Analysis (PCA) is one of the most powerful unsupervised learning techniques. It helps you find the hidden structure in the data with a lot of variables (also called dimensions).
Let’s take an example to understand this.
Imagine you have a box of hundreds of books and you want to select and gift some books to your friend such that each book you choose is unique in its own way. The idea here is that if there are two books that have the same genre, same kind of ending and same kind of character arc, you would want to just choose one of them to gift to your friend and not both.
Also, imagine if you could catalogue hundreds of characteristics for these books. Some characteristics would always coexist in books. For example, there is one characteristic called “if the book is a murder mystery”. Another characteristic called “whether atleast one person dies in the book”. As you can notice, whenever a book would be catalogued under a murder mystery, the second characteristic will always have to be true. We would say both these characteristics are linearly correlated or the information encoded is redundant.
PCA helps us deal with such situations. It helps in reducing variables (or dimensions) such that it still best describes the data. Therefore, it is called as a dimensionality reduction technique.
Before we proceed with coding PCA, there are two more terms that are important to keep in mind. PCA is a linear technique that helps preserve the global structure of the data in lower dimensions. What does that mean?
Let us first understand what global means. You could segregate books based on many characteristics, some more nuanced and some less nuanced. Obviously as more nuanced the characteristics get, you would start clubbing books into smaller clumps, even if they fall under the same genre. For example, if the character that dies in a murder mystery is always a girl who is below twenty years of age. This is more nuanced. PCA will not deal with these. More nuanced characteristics will give rise to many local structures. What PCA does is to reduce the dimensions such that it maximizes the variance of the overall data. On a lighter note, on a scale of “don’t judge a book by its cover” to “judge a book by its cover”, PCA leans towards the latter.
Second, PCA looks at only linear relationships among variables. This can be confusing to get. People often confuse linear relationship with correlation. Yes, a perfectly linear relationship is correlated, but it is not the only condition for correlation. There could be non-linear relationships among variables that could be correlated. For example, number of words in a book are linearly correlated with number of pages in a book. PCA can deal with such variables and reduce the dimensions after detecting them. However, there could be relations that are non-linearly correlated. For example, the popularity of a book could increase with the complexity of the plot, but after a certain point, if the plot becomes overly complex, the popularity could start decreasing. PCA does not consider such relationships.
Mathematically speaking, PCA uses the magic of linear algebra. That is what we are going to code.
We will make use of the library numpy in python. We will define a class PCA with a constructor __init__.
Notice in code block I below, we have passed an argument desired_principal_components to our constructor that we wish the user to specify.
Then we define three variables for our class — desired_principal_components, extracted_eigenvectors and feature_mean. We initialise desired_principal_components with user input (compulsory argument) and the rest two as None.
Next, we define a function called fit(). It takes as an argument, the feature table from the user with each row as one data point and columns as features describing the attributes of that data point.
We will do the following:
a) Center each variable around the mean. Why are we doing it? Well, to prevent features with larger means to dominate (and hence falsely appearing to contribute to a given principal component). By centering each feature around their mean, we are making the calculation of covariance in the next step easier and comparable.
b) Calculate a covariance matrix to describe how a given feature covaries with other features. Since we are using numpy, it requires the features to be rows and each sample or data point to be a column. Therefore we transpose the input matrix using .T functionality from numpy.
c) Next, we calculate eigenvectors and eigenvalues from the covariance matrix that we calculated in step b. Eigenvectors will give us the orthogonal directions (uncorrelated) among the data points that would maximize the variance (coming from the covariance matrix) in a given direction. And Eigenvalues will give us the magnitude of that variance. Hold on to these; later we will project our original points onto a straight line/plane/hyperplane that passes through the points in the direction given by the eigenvectors.
d) We want to choose the number of principal components as given by the user through the variable desired_principal_components such that we report the top principal components (PCs) arranged in a decreasing order of their ability to maximize the variance among the data points based on the input feature set. Of course, these principal components will contain non-redundant information. But what does a principal component represent? I will clarify that in a moment. Read along.
e) Since we need to fetch PCs in a decreasing fashion, we need to sort our eigenvalues. We will use the argsort() method from numpy. Since we want the eigenvalues sorted in a decreasing order we use the slicing functionality [ : :-1] i.e reversing the order that argsort() returns by default. This will return the indices of eigenvalues arranged from the highest to lowest.
For example, the vector for eigenvalues was as follows: [13, 5, 25, 4]. Here the index of the highest eigenvalue is 2 i.e 25, next highest is 0 i.e 13 and so on. So the indices vector that argsort()[ : : -1] will return would be [2,0,1,3].
f) Next we need to fetch the eigenvectors corresponding to the sorted eigenvalues based on these indices. Since we are using numpy, we need to first transpose our variable eigenvectors that contain all eigenvectors. Finally, we assign the desired number of eigenvectors (arranged in descreasing order based on their corresponding eigenvalues) to our variable extracted_eigenvectors that we had initialised to None in our constructor.
What does a principal component mean though? Well, the answer isn’t that straightforward. But let us understand it intuitively. Let us say we had only two dimensions in our book example. One dimension was number of pages and another dimension was number of words. Since both are linearly correlated, PCA would merge both of them into principal component 1 (with maximum variance) and that new reduced dimension that is a linear combination of both the original dimensions could represent a new feature (that does the best job describing the books as one variable instead of two original variables) called book content. Remember, we are naming the PC1 as book content here because that is what intuitively number of pages and number of words together represent. Sometimes PCs could be a linear combination of many variables that are correlated linearly but might not have an intuitive single phrase name to describe them. Therefore, one can think of PCs as an abstract version of the original variables. One can though look at the individual contribution of different original variables that goes into making up a given principal component. These are called as factor loadings in PCA lingo.
Now, getting back to our code. So far we have used the covariance structure of the data to extract the eigenvectors and eigenvalues ( i.e. the direction and magnitude) from our data such that we pick the topmost PCs maximizing the variance in the data later on. All we are now left to do is project our original data onto the line/plane/hyperplane given by the eigenvectors. These are what we will call principal components.
f) For this, we will make another function called transform(). It will take the original data matrix as an input. We would center the data again, since the eigenvectors that we get were on the centered data. So the projections should also be between centered inputs (original matrix and eigenvectors). Projecting the datapoints in the principal component space is nothing but taking a dot product between the centered original data matrix and matrix of eigenvectors that we sorted and extracted using the fit function. Again, we need to first transpose the extracted_eigenvectors variable. And VOILA! We have the principal components!!!
Now let us test our class PCA() on a dataset.
We are going to use already built-in datasets in sklearn library for classification. Remember, we are not going to do a supervised learning. But once we visualize our data using PCA on a two dimensional graph, we will overlay the class labels of whatever dataset we use to see how best PCA was able to segregate the data, without explicitly trying to segregate the classes.
Let us take an example of sklearn’s wine dataset with different features, collected from three different geographical locations in Italy i.e. three different kinds of wines. Let us see with the features for each kind of wine, can PCA segregate the three types of wines or not.
Full documentation can be found at https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset
The code for plotting principal components based on the class we create above is shown below and is self explanatory (with comments for each line of code).
Before we dive into the code, let’s see what the data comprises of. There are 13 numeric features describing the various wine components. These are:
- alcohol content,
- malic acid content,
- magnesium content,
- total phenols present,
- flavanoids,
- nonflavanoid phenols,
- color intensity,
- proanthocyanins
- hue
- OD280/OD315 of diluted wines
- proline
- Ash
- Alkalinity of ash
There are three classes of wines:
a. class_0 (59 samples),
b. class_1 (71 samples) and,
c. class_2 (48 samples).
Back to the code now.
##importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
#the class PCA() that we made above is saved as a pca.py
from pca import PCA
#importing datasets from sklearn that has the wine dataset
from sklearn import datasets
##loading the wine dataset() from sklearn
wine_data = datasets.load_wine()
##extracting the features table and labels
#pca will be carried on the following input matrix
input_features_matrix = data.data
#for coloring our datapoints later based on these labels describing the three kinds of wines
target_labels = data.target
##carrying out PCA based on our code
#this will initialise the variables we defined in the constructor of class PCA()
pca_object = PCA(2)
#let's extract the eigenvectors
pca_object.fit(input_features_matrix)
#let's project the data onto the eigenvectors to get PCs
principal_components = pca.transform(input_features_matrix)
#let's plot the first two principal components
PC1=principal_components[:,0]
PC2=principal_components[:,1]
plt.scatter(PC1,PC2,c=target_labels,edgecolor='none',alpha=0.8,cmap=plt.cm.get_cmap('viridis',3))
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar()
plt.show()
The script above will result in the following plot:
Remember, we are just overlaying the class type on the plot in the end. What this tells us is that based on the global and linear structure that is hidden among the 13 variables that describe the three kinds of wines, class 0 wine is well segregated from class 1 and class 2 wine. However, there is huge variability among different samples of class 0 wine, on the first principal component. It also tells us that wines represented by class 1 and class 2 are very similar to each other and not well segregated based on the 13 features using PCA. (Hmm… It could be that wines class 1 and class 2 are grown in geographically and climatically similar conditions?)
What does this leave us with? There are many possibilities that could explain the observations and dictate the next steps in our analysis. I am listing four here:
- One would need more features that could segregate even class 1 and class 2 using PCA.
- Maybe there are more nuanced local structures in the data that PCA can not take care of, to better segregate class 1 and class 2?
- Maybe there are non-linear relationships that exist between class 1 and class 2 wines that again PCA can not handle.
- Maybe class 1 and class 2 wines are indeed the same wine and labelled differently because of different sellers trying to increase their business.
Could you think of other reasons? Maybe there is mislabeling? Inaccurate measurements? Other reasons?
Remember PCA is just one unsupervised technique. There are other unsupervised techniques one could try putting this dataset through. For example, t-SNE or UMAP or PACMAP. All these can preserve the local and/or the global structure of the data better than PCA and are worth trying to explore more nuanced structures ( Oops! I use nuanced A LOT, I just realised).
Finally, since we have the information of the class labels for each of our samples, one could use classification supervised learning techniques like logistic regression, Support Vector Machines, Tree-based models and so on. It is worth trying now, since PCA does reveal that the features at hand are not completely garbage and this data exploration suggests that it would be worth spending more time using the more robust supervised learning methods.
What I have not covered in this post is to understand what features (out of the 13) contribute most to the principal component 1 and what is the magnitude of individual contribution to that PC. That is important and interesting to know since they will help us understand what exactly makes these wines so different from each other. Could you think of a way to code these factor loadings, as I mentioned earlier, in the PCA class that we made?
Maybe I could do another post for it next time!
For now I leave you with the code that we built so far. I encourage you to try other datasets provided by sklearn, other than the wine dataset. For example, you could try the breast cancer dataset with two classes — cancer versus normal. The features (30 features) are differences in characteristics of cancer and normal cells as seen under the microscope. The link for the breast cancer dataset is : https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer
Happy PCAing! *pours a glass of wine* (I choose class 0 wine)