PCA (Principal Component Analysis) is an unsupervised machine learning algorithm used to reduce the dimensionality of the given data. It has first been invented by Karl Pearson (1901) and independently developed by Harold Hotelling (1933).
Dimensionality reduction refers to the mapping of the original high-dimensional data onto a lower-dimensional space, thus reducing the risk of model overfitting and improving the generalization ability of the model on the test set. The essential idea for dimension reduction is that highly redundant features are likely to be compressible, in fact, some components of the data may express less variance and more noise than others. Thus, representing the data with a smaller subset, conserving only its principal components, it’s useful not only to preserve its main features but also to discard the noise.
Many algorithms that work fine in low dimensions become intractable when the input is high-dimensional.Curse of dimensionality
To show an example, let’s consider a dataset two-dimensional dataset, like (height, weight). Let’s plot the points of the dataset in a plane and carefully analyze the variations among them.
If we’re going to only see the data along one dimension, though, it might be better to make that dimension the principal component with most variation. We don’t lose much by dropping PC2 since it contributes the least to the variation in the data set, and at the same time, we can clearly see that PC1 expresses the most variance.
Furthermore, reducing the dimensionality of the data can also be useful for other tasks such as:
- Data visualization: projection of high-dimensional data onto 2D or 3D.
- Data compression: efficient storage and retrieval.
- Noise removal: positive effect on query accuracy.
Before rushing into the details of how this algorithm works it is better understanding two fundamental mathematical notions that constitute the core idea of PCA.
Eigenvalues and Eigenvectors
Given A a square matrix (p x p) and a vector v (p x 1), if we multiply Av we would get another vector u of dimension (p x 1). In some particular cases it can result that u = Av = λu, this means that the original vector v after the multiplication with A has been translated by a constant term λ but its direction remained the same as before. In this case, u is an eigenvector of the matrix A and λ is the corresponding eigenvalue.
We can also put the problem in a matrix form AU = UΛ, where U = [u1,..,up] and Λ = diag(λ1,.. λp). For symmetric matrixes, the eigenvectors can be orthogonal: UUT = UTU = I. Thus, the two following equations result to be valid: UTAU = Λ and A = UΛUT.
In the next sections, we are going to learn how PCA works. A brief summary of the 5 steps of the pipeline is reported in the list below:
- Data normalization.
- Computation of the covariance matrix.
- Eigenvalues decomposition.
- Selection of the optimal number of principal components.
- Data encoding and reconstruction.
The first step consists in normalizing the data. Having unscaled data whose features are measured with different units size can have a negative influence on the comparison between the variance of its features, thus leading to poor results when performing dimension reduction. Normalizing the data means subtracting the mean from each value of each feature and dividing them by their variance:
Keeping in mind that PCA’s main task is to identify the features with the maximum variance, the next step consists of computing the covariance matrix among the different dimensions of the data. The covariance is a measure of how much two random variables X and Y vary together, they will have a high covariance when presenting a high correlation, both positive or negative, and a low covariance when from varying one variable we have that it’s pretty hard to predict the value of another one. Moreover, two variables whose covariance is equal to 0 are called “independent” which means that a variation of X does not depend on a variation of Y. Below is reported the formula to calculate the covariance between two variables:
We can then use the covariance to calculate the covariance matrix, which is the square matrix given by the entries Cij=σ(xi,xj) where C is a d x d matrix and where d describes the number of dimensions of the data. Is it also interesting to note that the covariance matrix is symmetric and its diagonal entries are the variances of the features Cii=σ(xi,xi) and the other entries are their covariances. Given that X (n x d) si the matrix representing the input data, the covariance matrix can be directly expressed as:
But since the data has been normalized we have that each feature has mean equal to 0, thus:
Finally, the covariance matrix will look like this (supposing d = 2):
This step consists of calculating the eigen decomposition of the covariance matrix C. For this task there exist many matrix decomposition methods that can be used such as Singular Value Decomposition (SVD) which is already implemented in Tensorflow. This process results in a list of eigenvalues and eigenvectors of the data. The eigenvectors represent the components of the data (the direction) in the reduced dimensionality space, whereas the eigenvalues represent the magnitudes for these components.
Next, we let U be the eigenvectors corresponding to the top m eigenvalues. This process is equivalent to choose the m most representative components of the original data. If there are eigenvalues close to zero, they represent components that may be discarded such as noise or redundant features.
Selection of the optimal number of principal components
Choosing the optimal number m of the principal components to be extracted plays a fundamental role in the final performance of the PCA algorithm. One way to select the appropriate value for m consists of measuring the total variance accounted for by all the p principal components extracted by the eigenvalues decomposition. The percentage of the variance accounted for by the i-th eigenvector is:
Where λi corresponds to the i-th eigenvalue. Afterward, we set the desired threshold of minimum variance s we want to maintain (e.g. s = 95%) and satisfy the constraint by finding the appropriate value of m such that:
Data encoding and reconstruction
Now the PCA model is ready to extract the m principal components from the d dimensional data. The new data can now be projected onto the reduced space and is calculated as:
Where X (n x d) is the original data that we wish to project, U^T (d x m) is the transpose of the matrix containing the m chosen principal components and Y is the projected data (n x m).
Finally, it is also possible to reconstruct the original data from the projected data:
The selected eigenvectors of the covariance matrix C define a new axis system onto the which project the data:
In the above picture, u1 and u2 are the two eigenvectors representing the two most important components of the data onto the which the point x1,x2 from the original data will be projected. In particular, any point px in the X-axis system can be projected onto the U-axis system:
Where px (d x 1) represents the coordinate of the point in the X-axis system and pu (m x 1) represents the same point in the U-axis system.
Useful examples of PCA results on 2D data represented in 1D dimensions, 3D data in 2D dimensions and even more can be visualized at the following link: http://setosa.io/ev/principal-component-analysis/