Principal Component Analysis
Principal Component Analysis
PCA statistics is the science of analyzing all dimensions and reducing them as much as possible
while retaining precise information.
▫️PCA and factor analysis (FA) are both dimensionality reduction techniques, but they serve
different purposes.
▫️It aims to maximize variance and is used for feature selection, while FA aims to explain the
observed variables with latent factors.
Covariance matrix:
- Used to calculate the interdependence between features or variables and also helps in reducing
it to improve performance.
PCA provides a complete explanation of the composition of variance and covariance using
multiple linear combinations of core variables. You can use PCA to analyze row dispersion and
identify properties related to the distribution.
When to use PCA?
- Whenever we need to know that our characteristics are independent of each other.
- Whenever we need less features from higher features.
Variance - used to calculate the variation in the distribution of data along the dimensions of the
graph.
Standardizing data - Scaling our data set within a specific range so that the output is unbiased.
- The purpose of eigenvectors is to find the maximum variance present in the data set to calculate
the principal components. Eigenvalue refers to the size of the eigenvector.
Variance can only be used to explain the spread of data in directions parallel to the axis of the
feature space.
The largest eigenvector of the covariance matrix always points in the direction of the largest data
variance, and the size of the vector is equal to the corresponding eigenvalue. The second largest
eigenvector is always orthogonal to the largest eigenvector and points in the direction of the
second largest data diffusion.
The covariance matrix of the observed data is directly related to the linear transformation of the
white, uncorrelated data. This linear transformation is entirely defined by the eigenvectors and
eigenvalues of the data.
Principal component analysis is not only used for simple dimensionality reduction but can also
be used to identify key features and solve multicollinearity problems.