Family of statistical analyzes that allows to study and explore a data set of quantitative variables.
Canonical Correlation Analysis (CCA)
Principal Component Analysis (PCA)
Linear discriminant analysis (LDA)
Redundancy Analysis (RDA)
One of the goals behind PCA is to graphically represent the essential information contained in a (quantitative) data table.
Useful way to discover (hidden) patterns in the data by compressing data.
Not performed directly on the data but on either the covariance or correlation matrix of the data.
PCA analysis is applied to rectangular data format.
\[ X_{n,p} = \begin{bmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,p} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n,1} & a_{n,2} & \cdots & a_{m,p} \end{bmatrix} \]
\(n\) objects in the rows (observations)
\(p\) quantitative variables in the columns (variables)
mpg | cyl | disp | hp | drat | wt | qsec | |
---|---|---|---|---|---|---|---|
Mazda RX4 | 21.00 | 6.00 | 160.00 | 110.00 | 3.90 | 2.62 | 16.46 |
Mazda RX4 Wag | 21.00 | 6.00 | 160.00 | 110.00 | 3.90 | 2.88 | 17.02 |
Datsun 710 | 22.80 | 4.00 | 108.00 | 93.00 | 3.85 | 2.32 | 18.61 |
Hornet 4 Drive | 21.40 | 6.00 | 258.00 | 110.00 | 3.08 | 3.21 | 19.44 |
Hornet Sportabout | 18.70 | 8.00 | 360.00 | 175.00 | 3.15 | 3.44 | 17.02 |
Valiant | 18.10 | 6.00 | 225.00 | 105.00 | 2.76 | 3.46 | 20.22 |
Duster 360 | 14.30 | 8.00 | 360.00 | 245.00 | 3.21 | 3.57 | 15.84 |
Merc 240D | 24.40 | 4.00 | 146.70 | 62.00 | 3.69 | 3.19 | 20.00 |
Merc 230 | 22.80 | 4.00 | 140.80 | 95.00 | 3.92 | 3.15 | 22.90 |
Merc 280 | 19.20 | 6.00 | 167.60 | 123.00 | 3.92 | 3.44 | 18.30 |
One option to visualize this dataset is to look at all pairs of correlation.
The main idea of a PCA is to find out vector (also called loadings or eigenvectors) that explain most of the variation in the data.
For this example we will perform a PCA on two variables from the mtcars
data set.
How would you draw two axis that capture most of the variation?
There are few things to keep in mind before performing a PCA.
For practical reasons (i.e. to make computations easier), PCA should always be performed on mean-centered data (variables with mean = 0).
This simply means that mean of the variable is subtracted from each observation.
\[ \bar{X}_j = \sum_{i = 1}^{n} X_{ij} = 0 \]
Additionally, variables should be scaled to unit-variance when \(X_1, X_2, ... X_p\) are not of the same units.
Example:
Should we scale?
Since we do not have same unite, we first need to scale data to zero mean and unit-variance.
We can verify that the data has mean of 0.
## [1] 7.112366e-17
## [1] -2.918672e-16
We can verify that the data has variance of 1.
## [1] 1
## [1] 1
PCA are performed on either the covariance or correlation matrix of the data.
Which one to choose:
Covariance if the data has the same unit.
Correlation if the data has not the same unit (i.e. if you scaled data to unit-variance).
Given the data has not the same units we the first step is to calculate the correlation matrix on the variables of interest.
my_cor <- cor(mydata_scaled) # Correlation matrix of scaled variables
my_cor # This is on this matrix we will perform the PCA
## mpg drat
## mpg 1.0000000 0.6811719
## drat 0.6811719 1.0000000
The next step of a PCA is to perform a singular value decomposition (SVD) on the correlation/covariance matrix of the data.
SVD is a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items (Kirk Baker, 2013).
We will not go into the mathematic of it, but just keep in mind that this procedure will produce vectors that are orthogonal among each others.
In R, SVD can be performed using the eigen()
function.
# Find the eigenvectors of the correlation matrix
my_eigen <- eigen(my_cor)
# Just renaming things so it is easier to understand
rownames(my_eigen$vectors) <- c("mpg", "drat")
colnames(my_eigen$vectors) <- c("PC1", "PC2")
my_eigen
## eigen() decomposition
## $values
## [1] 1.6811719 0.3188281
##
## $vectors
## PC1 PC2
## mpg 0.7071068 -0.7071068
## drat 0.7071068 0.7071068
One important characteristic of eigenvectors/loadings is they are orthogonal (90 degree).
## PC1 PC2
## PC1 1 NA
## PC2 NA 1
At this point, the PCA is almost done. What we need to do is to plot the new axis (PCs/eigenvectors) on the scaled data.
When you perform a PCA, variables become the principal components.
What we want to do is to create a simple x, y plot with the newly created axes (loadings). Scores can be seen as the position of scaled data on the newly created axes (principal components).
This type of plot is called biplot.
These values are called the scores. They are the \(x\) and \(y\) positions of the observations on the newly created axes (PCs).
PC1 | PC2 | |
---|---|---|
Mazda RX4 | 0.51 | 0.29 |
Mazda RX4 Wag | 0.51 | 0.29 |
Datsun 710 | 0.65 | 0.02 |
Hornet 4 Drive | -0.53 | -0.84 |
Hornet Sportabout | -0.75 | -0.43 |
Valiant | -1.34 | -0.87 |
Duster 360 | -1.19 | 0.17 |
Merc 240D | 0.63 | -0.38 |
Merc 230 | 0.75 | 0.11 |
Merc 280 | 0.32 | 0.53 |
We know that by definition that the generated PC are capturing most of the variation in the data.
But how much is most of the variation?
Remember that the SVD generated eigenvectors (PC) and eigenvalues.
## eigen() decomposition
## $values
## [1] 1.6811719 0.3188281
##
## $vectors
## PC1 PC2
## mpg 0.7071068 -0.7071068
## drat 0.7071068 0.7071068
An interesting property of the eigenvalues is their sum is equal to the total variance of the scaled variables.
## [1] 2
## [1] 2
Hence, it becomes easy to calculate the variance explained by each PC.
## [1] 1.6811719 0.3188281
## [1] 0.840586 0.159414
We see that PC1 accounts for 84.06% of the total variance where as PC2 explains 15.94% of the total variance.
There are many packages to perform PCA in R.
They are probably all good, but I recommend to use the rda()
function from the vegan
package.
The package is not included in base R, so you have to install it using the following code:
We will use the same data to see if we can obtain the same results with vegan
.
## mpg drat
## Mazda RX4 21.0 3.90
## Mazda RX4 Wag 21.0 3.90
## Datsun 710 22.8 3.85
## Hornet 4 Drive 21.4 3.08
## Hornet Sportabout 18.7 3.15
## Valiant 18.1 2.76
A PCA is performed using the rda()
function. It is a bit confusing since it is the same function for performing redundancy analysis.
## Call: rda(X = mydata, scale = TRUE)
##
## Inertia Rank
## Total 2
## Unconstrained 2 2
## Inertia is correlations
##
## Eigenvalues for unconstrained axes:
## PC1 PC2
## 1.6812 0.3188
Lets compare preliminary results from vegan with those we manually calculated earlier.
## Call: rda(X = mydata, scale = TRUE)
##
## Inertia Rank
## Total 2
## Unconstrained 2 2
## Inertia is correlations
##
## Eigenvalues for unconstrained axes:
## PC1 PC2
## 1.6812 0.3188
## [1] 1.6811719 0.3188281
Biplot of a PCA can be easily plotted using the biplot()
function.
Lets do a PCA with more more variable (i.e. more dimensions).
## mpg cyl disp hp drat wt qsec
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02
## Valiant 18.1 6 225 105 2.76 3.460 20.22
## Call: rda(X = mtcars[, 1:7], scale = TRUE)
##
## Inertia Rank
## Total 7
## Unconstrained 7 7
## Inertia is correlations
##
## Eigenvalues for unconstrained axes:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## 5.086 1.157 0.345 0.158 0.129 0.076 0.049
##
## Call:
## rda(X = mtcars[, 1:7], scale = TRUE)
##
## Partitioning of correlations:
## Inertia Proportion
## Total 7 1
## Unconstrained 7 1
##
## Eigenvalues, and their contribution to the correlations
##
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Eigenvalue 5.0861 1.1566 0.34485 0.15793 0.1295 0.07586
## Proportion Explained 0.7266 0.1652 0.04926 0.02256 0.0185 0.01084
## Cumulative Proportion 0.7266 0.8918 0.94107 0.96364 0.9821 0.99297
## PC7
## Eigenvalue 0.049198
## Proportion Explained 0.007028
## Cumulative Proportion 1.000000
##
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores: 3.838088
##
##
## Species scores
##
## PC1 PC2 PC3 PC4 PC5 PC6
## mpg -1.3504 0.1294 -0.2059 -0.442063 0.11108 0.03597
## cyl 1.3895 0.1224 -0.1602 -0.111800 -0.12444 -0.31186
## disp 1.3823 -0.1286 0.1006 -0.338800 -0.07772 0.06461
## hp 1.2686 0.5257 0.1727 0.003969 0.43404 -0.01841
## drat -1.0835 0.6998 0.6434 -0.067493 -0.11576 -0.09317
## wt 1.2802 -0.5029 0.3753 -0.061886 -0.08699 0.14628
## qsec -0.7849 -1.1690 0.2508 -0.035387 0.17113 -0.16276
##
##
## Site scores (weighted sums of species scores)
##
## PC1 PC2 PC3 PC4 PC5
## Mazda RX4 -0.243242 0.579894 -0.30711 0.49108 -1.11434
## Mazda RX4 Wag -0.235052 0.375523 -0.06404 0.40918 -1.00074
## Datsun 710 -0.625323 -0.033350 -0.08749 0.84325 0.13668
## Hornet 4 Drive -0.043821 -0.825952 -0.68534 -0.38259 0.19179
## Hornet Sportabout 0.490303 0.008421 -0.69878 -0.91626 -0.20639
## Valiant 0.040751 -1.290237 -0.83456 0.64467 0.37678
## Duster 360 0.756378 0.512097 -0.30871 0.08974 0.61740
## Merc 240D -0.592484 -0.776765 0.23619 -0.02117 -0.23256
## Merc 230 -0.675542 -1.331752 1.34052 0.13367 1.28885
## Merc 280 -0.154230 -0.055250 0.65363 0.55479 -0.58838
## Merc 280C -0.149548 -0.228874 0.83555 0.82801 -0.47222
## Merc 450SE 0.535921 -0.238442 -0.39350 0.17258 -0.06050
## Merc 450SL 0.467311 -0.212458 -0.57688 0.02675 0.18159
## Merc 450SLC 0.500967 -0.349059 -0.37426 0.45684 0.16381
## Cadillac Fleetwood 1.056127 -0.776612 0.69870 -0.31860 -0.40694
## Lincoln Continental 1.075495 -0.696090 0.89710 -0.26832 -0.31563
## Chrysler Imperial 0.953786 -0.354488 0.98670 -1.09866 0.07216
## Fiat 128 -0.996997 -0.103343 -0.18151 -1.15484 0.38481
## Honda Civic -1.175905 0.672184 0.77724 -0.84613 -0.88707
## Toyota Corolla -1.126816 -0.053228 -0.14169 -1.43287 0.63949
## Toyota Corona -0.589920 -0.524964 0.09976 0.97728 0.67769
## Dodge Challenger 0.560080 -0.266981 -1.31533 0.27287 -0.67591
## AMC Javelin 0.449899 -0.151555 -0.63193 0.29666 -0.79499
## Camaro Z28 0.719098 0.845740 0.64899 0.16942 -0.07955
## Pontiac Firebird 0.583046 -0.135449 -0.57381 -1.40788 -0.33059
## Fiat X1-9 -0.898925 0.060701 -0.19144 0.05260 -0.07467
## Porsche 914-2 -0.736374 0.846646 0.26597 -0.03670 -0.79493
## Lotus Europa -0.776611 0.689230 -1.28006 -0.43856 0.87092
## Ford Pantera L 0.563327 1.576823 0.88082 -0.39160 0.03812
## Ferrari Dino -0.008548 0.855360 -0.60774 1.04767 0.17839
## Maserati Bora 0.824243 1.337699 0.21219 0.38475 2.19332
## Volvo 142E -0.547393 0.044529 0.72081 0.86238 0.02361
## PC6
## Mazda RX4 -0.02832
## Mazda RX4 Wag -0.10902
## Datsun 710 0.63247
## Hornet 4 Drive 0.05870
## Hornet Sportabout -0.49123
## Valiant -0.03029
## Duster 360 -0.04395
## Merc 240D 1.06715
## Merc 230 -1.01059
## Merc 280 -0.39638
## Merc 280C -0.79105
## Merc 450SE -0.40004
## Merc 450SL -0.79893
## Merc 450SLC -1.05883
## Cadillac Fleetwood 0.90127
## Lincoln Continental 1.02310
## Chrysler Imperial 0.99648
## Fiat 128 0.08697
## Honda Civic -0.90773
## Toyota Corolla -0.62008
## Toyota Corona 0.11740
## Dodge Challenger -0.11973
## AMC Javelin -0.92732
## Camaro Z28 -0.18341
## Pontiac Firebird 0.09667
## Fiat X1-9 -0.02567
## Porsche 914-2 1.08370
## Lotus Europa 1.14803
## Ford Pantera L -0.76179
## Ferrari Dino 0.75862
## Maserati Bora -0.01456
## Volvo 142E 0.74836
As already stated, all principal component are orthogonal to each other.
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | |
---|---|---|---|---|---|---|---|
PC1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
PC2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
PC3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
PC4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
PC5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
PC6 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
PC7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
The Kaiser-Guttman criteria simply says that PC with eigenvalues higher than the average of all eigenvalues should be kept.
We can use the screeplot()
function to have a graphical overview of all eigenvalues.
# plot the screeplot
screeplot(my_pca)
# add a line representing mean value of all eigenvalues
abline(h = mean(my_pca$CA$eig), col = "red")
Based on this graph, it seems that the first two PC are important.
Look at the angle between qsec
and wt
(~90 degrees).
Using your data (or a subset of your data) and perform a PCA. What does this tell you about your data?
https://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf
Peres-Neto, Pedro R., Donald A. Jackson, and Keith M. Somers. 2005. “How many principal components? stopping rules for determining the number of non-trivial axes revisited.” Comput. Stat. Data Anal. 49 (4): 974–97. https://doi.org/10.1016/j.csda.2004.06.015.