Principal component analysis

Canonical analysis

Family of statistical analyzes that allows to study and explore a data set of quantitative variables.
Aimed at reducing the dimensionality of a data set while retaining much of the variation present in the data.
- Create new axes that explain most of the data variation.
- Project data on these new axes.
Visualization techniques are limited to 1D, 2D or 3D data.
- By reducing the dimensionality of data, canonical analysis make possible to plot in 2D.

Type of canonical analysis

Canonical Correlation Analysis (CCA)
Principal Component Analysis (PCA)
Linear discriminant analysis (LDA)
Redundancy Analysis (RDA)

Principal component analysis (PCA)

One of the goals behind PCA is to graphically represent the essential information contained in a (quantitative) data table.
Useful way to discover (hidden) patterns in the data by compressing data.
Not performed directly on the data but on either the covariance or correlation matrix of the data.

Matrix structure

PCA analysis is applied to rectangular data format.

\[ X_{n,p} = \begin{bmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,p} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n,1} & a_{n,2} & \cdots & a_{m,p} \end{bmatrix} \]

$n$ objects in the rows (observations)

$p$ quantitative variables in the columns (variables)

Example of high dimensional data

mtcars dataset

	mpg	cyl	disp	hp	drat	wt	qsec
Mazda RX4	21.00	6.00	160.00	110.00	3.90	2.62	16.46
Mazda RX4 Wag	21.00	6.00	160.00	110.00	3.90	2.88	17.02
Datsun 710	22.80	4.00	108.00	93.00	3.85	2.32	18.61
Hornet 4 Drive	21.40	6.00	258.00	110.00	3.08	3.21	19.44
Hornet Sportabout	18.70	8.00	360.00	175.00	3.15	3.44	17.02
Valiant	18.10	6.00	225.00	105.00	2.76	3.46	20.22
Duster 360	14.30	8.00	360.00	245.00	3.21	3.57	15.84
Merc 240D	24.40	4.00	146.70	62.00	3.69	3.19	20.00
Merc 230	22.80	4.00	140.80	95.00	3.92	3.15	22.90
Merc 280	19.20	6.00	167.60	123.00	3.92	3.44	18.30

Visualisation

One option to visualize this dataset is to look at all pairs of correlation.

languageR::pairscor.fnc(mtcars[, 1:7])

A simple example

The main idea of a PCA is to find out vector (also called loadings or eigenvectors) that explain most of the variation in the data.

For this example we will perform a PCA on two variables from the mtcars data set.

# Extract two variables from mtcars
mydata <- mtcars[, c("mpg", "drat")]

Visualizing the data

plot(mydata, xlim = c(5, 35), ylim = c(2, 5))

Ellipse

How would you draw two axis that capture most of the variation?

Considerations

There are few things to keep in mind before performing a PCA.

Consideration #1

For practical reasons (i.e. to make computations easier), PCA should always be performed on mean-centered data (variables with mean = 0).

This simply means that mean of the variable is subtracted from each observation.

\[ \bar{X}_j = \sum_{i = 1}^{n} X_{ij} = 0 \]

Consideration #2

Additionally, variables should be scaled to unit-variance when $X_1, X_2, ... X_p$ are not of the same units.

Example:

$X_1$: Age of people $(1 \leq X_1 \leq 100)$
$X_2$: Salary income $(20 000 \$ \leq X_2 \leq 100 000 \$)$

Original data

Should we scale?

Since we do not have same unite, we first need to scale data to zero mean and unit-variance.

mydata_scaled <- scale(mydata, center = TRUE, scale = TRUE)

We can verify that the data has mean of 0.

mean(mydata_scaled$mpg)   # Verify that mean is 0

## [1] 7.112366e-17

mean(mydata_scaled$drat)  # Verify that mean is 0

## [1] -2.918672e-16

We can verify that the data has variance of 1.

var(mydata_scaled$mpg)    # Verify that variance is 1

## [1] 1

var(mydata_scaled$drat)   # Verify that variance is 1

## [1] 1

Using covariance or correlation matrix

PCA are performed on either the covariance or correlation matrix of the data.

Which one to choose:

Covariance if the data has the same unit.
Correlation if the data has not the same unit (i.e. if you scaled data to unit-variance).

Compute the correlation matrix

Given the data has not the same units we the first step is to calculate the correlation matrix on the variables of interest.

my_cor <- cor(mydata_scaled) # Correlation matrix of scaled variables
my_cor # This is on this matrix we will perform the PCA

##            mpg      drat
## mpg  1.0000000 0.6811719
## drat 0.6811719 1.0000000

SVD

The next step of a PCA is to perform a singular value decomposition (SVD) on the correlation/covariance matrix of the data.

SVD is a method for transforming correlated variables into a set of uncorrelated ones that better expose the various relationships among the original data items (Kirk Baker, 2013).

We will not go into the mathematic of it, but just keep in mind that this procedure will produce vectors that are orthogonal among each others.

SVD

In R, SVD can be performed using the eigen() function.

# Find the eigenvectors of the correlation matrix
my_eigen <- eigen(my_cor)

# Just renaming things so it is easier to understand
rownames(my_eigen$vectors) <- c("mpg", "drat")
colnames(my_eigen$vectors) <- c("PC1", "PC2")

my_eigen

## eigen() decomposition
## $values
## [1] 1.6811719 0.3188281
## 
## $vectors
##            PC1        PC2
## mpg  0.7071068 -0.7071068
## drat 0.7071068  0.7071068

SVD

One important characteristic of eigenvectors/loadings is they are orthogonal (90 degree).

cor(my_eigen$vectors) # Check for the correlation between PC

##     PC1 PC2
## PC1   1  NA
## PC2  NA   1

Plotting the PC

At this point, the PCA is almost done. What we need to do is to plot the new axis (PCs/eigenvectors) on the scaled data.

Calculating scores

When you perform a PCA, variables become the principal components.

What we want to do is to create a simple x, y plot with the newly created axes (loadings). Scores can be seen as the position of scaled data on the newly created axes (principal components).

This type of plot is called biplot.

# Calculate the new positions of the observations on the PCs
scores <- as.matrix(mydata_scaled) %*% my_eigen$vectors

Calculating scores

These values are called the scores. They are the $x$ and $y$ positions of the observations on the newly created axes (PCs).

	PC1	PC2
Mazda RX4	0.51	0.29
Mazda RX4 Wag	0.51	0.29
Datsun 710	0.65	0.02
Hornet 4 Drive	-0.53	-0.84
Hornet Sportabout	-0.75	-0.43
Valiant	-1.34	-0.87
Duster 360	-1.19	0.17
Merc 240D	0.63	-0.38
Merc 230	0.75	0.11
Merc 280	0.32	0.53

Biplot

Explained variance

We know that by definition that the generated PC are capturing most of the variation in the data.

But how much is most of the variation?

Remember that the SVD generated eigenvectors (PC) and eigenvalues.

my_eigen

## eigen() decomposition
## $values
## [1] 1.6811719 0.3188281
## 
## $vectors
##            PC1        PC2
## mpg  0.7071068 -0.7071068
## drat 0.7071068  0.7071068

Explained variance

An interesting property of the eigenvalues is their sum is equal to the total variance of the scaled variables.

# Sum of the generated eigenvalues
sum(my_eigen$values)

## [1] 2

# Total variance of the scaled data
var(mydata_scaled$mpg) + var(mydata_scaled$drat)

## [1] 2

Explained variance

Hence, it becomes easy to calculate the variance explained by each PC.

my_eigen$values

## [1] 1.6811719 0.3188281

explained_variance <- my_eigen$values / sum(my_eigen$values)
explained_variance

## [1] 0.840586 0.159414

We see that PC1 accounts for 84.06% of the total variance where as PC2 explains 15.94% of the total variance.

Biplot

PCA in R

There are many packages to perform PCA in R.

They are probably all good, but I recommend to use the rda() function from the vegan package.

The package is not included in base R, so you have to install it using the following code:

install.packages("vegan")
library(vegan)

PCA in R

We will use the same data to see if we can obtain the same results with vegan.

head(mydata)

##                    mpg drat
## Mazda RX4         21.0 3.90
## Mazda RX4 Wag     21.0 3.90
## Datsun 710        22.8 3.85
## Hornet 4 Drive    21.4 3.08
## Hornet Sportabout 18.7 3.15
## Valiant           18.1 2.76

PCA in R

A PCA is performed using the rda() function. It is a bit confusing since it is the same function for performing redundancy analysis.

library(vegan)
my_pca <- rda(mydata, scale = TRUE) # Do not forget to scale!
my_pca

## Call: rda(X = mydata, scale = TRUE)
## 
##               Inertia Rank
## Total               2     
## Unconstrained       2    2
## Inertia is correlations 
## 
## Eigenvalues for unconstrained axes:
##    PC1    PC2 
## 1.6812 0.3188

PCA in R

Lets compare preliminary results from vegan with those we manually calculated earlier.

my_pca # Vegan results

## Call: rda(X = mydata, scale = TRUE)
## 
##               Inertia Rank
## Total               2     
## Unconstrained       2    2
## Inertia is correlations 
## 
## Eigenvalues for unconstrained axes:
##    PC1    PC2 
## 1.6812 0.3188

my_eigen$values # Manually calculated results

## [1] 1.6811719 0.3188281

Biplot in R

Biplot of a PCA can be easily plotted using the biplot() function.

Biplot in R

biplot(my_pca)

A more complicated example

Lets do a PCA with more more variable (i.e. more dimensions).

mydata <- head(mtcars[, 1:7])
mydata

##                    mpg cyl disp  hp drat    wt  qsec
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
## Valiant           18.1   6  225 105 2.76 3.460 20.22

The PCA

my_pca <- rda(mtcars[, 1:7], scale = TRUE) # Do not forget to scale!
my_pca # 7 PC generated

## Call: rda(X = mtcars[, 1:7], scale = TRUE)
## 
##               Inertia Rank
## Total               7     
## Unconstrained       7    7
## Inertia is correlations 
## 
## Eigenvalues for unconstrained axes:
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7 
## 5.086 1.157 0.345 0.158 0.129 0.076 0.049

The PCA

summary(my_pca)

## 
## Call:
## rda(X = mtcars[, 1:7], scale = TRUE) 
## 
## Partitioning of correlations:
##               Inertia Proportion
## Total               7          1
## Unconstrained       7          1
## 
## Eigenvalues, and their contribution to the correlations 
## 
## Importance of components:
##                          PC1    PC2     PC3     PC4    PC5     PC6
## Eigenvalue            5.0861 1.1566 0.34485 0.15793 0.1295 0.07586
## Proportion Explained  0.7266 0.1652 0.04926 0.02256 0.0185 0.01084
## Cumulative Proportion 0.7266 0.8918 0.94107 0.96364 0.9821 0.99297
##                            PC7
## Eigenvalue            0.049198
## Proportion Explained  0.007028
## Cumulative Proportion 1.000000
## 
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores:  3.838088 
## 
## 
## Species scores
## 
##          PC1     PC2     PC3       PC4      PC5      PC6
## mpg  -1.3504  0.1294 -0.2059 -0.442063  0.11108  0.03597
## cyl   1.3895  0.1224 -0.1602 -0.111800 -0.12444 -0.31186
## disp  1.3823 -0.1286  0.1006 -0.338800 -0.07772  0.06461
## hp    1.2686  0.5257  0.1727  0.003969  0.43404 -0.01841
## drat -1.0835  0.6998  0.6434 -0.067493 -0.11576 -0.09317
## wt    1.2802 -0.5029  0.3753 -0.061886 -0.08699  0.14628
## qsec -0.7849 -1.1690  0.2508 -0.035387  0.17113 -0.16276
## 
## 
## Site scores (weighted sums of species scores)
## 
##                           PC1       PC2      PC3      PC4      PC5
## Mazda RX4           -0.243242  0.579894 -0.30711  0.49108 -1.11434
## Mazda RX4 Wag       -0.235052  0.375523 -0.06404  0.40918 -1.00074
## Datsun 710          -0.625323 -0.033350 -0.08749  0.84325  0.13668
## Hornet 4 Drive      -0.043821 -0.825952 -0.68534 -0.38259  0.19179
## Hornet Sportabout    0.490303  0.008421 -0.69878 -0.91626 -0.20639
## Valiant              0.040751 -1.290237 -0.83456  0.64467  0.37678
## Duster 360           0.756378  0.512097 -0.30871  0.08974  0.61740
## Merc 240D           -0.592484 -0.776765  0.23619 -0.02117 -0.23256
## Merc 230            -0.675542 -1.331752  1.34052  0.13367  1.28885
## Merc 280            -0.154230 -0.055250  0.65363  0.55479 -0.58838
## Merc 280C           -0.149548 -0.228874  0.83555  0.82801 -0.47222
## Merc 450SE           0.535921 -0.238442 -0.39350  0.17258 -0.06050
## Merc 450SL           0.467311 -0.212458 -0.57688  0.02675  0.18159
## Merc 450SLC          0.500967 -0.349059 -0.37426  0.45684  0.16381
## Cadillac Fleetwood   1.056127 -0.776612  0.69870 -0.31860 -0.40694
## Lincoln Continental  1.075495 -0.696090  0.89710 -0.26832 -0.31563
## Chrysler Imperial    0.953786 -0.354488  0.98670 -1.09866  0.07216
## Fiat 128            -0.996997 -0.103343 -0.18151 -1.15484  0.38481
## Honda Civic         -1.175905  0.672184  0.77724 -0.84613 -0.88707
## Toyota Corolla      -1.126816 -0.053228 -0.14169 -1.43287  0.63949
## Toyota Corona       -0.589920 -0.524964  0.09976  0.97728  0.67769
## Dodge Challenger     0.560080 -0.266981 -1.31533  0.27287 -0.67591
## AMC Javelin          0.449899 -0.151555 -0.63193  0.29666 -0.79499
## Camaro Z28           0.719098  0.845740  0.64899  0.16942 -0.07955
## Pontiac Firebird     0.583046 -0.135449 -0.57381 -1.40788 -0.33059
## Fiat X1-9           -0.898925  0.060701 -0.19144  0.05260 -0.07467
## Porsche 914-2       -0.736374  0.846646  0.26597 -0.03670 -0.79493
## Lotus Europa        -0.776611  0.689230 -1.28006 -0.43856  0.87092
## Ford Pantera L       0.563327  1.576823  0.88082 -0.39160  0.03812
## Ferrari Dino        -0.008548  0.855360 -0.60774  1.04767  0.17839
## Maserati Bora        0.824243  1.337699  0.21219  0.38475  2.19332
## Volvo 142E          -0.547393  0.044529  0.72081  0.86238  0.02361
##                          PC6
## Mazda RX4           -0.02832
## Mazda RX4 Wag       -0.10902
## Datsun 710           0.63247
## Hornet 4 Drive       0.05870
## Hornet Sportabout   -0.49123
## Valiant             -0.03029
## Duster 360          -0.04395
## Merc 240D            1.06715
## Merc 230            -1.01059
## Merc 280            -0.39638
## Merc 280C           -0.79105
## Merc 450SE          -0.40004
## Merc 450SL          -0.79893
## Merc 450SLC         -1.05883
## Cadillac Fleetwood   0.90127
## Lincoln Continental  1.02310
## Chrysler Imperial    0.99648
## Fiat 128             0.08697
## Honda Civic         -0.90773
## Toyota Corolla      -0.62008
## Toyota Corona        0.11740
## Dodge Challenger    -0.11973
## AMC Javelin         -0.92732
## Camaro Z28          -0.18341
## Pontiac Firebird     0.09667
## Fiat X1-9           -0.02567
## Porsche 914-2        1.08370
## Lotus Europa         1.14803
## Ford Pantera L      -0.76179
## Ferrari Dino         0.75862
## Maserati Bora       -0.01456
## Volvo 142E           0.74836

Correlation between PC

As already stated, all principal component are orthogonal to each other.

Correlation coefficients between PC.
	PC1	PC2	PC3	PC4	PC5	PC6	PC7
PC1	1	0	0	0	0	0	0
PC2	0	1	0	0	0	0	0
PC3	0	0	1	0	0	0	0
PC4	0	0	0	1	0	0	0
PC5	0	0	0	0	1	0	0
PC6	0	0	0	0	0	1	0
PC7	0	0	0	0	0	0	1

Selecting PCA axes

As we just saw, seven PC were generated.
Since we are limited to plot into only 2D, we have to decide which PC to keep for the visualization.
There are many ways to determine with PC are important which are review in Peres-Neto, Jackson, and Somers (2005).
Today we will focus on the Kaiser-Guttman criterion.

Kaiser-Guttman criteria

The Kaiser-Guttman criteria simply says that PC with eigenvalues higher than the average of all eigenvalues should be kept.

We can use the screeplot() function to have a graphical overview of all eigenvalues.

# plot the screeplot
screeplot(my_pca)

# add a line representing mean value of all eigenvalues
abline(h = mean(my_pca$CA$eig), col = "red")

Based on this graph, it seems that the first two PC are important.

Biplot

biplot(my_pca, choices = c(1, 2)) # Selecting axes 1 and 2

Selecting different axes

biplot(my_pca, choices = c(1, 3)) # Selecting axes 1 and 3

Interpreting a PCA

Angle between variables

Look at the angle between qsec and wt (~90 degrees).

Your turn!

Using your data (or a subset of your data) and perform a PCA. What does this tell you about your data?

Useful resources

https://www.ling.ohio-state.edu/~kbaker/pubs/Singular_Value_Decomposition_Tutorial.pdf

http://www.davidzeleny.net/anadat-r/doku.php/en:pca

References

Peres-Neto, Pedro R., Donald A. Jackson, and Keith M. Somers. 2005. “How many principal components? stopping rules for determining the number of non-trivial axes revisited.” Comput. Stat. Data Anal. 49 (4): 974–97. https://doi.org/10.1016/j.csda.2004.06.015.

	PC1	PC2	PC3	PC4	PC5	PC6	PC7
PC1	1	0	0	0	0	0	0
PC2	0	1	0	0	0	0	0
PC3	0	0	1	0	0	0	0
PC4	0	0	0	1	0	0	0
PC5	0	0	0	0	1	0	0
PC6	0	0	0	0	0	1	0
PC7	0	0	0	0	0	0	1

	PC1	PC2	PC3	PC4	PC5	PC6	PC7
PC1	1	0	0	0	0	0	0
PC2	0	1	0	0	0	0	0
PC3	0	0	1	0	0	0	0
PC4	0	0	0	1	0	0	0
PC5	0	0	0	0	1	0	0
PC6	0	0	0	0	0	1	0
PC7	0	0	0	0	0	0	1

	PC1	PC2	PC3	PC4	PC5	PC6	PC7
PC1	1	0	0	0	0	0	0
PC2	0	1	0	0	0	0	0
PC3	0	0	1	0	0	0	0
PC4	0	0	0	1	0	0	0
PC5	0	0	0	0	1	0	0
PC6	0	0	0	0	0	1	0
PC7	0	0	0	0	0	0	1