Principal component analysis

Canonical analysis

  • Family of statistical analyzes that allows to study and explore a data set of quantitative variables.

  • Aimed at reducing the dimensionality of a data set while retaining much of the variation present in the data.
    • Create new axes that explain most of the data variation.
    • Project data on these new axes.
  • Visualization techniques are limited to 1D, 2D or 3D data.
    • By reducing the dimensionality of data, canonical analysis make possible to plot in 2D.

Type of canonical analysis

  1. Canonical Correlation Analysis (CCA)

  2. Principal Component Analysis (PCA)

  3. Linear discriminant analysis (LDA)

  4. Redundancy Analysis (RDA)

Principal component analysis (PCA)

  • One of the goals behind PCA is to graphically represent the essential information contained in a (quantitative) data table.

  • Useful way to discover (hidden) patterns in the data by compressing data.

  • Not performed directly on the data but on either the covariance or correlation matrix of the data.

Matrix structure

PCA analysis is applied to rectangular data format.

\[ X_{n,p} = \begin{bmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,p} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n,1} & a_{n,2} & \cdots & a_{m,p} \end{bmatrix} \]


\(n\) objects in the rows (observations)

\(p\) quantitative variables in the columns (variables)

Example of high dimensional data

mtcars dataset

mpg cyl disp hp drat wt qsec
Mazda RX4 21.00 6.00 160.00 110.00 3.90 2.62 16.46
Mazda RX4 Wag 21.00 6.00 160.00 110.00 3.90 2.88 17.02
Datsun 710 22.80 4.00 108.00 93.00 3.85 2.32 18.61
Hornet 4 Drive 21.40 6.00 258.00 110.00 3.08 3.21 19.44
Hornet Sportabout 18.70 8.00 360.00 175.00 3.15 3.44 17.02
Valiant 18.10 6.00 225.00 105.00 2.76 3.46 20.22
Duster 360 14.30 8.00 360.00 245.00 3.21 3.57 15.84
Merc 240D 24.40 4.00 146.70 62.00 3.69 3.19 20.00
Merc 230 22.80 4.00 140.80 95.00 3.92 3.15 22.90
Merc 280 19.20 6.00 167.60 123.00 3.92 3.44 18.30

Visualisation

One option to visualize this dataset is to look at all pairs of correlation.