Redundancy analysis (RDA)

Redundancy analysis

Simple (unconstrained) ordination analyses (such as PCA) on a single data matrix \(X\) helps to reveal its major structure (Borcard, Gillet, and Legendre 2011).
There are not notions of explanatory or response variables.
On contrary, canonical ordination such as RDA explicitly explores the relationships between two matrices: a response matrix and an explanatory matrix.

Redundancy analysis

RDA is the multivariate (meaning multiresponse) technique analogue of regression.
The method uses a mix of linear regression and principal components analysis (PCA).
Conceptually, RDA is a multivariate (meaning multiresponse) multiple linear regression followed by a PCA of the table of fitted values.

Definitions

Lets define :

\(X\) a matrix of explanatory variables
\(Y\) a matrix of response variables

Definitions

RDA procedure works on both centered matrices. This simply means that that the average of the variable is subtracted from each observation.

\[ \bar{X}_j = \sum_{i = 1}^{n} X_{ij} = 0 \]

\[ \bar{Y}_j = \sum_{i = 1}^{n} Y_{ij} = 0 \]

RDA cookbook

These steps are from Borcard, Gillet, and Legendre (2011) which I highly recommend.

Regress each (centered) \(y\) variable on explanatory matrix \(X\) and compute the fitted (\(\hat{y}\)) and residuals (\(y_{res}\)) vectors.
Create a new matrix (\(\hat{Y}\)) containing all the fitted vectors (\(\hat{y}\)).
Compute a PCA on \(\hat{Y}\). This will produces a vector of canonical eigenvalues and a matrix \(U\) of canonical eigenvectors (principal components).

Graphical view

\(\hat{Y}\) is produced using multiple linear regression between \(X\) and each \(y_i\).

Graphical view

A PCA is performed on \(\hat{Y}\) which gives a set of principal component vectors \(U\).

PCA vs RDA

PCA and RDA are very similar:

PCA is performed on a matrix with explanatory variables.
RDA is performed on a matrix of predicted explanatory variables.

Two types of RDA

Depends on how site scores are calculated (two possibilities).

\(Y \times U\) to obtain ordination in the space of the original variables \(Y\).
\(\hat{Y} \times U\) to obtain ordination in the space of the variables \(X\).

Site scores calculated using \(Y \times U\) are simply called site scores where as scores calculates using \(\hat{Y} \times U\) are called site constraints since they are calculated using linear combinations of constraining variables \(X\).

Vegan R package

The vegan package makes it very easy to perform RDA in R using the RDA() function.

# Install the package
install.packages("vegan")

# Load the package
library(vegan)

Vegan R package

Basic usage

# X is a matrix or data frame of explanatory variables
my_rda <- rda(Y ~ x1 + x2 + x3, data = X)

A concrete example

http://goo.gl/hwxKAD

The data come from a Ph.D. thesis (Verneaux, 1973) who proposed to use fish species to characterize ecological zones along European rivers and streams. He showed that fish communities were good biological indicators of these water bodies. Data have been collected at 30 localities along Doubs river.

The first matrix (Y) contains coded abundances of 27 fish species.

The second matrix (X) contains 11 environmental variables related to the hydrology, geomorphology and chemistry of the river.

Reference: Verneaux, J. (1973) Cours d’eau de Franche-Comté (Massif du Jura). Recherches écologiques sur le réseau hydrographique du Doubs. Essai de biotypologie. Thèse d’état, Besançon. 1–257.

A concrete example

Matrix \(Y\) contains fish species abundance.

spe <- read.csv("http://goo.gl/hwxKAD", row.names = 1)

	TRU	VAI	LOC	VAN	CHE
1	3	0	0	0	0
2	5	4	3	0	0
3	5	5	5	0	0
4	4	5	5	0	1
5	2	3	2	5	2
6	3	4	5	1	2
7	5	4	5	1	1
8	0	0	0	0	0
9	0	1	3	0	5
10	1	4	4	2	2

A concrete example

Lets transform abundance data (\(Y\)) using the Hellinger-transform method (Borcard, Gillet, and Legendre 2011; Legendre and Gallagher 2001).

Particularly suited to species abundance data, this transformation gives low weights to variables with low counts and many zeros. The transformation itself comprises dividing each value in a data matrix by its row sum, and taking the square root of the quotient.

Source: http://mb3is.megx.net/gustame/reference/transformations

A concrete example

In R, the Hellinger-transform is performed using the decostand() function from vegan package.

spe_hellinger <- decostand(spe, method = "hellinger")

	TRU	VAI	LOC	VAN	CHE
1	1.00	0.00	0.00	0.00	0.00
2	0.65	0.58	0.50	0.00	0.00
3	0.56	0.56	0.56	0.00	0.00
4	0.44	0.49	0.49	0.00	0.22
5	0.24	0.30	0.24	0.38	0.24
6	0.38	0.44	0.49	0.22	0.31
7	0.56	0.50	0.56	0.25	0.25
8	0.00	0.00	0.00	0.00	0.00
9	0.00	0.27	0.46	0.00	0.60
10	0.27	0.53	0.53	0.38	0.38

A concrete example

Environemental data (\(X\))

env <- read.csv("http://goo.gl/HDAWtO", row.names = 1)

	das	alt	pen	deb	pH	dur	pho	nit	amm	oxy	dbo
1	0.30	934	48.00	0.84	7.90	45	0.01	0.20	0.00	12.20	2.70
2	2.20	932	3.00	1.00	8.00	40	0.02	0.20	0.10	10.30	1.90
3	10.20	914	3.70	1.80	8.30	52	0.05	0.22	0.05	10.50	3.50
4	18.50	854	3.20	2.53	8.00	72	0.10	0.21	0.00	11.00	1.30
5	21.50	849	2.30	2.64	8.10	84	0.38	0.52	0.20	8.00	6.20
6	32.40	846	3.20	2.86	7.90	60	0.20	0.15	0.00	10.20	5.30
7	36.80	841	6.60	4.00	8.10	88	0.07	0.15	0.00	11.10	2.20
8	49.10	792	2.50	1.30	8.10	94	0.20	0.41	0.12	7.00	8.10
9	70.50	752	1.20	4.80	8.00	90	0.30	0.82	0.12	7.20	5.20
10	99.00	617	9.90	10.00	7.70	82	0.06	0.75	0.01	10.00	4.30

A concrete example

Environmental data (X)

Code	Description of the variable
das	Distance from the source [km]
alt	Altitude [m a.s.l.]
pen	Slope [per thousand]
deb	Mean minimum discharge [m3s-1]
pH	pH of water
dur	Calcium concentration (hardness) [mgL-1]
pho	Phosphate concentration [mgL-1]
nit	Nitrate concentration [mgL-1]
amn	Ammonium concentration [mgL-1]
oxy	Dissolved oxygen [mgL-1]
dbo	Biological oxygen demand [mgL-1]

languageR::pairscor.fnc(env)  # First look at pairwise correlations

Can you spot potential problems?

Lets do it

# Create the model
my_rda <- rda(spe_hellinger ~ das + alt + pen + deb + pH + dur + 
    pho + nit + amm + oxy + dbo, data = env)

Quick R tip

To make the syntax easier, you can use the formulation

y ~ ., data = x

which means: y as a function of all variables in x.

# Predicting transformed species using all variables
# contained in env
my_rda <- rda(spe_hellinger ~ ., data = env)

Summary

summary(my_rda)  # Lets have a look to the summary of the RDA

First part -> contribution to explained variance

Second part -> contribution to total variance

Visualisation

For RDA, the visualization plot is called triplot since there are three different entities in the plot: sites, response variables and explanatory variables (Borcard, Gillet, and Legendre 2011).

Interpreting triplot

plot(my_rda)  # Show the triplot

Explained variance

RDA: explained variance

Attention: \(R^2\) as the relative contribution of each eigenvectors are unadjusted and are therefore biased. For a proper computation of unbiased, adjusted \(R^2\) one should use the RsquareAdj() function.

RsquareAdj(my_rda)  # Total variance explained by the RDA

## $r.squared
## [1] 0.7269421
## 
## $adj.r.squared
## [1] 0.5600733

\[ R^2_{adj} = 1 - \frac{n - 1}{n - m - 1}(1 - R^2) \]

Test of significance

Tests of RDA results

Most of the time, ecological data are not normally-distributed: parametric tests are not appropriate.

For canonical analysis such as RDA, model significance relies on permutation tests.

The principle of a permutation test is to generate a reference distribution of the chosen statistic under the null hypothesis \(H0\) by randomly permuting appropriate elements of the data a large number of times and recomputing the statistic each time. Then, one compares the true value of the statistic to this reference distribution. The \(p\) value is computed as the proportion of the permuted values equal to or larger than the true (unpermuted) value of the statistic for a one-tailed test in the upper tail, like the \(F\) test used in RDA (Borcard, Gillet, and Legendre 2011).

Permutation test

Repeat the process n numbers of times (default: n = 999)

\(H0\): Observed results can be produced by random chance.

The null hypothesis is rejected if this \(p\) value is equal to or smaller than the predefined significance level \(\alpha\) (ex.: \(\alpha = 0.05\)).

Permutation test

For an RDA, you have to test for three different things:

Global RDA significance
Axis significance
Terms (explanatory variables) significance

Global significance

We use the function anova.cca() to perform the permutation test. Do not get confused with the name, is has nothing to do with the classical ANOVA test.

anova.cca(rda, permutations = 999)

Argument permutations gives the minimal number of permutations requested to assess if the F value of a test is obviously significant or not (Borcard, Gillet, and Legendre 2011).

Permutation test

anova.cca(my_rda)

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe_hellinger ~ das + alt + pen + deb + pH + dur + pho + nit + amm + oxy + dbo, data = env)
##          Df Variance      F Pr(>F)    
## Model    11  0.36517 4.3564  0.001 ***
## Residual 18  0.13717                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since \(p\) = 0.001 we reject \(H0\) meaning that the RDA model is significant.

Axis significance

Axis significance can be done by adding by = "axis".

anova.cca(my_rda, by = "axis")  # Test which axis are significant

## Permutation test for rda under reduced model
## Forward tests for axes
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe_hellinger ~ das + alt + pen + deb + pH + dur + pho + nit + amm + oxy + dbo, data = env)
##          Df Variance       F Pr(>F)    
## RDA1      1 0.221790 29.1044  0.001 ***
## RDA2      1 0.056112  7.3633  0.001 ***
## RDA3      1 0.032462  4.2599  0.090 .  
## RDA4      1 0.027880  3.6585  0.259    
## RDA5      1 0.009756  1.2802  0.998    
## RDA6      1 0.005466  0.7173  1.000    
## RDA7      1 0.005127  0.6728  0.999    
## RDA8      1 0.003182  0.4176  1.000    
## RDA9      1 0.001958  0.2569  1.000    
## RDA10     1 0.000880  0.1155  1.000    
## RDA11     1 0.000562  0.0737  1.000    
## Residual 18 0.137169                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Terms significance

Terms significance can be done by adding by = "terms".

anova.cca(my_rda, by = "terms")  # Test which terms are significant

## Permutation test for rda under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe_hellinger ~ das + alt + pen + deb + pH + dur + pho + nit + amm + oxy + dbo, data = env)
##          Df Variance       F Pr(>F)    
## das       1 0.183727 24.1096  0.001 ***
## alt       1 0.031610  4.1481  0.009 ** 
## pen       1 0.023637  3.1018  0.007 ** 
## deb       1 0.039763  5.2179  0.002 ** 
## pH        1 0.007411  0.9725  0.389    
## dur       1 0.014727  1.9326  0.114    
## pho       1 0.029551  3.8778  0.004 ** 
## nit       1 0.010425  1.3680  0.229    
## amm       1 0.008019  1.0523  0.362    
## oxy       1 0.011292  1.4819  0.180    
## dbo       1 0.005013  0.6578  0.619    
## Residual 18 0.137169                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear dependencies

As with multiple linear regressions, there are chances to have collinearity among predictors (\(X\)).

This needs to be carefully verified.

Linear dependencies

Lets look at the environmental matrix (\(X\)) used for the RDA.

Linear dependencies

Multicollinearity can be verified using the vif.cca() function.

sqrt(vif.cca(my_rda))

##       das       alt       pen       deb        pH       dur       pho 
## 10.198473  6.292054  1.422669  6.004052  1.127849  1.954401  5.032821 
##       nit       amm       oxy       dbo 
##  4.157438  5.576650  3.600462  4.203407

As a rule of the thumb, if \(\sqrt{VIF} > 2\) multicollinearity is considered high.

Here, it seems that some of explanatory variables are redundant.

Should we reduce the number of explanatory variables?

Forward selection

Like we saw for the multiple regression analysis, we can use the forward.sel() function to select the best explanatory variables.

You have to install the packfor package.

install.packages("packfor", repos = "http://R-Forge.R-project.org")

Recipe

To prevent problem with inflation of type I error: compute the global model (i.e. using all explanatory variables) and calculate p value. If the model is significant, perform the forward selection.
To prevent adding too many explanatory variables: compute the global \(R^2_{adj}\) of the model. Use this value as the second stopping criteria.

Forward selection

library(packfor)

# Global model with all environmental variables
my_rda <- rda(spe_hellinger ~ ., data = env)

# global R2
global_r2 <- RsquareAdj(my_rda)$adj.r.squared

global_r2

## [1] 0.5600733

Forward selection

Perform the forward selection with \(R^2_{adj} = 0.56\).

# Forward selection
fs <- forward.sel(spe_hellinger, # Y matrix
  env, # X matrix
  adjR2thresh = global_r2, # Set the adj.R2 threshold
  alpha = 0.001, # Set alpha level
  nperm = 999  # Number of permutations
)

## Testing variable 1
## Testing variable 2
## Testing variable 3
## Testing variable 4
## Procedure stopped (alpha criteria): pvalue for variable 4 is 0.005000 (superior to 0.001000)

Forward selection

The forward selection has determined that fish species gradient was best modeled using 3 environmental variables.

fs

##   variables order        R2     R2Cum  AdjR2Cum         F  pval
## 1       das     1 0.3657395 0.3657395 0.3430874 16.145902 0.001
## 2       oxy    10 0.1235899 0.4893295 0.4515020  6.534406 0.001
## 3       dbo    11 0.0746889 0.5640184 0.5137128  4.454113 0.001

Based on the forward selection, it seems that das, oxy and dbo are good predictors.

Forward selection

Lets create another RDA using the selected variables.

my_rda2 <- rda(spe_hellinger ~ das + oxy + dbo, data = env)

Forward selection

RsquareAdj(my_rda)$adj.r.squared  # Using 11 explanatory variables

## [1] 0.5600733

RsquareAdj(my_rda2)$adj.r.squared  # Using 3 explanatory variables

## [1] 0.5137128

The RDA using only 3 explanatory variables explains almost as much as the RDA using 11 explanatory variables.

Forward selection

sqrt(vif.cca(my_rda))  # Multicollinearity before variables selection

##       das       alt       pen       deb        pH       dur       pho 
## 10.198473  6.292054  1.422669  6.004052  1.127849  1.954401  5.032821 
##       nit       amm       oxy       dbo 
##  4.157438  5.576650  3.600462  4.203407

sqrt(vif.cca(my_rda2))  # Multicollinearity after variables selection

##      das      oxy      dbo 
## 1.165765 1.991699 1.865008

Remember: \(\sqrt{VIF} > 2\) means high multicollinearity among predictor.

Axis significance

Axis significance can be done by adding by = "axis".

anova.cca(my_rda2, by = "axis")  # Test which axis are significant

## Permutation test for rda under reduced model
## Forward tests for axes
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe_hellinger ~ das + oxy + dbo, data = env)
##          Df Variance       F Pr(>F)    
## RDA1      1 0.210795 25.0245  0.001 ***
## RDA2      1 0.050414  5.9849  0.001 ***
## RDA3      1 0.022121  2.6261  0.007 ** 
## Residual 26 0.219012                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Terms significance

Terms significance can be done by adding by = "terms".

anova.cca(my_rda2, by = "terms")  # Test which terms are significant

## Permutation test for rda under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe_hellinger ~ das + oxy + dbo, data = env)
##          Df Variance       F Pr(>F)    
## das       1 0.183727 21.8111  0.001 ***
## oxy       1 0.062085  7.3704  0.001 ***
## dbo       1 0.037519  4.4541  0.004 ** 
## Residual 26 0.219012                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unexplained variance

It is a good idea to look at the residual variation in the data. We can use the Kaiser-Guttman criterion (Kaiser 1960).

my_rda$CA$eig[my_rda$CA$eig > mean(my_rda$CA$eig)]

##         PC1         PC2         PC3         PC4         PC5         PC6 
## 0.044375541 0.023551441 0.018339110 0.012026190 0.010230639 0.007679747

Based on this result, we see there is still some variation in these PCA that has not been explained by our set of environmental data.
Need more data?

Exercise

Exercise #1

Explore our data for your using an RDA.

If you do not have data, use the following data and create an RDA to predict vascular plant abundance with chemical variables.

# bryo <- read.delim('http://goo.gl/UugH8X', row.names = 1)

vasc <- read.delim("http://goo.gl/UBcc5J", row.names = 1)

chem <- read.delim("http://goo.gl/55gDIi", row.names = 1)

Partial RDA

It is possible to run an RDA of a matrix Y, explained by a matrix variables X, in the presence of co-variable(s) W.

This analysis allows to display the patterns of the response data (Y) uniquely explained by a linear model of the explanatory variables (X) when the effect of other covariates (W) is held constant (Borcard, Gillet, and Legendre 2011).

An example

The analysis that follows determines whether water chemistry (X) significantly explains the fish species patterns (Y) when the effect of the topographical gradient (W) is held constant.

In other words, we want to model the effect of water chemistry variables on fish abundance once the effect of topographic variables is removed.

Partial RDA

Subsetting variables

Partial RDA in R

The construction of a partial RDA is pretty much the same as for a standard RDA except that we are using the Condition(...) function to specify which covariate (\(W\)) we want to remove the effect.

Model formulation

partial_rda <- rda(spe_hellinger ~ pH + dur + pho + nit + amm + 
    oxy + dbo + Condition(das + alt + pen + deb), data = env)

Summary

summary(partial_rda)

## 
## Call:
## rda(formula = spe_hellinger ~ pH + dur + pho + nit + amm + oxy +      dbo + Condition(das + alt + pen + deb), data = env) 
## 
## Partitioning of variance:
##               Inertia Proportion
## Total         0.50234     1.0000
## Conditioned   0.27874     0.5549
## Constrained   0.08644     0.1721
## Unconstrained 0.13717     0.2731
## 
## Eigenvalues, and their contribution to the variance 
## after removing the contribution of conditiniong variables
## 
## Importance of components:
##                          RDA1    RDA2    RDA3     RDA4     RDA5     RDA6
## Eigenvalue            0.04636 0.01732 0.01080 0.005924 0.003477 0.001806
## Proportion Explained  0.20734 0.07747 0.04829 0.026493 0.015549 0.008077
## Cumulative Proportion 0.20734 0.28481 0.33311 0.359600 0.375149 0.383225
##                            RDA7     PC1     PC2     PC3     PC4     PC5
## Eigenvalue            0.0007458 0.04438 0.02355 0.01834 0.01203 0.01023
## Proportion Explained  0.0033352 0.19845 0.10533 0.08202 0.05378 0.04575
## Cumulative Proportion 0.3865606 0.58501 0.69034 0.77236 0.82614 0.87189
##                           PC6      PC7      PC8      PC9     PC10     PC11
## Eigenvalue            0.00768 0.005359 0.004465 0.003185 0.002903 0.001800
## Proportion Explained  0.03435 0.023966 0.019970 0.014242 0.012982 0.008051
## Cumulative Proportion 0.90624 0.930203 0.950173 0.964415 0.977397 0.985449
##                           PC12      PC13      PC14      PC15      PC16
## Eigenvalue            0.001160 0.0008835 0.0004229 0.0003872 0.0002722
## Proportion Explained  0.005189 0.0039509 0.0018914 0.0017318 0.0012175
## Cumulative Proportion 0.990638 0.9945887 0.9964801 0.9982119 0.9994294
##                            PC17      PC18
## Eigenvalue            9.722e-05 3.036e-05
## Proportion Explained  4.348e-04 1.358e-04
## Cumulative Proportion 9.999e-01 1.000e+00
## 
## Accumulated constrained eigenvalues
## Importance of components:
##                          RDA1    RDA2   RDA3     RDA4     RDA5     RDA6
## Eigenvalue            0.04636 0.01732 0.0108 0.005924 0.003477 0.001806
## Proportion Explained  0.53638 0.20041 0.1249 0.068536 0.040223 0.020894
## Cumulative Proportion 0.53638 0.73679 0.8617 0.930255 0.970479 0.991372
##                            RDA7
## Eigenvalue            0.0007458
## Proportion Explained  0.0086278
## Cumulative Proportion 1.0000000
## 
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores:  1.953663 
## 
## 
## Species scores
## 
##          RDA1      RDA2      RDA3      RDA4       RDA5      RDA6
## CHA -0.006736  0.056075 -0.067508 -0.031867  0.0328150 -0.003809
## TRU  0.056411  0.215314  0.048162  0.042760  0.0245275  0.013154
## VAI  0.113326  0.120592  0.082918 -0.011714 -0.0447019 -0.010303
## LOC  0.178107  0.121882  0.076960 -0.052258 -0.0432833  0.013544
## OMB -0.032569  0.062566 -0.072818 -0.010169  0.0033421 -0.026404
## BLA  0.021748  0.047012 -0.068045 -0.023806  0.0620808  0.020182
## HOT  0.079670 -0.046183 -0.014754  0.043955 -0.0268121  0.033513
## TOX  0.147921 -0.059793  0.019036  0.059762 -0.0006972 -0.027146
## VAN  0.061168 -0.067139  0.069243 -0.058602  0.0234703 -0.018282
## CHE -0.107322 -0.044838  0.017223 -0.097562 -0.0522028 -0.017556
## BAR  0.176698 -0.006498 -0.038188  0.022431  0.0532397  0.010502
## SPI  0.137367 -0.061973  0.019658  0.043529 -0.0029242 -0.019346
## GOU  0.047367 -0.044509  0.102003 -0.053955  0.0363768  0.041214
## BRO  0.068898  0.006641  0.091837 -0.003211  0.0526976 -0.043015
## PER  0.136899 -0.041390  0.024280  0.001626 -0.0360536 -0.011459
## BOU  0.152817 -0.027459  0.002190  0.026119 -0.0020938 -0.012330
## PSO  0.118847 -0.045988 -0.025997  0.024803 -0.0251135  0.010376
## ROT  0.005053 -0.052845  0.066721 -0.011614  0.0502122 -0.010447
## CAR  0.139290 -0.018109 -0.011819  0.010822  0.0002110 -0.014721
## TAN  0.167530 -0.048528 -0.012838 -0.040782  0.0154087  0.026609
## BCO  0.119418  0.010340 -0.023971 -0.005092 -0.0178581  0.018706
## PCH  0.075090  0.011810 -0.039215 -0.013490 -0.0157576  0.002078
## GRE  0.030458 -0.052458  0.050348 -0.016292  0.0118461  0.044864
## GAR -0.082276 -0.120778  0.032477  0.018500 -0.0112214  0.013140
## BBO  0.103534 -0.010786 -0.049588  0.017441 -0.0227933  0.028896
## ABL -0.205858  0.015811  0.108859  0.100722 -0.0031471  0.018825
## ANG  0.134038 -0.021178 -0.006907  0.019326  0.0024788 -0.021222
## 
## 
## Site scores (weighted sums of species scores)
## 
##        RDA1     RDA2     RDA3      RDA4     RDA5     RDA6
## 1  -0.01065 -0.16652 -0.14794  0.264469  0.32820  0.12189
## 2   0.06446  0.78394 -0.09872  1.083078 -0.03281  0.71002
## 3   0.12808  0.73956  0.18636  0.853163  0.26171 -0.14300
## 4   0.23832  0.03298  0.49519 -0.068726 -0.04972  0.15323
## 5   0.05334 -1.33539  0.68190 -0.459798  1.22152 -0.42072
## 6   0.18560 -0.21669  0.49966 -0.695084 -0.05519  0.24449
## 7   0.05283  0.60479  0.20124 -0.225805 -0.64900 -0.17305
## 8  -0.40692 -0.70449 -1.51926  0.915941  1.03301 -0.47344
## 9  -0.29966 -0.74228 -0.51274 -1.021561 -1.47758  0.04004
## 10  0.03389  0.09186  0.78500 -1.599635 -1.30992 -0.22610
## 11 -0.32573  0.36537 -0.16671  0.007798 -1.25477 -0.33065
## 12 -0.26473  0.65465 -0.28707 -0.009253 -1.00531 -0.15052
## 13 -0.23858  0.82346 -0.73758  0.397032  0.31015  0.24887
## 14 -0.14521  0.64434 -0.45194 -0.308003  0.97783  0.10133
## 15 -0.07509  0.18484 -0.36303 -1.175132  0.96867  0.95440
## 16  0.45081  0.19081 -0.05680 -0.610718  1.08501 -0.94728
## 17  0.34019 -0.49636  0.17749  0.846064  0.08994 -0.96113
## 18  0.33976 -0.60291  0.26129  0.555080 -0.03882 -0.55441
## 19  0.38051 -0.88714  0.55807  0.614652 -0.14071  0.93982
## 20  0.34744 -0.88344  0.41751  0.596592 -0.01302  0.39399
## 21  0.46064 -0.27667  0.14040  0.326027  0.16471  0.04939
## 22  0.43425 -0.19827 -0.14184  0.040092  0.14864  0.06692
## 23 -1.13447  0.44175 -0.09791  0.534633 -1.15993 -0.98447
## 24 -0.69985  0.30306 -0.13651  0.504795 -0.98194  2.12108
## 25 -0.85845 -0.15106  1.17533 -0.544402  1.32867 -0.10600
## 26  0.20536  0.29058 -0.06739 -0.209412  0.14905  0.44500
## 27  0.35017  0.50452 -0.25649 -0.352568  0.33618 -0.32536
## 28  0.33524  0.64514 -0.37172 -0.416249  0.08103 -0.67630
## 29  0.03591 -0.33363  0.00736  0.145693 -0.25999  0.29330
## 30  0.02253 -0.30681 -0.17314  0.011237 -0.05560 -0.41134
## 
## 
## Site constraints (linear combinations of constraining variables)
## 
##        RDA1      RDA2     RDA3     RDA4     RDA5      RDA6
## 1  -0.02391 -0.036707 -0.03327  0.07930  0.16118  0.075277
## 2   0.33516  0.006732  0.58753  0.68365  0.18160  0.551285
## 3   0.07580  0.382694 -0.04659  0.55873  0.52004  0.115742
## 4   0.20289  0.239507  0.35641 -0.40527 -0.14871  0.005679
## 5  -0.31485 -0.563519 -0.25858  0.02004  0.17970 -0.097047
## 6   0.06779  0.585164  0.26344  0.03847 -0.39751  0.005655
## 7  -0.02030  0.465622 -0.07491 -0.58245 -0.24644 -0.534697
## 8  -0.40872 -0.553370 -0.70505 -0.04871 -0.06311 -0.276797
## 9  -0.04102 -0.760693 -0.17589 -0.42206  0.01551  0.073980
## 10  0.06557 -0.264006  0.16238 -0.16112 -0.66799 -0.049270
## 11  0.02127 -0.202447 -0.03652 -0.38875  0.27358 -0.297116
## 12 -0.07716  0.191338  0.11839  0.23000 -0.56646  0.108175
## 13 -0.18417  0.231607 -0.20020 -0.01579 -0.32286 -0.014005
## 14 -0.16408  0.573558 -0.39262 -0.14808  0.15340 -0.401418
## 15 -0.08290  0.239785 -0.49693 -0.14732  0.93083  0.762921
## 16  0.37084 -0.076852  0.09606 -0.28534 -0.07623  0.037788
## 17  0.10555 -0.408005 -0.03129  0.60953  0.08629 -0.606067
## 18  0.21528 -0.292934  0.26194 -0.07232  0.06156  0.180256
## 19  0.25110  0.118684  0.20929 -0.15871  0.19487  0.336569
## 20  0.39265 -0.244266  0.27743  0.58776  0.09885 -0.428379
## 21  0.43229 -0.108845  0.08768  0.13806 -0.32203  0.191321
## 22  0.21640  0.218611 -0.32197  0.13614 -0.26136  0.418909
## 23 -1.03978  0.716385 -0.04610  0.42770 -0.21453 -0.102241
## 24 -0.49025 -0.331472 -0.37860  0.20922 -0.24626  0.461350
## 25 -0.94699 -0.265809  1.22095 -0.24306  0.56368 -0.164705
## 26  0.17052  0.100774  0.19412 -0.88716 -0.11180  0.441012
## 27  0.34452 -0.045063 -0.19822  0.26767  0.12702 -0.086841
## 28  0.54104  0.331874 -0.14415 -0.15916  0.58080 -0.889494
## 29  0.01374 -0.233864  0.07644  0.18198 -0.58074 -0.132751
## 30 -0.02828 -0.014485 -0.37119 -0.04297  0.09712  0.314907
## 
## 
## Biplot scores for constraining variables
## 
##          RDA1     RDA2    RDA3     RDA4      RDA5     RDA6
## pH  -0.009684  0.35319 -0.5252  0.01125  0.713943  0.04296
## dur -0.270682 -0.17691 -0.2616 -0.31065 -0.111336 -0.27432
## pho -0.672975  0.01490  0.4494 -0.10956  0.225350 -0.00608
## nit -0.209209 -0.08541  0.3855 -0.01586  0.285438 -0.21167
## amm -0.670851 -0.05230  0.4774  0.06538  0.212485 -0.09336
## oxy  0.436621  0.33888 -0.1078  0.07314 -0.042087 -0.03581
## dbo -0.767570 -0.05931  0.1696  0.01937  0.008685 -0.01508

Interpreting the output

According to Borcard, Gillet, and Legendre (2011):

Conditioned: gives the amount of variance that has been explained by the covariables and removed (55.49%).
Constrained: gives the amount of variance uniquely explained by the explanatory variables (17.21%).
Unconstrained: gives the residual variance (27.31%).

Interpreting the output

Caution: the values are unadjusted and are therefore biased.

RsquareAdj(partial_rda)  # Always report adjusted R2

## $r.squared
## [1] 0.1720683
## 
## $adj.r.squared
## [1] 0.07641971

Test of significance

As for RDA without covariates we can use the anova.cca() function to test the significance of partial RDA.

anova.cca(partial_rda)

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe_hellinger ~ pH + dur + pho + nit + amm + oxy + dbo + Condition(das + alt + pen + deb), data = env)
##          Df Variance      F Pr(>F)  
## Model     7 0.086437 1.6204  0.021 *
## Residual 18 0.137169                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since \(p \leq 0.05\) we reject \(H0\) and can conclude that the result was not produced randomly.

Exercise

Question #1

Run anova.cca(partial_rda) couple of times. What are you observing?

Visualization

# As expected, topographic data not present
plot(partial_rda, scaling = 1)

What is going on?

# Adj-R2 using all env. variables
RsquareAdj(rda(spe_hellinger ~ ., data = env))$adj.r.squared

## [1] 0.5600733

# Adj-R2 after removing effect of topography
RsquareAdj(rda(spe_hellinger ~ pH + dur + pho + nit + amm + oxy + 
    dbo + Condition(das + alt + pen + deb), data = env))$adj.r.squared

## [1] 0.07641971

# Adj-R2 after removing effect of water chemistry
RsquareAdj(rda(spe_hellinger ~ das + alt + pen + deb + Condition(pH + 
    dur + pho + nit + amm + oxy + dbo), data = env))$adj.r.squared

## [1] 0.1007639

Why 0.08% + 0.1% is not equal to 0.56%?

Variance partitioning

We just saw that the \(R^2_{adj}\) of each partial RDA does not sum up to the \(R^2_{adj}\) of the global model.

56% - 8% - 10 = 38%

Where is the missing 38%?

There must have confounded variance explained by both partial RDA.

Variance partitioning

Variance partitioning in R

In R, variance partitioning among partial RDA is done using the varpart function from the vegan package.

# Y = response variables matrix (species community)

# X = explanatory variables matrix (water chemistry variables)

# W = covariate variables matrix (topography variables)
vp <- varpart(Y, X, W)

Varpart

env_topograhpy <- env[, 1:4]  # Extract topography variables
env_water_chemistry <- env[, 5:11]  # Extract water chem variables
vp <- varpart(spe_hellinger, env_water_chemistry, env_topograhpy)
vp

## 
## Partition of variance in RDA 
## 
## Call: varpart(Y = spe_hellinger, X = env_water_chemistry,
## env_topograhpy)
## 
## Explanatory tables:
## X1:  env_water_chemistry
## X2:  env_topograhpy 
## 
## No. of explanatory tables: 2 
## Total variation (SS): 14.568 
##             Variance: 0.50234 
## No. of observations: 30 
## 
## Partition table:
##                      Df R.squared Adj.R.squared Testable
## [a+b] = X1            7   0.58982       0.45931     TRUE
## [b+c] = X2            4   0.55487       0.48365     TRUE
## [a+b+c] = X1+X2      11   0.72694       0.56007     TRUE
## Individual fractions                                    
## [a] = X1|X2           7                 0.07642     TRUE
## [b]                   0                 0.38289    FALSE
## [c] = X2|X1           4                 0.10076     TRUE
## [d] = Residuals                         0.43993    FALSE
## ---
## Use function 'rda' to test significance of fractions of interest

Varpart

plot(vp)

RsquareAdj(partial_rda)

## $r.squared
## [1] 0.1720683
## 
## $adj.r.squared
## [1] 0.07641971

Test of significance

We have tested the global significance of the RDA. We also need to do it for each fraction: (1) [a + b], (2) [b + c], (3) [a + b + c], (4) [a] and (5) [c]

Test of significance

Same as a normal RDA using permutation test.

Testing fractions [a + b]

# Test fractions [a + b]
anova.cca(rda(spe_hellinger, env_water_chemistry))

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(X = spe_hellinger, Y = env_water_chemistry)
##          Df Variance      F Pr(>F)    
## Model     7  0.29629 4.5193  0.001 ***
## Residual 22  0.20605                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing fractions [b + c]

# Test fractions [b + c]
anova.cca(rda(spe_hellinger, env_topograhpy))

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(X = spe_hellinger, Y = env_topograhpy)
##          Df Variance     F Pr(>F)    
## Model     4  0.27874 7.791  0.001 ***
## Residual 25  0.22361                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing fractions [a + b + c]

# Test fractions [a + b + c]
anova.cca(rda(spe_hellinger, env))

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(X = spe_hellinger, Y = env)
##          Df Variance      F Pr(>F)    
## Model    11  0.36517 4.3564  0.001 ***
## Residual 18  0.13717                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing fraction [a]

# Test fraction [a]
anova.cca(rda(spe_hellinger, env_water_chemistry, env_topograhpy))

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(X = spe_hellinger, Y = env_water_chemistry, Z = env_topograhpy)
##          Df Variance      F Pr(>F)  
## Model     7 0.086437 1.6204  0.019 *
## Residual 18 0.137169                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Testing fraction [c]

# Test fraction [c]
anova.cca(rda(spe_hellinger, env_topograhpy, env_water_chemistry))

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(X = spe_hellinger, Y = env_topograhpy, Z = env_water_chemistry)
##          Df Variance      F Pr(>F)   
## Model     4 0.068882 2.2598  0.003 **
## Residual 18 0.137169                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Varpart

With varpart it is possible to have up to 4 explanatory matrices.

Quick note

Variable selection is kind of powerful statistical tool to select best explanatory variables. However, it should not replace good scientific knowledge and good scientific hypothesis! Don’t go fishing on your explanatory variables.

Exercise

Exercise #1

Perform a RDA on fish species data using water chemistry variables as X and topography variables as W.

Perform variable selection on X using \(\alpha = 0.01\).
Perform variable selection on W using \(\alpha = 0.01\).
Keep selected variables in both X and W and perform variance partitioning.
Test each fraction significance.
What are the differences with the previous varpart we performed?

Comparing both models

Exercise

Exercise #2

Use env and create three data frames and perform a RDA using three sets of covariates and perform variance partitioning.

X <- env[, 1:3]
W <- env[, 4:6]
Z <- env[, 7:11]

## # A tibble: 6 x 11
##     das   alt   pen   deb    pH   dur   pho   nit   amm   oxy   dbo
##   <dbl> <int> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1   0.3   934  48    0.84   7.9    45  0.01  0.2   0     12.2   2.7
## 2   2.2   932   3    1      8      40  0.02  0.2   0.1   10.3   1.9
## 3  10.2   914   3.7  1.8    8.3    52  0.05  0.22  0.05  10.5   3.5
## 4  18.5   854   3.2  2.53   8      72  0.1   0.21  0     11     1.3
## 5  21.5   849   2.3  2.64   8.1    84  0.38  0.52  0.2    8     6.2
## 6  32.4   846   3.2  2.86   7.9    60  0.2   0.15  0     10.2   5.3

Same amount of explained variance, just divided differently.

Useful resources

http://evol.bio.lmu.de/_statgen/Multivariate/11SS/rda.pdf

http://mb3is.megx.net/gustame/constrained-analyses/rda

http://www.davidzeleny.net/anadat-r/doku.php/en:rda_cca

References

Borcard, Daniel, Francois Gillet, and Pierre Legendre. 2011. Numerical Ecology with R. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4419-7976-6.

Kaiser, H. F. 1960. “The application of electronic computers to factor analysis.” Educ. Psychol. Meas. 20 (1): 141–51. https://doi.org/10.1177/001316446002000116.

Legendre, Pierre, and Eugene Gallagher. 2001. “Ecologically meaningful transformations for ordination of species data.” Oecologia 129 (2): 271–80. https://doi.org/10.1007/s004420100716.

	TRU	VAI	LOC	VAN	CHE
1	3	0	0	0	0
2	5	4	3	0	0
3	5	5	5	0	0
4	4	5	5	0	1
5	2	3	2	5	2
6	3	4	5	1	2
7	5	4	5	1	1
8	0	0	0	0	0
9	0	1	3	0	5
10	1	4	4	2	2

	TRU	VAI	LOC	VAN	CHE
1	3	0	0	0	0
2	5	4	3	0	0
3	5	5	5	0	0
4	4	5	5	0	1
5	2	3	2	5	2
6	3	4	5	1	2
7	5	4	5	1	1
8	0	0	0	0	0
9	0	1	3	0	5
10	1	4	4	2	2

	TRU	VAI	LOC	VAN	CHE
1	3	0	0	0	0
2	5	4	3	0	0
3	5	5	5	0	0
4	4	5	5	0	1
5	2	3	2	5	2
6	3	4	5	1	2
7	5	4	5	1	1
8	0	0	0	0	0
9	0	1	3	0	5
10	1	4	4	2	2