05_introduction_basic

Introdction to basics linear analysis in R

Student’s t-test

The t-test is a test used to determine whether there is a significant difference between the means of two groups. It assumes that the dependent variable fits a normal distribution.

H0: \(\mu_1 = \mu_2\)

H1: The means are differents

t-test

A working example.

data("InsectSprays")
InsectSprays

##    count spray
## 1     10     A
## 2      7     A
## 3     20     A
## 4     14     A
## 5     14     A
## 6     12     A
## 7     10     A
## 8     23     A
## 9     17     A
## 10    20     A
## 11    14     A
## 12    13     A
## 13    11     B
## 14    17     B
## 15    21     B
## 16    11     B
## 17    16     B
## 18    14     B
## 19    17     B
## 20    17     B
## 21    19     B
## 22    21     B
## 23     7     B
## 24    13     B
## 25     0     C
## 26     1     C
## 27     7     C
## 28     2     C
## 29     3     C
## 30     1     C
## 31     2     C
## 32     1     C
## 33     3     C
## 34     0     C
## 35     1     C
## 36     4     C
## 37     3     D
## 38     5     D
## 39    12     D
## 40     6     D
## 41     4     D
## 42     3     D
## 43     5     D
## 44     5     D
## 45     5     D
## 46     5     D
## 47     2     D
## 48     4     D
## 49     3     E
## 50     5     E
## 51     3     E
## 52     5     E
## 53     3     E
## 54     6     E
## 55     1     E
## 56     1     E
## 57     3     E
## 58     2     E
## 59     6     E
## 60     4     E
## 61    11     F
## 62     9     F
## 63    15     F
## 64    22     F
## 65    15     F
## 66    16     F
## 67    13     F
## 68    10     F
## 69    26     F
## 70    26     F
## 71    24     F
## 72    13     F

t-test

# Subset the data to get only two groups
InsectSprays2 <- dplyr::filter(InsectSprays, spray %in% c("C", "F"))

# Visualize the data
ggplot(InsectSprays2, aes(x = spray, y = count)) +
  geom_boxplot()

t-test (assumptions)

Data normality
Homogeneous variances

Normality

ggplot(InsectSprays2, aes(x = count)) +
  geom_histogram() +
  facet_wrap(~spray, scales = "free")

Visually it does not look that our data is normally distributed.

Normality

We can use the shapiro.test() function to determine if our data is normally distributed (null model).

shapiro.test(InsectSprays2$count[InsectSprays2$spray == "C"])

## 
##  Shapiro-Wilk normality test
## 
## data:  InsectSprays2$count[InsectSprays2$spray == "C"]
## W = 0.85907, p-value = 0.04759

We can reject the null hypothesis that our data is normally distributed.

Normality

shapiro.test(InsectSprays2$count[InsectSprays2$spray == "F"])

## 
##  Shapiro-Wilk normality test
## 
## data:  InsectSprays2$count[InsectSprays2$spray == "F"]
## W = 0.88475, p-value = 0.1009

Despite the non-normal look of this data, the test suggests it is normally distributed.

Normality

Since our data is not normal, what can we do?

Use a mathematical tranformation to normalize the data and do the statistical test on these transformed data.

Here, we work with data that is distributed asymmetrically with a dominance of low values, and some strong values. This type of distribution typically corresponds to a log-normal distribution, that is, the log-transformed values follow a Normal distribution.

# We can use the log tranformation to normalize our data
InsectSprays2 <- mutate(InsectSprays2, count_norm = log10(count + 1))

Normality

ggplot(InsectSprays2, aes(x = count_norm)) +
  geom_histogram() +
  facet_wrap(~spray, scales = "free")

Normality

Log-transformed data are normal.

shapiro.test(InsectSprays2$count_norm[InsectSprays2$spray == "C"])

## 
##  Shapiro-Wilk normality test
## 
## data:  InsectSprays2$count_norm[InsectSprays2$spray == "C"]
## W = 0.94771, p-value = 0.6038

shapiro.test(InsectSprays2$count_norm[InsectSprays2$spray == "F"])

## 
##  Shapiro-Wilk normality test
## 
## data:  InsectSprays2$count_norm[InsectSprays2$spray == "F"]
## W = 0.92197, p-value = 0.3026

Homogeneous variances

The second assumption we need to verify is that the variance of the two groups are homogeneous.

Homogeneous variances

It can be tested using the var.test() function.

var.test(count_norm ~ spray, data = InsectSprays2)

## 
##  F test to compare two variances
## 
## data:  count_norm by spray
## F = 3.1384, num df = 11, denom df = 11, p-value = 0.07059
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   0.9034831 10.9019558
## sample estimates:
## ratio of variances 
##           3.138428

Because p > 0.05, we can conclude that the variances between the two groups are not significantly different at a 95% confidence interval.

t-test

Now that we have verified the two assumptions, we can use the t.test() function to compare the mean between the two groups. Note that here we are using var.equal = TRUE. If the variances are not equals, we should use var.equal = FALSE. This will perform a Welch’s t-test, or unequal variances t-test.

t-test

t.test(count_norm ~ spray, data = InsectSprays2, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  count_norm by spray
## t = -9.091, df = 22, p-value = 6.638e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9936932 -0.6245354
## sample estimates:
## mean in group C mean in group F 
##       0.4137119       1.2228262

The p-value < 0.05 allows us to conclude that these two types of treatment lead to different average insect numbers, 19 times out of 20 (95% confidence).

t-test

## 
##  Two Sample t-test
## 
## data:  count_norm by spray
## t = -9.091, df = 22, p-value = 6.638e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9936932 -0.6245354
## sample estimates:
## mean in group C mean in group F 
##       0.4137119       1.2228262

We then report the results of the t-test as follows:

There was a significant difference in the number of insects between spray C (mean = 0.41) and spray F (mean = 1.22)[t(22) = -9.09, p = 0].

t-test

You can (and should) exponentiate the results you get from the log-transformation. Doing so will provide the geometric mean and not the arithmetic mean.

There was a significant difference in the number of insects between spray C (mean = 2.59) and spray F (mean = 16.7)[t(22) = -9.09, p = 0].

t-test

We usually do a bilateral test that is appropriate for the vast majority of situations. A bilateral test does not presuppose that the average should be significantly higher or lower than the threshold value, but rather that it could be one or the other, without preference.
A unilateral test is used to checks if a value is superior or equal to the test value (alternative = "greater") or inferior or equal to the test value (alternative = "less").

t-test

t.test(
  count_norm ~ spray,
  data = InsectSprays2,
  var.equal = TRUE,
  alternative = "greater"
)

## 
##  Two Sample t-test
## 
## data:  count_norm by spray
## t = -9.091, df = 22, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.9619436        Inf
## sample estimates:
## mean in group C mean in group F 
##       0.4137119       1.2228262

t-test

t.test(
  count_norm ~ spray,
  data = InsectSprays2,
  var.equal = TRUE,
  alternative = "less"
)

## 
##  Two Sample t-test
## 
## data:  count_norm by spray
## t = -9.091, df = 22, p-value = 3.319e-09
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -0.656285
## sample estimates:
## mean in group C mean in group F 
##       0.4137119       1.2228262

Non-parametric test

If your data is not normal and can not be transformed, you should use a non-parametric test (i.e. based on the ranks) such as the Wilcoxon / Mann-Whitney test (wilcoxon.test()).

Analysis of Variances (ANOVA)

The objective of an ANOVA is to test weither there is a statistically significant difference in the mean of a continuous variable among different groups.

Compares the variation among sample means to the variation within groups.

H0: \(\mu_1 = \mu_2 = \mu_3 = ... = \mu_{n-1} = \mu_n\)

H1: At least one mean is different

A concrete example

This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov.

data("mpg")
mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.80	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.80	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.00	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.00	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.80	1999	6	auto(l5)	f	16	26	p	compact
6	audi	a4	2.80	1999	6	manual(m5)	f	18	26	p	compact
7	audi	a4	3.10	2008	6	auto(av)	f	18	27	p	compact
8	audi	a4 quattro	1.80	1999	4	manual(m5)	4	18	26	p	compact
9	audi	a4 quattro	1.80	1999	4	auto(l5)	4	16	25	p	compact
10	audi	a4 quattro	2.00	2008	4	manual(m6)	4	20	28	p	compact

A concrete example

# city miles per gallon as a function of number of cylinders
ggplot(mpg, aes(x = factor(cyl), y = cty)) +
  geom_boxplot()

ANOVA

A one-way anova is performed using the aov() (analysis of variance) function.

# city miles per gallon AS A FUNCTION of the grouping variable cyl
my_aov <- aov(cty ~ factor(cyl), data = mpg)

We can have a look to the summary of the model using the summary() function.

summary(my_aov)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## factor(cyl)   3   2770   923.3   146.4 <2e-16 ***
## Residuals   230   1450     6.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficients of an ANOVA can be extracted using the coef() function.

coef(my_aov)

##  (Intercept) factor(cyl)5 factor(cyl)6 factor(cyl)8 
##   21.0123457   -0.5123457   -4.7971558   -8.4409171

If we compare with observed mean for each group:

group_by(mpg, factor(cyl)) %>% 
  summarise(mean_mpg = mean(cty))

## # A tibble: 4 x 2
##   `factor(cyl)` mean_mpg
##   <fct>            <dbl>
## 1 4                 21.0
## 2 5                 20.5
## 3 6                 16.2
## 4 8                 12.6

ANOVA

summary(my_aov)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## factor(cyl)   3   2770   923.3   146.4 <2e-16 ***
## Residuals   230   1450     6.3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the output we have F = 146.41 with a p-value of 0. We can reject the null hypothesis (H0: all means are equal) and accept the alternative hypothesis H1.

ANOVA

Reporting AVNOVA results in a paper

Here is how you should report the result of your ANOVA in your paper.

There was a significant (not a significant) effect of the number of cylinders on city miles per gallon at the \(p < 0.05\) level \([F(3, 230) = 146.41, p < 0.0001]\).

Non-parametric test

If your data is not normal and can not be transformed, you should use a non-parametric test (i.e. based on the ranks) such as the Kruskal-Wallis test (kruskal.test()).

Tukey post-hoc test

From the summary() function we rejected the null hypothesis.
We know that not all means are equal but we do not know which one(s).
To determine which groups are different from each other we need to use a post hoc test such as the Tukey Honest Significant Differences.

TukeyHSD

In R, the Tukey test is performed using the TukeyHSD() function which use the result returned by the aov() function.

tuk <- TukeyHSD(my_aov)
tuk

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ factor(cyl), data = mpg)
## 
## $`factor(cyl)`
##           diff        lwr        upr     p adj
## 5-4 -0.5123457  -3.841065  2.8163739 0.9785502
## 6-4 -4.7971558  -5.824803 -3.7695083 0.0000000
## 8-4 -8.4409171  -9.501482 -7.3803523 0.0000000
## 6-5 -4.2848101  -7.615512 -0.9541083 0.0055618
## 8-5 -7.9285714 -11.269576 -4.5875667 0.0000000
## 8-6 -3.6437613  -4.710531 -2.5769912 0.0000000

TukeyHSD

It is possible to plot the object to have a graphical representation of the differences.

plot(tuk)

Two+ factors ANOVA

It is easy to fit a multiple factors ANOVA using the + sign in the formula.

my_aov <- aov(cty ~ factor(cyl) + drv, data = mpg)
summary(my_aov)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## factor(cyl)   3 2769.9   923.3  192.31  < 2e-16 ***
## drv           2  355.8   177.9   37.06 1.16e-14 ***
## Residuals   228 1094.6     4.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Two+ factors ANOVA

If you want to use the TukeyHSD function you must specify which factor is of interest.

TukeyHSD(my_aov, "drv") # Look at the difference for the fl variable

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = cty ~ factor(cyl) + drv, data = mpg)
## 
## $drv
##            diff       lwr      upr     p adj
## f-4  2.26257238  1.547387 2.977757 0.0000000
## r-4  2.18380844  1.031329 3.336288 0.0000366
## r-f -0.07876394 -1.228054 1.070526 0.9856936

Two+ factors ANOVA

plot(TukeyHSD(my_aov, "drv"))

Exercise

Exercise #1

Take few minutes to open the data for your project and try to perform a simple ANOVA.

Linear and multiple linear regressions

Definition

A linear regression is simply an approach for modeling the linear relationship between a dependent variable (Y) and one or more explanatory/independent variable(s) (X).

One predictor -> simple linear regression
Two or more predictors -> multiple linear regression

Assumptions

Normality: Both response and explanatory variables should be normally distributed.
Linearity: The mean of the response variable is a linear combination of the predictors.
Homoscedasticity: Sometimes termed as homogeneity of variance, this means that the variance in the error is the same regardless of the values of the predictor variables.
Independence of errors: This assumes that the errors of the response variables are uncorrelated with each other.
No multicollinearity: This means that predictors are not (or very little) correlated among each others.

Purpose of a LM

Fitting a linear model (lm) is the process of minimizing the errors/residuals between observations and the regression line.
The line that minimizes these errors is called the least square line.

Graphical view of a LM

Where to draw the least square line?

Graphical view of a LM

Where to draw the least square line?

Formulation

\[ Y = \alpha + \beta X + \epsilon \]

\(\alpha\) is the intercept of the line (i.e. when \(X\) = 0).

\(\beta\) is the slope of the line describing how much \(Y\) changes when \(X\) varies by one unit.

\(\epsilon\) is the error or variation not accounted by the model.

Formulation

An important assumption of lm

\[ \epsilon \sim N(0,\sigma^2) \]

The errors follow a Normal distribution of mean \(\mu = 0\) and variance \(\sigma\).

Objective: Find \(\alpha\) and \(\beta\) that will minimize \(\epsilon\).

Graphical view of a LM

\[ Y_i = \alpha + \beta X_i + \epsilon_i \]

Fitting parameters

\[ Y_i = \alpha + \beta X_i + \epsilon_i \]

Find \(\alpha\) and \(\beta\) that will minimize \(\sum\limits_{i=1}^n \epsilon_i\)
Cannot simply use \(\sum\limits_{i=1}^n \epsilon_i\). Why?

Two possibilities

There are two other possibilities, either use \(\sum\limits_{i=1}^n |\epsilon_i|\) or \(\sum\limits_{i=1}^n \epsilon_i^2\)
We consider the second option because working with squares is mathematically easier than working with absolute values.
Find \(\alpha\) and \(\beta\) that will minimize \(\sum\limits_{i=1}^n \epsilon_i^2\)
This is why we call it the the standard squared error (SSE).

Standard squared error (SSE)

\(SSE(\alpha, \beta)\) : \(\sum\limits_{i=1}^n (Y_i - \bar{Y})^2 = \sum\limits_{i=1}^n (Y_i - \alpha - \beta X_i)^2\)

Objective: Find \(\alpha\) and \(\beta\) that minimize SSE.

Partial derivatives of \(SSE(\alpha, \beta)\) with respect to \(\alpha\) and \(\beta\) equal to 0

\(SSE(\alpha, \beta) = \sum\limits_{i=1}^n (Y_i - \alpha - \beta X_i)^2\)

\[ \begin{align} \frac{\partial SSE(\alpha, \beta)}{\partial \alpha} &= \sum (-1)(2)(Y_i - \alpha - \beta X_i) = 0 \\ &= \sum (Y_i - \alpha - \beta X_i) = 0 \end{align} \]

\[ \begin{align} \frac{\partial SSE(\alpha, \beta)}{\partial \beta} &= \sum (-X_i)(2)(Y_i - \alpha - \beta X_i) = 0 \\ &= \sum X_i(Y_i - \alpha - \beta X_i) = 0 \end{align} \]

After a bit of algebra we ends up with two equations:

\[ \alpha = \bar{Y} - \beta \bar{X} \]

\[ \beta = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - X)^2} \]

Exercise

Question #1

Using the follow data, find \(\alpha\) and \(\beta\).

x	y
1	-0.2070657
2	2.2774292
3	4.0844412
4	1.6543023
5	5.4291247
6	6.5060559
7	6.4252600

b <- sum( (df$x - mean(df$x)) * (df$y - mean(df$y)) ) / 
  sum((df$x - mean(df$x))^2)
b

## [1] 1.060676

a <- mean(df$y) - b * mean(df$x)
a

## [1] -0.5041952

Hence the equation is: \(y = -0.5042 + 1.0607x\)

Linear regression in R

(Multiple) linear regression are performed using the lm() function.
The model is specified as model <- lm(y ~ x).

Example

##   x          y
## 1 1 -0.2070657
## 2 2  2.2774292
## 3 3  4.0844412
## 4 4  1.6543023
## 5 5  5.4291247
## 6 6  6.5060559
## 7 7  6.4252600

Example

mod <- lm(y ~ x, data = df) # Define the model

Different functions can be used on lm models:

plot(mod): Shows diagnostic plots of the model.
residuals(mod): Return the residuals of the model (\(\epsilon_i\)).
predict(mod): Return the predicted values of the model (\(\hat{Y}_i\)).
summary(mod): Print a summary of the model.
coef(mod): Extract the coefficients of the model.

Summary

summary(mod)

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##       1       2       3       4       5       6       7 
## -0.7635  0.6603  1.4066 -2.0842  0.6299  0.6462 -0.4953 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -0.5042     1.0955   -0.46   0.6647   
## x             1.0607     0.2450    4.33   0.0075 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.296 on 5 degrees of freedom
## Multiple R-squared:  0.7895, Adjusted R-squared:  0.7473 
## F-statistic: 18.75 on 1 and 5 DF,  p-value: 0.0075

Coefficients

coef(mod) # a and b

## (Intercept)           x 
##  -0.5041952   1.0606755

Visualization

It is easy to plot the regression line using the abline() function.

plot(x, y, pch = 21, bg = "gray") # Plot the data
abline(mod, col = "red") # Plot the regression line

Visualization

With ggplot it is even easier using geom_smooth().

ggplot(df, aes(x = x, y = y)) +
  geom_point(size = 3) + 
  geom_smooth(method = "lm") # Add a geom_smooth using lm

Visualization

Use se = FALSE if you do not want the confidence interval.

ggplot(df, aes(x = x, y = y)) +
  geom_point(size = 3) + 
  geom_smooth(method = "lm", se = FALSE)

Predicted values

predict(mod)

##         1         2         3         4         5         6         7 
## 0.5564803 1.6171558 2.6778313 3.7385068 4.7991823 5.8598578 6.9205333

Residuals

residuals(mod) # This is the equivalent of e in previous equations

##          1          2          3          4          5          6 
## -0.7635460  0.6602735  1.4066099 -2.0842045  0.6299424  0.6461981 
##          7 
## -0.4952733

Remember that \(\epsilon \sim N(\mu,\sigma^2)\) and that \(\mu = 0\)

mean(residuals(mod))

## [1] 7.92939e-17

Exercise

Exercise #1

Use the mpg data and make a model to predict cty using the variable hwy (do not forget to look at the data before making your model).

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	audi	a4	1.80	1999	4	auto(l5)	f	18	29	p	compact
2	audi	a4	1.80	1999	4	manual(m5)	f	21	29	p	compact
3	audi	a4	2.00	2008	4	manual(m6)	f	20	31	p	compact
4	audi	a4	2.00	2008	4	auto(av)	f	21	30	p	compact
5	audi	a4	2.80	1999	6	auto(l5)	f	16	26	p	compact
6	audi	a4	2.80	1999	6	manual(m5)	f	18	26	p	compact
7	audi	a4	3.10	2008	6	auto(av)	f	18	27	p	compact
8	audi	a4 quattro	1.80	1999	4	manual(m5)	4	18	26	p	compact
9	audi	a4 quattro	1.80	1999	4	auto(l5)	4	16	25	p	compact
10	audi	a4 quattro	2.00	2008	4	manual(m6)	4	20	28	p	compact

Analyzing the model

Plot of the data

ggplot(mpg, aes(x = hwy, y = cty)) +
  geom_point()

Create the model

lm1 <- lm(cty ~ hwy, data = mpg) # Fit the model

summary(lm1)

## 
## Call:
## lm(formula = cty ~ hwy, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9247 -0.7757 -0.0428  0.6965  4.6096 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.84420    0.33319   2.534   0.0119 *  
## hwy          0.68322    0.01378  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.252 on 232 degrees of freedom
## Multiple R-squared:  0.9138, Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

Look at the residuals

plot(mpg$hwy, residuals(lm1)) # Check if any patterns in residuals
abline(h = 0, col = "red")

Residuals normality

Remember that \(\epsilon \sim N(\mu,\sigma^2)\) and that \(\mu = 0\)

hist(residuals(lm1))

Normality Test

H0: Data from a normally distributed population.

H1: Data are not from a normally distributed population.

shapiro.test(residuals(lm1))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lm1)
## W = 0.98562, p-value = 0.0184

This p-value tells you what the chances are that the sample comes from a normal distribution. The lower this value, the smaller the chance. You cannot say “the sample has a normal distribution” or “the sample comes from a population which has a normal distribution”, but only “you cannot reject the hypothesis that the sample comes from a population which has a normal distribution”.

Constant error variance

H0: Constant error variance

H1: Error variance changes with the level of the response

library(car) # Load required library
ncvTest(lm1) # Score Test for Non-Constant Error Variance

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 21.96073, Df = 1, p = 2.7829e-06

Since \(p = 0\) we reject the null hypothesis and we consider that the model has not a constant error variance.

Understanding error variance

What to do?

Generalized Least Squares (GLS)
- The errors are allowed to have a varying variance
Done in R using the gls() function from the nlme library.
Specifically provide the structure of the error as a function of the observed values.
Different possibilities:
- Power varPower()
- Exponential varExp()
Useful book: Zuur et al. (2009)

Data transformation

Most linear statistics (anova, lm, etc.) suppose normally distributed data:

\[ Y \sim N(\mu,\sigma^2) \]

What to do if the data is not following such distribution?

Data transformation

Data transformation is one option, but most of the time we can use better options to deal with data that do not follow a Normal distribution. We will see how when talking about glm.

Multiple linear regression

Formulation

\[ Y = \alpha + \beta_i X + \beta_2 X + \beta_3 X + ... \beta_n X + \epsilon \]

A multiple regression is only an extension of a simple linear regression which include two or more predictors.

A look to the data

library(languageR)
pairscor.fnc(mtcars[, 1:7])

Tempting to do…

model1 <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9682 -1.5795 -0.4353  1.1662  5.5272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 26.30736   14.62994   1.798  0.08424 . 
## cyl         -0.81856    0.81156  -1.009  0.32282   
## disp         0.01320    0.01204   1.097  0.28307   
## hp          -0.01793    0.01551  -1.156  0.25846   
## drat         1.32041    1.47948   0.892  0.38065   
## wt          -4.19083    1.25791  -3.332  0.00269 **
## qsec         0.40146    0.51658   0.777  0.44436   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 25 degrees of freedom
## Multiple R-squared:  0.8548, Adjusted R-squared:   0.82 
## F-statistic: 24.53 on 6 and 25 DF,  p-value: 2.45e-09

R2 always increasing

The \(R^2\) will always increase when adding new predictors the the model.

Adjusted R2

Since \(R^2\) is always increasing when new predictors (\(X_i\)) are added, we need a better way to evaluated the goodness of fit of multiple-linear regression models.

The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.

Adjusted R2

The adjusted \(R^2\) takes into account the number of explanatory variables in the model.

\[ R^2_{adj} = R^2 - (1 - R^2) \frac{p}{n - p - 1} \]

Where \(p\) is the number of explanatory variables and \(n\) the number of observations in the data.

Adjusted R2

Always report \(R^2_{adj}\) when using multiple-linear regressions.

What is a good model?

Trade-off between performance and complexity

Philosophy

When I work to make a new model I personally like to keep in mind that:

Any modeling project should be tempered by the morality of laziness (Barnes and Chu 2010).

Avoid the pitfall of: my \(R^2\) is better than yours.

Potential problems in multiple linear regression

Problem #1: Multicollinearity

Definition

Multicollinearity occurs when predictors are highly correlated among them.
Some predictors add redundant information to the model.
Variance inflation factor (VIF) is a measure used to quantify the degree of multicollinearity in a multiple regression analysis.
Tells how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model.

Possible problem…

Variance inflation factor (VIF)

It is easy to understand how to calculate VIF. Basically, we take 1 predictor of the regression and predict it using other predictors (1 VIF per \(X_i\)).

Step #1: Regressions (1 per explanatory variable)

\[ X_1 = \alpha + \beta_1 X_2 + \beta_2 X_3 + \beta_3 X_4 + ... \beta_n X_n + \epsilon \]

Step #2: Calculate VIF

\[ VIF = \frac{1}{1 - R^2} \]

Step #3: Analyze

If the variance inflation factor of a predictor variable were 5.27 (\(\sqrt{5.27}\) = 2.3) this means that the standard error for the coefficient of that predictor variable is 2.3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

As a rule of the thumb, if \(\sqrt{VIF} > 2\) multicollinearity is considered high.

An example

Lets calculate VIF for the cyl explanatory variable.

Remember that:

model1 <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
summary(model1)

Step #1

Multiple regression, 1 per explanatory variable

# Multiple regression for cyl
reg1 <- lm(cyl ~ disp + hp + drat + wt + qsec, data = mtcars)

Step #2

Calculate VIF

R2_cyl <- summary(reg1)$r.squared
R2_cyl

## [1] 0.8995881

VIF <- 1 / (1 - R2_cyl) # VIF for the cyl variable
VIF

## [1] 9.958978

Step #3

Analyze

Given that \(\sqrt{VIF}\) = 3.16 we conclude that MC is high.

VIF in R

There is an easy way to calculate VIF in R using the vif() function from the car library.

library(car)
vif(model1)

##       cyl      disp        hp      drat        wt      qsec 
##  9.958978 10.550573  5.357783  2.966519  7.181690  4.039701

sqrt(vif(model1)) > 2 # problem?

##   cyl  disp    hp  drat    wt  qsec 
##  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

Based on these results we have high MC for many predictors. What can we do about it?

How to select predictors?

Controversial topic in statistics

There are many stepwise techniques to select predictors (\(X_i\)) in a multiple linear regression.

In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure (Wikipedia).

Why controversial?

Basically, you say: “Here are my data and find out a good model for me.”

\(R^2\) values badly biased to be high.
It allows us to not think about the problem.
It has severe problems in the presence of collinearity.

Forward selection

Nevertheless, many peoples have been working to develop new methods that minimize these drawbacks.

We will see an approach based on the technique suggested by Blanchet, Legendre, and Borcard (2008) called forward selection.

The package is not installed by default.

install.packages("packfor", repos="http://R-Forge.R-project.org")

library(packfor)

Overview of the method

Without going in too much details, the method

Two tests:

Global test (Adjusted-R2)
p-value

Using packfor

The first step consists in creating two matrices: \(x\) that will contains our explanatory variables and \(y\) that contains the response variable.

library(packfor)

x <- mtcars[, 2:7]  # Extract the possible predictors
y <- mtcars$mpg     # Extract the response variable

Using packfor

The second step

model2_temp <- forward.sel(y, x, alpha = 1, 
                           R2more = 0.0000000001, 
                           adjR2thresh = 1)

model2_temp

##   variables order          R2     R2Cum  AdjR2Cum          F  pval
## 1        wt     5 0.752832794 0.7528328 0.7445939 91.3753250 0.001
## 2       cyl     1 0.077394600 0.8302274 0.8185189 13.2202917 0.002
## 3        hp     3 0.012922590 0.8431500 0.8263446  2.3068695 0.146
## 4      disp     2 0.005484810 0.8486348 0.8262103  0.9783614 0.321
## 5      drat     4 0.002680387 0.8513152 0.8227219  0.4687099 0.520
## 6      qsec     6 0.003507231 0.8548224 0.8199798  0.6039554 0.444

Using packfor

# Maximum r2 expected
globalR2 <- model2_temp$AdjR2Cum[length(model2_temp$AdjR2Cum)] 

model2_final <- forward.sel(y, x, alpha = 0.05, R2thresh = 1, 
                            R2more = 0.001, adjR2thresh = globalR2, 
                            nperm = 999)

## Testing variable 1
## Testing variable 2
## Testing variable 3
## Procedure stopped (alpha criteria): pvalue for variable 3 is 0.125000 (superior to 0.050000)

model2_final

##   variables order        R2     R2Cum  AdjR2Cum        F  pval
## 1        wt     5 0.7528328 0.7528328 0.7445939 91.37533 0.001
## 2       cyl     1 0.0773946 0.8302274 0.8185189 13.22029 0.003

Using packfor

model2_final <- lm(mpg ~ wt + cyl, data = mtcars)

summary(model2_final)

## 
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

Comparing models

Initial model with ALL predictors

summary(model1)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9682 -1.5795 -0.4353  1.1662  5.5272 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 26.30736   14.62994   1.798  0.08424 . 
## cyl         -0.81856    0.81156  -1.009  0.32282   
## disp         0.01320    0.01204   1.097  0.28307   
## hp          -0.01793    0.01551  -1.156  0.25846   
## drat         1.32041    1.47948   0.892  0.38065   
## wt          -4.19083    1.25791  -3.332  0.00269 **
## qsec         0.40146    0.51658   0.777  0.44436   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 25 degrees of freedom
## Multiple R-squared:  0.8548, Adjusted R-squared:   0.82 
## F-statistic: 24.53 on 6 and 25 DF,  p-value: 2.45e-09

Comparing models

Final model with selected predictors

summary(model2_final)

## 
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2893 -1.5512 -0.4684  1.5743  6.1004 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185 
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

Variance partitioning

Relative Importance

For this, lets suppose that the following model is good (which is not).

model1 <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)

The model includes 6 predictors and yield a \(R^2\) of 0.8548224

The question we might ask is, how this value is divided among predictors?

Relative importance

We can evaluate the the contribution of each predictor to the \(R^2\). We call this operation variance partitioning.

This can be easily done in R using the calc.relimp() function from the relaimpo package.

Relative importance

library(relaimpo) #Do not forget to install it first

ri <- calc.relimp(model1)
ri

## Response variable: mpg 
## Total response variance: 36.3241 
## Analysis based on 32 observations 
## 
## 6 Regressors: 
## cyl disp hp drat wt qsec 
## Proportion of variance explained by model: 85.48%
## Metrics are not normalized (rela=FALSE). 
## 
## Relative importance metrics: 
## 
##             lmg
## cyl  0.17323909
## disp 0.16405351
## hp   0.14365936
## drat 0.10377957
## wt   0.22521079
## qsec 0.04488009
## 
## Average coefficients for different model sizes: 
## 
##               1X         2Xs         3Xs          4Xs          5Xs
## cyl  -2.87579014 -2.19605501 -1.60714484 -1.231094993 -0.993048829
## disp -0.04121512 -0.02884724 -0.01449644 -0.002155583  0.006775406
## hp   -0.06822828 -0.04242288 -0.03020759 -0.025166512 -0.021378913
## drat  7.67823260  3.42465007  1.90570167  1.449119405  1.326454854
## wt   -5.34447157 -4.05010008 -3.78513541 -3.898088179 -4.045422683
## qsec  1.41212484  0.20520857  0.12363951  0.187916885  0.287527166
##              6Xs
## cyl  -0.81856023
## disp  0.01320490
## hp   -0.01792993
## drat  1.32040573
## wt   -4.19083238
## qsec  0.40146117

Relative importance

plot(ri)

Your turn

Practice these concepts on your own data.

External resources

http://cba.ualr.edu/smartstat/topics/anova/example.pdf

http://www.analyticsforfun.com/2014/06/performing-anova-test-in-r-results-and.html

http://www.ats.ucla.edu/stat/r/faq/posthoc.htm

http://www.r-tutor.com/elementary-statistics/probability-distributions/f-distribution

http://qcbs.ca/wiki/r_atelier3

References

Barnes, David J, and Dominique Chu. 2010. Introduction to modeling for biosciences. New York ET - 1st: Springer.

Blanchet, F Guillaume, Pierre Legendre, and Daniel Borcard. 2008. “Forward selection of explanatory variables.” Ecology 89 (9). Eco Soc America: 2623–32. https://doi.org/10.1890/07-0986.1.

Zuur, Alain F, Elena N Ieno, Neil Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed effects models and extensions in ecology with R. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-87458-6.