The t-test is a test used to determine whether there is a significant difference between the means of two groups. It assumes that the dependent variable fits a normal distribution.
H0: \(\mu_1 = \mu_2\)
H1: The means are differents
A working example.
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
## 7 10 A
## 8 23 A
## 9 17 A
## 10 20 A
## 11 14 A
## 12 13 A
## 13 11 B
## 14 17 B
## 15 21 B
## 16 11 B
## 17 16 B
## 18 14 B
## 19 17 B
## 20 17 B
## 21 19 B
## 22 21 B
## 23 7 B
## 24 13 B
## 25 0 C
## 26 1 C
## 27 7 C
## 28 2 C
## 29 3 C
## 30 1 C
## 31 2 C
## 32 1 C
## 33 3 C
## 34 0 C
## 35 1 C
## 36 4 C
## 37 3 D
## 38 5 D
## 39 12 D
## 40 6 D
## 41 4 D
## 42 3 D
## 43 5 D
## 44 5 D
## 45 5 D
## 46 5 D
## 47 2 D
## 48 4 D
## 49 3 E
## 50 5 E
## 51 3 E
## 52 5 E
## 53 3 E
## 54 6 E
## 55 1 E
## 56 1 E
## 57 3 E
## 58 2 E
## 59 6 E
## 60 4 E
## 61 11 F
## 62 9 F
## 63 15 F
## 64 22 F
## 65 15 F
## 66 16 F
## 67 13 F
## 68 10 F
## 69 26 F
## 70 26 F
## 71 24 F
## 72 13 F
Visually it does not look that our data is normally distributed.
We can use the shapiro.test()
function to determine if our data is normally distributed (null model).
##
## Shapiro-Wilk normality test
##
## data: InsectSprays2$count[InsectSprays2$spray == "C"]
## W = 0.85907, p-value = 0.04759
We can reject the null hypothesis that our data is normally distributed.
##
## Shapiro-Wilk normality test
##
## data: InsectSprays2$count[InsectSprays2$spray == "F"]
## W = 0.88475, p-value = 0.1009
Despite the non-normal look of this data, the test suggests it is normally distributed.
Since our data is not normal, what can we do?
Here, we work with data that is distributed asymmetrically with a dominance of low values, and some strong values. This type of distribution typically corresponds to a log-normal distribution, that is, the log-transformed values follow a Normal distribution.
Log-transformed data are normal.
##
## Shapiro-Wilk normality test
##
## data: InsectSprays2$count_norm[InsectSprays2$spray == "C"]
## W = 0.94771, p-value = 0.6038
##
## Shapiro-Wilk normality test
##
## data: InsectSprays2$count_norm[InsectSprays2$spray == "F"]
## W = 0.92197, p-value = 0.3026
The second assumption we need to verify is that the variance of the two groups are homogeneous.
It can be tested using the var.test()
function.
##
## F test to compare two variances
##
## data: count_norm by spray
## F = 3.1384, num df = 11, denom df = 11, p-value = 0.07059
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9034831 10.9019558
## sample estimates:
## ratio of variances
## 3.138428
Because p > 0.05, we can conclude that the variances between the two groups are not significantly different at a 95% confidence interval.
Now that we have verified the two assumptions, we can use the t.test()
function to compare the mean between the two groups. Note that here we are using var.equal = TRUE
. If the variances are not equals, we should use var.equal = FALSE
. This will perform a Welch’s t-test, or unequal variances t-test.
##
## Two Sample t-test
##
## data: count_norm by spray
## t = -9.091, df = 22, p-value = 6.638e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9936932 -0.6245354
## sample estimates:
## mean in group C mean in group F
## 0.4137119 1.2228262
The p-value < 0.05 allows us to conclude that these two types of treatment lead to different average insect numbers, 19 times out of 20 (95% confidence).
##
## Two Sample t-test
##
## data: count_norm by spray
## t = -9.091, df = 22, p-value = 6.638e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.9936932 -0.6245354
## sample estimates:
## mean in group C mean in group F
## 0.4137119 1.2228262
We then report the results of the t-test as follows:
There was a significant difference in the number of insects between spray C (mean = 0.41) and spray F (mean = 1.22)[t(22) = -9.09, p = 0].
You can (and should) exponentiate the results you get from the log-transformation. Doing so will provide the geometric mean and not the arithmetic mean.
There was a significant difference in the number of insects between spray C (mean = 2.59) and spray F (mean = 16.7)[t(22) = -9.09, p = 0].
We usually do a bilateral test that is appropriate for the vast majority of situations. A bilateral test does not presuppose that the average should be significantly higher or lower than the threshold value, but rather that it could be one or the other, without preference.
A unilateral test is used to checks if a value is superior or equal to the test value (alternative = "greater"
) or inferior or equal to the test value (alternative = "less"
).
##
## Two Sample t-test
##
## data: count_norm by spray
## t = -9.091, df = 22, p-value = 1
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.9619436 Inf
## sample estimates:
## mean in group C mean in group F
## 0.4137119 1.2228262
##
## Two Sample t-test
##
## data: count_norm by spray
## t = -9.091, df = 22, p-value = 3.319e-09
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.656285
## sample estimates:
## mean in group C mean in group F
## 0.4137119 1.2228262
If your data is not normal and can not be transformed, you should use a non-parametric test (i.e. based on the ranks) such as the Wilcoxon / Mann-Whitney test (wilcoxon.test()
).
H0: \(\mu_1 = \mu_2 = \mu_3 = ... = \mu_{n-1} = \mu_n\)
H1: At least one mean is different
This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov.
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | audi | a4 | 1.80 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
2 | audi | a4 | 1.80 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
3 | audi | a4 | 2.00 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
4 | audi | a4 | 2.00 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
5 | audi | a4 | 2.80 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
6 | audi | a4 | 2.80 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
7 | audi | a4 | 3.10 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact |
8 | audi | a4 quattro | 1.80 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
9 | audi | a4 quattro | 1.80 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
10 | audi | a4 quattro | 2.00 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
A one-way anova is performed using the aov()
(analysis of variance) function.
# city miles per gallon AS A FUNCTION of the grouping variable cyl
my_aov <- aov(cty ~ factor(cyl), data = mpg)
We can have a look to the summary of the model using the summary()
function.
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cyl) 3 2770 923.3 146.4 <2e-16 ***
## Residuals 230 1450 6.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficients of an ANOVA can be extracted using the coef()
function.
## (Intercept) factor(cyl)5 factor(cyl)6 factor(cyl)8
## 21.0123457 -0.5123457 -4.7971558 -8.4409171
If we compare with observed mean for each group:
## # A tibble: 4 x 2
## `factor(cyl)` mean_mpg
## <fct> <dbl>
## 1 4 21.0
## 2 5 20.5
## 3 6 16.2
## 4 8 12.6
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cyl) 3 2770 923.3 146.4 <2e-16 ***
## Residuals 230 1450 6.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the output we have F = 146.41 with a p-value of 0. We can reject the null hypothesis (H0: all means are equal) and accept the alternative hypothesis H1.
Here is how you should report the result of your ANOVA in your paper.
There was a significant (not a significant) effect of the number of cylinders on city miles per gallon at the \(p < 0.05\) level \([F(3, 230) = 146.41, p < 0.0001]\).
If your data is not normal and can not be transformed, you should use a non-parametric test (i.e. based on the ranks) such as the Kruskal-Wallis test (kruskal.test()
).
From the summary()
function we rejected the null hypothesis.
We know that not all means are equal but we do not know which one(s).
To determine which groups are different from each other we need to use a post hoc test such as the Tukey Honest Significant Differences.
In R, the Tukey test is performed using the TukeyHSD()
function which use the result returned by the aov()
function.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ factor(cyl), data = mpg)
##
## $`factor(cyl)`
## diff lwr upr p adj
## 5-4 -0.5123457 -3.841065 2.8163739 0.9785502
## 6-4 -4.7971558 -5.824803 -3.7695083 0.0000000
## 8-4 -8.4409171 -9.501482 -7.3803523 0.0000000
## 6-5 -4.2848101 -7.615512 -0.9541083 0.0055618
## 8-5 -7.9285714 -11.269576 -4.5875667 0.0000000
## 8-6 -3.6437613 -4.710531 -2.5769912 0.0000000
It is possible to plot the object to have a graphical representation of the differences.
It is easy to fit a multiple factors ANOVA using the +
sign in the formula.
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(cyl) 3 2769.9 923.3 192.31 < 2e-16 ***
## drv 2 355.8 177.9 37.06 1.16e-14 ***
## Residuals 228 1094.6 4.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If you want to use the TukeyHSD function you must specify which factor is of interest.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = cty ~ factor(cyl) + drv, data = mpg)
##
## $drv
## diff lwr upr p adj
## f-4 2.26257238 1.547387 2.977757 0.0000000
## r-4 2.18380844 1.031329 3.336288 0.0000366
## r-f -0.07876394 -1.228054 1.070526 0.9856936
Take few minutes to open the data for your project and try to perform a simple ANOVA.
A linear regression is simply an approach for modeling the linear relationship between a dependent variable (Y) and one or more explanatory/independent variable(s) (X).
Normality: Both response and explanatory variables should be normally distributed.
Linearity: The mean of the response variable is a linear combination of the predictors.
Homoscedasticity: Sometimes termed as homogeneity of variance, this means that the variance in the error is the same regardless of the values of the predictor variables.
Independence of errors: This assumes that the errors of the response variables are uncorrelated with each other.
No multicollinearity: This means that predictors are not (or very little) correlated among each others.
Fitting a linear model (lm) is the process of minimizing the errors/residuals between observations and the regression line.
The line that minimizes these errors is called the least square line.
Where to draw the least square line?
Where to draw the least square line?
\[ Y = \alpha + \beta X + \epsilon \]
\(\alpha\) is the intercept of the line (i.e. when \(X\) = 0).
\(\beta\) is the slope of the line describing how much \(Y\) changes when \(X\) varies by one unit.
\(\epsilon\) is the error or variation not accounted by the model.
\[ \epsilon \sim N(0,\sigma^2) \]
The errors follow a Normal distribution of mean \(\mu = 0\) and variance \(\sigma\).
Objective: Find \(\alpha\) and \(\beta\) that will minimize \(\epsilon\).
\[ Y_i = \alpha + \beta X_i + \epsilon_i \]
\[ Y_i = \alpha + \beta X_i + \epsilon_i \]
Find \(\alpha\) and \(\beta\) that will minimize \(\sum\limits_{i=1}^n \epsilon_i\)
Cannot simply use \(\sum\limits_{i=1}^n \epsilon_i\). Why?
There are two other possibilities, either use \(\sum\limits_{i=1}^n |\epsilon_i|\) or \(\sum\limits_{i=1}^n \epsilon_i^2\)
We consider the second option because working with squares is mathematically easier than working with absolute values.
Find \(\alpha\) and \(\beta\) that will minimize \(\sum\limits_{i=1}^n \epsilon_i^2\)
This is why we call it the the standard squared error (SSE).
\(SSE(\alpha, \beta)\) : \(\sum\limits_{i=1}^n (Y_i - \bar{Y})^2 = \sum\limits_{i=1}^n (Y_i - \alpha - \beta X_i)^2\)
Objective: Find \(\alpha\) and \(\beta\) that minimize SSE.
Partial derivatives of \(SSE(\alpha, \beta)\) with respect to \(\alpha\) and \(\beta\) equal to 0
\(SSE(\alpha, \beta) = \sum\limits_{i=1}^n (Y_i - \alpha - \beta X_i)^2\)
\[ \begin{align} \frac{\partial SSE(\alpha, \beta)}{\partial \alpha} &= \sum (-1)(2)(Y_i - \alpha - \beta X_i) = 0 \\ &= \sum (Y_i - \alpha - \beta X_i) = 0 \end{align} \]
\[ \begin{align} \frac{\partial SSE(\alpha, \beta)}{\partial \beta} &= \sum (-X_i)(2)(Y_i - \alpha - \beta X_i) = 0 \\ &= \sum X_i(Y_i - \alpha - \beta X_i) = 0 \end{align} \]
After a bit of algebra we ends up with two equations:
\[ \alpha = \bar{Y} - \beta \bar{X} \]
\[ \beta = \frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{\sum(X_i - X)^2} \]
Using the follow data, find \(\alpha\) and \(\beta\).
x | y |
---|---|
1 | -0.2070657 |
2 | 2.2774292 |
3 | 4.0844412 |
4 | 1.6543023 |
5 | 5.4291247 |
6 | 6.5060559 |
7 | 6.4252600 |
## [1] 1.060676
## [1] -0.5041952
Hence the equation is: \(y = -0.5042 + 1.0607x\)
(Multiple) linear regression are performed using the lm()
function.
The model is specified as model <- lm(y ~ x)
.
## x y
## 1 1 -0.2070657
## 2 2 2.2774292
## 3 3 4.0844412
## 4 4 1.6543023
## 5 5 5.4291247
## 6 6 6.5060559
## 7 7 6.4252600
Different functions can be used on lm models:
plot(mod)
: Shows diagnostic plots of the model.residuals(mod)
: Return the residuals of the model (\(\epsilon_i\)).predict(mod)
: Return the predicted values of the model (\(\hat{Y}_i\)).summary(mod)
: Print a summary of the model.coef(mod)
: Extract the coefficients of the model.##
## Call:
## lm(formula = y ~ x, data = df)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.7635 0.6603 1.4066 -2.0842 0.6299 0.6462 -0.4953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5042 1.0955 -0.46 0.6647
## x 1.0607 0.2450 4.33 0.0075 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.296 on 5 degrees of freedom
## Multiple R-squared: 0.7895, Adjusted R-squared: 0.7473
## F-statistic: 18.75 on 1 and 5 DF, p-value: 0.0075
## (Intercept) x
## -0.5041952 1.0606755
It is easy to plot the regression line using the abline()
function.
With ggplot it is even easier using geom_smooth()
.
Use se = FALSE
if you do not want the confidence interval.
## 1 2 3 4 5 6 7
## 0.5564803 1.6171558 2.6778313 3.7385068 4.7991823 5.8598578 6.9205333
## 1 2 3 4 5 6
## -0.7635460 0.6602735 1.4066099 -2.0842045 0.6299424 0.6461981
## 7
## -0.4952733
Remember that \(\epsilon \sim N(\mu,\sigma^2)\) and that \(\mu = 0\)
## [1] 7.92939e-17
Use the mpg
data and make a model to predict cty
using the variable hwy
(do not forget to look at the data before making your model).
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | audi | a4 | 1.80 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
2 | audi | a4 | 1.80 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
3 | audi | a4 | 2.00 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
4 | audi | a4 | 2.00 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
5 | audi | a4 | 2.80 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
6 | audi | a4 | 2.80 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
7 | audi | a4 | 3.10 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact |
8 | audi | a4 quattro | 1.80 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
9 | audi | a4 quattro | 1.80 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | compact |
10 | audi | a4 quattro | 2.00 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | compact |
##
## Call:
## lm(formula = cty ~ hwy, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9247 -0.7757 -0.0428 0.6965 4.6096
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.84420 0.33319 2.534 0.0119 *
## hwy 0.68322 0.01378 49.585 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.252 on 232 degrees of freedom
## Multiple R-squared: 0.9138, Adjusted R-squared: 0.9134
## F-statistic: 2459 on 1 and 232 DF, p-value: < 2.2e-16
Remember that \(\epsilon \sim N(\mu,\sigma^2)\) and that \(\mu = 0\)
H0: Data from a normally distributed population.
H1: Data are not from a normally distributed population.
##
## Shapiro-Wilk normality test
##
## data: residuals(lm1)
## W = 0.98562, p-value = 0.0184
This p-value tells you what the chances are that the sample comes from a normal distribution. The lower this value, the smaller the chance. You cannot say “the sample has a normal distribution” or “the sample comes from a population which has a normal distribution”, but only “you cannot reject the hypothesis that the sample comes from a population which has a normal distribution”.
H0: Constant error variance
H1: Error variance changes with the level of the response
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 21.96073, Df = 1, p = 2.7829e-06
Since \(p = 0\) we reject the null hypothesis and we consider that the model has not a constant error variance.
Done in R using the gls()
function from the nlme
library.
Specifically provide the structure of the error as a function of the observed values.
varPower()
varExp()
Useful book: Zuur et al. (2009)
Most linear statistics (anova, lm, etc.) suppose normally distributed data:
\[ Y \sim N(\mu,\sigma^2) \]
What to do if the data is not following such distribution?
Data transformation is one option, but most of the time we can use better options to deal with data that do not follow a Normal distribution. We will see how when talking about glm.
\[ Y = \alpha + \beta_i X + \beta_2 X + \beta_3 X + ... \beta_n X + \epsilon \]
A multiple regression is only an extension of a simple linear regression which include two or more predictors.
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9682 -1.5795 -0.4353 1.1662 5.5272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.30736 14.62994 1.798 0.08424 .
## cyl -0.81856 0.81156 -1.009 0.32282
## disp 0.01320 0.01204 1.097 0.28307
## hp -0.01793 0.01551 -1.156 0.25846
## drat 1.32041 1.47948 0.892 0.38065
## wt -4.19083 1.25791 -3.332 0.00269 **
## qsec 0.40146 0.51658 0.777 0.44436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 25 degrees of freedom
## Multiple R-squared: 0.8548, Adjusted R-squared: 0.82
## F-statistic: 24.53 on 6 and 25 DF, p-value: 2.45e-09
The \(R^2\) will always increase when adding new predictors the the model.
Since \(R^2\) is always increasing when new predictors (\(X_i\)) are added, we need a better way to evaluated the goodness of fit of multiple-linear regression models.
The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model.
The adjusted \(R^2\) takes into account the number of explanatory variables in the model.
\[ R^2_{adj} = R^2 - (1 - R^2) \frac{p}{n - p - 1} \]
Where \(p\) is the number of explanatory variables and \(n\) the number of observations in the data.
Always report \(R^2_{adj}\) when using multiple-linear regressions.
Trade-off between performance and complexity
When I work to make a new model I personally like to keep in mind that:
Any modeling project should be tempered by the morality of laziness (Barnes and Chu 2010).
Avoid the pitfall of: my \(R^2\) is better than yours.
Multicollinearity occurs when predictors are highly correlated among them.
Some predictors add redundant information to the model.
Variance inflation factor (VIF) is a measure used to quantify the degree of multicollinearity in a multiple regression analysis.
Tells how much larger the standard error is, compared with what it would be if that variable were uncorrelated with the other predictor variables in the model.
It is easy to understand how to calculate VIF. Basically, we take 1 predictor of the regression and predict it using other predictors (1 VIF per \(X_i\)).
Step #1: Regressions (1 per explanatory variable)
\[ X_1 = \alpha + \beta_1 X_2 + \beta_2 X_3 + \beta_3 X_4 + ... \beta_n X_n + \epsilon \]
Step #2: Calculate VIF
\[ VIF = \frac{1}{1 - R^2} \]
Step #3: Analyze
If the variance inflation factor of a predictor variable were 5.27 (\(\sqrt{5.27}\) = 2.3) this means that the standard error for the coefficient of that predictor variable is 2.3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.
As a rule of the thumb, if \(\sqrt{VIF} > 2\) multicollinearity is considered high.
Lets calculate VIF for the cyl
explanatory variable.
Remember that:
## [1] 0.8995881
## [1] 9.958978
Given that \(\sqrt{VIF}\) = 3.16 we conclude that MC is high.
There is an easy way to calculate VIF in R using the vif()
function from the car
library.
## cyl disp hp drat wt qsec
## 9.958978 10.550573 5.357783 2.966519 7.181690 4.039701
## cyl disp hp drat wt qsec
## TRUE TRUE TRUE FALSE TRUE TRUE
Based on these results we have high MC for many predictors. What can we do about it?
There are many stepwise techniques to select predictors (\(X_i\)) in a multiple linear regression.
In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure (Wikipedia).
Basically, you say: “Here are my data and find out a good model for me.”
Nevertheless, many peoples have been working to develop new methods that minimize these drawbacks.
We will see an approach based on the technique suggested by Blanchet, Legendre, and Borcard (2008) called forward selection.
The package is not installed by default.
Without going in too much details, the method
Two tests:
The first step consists in creating two matrices: \(x\) that will contains our explanatory variables and \(y\) that contains the response variable.
The second step
## variables order R2 R2Cum AdjR2Cum F pval
## 1 wt 5 0.752832794 0.7528328 0.7445939 91.3753250 0.001
## 2 cyl 1 0.077394600 0.8302274 0.8185189 13.2202917 0.002
## 3 hp 3 0.012922590 0.8431500 0.8263446 2.3068695 0.146
## 4 disp 2 0.005484810 0.8486348 0.8262103 0.9783614 0.321
## 5 drat 4 0.002680387 0.8513152 0.8227219 0.4687099 0.520
## 6 qsec 6 0.003507231 0.8548224 0.8199798 0.6039554 0.444
# Maximum r2 expected
globalR2 <- model2_temp$AdjR2Cum[length(model2_temp$AdjR2Cum)]
model2_final <- forward.sel(y, x, alpha = 0.05, R2thresh = 1,
R2more = 0.001, adjR2thresh = globalR2,
nperm = 999)
## Testing variable 1
## Testing variable 2
## Testing variable 3
## Procedure stopped (alpha criteria): pvalue for variable 3 is 0.125000 (superior to 0.050000)
## variables order R2 R2Cum AdjR2Cum F pval
## 1 wt 5 0.7528328 0.7528328 0.7445939 91.37533 0.001
## 2 cyl 1 0.0773946 0.8302274 0.8185189 13.22029 0.003
##
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2893 -1.5512 -0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
## wt -3.1910 0.7569 -4.216 0.000222 ***
## cyl -1.5078 0.4147 -3.636 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
## F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9682 -1.5795 -0.4353 1.1662 5.5272
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.30736 14.62994 1.798 0.08424 .
## cyl -0.81856 0.81156 -1.009 0.32282
## disp 0.01320 0.01204 1.097 0.28307
## hp -0.01793 0.01551 -1.156 0.25846
## drat 1.32041 1.47948 0.892 0.38065
## wt -4.19083 1.25791 -3.332 0.00269 **
## qsec 0.40146 0.51658 0.777 0.44436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.557 on 25 degrees of freedom
## Multiple R-squared: 0.8548, Adjusted R-squared: 0.82
## F-statistic: 24.53 on 6 and 25 DF, p-value: 2.45e-09
##
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2893 -1.5512 -0.4684 1.5743 6.1004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.6863 1.7150 23.141 < 2e-16 ***
## wt -3.1910 0.7569 -4.216 0.000222 ***
## cyl -1.5078 0.4147 -3.636 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared: 0.8302, Adjusted R-squared: 0.8185
## F-statistic: 70.91 on 2 and 29 DF, p-value: 6.809e-12
For this, lets suppose that the following model is good (which is not).
The model includes 6 predictors and yield a \(R^2\) of 0.8548224
The question we might ask is, how this value is divided among predictors?
We can evaluate the the contribution of each predictor to the \(R^2\). We call this operation variance partitioning.
This can be easily done in R using the calc.relimp()
function from the relaimpo
package.
## Response variable: mpg
## Total response variance: 36.3241
## Analysis based on 32 observations
##
## 6 Regressors:
## cyl disp hp drat wt qsec
## Proportion of variance explained by model: 85.48%
## Metrics are not normalized (rela=FALSE).
##
## Relative importance metrics:
##
## lmg
## cyl 0.17323909
## disp 0.16405351
## hp 0.14365936
## drat 0.10377957
## wt 0.22521079
## qsec 0.04488009
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs 4Xs 5Xs
## cyl -2.87579014 -2.19605501 -1.60714484 -1.231094993 -0.993048829
## disp -0.04121512 -0.02884724 -0.01449644 -0.002155583 0.006775406
## hp -0.06822828 -0.04242288 -0.03020759 -0.025166512 -0.021378913
## drat 7.67823260 3.42465007 1.90570167 1.449119405 1.326454854
## wt -5.34447157 -4.05010008 -3.78513541 -3.898088179 -4.045422683
## qsec 1.41212484 0.20520857 0.12363951 0.187916885 0.287527166
## 6Xs
## cyl -0.81856023
## disp 0.01320490
## hp -0.01792993
## drat 1.32040573
## wt -4.19083238
## qsec 0.40146117
Practice these concepts on your own data.
http://cba.ualr.edu/smartstat/topics/anova/example.pdf
http://www.analyticsforfun.com/2014/06/performing-anova-test-in-r-results-and.html
http://www.ats.ucla.edu/stat/r/faq/posthoc.htm
http://www.r-tutor.com/elementary-statistics/probability-distributions/f-distribution
Barnes, David J, and Dominique Chu. 2010. Introduction to modeling for biosciences. New York ET - 1st: Springer.
Blanchet, F Guillaume, Pierre Legendre, and Daniel Borcard. 2008. “Forward selection of explanatory variables.” Ecology 89 (9). Eco Soc America: 2623–32. https://doi.org/10.1890/07-0986.1.
Zuur, Alain F, Elena N Ieno, Neil Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed effects models and extensions in ecology with R. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-87458-6.