Until now, we have worked with statistical analysis (anova, lm) that supposed a normal distribution of the response variable.
Data: Daily maximum temperature in Roskilde between 2015-01-01 and 2015-12-18.
cet | max_temperature_c | mean_temperature_c | min_temperature_c | dew_point_c | mean_dew_point_c | min_dewpoint_c | |
---|---|---|---|---|---|---|---|
1 | 16436.00 | 6.00 | 4.00 | 3.00 | 5.00 | 4.00 | 2.00 |
2 | 16437.00 | 9.00 | 7.00 | 4.00 | 7.00 | 3.00 | 0.00 |
3 | 16438.00 | 6.00 | 3.00 | 2.00 | 3.00 | 2.00 | 0.00 |
4 | 16439.00 | 4.00 | 2.00 | 1.00 | 2.00 | -1.00 | -3.00 |
5 | 16440.00 | 4.00 | 2.00 | 0.00 | 3.00 | 2.00 | 0.00 |
6 | 16441.00 | 3.00 | 1.00 | -1.00 | 3.00 | 1.00 | 0.00 |
7 | 16442.00 | 6.00 | 3.00 | 1.00 | 4.00 | 2.00 | 0.00 |
8 | 16443.00 | 7.00 | 4.00 | 3.00 | 6.00 | 4.00 | 2.00 |
9 | 16444.00 | 7.00 | 5.00 | 3.00 | 5.00 | 2.00 | 1.00 |
10 | 16445.00 | 10.00 | 7.00 | 3.00 | 9.00 | 3.00 | -2.00 |
In 2015, the probability of having a temperature of 15 degrees was ~ 4%.
The probability density function of a normal curve (or distribution) of observations \(y_i\) is defined as:
\[ f(y_i; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(y_i - \mu)^2}{2\sigma^2}} \]
With \(\mu\) and \(\sigma^2\) being the mean and the variance of the population.
Using the previous equation, we can easily calculate the probability of having a value \(y\) given a mean \(\mu\) and a standard deviation \(\sigma\).
Lets calculate the probability of \(P(Y = y)\).
Note that with the normal distribution: \(-\infty < y < \infty\)
Precisely:
\(P(Y = -100)\)
## [1] 9.678735e-83
\(P(Y = 25)\)
## [1] 0.1074266
Within the variable qsec
from the mtcars
data frame, calculate the theoretical probability of having a car doing 1/4 miles in 18 seconds \(P(Y = 18)\).
## [1] 0.2224557
Tips: In R, you can use the dnorm()
function.
## [1] 0.2224557
What to do if the response variable does not follow a Normal distribution?
This is specially common in ecology or in environmental sciences.
How can we model such data?
GLM are extensions of linear modeling since a non-Gaussian distribution for the response variable is used (Zuur et al. 2009).
Before using GLM on non-Gaussian response variables, we need to understand few more types of distribution:
The Poisson distribution is a type of distribution suited for count data which are common in ecology.
\[ f(y; \mu) = \frac{\mu^y \times e^{-\mu}}{y!} \]
with \(y \geq 0\) (why?)
Contrary to the Normal distribution, the PDF of the Poisson distribution has only one parameter \(\mu\) .
\(\mu\) : Mean number of successes in the given time interval or region space.
You have two minutes to write the Poisson PDF equation in R.
The mean number of successes (\(\mu\)) is the number of successes obtained during a period of time.
For example, your Ph. D. advisor writes you, on average, 2 emails per day (\(\mu = 2\)) . What is the probability of finally having a day without any emails from him/her (\(P(Y_0; 2)\))?
pdf_poisson(y = 0, u = 2)
## [1] 0.1353353
What is the probability of having him/her writing you less than 4 emails?
\(P(Y_0; 2) + P(Y_1; 2) + P(Y_2; 2) + P(Y_3; 2)\)
pdf_poisson(y = 0, u = 2) + pdf_poisson(y = 1, u = 2) +
pdf_poisson(y = 2, u = 2) + pdf_poisson(y = 3, u = 2)
## [1] 0.8571235
dpois()
function.## [1] 0.8571235
Iimportant: the sum of all probabilities is equal to 1.
\[ \sum\limits_{y=0}^{\infty} P(Y_y; \mu) = 1 \]
What is the probability of having him/her writing you 4 or more emails?
\[\begin{equation} \label{eq1} \begin{split} P(Y \geq 4) & = P(Y = 4) + P(Y = 5) + ... \\ \end{split} \end{equation}\]
What is the probability of having him/her writing you 4 or more emails?
\[\begin{equation} \label{eq2} \begin{split} P(Y \geq 4) & = P(Y = 4) + P(Y = 5) + ... \\ & = 1 - P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3) \\ & = 1 - \frac{2^0 \times e^{-2}}{0!} + \frac{2^1 \times e^{-9.5}}{1!} + \frac{2^2 \times e^{-2}}{2!} + \frac{2^3 \times e^{-2}}{3!} \\ & = 1 - 0.135 - 0.271 - 0.271 - 0.180\\ & = 0.143 \end{split} \end{equation}\]
1 - dpois(0, 2) - dpois(1, 2) - dpois(2, 2) - dpois(3,
2)
## [1] 0.1428765
You are now about to finish your PhD and your advisor is now sending you more emails each day (\(\mu = 3.2\)).
What is the probability of having 2.1 emails at one particular day?
Some shapes actually look like the Gaussian distribution but they are not.
The negative binomial distribution is also a type of distribution concerning count data.
\[ f(y; k, \mu) = \frac{(y + k)!}{k! \times (y + 1)!} \times \left(\frac{k}{\mu + k}\right)^k \times \left(1 - \frac{k}{\mu + k}\right)^y \]
The first difference with the Poisson distribution is that the negative binomial has two parameters: \(k\) and \(\mu\).
In R, it can be calculated using the dnbinom()
function.
Parameter \(k\) is often called the dispersion parameter (Zuur et al. 2009). Dispersion occurs when the variance is larger than the mean.
\[ \text{E}(Y) = \mu \]
\[ \text{var}(Y) = \mu + \frac{\mu^2}{k} \]
The smaller \(k\), the larger is dispersion.
If \(k\) is large, then \(\frac{\mu^2}{k}\) becomes ~ 0 and the negative binomial converges to the Poisson distribution.
If \(k\) is high, then the Negative distribution is similar to the Poisson distribution.
How to choose between a Poisson and a Negative binomial distribution?
In a Poisson distribution \(E(Y) = \text{var}(Y)\)
Overdispersion occurs when the variance is greater than the mean, \(E(Y)<\text{var}(Y)\) or if the number of zeros is greater than expected.
In theorie, you can use both distributions to model your count data, the estimated mean (\(\mu\)) will be similar. If you do not choose the appropriate distibution you risk to have biased standard errors which increases the likelihood of incorrectly detecting a significant treatment effect in the Poisson model.
We have seen that both Poisson and Negative binomial distributions deal with count data values \(\geq 0\).
The Gamma distribution is also dealing with values \(\geq 0\) but for continuous data.
The probability density function of a Gamma distribution is defined as:
\[ f(y; \mu, \nu) = \frac{1}{\nu!} \times \left(\frac{\nu}{\mu}\right)^\nu \times y^{\nu - 1} \times e^{\frac{y \times \nu}{\mu}} \]
with \(y > 0\)
with \(\mu\) the mean and \(\nu\) the dispersion parameter controlling the shape of the curve.
If \(\nu\) is large, the distribution converge toward a Gaussian distribution.
When the response variable is continuous, right skewed and always positive.
In such situation you could also choose to the log-transform your response variable and perform a traditional linear Normal model.
As we will see later using a GLM with a Gamma distribution is generally a better solution.
The values in a binomial distribution can only take two values: 0 or 1.
This can also be viewed as the presence/absence of an animal in an ecological study.
The probability distribution of the binomial distribution is defined as:
\[ f(y;\pi, N) = \left(N \atop y \right) \times \pi^y \times (1 - \pi)^{N-y} \]
where \(\pi \geq 0\) is representing the probability of success and \(N \geq 1\) the number of trials.
Note that \(\left(N \atop y \right) = \frac{N!}{y!(N - y)}\)
The easiest way to understand the binomial distribution is to think of an experience where you toss a coin in the air. There are only two possible outcomes: either you get a “tail” or a “head”.
First, create the R function to calculate the PDF of a binomial distribution.
For this example, we represent head as 1 (success) and tail as 0 (fail). If you are not cheating, the chance of having 1 (head) is 0.5 (50%) and the chance of having 0 is (1 - 0.5).
Lets verify this by calculating \(P(y = 1)\):
# Prob of having 1 (head) with a success rate of
# 50% after 1 trial
pdf_binomial(1, pi = 0.5, N = 1)
## [1] 0.5
That makes sens, we have 50% chance to have a head value after throwing a coin 1 time.
The probability of having always 1 is decreasing with increasing number of trials.
What is the probability of having 8 heads when throwing a coin 10 times?
pdf_binomial(y = 8, pi = 0.5, N = 10)
## [1] 0.04394531
What is the probability of having the value 2 three times when throwing a 6 faces dice four times?
pdf_binomial(y = 3, pi = 1/6, N = 4)
## [1] 0.0154321
What is the probability of throwing a pair of dice 24 times and have at least one double 6?
sum(pdf_binomial(y = 1:24, pi = 1/36, N = 24))
## [1] 0.4914039
\[ P(Y > 0) = 1 - P(Y = 0) \]
## [1] 0.5085961
## [1] 0.4914039
The choice of the best distribution to model your response variable (\(y\)) should be an a priori choice based on the knowledge of type of data (Zuur et al. 2009).
Distribution of \(y\) | Type of data | R function |
---|---|---|
Normal | Continuous (\(-\infty \leq y \leq \infty\)) | dnorm |
Poisson | Count (\(y \geq 0\)) | dpois |
Negative binomial | Overdispersed count (\(y \geq 0\)) | dnbinom |
Gamma | Continuous (\(y \geq 0\)) | dgamma |
Binomial | Presence/absence (\(y = 0\), \(y = 1\)) | dbinom |
GLM in R are performed using the glm()
function. The syntax of a glm model is very similar to the one of a linear model using the lm()
function.
The only thing that differs is the family
argument which is used to specify from which distribution is the response variable.
For this first example, we will try to model a response variable (slope
) which present a Gaussian distribution.
##
## Call:
## lm(formula = slope ~ elev, data = my_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.031941 -0.008261 -0.001467 0.005572 0.057441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.349e-02 1.961e-03 6.879 4.71e-11 ***
## elev 1.563e-04 5.546e-06 28.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01282 on 253 degrees of freedom
## Multiple R-squared: 0.7583, Adjusted R-squared: 0.7574
## F-statistic: 793.9 on 1 and 253 DF, p-value: < 2.2e-16
##
## Call:
## glm(formula = slope ~ elev, family = "gaussian", data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.031941 -0.008261 -0.001467 0.005572 0.057441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.349e-02 1.961e-03 6.879 4.71e-11 ***
## elev 1.563e-04 5.546e-06 28.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.0001643528)
##
## Null deviance: 0.172056 on 254 degrees of freedom
## Residual deviance: 0.041581 on 253 degrees of freedom
## AIC: -1494.3
##
## Number of Fisher Scoring iterations: 2
Two linear models with lm()
and glm(... family = "gaussian")
are identical.
## (Intercept) elev
## 0.013486467 0.000156273
## (Intercept) elev
## 0.013486467 0.000156273
A Poisson regression is fitted using the glm()
function with family = "poisson"
.
Lets look to the predicted values of the model. Fitted (predicted) values of a glm model can be extracted using the predict()
function as follow:
## 1 2 3 4 5 6
## 4.165979 4.165979 4.326270 4.326270 5.031522 5.031522
##
## Call:
## glm(formula = cyl ~ displ, family = "poisson", data = mpg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.98488 -0.15890 -0.07696 0.28678 0.64538
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.08716 0.08111 13.403 <2e-16 ***
## displ 0.18877 0.02014 9.374 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 103.685 on 233 degrees of freedom
## Residual deviance: 17.451 on 232 degrees of freedom
## AIC: 864.5
##
## Number of Fisher Scoring iterations: 4
The coefficient for displ
is 0.1887712. This means that for a 1-unit increase in displ
, the expectation (or mean) number of cyl
increases by a factor of \(e^{0.18877}\) = 1.2077631.
## (Intercept) displ
## 2.965849 1.207765
displ | cyl | predicted | |
---|---|---|---|
1 | 2.00 | 4 | 4.33 |
2 | 3.00 | 6 | 5.23 |
3 | 4.00 | 6 | 6.31 |
4 | 4.00 | 8 | 6.31 |
5 | 5.00 | 8 | 7.62 |
6 | 6.00 | 8 | 9.21 |
We see that for a 1-unit increase in displ
, the expectation (or mean) number of cyl
increases by a factor of \(e^{0.18877}\) = 1.2077631.
The number of cyl
is increasing by 20.78% per 1-unit of displ
.
Remember that the binomial distribution is very similar to a Poisson distribution. We will now create a negative binomial model using the same data as for the Poisson model.
In R, to perform a negative binomial regression we need to use the glm.nb()
function from the MASS
library.
library(MASS)
glm_nb <- glm.nb(cyl ~ displ, data = mpg)
summary(glm_nb) # Look at the dispersion parameter
##
## Call:
## glm.nb(formula = cyl ~ displ, data = mpg, init.theta = 1354965.912,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.98488 -0.15890 -0.07696 0.28678 0.64538
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.08716 0.08111 13.403 <2e-16 ***
## displ 0.18877 0.02014 9.374 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1354966) family taken to be 1)
##
## Null deviance: 103.685 on 233 degrees of freedom
## Residual deviance: 17.451 on 232 degrees of freedom
## AIC: 866.5
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1354966
## Std. Err.: 24103082
## Warning while fitting theta: iteration limit reached
##
## 2 x log-likelihood: -860.504
As with any glm, predicted values are obtained using the predict()
function.
## 1 2 3 4 5 6
## 4.165979 4.165979 4.326269 4.326269 5.031522 5.031522
Looking at the summary of the Negative binomial regression we have a large dispersion coefficient (\(K\) = 1.35496610^{6}). It is a first sign that we should stick with the Poisson model.
Additionally, the coefficients for both models are equal.
## (Intercept) displ
## 1.0871633 0.1887712
## (Intercept) displ
## 1.0871631 0.1887712
A useful way to select models in general is to use the Bayesian Information Criterion (BIC).
\[ \text{BIC} = -2 \times ln(\hat{L}) + k \times ln(n) \]
Where \(n\) is the number of data points (observations), \(k\) the number of parameters to be estimated and \(\hat{L}\) the maximized value of the likelihood function of the model (Burnham and Anderson 2002; Zuur et al. 2009).
The BIC value is used to compare models. It is based on the principle of parsimony, helping to identify the model that accounts for the most variation with the fewest variables.
Such a model selection method requires the calculation of the BIC differences (\(\Delta_i\)) for all candidate models as follow:
\[ \Delta_i = \text{BIC}_i - \text{BIC}_{\text{min}} \]
\(\Delta_i\) | Evidence against higher BIC |
---|---|
0-2 | Weak |
2-6 | Positive |
6-10 | Strong |
10+ | Very strong |
In R, BIC are calculated using the BIC()
function.
## df BIC
## glm_poisson 2 871.4141
## glm_nb 3 876.8703
If we look at the \(\Delta_i\) values.
## [1] 0.000000 5.456258
Based on these values, we can conclude that the Poisson model is the best option.
Use the following data to perform a GLM regression between distance
and count
. Why distribution should you use?
For this example, we are going to use the presence/absence data.
A Binomial regression is fitted using the glm()
function with family = "binomial"
.
As with any glm, predicted values are obtained using the predict()
function.
## 1 2 3 4 5 6
## 0.006113853 0.007174781 0.996638584 0.877413756 0.312340572 0.960833362
##
## Call:
## glm(formula = presence ~ area, family = "binomial", data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.50961 -0.26750 0.09324 0.44214 1.94054
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.559089 1.185036 -7.223 5.10e-13 ***
## area 0.009475 0.001255 7.549 4.39e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 337.78 on 254 degrees of freedom
## Residual deviance: 147.86 on 253 degrees of freedom
## AIC: 151.86
##
## Number of Fisher Scoring iterations: 6
## # A tibble: 3 x 3
## area presence predicted
## <dbl> <dbl> <dbl>
## 1 530 0 0.0283
## 2 531 0 0.0285
## 3 532 0 0.0288
## (Intercept) area
## 0.0001917938 1.0095205100
We will use the following dataset to understand the Gamma distribution.
Lets give a look to the relation between x
and y
.
Looking at the histogram of y
we could decide to log-transform the data.
Note: \(y_i \geq 0\) which is an essential condition for applying a Gamma GLM.
##
## Call:
## lm(formula = log(y) ~ x, data = data_gamma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98780 -0.18389 0.02307 0.23713 0.80138
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.38107 0.03385 11.26 <2e-16 ***
## x 1.13836 0.06019 18.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3382 on 98 degrees of freedom
## Multiple R-squared: 0.785, Adjusted R-squared: 0.7828
## F-statistic: 357.7 on 1 and 98 DF, p-value: < 2.2e-16
##
## Call:
## glm(formula = y ~ x, family = Gamma(link = "log"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9007 -0.2307 -0.0210 0.1797 0.8326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.43559 0.03256 13.38 <2e-16 ***
## x 1.16522 0.05789 20.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.1057873)
##
## Null deviance: 50.968 on 99 degrees of freedom
## Residual deviance: 10.768 on 98 degrees of freedom
## AIC: 139.12
##
## Number of Fisher Scoring iterations: 4
## x predicted
## 1 -1 0.4820882
## 2 0 1.5458748
## 3 1 4.9570363
We see that for a 1-unit increase in x
, the expectation (or mean) number of y
increases by a factor of \(e^{1.16522}\) = 3.2066283.
## [1] 1.545878
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3810662 0.03385086 11.25721 2.320083e-19
## x 1.1383629 0.06018776 18.91353 1.779982e-34
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4355899 0.03255854 13.37867 7.711234e-24
## x 1.1652181 0.05788998 20.12815 1.375438e-36
We can see that the standard errors on estimated coefficients are lower for the Gamma GLM.
Calculating \(R^2\) value for a GLM is not straightforward. However, if you absolutely want to have a value you can use the pR2()
function from the pscl
package.
## llh llhNull G2 McFadden r2ML
## -66.5590371 -147.6077167 162.0973592 0.5490816 0.8022939
## r2CU
## 0.8465043
Be conservative: choose the lowest value between McFadden
, r2ML
and r2CU
.
Use your own data and try to fit and evaluate a GLM model (Poisson, Negative binomial, Gamma, binomial).
http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm
https://sites.google.com/site/rforfishandwildlifegrads/home/mumin_usage_examples
https://en.wikipedia.org/wiki/Bayesian_information_criterion
http://users.ecs.soton.ac.uk/jn2/teaching/logisticRegression.pdf
http://nlp.stanford.edu/manning/courses/ling289/logistic.pdf
http://bayes.bgsu.edu/m6480/homework/HW2%20-%20Monte%20Carlo/R%20intro%20II.pdf
http://rstudio-pubs-static.s3.amazonaws.com/5691_192685385fc445c9b3fb1619960a20e2.html
https://sites.google.com/site/rforfishandwildlifegrads/home/mumin_usage_examples
Burnham, Kenneth P, and David Raymond Anderson. 2002. Model selection and multimodel inference : a practical information-theoretic approach. 2nd ed. New York: Springer.
Zuur, Alain F, Elena N Ieno, Neil Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed effects models and extensions in ecology with R. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-87458-6.