Generalized linear models (GLM)

Introduction to GLM

Until now, we have worked with statistical analysis (anova, lm) that supposed a normal distribution of the response variable.

Normal distribution

Data: Daily maximum temperature in Roskilde between 2015-01-01 and 2015-12-18.

Source: http://www.wunderground.com/history/

temperature <- read_csv("data/temp_roskilde_2015.csv")
temperature <- janitor::clean_names(temperature)
temperature$cet <- as.Date(temperature$cet, format = "%Y-%m-%d")

The data

	cet	max_temperature_c	mean_temperature_c	min_temperature_c	dew_point_c	mean_dew_point_c	min_dewpoint_c
1	16436.00	6.00	4.00	3.00	5.00	4.00	2.00
2	16437.00	9.00	7.00	4.00	7.00	3.00	0.00
3	16438.00	6.00	3.00	2.00	3.00	2.00	0.00
4	16439.00	4.00	2.00	1.00	2.00	-1.00	-3.00
5	16440.00	4.00	2.00	0.00	3.00	2.00	0.00
6	16441.00	3.00	1.00	-1.00	3.00	1.00	0.00
7	16442.00	6.00	3.00	1.00	4.00	2.00	0.00
8	16443.00	7.00	4.00	3.00	6.00	4.00	2.00
9	16444.00	7.00	5.00	3.00	5.00	2.00	1.00
10	16445.00	10.00	7.00	3.00	9.00	3.00	-2.00

ggplot(temperature, aes(x = cet, y = mean_temperature_c)) + 
    geom_point() + geom_smooth(method = "loess") + 
    scale_x_date()

Normal distribution

In 2015, the probability of having a temperature of 15 degrees was ~ 4%.

Probability density function (PDF)

The probability density function of a normal curve (or distribution) of observations \(y_i\) is defined as:

\[ f(y_i; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(y_i - \mu)^2}{2\sigma^2}} \]

With \(\mu\) and \(\sigma^2\) being the mean and the variance of the population.

Calculating the probability of…

Using the previous equation, we can easily calculate the probability of having a value \(y\) given a mean \(\mu\) and a standard deviation \(\sigma\).

# Function to calculate PDF
pdf_normal <- function(x, sd, u) {
    
    y <- (1/(sd * sqrt(2 * pi))) * exp(-(((x - u)^2)/(2 * 
        sd^2)))
    
    return(y)
}

PDF of temperature

Lets calculate the probability of \(P(Y = y)\).

# Average
u <- mean(temperature$mean_temperature_c, na.rm = TRUE)

# Standard deviation
sd <- sd(temperature$mean_temperature_c, na.rm = TRUE)

prob <- pdf_normal(x = temperature$mean_temperature_c, 
    sd = sd, u = u)

Theoretical vs observed PDF

Normal distribution

Note that with the normal distribution: \(-\infty < y < \infty\)

temp <- -100:100  # Vector of extreme temperatures
y <- pdf_normal(temp, sd, u)
plot(temp, y * 100, type = "l", xlab = "Temperature (C)", 
    ylab = "Probability (%)", col = "red", lwd = 2)

Precisely:

\(P(Y = -100)\)

pdf_normal(-100, sd, u) * 100

## [1] 9.678735e-83

\(P(Y = 25)\)

pdf_normal(25, sd, u) * 100

## [1] 0.1074266

Exercise

Question #1

Within the variable qsec from the mtcars data frame, calculate the theoretical probability of having a car doing 1/4 miles in 18 seconds \(P(Y = 18)\).

Answer

sd <- sd(mtcars$qsec)
u <- mean(mtcars$qsec)

pdf_normal(18, sd, u)

## [1] 0.2224557

Tips: In R, you can use the dnorm() function.

dnorm(18, mean = u, sd = sd)

## [1] 0.2224557

Other distributions

What to do if the response variable does not follow a Normal distribution?

This is specially common in ecology or in environmental sciences.

Count data (always positive): 1, 2, 3, 4
Presence/absence data (0-1): 0, 1, 1, 1, 0, 0
Proportional data (0-100%): 0.1, 14, 97.4, 100

How can we model such data?

Apply data transformation (\(\sqrt{y}\), \(log(y)\), …).
Choose a different distribution better suited for our data.

GLM

GLM are extensions of linear modeling since a non-Gaussian distribution for the response variable is used (Zuur et al. 2009).

Before using GLM on non-Gaussian response variables, we need to understand few more types of distribution:

Poisson distribution
Negative binomial distribution
Gamma distribution
Binomial distribution

Poisson distribution

The Poisson distribution is a type of distribution suited for count data which are common in ecology.

PDF of a Poisson distribution

\[ f(y; \mu) = \frac{\mu^y \times e^{-\mu}}{y!} \]

with \(y \geq 0\) (why?)

Contrary to the Normal distribution, the PDF of the Poisson distribution has only one parameter \(\mu\) .

\(\mu\) : Mean number of successes in the given time interval or region space.

Poisson PDF in R

You have two minutes to write the Poisson PDF equation in R.

# Poisson distribution function
pdf_poisson <- function(y, u) {
    
    prob <- (u^y * exp(-u))/factorial(y)
    
    return(prob)
    
}

Mean number of successes

The mean number of successes (\(\mu\)) is the number of successes obtained during a period of time.

For example, your Ph. D. advisor writes you, on average, 2 emails per day (\(\mu = 2\)) . What is the probability of finally having a day without any emails from him/her (\(P(Y_0; 2)\))?

pdf_poisson(y = 0, u = 2)

## [1] 0.1353353

More examples

What is the probability of having him/her writing you less than 4 emails?

\(P(Y_0; 2) + P(Y_1; 2) + P(Y_2; 2) + P(Y_3; 2)\)

pdf_poisson(y = 0, u = 2) + pdf_poisson(y = 1, u = 2) +
  pdf_poisson(y = 2, u = 2) + pdf_poisson(y = 3, u = 2)

## [1] 0.8571235

Tip: In R, you can use the dpois() function.

dpois(0, 2) + dpois(1, 2) + dpois(2, 2) + dpois(3, 2)

## [1] 0.8571235

Graphical view

Iimportant: the sum of all probabilities is equal to 1.

\[ \sum\limits_{y=0}^{\infty} P(Y_y; \mu) = 1 \]

More examples

What is the probability of having him/her writing you 4 or more emails?

\[\begin{equation} \label{eq1} \begin{split} P(Y \geq 4) & = P(Y = 4) + P(Y = 5) + ... \\ \end{split} \end{equation}\]

More examples

What is the probability of having him/her writing you 4 or more emails?

\[\begin{equation} \label{eq2} \begin{split} P(Y \geq 4) & = P(Y = 4) + P(Y = 5) + ... \\ & = 1 - P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3) \\ & = 1 - \frac{2^0 \times e^{-2}}{0!} + \frac{2^1 \times e^{-9.5}}{1!} + \frac{2^2 \times e^{-2}}{2!} + \frac{2^3 \times e^{-2}}{3!} \\ & = 1 - 0.135 - 0.271 - 0.271 - 0.180\\ & = 0.143 \end{split} \end{equation}\]

1 - dpois(0, 2) - dpois(1, 2) - dpois(2, 2) - dpois(3, 
    2)

## [1] 0.1428765

Graphical view

Exercise

You are now about to finish your PhD and your advisor is now sending you more emails each day (\(\mu = 3.2\)).

What is the probability of having 2.1 emails at one particular day?

Poisson distribution

Some shapes actually look like the Gaussian distribution but they are not.

Negative binomial distribution

The negative binomial distribution is also a type of distribution concerning count data.

\[ f(y; k, \mu) = \frac{(y + k)!}{k! \times (y + 1)!} \times \left(\frac{k}{\mu + k}\right)^k \times \left(1 - \frac{k}{\mu + k}\right)^y \]

The first difference with the Poisson distribution is that the negative binomial has two parameters: \(k\) and \(\mu\).

In R, it can be calculated using the dnbinom() function.

Parameter \(k\) is often called the dispersion parameter (Zuur et al. 2009). Dispersion occurs when the variance is larger than the mean.

\[ \text{E}(Y) = \mu \]

\[ \text{var}(Y) = \mu + \frac{\mu^2}{k} \]

The smaller \(k\), the larger is dispersion.

If \(k\) is large, then \(\frac{\mu^2}{k}\) becomes ~ 0 and the negative binomial converges to the Poisson distribution.

NB shape

large K

If \(k\) is high, then the Negative distribution is similar to the Poisson distribution.

How to choose?

How to choose between a Poisson and a Negative binomial distribution?

In a Poisson distribution \(E(Y) = \text{var}(Y)\)

Overdispersion occurs when the variance is greater than the mean, \(E(Y)<\text{var}(Y)\) or if the number of zeros is greater than expected.

In theorie, you can use both distributions to model your count data, the estimated mean (\(\mu\)) will be similar. If you do not choose the appropriate distibution you risk to have biased standard errors which increases the likelihood of incorrectly detecting a significant treatment effect in the Poisson model.

The Gamma distribution

We have seen that both Poisson and Negative binomial distributions deal with count data values \(\geq 0\).

The Gamma distribution is also dealing with values \(\geq 0\) but for continuous data.

The Gamma distribution

The probability density function of a Gamma distribution is defined as:

\[ f(y; \mu, \nu) = \frac{1}{\nu!} \times \left(\frac{\nu}{\mu}\right)^\nu \times y^{\nu - 1} \times e^{\frac{y \times \nu}{\mu}} \]

with \(y > 0\)

with \(\mu\) the mean and \(\nu\) the dispersion parameter controlling the shape of the curve.

If \(\nu\) is large, the distribution converge toward a Gaussian distribution.

Examples

When to use Gamma?

When the response variable is continuous, right skewed and always positive.

In such situation you could also choose to the log-transform your response variable and perform a traditional linear Normal model.

As we will see later using a GLM with a Gamma distribution is generally a better solution.

The binomial distribution

The values in a binomial distribution can only take two values: 0 or 1.

This can also be viewed as the presence/absence of an animal in an ecological study.

The binomial distribution

The probability distribution of the binomial distribution is defined as:

\[ f(y;\pi, N) = \left(N \atop y \right) \times \pi^y \times (1 - \pi)^{N-y} \]

where \(\pi \geq 0\) is representing the probability of success and \(N \geq 1\) the number of trials.

Note that \(\left(N \atop y \right) = \frac{N!}{y!(N - y)}\)

The binomial distribution

The easiest way to understand the binomial distribution is to think of an experience where you toss a coin in the air. There are only two possible outcomes: either you get a “tail” or a “head”.

First, create the R function to calculate the PDF of a binomial distribution.

pdf_binomial <- function(y, pi, N) {
    prob <- factorial(N)/(factorial(y) * factorial((N - 
        y))) * pi^y * (1 - pi)^(N - y)
    
    return(prob)
}

For this example, we represent head as 1 (success) and tail as 0 (fail). If you are not cheating, the chance of having 1 (head) is 0.5 (50%) and the chance of having 0 is (1 - 0.5).

Lets verify this by calculating \(P(y = 1)\):

# Prob of having 1 (head) with a success rate of
# 50% after 1 trial
pdf_binomial(1, pi = 0.5, N = 1)

## [1] 0.5

That makes sens, we have 50% chance to have a head value after throwing a coin 1 time.

The probability of having always 1 is decreasing with increasing number of trials.

prob <- pdf_binomial(1:10, pi = 0.5, N = 1:10)
plot(1:10, prob * 100, type = "h", ylab = "Probabilities (%)", 
    xlab = "Number of successive trial")

Exercise

Question #1

What is the probability of having 8 heads when throwing a coin 10 times?

Solution:

Define the success rate for a single trial: \(\pi = 0.5\)
Define how many successes we want: \(y = 8\)
Define how many trials we attempt: \(N = 10\)

pdf_binomial(y = 8, pi = 0.5, N = 10)

## [1] 0.04394531

Exercise

Exercise #2

What is the probability of having the value 2 three times when throwing a 6 faces dice four times?

Solution:

Define the success rate for a single trial: \(\pi = \frac{1}{6}\)
Define how many successes we want: \(y = 3\)
Define how many trials we attempt: \(N = 4\)

pdf_binomial(y = 3, pi = 1/6, N = 4)

## [1] 0.0154321

Exercise

Exercise #3

What is the probability of throwing a pair of dice 24 times and have at least one double 6?

Solution #1

Define the success rate for a single trial: \(\pi = \frac{1}{6}\times\frac{1}{6} = \frac{1}{36}\)
Define how many successes we want: \(y = 1, 2, 3, ..., 24\)
Define how many trials we attempt: \(N = 24\)

sum(pdf_binomial(y = 1:24, pi = 1/36, N = 24))

## [1] 0.4914039

Exercise

Solution #2:

\[ P(Y > 0) = 1 - P(Y = 0) \]

Define the success rate for a single trial: \(\pi = \frac{1}{36}\)
Define how many successes we want: \(y = 0\)
Define how many trials we attempt: \(N = 24\)

# The probability of failing (i.e. 0 success)
pdf_binomial(y = 0, pi = 1/36, N = 24)

## [1] 0.5085961

# Then the probability of success 1, 2, 3, ...
1 - pdf_binomial(y = 0, pi = 1/36, N = 24)

## [1] 0.4914039

Graphical view

The main distributions

The choice of the best distribution to model your response variable (\(y\)) should be an a priori choice based on the knowledge of type of data (Zuur et al. 2009).

Distribution of \(y\)	Type of data	R function
Normal	Continuous (\(-\infty \leq y \leq \infty\))	dnorm
Poisson	Count (\(y \geq 0\))	dpois
Negative binomial	Overdispersed count (\(y \geq 0\))	dnbinom
Gamma	Continuous (\(y \geq 0\))	dgamma
Binomial	Presence/absence (\(y = 0\), \(y = 1\))	dbinom

GLM in R

GLM in R are performed using the glm() function. The syntax of a glm model is very similar to the one of a linear model using the lm() function.

glm(y ~ x, family = "gaussian")
glm(y ~ x, family = "binomial")
glm(y ~ x, family = "poisson")

The only thing that differs is the family argument which is used to specify from which distribution is the response variable.

Gaussian

Data sample

my_data <- read_csv("https://goo.gl/Vn5uZi")
languageR::pairscor.fnc(my_data)

Gaussian

For this first example, we will try to model a response variable (slope) which present a Gaussian distribution.

ggplot(my_data, aes(x = elev, y = slope)) + geom_point() + 
    geom_smooth(method = "lm")

lm vs Gaussian glm

lm_1 <- lm(slope ~ elev, data = my_data)
summary(lm_1)

## 
## Call:
## lm(formula = slope ~ elev, data = my_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.031941 -0.008261 -0.001467  0.005572  0.057441 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.349e-02  1.961e-03   6.879 4.71e-11 ***
## elev        1.563e-04  5.546e-06  28.176  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01282 on 253 degrees of freedom
## Multiple R-squared:  0.7583, Adjusted R-squared:  0.7574 
## F-statistic: 793.9 on 1 and 253 DF,  p-value: < 2.2e-16

lm vs Gaussian glm

glm_gaussian <- glm(slope ~ elev, data = my_data, family = "gaussian")
summary(glm_gaussian)

## 
## Call:
## glm(formula = slope ~ elev, family = "gaussian", data = my_data)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.031941  -0.008261  -0.001467   0.005572   0.057441  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.349e-02  1.961e-03   6.879 4.71e-11 ***
## elev        1.563e-04  5.546e-06  28.176  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.0001643528)
## 
##     Null deviance: 0.172056  on 254  degrees of freedom
## Residual deviance: 0.041581  on 253  degrees of freedom
## AIC: -1494.3
## 
## Number of Fisher Scoring iterations: 2

Gaussian glm vs lm

Two linear models with lm() and glm(... family = "gaussian") are identical.

coef(lm_1)

## (Intercept)        elev 
## 0.013486467 0.000156273

coef(glm_gaussian)

## (Intercept)        elev 
## 0.013486467 0.000156273

Poisson and negative binomial

The data

p <- ggplot(mpg, aes(x = displ, y = cyl)) + geom_point()
p

Poisson glm

A Poisson regression is fitted using the glm() function with family = "poisson".

glm_poisson <- glm(cyl ~ displ, data = mpg, family = "poisson")

Fitted values

Lets look to the predicted values of the model. Fitted (predicted) values of a glm model can be extracted using the predict() function as follow:

predicted_poisson <- predict(glm_poisson, type = "response")
head(predicted_poisson)

##        1        2        3        4        5        6 
## 4.165979 4.165979 4.326270 4.326270 5.031522 5.031522

Regression line

p + geom_line(aes(x = mpg$displ, y = predicted_poisson), color = "red")

Interpreting results

summary(glm_poisson)

## 
## Call:
## glm(formula = cyl ~ displ, family = "poisson", data = mpg)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.98488  -0.15890  -0.07696   0.28678   0.64538  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.08716    0.08111  13.403   <2e-16 ***
## displ        0.18877    0.02014   9.374   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 103.685  on 233  degrees of freedom
## Residual deviance:  17.451  on 232  degrees of freedom
## AIC: 864.5
## 
## Number of Fisher Scoring iterations: 4

Interpreting results

The coefficient for displ is 0.1887712. This means that for a 1-unit increase in displ, the expectation (or mean) number of cyl increases by a factor of \(e^{0.18877}\) = 1.2077631.

exp(coef(glm_poisson))

## (Intercept)       displ 
##    2.965849    1.207765

Interpreting results

	displ	cyl	predicted
1	2.00	4	4.33
2	3.00	6	5.23
3	4.00	6	6.31
4	4.00	8	6.31
5	5.00	8	7.62
6	6.00	8	9.21

We see that for a 1-unit increase in displ, the expectation (or mean) number of cyl increases by a factor of \(e^{0.18877}\) = 1.2077631.

The number of cyl is increasing by 20.78% per 1-unit of displ.

Negative binomial

Remember that the binomial distribution is very similar to a Poisson distribution. We will now create a negative binomial model using the same data as for the Poisson model.

In R, to perform a negative binomial regression we need to use the glm.nb() function from the MASS library.

Negative binomial

library(MASS)
glm_nb <- glm.nb(cyl ~ displ, data = mpg)

summary(glm_nb)  # Look at the dispersion parameter

## 
## Call:
## glm.nb(formula = cyl ~ displ, data = mpg, init.theta = 1354965.912, 
##     link = log)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.98488  -0.15890  -0.07696   0.28678   0.64538  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.08716    0.08111  13.403   <2e-16 ***
## displ        0.18877    0.02014   9.374   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1354966) family taken to be 1)
## 
##     Null deviance: 103.685  on 233  degrees of freedom
## Residual deviance:  17.451  on 232  degrees of freedom
## AIC: 866.5
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1354966 
##           Std. Err.:  24103082 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -860.504

Negative binomial

As with any glm, predicted values are obtained using the predict() function.

predicted_nb <- predict(glm_nb, type = "response")
head(predicted_nb)

##        1        2        3        4        5        6 
## 4.165979 4.165979 4.326269 4.326269 5.031522 5.031522

p + geom_line(aes(x = mpg$displ, y = predicted_nb), 
    color = "blue")

Which one to choose

Looking at the summary of the Negative binomial regression we have a large dispersion coefficient (\(K\) = 1.35496610^{6}). It is a first sign that we should stick with the Poisson model.

Additionally, the coefficients for both models are equal.

coef(glm_poisson)

## (Intercept)       displ 
##   1.0871633   0.1887712

coef(glm_nb)

## (Intercept)       displ 
##   1.0871631   0.1887712

Which one to choose

A useful way to select models in general is to use the Bayesian Information Criterion (BIC).

\[ \text{BIC} = -2 \times ln(\hat{L}) + k \times ln(n) \]

Where \(n\) is the number of data points (observations), \(k\) the number of parameters to be estimated and \(\hat{L}\) the maximized value of the likelihood function of the model (Burnham and Anderson 2002; Zuur et al. 2009).

Which one to choose

The BIC value is used to compare models. It is based on the principle of parsimony, helping to identify the model that accounts for the most variation with the fewest variables.

Which one to choose

Such a model selection method requires the calculation of the BIC differences (\(\Delta_i\)) for all candidate models as follow:

\[ \Delta_i = \text{BIC}_i - \text{BIC}_{\text{min}} \]

\(\Delta_i\)	Evidence against higher BIC
0-2	Weak
2-6	Positive
6-10	Strong
10+	Very strong

Which one to choose

In R, BIC are calculated using the BIC() function.

BIC(glm_poisson, glm_nb)

##             df      BIC
## glm_poisson  2 871.4141
## glm_nb       3 876.8703

If we look at the \(\Delta_i\) values.

BIC(glm_poisson, glm_nb)$BIC - min(BIC(glm_poisson, 
    glm_nb)$BIC)

## [1] 0.000000 5.456258

Based on these values, we can conclude that the Poisson model is the best option.

Exercise

Exercise #1

Use the following data to perform a GLM regression between distance and count. Why distribution should you use?

my_data <- read_csv("https://goo.gl/Vn5uZi")

Binomial regression

For this example, we are going to use the presence/absence data.

p <- ggplot(my_data, aes(x = area, y = presence)) + 
    geom_point()
p

Binomial regression

A Binomial regression is fitted using the glm() function with family = "binomial".

glm_binomial <- glm(presence ~ area, data = my_data, 
    family = "binomial")

As with any glm, predicted values are obtained using the predict() function.

predicted_binomial <- predict(glm_binomial, type = "response")
head(predicted_binomial)

##           1           2           3           4           5           6 
## 0.006113853 0.007174781 0.996638584 0.877413756 0.312340572 0.960833362

Fitted values

p + geom_line(aes(x = my_data$area, y = predicted_binomial), 
    color = "red") + geom_hline(yintercept = 0.5, lty = 2)

Interpreting results

summary(glm_binomial)

## 
## Call:
## glm(formula = presence ~ area, family = "binomial", data = my_data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.50961  -0.26750   0.09324   0.44214   1.94054  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.559089   1.185036  -7.223 5.10e-13 ***
## area         0.009475   0.001255   7.549 4.39e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 337.78  on 254  degrees of freedom
## Residual deviance: 147.86  on 253  degrees of freedom
## AIC: 151.86
## 
## Number of Fisher Scoring iterations: 6

Interpreting results

df

## # A tibble: 3 x 3
##    area presence predicted
##   <dbl>    <dbl>     <dbl>
## 1   530        0    0.0283
## 2   531        0    0.0285
## 3   532        0    0.0288

exp(coef(glm_binomial))

##  (Intercept)         area 
## 0.0001917938 1.0095205100

Gamma distribution

We will use the following dataset to understand the Gamma distribution.

data_gamma <- read_csv("data/data_gamma.csv")

Lets give a look to the relation between x and y.

Graphical view

ggplot(data_gamma, aes(x = x, y = y)) + geom_point()

Looking at the histogram of y we could decide to log-transform the data.

hist(data_gamma$y)
hist(log(data_gamma$y))

Note: \(y_i \geq 0\) which is an essential condition for applying a Gamma GLM.

Graphical view

ggplot(data_gamma, aes(x = x, y = log(y))) + geom_point()

Normal LM

my_lm <- lm(log(y) ~ x, data = data_gamma)

summary(my_lm)

## 
## Call:
## lm(formula = log(y) ~ x, data = data_gamma)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98780 -0.18389  0.02307  0.23713  0.80138 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.38107    0.03385   11.26   <2e-16 ***
## x            1.13836    0.06019   18.91   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3382 on 98 degrees of freedom
## Multiple R-squared:  0.785,  Adjusted R-squared:  0.7828 
## F-statistic: 357.7 on 1 and 98 DF,  p-value: < 2.2e-16

Regression line

ggplot(data_gamma, aes(x = x, y = log(y))) + geom_point() + 
    geom_line(aes(x = data_gamma$x, y = predict(my_lm)), 
        col = "red")

Gamma regression

glm_gamma <- glm(y ~ x, family = Gamma(link = "log"))

summary(glm_gamma)

## 
## Call:
## glm(formula = y ~ x, family = Gamma(link = "log"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9007  -0.2307  -0.0210   0.1797   0.8326  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.43559    0.03256   13.38   <2e-16 ***
## x            1.16522    0.05789   20.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.1057873)
## 
##     Null deviance: 50.968  on 99  degrees of freedom
## Residual deviance: 10.768  on 98  degrees of freedom
## AIC: 139.12
## 
## Number of Fisher Scoring iterations: 4

Interpretation

##    x predicted
## 1 -1 0.4820882
## 2  0 1.5458748
## 3  1 4.9570363

We see that for a 1-unit increase in x, the expectation (or mean) number of y increases by a factor of \(e^{1.16522}\) = 3.2066283.

# Calculate increase from -1 to 0 (1-unit increase)
0.4820882 * exp(1.16522)

## [1] 1.545878

Graphical view

ggplot(data_gamma, aes(x = x, y = y)) + geom_point() + 
    geom_line(aes(x = data_gamma$x, y = predict(glm_gamma, 
        type = "response")), col = "red")

Gamma or LM?

# Coefficients for the log-transformed model
summary(my_lm)$coefficients

##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 0.3810662 0.03385086 11.25721 2.320083e-19
## x           1.1383629 0.06018776 18.91353 1.779982e-34

# Coefficients for the Gamma GLM
summary(glm_gamma)$coefficients

##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 0.4355899 0.03255854 13.37867 7.711234e-24
## x           1.1652181 0.05788998 20.12815 1.375438e-36

We can see that the standard errors on estimated coefficients are lower for the Gamma GLM.

GLM and \(R^2\)

Calculating \(R^2\) value for a GLM is not straightforward. However, if you absolutely want to have a value you can use the pR2() function from the pscl package.

library(pscl)

pR2(glm_gamma)

##          llh      llhNull           G2     McFadden         r2ML 
##  -66.5590371 -147.6077167  162.0973592    0.5490816    0.8022939 
##         r2CU 
##    0.8465043

Be conservative: choose the lowest value between McFadden, r2ML and r2CU.

Your turn!

Use your own data and try to fit and evaluate a GLM model (Poisson, Negative binomial, Gamma, binomial).

Additional resources

http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm

https://sites.google.com/site/rforfishandwildlifegrads/home/mumin_usage_examples

http://datavoreconsulting.com/programming-tips/count-data-glms-choosing-poisson-negative-binomial-zero-inflated-poisson

https://en.wikipedia.org/wiki/Bayesian_information_criterion

http://users.ecs.soton.ac.uk/jn2/teaching/logisticRegression.pdf

http://nlp.stanford.edu/manning/courses/ling289/logistic.pdf

http://bayes.bgsu.edu/m6480/homework/HW2%20-%20Monte%20Carlo/R%20intro%20II.pdf

http://rstudio-pubs-static.s3.amazonaws.com/5691_192685385fc445c9b3fb1619960a20e2.html

http://datavoreconsulting.com/programming-tips/count-data-glms-choosing-poisson-negative-binomial-zero-inflated-poisson/

https://sites.google.com/site/rforfishandwildlifegrads/home/mumin_usage_examples

http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm

References

Burnham, Kenneth P, and David Raymond Anderson. 2002. Model selection and multimodel inference : a practical information-theoretic approach. 2nd ed. New York: Springer.

Zuur, Alain F, Elena N Ieno, Neil Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed effects models and extensions in ecology with R. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-87458-6.