The t-test is a test used to determine whether there is a significant difference between the means of two groups. It assumes that the dependent variable fits a normal distribution.
H0: \(\mu_1 = \mu_2\)
H1: The means are differents
A working example.
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
## 7 10 A
## 8 23 A
## 9 17 A
## 10 20 A
## 11 14 A
## 12 13 A
## 13 11 B
## 14 17 B
## 15 21 B
## 16 11 B
## 17 16 B
## 18 14 B
## 19 17 B
## 20 17 B
## 21 19 B
## 22 21 B
## 23 7 B
## 24 13 B
## 25 0 C
## 26 1 C
## 27 7 C
## 28 2 C
## 29 3 C
## 30 1 C
## 31 2 C
## 32 1 C
## 33 3 C
## 34 0 C
## 35 1 C
## 36 4 C
## 37 3 D
## 38 5 D
## 39 12 D
## 40 6 D
## 41 4 D
## 42 3 D
## 43 5 D
## 44 5 D
## 45 5 D
## 46 5 D
## 47 2 D
## 48 4 D
## 49 3 E
## 50 5 E
## 51 3 E
## 52 5 E
## 53 3 E
## 54 6 E
## 55 1 E
## 56 1 E
## 57 3 E
## 58 2 E
## 59 6 E
## 60 4 E
## 61 11 F
## 62 9 F
## 63 15 F
## 64 22 F
## 65 15 F
## 66 16 F
## 67 13 F
## 68 10 F
## 69 26 F
## 70 26 F
## 71 24 F
## 72 13 F
# Subset the data to get only two groups
InsectSprays2 <- dplyr::filter(InsectSprays, spray %in% c("C", "F"))
# Visualize the data
ggplot(InsectSprays2, aes(x = spray, y = count)) +
geom_boxplot()
Visually it does not look that our data is normally distributed.
We can use the shapiro.test()
function to determine if our data is normally distributed (null model).
##
## Shapiro-Wilk normality test
##
## data: InsectSprays2$count[InsectSprays2$spray == "C"]
## W = 0.85907, p-value = 0.04759
We can reject the null hypothesis that our data is normally distributed.
##
## Shapiro-Wilk normality test
##
## data: InsectSprays2$count[InsectSprays2$spray == "F"]
## W = 0.88475, p-value = 0.1009
Despite the non-normal look of this data, the test suggests it is normally distributed.
Since our data is not normal, what can we do?
Here, we work with data that is distributed asymmetrically with a dominance of low values, and some strong values. This type of distribution typically corresponds to a log-normal distribution, that is, the log-transformed values follow a Normal distribution.