Our approach is to translate hypothesis into statistical models. We’ll do a first exercise today. Let’s start with a reasearch question.
Do leaf chemical compounds affect the growth of caterpillars?
See for instance Barbehenn & Constabel 2011 for a review on the topic.
theoretical context: Tannin compounds can defend the leaves against herbivorous insects
hypothesis: Toxicity of tannin in the leaf reduces the growth of caterpillars
prediction: Caterpillars that consume leaves with greater amount of tannin will grow less
They performend an experiment of caterpillar growth varying the tannin quantity in their diet
boxplot(cat$growth, col = "darkgreen")
unique(cat$tannin)
[1] 0 1 2 3 4 5 6 7 8
Now, to test the hypothesis, we will build a linear model and estimate their parameters.
We will check the model estimates using the function
summary()
. Because R is an OOP (Object-oriented
Programming) language, the method used by summary is different depending
on the class of the object (See Advanced R to learn
more about this topic). Let’s see what happens when we apply the
function to our object mod_cat
.
summary(mod_cat)
Call:
lm(formula = growth ~ tannin, data = cat)
Residuals:
Min 1Q Median 3Q Max
-2.4556 -0.8889 -0.2389 0.9778 2.8944
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.7556 1.0408 11.295 9.54e-06 ***
tannin -1.2167 0.2186 -5.565 0.000846 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared: 0.8157, Adjusted R-squared: 0.7893
F-statistic: 30.97 on 1 and 7 DF, p-value: 0.0008461
ggplot(data = cat, mapping = aes(x = tannin, y = growth)) +
geom_point() +
geom_smooth(method = lm) +
theme_classic()
summary.aov(mod_cat)
Df Sum Sq Mean Sq F value Pr(>F)
tannin 1 88.82 88.82 30.97 0.000846 ***
Residuals 7 20.07 2.87
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
predict(mod_cat)
1 2 3 4 5 6 7
11.755556 10.538889 9.322222 8.105556 6.888889 5.672222 4.455556
8 9
3.238889 2.022222
cat$fitted <- predict(mod_cat)
ggplot(data = cat) +
geom_point(aes(x = growth, y = fitted)) +
geom_abline(aes(slope = 1, intercept = 0)) +
theme_classic()
Using the following plots we can check the linear model assumptions.
3 possible outliers
In this section we will use the package fitdistrplus to find which distribution best fit the data. The examples here are extracted from Delignette-Muller & Dutang 2015.
We will use the data inside the package stored in the object
groundbeef
'data.frame': 254 obs. of 1 variable:
$ serving: num 30 10 20 24 20 24 40 20 50 30 ...
plotdist(groundbeef$serving, histo = TRUE, demp = TRUE)
First, let’s describe the distribution using descriptive statistics such as median, mean, standard deviation, skewness (asymmetry) and kurtosis (“taildness”).
descdist(groundbeef$serving, boot = 1000)
summary statistics
------
min: 10 max: 200
median: 79
mean: 73.64567
estimated sd: 35.88487
estimated skewness: 0.7352745
estimated kurtosis: 3.551384
Checking if the Weibull distribution fits to data.
Fitting of the distribution ' weibull ' by maximum likelihood
Parameters :
estimate Std. Error
shape 2.185885 0.1045755
scale 83.347679 2.5268626
Loglikelihood: -1255.225 AIC: 2514.449 BIC: 2521.524
Correlation matrix:
shape scale
shape 1.000000 0.321821
scale 0.321821 1.000000
Comparing with other continuous distributions for skewed data such as the gamma and lognormal distributions.
fg <- fitdist(groundbeef$serving, "gamma")
fln <- fitdist(groundbeef$serving, "lnorm")
par(mfrow = c(2, 2))
plot_legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fw, fln, fg), legendtext = plot_legend)
qqcomp(list(fw, fln, fg), legendtext = plot_legend)
cdfcomp(list(fw, fln, fg), legendtext = plot_legend)
ppcomp(list(fw, fln, fg), legendtext = plot_legend)
Comparing the distributions with Goodnes-of-fit statistics
Goodness-of-fit statistics
1-mle-weibull 2-mle-lnorm 3-mle-gamma
Kolmogorov-Smirnov statistic 0.1396646 0.1493090 0.1281246
Cramer-von Mises statistic 0.6840994 0.8277358 0.6934112
Anderson-Darling statistic 3.5736460 4.5436542 3.5660192
Goodness-of-fit criteria
1-mle-weibull 2-mle-lnorm 3-mle-gamma
Akaike's Information Criterion 2514.449 2526.639 2511.250
Bayesian Information Criterion 2521.524 2533.713 2518.325
Be aware that some statistics such as Cramer-von Mises, Kolmogorov-Smirnov and Anderson-Darling statistic are complicated to be used when comparing distributions with different numbers of parameters. Information Criteria are more interesting and take into account overfitting.