Implementing a linear model in R

Sara Mortara and Andrea Sánchez-Tapia
2022-07-19

Our approach is to translate hypothesis into statistical models. We’ll do a first exercise today. Let’s start with a reasearch question.

The context

Do leaf chemical compounds affect the growth of caterpillars?

Lymantria dispar larvae eating an oak leaf, image from: https://www.canr.msu.edu/news/gypsy-moth-caterpillars-are-out-and-about

See for instance Barbehenn & Constabel 2011 for a review on the topic.

Research question, hypothesis, and prediction

Q: Do leaf chemical compounds affect the growth of caterpillars?

theoretical context: Tannin compounds can defend the leaves against herbivorous insects

hypothesis: Toxicity of tannin in the leaf reduces the growth of caterpillars

prediction: Caterpillars that consume leaves with greater amount of tannin will grow less

Graphical representation of the prediction

The experiment

They performend an experiment of caterpillar growth varying the tannin quantity in their diet

boxplot(cat$growth, col = "darkgreen")

unique(cat$tannin)
[1] 0 1 2 3 4 5 6 7 8

Fitting model to the data

Now, to test the hypothesis, we will build a linear model and estimate their parameters.

# loading packages
library(ggplot2)

# reading data
cat <- read.csv("data/raw/crawley_regression.csv")

# creating the lm
mod_cat <- lm(growth ~ tannin, data = cat)

We will check the model estimates using the function summary(). Because R is an OOP (Object-oriented Programming) language, the method used by summary is different depending on the class of the object (See Advanced R to learn more about this topic). Let’s see what happens when we apply the function to our object mod_cat.

summary(mod_cat)

Call:
lm(formula = growth ~ tannin, data = cat)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4556 -0.8889 -0.2389  0.9778  2.8944 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  11.7556     1.0408  11.295 9.54e-06 ***
tannin       -1.2167     0.2186  -5.565 0.000846 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.693 on 7 degrees of freedom
Multiple R-squared:  0.8157,    Adjusted R-squared:  0.7893 
F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

Fitting model to data

## ajuste do modelo aos dados
plot(growth ~ tannin, data = cat, bty = 'l', pch = 19)
abline(mod_cat, col = "red", lwd = 2)

Fitting model to data (ggplot2)

ggplot(data = cat, mapping = aes(x = tannin, y = growth)) +
  geom_point() +
  geom_smooth(method = lm) +
  theme_classic()

ANOVA table

summary.aov(mod_cat)
            Df Sum Sq Mean Sq F value   Pr(>F)    
tannin       1  88.82   88.82   30.97 0.000846 ***
Residuals    7  20.07    2.87                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Getting fitted values

predict(mod_cat)
        1         2         3         4         5         6         7 
11.755556 10.538889  9.322222  8.105556  6.888889  5.672222  4.455556 
        8         9 
 3.238889  2.022222 

Fitted vs observed values

cat$fitted <- predict(mod_cat)

ggplot(data = cat) +
  geom_point(aes(x = growth, y = fitted)) +
  geom_abline(aes(slope = 1,  intercept = 0)) +
  theme_classic()

Model diagnostics

Using the following plots we can check the linear model assumptions.

  1. Residuals vs fitted
  1. Normal Q-Q
  1. Scale-location
  1. Residuals vs. Leverage
par(mfrow = c(2, 2))
plot(mod_cat)
par(mfrow = c(1, 1))

Extra: fitting distributions to data

In this section we will use the package fitdistrplus to find which distribution best fit the data. The examples here are extracted from Delignette-Muller & Dutang 2015.

We will use the data inside the package stored in the object groundbeef

data("groundbeef")
?groundbeef
str(groundbeef)
'data.frame':   254 obs. of  1 variable:
 $ serving: num  30 10 20 24 20 24 40 20 50 30 ...

Empirical distribution function and the histogram

plotdist(groundbeef$serving, histo = TRUE, demp = TRUE)

Confronting different distributions

First, let’s describe the distribution using descriptive statistics such as median, mean, standard deviation, skewness (asymmetry) and kurtosis (“taildness”).

descdist(groundbeef$serving, boot = 1000)

summary statistics
------
min:  10   max:  200 
median:  79 
mean:  73.64567 
estimated sd:  35.88487 
estimated skewness:  0.7352745 
estimated kurtosis:  3.551384 

Checking if the Weibull distribution fits to data.

fw <- fitdist(groundbeef$serving, "weibull")
summary(fw)
Fitting of the distribution ' weibull ' by maximum likelihood 
Parameters : 
       estimate Std. Error
shape  2.185885  0.1045755
scale 83.347679  2.5268626
Loglikelihood:  -1255.225   AIC:  2514.449   BIC:  2521.524 
Correlation matrix:
         shape    scale
shape 1.000000 0.321821
scale 0.321821 1.000000

Comparing with other continuous distributions for skewed data such as the gamma and lognormal distributions.

fg <- fitdist(groundbeef$serving, "gamma")
fln <- fitdist(groundbeef$serving, "lnorm")
par(mfrow = c(2, 2))
plot_legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fw, fln, fg), legendtext = plot_legend)
qqcomp(list(fw, fln, fg), legendtext = plot_legend)
cdfcomp(list(fw, fln, fg), legendtext = plot_legend)
ppcomp(list(fw, fln, fg), legendtext = plot_legend)

Comparing the distributions with Goodnes-of-fit statistics

gofstat(list(fw, fln, fg))
Goodness-of-fit statistics
                             1-mle-weibull 2-mle-lnorm 3-mle-gamma
Kolmogorov-Smirnov statistic     0.1396646   0.1493090   0.1281246
Cramer-von Mises statistic       0.6840994   0.8277358   0.6934112
Anderson-Darling statistic       3.5736460   4.5436542   3.5660192

Goodness-of-fit criteria
                               1-mle-weibull 2-mle-lnorm 3-mle-gamma
Akaike's Information Criterion      2514.449    2526.639    2511.250
Bayesian Information Criterion      2521.524    2533.713    2518.325

Be aware that some statistics such as Cramer-von Mises, Kolmogorov-Smirnov and Anderson-Darling statistic are complicated to be used when comparing distributions with different numbers of parameters. Information Criteria are more interesting and take into account overfitting.