Model selection

Sara Mortara and Andrea Sánchez-Tapia
2022-07-20

Why model selection?

But before continuing it’s always good to remember the concept of conecting theory with data. For that, we have a good question as a guide, derive hypotheses, and represent hypotheses with models! There’s no coding models in R without thinking, and remember of what Burnnam and Anderson say

we recommend more emphasis on thinking!

Guiding questions

  1. Are the objectives relevant and achievable?

  2. Was sample or experimental design carefully defined?

  3. Were candidate models defined a priori?

  4. What justifies the models?

the basis of model selection: criteria

Akaike uses the Kullback-Leibler distance (from information theory)

\[I(f, g)\]

Which represents the distance between the truth \(f\) and the model \(g\) or information lost when model \(g\) is used to get closer to the truth \(f\). The goal is to find the model that loses the least amount of information = minimize \(I(f,g)\).

Akaike information criterion (AIC)

\[AIC = -2log(\mathcal{L}(\hat{\theta})) + 2K\]

Remember that maximum likelihood means that given the data & given the model, which model parameter values make the data more plausible? And the AIC penalizes models with more parameters.

Inference based on AIC

An example using the cuckoo data

library(bbmle)
library(ggplot2)
cuckoo <- read.csv("data/raw/valletta_cuckoo.csv")

Hypotheses

  1. Bigger nestling make higher begging calls

  2. Bigger nestling make higher begging calls and the two species exhibit similar response (i.e. different intercept)

  3. Begging rates of different species are being affected differently by nestling mass (i.e. different slope and intercept)

Translating hypotheses into models

h1 <- glm(Beg ~ Mass, data = cuckoo,
            family = poisson(link = log))

h2 <- glm(Beg ~ Mass + Species, data = cuckoo,
            family = poisson(link = log))

h3 <- glm(Beg ~ Mass * Species, data = cuckoo,
            family = poisson(link = log))

h0 <- glm(Beg ~ 0, data = cuckoo,
            family = poisson(link = log))

Using AIC to confront simultaneosly multiple hypotheses

bbmle::AICtab(h0, h1, h2, h3, base = TRUE, weights = TRUE)
   AIC    dAIC   df weight
h3  784.8    0.0 4  0.9752
h1  792.9    8.0 2  0.0174
h2  794.6    9.8 3  0.0074
h0 8016.6 7231.8 0  <0.001

Calculating predictors and plotting the predicted values by the model

# Calculating the predicted values
newdata <- expand.grid(Mass = seq(min(cuckoo$Mass), max(cuckoo$Mass), length.out = 200),
                       Species = unique(cuckoo$Species))
newdata$Beg <- predict(h3, newdata, type = 'response')

## explore ?predict.glm

p <- ggplot(mapping = aes(x = Mass, y = Beg, colour = Species)) +
  geom_point(data = cuckoo) +  geom_line(data = newdata) +
  theme_classic()

p