Statistical modeling (part 2)

class: center, middle, inverse, title-slide

.title[
# Statistical modeling (part 2)
]
.author[
### Sara Mortara & Andrea Sánchez-Tapia
]
.institute[
### re.green | ¡liibre!
]
.date[
### 2022-07-20
]

---

## today

1. glm

2. glmm

3. model selection tutorial

---
class: middle, center, inverse

# 1. glm

---
## road map

---

## linear models

+ linear relationship between x and y

+ variances are equal across all predicted values of the response (homoscedatic)

+ errors are normally distributed

+ errors are independent

---

## generalized linear models

+ a linear mean (of your making)

+ a link function (like an ‘internal’ transformation)

+ an error structure

---
## link function

links your mean function to the scale of the observed data

- response variable `$Y$` and explanatory variable(s) `$X$`

- linear function: `$\beta_0 + \beta_1 X$`

- `$E(Y) = g^{-1}\left(\beta_0 + \beta_1 X\right)$`

- the function `$g(\cdot)$` is known as the link function

- `$g^{-1}(\cdot)$` denotes the inverse of `$g(\cdot)$`

---
## back to linear regression

`lm` as a special case of `glm`

```r
df <- read.csv("data/raw/crawley_regression.csv")

lm(growth ~ tannin, data = df)
```

```
## 
## Call:
## lm(formula = growth ~ tannin, data = df)
## 
## Coefficients:
## (Intercept)       tannin  
##      11.756       -1.217
```

---
## back to linear regression

`family`: error structure __and__ the link function

```r
glm(growth ~ tannin, data = df, family = gaussian(link = identity))
```

```
## 
## Call:  glm(formula = growth ~ tannin, family = gaussian(link = identity), 
##     data = df)
## 
## Coefficients:
## (Intercept)       tannin  
##      11.756       -1.217  
## 
## Degrees of Freedom: 8 Total (i.e. Null);  7 Residual
## Null Deviance:	    108.9 
## Residual Deviance: 20.07 	AIC: 38.76
```

---
## link function in gaussian family

The default link function for the normal (Gaussian) distribution is the identity, i.e. for mean  `$\mu$` we have:

`$$\mu = \beta_0 + \beta_1X$$`

---
## Poisson regression

- count data (positive)
- still getting back to glm
- __Gaussian__ error structure and __identity__ link

.pull-left[
`$$Y = \beta_0 + \beta_1X + \epsilon$$`
`$$\epsilon \sim N(0, \sigma^2)$$`

]
-
.pull-right[
`$$Y \sim N(\mu, \sigma^2)$$`

`$$\mu = \beta_0 + \beta_1$$`

]

---
## Poisson regression

.pull-left[
+ Family: Gaussian

+ Link: identity

`$$Y \sim N(\mu, \sigma^2)$$`

`$$\mu = \beta_0 + \beta_1X$$`

]

.pull-right[
+ Family: Poisson

+ Link: log

`$$Y \sim Pois(\lambda)$$`

`$$log \lambda = \beta_0 + \beta_1X$$`

`$$\lambda = \exp^{\beta_0 + \beta_1X}$$`

]

- we will still fit straight lines
- linear for the __log__ transformed observations

---
## Poisson distribution

+ discrete variable, defined on the range [0, 1, `$\infty$`]

+ mean = `$\lambda$`

+ variance = `$\lambda$`

![](08_slides_files/figure-html/unnamed-chunk-4-1.png)![](08_slides_files/figure-html/unnamed-chunk-4-2.png)![](08_slides_files/figure-html/unnamed-chunk-4-3.png)

- as the mean increases, the variance increases also --> heteroscedascity

---
## Poisson regression: cuckoo example

How does nestling mass affect begging rates between the different species?

.pull-left[

[Kilner et al 1999](https://www.nature.com/articles/17746)

]

.pull-right[

+ __Mass__: nestling mass of chick in grams
+ __Beg__: begging calls per 6 secs
+ __Species__: Warbler or Cuckoo

]

---
## read the data

```r
cuckoo <- read.csv("data/raw/valletta_cuckoo.csv")
summary(cuckoo)
```

```
##       Mass             Beg           Species         
##  Min.   : 4.988   Min.   :  0.00   Length:58         
##  1st Qu.:10.682   1st Qu.:  1.25   Class :character  
##  Median :23.012   Median : 13.00   Mode  :character  
##  Mean   :23.650   Mean   : 23.71                     
##  3rd Qu.:31.812   3rd Qu.: 36.50                     
##  Max.   :62.956   Max.   :114.00
```

---
## understanding the data

---
## let's fit a lm

```r
# Fitting a model with an interaction term
cuckoo_lm <- lm(Beg ~ Mass * Species, data = cuckoo)
```

---
## inspecting the lm

![](08_slides_files/figure-html/unnamed-chunk-9-1.png)

---
## fitting a Poisson glm

```r
cuckoo_glm <- glm(Beg ~ Mass * Species, data = cuckoo,
           family = poisson(link = log))
```

---
## understanding the model

+ model with interaction term:
`$$log\lambda = \beta_0 + \beta_1M_1 + \beta_2S_i + \beta_3M_iS_i$$`

`$M_i$` = nestling mass
`$S_i$` = categorical explanatory variable 1 = warbler

+ cuckoo:  `$S_i$` = 0
+ warbler: `$S_i$` = 1

---
## understanding the model

`$$log\lambda = \beta_0 + \beta_1M_1 + \beta_2S_i + \beta_3M_iS_i$$`

cuckoo: 
`$$log\lambda = \beta_0 + \beta_1M_1$$`

warbler: 
`$$log\lambda = \beta_0 + \beta_1M_1 + \beta_2S_i + \beta_3M_iS_i$$`
`$$log\lambda = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)M_i$$`

---
## cuckoo glm

.tiny[

```r
summary(cuckoo_glm)
```

```
## 
## Call:
## glm(formula = Beg ~ Mass * Species, family = poisson(link = log), 
##     data = cuckoo)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.4570  -3.0504  -0.0006   1.9389   5.2139  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          1.589861   0.104531  15.209  < 2e-16 ***
## Mass                 0.054736   0.002298  23.820  < 2e-16 ***
## SpeciesWarbler      -0.535546   0.161304  -3.320 0.000900 ***
## Mass:SpeciesWarbler  0.015822   0.004662   3.394 0.000689 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1730.04  on 57  degrees of freedom
## Residual deviance:  562.08  on 54  degrees of freedom
## AIC: 784.81
## 
## Number of Fisher Scoring iterations: 5
```
]

---
## inspecting the glm

![](08_slides_files/figure-html/unnamed-chunk-12-1.png)

---
## creating the Poisson regression line

```r
newdata <- expand.grid(Mass = seq(min(cuckoo$Mass), max(cuckoo$Mass), length.out = 200),
                       Species = unique(cuckoo$Species))
newdata$Beg <- predict(cuckoo_glm, newdata, type = 'response')
```

---
## creating the Poisson regression line

.pull-left[

```r
p <- ggplot(mapping = aes(x = Mass, y = Beg, colour = Species)) + 
  geom_point(data = cuckoo) + 
  geom_line(data = newdata) +
  my_theme
```

]

.pull-right[
![](08_slides_files/figure-html/unnamed-chunk-15-1.png)

]

---
class: middle, center
# bestiary of error distributions

---

.tiny[

Response variable | Error distribution | Canonical link function | 
------------- | -----| -------|
Continuous positive and negative values | Gaussian/Normal | Identity |
Counts | Poisson | Log | 
Counts with over-dispersion | Negative Binomial, Quasi-Poisson | Log Log | 
Proportions (no. successes/total trials) | Binomial | Logit |
Binary (male/female, alive/dead) | Binomial (Bernoulli) | Logit | 
Proportions or binary with overdispersion | Quasi-Binomial | Logit | 
Time to event (germination, death) | Gamma | Inverse |

]

---
## parameter estimation

- maximum likelihood
- quasi-likelihood
- Bayesian approaches

---
class: middle, center, inverse

# 2. glmm

---
## glmm

+ modeling variance

+ non-independent data (violates the independence of residuals assumption): 
 + blocks (spatial, temporal, genetic)
 + individual level effects (repeated measures)

+ zero-inflated modes

---
## glmm: bacterial growth

which of four growth media is best for rearing large populations of anthrax?

---
## reading the data and creating a lm

```r
bac <- read.csv("data/raw/valletta_bac.csv")

bac$media <- as.factor(bac$media)
bac$cabinet <- as.factor(bac$cabinet)

bac_lm <- lm(growth ~ media, data = bac)
```

---
## creating a lm

.tiny[

```r
summary(bac_lm)
```

```
## 
## Call:
## lm(formula = growth ~ media, data = bac)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3.96  -2.00  -0.68   0.86   6.64 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.580      1.380   4.043 0.000943 ***
## media2         1.780      1.952   0.912 0.375315    
## media3        -1.960      1.952  -1.004 0.330226    
## media4        -3.840      1.952  -1.967 0.066720 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.086 on 16 degrees of freedom
## Multiple R-squared:  0.3676,	Adjusted R-squared:  0.249 
## F-statistic:   3.1 on 3 and 16 DF,  p-value: 0.05638
```

]

---
## taking cabinet effect into account

```r
bac_lm2 <- lm(growth ~ media + cabinet, data = bac)
```

---
## taking cabinet effect into account

.tiny[

```r
summary(bac_lm2)
```

```
## 
## Call:
## lm(formula = growth ~ media + cabinet, data = bac)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5650 -0.5363  0.1600  0.5375  1.8150 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8800     0.7125   4.042 0.001633 ** 
## media2        1.7800     0.7125   2.498 0.028007 *  
## media3       -1.9600     0.7125  -2.751 0.017575 *  
## media4       -3.8400     0.7125  -5.389 0.000163 ***
## cabinet2      1.9250     0.7966   2.416 0.032525 *  
## cabinet3      7.5250     0.7966   9.446  6.6e-07 ***
## cabinet4      0.9750     0.7966   1.224 0.244463    
## cabinet5      3.0750     0.7966   3.860 0.002268 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.127 on 12 degrees of freedom
## Multiple R-squared:  0.9368,	Adjusted R-squared:  0.8999 
## F-statistic: 25.41 on 7 and 12 DF,  p-value: 2.766e-06
```

]

---
## creating a glmm

.pull-left[
fixed effect

__media__:

+ we chose the media to be tested

+ each media has a specific identity

+ we want to estimate the differences in bacterial growth between different media
]

.pull-right[
random effect

__cabinet__: 
+ we don’t care about the identity of each cabinet

+ each cabinet is sampled from a population of possible cabinets

+ we just want to predict and absorb the variance in bacterial growth rate explained by cabinet

]

---
## glmm using lme4 package

+ the intercept of our linear model will vary according to cabinet

`$$Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3}+  \gamma C_i + \epsilon_i$$`

`$x_{i∗}$` = dummy variables corresponding to levels of media

```r
bac_lmer <- lmer(growth ~ media + (1 | cabinet), data = bac)
```

---
## glmm using lme4 package

.tiny[

```r
summary(bac_lmer)
```

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: growth ~ media + (1 | cabinet)
##    Data: bac
## 
## REML criterion at convergence: 68.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.2306 -0.5407  0.1088  0.4320  1.7696 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  cabinet  (Intercept) 8.255    2.873   
##  Residual             1.269    1.127   
## Number of obs: 20, groups:  cabinet, 5
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   5.5800     1.3801   4.043
## media2        1.7800     0.7125   2.498
## media3       -1.9600     0.7125  -2.751
## media4       -3.8400     0.7125  -5.389
## 
## Correlation of Fixed Effects:
##        (Intr) media2 media3
## media2 -0.258              
## media3 -0.258  0.500       
## media4 -0.258  0.500  0.500
```
]

---
## visualizing the glmm

[merTools package](https://cran.r-project.org/web/packages/merTools/vignettes/merToolsIntro.html)

```r
feEx <- FEsim(bac_lmer, 1000)
```

---
## visualizing the glmm

.pull-left[

```r
pfe <- plotFEsim(feEx) +
  theme_bw() + labs(title = "Coefficient Plot",
                    x = "Median Effect Estimate", y = "Evaluation Rating")
```
]

.pull.right[

![](08_slides_files/figure-html/unnamed-chunk-25-1.png)

]

---
## todo

- model selection tutorial

- run code from the slides

- `git add`, `commit`, and `push` of the day

---
class: center, middle

# ¡Thanks!

<center>
<svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg">  <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> [saramortara@gmail.com](mailto:saramortara@gmail.com) | [andreasancheztapia@gmail.com](mailto:andreasancheztapia@gmail.com)