class: center, middle, inverse, title-slide .title[ # Statistical modeling (part 2) ] .author[ ### Sara Mortara & Andrea Sánchez-Tapia ] .institute[ ### re.green | ¡liibre! ] .date[ ### 2022-07-20 ] --- <style type="text/css"> .tiny .remark-code { /*Change made here*/ font-size: 50% !important; } </style> ## today 1. glm 2. glmm 3. model selection tutorial --- class: middle, center, inverse # 1. glm --- ## road map <img src="figs/stats_diagram_bolker.png" width="699" style="display: block; margin: auto;" /> --- ## linear models + linear relationship between x and y + variances are equal across all predicted values of the response (homoscedatic) + errors are normally distributed + errors are independent --- ## generalized linear models + a linear mean (of your making) + a link function (like an ‘internal’ transformation) + an error structure --- ## link function links your mean function to the scale of the observed data - response variable `\(Y\)` and explanatory variable(s) `\(X\)` - linear function: `\(\beta_0 + \beta_1 X\)` - `\(E(Y) = g^{-1}\left(\beta_0 + \beta_1 X\right)\)` - the function `\(g(\cdot)\)` is known as the link function - `\(g^{-1}(\cdot)\)` denotes the inverse of `\(g(\cdot)\)` --- ## back to linear regression `lm` as a special case of `glm` ```r df <- read.csv("data/raw/crawley_regression.csv") lm(growth ~ tannin, data = df) ``` ``` ## ## Call: ## lm(formula = growth ~ tannin, data = df) ## ## Coefficients: ## (Intercept) tannin ## 11.756 -1.217 ``` --- ## back to linear regression `family`: error structure __and__ the link function ```r glm(growth ~ tannin, data = df, family = gaussian(link = identity)) ``` ``` ## ## Call: glm(formula = growth ~ tannin, family = gaussian(link = identity), ## data = df) ## ## Coefficients: ## (Intercept) tannin ## 11.756 -1.217 ## ## Degrees of Freedom: 8 Total (i.e. Null); 7 Residual ## Null Deviance: 108.9 ## Residual Deviance: 20.07 AIC: 38.76 ``` --- ## link function in gaussian family The default link function for the normal (Gaussian) distribution is the identity, i.e. for mean `\(\mu\)` we have: `$$\mu = \beta_0 + \beta_1X$$` --- ## Poisson regression - count data (positive) - still getting back to glm - __Gaussian__ error structure and __identity__ link .pull-left[ `$$Y = \beta_0 + \beta_1X + \epsilon$$` `$$\epsilon \sim N(0, \sigma^2)$$` ] - .pull-right[ `$$Y \sim N(\mu, \sigma^2)$$` `$$\mu = \beta_0 + \beta_1$$` ] --- ## Poisson regression .pull-left[ + Family: Gaussian + Link: identity `$$Y \sim N(\mu, \sigma^2)$$` `$$\mu = \beta_0 + \beta_1X$$` ] .pull-right[ + Family: Poisson + Link: log `$$Y \sim Pois(\lambda)$$` `$$log \lambda = \beta_0 + \beta_1X$$` `$$\lambda = \exp^{\beta_0 + \beta_1X}$$` ] - we will still fit straight lines - linear for the __log__ transformed observations --- ## Poisson distribution + discrete variable, defined on the range [0, 1, `\(\infty\)`] + mean = `\(\lambda\)` + variance = `\(\lambda\)` <center> ![](08_slides_files/figure-html/unnamed-chunk-4-1.png)<!-- -->![](08_slides_files/figure-html/unnamed-chunk-4-2.png)<!-- -->![](08_slides_files/figure-html/unnamed-chunk-4-3.png)<!-- --> - as the mean increases, the variance increases also --> heteroscedascity --- ## Poisson regression: cuckoo example How does nestling mass affect begging rates between the different species? .pull-left[ <img src="figs/03-cuckoo.png" width="80%" /> [Kilner et al 1999](https://www.nature.com/articles/17746) ] .pull-right[ + __Mass__: nestling mass of chick in grams + __Beg__: begging calls per 6 secs + __Species__: Warbler or Cuckoo ] --- ## read the data ```r cuckoo <- read.csv("data/raw/valletta_cuckoo.csv") summary(cuckoo) ``` ``` ## Mass Beg Species ## Min. : 4.988 Min. : 0.00 Length:58 ## 1st Qu.:10.682 1st Qu.: 1.25 Class :character ## Median :23.012 Median : 13.00 Mode :character ## Mean :23.650 Mean : 23.71 ## 3rd Qu.:31.812 3rd Qu.: 36.50 ## Max. :62.956 Max. :114.00 ``` --- ## understanding the data <img src="08_slides_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## let's fit a lm ```r # Fitting a model with an interaction term cuckoo_lm <- lm(Beg ~ Mass * Species, data = cuckoo) ``` --- ## inspecting the lm <center> ![](08_slides_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- ## fitting a Poisson glm ```r cuckoo_glm <- glm(Beg ~ Mass * Species, data = cuckoo, family = poisson(link = log)) ``` --- ## understanding the model + model with interaction term: `$$log\lambda = \beta_0 + \beta_1M_1 + \beta_2S_i + \beta_3M_iS_i$$` `\(M_i\)` = nestling mass `\(S_i\)` = categorical explanatory variable 1 = warbler + cuckoo: `\(S_i\)` = 0 + warbler: `\(S_i\)` = 1 --- ## understanding the model `$$log\lambda = \beta_0 + \beta_1M_1 + \beta_2S_i + \beta_3M_iS_i$$` cuckoo: `$$log\lambda = \beta_0 + \beta_1M_1$$` warbler: `$$log\lambda = \beta_0 + \beta_1M_1 + \beta_2S_i + \beta_3M_iS_i$$` `$$log\lambda = (\beta_0 + \beta_2) + (\beta_1 + \beta_3)M_i$$` --- ## cuckoo glm .tiny[ ```r summary(cuckoo_glm) ``` ``` ## ## Call: ## glm(formula = Beg ~ Mass * Species, family = poisson(link = log), ## data = cuckoo) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -7.4570 -3.0504 -0.0006 1.9389 5.2139 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.589861 0.104531 15.209 < 2e-16 *** ## Mass 0.054736 0.002298 23.820 < 2e-16 *** ## SpeciesWarbler -0.535546 0.161304 -3.320 0.000900 *** ## Mass:SpeciesWarbler 0.015822 0.004662 3.394 0.000689 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 1730.04 on 57 degrees of freedom ## Residual deviance: 562.08 on 54 degrees of freedom ## AIC: 784.81 ## ## Number of Fisher Scoring iterations: 5 ``` ] --- ## inspecting the glm <center> ![](08_slides_files/figure-html/unnamed-chunk-12-1.png)<!-- --> --- ## creating the Poisson regression line ```r newdata <- expand.grid(Mass = seq(min(cuckoo$Mass), max(cuckoo$Mass), length.out = 200), Species = unique(cuckoo$Species)) newdata$Beg <- predict(cuckoo_glm, newdata, type = 'response') ``` --- ## creating the Poisson regression line .pull-left[ ```r p <- ggplot(mapping = aes(x = Mass, y = Beg, colour = Species)) + geom_point(data = cuckoo) + geom_line(data = newdata) + my_theme ``` ] .pull-right[ ![](08_slides_files/figure-html/unnamed-chunk-15-1.png)<!-- --> ] --- class: middle, center # bestiary of error distributions --- .tiny[ Response variable | Error distribution | Canonical link function | ------------- | -----| -------| Continuous positive and negative values | Gaussian/Normal | Identity | Counts | Poisson | Log | Counts with over-dispersion | Negative Binomial, Quasi-Poisson | Log Log | Proportions (no. successes/total trials) | Binomial | Logit | Binary (male/female, alive/dead) | Binomial (Bernoulli) | Logit | Proportions or binary with overdispersion | Quasi-Binomial | Logit | Time to event (germination, death) | Gamma | Inverse | ] --- ## parameter estimation - maximum likelihood - quasi-likelihood - Bayesian approaches --- class: middle, center, inverse # 2. glmm --- ## glmm + modeling variance + non-independent data (violates the independence of residuals assumption): + blocks (spatial, temporal, genetic) + individual level effects (repeated measures) + zero-inflated modes --- ## glmm: bacterial growth which of four growth media is best for rearing large populations of anthrax? <img src="figs/experimental_design.png" width="70%" style="display: block; margin: auto;" /> --- ## reading the data and creating a lm ```r bac <- read.csv("data/raw/valletta_bac.csv") bac$media <- as.factor(bac$media) bac$cabinet <- as.factor(bac$cabinet) bac_lm <- lm(growth ~ media, data = bac) ``` --- ## creating a lm .tiny[ ```r summary(bac_lm) ``` ``` ## ## Call: ## lm(formula = growth ~ media, data = bac) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.96 -2.00 -0.68 0.86 6.64 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.580 1.380 4.043 0.000943 *** ## media2 1.780 1.952 0.912 0.375315 ## media3 -1.960 1.952 -1.004 0.330226 ## media4 -3.840 1.952 -1.967 0.066720 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.086 on 16 degrees of freedom ## Multiple R-squared: 0.3676, Adjusted R-squared: 0.249 ## F-statistic: 3.1 on 3 and 16 DF, p-value: 0.05638 ``` ] --- ## taking cabinet effect into account ```r bac_lm2 <- lm(growth ~ media + cabinet, data = bac) ``` --- ## taking cabinet effect into account .tiny[ ```r summary(bac_lm2) ``` ``` ## ## Call: ## lm(formula = growth ~ media + cabinet, data = bac) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.5650 -0.5363 0.1600 0.5375 1.8150 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.8800 0.7125 4.042 0.001633 ** ## media2 1.7800 0.7125 2.498 0.028007 * ## media3 -1.9600 0.7125 -2.751 0.017575 * ## media4 -3.8400 0.7125 -5.389 0.000163 *** ## cabinet2 1.9250 0.7966 2.416 0.032525 * ## cabinet3 7.5250 0.7966 9.446 6.6e-07 *** ## cabinet4 0.9750 0.7966 1.224 0.244463 ## cabinet5 3.0750 0.7966 3.860 0.002268 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.127 on 12 degrees of freedom ## Multiple R-squared: 0.9368, Adjusted R-squared: 0.8999 ## F-statistic: 25.41 on 7 and 12 DF, p-value: 2.766e-06 ``` ] --- ## creating a glmm .pull-left[ fixed effect __media__: + we chose the media to be tested + each media has a specific identity + we want to estimate the differences in bacterial growth between different media ] .pull-right[ random effect __cabinet__: + we don’t care about the identity of each cabinet + each cabinet is sampled from a population of possible cabinets + we just want to predict and absorb the variance in bacterial growth rate explained by cabinet ] --- ## glmm using lme4 package + the intercept of our linear model will vary according to cabinet `$$Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \beta_3x_{i3}+ \gamma C_i + \epsilon_i$$` `\(x_{i∗}\)` = dummy variables corresponding to levels of media ```r bac_lmer <- lmer(growth ~ media + (1 | cabinet), data = bac) ``` --- ## glmm using lme4 package .tiny[ ```r summary(bac_lmer) ``` ``` ## Linear mixed model fit by REML ['lmerMod'] ## Formula: growth ~ media + (1 | cabinet) ## Data: bac ## ## REML criterion at convergence: 68.8 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -1.2306 -0.5407 0.1088 0.4320 1.7696 ## ## Random effects: ## Groups Name Variance Std.Dev. ## cabinet (Intercept) 8.255 2.873 ## Residual 1.269 1.127 ## Number of obs: 20, groups: cabinet, 5 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 5.5800 1.3801 4.043 ## media2 1.7800 0.7125 2.498 ## media3 -1.9600 0.7125 -2.751 ## media4 -3.8400 0.7125 -5.389 ## ## Correlation of Fixed Effects: ## (Intr) media2 media3 ## media2 -0.258 ## media3 -0.258 0.500 ## media4 -0.258 0.500 0.500 ``` ] --- ## visualizing the glmm [merTools package](https://cran.r-project.org/web/packages/merTools/vignettes/merToolsIntro.html) ```r feEx <- FEsim(bac_lmer, 1000) ``` --- ## visualizing the glmm .pull-left[ ```r pfe <- plotFEsim(feEx) + theme_bw() + labs(title = "Coefficient Plot", x = "Median Effect Estimate", y = "Evaluation Rating") ``` ] .pull.right[ ![](08_slides_files/figure-html/unnamed-chunk-25-1.png)<!-- --> ] --- ## todo <svg viewBox="0 0 640 512" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M255.03 261.65c6.25 6.25 16.38 6.25 22.63 0l11.31-11.31c6.25-6.25 6.25-16.38 0-22.63L253.25 192l35.71-35.72c6.25-6.25 6.25-16.38 0-22.63l-11.31-11.31c-6.25-6.25-16.38-6.25-22.63 0l-58.34 58.34c-6.25 6.25-6.25 16.38 0 22.63l58.35 58.34zm96.01-11.3l11.31 11.31c6.25 6.25 16.38 6.25 22.63 0l58.34-58.34c6.25-6.25 6.25-16.38 0-22.63l-58.34-58.34c-6.25-6.25-16.38-6.25-22.63 0l-11.31 11.31c-6.25 6.25-6.25 16.38 0 22.63L386.75 192l-35.71 35.72c-6.25 6.25-6.25 16.38 0 22.63zM624 416H381.54c-.74 19.81-14.71 32-32.74 32H288c-18.69 0-33.02-17.47-32.77-32H16c-8.8 0-16 7.2-16 16v16c0 35.2 28.8 64 64 64h512c35.2 0 64-28.8 64-64v-16c0-8.8-7.2-16-16-16zM576 48c0-26.4-21.6-48-48-48H112C85.6 0 64 21.6 64 48v336h512V48zm-64 272H128V64h384v256z"></path></svg> - model selection tutorial - run code from the slides - `git add`, `commit`, and `push` of the day --- class: center, middle # ¡Thanks! <center> <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> [saramortara@gmail.com](mailto:saramortara@gmail.com) | [andreasancheztapia@gmail.com](mailto:andreasancheztapia@gmail.com) <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [@MortaraSara](https://twitter.com/MortaraSara) | [@SanchezTapiaA](https://twitter.com/SanchezTapiaA) <svg viewBox="0 0 496 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg><svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M105.2 24.9c-3.1-8.9-15.7-8.9-18.9 0L29.8 199.7h132c-.1 0-56.6-174.8-56.6-174.8zM.9 287.7c-2.6 8 .3 16.9 7.1 22l247.9 184-226.2-294zm160.8-88l94.3 294 94.3-294zm349.4 88l-28.8-88-226.3 294 247.9-184c6.9-5.1 9.7-14 7.2-22zM425.7 24.9c-3.1-8.9-15.7-8.9-18.9 0l-56.6 174.8h132z"></path></svg> [saramortara](http://github.com/saramortara) | [andreasancheztapia](http://github.com/andreasancheztapia)