+ - 0:00:00
Notes for current slide
Notes for next slide

Statistical modeling

Sara Mortara & Andrea Sánchez-Tapia

re.green | ¡liibre!

2022-07-19

1 / 45

about

  1. concepts in statistical modeling

  2. probability distributions

  3. the linear model

2 / 45

1. concepts in statistical modellng

3 / 45

connect theory with data using statistical models

4 / 45

best references

5 / 45

best references

6 / 45

model & data

  • data are not sacrossanct
  • search for a minimal and suitable model

7 / 45

the data

  • continuous or dicrete variable?
  • how many replicates?
  • what are the predictor variables?
  • what is the pattern?
8 / 45

the data

  • continuous or dicrete variable?
  • how many replicates?
  • what are the predictor variables?
  • what is the pattern?

concepts

  • maximum likelihood
  • principle of parsimony
    • Ocram's razor
8 / 45

maximum likelihood

given the data and the model:

what are the parameter values that make the data more plausible?

9 / 45

principle of parsimony

all things being equal, the simpler solution is the best

William of Occam

10 / 45

principle of parsimony

  • models with fewer possible parameters
  • linear models preferable to non-linear
  • less assumptions
  • minimally adequate models
  • simpler explanations

11 / 45

best model is just a model

  • all models are wrong
  • some models are better than others
  • we are never sure of the correct model
  • the simpler the model, the better -- but not simplistic

12 / 45

2. statistical distributions

13 / 45

statistical distributions

distribution type E(X) σ2(X) usage example
normal continuous μ σ2 Symmetric curve for continuous data size distribution
binomial discrete np np(1p) Number of successes in n attempts Presence or absence of species
Poisson discrete λ λ Independent rare events where λ is the rate at which the event occurs in space or time Distribution of rare species in space
Log-normal continuous log(μ) log(σ2) Asymmetric curve Species abundance distribution
14 / 45

continuous distributions

df_n <- data.frame(val = rnorm(1000, mean = 0, sd = 1))
df_ln <- data.frame(val = exp(rnorm(1000)))
15 / 45

continuous distributions

16 / 45

3. the linear model

mathematical model + uncertainty: Y=a+bx+ϵ

17 / 45

statistical model

18 / 45

relation between variables: prediction

19 / 45

relation between variables: extrapolation

20 / 45

the linear model

y=a+bx

y=α+βX+ϵ

ϵ=N(0,σ)

21 / 45

is the response variable normal?

22 / 45

is the response variable normal?

23 / 45

what is relationship between the predictor and the response variable?

24 / 45

assumptions

  • relationship between x and y is linear

  • normality of residuals

  • homoscedasticity -- homogeneity of residuals variance

  • independence of residuals error terms

25 / 45

parameter estimation

  • least squares method
  • maximum likelihood

26 / 45

least squares method

27 / 45

least squares method

28 / 45

least squares method

29 / 45

least squares method

30 / 45

linear model in R

mod <- lm(y1 ~ x1)
summary(mod)
##
## Call:
## lm(formula = y1 ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1424 -4.0088 0.9982 2.8714 6.1706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.632 4.520 0.361 0.7287
## x1 4.076 1.384 2.945 0.0216 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.361 on 7 degrees of freedom
## Multiple R-squared: 0.5534, Adjusted R-squared: 0.4895
## F-statistic: 8.672 on 1 and 7 DF, p-value: 0.02156
31 / 45

uncertainty in the estimate

y=1.63+4.08x+ϵ

estimation of coefficients

coef(mod)
## (Intercept) x1
## 1.632390 4.075949

confidence interval

confint(mod)
## 2.5 % 97.5 %
## (Intercept) -9.0566136 12.321394
## x1 0.8031237 7.348775
32 / 45

linear model residue

33 / 45

linear model variance partitioning

sum of squares from the linear model

SStotal=SSbetween+SSerror

34 / 45

total sum of squares

SStotal=i=1n(yiy¯)2

35 / 45

total sum of squares

SStotal=i=1n(yiy¯)2

SStotal=450.35

36 / 45

residual sum of squares

SSerror=i=1n(yiy^i)2

37 / 45

residual sum of squares

SSerror=i=1n(yiy^i)2

SSerror=201.15

38 / 45

model sum of squares

SStotal=SSbetween+SSerror

SSbetween=SStotalSSerror

SSbetween=450.35201.15

SSbetween=249.2

39 / 45

variance partitioning

SStotal=450.35

SSbetween=249.2

SSerror=201.15

40 / 45

variance partitioning

anova table

anova(mod)
## Analysis of Variance Table
##
## Response: y1
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 249.20 249.200 8.6723 0.02156 *
## Residuals 7 201.15 28.735
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
41 / 45

coefficient of determination R2

R2=SSbetweenSStotal

R2=249.2450.35

R2=0.5533

42 / 45

coefficient of determination R2

summary(mod)
##
## Call:
## lm(formula = y1 ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1424 -4.0088 0.9982 2.8714 6.1706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.632 4.520 0.361 0.7287
## x1 4.076 1.384 2.945 0.0216 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.361 on 7 degrees of freedom
## Multiple R-squared: 0.5534, Adjusted R-squared: 0.4895
## F-statistic: 8.672 on 1 and 7 DF, p-value: 0.02156
43 / 45

todo

  • lm tutorial
  • git add, commit, and push of the day
44 / 45

about

  1. concepts in statistical modeling

  2. probability distributions

  3. the linear model

2 / 45
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow