Introduction to multivariate analysis in ecology

class: center, middle, inverse, title-slide

.title[
# Introduction to multivariate analysis in ecology
]
.subtitle[
## Serrapilheira/ICTP-SAIFR Training Program in Quantitative Biology and Ecology
]
.author[
### Andrea Sánchez-Tapia & Sara Mortara
]
.date[
### 4 August 2022
]

---

https://cran.r-project.org/web/views/Multivariate.html

## Multivariate analysis

+ The response variable is a matrix with objects described by variables

+ The predictors/variables can be species or variables of all kinds

+ Pairwise comparisons between objects, association between variables: distance/dissimilarity measures

---

## Dissimilarity metrics

+ Between objects (sites): presence/absence of species, __shared presences or absences__. ex. Bray-Curtis, Jaccard

|  |  presence|  absence|
|--:|--:|--:|
|  presence|  a|  b|
| absence|  c|  d|

- `vegan::vegdist()`

+ Raw euclidean distance does not work with abundance data because of abundances. Standardize!

+ Between species and variables: correlation, covariation, euclidean distance

---
## Metric dissimilarity indices

+ `\(D_{A-B} = D_{B-A}\)`

+ If `\(A = B\)` then `\(D_{A-B} = 0\)`

+ Triangle inequality:

`\(D_{A-B} + D_{B-C} ≥ D_{A-C}\)`

---
## Clustering

+ Clustering methods: find groups  (_clusters_) in data. In-group similarity and dissimilarity with other groups

+ What does it mean to be similar? __Dissimilarity measure__

+ How do we create the groups? __Clustering method__

+ __K-means__: pre-specified number of cluster, __divisive algorithm__, starting from the whole group

+ __Hierarchical clustering__: no predefined number of groups, tree-like visualization (cluster dendrogram)

---
## K-means clustering

.pull-left[

+ Within-cluster variation is as small as possible: "mean pairwise squared Euclidean distances per cluster"

+ Calculate centroids (_mean_ of the observations) and reassign groups iteratively

__James et al 2013__
]

.pull-right[

]

---

---
## Hierarchical clustering

+ Agglomerative: starts by joining the closest pair and then the next closest pair

+ How to join clusters? Different __linkage__ functions: "complete", "average", "single", "centroid", "ward"

+ Which distance measure?  Euclidean, Bray

+ __Cut level__ creates different number of groups

+ Only interpret the y axis! the horizontal arrangement is arbitrary

---

```r
#install.packages("palmerpenguins")
library(palmerpenguins)
data(penguins)
dim(penguins)
hclust(d = penguins[,-1])
```

---
## Ordination methods

+ __Ordination methods__: organize the data along axes that represent most of the variance

+ __Unconstrained ordination:__ extract the gradients ("gradient analysis") from the main data matrix

+ __Constrained ordination__ a second matrix is used as to further adjust the ordination

---

## Principal Components Analysis

+ __Dimension reduction__ technique: low-dimensional representations of a data set

+ Retainining original variation as much as possible. Enough for interpretation

+ Widely used

---
## Before PCA

+ All numeric (recoded if needed)

+ No missing data

+ Numeric data should be standardized (centered and scaled)

---
##  Dimension reduction

+ PCA examines the __covariance__ among features and combines multiple features into __a smaller set of uncorrelated variables__: the principal components

+ The weights of each PC reveal the contribution of each one to the overall variance in the original data

+ Decreasing importance: first principal component explains the largest variance in the original dataset

---

## Performing PCA in R

+ In base R: `prcomp()`

+ In ecology applications: `vegan::rda()`

+ In machine learning/statistical learning packages. PCA and ordination in general are forms of ___unsupervised learning___

---
## How many principal components?

+ Decreasing eigenvalue
    + `sum(eigen)` is equal to number of variables
    + eigenvalues vary from more than 1 to almost zero: eigenvalues > 1 carry the most overall information

+ Proportion of variance explained

+ Scree plot

---
## Other ordination techniques important in Ecology

+ __Correspondence analysis CA__, based in chi-square (co-occurrences)

+ __Principal Components Analysis PCoA__ uses any distance matrix

+ __Non-Metric Multidimensional Scaling NMDS __ based on a dissimilarity or distance matrix. attempts to represent the pairwise dissimilarity between objects in a low-dimensional space

+ __Canonical Correspondence Analysis CCA__: contrained version of a CA, using a second matrix (ex. environment)

+ __Redundancy Analysis RDA__: contrained version of a PCA, using a second matrix (ex. environment)

---
##