class: center, middle, inverse, title-slide .title[ # Introduction to multivariate analysis in ecology ] .subtitle[ ## Serrapilheira/ICTP-SAIFR Training Program in Quantitative Biology and Ecology ] .author[ ### Andrea Sánchez-Tapia & Sara Mortara ] .date[ ### 4 August 2022 ] --- https://cran.r-project.org/web/views/Multivariate.html ## Multivariate analysis + The response variable is a matrix with objects described by variables -- + The predictors/variables can be species or variables of all kinds -- + Pairwise comparisons between objects, association between variables: distance/dissimilarity measures --- ## Dissimilarity metrics + Between objects (sites): presence/absence of species, __shared presences or absences__. ex. Bray-Curtis, Jaccard | | presence| absence| |--:|--:|--:| | presence| a| b| | absence| c| d| - `vegan::vegdist()` -- + Raw euclidean distance does not work with abundance data because of abundances. Standardize! -- + Between species and variables: correlation, covariation, euclidean distance --- ## Metric dissimilarity indices + `\(D_{A-B} = D_{B-A}\)` -- + If `\(A = B\)` then `\(D_{A-B} = 0\)` -- + Triangle inequality: `\(D_{A-B} + D_{B-C} ≥ D_{A-C}\)` --- ## Clustering + Clustering methods: find groups (_clusters_) in data. In-group similarity and dissimilarity with other groups -- + What does it mean to be similar? __Dissimilarity measure__ -- + How do we create the groups? __Clustering method__ -- + __K-means__: pre-specified number of cluster, __divisive algorithm__, starting from the whole group -- + __Hierarchical clustering__: no predefined number of groups, tree-like visualization (cluster dendrogram) --- ## K-means clustering .pull-left[ + Within-cluster variation is as small as possible: "mean pairwise squared Euclidean distances per cluster" + Calculate centroids (_mean_ of the observations) and reassign groups iteratively __James et al 2013__ ] .pull-right[ <img src="figs/kmeans.png" width="600" style="display: block; margin: auto;" /> ] --- --- ## Hierarchical clustering + Agglomerative: starts by joining the closest pair and then the next closest pair -- + How to join clusters? Different __linkage__ functions: "complete", "average", "single", "centroid", "ward" -- + Which distance measure? Euclidean, Bray -- + __Cut level__ creates different number of groups -- + Only interpret the y axis! the horizontal arrangement is arbitrary --- ```r #install.packages("palmerpenguins") library(palmerpenguins) data(penguins) dim(penguins) hclust(d = penguins[,-1]) ``` --- ## Ordination methods + __Ordination methods__: organize the data along axes that represent most of the variance -- + __Unconstrained ordination:__ extract the gradients ("gradient analysis") from the main data matrix -- + __Constrained ordination__ a second matrix is used as to further adjust the ordination --- ## Principal Components Analysis + __Dimension reduction__ technique: low-dimensional representations of a data set -- + Retainining original variation as much as possible. Enough for interpretation -- + Widely used --- ## Before PCA + All numeric (recoded if needed) -- + No missing data -- + Numeric data should be standardized (centered and scaled) --- ## Dimension reduction + PCA examines the __covariance__ among features and combines multiple features into __a smaller set of uncorrelated variables__: the principal components -- + The weights of each PC reveal the contribution of each one to the overall variance in the original data -- + Decreasing importance: first principal component explains the largest variance in the original dataset --- ## Performing PCA in R + In base R: `prcomp()` + In ecology applications: `vegan::rda()` + In machine learning/statistical learning packages. PCA and ordination in general are forms of ___unsupervised learning___ --- ## How many principal components? + Decreasing eigenvalue + `sum(eigen)` is equal to number of variables + eigenvalues vary from more than 1 to almost zero: eigenvalues > 1 carry the most overall information -- + Proportion of variance explained -- + Scree plot --- ## Other ordination techniques important in Ecology + __Correspondence analysis CA__, based in chi-square (co-occurrences) + __Principal Components Analysis PCoA__ uses any distance matrix + __Non-Metric Multidimensional Scaling NMDS __ based on a dissimilarity or distance matrix. attempts to represent the pairwise dissimilarity between objects in a low-dimensional space + __Canonical Correspondence Analysis CCA__: contrained version of a CA, using a second matrix (ex. environment) + __Redundancy Analysis RDA__: contrained version of a PCA, using a second matrix (ex. environment) --- ##