+ - 0:00:00
Notes for current slide
Notes for next slide

Biodiversity databases

Serrapilheira/ICTP-SAIFR Training Program in Quantitative Biology and Ecology

Andrea Sánchez-Tapia & Sara Mortara

27 July 2022

1 / 27

today

  1. Biodiversity databases: basic concepts
2 / 27

today

  1. Biodiversity databases: basic concepts

  2. DarwinCore standard

2 / 27

today

  1. Biodiversity databases: basic concepts

  2. DarwinCore standard

  3. Biodiversity data cleaning

2 / 27

today

  1. Biodiversity databases: basic concepts

  2. DarwinCore standard

  3. Biodiversity data cleaning

  4. A workflow in R

2 / 27

1. Biodiversity databases: basic concepts

3 / 27

Biodiversity data

Museums, herbaria, collections

4 / 27

Who creates these data?

  • Researchers: taxonomists, ecologists, field biologists
  • Undergrad and grad school courses
  • Amateur collectors, citizen science, using apps or not
5 / 27

Who creates these data?

  • Researchers: taxonomists, ecologists, field biologists
  • Undergrad and grad school courses
  • Amateur collectors, citizen science, using apps or not

Who uses them?

  • Researchers
  • Curators
  • The overall community (e.g., bird watchers)
5 / 27

Who creates these data?

  • Researchers: taxonomists, ecologists, field biologists
  • Undergrad and grad school courses
  • Amateur collectors, citizen science, using apps or not

Who uses them?

  • Researchers
  • Curators
  • The overall community (e.g., bird watchers)
  • A massive amount of data
5 / 27

Who creates these data?

  • Researchers: taxonomists, ecologists, field biologists
  • Undergrad and grad school courses
  • Amateur collectors, citizen science, using apps or not

Who uses them?

  • Researchers
  • Curators
  • The overall community (e.g., bird watchers)
  • A massive amount of data

  • New research areas from the analysis and synthesis of these data: Biogeography, macroecology

5 / 27

Who creates these data?

  • Researchers: taxonomists, ecologists, field biologists
  • Undergrad and grad school courses
  • Amateur collectors, citizen science, using apps or not

Who uses them?

  • Researchers
  • Curators
  • The overall community (e.g., bird watchers)
  • A massive amount of data

  • New research areas from the analysis and synthesis of these data: Biogeography, macroecology

  • However: heterogeneous quality, inequality between countries and institutions

5 / 27

6 / 27

Biodiversity databases

  • GBIF (Global Biodiversity Information Facility)
7 / 27

Biodiversity databases

  • GBIF (Global Biodiversity Information Facility)

  • Country-wise biodiversity information systems:

    • 🇧🇷 speciesLink, Sistema de Informações sobre Biodiversidade Brasileira SiBBr
    • 🇨🇴 Sistema de Información sobre Biodiversidad SibColombia
    • 🇲🇽 Sistema Nacional de Información sobre Biodiversidad SNIB
    • 🇦🇷 Sistema de Información de Biodiversidad SIB
7 / 27

Biodiversity databases

  • GBIF (Global Biodiversity Information Facility)

  • Country-wise biodiversity information systems:

    • 🇧🇷 speciesLink, Sistema de Informações sobre Biodiversidade Brasileira SiBBr
    • 🇨🇴 Sistema de Información sobre Biodiversidad SibColombia
    • 🇲🇽 Sistema Nacional de Información sobre Biodiversidad SNIB
    • 🇦🇷 Sistema de Información de Biodiversidad SIB
  • Countries that participate via ministries, agencies
7 / 27

Biodiversity databases

  • GBIF (Global Biodiversity Information Facility)

  • Country-wise biodiversity information systems:

    • 🇧🇷 speciesLink, Sistema de Informações sobre Biodiversidade Brasileira SiBBr
    • 🇨🇴 Sistema de Información sobre Biodiversidad SibColombia
    • 🇲🇽 Sistema Nacional de Información sobre Biodiversidad SNIB
    • 🇦🇷 Sistema de Información de Biodiversidad SIB
  • Countries that participate via ministries, agencies

  • Other systems:

7 / 27

By taxonomic group

8 / 27

By taxonomic group

8 / 27

Transforming science through open data, software & reproducibility

9 / 27

Transforming science through open data, software & reproducibility

  • R packages to make data available, data cleaning, APIs
9 / 27

Transforming science through open data, software & reproducibility

  • R packages to make data available, data cleaning, APIs

  • Package peer-review

9 / 27

Transforming science through open data, software & reproducibility

  • R packages to make data available, data cleaning, APIs

  • Package peer-review

  • Packages related to biodiversity data retrieval, manipulation, and standardization: finch, rgbif, taxize, taxview ...

9 / 27

Looking at GBIF data

Let's download the data of a tree species from South America Myrsine coriacea from the Primulaceae family.

library(rgbif)
library(dplyr)
species <- "Myrsine coriacea"
occs <- occ_search(scientificName = species)
names(occs)
#glimpse(occs)
#names(occs$data)

Column names returned from gbif follow the DarwinCore standard (https://dwc.tdwg.org).

10 / 27

Possible fields

  • Species (taxonomic information)
11 / 27

Possible fields

  • Species (taxonomic information)

  • Locality, coordinates, geographic

11 / 27

Possible fields

  • Species (taxonomic information)

  • Locality, coordinates, geographic

  • Collector notes and species attributes, IUCN status

11 / 27

Possible fields

  • Species (taxonomic information)

  • Locality, coordinates, geographic

  • Collector notes and species attributes, IUCN status

  • Collectors, determiners, and authors' names

11 / 27

2.DarwinCore Standard

12 / 27

DarwinCore

  • goal: facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions

  • maintained by a working group, adopted globally by GBIF and most collections, constant evolution

  • reference guide: Taxon, Event, Identification, Location

13 / 27

14 / 27

Multi-file DwC scheme

15 / 27

Multi-file DwC scheme

  • a ZIP file with:

  • common columns for each table, ID

15 / 27

  • IPT is used to publish your data, either in one single table or extended: species lists, field work samplings etc. Datasets are given a permanent identifier and citation
16 / 27

Brazil Flora G (2020): Brazilian Flora 2020 project - Projeto Flora do Brasil 2020. v393.274. Instituto de Pesquisas Jardim Botanico do Rio de Janeiro. Dataset/Checklist. doi:10.15468/1mtkaw URL: http://ipt.jbrj.gov.br/jbrj/

  • 49.343 species

  • 136.314 taxa

  • Information about: distribution, habitat, endemism, references, synonyms

  • IPT in R: package finch (Chamberlain 2020)

17 / 27

3. Data cleaning

18 / 27

Primary data are not clean or updated

  • Names and classifications change in time (synonyms):

  • Taxonomic identification errors, georreferencing, typographic/orthographic

  • Missing coordinates or localities

  • Imprecise coordinates, badly attributed

19 / 27

Spatial error

Yesson et al 2007

20 / 27

Zero latitude and longitude

Yesson et al 2007

21 / 27

Taxonomic, nomenclatural erros

  • Original vs. accepted name
  • Different taxonomic concepts / backbones, premisses, measure units.

  • New errors arising in processing and junction

  • Duplicate data vs. duplicatae

22 / 27

Data cleaning workflow

Check for

  1. format
  2. completeness
  3. sense (dimensions, ranges)
  4. outliers (geographic, temporary, environmental)
23 / 27

Good practices:

  • Always flag 🚩 additions, corrections, never modify original data

  • Document all steps -> reproducibility

  • Cite packages and data sources

24 / 27

Know your shortfalls

informed ignorance is a powerful research tool

25 / 27

References

Chapman, A. D. (2005). Principles and methods of data cleaning. GBIF.

Hortal, J., de Bello, F., Diniz-Filho, J. A. F., Lewinsohn, T. M., Lobo, J. M., & Ladle, R. J. (2015). Seven shortfalls that beset large-scale knowledge of biodiversity. Annual Review of Ecology, Evolution, and Systematics, 46, 523-549.

Yesson, C., Brewer, P. W., Sutton, T., Caithness, N., Pahwa, J. S., Burgess, M., ... & Culham, A. (2007). How global is the global biodiversity information facility?. PloS one, 2(11), e1124.

Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Döring, M., Giovanni, R., ... & Vieglais, D. (2012). Darwin Core: an evolving community-developed biodiversity data standard. PloS one, 7(1), e29715.

26 / 27

today

  1. Biodiversity databases: basic concepts
2 / 27
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow