Biodiversity databases: basic concepts
DarwinCore standard
Biodiversity databases: basic concepts
DarwinCore standard
Biodiversity data cleaning
Biodiversity databases: basic concepts
DarwinCore standard
Biodiversity data cleaning
A workflow in R
A massive amount of data
New research areas from the analysis and synthesis of these data: Biogeography, macroecology
A massive amount of data
New research areas from the analysis and synthesis of these data: Biogeography, macroecology
However: heterogeneous quality, inequality between countries and institutions
GBIF (Global Biodiversity Information Facility)
Country-wise biodiversity information systems:
GBIF (Global Biodiversity Information Facility)
Country-wise biodiversity information systems:
GBIF (Global Biodiversity Information Facility)
Country-wise biodiversity information systems:
Countries that participate via ministries, agencies
Other systems:
Transforming science through open data, software & reproducibility
Transforming science through open data, software & reproducibility
Transforming science through open data, software & reproducibility
R packages to make data available, data cleaning, APIs
Package peer-review
Transforming science through open data, software & reproducibility
R packages to make data available, data cleaning, APIs
Package peer-review
Packages related to biodiversity data retrieval, manipulation, and standardization: finch
, rgbif
, taxize
, taxview
...
Let's download the data of a tree species from South America Myrsine coriacea from the Primulaceae family.
library(rgbif)library(dplyr)species <- "Myrsine coriacea"occs <- occ_search(scientificName = species)names(occs)#glimpse(occs)#names(occs$data)
Column names returned from gbif follow the DarwinCore standard (https://dwc.tdwg.org).
Species (taxonomic information)
Locality, coordinates, geographic
Species (taxonomic information)
Locality, coordinates, geographic
Collector notes and species attributes, IUCN status
Species (taxonomic information)
Locality, coordinates, geographic
Collector notes and species attributes, IUCN status
Collectors, determiners, and authors' names
goal: facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions
maintained by a working group, adopted globally by GBIF and most collections, constant evolution
reference guide: Taxon
, Event
, Identification
, Location
a ZIP file with:
a ZIP file with:
common columns for each table, ID
Brazil Flora G (2020): Brazilian Flora 2020 project - Projeto Flora do Brasil 2020. v393.274. Instituto de Pesquisas Jardim Botanico do Rio de Janeiro. Dataset/Checklist. doi:10.15468/1mtkaw URL: http://ipt.jbrj.gov.br/jbrj/
49.343 species
136.314 taxa
Information about: distribution, habitat, endemism, references, synonyms
IPT in R
: package finch (Chamberlain 2020)
Names and classifications change in time (synonyms):
International Commission on Zoological Nomenclature (ICZN), International Code of Botanical Nomenclature, International Association for Plant Taxonomy
The names of places: toponyms, we need gazetteers
Taxonomic identification errors, georreferencing, typographic/orthographic
Missing coordinates or localities
Imprecise coordinates, badly attributed
Different taxonomic concepts / backbones, premisses, measure units.
New errors arising in processing and junction
Duplicate data vs. duplicatae
Check for
Always flag 🚩 additions, corrections, never modify original data
Document all steps -> reproducibility
Cite packages and data sources
informed ignorance is a powerful research tool
Chapman, A. D. (2005). Principles and methods of data cleaning. GBIF.
Hortal, J., de Bello, F., Diniz-Filho, J. A. F., Lewinsohn, T. M., Lobo, J. M., & Ladle, R. J. (2015). Seven shortfalls that beset large-scale knowledge of biodiversity. Annual Review of Ecology, Evolution, and Systematics, 46, 523-549.
Yesson, C., Brewer, P. W., Sutton, T., Caithness, N., Pahwa, J. S., Burgess, M., ... & Culham, A. (2007). How global is the global biodiversity information facility?. PloS one, 2(11), e1124.
Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Döring, M., Giovanni, R., ... & Vieglais, D. (2012). Darwin Core: an evolving community-developed biodiversity data standard. PloS one, 7(1), e29715.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |