class: center, middle, inverse, title-slide .title[ # Biodiversity databases ] .subtitle[ ## Serrapilheira/ICTP-SAIFR Training Program in Quantitative Biology and Ecology ] .author[ ### Andrea Sánchez-Tapia & Sara Mortara ] .date[ ### 27 July 2022 ] --- <style type="text/css"> .tiny .remark-code { /*Change made here*/ font-size: 50% !important; } </style> ## today 1. Biodiversity databases: basic concepts -- 2. DarwinCore standard -- 3. Biodiversity data cleaning -- 4. A workflow in R --- class: inverse, middle, center # 1. Biodiversity databases: basic concepts --- ## Biodiversity data ### Museums, herbaria, collections .pull-left[ <img src="./figs/mz_ictio.jpg" width="500" /> ] .pull-right[ <img src="./figs/herbarioRB.jpg" width="400" /> ] --- .pull-left[ ## Who creates these data? - Researchers: taxonomists, ecologists, field biologists - Undergrad and grad school courses - Amateur collectors, citizen science, using apps or not ] -- .pull-right[ ## Who uses them? - Researchers - Curators - The overall community (e.g., bird watchers) ] -- + A __massive__ amount of data -- + New research areas from the analysis and synthesis of these data: Biogeography, macroecology -- + __However__: heterogeneous quality, inequality between countries and institutions --- <img src="figs/bi.png" width="1457" /> --- ## Biodiversity databases - [GBIF](https://www.gbif.org/) (Global Biodiversity Information Facility) -- - Country-wise biodiversity information systems: + 🇧🇷 [speciesLink](http://splink.cria.org.br/), Sistema de Informações sobre Biodiversidade Brasileira [SiBBr](https://www.sibbr.gov.br/) + 🇨🇴 Sistema de Información sobre Biodiversidad [SibColombia](https://biodiversidad.co/) + 🇲🇽 Sistema Nacional de Información sobre Biodiversidad [SNIB](https://snib.mx/) + 🇦🇷 Sistema de Información de Biodiversidad [SIB](https://sib.gob.ar/#!/) -- - Countries that participate via ministries, agencies -- - Other systems: - Natureserve/ BISON in the USA - [iNaturalist](https://www.inaturalist.org/) --- background-image: url("https://docs.ropensci.org/rfishbase/logo.svg") background-size: 150px background-position: 90% 10% ## By taxonomic group + [World Flora Online](http://www.worldfloraonline.org/) -- + [Plants of the world](http://plantsoftheworldonline.org/) -- + [International Plant Name Index (IPNI)](https://www.ipni.org/) -- + [Fishbase](https://www.fishbase.in/) -- + [Mammal Diversity](https://www.mammaldiversity.org/) --- ## <img src="./figs/Ropensci.png" width="239" /> > Transforming science through open data, software & reproducibility -- - R packages to make data available, data cleaning, APIs -- - __Package peer-review__ -- - Packages related to biodiversity data retrieval, manipulation, and standardization: `finch`, `rgbif`, `taxize`, `taxview` ... --- ## Looking at GBIF data Let's download the data of a tree species from South America *Myrsine coriacea* from the Primulaceae family. ```r library(rgbif) library(dplyr) species <- "Myrsine coriacea" occs <- occ_search(scientificName = species) names(occs) #glimpse(occs) #names(occs$data) ``` Column names returned from gbif follow the DarwinCore standard (https://dwc.tdwg.org). --- ## Possible fields + Species (taxonomic information) -- + Locality, coordinates, geographic -- + Collector notes and species attributes, IUCN status -- + Collectors, determiners, and authors' names --- class: inverse, middle, center # 2.DarwinCore Standard --- ## [DarwinCore](https://dwc.tdwg.org/) .pull-left[ - goal: __facilitate the sharing of information about biological diversity__ by providing identifiers, labels, and definitions - maintained by a working group, adopted globally by GBIF and most collections, constant evolution - [_reference guide_](https://dwc.tdwg.org/terms/): `Taxon`, `Event`, `Identification`, `Location` ] .pull.right[ <img src="./figs/darwin.jpeg" width="350" style="display: block; margin: auto;" /> ] <!-- ö more practical stuff--> --- <img src="./figs/dwc_scheme.png" width="900" style="display: block; margin: auto;" /> --- ## Multi-file [DwC scheme](https://dwc.tdwg.org/) + a ZIP file with: + different data tables (__relational databases__) + description file (__XML__) + metadata files (__EML__ [Ecological Metadata Language](https://eml.ecoinformatics.org/)) -- + __common columns__ for each table, `ID` --- <img src="./figs/IPT_GBIF.png" width="1556" /> + IPT is used to publish your data, __either in one single table__ or __extended__: species lists, field work samplings etc. Datasets are given a __permanent identifier and citation__ --- ## <img src="./figs/IPT.png" width="599" /> > Brazil Flora G (2020): Brazilian Flora 2020 project - Projeto Flora do Brasil 2020. v393.274. Instituto de Pesquisas Jardim Botanico do Rio de Janeiro. Dataset/Checklist. doi:10.15468/1mtkaw URL: http://ipt.jbrj.gov.br/jbrj/ - 49.343 species - 136.314 taxa - Information about: distribution, habitat, endemism, references, synonyms - IPT in `R`: package __finch__ ([Chamberlain 2020](https://CRAN.R-project.org/package=finch)) --- class: inverse, middle, center # 3. Data cleaning --- ## Primary data are not clean or updated - Names and classifications change in time (synonyms): - International Commission on Zoological Nomenclature ([ICZN](https://www.iczn.org/)), International Code of Botanical Nomenclature, [International Association for Plant Taxonomy](https://www.iapt-taxon.org/icbn/main.htm) - The names of places: __toponyms__, we need _gazetteers_ - Taxonomic identification errors, georreferencing, typographic/orthographic - Missing coordinates or localities - Imprecise coordinates, badly attributed --- ## Spatial error <img src="figs/GBIF_g002.png" width="800" style="display: block; margin: auto;" /> [Yesson et al 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2043490/) --- ## Zero latitude and longitude <img src="figs/GBIF_g003.png" width="800" style="display: block; margin: auto;" /> [Yesson et al 2007](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2043490/) --- ## Taxonomic, nomenclatural erros .pull-left[ + Original vs. accepted name - Different taxonomic concepts / backbones, premisses, measure units. - New errors arising in processing and junction - Duplicate data vs. _duplicatae_ ] .pull-right[ <img src="figs/paubrasil.png" width="450" style="display: block; margin: auto 0 auto auto;" /> ] --- ## Data cleaning workflow Check for 1. format 2. completeness 3. sense (dimensions, ranges) 4. outliers (geographic, temporary, environmental) --- ## Good practices: + Always flag 🚩 additions, corrections, __never modify__ original data + Document all steps -> reproducibility + Cite packages and data sources --- ## Know your shortfalls > informed ignorance is a powerful research tool <img src="./figs/Hortaletal2015.png" width="1409" style="display: block; margin: auto;" /> --- ## References Chapman, A. D. (2005). Principles and methods of data cleaning. GBIF. Hortal, J., de Bello, F., Diniz-Filho, J. A. F., Lewinsohn, T. M., Lobo, J. M., & Ladle, R. J. (2015). Seven shortfalls that beset large-scale knowledge of biodiversity. Annual Review of Ecology, Evolution, and Systematics, 46, 523-549. Yesson, C., Brewer, P. W., Sutton, T., Caithness, N., Pahwa, J. S., Burgess, M., ... & Culham, A. (2007). How global is the global biodiversity information facility?. PloS one, 2(11), e1124. Wieczorek, J., Bloom, D., Guralnick, R., Blum, S., Döring, M., Giovanni, R., ... & Vieglais, D. (2012). Darwin Core: an evolving community-developed biodiversity data standard. PloS one, 7(1), e29715. --- class: center, middle # ¡Thanks! <center> <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> [andreasancheztapia@gmail.com](mailto:andreasancheztapia@gmail.com) | [saramortara@gmail.com](mailto:saramortara@gmail.com) <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [@SanchezTapiaA](https://twitter.com/SanchezTapiaA) | [@MortaraSara](https://twitter.com/MortaraSara) <svg viewBox="0 0 496 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg><svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M105.2 24.9c-3.1-8.9-15.7-8.9-18.9 0L29.8 199.7h132c-.1 0-56.6-174.8-56.6-174.8zM.9 287.7c-2.6 8 .3 16.9 7.1 22l247.9 184-226.2-294zm160.8-88l94.3 294 94.3-294zm349.4 88l-28.8-88-226.3 294 247.9-184c6.9-5.1 9.7-14 7.2-22zM425.7 24.9c-3.1-8.9-15.7-8.9-18.9 0l-56.6 174.8h132z"></path></svg> [andreasancheztapia](http://github.com/andreasancheztapia) | [saramortara](http://github.com/saramortara)