Basic workflow for biodiversity data download and cleaning using R

Sara Mortara , Andrea Sánchez-Tapia
7/27/2022

Ocurrence downloads

Loading packages

For this tutorial, you will need to install the R packages: rgbif, Taxonstand CoordinateCleaner and maps. If you don’t have them installed use the following commands:

install.packages("rgbif")
install.packages("Taxonstand")
install.packages("CoordinateCleaner")
install.packages("maps")

Then, we’ll start loading the packages.

Getting the data

First, let’s download the data of a Primulaceae species from South America Myrsine coriacea (Sw.) R.Br.

library(rgbif)
library(dplyr)
species <- "Myrsine coriacea"
occs <- occ_search(scientificName = species,
                   limit = 100000, 
                   basisOfRecord = "PRESERVED_SPECIMEN")
names(occs)
[1] "meta"      "hierarchy" "data"      "media"     "facets"   

The occurrences are saved in occs$data. Let’s create a new object from this table

myrsine.data <- occs$data

In the raw data, we have 5068 records.

Column names returned from gbif follow the DarwinCore standard (https://dwc.tdwg.org).

colnames(myrsine.data)

Exporting raw data

In order to guarantee the documentation of all steps, saving the raw data is essential. We will create a directory to save data and then export the data as csv (text file separated by comma).

dir.create("data/raw/", recursive = TRUE)
write.csv(myrsine.data, 
          "data/raw/myrsine_data.csv", 
          row.names = FALSE)

Checking species taxonomy

Let’s check the unique entries for the species name we just searched.

sort(unique(myrsine.data$scientificName))
 [1] "Ardisia coriacea Sw."                                           
 [2] "Ardisia coriacea var. berteriana A.DC."                         
 [3] "Caballeria ferruginea Ruiz & Pav."                              
 [4] "Myrsine berteroi A.DC."                                         
 [5] "Myrsine coriacea (Sw.) R.Br."                                   
 [6] "Myrsine coriacea Sieber ex A.DC."                               
 [7] "Myrsine coriacea subsp. coriacea"                               
 [8] "Myrsine coriacea subsp. nigrescens (Lundell) Ricketson & Pipoly"
 [9] "Myrsine coriacea subsp. reticulata (Steyerm.) Pipoly"           
[10] "Myrsine ferruginea (Ruiz & Pav.) A.DC., 1834"                   
[11] "Myrsine ferruginea (Ruiz & Pav.) Spreng."                       
[12] "Myrsine flocculosa Mart."                                       
[13] "Myrsine guatemalensis Gand."                                    
[14] "Myrsine jelskii Zahlbr."                                        
[15] "Myrsine laeta A.DC."                                            
[16] "Myrsine microcalyx Lundell"                                     
[17] "Myrsine myricoides Schltdl."                                    
[18] "Myrsine nigrescens Lundell"                                     
[19] "Myrsine paulensis A.DC."                                        
[20] "Myrsine rufescens A.DC."                                        
[21] "Myrsine salicifolia A.DC."                                      
[22] "Myrsine vestita Lundell"                                        
[23] "Myrsine viridis Rusby"                                          
[24] "Rapanea ambigua Mez"                                            
[25] "Rapanea coriacea (Sw.) Mez"                                     
[26] "Rapanea ferruginea (Ruiz & Pav.) Mez"                           
[27] "Rapanea jelskii (Zahlbr.) Mez"                                  
[28] "Rapanea lancifolia Mez"                                         
[29] "Rapanea mandonii Mez"                                           
[30] "Rapanea myricoides (Schltdl.) Lundell"                          
[31] "Rapanea nigrescens (Lundell) Lundell"                           
[32] "Rapanea paulensis Mez"                                          
[33] "Rapanea reticulata Steyerm."                                    
[34] "Rapanea rufa Lundell"                                           
[35] "Samara coriacea Sw."                                            
[36] "Samara myricoides Willd. ex Schult."                            
[37] "Samara saligna Willd. ex Schult."                               

In this particular case, we have a species with a long history of synonyms. In the gbif data, there is already a column showing the currently accepted taxonomy:

table(myrsine.data$taxonomicStatus)

ACCEPTED  SYNONYM 
    2723     2345 

We can also check which of the names are accepted or not:

table(myrsine.data$scientificName, myrsine.data$taxonomicStatus)
                                                                 
                                                                  ACCEPTED
  Ardisia coriacea Sw.                                                   0
  Ardisia coriacea var. berteriana A.DC.                                 0
  Caballeria ferruginea Ruiz & Pav.                                      0
  Myrsine berteroi A.DC.                                                 0
  Myrsine coriacea (Sw.) R.Br.                                        1146
  Myrsine coriacea Sieber ex A.DC.                                       0
  Myrsine coriacea subsp. coriacea                                    1496
  Myrsine coriacea subsp. nigrescens (Lundell) Ricketson & Pipoly       76
  Myrsine coriacea subsp. reticulata (Steyerm.) Pipoly                   5
  Myrsine ferruginea (Ruiz & Pav.) A.DC., 1834                           0
  Myrsine ferruginea (Ruiz & Pav.) Spreng.                               0
  Myrsine flocculosa Mart.                                               0
  Myrsine guatemalensis Gand.                                            0
  Myrsine jelskii Zahlbr.                                                0
  Myrsine laeta A.DC.                                                    0
  Myrsine microcalyx Lundell                                             0
  Myrsine myricoides Schltdl.                                            0
  Myrsine nigrescens Lundell                                             0
  Myrsine paulensis A.DC.                                                0
  Myrsine rufescens A.DC.                                                0
  Myrsine salicifolia A.DC.                                              0
  Myrsine vestita Lundell                                                0
  Myrsine viridis Rusby                                                  0
  Rapanea ambigua Mez                                                    0
  Rapanea coriacea (Sw.) Mez                                             0
  Rapanea ferruginea (Ruiz & Pav.) Mez                                   0
  Rapanea jelskii (Zahlbr.) Mez                                          0
  Rapanea lancifolia Mez                                                 0
  Rapanea mandonii Mez                                                   0
  Rapanea myricoides (Schltdl.) Lundell                                  0
  Rapanea nigrescens (Lundell) Lundell                                   0
  Rapanea paulensis Mez                                                  0
  Rapanea reticulata Steyerm.                                            0
  Rapanea rufa Lundell                                                   0
  Samara coriacea Sw.                                                    0
  Samara myricoides Willd. ex Schult.                                    0
  Samara saligna Willd. ex Schult.                                       0
                                                                 
                                                                  SYNONYM
  Ardisia coriacea Sw.                                                 54
  Ardisia coriacea var. berteriana A.DC.                                2
  Caballeria ferruginea Ruiz & Pav.                                     2
  Myrsine berteroi A.DC.                                                4
  Myrsine coriacea (Sw.) R.Br.                                          0
  Myrsine coriacea Sieber ex A.DC.                                      4
  Myrsine coriacea subsp. coriacea                                      0
  Myrsine coriacea subsp. nigrescens (Lundell) Ricketson & Pipoly       0
  Myrsine coriacea subsp. reticulata (Steyerm.) Pipoly                  0
  Myrsine ferruginea (Ruiz & Pav.) A.DC., 1834                         27
  Myrsine ferruginea (Ruiz & Pav.) Spreng.                             66
  Myrsine flocculosa Mart.                                              5
  Myrsine guatemalensis Gand.                                           2
  Myrsine jelskii Zahlbr.                                               5
  Myrsine laeta A.DC.                                                   2
  Myrsine microcalyx Lundell                                            2
  Myrsine myricoides Schltdl.                                          42
  Myrsine nigrescens Lundell                                            1
  Myrsine paulensis A.DC.                                               7
  Myrsine rufescens A.DC.                                               4
  Myrsine salicifolia A.DC.                                             4
  Myrsine vestita Lundell                                               3
  Myrsine viridis Rusby                                                12
  Rapanea ambigua Mez                                                   3
  Rapanea coriacea (Sw.) Mez                                           60
  Rapanea ferruginea (Ruiz & Pav.) Mez                               1130
  Rapanea jelskii (Zahlbr.) Mez                                        22
  Rapanea lancifolia Mez                                               64
  Rapanea mandonii Mez                                                  4
  Rapanea myricoides (Schltdl.) Lundell                               786
  Rapanea nigrescens (Lundell) Lundell                                  1
  Rapanea paulensis Mez                                                 9
  Rapanea reticulata Steyerm.                                           2
  Rapanea rufa Lundell                                                  2
  Samara coriacea Sw.                                                  10
  Samara myricoides Willd. ex Schult.                                   2
  Samara saligna Willd. ex Schult.                                      2

Let’s use the function TPL() from package taxonstand to check if the taxonomic updates in the gbif data are correct. This function receives a vector containing a list of species and performs both orthographical and nomenclature checking. Nomenclature checking follows The Plant List.

We will first generate a list with unique species names and combine it to the data. This is preferable because we do not need to check more than once the same name and, in the case of working with several species, it will make the workflow faster.

species.names <- unique(myrsine.data$scientificName) 
dim(species.names)
NULL
tax.check <- TPL(species.names)

Let’s check the output:

tax.check
                                         Taxon   Genus Hybrid.marker
1                 Myrsine coriacea (Sw.) R.Br. Myrsine              
2             Myrsine coriacea subsp. coriacea Myrsine              
3 Myrsine ferruginea (Ruiz & Pav.) A.DC., 1834 Myrsine              
4     Myrsine ferruginea (Ruiz & Pav.) Spreng. Myrsine              
5             Myrsine coriacea Sieber ex A.DC. Myrsine              
6                   Rapanea coriacea (Sw.) Mez Rapanea              
7        Rapanea myricoides (Schltdl.) Lundell Rapanea              
     Species Abbrev Infraspecific.rank Infraspecific
1   coriacea   <NA>               <NA>              
2   coriacea   <NA>             subsp.      coriacea
3 ferruginea   <NA>               <NA>              
4 ferruginea   <NA>               <NA>              
5   coriacea   <NA>               <NA>              
6   coriacea   <NA>               <NA>              
7 myricoides   <NA>               <NA>              
                  Authority           ID Plant.Name.Index TPL.version
1               (Sw.) R.Br.  kew-2503287             TRUE         1.1
2                           tro-50096857             TRUE         1.1
3 (Ruiz & Pav.) A.DC., 1834 tro-22001652             TRUE         1.1
4     (Ruiz & Pav.) Spreng. tro-22001652             TRUE         1.1
5           Sieber ex A.DC.  kew-2503287             TRUE         1.1
6                 (Sw.) Mez  kew-2424988             TRUE         1.1
7        (Schltdl.) Lundell  kew-2418119             TRUE         1.1
  Taxonomic.status      Family New.Genus New.Hybrid.marker
1         Accepted Primulaceae   Myrsine                  
2          Synonym Primulaceae   Myrsine                  
3          Synonym Primulaceae   Myrsine                  
4          Synonym Primulaceae   Myrsine                  
5         Accepted Primulaceae   Myrsine                  
6          Synonym Primulaceae   Myrsine                  
7          Synonym Primulaceae   Myrsine                  
  New.Species New.Infraspecific.rank New.Infraspecific
1    coriacea                                         
2    coriacea                                         
3    coriacea                                         
4    coriacea                                         
5    coriacea                                         
6    coriacea                                         
7    coriacea                                         
                   New.Authority      New.ID New.Taxonomic.status
1 (Sw.) R.Br. ex Roem. & Schult. kew-2503287             Accepted
2 (Sw.) R.Br. ex Roem. & Schult. kew-2503287             Accepted
3 (Sw.) R.Br. ex Roem. & Schult. kew-2503287             Accepted
4 (Sw.) R.Br. ex Roem. & Schult. kew-2503287             Accepted
5 (Sw.) R.Br. ex Roem. & Schult. kew-2503287             Accepted
6 (Sw.) R.Br. ex Roem. & Schult. kew-2503287             Accepted
7 (Sw.) R.Br. ex Roem. & Schult. kew-2503287             Accepted
   Typo WFormat Higher.level       Date
1 FALSE   FALSE        FALSE 2022-07-26
2 FALSE   FALSE        FALSE 2022-07-26
3 FALSE   FALSE        FALSE 2022-07-26
4 FALSE   FALSE        FALSE 2022-07-26
5 FALSE   FALSE        FALSE 2022-07-26
6 FALSE   FALSE        FALSE 2022-07-26
7  TRUE   FALSE        FALSE 2022-07-26

Note that the function adds several new variables to the input data and creates columns such as New.Genus and New.Species with the accepted name. We should adopt these names if the column New.Taxonomic.status is filled with “Accepted”

We will merge the new genus and species and then add them to the original data.

# creating new object w/ original and new names after TPL
new.tax <- data.frame(scientificName = species.names, 
                      genus.new.TPL = tax.check$New.Genus, 
                      species.new.TPL = tax.check$New.Species,
                      status.TPL = tax.check$Taxonomic.status,
                      scientificName.new.TPL = paste(tax.check$New.Genus,
                                                     tax.check$New.Species)) 
# now we are merging raw data and checked data
myrsine.new.tax <- merge(myrsine.data, new.tax, by = "scientificName")

Exporting data after taxonomy check

To guarantee the documentation of all steps, we will export the data after the taxonomy check.

dir.create("data/processed/", recursive = TRUE)
write.csv(myrsine.new.tax, 
          "data/processed/data_taxonomy_check.csv", 
          row.names = FALSE)

Checking species’ coordinates

First, let’s inspect visually the coordinates in the raw data.

plot(decimalLatitude ~ decimalLongitude, data = myrsine.data, asp = 1)
map(, , , add = TRUE)

Now we will use the the function clean_coordinates() from the CoordinateCleaner package to clean the species records. This function checks for common errors in coordinates such as institutional coordinates, sea coordinates, outliers, zeros, centroids, etc. This function does not accept not available information (here addressed as “NA”) so we will first select only data that have a numerical value for both latitude and longitude.

Note: at this moment having a specific ID code for each observation is essential. The raw data already provides an ID in the column gbifID.

myrsine.coord <- myrsine.data[!is.na(myrsine.data$decimalLatitude) 
                   & !is.na(myrsine.data$decimalLongitude),]

Now that we don’t have NA in latitude or longitude, we can perform the coordinate cleaning. This new dataset has 2580 occurrences.

# output w/ only potential correct coordinates
geo.clean <- clean_coordinates(x = myrsine.coord, 
                               lon = "decimalLongitude",
                               lat = "decimalLatitude",
                               species = "species", 
                               value = "clean")
OGR data source with driver: ESRI Shapefile 
Source: "/private/var/folders/14/9ljg2mcj1rdd1ht69s94t2240000gn/T/RtmpQtZjhq", layer: "ne_50m_land"
with 1420 features
It has 3 fields
Integer64 fields read as strings:  scalerank 
table(myrsine.coord$country)

                         Argentina                             Belize 
                                32                                  7 
  Bolivia (Plurinational State of)   Bonaire, Sint Eustatius and Saba 
                                 8                                  1 
                            Brazil                           Colombia 
                               345                                697 
                        Costa Rica                               Cuba 
                               280                                  5 
                Dominican Republic                            Ecuador 
                                25                                 23 
                       El Salvador                   French Polynesia 
                                55                                  1 
                        Guadeloupe                          Guatemala 
                                 3                                 16 
                            Guyana                              Haiti 
                                 2                                  4 
                          Honduras                            Jamaica 
                                13                                 40 
                            Mexico                          Nicaragua 
                               852                                 15 
                            Panama                           Paraguay 
                                31                                  6 
                              Peru                        Puerto Rico 
                                 8                                 21 
                           Uruguay Venezuela (Bolivarian Republic of) 
                                 7                                 83 
table(geo.clean$country)

                         Argentina                             Belize 
                                32                                  7 
  Bolivia (Plurinational State of)                             Brazil 
                                 8                                318 
                          Colombia                         Costa Rica 
                               695                                227 
                              Cuba                 Dominican Republic 
                                 5                                 25 
                           Ecuador                        El Salvador 
                                23                                 55 
                         Guatemala                             Guyana 
                                16                                  2 
                             Haiti                           Honduras 
                                 3                                 12 
                           Jamaica                             Mexico 
                                40                                846 
                         Nicaragua                             Panama 
                                15                                 28 
                          Paraguay                               Peru 
                                 5                                  8 
                       Puerto Rico                            Uruguay 
                                20                                  7 
Venezuela (Bolivarian Republic of) 
                                79 

Let’s plot the output of the clean data.

par(mfrow = c(1, 2))
plot(decimalLatitude ~ decimalLongitude, data = myrsine.data, asp = 1)
map(, , , add = TRUE)
plot(decimalLatitude ~ decimalLongitude, data = geo.clean, asp = 1)
map(, , , add = TRUE)
par(mfrow = c(1, 1))

When setting value = clean it returns only the potentially correct coordinates. For checking and reproducibility we want to save all the output with the flags generated by the routine. Let’s try a different output.

myrsine.new.geo <- clean_coordinates(x = myrsine.coord, 
                                  lon = "decimalLongitude",
                                  lat = "decimalLatitude",
                                  species = "species", 
                                  value = "spatialvalid")
OGR data source with driver: ESRI Shapefile 
Source: "/private/var/folders/14/9ljg2mcj1rdd1ht69s94t2240000gn/T/RtmpQtZjhq", layer: "ne_50m_land"
with 1420 features
It has 3 fields
Integer64 fields read as strings:  scalerank 
table(myrsine.new.geo$.summary)

FALSE  TRUE 
  104  2476 
tail(names(myrsine.new.geo))
[1] ".cen"     ".sea"     ".otl"     ".gbf"     ".inst"    ".summary"

Then, we merge the raw data with the cleaned data.

# merging w/ original data
dim(myrsine.data)
[1] 5068  187
dim(myrsine.new.geo)
[1] 2580  197
myrsine.new.geo2 <- merge(myrsine.data, myrsine.new.geo, 
                       all.x = TRUE) 
dim(myrsine.new.geo2)
[1] 5068  197
full_join(myrsine.data, myrsine.new.geo)
# A tibble: 5,068 × 197
   key        scienti…¹ decim…² decim…³ issues datas…⁴ publi…⁵ insta…⁶
   <chr>      <chr>       <dbl>   <dbl> <chr>  <chr>   <chr>   <chr>  
 1 3764221527 Myrsine …  -16.3    -67.9 ""     50c950… 28eb1a… 997448…
 2 3759705682 Myrsine …  -27.7    -48.5 "cdro… 50c950… 28eb1a… 997448…
 3 3760043459 Myrsine …  -32.5    -54.1 "cdro… 50c950… 28eb1a… 997448…
 4 3760235429 Myrsine …    4.70   -74.0 "cdro… 50c950… 28eb1a… 997448…
 5 3760266204 Myrsine …  -29.4    -50.9 ""     50c950… 28eb1a… 997448…
 6 3764366114 Myrsine …  -27.6    -48.5 "cdro… 50c950… 28eb1a… 997448…
 7 3764702306 Myrsine …  -34.4    -54.4 "cdro… 50c950… 28eb1a… 997448…
 8 3773308761 Myrsine …    4.86   -74.3 ""     50c950… 28eb1a… 997448…
 9 3773663862 Myrsine …    4.67   -74.0 "cdro… 50c950… 28eb1a… 997448…
10 3044918628 Myrsine …  -32.9    -54.5 "cdro… 50c950… 28eb1a… 997448…
# … with 5,058 more rows, 189 more variables:
#   publishingCountry <chr>, protocol <chr>, lastCrawled <chr>,
#   lastParsed <chr>, crawlId <int>, hostingOrganizationKey <chr>,
#   basisOfRecord <chr>, occurrenceStatus <chr>, taxonKey <int>,
#   kingdomKey <int>, phylumKey <int>, classKey <int>,
#   orderKey <int>, familyKey <int>, genusKey <int>,
#   speciesKey <int>, acceptedTaxonKey <int>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
plot(decimalLatitude ~ decimalLongitude, data = myrsine.new.geo2, asp = 1, 
     col = if_else(myrsine.new.geo2$.summary, "green", "red"))
map(, , , add = TRUE)

Exporting the data after coordinate check

write.csv(myrsine.new.geo2, 
          "data/processed/myrsine_coordinate_check.csv", 
          row.names = FALSE)

Save the dataset as shapefile

We can also save the dataset as a shapefile, we will transform the data frame in a sf object.

library(tmap)
library(sf)
myrsine.final <- left_join(myrsine.coord, myrsine.new.geo2)
nrow(myrsine.final)
[1] 2580
myrsine_sf <- st_as_sf(myrsine.final, coords = c("decimalLongitude", "decimalLatitude"))
st_crs(myrsine_sf)
Coordinate Reference System: NA
myrsine_sf <- st_set_crs(myrsine_sf, 4326)
st_crs(myrsine_sf)
Coordinate Reference System:
  User input: EPSG:4326 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["unknown"],
        AREA["World"],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]
#dir.create("data/shapefiles", recursive = T)
#st_write(myrsine_sf, dsn = "data/shapefiles/myrsine.shp")

Plot with tmap

tmap will use the sf object. We can add the shapefile to the map we created yesterday:

data(World)

SAm_map <- World %>% 
  filter(continent %in% c("South America", "North America")) %>% 
  tm_shape() +
  tm_borders()
 

SAm_map +
tm_shape(myrsine_sf) + 
  tm_bubbles(size = 0.2, 
             col = ".summary")

Interactive mode in tmap

The option tmap_mode("view") creates interactive maps. For this, we need to transform the data frame into a sf shapefile.

tmap_mode("view")
World %>% 
  filter(continent %in% c("South America", "North America")) %>% 
  tm_shape() +
  tm_borders() + 
  tm_shape(myrsine_sf) + 
  tm_bubbles(size = 0.2)