For this tutorial, you will need to install the R packages: rgbif
, Taxonstand
CoordinateCleaner
and maps
. If you don’t have them installed use the following commands:
install.packages("rgbif")
install.packages("Taxonstand")
install.packages("CoordinateCleaner")
install.packages("maps")
Then, we’ll start loading the packages.
First, let’s download the data of a Primulaceae species from South America Myrsine coriacea (Sw.) R.Br.
library(rgbif)
library(dplyr)
species <- "Myrsine coriacea"
occs <- occ_search(scientificName = species,
limit = 100000,
basisOfRecord = "PRESERVED_SPECIMEN")
names(occs)
[1] "meta" "hierarchy" "data" "media" "facets"
The occurrences are saved in occs$data
. Let’s create a new object from this table
myrsine.data <- occs$data
In the raw data, we have 5068 records.
Column names returned from gbif follow the DarwinCore standard (https://dwc.tdwg.org).
colnames(myrsine.data)
In order to guarantee the documentation of all steps, saving the raw data is essential. We will create a directory to save data and then export the data as csv (text file separated by comma).
dir.create("data/raw/", recursive = TRUE)
write.csv(myrsine.data,
"data/raw/myrsine_data.csv",
row.names = FALSE)
Let’s check the unique entries for the species name we just searched.
[1] "Ardisia coriacea Sw."
[2] "Ardisia coriacea var. berteriana A.DC."
[3] "Caballeria ferruginea Ruiz & Pav."
[4] "Myrsine berteroi A.DC."
[5] "Myrsine coriacea (Sw.) R.Br."
[6] "Myrsine coriacea Sieber ex A.DC."
[7] "Myrsine coriacea subsp. coriacea"
[8] "Myrsine coriacea subsp. nigrescens (Lundell) Ricketson & Pipoly"
[9] "Myrsine coriacea subsp. reticulata (Steyerm.) Pipoly"
[10] "Myrsine ferruginea (Ruiz & Pav.) A.DC., 1834"
[11] "Myrsine ferruginea (Ruiz & Pav.) Spreng."
[12] "Myrsine flocculosa Mart."
[13] "Myrsine guatemalensis Gand."
[14] "Myrsine jelskii Zahlbr."
[15] "Myrsine laeta A.DC."
[16] "Myrsine microcalyx Lundell"
[17] "Myrsine myricoides Schltdl."
[18] "Myrsine nigrescens Lundell"
[19] "Myrsine paulensis A.DC."
[20] "Myrsine rufescens A.DC."
[21] "Myrsine salicifolia A.DC."
[22] "Myrsine vestita Lundell"
[23] "Myrsine viridis Rusby"
[24] "Rapanea ambigua Mez"
[25] "Rapanea coriacea (Sw.) Mez"
[26] "Rapanea ferruginea (Ruiz & Pav.) Mez"
[27] "Rapanea jelskii (Zahlbr.) Mez"
[28] "Rapanea lancifolia Mez"
[29] "Rapanea mandonii Mez"
[30] "Rapanea myricoides (Schltdl.) Lundell"
[31] "Rapanea nigrescens (Lundell) Lundell"
[32] "Rapanea paulensis Mez"
[33] "Rapanea reticulata Steyerm."
[34] "Rapanea rufa Lundell"
[35] "Samara coriacea Sw."
[36] "Samara myricoides Willd. ex Schult."
[37] "Samara saligna Willd. ex Schult."
In this particular case, we have a species with a long history of synonyms. In the gbif data, there is already a column showing the currently accepted taxonomy:
table(myrsine.data$taxonomicStatus)
ACCEPTED SYNONYM
2723 2345
We can also check which of the names are accepted or not:
table(myrsine.data$scientificName, myrsine.data$taxonomicStatus)
ACCEPTED
Ardisia coriacea Sw. 0
Ardisia coriacea var. berteriana A.DC. 0
Caballeria ferruginea Ruiz & Pav. 0
Myrsine berteroi A.DC. 0
Myrsine coriacea (Sw.) R.Br. 1146
Myrsine coriacea Sieber ex A.DC. 0
Myrsine coriacea subsp. coriacea 1496
Myrsine coriacea subsp. nigrescens (Lundell) Ricketson & Pipoly 76
Myrsine coriacea subsp. reticulata (Steyerm.) Pipoly 5
Myrsine ferruginea (Ruiz & Pav.) A.DC., 1834 0
Myrsine ferruginea (Ruiz & Pav.) Spreng. 0
Myrsine flocculosa Mart. 0
Myrsine guatemalensis Gand. 0
Myrsine jelskii Zahlbr. 0
Myrsine laeta A.DC. 0
Myrsine microcalyx Lundell 0
Myrsine myricoides Schltdl. 0
Myrsine nigrescens Lundell 0
Myrsine paulensis A.DC. 0
Myrsine rufescens A.DC. 0
Myrsine salicifolia A.DC. 0
Myrsine vestita Lundell 0
Myrsine viridis Rusby 0
Rapanea ambigua Mez 0
Rapanea coriacea (Sw.) Mez 0
Rapanea ferruginea (Ruiz & Pav.) Mez 0
Rapanea jelskii (Zahlbr.) Mez 0
Rapanea lancifolia Mez 0
Rapanea mandonii Mez 0
Rapanea myricoides (Schltdl.) Lundell 0
Rapanea nigrescens (Lundell) Lundell 0
Rapanea paulensis Mez 0
Rapanea reticulata Steyerm. 0
Rapanea rufa Lundell 0
Samara coriacea Sw. 0
Samara myricoides Willd. ex Schult. 0
Samara saligna Willd. ex Schult. 0
SYNONYM
Ardisia coriacea Sw. 54
Ardisia coriacea var. berteriana A.DC. 2
Caballeria ferruginea Ruiz & Pav. 2
Myrsine berteroi A.DC. 4
Myrsine coriacea (Sw.) R.Br. 0
Myrsine coriacea Sieber ex A.DC. 4
Myrsine coriacea subsp. coriacea 0
Myrsine coriacea subsp. nigrescens (Lundell) Ricketson & Pipoly 0
Myrsine coriacea subsp. reticulata (Steyerm.) Pipoly 0
Myrsine ferruginea (Ruiz & Pav.) A.DC., 1834 27
Myrsine ferruginea (Ruiz & Pav.) Spreng. 66
Myrsine flocculosa Mart. 5
Myrsine guatemalensis Gand. 2
Myrsine jelskii Zahlbr. 5
Myrsine laeta A.DC. 2
Myrsine microcalyx Lundell 2
Myrsine myricoides Schltdl. 42
Myrsine nigrescens Lundell 1
Myrsine paulensis A.DC. 7
Myrsine rufescens A.DC. 4
Myrsine salicifolia A.DC. 4
Myrsine vestita Lundell 3
Myrsine viridis Rusby 12
Rapanea ambigua Mez 3
Rapanea coriacea (Sw.) Mez 60
Rapanea ferruginea (Ruiz & Pav.) Mez 1130
Rapanea jelskii (Zahlbr.) Mez 22
Rapanea lancifolia Mez 64
Rapanea mandonii Mez 4
Rapanea myricoides (Schltdl.) Lundell 786
Rapanea nigrescens (Lundell) Lundell 1
Rapanea paulensis Mez 9
Rapanea reticulata Steyerm. 2
Rapanea rufa Lundell 2
Samara coriacea Sw. 10
Samara myricoides Willd. ex Schult. 2
Samara saligna Willd. ex Schult. 2
Let’s use the function TPL()
from package taxonstand
to check if the taxonomic updates in the gbif data are correct. This function receives a vector containing a list of species and performs both orthographical and nomenclature checking. Nomenclature checking follows The Plant List.
We will first generate a list with unique species names and combine it to the data. This is preferable because we do not need to check more than once the same name and, in the case of working with several species, it will make the workflow faster.
NULL
tax.check <- TPL(species.names)
Let’s check the output:
tax.check
Taxon Genus Hybrid.marker
1 Myrsine coriacea (Sw.) R.Br. Myrsine
2 Myrsine coriacea subsp. coriacea Myrsine
3 Myrsine ferruginea (Ruiz & Pav.) A.DC., 1834 Myrsine
4 Myrsine ferruginea (Ruiz & Pav.) Spreng. Myrsine
5 Myrsine coriacea Sieber ex A.DC. Myrsine
6 Rapanea coriacea (Sw.) Mez Rapanea
7 Rapanea myricoides (Schltdl.) Lundell Rapanea
Species Abbrev Infraspecific.rank Infraspecific
1 coriacea <NA> <NA>
2 coriacea <NA> subsp. coriacea
3 ferruginea <NA> <NA>
4 ferruginea <NA> <NA>
5 coriacea <NA> <NA>
6 coriacea <NA> <NA>
7 myricoides <NA> <NA>
Authority ID Plant.Name.Index TPL.version
1 (Sw.) R.Br. kew-2503287 TRUE 1.1
2 tro-50096857 TRUE 1.1
3 (Ruiz & Pav.) A.DC., 1834 tro-22001652 TRUE 1.1
4 (Ruiz & Pav.) Spreng. tro-22001652 TRUE 1.1
5 Sieber ex A.DC. kew-2503287 TRUE 1.1
6 (Sw.) Mez kew-2424988 TRUE 1.1
7 (Schltdl.) Lundell kew-2418119 TRUE 1.1
Taxonomic.status Family New.Genus New.Hybrid.marker
1 Accepted Primulaceae Myrsine
2 Synonym Primulaceae Myrsine
3 Synonym Primulaceae Myrsine
4 Synonym Primulaceae Myrsine
5 Accepted Primulaceae Myrsine
6 Synonym Primulaceae Myrsine
7 Synonym Primulaceae Myrsine
New.Species New.Infraspecific.rank New.Infraspecific
1 coriacea
2 coriacea
3 coriacea
4 coriacea
5 coriacea
6 coriacea
7 coriacea
New.Authority New.ID New.Taxonomic.status
1 (Sw.) R.Br. ex Roem. & Schult. kew-2503287 Accepted
2 (Sw.) R.Br. ex Roem. & Schult. kew-2503287 Accepted
3 (Sw.) R.Br. ex Roem. & Schult. kew-2503287 Accepted
4 (Sw.) R.Br. ex Roem. & Schult. kew-2503287 Accepted
5 (Sw.) R.Br. ex Roem. & Schult. kew-2503287 Accepted
6 (Sw.) R.Br. ex Roem. & Schult. kew-2503287 Accepted
7 (Sw.) R.Br. ex Roem. & Schult. kew-2503287 Accepted
Typo WFormat Higher.level Date
1 FALSE FALSE FALSE 2022-07-26
2 FALSE FALSE FALSE 2022-07-26
3 FALSE FALSE FALSE 2022-07-26
4 FALSE FALSE FALSE 2022-07-26
5 FALSE FALSE FALSE 2022-07-26
6 FALSE FALSE FALSE 2022-07-26
7 TRUE FALSE FALSE 2022-07-26
Note that the function adds several new variables to the input data and creates columns such as New.Genus
and New.Species
with the accepted name. We should adopt these names if the column New.Taxonomic.status
is filled with “Accepted”
We will merge the new genus and species and then add them to the original data.
# creating new object w/ original and new names after TPL
new.tax <- data.frame(scientificName = species.names,
genus.new.TPL = tax.check$New.Genus,
species.new.TPL = tax.check$New.Species,
status.TPL = tax.check$Taxonomic.status,
scientificName.new.TPL = paste(tax.check$New.Genus,
tax.check$New.Species))
# now we are merging raw data and checked data
myrsine.new.tax <- merge(myrsine.data, new.tax, by = "scientificName")
To guarantee the documentation of all steps, we will export the data after the taxonomy check.
dir.create("data/processed/", recursive = TRUE)
write.csv(myrsine.new.tax,
"data/processed/data_taxonomy_check.csv",
row.names = FALSE)
First, let’s inspect visually the coordinates in the raw data.
plot(decimalLatitude ~ decimalLongitude, data = myrsine.data, asp = 1)
map(, , , add = TRUE)
Now we will use the the function clean_coordinates()
from the CoordinateCleaner
package to clean the species records. This function checks for common errors in coordinates such as institutional coordinates, sea coordinates, outliers, zeros, centroids, etc. This function does not accept not available information (here addressed as “NA”) so we will first select only data that have a numerical value for both latitude and longitude.
Note: at this moment having a specific ID code for each observation is essential. The raw data already provides an ID in the column gbifID
.
Now that we don’t have NA in latitude or longitude, we can perform the coordinate cleaning. This new dataset has 2580 occurrences.
# output w/ only potential correct coordinates
geo.clean <- clean_coordinates(x = myrsine.coord,
lon = "decimalLongitude",
lat = "decimalLatitude",
species = "species",
value = "clean")
OGR data source with driver: ESRI Shapefile
Source: "/private/var/folders/14/9ljg2mcj1rdd1ht69s94t2240000gn/T/RtmpQtZjhq", layer: "ne_50m_land"
with 1420 features
It has 3 fields
Integer64 fields read as strings: scalerank
table(myrsine.coord$country)
Argentina Belize
32 7
Bolivia (Plurinational State of) Bonaire, Sint Eustatius and Saba
8 1
Brazil Colombia
345 697
Costa Rica Cuba
280 5
Dominican Republic Ecuador
25 23
El Salvador French Polynesia
55 1
Guadeloupe Guatemala
3 16
Guyana Haiti
2 4
Honduras Jamaica
13 40
Mexico Nicaragua
852 15
Panama Paraguay
31 6
Peru Puerto Rico
8 21
Uruguay Venezuela (Bolivarian Republic of)
7 83
table(geo.clean$country)
Argentina Belize
32 7
Bolivia (Plurinational State of) Brazil
8 318
Colombia Costa Rica
695 227
Cuba Dominican Republic
5 25
Ecuador El Salvador
23 55
Guatemala Guyana
16 2
Haiti Honduras
3 12
Jamaica Mexico
40 846
Nicaragua Panama
15 28
Paraguay Peru
5 8
Puerto Rico Uruguay
20 7
Venezuela (Bolivarian Republic of)
79
Let’s plot the output of the clean data.
When setting value = clean
it returns only the potentially correct coordinates. For checking and reproducibility we want to save all the output with the flags generated by the routine. Let’s try a different output.
myrsine.new.geo <- clean_coordinates(x = myrsine.coord,
lon = "decimalLongitude",
lat = "decimalLatitude",
species = "species",
value = "spatialvalid")
OGR data source with driver: ESRI Shapefile
Source: "/private/var/folders/14/9ljg2mcj1rdd1ht69s94t2240000gn/T/RtmpQtZjhq", layer: "ne_50m_land"
with 1420 features
It has 3 fields
Integer64 fields read as strings: scalerank
table(myrsine.new.geo$.summary)
FALSE TRUE
104 2476
[1] ".cen" ".sea" ".otl" ".gbf" ".inst" ".summary"
Then, we merge the raw data with the cleaned data.
# merging w/ original data
dim(myrsine.data)
[1] 5068 187
dim(myrsine.new.geo)
[1] 2580 197
[1] 5068 197
full_join(myrsine.data, myrsine.new.geo)
# A tibble: 5,068 × 197
key scienti…¹ decim…² decim…³ issues datas…⁴ publi…⁵ insta…⁶
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 3764221527 Myrsine … -16.3 -67.9 "" 50c950… 28eb1a… 997448…
2 3759705682 Myrsine … -27.7 -48.5 "cdro… 50c950… 28eb1a… 997448…
3 3760043459 Myrsine … -32.5 -54.1 "cdro… 50c950… 28eb1a… 997448…
4 3760235429 Myrsine … 4.70 -74.0 "cdro… 50c950… 28eb1a… 997448…
5 3760266204 Myrsine … -29.4 -50.9 "" 50c950… 28eb1a… 997448…
6 3764366114 Myrsine … -27.6 -48.5 "cdro… 50c950… 28eb1a… 997448…
7 3764702306 Myrsine … -34.4 -54.4 "cdro… 50c950… 28eb1a… 997448…
8 3773308761 Myrsine … 4.86 -74.3 "" 50c950… 28eb1a… 997448…
9 3773663862 Myrsine … 4.67 -74.0 "cdro… 50c950… 28eb1a… 997448…
10 3044918628 Myrsine … -32.9 -54.5 "cdro… 50c950… 28eb1a… 997448…
# … with 5,058 more rows, 189 more variables:
# publishingCountry <chr>, protocol <chr>, lastCrawled <chr>,
# lastParsed <chr>, crawlId <int>, hostingOrganizationKey <chr>,
# basisOfRecord <chr>, occurrenceStatus <chr>, taxonKey <int>,
# kingdomKey <int>, phylumKey <int>, classKey <int>,
# orderKey <int>, familyKey <int>, genusKey <int>,
# speciesKey <int>, acceptedTaxonKey <int>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
write.csv(myrsine.new.geo2,
"data/processed/myrsine_coordinate_check.csv",
row.names = FALSE)
We can also save the dataset as a shapefile, we will transform the data frame in a sf object.
library(tmap)
library(sf)
myrsine.final <- left_join(myrsine.coord, myrsine.new.geo2)
nrow(myrsine.final)
[1] 2580
myrsine_sf <- st_as_sf(myrsine.final, coords = c("decimalLongitude", "decimalLatitude"))
st_crs(myrsine_sf)
Coordinate Reference System: NA
myrsine_sf <- st_set_crs(myrsine_sf, 4326)
st_crs(myrsine_sf)
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
DATUM["World Geodetic System 1984",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["unknown"],
AREA["World"],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
#dir.create("data/shapefiles", recursive = T)
#st_write(myrsine_sf, dsn = "data/shapefiles/myrsine.shp")
tmap
tmap
will use the sf object. We can add the shapefile to the map we created yesterday:
data(World)
SAm_map <- World %>%
filter(continent %in% c("South America", "North America")) %>%
tm_shape() +
tm_borders()
SAm_map +
tm_shape(myrsine_sf) +
tm_bubbles(size = 0.2,
col = ".summary")
The option tmap_mode("view")
creates interactive maps. For this, we need to transform the data frame into a sf shapefile.
tmap_mode("view")
World %>%
filter(continent %in% c("South America", "North America")) %>%
tm_shape() +
tm_borders() +
tm_shape(myrsine_sf) +
tm_bubbles(size = 0.2)