+ - 0:00:00
Notes for current slide
Notes for next slide

Data manipulation

Sara Mortara & Andrea Sánchez-Tapia

re.green | ¡liibre!

2022-07-12

1 / 48

today

  • data

  • relational data bases

  • manipulating data in R

2 / 48

data

3 / 48

data and data structure

  • knowledge comes from synthesizing multiples perspectives
4 / 48

data and data structure

  • knowledge comes from synthesizing multiples perspectives

  • every data have a context

4 / 48

data and data structure

  • knowledge comes from synthesizing multiples perspectives

  • every data have a context

  • data cleanning

    • insistence on tidiness, cleanliness, and order
    • idea of data scientists as wizards or janitors
4 / 48

data and data structure

  • knowledge comes from synthesizing multiples perspectives

  • every data have a context

  • data cleanning

    • insistence on tidiness, cleanliness, and order
    • idea of data scientists as wizards or janitors
  • ideas that data should always be clean and controlled have tainted historical roots
4 / 48

data and data structure

  • knowledge comes from synthesizing multiples perspectives

  • every data have a context

  • data cleanning

    • insistence on tidiness, cleanliness, and order
    • idea of data scientists as wizards or janitors
  • ideas that data should always be clean and controlled have tainted historical roots
4 / 48

organizing data

5 / 48

organizing data

  • manipulating data in rows and columns
  • data tidying: structuring datasets to facilitate analysis

6 / 48

organizing data: tidy

  • each column a variable

  • each row an observation

7 / 48

relational data bases

8 / 48

relational data bases

  • different data are organized in different tables
9 / 48

relational data bases

  • different data are organized in different tables

  • tables are integrated

9 / 48

relational data bases

  • different data are organized in different tables

  • tables are integrated

  • common identifier for each table

9 / 48

relational data bases

  • different data are organized in different tables

  • tables are integrated

  • common identifier for each table

  • in general organized in SQL (Structured Query Language)

9 / 48

dealing with relational data

  • tipically data analysis in different tables
10 / 48

dealing with relational data

  • tipically data analysis in different tables

  • relations between tables are pairwised

10 / 48

dealing with relational data

  • tipically data analysis in different tables

  • relations between tables are pairwised

  • relations of three or more tables are always properties of pairwise relationships

10 / 48

dealing with relational data

  • tipically data analysis in different tables

  • relations between tables are pairwised

  • relations of three or more tables are always properties of pairwise relationships

  • verbs to work with pairs of tables

10 / 48

types of verbs

  • mutating
11 / 48

types of verbs

  • mutating

  • filtering

11 / 48

types of verbs

  • mutating

  • filtering

  • set operations

11 / 48

types of verbs

  • mutating

  • filtering

  • set operations

  • data manipulation can be done using packages base and dplyr

11 / 48

dplyr package

based on relational data bases logic

  • easier than SQL because it is focused on data analysis
12 / 48

dplyr package

based on relational data bases logic

  • easier than SQL because it is focused on data analysis

  • grammar for data manipulation

12 / 48

dplyr package

based on relational data bases logic

  • easier than SQL because it is focused on data analysis

  • grammar for data manipulation

  • mutate()

12 / 48

dplyr package

based on relational data bases logic

  • easier than SQL because it is focused on data analysis

  • grammar for data manipulation

  • mutate()

  • filter()

12 / 48

dplyr package

based on relational data bases logic

  • easier than SQL because it is focused on data analysis

  • grammar for data manipulation

  • mutate()

  • filter()

12 / 48

data structure

13 / 48

keys

14 / 48

keys

  • a key indicates an unique observation
15 / 48

keys

  • a key indicates an unique observation
  • primary key --> identifies an observation in its own table
15 / 48

keys

  • a key indicates an unique observation
  • primary key --> identifies an observation in its own table

  • foreign key --> identifies an observation in another table

15 / 48

keys

  • a key indicates an unique observation
  • primary key --> identifies an observation in its own table

  • foreign key --> identifies an observation in another table

  • a variable can be a primary key in one table and a foreign key in another

15 / 48

keys

  • a key indicates an unique observation
  • primary key --> identifies an observation in its own table

  • foreign key --> identifies an observation in another table

  • a variable can be a primary key in one table and a foreign key in another

  • every primary key must contain unique information!

15 / 48

keys

  • a key indicates an unique observation
  • primary key --> identifies an observation in its own table

  • foreign key --> identifies an observation in another table

  • a variable can be a primary key in one table and a foreign key in another

  • every primary key must contain unique information!

  • every table must have a primary key

15 / 48

relation

is given by primary key and its respective foreign key

  • 1-to-many
  • 1-to-1
  • we use the keys to combine tables
  • we use merge() (base) or join() (dplyr)

16 / 48

using relationships to join data

17 / 48

equivalences between dplyr and base

18 / 48

manipulating data in R

19 / 48

CESTES data base

Jeliazkov et al 2020 Sci Data

20 / 48

open data and code

21 / 48

open data and code

22 / 48

open data and code

23 / 48

todo

  • create and run 03_data_manipulation.R

  • generate outputs

  • git add, commit, and push

24 / 48

a workflow in R

loading needed packages

library("reshape2")
library("tidyverse")
25 / 48

exploring the data

species vs. sites

files_path <- list.files("data/raw/cestes/",
pattern = ".csv",
full.names = TRUE)
files_path
## [1] "data/raw/cestes//comm.csv" "data/raw/cestes//coord.csv"
## [3] "data/raw/cestes//envir.csv" "data/raw/cestes//splist.csv"
## [5] "data/raw/cestes//traits.csv"
files_path[3]
## [1] "data/raw/cestes//envir.csv"
26 / 48

reading the data in R

creating data.frame objects

comm <- read.csv(files_path[1])
coord <- read.csv(files_path[2])
envir <- read.csv(files_path[3])
splist <- read.csv(files_path[4])
traits <- read.csv(files_path[5])
27 / 48

inspecting the community data

comm[1:6, 1:10]
## Sites sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9
## 1 1 0 0 0 0 0 0 0 1 0
## 2 2 0 1 0 0 0 0 0 0 0
## 3 3 0 1 0 0 0 0 0 0 2
## 4 4 0 0 0 0 0 0 0 0 0
## 5 5 0 0 0 0 0 0 0 0 0
## 6 6 0 0 0 0 0 0 0 0 0
28 / 48

inspecting traits data

traits[1:9, 1:7]
## Sp Anemo Auto Entomo Annual Biennial Perennial
## 1 sp1 0 0 1 0 0 1
## 2 sp2 0 0 1 0 0 1
## 3 sp3 0 0 1 1 1 1
## 4 sp4 0 0 1 0 0 1
## 5 sp5 0 0 1 0 0 1
## 6 sp6 0 0 1 0 0 1
## 7 sp7 0 0 1 0 0 1
## 8 sp8 0 0 1 0 0 1
## 9 sp9 1 0 0 0 0 1
29 / 48

inspecting environmental data

head(envir)
## Sites Clay Silt Sand K2O Mg Na100g K Elev
## 1 1 0.73 0.24 0.03 1.3 9.2 4.2 1.2 6
## 2 2 0.75 0.24 0.02 0.8 10.7 10.4 1.4 2
## 3 3 0.74 0.24 0.02 1.7 8.6 10.8 1.9 2
## 4 4 0.23 0.26 0.49 0.3 2.0 1.2 0.3 6
## 5 5 0.73 0.24 0.03 1.3 9.2 4.2 1.2 6
## 6 6 0.72 0.22 0.03 1.7 6.0 10.7 1.3 4
30 / 48

inspecting coordenate data

head(coord)
## Sites X Y
## 1 1 365.80 181.20
## 2 2 349.60 185.00
## 3 3 333.40 185.40
## 4 4 373.86 161.43
## 5 5 380.25 179.50
## 6 6 354.40 168.60
31 / 48

inspecting the species list data

head(splist)
## TaxonName Taxon TaxCode
## 1 Arisarum vulgare Arvu sp1
## 2 Alisma plantago Alpl sp2
## 3 Damasonium alisma Daal sp3
## 4 Asphodelus aetivus Asae sp4
## 5 Narcissus tazetta Nata sp5
## 6 Narcissus elegans Nael sp6
# how many species?
nrow(splist)
## [1] 56
32 / 48

adding coordinates in the site data.frame

# info on coord
names(coord)
## [1] "Sites" "X" "Y"
dim(coord)
## [1] 97 3
# info on envir
names(envir)
## [1] "Sites" "Clay" "Silt" "Sand" "K2O" "Mg" "Na100g" "K"
## [9] "Elev"
dim(envir)
## [1] 97 9
33 / 48

using merge and a common column

the common column is primary and foreign key

envir_coord <- merge(x = envir,
y = coord,
by = "Sites")
dim(envir_coord)
## [1] 97 11
34 / 48

checking the merge

names(envir_coord)
## [1] "Sites" "Clay" "Silt" "Sand" "K2O" "Mg" "Na100g" "K"
## [9] "Elev" "X" "Y"
35 / 48

transforming the species matrix vs. area in data table (tidy data)

comm[1:6, 1:10]
## Sites sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9
## 1 1 0 0 0 0 0 0 0 1 0
## 2 2 0 1 0 0 0 0 0 0 0
## 3 3 0 1 0 0 0 0 0 0 2
## 4 4 0 0 0 0 0 0 0 0 0
## 5 5 0 0 0 0 0 0 0 0 0
## 6 6 0 0 0 0 0 0 0 0 0
36 / 48

using the package reshape2

# converting the matrix into tidy data
comm_df <- reshape2::melt(comm[, -1])
## No id variables; using all as measure variables
# checking if it worked
head(comm_df)
## variable value
## 1 sp1 0
## 2 sp1 0
## 3 sp1 0
## 4 sp1 0
## 5 sp1 0
## 6 sp1 0
37 / 48

using the package reshape2

# how many the times the species repeats
table(comm_df$variable)
##
## sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9 sp10 sp11 sp12 sp13 sp14 sp15 sp16
## 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97
## sp17 sp18 sp19 sp20 sp21 sp22 sp23 sp24 sp25 sp26 sp27 sp28 sp29 sp30 sp31 sp32
## 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97
## sp33 sp34 sp35 sp36 sp37 sp38 sp39 sp40 sp41 sp42 sp43 sp44 sp45 sp46 sp47 sp48
## 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97
## sp49 sp50 sp51 sp52 sp53 sp54 sp55 sp56
## 97 97 97 97 97 97 97 97
38 / 48

creating the variable "Sites"

# how many species?
n_sp <- nrow(splist)
n_sp
## [1] 56
# vector containing all sites
Sites <- envir$Sites
length(Sites)
## [1] 97
comm_df$Sites <- rep(Sites, each = n_sp)
39 / 48

checking if the first column was created

head(comm_df)
## variable value Sites
## 1 sp1 0 1
## 2 sp1 0 1
## 3 sp1 0 1
## 4 sp1 0 1
## 5 sp1 0 1
## 6 sp1 0 1
40 / 48

changing column names

names(comm_df)
## [1] "variable" "value" "Sites"
names(comm_df)[1:2] <- c("TaxCode", "Abundance")
head(comm_df)
## TaxCode Abundance Sites
## 1 sp1 0 1
## 2 sp1 0 1
## 3 sp1 0 1
## 4 sp1 0 1
## 5 sp1 0 1
## 6 sp1 0 1
41 / 48

using R base

let's add a column with the species name to our comm_df object

comm_sp <- merge(x = comm_df,
y = splist[, c(1, 3)],
by = "TaxCode")
42 / 48

using packages dplyr e tidyr

let's add a column with the species name to our comm_df object

comm_sp <- inner_join(x = comm_df,
y = splist[, c(1, 3)])
## Joining, by = "TaxCode"
43 / 48

checking our new column

head(comm_sp)
## TaxCode Abundance Sites TaxonName
## 1 sp1 0 1 Arisarum vulgare
## 2 sp1 0 1 Arisarum vulgare
## 3 sp1 0 1 Arisarum vulgare
## 4 sp1 0 1 Arisarum vulgare
## 5 sp1 0 1 Arisarum vulgare
## 6 sp1 0 1 Arisarum vulgare
44 / 48

using the package tidyr

comm_tidy <- tidyr::pivot_longer(comm, cols = 2:ncol(comm), names_to = "TaxCode", values_to = "Abundance")
head(comm_tidy)
## # A tibble: 6 × 3
## Sites TaxCode Abundance
## <int> <chr> <int>
## 1 1 sp1 0
## 2 1 sp2 0
## 3 1 sp3 0
## 4 1 sp4 0
## 5 1 sp5 0
## 6 1 sp6 0
dim(comm_tidy)
## [1] 5432 3
45 / 48

joining all variables into a single table

binding comm_sp with envir.coord

comm_envir <- inner_join(x = comm_sp,
y = envir_coord,
by = "Sites")
dim(comm_sp)
## [1] 5432 4
dim(envir_coord)
## [1] 97 11
dim(comm_envir)
## [1] 5432 14
46 / 48

writing final table

write.csv("comm_envir", "data/processed/03_Pavoine_full_table.csv", row.names = FALSE)
47 / 48

today

  • data

  • relational data bases

  • manipulating data in R

2 / 48
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow