class: center, middle, inverse, title-slide .title[ # Data manipulation ] .author[ ### Sara Mortara & Andrea Sánchez-Tapia ] .institute[ ### re.green | ¡liibre! ] .date[ ### 2022-07-12 ] --- <style type="text/css"> .tiny .remark-code { /*Change made here*/ font-size: 50% !important; } </style> ## today - data - relational data bases - manipulating data in R --- ## data <img src="figs/Data-Wisdom.jpg" width="1361" /> --- # data and data structure - knowledge comes from synthesizing multiples perspectives -- - every data have a __context__ -- - data cleanning - insistence on tidiness, cleanliness, and order - idea of data scientists as wizards or janitors -- - ideas that data should always be clean and controlled have tainted historical roots -- --- ## organizing data <img src="figs/tidy_data.png" width="600" style="display: block; margin: auto;" /> --- ## organizing data - manipulating data in rows and columns - data tidying: structuring datasets to facilitate analysis <img src="figs/tidy01.png" width="419" style="display: block; margin: auto;" /> --- ## organizing data: tidy - each column a variable - each row an observation <img src="figs/tidy02.png" width="396" style="display: block; margin: auto;" /> --- class: inverse, center, middle ## relational data bases --- ## relational data bases + different data are organized in different tables -- + tables are integrated -- + common identifier for each table -- + in general organized in __SQL__ (__Structured Query Language__) --- ## dealing with relational data + tipically data analysis in different tables -- + relations between tables are pairwised -- + relations of three or more tables are always properties of pairwise relationships -- + __verbs__ to work with pairs of tables --- ## types of verbs + mutating -- + filtering -- + set operations -- + data manipulation can be done using packages __base__ and __dplyr__ --- background-image: url("figs/dplyr.png") background-position: 98% 2% background-size: 100px ## __dplyr__ package based on relational data bases logic + easier than SQL because it is focused on data analysis -- + __grammar__ for data manipulation -- + `mutate()` -- + `filter()` -- --- ## data structure <img src="figs/dplyr2.png" width="697" style="display: block; margin: auto;" /> --- ## keys <img src="figs/what-is-a-relational-database.jpg" width="800" style="display: block; margin: auto;" /> --- ## keys + a key indicates an unique observation -- + __primary key__ --> identifies an observation in its own table -- + __foreign key__ --> identifies an observation in another table -- + a variable can be a primary key in one table and a foreign key in another -- + every primary key must contain unique information! -- + every table must have a primary key --- ## relation is given by __primary key__ and its respective __foreign key__ .pull-left[ + 1-to-many + 1-to-1 + we use the keys to combine tables + we use `merge()` (__base__) or `join()` (__dplyr__) ] .pull-right[ <img src="figs/what-is-a-relational-database.jpg" width="400" style="display: block; margin: auto;" /> ] --- ## using relationships to join data <img src="figs/join.png" width="400" style="display: block; margin: auto;" /> --- ## equivalences between __dplyr__ and __base__ <img src="figs/join_merge.png" width="1068" style="display: block; margin: auto;" /> --- class: inverse, center, middle ## manipulating data in R --- ## CESTES data base <img src="figs/cestesdatabase.png" width="700" style="display: block; margin: auto;" /> [Jeliazkov et al 2020 Sci Data](https://doi.org/10.1038/s41597-019-0344-7) --- ## open data and code <img src="figs/cabecalho_cestes.png" width="887" style="display: block; margin: auto;" /> --- ## open data and code <img src="figs/setwdeca.png" width="888" style="display: block; margin: auto;" /> --- ## open data and code .pull-left[ <img src="figs/JennyBryan.png" width="724" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figs/jenny.jpg" width="300" style="display: block; margin: auto;" /> ] --- ## todo <svg viewBox="0 0 640 512" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M255.03 261.65c6.25 6.25 16.38 6.25 22.63 0l11.31-11.31c6.25-6.25 6.25-16.38 0-22.63L253.25 192l35.71-35.72c6.25-6.25 6.25-16.38 0-22.63l-11.31-11.31c-6.25-6.25-16.38-6.25-22.63 0l-58.34 58.34c-6.25 6.25-6.25 16.38 0 22.63l58.35 58.34zm96.01-11.3l11.31 11.31c6.25 6.25 16.38 6.25 22.63 0l58.34-58.34c6.25-6.25 6.25-16.38 0-22.63l-58.34-58.34c-6.25-6.25-16.38-6.25-22.63 0l-11.31 11.31c-6.25 6.25-6.25 16.38 0 22.63L386.75 192l-35.71 35.72c-6.25 6.25-6.25 16.38 0 22.63zM624 416H381.54c-.74 19.81-14.71 32-32.74 32H288c-18.69 0-33.02-17.47-32.77-32H16c-8.8 0-16 7.2-16 16v16c0 35.2 28.8 64 64 64h512c35.2 0 64-28.8 64-64v-16c0-8.8-7.2-16-16-16zM576 48c0-26.4-21.6-48-48-48H112C85.6 0 64 21.6 64 48v336h512V48zm-64 272H128V64h384v256z"></path></svg> - create and run `03_data_manipulation.R` - generate outputs - `git add`, `commit`, and `push` --- ## a workflow in R loading needed packages ```r library("reshape2") library("tidyverse") ``` --- ## exploring the data ## species vs. sites .tiny[ ```r files_path <- list.files("data/raw/cestes/", pattern = ".csv", full.names = TRUE) files_path ``` ``` ## [1] "data/raw/cestes//comm.csv" "data/raw/cestes//coord.csv" ## [3] "data/raw/cestes//envir.csv" "data/raw/cestes//splist.csv" ## [5] "data/raw/cestes//traits.csv" ``` ```r files_path[3] ``` ``` ## [1] "data/raw/cestes//envir.csv" ``` ] --- ## reading the data in R ### creating `data.frame` objects ```r comm <- read.csv(files_path[1]) coord <- read.csv(files_path[2]) envir <- read.csv(files_path[3]) splist <- read.csv(files_path[4]) traits <- read.csv(files_path[5]) ``` --- ## inspecting the community data ```r comm[1:6, 1:10] ``` ``` ## Sites sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9 ## 1 1 0 0 0 0 0 0 0 1 0 ## 2 2 0 1 0 0 0 0 0 0 0 ## 3 3 0 1 0 0 0 0 0 0 2 ## 4 4 0 0 0 0 0 0 0 0 0 ## 5 5 0 0 0 0 0 0 0 0 0 ## 6 6 0 0 0 0 0 0 0 0 0 ``` --- ## inspecting traits data ```r traits[1:9, 1:7] ``` ``` ## Sp Anemo Auto Entomo Annual Biennial Perennial ## 1 sp1 0 0 1 0 0 1 ## 2 sp2 0 0 1 0 0 1 ## 3 sp3 0 0 1 1 1 1 ## 4 sp4 0 0 1 0 0 1 ## 5 sp5 0 0 1 0 0 1 ## 6 sp6 0 0 1 0 0 1 ## 7 sp7 0 0 1 0 0 1 ## 8 sp8 0 0 1 0 0 1 ## 9 sp9 1 0 0 0 0 1 ``` --- ## inspecting environmental data ```r head(envir) ``` ``` ## Sites Clay Silt Sand K2O Mg Na100g K Elev ## 1 1 0.73 0.24 0.03 1.3 9.2 4.2 1.2 6 ## 2 2 0.75 0.24 0.02 0.8 10.7 10.4 1.4 2 ## 3 3 0.74 0.24 0.02 1.7 8.6 10.8 1.9 2 ## 4 4 0.23 0.26 0.49 0.3 2.0 1.2 0.3 6 ## 5 5 0.73 0.24 0.03 1.3 9.2 4.2 1.2 6 ## 6 6 0.72 0.22 0.03 1.7 6.0 10.7 1.3 4 ``` --- ## inspecting coordenate data ```r head(coord) ``` ``` ## Sites X Y ## 1 1 365.80 181.20 ## 2 2 349.60 185.00 ## 3 3 333.40 185.40 ## 4 4 373.86 161.43 ## 5 5 380.25 179.50 ## 6 6 354.40 168.60 ``` --- ## inspecting the species list data ```r head(splist) ``` ``` ## TaxonName Taxon TaxCode ## 1 Arisarum vulgare Arvu sp1 ## 2 Alisma plantago Alpl sp2 ## 3 Damasonium alisma Daal sp3 ## 4 Asphodelus aetivus Asae sp4 ## 5 Narcissus tazetta Nata sp5 ## 6 Narcissus elegans Nael sp6 ``` ```r # how many species? nrow(splist) ``` ``` ## [1] 56 ``` --- ## adding coordinates in the site `data.frame` .tiny[ ```r # info on coord names(coord) ``` ``` ## [1] "Sites" "X" "Y" ``` ```r dim(coord) ``` ``` ## [1] 97 3 ``` ```r # info on envir names(envir) ``` ``` ## [1] "Sites" "Clay" "Silt" "Sand" "K2O" "Mg" "Na100g" "K" ## [9] "Elev" ``` ```r dim(envir) ``` ``` ## [1] 97 9 ``` ] --- ## using `merge` and a common column the common column is __primary__ and __foreign__ key ```r envir_coord <- merge(x = envir, y = coord, by = "Sites") dim(envir_coord) ``` ``` ## [1] 97 11 ``` --- ## checking the `merge` ```r names(envir_coord) ``` ``` ## [1] "Sites" "Clay" "Silt" "Sand" "K2O" "Mg" "Na100g" "K" ## [9] "Elev" "X" "Y" ``` --- ## transforming the species matrix vs. area in data table (tidy data) ```r comm[1:6, 1:10] ``` ``` ## Sites sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9 ## 1 1 0 0 0 0 0 0 0 1 0 ## 2 2 0 1 0 0 0 0 0 0 0 ## 3 3 0 1 0 0 0 0 0 0 2 ## 4 4 0 0 0 0 0 0 0 0 0 ## 5 5 0 0 0 0 0 0 0 0 0 ## 6 6 0 0 0 0 0 0 0 0 0 ``` --- ## using the package **reshape2** ```r # converting the matrix into tidy data comm_df <- reshape2::melt(comm[, -1]) ``` ``` ## No id variables; using all as measure variables ``` ```r # checking if it worked head(comm_df) ``` ``` ## variable value ## 1 sp1 0 ## 2 sp1 0 ## 3 sp1 0 ## 4 sp1 0 ## 5 sp1 0 ## 6 sp1 0 ``` --- ## using the package **reshape2** .tiny[ ```r # how many the times the species repeats table(comm_df$variable) ``` ``` ## ## sp1 sp2 sp3 sp4 sp5 sp6 sp7 sp8 sp9 sp10 sp11 sp12 sp13 sp14 sp15 sp16 ## 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 ## sp17 sp18 sp19 sp20 sp21 sp22 sp23 sp24 sp25 sp26 sp27 sp28 sp29 sp30 sp31 sp32 ## 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 ## sp33 sp34 sp35 sp36 sp37 sp38 sp39 sp40 sp41 sp42 sp43 sp44 sp45 sp46 sp47 sp48 ## 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 97 ## sp49 sp50 sp51 sp52 sp53 sp54 sp55 sp56 ## 97 97 97 97 97 97 97 97 ``` ] --- ## creating the variable "Sites" ```r # how many species? n_sp <- nrow(splist) n_sp ``` ``` ## [1] 56 ``` ```r # vector containing all sites Sites <- envir$Sites length(Sites) ``` ``` ## [1] 97 ``` ```r comm_df$Sites <- rep(Sites, each = n_sp) ``` --- ## checking if the first column was created ```r head(comm_df) ``` ``` ## variable value Sites ## 1 sp1 0 1 ## 2 sp1 0 1 ## 3 sp1 0 1 ## 4 sp1 0 1 ## 5 sp1 0 1 ## 6 sp1 0 1 ``` --- ## changing column names ```r names(comm_df) ``` ``` ## [1] "variable" "value" "Sites" ``` ```r names(comm_df)[1:2] <- c("TaxCode", "Abundance") head(comm_df) ``` ``` ## TaxCode Abundance Sites ## 1 sp1 0 1 ## 2 sp1 0 1 ## 3 sp1 0 1 ## 4 sp1 0 1 ## 5 sp1 0 1 ## 6 sp1 0 1 ``` --- ## using R **base** let's add a column with the species name to our `comm_df` object ```r comm_sp <- merge(x = comm_df, y = splist[, c(1, 3)], by = "TaxCode") ``` --- ## using packages **dplyr** e **tidyr** let's add a column with the species name to our `comm_df` object ```r comm_sp <- inner_join(x = comm_df, y = splist[, c(1, 3)]) ``` ``` ## Joining, by = "TaxCode" ``` --- ## checking our new column ```r head(comm_sp) ``` ``` ## TaxCode Abundance Sites TaxonName ## 1 sp1 0 1 Arisarum vulgare ## 2 sp1 0 1 Arisarum vulgare ## 3 sp1 0 1 Arisarum vulgare ## 4 sp1 0 1 Arisarum vulgare ## 5 sp1 0 1 Arisarum vulgare ## 6 sp1 0 1 Arisarum vulgare ``` --- ## using the package **tidyr** ```r comm_tidy <- tidyr::pivot_longer(comm, cols = 2:ncol(comm), names_to = "TaxCode", values_to = "Abundance") head(comm_tidy) ``` ``` ## # A tibble: 6 × 3 ## Sites TaxCode Abundance ## <int> <chr> <int> ## 1 1 sp1 0 ## 2 1 sp2 0 ## 3 1 sp3 0 ## 4 1 sp4 0 ## 5 1 sp5 0 ## 6 1 sp6 0 ``` ```r dim(comm_tidy) ``` ``` ## [1] 5432 3 ``` --- ## joining all variables into a single table ### binding `comm_sp` with `envir.coord` .tiny[ ```r comm_envir <- inner_join(x = comm_sp, y = envir_coord, by = "Sites") dim(comm_sp) ``` ``` ## [1] 5432 4 ``` ```r dim(envir_coord) ``` ``` ## [1] 97 11 ``` ```r dim(comm_envir) ``` ``` ## [1] 5432 14 ``` ] --- ## writing final table ```r write.csv("comm_envir", "data/processed/03_Pavoine_full_table.csv", row.names = FALSE) ``` --- class: center, middle # ¡Thanks! <center> <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> [saramortara@gmail.com](mailto:saramortara@gmail.com) | [andreasancheztapia@gmail.com](mailto:andreasancheztapia@gmail.com) <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [@MortaraSara](https://twitter.com/MortaraSara) | [@SanchezTapiaA](https://twitter.com/SanchezTapiaA) <svg viewBox="0 0 496 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg><svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M105.2 24.9c-3.1-8.9-15.7-8.9-18.9 0L29.8 199.7h132c-.1 0-56.6-174.8-56.6-174.8zM.9 287.7c-2.6 8 .3 16.9 7.1 22l247.9 184-226.2-294zm160.8-88l94.3 294 94.3-294zm349.4 88l-28.8-88-226.3 294 247.9-184c6.9-5.1 9.7-14 7.2-22zM425.7 24.9c-3.1-8.9-15.7-8.9-18.9 0l-56.6 174.8h132z"></path></svg> [saramortara](http://github.com/saramortara) | [andreasancheztapia](http://github.com/andreasancheztapia)