class: center, middle, inverse, title-slide .title[ # Exploratory data analysis ] .author[ ### Sara Mortara & Andrea Sánchez-Tapia ] .institute[ ### re.green | ¡liibre! ] .date[ ### 2022-07-13 ] --- <style type="text/css"> .tiny .remark-code { /*Change made here*/ font-size: 50% !important; } </style> ## today - exploratory data analysis - descriptive statistics - exploratory graphics - variable relationships --- ## Explanatory Data Analysis - John Tukey .pull-left[ <img src="figs/tukey.jpg" width="300" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figs/John_Tukey.jpg" width="293" style="display: block; margin: auto;" /> ] --- ## get to know your data! <img src="figs/pee.jpg" width="600" style="display: block; margin: auto;" /> --- ## goals of EDA 1. control data quality -- 2. suggest hypotheses for observed patterns -- 3. support the choice of statistical procedures for hypothesis testing -- 4. assess whether the data meet the assumptions of the chosen statistical procedures -- 5. indicate new studies and hypotheses --- ## alert! EDA does not mean <img src="figs/tortura.jpg" width="500" style="display: block; margin: auto;" /> it is assumed that the researcher has formulated *a priori* __hypotheses__ supported by __theory__ --- ## tips - there is no recipe! -- - can take between 20-50% of analysis time -- - can be started during data collection -- - visual techniques are widely used -- --- ## the importance of graphics and the Anscombe quartet - created by mathematician Francis Ascombe -- - 4 datasets with the same descriptive statistics but very different graphically <img src="figs/Francis_Anscombe.jpeg" width="267" style="display: block; margin: auto;" /> --- ## Anscombe data .tiny[ ```r # the dataset already exists inside R data("anscombe") ``` ```r # mean apply(anscombe, 2, var) ``` ``` ## x1 x2 x3 x4 y1 y2 y3 y4 ## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620 4.123249 ``` ```r # variance apply(anscombe, 2, var) ``` ``` ## x1 x2 x3 x4 y1 y2 y3 y4 ## 11.000000 11.000000 11.000000 11.000000 4.127269 4.127629 4.122620 4.123249 ``` ] --- ## let's take a look into the data ``` ## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04 ## 7 6 6 6 8 7.24 6.13 6.08 5.25 ## 8 4 4 4 19 4.26 3.10 5.39 12.50 ## 9 12 12 12 8 10.84 9.13 8.15 5.56 ## 10 7 7 7 8 4.82 7.26 6.42 7.91 ## 11 5 5 5 8 5.68 4.74 5.73 6.89 ``` --- ## correltion between x and y .tiny[ ```r # correlation cor(anscombe$x1, anscombe$y1) ``` ``` ## [1] 0.8164205 ``` ```r cor(anscombe$x2, anscombe$y2) ``` ``` ## [1] 0.8162365 ``` ```r cor(anscombe$x3, anscombe$y3) ``` ``` ## [1] 0.8162867 ``` ```r cor(anscombe$x4, anscombe$y4) ``` ``` ## [1] 0.8165214 ``` ] --- ## coefficients of the linear model .tiny[ ```r # correlation coef(lm(anscombe$x1 ~ anscombe$y1)) ``` ``` ## (Intercept) anscombe$y1 ## -0.9975311 1.3328426 ``` ```r coef(lm(anscombe$x2 ~ anscombe$y2)) ``` ``` ## (Intercept) anscombe$y2 ## -0.9948419 1.3324841 ``` ```r coef(lm(anscombe$x3 ~ anscombe$y3)) ``` ``` ## (Intercept) anscombe$y3 ## -1.000315 1.333375 ``` ```r coef(lm(anscombe$x4 ~ anscombe$y4)) ``` ``` ## (Intercept) anscombe$y4 ## -1.003640 1.333657 ``` ] --- ## now let's actually look into the Anscombe data <img src="05_slides_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## guiding questions 1. Where is the data centered? How is the data distributed? Are the data symmetrical, asymmetrical, bimodal? -- 2. Are there outliers? -- 3. Do the variables follow a normal distribution? -- 4. Are there relationships between the variables? Are the relationships between variables linear? -- 5. Do variables need to be transformed? -- 6. Was the sampling effort the same for each observation or variable? --- class: inverse, middle, center # descriptive statistics --- ## questions to ask the data 1. are there are missing values i.e. (__NA__s)? Are they really missing? 2. area there many __zeroes__? 3. where is the data centered? how are they spread? are they symmetrical? skewed, bimodal? 4. are there extreme values (outliers)? 5. what is the distribution of the variable? --- ## descriptive statistics | Parameter | Description | R function | |------|-------------|--------| | average | arithmetic mean | mean() | | median | core value | median() | | mode | most frequent value | sort(table(), decreasing = TRUE)[1] | | standard deviation | variation around the mean | sd() | | quantiles | cut points dividing a probability distribution | quantile() | --- class: inverse, middle, center # exploratory graphics --- ## reading data in R ```r # reading data generated in the last class all_data <- read.csv("data/processed/03_Pavoine_full_table.csv") # reading environmental data envir <- read.csv("data/raw/cestes/envir.csv") # environmental data without site envir.vars <- envir[, -1] ``` --- ## visualizing data in a boxplot ```r boxplot(all_data$Abundance) ``` <img src="05_slides_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## going back to the data ```r summary(all_data$Abundance) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 0.0000 0.1788 0.0000 6.0000 ``` ```r # how many zeroes sum(all_data$Abundance == 0) ``` ``` ## [1] 4824 ``` ```r # what proportion? sum(all_data$Abundance == 0)/nrow(all_data) ``` ``` ## [1] 0.8880707 ``` --- ## understanding the boxplot <img src="05_slides_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## visualizing data in a histogram ```r hist(all_data$Abundance) ``` <img src="05_slides_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- ## types of histogram .tiny[ ```r par(mfrow = c(1,2)) hist(all_data$Abundance) hist(all_data$Abundance, probability = TRUE) ``` <img src="05_slides_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ```r par(mfrow = c(1,1)) ``` ] --- ## classes of histogram .tiny[ ```r par(mfrow = c(1,3)) hist(all_data$Abundance, breaks = seq(0, max(all_data$Abundance), length = 3)) hist(all_data$Abundance, breaks = seq(0, max(all_data$Abundance), length = 5)) hist(all_data$Abundance) ``` <img src="05_slides_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ```r par(mfrow = c(1,1)) ``` ] --- ## empirical probability density curves represents the function that describes the probability of finding a certain value ```r hist(all_data$Abundance, probability = TRUE ) ``` <img src="05_slides_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## empirical probability density curves ```r plot(density(all_data$Abundance)) ``` <img src="05_slides_files/figure-html/dens-1.png" style="display: block; margin: auto;" /> --- ## does the distribution fit the data? discrete and asymmetric distribution --> Poisson? ```r # maximum of abundance ab.max <- max(all_data$Abundance) # lambda ab.med <- mean(all_data$Abundance) ``` --- ## does the __Poisson__ distribution fit the data? ```r hist(all_data$Abundance, probability = TRUE) points(dpois(0:ab.max, ab.med), col = cor[5]) lines(dpois(0:ab.max, ab.med), col = cor[5]) ``` <img src="05_slides_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ## statistical distributions: Gaussian or normal <img src="05_slides_files/figure-html/norm-1.png" style="display: block; margin: auto;" /> --- ## why is sampling important? <img src="05_slides_files/figure-html/norm-sampling-1.png" style="display: block; margin: auto;" /> --- class: inverse, middle, center # relationships between variables --- ## scatter plot ```r plot(Clay ~ Silt, data = envir.vars, pch = 19) ``` <img src="05_slides_files/figure-html/dispersao-1.png" style="display: block; margin: auto;" /> --- ## correlation between variables .tiny[ ```r cor(envir.vars) ``` ``` ## Clay Silt Sand K2O Mg Na100g ## Clay 1.0000000 -0.62694838 -0.71786978 0.4422121 0.18895961 0.28623195 ## Silt -0.6269484 1.00000000 -0.07660720 -0.2388823 -0.02370373 0.02738666 ## Sand -0.7178698 -0.07660720 1.00000000 -0.3364384 -0.21930954 -0.37588031 ## K2O 0.4422121 -0.23888226 -0.33643842 1.0000000 0.33549979 0.25314016 ## Mg 0.1889596 -0.02370373 -0.21930954 0.3354998 1.00000000 0.41377118 ## Na100g 0.2862320 0.02738666 -0.37588031 0.2531402 0.41377118 1.00000000 ## K 0.5436153 -0.32123692 -0.40584268 0.5681411 0.41177702 0.57075510 ## Elev -0.1485992 0.08087163 0.09379561 -0.1767765 -0.22314328 -0.33392061 ## K Elev ## Clay 0.5436153 -0.14859923 ## Silt -0.3212369 0.08087163 ## Sand -0.4058427 0.09379561 ## K2O 0.5681411 -0.17677652 ## Mg 0.4117770 -0.22314328 ## Na100g 0.5707551 -0.33392061 ## K 1.0000000 -0.33251202 ## Elev -0.3325120 1.00000000 ``` ] --- ## correlation between variables <img src="figs/Correlation_examples2.svg" width="800" style="display: block; margin: auto;" /> [By DenisBoigelot](https://commons.wikimedia.org/w/index.php?curid=15165296) --- ## correlation between variables .tiny[ ```r pairs(envir.vars) ``` <img src="05_slides_files/figure-html/pairs-1.png" style="display: block; margin: auto;" /> ] --- ## even better visualization <img src="05_slides_files/figure-html/cor-1.png" style="display: block; margin: auto;" /> --- class: center, middle # and what are the paths for the data analysis? your __[ H Y P O T H E S I S ]__ --- ## after the __[ H Y P O T H E S I S ]__, what are the paths? 1. understand the data well -- 2. variable response is normal? --> __lm__ and other parametric analysis -- 3. variable response has another distribution --> non-parametric analysis, __glm__ -- 4. hierarchical predictor variables? --> __(g)lmm__ -- 5. pseudo-replication in space or time --> __(g)lmm__ -- --- ## todo <svg viewBox="0 0 640 512" style="height:1em;fill:currentColor;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M255.03 261.65c6.25 6.25 16.38 6.25 22.63 0l11.31-11.31c6.25-6.25 6.25-16.38 0-22.63L253.25 192l35.71-35.72c6.25-6.25 6.25-16.38 0-22.63l-11.31-11.31c-6.25-6.25-16.38-6.25-22.63 0l-58.34 58.34c-6.25 6.25-6.25 16.38 0 22.63l58.35 58.34zm96.01-11.3l11.31 11.31c6.25 6.25 16.38 6.25 22.63 0l58.34-58.34c6.25-6.25 6.25-16.38 0-22.63l-58.34-58.34c-6.25-6.25-16.38-6.25-22.63 0l-11.31 11.31c-6.25 6.25-6.25 16.38 0 22.63L386.75 192l-35.71 35.72c-6.25 6.25-6.25 16.38 0 22.63zM624 416H381.54c-.74 19.81-14.71 32-32.74 32H288c-18.69 0-33.02-17.47-32.77-32H16c-8.8 0-16 7.2-16 16v16c0 35.2 28.8 64 64 64h512c35.2 0 64-28.8 64-64v-16c0-8.8-7.2-16-16-16zM576 48c0-26.4-21.6-48-48-48H112C85.6 0 64 21.6 64 48v336h512V48zm-64 272H128V64h384v256z"></path></svg> - create and run script `04_eda.R` - `git add`, `commit`, and `push` of the day --- class: center, middle # ¡Thanks! <center> <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> [saramortara@gmail.com](mailto:saramortara@gmail.com) | [andreasancheztapia@gmail.com](mailto:andreasancheztapia@gmail.com) <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [@MortaraSara](https://twitter.com/MortaraSara) | [@SanchezTapiaA](https://twitter.com/SanchezTapiaA) <svg viewBox="0 0 496 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg><svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:#A70000;height:1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M105.2 24.9c-3.1-8.9-15.7-8.9-18.9 0L29.8 199.7h132c-.1 0-56.6-174.8-56.6-174.8zM.9 287.7c-2.6 8 .3 16.9 7.1 22l247.9 184-226.2-294zm160.8-88l94.3 294 94.3-294zm349.4 88l-28.8-88-226.3 294 247.9-184c6.9-5.1 9.7-14 7.2-22zM425.7 24.9c-3.1-8.9-15.7-8.9-18.9 0l-56.6 174.8h132z"></path></svg> [saramortara](http://github.com/saramortara) | [andreasancheztapia](http://github.com/andreasancheztapia)