Exploratory data analysis

Sara Mortara, Andrea Sánchez-Tapia
2022-07-13

One possible routine of exploratory data analysis

Morphological patterns of Iris species

Theiris datasets was collected by Edgar Anderson and became known by the work Ronald E. Fisher. Let’s load the data in R.

After loading the dataset, use the command ?iris to learn more about the dataset. Let’s start with basic data inspections.

head(iris)
summary(iris)

The functions aggregate and tapply

How many observations per species?

table(iris$Species)

What is the average of the variables by species? Let’s use the aggregate and tapply functions. The two functions are similar, what changes are the arguments and the output format of each one of them.

# sepal length mean per species
tapply(X = iris$Sepal.Length, INDEX = list(iris$Species), FUN = mean)
# the same operation, using another function, other arguments and other output
aggregate(x = iris$Sepal.Length, by = list(iris$Species), FUN = mean)
# still the same operation, the same function but a different notation
aggregate(Sepal.Length ~ Species, data = iris, mean)

NAs and zeroes

Using a logical test to count how many NAs exist in the data.

apply(iris, 2, function(x) sum(is.na(x)))

Descriptive statistics

\(% fonte: http://ncss-tech.github.io/stats_for_soil_survey/chapters/4_exploratory_analysis/4_exploratory_analysis.html#4_descriptive_statistics\)

The main descriptive statistics that we will use are:

Parameter Description R function
average arithmetic mean mean()
median core value median()
mode most frequent value sort(table(), decreasing = TRUE)[1]
standard deviation variation around the mean sd()
quantiles cut points dividing a probability distribution quantile()

Measures of central tendency

Mean

\(\overline{x}= \frac{\sum^{n}_{i=1}x_i}{n}\)

vars <- iris[, -5]
apply(vars, 2, mean)

Median: middle value separating the greater and lesser halves of a data set

apply(vars, 2, median)

Mode: Most frequent value in a data set

freq_sl <- sort(table(iris$Sepal.Length), decreasing = TRUE)
freq_sl[1]

Measures of dispersion

Variance: deviation from the mean

\(s^{2}=\sum^{n}_{i=1}\frac{(x_i−\overline{x})^2}{n−1}\)

apply(vars, 2, var)

Standard deviation: square root of variance

\(s=\sqrt{\sum^{n}_{i=1}\frac{(x_i−\overline{x})^2}{n−1}}\)

sd01 <- apply(vars, 2, sd)
# another way:
sd02 <- apply(vars, 2, function(x) sqrt(var(x)))
sd01
sd02
sd01 == sd02

Coefficient of variation: relative measure of standard deviation

\(CV=\frac{s}{\overline{x}}\times100\)

There is no function in base R to calculate the coefficient of variation. This is not a problem. Let’s create an R function for that.

cv <- function(x){
  sd(x) / mean(x) * 100
}
apply(vars, 2, cv)

Quantiles or percentiles

It is the value that cuts the nth percentage of data values when sorted in ascending order. By default, the quantile() function returns the minimum, the 25th percentile, the median, the 50th percentile, the 75th percentile, and the maximum (which is very similar to the return of the summary() function of a numeric vector). The five numbers divide the data into four quantiles, which are therefore called quartiles. The quartiles are a very useful metric to describe the data because they have a simple interpretation and are not affected by the distribution of the data. It is possible to modify the desired percentiles with the probs argument.

# quartiles (four quantiles)
apply(vars, 2, quantile)
# 5%, 50% e 95%
apply(vars, 2, quantile, probs = c(0.05, 0.5, 0.95))

Range

The range is the difference between the highest and lowest value of a given variable.

# range function return the min and max
apply(vars, 2, range)
# applying the diff function to the range result, we have the desired value
# a good practice is to never overwrite an existing object in R, so
# never name an object with an already existing name
my_range <- function(x){
  diff(range(x))
}
apply(vars, 2, my_range)

Interquartile range (IQR)

The IQR is the difference between the top quartile (75%) and the bottom quartile (25%).

apply(vars, 2, IQR)

Correlation

A correlation matrix is a table with the correlation values between each of the pairwise variables. Variables can be positively correlated (positive values) or negatively (negative values). What are highly correlated variables? A good “rule of thumb” is that any \(\geq{0.7}\) correlation is considered a high correlation.

cor(vars)

Graphical methods

Looking at the distribution of data is essential (Anscombe already said). Frequency distributions are useful because they help us visualize the center and distribution of the data. We are very used to the normal distribution, but biological data can assume different continuous or discrete probability distributions.

A brief description of the graphical methods:

Graph Type Description
bars each bar represents the frequency of observations of a group
histogram each bar represents the frequency of observations in a given set of values
density estimation of statistical distribution based on data
quantile-quantile comparison of data in relation to a normal distribution
box plot visual representation of mean, quartiles, symmetry and outliers
dispersion graphic display of continuous variables on the x and y axes

Bar chart

A bar graph shows the frequency of observations in a given class.

barplot(table(iris$Species))

In this case, all species have the same number of observations.

Histogram

The histogram is the bar graph equivalent for continuous variables. Each bar represents a range of values. The number of intervals can be specified by the user and affects the perception of data distribution.

Let’s see an example of a standard histogram for Iris species data.

par(mfrow = c(2, 2))
hist(iris$Sepal.Length)
hist(iris$Sepal.Width)
hist(iris$Petal.Length)
hist(iris$Petal.Length)
par(mfrow = c(1, 1))

Now for the sepal length of Iris species, let’s see the effect of the number of intervals in the histogram with the breaks argument.

par(mfrow = c(1, 2))
hist(iris$Sepal.Width)
hist(iris$Sepal.Width, breaks = 4)
par(mfrow = c(1, 1))

Density curve

The density curve shows the probability of observing a certain value. In comparison to the histogram, on the y axis, instead of having the frequency, we have the probability density.

par(mfrow = c(1, 2))
hist(iris$Sepal.Width)
hist(iris$Sepal.Width, freq = FALSE)
par(mfrow = c(1, 1))

In R we can see the density curve by plotting the density() function.

par(mfrow = c(1, 2))
# ploting the density curve
plot(density(iris$Sepal.Width))
#  ploting the density curve over the density histogram
hist(iris$Sepal.Width, freq = FALSE)
lines(density(iris$Sepal.Width), col = "blue")
par(mfrow = c(1, 1))

Box-plot or box-whisker plot

Box-plots are graphs that represent the summary of five Tuckey numbers showing the quartiles (25%, 50% and 75%), minimum, maximum and outliers values. The box-plot shows the shape of the data distribution, the shape of the distribution and the ability to compare with other variables on the same scale.

Understanding the box plot:

The boxplot box contains the first quartile (25%, lower quartile) and the third quartile (75%, upper quartile). The median (50%, second quartile) is the black line inside the box. The extremes of the graph (“whiskers”) show data at a distance of \(1.5\times\)IIQ above and below the third quartile and the first quartile. Any data beyond the segment is considered an outlier.

What is an outlier?

Outliers are not a bug (necessarily, but they can be). Outliers represent extreme values compared to the rest of the data. It is always important to evaluate outliers to ensure they are correct measurements.

Let’s now make the box-plots of the variables contained in the iris object. Let’s start with the general variables.

boxplot(iris$Sepal.Length)
boxplot(iris$Sepal.Width)
boxplot(iris$Petal.Length)
boxplot(iris$Petal.Width)

Now let’s look at the values per species.

boxplot(Sepal.Length ~ Species, data = iris)
boxplot(Sepal.Width ~ Species, data = iris)
boxplot(Petal.Length ~ Species, data = iris)
boxplot(Petal.Width ~ Species, data = iris)

Do you identify outliers in the dataset? How can we check outliers? Let’s use the boxplot() function itself to identify the outliers.

boxplot(iris$Sepal.Width)
my_boxplot <- boxplot(iris$Sepal.Width, plot = FALSE)
my_boxplot
# the object is a list and the outliers are stored in the $out element of the list
outliers <- my_boxplot$out
# what is the position of the outliers
which(iris$Sepal.Width %in% outliers)
# let's use the position to index the object
iris[which(iris$Sepal.Width %in% outliers), c("Sepal.Width", "Species")]

In the previous case, we considered outliers in relation to the distribution of the variable for all species together. It is reasonable to assume that each species has a distinct morphometric pattern so that we could identify species-specific outliers.

boxplot(Sepal.Width ~ Species, data = iris)
my_boxplot2 <- boxplot(Sepal.Width ~ Species, data = iris, plot = FALSE)
my_boxplot2
# the object is a list and the outliers are stored in the $out element of the list
outliers2 <- my_boxplot2$out
# in this case, we only want the outliers of the setosa species
# let's use the position to index the object
iris[iris$Sepal.Width %in% outliers2 &
       iris$Species == "setosa",
     c("Sepal.Width", "Species")]

Checking if the data follow a normal distribution

For many statistical analyses, the data are expected to assume a normal distribution. This is not always the standard in biological data. The type of data distribution will depend on the nature of the data. If your data does not follow a normal distribution, this could either be because the nature of the data is not normal (Gaussian) or you do not have a sufficient sample size. If the data is not normal, one approach may be to transform the data or find an analysis appropriate to the type of distribution of the data (eg if you are doing a linear regression, you can make a generalized linear model suitable for your distribution).

Let’s look at the morphometric data of the Iris species and compare it to a normal distribution. In R, this can be done visually with the qqnorm() and qqline() functions.

par(mfrow = c(1, 3))
qqnorm(iris$Sepal.Length[iris$Species == "setosa"],
       main = "setosa")
qqline(iris$Sepal.Length[iris$Species == "setosa"])
qqnorm(iris$Sepal.Length[iris$Species == "versicolor"],
       main = "versicolor")
qqline(iris$Sepal.Length[iris$Species == "versicolor"])
qqnorm(iris$Sepal.Length[iris$Species == "virginica"],
       main = "virginica")
qqline(iris$Sepal.Length[iris$Species == "virginica"])
par(mfrow = c(1, 1))

Relationship between variables

A function in R that helps us explore the relationship between many variables is pairs(). The result is a matrix with variables in rows and columns. The graph we see is the scatterplot for each pair of variables. The diagonal of the matrix contains the variable names. Note that the graph is mirrored so that the relationship between sepal size and length appears both in row 1 and column 2 and in row 2 and column 1.

pairs(vars)

EXTRA!

The R package GGally provides a very interesting output to evaluate the relationships between variables as it already shows the correlation value between variables and the probability density curve of each of the variables.

# EXCEPTIONALLY let's load the package now, as this is a bonus exercise.
library("GGally")
ggpairs(vars)