Theiris
datasets was collected by Edgar Anderson and became known by the work Ronald E. Fisher. Let’s load the data in R.
After loading the dataset, use the command ?iris
to learn more about the dataset. Let’s start with basic data inspections.
aggregate
and tapply
How many observations per species?
table(iris$Species)
What is the average of the variables by species? Let’s use the aggregate
and tapply
functions. The two functions are similar, what changes are the arguments and the output format of each one of them.
# sepal length mean per species
tapply(X = iris$Sepal.Length, INDEX = list(iris$Species), FUN = mean)
# the same operation, using another function, other arguments and other output
aggregate(x = iris$Sepal.Length, by = list(iris$Species), FUN = mean)
# still the same operation, the same function but a different notation
aggregate(Sepal.Length ~ Species, data = iris, mean)
Using a logical test to count how many NAs exist in the data.
\(% fonte: http://ncss-tech.github.io/stats_for_soil_survey/chapters/4_exploratory_analysis/4_exploratory_analysis.html#4_descriptive_statistics\)
The main descriptive statistics that we will use are:
Parameter | Description | R function |
---|---|---|
average | arithmetic mean | mean() |
median | core value | median() |
mode | most frequent value | sort(table(), decreasing = TRUE)[1] |
standard deviation | variation around the mean | sd() |
quantiles | cut points dividing a probability distribution | quantile() |
\(\overline{x}= \frac{\sum^{n}_{i=1}x_i}{n}\)
vars <- iris[, -5]
apply(vars, 2, mean)
apply(vars, 2, median)
\(s^{2}=\sum^{n}_{i=1}\frac{(x_i−\overline{x})^2}{n−1}\)
apply(vars, 2, var)
\(s=\sqrt{\sum^{n}_{i=1}\frac{(x_i−\overline{x})^2}{n−1}}\)
\(CV=\frac{s}{\overline{x}}\times100\)
There is no function in base R to calculate the coefficient of variation. This is not a problem. Let’s create an R function for that.
It is the value that cuts the nth percentage of data values when sorted in ascending order. By default, the quantile()
function returns the minimum, the 25th percentile, the median, the 50th percentile, the 75th percentile, and the maximum (which is very similar to the return of the summary()
function of a numeric vector). The five numbers divide the data into four quantiles, which are therefore called quartiles. The quartiles are a very useful metric to describe the data because they have a simple interpretation and are not affected by the distribution of the data. It is possible to modify the desired percentiles with the probs
argument.
The range is the difference between the highest and lowest value of a given variable.
# range function return the min and max
apply(vars, 2, range)
# applying the diff function to the range result, we have the desired value
# a good practice is to never overwrite an existing object in R, so
# never name an object with an already existing name
my_range <- function(x){
diff(range(x))
}
apply(vars, 2, my_range)
The IQR is the difference between the top quartile (75%) and the bottom quartile (25%).
apply(vars, 2, IQR)
A correlation matrix is a table with the correlation values between each of the pairwise variables. Variables can be positively correlated (positive values) or negatively (negative values). What are highly correlated variables? A good “rule of thumb” is that any \(\geq{0.7}\) correlation is considered a high correlation.
cor(vars)
Looking at the distribution of data is essential (Anscombe already said). Frequency distributions are useful because they help us visualize the center and distribution of the data. We are very used to the normal distribution, but biological data can assume different continuous or discrete probability distributions.
A brief description of the graphical methods:
Graph Type | Description |
---|---|
bars | each bar represents the frequency of observations of a group |
histogram | each bar represents the frequency of observations in a given set of values |
density | estimation of statistical distribution based on data |
quantile-quantile | comparison of data in relation to a normal distribution |
box plot | visual representation of mean, quartiles, symmetry and outliers |
dispersion | graphic display of continuous variables on the x and y axes |
A bar graph shows the frequency of observations in a given class.
In this case, all species have the same number of observations.
The histogram is the bar graph equivalent for continuous variables. Each bar represents a range of values. The number of intervals can be specified by the user and affects the perception of data distribution.
Let’s see an example of a standard histogram for Iris species data.
Now for the sepal length of Iris species, let’s see the effect of the number of intervals in the histogram with the breaks
argument.
The density curve shows the probability of observing a certain value. In comparison to the histogram, on the y axis, instead of having the frequency, we have the probability density.
In R we can see the density curve by plotting the density()
function.
Box-plots are graphs that represent the summary of five Tuckey numbers showing the quartiles (25%, 50% and 75%), minimum, maximum and outliers values. The box-plot shows the shape of the data distribution, the shape of the distribution and the ability to compare with other variables on the same scale.
Understanding the box plot:
The boxplot box contains the first quartile (25%, lower quartile) and the third quartile (75%, upper quartile). The median (50%, second quartile) is the black line inside the box. The extremes of the graph (“whiskers”) show data at a distance of \(1.5\times\)IIQ above and below the third quartile and the first quartile. Any data beyond the segment is considered an outlier.
What is an outlier?
Outliers are not a bug (necessarily, but they can be). Outliers represent extreme values compared to the rest of the data. It is always important to evaluate outliers to ensure they are correct measurements.
Let’s now make the box-plots of the variables contained in the iris
object. Let’s start with the general variables.
boxplot(iris$Sepal.Length)
boxplot(iris$Sepal.Width)
boxplot(iris$Petal.Length)
boxplot(iris$Petal.Width)
Now let’s look at the values per species.
boxplot(Sepal.Length ~ Species, data = iris)
boxplot(Sepal.Width ~ Species, data = iris)
boxplot(Petal.Length ~ Species, data = iris)
boxplot(Petal.Width ~ Species, data = iris)
Do you identify outliers in the dataset? How can we check outliers? Let’s use the boxplot()
function itself to identify the outliers.
boxplot(iris$Sepal.Width)
my_boxplot <- boxplot(iris$Sepal.Width, plot = FALSE)
my_boxplot
# the object is a list and the outliers are stored in the $out element of the list
outliers <- my_boxplot$out
# what is the position of the outliers
which(iris$Sepal.Width %in% outliers)
# let's use the position to index the object
iris[which(iris$Sepal.Width %in% outliers), c("Sepal.Width", "Species")]
In the previous case, we considered outliers in relation to the distribution of the variable for all species together. It is reasonable to assume that each species has a distinct morphometric pattern so that we could identify species-specific outliers.
boxplot(Sepal.Width ~ Species, data = iris)
my_boxplot2 <- boxplot(Sepal.Width ~ Species, data = iris, plot = FALSE)
my_boxplot2
# the object is a list and the outliers are stored in the $out element of the list
outliers2 <- my_boxplot2$out
# in this case, we only want the outliers of the setosa species
# let's use the position to index the object
iris[iris$Sepal.Width %in% outliers2 &
iris$Species == "setosa",
c("Sepal.Width", "Species")]
For many statistical analyses, the data are expected to assume a normal distribution. This is not always the standard in biological data. The type of data distribution will depend on the nature of the data. If your data does not follow a normal distribution, this could either be because the nature of the data is not normal (Gaussian) or you do not have a sufficient sample size. If the data is not normal, one approach may be to transform the data or find an analysis appropriate to the type of distribution of the data (eg if you are doing a linear regression, you can make a generalized linear model suitable for your distribution).
Let’s look at the morphometric data of the Iris species and compare it to a normal distribution. In R, this can be done visually with the qqnorm()
and qqline()
functions.
par(mfrow = c(1, 3))
qqnorm(iris$Sepal.Length[iris$Species == "setosa"],
main = "setosa")
qqline(iris$Sepal.Length[iris$Species == "setosa"])
qqnorm(iris$Sepal.Length[iris$Species == "versicolor"],
main = "versicolor")
qqline(iris$Sepal.Length[iris$Species == "versicolor"])
qqnorm(iris$Sepal.Length[iris$Species == "virginica"],
main = "virginica")
qqline(iris$Sepal.Length[iris$Species == "virginica"])
A function in R that helps us explore the relationship between many variables is pairs()
. The result is a matrix with variables in rows and columns. The graph we see is the scatterplot for each pair of variables. The diagonal of the matrix contains the variable names. Note that the graph is mirrored so that the relationship between sepal size and length appears both in row 1 and column 2 and in row 2 and column 1.
pairs(vars)
The R package GGally
provides a very interesting output to evaluate the relationships between variables as it already shows the correlation value between variables and the probability density curve of each of the variables.