Quantitative Image Analysis Data Visualization
Plotting and Analysis of Quantitative Image Analysis Data
Last time we were working on some real data obtained from a quantitative image analysis experiment. This time we will pick up where we left off and look at a few more metrics and explore some more plotting options.
Goals
- read in the data from online repository
- visual inspection of outliers
- statistical outlier detection
- additional graphing
##Import the Tidy Data
The original data set in this case was untidy and required a bit of work before we could do anything with it. To save some time and to prevent redundancy, I have provided a cleaned up version of the data for direct download. This data can be directly loaded in from my github repository using the RCurl package.
library(dplyr)
library(RCurl)
library(ggplot2)
#read in the file from github
urlBase <-'https://raw.githubusercontent.com/DrContrary/R4biology/master/Data%20Files/'
mkCon <- function(nm) {
textConnection(getURL(paste(urlBase,nm,sep='/')))
}
sytox_data_full <- read.table(mkCon('sytox_data_full.csv'), sep=',',header=T,comment.char='', row.names = 1)
sytox_data_full$glucose = as.character(sytox_data_full$glucose)
Generate the Statistics Table Again
Now that we have the data, lets run our statistics again so that we have them for graphing. Remember the full dataset contains the full range of glucose treatments as well as treatment. Our summary statistics will include the mean percentage of dead cells as well as the standard deviation. For ease of reading, we also use the arrange function. In this case the function sorts our data by treatment.
#generate the table of statisitcs
mean_stats_full = sytox_data_full %>%
group_by(glucose, mye.treatment) %>%
summarise(percent_dead_mean = mean(percent_dead), sd_percent = sd(percent_dead)) %>%
arrange(mye.treatment)
mean_stats_full
## # A tibble: 18 x 4
## # Groups: glucose [9]
## glucose mye.treatment percent_dead_mean sd_percent
## <chr> <fct> <dbl> <dbl>
## 1 0 +_Mye 1.78 0.458
## 2 1 +_Mye 3.59 1.11
## 3 1.5 +_Mye 2.93 1.19
## 4 2 +_Mye 1.10 0.482
## 5 2.5 +_Mye 2.55 1.57
## 6 3 +_Mye 4.36 6.52
## 7 3.5 +_Mye 1.07 0.939
## 8 4 +_Mye 1.43 0.664
## 9 4.5 +_Mye 1.05 0.573
## 10 0 No_Mye 2.19 0.568
## 11 1 No_Mye 3.87 1.60
## 12 1.5 No_Mye 3.35 0.824
## 13 2 No_Mye 4.26 1.63
## 14 2.5 No_Mye 3.11 1.18
## 15 3 No_Mye 2.61 1.25
## 16 3.5 No_Mye 8.39 4.84
## 17 4 No_Mye 36.6 9.67
## 18 4.5 No_Mye 44.4 28.9
Generate Boxplot in ggplot to Visually Detect Outliers
The type of boxplot that we are talking about here is the box and whisker plot. If you recall from statistics each point on the box shows a specific data summary point. While the whiskers can be defined a number of ways, the default output in ggplot is Tukey’s style we will focus only on this definition. This method define the points as such:
- lower hinge: 25th percentile (first quartile)
- upper hinge: 75th percentile (third quartile)
- middle hinge: 50th percentile (mean)
- upper whisker: from upper hinge to largest value or 1.5 X IQR (inter-quartile range)
- lower whisker: from lower hinge to smallest value or 1.5 X IQR (inter-quartile range)
Outliers are those points outside the end of the whiskers.
For more information you can reference the open source paper here (1978) https://www.jstor.org/stable/2683468?origin=crossref&seq=1#page_scan_tab_contents
Using geom_boxplot
#box plot to show data and outliers
ggplot(sytox_data_full,
aes(x= glucose, y = percent_dead, fill = mye.treatment)) +
geom_boxplot()
For more information on the different arguments you can use with the boxplot look here: https://ggplot2.tidyverse.org/reference/geom_boxplot.html
Grubbs Test for Outliers
For illustration purposes only, the Grubbs Test will be used to determine if some of the outliers shown on the boxplot can be removed. I say illustration only because while it may be statistically sound to remove outliers from a dataset, I am of the belief that they should not be removed from biological data. However, in some fields such as chemistry the removal of outliers is common. Additionally, in the context of machine learning, the removal of outliers is often required for proper training.
Why the Grubbs Test?
There are a number of outlier tests and each has a certain set of assumptions and should be used under certain situations. The Grubbs Test allows for the removal of only one data point and is well suited for data with few observations. For easy implementation of this test we will use the outliers package.
#install.packages("outliers")
library(outliers)
The function is defined as : grubbs.test(x, type = 10, opposite = FALSE, two.sided = FALSE)
otest = sytox_data_full %>%
filter(glucose ==3) %>%
filter(mye.treatment == "+_Mye")
grubbs.test(otest$percent_dead)
##
## Grubbs test for one outlier
##
## data: otest$percent_dead
## G = 1.7802000, U = 0.0096713, p-value = 0.001012
## alternative hypothesis: highest value 15.9722222222222 is an outlier
Based on the results, we can conclude that at 3.0g/L glucose with treatment there is one outlier present. Just looking at the boxplot, it is evident without testing that this point is well beyond the end of the whisker.
otest
## timepoint mye.treatment glucose DAPI Sytox total_cells percent_dead
## 1 24hr +_Mye 3 1190 9 1199 0.7506255
## 2 24hr +_Mye 3 1491 23 1514 1.5191546
## 3 24hr +_Mye 3 1620 41 1661 2.4683925
## 4 24hr +_Mye 3 1690 19 1709 1.1117613
## 5 24hr +_Mye 3 121 23 144 15.9722222
If we look at the raw data, it becomes clear that an error likely occurred in the total cell counts since the DAPI numbers are significantly lower than the other observations. As a result the total percentage of dead cells is much higher for this sample.
Boxplots With and Without Outliers Shown
You may have started to notice a pattern by now, that there are many ways to accomplish the same task in R. This is true as well for plotting multiple plots together. In a separate tutorial it would be useful to go through the multiple packages and methods for doing so, but for now we will simply use the cowplot package. It has a number of features that really shine when it comes to creating multi-panel graphs like those you see in publications.
#install.packages("cowplot")
library(cowplot)
#box plot to show data and outliers
nooutlier =ggplot(sytox_data_full,
aes(x= glucose, y = percent_dead, fill = mye.treatment)) +
geom_boxplot(outlier.shape = NA)+
ylab("Percentage of Dead Cells") +
xlab("Glucose Concentration (g/L)") +
labs(title = "Outliers Hidden")+
scale_fill_discrete(name="Experimental Condition",
breaks=c("+_Mye", "No_Mye"),
labels=c("Treated", "Untreated"))+
theme(legend.position="bottom")
woutlier =ggplot(sytox_data_full,
aes(x= glucose, y = percent_dead, fill = mye.treatment)) +
geom_boxplot()+
ylab("Percentage of Dead Cells") +
xlab("Glucose Concentration (g/L)") +
labs(title = "Outliers Shown")+
scale_fill_discrete(name="Experimental Condition",
breaks=c("+_Mye", "No_Mye"),
labels=c("Treated", "Untreated"))+
theme(legend.position="bottom")
plot_grid(woutlier, nooutlier, labels = "AUTO")
Showing these side by side demonstrates the subtle differences. The most important aspect to note is that the boxes in B are not different than those in A. The removal of the outlying points only occurs on the graph and does not effect the underlying data. As such the IQRs and whiskers remain unchanged. If your goal is to remove the outliers entirely, you must remove them from the dataset to propagate the changes in the boxes.
High Impact Journal Formatting
While the boxplot may be a great way to present data, it isn’t as common to see in many of your top tier medical and biological journals. Humans are creatures of habit, we have always used the bargraph so that is what we turn to when given a choice. However if you keep up with your reading, you might notice that there has been more of a push to publish what I will call “complete data”. This means all of the data points are presented along with the mean and standard deviation. One method I have noticed recently is overlying individual points on top of your standard bar graph.
Luckily because of the laying method for generating plots in ggplot, we can easily achieve this type of graph using the data we already have.
library(ggplot2)
Because we are using two individual data sets (mean_stats_full and sytox_data_full) and plotting them on the same graph, we must set one as the default. Since our bars and error bars are from the same stats dataframe, we will use this as default and indicate this in the ggplot() argument.
To overlay the points on top, we must pull the data from the full sytox_data_full dataframe. To do so we specify the data for geom_point and provide the respective x and y values in the aes() arguments.
#high impact journal formatting
ggplot(mean_stats_full,
aes(x= glucose, y = percent_dead_mean, fill = mye.treatment)) +
geom_bar(stat = "identity", position = "dodge") +
geom_point(data = sytox_data_full,
aes(x= glucose, y = percent_dead, fill = mye.treatment),
position = position_dodge(width = 0.9),
shape = 21, #shape with outline
size = 3)+ #size of point
geom_errorbar(aes(ymin=percent_dead_mean-sd_percent, ymax=percent_dead_mean+sd_percent),
size=.3, # Thinner lines
width=.2,
position = position_dodge(width = 0.9)) +
ylab("Percentage of Dead Cells") +
xlab("Glucose Concentration (g/L)") +
labs(title = "Barplot of Data using ggplot2") +
scale_fill_discrete(name="Experimental \n Condition",
breaks=c("+_Mye", "No_Mye"),
labels=c("Treated", "Untreated"))+
theme_bw()
NOTE: use “\n” to create a line break in a title
Order matters in the outcome of the final plot. If you plot the errorbar layer before the geom_point layer your points will cover the black lines of the errorbars.