Visualize Data
ggplot()
¶
ggplot2
is a very popular package for making visualizations. It is
built on the “grammar of graphics”. Any plot can be expressed from the
same set of components: a data set, a coordinate system, and a set of
“geoms” or the visual representation of data points such as points,
bars, lines, or boxes. This is the template we build on:
ggplot(data = <DATA>, aes(<MAPPINGS>)) +
<geom_function>() +
geom_bar()
¶
We just used the count()
function to calculate how many samples are in
each group. The function for creating bar graphs geom_bar()
also
makes use of stat = "count"
to plot the total number of observations
per variable. Let’s use ggplot2 to create a visual representation of how
many samples there are per tissue, sex, and hardiness.
# Visualizing data with ggplot2
ggplot(samples, aes(x = SMTS)) +
geom_bar(stat = "count")
In the last section, we will discuss how to modify the themes()
to
adjust the axes, legends, and more. For now, let’s flip the x and y
coordinates so that we can read the sample names. We do this by adding a
layer and the function coord_flip()
ggplot(samples, aes(x = SMTS)) +
geom_bar(stat = "count") +
coord_flip()
Now, there are two ways we can visualize another variable in addition to tissue. We can add color or we can add facets.
Let’s first color the data by age bracket. Color is an aesthetic, so it
must go inside the aes()
. If you include aes(color = AGE)
inside
ggplot()
, the color will be applied to every layer in your plot. If
you add aes(color = AGE)
inside geom_bar()
, it will only be applied
to that layer (which is important later when you layer multiple geoms).
head(samples)
ggplot(samples, aes(x = SMTS, color = AGE)) +
geom_bar(stat = "count") +
coord_flip()
Note that the bars are outlined in a color according to Hardy scale. If
instead, you would the bars “filled” with color, use the aesthetic
aes(fill = AGE)
ggplot(samples, aes(x = SMTS, fill = AGE)) +
geom_bar(stat = "count") +
coord_flip()
Now, let’s use facet_wrap(~SEX)
to break the data into two groups
based on the variable sex.
ggplot(samples, aes(x = SMTS, fill = AGE)) +
geom_bar(stat = "count") +
coord_flip() +
facet_wrap(~SEX)
With this graph, we have an excellent overview of the total numbers of RNA-Seq samples in the GTEx project, and we can see where we are missing data (for good biological reasons). However, this plot does not show us Hardy Scale. It’s hard to layer 4 variables, so let’s remove Tissue as a variable by focusing just on one Tissue.
Exercise¶
Create a plot showing the total number of samples per Sex, Age Bracket, and Hardy Scale for just the Heart samples. Paste the code you used in the chat.
There are many options. Here are a few.
```r ggplot(samples, aes(x = DTHHRDY, fill = AGE)) + geom_bar(stat = "count") + facet_wrap(~SEX)
ggplot(samples, aes(x = AGE, fill = as.factor(DTHHRDY))) + geom_bar(stat = "count") + facet_wrap(~SEX) ```
One thing these plots show us is that we don’t have enough samples to test the effects of all our experimental variables (age, sex, tissue, and hardy scale) and their interactions on gene expression. We can, however, focus on one or two variables or groups at a time.
Earlier, we imported the file “data/GTEx_Heart_20-29_vs_50-59.tsv” and saved it as “results”. This file contains the results of a differential gene expression analysis comparing heart tissue from 20-29 to heart tissue from 50-59 year olds. This is a one-way design investigating only the effect of age (but not sex or hardy scale) on gene expression in the heart. Let’s visualize these results.
geom_point()
¶
Volcano Plots
are a type of scatter plots that show the log fold change (logFC) on the
x axis and the inverse log -log10()
of a p-value that has been
corrected for multiple hypothesis testing (adj.P.Val). Let’s create a
Volcano Plot using the gplot()
and geom_point()
. Note: this may
take a minute because there are 15,000 points that must be plotted
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point()
The inverse log of p < 0.05 is 1.30103. We can add a horizontal line to
our plot using geom_hline()
so that we can visually see how many genes
or points are significant and how many are not.
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point() +
geom_hline(yintercept = -log10(0.05))
We can add color to a plot by specifying the color and/or fill. If we wanted to color all the points blue, we could do this by specifying aes(color = "blue")
. Alternatively, we could color points along a gradient by specifying aes(color = adj.P.Val)
.
I find it useful to color the points by whether or not the genes are significant. We can do this by adding an if else statement: aes(color = ifelse( adj.P.Val < 0.05, "p < 0.05", "NS"))
. This essentially create two factors called "p < 0.05" and "NS" and gives each factor a color different color on the plot.
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point(aes(color = ifelse( adj.P.Val < 0.05, "p < 0.05", "NS"))) +
geom_hline(yintercept = -log10(0.05))
To clean up the plot, we can move the legend to the bottom with theme(legend.position = "bottom")
.
Then, we can add titles, subtitles and captions and rename axes and legends with labs()
.
ggplot(results, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point(aes(color = ifelse( adj.P.Val < 0.05, "p < 0.05", "NS"))) +
geom_hline(yintercept = -log10(0.05)) +
theme(legend.position = "bottom") +
labs(color = "20-29 vs 50-59 year olds",
subtitle = "Heart Tissue Gene Expression")
We have only scratched the surface of plotting, but these basics will get you started. Try to craete a plot on your own.
Exercise¶
Create a volcano plot for the results comparing the muscle* tissue of 20-29 year olds to that of 50-59 year olds? Are more or less differentially expressed gene between between this age group in the heart or muscle?
df <- read.table("./data/GTEx_Muscle_20-29_vs_50-59.tsv")
ggplot(df, aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point() +
geom_hline(yintercept = -log10(0.05))
geom_boxplot()
¶
In addition to containing information about the donor tissue, the samples file contains has a column with a RIN score, which tells us about the quality of the data. If we wanted to look for interactions between RIN score (SMRIN) and sequencing facility (SMCENTER), we can use a box plot.
ggplot(samples, aes(x = SMCENTER, y = SMRIN)) +
geom_boxplot() +
geom_jitter(aes(color = SMRIN))
Now you know a handful of R functions for importing, summarizing, and visualizing data. In the next section, we will tidy and transform our data so that we can make even better summaries and figures. In the last section, you will learn ggplot function for making fancier figures.
Key functions¶
Function | Description |
---|---|
ggplot2 |
An open-source data visualization package for the statistical programming language R |
ggplot() |
The function used to construct the initial plot object, and is almost always followed by + to add component to the plot |
aes() |
Aesthetic mappings that describe how variables in the data are mapped to visual properties (aesthetics) of geoms |
geom_point() |
A function used to create scatter plots |
geom_bar() |
A function used to create bar plots |
coord_flip() |
Flips the x and y axis |
geom_hline() |
Add a horizontal line to plots |