library(edgeR)
library(tidyverse)6 RNA-seq - Part 1
In this chapter we will run through the basic steps for analysing a simple RNA-seq experiment using the limma-voom workflow. This includes:
- filtering out lowly expressed genes
- normalisation
- creating a multidimensional scaling (MDS) plot
- creating a design matrix
- fitting gene-wise linear models (with empirical Bayes moderation to more accurately estimate gene-wise variability)
- performing statistical testing for differential expression
Compared to the original article linked above, this workflow has been modified for simplicity but retains all the crucial steps and concepts.
The aim of this chapter is to give you experience with a real-world RNA-seq analysis, and making extensive use of an external library. We will not cover the statistics in any depth. Instead, the goal is to understand how to construct data structures required for specific packages and how to use the functions in those packages to perform the analysis.
Much of the materials here are explained in greater detail in the limma user’s guide. You can view this by typing help("limma") and following the links.
- Constructing a
DGEListobject to use with theedgeRpackage - Performing filtering of lowly expressed genes
- Normalising RNA-seq data to account for library size and RNA composition
- Performing exploratory data analysis using MDS plots
- Pulling data out of
edgeRpackage objects and re-organising it into a tidy format for use withggplot2.
6.1 Data files
We will work with a counts matrix similar to that generated by Subread featureCounts, and an Entrez gene annotation file.
Save these files in the data/ directory within your R project parent directory.
6.2 R Packages
Let’s load edgeR (Estimating Digital Gene Expression in R), which also loads the limma package. We also load the tidyverse packages, which will be used to extract and plot data from the edgeR and limma objects.
If you get an error, the package can be downloaded by the following commands.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("edgeR")6.3 Loading in read count data and annotation
The data we are looking at comes from three cell populations (basal, luminal progenitor (LP) and mature luminal (ML)) sorted from the mammary glands of female virgin mice, each profiled in triplicate. These samples have been sequenced using Illumina RNAseq platforms and we will be using the edgeR package to analyse the data.
Let’s read in the raw counts and gene annotation data:
counts_tbl <- read_tsv('data/rnaseq_counts.tsv')
gene_anno <- read_tsv('data/Ses3_geneAnnot.tsv')6.4 Creating the DGEList object
To use the edgeR package we first need to construct a DGEList object. This object contains 3 key pieces of data:
counts: the main data of this object, a matrix of count values with samples along the columns and features/genes along the rows. All data must be numeric!samples: a data frame containing annotation for the samples. The rows in this table describe the corresponding column of the counts data.samplesis created automatically by theDGEList()function.genes: a data frame containing annotation for the genes in the counts matrix. The rows in this table describe the corresponding row in the counts matrix. Importantly, the row order forcountsandgenesmust be identical for your analysis to run as expected!
Data objects in general are a way to store related pieces of information together. Often they provide features to maintain the relationships between the pieces of information. For example, in a DGEList object, the counts are stored as a matrix while samples and genes are stored as data frames. When you subset the DGEList object, the rows and columns of the counts matrix are subset along with the corresponding rows of the samples and genes data frames. This prevents errors that can occur if the user tries to manually subset all three pieces of information separately.
A common way to create a DGEList object is to read in the count data as a matrix or data frame, and then create the DGEList object using the DGEList() function.
What you can do with data objects is determined by package that the object comes from. You will need to read the package documentation to find out how to access data from the object. Here edgeR informs us that the DGEList has data that can be accessed as if it was a list.
DGEList converts numeric data into matrix format for the counts element.
dge <- DGEList(counts = counts_tbl %>% select(-1),
genes = gene_anno)The samples table is automatically created based on information in counts
dge$samples group lib.size norm.factors
10_6_5_11 1 32863052 1
9_6_5_11 1 35335491 1
purep53 1 57160817 1
JMS8-2 1 51368625 1
JMS8-3 1 75795034 1
JMS8-4 1 60517657 1
JMS8-5 1 55086324 1
JMS9-P7c 1 21311068 1
JMS9-P8c 1 19958838 1
We have the library size (column sums for counts matrix) pre-computed. Note that the experimental group has not yet been defined. To do this we first create a text vector encoded as a factor.
6.4.1 A brief detour to factors
According to the original experimental description, the counts in each column are either from basal, luminal progenitor (LP) or mature luminal (ML) cell populations. The exact experimental groups are as follows
groups <- c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP")
groups <- parse_factor(groups, levels = c("Basal", "LP", "ML"))Note that when we declared the groups variable, we used factors. Factors are a special data type that is used to encode categorical variables. They have multiple useful properties in comparison to regular character vectors: - They allow you to specify the order of the levels. - They are stored as integers but displayed as labels. - They encode all the valid levels of the factor, even if they are not present in the data.
Specifying the order of the levels is useful because it allows you to re-arrange labels when used in plots. In this data we would have the “Basal” group first, followed by “LP” and then “ML”. Using parse_factor() also allows you to check that the values are all valid levels, for example if one of the samples was labelled “Bassal” instead of “Basal”, it would throw an error. You can read the R for Data Science chapter on factors for more information, as well as the forcats package for many useful functions for working with factors.
6.5 Finalising the DGEList object
To specify the experimental groups, we can update the group column in the samples table as follows:
dge$samples$group <- groupsLastly we can set the rownames (a text vector) for the counts and genes tables, which is useful to avoid losing information downstream.
rownames(dge$counts) <- counts_tbl %>% pull(ENTREZID)
rownames(dge$genes) <- gene_anno %>% pull(ENTREZID)
head(dge$counts) 10_6_5_11 9_6_5_11 purep53 JMS8-2 JMS8-3 JMS8-4 JMS8-5 JMS9-P7c
497097 1 2 342 526 3 3 535 2
100503874 0 0 5 6 0 0 5 0
100038431 0 0 0 0 0 0 1 0
19888 0 1 0 0 17 2 0 1
20671 1 1 76 40 33 14 98 18
27395 431 771 1368 1268 1564 769 818 468
JMS9-P8c
497097 0
100503874 0
100038431 0
19888 0
20671 8
27395 342
head(dge$genes) ENTREZID SYMBOL TXCHROM
497097 497097 Xkr4 chr1
100503874 100503874 Gm19938 <NA>
100038431 100038431 Gm10568 <NA>
19888 19888 Rp1 chr1
20671 20671 Sox17 chr1
27395 27395 Mrpl15 chr1
Now we have a data object that can be used for downstream analysis.
6.6 Filtering
The first step of our analysis is to filter out lowly expressed genes. There are two main problems with low abundant genes:
- Technical variation is more problematic for low abundance genes. This variation is thought to be due to two factors; insufficient mixing and low sampling fraction1.
- Insufficient mixing of solutions during library preparation can result in uneven distribution of reads.
- RNA sequencing can be thought of as sampling. Measurement errors will occur simply due to the random nature of the sampling process. This problem affects lowly abundant RNA species more because the relative error for small count values is larger than it would be for more highly abundant RNA species.
- Genes that are very lowly expressed do not produce sufficient information to be useful for biological interpretation. For example, it is very hard to believe the biological significance of genes that have counts ranging from 0 to 3 across samples even if come up as statistically significant.
Removing these highly variable, lowly expressed genes increases your ‘power’ to detect differentially expressed genes2, where ‘power’ is your ability to detect true positives. In testing for differential expression, a statistical test is conducted for each gene. When a high number of statistical tests are performed, a portion of them will be significant purely due to random chance. A common procedure to control for the number of false positive is to perform ‘multiple testing correction’ on the p-values. This adjusts the p-value in a way that reduces the number of false positives but comes at the cost of reduced power to detect true positives. If we filter out uninteresting, lowly expressed genes, we need to perform fewer statistical tests and reduce the impact that multiple testing adjustment has on detection power.
The edgeR provides the filterByExpr() function to automate gene filtering. By default, it aims to keep genes with a count of 10 or more, in at least as many samples as the smallest experimental group. In our experiment, there are 3 phenotype groups each with 3 samples. Therefore we retain only genes that have 10 or more counts in 3 or more samples. The actual filtering is done on counts per million, prevent bias against samples with small library sizes. This complex procedure is the reason why the package provides a function to perform the filtering for you.
The output of this function is a vector of logicals, indicating which genes (rows) should be kept and which filtered.
Its good practice to create a new DGEList for each step of the filtering and normalization process, to allow easy before-and-after comparisons. We create dge_filt here:
keep <- filterByExpr(dge)
table(keep)keep
FALSE TRUE
10555 16624
proportions(table(keep))keep
FALSE TRUE
0.3883513 0.6116487
dge_filt <- dge[keep, , keep.lib.sizes = FALSE]Compare the effect of filtering out lowly-expressed genes on the data dimensions
dim(dge)[1] 27179 9
dim(dge_filt)[1] 16624 9
We can see that we now have 16624 genes. We started with 27179 genes - meaning that ~40% of genes have been filtered out.
6.7 Library-size normalisation
After filtering, our next step is to normalise the data. Normalisation refers to the process of adjusting the data to reduce or eliminate systematic bias. This allows the data to be meaningfully compared across samples or experimental groups.
There are two main factors that need to be normalised for in RNA-seq:
- Sequencing depth/library size - technically, sequencing a sample to half the depth will give, on average, half the number of reads mapping to each gene3.
- RNA composition - if a large number of genes are unique to, or highly expressed in, only one experimental condition, the sequencing capacity available for the remaining genes in that sample is decreased. For example, if there are only five genes being studied in two experimental groups, if one gene is particularly high in group A, then with limited sequencing depth, that gene will reduce the counts of the remaining four genes. The effect of this is that the remaining four genes appear under-expressed in group A compared to group B when the true amount of gene product is actually equal for these 4 genes3.
Sequencing depth is accounted for by calculating the counts per million (cpm). This metric is calculated by:
- taking the library size (sum of all counts for a sample),
- dividing this by 1,000,000 to get the ‘per million’ scaling factor,
- then dividing all read counts for each gene in that sample by the ‘per million’ scaling factor
RNA composition can be accounted for by using more sophisticated normalisation methodologies. We will use ‘trimmed mean of M-values’ (TMM), which estimates relative RNA levels from RNA-seq data3. Under the assumption that most genes are not differentially expressed, TMM calculates a library size scaling factor for each library (sample). This is done using the following steps:
- calculate the gene expression log fold changes and absolute expression values for pair-wise samples (selecting one sample from the experiment as a reference)
- remove the genes with the highest and lowest fold changes and absolute expression values
- take a weighted mean of the remaining genes (where the weight is the inverse of the approximate asymptotic variances). This gives the normalisation factor for each library (sample)
Subsequent steps in this analysis will use log-cpm values, calculated using the normalisation factors, which scales each library size.
We can calculate the normalisation factors, specifying that we want to use the "TMM" method. We now create dge_norm from dge_filt.
dge_norm <- calcNormFactors(dge_filt, method = "TMM")This function calculates the normalisation factors for each library (sample) and puts this information in the samples data frame. Note that it takes dge (our DGEList object as input) and returns a DGEList object as well.
Let’s take a look at our normalisation factors:
dge_norm$samples group lib.size norm.factors
10_6_5_11 LP 32857304 0.8943956
9_6_5_11 ML 35328624 1.0250186
purep53 Basal 57147943 1.0459005
JMS8-2 Basal 51356800 1.0458455
JMS8-3 ML 75782871 1.0162707
JMS8-4 LP 60506774 0.9217132
JMS8-5 Basal 55073018 0.9961959
JMS9-P7c ML 21305254 1.0861026
JMS9-P8c LP 19955335 0.9839203
These normalisation factors are all close to 1 for all samples, suggesting minimal difference in RNA composition between samples.
6.8 Extracting tidy tables
For ease of tidy data manipulation and plotting, at this point let’s extract tidy tables from the tables nested within the DGEList!
samples_tbl <- dge_norm$samples %>% as_tibble(rownames = 'id')
counts_raw_tbl <- dge$counts %>% as_tibble(rownames='gene')
counts_filt_tbl <- dge_filt$counts %>% as_tibble(rownames='gene') 6.9 Visualising the effect of normalisation
To visualise the effect of TMM normalisation, we can plot the log-counts as a boxplot, and observe the effect of applying the normalisation. To create a boxplot of the log-counts, we can use log(dge$counts + 0.5) to create the log-count matrix. The addition of 0.5 is to avoid taking the log of zero. Then in order to use ggplot2 for plotting, we must convert the matrix to a data frame. We can use the as_tibble(rownames = "gene") function to convert the matrix to a data frame where the rownames are converted to a column called “gene”. We can then use the pivot_longer() function to convert the data frame from wide format to long format, where each row represents a single observation. This is necessary for ggplot2 to plot the data correctly.
counts_raw_tbl %>%
pivot_longer(cols = -gene,
names_to = "sample", values_to = "counts"
) %>%
mutate(log_counts = log10(counts + 0.5)) %>%
ggplot(aes(x = sample, y = log_counts)) +
geom_boxplot() +
ggtitle('Log-transformed raw counts, before filtering')
counts_filt_tbl %>%
pivot_longer(cols = -gene,
names_to = "sample", values_to = "counts"
) %>%
mutate(log_counts = log10(counts + 0.5)) %>%
ggplot(aes(x = sample, y = log_counts)) +
geom_boxplot() +
ggtitle('Log-transformed raw counts, after filtering')
We can compare this to the cpm values which when calculated using the cpm() function, automatically applies normalisation factors if they are present.
logcpm_tbl <- cpm(dge_norm$counts, log=T) %>% as_tibble(rownames='gene')logcpm_tbl %>%
pivot_longer(cols = -gene,
names_to = "sample",
values_to = "log_cpm") %>%
ggplot(aes(x = sample, y = log_cpm)) +
geom_boxplot() +
ggtitle('Expression normalized by lib size [CPM] and complexity [TMM]')
We see that by performing normalisation on our data, the gene expression values of each sample now have similar medians and quantiles. This indicates that the relative expression values of each sample can be more meaningfully compared.
6.10 MDS plots
Before we perform statistical tests, it’s useful to perform some exploratory visual analysis to get an overall idea of how our data is behaving.
MDS is a way to visualise distances between sets of data points (samples in our case). It is a dimensionality reduction technique, similar to principal components analysis (PCA). We treat gene expression in samples as if they were coordinates in a high-dimensional coordinate system, then we can find “distances” between samples as we do between points in space. Then the goal of the algorithm is to find a representation in lower dimensional space such that points that the distance of two objects from each other in high dimensional space is preserved in lower dimensions.
The plotMDS() from limma creates an MDS plot from a DGEList object.
plotMDS(dge_norm)
Each point on the plot represents one sample and is ‘labelled’ using the sample name. The distances between each sample in the resulting plot can be interpreted as the typical log2-fold-change between the samples, for the most differentially expressed genes.
We can change the labelling to use the name of the group the sample belongs to instead:
plotMDS(dge_norm, labels = groups)
This shows us that the phenotype groups tend to cluster together, meaning that the gene expression profiles are similar for samples within a phenotype group. The ‘Basal’ type samples quite close together while the ‘LP’ (luminal progenitor) and ‘ML’ (mature luminal) type samples are further apart, signifying that their expression profiles are more variable.
6.11 MDS plot using ggplot2
For more customisability and better consistency with the style of other plots, it’d be nice to be able to draw the MDS plot using ggplot2. In order to do this we would need the MDS coordinates calculated by plotMDS(). Luckily the documentation of plotMDS() shows that it returns multiple computed values.
mds_result <- plotMDS(dge_norm, plot = FALSE)
mds_resultAn object of class MDS
$eigen.values
[1] 6.844419e+01 1.154413e+01 3.790044e+00 2.994025e+00 2.021565e+00
[6] 1.613173e+00 1.202805e+00 1.094072e+00 -7.018692e-16
$eigen.vectors
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -0.2461859 0.54036454 0.12401524 0.677678569 0.20623170 0.07294476
[2,] -0.2851986 -0.33460855 0.67471878 0.007007206 -0.46099085 0.15319914
[3,] 0.4690174 -0.04145263 0.05041410 0.056945694 -0.03695319 -0.03494949
[4,] 0.4711165 -0.04907588 0.05443024 0.161230059 -0.03439028 -0.55065413
[5,] -0.2626081 -0.18747095 0.18504793 -0.355202862 0.66319443 -0.33464804
[6,] -0.1568565 0.41423458 -0.09091514 -0.468642654 0.04620796 0.26492304
[7,] 0.4670465 0.01138476 -0.03115306 -0.161357159 0.07822632 0.56941932
[8,] -0.2405667 -0.57831836 -0.56474831 0.302172046 0.07568690 0.20422089
[9,] -0.2157647 0.22494250 -0.40180977 -0.219830899 -0.53721300 -0.34445549
[,7] [,8] [,9]
[1,] -0.11742920 -0.004177629 -0.3333333
[2,] 0.04463732 -0.048147665 -0.3333333
[3,] 0.03859114 0.810760474 -0.3333333
[4,] 0.33482254 -0.468041790 -0.3333333
[5,] -0.26519650 0.047152480 -0.3333333
[6,] 0.62647876 -0.002850499 -0.3333333
[7,] -0.44130343 -0.344258233 -0.3333333
[8,] 0.19689394 -0.010674134 -0.3333333
[9,] -0.41749458 0.020236996 -0.3333333
$var.explained
[1] 0.73830888 0.12452679 0.04088328 0.03229660 0.02180666 0.01740133 0.01297468
[8] 0.01180178 0.00000000
$dim.plot
[1] 1 2
$distance.matrix.squared
10_6_5_11 9_6_5_11 purep53 JMS8-2 JMS8-3 JMS8-4
10_6_5_11 -18.127013 -5.738791 16.10192 16.03204 -5.792252 -8.3908112
9_6_5_11 -5.738791 -18.115187 17.75982 17.85109 -11.197198 -2.5513014
purep53 16.101924 17.759816 -31.64208 -29.63783 16.733543 10.6453322
JMS8-2 16.032036 17.851094 -29.63783 -32.34830 16.749432 11.0446228
JMS8-3 -5.792252 -11.197198 16.73354 16.74943 -13.580395 -4.1528811
JMS8-4 -8.390811 -2.551301 10.64533 11.04462 -4.152881 -9.8867331
JMS8-5 15.954448 18.363197 -29.18038 -28.91308 16.698190 9.6067888
JMS9-P7c -1.642914 -10.966083 15.03939 15.00405 -9.571938 0.3388923
JMS9-P8c -8.396628 -5.405548 14.18030 14.21798 -5.886502 -6.6539093
JMS8-5 JMS9-P7c JMS9-P8c
10_6_5_11 15.954448 -1.6429143 -8.396628
9_6_5_11 18.363197 -10.9660831 -5.405548
purep53 -29.180384 15.0393893 14.180296
JMS8-2 -28.913082 15.0040512 14.217981
JMS8-3 16.698190 -9.5719377 -5.886502
JMS8-4 9.606789 0.3388923 -6.653909
JMS8-5 -31.824715 15.4926837 13.802874
JMS9-P7c 15.492684 -18.8595492 -4.834532
JMS9-P8c 13.802874 -4.8345322 -11.024033
$top
[1] 500
$gene.selection
[1] "pairwise"
$axislabel
[1] "Leading logFC dim"
$x
[1] -2.036721 -2.359477 3.880228 3.897594 -2.172583 -1.297689 3.863923
[8] -1.990232 -1.785043
$y
[1] 1.83597803 -1.13688798 -0.14084219 -0.16674343 -0.63696359 1.40743060
[7] 0.03868159 -1.96493242 0.76427939
We see that there’s are x and y values in the list returned by the plotMDS() function, we can put these into a table and see if they match up to the positions plotted by the function.
mds_tbl <- tibble(
id = samples_tbl %>% pull(id),
dim1 = mds_result$x,
dim2 = mds_result$y
) %>% print()# A tibble: 9 × 3
id dim1 dim2
<chr> <dbl> <dbl>
1 10_6_5_11 -2.04 1.84
2 9_6_5_11 -2.36 -1.14
3 purep53 3.88 -0.141
4 JMS8-2 3.90 -0.167
5 JMS8-3 -2.17 -0.637
6 JMS8-4 -1.30 1.41
7 JMS8-5 3.86 0.0387
8 JMS9-P7c -1.99 -1.96
9 JMS9-P8c -1.79 0.764
ggplot(mds_tbl, aes(x = dim1, y = dim2)) +
geom_point()
The positions of the points and the range of the scales seem to match, so we want to add in more metadata to help use plot. We can try to recreate the MDS plot using just ggplot2.
First we extract the transcriptional variance explained by each dimension into a tibble, and add a column defining the dimensions.
var_expln <- tibble(propn = mds_result$var.explained * 100 ) %>%
rownames_to_column(var = 'dimension') Dimensions 1 and 2 are plotted by default. We can now extract and round the proportion of variance explained in rows 1 and 2 of the var_expln tibble
dim1_expl <- var_expln %>% filter(dimension==1) %>% pull(propn) %>% round()
dim2_expl <- var_expln %>% filter(dimension==2) %>% pull(propn) %>% round()Finally we join the full sample metadata with the MDS coordinates for plotting, using left_join()
mds_tbl_label <- left_join(samples_tbl, mds_tbl, by = 'id') %>%
print()# A tibble: 9 × 6
id group lib.size norm.factors dim1 dim2
<chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 10_6_5_11 LP 32857304 0.894 -2.04 1.84
2 9_6_5_11 ML 35328624 1.03 -2.36 -1.14
3 purep53 Basal 57147943 1.05 3.88 -0.141
4 JMS8-2 Basal 51356800 1.05 3.90 -0.167
5 JMS8-3 ML 75782871 1.02 -2.17 -0.637
6 JMS8-4 LP 60506774 0.922 -1.30 1.41
7 JMS8-5 Basal 55073018 0.996 3.86 0.0387
8 JMS9-P7c ML 21305254 1.09 -1.99 -1.96
9 JMS9-P8c LP 19955335 0.984 -1.79 0.764
To recreate the plotMDS plot:
mds_tbl_label %>%
ggplot(aes(x=dim1, y=dim2)) +
geom_text(aes(label = group)) +
labs(x = paste0('Dim 1 variance explained = ', dim1_expl,'%'),
y = paste0('Dim 2 variance explained = ', dim2_expl,'%'))
Now that we have our data in a nice tidy format we can use with ggplot, it’s easy to create variations of the plot. For example we can draw points instead of group labels, and use colour to identify the groups.
mds_tbl_label %>%
ggplot(aes(x=dim1, y=dim2, colour = group)) +
geom_point() +
labs(x = paste0('Dim 1 variance explained = ', dim1_expl,'%'),
y = paste0('Dim 2 variance explained = ', dim2_expl,'%'))
Alternatively we can also use the labels to identify the individual samples while colouring them by their group.
mds_tbl_label %>%
ggplot( aes(x = dim1, y = dim2, col = group)) +
geom_point() +
geom_text(aes(label = id)) +
labs(x = paste0('Dim 1 variance explained = ', dim1_expl,'%'),
y = paste0('Dim 2 variance explained = ', dim2_expl,'%'))
We see that some labels are very hard to read due to the overlapping, so we can use the ggrepel package to fix this. The ggrepel package creates labels that repel each other in order to avoid overlap. It’s a good idea to use it in conjunction with geom_point() in order to keep track of exact coordinate of the data point.
library(ggrepel)
mds_tbl_label %>%
ggplot( aes(x = dim1, y = dim2, col = group)) +
geom_point() +
geom_text_repel(aes(label = id)) +
labs(x = paste0('Dim 1 variance explained = ', dim1_expl,'%'),
y = paste0('Dim 2 variance explained = ', dim2_expl,'%'))
6.12 Summary
Today we started on the early steps of the RNA-seq analysis workflow.
- We learned how to create a
DGEListobject, demonstrating how to create data objects required by specific packages. - We learned how to use edgeR’s
filterByExpr()function to filter out lowly expressed genes. - We learned how to normalise the data using TMM normalisation.
- We learned how to create MDS plots using both
plotMDS()andggplot2. - We learned how to use
ggrepelto create non-overlapping labels inggplot2plots.
This provides a good foundation for organising data to satisfy the requirements of specific packages, and how to pull data out of the objects created by those packages. The data we pulled out can then be organised into a tidy format to leverage the power of all the tidyverse functions that we have learned so far.
In the next chapter we will continue the RNA-seq analysis workflow and complete our differential expression analysis.