The aim of this chapter is to complete the analysis of the RNA-seq data using the limma and edgeR packages. With the results of the analysis, we want to be able to produce publication quality figures and write out the results to a file for sharing or publication.
Learning Objectives
fit a linear model to the data
construct design and contrast matrices for the linear model
model the mean-variance relationship of the data using the voom() function
perform statistical testing to identify differentially expressed genes
visualise the results using a range of common plots
write out the results to a file
# load required packageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 27179 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): SYMBOL, TXCHROM
dbl (1): ENTREZID
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# filter lowly expressed geneskeep <-filterByExpr(dge)dge <- dge[keep, ]# normalise the data for library size and RNA compositiondge <-calcNormFactors(dge, method ="TMM")
7.1 Linear modelling
To identify differentially expressed genes from our normalised gene expression data, we will use the limma package to fit linear models to the genes. A linear model is a broad class of statistical models that fit the value of an response (or dependent) variable as a linear function of one or more explanatory (or independent) variables (also called covariates).
The general form of a linear model looks like this:
This equation is saying that a response variable of interest \(Y\) is equal to a constant (\(\beta_{0}\)) plus the sum of the covariates (\(X_{i}\)) each multiplied by a constant coefficient (\(\beta_{i}\)) and some error term (\(\epsilon\)). The error term is the difference between the observed value and the predicted value of the model and is assumed to have a normal distribution with mean 0 and some variance.
Our experiment is quite simple, since there is only a single covariate, the cell type. The true benefit of using linear models is in its ability to accommodate more complex designs including multiple covariates.
To fit the linear models in the limma-voom framework we need two objects in addition to our data:
A design matrix, representing the covariates.
A contrast matrix, representing the specific comparison we wish to make.
7.1.1 Design matrix
The first step to fitting a linear model is to specify a design matrix. The design matrix specifies the values of the covariates for each sample, and is represented as a matrix due to the mathematical convenience.
To generate a design matrix. We use the function model.matrix(), with the expression ~0 + group. This returns a matrix representing the design where there is no intercept term and group is the only covariate. If we omit the 0 then there would be an intercept in the model, and if we included more covariates then more columns would be generated.
design <-model.matrix(~0+ group, data = dge$samples)design
There are 9 rows, one for each sample. Along the columns are the names of the groups. The values in the cells denote membership of the particular sample for a particular group, as our groups in this case are mutually exclusive, each row contains only a single 1 to denote membership in a single group.
7.1.2 Contrasts
Contrast matrices are companions to design matrices. They are used to specify the comparison of interest. In our case, we have three experimental groups: Basal, LP and ML. So if we are to perform differential expression analysis, we are most likely interested in the differences between only two of the groups at a time. Contrast matrices let use specify which comparison we’d like to make, and are also represented as a matrix just like the design.
A contrast matrix can be made using the makeContrasts() function. Within this function, we specify the name of each specific contrast and the formula for that contrast. For example, the BasalvsLP contrasts compares the difference between the Basal and LP groups. Note that the name of the phenotype groups must be written exactly as they are in the column names of our design matrix (see above).
In addition to the individual contrasts, the function must know about the design of the model. This is passed through the levels argument, which either accepts a matrix with the column names corresponding to levels of your experimental groups, or the levels themselves as a character vector. Here we first simplify the column names of our design matrix to make it easier to read. Then we create the contrast matrix using the makeContrasts() function.
There are two things to note about a design matrix. First, the sum of the numbers in each column is equal to 0. Second, the way that you set up the equation in the matrix will determine the interpretation of the log-fold-change calculated later, as well as the meaning of up-regulated and down-regulated genes. For example, the contrast Basal - LP will give a positive log-fold-change for genes that are up-regulated in the Basal group compared to the LP group. If we had set up the contrast as LP - Basal, then the opposite would be true, and we would get a negative log-fold-change for genes that are up-regulated in the Basal group compared to the LP group.
7.1.3 Variance modelling with voom
We are now ready to fit our linear models. Limma fits linear models to the data with the assumption that the underlying data is normally distributed. Count data is generally not normally distributed, but log transforming count data gives it a roughly normal distribution sufficient for linear models to work well. To do this, limma transforms the raw count data to log-cpm using library sizes and the normalisation factors we calculated previously.
In addition to the normalisation steps, the limma-voom pipeline uses the voom() function to generate weights for the individual genes based on a modelled mean-variance relationship. This modelling allows us to get more information out of small sample sizes as the weights prevent our model from being more heavily influenced by more variable data points. The weights will then allow the linear model to be less influenced by genes with high variability, which is important for small sample sizes. The voom function also generates a mean-variance plot, which is useful for visualising the data.
The voom() function takes as arguments, our DGEList object and our design matrix. It also optionally outputs a plot of the mean-variance relationship of our data, called the ‘voom plot’.
v <-voom(dge, design, plot =TRUE)
The output of voom() (our variable v) is an EList object which contains the following elements:
genes - a data frame of gene annotation data.
targets - data frame of sample data.
E - numeric matrix of normalised log-cpm values.
weights - numeric matrix of precision weights.
design - the design matrix.
The ‘voom plot’ shows the mean-variance relationship of the data. The x-axis is the average log2 expression of each gene, and the y-axis is the log2 standard deviation of each gene. The idea is that the variance of a gene can be estimated as a function of its main expression, and that this estimate is more accurate than the direct calculation of the variance.
7.1.4 Fitting the linear model
We are now ready to fit our linear model with lmFit(), which calculates coefficients we defined in our design matrix design. The resulting object, vfit is a MArrayLM object. It contains a information about our genes (the same data frame as genes from our EList object v above), the design matrix and a number of statistical outputs. Of most interest to us is the coefficients, stored in an element called coefficients. The first rows of this matrix is shown below. Each gene is row and is labelled using the EntrezID. Each column gives coefficients for each of our phenotype groups. These coefficients are weighted averages of the log-cpm of each gene in each group.
We can then use contrasts.fit() to calculate coefficients for each contrast (or ‘comparison’) we specified in our contr_matrix. The output is also an object of the class MArrayLM (also known as an MArrayLM object). When we inspect the coefficients element now, we can see that each column is a contrast that we specified in our contrast matrix.
With these values we essentially want to know whether or not the difference in values is significantly different from 0. If the difference is not significantly different from 0, then we have failed to establish a difference in the expression of a particularly gene between two groups. If the difference is significantly greater than 0, then we have up-regulation in the first group of the contrast, and if the difference is significantly less than 0, then we have down-regulation in the first group of the contrast.
7.2 Statistical testing
To actually test if the values are significantly different from 0, we use the eBayes() function to compute moderated t-statistics, moderated F-statistics and log-odds of differential expression for each gene, given a fitted linear model. ‘Moderated’ refers to empirical Bayes moderation, which borrows information across genes to obtain more accurate measures of variability for each gene. This also increases our power to detect differentially expressed genes.
efit <-eBayes(vfit)
We can now look at the number of differentially expressed genes using the decideTests() function. The output of this function is a matrix where each column is a contrast (comparison of interest) and each row is a gene. The numbers 1, -1 and 0 mean up-regulated, down-regulated or not significantly differentially expressed, respectively.
Note that decideTests() also accounts for multiple testing. The default method is Benjamini and Hochberg1 but several others are also available.
To obtain the total number of differentially expressed genes for each comparison, we can add the function summary():
summary(dt)
BasalvsLP BasalvsML LPvsML
Down 4501 4850 2701
NotSig 7306 6996 11821
Up 4817 4778 2102
The function topTable() can be used to obtain more information on the differentially expressed genes for each contrast. topTable() takes as arguments the MArrayLM object output by eBayes() (efit), the contrast name of interest, and the number of top differentially expressed genes to output. Note that the contrast name must be given in quotes and must be exactly as written in the contrast matrix contr_matrix.
It outputs a data frame with the following information:
Gene details - gene information, from the gene element of the MArrayLM object (efit).
logFC - the log2 fold change of the contrast.
AveExpr - the average log2 expression of that gene.
t - moderated t-statistic.
P.Value - p value.
adj.P.Val - adjusted p value.
B - log-odds that the gene is differentially expressed.
top_bvl <-topTable(efit, coef ="BasalvsLP", n =Inf) %>%as_tibble() %>%mutate(contr ='BasalvsLP')head(top_bvl)
In this section we will make plots of both the summary statistics (topTable output), and the gene expression levels per sample, using log(counts per million) together with sample metadata from dge$samples.
7.3.1 Data extract
To start with we extract all the data required to make any of the the common RNAseq results plots. We will
extract the summary statistics for all three contrasts into separate tables
extract the sample_ids and experimental group labels, and then
extract the log(CPM) values for each sample
7.3.1.1 Extract summary tables
Volcano plots and MA plots require differential expression summary statistics. We have the summary table for BasalvsLP already extracted to top_bvl, now for BasalvsML and LPvsML:
top_bvm <-topTable(efit, coef ="BasalvsML", n =Inf) %>%as_tibble() %>%mutate(contr ='BasalvsML')top_lvm <-topTable(efit, coef ="LPvsML", n =Inf) %>%as_tibble() %>%mutate(contr ='LPvsML')
7.3.1.2 Extract log(counts per million)
Box plots and heatmaps require the gene expression data per sample, expressed as log(counts per million), or ‘log(CPM)’.
We will need the experimental group label for each sample to make boxplots and to annotate our heatmap. We can extract this from the dge$samples table as follows.
Volcano plots are very popular for visually summarizing results from RNA-seq analysis. The volcano plot is a scatter plot of the negative log10 p-value (y) against the log2 fold-change (x). This plot is useful for visualising the significance of the differentially expressed genes, as the more significant results are at the top left and right of the plot.
To make a volcano plot for a single experimental contrast (BasalvsLP), we must first categorize the summary statistics for each gene with ‘Up’ ‘Down’ or ‘Not DE’, according to the adj.P.Val and logFC. The case_when() function is a good choice for binning continuous data (adj.P.Val) into categories.
volcano_bvl <- top_bvl_cat %>%ggplot(aes(x = logFC, y =-log10(P.Value), col = status)) +geom_point(size=0.5) +scale_colour_manual(values =c("blue", "#595959", "red")) +ggtitle('Basal vs LP')volcano_bvl
Lastly we can label the gene names for the top-most DEGs by extracting the top 20 genes using dplyr::slice() and feeding them into geom_label_repel() from the ggrepel library. Here instead of rewriting code, we just keep adding geoms to the volcano_plot ggplot object created above!
Can you now make the labelled volcano plots for the BasalvsML and LPvsML contrasts?
7.3.3 Boxplots
It’s important to create boxplots for a few differentially-expressed genes. This is a way of confirming that the results in our summary statistics tables and the underlying data agree, and that that our contrasts are set up correctly!
In this case we’ll make boxplots from the log(CPM) values of individual samples, for the first 6 differentially-expressed genes in the BasalvsLP topTable top_bvl_cat. We will extract the first 6 rows of the tibble using slice_head(), then join the logCPM values per sample
Error in `left_join()`:
! Can't join `x$ENTREZID` with `y$ENTREZID` due to incompatible types.
ℹ `x$ENTREZID` is a <double>.
ℹ `y$ENTREZID` is a <character>.
Great! An error! …it seems that the ENTREZID column is not the same type between the top_bvl_cat and the sample_logcpm tibbles. We can fix this by changing the sample_logcpm$ENTREZID into numeric data:
To confirm your statistical contrasts are working as expected, note the gene names that are down-regulated in the Basal vs LP conditions in these boxplots. Are they labelled as expected in the volcano plot?
7.3.4 Heatmap
Next, we can create a heatmap of the differentially expressed genes. The heatmap is a useful way to visualise the expression of the differentially expressed genes in individual samples, especially when more than two conditions are being tested. We can use the tidyheatmapspackage to create the heatmap.
First we need to install the package if you haven’t already
install.packages("tidyheatmaps")
The data we will use for the heatmap is going to be the log2 cpm expression values of the top differentially expressed genes, within the sample_logcpm data, and the first 50 rows from each of the three top tables.
Note
Because genes may be differentially expressed in more than one contrast, the total number of unique genes may be less than 150.
For more interpretable labels, we will use the gene symbols as the row names of the expression matrix.
To create the heatmap we will use the tidyheatmaps() function with the logCPM values. We will add in arguments to scale the data colours along the rows with a z-score, adding in sample annotation, and using a dendrogram (tree-based clustering) to order the data by rows and columns. We will also add in the option to show the row names and column names, and set the font size.
You can experiment with removing each of these options to see how they affect the heatmap, as well as investigate additional options in the tidyheatmaps() documentation.
Now that we have produced a series of plots and the top genes table, we can save these files to a directory. These figures may form the basis for publication figures, and the results table can partially presented as a result, or the full table can be included as a supplementary file.
With that we have come to the conclusion of the RNA-seq analysis and this course. You have now learned how to use various tidyverse packages to operate on data and create publication quality figures. You have also learned how to use the limma and edgeR packages to perform differential expression analysis on RNA-seq data. You should now be able to apply these skills to perform analysis on your own data.
7.5 Extension
7.5.1 MA Plot
The MA plot is a plot of log-fold-change (M-values) against log-expression averages (A-values), this is a common plot in RNA sequencing analysis to visualise the result of differential expression tests. It can be created using the plotMA() from the limma package. Creating this plot requires 3 pieces of information:
object = efit: The the fitted object containing the log-fold-change and log-expression averages
coef = 1: The column number of the contrast to plot since there are 3 different contrasts fitted within the object.
status = dt[, 1]: A vector of numerics denoting whether a gene is up-regulated or down-regulated.
plotMA(efit, coef =1, status = dt[, "BasalvsLP"])#If you get an error with the above code, try:limma::plotMA(efit, coef =1, status = dt[, "BasalvsLP"])
As before, we can try to re-create this plot using ggplot2 to give us more flexibility in plotting. The information we’ll need is in the top_bvl_cat tibble that we extracted for the volcano plot.
top_bvl_cat %>%ggplot(aes(x = AveExpr, y = logFC, col = status)) +geom_point()
While the coordinates of the points are the same, there is still quite a bit of work to do to make this plot look like the one produced by plotMA(). First we have the fix up the colours of the points. We also have to make it such that the non-significant points are plotted below the other points. To do this we would need to make it so that the non-significant points are plotted first. We can do this by reordering the levels of the status factor so that the non-significant points are plotted first. Labels can also be added using ggrepel for the top 10 DE genes in each direction.
Can you now make the labelled MA plots for the BasalvsML and LPvsML contrasts?
Now we can save the MA plot to our output:
ggsave("output/ma_plot_BasalvsLP.png", ma_bvl)
Saving 10 x 5 in image
7.6 References
1.
Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology [Internet]. 1995 Jan 1;57(1):289–300. Available from: http://dx.doi.org/10.1111/j.2517-6161.1995.tb02031.x
---filters: - naquizformat: html: toc: true toc-location: left toc-title: "In this chapter:"execute: error: true---# RNA-seq - Part 2 {#sec-chapter07}The aim of this chapter is to complete the analysis of the RNA-seq data using the `limma` and `edgeR` packages.With the results of the analysis, we want to be able to produce publication quality figures and write out the results to a file for sharing or publication.::: {.callout-tip title="Learning Objectives"}- fit a linear model to the data- construct design and contrast matrices for the linear model- model the mean-variance relationship of the data using the `voom()` function- perform statistical testing to identify differentially expressed genes- visualise the results using a range of common plots- write out the results to a file:::```{r}# load required packageslibrary(tidyverse)library(edgeR)library(limma)group <-parse_factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"),levels =c("Basal", "LP", "ML"))samplenames <-c("10_6_5_11", "9_6_5_11", "purep53", "JMS8-2", "JMS8-3","JMS8-4", "JMS8-5", "JMS9-P7c", "JMS9-P8c")# create DGEList objectdge <-readDGE(dir(path ="data/counts", pattern ="GSM*"),path ="data/counts",columns =c(1, 3),group = group,labels = samplenames)# add gene annotation informationdge$genes <-read_tsv("data/Ses3_geneAnnot.tsv")# filter lowly expressed geneskeep <-filterByExpr(dge)dge <- dge[keep, ]# normalise the data for library size and RNA compositiondge <-calcNormFactors(dge, method ="TMM")```## Linear modellingTo identify differentially expressed genes from our normalised gene expression data, we will use the `limma` package to fit linear models to the genes.A linear model is a broad class of statistical models that fit the value of an response (or dependent) variable as a linear function of one or more explanatory (or independent) variables (also called covariates).The general form of a linear model looks like this:$Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}... + \epsilon$This equation is saying that a response variable of interest $Y$ is equal to a constant ($\beta_{0}$) plus the sum of the covariates ($X_{i}$) each multiplied by a constant coefficient ($\beta_{i}$) and some error term ($\epsilon$).The error term is the difference between the observed value and the predicted value of the model and is assumed to have a normal distribution with mean 0 and some variance.Our experiment is quite simple, since there is only a single covariate, the cell type.The true benefit of using linear models is in its ability to accommodate more complex designs including multiple covariates.To fit the linear models in the limma-voom framework we need two objects in addition to our data:- A design matrix, representing the covariates.- A contrast matrix, representing the specific comparison we wish to make.### Design matrixThe first step to fitting a linear model is to specify a design matrix.The design matrix specifies the values of the covariates for each sample, and is represented as a matrix due to the mathematical convenience.To generate a design matrix.We use the function `model.matrix()`, with the expression `~0 + group`.This returns a matrix representing the design where there is no intercept term and group is the only covariate.If we omit the `0` then there would be an intercept in the model, and if we included more covariates then more columns would be generated.```{r}design <-model.matrix(~0+ group, data = dge$samples)design```There are 9 rows, one for each sample.Along the columns are the names of the groups.The values in the cells denote membership of the particular sample for a particular group, as our groups in this case are mutually exclusive, each row contains only a single 1 to denote membership in a single group.### ContrastsContrast matrices are companions to design matrices.They are used to specify the comparison of interest.In our case, we have three experimental groups: `Basal`, `LP` and `ML`.So if we are to perform differential expression analysis, we are most likely interested in the differences between only two of the groups at a time.Contrast matrices let use specify which comparison we'd like to make, and are also represented as a matrix just like the design.A contrast matrix can be made using the `makeContrasts()` function.Within this function, we specify the name of each specific contrast and the formula for that contrast.For example, the `BasalvsLP` contrasts compares the difference between the `Basal` and `LP` groups.Note that the name of the phenotype groups must be written exactly as they are in the column names of our design matrix (see above).In addition to the individual contrasts, the function must know about the design of the model.This is passed through the `levels` argument, which either accepts a matrix with the column names corresponding to levels of your experimental groups, or the levels themselves as a character vector.Here we first simplify the column names of our design matrix to make it easier to read.Then we create the contrast matrix using the `makeContrasts()` function.```{r}colnames(design) <-c("Basal", "LP", "ML")contr_matrix <-makeContrasts(BasalvsLP ="Basal - LP",BasalvsML ="Basal - ML",LPvsML ="LP - ML",levels = design # alternatively 'levels = colnames(design)')contr_matrix```There are two things to note about a design matrix.First, the sum of the numbers in each column is equal to 0.Second, the way that you set up the equation in the matrix will determine the interpretation of the log-fold-change calculated later, as well as the meaning of up-regulated and down-regulated genes.For example, the contrast `Basal - LP` will give a positive log-fold-change for genes that are up-regulated in the `Basal` group compared to the `LP` group.If we had set up the contrast as `LP - Basal`, then the opposite would be true, and we would get a negative log-fold-change for genes that are up-regulated in the `Basal` group compared to the `LP` group.### Variance modelling with voomWe are now ready to fit our linear models.Limma fits linear models to the data with the assumption that the underlying data is normally distributed.Count data is generally not normally distributed, but log transforming count data gives it a roughly normal distribution sufficient for linear models to work well.To do this, limma transforms the raw count data to log-cpm using library sizes and the normalisation factors we calculated previously.In addition to the normalisation steps, the limma-voom pipeline uses the `voom()` function to generate weights for the individual genes based on a modelled mean-variance relationship.This modelling allows us to get more information out of small sample sizes as the weights prevent our model from being more heavily influenced by more variable data points.The weights will then allow the linear model to be less influenced by genes with high variability, which is important for small sample sizes.The voom function also generates a mean-variance plot, which is useful for visualising the data.The `voom()` function takes as arguments, our `DGEList` object and our design matrix.It also optionally outputs a plot of the mean-variance relationship of our data, called the 'voom plot'.```{r}v <-voom(dge, design, plot =TRUE)```The output of `voom()` (our variable `v`) is an `EList` object which contains the following elements:- `genes` - a data frame of gene annotation data.- `targets` - data frame of sample data.- `E` - numeric matrix of normalised log-cpm values.- `weights` - numeric matrix of precision weights.- `design` - the design matrix.The 'voom plot' shows the mean-variance relationship of the data.The x-axis is the average log2 expression of each gene, and the y-axis is the log2 standard deviation of each gene.The idea is that the variance of a gene can be estimated as a function of its main expression, and that this estimate is more accurate than the direct calculation of the variance.### Fitting the linear modelWe are now ready to fit our linear model with `lmFit()`, which calculates coefficients we defined in our design matrix `design`.The resulting object, `vfit` is a `MArrayLM` object.It contains a information about our genes (the same data frame as `genes` from our `EList` object `v` above), the design matrix and a number of statistical outputs.Of most interest to us is the coefficients, stored in an element called `coefficients`.The first rows of this matrix is shown below.Each gene is row and is labelled using the EntrezID.Each column gives coefficients for each of our phenotype groups.These coefficients are weighted averages of the log-cpm of each gene in each group.```{r}vfit <-lmFit(v)head(vfit$coefficients)```We can then use `contrasts.fit()` to calculate coefficients for each contrast (or 'comparison') we specified in our `contr_matrix`.The output is also an object of the class `MArrayLM` (also known as an `MArrayLM` object).When we inspect the `coefficients` element now, we can see that each column is a contrast that we specified in our contrast matrix.```{r}vfit <-contrasts.fit(vfit, contrasts = contr_matrix)head(vfit$coefficients)```With these values we essentially want to know whether or not the difference in values is significantly different from 0.If the difference is not significantly different from 0, then we have failed to establish a difference in the expression of a particularly gene between two groups.If the difference is significantly greater than 0, then we have up-regulation in the first group of the contrast, and if the difference is significantly less than 0, then we have down-regulation in the first group of the contrast.## Statistical testingTo actually test if the values are significantly different from 0, we use the `eBayes()` function to compute moderated t-statistics, moderated F-statistics and log-odds of differential expression for each gene, given a fitted linear model.'Moderated' refers to empirical Bayes moderation, which borrows information across genes to obtain more accurate measures of variability for each gene.This also increases our power to detect differentially expressed genes.```{r}efit <-eBayes(vfit)```We can now look at the number of differentially expressed genes using the `decideTests()` function.The output of this function is a matrix where each column is a contrast (comparison of interest) and each row is a gene.The numbers 1, -1 and 0 mean up-regulated, down-regulated or not significantly differentially expressed, respectively.Note that `decideTests()` also accounts for multiple testing.The default method is Benjamini and Hochberg [@benjamini1995] but several others are also available.```{r}dt <-decideTests(efit)dt```To obtain the total number of differentially expressed genes for each comparison, we can add the function `summary()`:```{r}summary(dt)```The function `topTable()` can be used to obtain more information on the differentially expressed genes for each contrast.`topTable()` takes as arguments the `MArrayLM` object output by `eBayes()` (`efit`), the contrast name of interest, and the number of top differentially expressed genes to output.Note that the contrast name must be given in quotes and must be exactly as written in the contrast matrix `contr_matrix`.It outputs a data frame with the following information:- **Gene details** - gene information, from the `gene` element of the `MArrayLM` object (`efit`).- `logFC` - the log2 fold change of the contrast.- `AveExpr` - the average log2 expression of that gene.- `t` - moderated t-statistic.- `P.Value` - p value.- `adj.P.Val` - adjusted p value.- `B` - log-odds that the gene is differentially expressed.```{r}top_bvl <-topTable(efit, coef ="BasalvsLP", n =Inf) %>%as_tibble() %>%mutate(contr ='BasalvsLP')head(top_bvl)```## Plotting ResultsIn this section we will make plots of both the summary statistics (topTable output), and the gene expression levels per sample, using log(counts per million) together with sample metadata from dge\$samples.### Data extractTo start with we extract all the data required to make any of the the common RNAseq results plots.We will1. extract the summary statistics for all three contrasts into separate tables2. extract the sample_ids and experimental group labels, and then3. extract the log(CPM) values for each sample#### Extract summary tablesVolcano plots and MA plots require differential expression summary statistics.We have the summary table for BasalvsLP already extracted to `top_bvl`, now for BasalvsML and LPvsML:```{r}top_bvm <-topTable(efit, coef ="BasalvsML", n =Inf) %>%as_tibble() %>%mutate(contr ='BasalvsML')top_lvm <-topTable(efit, coef ="LPvsML", n =Inf) %>%as_tibble() %>%mutate(contr ='LPvsML')```#### Extract log(counts per million)Box plots and heatmaps require the gene expression data per sample, expressed as log(counts per million), or 'log(CPM)'.To extract the logCPM to a tibble:```{r}sample_logcpm <-cpm(dge, log =TRUE) %>%as_tibble(rownames='ENTREZID')```#### Extract sample groupsWe will need the experimental group label for each sample to make boxplots and to annotate our heatmap.We can extract this from the `dge$samples` table as follows.```{r}sample_group <- dge$samples %>%as_tibble(rownames='sample_id')```### Volcano PlotVolcano plots are very popular for visually summarizing results from RNA-seq analysis.The volcano plot is a scatter plot of the negative log10 p-value (y) against the log2 fold-change (x).This plot is useful for visualising the significance of the differentially expressed genes, as the more significant results are at the top left and right of the plot.To make a volcano plot for a single experimental contrast (BasalvsLP), we must first categorize the summary statistics for each gene with 'Up' 'Down' or 'Not DE', according to the adj.P.Val and logFC.The `case_when()` function is a good choice for binning continuous data (adj.P.Val) into categories.```{r}top_bvl_cat <- top_bvl %>%mutate(status =case_when(adj.P.Val <0.05& logFC >0~'Up', adj.P.Val <0.05& logFC <0~'Down',.default='Non DE'))```Now to render the plot...```{r}volcano_bvl <- top_bvl_cat %>%ggplot(aes(x = logFC, y =-log10(P.Value), col = status)) +geom_point(size=0.5) +scale_colour_manual(values =c("blue", "#595959", "red")) +ggtitle('Basal vs LP')volcano_bvl```Lastly we can label the gene names for the top-most DEGs by extracting the top 20 genes using `dplyr::slice()` and feeding them into `geom_label_repel()` from the ggrepel library.Here instead of rewriting code, we just keep adding geoms to the `volcano_plot` ggplot object created above!```{r}library(ggrepel)``````{r}top_deg_bvl <- top_bvl_cat %>%group_by(status) %>%arrange(P.Value) %>%slice_head(n =10) %>%filter(status!='Non DE')volcano_bvl +geom_label_repel(data = top_deg_bvl,aes(label = SYMBOL, col = status),max.overlaps =20,show.legend =FALSE ) ```::: {.callout-tip title="Next steps"}Can you now make the labelled volcano plots for the BasalvsML and LPvsML contrasts?:::### BoxplotsIt's important to create boxplots for a few differentially-expressed genes.This is a way of confirming that the results in our summary statistics tables and the underlying data agree, and that that our contrasts are set up correctly!In this case we'll make boxplots from the log(CPM) values of individual samples, for the first 6 differentially-expressed genes in the BasalvsLP topTable `top_bvl_cat`.We will extract the first 6 rows of the tibble using `slice_head()`, then join the logCPM values per sample<!-- for this to render, yaml header updated to error: true -->```{r}top_bvl_cat %>%slice_head(n=6) %>%left_join(sample_logcpm, by='ENTREZID')```Great! An error!...it seems that the ENTREZID column is not the same type between the `top_bvl_cat` and the `sample_logcpm` tibbles.We can fix this by changing the `sample_logcpm$ENTREZID` into numeric data:```{r}sample_logcpm <- sample_logcpm %>%mutate(ENTREZID =as.numeric(ENTREZID))```Now we can rerun the `left_join()` and reshape the data to long form as required for ggplot2.```{r}top_gene_long <- top_bvl_cat %>%slice_head(n=6) %>%left_join(sample_logcpm, by='ENTREZID') %>%select(SYMBOL, '10_6_5_11':'JMS9-P8c') %>%pivot_longer(cols =-SYMBOL, names_to ='sample_id', values_to ='logCPM')```Next, we join the sample groupings to this reshaped data:```{r}top_gene_long_samples <-left_join(top_gene_long, sample_group, by='sample_id')```...and plot the results```{r}boxplots <- top_gene_long_samples %>%ggplot(aes(x=group,y=logCPM, colour = group)) +geom_boxplot() +geom_jitter() +facet_wrap(~SYMBOL, scales='free')boxplots```::: {.callout-tip title="Cross-check!"}To confirm your statistical contrasts are working as expected, note the gene names that are down-regulated in the Basal vs LP conditions in these boxplots.Are they labelled as expected in the volcano plot?:::### HeatmapNext, we can create a heatmap of the differentially expressed genes.The heatmap is a useful way to visualise the expression of the differentially expressed genes in individual samples, especially when more than two conditions are being tested.We can use the `tidyheatmaps`[package](https://jbengler.github.io/tidyheatmaps/) to create the heatmap.First we need to install the package if you haven't already```{r}#| eval: falseinstall.packages("tidyheatmaps")```The data we will use for the heatmap is going to be the log2 cpm expression values of the top differentially expressed genes, within the `sample_logcpm` data, and the first 50 rows from each of the three top tables.::: {.callout-tip title="Note"}Because genes may be differentially expressed in more than one contrast, the total number of unique genes may be less than 150.:::For more interpretable labels, we will use the gene symbols as the row names of the expression matrix.```{r}top_genes_all <-bind_rows( top_bvl %>%slice_head(n =50), top_bvm %>%slice_head(n =50), top_lvm %>%slice_head(n =50)) %>%select(ENTREZID, SYMBOL) %>%distinct() ```We now join the logCPM values for the top_genes:```{r}top_genes_lcpm <- top_genes_all %>%left_join(sample_logcpm)```...and reshape the data to long format before joining in the sample group labels.This is very similar to how the data was prepared for our boxplots.```{r}top_genes_lcpm_long <- top_genes_lcpm %>%pivot_longer(names_to ="sample_id", values_to ="logcpm",cols =-c(ENTREZID,SYMBOL))top_genes_lcpm_gp <- top_genes_lcpm_long %>%left_join(sample_group)```To create the heatmap we will use the `tidyheatmaps()` function with the logCPM values.We will add in arguments to scale the data colours along the rows with a z-score, adding in sample annotation, and using a dendrogram (tree-based clustering) to order the data by rows and columns.We will also add in the option to show the row names and column names, and set the font size.You can experiment with removing each of these options to see how they affect the heatmap, as well as investigate additional options in the `tidyheatmaps()` documentation.```{r, fig.height = 9}library(tidyheatmaps)heatmap <- tidyheatmap( df = top_genes_lcpm_gp, rows = SYMBOL, columns = sample_id, values = logcpm, color_legend_n = 100, annotation_col = group, scale = 'row', border_color = NA, cluster_rows = TRUE, cluster_cols = TRUE, show_rownames = TRUE, show_colnames = TRUE, fontsize_row = 6, fontsize_col = 10)heatmap```## Saving the resultsNow that we have produced a series of plots and the top genes table, we can save these files to a directory.These figures may form the basis for publication figures, and the results table can partially presented as a result, or the full table can be included as a supplementary file.```{r, warning=FALSE}dir.create("output")write_csv(top_bvl_cat, "output/top_genes_BasalvsLP.csv")ggsave("output/volcano_plot_BasalvsLP.png", volcano_bvl)ggsave("output/heatmap.png", heatmap, width=9, height=9)ggsave("output/boxplots.png", boxplots)```With that we have come to the conclusion of the RNA-seq analysis and this course.You have now learned how to use various `tidyverse` packages to operate on data and create publication quality figures.You have also learned how to use the `limma` and `edgeR` packages to perform differential expression analysis on RNA-seq data.You should now be able to apply these skills to perform analysis on your own data.## Extension### MA PlotThe MA plot is a plot of log-fold-change (M-values) against log-expression averages (A-values), this is a common plot in RNA sequencing analysis to visualise the result of differential expression tests.It can be created using the `plotMA()` from the `limma` package.Creating this plot requires 3 pieces of information:- `object = efit`: The the fitted object containing the log-fold-change and log-expression averages- `coef = 1`: The column number of the contrast to plot since there are 3 different contrasts fitted within the object.- `status = dt[, 1]`: A vector of numerics denoting whether a gene is up-regulated or down-regulated.```{r}plotMA(efit, coef =1, status = dt[, "BasalvsLP"])#If you get an error with the above code, try:limma::plotMA(efit, coef =1, status = dt[, "BasalvsLP"])```As before, we can try to re-create this plot using `ggplot2` to give us more flexibility in plotting.The information we'll need is in the `top_bvl_cat` tibble that we extracted for the volcano plot.```{r}top_bvl_cat %>%ggplot(aes(x = AveExpr, y = logFC, col = status)) +geom_point()```While the coordinates of the points are the same, there is still quite a bit of work to do to make this plot look like the one produced by `plotMA()`.First we have the fix up the colours of the points.We also have to make it such that the non-significant points are plotted below the other points.To do this we would need to make it so that the non-significant points are plotted first.We can do this by reordering the levels of the `status` factor so that the non-significant points are plotted first.Labels can also be added using `ggrepel` for the top 10 DE genes in each direction.```{r}top_bvl_cat_fct <- top_bvl_cat %>%mutate(status =fct_relevel(status,'Non DE')) scale_cols <-c("Not DE"="#595959","Down"="blue","Up"="red")ma_bvl <- top_bvl_cat_fct %>%ggplot(aes(x = AveExpr, y = logFC, col = status)) +geom_point(data = top_bvl_cat_fct %>%filter(status =="Non DE"), size =0.1) +geom_point(data = top_bvl_cat_fct %>%filter(status !="Non DE"), size =1) +geom_label_repel(data = top_deg_bvl,aes(label = SYMBOL, col =NULL),max.overlaps =20,show.legend =FALSE ) +scale_colour_manual(values = scale_cols) +ggtitle('Basal vs LP')ma_bvl```::: {.callout-tip title="Next steps"}Can you now make the labelled MA plots for the BasalvsML and LPvsML contrasts?:::Now we can save the MA plot to our output:```{r}ggsave("output/ma_plot_BasalvsLP.png", ma_bvl)```## References