We are going to re-analyse a dataset containing transcriptomic samples from healthy patients and patients infected with mycobacterium tuberculosis, which causes tuberculosis disease. This data has previously been analysed by Singhania et al, 2018. Tuberculosis is a heterogeneous disease with high global burden. Patients infected with mycobacterium tuberculosis can be classified as active or latent infections, depending on whether clinical symptoms are present or not. Diagnosing tuberculosis infection can be difficult, especially in asymptomatic patients with latent tuberculosis infection (LTBI).
Therefore, it would be beneficial to identify some biomarkers which can act as sensitive and specific indicators of the disease state. The data is open-access and available through the National Center for Biotechnology Information’s Gene Expression Omnibus project. Data were collected from three cohorts (the London, South Africa, and Leicester cohorts).
The goal of this practical will be to perform an analysis of these data, using the principles learned in the lecture. Specifically, we will practice data pre-processing, normalisation, batch effect correction, data exploration with principal component analysis and hierarchical clustering, and differential gene expression analysis.
Using
install.packages(), install the following packages from the CRAN : BiocManager, readxl, dplyr, flashClust, UpSetR, tidyR, ggplot2, stringr, dendextend, ggdendro, ggrepel, VennDiagram, grid.
Using
BiocManager::install(), install the following packages fromBioConductor: GEOquery, edgeR, limma, pvca, dearseq, sva.
The following chunk will check if you already have any of these packages and then install the missing ones.
cran_packages <- c(
"BiocManager",
"readxl",
"dplyr",
"flashClust",
"UpSetR",
"tidyr",
"ggplot2",
"stringr",
"dendextend",
"ggdendro",
"ggrepel",
"VennDiagram",
"grid"
)
# Install missing CRAN packages
install.packages(setdiff(cran_packages, installed.packages()[, "Package"]))
# ---- BioConductor Packages ----
bioc_packages <-
c("GEOquery", "edgeR", "limma", "pvca", "dearseq", "sva")
# Load BiocManager and install missing Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(setdiff(bioc_packages, installed.packages()[, "Package"]))
Take the following link to the NCBI’s gene expression omnibus for the series A modular transcriptional signature identifies phenotypic heterogeneity of human tuberculosis infection. Scroll down and click on the links to the following SubSeries :
GSE107991,GSE107992,GSE107993. These SubSeries correspond to three different patient cohorts, called the London, South Africa, and Leicester cohorts, respectively. On each of these pages, download the corresponding raw count data (i.e. ‘GSE107991_Raw_counts_Berry_London.xlsx’, ‘GSE107992_Raw_counts_Berry_SouthAfrica.xlsx’, and ‘GSE107993_Raw_counts_Leicester_non_progressor_longitudnal_only.xlsx’), saving them in a single directory.
In this first step, I will give you the code to load the data and merge it together to make it easier to work with. Simply change the path directing to where the data were saved.
First, change the line below to specify the path where you saved your data.
data_path = "your/path/to/data/"
Load the raw counts data with the
read_excel()function.
# Read the raw counts for the London data
london_raw <-
read_excel(paste0(data_path, "GSE107991_Raw_counts_Berry_London.xlsx"))
# Read the raw counts for the South Africa data
southafrica_raw <-
read_excel(paste0(data_path, "GSE107992_Raw_counts_Berry_SouthAfrica.xlsx"))
# Read the raw counts for the Leicester data
leicester_raw <-
read_excel(
paste0(
data_path,
"GSE107993_Raw_counts_Leicester_non_progressor_longitudnal_only.xlsx"
)
)
Use the
GEOquery()function to download the metadata corresponding to these files.
# Load the metadata for the London, South Africa, and Leicester datasets
london_metadata <-
GEOquery::getGEO("GSE107991", GSEMatrix = FALSE) #London
southafrica_metadata <-
GEOquery::getGEO("GSE107992", GSEMatrix = FALSE) # SouthAfrica
leicester_metadata <-
GEOquery::getGEO("GSE107993", GSEMatrix = FALSE) #Leicester
Extract relevant information from the metadata.
# This function extracts the relevant columns from the metadata files and changes some of the names of the values
get_info <- function(i, meta) {
name <- meta@gsms[[i]]@header$source_name_ch1
name <- gsub("Active_TB", "ActiveTB", name)
name <- gsub("Test_set", "TestSet", name)
name <- gsub("Validation_set", "ValidationSet", name)
name <- gsub("Non_progressor", "NonProgressor", name)
res <- unlist(strsplit(name, split = "_"))
res <- c(res, meta@gsms[[i]]@header$title)
res
}
infos_london <-
sapply(1:length(london_metadata@gsms),
FUN = get_info,
meta = london_metadata)
infos_df_london <-
cbind.data.frame(GSMID = names(london_metadata@gsms),
t(infos_london))
rownames(infos_df_london) <- names(london_metadata@gsms)
colnames(infos_df_london)[-1] <-
c("Cohort", "Location", "Set", "Status", "SampleName")
infos_southafrica <-
sapply(1:length(southafrica_metadata@gsms),
FUN = get_info,
meta = southafrica_metadata)
infos_df_southafrica <-
cbind.data.frame(GSMID = names(southafrica_metadata@gsms),
t(infos_southafrica))
rownames(infos_df_southafrica) <- names(southafrica_metadata@gsms)
colnames(infos_df_southafrica)[-1] <-
c("Cohort", "Location", "Set", "Status", "SampleName")
infos_leicester <-
sapply(1:length(leicester_metadata@gsms),
FUN = get_info,
meta = leicester_metadata)
infos_df_leicester <-
cbind.data.frame(GSMID = names(leicester_metadata@gsms),
t(infos_leicester))
rownames(infos_df_leicester) <- names(leicester_metadata@gsms)
colnames(infos_df_leicester)[-1] <-
c("Cohort", "Location", "Status", "Set",
"SampleName")
infos_df_leicester <- infos_df_leicester[, c(1, 2, 3, 5, 4, 6)]
Change the format to wide, so each column is a gene, and merge the data.
# Bind the metadata for London, SA, and Leicester together
infos_all <-
rbind.data.frame(infos_df_london, infos_df_southafrica, infos_df_leicester)
# Bind the counts for London, SA, and Leicester together
counts_merged <-
merge(
x = merge(x = london_raw, y = southafrica_raw, by = c(1:3)),
y = leicester_raw,
by = c(1:3)
) %>% select(-Gene_name, -Gene_biotype)
# pivot to wide format
counts_all <- t(counts_merged)[-1,] %>% as.data.frame()
# Columns are genes
colnames(counts_all) = counts_merged$Genes
# Rows are samples
rownames(counts_all) = 1:dim(counts_all)[1]
# Now we merge the counts with the metadata by the sample name
merged_all = merge(
y = bind_cols("SampleName" = colnames(counts_merged)[-1], counts_all),
x = infos_all,
by = "SampleName"
) %>% distinct()
# here we extract the counts and store them as numeric values in the dataframe
counts_all = merged_all %>% select(-c(1:6)) %>% sapply(as.numeric) %>% as.data.frame()
# Here we extract the metadata
metadata_all = merged_all %>% select(c(1:6))
# Visualise the start of the merged dataframe
merged_all[, c(1:8)] %>% head(5)
## SampleName GSMID Cohort Location Set Status
## 1 Berry_London_Sample1 GSM2886035 Berry London TestSet ActiveTB
## 2 Berry_London_Sample10 GSM2886044 Berry London TestSet ActiveTB
## 3 Berry_London_Sample11 GSM2886045 Berry London TestSet LTBI
## 4 Berry_London_Sample12 GSM2886046 Berry London TestSet LTBI
## 5 Berry_London_Sample13 GSM2886047 Berry London TestSet LTBI
## ENSG00000000003 ENSG00000000005
## 1 5 0
## 2 7 0
## 3 2 0
## 4 14 0
## 5 26 1
Counts matrices usually contain some completely uninformative variables (i.e. with 0 counts in every sample). Remove these from the data and report how many columns were removed.
Many investigators only include protein-coding genes in their analysis. Information on whether a gene is protein coding or not may be found in the raw counts under the Gene_biotype columns. For example,
Using this information, filter the counts to only include genes who are protein-coding. Report the amount of genes removed at this stage.
Normalisation is a crucial step in RNA-seq analysis to ensure that samples are comparable. An important point here is that our data comes from 3 different studies (London, SA, and Leicester). Let’s investigate the impact of this study variable on the data.
The library size of a sample is the total number of counts across all genes. Compute the library sizes and visualise them on a bar chart using
ggplot2. Colour the samples by the study from which it came. What do you notice?
A simple way to address the library size problem is simply to normalise the raw integer counts by dividing by the library size. The variant of this approach which is often used is a transformation called counts-per-million. This is calculated by first dividing each count by its sample’s library size, then multiplying by a million.
Using the
cpm()(counts-per-million) function from theedgeRwith thelog = TRUEargument, compute the log2-CPM of these data.
Principal component analysis is a useful tool for representing the data in a lower-dimensional space while preserving signals which contribute to between-sample variability. Perform a PCA using the
prcomp()function. Plot the samples on the first two principal components, coloured by location. Has this normalisation step removed the study-effect?
Another typical normalisation approach is to use the trimmed mean of moments (TMM) method. Calculate the normalisation factors on the raw counts data using the
calcNormFactors()function inedgeRwith the argumentmethod = "TMM"applied to aDGEList()object. Then, convert this object into log2CPM using thecpm()function as before. Using your previous code, plot the points on the first two principal components, coloured by location. Has this normalisation step solved the problem?
Even after normalisation, we appear to still have a study-effect. We could view the study as a batch variable, since we do not expect it to contribute much to the heterogeneity between samples. To correct for batch effects in RNA-seq data, we can use the combat-seq method. We do this using the
ComBat_seq()function from thesvapackage.
In this method, we must specify two things. Firstly, the batch variable (
Location), and secondly, a variable whose effects we want to remain intact after correction (in this case, theStatusvariable, which provides information on the disease states of the patients). ApplyComBat_seq()to the raw, filtered counts to correct for thebatch = Location, while keeping preserved effects fromgroup = metadata_all$Status. NB : running combat-seq may take a couple of minutes.
Now that we have attempted to correct for the batch effect in the raw counts, use your previous code to normalise the data using Log2-CPM-TMM. Now visualise the data on the first two principal components and conlcude if there still remains a study effect in the data.
Now that we have preprocessed the data and (hopefully) removed any batch effects, we can start to explore it. The goal is to use tools like PCA and hierarchical clustering to explore any obvious signals in the data related to the patient status.
Perform a PCA to explore whether the variability in the gene expression data can be explained by the disease status of patients. Like before, perform a PCA on the batch-corrected-Log2-CPM-TMM-normalised data and plot the data on the first two PCs. However, this time, colour the points by their disease status. What does this suggest?
In addition, we can explore signal in the data using hierachical clustering algorithms. Here, the goal is to determine which samples are similar and which are different. If these similarities and/or differences can be explained by the disease state, it suggests a signal might be present.
Using the
hclust()function from theflashClustpackage, perform an agglomerative hierarchical clustering algorithm on the gene expression with Euclidean distance and complete linkage. Compute the resulting dendrogram with theas.dendrogram()function. Find a way to colour the sample labels by theStatusvariable.
Do you notice any patterns? Is there any obvious way to determine the number of clusters in this case?
Try changing the
methodargument in thehclust()call tomethod = "ward". Use your previous code to plot the dendrogram as before. A criterion to select the number of clusters which often accompanies the ward method is to select the number of clusters such that the next jump in the tree is the greatest. How many clusters would you choose with this criterion? What is your conclusion now?
We are interested in identifying any genes whose expression is different between different disease states. For the purposes of this exercise, we will group together control and latent tuberculosis patients together, and compare them to the active infection group. We will use multiple methods to do this, and compare their results at the end.
We will be following the edgeR pipeline, found here.
edgeR starts with raw integer counts. We can use the counts from ComBat-Seq (prior to normalisation). Starting from a DGEList object, follow the edgeR pipeline. This starts with estimating TMM normalisation factors with
calcNormFactors(). Then, we estimate common dispersion factors usingestimateCommonDisp(). The negative binomial model can be fit withglmQLFit(). P-values can be extracted withglmQLFTest(), and the top genes can be extracted using thetopTags()function.
Extract a vector of gene names which have adjusted p-value < 0.05, as well as |log2-Fold-Change > 1|.
Repeat the previous experiment, but instead of estimating a common dispersion factor, estimate one for each gene using the
estimateDisp()function. Do the results change significantly?.
Now, we will perform the same task using voom-limma. A tutorial can be found here.
Following the guide, apply the Voom-limma pipeline with the functions
voom(),lmFit(),eBayes()anddecideTests(). The output should be a vector giving the names of the differentially expressed genes defined by adjusted p-value < 0.05 and |log2-Fold-Change > 1|.
Finally we will apply the dearseq pipeline. A guide can be found here.
Use the
dear_seq()function to perform differential expression analysis. Since log2-fold-change values are not directly available from the dearseq package, you may use the ones calculated byedgeR.
A common representation of the results of DGEA is in the form of a so-called volcano plot. The x-axis of a volcano plot is the log2FC value and the y-axis is the adjusted p-values on the log10 scale, with the axis inverted such that smaller p-values are greater on the y-axis. Typically, the ad-hoc limits which define differential expression (e.g. adjusted p-value < 0.05 and log2FC > 1 in our case) are represented by lines on the plot. Significantly up-regulated genes are coloured as red and significantly down-regulated genes as blue. Sometimes, the few most significant genes are given labels showing their name.
Choose one of your analysis pipelines, and make a volcano plot to represent the results. Annotate the top 10 most significantly differentially expressed genes (by adjusted p-value) on the plot.
Now, let’s compare the results from the 3 pipelines (voom-limma, edgeR and dearseq).
First, use the
VennDiagrampackage to visualise the agreement between the methods.
An alternative way to visualise agreement is with a so-called UpSet plot. This representation works well when the number of classes to compare is greater than 3, such that a Venn diagram becomes hard to view. Using the
UpSetRpackage, make an UpSet diagram. Do our 3 pipelines have concordant results?