Task Overview

We are going to re-analyse a dataset containing transcriptomic samples from healthy patients and patients infected with mycobacterium tuberculosis, which causes tuberculosis disease. This data has previously been analysed by Singhania et al, 2018. Tuberculosis is a heterogeneous disease with high global burden. Patients infected with mycobacterium tuberculosis can be classified as active or latent infections, depending on whether clinical symptoms are present or not. Diagnosing tuberculosis infection can be difficult, especially in asymptomatic patients with latent tuberculosis infection (LTBI).

Therefore, it would be beneficial to identify some biomarkers which can act as sensitive and specific indicators of the disease state. The data is open-access and available through the National Center for Biotechnology Information’s Gene Expression Omnibus project. Data were collected from three cohorts (the London, South Africa, and Leicester cohorts).

The goal of this practical will be to perform an analysis of these data, using the principles learned in the lecture. Specifically, we will practice data pre-processing, normalisation, batch effect correction, data exploration with principal component analysis and hierarchical clustering, and differential gene expression analysis.

Data Management

0a) Install and Load Required Packages

Using install.packages(), install the following packages from the CRAN : BiocManager, readxl, dplyr, flashClust, UpSetR, tidyR, ggplot2, stringr, dendextend, ggdendro, ggrepel, VennDiagram, grid.

Using BiocManager::install(), install the following packages from BioConductor : GEOquery, edgeR, limma, pvca, dearseq, sva.

The following chunk will check if you already have any of these packages and then install the missing ones.

cran_packages <- c(
  "BiocManager",
  "readxl",
  "dplyr",
  "flashClust",
  "UpSetR",
  "tidyr",
  "ggplot2",
  "stringr",
  "dendextend",
  "ggdendro",
  "ggrepel",
  "VennDiagram",
  "grid"
)

# Install missing CRAN packages
install.packages(setdiff(cran_packages, installed.packages()[, "Package"]))

# ---- BioConductor Packages ----
bioc_packages <-
  c("GEOquery", "edgeR", "limma", "pvca", "dearseq", "sva")

# Load BiocManager and install missing Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install(setdiff(bioc_packages, installed.packages()[, "Package"]))

Take the following link to the NCBI’s gene expression omnibus for the series A modular transcriptional signature identifies phenotypic heterogeneity of human tuberculosis infection. Scroll down and click on the links to the following SubSeries : GSE107991, GSE107992, GSE107993. These SubSeries correspond to three different patient cohorts, called the London, South Africa, and Leicester cohorts, respectively. On each of these pages, download the corresponding raw count data (i.e. ‘GSE107991_Raw_counts_Berry_London.xlsx’, ‘GSE107992_Raw_counts_Berry_SouthAfrica.xlsx’, and ‘GSE107993_Raw_counts_Leicester_non_progressor_longitudnal_only.xlsx’), saving them in a single directory.

In this first step, I will give you the code to load the data and merge it together to make it easier to work with. Simply change the path directing to where the data were saved.

0b) Load data

First, change the line below to specify the path where you saved your data.

data_path = "your/path/to/data/"

Load the raw counts data with the read_excel() function.

# Read the raw counts for the London data
london_raw <-
  read_excel(paste0(data_path, "GSE107991_Raw_counts_Berry_London.xlsx"))

# Read the raw counts for the South Africa data
southafrica_raw <-
  read_excel(paste0(data_path, "GSE107992_Raw_counts_Berry_SouthAfrica.xlsx"))

# Read the raw counts for the Leicester data
leicester_raw <-
  read_excel(
    paste0(
      data_path,
      "GSE107993_Raw_counts_Leicester_non_progressor_longitudnal_only.xlsx"
    )
  )

0c) Download metadata

Use the GEOquery() function to download the metadata corresponding to these files.

# Load the metadata for the London, South Africa, and Leicester datasets
london_metadata <-
  GEOquery::getGEO("GSE107991", GSEMatrix = FALSE)  #London
southafrica_metadata <-
  GEOquery::getGEO("GSE107992", GSEMatrix = FALSE)  # SouthAfrica
leicester_metadata <-
  GEOquery::getGEO("GSE107993", GSEMatrix = FALSE)  #Leicester

0d) Process the metadata

Extract relevant information from the metadata.

# This function extracts the relevant columns from the metadata files and changes some of the names of the values
get_info <- function(i, meta) {
  name <- meta@gsms[[i]]@header$source_name_ch1
  name <- gsub("Active_TB", "ActiveTB", name)
  name <- gsub("Test_set", "TestSet", name)
  name <- gsub("Validation_set", "ValidationSet", name)
  name <- gsub("Non_progressor", "NonProgressor", name)
  res <- unlist(strsplit(name, split = "_"))
  res <- c(res, meta@gsms[[i]]@header$title)
  res
}

infos_london <-
  sapply(1:length(london_metadata@gsms),
         FUN = get_info,
         meta = london_metadata)
infos_df_london <-
  cbind.data.frame(GSMID = names(london_metadata@gsms),
                   t(infos_london))
rownames(infos_df_london) <- names(london_metadata@gsms)
colnames(infos_df_london)[-1] <-
  c("Cohort", "Location", "Set", "Status", "SampleName")

infos_southafrica <-
  sapply(1:length(southafrica_metadata@gsms),
         FUN = get_info,
         meta = southafrica_metadata)
infos_df_southafrica <-
  cbind.data.frame(GSMID = names(southafrica_metadata@gsms),
                   t(infos_southafrica))
rownames(infos_df_southafrica) <- names(southafrica_metadata@gsms)
colnames(infos_df_southafrica)[-1] <-
  c("Cohort", "Location", "Set", "Status", "SampleName")

infos_leicester <-
  sapply(1:length(leicester_metadata@gsms),
         FUN = get_info,
         meta = leicester_metadata)
infos_df_leicester <-
  cbind.data.frame(GSMID = names(leicester_metadata@gsms),
                   t(infos_leicester))
rownames(infos_df_leicester) <- names(leicester_metadata@gsms)
colnames(infos_df_leicester)[-1] <-
  c("Cohort", "Location", "Status", "Set",
    "SampleName")
infos_df_leicester <- infos_df_leicester[, c(1, 2, 3, 5, 4, 6)]

0e) Merge data

Change the format to wide, so each column is a gene, and merge the data.

# Bind the metadata for London, SA, and Leicester together
infos_all <-
  rbind.data.frame(infos_df_london, infos_df_southafrica, infos_df_leicester)

# Bind the counts for London, SA, and Leicester together
counts_merged <-
  merge(
    x = merge(x = london_raw, y = southafrica_raw, by = c(1:3)),
    y = leicester_raw,
    by = c(1:3)
  ) %>% select(-Gene_name, -Gene_biotype)

# pivot to wide format
counts_all <- t(counts_merged)[-1,] %>% as.data.frame()

# Columns are genes
colnames(counts_all) = counts_merged$Genes

# Rows are samples
rownames(counts_all) = 1:dim(counts_all)[1]

# Now we merge the counts with the metadata by the sample name
merged_all = merge(
  y = bind_cols("SampleName" = colnames(counts_merged)[-1], counts_all),
  x = infos_all,
  by = "SampleName"
) %>% distinct()

# here we extract the counts and store them as numeric values in the dataframe
counts_all = merged_all %>% select(-c(1:6)) %>% sapply(as.numeric) %>% as.data.frame()

# Here we extract the metadata
metadata_all = merged_all %>% select(c(1:6))

# Visualise the start of the merged dataframe
merged_all[, c(1:8)] %>% head(5)

##              SampleName      GSMID Cohort Location     Set   Status
## 1  Berry_London_Sample1 GSM2886035  Berry   London TestSet ActiveTB
## 2 Berry_London_Sample10 GSM2886044  Berry   London TestSet ActiveTB
## 3 Berry_London_Sample11 GSM2886045  Berry   London TestSet     LTBI
## 4 Berry_London_Sample12 GSM2886046  Berry   London TestSet     LTBI
## 5 Berry_London_Sample13 GSM2886047  Berry   London TestSet     LTBI
##   ENSG00000000003 ENSG00000000005
## 1               5               0
## 2               7               0
## 3               2               0
## 4              14               0
## 5              26               1

1) Data pre-processing

1a) Remove unexpressed genes

Counts matrices usually contain some completely uninformative variables (i.e. with 0 counts in every sample). Remove these from the data and report how many columns were removed.

1b) Remove non-protein coding genes

Many investigators only include protein-coding genes in their analysis. Information on whether a gene is protein coding or not may be found in the raw counts under the Gene_biotype columns. For example,

Using this information, filter the counts to only include genes who are protein-coding. Report the amount of genes removed at this stage.

2) Normalisation

Normalisation is a crucial step in RNA-seq analysis to ensure that samples are comparable. An important point here is that our data comes from 3 different studies (London, SA, and Leicester). Let’s investigate the impact of this study variable on the data.

2a) Investigate study effects

The library size of a sample is the total number of counts across all genes. Compute the library sizes and visualise them on a bar chart using ggplot2. Colour the samples by the study from which it came. What do you notice?

2b) Log2-CPM normalisation

A simple way to address the library size problem is simply to normalise the raw integer counts by dividing by the library size. The variant of this approach which is often used is a transformation called counts-per-million. This is calculated by first dividing each count by its sample’s library size, then multiplying by a million.

Using the cpm() (counts-per-million) function from the edgeR with the log = TRUE argument, compute the log2-CPM of these data.

Principal component analysis is a useful tool for representing the data in a lower-dimensional space while preserving signals which contribute to between-sample variability. Perform a PCA using the prcomp() function. Plot the samples on the first two principal components, coloured by location. Has this normalisation step removed the study-effect?

2c) Log2-CPM-TMM Normalisation

Another typical normalisation approach is to use the trimmed mean of moments (TMM) method. Calculate the normalisation factors on the raw counts data using the calcNormFactors() function in edgeR with the argument method = "TMM"applied to a DGEList() object. Then, convert this object into log2CPM using the cpm() function as before. Using your previous code, plot the points on the first two principal components, coloured by location. Has this normalisation step solved the problem?

2d) Batch effect correction

Even after normalisation, we appear to still have a study-effect. We could view the study as a batch variable, since we do not expect it to contribute much to the heterogeneity between samples. To correct for batch effects in RNA-seq data, we can use the combat-seq method. We do this using the ComBat_seq() function from the sva package.

In this method, we must specify two things. Firstly, the batch variable (Location), and secondly, a variable whose effects we want to remain intact after correction (in this case, the Status variable, which provides information on the disease states of the patients). Apply ComBat_seq() to the raw, filtered counts to correct for the batch = Location, while keeping preserved effects from group = metadata_all$Status. NB : running combat-seq may take a couple of minutes.

Now that we have attempted to correct for the batch effect in the raw counts, use your previous code to normalise the data using Log2-CPM-TMM. Now visualise the data on the first two principal components and conlcude if there still remains a study effect in the data.

3) Data Exploration

Now that we have preprocessed the data and (hopefully) removed any batch effects, we can start to explore it. The goal is to use tools like PCA and hierarchical clustering to explore any obvious signals in the data related to the patient status.

3a) Principal Component Analysis

Perform a PCA to explore whether the variability in the gene expression data can be explained by the disease status of patients. Like before, perform a PCA on the batch-corrected-Log2-CPM-TMM-normalised data and plot the data on the first two PCs. However, this time, colour the points by their disease status. What does this suggest?

3b) Hierarchical clustering

In addition, we can explore signal in the data using hierachical clustering algorithms. Here, the goal is to determine which samples are similar and which are different. If these similarities and/or differences can be explained by the disease state, it suggests a signal might be present.

Using the hclust() function from the flashClust package, perform an agglomerative hierarchical clustering algorithm on the gene expression with Euclidean distance and complete linkage. Compute the resulting dendrogram with the as.dendrogram() function. Find a way to colour the sample labels by the Status variable.

Do you notice any patterns? Is there any obvious way to determine the number of clusters in this case?

Try changing the method argument in the hclust() call to method = "ward". Use your previous code to plot the dendrogram as before. A criterion to select the number of clusters which often accompanies the ward method is to select the number of clusters such that the next jump in the tree is the greatest. How many clusters would you choose with this criterion? What is your conclusion now?

4) Differential Gene Expression Analysis

We are interested in identifying any genes whose expression is different between different disease states. For the purposes of this exercise, we will group together control and latent tuberculosis patients together, and compare them to the active infection group. We will use multiple methods to do this, and compare their results at the end.

4a) EdgeR

We will be following the edgeR pipeline, found here.

edgeR starts with raw integer counts. We can use the counts from ComBat-Seq (prior to normalisation). Starting from a DGEList object, follow the edgeR pipeline. This starts with estimating TMM normalisation factors with calcNormFactors(). Then, we estimate common dispersion factors using estimateCommonDisp(). The negative binomial model can be fit with glmQLFit(). P-values can be extracted with glmQLFTest(), and the top genes can be extracted using the topTags() function.

Extract a vector of gene names which have adjusted p-value < 0.05, as well as |log2-Fold-Change > 1|.

Repeat the previous experiment, but instead of estimating a common dispersion factor, estimate one for each gene using the estimateDisp() function. Do the results change significantly?.

4b) Voom-Limma

Now, we will perform the same task using voom-limma. A tutorial can be found here.

Following the guide, apply the Voom-limma pipeline with the functions voom(), lmFit(), eBayes() and decideTests(). The output should be a vector giving the names of the differentially expressed genes defined by adjusted p-value < 0.05 and |log2-Fold-Change > 1|.

4c) dearseq

Finally we will apply the dearseq pipeline. A guide can be found here.

Use the dear_seq() function to perform differential expression analysis. Since log2-fold-change values are not directly available from the dearseq package, you may use the ones calculated by edgeR.

4d) Volcano plot

A common representation of the results of DGEA is in the form of a so-called volcano plot. The x-axis of a volcano plot is the log2FC value and the y-axis is the adjusted p-values on the log10 scale, with the axis inverted such that smaller p-values are greater on the y-axis. Typically, the ad-hoc limits which define differential expression (e.g. adjusted p-value < 0.05 and log2FC > 1 in our case) are represented by lines on the plot. Significantly up-regulated genes are coloured as red and significantly down-regulated genes as blue. Sometimes, the few most significant genes are given labels showing their name.

Choose one of your analysis pipelines, and make a volcano plot to represent the results. Annotate the top 10 most significantly differentially expressed genes (by adjusted p-value) on the plot.

4e) Results comparison

Now, let’s compare the results from the 3 pipelines (voom-limma, edgeR and dearseq).

First, use the VennDiagram package to visualise the agreement between the methods.

An alternative way to visualise agreement is with a so-called UpSet plot. This representation works well when the number of classes to compare is greater than 3, such that a Venn diagram becomes hard to view. Using the UpSetR package, make an UpSet diagram. Do our 3 pipelines have concordant results?

M2 PHDS Omics Data : Generation, Management, and Analysis - Questions

Arthur Hughes

2025-09-25

Task Overview

Data Management

0a) Install and Load Required Packages

0b) Load data

0c) Download metadata

0d) Process the metadata

0e) Merge data

1) Data pre-processing

1a) Remove unexpressed genes

1b) Remove non-protein coding genes

2) Normalisation

2a) Investigate study effects

2b) Log2-CPM normalisation

2c) Log2-CPM-TMM Normalisation

2d) Batch effect correction

3) Data Exploration

3a) Principal Component Analysis

3b) Hierarchical clustering

4) Differential Gene Expression Analysis

4a) EdgeR

4b) Voom-Limma

4c) dearseq

4d) Volcano plot

4e) Results comparison