Singhania et al, 2018 analysed a dataset consisting of transcriptomic responses to Tuberculosis infection. In this practical, we will perform some statistical tests on the data and examine different types of correction to control inflated family-wise-error rate and false discovery rate.
Today, we will only be using the London dataset. You may run Section 0 to load and pre-process the data, remove uninformative genes, and perform data normalisation with log2-CPM-TMM.
Using
install.packages(), install the following packages from the CRAN : BiocManager, readxl, dplyr, flashClust, UpSetR, tidyR, ggplot2, stringr, dendextend, ggdendro, ggrepel, VennDiagram, grid.
Using
BiocManager::install(), install the following packages fromBioConductor: GEOquery, edgeR, limma, pvca, dearseq, sva.
The following chunk will check if you already have any of these packages and then install the missing ones.
cran_packages <- c(
"BiocManager",
"readxl",
"dplyr",
"flashClust",
"UpSetR",
"tidyr",
"ggplot2",
"stringr",
"dendextend",
"ggdendro",
"ggrepel",
"VennDiagram",
"grid"
)
# Install missing CRAN packages
install.packages(setdiff(cran_packages, installed.packages()[, "Package"]))
# ---- BioConductor Packages ----
bioc_packages <-
c("GEOquery", "edgeR", "limma", "pvca", "dearseq", "sva")
# Load BiocManager and install missing Bioconductor packages
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(setdiff(bioc_packages, installed.packages()[, "Package"]))
Take the following link to the NCBI’s gene expression omnibus for the series A modular transcriptional signature identifies phenotypic heterogeneity of human tuberculosis infection. Scroll down and click on the links to the SubSeries
GSE107991, corresponding to the London cohort. Download the corresponding raw count data (i.e. ‘GSE107991_Raw_counts_Berry_London.xlsx’) and save it in a directory.
First, change the line below to specify the path where you saved your data.
data_path = "your/path/to/data/"
Load the raw counts data with the
read_excel()function.
# Read the raw counts for the London data
london_raw <-
read_excel(paste0(data_path, "GSE107991_Raw_counts_Berry_London.xlsx"))
Use the
GEOquery()function to download the metadata corresponding to these files.
# Load the metadata for the London, South Africa, and Leicester datasets
london_metadata <-
GEOquery::getGEO("GSE107991", GSEMatrix = FALSE) #London
Extract relevant information from the metadata.
# This function extracts the relevant columns from the metadata files and changes some of the names of the values
get_info <- function(i, meta) {
name <- meta@gsms[[i]]@header$source_name_ch1
name <- gsub("Active_TB", "ActiveTB", name)
name <- gsub("Test_set", "TestSet", name)
name <- gsub("Validation_set", "ValidationSet", name)
name <- gsub("Non_progressor", "NonProgressor", name)
res <- unlist(strsplit(name, split = "_"))
res <- c(res, meta@gsms[[i]]@header$title)
res
}
infos_london <-
sapply(1:length(london_metadata@gsms),
FUN = get_info,
meta = london_metadata)
infos_df_london <-
cbind.data.frame(GSMID = names(london_metadata@gsms),
t(infos_london))
rownames(infos_df_london) <- names(london_metadata@gsms)
colnames(infos_df_london)[-1] <-
c("Cohort", "Location", "Set", "Status", "SampleName")
Change the format to wide, so each column is a gene, and merge the data.
# Bind the metadata for London, SA, and Leicester together
infos_all <- infos_df_london %>% as.data.frame()
# Bind the counts for London, SA, and Leicester together
counts_merged <- london_raw %>% select(-Gene_name, -Gene_biotype)
# pivot to wide format
counts_all <- t(counts_merged)[-1,] %>% as.data.frame()
# Columns are genes
colnames(counts_all) = counts_merged$Genes
# Rows are samples
rownames(counts_all) = 1:dim(counts_all)[1]
# Now we merge the counts with the metadata by the sample name
merged_all = merge(
y = bind_cols("SampleName" = colnames(counts_merged)[-1], counts_all),
x = infos_all,
by = "SampleName"
) %>% distinct()
# here we extract the counts and store them as numeric values in the dataframe
counts_all = merged_all %>% select(-c(1:6)) %>%
sapply(as.numeric) %>%
as.data.frame()
# Here we extract the metadata
metadata_all = merged_all %>% select(c(1:6))
# Visualise the start of the merged dataframe
merged_all[, c(1:8)] %>% head(5)
## SampleName GSMID Cohort Location Set Status
## 1 Berry_London_Sample1 GSM2886035 Berry London TestSet ActiveTB
## 2 Berry_London_Sample10 GSM2886044 Berry London TestSet ActiveTB
## 3 Berry_London_Sample11 GSM2886045 Berry London TestSet LTBI
## 4 Berry_London_Sample12 GSM2886046 Berry London TestSet LTBI
## 5 Berry_London_Sample13 GSM2886047 Berry London TestSet LTBI
## ENSG00000000003 ENSG00000000005
## 1 5 0
## 2 7 0
## 3 2 0
## 4 14 0
## 5 26 1
Counts matrices usually contain some completely uninformative variables (i.e. with 0 counts in every sample). Remove these from the data and report how many columns were removed.
# Extract the total counts for each gene
count_sums <- colSums(counts_all)
# Unexpressed genes are those without a single count across samples
unexpressed_genes = colnames(counts_all)[count_sums == 0]
# Remove these from the data
counts_all_filtered = counts_all %>%
select(-all_of(unexpressed_genes))
# how many genes were removed?
message(paste0(
unexpressed_genes %>% length(),
" unexpressed genes removed from the data."
))
## 19065 unexpressed genes removed from the data.
Many investigators only include protein-coding genes in their analysis. Information on whether a gene is protein coding or not may be found in the raw counts under the Gene_biotype columns.
london_raw$Gene_biotype %>% head()
## [1] "protein_coding" "protein_coding" "protein_coding" "protein_coding"
## [5] "protein_coding" "protein_coding"
Using this information, filter the counts to only include genes who are protein-coding. Report the amount of genes removed at this stage.
# Use the london data to extract the names of the protein-coding genes
non_protein_coding_genes = london_raw %>%
filter(Gene_biotype != "protein_coding") %>%
pull(Genes)
# De-select non-protein-coding genes from the counts matrix
counts_all_filtered_proteincoding = counts_all_filtered %>%
select(!any_of(non_protein_coding_genes))
# how many genes were removed?
message(paste0(
intersect(non_protein_coding_genes, colnames(counts_all_filtered)) %>% length(),
" non-protein-coding genes removed from the data."
))
## 21222 non-protein-coding genes removed from the data.
Now we are finished with the data loading and merging, we may remove all of the unnecessary data from the global environment using the
rm()function. This is good practice when using high-dimensional data, as memory can quickly be occupied by large, unnecessary files in the global environment.
# Here we list the objects to keep
keep <-
c("metadata_all",
"counts_all_filtered_proteincoding",
"data_path")
# Remove everything else from the global environment
rm(list = setdiff(ls(), keep))
# Garbage collection to clear unused memory
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6128015 327.3 11049057 590.1 11049057 590.1
## Vcells 11474492 87.6 40691150 310.5 39265428 299.6
Perform a differential gene expression analysis using voom-limma. A tutorial can be found here.
1b) Extract the raw p-values from this test from the
eBayes()object and store them in a dataframe. Sort the p-values in ascending order and store the ranks in the same dataframe.
1c) In ggplot2, plot a histogram of the raw p-values. What do you notice about the shape of the distribution? Why do you think this is?
2a) Using the
p.adjust()function, adjust the raw p-values according to the Bonferroni, Holm, Benjamini-Hochberg, and Benjamini-Yekutieli corrections. Store them in different columns of the same dataframe. Based on a significance level of 0.05, compare the number of significant genes that you find according to each of the methods. Which are the strictest methods?
3a) Plot the log10-tansformed p-values against their ranks for all the correction methods on the same plot. Colour the points by their correction method. What can you say about the strictness of each method?
When we perform multiplicity corrections, we can either correct the p-values themselves, or we can correct the (per-p-value) significance levels. For example, in the bonferroni correction, we can either multiply each p-value by the number of tests performed, and compare it to the unadjusted significance level (e.g. 5%), or we could equivalently divide the significance level by the number of tests and compare the unadjusted p-values to it. If the p-values are ordered from smallest to largest as \(p_{1},...,p_{N}\), the bonferroni thresholds are therefore given by
\[ p_i \le \frac{\alpha}{N}, \quad i = 1, \dots, N \]
The thresholds for the Holm correction are given as
\[ p_{(i)} \le \frac{\alpha}{N-i+1}, \quad i = 1, \dots, N \]
The thresholds for the B-H procedure are given as
\[ p_{(i)} \le \frac{i}{N} \alpha, \quad i = 1, \dots, N \]
The thresholds for the B-Y procedure are given as
\[ p_{(i)} \le \frac{i}{N} \frac{\alpha}{c(N)}, \quad i = 1, \dots, N, \quad c(N) = \sum_{j=1}^{N} \frac{1}{j} \]
Plot the log10-tansformed unadjusted p-values against their ranks. Draw lines in different colours showing the thresholds for each of the correction methods, as detailed above. Then, colour the points according to the most strict threshold passed (for reference, see slide 48 of the Test Multiplicity course).