Part 2: Functional analysis of EMG genomic and transcriptomic data using R¶
Gut microbiota disturbance during antibiotic therapy: a multi-omic approach¶
Introduction¶
Antibiotic usage strongly affects microbial intestinal metabolism and thereby impacts human health. Understanding this process and the underlying mechanisms remains a major research goal. Accordingly, we conducted the first comparative omic investigation of gut microbial communities in faecal samples taken at multiple time points from an individual subjected to Beta-lactam therapy. Perez-Cobas AE, et al. 2013 Gut 62(11), 1591-1601
We will explore the data generated by the study above. The data is available at: https://www.ebi.ac.uk/metagenomics/projects/ERP001506
1 2 3 4 5 6 7 8 | suppressWarnings(suppressMessages(require(stringr))) # String format suppressWarnings(suppressMessages(require(ade4))) # Multivariate analysis suppressWarnings(suppressMessages(require(ggplot2))) # Fancy plotting suppressWarnings(suppressMessages(require(grid))) # Has the viewport function suppressWarnings(suppressMessages(require(dplyr))) # Data manipulation suppressWarnings(suppressMessages(require(tidyr))) # Data manipulation suppressWarnings(suppressMessages(require(FactoMineR))) # Multivariate analysis suppressWarnings(suppressMessages(require(factoextra))) # Visualize result of multivariate analysis |
Gather the necessary analysis results file from EMG portal¶
R provides a very versatile data structure that allows us to store various data types, namely the list.
1 | analysis_files <- c("BP_GO_abundances","BP_GO-slim_abundances","CC_GO_abundances","CC_GO-slim_abundances","GO_abundances","GO-slim_abundances","IPR_abundances","MF_GO_abundances","MF_GO-slim_abundances","phylum_taxonomy_abundances","taxonomy_abundances") |
Read in all the contingency tables in the list making use of the lapply function
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | baseanalysis = "https://www.ebi.ac.uk/metagenomics/projects/ERP001506/download/2.0/export?contentType=text&exportValue=" metadata <- "https://www.ebi.ac.uk/metagenomics/projects/ERP001506/overview/doExport"; filenames <- paste(baseanalysis, analysis_files, sep="") results <- lapply(filenames,function(i){ print(i) read.delim(i, header=TRUE) }) metadata <- tbl_df(read.csv(file=metadata)) metadata <- metadata %>% filter(Release.version>1) %>% mutate(Sample.Name= gsub("Human gut metagenome, |Human gut metatranscriptome \\(mRNA\\), | after antibiotics","",Sample.Name)) %>% mutate(Sample.Name= paste("days",gsub(" days", "", Sample.Name), sep="")) %>% mutate(genomictype= ifelse(Experiment.type=="metatranscriptomic", "mRNA", "DNA")) %>% rename(days=Sample.Name, id=Sample.ID, datatype=Experiment.type,genomictype=genomictype) %>% select(days,id,datatype,genomictype) %>% mutate(id=ifelse(datatype=="amplicon", paste(id, "_A", sep=""), ifelse(datatype=="metatranscriptomic", paste(id,"_T", sep=""), paste(id, "_G", sep="")))) %>% mutate(datatype=str_to_title(datatype)) |
Question
How can we access elements in the list?
1 | attributes(results) |
1 2 3 | colheader <- gsub(analysis_files, pattern="-", replacement ="_") names(results) <- colheader attributes(results) |
1 | str(results, list.len=4) |
Each entry in the list can now be accessed by name. Give each row an explicit name before manipulating the table contents with dplyr functions. The main advantage of tbl_df (return a dataframe from a dataframe) is that it only displays a few rows and all the columns fit on the current screen, with the rest of the data displayed as text. We will be making extensive use of the dplyr package, which allows operations to be chained (“%>%”); it is analogous to the linux pipe command (|), see here for more details: https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
1. Taxonomy: who is there and in what proportion?¶
We can optimise the display of the first few columns or so…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | taxonomy <- tbl_df(results$phylum_taxonomy_abundances) taxonomy <- taxonomy %>% mutate(phylum=as.character(phylum)) tax_g <- taxonomy %>% select(matches("_G|um")) %>% mutate(DataType="Metagenomics") tax_t <- taxonomy %>% select(matches("_T|um")) %>% mutate(DataType="Metatranscriptomics") tax_a <- taxonomy %>% select(matches("_A|um")) %>% mutate(DataType="Amplicon") renameSample <- function(metadata=metadata, localdf=localdf){ strippedcol <- colnames(localdf) for (i in 1:nrow(metadata)){ strippedcol[which(strippedcol==as.character(metadata[i,2]))] <- as.character(metadata[i,1]) } return(strippedcol) } colnames(tax_g) <- renameSample(metadata=metadata, localdf=tax_g) colnames(tax_t) <- renameSample(metadata=metadata, localdf=tax_t) colnames(tax_a) <- renameSample(metadata=metadata, localdf=tax_a) |
1 2 3 4 5 6 | tax_metag <- tax_g %>% gather(condition, count, 3:ncol(tax_g)-1) %>% group_by( condition) %>% mutate(prop=count/sum(count)) tax_metat <- tax_t %>% gather(condition, count, 3:ncol(tax_t)-1) %>% group_by( condition) %>% mutate(prop=count/sum(count)) |
1 | genomic_transcriptomics <- rbind(tax_metag, tax_metat) |
1 2 3 4 5 6 7 8 | genomic_transcriptomics %>% arrange(desc(prop)) %>% ggplot( aes(x=condition, y=prop,fill=phylum)) + geom_bar(stat='identity', position='fill', aes(fill = phylum)) + facet_grid(~DataType) + ggtitle('Species abundance through treatment per datatype') + ylab('proportion abundance') + theme_light() + theme(axis.text.x=element_text(angle=45, hjust=1)) + coord_flip() |
Question
What conclusion would you draw from this distribution?
The heatmap is of absolute Species counts per datatypes
1 2 3 4 | genomic_transcriptomics %>% ggplot( aes(x=condition,y=phylum))+ facet_grid(~DataType) + geom_tile(aes(fill=count)) + scale_fill_gradient(low="white", high="darkblue") + xlab("") + ylab("") + theme(axis.text.x=element_text(angle=45, hjust=1)) |

2. Function: who does what, in what relative proportion?¶
Recall the various functional annotations in the created list.
1 | attributes(results) |
As an initial exploration, we will focus on a broad GO annotation (GO-slim)
** (A) GO Biological processes**: relative occurrence during treatment
1 2 3 4 5 6 7 8 9 10 11 | bp_slim <- tbl_df(results$BP_GO_slim_abundances) bp_slim <- bp_slim %>% mutate(description=as.character(description)) bp_slim_g <- bp_slim %>% select(matches("_G|description")) %>% mutate(DataType="Metagenomics") bp_slim_t <- bp_slim %>% select(matches("_T|description")) %>% mutate(DataType="Metatranscriptomics") bp_slim_a <- bp_slim %>% select(matches("_A|description")) %>% mutate(DataType="Amplicon") |
Give meaningful names to conditions, as for the taxonomic case above.
1 2 3 4 5 6 7 8 9 10 11 12 | colnames(bp_slim_g) <- renameSample(metadata=metadata, localdf=bp_slim_g) colnames(bp_slim_t) <- renameSample(metadata=metadata, localdf=bp_slim_t) colnames(bp_slim_a) <- renameSample(metadata=metadata, localdf=bp_slim_a) bp_slim_metagenomics <- bp_slim_g %>% gather(condition, count, 3:ncol(bp_slim_g)-1) %>% group_by( condition) %>% mutate(prop=count/sum(count)) bp_slim_metatransciptomics <- bp_slim_t %>% gather(condition, count, 3:ncol(bp_slim_t)-1) %>% group_by( condition) %>% mutate(prop=count/sum(count)) |
Merge both metatranscriptomic and metagenomic relative occurences of biological process terms per condition to visualise the relative abundance of biological process terms that are dynamic during treatment.
1 | bp_genomic_transcriptomics <- rbind(bp_slim_metagenomics, bp_slim_metatransciptomics) |
1 2 3 4 5 6 7 8 | bp_genomic_transcriptomics %>% arrange(desc(prop)) %>% ggplot( aes(x=condition, y=prop,fill=description)) + geom_bar(stat='identity', position='fill', aes(fill = description)) + facet_wrap(~DataType, ncol=1) + ggtitle('Biological processess relative abundance through \n treatment per datatype') + ylab('proportion abundance') + theme_light() + theme(axis.text.x=element_text(angle=45, hjust=1)) + coord_flip() + theme(legend.position="none") |
Question
What can you conclude from this plot?
Heat map of absolute biological process counts per datatype
1 2 3 4 | bp_genomic_transcriptomics %>% ggplot( aes(x=condition,y=description))+ facet_grid(~DataType) + geom_tile(aes(fill=count)) + scale_fill_gradient(low="white", high="darkblue") + xlab("") + ylab("") + theme(axis.text.x=element_text(angle=45, hjust=1)) |
Heat map of biological process proportions per datatype
1 2 3 4 | bp_genomic_transcriptomics %>% ggplot( aes(x=condition,y=description))+ facet_grid(~DataType) + geom_tile(aes(fill=prop)) + scale_fill_gradient(low="white", high="darkblue") + xlab("") + ylab("") + theme(axis.text.x=element_text(angle=45, hjust=1)) |

(B) GO Cellular Compartment: relative occurence during treatment
1 2 3 4 5 6 7 8 9 10 11 | cc_slim <- tbl_df(results$CC_GO_slim_abundances) cc_slim <- cc_slim %>% mutate(description=as.character(description)) cc_slim_g <- cc_slim %>% select(matches("_G|description")) %>% mutate(DataType="Metagenomics") cc_slim_t <- cc_slim %>% select(matches("_T|description")) %>% mutate(DataType="Metatranscriptomics") cc_slim_a <- cc_slim %>% select(matches("_A|description")) %>% mutate(DataType="Amplicon") |
1 2 3 | colnames(cc_slim_g) <- renameSample(metadata=metadata, localdf=cc_slim_g) colnames(cc_slim_t) <- renameSample(metadata=metadata, localdf=cc_slim_t) colnames(cc_slim_a) <- renameSample(metadata=metadata, localdf=cc_slim_a) |
1 2 3 4 5 6 7 8 9 | cc_slim_metagenomics <- cc_slim_g %>% gather(condition, count, 3:ncol(cc_slim_g)-1) %>% group_by( condition) %>% mutate(prop=count/sum(count)) cc_slim_metatransciptomics <- cc_slim_t %>% gather(condition, count, 3:ncol(cc_slim_t)-1) %>% group_by( condition) %>% mutate(prop=count/sum(count)) |
1 | cc_genomic_transcriptomics <- rbind(cc_slim_metagenomics, cc_slim_metatransciptomics) |
Bar plot of relative abundances of cellular compartment
1 2 3 4 5 6 7 8 9 | cc_genomic_transcriptomics %>% arrange(desc(prop)) %>% ggplot( aes(x=condition, y=prop,fill=description)) + geom_bar(stat='identity', position='fill', aes(fill = description)) + facet_wrap(~DataType, ncol=1) + ggtitle('Cellular compartment relative abundance through \n treatment per datatype') + ylab('proportion abundance') + theme_light() + theme(axis.text.x=element_text(angle=45, hjust=1)) + coord_flip() + theme(legend.position="none") |
Question
What can you conclude from this plot?
Heat map of absolute cellular compartment term counts per datatype
1 2 3 4 | cc_genomic_transcriptomics %>% ggplot( aes(x=condition,y=description))+ facet_grid(~DataType) + geom_tile(aes(fill=count)) + scale_fill_gradient(low="white", high="darkblue") + xlab("") + ylab("") + theme(axis.text.x=element_text(angle=45, hjust=1)) |
Heat map of Cellular compartment term proportions per datatype
1 2 3 4 | cc_genomic_transcriptomics %>% ggplot( aes(x=condition,y=description))+ facet_grid(~DataType) + geom_tile(aes(fill=prop)) + scale_fill_gradient(low="white", high="darkblue") + xlab("") + ylab("") + theme(axis.text.x=element_text(angle=45, hjust=1)) |
(C) GO Molecular function: relative occurence during treatment
Extract the data from the data structure “results” created earlier.
1 2 3 4 5 6 7 8 9 10 11 | mf_slim <- tbl_df(results$MF_GO_slim_abundances) mf_slim <- mf_slim %>% mutate(description=as.character(description)) mf_slim_g <- mf_slim %>% select(matches("_G|description")) %>% mutate(DataType="Metagenomics") mf_slim_t <- mf_slim %>% select(matches("_T|description")) %>% mutate(DataType="Metatranscriptomics") mf_slim_a <- mf_slim %>% select(matches("_A|description")) %>% mutate(DataType="Amplicon") |
Give meaningful names to conditions as for the taxonomic case above
1 2 3 4 5 6 7 8 9 10 11 12 13 | colnames(mf_slim_g) <- renameSample(metadata=metadata, localdf=mf_slim_g) colnames(mf_slim_t) <- renameSample(metadata=metadata, localdf=mf_slim_t) colnames(mf_slim_a) <- renameSample(metadata=metadata, localdf=mf_slim_a) mf_slim_metagenomics <- mf_slim_g %>% gather(condition, count, 3:ncol(mf_slim_g)-1) %>% group_by( condition) %>% mutate(prop=count/sum(count)) mf_slim_metatransciptomics <- mf_slim_t %>% gather(condition, count, 3:ncol(mf_slim_t)-1) %>% group_by( condition) %>% mutate(prop=count/sum(count)) |
Merge both metatranscriptomic and metagenomic relative occurrences of biological processes per condition to visualise the relative abundances of biological process terms that are dynamic during treatment.
1 | mf_genomic_transcriptomics <- rbind(mf_slim_metagenomics, mf_slim_metatransciptomics) |
Barplot of relative abundance Molecular function
1 2 3 4 5 6 7 8 9 10 11 | mf_genomic_transcriptomics %>% arrange(desc(prop)) %>% ggplot( aes(x=condition, y=prop,fill=description)) + geom_bar(stat='identity', position='fill', aes(fill = description)) + facet_wrap(~DataType, ncol=1) + ggtitle('Molecular function relative abundance through \n treatment per datatype') + ylab('proportion abundance') + theme_light() + theme(axis.text.x=element_text(angle=45, hjust=1)) + coord_flip() + theme(legend.position="none")# + guides(fill=guide_legend(ncol=2)) |
1 2 3 4 | mf_genomic_transcriptomics %>% ggplot( aes(x=condition,y=description)) + facet_grid(~DataType) + geom_tile(aes(fill=count)) + scale_fill_gradient(low="white", high="darkblue") + xlab("") + ylab("") + theme(axis.text.x=element_text(angle=45, hjust=1)) |
Question
What can you conclude from the plots?