Mining EBI Metagenomics (EMG) output with R¶
Key Learning Outcomes¶
After completing this module the trainee should be able to:
-
Load requred R packages and input data for analysis from EMG
-
Know how analysis the data with R statistical packages
Resources You’ll be Using¶
Tools Used¶
RStudio: https://www.rstudio.com/
Data¶
America Gut Project: EMG Project ERP012803
Author: B T.F. Alako - Cochrane team: European Nucleotide Archive.
Useful Links¶
EBI Metagenomics resource (EMG) : EBI MG Portal
Part 1: Exploratory analysis of EMG portal contingency tables¶
Introduction¶
The American Gut project is the largest crowdsourced citizen science project to date. Fecal, oral, skin, and other body site samples collected from thousands of participants represent the largest human microbiome cohort in existence. Detailed health and lifestyle and diet data associated with each sample is enabling us to deeply examine associations between the human microbiome and factors such as diet (from vegan to near carnivore and everything in between), season, amount of sleep, and disease states such as IBD, diabetes, or autism spectrum disorder-as well as many other factors not listed here. The American Gut project also encompasses the British Gut and Australian Gut projects, widening the cohort beyond North America. As the project continues to grow, we will be able to identify significant associations that would not be possible with smaller, geographically and health/disease status-limited cohorts. EMG Project ERP012803
We will explore the data generated by the study above. The data is available at: EMG Project ERP012803
Loading of R packages¶
First we need to load the required R packages for analysis.
1 2 3 4 5 6 7 8 | suppressWarnings(suppressMessages(require(stringr))) # String format suppressWarnings(suppressMessages(require(ade4))) # Multivariate analysis suppressWarnings(suppressMessages(require(ggplot2))) # Fancy plotting suppressWarnings(suppressMessages(require(grid))) # Has the viewport function suppressWarnings(suppressMessages(require(dplyr))) # Data manipulation suppressWarnings(suppressMessages(require(tidyr))) # Data manipulation suppressWarnings(suppressMessages(require(FactoMineR))) # Multivariate analysis suppressWarnings(suppressMessages(require(factoextra))) # Visualize result of multivariate analysis |
Load EMG input data for analysis¶
We can then load the analysis summary and metadata files for ERP0123803.
1 2 3 4 | ERP012803 <- "https://www.ebi.ac.uk/metagenomics/projects/ERP012803/download/2.0/export?contentType=text&exportValue=taxonomy_abundances" ERP012803.meta <- "https://www.ebi.ac.uk/metagenomics/projects/ERP012803/overview/doExport"; gutproject <- tbl_df(read.delim(file=ERP012803)) meta <- read.csv(file=ERP012803.meta) |
Preprocess the data¶
- Filter to only retain full taxonomic lineages with a species name
1 2 3 4 5 6 7 | gutproject <- gutproject %>% rename(taxonomy=X.SampleID) %>% #mutate(taxonomy=gsub(".*?s__","",taxonomy)) %>% mutate(taxonomy=gsub(".*?g__","",taxonomy)) %>% mutate(taxonomy=gsub(";s__","_", taxonomy)) %>% filter(taxonomy !="" | taxonomy=='_' ) %>% filter(!grepl('_$', taxonomy)) |
1 2 3 4 | gutproject <- gutproject %>% group_by(taxonomy) %>% summarize_each(funs(sum)) %>% ungroup() |
Note
You can ignore warning: Setting row names on a tibble is deprecated.
- Transform the wide format data into the long format for the purpose of appending descriptive information
1 2 3 | gutproject.long <- gutproject %>% gather(sampleb, freq, ERR1072624:ERR1160857) %>% filter(taxonomy !="Root") |
1 | meta <- meta %>% select(Sample.Description, Run.ID) |
1 2 3 4 5 6 7 8 9 10 11 12 | meta.strip <- tbl_df(meta) %>% mutate(Sample.Description=gsub(" ","_",Sample.Description), Sample.Description=gsub("American_Gut_Project_","", Sample.Description), Sample.Description=gsub("American_","", Sample.Description) ) meta.strip <- meta.strip %>% group_by(Sample.Description) %>% mutate(count=n()) %>% ungroup() %>% select(Sample.Description, count) %>% unique() meta.strip %>% ggplot(aes(x=Sample.Description, y=count)) + geom_bar(stat='identity') + geom_text(aes(label=count)) + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1)) + ggtitle("American gut project") |
Question
What is the data mostly composed of?
- Append metadata to the gut data
1 2 | gutproject.df <- merge(gutproject.long, meta, by.x='sampleb', by.y='Run.ID') gutproject.df <- tbl_df(gutproject.df) |
1 2 3 4 5 | gutproject.df <- gutproject.df %>% mutate(Sample.Description=gsub(" ","_",Sample.Description), Sample.Description=gsub("American_Gut_Project_","", Sample.Description), Sample.Description=gsub("American_","", Sample.Description)) %>% ungroup() |
1 2 3 4 5 6 7 8 | gutproject.df <- gutproject.df %>% group_by(Sample.Description,taxonomy) %>% summarise(cum=sum(freq)) %>% ungroup() gutproject.df <- gutproject.df %>% select(taxonomy,Sample.Description,cum) #filter(Sample.Description !="Stool_sample") %>% #filter(Sample.Description !="Stool_sample" & Sample.Description !="Mouth_sample" ) #filter( Sample.Description !="Vaginal_mucus_sample" ) |
1 2 3 | gutproject.df <- gutproject.df %>% spread(Sample.Description, cum, fill=0) %>% as.data.frame() |
1 2 | gutproject.mat <- gutproject.df %>% select(-c(1)) %>% as.matrix() rownames(gutproject.mat) <- gsub("\\[|\\]","", perl=TRUE, gutproject.df$taxonomy) |
Question
Is there any association between the species and the data source?
- Perform a chi-square test of independence, this evaluates whether there is a significant dependence between species and biological samples
1 2 | chisq <- chisq.test(gutproject.mat) chisq |
1 2 3 | Pearson's Chi-squared test data: gutproject.mat X-squared = 106730000, df = 8442, p-value < 2.2e-16 |
Perform Correspondence Analysis (CA)¶
Correspondence Analysis (CA) is a generalized form of Principal Component Analysis for categorical data. We use CA to reduce the dimension of the data without loosing the most important information. CA is used to graphically visualize rows points and column points in a low dimensional space
1 | gutproject.ca <- CA(gutproject.mat, graph=FALSE) |
- Explore the content of the CA output
1 | summary(gutproject.ca, nb.dec = 2, nbelements = 4, ncp= 3) |
The result of the function summary() contains the chi-square statistics and 3 tables:
- Table 1- Eigenvalues, displays the variances and the percentage of vaiances retained by each dimension.
- Table 2, Contains the coordinates, the contribution and the cos2 (quality of representation [0-1] of the first 4 active rows variable on the dimension 1, 2 and 3
- Table 3: displays the coordinates, the contribution and the cos2 of the first 4 active column variables on the dimension 1, 2 and 3
Question
Can you interprete the CA result?
Hint
correlation coef., chi-square statistic
Correlation coefficient¶
Eigen values is the amount of information retained by each axis.
1 2 3 4 | eig <- get_eigenvalue(gutproject.ca) trace <- sum(eig$eigenvalue) cor.coef <- sqrt(trace) cor.coef |
1 | [1] 1.378614 |
As a rule of thumb a correlation above 0.2 can be considered important (Bendixen 1995, 567; Healey 2013, 289-290) The correlation coef. is 1.38 in the American gut project dataset, indicating a strong association between row (species) and column(biological samples variables.). A more rigorous approach for examining the association is to use Chi-square statistics. The association is highly significant (p-value=0)
Explore the percentages of inertia explained by the CA dimensions, the scree plot
1 | fviz_screeplot(gutproject.ca, barfill='white', addlabels=TRUE) + theme_bw() |
Hierarchical Clustering of principal components¶
Compute hierarchical clustering of species of CA results
1 | gutproject.hcpc <- HCPC(gutproject.ca, graph = TRUE , nb.clust=4, order=TRUE ) |
Symmetric biplot, whereby both rows (blue) and columns (red) are represented in the same space using the principal coordinates. Only the distance between row points or the distance between column points can be reliably interpreted. With a symmetric plot, only general statements can be drawn about the pattern. The inter-distance between rows and column can not be interpreted.
1 | fviz_ca_biplot(gutproject.ca, repel=TRUE, select.row = list(contrib = 95), select.col = list(contrib = 10), geom="text") + theme_minimal() |
A 3D map of the hierarchical clustering of the principal component.
1 2 3 | plot(gutproject.hcpc, choice ="3D.map", draw.tree = FALSE, ind.names = FALSE, centers.plot = TRUE, title='Hierarchical clustering of the Principal Components') |
Compute hierarchical clustering of biological samples of CA results
1 | res.hcpc <- HCPC(gutproject.ca, cluster.CA = "columns", graph = TRUE , nb.clust=4, order=TRUE ) |
Transpose the row and column in the gut data and perform a new correspondance analysis
1 | gutproject.ca <- CA(t(gutproject.mat), ncp=4) |
1 2 3 | fviz_ca_biplot(gutproject.ca, repel=TRUE, select.row = list(contrib = 12), select.col = list(contrib = 95), geom="text") + theme_minimal() |
3D hierarchical clustering of the principal components.
1 2 | plot(res.hcpc, choice ="3D.map", draw.tree = FALSE, ind.names = FALSE, centers.plot = TRUE, title='Hierarchical clustering of the Principal Components') |