Mining EBI Metagenomics (EMG) output with R

Key Learning Outcomes


After completing this module the trainee should be able to:

  • Load requred R packages and input data for analysis from EMG

  • Know how analysis the data with R statistical packages

Resources You’ll be Using


Tools Used

RStudio: https://www.rstudio.com/

Data

America Gut Project: EMG Project ERP012803

Author: B T.F. Alako - Cochrane team: European Nucleotide Archive.

EBI Metagenomics resource (EMG) : EBI MG Portal

Part 1: Exploratory analysis of EMG portal contingency tables


Introduction

The American Gut project is the largest crowdsourced citizen science project to date. Fecal, oral, skin, and other body site samples collected from thousands of participants represent the largest human microbiome cohort in existence. Detailed health and lifestyle and diet data associated with each sample is enabling us to deeply examine associations between the human microbiome and factors such as diet (from vegan to near carnivore and everything in between), season, amount of sleep, and disease states such as IBD, diabetes, or autism spectrum disorder-as well as many other factors not listed here. The American Gut project also encompasses the British Gut and Australian Gut projects, widening the cohort beyond North America. As the project continues to grow, we will be able to identify significant associations that would not be possible with smaller, geographically and health/disease status-limited cohorts. EMG Project ERP012803

We will explore the data generated by the study above. The data is available at: EMG Project ERP012803

Loading of R packages

First we need to load the required R packages for analysis.

1
2
3
4
5
6
7
8
suppressWarnings(suppressMessages(require(stringr)))      # String format
suppressWarnings(suppressMessages(require(ade4)))         # Multivariate analysis
suppressWarnings(suppressMessages(require(ggplot2)))      # Fancy plotting
suppressWarnings(suppressMessages(require(grid)))         # Has the viewport function
suppressWarnings(suppressMessages(require(dplyr)))        # Data manipulation
suppressWarnings(suppressMessages(require(tidyr)))        # Data manipulation
suppressWarnings(suppressMessages(require(FactoMineR)))   # Multivariate analysis
suppressWarnings(suppressMessages(require(factoextra)))   # Visualize result of multivariate analysis

Load EMG input data for analysis

We can then load the analysis summary and metadata files for ERP0123803.

1
2
3
4
ERP012803 <- "https://www.ebi.ac.uk/metagenomics/projects/ERP012803/download/2.0/export?contentType=text&exportValue=taxonomy_abundances"
ERP012803.meta <- "https://www.ebi.ac.uk/metagenomics/projects/ERP012803/overview/doExport";
gutproject <- tbl_df(read.delim(file=ERP012803))
meta <-  read.csv(file=ERP012803.meta)

Preprocess the data

  • Filter to only retain full taxonomic lineages with a species name

1
2
3
4
5
6
7
gutproject <- gutproject %>% 
              rename(taxonomy=X.SampleID) %>%
                #mutate(taxonomy=gsub(".*?s__","",taxonomy)) %>% 
                mutate(taxonomy=gsub(".*?g__","",taxonomy)) %>%
                mutate(taxonomy=gsub(";s__","_", taxonomy)) %>%
                  filter(taxonomy !="" | taxonomy=='_' ) %>%
                    filter(!grepl('_$', taxonomy))
- Merge counts of species in the same experimental sample

1
2
3
4
gutproject <- gutproject %>% 
                group_by(taxonomy) %>%
                  summarize_each(funs(sum)) %>%
                    ungroup()
- Use the species name as the rowname, this replaces the default numerical name

Note

You can ignore warning: Setting row names on a tibble is deprecated.

  • Transform the wide format data into the long format for the purpose of appending descriptive information

1
2
3
gutproject.long  <- gutproject %>% 
                      gather(sampleb, freq, ERR1072624:ERR1160857) %>%
                        filter(taxonomy !="Root")
- Retain only the sample description and the Run id from the metadata

1
meta <- meta %>% select(Sample.Description, Run.ID)
- What is the composition of the data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
meta.strip <-  tbl_df(meta) %>% mutate(Sample.Description=gsub(" ","_",Sample.Description),
                         Sample.Description=gsub("American_Gut_Project_","", Sample.Description), 
                         Sample.Description=gsub("American_","", Sample.Description)
                         )
meta.strip <- meta.strip %>% group_by(Sample.Description) %>%
              mutate(count=n()) %>% ungroup() %>%
              select(Sample.Description, count)  %>% unique()

meta.strip %>% ggplot(aes(x=Sample.Description, y=count)) +  geom_bar(stat='identity') + 
                geom_text(aes(label=count))  + theme_bw() +
                      theme(axis.text.x=element_text(angle=45, hjust=1)) + 
                      ggtitle("American gut project")

Question

What is the data mostly composed of?

  • Append metadata to the gut data

1
2
gutproject.df <- merge(gutproject.long, meta, by.x='sampleb', by.y='Run.ID')
gutproject.df <- tbl_df(gutproject.df)
- Clean up the description of sample

1
2
3
4
5
gutproject.df <- gutproject.df %>% 
                  mutate(Sample.Description=gsub(" ","_",Sample.Description), 
                   Sample.Description=gsub("American_Gut_Project_","", Sample.Description),
                   Sample.Description=gsub("American_","", Sample.Description)) %>%
                    ungroup() 
- Merge taxonomy occurences count per data source

1
2
3
4
5
6
7
8
gutproject.df <- gutproject.df %>%
                  group_by(Sample.Description,taxonomy) %>% 
                    summarise(cum=sum(freq)) %>% ungroup()
gutproject.df <- gutproject.df %>%
                  select(taxonomy,Sample.Description,cum)
                  #filter(Sample.Description !="Stool_sample") %>%
                  #filter(Sample.Description !="Stool_sample" & Sample.Description !="Mouth_sample" )
                  #filter( Sample.Description     !="Vaginal_mucus_sample" )
- Transform the long format to wide format for the purpose of Mutivariate analysis

1
2
3
gutproject.df <- gutproject.df %>%
                    spread(Sample.Description, cum, fill=0) %>%
                      as.data.frame()

1
2
gutproject.mat <- gutproject.df %>% select(-c(1)) %>% as.matrix()
rownames(gutproject.mat) <- gsub("\\[|\\]","", perl=TRUE, gutproject.df$taxonomy)
- Clean up the matrix

Question

Is there any association between the species and the data source?

  • Perform a chi-square test of independence, this evaluates whether there is a significant dependence between species and biological samples

1
2
chisq <- chisq.test(gutproject.mat)
chisq
The results should look like this.
1
2
3
Pearson's Chi-squared test
data:  gutproject.mat
X-squared = 106730000, df = 8442, p-value < 2.2e-16
The species and biological samples are statistically significantly associated (p-value=0)

Perform Correspondence Analysis (CA)

Correspondence Analysis (CA) is a generalized form of Principal Component Analysis for categorical data. We use CA to reduce the dimension of the data without loosing the most important information. CA is used to graphically visualize rows points and column points in a low dimensional space

1
gutproject.ca <- CA(gutproject.mat, graph=FALSE)
  • Explore the content of the CA output
1
summary(gutproject.ca, nb.dec = 2, nbelements = 4, ncp= 3)

The result of the function summary() contains the chi-square statistics and 3 tables:

  • Table 1- Eigenvalues, displays the variances and the percentage of vaiances retained by each dimension.
  • Table 2, Contains the coordinates, the contribution and the cos2 (quality of representation [0-1] of the first 4 active rows variable on the dimension 1, 2 and 3
  • Table 3: displays the coordinates, the contribution and the cos2 of the first 4 active column variables on the dimension 1, 2 and 3

Question

Can you interprete the CA result?

Hint

correlation coef., chi-square statistic

Correlation coefficient

Eigen values is the amount of information retained by each axis.

1
2
3
4
eig <- get_eigenvalue(gutproject.ca)
trace <- sum(eig$eigenvalue)
cor.coef <- sqrt(trace)
cor.coef
1
[1] 1.378614

As a rule of thumb a correlation above 0.2 can be considered important (Bendixen 1995, 567; Healey 2013, 289-290) The correlation coef. is 1.38 in the American gut project dataset, indicating a strong association between row (species) and column(biological samples variables.). A more rigorous approach for examining the association is to use Chi-square statistics. The association is highly significant (p-value=0)

Explore the percentages of inertia explained by the CA dimensions, the scree plot

1
fviz_screeplot(gutproject.ca, barfill='white', addlabels=TRUE) + theme_bw()
The point at which the scree plot shows a bend mignt indicate an optimal dimensionality.

Hierarchical Clustering of principal components

Compute hierarchical clustering of species of CA results

1
gutproject.hcpc <- HCPC(gutproject.ca, graph = TRUE , nb.clust=4, order=TRUE )
- Visualize the CA results

Symmetric biplot, whereby both rows (blue) and columns (red) are represented in the same space using the principal coordinates. Only the distance between row points or the distance between column points can be reliably interpreted. With a symmetric plot, only general statements can be drawn about the pattern. The inter-distance between rows and column can not be interpreted.

1
fviz_ca_biplot(gutproject.ca, repel=TRUE, select.row = list(contrib = 95), select.col = list(contrib = 10), geom="text") + theme_minimal()
- Remove labels and add cluster centers

A 3D map of the hierarchical clustering of the principal component.

1
2
3
plot(gutproject.hcpc, choice ="3D.map", draw.tree = FALSE,
     ind.names = FALSE, centers.plot = TRUE, 
     title='Hierarchical clustering of the Principal Components')
* Hierarchical Clustering of principal component.

Compute hierarchical clustering of biological samples of CA results

1
res.hcpc <- HCPC(gutproject.ca, cluster.CA = "columns", graph = TRUE , nb.clust=4, order=TRUE )
Biological sample cluster in the dataset using fviz_cluster and theme_bw()

Transpose the row and column in the gut data and perform a new correspondance analysis

1
gutproject.ca <- CA(t(gutproject.mat), ncp=4)
Symmetric biplot

1
2
3
fviz_ca_biplot(gutproject.ca, repel=TRUE, select.row = list(contrib = 12), 
               select.col = list(contrib = 95), geom="text") +
  theme_minimal()
- Remove labels and add cluster centers

3D hierarchical clustering of the principal components.

1
2
plot(res.hcpc, choice ="3D.map", draw.tree = FALSE,
     ind.names = FALSE, centers.plot = TRUE, title='Hierarchical clustering of the Principal Components')