Summary

The term “microbiome” refers to the microbial community in a given environment. In particular, it has recently risen to prominence in medicine to refer to the human oral, skin, urogenital, and digestive tract commensal bacterial communities. Modern technologies allow for the characterization of microbiome composition by high-throughput sequencing of 16S ribosomal DNA that effectively serves as identifying “bar codes”. Research studies focus on comparing differences in microbiomes between environments or changes within environments over time. To ensure study rigor and reproducibility it is important that data processing be clearly described and standardized. This package aims to unify the processing, analysis, and visualization tools necessary for modern microbiome studies under a transparent and straightforward implementation that facilitates standardization in analysis and reporting.

Previous open source microbiome packages available for R (R Core Team 2018) (e.g. vegan and microbiome) have been developed, but lack the functionality afforded to more modern tools including phyloseq (McMurdie and Holmes 2013) and metacoder (Foster, Sharpton, and Grünwald 2017). Both of these packages, however, provide different degrees of functionality as it relates to data wrangling, statistical methods, and visualization. Phyloseq, for instance, relies on base R functions such as subset to extract or manipulate data, while metacoder uses a more modern approach like the tidyverse. Additionally, metacoder is built on top of the taxa package and uses a “taxmap object”, which allows for direct manipulation of hierarchical taxonomic data and associated application-specific data (Zachary, Scott, and Niklaus 2018). Phyloseq, on the other hand, provides an excellent means for importing data into R as a “phyloseq object”, which can be used with various proven methods for analysis. In order to bridge the gap, we have developed MicrobiomeR, to provide new tools and a comprehensive workflow based on concepts found in the phyloseq package and newer technologies being developed in the metacoder package.

Workflow

The paradigm for studying the microbiome follows the steps shown in the figure below. MicrobiomeR covers the last four steps of this workflow. Microbiome Workflow

MicrobiomeR provides users with raw output files from the NIH’s Nephele pipeline. This data was generated by the Illumina MiSeq workflow followed by the Qiime (JG et al. 2010) 16S pipeline for pair ended FASTQ files. To begin a MicrobiomeR workflow, raw data files are imported into R as a phyloseq object, using the create_phyloseq() function. While discouraged, the data can be manipulated directly using any of phyloseq’s methods. Next, the data is converted to a taxmap object using the create_taxmap() function.

By conceptually modeling MicrobiomeR’s data preprocessing functions after phyloseq, we are able to enhance the useful microbiome analysis capabilities of metacoder and taxa by providing simple “phyloseq-style” preprocessing functions, which are critical for reducing noise within the data. Many of the **_filter()* functions are used for this purpose. Additionally, any of the metacoder or taxa functions can be incorporated into the preprocessing steps as MicrobiomeR primarily operates on taxmap objects.

After preprocessing the data, statistical analysis can be conducted using MicrobiomeR’s formatting functions (as_*format()), which incorporate metacoder’s calc*() and compare_groups() functions. Furthermore, MicrobiomeR offers a unique permanova() function, which quantifies multivariate community-level differences between groups. Following statistical analysis, alpha and beta diversity can be visualized using the alpha_diversity_plot() and ordination_plot() functions, which are helpful for understanding the intra-sample differences (evenness and richness) and the inter-group differences. Other publication-ready visualizations can also be created using heat_tree_plots(), correlation_plots(), stacked_barplot(), and top_coefficients_barplot().

Utilities

One of the key advantages in using MicrobiomeR is its “phyloseq-style” filtering functions. These functions are heavily dependent upon formatting and validation checkpoints to ensure that the proper data is being manipulated. In Table 1, we describe the “MicrobiomeR formats” used throughout the package and how they are related to phyloseq, metacoder, and each other.

Table 1:

Level Name Description Data Tables Other Notes
0 PhyloseqFormat A taxmap object that has just been converted from a phyloseq object with the create_phyloseq() function. otu_table tax_data sample_data phy_tree The observation tables represent the phyloseq::otu_table(), phyloseq::tax_table(), phyloseq::sample_data(), and phyloseq::phy_tree(), data in the original phyloseq object.
1 Raw Format A taxmap object that has been processed with the as_raw_format()function. phyloseq tables otu_abundance otu_annotations The new observation tables are just name conversions of the otu_table and tax_data table from the “phyloseq_format”.
2 BasicFormat A taxmap object that has been processed with the as_basic_format() function. raw tables taxa_abundance otu_proportions taxa_proportions This format is defined by observation data that has been processed with the metacoder::calc_*_() functions.
3 AnalyzedFormat A taxmap object that has been processed with the as_analyzed_format() function.. basic tables statistical_data stats_tax_data This format is defined by observation data that has been processed with the metacoder::compare_groups() function.

Validation is performed internally by most of the preprocessing, analysis, and formatting functions, but it can also be done directly by using the *is_*_format()* functions. Furthermore, it is encouraged to explore the taxmap objects in order to view observation and taxonomy data. Other MicrobiomeR utilities include project management tools for creating and organizing output directories for plots, a set of color palette functions based on grDevices::colorRampPalette(), and a taxonomic data parsing function called parse_taxonomy_silva_128() used to import data annotated by the SILVA (C et al. 2013) database as a phyloseq object.

Acknowledgments

The data in this R package was generated by the University of Mississippi Medical Center’s Genomics Core and further analyzed using the Nephele platform from the National Institute of Allergy and Infectious Diseases (NIAID) Office of Cyber Infrastructure and Computational Biology (OCICB) in Bethesda, MD.

MicrobiomeR is currently being used to analyze microbiome data in a study conducted by Xiao Zhang, and enabled by Dr. Eric Vallender.

References

C, Quast, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, and Glöckner FO. 2013. “The Silva Ribosomal Rna Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research, no. 41: D590–D596.

Foster, Zachary, Thomas Sharpton, and Niklaus Grünwald. 2017. “Metacoder: An R Package for Visualization and Manipulation of Community Taxonomic Diversity Data.” PLOS Computational Biology 13 (2). Public Library of Science: 1–15. https://doi.org/10.1371/journal.pcbi.1005404.

JG, Caporaso, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, et al. 2010. “QIIME Allows Analysis of High-Throughput Community Sequencing Data.” Nature Methods 7: 335–36. qiime.org.

McMurdie, Paul J., and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PLoS ONE 8 (4): e61217. http://dx.plos.org/10.1371/journal.pone.0061217.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Zachary, Foster, Chamberlain Scott, and Grunwald Niklaus. 2018. Taxa: An R Package Implementing Data Standards and Methods for Taxonomic Data. F1000Research. Vol. 7. https://doi.org/10.12688/f1000research.14013.1.