Statistical Analysis of Microbiome Data with R
Visit the link below to download the full version of this book:
https://medipdf.com/product/statistical-analysis-of-microbiome-data-with-r/
Click Download Now
Yinglin Xia and Jun Sun would like to
dedicate this book to their parents and
parents-in-law: Qijia Xia, Xincui Wang,
Zong-Xiang Sun, and Xiao-Yun Fu; and their
sons: Yuxuan Xia and Jason Xia for their
constant love and support.
Ding-Geng Chen would like to dedicate this
book to his parents and parents-in-law who
value high-education and hard-working, and
to his wife, Ke, his son, John D. Chen, and his
daughter, Jenny K. Chen, for their love and
support.
Preface
Microbiome research and microbiome data analysis are one of the fast-growing
areas in biomedical and public health research. It is evidenced and catalyzed by
publications in different relevant fields of studies and methodology development, as
well as large-scale projects, such as the Human Microbiome Project (HMP), the
Integrative Human Microbiome Project, and Metagenomics of the Human Intestinal
Tract (MetaHIT) study. By the end of 2017, HMP investigators have already
published over 650 scientific papers, cited over 70,000 times (https://commonfund.
nih.gov/hmp).
We are now advancing our understanding of how the microbiome impacts
human health and disease, with more and more projects in microbiome funded and
research and statistical methodology papers published. The masses of microbiome
data generated by 16S rRNA sequencing and shotgun metagenomic sequencing via
the bioinformatics pipelines (packages), promote recent major growth spurt of
microbiome study. Data analysis and methodology are integral parts of microbiome
research. Since microbiome data are very complicated, there is a critical need to
develop all kinds of statistical methodologies for microbiome research, ranging
from application to methodology, and to statistical theory.
The habit of human learning starts with the known, then processes to the
unknown. Statistical analysis of microbiome data follows with the similar process.
In the beginning, the researchers and statisticians used the classic statistical methods
and models or borrowed them from other relevant fields, such as ecology and
microarray. Later, they developed their own statistical methods and models that
target one or more unique features of microbiome data. Currently, statistical
methods and analysis tools for analyzing microbiome data are available from classic
statistics, relevant research fields, and new developments, including visualization
and characterization of structure of microbiome data sets.
Statistical tools for performing microbiome data analysis are now available in
different languages and environments across different platforms, either in
web-based or programming-based approaches. Obviously, R system and environ-
ment play a critical role in developing statistical tools for analyzing microbiome
data.
vii
viii Preface
The birth of this book is an excellent example to show how a multidisciplinary
team working together to meet the need of the field. In April 2016, Dr. Jun Sun was
working on a microbiome book on behalf of the American Physiological Society by
Springer. She invited Dr. Yinglin Xia to contribute a book chapter on microbiome
data analysis. Dr. Sun and Dr. Xia have long-time collaborations in biomedical
sciences including microbiome studies. While working on the book chapter, they
thought it would be a good idea to expand a brief book chapter to a comprehensive
book on analyzing microbiome data. They were very happy that Dr. Ding-Geng Chen
was willing to join the team and provide his expertise on statistics and microarray
study. In May 2016, a book proposal on Statistical Analysis of Microbiome Data with
R was submitted. It was well received by the peer-review and fully supported by the
editors of ICSA Book Series in Statistics.
In this book, we aim to provide the step-by-step procedures to perform data
analysis of microbiome data by way of the R programming language. We provide
some bioinformatic and statistical foundations of data analysis because microbiome
data are complicated and analysis of microbiome data is still very challenging. To
strike a balance, we briefly introduce concepts, backgrounds, statistical method
developments before illustrating the applications in real data.
The book was organized in this way: in the beginning three chapters, we specially
provided overview and introduction of bioinformatics, features of microbiome data,
and statistical analysis of microbiome data. In Chap. 4, we covered some basic skills
in R programming, RStudio, ggplot2, and most often used R packages and tech-
niques for microbiome data management and programming. In Chap. 5, we intro-
duced classic and newly developed methods in application of hypothesis testing and
power analysis of microbiome data. Chapter 6 focused on introduction of commu-
nity alpha and beta measures and calculations. Chapter 7 provided most often used
visualization techniques for exploratory analysis of microbiome data including
graphic summary of data and clustering, ordination. Chapters 8 and 9 focused on
univariate and multivariate community analysis, respectively. Many classic and
newly developed methods are introduced in the application of microbiome studies.
We contributed Chap. 10 to compositional analysis of microbiome data. In this
chapter, we introduced basic concept, fundamental principles, brief history, proce-
dures, and challenges of compositional data analysis. We also summarized several
considerations of microbiome dataset being treated as compositional and illustrated
compositional analysis of microbiome data using real data. Chapters 11 and 12
focused on count-based approaches of modeling over-dispersed and zero-inflated
microbiome data, respectively. Here, we widely covered statistical methods and
models of count data, including negative binomial, zero-inflated, and zero-hurdle
models, and zero-inflated Beta regression model with random-effects in longitudinal
setting. We also discussed the concept adjustment of model application and topics of
model comparisons.
We hope the contents of these chapters and the way of organization provide a
framework of statistical analysis of microbiome data. We expect this book to be
used by (1) statisticians, who are working on microbiome studies, either for their
own research, or for their collaborative research, such as experimental design, grant
Preface ix
application, and data analysis; (2) researchers from microbiome and biomedical
fields, such as principal investigators, clinicians, research fellows, graduate stu-
dents, who are designing the studies, collecting the data; (3) researchers from other
relevant or similar fields (e.g., bioinformatics, ecology, microarray, economics, etc.)
and common use of statistical methods and R packages. The data and R codes
used in this book are available by requesting to the first author: Yinglin Xia at
[email protected].
Chicago, IL, USA Yinglin Xia
Chicago, IL, USA Jun Sun
North Carolina, NC, USA Ding-Geng Chen
Pretoria, South Africa
June, 2018
Acknowledgements
There are a number of people we wish to thank. We greatly appreciate those
persons, who helped us from various resources and at different stages of devel-
opment and writing. Thanks go to the editorial team at Springer for their enthusiasm
in supporting this project and for their feedback along the way of process to
publishing. Broadly, we greatly appreciate the developers of statistical methods,
models, and R packages and R system and environment in general. Some of them
shared their papers with us. Without their great works, the book cannot be available
in current breadth and scope. We especially wish thanks to Jason for his support
and proofreading of some chapters. Finally, we all wish to express our deepest
appreciation to our respective families for their love, patience, and support during
the writing of this book.
xi
Contents
1 Bioinformatic Analysis of Microbiome Data . . . . . . . . . . . . . . . . . . 1
1.1 Introduction to Microbiome Study . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 What Is the Human Microbiome? . . . . . . . . . . . . . . . . 1
1.1.2 Microbiome Research and DNA Sequencing . . . . . . . . 2
1.2 Introduction to Phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 16S rRNA Sequencing Approach . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 The Advantages of 16S rRNA Sequencing . . . . . . . . . 5
1.3.2 Bioinformatic Analysis of 16S rRNA Sequencing
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Shotgun Metagenomic Sequencing Approach . . . . . . . . . . . . . . 12
1.4.1 Definition of Metagenomics . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Advantages of Shotgun Metagenomic Sequencing . . . . 13
1.4.3 Bioinformatic Analysis of Shotgun Metagenomic
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Bioinformatics Data Analysis Tools . . . . . . . . . . . . . . . . . . . . . 18
1.5.1 QIIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.2 mothur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.3 Analyzing 16S rRNA Sequence Data Using QIIME
and Mothur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21
2 What Are Microbiome Data? . . . . . . . . . . . . . . . . . . . . . . . . . . ... 29
2.1 Microbiome Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 29
2.2 Microbiome Data Structure . . . . . . . . . . . . . . . . . . . . . . . . ... 30
2.2.1 Microbiome Data Are Structured as a Phylogenetic
Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 30
2.2.2 Feature-by-Sample Contingency Table . . . . . . . . . . ... 30
2.2.3 OTU Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 31
xiii
xiv Contents
2.2.4 Taxa Count Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.5 Taxa Percent Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Features of Microbiome Data . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.1 Microbiome Data Are Compositional . . . . . . . . . . . . . . 34
2.3.2 Microbiome Data Are High Dimensional and
Underdetermined . . . . . . . . . . . . . . . . . . . . . ....... 34
2.3.3 Microbiome Data Are Over-Dispersed . . . . . . ....... 36
2.3.4 Microbiome Data Are Often Sparse
with Many Zeros . . . . . . . . . . . . . . . . . . . . . ....... 36
2.4 An Example of Over-Dispersed and Zero-Inflated
Microbiome Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Challenges of Modeling Microbiome Data . . . . . . . . . . . . . . . . 38
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Introductory Overview of Statistical Analysis of Microbiome
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43
3.1 Research Themes and Statistical Hypotheses in Human
Microbiome Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43
3.2 Classic Statistical Methods and Models in Microbiome
Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Classic Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Multivariate Statistical Tools . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Over-Dispersed and Zero-Inflated Models . . . . . . . . . . 47
3.3 Newly Developed Multivariate Statistical Methods . . . . . . . . . . 48
3.3.1 Dirichlet-Multinomial Model . . . . . . . . . . . . . . . . . . . . 48
3.3.2 UniFrac Distance Metric Family . . . . . . . . . . . . . . . . . 49
3.3.3 Multivariate Bayesian Models . . . . . . . . . . . . . . . . . . . 50
3.3.4 Phylogenetic LASSO and Microbiome . . . . . . . . . . . . 51
3.4 Compositional Analysis of Microbiome Data . . . . . . . . . . . . . . 52
3.5 Longitudinal Data Analysis and Causal Inference in
Microbiome Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54
3.5.1 Standard Longitudinal Models . . . . . . . . . . . . . . . . . .. 54
3.5.2 Newly Developed Over-Dispersed and Zero-Inflated
Longitudinal Models . . . . . . . . . . . . . . . . . . . . . . . . .. 55
3.5.3 Regression-Based Time Series Models . . . . . . . . . . . .. 57
3.5.4 Detecting Causality: Causal Inference
and Mediation Analysis of Microbiome Data . . . . . . .. 59
3.5.5 Meta-analysis of Microbiome Data . . . . . . . . . . . . . .. 60
3.6 Introduction of Statistical Packages . . . . . . . . . . . . . . . . . . . .. 62
3.7 Limitations of Existing Statistical Methods and Future
Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 66
Contents xv
4 Introduction to R, RStudio and ggplot2 . . . . . . . . . . . . . . . . . . . . . 77
4.1 Introduction to R and RStudio . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.1 Installing R, RStudio, and R Packages . . . . . . . . . . . . . 78
4.1.2 Set Working Directory in R . . . . . . . . . . . . . . . . . . . . 79
4.1.3 Data Analysis Through RStudio . . . . . . . . . . . . . . . . . 80
4.1.4 Data Import and Export . . . . . . . . . . . . . . . . . . . . . . . 83
4.1.5 Basic Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.6 Simple Summary Statistics . . . . . . . . . . . . . . . . . . . . . 93
4.1.7 Other Useful R Functions . . . . . . . . . . . . . . . . . . . . . . 95
4.2 Introduction to the dplyr Package . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Introduction to ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.1 ggplot2 and the Grammar of Graphics . . . . . . . . . . . . . 110
4.3.2 Simplify Specifications in Creating a Plot Using
ggplot() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3.3 Creating a Plot Using ggplot() . . . . . . . . . . . . . . . . . . 115
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5 Power and Sample Size Calculations for Microbiome Data . . . . . . 129
5.1 Hypothesis Testing and Power Analysis . . . . . . . . . . . . . . . . . . 129
5.1.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1.2 Power Analysis and Sample Size Calculation . . . . . . . . 132
5.2 Power Analysis for Testing Differences in Diversity
Using T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2.1 Power Formula for Continuous Outcome . . . . . . . . . . . 134
5.2.2 Diversity Data for ALS Study . . . . . . . . . . . . . . . . . . . 137
5.2.3 Calculating Power or Sample Size Using R Function
power.t.test() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.3 Power Analysis for Comparing Diversity Across More
than Two Groups Using ANOVA . . . . . . . . . . . . . . . . . . . . . . 143
5.3.1 Hypothesis and Theory of Power for One-Way
ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3.2 Calculating Power or Sample Size Using R Function
pwr.avova.test() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.4 Power Analysis for Comparing a Taxon of Interest Across
Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.4.1 Hypothesis and Basic Power and Sample Size
Formulas for Comparing Proportions . . . . . . . . . . . . . . 146
5.4.2 Power Analysis Using R Function
power.prop.test() . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.4.3 Power Analysis Using v2 Test and Fisher
Exact Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.5 Comparing the Frequency of All Taxa Across Groups
Using Dirichlet-Multinomial Model . . . . . . . . . . . . . . . . . . . . . 154
xvi Contents
5.5.1 Multivariate Hypothesis Testing
and Dirichlet-Multinomial Model . . . . . . . . . . . . . . . . 154
5.5.2 Power and Sample Size Calculations Under Dirichlet-
Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.5.3 Power and Size Calculations Using HMP Package . . . . 157
5.5.4 Effect Size Calculation Using HMP Package . . . . . . . . 164
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6 Community Diversity Measures and Calculations . . . .. . . . . . . . . . 167
6.1 Vdr−/− Mice Data Set . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 167
6.2 Introduction to Community Diversities . . . . . . . . .. . . . . . . . . . 167
6.2.1 Alpha Diversity . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 168
6.2.2 Beta Diversity . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . 168
6.2.3 Gamma Diversity . . . . . . . . . . . . . . . . . .. . . . . . . . . . 169
6.3 Alpha Diversity Measures and Calculations . . . . .. . . . . . . . . . 169
6.3.1 Chao 1 Richness Index and Number of Taxa . . . . . . . . 169
6.3.2 Shannon-Wiener Diversity Index . . . . . ..... . . . . . . . 173
6.3.3 Simpson Diversity Index . . . . . . . . . . . ..... . . . . . . . 174
6.3.4 Pielou’s Evenness Index . . . . . . . . . . . ..... . . . . . . . 177
6.3.5 Make a Dataframe of Diversity Indices ..... . . . . . . . 178
6.4 Beta Diversity Measures and Calculations . . . . ..... . . . . . . . 180
6.4.1 Binary Similarity Coefficients: Jaccard
and Sørensen Indices . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.4.2 Distance (Dissimilarity) Coefficients:
Bray-Curtis Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
7 Exploratory Analysis of Microbiome Data and Beyond . . . . . . . . . 191
7.1 Datasets from Mice and Human . . . . . . . . . . . . . . . . . . . . . . . . 191
7.1.1 Vdr−/− Mice Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 191
7.1.2 Cigarette Smokers Data Set . . . . . . . . . . . . . . . . . . . . . 191
7.2 Exploratory Analysis with Graphic Summary . . . . . . . . . . . . . . 192
7.2.1 Plot Richness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
7.2.2 Plot Abundance Bar . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.2.3 Plot Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.2.4 Plot Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.2.5 Plot Phylogenetic Tree . . . . . . . . . . . . . . . . . . . . . . . . 199
7.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.3.1 Introduction to Clustering, Distance
and Ordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
7.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Contents xvii
7.4 Ordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
7.4.1 Principal Component Analysis (PCA) . . . . . . . . . . . . . 209
7.4.2 Principal Coordinate Analysis (PCoA) . . . . . . . . . . . . . 214
7.4.3 Non-metric Multidimensional Scaling (NMDS) . . . . . . 220
7.4.4 Correspondence Analysis (CA) . . . . . . . . . . . . . . . . . . 223
7.4.5 Redundancy Analysis (RDA) . . . . . . . . . . . . . . . . . . . 227
7.4.6 Constrained Correspondence Analysis (CCA) . . . . . . . . 236
7.4.7 Constrained Analysis of Principal Coordinates
(CAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8 Univariate Community Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
8.1 Comparisons of Diversities Between Two Groups . . . . . . . . . . . 251
8.1.1 Two-Sample Welch’s t-Test . . . . . . . . . . . . . . . . . . . . 251
8.1.2 Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . 254
8.2 Comparisons of a Taxon of Interest Between Two Groups . . . . 256
8.2.1 Comparison of Relative Abundance Using
Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . . . . . . 256
8.2.2 Comparison of Present or Absent Taxon
Using Chi-Square Test . . . . . . . . . . . . . . . . . . . . . . . . 260
8.3 Comparisons Among More than Two Groups
Using ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
8.3.1 One-Way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
8.3.2 Pairwise and Tukey Multiple Comparisons . . . . . . . . . 270
8.4 Comparisons Among More than Two Groups
Using Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.4.1 Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . . . 273
8.4.2 Compare Diversities Among Groups . . . . . . . . . . . . . . 274
8.4.3 Find Significant Taxa Among Groups . . . . . . . . . . . . . 277
8.4.4 Multiple Testing and E-value, FWER and FDR . . . . . . 278
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
9 Multivariate Community Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 285
9.1 Hypothesis Testing Among Groups Using Permutational
Multivariate Analysis of Variance (PERMANOVA) . . . . . . . . . 285
9.1.1 Introduction of PERMANOVA . . . . . . . . . . . . . . . . . . 285
9.1.2 Implementing PERMANOVA Using
Vegan Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
9.1.3 Implementing Pairwise Permutational MANOVA
Using RVAideMemoire Package . . . . . . . . . . . . . . . . . 297
9.1.4 Test Group Homogeneities Using the Function
betadisper() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
xviii Contents
9.2 Hypothesis Tests Among Group-Differences Using Mantel
Test (MANTEL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
9.2.1 Introduction of Mantel and Partial Mantel Tests
for Dissimilarity Matrices . . . . . . . . . . . . . . . . . . . . . . 304
9.2.2 Illustrating Mantel Test Using Vegan Package . . . . . . . 306
9.3 Hypothesis Tests Among-Group Differences
Using ANOSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
9.3.1 Introduction of Analysis of Similarity
(ANOSIM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
9.3.2 Illustrating Analysis of Similarity (ANOSIM)
Using Vegan Package . . . . . . . . . . . . . . . . . . . . . . . . . 313
9.4 Hypothesis Tests of Multi-response Permutation Procedures
(MRPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
9.4.1 Introduction of MRPP . . . . . . . . . . . . . . . . . . . . . . . . 316
9.4.2 Illustrating MRPP Using Vegan Package . . . . . . . . . . . 317
9.5 Compare Microbiome Communities Using the GUniFrac
Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
9.5.1 Introduction to UniFrac, Weighted UniFrac
and Generalized UniFrac Distance Metrics . . . . . . . . . . 320
9.5.2 Breast Milk Data Set . . . . . . . . . . . . . . . . . . . . . . . . . 322
9.5.3 Comparing Microbiome Communities Using
the GUniFrac Package . . . . . . . . . . . . . . . . . . . . . . . . 323
9.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
10 Compositional Analysis of Microbiome Data . . . . . . . . . . . . . . . . . 331
10.1 Introduction to Compositional Analysis . . . . . . . . . . . . . . . . . . 331
10.1.1 What Are Compositional Data? . . . . . . . . . . . . . . . . . . 331
10.1.2 Aitchison Simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
10.1.3 Problems with Standard Statistical Methods . . . . . . . . . 332
10.1.4 Statistical Analysis of Compositional Data . . . . . . . . . . 334
10.2 Why Microbiome Dataset Can Be Treated
as Compositional? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
10.3 Exploratory Compositional Data Analysis . . . . . . . . . . . . . . . . 349
10.3.1 Compositional Biplot . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.3.2 Compositional Scree Plot . . . . . . . . . . . . . . . . . . . . . . 356
10.3.3 Compositional Cluster Dendrogram . . . . . . . . . . . . . . . 357
10.3.4 Compositional Barplot . . . . . . . . . . . . . . . . . . . . . . . . 358
10.4 Comparison Between the Groups Using ALDEx2 Package . . . . 361
10.4.1 Vdr Data Set of Fecal and Cecal Sites . . . . . . . . . . . . . 361
10.4.2 Compositional Data Analysis Using ALDEx2 . . . . . . . 361
10.4.3 Difference Plot, Effect Size and Effect Plot . . . . . . . . . 368
Contents xix
10.5 Proportionality: Correlation Analysis for Relative Data . . . . . . . 371
10.5.1 Correlation Analysis Is Not Appropriate for
Compositional Data . . . . . . . . . . . . . . . . . . . . . . . . . . 371
10.5.2 Introduction to Proportionality . . . . . . . . . . . . . . . . . . . 373
10.5.3 Illustrating Proportionality Analysis . . . . . . . . . . . . . . . 375
10.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
11 Modeling Over-Dispersed Microbiome Data . . . . . . . . . . . . . . . . . . 395
11.1 Count-Based Differential Abundance Analysis
of Microbiome Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
11.1.1 Biological and Technical Variations . . . . . . . . . . . . . . 396
11.1.2 Poisson Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
11.1.3 Negative Binomial Model . . . . . . . . . . . . . . . . . . . . . . 400
11.2 NB Model in edgeR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
11.2.1 Development of NB in the Setting of Genomic
Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
11.2.2 Dispersion Estimators of NB in edgeR . . . . . . . . . . . . 401
11.2.3 Hypothesis Testing in edgeR . . . . . . . . . . . . . . . . . . . . 403
11.3 The edgeR Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
11.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
11.3.2 Step-by-Step Implementing edgeR . . . . . . . . . . . . . . . . 408
11.4 NB Model in DESeq and DESeq2 . . . . . . . . . . . . . . . . . . . . . . 424
11.4.1 NB Model in DESeq . . . . . . . . . . . . . . . . . . . . . . . . . 424
11.4.2 NB Model in DESeq2 . . . . . . . . . . . . . . . . . . . . . . . . 425
11.5 The DESeq and DESeq2 Packages . . . . . . . . . . . . . . . . . . . . . . 427
11.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
11.5.2 Step-by-Step Implementing DESeq2 . . . . . . . . . . . . . . 428
11.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
12 Modeling Zero-Inflated Microbiome Data . . . . . . . . . . . . . . . . . . . . 453
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
12.2 Zero-Inflated Models: ZIP and ZINB . . . . . . . . . . . . . . . . . . . . 454
12.2.1 ZIP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
12.2.2 ZINB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
12.2.3 Modeling Using ZIP and ZINB . . . . . . . . . . . . . . . . . . 457
12.3 Zero-Hurdle Models: ZHP and ZHNB . . . . . . . . . . . . . . . . . . . 465
12.3.1 ZHP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
12.3.2 ZHNB Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
12.3.3 Modeling ZHP and ZHNB . . . . . . . . . . . . . . . . . . . . . 468
12.3.4 Comparing Zero-Inflated and Zero-Hurdle Models . . . . 471
12.3.5 Interpreting Main Effects of Modeling Results . . . . . . . 477
12.3.6 Multiple Testing Issue and Adjusting P-Values . . . . . . 480
xx Contents
12.4 Zero-Inflated Beta Regression Model with Random Effects . . . . 481
12.4.1 Introduction . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 481
12.4.2 ZIBR Model . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 481
12.4.3 Hypothesis Testing of ZIBR .. . . . . . . . . . . . . . . . . . . 482
12.4.4 Modeling Using ZIBR . . . . .. . . . . . . . . . . . . . . . . . . 483
12.5 Summary and Discussion . . . . . . . . .. . . . . . . . . . . . . . . . . . . 494
References . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . 494
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497
About the Authors
Dr. Yinglin Xia is a Research Associate Professor at the
Department of Medicine, the University of Illinois at
Chicago, USA. He was a Research Assistant Professor in
the Department of Biostatistics and Computational
Biology at the University of Rochester, Rochester, NY.
Dr. Xia has worked on a variety of research projects and
clinical trials in microbiome, gastroenterology, oncol-
ogy, immunology, psychiatry, sleep, neuroscience, HIV,
mental health, public health, social and behavioral
sciences, as well as nursing caregiver. He has published
more than 100 papers in peer-reviewed journals on
Statistical Methodology, Clinical Trial, Medical
Statistics, Biomedical Sciences, and Social and
Behavioral sciences. He serves the editorial board of
9 scientific journals. Dr. Xia is well versed in the design
and analysis in the areas of longitudinal data, mediation
and moderation analyses, multilevel clustered-data,
zero-inflated count data, mixed-effects model, GEE,
structural equation model, meta-analysis, and ROC
curve. He has successfully applied his statistical knowl-
edge, modeling and programming skills to study designs
and data analysis in biomedical research and clinical
trials. He has been involved as a co-investigator or
statistician in numerous NIH, CDC, and other grants.
Three grants he designed on microbiome studies were
funded by NIH and other funding agencies. His recent
papers on microbiome data analysis are well received by
peers.
xxi