0% found this document useful (0 votes)
92 views15 pages

Big Data Analytics in Genomics Free Ebook Download

The document discusses the significance of big data analytics in genomics, particularly in light of advancements in sequencing technologies and the accumulation of vast genomic data. It highlights the need for novel analytical methods to extract insights from this data, as traditional approaches may not suffice. The book is structured into three parts: Statistical Analytics, Computational Analytics, and Cancer Analytics, each containing chapters that cover various methodologies and applications in genomic research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views15 pages

Big Data Analytics in Genomics Free Ebook Download

The document discusses the significance of big data analytics in genomics, particularly in light of advancements in sequencing technologies and the accumulation of vast genomic data. It highlights the need for novel analytical methods to extract insights from this data, as traditional approaches may not suffice. The book is structured into three parts: Statistical Analytics, Computational Analytics, and Cancer Analytics, each containing chapters that cover various methodologies and applications in genomic research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Big Data Analytics in Genomics

Visit the link below to download the full version of this book:

https://medidownload.com/product/big-data-analytics-in-genomics/

Click Download Now


Ka-Chun Wong

Big Data Analytics


in Genomics

123
Ka-Chun Wong
Department of Computer Science
City University of Hong Kong
Kowloon Tong, Hong Kong

ISBN 978-3-319-41278-8 ISBN 978-3-319-41279-5 (eBook)


DOI 10.1007/978-3-319-41279-5

Library of Congress Control Number: 2016950204

© Springer International Publishing Switzerland (outside the USA) 2016

Chapter 12 completed within the capacity of an US governmental employment. US copy-right protection


does not apply.

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG Switzerland
Preface

At the beginning of the 21st century, next-generation sequencing (NGS) and


third-generation sequencing (TGS) technologies have enabled high-throughput
sequencing data generation for genomics; international projects (e.g., the Ency-
clopedia of DNA Elements (ENCODE) Consortium, the 1000 Genomes Project,
The Cancer Genome Atlas (TCGA), Genotype-Tissue Expression (GTEx) program,
and the Functional Annotation Of Mammalian genome (FANTOM) project) have
been successfully launched, leading to massive genomic data accumulation at an
unprecedentedly fast pace.
To reveal novel genomic insights from those big data within a reasonable
time frame, traditional data analysis methods may not be sufficient and scalable.
Therefore, big data analytics have to be developed for genomics.
As an attempt to summarize the current efforts in big data analytics for genomics,
an open book chapter call is made at the end of 2015, resulting in 40 book chapter
submissions which have gone through rigorous single-blind review process. After
the initial screening and hundreds of reviewer invitations, the authors of each
eligible book chapter submission have received at least 2 anonymous expert reviews
(at most, 6 reviews) for improvements, resulting in the current 13 book chapters.
Those book chapters are organized into three parts (“Statistical Analytics,”
“Computational Analytics,” and “Cancer Analytics”) in the spirit that statistics form
the basis for computation which leads to cancer genome analytics. In each part,
the book chapters have been arranged from general introduction to advanced top-
ics/specific applications/specific cancer sequentially, for the interests of readership.
In the first part on statistical analytics, four book chapters (Chaps. 1–4) have
been contributed. In Chap. 1, Yang et al. have compiled a statistical introduction for
the integrative analysis of genomic data. After that, we go deep into the statistical
methodology of expression quantitative trait loci (eQTL) mapping in Chap. 2
written by Cheng et al. Given the genomic variants mapped, Ribeiro et al. have
contributed a book chapter on how to integrate and organize those genomic variants
into genotype-phenotype networks using causal inference and structure learning in
Chap. 3. At the end of the first part, Li and Tong have given a refreshing statistical

v
vi Preface

perspective on genomic applications of the Neyman-Pearson classification paradigm


in Chap. 4.
In the second part on computational analytics, four book chapters
(Chaps. 5–8) have been contributed. In Chap. 5, Gupta et al. have reviewed
and improved the existing computational pipelines for re-annotating eukaryotic
genomes. In Chap. 6, Rucci et al. have compiled a comprehensive survey on the
computational acceleration of Smith-Waterman protein sequence database search
which is still central to genome research. Based on those sequence database
search techniques, protein function prediction methods have been developed
and demonstrated promising. Therefore, the recent algorithmic developments,
remaining challenges, and prospects for future research in protein function
prediction are discussed in great details by Shehu et al. in Chap. 7. At the end
of the part, Nagarajan and Prabhu provided a review on the computational pipelines
for epigenetics in Chap. 8.
In the third part on cancer analytics, five chapters (Chaps. 9–13) have been
contributed. At the beginning, Prabahar and Swaminathan have written a reader-
friendly perspective on machine learning techniques in cancer analytics in Chap. 9.
To provide solid supports for the perspective, Tong and Li summarize the existing
resources, tools, and algorithms for therapeutic biomarker discovery for cancer
analytics in Chap.10. The NGS analysis of somatic mutations in cancer genomes
are then discussed by Prieto et al. in Chap. 11. To consolidate the cancer analytics
part further, two computational pipelines for cancer analytics are described in the
last two chapters, demonstrating concrete examples for reader interests. In Chap.
12, Leung et al. have proposed and described a novel pipeline for statistical analysis
of exonic variants in cancer genomes. In Chap. 13, Yotsukura et al. have proposed
and described a unique pipeline for understanding genotype-phenotype correlation
in breast cancer genomes.

Kowloon Tong, Hong Kong Ka-Chun Wong


April 2016
Contents

Part I Statistical Analytics


Introduction to Statistical Methods for Integrative Data
Analysis in Genome-Wide Association Studies . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3
Can Yang, Xiang Wan, Jin Liu, and Michael Ng
Robust Methods for Expression Quantitative Trait Loci Mapping .. . . . . . . . 25
Wei Cheng, Xiang Zhang, and Wei Wang
Causal Inference and Structure Learning
of Genotype–Phenotype Networks Using Genetic Variation . . . . . . . . . . . . . . . . 89
Adèle H. Ribeiro, Júlia M. P. Soler, Elias Chaibub Neto, and André
Fujita
Genomic Applications of the Neyman–Pearson Classification Paradigm .. 145
Jingyi Jessica Li and Xin Tong

Part II Computational Analytics


Improving Re-annotation of Annotated Eukaryotic Genomes .. . . . . . . . . . . . . 171
Shishir K. Gupta, Elena Bencurova, Mugdha Srivastava,
Pirasteh Pahlavan, Johannes Balkenhol, and Thomas Dandekar
State-of-the-Art in Smith–Waterman Protein Database Search
on HPC Platforms .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 197
Enzo Rucci, Carlos García, Guillermo Botella, Armando De
Giusti, Marcelo Naiouf, and Manuel Prieto-Matías
A Survey of Computational Methods for Protein Function Prediction . . . . 225
Amarda Shehu, Daniel Barbará, and Kevin Molloy
Genome-Wide Mapping of Nucleosome Position and Histone
Code Polymorphisms in Yeast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 299
Muniyandi Nagarajan and Vandana R. Prabhu

vii
viii Contents

Part III Cancer Analytics


Perspectives of Machine Learning Techniques in Big Data
Mining of Cancer.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 317
Archana Prabahar and Subashini Swaminathan
Mining Massive Genomic Data for Therapeutic Biomarker
Discovery in Cancer: Resources, Tools, and Algorithms . . . . . . . . . . . . . . . . . . . . 337
Pan Tong and Hua Li
NGS Analysis of Somatic Mutations in Cancer Genomes . . . . . . . . . . . . . . . . . . . 357
T. Prieto, J.M. Alves, and D. Posada
OncoMiner: A Pipeline for Bioinformatics Analysis of Exonic
Sequence Variants in Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 373
Ming-Ying Leung, Joseph A. Knapka, Amy E. Wagler,
Georgialina Rodriguez, and Robert A. Kirken
A Bioinformatics Approach for Understanding
Genotype–Phenotype Correlation in Breast Cancer . . . . .. . . . . . . . . . . . . . . . . . . . 397
Sohiya Yotsukura, Masayuki Karasuyama, Ichigaku Takigawa,
and Hiroshi Mamitsuka
Part I
Statistical Analytics
Introduction to Statistical Methods
for Integrative Data Analysis in Genome-Wide
Association Studies

Can Yang, Xiang Wan, Jin Liu, and Michael Ng

Abstract Scientists in the life science field have long been seeking genetic
variants associated with complex phenotypes to advance our understanding of
complex genetic disorders. In the past decade, genome-wide association studies
(GWASs) have been used to identify many thousands of genetic variants, each
associated with at least one complex phenotype. Despite these successes, there
is one major challenge towards fully characterizing the biological mechanism of
complex diseases. It has been long hypothesized that many complex diseases
are driven by the combined effect of many genetic variants, formally known as
“polygenicity,” each of which may only have a small effect. To identify these genetic
variants, large sample sizes are required but meeting such a requirement is usually
beyond the capacity of a single GWAS. As the era of big data is coming, many
genomic consortia are generating an enormous amount of data to characterize the
functional roles of genetic variants and these data are widely available to the public.
Integrating rich genomic data to deepen our understanding of genetic architecture
calls for statistically rigorous methods in the big-genomic-data analysis. In this book
chapter, we present a brief introduction to recent progresses on the development
of statistical methodology for integrating genomic data. Our introduction begins
with the discovery of polygenic genetic architecture, and aims at providing a
unified statistical framework of integrative analysis. In particular, we highlight the

C. Yang () • M. Ng
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong
e-mail: [email protected]; [email protected]
X. Wan
Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong
e-mail: [email protected]
J. Liu
Center of Quantitative Medicine, Duke-NUS Graduate Medical School, Singapore, Singapore
e-mail: [email protected]

© Springer International Publishing Switzerland 2016 3


K.-C. Wong (ed.), Big Data Analytics in Genomics,
DOI 10.1007/978-3-319-41279-5_1
4 C. Yang et al.

importance of integrative analysis of multiple GWAS and functional information.


We believe that statistically rigorous integrative analysis can offer more biologically
interpretable inference and drive new scientific insights.

Keywords Statistics • SNP • Population genetics • Methodology • Genomic


data

1 Introduction

Genome-wide association studies (GWAS) aim at studying the role of genetic vari-
ations in complex human phenotypes (including quantitative traits and qualitative
diseases) by genotyping a dense set of single-nucleotide polymorphisms (SNPs)
across the whole genome. Compared with the candidate-gene approaches which
only consider some regions chosen based on researcher’s experience, GWAS are
intended to provide an unbiased examination of the genetic risk variants [46].
In 2005, the identification of the complement factor H for age-related macular
degeneration in a small sample set (96 cases v.s. 50 controls) was the first successful
example of searching for risk genes under the GWAS paradigm [31]. It was a
milestone moment in the genetics community, and this result convinced researchers
that GWAS paradigm would be powerful even with such a small sample size. Since
then, an increasing number of GWAS have been conducted each year and significant
risk variants have been routinely reported. As of December, 2015, more than 15,000
risk genetic variants have been associated with at least one complex phenotypes at
the genome-wide significance level (p-value< 5  108 ) [61].
Despite the accumulating discoveries from GWAS, researchers found out that
the significantly associated variants only explained a small proportion of the
genetic contribution to the phenotypes in 2009 [42]. This is the so-called missing
heritability. For example, it is widely agreed that 70–80 % of variations in human
height can be attributed to genetics based on pedigree study while the significant
hits from GWAS can only explain less than 5–10 % of the height variance [1, 42]. In
2010, the seminal work of Yang et al. [66] showed that 45 % of variance in human
height can be explained by 294,831 common SNPs using a linear mixed model
(LMM)-based approach. This result implies that there exist a large number of SNPs
jointly contributing a substantial heritability on human height but their individual
effects are too small to pass the genome-wide significance level due to the limited
sample size. They further provided evidence that the remaining heritability on
human height (the gap between 45 % estimated from GWAS and 70–80 % estimated
from pedigree studies) might be due to the incomplete linkage disequilibrium (LD)
between causal variants and SNPs genotyped in GWAS. Researchers have applied
this LMM approach to many other complex phenotypes, e.g., metabolic syndrome
traits [56] and psychiatric disorders [11, 34]. These results suggest that complex
phenotypes are often highly polygenic, i.e., they are affected by many genetic
variants with small effects rather than just a few variants with large effects [57].
Introduction to Statistical Methods for Integrative Data Analysis in Genome-. . . 5

The polygenicity of complex phenotypes has many important implications on the


development of statistical methodology for genetic data analysis. First, the methods
relying on “extremely sparse and large effects” may not work well because the sum
of many small effects, which is non-negligible, has not been taken into account.
Second, it is often challenging to pinpoint those variants with small effects only
based on information from GWAS. Fortunately, an enormous amount of data from
different perspectives to characterize human genome is being generated and much
richer than ever. This motivates us to search for relevant information beyond GWAS
(indirect evidence) and combine it with GWAS signals (direct evidence) to make
more convincing inference [15]. However, it is not an easy task to integrate indirect
evidence with direct evidence. A major challenge in integrative analysis is that the
direct evidence and indirect evidence are often obtained from different data sources
(e.g., different sample cohorts, different experimental designs). A naive combination
may potentially lead to high false positive findings and misleading interpretation.
Yet, effective methods that combine indirect evidence with direct evidence are still
lacking [23]. In this book chapter, we offer an introduction to the statistical methods
for integrative analysis of genomic data, and highlight their importance in the big
genomic data era.
To provide a bird’s-eye view of integrative analysis of genomic data, we start
with the introduction of heritability estimation because heritability serves as a
fundamental concept which quantifies the genetic contribution to a phenotype [58].
A good understanding of heritability estimation offers valuable insights of the
polygenic architecture of complex phenotypes. From a statistical point of view, it
is the polygenicity that motivates integrative analysis of genomic data such that
more genetic variants with small effects can be identified robustly. Our discussion
of the statistical methods for integrative analysis will be divided into two sections:
integrative analysis of multiple GWAS and integrative analysis of GWAS with
genomic functional information. Then we demonstrate how to integrate multiple
GWAS and functional information simultaneously in the case study section. At the
end, we summarize this chapter with some discussions about the future directions
of this area.

2 Heritability Estimation

The theoretical foundation of heritability estimation can be traced back to R. A.


Fisher’s development [20], in which the phenotypic similarity between relatives
is related to the degrees of genetic resemblance. In quantitative genetics, the
phenotypic value (P) is modeled as the sum of genetic effects (G) and environmental
effects (E),

P D  C G C E; (1)
6 C. Yang et al.

where  is the population mean of the phenotype. To keep our introduction simple,
G and E are assumed to be independent, i.e., Cov.G; E/ D 0. The genetic effect can
be further decomposed into the additive effect (also known as the breeding value),
the dominance effect and the interaction effect, G D A C D C I. Accordingly, the
phenotype variance can be decomposed as

P2 D G2 C E2 D .A2 C D2 C I2 / C E2 ; (2)

where G2 is the variance due to genetic variations, A2 ; D2 ; I2 , and E2 correspond to
the variance of additive effects, dominance effects, interaction effects (also known
as epistasis), and environmental effects, respectively. Based on these variance
components, two types of heritability are defined. The broad-sense heritability (H 2 )
is defined as the proportion of the phenotypic variance that can be attributed to the
genetic factors,

G2 2 C 2 C 2
H2 D 2
D 2 A 2 D 2 I 2: (3)
P A C D C I C E

The narrow-sense heritability (h2 ), however, focuses only on the contribution of the
additive effects:

A2
h2 D : (4)
A2 C E2

Due to the law of inheritance, individuals can only transmit one allele of each
gene to their offsprings, most relatives (except full siblings and monozygotic twins)
share only one allele or no allele that is identical by descent (IBD). Therefore,
the dominance effects and interaction effects will not contribute to their genetic
resemblance as these effects are due to the sharing two IBD alleles. Accumulating
evidence suggests that non-additive genetic effects on complex phenotypes may be
negligible [28, 64, 69]. For example, Yang et al. [64] reported that the additive
effects of about 17 million imputed variants explained 56 % variance of human
height, leaving a very small space for the non-additive effects to contribute. Zhu
et al. [69] found the dominance effects on 79 quantitative traits explained little
phenotypic variance. Therefore, we will ignore non-additive effects and concentrate
our discussion on narrow-sense heritability in this book chapter.

2.1 The Basic Idea of Heritability Estimation


from Pedigree Data

In this section, we will introduce the key idea of heritability estimation from
pedigree data, which provides the basis of our discussion on integrative analysis.
Interested readers are referred to [18, 27, 40, 59] for the comprehensive discussion
Introduction to Statistical Methods for Integrative Data Analysis in Genome-. . . 7

of this issue. Assuming a number of conditions (e.g., random mating, no inbreeding,


Hardy–Weinberg equilibrium, and linkage equilibrium), a simple formula for the
genetic covariance between two relatives can be derived based on the additive
variance component:
Cov.G1 ; G2 / D K1;2 A2 ; (5)

where K1;2 is the expected proportion of their genomes sharing one chromosome
IBD. Let us take a parent–offspring pair as an example. Because the parent transmits
one copy of each gene to his/her offspring, i.e., K1;2 D 12 , thus their genetic
covariance is 12 A2 . Let P1 and P2 be the phenotypic values (e.g., height) of the parent
and the offspring. Based on (1), we have Cov.P1 ; P2 / D Cov.G1 ; G2 /CCov.E1 ; E2 /.
Assuming the independence of the environmental factor, Cov.E1 ; E2 / D 0, we
further have
1 2
Cov.P1 ; P2 / D  : (6)
2 A
Noticing that Var.P1 / D Var.P2 / D P2 D A2 C E2 , the phenotypic correlation can
be related to the narrow-sense heritability h2 :
Cov.P1 ; P2 / 1 A2 1
Corr.P1 ; P2 / D p D 2 2
D h2 : (7)
Var.P1 /Var.P2 / 2 A C E 2

Suppose we have collected the phenotypic values of n parent–offspring pairs.


A simple way to estimate h2 based on this data set is to use the linear regression:
Pi2 D Pi1 ˇ C ˇ0 C i ; (8)
where i D 1; : : : ; n is the index of samples, ˇ is the regression coefficient, and i is
the residual of the ith sample. The ordinary least square estimate of ˇ is
P
O .Pi2  PN 2 /.Pi1  PN 1 / O
ˇD i P 2
; ˇ0 D PN 1  ˇO1 PN 2 ; (9)
i .P i2  N
P 2 /
P P
where PN 1 D 1n i Pi1 and PN 2 D 1n i Pi2 are the sample means of parent phenotypic
values and offspring phenotypic values. Because ˇO is the sample version of the
correlation given in (7), heritability estimated from parent–offspring pairs is given
by twice of the regression slope, i.e., hO 2 D 2ˇ. O
Another example of heritability estimation is based on the phenotypic values of
two parents (P1 and P2 ) and one offspring (P3 ). Let PM D P1 CP 2
2
be the phenotypic
value of the mid-parent. Similarly, we have the genetic covariance Cov.PM ; P3 / D
1
2
Cov.P1 ; P3 / C 12 Cov.P2 ; P3 / D 12 A2 , and correlation between the mid-parent and
the offspring can be related to heritability h2 as

1 2
r
Cov.PM ; P3 / 
2 A 1 2
Corr.PM ; P3 / D p Dq D h: (10)
Var.PM /Var.P3 / 1
. 2
C  2
/ 2
2 A E
8 C. Yang et al.

Suppose we have n trio samples fPi1 ; Pi2 ; Pi3 g, where .Pi1 ; Pi2 ; Pi3 / corresponds to
the phenotypic values of two parents and the offspring from the ith sample. Again,
a convenient way to estimate h2 is to still use linear regression:

Pi1 C Pi2
Pi3 D ˇ C ˇ0 C i : (11)
2

3 5
Heritability estimated from the phenotypic values of mid-parents and offsprings can
be read from the coefficient fitted in (11) as hO 2 D ˇO D Var.PM /1 Cov.PM ; P3 /.
It is worth pointing out that the above methods for heritability estimation only
make use of covariance information. In statistics, they are referred to as the methods
of moments because covariance is the second moment. In fact, we can impose
normality assumptions and reformulate heritability estimation using maximum
likelihood estimator. Considering the parent–offspring case, we can view all the
samples independently drawn from the following distribution:
     1    
Pi1  1 2 2 10
N ; 1 A C E2 ; (12)
Pi2  2 1 0 1

where Pi1 and Pi2 are the phenotypic values of the parent and offspring from the ith
family. Similarly, we can view a trio sample Pi1 ; Pi2 ; Pi3 independently drawn from
the following distribution:
0 1 20 1 0 1 0 1 3
Pi1  1 0 12 10 0
@ Pi2 A  N 4@  A ; @ 0 1 1 A A2 C @ 0 1 0 A E2 5 : (13)
2
1 1
Pi3  2 2
1 00 1

The restricted maximum likelihood (REML) approach can be used to efficiently


compute the estimates of model parameters f; A2 ; E2 g in (12) and (13). Then the
heritability estimation can be obtained as

O A2
hO 2 D : (14)
O A2 C O E2
0 1
 1
 10 1
2
1
The matrices 1
2 and @ 0 1 1
2
A in (12) and (13) can be considered as expected
2
1 1 1
1 2 2
genetic similarity (i.e., expected genome sharing) in parent–offspring samples and
two-parent–offspring samples. As a result, heritability estimation based on pedigree
data relates the phenotypic similarity of relatives to their expected genome sharing.
Introduction to Statistical Methods for Integrative Data Analysis in Genome-. . . 9

2.2 Heritability Estimation Based on GWAS

As we discussed above, the heritability estimation based on pedigree data relies


on the expected genome sharing between relatives. Nowadays, genome-wide dense
SNP data provides an unprecedented opportunity to accurately characterize genome
sharing. However, this advantage brings new challenges. First, three billion base
pairs of human genome sequences are identical at more than 99.9 % of the sites
due to the inheritance from the common ancestors. SNP-based data only records
genotypes at some specific genome positions with single-nucleotide mutations, and
thus SNP-based measures of genetic similarity are much lower than the 99.9 %
similarity based on the whole genome DNA sequence. Second, SNP-based measures
depend on the subset of SNPs genotyped in GWAS and their allele frequencies.
Third, SNP-based measure can be affected by the quality control procedures used in
GWAS.
Our discussion assumes that the SNPs used in heritability estimation are fixed.
There are many different ways to characterize genome similarity based on these
fixed SNPs, as discussed in [51]. Here, we choose the GCTA approach [66, 67] as it
is the most widely used one. Suppose we have collected the genotypes of n subjects
in matrix G D Œgim  2 RnM and their phenotype in vector y 2 Rn1 , where M is the
number of SNP markers and gim 2 f0; 1; 2g is the numerical coding of the genotypes
at the mth SNP of the ith individual. Yang et al. [66, 67] proposed to standardize the
genotype matrix G as follows:
.gim  fm /
wim D p ; (15)
2fm .1  fm /M
where fm is the frequency of the reference allele. An underlying assumption in this
standardization is that lower frequency variants tend to have larger effects. Speed
et al. [52] examined this assumption and concluded that it would be robust in both
simulation studies and real data analysis. After standardization, an LMM is used to
model the relationship between the phenotypic value and the genotypes:

y D Xˇ C Wu C e;
u  N .0; u2 I/;
e  N .0; e2 I/; (16)
where X 2 Rnc is the fixed-effect design matrix collecting the intercept of the
regression model and all covariates, such as age, sex, and a few principal compo-
nents (PC) of the genotype data (PCs are used for adjustment of the population
structure [45]); ˇ is the vector of fixed effects; u collects all the individual SNP
effects which are considered as random, and e collects the random errors due to the
environmental factors. Since both u and e are Gaussian, they can be integrated out
analytically, which yields the marginal distribution of y:
y  N .Xˇ; WWT u2 C e2 I/; (17)

You might also like