
http://genomebiology.com/2008/9/9/R137 Genome Biology 2008, Volume 9, Issue 9, Article R137 Zhang et al. R137.2
Genome Biology 2008, 9:R137
unknown to the user. Second, ChIP-Seq data exhibit regional
biases along the genome due to sequencing and mapping
biases, chromatin structure and genome copy number varia-
tions [10]. These biases could be modeled if matching control
samples are sequenced deeply enough. However, among the
four recently published ChIP-Seq studies [5-8], one did not
have a control sample [5] and only one of the three with con-
trol samples systematically used them to guide peak finding
[8]. That method requires peaks to contain significantly
enriched tags in the ChIP sample relative to the control,
although a small ChIP peak region often contains too few con-
trol tags to robustly estimate the background biases.
Here, we present Model-based Analysis of ChIP-Seq data,
MACS, which addresses these issues and gives robust and
high resolution ChIP-Seq peak predictions. We conducted
ChIP-Seq of FoxA1 (hepatocyte nuclear factor 3α) in MCF7
cells for comparison with FoxA1 ChIP-chip [1] and identifica-
tion of features unique to each platform. When applied to
three human ChIP-Seq datasets to identify binding sites of
FoxA1 in MCF7 cells, NRSF (neuron-restrictive silencer fac-
tor) in Jurkat T cells [8], and CTCF (CCCTC-binding factor) in
CD4
+
T cells [5] (summarized in Table S1 in Additional data
file 1), MACS gives results superior to those produced by
other published ChIP-Seq peak finding algorithms [8,11,12].
Results
Modeling the shift size of ChIP-Seq tags
ChIP-Seq tags represent the ends of fragments in a ChIP-
DNA library and are often shifted towards the 3' direction to
better represent the precise protein-DNA interaction site. The
size of the shift is, however, often unknown to the experi-
menter. Since ChIP-DNA fragments are equally likely to be
sequenced from both ends, the tag density around a true
binding site should show a bimodal enrichment pattern, with
Watson strand tags enriched upstream of binding and Crick
strand tags enriched downstream. MACS takes advantage of
this bimodal pattern to empirically model the shifting size to
better locate the precise binding sites.
Given a sonication size (bandwidth) and a high-confidence
fold-enrichment (mfold), MACS slides 2bandwidth windows
across the genome to find regions with tags more than mfold
enriched relative to a random tag genome distribution. MACS
randomly samples 1,000 of these high-quality peaks, sepa-
rates their Watson and Crick tags, and aligns them by the
midpoint between their Watson and Crick tag centers (Figure
1a) if the Watson tag center is to the left of the Crick tag
center. The distance between the modes of the Watson and
Crick peaks in the alignment is defined as 'd', and MACS shifts
all the tags by d/2 toward the 3' ends to the most likely pro-
tein-DNA interaction sites.
When applied to FoxA1 ChIP-Seq, which was sequenced with
3.9 million uniquely mapped tags, MACS estimates the d to be
only 126 bp (Figure 1a; suggesting a tag shift size of 63 bp),
despite a sonication size (bandwidth) of around 500 bp and
Solexa size-selection of around 200 bp. Since the FKHR motif
sequence dictates the precise FoxA1 binding location, the true
distribution of d could be estimated by aligning the tags by the
FKHR motif (122 bp; Figure 1b), which gives a similar result
to the MACS model. When applied to NRSF and CTCF ChIP-
Seq, MACS also estimates a reasonable d solely from the tag
distribution: for NRSF ChIP-Seq the MACS model estimated
d as 96 bp compared to the motif estimate of 70 bp; applied to
CTCF ChIP-Seq data the MACS model estimated a d of 76 bp
compared to the motif estimate of 62 bp.
Peak detection
For experiments with a control, MACS linearly scales the total
control tag count to be the same as the total ChIP tag count.
Sometimes the same tag can be sequenced repeatedly, more
times than expected from a random genome-wide tag distri-
bution. Such tags might arise from biases during ChIP-DNA
amplification and sequencing library preparation, and are
likely to add noise to the final peak calls. Therefore, MACS
removes duplicate tags in excess of what is warranted by the
sequencing depth (binomial distribution p-value <10
-5
). For
example, for the 3.9 million FoxA1 ChIP-Seq tags, MACS
allows each genomic position to contain no more than one tag
and removes all the redundancies.
With the current genome coverage of most ChIP-Seq experi-
ments, tag distribution along the genome could be modeled
by a Poisson distribution [7]. The advantage of this model is
that one parameter, λ
BG
, can capture both the mean and the
variance of the distribution. After MACS shifts every tag by d/
2, it slides 2d windows across the genome to find candidate
peaks with a significant tag enrichment (Poisson distribution
p-value based on λ
BG
, default 10
-5
). Overlapping enriched
peaks are merged, and each tag position is extended d bases
from its center. The location with the highest fragment
pileup, hereafter referred to as the summit, is predicted as the
precise binding location.
In the control samples, we often observe tag distributions
with local fluctuations and biases. For example, at the FoxA1
candidate peak locations, tag counts are well correlated
between ChIP and control samples (Figure 1c,d). Many possi-
ble sources for these biases include local chromatin structure,
DNA amplification and sequencing bias, and genome copy
number variation. Therefore, instead of using a uniform λ
BG
estimated from the whole genome, MACS uses a dynamic
parameter, λ
local
, defined for each candidate peak as:
λ
local
= max(λ
BG
, [λ
1k
,] λ
5k
, λ
10k
)
where λ
1k
, λ
5k
and λ
10k
are λ estimated from the 1 kb, 5 kb or
10 kb window centered at the peak location in the control
sample, or the ChIP-Seq sample when a control sample is not
available (in which case λ
1k
is not used). λ
local
captures the