Model-basedAnalysisofChIP-Seq(MACS)资源-CSDN下载

需积分: 24 8 浏览量 2014-11-27 10:07:26 上传评论收藏 292KB PDF 举报

### Model-based Analysis of ChIP-Seq (MACS)：一种基于模型的ChIP-Seq数据分析方法 #### 背景介绍随着高通量测序技术的发展，染色质免疫沉淀结合高通量测序（ChIP-Seq）已成为研究转录因子、组蛋白修饰和其他DNA结合蛋白在基因组水平上的定位的重要工具。为了更准确地识别这些蛋白的结合位点，研究者们开发了一系列生物信息学分析工具，其中Model-based Analysis of ChIP-Seq (MACS) 是一个广泛使用的软件包。 #### MACS的主要功能与特点 MACS的主要目的是对通过短读测序器如Illumina的Genome Analyzer等产生的ChIP-Seq数据进行分析。它具备以下几个核心功能和特点： 1. **标签尺寸的动态建模**：MACS能够基于实验数据自动推断ChIP-Seq标签的尺寸（即DNA片段大小），这一特性显著提高了预测结合位点的空间分辨率。 2. **利用动态泊松分布处理局部偏倚**：为了有效捕捉基因组中的局部偏倚，MACS采用了动态泊松分布模型。这种方法有助于减少假阳性信号，提高预测准确性。 3. **背景噪音校正**：通过对输入样本进行建模，MACS能够精确地去除非特异性结合或背景噪音，从而提高特异性。 4. **灵活的参数设置**：用户可以根据具体研究需求调整多种参数，包括P值阈值、窗口大小等，以适应不同的实验设计和样本类型。 5. **快速高效的计算能力**：MACS的设计考虑了计算效率，能够在较短时间内完成大规模数据集的分析任务。 #### 技术细节与工作原理 MACS的工作流程主要包括以下几个步骤： 1. **质量控制与预处理**：首先对原始ChIP-Seq数据进行质量控制，去除低质量的reads，并将reads映射到参考基因组上。 2. **标签尺寸估计**：MACS通过分析reads的分布模式来估算ChIP-Seq标签的实际尺寸，这对于后续分析至关重要。 3. **构建模型**：基于标签尺寸的信息，MACS构建了用于预测结合位点的模型。该模型同时考虑了样本间的差异以及局部序列特异性。 4. **峰值检测**：通过比较ChIP样本与对照样本之间的reads分布差异，MACS可以识别出潜在的结合位点，即所谓的“peak”区域。 5. **结果输出与可视化**：最终，MACS会输出包含峰值位置、富集程度等相关信息的报告，并提供图形化展示选项。 #### 应用案例 MACS已成功应用于多个领域的研究中，例如： - **转录因子结合位点的鉴定**：通过ChIP-Seq技术结合MACS分析，研究人员能够精确地定位特定转录因子在基因组中的结合位点，为理解基因调控机制提供了有力支持。 - **组蛋白修饰图谱构建**：利用MACS分析组蛋白修饰的ChIP-Seq数据，可以帮助科学家们绘制出不同细胞状态下的组蛋白修饰模式，揭示表观遗传调控的复杂性。 - **DNA结合蛋白的研究**：除了转录因子外，MACS还可用于分析其他DNA结合蛋白的结合位点，如RNA聚合酶II、组蛋白变体等，进一步拓展了其应用范围。 #### 结论 MACS作为一种高效、灵活的ChIP-Seq数据分析工具，在生命科学领域具有广泛的应用前景。通过对ChIP-Seq数据的深度挖掘，MACS不仅能够帮助科研人员更准确地识别DNA结合蛋白的结合位点，还能促进我们对细胞调控网络及其生物学功能的理解。随着高通量测序技术的不断发展，预计MACS将在未来的研究中发挥更加重要的作用。

资源推荐

资源详情

资源评论

Genome Biology 2008, 9:R137

Open Access

2008Zhanget al.Volume 9, Issue 9, Article R137

Method

Model-based Analysis of ChIP-Seq (MACS)

Yong Zhang

, Tao Liu

, Clifford A Meyer

, Jérôme Eeckhoute

†

David S Johnson

‡

, Bradley E Bernstein

§¶

, Chad Nusbaum

Richard M Myers

, Myles Brown

†

, Wei Li

and X Shirley Liu

Addresses:

Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health, 44

Binney Street, Boston, MA 02115, USA.

†

Division of Molecular and Cellular Oncology, Department of Medical Oncology, Dana-Farber Cancer

Institute and Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 44 Binney Street, Boston, MA 02115, USA.

‡

Gene Security Network, Inc., 2686 Middlefield Road, Redwood City, CA 94063, USA.

Molecular Pathology Unit and Center for Cancer

Research, Massachusetts General Hospital and Department of Pathology, Harvard Medical School, 13th Street, Charlestown, MA 02129, USA.

Broad Institute of Harvard and MIT, 7 Cambridge Center, Cambridge, MA, 02142, USA.

Department of Genetics, Stanford University Medical

Center, Stanford, CA 94305, USA.

Division of Biostatistics, Dan L Duncan Cancer Center, Department of Molecular and Cellular Biology,

Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA.

¤ These authors contributed equally to this work.

Correspondence: Wei Li. Email: wl1@bcm.edu. X Shirley Liu. Email: [email protected]

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which

permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

ChIP-Seq analysis<p>MACS performs model-based analysis of ChIP-Seq data generated by short read sequencers.</p>

Abstract

We present Model-based Analysis of ChIP-Seq data, MACS, which analyzes data generated by short

read sequencers such as Solexa's Genome Analyzer. MACS empirically models the shift size of

ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also

uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for

more robust predictions. MACS compares favorably to existing ChIP-Seq peak-finding algorithms,

and is freely available.

Background

The determination of the 'cistrome', the genome-wide set of

in vivo cis-elements bound by trans-factors [1], is necessary

to determine the genes that are directly regulated by those

trans-factors. Chromatin immunoprecipitation (ChIP) [2]

coupled with genome tiling microarrays (ChIP-chip) [3,4]

and sequencing (ChIP-Seq) [5-8] have become popular tech-

niques to identify cistromes. Although early ChIP-Seq efforts

were limited by sequencing throughput and cost [2,9], tre-

mendous progress has been achieved in the past year in the

development of next generation massively parallel sequenc-

ing. Tens of millions of short tags (25-50 bases) can now be

simultaneously sequenced at less than 1% the cost of tradi-

tional Sanger sequencing methods. Technologies such as Illu-

mina's Solexa or Applied Biosystems' SOLiD™ have made

ChIP-Seq a practical and potentially superior alternative to

ChIP-chip [5,8].

While providing several advantages over ChIP-chip, such as

less starting material, lower cost, and higher peak resolution,

ChIP-Seq also poses challenges (or opportunities) in the anal-

ysis of data. First, ChIP-Seq tags represent only the ends of

the ChIP fragments, instead of precise protein-DNA binding

sites. Although tag strand information and the approximate

distance to the precise binding site could help improve peak

resolution, a good tag to site distance estimate is often

Published: 17 September 2008

Genome Biology 2008, 9:R137 (doi:10.1186/gb-2008-9-9-r137)

Received: 4 August 2008

Revised: 3 September 2008

Accepted: 17 September 2008

The electronic version of this article is the complete one and can be

found online at http://genomebiology.com/2008/9/9/R137

http://genomebiology.com/2008/9/9/R137 Genome Biology 2008, Volume 9, Issue 9, Article R137 Zhang et al. R137.2

Genome Biology 2008, 9:R137

unknown to the user. Second, ChIP-Seq data exhibit regional

biases along the genome due to sequencing and mapping

biases, chromatin structure and genome copy number varia-

tions [10]. These biases could be modeled if matching control

samples are sequenced deeply enough. However, among the

four recently published ChIP-Seq studies [5-8], one did not

have a control sample [5] and only one of the three with con-

trol samples systematically used them to guide peak finding

[8]. That method requires peaks to contain significantly

enriched tags in the ChIP sample relative to the control,

although a small ChIP peak region often contains too few con-

trol tags to robustly estimate the background biases.

Here, we present Model-based Analysis of ChIP-Seq data,

MACS, which addresses these issues and gives robust and

high resolution ChIP-Seq peak predictions. We conducted

ChIP-Seq of FoxA1 (hepatocyte nuclear factor 3α) in MCF7

cells for comparison with FoxA1 ChIP-chip [1] and identifica-

tion of features unique to each platform. When applied to

three human ChIP-Seq datasets to identify binding sites of

FoxA1 in MCF7 cells, NRSF (neuron-restrictive silencer fac-

tor) in Jurkat T cells [8], and CTCF (CCCTC-binding factor) in

CD4

T cells [5] (summarized in Table S1 in Additional data

file 1), MACS gives results superior to those produced by

other published ChIP-Seq peak finding algorithms [8,11,12].

Results

Modeling the shift size of ChIP-Seq tags

ChIP-Seq tags represent the ends of fragments in a ChIP-

DNA library and are often shifted towards the 3' direction to

better represent the precise protein-DNA interaction site. The

size of the shift is, however, often unknown to the experi-

menter. Since ChIP-DNA fragments are equally likely to be

sequenced from both ends, the tag density around a true

binding site should show a bimodal enrichment pattern, with

Watson strand tags enriched upstream of binding and Crick

strand tags enriched downstream. MACS takes advantage of

this bimodal pattern to empirically model the shifting size to

better locate the precise binding sites.

Given a sonication size (bandwidth) and a high-confidence

fold-enrichment (mfold), MACS slides 2bandwidth windows

across the genome to find regions with tags more than mfold

enriched relative to a random tag genome distribution. MACS

randomly samples 1,000 of these high-quality peaks, sepa-

rates their Watson and Crick tags, and aligns them by the

midpoint between their Watson and Crick tag centers (Figure

1a) if the Watson tag center is to the left of the Crick tag

center. The distance between the modes of the Watson and

Crick peaks in the alignment is defined as 'd', and MACS shifts

all the tags by d/2 toward the 3' ends to the most likely pro-

tein-DNA interaction sites.

When applied to FoxA1 ChIP-Seq, which was sequenced with

3.9 million uniquely mapped tags, MACS estimates the d to be

only 126 bp (Figure 1a; suggesting a tag shift size of 63 bp),

despite a sonication size (bandwidth) of around 500 bp and

Solexa size-selection of around 200 bp. Since the FKHR motif

sequence dictates the precise FoxA1 binding location, the true

distribution of d could be estimated by aligning the tags by the

FKHR motif (122 bp; Figure 1b), which gives a similar result

to the MACS model. When applied to NRSF and CTCF ChIP-

Seq, MACS also estimates a reasonable d solely from the tag

distribution: for NRSF ChIP-Seq the MACS model estimated

d as 96 bp compared to the motif estimate of 70 bp; applied to

CTCF ChIP-Seq data the MACS model estimated a d of 76 bp

compared to the motif estimate of 62 bp.

Peak detection

For experiments with a control, MACS linearly scales the total

control tag count to be the same as the total ChIP tag count.

Sometimes the same tag can be sequenced repeatedly, more

times than expected from a random genome-wide tag distri-

bution. Such tags might arise from biases during ChIP-DNA

amplification and sequencing library preparation, and are

likely to add noise to the final peak calls. Therefore, MACS

removes duplicate tags in excess of what is warranted by the

sequencing depth (binomial distribution p-value <10

-5

). For

example, for the 3.9 million FoxA1 ChIP-Seq tags, MACS

allows each genomic position to contain no more than one tag

and removes all the redundancies.

With the current genome coverage of most ChIP-Seq experi-

ments, tag distribution along the genome could be modeled

by a Poisson distribution [7]. The advantage of this model is

that one parameter, λ

, can capture both the mean and the

variance of the distribution. After MACS shifts every tag by d/

2, it slides 2d windows across the genome to find candidate

peaks with a significant tag enrichment (Poisson distribution

p-value based on λ

, default 10

-5

). Overlapping enriched

peaks are merged, and each tag position is extended d bases

from its center. The location with the highest fragment

pileup, hereafter referred to as the summit, is predicted as the

precise binding location.

In the control samples, we often observe tag distributions

with local fluctuations and biases. For example, at the FoxA1

candidate peak locations, tag counts are well correlated

between ChIP and control samples (Figure 1c,d). Many possi-

ble sources for these biases include local chromatin structure,

DNA amplification and sequencing bias, and genome copy

number variation. Therefore, instead of using a uniform λ

estimated from the whole genome, MACS uses a dynamic

parameter, λ

local

, defined for each candidate peak as:

local

= max(λ

, [λ

,] λ

, λ

10k

)

where λ

, λ

and λ

10k

are λ estimated from the 1 kb, 5 kb or

10 kb window centered at the peak location in the control

sample, or the ChIP-Seq sample when a control sample is not

available (in which case λ

is not used). λ

local

captures the

剩余8页未读，继续阅读

评论收藏

内容反馈

biobamboo

粉丝: 0

Model-based Analysis of ChIP-Seq (MACS)

ChIP-seq analysis pipeline

MACS V6.5.2软件安装包简介.pdf

MACS:MACS-ChIP-Seq的基于模型的分析

An Error Analysis of Polynomial Form Dead Reckoning Model based on a Numerical Analysis

ChIP-seq-analysis:明堂的ChIP-seq分析笔记

ChIP-Seq-pipeline:用于ChIP-Seq分析的管道

CHANCE-HT:ChIP-seq 数据预处理软件-开源

hic-bench:一组用于Hi-C和ChIP-Seq分析的管道

精品--Play couplet with seq2seq model. 用深度学习对对联。.zip

RNA-seq Data Analysis a practical approach.zip

Python-TensorFlow神经机翻译seq2seq教程

用于Text-to-AMR和AMR-to-Text的seq2seq模型SPRING.zip

人工智能-深度学习-基于Keras的双向Seq2Seq的多轮对话模型

Python-这个项目使用seq2seq模型来对对联

生物信息学数据分析 chip-seq

ChIP-seq数据分析.pdf

ChiP-Seq-Analysis-Replication：该项目是ChiP-Seq分析的复制，该实验是关于由独特的表观遗传变化介导的终末红细胞生成过程中的基因诱导和抑制的实验

ChIP-Seq-开源

Python-用于语音识别的seq2seq模型的实现

人工智能-项目实践-数据预处理-一个基于 TensorFlow Seq2Seq 模型的聊天机器人 （包含预处理过的 twitte

通过整合ChIP-seq和RNA-seq数据揭示跨细胞系的转录因子和组蛋白修饰共定位和动态

RNA-Seq-Simulator:RNA-Seq 短读长的真实模拟-开源

NOMe-seq-analysis:NOMe-seq 数据分析分步指南

ChIP-Seq:用于处理和分析 ChIP-Seq 数据的其他脚本

RNA-seq Data Analysis

人工智能-项目实践-问答系统-基于知识库的问答：seq2seq模型实践.zip

French-to-English-translation---Seq2Seq---Character-level

Python-使用最新版本的tensorflow实现seq2seq模型生成文本数据摘要

matlab代码左移-Neural-Machine-Translation-seq2seq-Tutorial:神经机器翻译-seq2seq-教

The Hitchhiker's Guide to Hi-C Analysis Practical guidelines

IDEA入门（四） IntelliJ Idea 常用快捷键（Windows）

平衡小车转向之后会失控加速倒下，该怎么解决

最新资源

人工智能-项目实践-数据预处理-一个基于 TensorFlow Seq2Seq 模型的聊天机器人（包含预处理过的 twitte