Skip to content

y741269430/ATAC-seq

Repository files navigation

ATAC-seq workflow

本文主要参考:

基本原理:

  • 1.当发生DNA复制、基因转录时,DNA的致密高级结构变为松散状态,这部分打开的染色质被称之为开放染色质(open chromatin)。开放染色质可以和一些调控蛋白如转录因子和辅因子结合,染色质的这一特性就叫做染色质可接近性(chromatin accessibility)。染色质可接近性反映了染色质转录活跃程度。
  • 2.真核生物染色质基本结构单位是核小体,由 147bp 的DNA缠绕在组蛋白八聚体上。核小体与核小体之间由 20-90bp 的DNA linkers所连接。因此,Tn5酶转座过程,可发生在核小体之间 20-90bp 的任何地方,从连接区中产生 < 90bp 的短片段,或核小体被夹在Tn5切割的DNA中间产生更大的片段。
  • 3.文库核酸片段包含原始DNA插入片段和来自adapter两端的 135bp 测序的序列。这就形成了从 200bp - 1000bp 左右的文库片段。主要的片段堆积在160bp-200bp之间的形成高耸的peak。
  • 4.如果加入的Tn5酶不足以从细胞中获得染色质,或者是Tn5与DNA的比例太低,使转座反应不充分,无法获得理想范围的DNA片段,此时片段大小估计为800bp左右,该现象被称为under-tagmentation。
  • 5.解决办法:裂解均匀;延长在冰上的裂解时间;增加转座酶反应时间;在裂解前加入额外的PBS冲洗步骤,防止其他物质抑制裂解。

生信分析流程:

  • 1.当我们数据下机之后,得到的fastq文件。使用FastQC软件对raw data进行质量评估。后续clean data同样需要评估。
  • 2.使用Trimmomatic软件对原始数据进行质控(这一步主要是去除3’端的接头污染、去除低质量序列(保留MPAQ >= 30))。得到clean data。
  • 3.将clean data使用bowtie2软件与基因组进行比对,得到的sam文件使用samtools转换成bam。
  • 4.得到的bam文件,获取其唯一比对以及去重复reads的结果bam文件。
  • 5.使用Deeptools绘制TSS, Peak center 或GeneBody富集热图(依组学而定),展示数据在这些区域及前后3kb上的富集情况。
  • 6.使用MACS2MACS3进行peak calling。
  • 7.使用IDR软件进行样品间高可信度的峰筛选.
  • 8.将bam文件转换成bigwig文件,使用IGV进行可视化。
  • 9.使用r包ChIPseeker对peak进行注释。
  • 10.使用homerMEME进行motif预测。
  • 11.使用MAnorm(无生物学重复)或DiffBind(有生物学重复)进行差异peak分析.
  • 12.假如测序深度高的情况下,可以考虑做足迹分析,即评估一个peak上是否有转录因子保护的足迹,如果有,这个peak会出现凹陷状。找出这类型的peak,然后再做motif预测,就可以找到具体是哪些转录因子在起调控作用了。比单一只做motif预测更稳健。(即:motif分析提供的是潜在的,统计相关的结合可能性;而足迹分析提供了更直接的,物理性的结合证据。)

结果指标:

  • 1.文库的Q20(90%以上),Q30(80%以上)
  • 2.peak注释的基因组占比(promoter占比大于20%,特殊的样本可大于10%)
  • 3.FRiP值(大于0.2)
  • 4.比对率(模式生物需大于85%)
  • 5.片段大小分布在200,400,600,出现明显峰型
  • 6.deeptools热图展示(peak主要富集在TSS区域)
  • 7.call peak数无固定范围

0.创建conda环境用于ATACseq分析(也可以用mamba)

我这里创建了多个环境,防止软件之间的冲突

# 1
conda create -n atac
conda activate atac
mamba install samtools
mamba install bowtie2
mamba install sambamba
mamba install bedtools
mamba install macs3
mamba install trim-galore
mamba install trimmomatic
mamba install bioconda::ucsc-bedclip
mamba install bioconda::ucsc-bedgraphtobigwig

# 2
conda create -n mqc python=3.6
conda activate mqc
pip install multiqc

# 3
conda create -n idr
conda activate idr
conda install -c bioconda idr

# 4

0.利用bowtie2构建小鼠基因组(mm39)索引(构建一次以后都不用做了)

conda activate atac
cd downloads/genome/mm_v112
mkdir bowtie2_idx
nohup bowtie2-build Mus_musculus.GRCm39.dna_sm.chromosome.chr.fa ./bowtie2_idx/mm39 &

1.开始——激活conda环境

conda activate atac

创建文件夹

mkdir -p trim bam macs3 logs bed rm_blacklist

生成一个filenames的文件,用来记录输出的文件名称(样本名称),例如:

ls raw/*1.fq.gz |cut -d "_" -f 1 |cut -d "/" -f 2 > filenames

2.利用Trim adaptors去除接头

vim pre_trim.sh
#!/bin/bash
## trim_galore ##

cat filenames | while read i; 
do
# paired end
nohup trim_galore -q 25 --phred33 --length 20 -e 0.1 --stringency 1 -j 40 --paired ./raw/${i}*_1.fq.gz ./raw/${i}*_2.fq.gz -o ./trim &
  
# single end
# nohup trim_galore -q 25 --phred33 --length 20 -e 0. 1 --stringency 1 -j 40 ./raw/${i}*_1.fq.gz -o ./trim &
done

2.1 利用trimmomatic去除接头(Illumina)

vim pre_trim.sh
#!/bin/bash
## trimmomatic ##

cat filenames | while read i; 
do
nohup trimmomatic PE -phred33 -threads 4 \
./RawData/${i}*_R1.fastq.gz \
./RawData/${i}*_R2.fastq.gz \
trim/${i}_forward_paired.fq.gz \
trim/${i}_forward_unpaired.fq.gz \
trim/${i}_reverse_paired.fq.gz \
trim/${i}_reverse_unpaired.fq.gz \
ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 &

done

3.比对到mm39

vim atac1_bw2.sh
#!/bin/bash

mm39="/home/jjyang/downloads/genome/mm_v112/bowtie2_idx/mm39"

cat filenames | while read i; 
do
nohup bowtie2 -p 4 --very-sensitive -X 2000 -k 10 \
-x ${mm39} \
-1 trim/${i}_forward_paired.fq.gz \
-2 trim/${i}_reverse_paired.fq.gz \
-S ./bam/${i}.sam 2> ./logs/${i}_map.txt & 
done

-X 设定最长的插入片段长度. Default: 500
--very-sensitive 提高识别的敏感度但是会降低比对速度
-k 默认设置下, bowtie2搜索出了一个read不同的比对结果, 并报告其中最好的比对结果(如果好几个最好的比对结果得分一致, 则随机挑选出其中一个). 而在该模式下, bowtie2最多搜索出一个read k个比对结果, 并将这些结果按得分降序报告出来。

方法二:

#!/bin/bash

mm39="/home/jjyang/downloads/genome/mm_v112/bowtie2_idx/mm39"

cat filenames | while read i; 
do
nohup bowtie2 -p 4 --very-sensitive -X 1000 \
-x ${mm39} \
-1 trim/${i}_forward_paired.fq.gz \
-2 trim/${i}_reverse_paired.fq.gz \
-S ./bam/${i}.sam 2> ./logs/${i}_map.txt & 
done

hisat2/bowtie2 整合输出比对率

输出比对率文件

python MappingRateOutput.py logs/ alignment_res

4.生成raw bam (optional)

vim atac2_sam2bamop.sh
#!/bin/bash
## sam to bam (samtools) ##
## sorted by position (samtools) ##

cat filenames | while read i; 
do
nohup samtools view -@ 4 -h ./bam/${i}.sam | samtools sort -@ 4 -O bam -o ./bam/${i}-sorted-pos.bam &&
samtools index -@ 4 ./bam/${i}-sorted-pos.bam & 
done

# mtReads=$(samtools idxstats ./bam/${i}-sorted-pos.bam | grep 'chrM' | cut -f 3)
# totalReads=$(samtools idxstats ./bam/${i}-sorted-pos.bam | awk '{SUM += $3} END {print SUM}')
# echo '==> mtDNA Content:' $(bc <<< "scale=2;100*$mtReads/$totalReads")'%'
# samtools flagstat -@ 10 ./bam/${i}-sorted-pos.bam > ./logs/${i}-sam.stat &

5.sam to bam 同时去除 ChrM

vim atac2_sam2lastbam.sh
#!/bin/bash
## sam to bam (samtools) ##
## remove ChrM & sorted by position (samtools) ##
## remove duplication (sambamba) ##

cat filenames | while read i; 
do
nohup samtools view -@ 4 -h ./bam/${i}.sam | grep -v chrM | samtools sort -@ 4 -O bam -o ./bam/${i}-rmChrM-sorted-pos.bam && 
sambamba markdup -r -t 40 --overflow-list-size 600000 ./bam/${i}-rmChrM-sorted-pos.bam ./bam/${i}.last.bam & 
done

# samtools flagstat -@ 10 ./bam/${i}-rmChrM-sorted-pos.bam > ./logs/${i}-rmChrM-sorted-pos.stat &
# samtools flagstat -@ 10 ./bam/${i}.last.bam > ./logs/${i}.last.stat &

6.macs2(现已弃用)

vim atac3_macs2.sh
#!/bin/bash
## peak calling (macs2) ##

cat filenames | while read i; 
do
nohup macs2 callpeak -t ./bam/${i}.last.bam -g 2723414844 -n ./macs2/${i} --shift -100 --extsize 200 --nomodel -B --SPMR --keep-dup all -q 0.1 --call-summits &
done

7.使用macs3进行call peak

vim atac4_macs3.sh
#!/bin/bash
## peak calling (macs3) ##

cat filenames | while read i; 
do
nohup macs3 callpeak -f BAMPE -t ./bam/${i}.last.bam -g mm -n ./macs3/${i} -B -q 0.1 &  
done

对于ATAC-seq来说,-f BAMPE 参数仅分析“正确配对”的比对,即下图中的A。
假如将bam转成bed,再去做call peak,这里通常是数据质量比较差的时候所作出的选择。理由是它不仅保留了双端“正确配对”的比对,也保留了单端的比对,即下图中的A和B。
(Note that the BEDTools program bamtobed cannot be used here, since its output is in a nonstandard BED format that MACS2 cannot analyze.)
Harvardinformatics ATAC-seq

narrowPeak和bed文件格式

参考

  • narrowPeak.html

  • 01_Introduction_to_peak_files.md

  • A narrowPeak (.narrowPeak) file is used by the ENCODE project to provide called peaks of signal enrichment based on pooled, normalized (interpreted) data. The narrowPeak file is a BED6+4 format, which means the first 6 columns of a standard BED file with 4 additional fields:

  • BED files require at least 3 fields indicating the genomic location of the feature, including the chromosome and the start and end coordinates. However, there are 9 additional fields that are optional, as shown in the image below.

  • Each row in the narrowPeak file represents a called peak. Below is an the example of a narrowPeak file, displaying the coordinate and statistical information for a handful of called peaks.

8.Remove blacklist

参考

How were the 'blacklists compiled? These blacklists were empirically derived from large compendia of data using a combination of automated heuristics and manual curation. Blacklists were generated for various species and genome versions including human, mouse, worm and fly. The lists can be downloaded here. For human, they used 80 open chromatin tracks (DNase and FAIRE datasets) and 12 ChIP-seq input/control tracks spanning ~60 cell lines in total. These blacklists are applicable to functional genomic data based on short-read sequencing (20-100bp reads). These are not directly applicable to RNA-seq or any other transcriptome data types.

More information about the blacklist region is described in this paper. This is a more recent resource and the authors compiled blacklists that can be downloaded here. This is the source for the bed file used in this workshop.

The black lists were downloaded from https://www.encodeproject.org/annotations/ENCSR636HFF/

vim blackls_rm.sh
#!/bin/bash
## remove blacklist (bedtools) ##

# 检查是否提供了足够的参数
if [ "$#" -ne 2 ]; then
    echo "Usage: $0 <input_directory> <output_directory>"
    exit 1
fi

# 获取输入和输出目录路径
input_dir=$1
output_dir=$2

Blacklist="/home/jjyang/downloads/genome/mm39_GRCm39/mm10_blacklist.bed"

cat filenames | while read i; 
do
input_file="$input_dir/${i}_peaks.narrowPeak"
output_file="$output_dir/${i}_rmBL.narrowPeak"

nohup bedtools intersect \
-v \
-a "$input_file" \
-b ${Blacklist} | awk '{if($0~"chr") print}' \
> "$output_file" &
done
blackls_rm.sh macs3 rm_blacklist

使用以下脚本将样本名更改为下游分析的名称(可选)

修改文件名

9. bam 转bw(用于igv可视化)+ 绘制TSS 富集热图

绘制deeptools的热图

vim atac5_bam2bw.sh
#!/bin/bash
## bamCoverage ##

cat filenames | while read i;
#cat name.info | while IFS=$'\t' read -r old_name new_name; # 如果前面修改了文件名,这里就要用name.info来读取第二列,否则依然以filenames读取第一列
do
nohup bamCoverage --bam ./bam/${i}.last.bam -o bw/${i}.bw --binSize 10 --effectiveGenomeSize 2723414844 --normalizeUsing RPGC --outFileFormat bigwig -p 4 &
done
bash vim atac5_bam2bw.sh

10. homer 的motif预测

homer

11. IDR分析

12. 差异peak分析(diffbind)

13. 富集分析

macs3 peak文件转 bw (可选)

peak2bw

fastqc质控

nohup fastqc -q -t 30 raw/*.fq.gz -o fqc/ &
nohup fastqc -q -t 30 trim/*.fq.gz -o trim_fqc/ &

nohup multiqc fqc/*.zip -o mqc/ &
nohup multiqc trim_fqc/*.zip -o trim_mqc/ &

louvain 聚类

conda activate atac
python ./louvain.py net_drg_select.csv net_drg_select.gexf 1 &

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages