[Pipeline] 在Anaconda下用Snakemake構建ChIP-seq流程(4)

(5)Peak annotation

使用Homer對得到的Peak進行注釋

正常的注釋命令:

annotatePeaks.pl <peak/BED file> <genome> > <output file>

通過對Peak/Bed 文件的注釋,我們能得到這樣的一個表格(使用Homer官網的例子,其實我更喜歡Y叔的ChIPseeker,可視化更好看一些):

annotatePeaks.pl ERpeaks.txt hg18 > outputfile.txt

An example from Homer annotation,from homer.ucsd.edu/homer/ng

如果你需要一個其他的參考基因組,比如hg38,需要預先進行下載:

perl /anaconda3/envs/chip/share/homer-4.9.1-6/.//configureHomer.pl -install hg38

在我們的pipeline里可以寫成:

rule homer_anno:
input:
"peak_file/{sample}_peaks.narrowPeak"
output:
"peak_file/{sample}_peakanno.xls"
shell:
r
annotatePeaks.pl {input} hg38 > {output}

別忘了添加註釋文件到rule all 下。

rule all:
input:
expand(fastq_file/{sample}.fastq, sample=SAMPLES),
expand(trim_file/{sample}.fastq, sample=SAMPLES),
expand(bam_file/{sample}.bam, sample=SAMPLES),
expand(peak_file/{sample}_peaks.narrowPeak, sample=SAMPLES),
expand(peak_file/{sample}__peakanno.xls, sample=SAMPLES)

所以完整的pipeline 是這樣的(註:未經測試,有問題可以問我):

import glob
SAMPLES = []
for filename in glob.glob(rsra_file/*.sra):
SAMPLES.append(filename.replace(.sra,).replace(sra_file/,))

rule all:
input:
expand(fastq_file/{sample}.fastq, sample=SAMPLES),
expand(trim_file/{sample}.fastq, sample=SAMPLES),
expand(bam_file/{sample}.bam, sample=SAMPLES),
expand(peak_file/{sample}_peaks.narrowPeak, sample=SAMPLES),
expand(peak_file/{sample}__peakanno.xls, sample=SAMPLES)
rule sra_fastq:
input:
sra_file/{sample}.sra
output:
fastq_file/{sample}.fastq
shell:
r
fastq-dump -O fastq_file {input}

rule perform_qc:
input:
fastq_file/{sample}.fastq
params:
out_dir = qc
output:
qc/{sample}_fastqc.html,
shell:
r
fastqc -o qc_file -t 4 -f fastq {input}

rule seq_trim:
input:
fastq_file/{sample}.fastq
output:
trimed_file/{sample}.fastq
shell:
r
seqtk trimfq -l 30 {input} > {output}

rule bowtie2_align:
input:
fastq = "trim_file/{sample}.fastq"
output:
"bam_file/{sample}.bam"
shell:
r
bowtie2 -p 10 --local -x /ref/GRCh38/hg38 -U {input.fastq} | samtools sort -O bam -o {output}

rule call_peak:
input:
"bam_file/{sample}.bam"
output:
"peak_file/{sample}_peaks.narrowPeak"
shell:
r
macs2 callpeak -t {input} -q 0.05 -f BAM -g mm -n {wildcards.sample} --outdir bam_file/

rule homer_anno:
input:
"peak_file/{sample}_peaks.narrowPeak"
output:
"peak_file/{sample}_peakanno.xls"
shell:
r
annotatePeaks.pl {input} hg38 > {output}

如果需要輸出其他格式的文件,比如 .bw,可以使用相應的工具進行轉換,這裡給出bam轉bw文件的代碼(需要deeptools 和samtools):

samtools index {input}
bamCoverage -b {input.bam} -o {output}

差不多就是這樣,以後應該不會給出這樣詳盡的pipeline了,但是也可能會用其他的流程管理工具或者別人寫好的工具講一下

推薦閱讀:

TAG:Anaconda | Python | 生物信息學 |