[Pipeline] 在Anaconda下用Snakemake構建ChIP-seq流程（3）

02-09

（3）序列比對：

選擇bowtie2作為序列比對的工具，在進行比對之前需要先下載參考基因組的索引文件，例子裡面用的是hg38。

參考基因組大部分有現成的bowtie2_indexes上下載，地址為：

ftp://ftp.ccb.jhu.edu/pub/data/bowtie2_indexes/

下載之後解壓即可。

例子裡面的hg38並不包括在這裡面，可以自行下載參考基因組，並用bowtie2 build產生索引文件：

wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz bowtie2-build hg38.chromFa.tar.gz hg38

產生的索引文件不是一個單獨的文件，而是六個：

hg38.1.bt2 hg38.2.bt2 hg38.3.bt2 hg38.4.bt2 hg38.rev.1.bt2 hg38.rev.2.bt2

正常的命令行執行bowtie2，並用samtools進行排序（-p設置線程數）：

bowtie2 -p 10 --local -x /ref/GRCh38/hg38 -U [fastq file] | samtools sort -O bam -o [bam file]

放在Snakefile里，將結果輸出到bam_file中：

rule bowtie2_align: input: fastq = "trim_file/{sample}.fastq" output: "bam_file/{sample}.bam" shell: r bowtie2 -p 10 --local -x /ref/GRCh38/hg38 -U {input.fastq} | samtools sort -O bam -o {output}

（4）Peak calling ：

使用macs2進行 call peak，找到測序結果中的峰：

對chip-seq，我常用的命令為：

macs2 callpeak -c [control bam file] -t [bam file] -f BAM -g mm -n [output name] -B -q 0.05

公共數據里有很多沒有提供input的數據，幸好macs2也不是必須要控制組才能進行call peak，這裡給出的是沒有control的rule：

rule call_peak: input: "bam_file/{sample}.bam" output: "peak_file/{sample}_peaks.narrowPeak" shell: r macs2 callpeak -t {input} -q 0.05 -f BAM -g mm -n {wildcards.sample} --outdir bam_file/

{wildcards.sample}能幫我們輸出通配符匹配的文件名。

別忘了在rule all 把新的output都添加進去：

rule all: input: expand(fastq_file/{sample}.fastq, sample=SAMPLES), expand(trim_file/{sample}.fastq, sample=SAMPLES), expand(bam_file/{sample}.bam, sample=SAMPLES), expand(peak_file/{sample}_peaks.narrowPeak, sample=SAMPLES)