[轉載]SNPs鑒定及注釋基本流程

07-18

SNP原文地址：SNPs鑒定及注釋基本流程作者：liyan910117

1、輸入文件：Tophat2等軟體產生的比對結果文件（BAM格式），和參考基因組序列文件genome.fa

2、首先要對genome.fa文件建立索引，如：samtools faidx genome.fasta生成的索引文件以.fai後綴結尾。

3、對BAM文件進行排序，如：samtools sort abc.bam abc.sort；具體見3_BAMsort.sh

4、使用samtools的mpileup模塊生成一個bcf文件，然後再使用bcftools命令對bcf文件進行處理，此處主要使用view命令來進行SNP和Indelcalling，具體見4_call_snp.sh 1）mpileup 參數如下：-g ：計算基因型的似然度，輸出於BCF文件中-u ：指定輸出文件為BCF格式-S ：產生每個樣本中鏈偏見的P-value-D ：產出每個樣本的read的深度-r ：指定特定的區域進行SNP，INDEL-f ：指定建好索引的參考序列文件-q ：指定可用比對的最低映射質量，默認為0-Q ：指定可用reads的最低鹼基質量，默認為13其它參數參考官網指導手冊2）bcftools 參數如下：view:這裡主要使用它來處理bcf文件-c ：指定用貝葉斯推理來鑒定變異，並且會自動調用-e參數（指定使用最大似然法）-v ：僅僅輸出變異的位點信息-N ：忽略參考基因上不是A/T/C/G 的位點信息-g：計算突變位點的每個樣本的基因型-b ：指定產出的文件格式是BCF的，默認為VCF格式。其它參數參考官網指導手冊3）示範程序如下：4_call_snp.sh#!/bin/bash -xcd/home/cuckoo/data/liyan/RNAseq_train/SNPssamtoolsmpileup -guSDf genome.fa s0924fb_sorted.bam sCal27_sorted.bam |/home/cuckoo/software/RNAseq/samtools-0.1.19/bcftools/bcftoolsview -cvNg - >sample_raw.vcf5、過濾第四步產生的VCF文件：減少SNPs的假陽性率1）過濾掉突變質量低於30的，reads深度小於5的SNPS位點：輸入文件：上一步的sample_raw.vcf文件輸出文件：過濾後的sample_filter.vcf文件

2）過濾掉reads深度大於100的SNPs位點：/home/cuckoo/software/RNAseq/samtools-0.1.19/bcftools/bcftools viewsample_raw.vcf |/home/cuckoo/software/RNAseq/samtools-0.1.19/bcftools/vcfutils.plvarFilter -D100 > sample_filter.vcf3）第8列的DP4進行過濾： vcf文件中的 DP4一列信息非常重要，提供了4個數據：1：比對結果和正鏈一致的reads數、2：比對結果和負鏈一致的reads數、3：比對結果在正鏈的variant上的reads數、4：比對結果在負鏈的variant上的reads數。可以設定（value3+ value4）大於某一閾值，才算是variant。比如： perl -ne"print $_ if /DP4=(d+),(d+),(d+),(d+)/ &&($3+$4)>=10 && ($3+$4)/($1+$2+$3+$4)>=0.8"smaple_raw.vcf > sample_filter.vcf註：以上的過濾方法根據不同的實驗設計和需求而異，但最終都會生成一個過濾後的文件：snp_filter.vcf，用於下一步的統計分析。6、（可忽略這一步）對過濾後的VCF文件進行解析，統計突變的信息：染色體，位點，參考鹼基，變異鹼基，突變質量具體見6_parse_vcf.pl文件：輸入文件：sample_fiter.vcf 輸出文件：snp.txt6、注釋過濾後的VCF文件：工具-ANNOVAR1、資料庫準備：下載相應的注釋文件到指定目錄如：/data2/disk1/humandbAnnovar/，一般為annovar安裝目錄下的humandb目錄。下載資料庫文件地址：http://www.openbioinformatics.org/annovar/download/hg19_ALL.sites.2010_11.txt.gz或是通過命令：./annotate_variation.pl -downdb -buildverhg19 -webfrom annovar refGene humandb/2、轉換數據格式：第五步過濾後vcf文件（sample_filter.vcf）換為ANNOVAR指定的格式命令：convert2annovar.pl -formatvcf4 sample_filter.vcf >smaple_annovar.input3、注釋：使用table_annovar.pl進行注釋，因為它可以一次性完成位點、基因、區域的注釋工作。1）table_annovar.pl使用方法：/home/cuckoo/software/annovar/table_annovar.plsample_annover.input /data2/disk1/humandbAnnovar/ -buildver hg19-out sample -remove -protocolrefGene,cytoBand,1000g2014oct_all,snp138 -operation g,r,f,f-nastring NA2）table_annovar.pl參數說明：輸入文件：sample_annover.input資料庫位置：/data2/disk1/humandbAnnovar/，一般為annovar安裝目錄下的humandb目錄。-buildver：指定物種基因組的類型，hg19-out：指定輸出文件的前綴，sample；默認為txt格式。-remove：指定刪除所有的臨時文件。-protocol：指定參考文件的資料庫來源，用逗號隔開;refGene,cytoBand,1000g2014oct_all,snp138四個必須包括。-operation：指定與-protocol所對應的參考文件的類型，用逗號隔開。-nastring：指定結果文件中的缺失值表示為NA4、其它注釋方法：annotate_variation.pl有三種注釋模式:gene-based（默認）,region-based,filter-based 使用方法：1）gene-based：annotate_variation.pl -buildver hg19 ex1.avinputhumandb/2）region-based：annotate_variation.pl -regionanno -buildver hg19 -dbtype cytoBandex1.avinput humandb/annotate_variation.pl -regionanno -buildver hg19 -dbtype gff3-gff3dbfile tfbs.gff3 ex1.avinput humandb/3）filter-based：annotate_variation.pl -filter -dbtype 1000g2014oct_all -maf 0.01ex1.avinput humandb/annotate_variation.pl -filter -buildver hg19 -dbtype snp138ex1.avinput humandb/annotate_variation.pl -filter -dbtype ljb26_all -otherinfoex1.avinput humandb/5、注意事項：1）指導手冊：table_annovar.pl和annotate_variation.pl的指導手冊和參數說明可以通過輸入相應的文件名獲得。2）每個文件的具體參數及使用方法請參考官網教程及指導手冊。7、統計第六步中結果文件的突變類型:轉換和顛換，並且用R作條形圖展示。具體見7_snp_count.pl文件: 輸入文件：snp.txt 輸出文件：snp_type.txt

8、R語言畫條形圖展示突變類型，腳本8_snp_count.r