計算 ka/ks , or dn/ds
ka = dn, ks = ds
在之前的一篇文章中,提到了如何用biomart下載兩個物種之間的orthologs數據(從biomart上下載兩個物種的orthologs數據集)。得到的結果中,有兩個orthologs基因之間ka,ks的結果。
那麼如果現在手上兩條orthologs 基因序列,如何計算它們之間的ka, ks呢?
大致思路是,先用氨基酸序列來作比對,然後轉化為核苷酸的序列比對,接下來計算ka,ks。
涉及到的軟體有:
Clusterw
http://www.clustal.org/download/clustalx_help.html
pal2nal.pl
Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments[J]. Nucleic acids research, 2006, 34(suppl_2): W609-W612.
KaKs_Calculator
Zhang Z, Li J, Zhao X Q, et al. KaKs_Calculator: calculating Ka and Ks through model selection and model averaging[J]. Genomics, proteomics & bioinformatics, 2006, 4(4): 259-263.
PAML
Phylogenetic Analysis by Maximum Likelihood (PAML)
這裡利用Dmel-Dsim中的一對同源基因:
Dmel: FBgn0041711
Dsim: FBgn0191968
介紹一下如何使用上述軟體計算ka,ks。
1,從flybase上下載相應蛋白質序列
利用flyID,從 Sequence Downloader ,輸入FlyBase ID,用Type 選擇下載蛋白質(Translations)or核苷酸(CDS)序列:
下載四條序列後,分別命名為:FBgn0041711.pep, FBgn0041711.fasta 和 FBgn0191968.pep, FBgn0191968.fasta .
製作protein sequence align的輸入文件,命名為 two.pep:
用clustalw2 比對這兩條氨基酸序列:
$ clustalw2 two.pep
產生了two.aln 和 two.dnd 兩個文件,其中two.aln是我們需要的protein比對結果。
2,用 蛋白質比對 guide 核苷酸序列 比對
類似的,製作一個 two.nuc文件:
$ pal2nal.pl two.aln two.nuc -nogap >two.codon
pal2nal.pl支持多種輸出格式,如果下游要用PAML繼續作分析,可以指定輸出文件符合paml的輸入格式:
$ pal2nal.pl two.aln two.nuc -output paml -nogap > two.codon
3, 利用PAML中的codeml,計算ka,ks
需要3個輸入文件:two.tree, two.codon, two.cnt (control file)
two.codon 就是上一步比對得到的結果文件,格式如下:
因為我們只有兩條序列,two.tree 文件長這樣:
(FBgn0041711, FBgn0191968);
two.cnt 文件長這樣:
seqfile = two.codon
treefile = two.treeoutfile = two.codeml
noisy = 0 * 0,1,2,3,9: how much rubbish on the screen verbose = 0 * 1: detailed output, 0: concise output runmode = -2 * 0: user tree; 1: semi-automatic; 2: automatic * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise cleandata = 1 * "I added on 07/07/2004" Mikita Suyama seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table model = 2 * models for codons:* 0:one, 1:b, 2:2 or more dN/dS ratios for branches
NSsites = 0 * dN/dS among sites. 0:no variation, 1:neutral, 2:positive icode = 0 * 0:standard genetic code; 1:mammalian mt; 2-10:see below Mgene = 0 * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated kappa = 2 * initial or fixed kappa fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate omega = 1 * initial or fixed omega, for codons or codon-transltd AAs fix_alpha = 1 * 0: estimate gamma shape parameter; 1: fix it at alpha alpha = .0 * initial or fixed alpha, 0:infinity (constant rate)Malpha = 0 * different alphas for genes
ncatG = 4 * # of categories in the dG or AdG models of rates clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree getSE = 0 * 0: dont want them, 1: want S.E.s of estimates RateAncestor = 0 * (1/0): rates (alpha>0) or ancestral states (alpha=0) method = 0 * 0: simultaneous; 1: one branch at a time
$ codeml two.cnt
output lots of files, but only two.codeml is what you need.
最後一行給出了ka,ks計算結果
4, 利用 kaks_calculator 計算ka ks
kaks_calculator 的輸入文件長這樣:
FBgn0041711,FBgn0191968
ATGATTCGCATCCTAGTTGGAGTGATTTGTGCGGTCCTTTGGTCGCCGGTTAGCCAGGCTCTAACCCCTGGCTTGCAGGTGGCCAAGCAGTGGAAACTCCTGCGCTATAACTTCGAACCTCAGGCTCCGGTCTCCGATCCTAATTTCTATAATCCTCAAAATGTTCTCATCACCGGCCTG
ATGATTCGCATCCTAGTTGGAGTGATTTGTGTGGTCCTTTGGTCGCCGGTTAGCCAGGCTCTAACCCCTGGCTTGCAGGTGGCCAAGCAGTGGAAACTCCTGCGCTATAACTTCGAACCTCAGGCTCCGGTCTCCGATCCTAATTTCTACAATCCCCAAAATGTTCTCATCACGGGCCTG
$ pal2nal.pl two.aln two.nuc -nogap >two.codon$ perl codon2kaks.pl two.codon >two.axt$ KaKs_Calculator -i two.axt -o two.axt.kaks$ KaKs_Calculator -i two.axt -o two.axt.kaks -m LWL -m YN -m MYN
-m 參數可以指定各種計算ka ks的方法。
-o 參數,指定輸出文件名字。
計算結果放入excel表格裡面非常簡潔明了:
PS:
pal2nal.pl的默認輸出格式為
CLUSTAL W multiple sequence alignment
FBgn0041711 ATGATTCGCATCCTAGTTGGAGTGATTTGTGCGGTCCTTTGGTCGCCGGTTAGCCAGGCTFBgn0191968 ATGATTCGCATCCTAGTTGGAGTGATTTGTGTGGTCCTTTGGTCGCCGGTTAGCCAGGCTFBgn0041711 CTAACCCCTGGCTTGCAGGTGGCCAAGCAGTGGAAACTCCTGCGCTATAACTTCGAACCTFBgn0191968 CTAACCCCTGGCTTGCAGGTGGCCAAGCAGTGGAAACTCCTGCGCTATAACTTCGAACCT
FBgn0041711 CAGGCTCCGGTCTCCGATCCTAATTTCTATAATCCTCAAAATGTTCTCATCACCGGCCTGFBgn0191968 CAGGCTCCGGTCTCCGATCCTAATTTCTACAATCCCCAAAATGTTCTCATCACGGGCCTG
可以用以下代碼將其轉變為kaks_calculator的輸入文件
codon2kaks.pl
# 1gene,2seqs.
# pal2nal output => kaks_calculator input<>;<>;for(1..3){ $a=<>;$a=~s/s+$//; last unless(defined $a || $a eq );$b=<>;$b=~s/s+$//;
<>; ($n1,$seq1)=(split/s+/,$a); ($n2,$seq2)=(split/s+/,$b); $line1.=$seq1;$line2.=$seq2;}print "$n1,$n2
$line1
$line2
";print "
";
推薦閱讀:
※我是解螺旋的礦工,我熱愛生命科學
※生物信息學100個基礎問題 —— 第8題 讀懂FastQC報告 Part III
※Linux入門1:遠程登錄伺服器
※Analyzing RNA-seq data with DESeq2翻譯(3)