計算 ka/ks , or dn/ds

ka = dn, ks = ds

在之前的一篇文章中,提到了如何用biomart下載兩個物種之間的orthologs數據(從biomart上下載兩個物種的orthologs數據集)。得到的結果中,有兩個orthologs基因之間ka,ks的結果。

那麼如果現在手上兩條orthologs 基因序列,如何計算它們之間的ka, ks呢?

大致思路是,先用氨基酸序列來作比對,然後轉化為核苷酸的序列比對,接下來計算ka,ks。

涉及到的軟體有:

Clusterw

clustal.org/download/cl

pal2nal.pl

Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments[J]. Nucleic acids research, 2006, 34(suppl_2): W609-W612.

KaKs_Calculator

Zhang Z, Li J, Zhao X Q, et al. KaKs_Calculator: calculating Ka and Ks through model selection and model averaging[J]. Genomics, proteomics & bioinformatics, 2006, 4(4): 259-263.

PAML

Phylogenetic Analysis by Maximum Likelihood (PAML)

這裡利用Dmel-Dsim中的一對同源基因:

Dmel: FBgn0041711

Dsim: FBgn0191968

介紹一下如何使用上述軟體計算ka,ks。

1,從flybase上下載相應蛋白質序列

利用flyID,從 Sequence Downloader ,輸入FlyBase ID,用Type 選擇下載蛋白質(Translations)or核苷酸(CDS)序列:

download peptide sequences

download nucleotide sequences

下載四條序列後,分別命名為:FBgn0041711.pep, FBgn0041711.fasta 和 FBgn0191968.pep, FBgn0191968.fasta .

製作protein sequence align的輸入文件,命名為 two.pep

clustalw2 比對這兩條氨基酸序列:

$ clustalw2 two.pep

產生了two.alntwo.dnd 兩個文件,其中two.aln是我們需要的protein比對結果。

2,用 蛋白質比對 guide 核苷酸序列 比對

類似的,製作一個 two.nuc文件:

$ pal2nal.pl two.aln two.nuc -nogap >two.codon

pal2nal.pl支持多種輸出格式,如果下游要用PAML繼續作分析,可以指定輸出文件符合paml的輸入格式:

$ pal2nal.pl two.aln two.nuc -output paml -nogap > two.codon

3, 利用PAML中的codeml,計算ka,ks

需要3個輸入文件:two.tree, two.codon, two.cnt (control file)

two.codon 就是上一步比對得到的結果文件,格式如下:

因為我們只有兩條序列,two.tree 文件長這樣:

(FBgn0041711, FBgn0191968);

two.cnt 文件長這樣:

seqfile = two.codon

treefile = two.tree

outfile = two.codeml

noisy = 0 * 0,1,2,3,9: how much rubbish on the screen

verbose = 0 * 1: detailed output, 0: concise output

runmode = -2 * 0: user tree; 1: semi-automatic; 2: automatic

* 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise

cleandata = 1 * "I added on 07/07/2004" Mikita Suyama

seqtype = 1 * 1:codons; 2:AAs; 3:codons-->AAs

CodonFreq = 2 * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table

model = 2

* models for codons:

* 0:one, 1:b, 2:2 or more dN/dS ratios for branches

NSsites = 0 * dN/dS among sites. 0:no variation, 1:neutral, 2:positive

icode = 0 * 0:standard genetic code; 1:mammalian mt; 2-10:see below

Mgene = 0 * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all

fix_kappa = 0 * 1: kappa fixed, 0: kappa to be estimated

kappa = 2 * initial or fixed kappa

fix_omega = 0 * 1: omega or omega_1 fixed, 0: estimate

omega = 1 * initial or fixed omega, for codons or codon-transltd AAs

fix_alpha = 1 * 0: estimate gamma shape parameter; 1: fix it at alpha

alpha = .0 * initial or fixed alpha, 0:infinity (constant rate)

Malpha = 0 * different alphas for genes

ncatG = 4 * # of categories in the dG or AdG models of rates

clock = 0 * 0: no clock, unrooted tree, 1: clock, rooted tree

getSE = 0 * 0: dont want them, 1: want S.E.s of estimates

RateAncestor = 0 * (1/0): rates (alpha>0) or ancestral states (alpha=0)

method = 0 * 0: simultaneous; 1: one branch at a time

$ codeml two.cnt

output lots of files, but only two.codeml is what you need.

最後一行給出了ka,ks計算結果

codeml的輸出結果

4, 利用 kaks_calculator 計算ka ks

kaks_calculator 的輸入文件長這樣:

FBgn0041711,FBgn0191968

ATGATTCGCATCCTAGTTGGAGTGATTTGTGCGGTCCTTTGGTCGCCGGTTAGCCAGGCTCTAACCCCTGGCTTGCAGGTGGCCAAGCAGTGGAAACTCCTGCGCTATAACTTCGAACCTCAGGCTCCGGTCTCCGATCCTAATTTCTATAATCCTCAAAATGTTCTCATCACCGGCCTG

ATGATTCGCATCCTAGTTGGAGTGATTTGTGTGGTCCTTTGGTCGCCGGTTAGCCAGGCTCTAACCCCTGGCTTGCAGGTGGCCAAGCAGTGGAAACTCCTGCGCTATAACTTCGAACCTCAGGCTCCGGTCTCCGATCCTAATTTCTACAATCCCCAAAATGTTCTCATCACGGGCCTG

$ pal2nal.pl two.aln two.nuc -nogap >two.codon$ perl codon2kaks.pl two.codon >two.axt$ KaKs_Calculator -i two.axt -o two.axt.kaks$ KaKs_Calculator -i two.axt -o two.axt.kaks -m LWL -m YN -m MYN

-m 參數可以指定各種計算ka ks的方法。

-o 參數,指定輸出文件名字。

計算結果放入excel表格裡面非常簡潔明了:

PS:

pal2nal.pl的默認輸出格式為

CLUSTAL W multiple sequence alignment

FBgn0041711 ATGATTCGCATCCTAGTTGGAGTGATTTGTGCGGTCCTTTGGTCGCCGGTTAGCCAGGCT

FBgn0191968 ATGATTCGCATCCTAGTTGGAGTGATTTGTGTGGTCCTTTGGTCGCCGGTTAGCCAGGCT

FBgn0041711 CTAACCCCTGGCTTGCAGGTGGCCAAGCAGTGGAAACTCCTGCGCTATAACTTCGAACCT

FBgn0191968 CTAACCCCTGGCTTGCAGGTGGCCAAGCAGTGGAAACTCCTGCGCTATAACTTCGAACCT

FBgn0041711 CAGGCTCCGGTCTCCGATCCTAATTTCTATAATCCTCAAAATGTTCTCATCACCGGCCTG

FBgn0191968 CAGGCTCCGGTCTCCGATCCTAATTTCTACAATCCCCAAAATGTTCTCATCACGGGCCTG

可以用以下代碼將其轉變為kaks_calculator的輸入文件

codon2kaks.pl

# 1gene,2seqs.

# pal2nal output => kaks_calculator input

<>;<>;

for(1..3){

$a=<>;$a=~s/s+$//;

last unless(defined $a || $a eq );

$b=<>;$b=~s/s+$//;

<>;

($n1,$seq1)=(split/s+/,$a);

($n2,$seq2)=(split/s+/,$b);

$line1.=$seq1;$line2.=$seq2;

}

print "$n1,$n2
$line1
$line2
";

print "
";

推薦閱讀:

我是解螺旋的礦工,我熱愛生命科學
生物信息學100個基礎問題 —— 第8題 讀懂FastQC報告 Part III
Linux入門1:遠程登錄伺服器
Analyzing RNA-seq data with DESeq2翻譯(3)

TAG:生物信息學 | 進化 |