從biomart上下載兩個物種的orthologs數據集

05-16

什麼是orthologs

參考ensembl官網頁面介紹：Protein trees and orthologies

同源基因可以分為兩大類：直系同源基因（orthologs，Ensembl genome browser 91）和旁系同源基因（paralogs，Ensembl genome browser 91）。

簡單來說，兩個物種中的兩個基因，coalescence的時候，coalescent的節點如果是speciaiton node，那麼這兩個基因就是 orthologs，如果是 gene duplication node，那麼這兩個基因的關係是paralogs。

上述鏈接不僅包含了ensemble計算物種間的ortholog/paralogs的pipeline，同時也給出了輸出結果中各個參數的定義，下面結合biomart的應用例子會有更詳細的解釋。

哪裡下載兩個物種間的orthologs（多個物種間的orthologs另做討論）

首先，打開biomart的網頁： Ensembl Metazoa

點擊 biomart ：

先選定 "Ensemble Metazoa Genes 38"

然後選擇想要物種的dataset，這裡以 Drosophila melanogaster 為例：

注意上圖左邊側欄中的「Filters」和 "Attributes"。

Filters 用來過濾data。

Attributes 用來控制需要輸出哪些信息。

比如想要Drosophila melanogaster 和 Drosophila simulance的orthologs，在filter中勾選相對應的選項即可。

選定想要輸出的Attributes，因為我們需要orthologs，所以特別注意「Homologues" 這個東東：

下拉菜單到需要的Drosophila simulance，勾選需要輸出的相關信息：

設置好filter和attributes，左欄會顯示出相應的數據集、filter條件和待輸出的attributes。

點擊 Results ，此時網頁上會顯示待輸出文件前10行的模樣

目測無誤後，點擊 GO 即可，即可下載文件到本地電腦。

如何處理下載後的數據

從輸出樣式上可以看出來，有5列文件是我們比較感興趣的：

Drosophila simulans homology type
%id. target Drosophila simulans gene identical to query gene
%id. query gene identical to target Drosophila simulans gene
Drosophila simulans orthology confidence [0 low, 1 high]
dN or dS with Drosophila simulans

根據以上信息，可以有以下處理步驟，

1，首先，去冗餘。

可以看到上面輸出結果的截圖中，基因FBgn0052280，有3行輸出記錄。

這是由於一個基因可能會有多個transcript。

基因FBgn0052280的三行記錄中，除了第二列的「Transcript stable ID」信息不同，其餘各列內容都是一致的，考慮到我們只是在找Dmel和Dsim間的orthologs，並不關心「一個基因<=>多個轉錄本」，因此可以只保留其中一行的信息。

2，根據 orgholog confidence來過濾

見官方說明Protein trees and orthologies：

We compute a duplication confidence score for each duplication node as the Jaccard index of the sets of species under the two sub-trees. The score is sometimes refered as "species intersection score".

confidence與orthologs的真實度正相關，輸出結果中，絕大多數的 orgholog confidence 是等於1的，可以直接用「orgholog confidence==1」來過濾。

3，根據 Homology types來過濾

見官方說明Protein trees and orthologies：

Using the Gene Trees, we can infer the following pairwise relationships.

Orthologues
Any gene pairwise relation where the ancestor node is a speciation event. We predict several descriptions of orthologues:
1-to-1 orthologues (ortholog_one2one)
1-to-many orthologues (ortholog_one2many)
many-to-many orthologues (ortholog_many2many)
between-species paralogies —only as exceptions, see below — (reusing the three orthology types above)
Paralogues
Any gene pairwise relation where the ancestor node is a duplication event. We predict several descriptions of paralogues:
same-species paralogies (within_species_paralog)
fragments of the same ‐predicted‐ gene (gene_split)

根據研究目的的不同，如果只需要「一對一」的同源基因，那麼這一列可以用「ortholog_one2one」來過濾。

4，gene identical

可以看到，有兩列信息包含了「gene identical」

%id. target Drosophila simulans gene identical to query gene
%id. query gene identical to target Drosophila simulans gene

官網解釋FAQs for EG | Ensembl Genomes

Q2) What do Target % ID and Query % ID mean in the Comparative Genomics views of the Ensembl browser?
A2) Query % ID and Target % ID are reported on Comparative Genomics views of the Ensembl Genomes browser such as the Orthologues page (see an example here). If you are searching for one gene in arabidopsis and looking for its homologue in another species such as maize, the query % ID refers to the percentage of the query sequence (arabidopsis) that matches to the homologue (the maize protein). Target % ID refers to the percentage of the target sequence (maize) that matches to the query sequence(arabidopsis).
It can be helpful to think of a query sequence (arabidopsis protein) of length 100 amino acids, and a target sequence (maize protein) of length 50 amino acids. Assume that the sequence of 50 amino acids are identical between the two proteins. In this case, Query % ID will be 50%, and Target % ID will be 100%.

兩者的區別是被除數不同，一個用query sequence長度（query %ID），一個用target sequence 長度（target %ID）。

比如，query sequence是100個氨基酸，找到的ortholog是50個氨基酸，那麼

query %ID = 50/100 = 50%

target %ID = 50/50 = 50%

5, dN or dS with Drosophila

一些過大的dn/ds其實並不可靠，因此可以用ds、dn的值來做過濾，比如限定dn or ds在0到0.25之間。同時也可以根據這兩個值對基因進行分組，比如用ds來分組分成 0，0~0.01，0.01~0.1，>0.1 這樣四組基因數據。