生物信息學相關資料庫整理
生物信息學主要應用到HMM隱馬可夫鏈的方法。數學中具有馬爾可夫性質的離散時間隨機過程。該過程中,在給定當前知識或信息的情況下,只有當前的狀態用來預測將來,過去(即當前以前的歷史狀態)對於預測將來(即當前以後的未來狀態)是無關的。
在馬爾可夫鏈的每一步,系統根據概率分布,可以從一個狀態變到另一個狀態,也可以保持當前狀態。狀態的改變叫做過渡,與不同的狀態改變相關的概率叫做過渡概率。隨機漫步就是馬爾可夫鏈的例子。隨機漫步中每一步的狀態是在圖形中的點,每一步可以移動到任何一個相鄰的點,在這裡移動到每一個點的概率都是相同的(無論之前漫步路徑是如何的)。
1 評估問題
給定觀測序列O=O1O2O3…Ot和模型參數λ=(A,B,π),怎樣有效計算某一觀測序列的概率,進而可對該HMM做出相關評估。例如,已有一些模型參數各異的HMM,給定觀測序列O=O1O2O3…Ot,我們想知道哪個HMM模型最可能生成該觀測序列。通常我們利用forward演算法分別計算每個HMM產生給定觀測序列O的概率,然後從中選出最優的HMM模型。
這類評估的問題的一個經典例子是語音識別。在描述語言識別的隱馬爾科夫模型中,每個單詞生成一個對應的HMM,每個觀測序列由一個單詞的語音構成,單詞的識別是通過評估進而選出最有可能產生觀測序列所代表的讀音的HMM而實現的。
2 解碼問題
給定觀測序列O=O1O2O3…Ot和模型參數λ=(A,B,π),怎樣尋找某種意義上最優的隱狀態序列。在這類問題中,我們感興趣的是馬爾科夫模型中隱含狀態,這些狀態不能直接觀測但卻更具有價值,通常利用Viterbi演算法來尋找。
這類問題的一個實際例子是中文分詞,即把一個句子如何劃分其構成才合適。例如,句子「發展中國家」是劃分成「發展-中-國家」,還是「發展-中國-家」。這個問題可以用隱馬爾科夫模型來解決。句子的分詞方法可以看成是隱含狀態,而句子則可以看成是給定的可觀測狀態,從而通過建HMM來尋找出最可能正確的分詞方法。
3 學習問題
即HMM的模型參數λ=(A,B,π)未知,如何調整這些參數以使觀測序列O=O1O2O3…Ot的概率儘可能的大。通常使用Baum-Welch演算法以及Reversed Viterbi演算法解決。
關於生物信息學資料庫可以分為4大類:即基因組資料庫、核酸和蛋白質一級結構資料庫、生物大分子三維空間結構資料庫,當前研究比較熱點的集中於基因組、miRNA、LncRNA、circRNA等分子的查詢,以及蛋白或蛋白修飾變化(甲基化、乙醯化等)與DNA啟動子、miRNA、LncRNA、circRNA的互作,LncRNA與miRNA、mRNA、circRNA等相互的結合調控,目前各種資料庫大概有上百種,沒有系統性針對性的資料庫,以下是我們對數據的整理,通過資料庫查詢分類、資料庫功能及用途、示例結合分析、資料庫優化等這四大項,進行闡述和演示資料庫的查詢和使用,希望對您的實驗項目有所幫助。
基因查詢資料庫:
查詢獲取你的基因信息及相關序列信息
①NCBI:https://www.ncbi.nlm.nih.gov/
②UCSC:http://genome.ucsc.edu/
③Ensembl:http://www.ensembl.org/index.html
④EBI:http : //http://www.ebi.ac.uk/
⑤NIG:http: //http://www.nig.ac.jp/
MiRNA查詢資料庫:
①miRBase: http://www.mirbase.org
②microRNA.org:http://www.microrna.org/
③deepBase:http://deepbase.sysu.edu.cn/
④starBase:http://starbase.sysu.edu.cn/
⑤targetScan:http://www.targetscan.org/vert_70/
⑥TarBase: http://www.tarbase.com/
⑦miRanda:http://www.microrna.org/microrna/home.do
⑧RNAhybrid:https://bibiserv.cebitec.uni-bielefeld.de/
⑨CoGeMiR:http://cogemir.tigem.it/
⑩miRNApath:http://lgmb.fmrp.usp.br/mirnapath/tools.php
LncRNA查詢資料庫:
①Ensembl:http://www.ensembl.org/index.html
②LncRNAdb: http://www.lncrnadb.org/
③LNCipedia:https://lncipedia.org/
④CHIPbase: http://rna.sysu.edu.cn/chipbase/
⑤starBase:http://starbase.sysu.edu.cn/
ircRNA查詢資料庫:
①circBase:http://www.circbase.org/
②CIRCpedia:http://www.picb.ac.cn/rnomics/circpedia/
③deepbase:http://rna.sysu.edu.cn/deepBase/
④starbase:http://starbase.sysu.edu.cn/index.php
常用資料庫功能介紹:
基因資料庫功能:
1. NCBI:
The National Center forBiotechnology Information advances science and health by providing access tobiomedical and genomic information
資料庫功能:
Submit:NCBIcollects submissions of data for the worlds largest public repository ofbiological and scientific information
Download:Themajority of NCBI data are available for downloading, either directly from theNCBI FTP site or by using software tools to download custom datasets
Learn:NCBIcreates a variety of educational products including courses, workshops,webinars, training materials and documentation. NCBI educational events arefree and open to everyone. All NCBI educational materials are available foranyone to re-use and distribute.
Develop:NCBIprovides a variety of resources that allow developers to access and manipulateNCBI data in their applications.
Analyze:NCBIprovides a wide variety of data analysis tools that allow users to manipulate,align, visualize and evaluate biological data.
2.UCSCGenome Browser:
The UCSC Genome Browser isdeveloped and maintained by the Genome Bioinformatics Group, across-departmental team within the UCSC Genomics Institute. the website hasgrown to include a broad collection of vertebrate and model organism assembliesand annotations, along with a large suite of tools for viewing, analyzing anddownloading data.
資料庫功能:
Genome Browser:interactivelyvisualize genomic data
BLAT:rapidly align sequencesto the genome
Table Browser:downloaddata from the Genome Browser database
Variant Annotation Integrator:get functional effect predictions for variant calls
Data Integrator:combinedata sources from the Genome Browser database
Gene Sorter:find genesthat are similar by expression and other metrics
Genome Browser in a Box (GBiB):run the Genome Browser on your laptop or server
In-Silico PCR:rapidlyalign PCR primer pairs to the genome
LiftOver:convertgenome coordinates between assemblies
VisiGene:interactivelyview in situ images of mouse and frog
MiRNA資料庫功能:
1. miRBase
the microRNA database
資料庫功能:
? The miRBase database is a searchable databaseof published miRNA sequences and annotation. ? The miRBase Registry provides miRNA genehunters with unique names for novel miRNA genes prior to publication ofresults.
2. http://microRNA.org :
Targets and Expression,Predicted microRNA targets & target downregulation scores.Experimentally observed expression patterns.
資料庫功能:
1. mirSVR predicted targetsite scoring method: Comprehensive modeling of microRNA targets predictsfunctional non-conserved and non-canonical sites
2. microRNA targetpredictions: The http://microRNA.org resource: targets and expression.
3. miRanda application: HumanMicroRNA targets.
4. miRanda algorithm:MicroRNA targets in Drosophila.
LncRNA資料庫功能:
1. Ensembl genome browser
Ensembl is a genome browserfor vertebrate genomes that supports research in comparative genomics,evolution, sequence variation and transcriptional regulation. Ensembl annotategenes, computes multiple alignments, predicts regulatory function and collectsdisease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant EffectPredictor (VEP) for all supported species
資料庫功能:
Variant Effect Predictor
Gene expression in Ensembl
Retrieving sequences
Compare genes across species
SNPs and other variants formy gene
Use my own data in Ensembl
2. LncRNAab :
Long Noncoding RNA Databasev2.0- The Reference Database For Functional Long Noncoding RNAs
資料庫功能:
nucleotide sequences
genomic context
gene expression data derivedfrom the Illumina Body Atlas
structural information
subcellular localization
conservation
function with referencedliterature
3. http://LNCipedia.org v. 4.1:
A comprehensive compendium ofhuman long non-coding RNAs
circRNA資料庫功能:
1. circBase:
Circular RNA ( circ RNA) is arecent addition to the growing list of types of noncoding RNA.Here you can explore public circ RNA datasets and download thecustom python scripts needed to dis cover cicRNAs in your own RNA-seq data
資料庫功能(Databasefunction)
? Sequence-based search
? Search the database by identifier, genedescription, genomic position, or their lists.
? Retrieve dataset slices by defining a set of conditions(table browser).
? Export tables in a variety of formats.
? Export FASTA files containing genomicsequence.
2. CIRCpedia:
CIRCpedia is an integrativedatabase, aiming to annotating alternative back-splicing and alternativesplicing in circRNAs across different cell lines. Through employing an upgradedcircRNA characterization pipeline (CIRCexplorer2), thousands of alternativeback-splicing and alternative splicing events in circRNAs were identified. Allthese identified alternative back-splicing and alternative splicing incircRNAs, together with novel exons, are formatted and classified for beingeasily searched, browsed and downloaded from CIRCpedia
示例分析:
基因查詢:UCSC資料庫,以H19為例
1. 打開主頁面
2. 點擊Genome Browser,選擇種屬,
3. 對話框中輸入基因,點擊「GO」
4. 即可查詢到基因的相關信息
資料庫優化:UCSC資料庫可查詢到基因的信息,以及該基因在不同物種中,序列的保守性等數據
miRNA查詢:miRBase使用,以has-mir-9為例
1. 輸入網址,打開主頁面
2. 「search by miRNA name or keyword』對話框中輸入miRNA名稱
3. 點擊「GO」查詢
4. 根據您的物種需要,點擊即可獲取該miRNA的相關信息
5. 點擊「Get sequence」,即可獲取序列信息
資料庫優化:MiRbase是一款非常強大的miRNA查詢資料庫,可查詢miRNA相關信息外,還可以做與mRNA的結合預測分析,詳細請您進一步探知
LncRNA查詢:以LncRNA H19為例
Ensembl genome browser資料庫:
1. 打開主頁面
2. 選取種屬,對話框輸入查詢LncRNA
3. 點擊進入,即可獲取LncRNAH19的相關信息
資料庫優化:Ensembl資料庫是一款可查詢LncRNA不同剪接變體及詳細信息的資料庫,對於LncRNA有多種剪接變體來說,可查詢獲取得到確切的研究變體序列
CircRNA查詢:
CircRNA資料庫以CDR1(小腦變性相關蛋白1)為例,查詢環狀RNA信息
資料庫優化:circbase可查詢基因轉錄對應的環狀RNA信息外,還可以直接通過輸入環狀RNA的ID或是名稱進行查詢,可得到詳細的環狀RNA的信息
推薦閱讀:
※一文參透:lncRNA研究思路、模式和資料庫應用...
※Nucleic Acids Research在線發布腫瘤特異性circRNA資料庫
※提問使人快樂!Western Blot 結果條帶全面分析
※環狀RNAs(Circular RNAs)在植物體內的功能和調節網路
※pcr引物設計,入門級別的招式。