PCR duplicates in NGS - II

接上篇(PCR duplicates in NGS - I)

這一次的資料給出了詳細的計算公式,並且有R做模擬驗證的code。

在我實際處理NGS data時候,從測序數據中自己寫腳本remove PCR duplicates,常常是根據read的坐標。如果是single end read,就看5 端read map 到reference上的位置,map到一模一樣位置的reads集合會隨機保留一條read信息,丟掉其他的。如果是paired end reads,看一對reads map到reference後,兩個5』端坐標是不是完全一致,完全一致的那些read pairs,隨機保留一對,其他的扔掉。

下面介紹的資料是用類似的思路,直接從fragment在基因組的坐標分布來計算duplication rate。

原文:Expected number of read duplicates

先直接來看公式,

設G為genome size,N為read 總數目,

For single end reads,

PCR duplicate rate = N - (sum(dpois(1:n, N/G/2) * G) * 2)

  • dpois(1:n, N/G/2) 是基因組上某一個位置,被1,2,... ,n條reads covered到的概率。
  • dpois(1:n, N/G/2) * G 是基因組上,有多少位置被1條read cover到,多少個位置被2條reads cover到...
  • sum(dpois(1:n, N/G/2) * G) 是基因組上總共有多少個位置被至少一條read讀到。

For paired end reads,

如果打碎基因組DNA後,得到的fragment最小長度為a,最大為b,fragment長度服從[a,b]見的均勻分布,那麼考慮到不同size的fragments,我們需要「延長」基因組:

G2<- G + G * (b - a)

PCR duplicate rate = N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2

如果fragment長度服從類似泊松分布怎麼辦呢?下面通過模擬來估算duplicate rate。

library(data.table)G<- 10000 #genome sizeN<- 20000 #total read counta<- 1 # Min fragment lengthb<- 10 # Max fragment lengthL<- 15 # Mean fragment length, if assume uniform fragment size distributionn<- 10 # A number sufficiently large (ideally infinite)set.seed(1)# generate 5 coordinate of one mapped readpos_fp1<- sample(1:G, size= N, replace= TRUE)set.seed(2)# pos_fp2<- pos_fp1 + rpois(n= N, lambda= L)# generate 5 coordinate of the other read of the same fragmentpos_fp2<- pos_fp1 + sample(a:b, size= N, replace= TRUE)pos_fp2<- ifelse(pos_fp2 > G, G, pos_fp2)set.seed(3)# generate + or - strand for each readstrand<- sample(c(-, +), size= N, replace= TRUE)# make "reads", from small coord to largereads<- data.table(pos_fp1, pos_fp2, strand)[order(pos_fp1, pos_fp2),] ## Observed number of reads for SE:## ================================# for each 5 coordinate, count the number of reads as depth,# make "depth_se", from small depth to largedepth_se<- reads[, list(depth= .N), by= list(pos_fp1, strand)][order(depth)]dups<- table(depth_se$depth)dups<- data.table(depth= as.numeric(names(dups)), n_pos=as.vector(dups))dups$nreads<- dups$depth * dups$n_posdups$ndups<- (dups$depth - 1) * dups$n_possum(dups$ndups)7373## Analytical way SE## =================N - (sum(dpois(1:n, N/G/2) * G) * 2)7357.589## Observed number of reads for PE:## ================================# for each pair of 5 coordinates, count the number of read pairs as depth,depth_pe<- reads[, list(depth= .N), by= list(pos_fp1, pos_fp2, strand)][order(depth)]N - nrow(depth_pe)972## Analytical way PE## =================G2<- G + G * (b - a)N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2)967.4836

推薦閱讀:

手把手教你生信分析平台搭建(一)
生物信息學100個基礎問題:問題及答案目錄
生物信息實用R語言筆記1-軟體安裝與設置
使用MutMap快速定位突變基因:原理篇
【好書分享】生信技能學習指南

TAG:生物信息學 | 測序 |