PCR duplicates in NGS - II

05-12

接上篇（PCR duplicates in NGS - I）

這一次的資料給出了詳細的計算公式，並且有R做模擬驗證的code。

在我實際處理NGS data時候，從測序數據中自己寫腳本remove PCR duplicates，常常是根據read的坐標。如果是single end read，就看5 端read map 到reference上的位置，map到一模一樣位置的reads集合會隨機保留一條read信息，丟掉其他的。如果是paired end reads，看一對reads map到reference後，兩個5』端坐標是不是完全一致，完全一致的那些read pairs，隨機保留一對，其他的扔掉。

下面介紹的資料是用類似的思路，直接從fragment在基因組的坐標分布來計算duplication rate。

原文：Expected number of read duplicates

先直接來看公式，

設G為genome size，N為read 總數目，

For single end reads，

PCR duplicate rate = N - (sum(dpois(1:n, N/G/2) * G) * 2)

dpois(1:n, N/G/2) 是基因組上某一個位置，被1，2，... ，n條reads covered到的概率。
dpois(1:n, N/G/2) * G 是基因組上，有多少位置被1條read cover到，多少個位置被2條reads cover到...
sum(dpois(1:n, N/G/2) * G) 是基因組上總共有多少個位置被至少一條read讀到。

For paired end reads，

如果打碎基因組DNA後，得到的fragment最小長度為a，最大為b，fragment長度服從[a,b]見的均勻分布，那麼考慮到不同size的fragments，我們需要「延長」基因組：

G2<- G + G * (b - a)
PCR duplicate rate = N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2

如果fragment長度服從類似泊松分布怎麼辦呢？下面通過模擬來估算duplicate rate。

library(data.table)G<- 10000 #genome sizeN<- 20000 #total read counta<- 1 # Min fragment lengthb<- 10 # Max fragment lengthL<- 15 # Mean fragment length, if assume uniform fragment size distributionn<- 10 # A number sufficiently large (ideally infinite)set.seed(1)# generate 5 coordinate of one mapped readpos_fp1<- sample(1:G, size= N, replace= TRUE)set.seed(2)# pos_fp2<- pos_fp1 + rpois(n= N, lambda= L)# generate 5 coordinate of the other read of the same fragmentpos_fp2<- pos_fp1 + sample(a:b, size= N, replace= TRUE)pos_fp2<- ifelse(pos_fp2 > G, G, pos_fp2)set.seed(3)# generate + or - strand for each readstrand<- sample(c(-, +), size= N, replace= TRUE)# make "reads", from small coord to largereads<- data.table(pos_fp1, pos_fp2, strand)[order(pos_fp1, pos_fp2),] ## Observed number of reads for SE:## ================================# for each 5 coordinate, count the number of reads as depth,# make "depth_se", from small depth to largedepth_se<- reads[, list(depth= .N), by= list(pos_fp1, strand)][order(depth)]dups<- table(depth_se$depth)dups<- data.table(depth= as.numeric(names(dups)), n_pos=as.vector(dups))dups$nreads<- dups$depth * dups$n_posdups$ndups<- (dups$depth - 1) * dups$n_possum(dups$ndups)7373## Analytical way SE## =================N - (sum(dpois(1:n, N/G/2) * G) * 2)7357.589## Observed number of reads for PE:## ================================# for each pair of 5 coordinates, count the number of read pairs as depth,depth_pe<- reads[, list(depth= .N), by= list(pos_fp1, pos_fp2, strand)][order(depth)]N - nrow(depth_pe)972## Analytical way PE## =================G2<- G + G * (b - a)N - (sum(dpois(1:n, N / G2 / 2) * G2) * 2)967.4836