基因差異表達之二 - between sample 怎麼辦

05-14

轉載自：hbc/knowledgebase

引子

之前看What the FPKM? A review of RNA-Seq expression units的帖子時，裡面特別提到，RPKM, FPKM和TMM都不適合between sample檢測DE genes（differential expressed genes）的分析。那比較不同處理組的轉錄本找差異表達基因的分析應該怎麼做呢？

後來在github上看到一個人寫的帖子，詳細講解了DESeq2包是如何做DE gene分析的，這裡轉載過來以供參考。

Count normalization methods

marypiper edited this page on 22 Apr 2017

https://github.com/hbc/knowledgebase/wiki/Count-normalization-methods?

github.com

Normalization

The first step in the workflow is count normalization, which is necessary to make accurate comparisons of gene expression between samples. The raw counts, or number of reads aligning to each gene, need to be normalized to account for differences in library depth and composition between samples when performing differential expression analyses.

While normalization is necessary for differential expression analyses, it is also necessary whenever exploring or comparing counts between or within samples.

Different types of normalization methods exist, and a few of the most common methods include:

normalization for library size: necessary for comparison of expression of the same gene between samples
normalization for gene length: necessary for comparison of expression of different genes within the same sample
normalization for RNA composition: recommended for comparison of expression between samples (particularly important when performing differential expression analyses)

"A few highly and differentially expressed genes may have strong influence on the total read count, causing the ratio of total read counts not to be a good estimate for the ratio of expected counts (for all genes)"[1]

Common normalization measures

Several common normalization measures exist to account for these differences:

CPM (counts per million): counts scaled by total number of reads. This measure accounts for sequencing depth only.
TPM (transcripts per kilobase million): counts per length of transcript (kb) per million reads mapped. This measure accounts for both sequencing depth and gene length.
RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped):similar to TPM, as this measure accounts for both sequencing depth and gene length as well; however, it is not recommended.
Tool-specific metrics for normalization:

DESeq2 uses a median of ratios method, which accounts for sequencing depth and RNA composition [1].
EdgeR uses a trimmed mean of M values (TMM) method that accounts for sequencing depth, RNA composition, and gene length [2]

RPKM/FPKM (not recommended)

While TPM and RPKM/FPKM normalization methods both account for sequencing depth and gene length, RPKM/FPKM measures are not recommended. The reason is that the normalized count values output by the RPKM/FPKM method are not comparable between samples.

Using RPKM/FPKM normalization, the total number of RPKM/FPKM normalized counts for each sample will be different. Therefore, you cannot compare the normalized counts for each gene equally between samples.

RPKM-normalized counts table

SampleA has a greater proportion of counts associated with MOV10 (5.5/1,000,000) than does sampleB (5.5/1,500,000) even though the RPKM count values are the same. Therefore, we cannot directly compare the counts for MOV10 (or any other gene) between sampleA and sampleB because the total number of normalized counts are different between samples.

TPM (recommended)

In contrast to RPKM/FPKM, TPM-normalized counts normalize for both sequencing depth and gene length, but have the same total TPM-normalized counts per sample. Therefore, the normalized count values are comparable both between and within samples.

NOTE: This video by StatQuest shows in more detail why TPM should be used in place of RPKM/FPKM if needing to normalize for sequencing depth and gene length.

DESeq2-normalized counts - Median of ratios method

Since tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool. However, sequencing depth and RNA composition do need to be taken into account.

To normalize for sequencing depth and RNA composition, DESeq2 uses the median of ratios method, which performs the following steps when you run the tool:

Step 1: creates a pseudo-reference sample (row-wise geometric mean)

For each gene, a pseudo-reference sample is created that is equal to the geometric mean across all samples.

Step 2: calculates ratio of each sample to the reference

For every gene in a sample, the ratios (sample/ref) are calculated (as shown below). This is performed for each sample in the dataset. Since the majority of genes are not differentially expressed, the majority of genes in each sample should have similar ratios within the sample.

Step 3: takes samples median value as that samples normalization factor

The median value of all ratios for a single sample is taken as the normalization factor (size factor) for that sample, as calculated below. Notice that the differentially expressed genes should not affect the median value:

normalization_factor_sampleA <- median(c(0.59, 1.28, 1.3, 1.35, 1.39))

normalization_factor_sampleB <- median(c(0.72, 0.74, 0.77, 0.78, 1.35))

The figure below illustrates the median value for the distribution of all gene ratios for a single sample (frequency is on the y-axis).

The median of ratios method makes the assumption that not ALL genes are differentially expressed; therefore, the normalization factors should account for sequencing depth and RNA composition of the sample (large outlier genes will not represent the median ratio values). This method is robust to imbalance in up-/down-regulation and large numbers of differentially expressed genes.

Step 4: divide each raw count value in sample by that samples normalization factor to generate normalized count values

For example, if median ratio for SampleA was 1.3 and the median ratio for SampleB was 0.77, you could calculate normalized counts as follows:

SampleA median ratio = 1.3

SampleB median ratio = 0.77

Raw Counts

（1）Differential expression analysis for sequence count data

（2）A scaling normalization method for differential expression analysis of RNA-seq data

參考資料：

1，Differential expression analysis for sequence count data

2，Index of /doc/pdf

3，hbc/knowledgebase

4，between samples and within samples

5，In RNA-Seq, 2 != 2: Between-sample normalization