_Summary and Peer-review_ ### A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis ( [Dillies et al. 2013](http://bib.oxfordjournals.org/content/14/6/671)) _Trang Tran_

7 normalization methods were evaluated

\begin{align} \DeclareMathOperator*{\quantile}{quantile} \DeclareMathOperator*{\median}{median} \end{align}
\begin{align} \text{Total count (TC) } a_{ij} &= \frac{c_{ij}}{\sum\limits_{j=1}^{N} c_j} \end{align}
\begin{align} \text{Upper Quartile (UQ) } a_{ij} &= \frac{c_{ij}}{\quantile\limits_{c_j > 0}^{0.75} c_j} \end{align}
\begin{align} \text{Median (M) } a_{ij} &= \frac{c_{ij}}{\median\limits_{c_j > 0} c_j} \end{align}
\begin{align} \text{DESeq } a_{ij} &= \frac{c_{ij}}{\median\limits_{j = 1}^m \frac{c_{ij}}{\frac{1}{m} \sum\limits_{j = 1}^m c_{ij}}} \end{align}
\begin{align} \text{TMM } a_{ij} &= \end{align}
\begin{align} \text{Quantile: Not explained in the paper} \end{align}
\begin{align} \text{RPKM } a_{ij} &= \frac{c_{ij}}{\ell_i \cdot M \cdot 10^{-9}} \end{align}
\begin{align} c_{ij} &: \text{read count of gene } i \text{ in sample } j \\ N &: \text{number of genes} \\ m &: \text{number of samples} \\ M &: \text{number of mapped reads} \end{align}

On 2 types of data

Real data

</br>

Simulation data

Real data sets

Table 1:
SR = single-end read, PE = paired-end read,
D = directional, ND = non-directional

Organism Type Number of genes Replicates per condition Minimum library size Maximum library size Correlation between replicates Correlation between conditions % Reads associated with the most expressed gene Library type Sequencing machine
h. sapiens rna 26 437 {3, 3} 2.0 × 107 2.8 × 107 (0.98, 0.99) (0.93, 0.96) ≈1% sr 54, nd gaiix
a. fumigatus rna 9248 {2, 2} 8.6 × 106 2.9 × 107 (0.92, 0.94) (0.88, 0.94) ≈1% sr 50, d hiseq2000
e. histolytica rna 5277 {3, 3} 2.1 × 107 3.3 × 107 (0.85, 0.92) (0.81, 0.98) 6.4–16.2% pe 100, nd hiseq2000
m. musculus mirna 669 {3, 2, 2} 2.0 × 106 5.9 × 106 (0.95, 0.99) (0.09, 0.75) 17.4–51.1% sr 36, d gaiix

Normalization on real data are compared in terms of

A. Distribution of normalized expression

B. Intra-group variance

C. Differential expression analysis</br></br>

z. House-keeping gene variance

A. Distribution of normalized expression

Each box corresponds to one sample.

Organism Type Number of genes
h. sapiens rna 26 437
a. fumigatus rna 9248
e. histolytica rna 5277
m. musculus mirna 669

B. Intra-group variance

C. Effects on Differential Analysis

Log-transformed of the ratio of expression levels between 2 conditions (y-axis) vs their Average (x-axis). A value well above/below zero region is indicative of differentially expressed genes

Genes are identified as differentially expressed by all methods are colored in GREY.

Both DESeq and TSPM require a table of count data as input.

The outcome of a DEG analysis also depends on the compatibility of the normalization method used for input and the DE method.

z. Housekeeping genes variance

30 Housekeeping genes are selected from the list identified by Eisenberg & Levanon (2003)

However, the authors have rightfully noted about the caveats of using housekeeping genes as internal reference in RNA-seq normalization

Housekeeping genes are inappropriate for use as internal reference in RNA-seq

  • They are often affected by various factors that are not controlled
  • They are usually highly expressed thus not representing genes of low intensities
  • HKG are usually a very small subset, so fluctuations in their intensities are highly affected by random or systematic errors
  • This use requires the a priori knowledge of the housekeeping genes

Performance on simulation data were based on

False positive rate

Power (Recall)

Simulation settings

  1. Equal library size, no dominant gene
  2. Non-equal library size, no dominant gene
  3. Equal library size, some dominant genes

Consensus and discordance

Log-transformed of the ratio of expression levels between 2 conditions (y-axis) vs their Average (x-axis). A value well above/below zero region is indicative of differentially expressed genes

Genes are identified as differentially expressed by all methods are colored in GREY.

Both DESeq and TSPM require a table of count data as input.

The outcome of a DEG analysis also depends on the compatibility of the normalization method used for input and the DE method.

Other note: GC-bias

Log-transformed counts vs GC-content.

Each line is one sample, with color-coded conditions (A,B,C).

It's not clear how this plot was generated. A guess would be that each data point is a transcript, and the x-value is the GC-composition of that transcript, while the y-value is the log-transformed of accumulated read counts. Thus each line of the plot might have been fitted over the raw data, and the choice of regression method might have biased toward an a priori model.

Assuming the fitting of those lines are not biased towards any pre-defined model, it seems there is siginificant correlation between GC composition and read counts, in contrast to the authors' claim that no such bias was observed in the data. Although this relation might not be linear nor monotonic, it seems to have the same pattern within a species, but sample-specific parameters.