Chapter 2 Considerations of Experimental Design

Factors that should be considered in experimental design

Understanding how to extract RNA and the experimental procedures of RNA-Seq library construction is very helpful for designing an RNA-Seq experiment, but there are some special factors that can seriously affect the quality of differential expression analysis should be considered.

These important considerations include:

  1. Number and type of replicates
  2. Avoid confusion (confounding)
  3. Handling batch effects

We will go through each consideration in detail and discuss best practices and optimized designs

Replicates

Experimental repetitions include technical replicates and biological replicates

Image source:  Klaus B., EMBO J (2015) 34: 2727-2730

  • Technical repetition: Use the same biological sample to repeat technical or experimental steps to accurately measure technical differences and remove them in analysis
  • Biological replicates: using a different biological sample under the same conditions to measure the biological differences between samples

In the era of microarrays, technical repetition is considered necessary; however, with the current RNA-Seq technology, technical variation is much smaller than biological variation, so technical repetition is not necessary .

On the contrary, for differential expression analysis, biological duplication is absolutely necessary . For mice or rats, it may be easy to judge what caused a different biological sample, but it is much more difficult to make such judgments on cell lines . This article gives some very good suggestions for cell line duplication.

For differential expression analysis, the more biological repetitions, the better the estimation of biological variation, and the more accurate our estimation of the average expression level. This will make our data modeling more accurate, and the more differential genes identified.

Image source:  Liu, Y., et al., Bioinformatics (2014) 30(3): 301–304

As shown in the figure above, biological repetition is more important than sequencing depth, that is, the total number of reads measured for each sample . This picture shows the relationship between sequencing depth and the number of biological repeats in the number of differential genes found. It is worth noting that with the increase of biological repetition, the depth of sequencing increases, and the more differential genes are found. Therefore, more repeats are generally better than higher sequencing depths, and higher sequencing depths are only needed to detect less expressed differential genes and perform isoform-level gene expression.

Sample mixed (the Sample Pooling) : If possible, try to avoid mixing the individual / experiments; however, if absolutely necessary, and then mixing each sample set should be treated as a single repeat (SINGLE replicate) . To ensure a similar amount of variation between replicates, you will mix the same number of individuals for each mixed sample set.
For example, if you need at least three individuals to get enough data for your control repetition, and at least 5 individuals to get enough data for your treatment repetition, you should gather 5 individuals for control and 5 individuals for treatment conditions. You will also ensure that individuals gathered under the same conditions are similar in gender, age, etc.

For bulk RNA-Seq, duplication almost always takes precedence over higher sequencing depth. However, the guidelines differ depending on the experiments performed and the analysis required. We have listed below some common guidelines to aid in experimental design for repetition and sequencing depth:

  • Commonly used gene-level differential expression:
  1. ENCODE guidelines recommend 30,000,000 reads single-ended sequencing for each sample
  2. If you repeat enough (>3), 15,000,000 reads per sample is usually enough
  3. If possible, spend money on more biological replicates
  4. Generally recommended read length>=50 bp
  • Detection of gene-level expression differences of low-expressed genes
  1. Similarly, multiple replicates are better than increasing sequencing depth
  2. According to the expression level, deep sequencing is at least 30-60,000,000 reads or more (if there are enough repeats, start from 30,000,000)
  3. Generally recommended read length>=50 bp
  • Differential expression at subtype level:
  1. Among the familiar subtypes, paired-end sequencing with a depth of at least 30,000,000 reads per sample is recommended
  2. Unknown subtypes should require greater depth (>60,000,000 reads per sample)
  3. Choose biological replicates instead of paired-end/deep sequencing
  4. Generally recommended read length>=50 bp, but longer results will be better, because reads will be more likely to pass through exon junctions
  5. Carry out careful quality control of RNA quality. Pay attention to the use of high-quality library construction methods and rigorous analysis to obtain samples with high RIN
  • Other types of RNA analysis (intron retention, small RNA-Seq, etc.):
  1. According to different analysis, different recommendations
  2. Basically, more biological repetitions are always better!
Note: The factor used to estimate the depth of genome sequencing is "coverage"-how many times the number of nucleotides tested "covers" the genome. This indicator is not accurate for genomes (whole-genome sequencing), but it is good enough and widely used. However, this indicator is not applicable to the transcriptome, because even though you may know what percentage of the genome is transcriptionally active, gene expression is highly variable.

Confounding

A confusing RNA-Seq experiment means that you cannot distinguish the independent effects of two different sources of variation in the experimental data .

For example, we know that gender has a great influence on gene expression. If all of our mice in the control group are female and the mice in the treatment group are male, then our treatment effect will be confused by gender. We cannot distinguish the effects of treatment from the effects of gender .

To avoid confusion:

  • If possible, make sure that the animals under each condition have the same sex, age and batch
  • If it is not possible, make sure to divide the animals evenly in different conditions

Batch effects

Batch effect is an important issue in RNA-Seq analysis. A picture taken from Hicks SC, et al., bioRxiv (2015) explains this phenomenon well: the experimental design is depicted on the left, by having samples from two sample groups in each batch, Demonstrates good use of batches. On the far right, an example of PCA is drawn, and the samples will be separated into batches. It shows that the influence of batches on gene expression is usually greater than the influence from experimental variables, so when designing experiments, we must take this into account in the statistical model. We will discuss this issue in more detail below.

Image source:  Hicks SC, et al., bioRxiv (2015)

The problems caused by improper batch processing in these research designs are elaborated in this article .

How to know if you have a batch problem?

  • Are all RNA extractions performed on the same day?
  • Are all database construction work performed on the same day?
  • Is the RNA extraction or library construction of all samples done by the same person?
  • Did you use the same reagents for all samples?
  • Have you performed RNA extraction or library construction at the same location?

If one of these answers is " no, " then you have a batch question.

Best practices for batch issues

  • Avoid batches as much as possible in the experimental design
  • If batches cannot be avoided:
    • Don't confuse different batches of experiments, that is, don't just do one batch of samples under one condition:

    • Please do replicates of different types of samples in different batches. If you want to find different genes under different treatment conditions or draw conclusions at the group level, the more repetitions, the better (of course, more than 2).

  • Please include batch information in experimental metadata . In this way, we can remove the differences due to batches in the analysis. When we have this information, it will not affect our final results.

Guess you like

Origin blog.csdn.net/u010608296/article/details/112859330