BMS8110复习（五）：Lecture 5 - RNA-seq Data Analysis

RNA-seq Applications

Differential expression
Gene fusion
Alternative splicing
Novel transcribed regions
Allele-specific(等位基因特异性) expression
RNA editing
Transcriptome for non-model organisms

生物学家通过对选定的生物物种进行科学研究，用于揭示某种具有普遍规律的生命现象，这种被选定的生物物种就是模式生物。

RNA-seq vs. microarray

RNA-seq can be used to characterize novel transcripts and splicing variants as well as to profile the expression levels of known transcripts (but hybridization-based techniques are limited to detect transcripts corresponding to known genomic sequences)
RNA-seq has higher resolution(分辨率) than whole genome tiling array analysis(全基因组平铺阵列分析)
- In principle, mRNA can achieve single-base resolution, where the resolution of tiling array depends on the density of probes
RNA-seq can apply the same experimental protocol to various purposes, whereas specialized arrays need to be designed in these cases
- Detecting single nucleotide polymorphisms (needs SNP array otherwise)
- Mapping exon junctions (needs junction array otherwise)
- Detecting gene fusions (needs gene fusion array otherwise)
Next-generation sequencing (NGS) technologies have often replaced microarrays as the tool of choice for genome analysis

RNA-seq and microarray agree fairly well only for genes with medium levels of expression

Challenges for RNA-seq: library construction

Unlike small RNAs (microRNAs or miRNAs, piwi-interacting RNAs (piRNAs), short interfering RNAs (siRNAs) and many others, which can be directly sequenced after adaptor ligation), larger RNA molecules must be fragmented into smaller pieces (200-500 bp) to be compatible with most deep-sequencing technologies.
Common fragmentation(碎片化) methods include RNA fragmentation (RNA hydrolysis or nebulization) and cDNA fragmentation (DNase I treatment or sonication(声波降解法))
Each of these methods creates a different bias in the outcome
PCR artefacts
- Many shorts reads that are identical to each other can be obtained from cDNA libraries that have been amplified. These could be a genuine refection of abundant RNA species, or they could be PCR artefacts.
- Use replicates
Whether or not to prepare strand-specific libraries
- Strand-specific libraries are valuable for transcriptome annotation, expecially for regions with overlapping transcription from opposite direction
- Strand-specific libraries are currently laborious to produce because they require many steps or direct RNA-RNA ligation, which is inefficient.

Why Quality Control

Sequence output:
- Reads + quality
Natural questions
- Is the quality of my sequenced data OK?
- If something is wrong can I fix it?
Problem: Huge files... How do they look?
Files are flat files and big... ten of Gbs (even hard to browse them)

FastQC

Genome assembly: Genome assembly is the process of converting short reads into a detailed set of sequences corresponding to the chromosome(s) of an organism.

Genome assembly: relevance

Genome assembly is needed when a genome is first sequenced. We can relate reads to chromosomes.
For the human genome, the assembly is "frozen "as a snapshot every few years. The current assembly is GRCh38.
For most human genome work we do not need to do "de novo" (from a new) assembly. Instead we map reads to a reference genome ----one that is already assembled.

Ready-To-Use reference sequences and annotations

The iGenomes are a collection of reference sequences and annotation files for commonly analyzed organisms.
The files have been downloaded from Ensembl, NCBI, or UCSC, and chromosome names have been changed to be simple and consistent with their download source.
Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism.

Sequence alignment: also called mapping, is the process of matching reads to a pre-existing reference by sequence homology.

Bowtie: Bowtie is an ultrafast, memory-efficient short read aligner. It algins short DNA sequences (reads) to the human genome at a rate of over 25 million reads per hour.

Sequence alignment/map format (SAM ) and BAM

SAM is a common format having sequence reads and their alignment to a reference genome
BAM is the binary form of a SAM file
Aligned BAM files are available at respositories (Sequence Read Archive at NCBI, ENA at Ensembl)
SAMTools is a software package commonly used to analyze SAM/BAM files.

Integrative Genomics Viewer (IGV)

Visualize ChIP-Seq data: UCSC genome browser; IGV browser

Challenges of mapping RNA-Seq reads

Mapping to just the transcriptome misses unknown transcribed regions
Additionally portions of the intron region could also be included, making it harder to map the reads to just the transcriptome
Using the entire reference genome makes it more difficult to deal with alternative splicing junctions
Unlike DNA-Seq, when mapping RNA-Seq reads back to reference genome, we need to pay attention to exon-exon junction reads.

Mapping with Tophat

Mapping with STAR

STAR is reportedly 50-times faster at aligning than TopHat 2 with better alignment precision and sensitivity
STAR: ultrafast universal RNA-seq aligner
Basic STAR workflow consists of 2 steps:
- Generating genome indexes files
- Mapping reads to the genome

Expression quantification

FPKM / RPKM
- Cufflinks & Cuffdiff
Count data
- Summarized mapped reads to CDS, gene or exon level
The number of reads is roughly proportional to
- the length of the gene
- the total number of reads in the library

HTSeq-count: counting reads in 'features'

Differential Gene Expression Analysis

Count-based methods(R packages):

DESeq: based on negative binomial(二项式) distribution
edgeR: use an overdispersed Poisson model
baySeq: use an empirical Bayes approach
TSPM: use a two-stage poisson model

Statistical Distributions: gaussian, poisson, negative binomial
RNA-seq data fits a Negative Binomial (NB) distribution
But really, that's just saying that RNA-seq looks like "counts" data with more variation than just statistical fluctuations - it also has biological variation in it.
How do we know? Because, when you measure variance (per gene, between replicates), it's not equal to the mean, and it's not even a good linear fit.

RPKM/ FPKM -based methods

Cufflinks & Cuffdiff
Other differential analysis methods for microarray data
- t-test, limma, etc.

Quality Control of Experiments

How well do the replicates correlate with each other?
Does a PCA plot show that my samples group by genotype?
What fraction of transcripts are expressed > 1 RPKM?

Hypergeometric(超几何的) test for overrepresentation - the basis for Gene Ontology analysis

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

wxw060709

发布了273 篇原创文章 · 获赞 16 · 访问量 2万+

私信关注

BMS8110复习（五）：Lecture 5 - RNA-seq Data Analysis

猜你喜欢