BMS8110复习(五):Lecture 5 - RNA-seq Data Analysis

RNA-seq Applications

  • Differential expression
  • Gene fusion
  • Alternative splicing
  • Novel transcribed regions
  • Allele-specific(等位基因特异性) expression
  • RNA editing
  • Transcriptome for non-model organisms

生物学家通过对选定的生物物种进行科学研究,用于揭示某种具有普遍规律的生命现象,这种被选定的生物物种就是模式生物。

RNA-seq vs. microarray

  • RNA-seq can be used to characterize novel transcripts and splicing variants as well as to profile the expression levels of known transcripts (but hybridization-based techniques are limited to detect transcripts corresponding to known genomic sequences)
  • RNA-seq has higher resolution(分辨率) than whole genome tiling array analysis(全基因组平铺阵列分析)
    • In principle, mRNA can achieve single-base resolution, where the resolution of tiling array depends on the density of probes
  • RNA-seq can apply the same experimental protocol to various purposes, whereas specialized arrays need to be designed in these cases
    • Detecting single nucleotide polymorphisms (needs SNP array otherwise)
    • Mapping exon junctions (needs junction array otherwise)
    • Detecting gene fusions (needs gene fusion array otherwise)
  • Next-generation sequencing (NGS) technologies have often replaced microarrays as the tool of choice for genome analysis

RNA-seq and microarray agree fairly well only for genes with medium levels of expression

Challenges for RNA-seq: library construction

  • Unlike small RNAs (microRNAs or miRNAs, piwi-interacting RNAs (piRNAs), short interfering RNAs (siRNAs) and many others, which can be directly sequenced after adaptor ligation), larger RNA molecules must be fragmented into smaller pieces (200-500 bp) to be compatible with most deep-sequencing technologies.
  • Common fragmentation(碎片化) methods include RNA fragmentation (RNA hydrolysis or nebulization) and cDNA fragmentation (DNase I treatment or sonication(声波降解法))
  • Each of these methods creates a different bias in the outcome
  • PCR artefacts
    • Many shorts reads that are identical to each other can be obtained from cDNA libraries that have been amplified. These could be a genuine refection of abundant RNA species, or they could be PCR artefacts.
    • Use replicates
  • Whether or not to prepare strand-specific libraries
    • Strand-specific libraries are valuable for transcriptome annotation, expecially for regions with overlapping transcription from opposite direction
    • Strand-specific libraries are currently laborious to produce because they require many steps or direct RNA-RNA ligation, which is inefficient.

Why Quality Control

  • Sequence output:
    • Reads + quality
  • Natural questions
    • Is the quality of my sequenced data OK?
    • If something is wrong can I fix it?
  • Problem: Huge files... How do they look?
  • Files are flat files and big... ten of Gbs (even hard to browse them)

FastQC

Genome assembly: Genome assembly is the process of converting short reads into a detailed set of sequences corresponding to the chromosome(s) of an organism.

Genome assembly: relevance

  • Genome assembly is needed when a genome is first sequenced. We can relate reads to chromosomes.
  • For the human genome, the assembly is "frozen "as a snapshot every few years. The current assembly is GRCh38.
  • For most human genome work we do not need to do "de novo" (from a new) assembly. Instead we map reads to a reference genome ----one that is already assembled.

Ready-To-Use reference sequences and annotations

  • The iGenomes are a collection of reference sequences and annotation files for commonly analyzed organisms.
  • The files have been downloaded from Ensembl, NCBI, or UCSC, and chromosome names have been changed to be simple and consistent with their download source.
  • Each iGenome is available as a compressed file that contains sequences and annotation files for a single genomic build of an organism.

Sequence alignment: also called mapping, is the process of matching reads to a pre-existing reference by sequence homology.

Bowtie: Bowtie is an ultrafast, memory-efficient short read aligner. It algins short DNA sequences (reads) to the human genome at a rate of over 25 million reads per hour.

Sequence alignment/map format (SAM ) and BAM

  • SAM is a common format having sequence reads and their alignment to a reference genome
  • BAM is the binary form of a SAM file
  • Aligned BAM files are available at respositories (Sequence Read Archive at NCBI, ENA at Ensembl)
  • SAMTools is a software package commonly used to analyze SAM/BAM files.

Integrative Genomics Viewer (IGV)

Visualize ChIP-Seq data: UCSC genome browser; IGV browser

Challenges of mapping RNA-Seq reads

  • Mapping to just the transcriptome misses unknown transcribed regions
  • Additionally portions of the intron region could also be included, making it harder to map the reads to just the transcriptome
  • Using the entire reference genome  makes it more difficult to deal with alternative splicing junctions
  • Unlike DNA-Seq, when mapping RNA-Seq reads back to reference genome, we need to pay attention to exon-exon  junction reads.

Mapping with Tophat

Mapping with STAR

  • STAR is reportedly 50-times faster at aligning than TopHat 2 with better alignment precision and sensitivity
  • STAR: ultrafast universal RNA-seq aligner
  • Basic STAR workflow consists of 2 steps:
    • Generating genome indexes files
    • Mapping reads to the genome

Expression quantification

  • FPKM / RPKM       FPKM = \frac{Counts of mapped fragments}{Total mapped fragments (millon) \times Exon length of transcript (KB)}
    • Cufflinks & Cuffdiff
  • Count data
    • Summarized mapped reads to CDS, gene or exon level
  • The number of reads is roughly proportional to 
    • the length of the gene
    • the total number of reads in the library

HTSeq-count: counting reads in 'features'

Differential Gene Expression Analysis

  • Count-based methods(R packages):
  1. DESeq: based on negative binomial(二项式) distribution
  2. edgeR: use an overdispersed Poisson model
  3. baySeq: use an empirical Bayes approach
  4. TSPM: use a two-stage poisson model
  • Statistical Distributions: gaussian, poisson, negative binomial
  • RNA-seq data fits a Negative Binomial (NB) distribution
  • But really, that's just saying that RNA-seq looks like "counts" data with more variation than just statistical fluctuations - it also has biological variation in it.
  • How do we know? Because, when you measure variance (per gene, between replicates), it's not equal to the mean, and it's not even a good linear fit.

RPKM/ FPKM -based methods

  • Cufflinks & Cuffdiff
  • Other differential analysis methods for microarray data
    • t-test, limma, etc.

Quality Control of Experiments

  • How well do the replicates correlate with each other?
  • Does a PCA plot show that my samples group by genotype?
  • What fraction of transcripts are expressed > 1 RPKM?

Hypergeometric(超几何的) test for overrepresentation - the basis for Gene Ontology analysis

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

发布了273 篇原创文章 · 获赞 16 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/wxw060709/article/details/103362870