BMS8110复习(四):Lecture 4 -Gene Expression Microarrays

Lecture 4 -Gene Expression Microarrays 基因表达微阵列

Outline:

  • Introduction
  • Experimental Design
  • Microarray Data Preprocessing
  • Quality Control of Experiments 
  • Differential Gene Expression Analysis
  • Case Study

Experimental Design

  • Define your biological question
    • e.g. how is transcription affected by estrogen(雌性激素) stimulation of breast cancer cells?
  • Define classes, i.e. biological conditions/ systems you intend to sample
    • e.g. cerebellum(小脑) from Alzheimer patients, HEK cell line treated with miR-2I antagomir, primary fibroblasts after 2 days of culture
  • Factors can be combined to define classes
    • e.g. treated vs untreated, 12 vs 24 hours
  • Plan how many replicates you need

Replicates

  • You need replicates to estimate the variability of your measurements
  • Without variability estimates you cannot make statistically significant observations
  • Biological Replicate: Make a new RNA sample from an independent source (e.g. new animal, or parallel cell line culture)
  • Technical Replicate: Use the same RNA sample for a new hybridization(杂交)
  • With Affymetrix microarrays, you will need only biological replicates  生物学重复与技术性重复 

Experimental Design: Important Points

  • Clearly define your biological questions
  • Replicate experiments
    • Biological variability must be factored-in through replication
    • repeat the experiment using different biological samples
  • Use clear and balanced designs
    • use the same number of replicates in every class
  • Minimize experimental variability
    • Experimental variability arises from different platforms, different protocols, different experimenters, different days, etc...
    • Minimize all these factors
  • Control the assumptions of your design
    • Many studies on human patients assume two-class designs; however, the patients may exhibit heterogeneous(多样化的) phenotpe (e.g. different cancer stages) and hence different transcriptome
    • The Explorative(探索性的) Analysis might reveal a different picture than you expected

.CEL Files

  • .CEL files are produced by the Affymetrix proprietary software after essential image processing.
  • You can download CEL files from databases (e.g. GEO, Gene Expression Omnibus) and analyze them using rma (available within a R/ Bioconductor analysis package)
  • Otherwise, pre-processed data are available in SOFT or MINiML textual formats
  • If Affymetrix data has not been processed using rma, and .CEL files are avaiable, it's better if you download .CEL files and process them yourself
  • .CEL are also valuable to compute A/P counts

RMA (Robust Microarray Analysis) is a multi-chip analysis method.

 Exporative Analysis Concepts

  • Sample signal distribution
    • You need to check if signals, on a by-sample basis, are affected by global biases
    • Such biases can be corrected, if they are not too extreme
    • Extreme or persisting biases are sign of sample preparation or hybridization problems
  • Similarity relations within and between classes
    • You need to check if relations between samples satisfy the assumptions behind your experimental design
    • Samples from the same class are expected to be more similar than samples from different classes

Signal Distribution

  • Use boxplots on log2-scale signals
  • If you have used rma, you will notice that boxplots are extremely similar
  • That's because rma incorporates a normalization step
    • Normalization is pre-processing technique that corrects for sample signal distribution differences among samples.

A/P Detection Rates

  • P (or A) rates are independent of rma and its normalization
  • The index should be pretty similar across samples
  • If there are strong differences, biases are likely impossible to remove even using normalization

Normalizing Signal Distribution

  • If you are not using Affymetrix + rms, you will have to take care of normalization yourself
  • We don't have enough time here to review the different procedures
  • Quantile(分位数) normalization is often regarded as a standard
  • If you are using Illumina microarrays, follow the instruction in the R / Bioconductor package lumi

Quality check methods

  • Hierarchical clustering
    • Groups samples into hierarchical groups, using a similarity measure (comparison of gene expression levels of samples)
    • Hierarchical groups are visualized using a dendrogram(系统树图)
  1. Define a similarity metric
  2. Compute similarity among samples (pearson correlation on log2 rma signals)
  3. Run hierarchical clustering algorithm
  4. Plot dendrogram
  5. Analyze if sample groups correctly reproduce classes, according to the experimental design
  • PCA
    • Projects samples into a new space, by capturing common patterns of gene expression
    • Usually a sample are visualized in 2D space (biplot)
    • Projection: In microarray explorative analysis, Samples are treated as objects; Genes/transcripts are treated as dimensions
  • Here we focus on the PCA of sample as objects, although PCA of genes as objects can be done as well (useful to identify functional groups)
  • The new dimensions are interpreted as 
    • meta-genes
    • i.e. the distilled expression patterns across samples of all genes in the data-set
  • If the factors in the experimental design have a strong effect of gene expression, PCA is expected to reveal groups of samples corresponding to classes
  • Different PC often "explain" different factors of the experimental design
  • It is important to standardize the data before running PCA
  • Otherwise, the first principal component will unlikely account for variability to the experimental design

PCA: PC selection

  • It is often common to look at the first two principal components, and figure out if they make sense biologically
  • Here we take a more nuanced(微妙的) approach
    • We can use the comparison between eigenvalues(特征值) from real data PCA and randomized data PCA to evaluate which components are "significant"
    • There is really no space to delve(钻研) the mechanics(技术性细节) of this, just use the R functions provided later.

PCA: beyond biplots

  • A big limitation in the visualization of PCA results is the limitation of the human observer to 2D spaces
  • In relatively simple microarray experiments, with a few experimental factors, only 1 or 2 PCs are really significant, hence a single biplot often suffices
  • In case you have more than 2 significant PCs:
    • Explore combinations of 2 PCs at a time
    • Use a 3D plot; since 3D plots end up being 2D on screens and paper, tune the projection parameters so that the most significant component is least affected by the projection
    • Other more advanced application

Differential Statistics

  • Once we have the expression matrix
    • Genes x Samples, signal related to transcript level
  • We need to identify which gene have an interesting behavior in relation to the experimental design
    • The statistical technique used depends on the biological question asked

Two-class Design: Strength of change vs Test Statistics

  • Log2 (ratio of class means)
    • is a measure of strength of the change
    • works on the means of classes, therefore it is unable to formally account for uncertainty in class mean estimation
  • t-test
    • is a test statistics, based on the statistical theory of inference, and specifically on hypothesis testing
    • formally accounts for uncertainty in class mean estimation
    • results in a p-value, the probability of making a mistake when rejecting the null hypothesis (i.e. no change in expression) ---> low corresponds to significant change
    • it is not a measure of change in strength, although strong change, if consistent across samples, results in a very low p-value

Two-class Design: moderated t-test

  • A problem with the t-test is the insufficient number of replicates in the usual microarray study
  • To overcome this problem, several modified t-statistics, usually called moderated t, have been introduced
  • They either define a minimal variation coefficient, or otherwise exploit the avaiability of many gene expression levels to better estimate single gene variance
  • We will utilize the moderated t defined in the Bioconductor package limma

es.affy <- read.affybatch(filename = list.files("D:/Dropbox/BMS8110/GSE11352/GSE11352_RAW/."), phenoData = es.pData) 为什么这一句复制到命令行窗口执行就可以,在file窗口选中执行就报错,好奇怪

发布了273 篇原创文章 · 获赞 16 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/wxw060709/article/details/103343823