BMS8110复习（四）：Lecture 4 -Gene Expression Microarrays

Lecture 4 -Gene Expression Microarrays 基因表达微阵列

Outline:

Introduction
Experimental Design
Microarray Data Preprocessing
Quality Control of Experiments
Differential Gene Expression Analysis
Case Study

Experimental Design

Define your biological question
- e.g. how is transcription affected by estrogen(雌性激素) stimulation of breast cancer cells?
Define classes, i.e. biological conditions/ systems you intend to sample
- e.g. cerebellum(小脑) from Alzheimer patients, HEK cell line treated with miR-2I antagomir, primary fibroblasts after 2 days of culture
Factors can be combined to define classes
- e.g. treated vs untreated, 12 vs 24 hours
Plan how many replicates you need

Replicates

You need replicates to estimate the variability of your measurements
Without variability estimates you cannot make statistically significant observations
Biological Replicate: Make a new RNA sample from an independent source (e.g. new animal, or parallel cell line culture)
Technical Replicate: Use the same RNA sample for a new hybridization(杂交)
With Affymetrix microarrays, you will need only biological replicates 生物学重复与技术性重复

Experimental Design: Important Points

Clearly define your biological questions
Replicate experiments
- Biological variability must be factored-in through replication
- repeat the experiment using different biological samples
Use clear and balanced designs
- use the same number of replicates in every class
Minimize experimental variability
- Experimental variability arises from different platforms, different protocols, different experimenters, different days, etc...
- Minimize all these factors
Control the assumptions of your design
- Many studies on human patients assume two-class designs; however, the patients may exhibit heterogeneous(多样化的) phenotpe (e.g. different cancer stages) and hence different transcriptome
- The Explorative(探索性的) Analysis might reveal a different picture than you expected

.CEL Files

.CEL files are produced by the Affymetrix proprietary software after essential image processing.
You can download CEL files from databases (e.g. GEO, Gene Expression Omnibus) and analyze them using rma (available within a R/ Bioconductor analysis package)
Otherwise, pre-processed data are available in SOFT or MINiML textual formats
If Affymetrix data has not been processed using rma, and .CEL files are avaiable, it's better if you download .CEL files and process them yourself
.CEL are also valuable to compute A/P counts

RMA (Robust Microarray Analysis) is a multi-chip analysis method.

Exporative Analysis Concepts

Sample signal distribution
- You need to check if signals, on a by-sample basis, are affected by global biases
- Such biases can be corrected, if they are not too extreme
- Extreme or persisting biases are sign of sample preparation or hybridization problems
Similarity relations within and between classes
- You need to check if relations between samples satisfy the assumptions behind your experimental design
- Samples from the same class are expected to be more similar than samples from different classes

Signal Distribution

Use boxplots on log2-scale signals
If you have used rma, you will notice that boxplots are extremely similar
That's because rma incorporates a normalization step
- Normalization is pre-processing technique that corrects for sample signal distribution differences among samples.

A/P Detection Rates

P (or A) rates are independent of rma and its normalization
The index should be pretty similar across samples
If there are strong differences, biases are likely impossible to remove even using normalization

Normalizing Signal Distribution

If you are not using Affymetrix + rms, you will have to take care of normalization yourself
We don't have enough time here to review the different procedures
Quantile(分位数) normalization is often regarded as a standard
If you are using Illumina microarrays, follow the instruction in the R / Bioconductor package lumi

Quality check methods

Hierarchical clustering
- Groups samples into hierarchical groups, using a similarity measure (comparison of gene expression levels of samples)
- Hierarchical groups are visualized using a dendrogram(系统树图)

Define a similarity metric
Compute similarity among samples (pearson correlation on log2 rma signals)
Run hierarchical clustering algorithm
Plot dendrogram
Analyze if sample groups correctly reproduce classes, according to the experimental design

PCA
- Projects samples into a new space, by capturing common patterns of gene expression
- Usually a sample are visualized in 2D space (biplot)
- Projection: In microarray explorative analysis, Samples are treated as objects; Genes/transcripts are treated as dimensions
Here we focus on the PCA of sample as objects, although PCA of genes as objects can be done as well (useful to identify functional groups)
The new dimensions are interpreted as
- meta-genes
- i.e. the distilled expression patterns across samples of all genes in the data-set
If the factors in the experimental design have a strong effect of gene expression, PCA is expected to reveal groups of samples corresponding to classes
Different PC often "explain" different factors of the experimental design
It is important to standardize the data before running PCA
Otherwise, the first principal component will unlikely account for variability to the experimental design

PCA: PC selection

It is often common to look at the first two principal components, and figure out if they make sense biologically
Here we take a more nuanced(微妙的) approach
- We can use the comparison between eigenvalues(特征值) from real data PCA and randomized data PCA to evaluate which components are "significant"
- There is really no space to delve(钻研) the mechanics(技术性细节) of this, just use the R functions provided later.

PCA: beyond biplots

A big limitation in the visualization of PCA results is the limitation of the human observer to 2D spaces
In relatively simple microarray experiments, with a few experimental factors, only 1 or 2 PCs are really significant, hence a single biplot often suffices
In case you have more than 2 significant PCs:
- Explore combinations of 2 PCs at a time
- Use a 3D plot; since 3D plots end up being 2D on screens and paper, tune the projection parameters so that the most significant component is least affected by the projection
- Other more advanced application

Differential Statistics

Once we have the expression matrix
- Genes x Samples, signal related to transcript level
We need to identify which gene have an interesting behavior in relation to the experimental design
- The statistical technique used depends on the biological question asked

Two-class Design: Strength of change vs Test Statistics

Log2 (ratio of class means)
- is a measure of strength of the change
- works on the means of classes, therefore it is unable to formally account for uncertainty in class mean estimation
t-test
- is a test statistics, based on the statistical theory of inference, and specifically on hypothesis testing
- formally accounts for uncertainty in class mean estimation
- results in a p-value, the probability of making a mistake when rejecting the null hypothesis (i.e. no change in expression) ---> low corresponds to significant change
- it is not a measure of change in strength, although strong change, if consistent across samples, results in a very low p-value

Two-class Design: moderated t-test

A problem with the t-test is the insufficient number of replicates in the usual microarray study
To overcome this problem, several modified t-statistics, usually called moderated t, have been introduced
They either define a minimal variation coefficient, or otherwise exploit the avaiability of many gene expression levels to better estimate single gene variance
We will utilize the moderated t defined in the Bioconductor package limma

es.affy <- read.affybatch(filename = list.files("D:/Dropbox/BMS8110/GSE11352/GSE11352_RAW/."), phenoData = es.pData) 为什么这一句复制到命令行窗口执行就可以，在file窗口选中执行就报错，好奇怪

wxw060709

发布了273 篇原创文章 · 获赞 16 · 访问量 2万+

私信关注

BMS8110复习（四）：Lecture 4 -Gene Expression Microarrays

猜你喜欢