Lecture 4 -Gene Expression Microarrays 基因表达微阵列
Outline:
- Introduction
- Experimental Design
- Microarray Data Preprocessing
- Quality Control of Experiments
- Differential Gene Expression Analysis
- Case Study
Experimental Design
- Define your biological question
- e.g. how is transcription affected by estrogen(雌性激素) stimulation of breast cancer cells?
- Define classes, i.e. biological conditions/ systems you intend to sample
- e.g. cerebellum(小脑) from Alzheimer patients, HEK cell line treated with miR-2I antagomir, primary fibroblasts after 2 days of culture
- Factors can be combined to define classes
- e.g. treated vs untreated, 12 vs 24 hours
- Plan how many replicates you need
Replicates
- You need replicates to estimate the variability of your measurements
- Without variability estimates you cannot make statistically significant observations
- Biological Replicate: Make a new RNA sample from an independent source (e.g. new animal, or parallel cell line culture)
- Technical Replicate: Use the same RNA sample for a new hybridization(杂交)
- With Affymetrix microarrays, you will need only biological replicates 生物学重复与技术性重复
Experimental Design: Important Points
- Clearly define your biological questions
- Replicate experiments
- Biological variability must be factored-in through replication
- repeat the experiment using different biological samples
- Use clear and balanced designs
- use the same number of replicates in every class
- Minimize experimental variability
- Experimental variability arises from different platforms, different protocols, different experimenters, different days, etc...
- Minimize all these factors
- Control the assumptions of your design
- Many studies on human patients assume two-class designs; however, the patients may exhibit heterogeneous(多样化的) phenotpe (e.g. different cancer stages) and hence different transcriptome
- The Explorative(探索性的) Analysis might reveal a different picture than you expected
.CEL Files
- .CEL files are produced by the Affymetrix proprietary software after essential image processing.
- You can download CEL files from databases (e.g. GEO, Gene Expression Omnibus) and analyze them using rma (available within a R/ Bioconductor analysis package)
- Otherwise, pre-processed data are available in SOFT or MINiML textual formats
- If Affymetrix data has not been processed using rma, and .CEL files are avaiable, it's better if you download .CEL files and process them yourself
- .CEL are also valuable to compute A/P counts
RMA (Robust Microarray Analysis) is a multi-chip analysis method.
Exporative Analysis Concepts
- Sample signal distribution
- You need to check if signals, on a by-sample basis, are affected by global biases
- Such biases can be corrected, if they are not too extreme
- Extreme or persisting biases are sign of sample preparation or hybridization problems
- Similarity relations within and between classes
- You need to check if relations between samples satisfy the assumptions behind your experimental design
- Samples from the same class are expected to be more similar than samples from different classes
Signal Distribution
- Use boxplots on log2-scale signals
- If you have used rma, you will notice that boxplots are extremely similar
- That's because rma incorporates a normalization step
- Normalization is pre-processing technique that corrects for sample signal distribution differences among samples.
A/P Detection Rates
- P (or A) rates are independent of rma and its normalization
- The index should be pretty similar across samples
- If there are strong differences, biases are likely impossible to remove even using normalization
Normalizing Signal Distribution
- If you are not using Affymetrix + rms, you will have to take care of normalization yourself
- We don't have enough time here to review the different procedures
- Quantile(分位数) normalization is often regarded as a standard
- If you are using Illumina microarrays, follow the instruction in the R / Bioconductor package lumi
Quality check methods
- Hierarchical clustering
- Groups samples into hierarchical groups, using a similarity measure (comparison of gene expression levels of samples)
- Hierarchical groups are visualized using a dendrogram(系统树图)
- Define a similarity metric
- Compute similarity among samples (pearson correlation on log2 rma signals)
- Run hierarchical clustering algorithm
- Plot dendrogram
- Analyze if sample groups correctly reproduce classes, according to the experimental design
- PCA
- Projects samples into a new space, by capturing common patterns of gene expression
- Usually a sample are visualized in 2D space (biplot)
- Projection: In microarray explorative analysis, Samples are treated as objects; Genes/transcripts are treated as dimensions
- Here we focus on the PCA of sample as objects, although PCA of genes as objects can be done as well (useful to identify functional groups)
- The new dimensions are interpreted as
- meta-genes
- i.e. the distilled expression patterns across samples of all genes in the data-set
- If the factors in the experimental design have a strong effect of gene expression, PCA is expected to reveal groups of samples corresponding to classes
- Different PC often "explain" different factors of the experimental design
- It is important to standardize the data before running PCA
- Otherwise, the first principal component will unlikely account for variability to the experimental design
PCA: PC selection
- It is often common to look at the first two principal components, and figure out if they make sense biologically
- Here we take a more nuanced(微妙的) approach
- We can use the comparison between eigenvalues(特征值) from real data PCA and randomized data PCA to evaluate which components are "significant"
- There is really no space to delve(钻研) the mechanics(技术性细节) of this, just use the R functions provided later.
PCA: beyond biplots
- A big limitation in the visualization of PCA results is the limitation of the human observer to 2D spaces
- In relatively simple microarray experiments, with a few experimental factors, only 1 or 2 PCs are really significant, hence a single biplot often suffices
- In case you have more than 2 significant PCs:
- Explore combinations of 2 PCs at a time
- Use a 3D plot; since 3D plots end up being 2D on screens and paper, tune the projection parameters so that the most significant component is least affected by the projection
- Other more advanced application
Differential Statistics
- Once we have the expression matrix
- Genes x Samples, signal related to transcript level
- We need to identify which gene have an interesting behavior in relation to the experimental design
- The statistical technique used depends on the biological question asked
Two-class Design: Strength of change vs Test Statistics
- Log2 (ratio of class means)
- is a measure of strength of the change
- works on the means of classes, therefore it is unable to formally account for uncertainty in class mean estimation
- t-test
- is a test statistics, based on the statistical theory of inference, and specifically on hypothesis testing
- formally accounts for uncertainty in class mean estimation
- results in a p-value, the probability of making a mistake when rejecting the null hypothesis (i.e. no change in expression) ---> low corresponds to significant change
- it is not a measure of change in strength, although strong change, if consistent across samples, results in a very low p-value
Two-class Design: moderated t-test
- A problem with the t-test is the insufficient number of replicates in the usual microarray study
- To overcome this problem, several modified t-statistics, usually called moderated t, have been introduced
- They either define a minimal variation coefficient, or otherwise exploit the avaiability of many gene expression levels to better estimate single gene variance
- We will utilize the moderated t defined in the Bioconductor package limma
es.affy <- read.affybatch(filename = list.files("D:/Dropbox/BMS8110/GSE11352/GSE11352_RAW/."), phenoData = es.pData) 为什么这一句复制到命令行窗口执行就可以,在file窗口选中执行就报错,好奇怪