You are on page 1of 44

• Define the “elemental” concept of microarrays.

• Describe the utility of the analysis of microarrays.


• Describe the different sources of variability among the
analysis of microarrays.
• Describe the linear technique used to normalize
microarray data.
– Describe the role of statistics in the normalization
techniques described today.
– Describe the different transformation techniques
mentioned today.
– Distinguish between the different transformation
techniques described today
• Describe the different pairwise comparisons techniques
used to test for independence among genes.
• Layman's term:
 A DNA microarray (also commonly known as gene
chip, DNA chip, or biochip) is a collection of
microscopic DNA spots attached to a solid surface.

 Scientists use DNA microarrays to measure the


expression levels of large numbers of genes
simultaneously or to genotype multiple regions of
a genome.
 http://www.sciencedaily.com/articles/d/dna_microarray.
htm
• Which genes are related.

• Which genes causes a certain disease.

• What subcategories of disease X are there.

• How certain can we be about this.

• Don’t expect it to fix bad data!


• Microarray data are inherently highly variable.
– YOU are measuring mRNA levels

• Some of this variability is relevant since it


corresponds to the differential expression of
genes.

• Unfortunately, a large portion of undesirable


biases are introduced during the many technical
steps of the experimental procedure.
• Biological variability

• RNA extraction

• Probe labeling
– Ex: dye differences

• Printing
– Ex: print-order, plate-order, clone variation

• Hybridization
– Ex: temperature, time, mixing technique
• Human
– Ex: variation between lab researchers
• Scanning
– Ex: laser & detector, chemistry of the fluorescent label

• Image analysis
– Ex: identification, quantification, background methods
• Raw Exploration
• Normalization
– Logarithmic Transformation (adjustment of variances)
– M vs. A plot (rotation of logarithmic transformation)
• This method adjust the median of differences to 0.
– Background Transformation (RMA background approach used
for linear scenarios) (to minimize the noise in the observed
plot)
– Averaging normalization techniques
• After normalization of all of the spots in the microarray
chip, we average them to obtain a more stable master
slide.
– Establish the cutting points
• Naïve approach (Establish cut off points by logs ratios)
• Justifiable approach (Establish cut off points by T-statistic)
• Statistical analysis
– For each gene i we have the hypothesis test:
– Null (neutral) hypothesis H0,i: Mi = 0
– Alternative hypothesis H1,i: Mi 0

• Post-hoc pairwise comparisons


– Minimize false positives
• At first, your data would probably be like this:
Observed data (R,G):
R = signal in red channel
G = signal in green
channel

Large numbers are very heavy to work


with, so we need a more suitable way to
play with them…
• Not be confused with the normalization in statistical procedures in
which the purpose is to make the data distribution to a normal or
Gaussian distribution.

• Normalization of microarray data is aimed to correct for the systematic


measurement errors and bias in the observed data.

• The process of normalization can be classified into linear and non linear
normalization.
– Linear= is applied to selected genes or global ones. The process is quite suitable
for consistent data.
– Non-linear= is highly precise for data at extreme values, but requires a gene set
for reference.

• The purpose of both methods is to bring each image in the microarray


data to same average brightness using statistical modeling.
• Expectation: Most genes are non-differentially expressed
– i.e. most of the data points should be around M=0.

• Idea: Do various exploratory plots to see if this assumption is met.


– For example, M vs A, spatial plots, density & boxplots plots, print-
order plots etc.

• Result: We commonly observe something like this:


– Measured value= real value + systematic errors + noise

• Correction: If so, normalize the data to get rid of errors &


noise:
– Corrected value= real value + systematic errors + noise
Logarithmic Transformation

log2R = log2G

Why Log2??...
• M vs. A is basically a
rotation of the log2R vs.
log2G scatter plot.
• Now the quantity of
interest, i.e. the fold
change, is contained in
one variable, namely M!

• Transformed data (M,A):


 M = log2(R) - log2(G) (log ratio)
 A = ½·[log2(R) + log2(G)] (log
intensity)
R vs. G log(R) vs. log(G) M vs A
R = red channel signal M = log2(R/G) aka log-ratio
G = green channel signal A = ½·log2(R·G) aka log-intensity
• It stands for Robust Multichip Average (Irizarry, 2003)

• More robust than the Lowess (aka Loess) technique.

• Mostly used in Affymetrix microarray data.

• It is biologically sound to assume that fluorescence


intensities from a microarray experiment are composed
of both signal and noise, and that the noise is
“omnipresent” throughout the entire signal distribution.

• A convolution model of a signal distribution and a noise


distribution is a good choice in such a situation.
• Convolution model is a mathematical
operation on two functions f and g, producing a
third function that is typically viewed as a modified
version of one of the original functions.

Fluorescent signal
Observed data
Background noise
• The equation of the RMA method, E(Si|Xi=xi) will
be used as the background intensity correction for
gene i (it is applied to all genes in the microarray
in order to minimize the noise from the observed
signal).
• Useful when having different segmentation of the same
gene.
• Combines all segmentation of the same gene into an
average transformed single unit.
• Can apply T test to work out if the mean of data is same
or different between two conditions.
• Can apply ANOVA to work out if the mean of data is
same or different across two or more conditions.
normalization

Average slide
Naïve approach
• Establish cut off points
by logs ratios.
– This has to be done post
M vs. A transformation &
background correction
– Top and bottom 0.5 of
the absolute M values
have to be shaven off.
Justifiable approach
• Establish cut off points using T-
statistic via Significance Analysis
of Microarrays*

– For replicated data, i.e.


multiple measurements of the
same thing, we trust this
approach more if the deviation
(std.dev.) is small.

– T = mean(x) / SE(x)
• Where

– The M axis is the only one to


be transformed by T.

– If the deviation is large, we do *R package / Excel Add-In


not trust it that much.(stick
with naïve approach)
• For each gene i we have the hypothesis
test:
 Which genes or groups are (most) differentially
expressed?
H0,i: Mi = 0
H1,i: Mi 0
 =5%
CI= 95%
• Thousands of tests, i.e. each gene is tested
against
H0: T=0.
– false positives problems are a serious threat.
– need to adjust p-values.

• Different adjustment procedures


– Pairwise comparisons post-hoc test
 Bonferroni (best in linear situations)
 Tukey
 Sidak
 Duncan
 Holm
• Multiple tests: a “family” of tests
• They compared a list of “significant genes”
• Then family-wise error (FWE) = 0.05
• Bonferroni correction: “set k=p/m”
• Where: k = new p-value; p= original α; m= # of post
hoc performed.
 To sort and rank data.

 To reduce data set of 1000s genes to 10s or


100s (via Averaging Normalization
Techniques).

 As a guide in selecting which genes to


validate more precisely and which no to.
• Filter out bad spots.
• Adjust low intensities.
• Normalize background noise and raw data.
• Calculate average ratios and statistical
significance values per gene.
• Perform pairwise post hoc comparisons to
minimize false positives.
• There are many different statistical significance
metrics.
– T-test (P values), SAM (T values), Wilcoxon RST,
ANOVA (F-statistics), many more…
– Just many variations on a theme!
• Choose one (or more!) wisely.
• BUT: don’t let it make decisions for you!

• There will always be false positives. (there’s no


post hoc test that can eliminate all!!)

• The most accurate tool in validating the results is


the researcher’s judgment, with the help of the
keen point of view of a biostatistician of course!...
• You need replication and statistics to find real
differences between genes.

• In most cases the naïve approach (cutoff points by


log ratios) is not enough.

• Cutoff points by t-statistics is a much wiser decision.

• Look out for false positives.

• Multiple testing = must adjust the p values.


 Dchip
 Affymetrix
 R
 Bioconductor
 BRBArray tools (NCI biometric research branch)
 Matlab Bioinformatics Toolbox
 GeneSpring
 Partek
• For further reading regarding the non-linear normalization of
microarrays please visit:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC126873/pdf/gb-2002-3-9-
research0048.pdf
1. Good image analysis is essential. Some software are
obsolete and not that good.
2. Normalization is needed. We understand more now
than a few years ago.
3. Use at least the t-statistics to identify differentially
expressed genes. Do not rely exclusively on log-ratios.
4. Multiple testing must be considered for false positives;
adjust your p -values.
5. Talk to a biostatistician before doing the experiments!
They too have a family to feed thanks to your work!.
• Analysis of Microarray Data
– Henrik Bengtsson hb@maths.lth.se

• Brown,S. (2009). Microarray Data Analysis. September 8, MMXI.


– Retrieved from http://www.docstoc.com/docs/5822653/Microarray-
Data-Analysis

• Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP.
(2003). Summaries of Affymetrix GeneChip probe level data.
Nucleic Acids Res. 31:e15.

• The Use of Statistics in Microarray Studies (Dr. Ernst Wit)


– http://www.stats.gla.ac.uk/~microarray

• Wikipedia. MA plots. September 8, MMXI.


– Retrieved from http://en.wikipedia.org/wiki/MA_plot

You might also like