Biostatistics Role in Microarray Analysis

• Define the “elemental” concept of microarrays.
• Describe the utility of the analysis of microarrays.

• Describe the different sources of variability among the
analysis of microarrays.
• Describe the linear technique used to normalize
microarray data.
– Describe the role of statistics in the normalization
techniques described today.
– Describe the different transformation techniques
mentioned today.
– Distinguish between the different transformation
techniques described today
• Describe the different pairwise comparisons techniques
used to test for independence among genes.
• Layman's term:
 A DNA microarray (also commonly known as gene
chip, DNA chip, or biochip) is a collection of
microscopic DNA spots attached to a solid surface.
 Scientists use DNA microarrays to measure the

expression levels of large numbers of genes
simultaneously or to genotype multiple regions of
a genome.
 http://www.sciencedaily.com/articles/d/dna_microarray.
htm
• Which genes are related.
• Which genes causes a certain disease.
• What subcategories of disease X are there.
• How certain can we be about this.
• Don’t expect it to fix bad data!

• Microarray data are inherently highly variable.
– YOU are measuring mRNA levels
• Some of this variability is relevant since it

corresponds to the differential expression of
genes.
• Unfortunately, a large portion of undesirable

biases are introduced during the many technical
steps of the experimental procedure.
• Biological variability
• RNA extraction
• Probe labeling
– Ex: dye differences
• Printing
– Ex: print-order, plate-order, clone variation
• Hybridization
– Ex: temperature, time, mixing technique
• Human
– Ex: variation between lab researchers
• Scanning
– Ex: laser & detector, chemistry of the fluorescent label
• Image analysis
– Ex: identification, quantification, background methods
• Raw Exploration
• Normalization
– Logarithmic Transformation (adjustment of variances)
– M vs. A plot (rotation of logarithmic transformation)
• This method adjust the median of differences to 0.
– Background Transformation (RMA background approach used
for linear scenarios) (to minimize the noise in the observed
plot)
– Averaging normalization techniques
• After normalization of all of the spots in the microarray
chip, we average them to obtain a more stable master
slide.
– Establish the cutting points
• Naïve approach (Establish cut off points by logs ratios)
• Justifiable approach (Establish cut off points by T-statistic)
• Statistical analysis
– For each gene i we have the hypothesis test:
– Null (neutral) hypothesis H0,i: Mi = 0
– Alternative hypothesis H1,i: Mi 0
• Post-hoc pairwise comparisons

– Minimize false positives
• At first, your data would probably be like this:
Observed data (R,G):
R = signal in red channel
G = signal in green
channel
Large numbers are very heavy to work

with, so we need a more suitable way to
play with them…
• Not be confused with the normalization in statistical procedures in
which the purpose is to make the data distribution to a normal or
Gaussian distribution.
• Normalization of microarray data is aimed to correct for the systematic

measurement errors and bias in the observed data.
• The process of normalization can be classified into linear and non linear
normalization.
– Linear= is applied to selected genes or global ones. The process is quite suitable
for consistent data.
– Non-linear= is highly precise for data at extreme values, but requires a gene set
for reference.
• The purpose of both methods is to bring each image in the microarray

data to same average brightness using statistical modeling.
• Expectation: Most genes are non-differentially expressed
– i.e. most of the data points should be around M=0.
• Idea: Do various exploratory plots to see if this assumption is met.

– For example, M vs A, spatial plots, density & boxplots plots, print-
order plots etc.
• Result: We commonly observe something like this:

– Measured value= real value + systematic errors + noise
• Correction: If so, normalize the data to get rid of errors &

noise:
– Corrected value= real value + systematic errors + noise
Logarithmic Transformation
log2R = log2G
Why Log2??...
• M vs. A is basically a
rotation of the log2R vs.
log2G scatter plot.
• Now the quantity of
interest, i.e. the fold
change, is contained in
one variable, namely M!
• Transformed data (M,A):

 M = log2(R) - log2(G) (log ratio)
 A = ½·[log2(R) + log2(G)] (log
intensity)
R vs. G log(R) vs. log(G) M vs A
R = red channel signal M = log2(R/G) aka log-ratio
G = green channel signal A = ½·log2(R·G) aka log-intensity
• It stands for Robust Multichip Average (Irizarry, 2003)
• More robust than the Lowess (aka Loess) technique.
• Mostly used in Affymetrix microarray data.
• It is biologically sound to assume that fluorescence

intensities from a microarray experiment are composed
of both signal and noise, and that the noise is
“omnipresent” throughout the entire signal distribution.
• A convolution model of a signal distribution and a noise

distribution is a good choice in such a situation.
• Convolution model is a mathematical
operation on two functions f and g, producing a
third function that is typically viewed as a modified
version of one of the original functions.
Fluorescent signal
Observed data
Background noise
• The equation of the RMA method, E(Si|Xi=xi) will
be used as the background intensity correction for
gene i (it is applied to all genes in the microarray
in order to minimize the noise from the observed
signal).
• Useful when having different segmentation of the same
gene.
• Combines all segmentation of the same gene into an
average transformed single unit.
• Can apply T test to work out if the mean of data is same
or different between two conditions.
• Can apply ANOVA to work out if the mean of data is
same or different across two or more conditions.
normalization
Average slide
Naïve approach
• Establish cut off points
by logs ratios.
– This has to be done post
M vs. A transformation &
background correction
– Top and bottom 0.5 of
the absolute M values
have to be shaven off.
Justifiable approach
• Establish cut off points using T-
statistic via Significance Analysis
of Microarrays*
– For replicated data, i.e.

multiple measurements of the
same thing, we trust this
approach more if the deviation
(std.dev.) is small.
– T = mean(x) / SE(x)
• Where
– The M axis is the only one to

be transformed by T.
– If the deviation is large, we do *R package / Excel Add-In

not trust it that much.(stick
with naïve approach)
• For each gene i we have the hypothesis
test:
 Which genes or groups are (most) differentially
expressed?
H0,i: Mi = 0
H1,i: Mi 0
 =5%
CI= 95%
• Thousands of tests, i.e. each gene is tested
against
H0: T=0.
– false positives problems are a serious threat.
– need to adjust p-values.
• Different adjustment procedures

– Pairwise comparisons post-hoc test
 Bonferroni (best in linear situations)
 Tukey
 Sidak
 Duncan
 Holm
• Multiple tests: a “family” of tests
• They compared a list of “significant genes”
• Then family-wise error (FWE) = 0.05
• Bonferroni correction: “set k=p/m”
• Where: k = new p-value; p= original α; m= # of post
hoc performed.
 To sort and rank data.
 To reduce data set of 1000s genes to 10s or

100s (via Averaging Normalization
Techniques).
 As a guide in selecting which genes to

validate more precisely and which no to.
• Filter out bad spots.
• Adjust low intensities.
• Normalize background noise and raw data.
• Calculate average ratios and statistical
significance values per gene.
• Perform pairwise post hoc comparisons to
minimize false positives.
• There are many different statistical significance
metrics.
– T-test (P values), SAM (T values), Wilcoxon RST,
ANOVA (F-statistics), many more…
– Just many variations on a theme!
• Choose one (or more!) wisely.
• BUT: don’t let it make decisions for you!
• There will always be false positives. (there’s no

post hoc test that can eliminate all!!)
• The most accurate tool in validating the results is

the researcher’s judgment, with the help of the
keen point of view of a biostatistician of course!...
• You need replication and statistics to find real
differences between genes.
• In most cases the naïve approach (cutoff points by

log ratios) is not enough.
• Cutoff points by t-statistics is a much wiser decision.
• Look out for false positives.
• Multiple testing = must adjust the p values.

 Dchip
 Affymetrix
 R
 Bioconductor
 BRBArray tools (NCI biometric research branch)
 Matlab Bioinformatics Toolbox
 GeneSpring
 Partek
• For further reading regarding the non-linear normalization of
microarrays please visit:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC126873/pdf/gb-2002-3-9-
research0048.pdf
1. Good image analysis is essential. Some software are
obsolete and not that good.
2. Normalization is needed. We understand more now
than a few years ago.
3. Use at least the t-statistics to identify differentially
expressed genes. Do not rely exclusively on log-ratios.
4. Multiple testing must be considered for false positives;
adjust your p -values.
5. Talk to a biostatistician before doing the experiments!
They too have a family to feed thanks to your work!.
• Analysis of Microarray Data
– Henrik Bengtsson hb@maths.lth.se
• Brown,S. (2009). Microarray Data Analysis. September 8, MMXI.

– Retrieved from http://www.docstoc.com/docs/5822653/Microarray-
Data-Analysis
• Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP.
(2003). Summaries of Affymetrix GeneChip probe level data.
Nucleic Acids Res. 31:e15.
• The Use of Statistics in Microarray Studies (Dr. Ernst Wit)

– http://www.stats.gla.ac.uk/~microarray
• Wikipedia. MA plots. September 8, MMXI.

– Retrieved from http://en.wikipedia.org/wiki/MA_plot

Biostatistics Role in Microarray Analysis

Uploaded by

Document Information

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Biostatistics Role in Microarray Analysis

Uploaded by

Copyright:

• Define the “elemental” concept of microarrays.

• Describe the utility of the analysis of microarrays.

 Scientists use DNA microarrays to measure the

• Which genes causes a certain disease.

• What subcategories of disease X are there.

• How certain can we be about this.

• Don’t expect it to fix bad data!

• Some of this variability is relevant since it

• Unfortunately, a large portion of undesirable

• Post-hoc pairwise comparisons

Large numbers are very heavy to work

• Normalization of microarray data is aimed to correct for the systematic

• The purpose of both methods is to bring each image in the microarray

• Idea: Do various exploratory plots to see if this assumption is met.

• Result: We commonly observe something like this:

• Correction: If so, normalize the data to get rid of errors &

• Transformed data (M,A):

• More robust than the Lowess (aka Loess) technique.

• Mostly used in Affymetrix microarray data.

• It is biologically sound to assume that fluorescence

• A convolution model of a signal distribution and a noise

– For replicated data, i.e.

– The M axis is the only one to

– If the deviation is large, we do *R package / Excel Add-In

• Different adjustment procedures

 To reduce data set of 1000s genes to 10s or

 As a guide in selecting which genes to

• There will always be false positives. (there’s no

• The most accurate tool in validating the results is

• In most cases the naïve approach (cutoff points by

• Cutoff points by t-statistics is a much wiser decision.

• Look out for false positives.

• Multiple testing = must adjust the p values.

• Brown,S. (2009). Microarray Data Analysis. September 8, MMXI.

• The Use of Statistics in Microarray Studies (Dr. Ernst Wit)

• Wikipedia. MA plots. September 8, MMXI.

You might also like