You are on page 1of 24

Use of the Half-Normal Probability

Plot to Identify Significant Effects for


Microarray Data

C. F. Jeff Wu
University of Michigan
(joint work with G. Dyson)

Outline
Current Methods
Proposed Methodology
Analysis Plan
Example
Conclusions

What are microarrays?


Two major types
Oligonucleotide gene chips
Spotted glass arrays

Perfect match (PM) and mismatch (MM)


probes are spotted onto a gene chip
~20 probes make up a probe set (or gene)
MM probe for each gene has the middle base
set to the complement of its PM probe
Hybridize labeled RNA corresponding to PM
probes

Glass arrays involve the competitive


hybridization of two RNA pools to cDNA
spotted onto a glass slide
Typically thousands on genes on a slide
3

Multiplicity Problem
When we make more than one
comparison in a hypothesis testing
situation, p-value interpretation falls
through
Control of family error rate is necessary in
order to preserve nominal type I error rate
Various approaches to correct the chance
of making a type I error for multiplicity,
including Tukey, Bonferroni and Holms

Microarray Analysis
Techniques
Westfall Young step down (WY)
Significance Analysis of Microarrays
(SAM)
Empirical Bayes (EB)
Bayesian (MCMC)
Mixture Modeling
Dimension reduction techniques
Machine learning
5

Westfall Young (WY)


Compute ranks of original test statistic rj
such that
Construct b balanced permutations of the
samples, computing the same test
(b )
(b )
statistic as above t1 , , t k for each b
Compute
and
Repeat B times and calculate the adjust
p-value as
Less conservative than Bonferroni
6

Significance Analysis of
Microarrays (SAM)
Use a t-like statistic

Use balanced permutation method from


previous slide to estimate null distribution,
assuming all effects are null
Call genes that fall outside bars
significant

Half-Normal Analysis

Microarray Specific Problem

Analysis Plan
Robust measures of location and scale
Summary statistic
Two half-normal plots (for upwardregulated and downward-regulated
genes)
Segment determination
NC
J
,
J
Find

insignificant, borderline, significant


NC
J
Repeat the procedure, using as base

10

Robust Measures of Location and


Scale
Perform transformation and suitable
normalization
Compute median and Maximum Absolute
Deviation (MAD) for each gene

Reasonable estimates
Less affected by outliers than mean and SD
Interested in robustness rather than efficiency

11

Summary Statistic
Compute quasi two-sample t-statistic
using robust values from above:

c is chosen to minimize

for the middle 100*(1-2)% of the ssl.


Tusher et al. (2001) chose c to minimize
the coefficient of variation
Efron et al. (2001) used the 90th
percentile of the gene standard error
estimates for c
12

Two Half-Normal Plots


Construct two half-normal plots: one for
the p positive and r negative ssl.
Run the procedure separately on each
set
Denote the ordered p positive effects by
Plot abssi against half-normal distribution
quantiles, i.e. the points
1 (.5 .5[i .5] / p ), abss( i ) )

Goal: obtain set of noise effects


Yield a baseline against which to test the
rest of the effects

13

Segment Determination: J
Given initialize null set as points abss1 :
abssk
Regress null set on 1:k half-normal
quantiles (Q1:Qk)
Produce predicted values y h at the
remaining quantile values (Qh:h>k)
Compute predicted statistics

with

Find
14

Segment Determination:J (cont)


The initial null set of k genes becomes k
k
+ m (= J ) null genes
Now re-do the segment determination
procedure, using the k + m genes as
base null set
Continue until no new genes are added
Do for each k less than p-1
k
J
Store the end point

Set the most frequent

J k

to

15

Sample
Let k = 200, total effects = 500
First 200 ordered positive effects regressed on first
200 half-normal quantiles
Test ordered effects 201 to 500 using absolute
value of predicted statistics
For example, effect 239 is the largest h less than
the t-critical value
k 200
So J
would initially be 239

Redo the above, with k = 239 effects; so we


test effects 240 to 500
Say statistic 242 is the largest h less than t-critical
value based on new regression line
So the new J 200 would be 242

Redo the above again with k = 242, test


effects 243 to 500
No statistics are less than t critical value
200
So J is 242

16

Example

J 3116

17

Find J NC
Will test all effects after J using same
statistics
To adjust for multiple testing, define NC
as the number of consecutive significant
effects necessary to call all subsequent
effects significant
Use the Bonferroni adjustment (does not
require independence):
Instead of doing thousands of
comparisons, only need to do NC to
determine significance
Define

Now we have identified the change points


in the graph for segment detection

18

Example: Downwardregulated Speed Mouse Data

19

Example: Downward Regulated


Speed Mouse Data (cont)

J NC
J

20

Error Rate Estimation: FDR


False Discovery Rate (FDR) is the
expected proportion of falsely rejected
hypotheses
Permute the condition labels, maintaining
balance
Example: 8 replicates in conditions A and B
Each A and B will have 4 replicates from A
and 4 from B
Compute the robust statistics, keeping the
same c from the actual data

Determine the average number of effects


that fall above the positive or below the
negative boundary of the significant sets
Divide that number by the total number of
called significant effect

21

Speed Data: Analysis and


Comparison

WY found 8 genes significant, with Type I


error = 0.05
22

Lemon Data: Analysis and


Comparison

WY found 253 genes significant, with


Type I error = 0.05
23

Conclusions
Proposed a new method for determining
differential expression in genes
Dealt with the multiplicity problem by using
only a small subset of genes
Can extend to other large data sets
Allow scientists to play a role in sequential
decision making
Incorporate a priori knowledge of experiment
with selection of c

24

You might also like