You are on page 1of 12

Microarray Data Paired data

Parametric
Hypothesis
Testing

2004/3/17

z-test
t-test

Unpaired data

Complex data
More than two Groups

two-sample
t-test

One-Way Analysis of
Variance (ANOVA)

Assumptions and Test for Normality


Histogram, QQplot

Jarque-Bera test, Lilliefors test, Kolmogorov-Smirnov test


Non-Parametric
Sign test,
Wilcoxon
Hypothesis
rank-sum
test,
Wilcoxon
Testing
(Mann-Whitney
signed-rank test
U test).
[2]

A hypothesis test is a procedure for determining if an assertion


about a characteristic of a population is reasonable.
For example, suppose that someone says that the average price of
a gallon of regular unleaded gas in Massachusetts is $1.15. How
would you decide whether this statement is true?
You could try to find out what every gas station in the state was
charging and how many gallons they were selling at that price.
That approach might be definitive, but it could end up costing
more than the information is worth.
A simpler approach is to find out the price of gas at a small
number of randomly chosen stations around the state and
compare the average price to $1.15.
Of course, the average price you get will probably not be exactly
$1.15 due to variability in price from one station to the next.
Suppose your average price was $1.18. Is this three cent difference
a result of chance variability, or is the original assertion incorrect?
A hypothesis test can provide an answer.
[3]

[4]

Samples are taken from 20 breast cancer patients, before and


after a 16 week course of doxorubicin chemotherapy, and
analyzed using microarray. There are 9216 genes.
Paired data: there are two measurements from each patient, one
before treatment and one after treatment.
These two measurements relate to one another, we are
interested in the difference between the two measurements (the
log ratio) to determine whether a gene has been up-regulated or
down-regulated in breast cancer following that treatment.
Perou CM, Sorlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR,
Ross DT, Johnsen H, Akslen LA, Fluge O, Pergamenschikov A, Williams C, Zhu SX,
Lonning PE, Borresen-Dale AL, Brown PO, Botstein D, (2000), Molecular portraits of
human breast tumours. Nature 406:747-752.
Stanford Microarray Database:
http://genome-www.stanford.edu/breast_cancer/molecularportraits/

Bone marrow samples are taken from 27 patients suffering from


acute lymphoblastic leukemia (ALL
) and
11 patients suffering from acute myeloid leukemia (AML
) and analyzed using Affymetrix arrays. There are
7070 genes.
Unpaired data: there are two groups of patients (ALL, AML).
We wish to identify the genes that are up- or down-regulated in
ALL relative to AML. (i.e., to see if a gene is differentially
expressed between the two groups.)

Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A. et al. (1999) Molecular
classification of cancer: class discovery and class prediction by gene expression
monitoring. Science 286, 531--537.
Cancer Genomics Program at Whitehead Institute for Genome Research
http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi
[5]

[6]

!
The null hypothesis:

There are four types of small round blue cell tumors of childhood:
neuroblastoma (NB), non-Hodgkin lymphoma (NHL),
rhabdomyosarcoma (RMS) and Ewing tumours (EWS). Sixtythree samples from these tumours, 12, 8, 20 and 23 in each of
the groups, respectively, have been hybridised to microarray.
We want to identify genes that are differentially expressed in one
or more of these four groups.

H0: = 1.15. (the average price of a gallon of gas is $1.15)

The alternative hypothesis:

H1: > 1.15. (gas prices were actually higher)


H1: < 1.15.
H1: != 1.15.

The significance level (alpha) is related to the degree of certainty you require in
order to reject the null hypothesis in favor of the alternative.
Decide in advance to reject the null hypothesis if the probability of observing your sampled result is less than the
significance level.
For a typical significance level of 5%, the notation is alpha = 0.05. For this significance level, the probability of
incorrectly rejecting the null hypothesis when it is actually true is 5%.
If you need more protection from this error, then choose a lower value of alpha .

The p-value is the probability of observing the given sample result under the
assumption that the null hypothesis is true.

More on SRBCT:
http://www.thedoctorsdoctor.com/diseases/small_round_blue_cell_tumor.htm

Khan J, Wei J, Ringner M, Saal L, Ladanyi M, Westermann F, Berthold F, Schwab M,


Antonescu C, Peterson C and Meltzer P Classification and diagnostic prediction of
cancers using gene expression profiling and artificial neural networks. Nature
Medicine 2001, 7:673-679
Stanford Microarray Database

If the p-value is less than alpha, then you reject the null hypothesis.
For example, if alpha = 0.05 and the p-value is 0.03, then you reject the null hypothesis.

Confidence intervals: a range of values that have a chosen probability of


containing the true hypothesized quantity.

Suppose, in our example, 1.15 is inside a 95% confidence interval for the mean, . That is equivalent to being unable
to reject the null hypothesis at a significance level of 0.05.
Conversely if the 100(1- alpha )% confidence interval does not contain 1.15, then you reject the null hypothesis at the
alpha level of significance.

[7]

[8]

"#
1.
2.
3.
4.
5.

Determine the null and alternative hypothesis, using


mathematical expressions if applicable.
Select a significance level (alpha).
Take a random sample from the population of interest.
Calculate a test statistic from the sample that provides
information about the null hypothesis.
Decision
If the value of the statistic is consistent with the null hypothesis
then do not reject H0.
If the value of the statistic is not consistent with the null
hypothesis, then reject H0 and accept the alternative hypothesis.

The microarray is being used as a tool to study many individual


genes in parallel.
In all of the Microarray datasets, we are interested in identifying
differentially expressed genes.
The methods are designed for looking at one gene at a time to
determine whether or not it is differentially expressed.
The method would then be applied to every gene on the
microarray in order to identify those genes that are differentially
expressed.

[9]

[10]

#
& $ '

!#

For each gene in data, these would be normalized, logged and


combined into a log ratio for each patient.
Chose a threshold, for example 2-fold differential expression, and
selected those genes whose average differential expression is
greater than that threshold. This is not a good approach for two
reasons:
The average fold ratio does not take into account the extent to
which the measurements of differential gene expression vary
between the individuals being studied.
The average fold ratio does not take into account the number of
patients in the study, which statisticians refer to as the sample
size.
For these reasons, statisticians determine whether or not a gene
is differentially expressed via methodologies known as
hypothesis tests.

The null hypothesis is that there is no biological effect.


For a gene in Breast Cancer Dataset, it would be that this gene is
not differentially expressed following doxorubicin chemotherapy.
For a gene in Leukemia Dataset, it would be that this gene is not
differentially expressed between ALL and AML patients.
If the null hypothesis were true, then the variability in the data
does not represent the biological effect under study, but instead
results from difference between individuals or measurement error
The smaller the p-value, the less likely it is that the observed
data have occurred by chance, and the more significant the
result.
p=0.01 would mean there is a 1% chance of observing at least
this level of differential gene expression by random chance.
We then select differentially expressed genes not on the basis of
their fold ratio, but on the basis of their p-value.
[11]

[12]

% )
Question: What if I do a t-test on a pair of samples and fail to reject the null hypothesis-does this mean that there is no significant difference?
Answer: Maybe yes, maybe no.

For two-sample t-test, power is the probability of rejecting the hypothesis that
the means are equal when they are in fact not equal. Power is one minus the
probability of Type-II error.

Example
H0: no differential expressed.

The power of the test depends upon the sample size, the magnitudes of the
variances, the alpha level, and the actual difference between the two
population means.
Usually you would only consider the power of a test when you failed to reject
the null hypothesis.

The test is significant


= Reject H0
False Positive
= ( Reject H0 | H0 true)
= concluding that a gene is
differentially expressed when in fact
it is not.

High power is desirable (0.7 to 1.0). High power means that there is a high
probability of rejecting the null hypothesis when the null hypothesis is false.
This is a critical measure of precision in hypothesis testing and needs to be
considered with care.

[13]

[14]

+
*
Two measurements are independent if knowing the
value of one measurement does not give information
about the value of the other.
For any gene, the measurements of expression in two
different patient are independent.
Replicate measurements from the same patient are not
independent. (replicate features on an array)

[15]

,
,

$
,

Jarque-Bera test for goodness-of-fit to a normal distribution

jbtest

Lilliefors test for goodness of fit to a normal distribution

lillietest

Kolmogorov-Smirnov test of the distribution of one sample

kstest

Kolmogorov-Smirnov test to compare the distribution of two samples

kstest2

Hypothesis testing for the mean of one sample with known variance

ztest

Hypothesis testing for a single sample mean (paired)

ttest

Hypothesis testing for the difference in means of two samples (unpaired) ttest2

Sign test for paired samples (paired)

signtest

Wilcoxon signed rank test of equality of medians (paired)

signrank

10

Wilcoxon rank sum test that two populations are identical (unpaired)
(Mann-Whitney test)

Ranksum

11

One-Way Analysis of Variance (ANOVA)

anova1
[16]

!"

""

%
/

The gene acetyl-Coenzyme A acetyltransferase 2 (ACAT2) is on the


microarray used for the breast cancer data.
We can use a paired t-test to determine whether or not the gene is
differentially expressed following doxoruicin chemotherapy.
The samples from before and after chemotherapy have been hybridized
on separate arrays, with a reference sample in the other channel.

Paired or one-sample t-test (Related samples)

Normalize the data.


Because this is a reference sample experiment, we calculate the log ratio of
the experimental sample relative to the reference sample for before and
after treatment in each patient.
Calculate a single log ratio for each patient that represents the difference in
gene expression due to treatment by subtracting the log ratio for the gene
before treatment from the log ratio of the gene after treatment.
Perform the t-test. t=3.22 compare to t(19).
The p-value for a two-tailed one sample t-test is 0.0045,
which is significant at a 1% confidence level.

Unpaired or two-sample t-test (Independent samples)

[17]

'

Conclude: this gene has been significantly downregulated following chemotherapy at the 1% level.

[18]

" -

/
The gene metallothionein IB is on the Affymetrix array used for
the leukemia data.

The distribution of the data being tested is normal.

For paired t-test, it is the distribution of the subtracted data that must be
normal.
For unpaired t-test, the distribution of both data sets must be normal.

To identify whether or not this gene is differentially expressed


between the AML and ALL patients.
To identify genes which are up- or down-regulation in AML relative
to ALL.

Homogeneous: the variances of the two population are equal.


Plots: Histogram, Density Plot, QQplot,
Test for Normality: Jarque-Bera test, Lilliefors test, KolmogorovSmirnov test.
Test for equality of the two variances: Variance ratio F-test.

Steps
the data is log transformed.
t=-3.4177, p=0.0016
Conclude that the expression of metallothionein IB is significantly
higher in AML than in ALL at the 1% level.

Note:
If the two populations are symmetric, and if the variances are
equal, then the t test may be used.

[19]

If the two populations are symmetric, and the variances are not
equal, then use the two-sample unequal variance t-test or Welch's t
test.

[20]

Example of data that are not normally distributed: t-tests are not appropriate
analysis of these data.
Histogram of the difference of expression of diubiquitin in 20 breast cancer
patients.

Example of normally distributed data: t-tests are


appropriate analysis of these data.
Histogram of the difference between the log
ratios of the expression of ACAT2 in 20 breast
cancer patients before and after a 16-week
course of chemotherapy.

The data have not been logged. The distribution is not normal; there are two outliers,
with values of approximately -28 and -11.
(a t-test gives a not-significant result because the standard error of the mean is so high.
p=0.03 which is not significant at the 1% level.)
The data have been logged. The distribution is normal; the outliers have been pulled in.
(In both case the mean difference is less than zero. a t-test is significant because the
standard error is much lower. p=0.001, which is significant at the 1% level).

(The data are approximately normal; the mean of the


distribution appears to be less than zero; suggesting that
this gene might be down-regulated.)

Histogram of the log of the gene expression of RYK in 27 ALL paqtients.


Histogram of the log of the gene expression of RYK in 11 AML pateints.
(neither distribution is normal. The ALL daya is bimodal: 10 patients have little or
no expression and 17 patients have gene expression)

Histogram of the log of the gene expression of


metallothionein IB in AML patients.
Histogram of the log of the gene expression of
metallothionein IB in ALL patients.

(both distribution are approximately normal. The mean of


the histogram for the ALL patients appears to be lower than
the mean for the AML patients, suggesting that this gene
might be differentially regulated in these two disease.)

-1
11

[21]

[22]

[23]

[24]

" )

Normal probability plot for graphical normality testing


normplot(X) displays a normal probability plot of the data in X. For matrix X, normplot displays
a line for each column of X.
qqplot(X) displays a quantile-quantile plot of the sample quantiles of X versus theoretical
quantiles from a normal distribution. If the distribution of X is normal, the plot will be close to linear.
qqplot(X,Y) displays a quantile-quantile plot of two samples. If the samples do come from the
same distribution, the plot will be linear.
If the quantiles of the theoretical and data distributions agree,
the plotted points fall on or near the line y = x.
If the theoretical and data distributions differ only in their
location or scale, the points on the plot fall on or near the line y
= ax + b. The slope a and intercept b are visual estimates of the
scale and location parameters of the theoretical distribution.

"
2
Jarque-Bera test for goodness-of-fit to a normal distribution
The Jarque-Bera test evaluates the hypothesis that X has a normal distribution
with unspecified mean and variance, against the alternative that X does not have
a normal distribution.
The test is based on the sample skewness and kurtosis of X. For a true normal
distribution, the sample skewness should be near 0 and the sample kurtosis
should be near 3.
The Jarque-Bera test determines whether the sample skewness and kurtosis
are unusually different than their expected values, as measured by a chi-square
statistic.
The Jarque-Bera test is an asymptotic test, and should not be used with small
samples. You may want to use lillietest in place of jbtest for small samples.

jbtest

Lilliefors test for goodness of fit to a normal distribution


The Lilliefors test evaluates the hypothesis that X has a normal distribution with
unspecified mean and variance, against the alternative that X does not have a
normal distribution.
This test compares the empirical distribution of X with a normal distribution
having the same mean and variance as X.
It is similar to the Kolmogorov-Smirnov test, but it adjusts for the fact that the
parameters of the normal distribution are estimated from X rather than specified
in advance.

[25]

lillietest

[26]

"
2

0
Kolmogorov-Smirnov test of the distribution of one sample

H = kstest(X) performs a Kolmogorov-Smirnov test to compare the values in


the data vector X with a standard normal distribution.
For each potential value x, the Kolmogorov-Smirnov test compares the
proportion of values less than x with the expected number predicted by the
standard normal distribution. The kstest function uses the maximum difference
over all x values is its test statistic.

Do not assume that the data is normally distributed.


There are two good reasons to use non-parametric statistic.
Microarray data is noisy:

kstest

there are many sources of variability in a microarray experiment and


outliers are frequent.
The distribution of intensities of many genes may not be normal.
Non-parametric methods are robust to outliers and noisy data.

Kolmogorov-Smirnov test to compare the distribution of two


samples
4

H = kstest2(X1,X2) performs a two-sample Kolmogorov-Smirnov test to


compare the distributions of values in the two data vectors X1 and X2 of length
n1 and n2, respectively. The null hypothesis for this test is that X1 and X2 have
the same continuous distribution.

Microarray data analysis is high throughput:

When analysising the many thousands of genes on a microarray, we


would need to check the normality of every gene in order to ensure
that t-test is appropriate.
Those genes with outliers or which were not normally distributed
would then need a different analysis.
It makes more sense to apply a test that is distribution free and thus
can be applied to all genes in a single pass.

kstest2

[27]

[28]

&

!
Null hypothesis: the population median from which both samples were
drawn is the same.

Given n pairs of data, the sign


test tests the hypothesis that the
median of the differences in the
pairs is zero.
The test statistic is the number
of positive differences.
If the null hypothesis is true,
then the numbers of positive and
negative differences should be
approximately the same.
In fact, the number of positive
differences will have a Binomial
distribution with parameters n and
p.

The sum of the ranks for the


"positive" (up-regulated) values is
calculated and compared against a
precomputed table to a p-value.
Sorting the absolute values
of the differences from
smallest to largest.
Assigning ranks to the
absolute values.
Find the sum of the ranks of
the positive differences.
If the null hypothesis is true, the
sum of the ranks of the positive
differences should be about the
same as the sum of the ranks of
the negative differences.
[29]

&

-& $

[30]

'

The data from the two groups are combined and given ranks. (1 for the smallest, 2
for the second smallest,... )
The ranks for the larger group are summed and that number is compared against a
precomputed table to a p-value.

[31]

[32]

&
,4

-& $
56 "

! "

$ !

The gene receptor-like tyrosine kinase (RYK) appears on the


Affymetrix arrays used for Leukemia dataset.
A number of values have negative scores and have been replaced
with zeros in the logged data set.
p=0.039, which is not significant at a 1% confidence level. (The twosample t-test p=0.0032 is significant)
The answers are different because neither the ALL data nor AML data
are normally distributed.
The AML sample is very small, so it is difficult to be conclusive, but it
also does not appear to be normally distributed.
The t-test s not an appropriate analysis, and we should not believe the
significant result from the t-test.
We could conclude from non-parametric analysis that this gene is not
significantly differentially expressed between these two disease types.
Mann-Whitney test is a less powerful test and is more likely to lead to
a false negative result.

Unlogged: p-value=0.00032,
Logged: p-value=0.00048
The test gives a significant
result. The Wilcoxon test is
robust to outliers and so
gives a significant result
even on the unlogged data.

[33]

[34]

0
The bootstrap data sets look like the real data, in that they have similar
values, but are biologically nonsense because the values have been
randomized.
Aim: the aim of the test is to compare some property of the real data
with a distribution of the same property in random data sets.

Bootstrap Sample/data

with replacement: different individuals in the bootstrap data could have the same value
from the real data.
without replacement: each of the real values is only used once in the bootstrap data.

Bootstrap analysis are more appropriate for microarray analysis than either ttest or classical non-parametric tests.
don't require that the data are normally distributed.
robust to noise and experimental artifacts.

Under the null hypothesis; there is no difference in gene expression between


the two groups. If that were the case, then any of the measurements in the data
could have been observed in any of the individuals.
ex: any of the AML patients could have had any of the 38 measurements associated
with both the AML and ALL patients.

The bootstrap works by constructing a large number of random data sets by


resampling from the original data, in which each individual is randomly allocated
one of the measurements from the data, which could be from either of the
groups.
[35]

[36]

! 56

"

Two-sample t-statistic from the real data is 3.1596.


Bootstrap
Create bootstrap data sets, each of which also consists of 27
ALL patients and 11 AML patients.
For each patient, we choose a measurement at random from the
38 observed values and assign that value to the patient.
For each bootstrap data set, construct a two-sample t-statistic.
Repeated this procedure 1000000 times to generate a bootstrap
distribution of the t-statistics.
Of the 1000000 values, 9750 had an absolute value greater than
3.1596.
The bootstrap p-value for RYK is under 0.001, which is
significant at the 1% level.
Recommend performing the bootstrap at 10 times the number of
genes on the array being analyzed.

1. We generate an empirical distribution using the t-statistics


calculated from the randomized bootstrap data.
2. The t-statistic from the real data is compared with the distribution of
t-statistics from the bootstrap data.
3. We calculate an empirical p-value by computing the proportion of
bootstrap statistics that have a more extreme value than the tstatistic from the real data.
if the real t-statistic is in the belly of the distribution, then it is
indistinguishable from t-statistics generated from randomized data.
if the statistic from the real data is towards the edge of the bootstrap
distribution, then it is unlikely that the experimental result can have
arisen by chance, and we would conclude that the gene is significantly
differentially expressed.
[37]

"

[38]

"

t-test

Non-parametric

Bootstrap Analysis

Easy

Easy

Robust

Powerful

Robust

Powerful

Widely Implemented

widely implemented

Requires use of specialist


packages or programming.

Not appropriate for data with


outliers

Less powerful

When we analyze a microarray experiment, we want to apply these test to many


gene in parallel.
There is a serious consequence of performing statistical tests on many genes in
parallel, which is known as multiplicity of p-values.
Since every sample hybridized to the arrays is the same reference sample, we know
that no genes are differentially expressed: all measured differences in expression are
experimental error.

By the very definition of a p-value, each gene would have a 1% chance of having a p-value of
less than 0.01, and thus be significant at the 1% level.
Because there are 10000 genes on this imaginary microarray, we would expect to find 100
significant genes at this level.
Similarly, we would expect to find 10 genes with a p-value less than 0.001, and 1 gene with
p-value less than 0.0001.

Note:
Because of the loss of power, classical non-parametric statistics have not
become popular for use with microarray data, and instead bootstrap methods
trend to be preferred.

In Breast Cancer Dataset with 9216 genes, even if the chemotherapy had no effect
whatsoever, we expect to find 92 differentially expressed genes with p-values less
than 0.01, simple because of the large number of genes being analyzed.
How do we know that the genes that appear to be differentially expressed are truly
differentially expressed and are not just artifact introduced because we are analyzing
a large number of genes?
Is this gene truly differentially expressed, or could it be a false positive results?
[39]

[40]

$ %

$ %

The permutation test is a test where the null-hypothesis allows to reduce the
inference to a randomization problem.
The process of randomizations makes it possible to ascribe a probability
distribution to the difference in the outcome possible under H0.
The outcome data are analyzed many times (once for each acceptable
assignment that could have been possible under H0) and then compared with
the observed result, without dependence on additional distributional or modelbased assumptions.
Perform a permutation test (general):

1.
2.
3.
4.

Analyze the problem, choice of null-hypothesis


Choice of test statistic T
Calculate the value of the test statistic for the observed data: tobs
Apply the randomization principle and look at all possible permutations, this gives the
distribution of the test statistic T under H0.
5. Calculation of p-value:

Ref: Mansmann, U. (2002), Practical microarray analysis: resampling and the


Bootstraap.Heidelberg.

-&
2 7

The permutation test allows determining the


statistical significance of the score for every gene.

[41]

"7

[42]

It often happens in research practice that you need to compare more


than two groups (e.g., drug 1, drug 2, and placebo), or compare groups
created by more than one independent variable while controlling for the
separate influence of each of them (e.g., Gender, type of Drug, and size
of Dose). In these cases, you need to analyze the data using Analysis
of Variance, which can be considered to be a generalization of the t-test.
In fact, for two group comparisons, ANOVA will give results identical to
a t-test.
When the design is more complex, ANOVA offers numerous
advantages that t-tests cannot provide (even if you run a series of ttests comparing various cells of the design).
Analysis of Variance (ANOVA) allows us to extend this to more than two
populations or measurements (treatments/). That is, we can test the
following:
Are all the means from more than two populations equal?
Are all the means from more than two treatments on one population equal?
(This is equivalent to asking whether the treatments have any overall effect.)

[43]

[44]

/
"

To identify the genes that are differentially expressed in one or


more of these four groups.
ARP1 (actin-related protein 1).

Enfron, B. and Tibshirani, R. (1993). An introduction to the bootstrap. Chapman and Hall.
Jarque, C. M. and Bera, A. K. (1980). Efficient tests for normality, homoscedasticity, and serial
independence of regression residuals. Economics Letters 6, 255-9.
Kerr, M. K., Martin, M., and Churchill, G. A. (2000). Analysis of variance for gene expression microarray
data, Journal of Computational Biology, 7: 819-837.
Lilliefors, H. W. (1967). On the Kolmogorov-Smirnov test for normality with mean and variance unknown,
The American Statistical Association Journal.
Martinez, W. L. (2002 ). Computational statistics handbook with MATLAB, Boca Raton : Chapman &
Hall/CRC.
Runyon, R. P. (1977). Nonparametric statistics : a contemporary approach, Reading, Mass.: AddisonWesley Pub. Co.
Statistics Toolbox User's Guide, The MathWorks Inc.
http://www.mathworks.com/access/helpdesk/help/toolbox/stats/stats.shtml
Stekel, D. (2003). Microarray bioinformatics, New York : Cambridge University Press.
Tsai, C. A., Chen, Y. J. and Chen, J. (2003). Testing for differentially expressed genes with microarray
data, Nucleic Acids Research 31, No 9, e52.
Turner, J. R. and Thayer, J. F. (2001). Introduction to analysis of variance : design, analysis, &
interpretation, Thousand Oaks, Calif. : Sage Publications.

E-mail: hmwu@stat.sinica.edu.tw
Website: http://www.sinica.edu.tw/~hmwu/
[45]

[46]

You might also like