Notes English

Modern Statistical Data Analysis:
February 27, 2015
CONTENTS
CONTENTS
Contents
I Hypothesis Testing:
1 Data and Data Preprocessing:
1.1
Visualization of Data: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Data Transformations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Completion of Missing Values: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Classical Hypothesis Testing:

2.1
2.2
Some Notation and Reminders: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1
Classical Problems in Parametric Statistics: . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Non-Parametric Tests: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.1
10
Permutation Tests: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Tests for Independence:
12
3.1
Classical tests for independence of scalar RVs: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3.2
Testing for independence using permutations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2.1
Testing for independence using a Kernel function:
. . . . . . . . . . . . . . . . . . . . . . . .
14
3.2.2
Testing for independence using the Distance Correlation (dCor) method: . . . . . . . . . . . .
15
3.2.3
Another distance based method for continuous univariate RVs: . . . . . . . . . . . . . . . . .
17
3.2.4
An extension of Hoeding's method:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.2.5
An information theoretic test for independence: . . . . . . . . . . . . . . . . . . . . . . . . . .
19
Comparison of dierent tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.3.1
20
3.3
Equitable tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Multiple Hypothesis Testing (MHT):

4.1
4.2
22
Family Wise Error Rate (FWER): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
4.1.1
The Bonferroni Correction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
False Discovery Rate (FDR): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
4.2.1
The BH procedure under more general dependence:
. . . . . . . . . . . . . . . . . . . . . . .
25
4.2.2
What is the true meaning of FDR control: . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.2.3
Adaptive Procedures (Modied BH procedures): . . . . . . . . . . . . . . . . . . . . . . . . .
26
4.2.4
What to do if the Pvalues are not independent: . . . . . . . . . . . . . . . . . . . . . . . . . .
28
4.2.5
Estimation versus control of the FDR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
CONTENTS
CONTENTS
4.3
An alternative method for MHT using Qvalues: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
4.4
Other variations of FDR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
4.5
Empirical Bayes View: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
References
31
1 DATA AND DATA PREPROCESSING:
Part I
Hypothesis Testing:
Data and Data Preprocessing:
During this course when we discuss "data" we will usually be referring to a collection of vectors x1 , ..., xn Rp
i.i.d
which we usually assume that these are realizations of of an I.I.D sequence Xi F with F being some known or
unknown distribution.
1.1 Visualization of Data:

One of the rst thing one wants to do when dealing with real data before beginning to perform any analysis or
employing any statistical methods is to look at the data any try to get some initial idea about its nature, there are
several ways to do this:
Examination of scatter-plots of the various variables in relation to one another in order could help detect
trends and relations as well as detect outliers. For example consider these two images:
Figure 1.1: Passing a linear trend-line through a scatterplot for the collection of (x,y) values (blue dots). We
would like to model the relation between X and Y and
possibly predict given an X value what corresponding
Y value would match it. The linear regression line (in
black) is one way to do this.
Figure 1.2: in this scatter-plot we see an outlier. This

outlier is clearly visible in this representation of the data
while we would not be able to detect is using a histogram
of the X or Y values separately.
Examination of density-plots might be more informative when the number of observations and their location
makes a regular scatter-plot uninformative.
1.2 Data Transformations:
1 DATA AND DATA PREPROCESSING:
Figure 1.3: On the left side we see a scatter-plot of x,y values, it can be seen that the sheer amount of observations
and the way they are scattered does not allow us to notice any trend as all areas look equally dense (equally covered
with black dots). On the other hand, the right side image shows us a density plot where dierent colors represent
the point density in each area. In this particular example the density in the red area is roughly 1000 times higher
than in the purple area - something that could not be seen in the left plot.
Examination of box-plots gives a compact and convenient representation of the averages and standard deviations of data in various subgroups of the data. This is mostly useful when there is a large number of variables
that need to be compared or examined together.
Figure 1.4: It can be seen that a box-plot can represent the distribution of the observations around the mean across
several groups. This allows comparison of both means and standard deviations of the various groups and allow
detection of dierences between the groups.
1.2 Data Transformations:

In many cases there are non-linear relationships between variables in a dataset and these relationships are harder to
detect and understand. Using a transformation (for example logarithmic or polynomial) of one or several variables
it is often possible to get a clearer representation of the relationships in the data or compensate for dierences in
order of magnitude in the data.
1.3 Completion of Missing Values:
2 CLASSICAL HYPOTHESIS TESTING:
1.3 Completion of Missing Values:

In many cases some of the values of some variables in the dataset are missing and there are methods to complete
these values using various estimation methods. This is helpful in cases where one wants to analyze data in a method
which does not tolerate missing values but one doesn't want to simply omit observations.
Classical Hypothesis Testing:
2.1 Some Notation and Reminders:

Remark 1. From this point on we will abbreviate both a scalar random variable and random vector by writing
RV. The dimension will be left to be inferred from the context.
Denition 2. Given a RV X with distribution F (usually described by a cumulative distribution function) a
distribution parameter is some (often unknown) value that characterizes the distribution (such as the expectation, variance, etc).
Denition 3. An independent sample or realization x1 , ..., xn Rd from the distribution F is a realization

i.i.d
from a sequence of random variables X1 , ..., Xn F (here Xi can be either scalar or vector valued).
Denition 4. Given a distribution F a statistic of the distribution is a RV T (X1 , ..., Xn ) which is a function of
i.i.d
a sequence of RVs X1 , ..., Xn F . The value of the statistic T (x1 , ..., xn ) is calculated based on realizations.
Remark 5. In many cases the statistics we will discuss will be estimators for the parameters of some distribution.
For example the sample average is an estimator for the parameter which is the distribution expectation.
Denition 6. Given a family of distributions F (x; ) (a statistical model) parameterized by , an estimator of

is any function (x1 , ..., xn ) that tries to approximate based on a sample from the distribution F .

h
i
Denition 7. The bias of an estimator (X1 , ..., Xn ) for the parameter is dened by Bias :=E . An
estimator is said to be unbiased if its bias equals zero.
that quanties the

Denition 8. A loss function for an estimator of the parameter is any function L ,
2

=
"nearness" of the estimator to the real parameter value. One example is the squared-error loss L ,
.
2
Denition 9. A risk function for an estimator of the parameter is the expected value ofsome lossfunction,

2

h i

= E L ,
, one common example is the mean-squared-error risk MSE := E
R ,
.
2
2.1.1 Classical Problems in Parametric Statistics:

In classical statistic the assumption usually is that given observations x1 , ..., xn there is a family of distributions
F (x; ) parameterized by (both x and can be scalars or vectors) such that x1 , ..., xn is an i.i.d sample from F .
In this context there are several classical question that arise:
1. Point Estimation: Given the observations x1 , ..., xn one tries to nd an estimator (x1 , ..., xn ) which
approximates the unknown parameter . There are a few desirable qualities in such an estimator:
(a) Consistency: we would like the estimator to converge in probability to the real value of .
h
i
(b) Lack of Bias: we would ideally like the estimator to have little or no bias in the sense that E 0.
(c) Low Risk: given some risk function R we would like an estimator with low risk.
2. Condence Interval Estimation: a condence interval [a, b] for the parameter with condence level is
any interval such that P [ [a, b]] = 1 . We often want to estimate such intervals based on observations
and the values of the interval edges (usually take to be symmetric) are functions of the observations.
3. Hypothesis Testing: hypothesis testing deals with decision problems of the type = 0 or , 0 . There
is a null hypothesis regarding the true value of the parameter, denoted by 0 and the objective is to decide
with a given level of certainty whether to accept or reject this null hypothesis.
Several key elements of all of these problems are the number of parameters that need to be estimated, the amount
of data available and the number of hypothesis we wish to test.
Denition 10. Suppose x1 , ..., xn is a realization of X1 , ..., Xn F (x; ) . Given a null hypothesis H0 regarding
i.i.d
the value of and the complementary alternative hypothesis H1 there are several stages to a statistical test of
H0 :
1. Denition and calculation of a test statistic T (X1 , ..., Xn ) based on the observations.
2. Denition and calculation of a rejection area R () that contains the parameter with probability .
3. Rejection of H0 i T (x1 , ..., xn ) R () (equally acceptance of H1 i T Rc )
There are two kinds of errors that arise in this context:
1. Type 1 Error (false positive): = P (T R | H0 = true)
2. Type 2 Error (false negative or miss): = P (T Rc | H1 = true)
Denition 11. Given a statistical test for with test statistic T (x1 , ..., xn ) the Pvalue of the test is dened as the
probability that we observe a value at least as extreme as T (x1 , ..., xn ) given the underlying distribution of the
observations under H0 :
Pvalue = P (T (X1 , ..., Xn ) T (x1 , ..., xn ) | T (X1 , ..., Xn ) is distributed accrording to H0 )
It is important to remember that just like T (X1 , ..., Xn ) the Pvalue is a random variable which is a function of
X1 , ..., Xn . Furthermore if H0 is true then the distribution of the Pvalue would be Uniform [0, 1].
Figure 2.1: Here we see the empirical density (left) and empirical cumulative distribution (right) of 500 Pvalues
calculated by a simulation in which the data was sampled according to H0 . It can be seen that the simulation result
indeed shows that the approximated density of the Pvalues in such a case is roughly Uniform[0, 1].
Remark 12. A few remarks:

It is equivalent to reject H0 when T (x1 , ..., xn ) R () and to reject H0 when Pval < .
The Pvalue is a good measure of our condence in the "correctness" of H0 when testing a single hypothesis.
In the parametric case there is often an analytic way to calculate the Pvalue based on the distribution of the
test statistic under H0 (for example this is possible in the classic t-test).
It is a common misconception that the following relation exist

Pval = P (H0 = True | Observed data) = P (T (X1 , ..., Xn ) is distributed accrording to H0 | T (X1 , ..., X1 ) = T (x1 , ..., xn ))
This is however not true and in order to calculate P (H0 = True | Observed data) one would need to rely on
Bayesian statistics and assume some prior distribution on H0 .
In order to calculate the type-2 error one would need an assumption of the distribution of under H1 .
Denition 13. Given a statistical test with test statistic T and rejection area R () the power of the test is
Power = 1 = P (T R () | H0 = false)
This value of course depends on the choice of . It is generally desired that a test would have a low and high
power simultaneously but in reality there is a trade o between the two and that is not always possible.
Figure 2.2: Here we see the density of the test statistic under H0 (in blue) and under H1 (in red). Given some
rejection threshold (denoted by the black line) we would reject H0 if the test statistic is to the right of the line.
The condence level would then be the area to the right of the line found under the density given H0 , here that
area is marked in red. The power of the test is the area to the right of the line found under the density given H1 ,
here denoted in green. It can be seen that moving the rejection threshold impact both and the power.
Example 14. One of the simplest examples of a parametric test is the independent two-sample test for equality
n
of means. Assume that we are given observations {(xi , yi )}i=1 where yi {0, 1} is a categorical variable and we

assume that the xi values are realization from the distribution Xi |yi N yi , 2 . We would like to test the
hypothesis H0 : 0 = 1 , that is that the means in both groups are equal. If we assume that is known then we
can conduct a simple Z-test by calculating the z-score of the mean dierence:
nj =
j =
1
nj
n
X
1{yi =j}
i=1
n
X
xi 1{yi =j}
i=1
1
0
Z=r

2 n10 + n11 2
Under the assumption of H0 being true Z N (0, 1) and thus we will reject H0 i Z > Z 2 or Z < Z 2 for

Z 2 = 1 2 . This is a two-sided test in which we have equally divided the rejection area between the two tails
of the distribution. We can also calculate the Pval of the test given by Pval = 2 min { (Z) , 1 (Z)}. Similarly
if the variance of the distribution is unknown we would use a t-test with n 2 degree of freedom by calculating the
following statistic:
tn2
1
0
= r
;
2 0
2 1
+
n0
n1
2 j =
1 X
2
(xi
j ) 1{yi =j}
nj 1 i=1
2.2 Non-Parametric Tests:
The quantity in the denominator is simply an estimator for 2 . In this case the distribution of tn2 is no longer
normal but it is a known distribution named the T-distribution with n 2 degree of freedom. This distribution can
be used to calculate the rejection area and the Pval in the same way as with the Z-test.
One problem with the t-test is that it assumes an underlying normal distribution of the observations, as assumption
which can not be omitted and thus limits this test to specic cases. Additionally the t-test is not guaranteed to
control the type-I error , especially for skewed distributions.

Denition 15. Given X1 , ..., Xn F a statistic T (X1 , ..., Xn ) will be called Distribution Free if the distribution
i.i.d
of T does not depend on F . Furthermore, a hypothesis test for H0 will be called a distribution free test if the
test statistic is distribution free, that is it does not depend on the distribution of the observations under H0 .
Remark 16. While a distribution free test statistic must be independent of the distribution of the data under H0 it
is possible that the distribution of the statistic does depend on the alternative hypothesis.
Example 17. We will now give an example for a distribution free aparametric test for the two independent sample
n
decision problem. Given the observations {(xi , yi )}i=1 when again yi {0, 1} we would like to determine whether
Xi |yi = j has the same distribution for both j = 0 and j = 1. The Mann-Whitney-Wilcoxon Rank-Sum Test
allows us to do this and works as following:
1. The rank of each observation is calculated as ri =
2. For j = 1, 2 the value Rj =
3. Given n1 =
Pn
i=1
Pn
i=1 ri 1{yi =j}
Pn
j=1
1{xj xi } .
is calculated.
1{yi =1} The test statistic u = R1
n1 (n1 +1)
2
is calculated.
4. The distribution of U is calculated and a rejection area is chosen accordingly given condence .
Despite it appears U is not symmetric it actually is since R1 + R2 =
n(n+1)
2
is constant. The main advantage
of this test is the fact it is distribution free and thus can be used regardless of the underlying distribution of
the observations. The main question that remains is how to calculate the distribution of U in order to obtain the
rejection area. In practice for relatively small samples (up to 20 observations) there are tables that directly calculate
said distribution while for large sample there is a normal approximation to the distribution of U . An alternative
more modern method is to use numerical methods as we will now discuss.
2.2.1 Permutation Tests:

n
Given the data {(xi , yi )}i=1 where yi are again binary labels we want to test the hypothesis H0 : X, Y are independent
0
(equivalently if x1 , ..., xn1 F and x1 , ..., xn2 G the hypothesis would be H0 : F = G).
Assume that
Tobs := T ((x1 , y1 ) , ..., (xn , yn )) is some statistic (essentially any function of the data). A permutation test based
on the statistic T would be carried out as follows:
10
1. Draw N random permutations s1 , ..., sN Sn .

n
2. For each permutation calculate the test statistic on the permuted data Ti := T ({(xi , yi )}i=1 ).
3. Compute the empirical Pvalue P =
1
N
PN
i=1
1{Ti >Tobs } .
4. Reject H0 if P < for given condence .
Claim 18. For any joint-distribution (Xi , Yi ) F and any test statistic T ({(xi , yi )}ni=1 ) the distribution of the
i.i.d
empirical Pval under H0 is approximately Uniform[0, 1].
Proof. Sketch of proof: Under H0 the RVs T1 , ..., Tn , Tobs are i.i.d and thus # Ti > Tobs is uniformly distributed

on {0, ..., N } which immediately gives us that P Uniform 0, N1 , N2 , ..., 1
1
N,1
and thus when N we get
that the distribution of P converges to Uniform [0, 1].
Remark 19. All of this is correct assuming that T is continuous and there are no ties. However if T was discrete
and ties were possible the distribution of P is still approximately uniform.
Fact 20. The above procedure guarantees that P (reject | H0 = true) .

The advantage of permutation tests over normal approximations is that the permutation test better captures the tails
of the distribution than the normal approximation. Even though for very large samples the normal approximation
converges to the true distribution in practice this still might not be satisfactory, especially when one wants to
conduct a test with very small values.
Advantages of Permutation tests: Accuracy, no underlying assumption about the data, exibility (can be used
with complicated null models)
Disadvantages of permutation tests: Computationally intensive and do not provide any insight regarding the
analytic characteristic of the distribution (for example the relation between the power of the test and the sample
size is not easily understood).
2.2.1.1 Using permutations to calculate test power:

Suppose we have an assumption regarding the underlying distribution of the data under both H0 and H1 and we
have a test statistics T meant for testing H0 . We would often like to know what is the power of the test and how
does the power change with the sample size. In the parametric approach we can often analyze the power by direct
analytic computation or via an asymptotic approximation. In the non-parametric computational approach we can
compute the power using permutations using the following algorithm:
Input: condence level , an assumed distribution F of the data under H1 , sample size n.
Parameters: N - number of permutations, K - number of simulations.
for k = 1, ...., K do:
Sample by simulation {(xi , yi )}ni=1 F

i.i.d
11
3 TESTS FOR INDEPENDENCE:

perform a permutation test with N permutations with condence based on the sampled data.
Denote by Rk {0, 1} the result of the test (Rk = 1 if the k'th simulation rejected H0 ).
The estimator for the power is then dened by 1 =
1
K
Pk
k=1
Rk
Computational cost: O (N K) which could be quite costly.
Tests for Independence:
Suppose we are given two RVs X, Y and we would like to test whether they are independent. Ideally we would like
a test that would be able to detect any kind of probabilistic dependence between the two variables (not necessarily
linear or monotonic for example).
P
Denition 21. Given Xn F (x; ) an estimator (X1 , ..., Xn ) for is said to be consistent if n
.
i.i.d
Denition 22. A sequence of tests Tn = (gn , Rn ) with sample size n, test statistic gn (X1 , ..., Xn ) and rejection
area Rn for the rejection of H0 is said to be consistent if the two following properties hold:
n
1. P (reject | H0 =true) = P (gn Rn | H0 = true) 0.

n
2. P (reject | H1 =true) = P (gn Rn | H1 = true) 1 for any H1 6= H0 .
The general method for constructing consistent test for independence is as follows:
1. Find a distance measure D (X, Y ) 0 such that D (X, Y ) = 0 i X, Y are independent.
n for D.
2. Find a consistent sequence of estimators D
n
o
n > n .
3. Find a series of threshold n and dene rejection areas Rn = D
Remark 23. The nal step is often tricky as it depends on the rate of convergence of D n .
Example 24. Testing for independence using a kernel method:
n
i.i.d
Suppose we again have the data {(xi yi )}i=1 F where yi is binary and we would like the test H0 : X, Y are independent.
We dene a kernel Kh (xi , xj ) and compare the kernel values for pairs of the same and dierent groups by calculating
the following statistic:
T =
n
X
i,j=1

Kh (xi , xj )
1{yi =yj =0}

1{yi =yj =1}
1{yi 6=yk }
+
2
2
2
n0
n1
2n0 n1
1
The kernel function could for example be a Gaussian kernel Kh (xi , xj ) = e h2 kxi xj k . Here h is a width parameter
which determines how fast the kernel goes to zero and in turn impacts the rejection area (it is possible to pick an
optimal h value using a more sophisticated technique). When the X values do not depend on Y we would obviously
have that E [T ] = 0 and thus it would suce to test the hypothesis that E [T ] = 0 by comparing the value of |T | to
an appropriate critical value. This can be done either by a normal approximation of the distribution of T or in a
direct method using permutations.
12
3.1 Classical tests for independence of scalar RVs:
3.1 Classical tests for independence of scalar RVs:

There are several measures for dependence of scalar random variables, the most known of which is the Pearson
correlation-coecient which is given by:
Pn
(xi x) (yi y)
2 Pn
2
i=1 (xi x)
i=1 (yi y)
R (x, y) = qP
n
i=1
This is an estimator of the correlation coecient given by
E [(X E (X)) (Y E (Y ))]

Cov (X, Y )
(X, Y ) = r h
i=p
Var (X) Var (Y )
2
2
E (X E (X)) (Y E (Y ))
The correlation coecient represents the strength of the linear dependence between X and Y but is not informative
regarding other types of dependence. If one assume a normal distribution of X and Y then there is a closed form
to the distribution of R and more generally the distribution could always be approximated using permutations.
Fact 25. When (X, Y ) = 0 we would say that X,Y are uncorrelated. It is known that independent RVs are always
uncorrelated but the opposite is not true.
Figure 3.1: Here we see the values of the Pearson correlation-coecient for various types of dependence and various
levels of noise. It can be seen in the rst and second lines that the value of the coecient depends on the amount
of noise but does not depends on the slope as long as the slope is not zero. The third line shows that for non-linear
dependence the coecient can very well be zero despite the fact the variables have a functional relationship and
are not independent.
Another measure of dependence is the Kendell-Tau Rank correlation coecient dened by:
P
(x, y) =
i<j
sgn ((xi xi ) (yi yj ))

1
2 n (n
1)
Using permutations it is again possible to perform a hypothesis test for the signicance of this coecient. The main
disadvantages of is that it is only dened for scalar RVs and it is only capable of detect monotonic dependencies.
On the other hand it has the advantage of being a distribution free statistic.
13
3.2 Testing for independence using permutations:

Fact 26. Given two RVs X Rp , Y Rq with distributions PX , PY and joint distribution PXY we know that
are independent i PX (~x) PY (~y) = PXY (~x, ~y) for all (~x, ~y) Rp Rq . If X,Y are continuous with densities
fX , fY and joint density fX,Y then an equivalent condition is that fX (~x)fY (~y ) = fX,Y (~x, ~y ) for all (~x, ~y ) Rp Rq .
X, Y
Assume we are given two RVs X Rp , Y Rq with distributions PX , PY and joint distribution PXY , we would like
a statistical test which will determine whether the probability distributions PX PY and PXY are identical. When
the distributions are both univariate the Kolmogorov-Smirno two-sample test provides an analytic solution to
this problem but it does not generalize to the multivariate case. We would thus want to approach the problem
by dening a statistic whose distribution is dierent when X, Y are dependent or independent and then use a
permutation test in order to test for signicance of this statistic.
Remark 27. The problem of testing for independence is similar (but not identical) to testing for equality of distribution in a two sample setting. When testing for equality of distribution we are given a set of data sampled from
distribution P and a dierent set of data sampled from a distribution Q and we would like to determine whether
P Q based on these two independent samples. In comparison, when testing for independence we would like
to determine whether PXY PX PY but in this case we evaluate these distributions based on a single sample.
There are several types of methods for testing for independence using permutations:
Kernel methods: these methods rely on denition of a kernel K(x, y) which measures similarity between x
and y . The value of the Kernel are then computed and compared for pairs from the same group and from the
two dierent groups.
Geometric methods: these methods rely on dening a distance between distribution measure D (X, Y ).
Information based methods: it can be shown that two RVs X, Y are independent i their mutual information I (X, Y ) equals zero. This provides a way for testing for independence by computing or estimating
the mutual information.
3.2.1 Testing for independence using a Kernel function:

n
In example 24 we described a two-sample method for testing independence given a sample {(xi , yi )}i=1 where yi
was a binary value based on the statistic
T =
n
X
i,j=1

Kh (xi , xj )
1{yi =yj =0}

1{yi =yj =1}
1{yi 6=yk }
+
2
2
2
n0
n1
2n0 n1
(3.1)
it can be seen that for yi = yj we are given a value with a positive sign and for yi 6= yj we got a value with a
negative sign, all these values are summed up and a permutation test is used to check whether the computed value is
n
i.i.d
signicantly dierent than zero. In the more general case where we are given a single sample {(xi , yi )}i=1 PXY
we need to adapt this method to suit our needs (see [5]). One way to do this is treat (xi , yi ) as sampled from
n
PXY and (xi , yj ) i 6= j as samples from PX PY . In order to do this we dene A := {(xi , yi )}i=1 and B :=
14
{(xi , yi ) | 1 i 6= j n} and notice that under H0 , A is an i.i.d sample from PXY and B is an i.i.d sample out of
PX PY . We thus want to determine whether these two samples are identically distributed since this is equivalent to
independence of X, Y . In order to do this we will compute the following values:
two points from same group
z
}|
{
+Kh ((xi , yi ) , (xj , yj )) i 6= j
one from same and one from dierent
z
}|
{
Kh ((xi , yi ) , (xl , yk ))
i and l 6= k
one from same and one from dierent
}|
{
z
Kh ((xl , yk ) , (xi , yi ))
l 6= k
two points from dierent groups
z
}|
{
+Kh ((xi , yj ) , (xl , yk ))
i 6= j, l 6= k
Where we note that (xi , yj ) Rpq are the "points" in our sample. We then calculate the test statistic given in
formula 3.1 and perform a standard permutation test in order to test for signicance.
Remark 28. One small problem with this method is that there are usually a lot more point in group B than in A.
3.2.2 Testing for independence using the Distance Correlation (dCor) method:
h
Denition 29. The characteristic function of a RV X is dened by X (t) = E eit
>
Remark 30. If X has density fX then X is the Fourier transform of fX .

Denition 31. given two vectors x, y Rp we dene a relation x y i xi yi for all i.
Denition 32. Given a realization x1 , ..., xn Rp from the random sequence X1 , ..., Xn F we dene the
i.i.d
empirical cumulative distribution of F as follows:

n

1X
n ~
FX
t =
1{xi ~t} ~t
n i=1
(3.2)
Denition 33. Given any sequence of values aij indexed by i {1, ..., n} and j {1, ..., m} we denote:
m
ai :=
1 X
1X
1 XX
aij , aj :=
aij , a =
aij
m j=1
n i=1
n m i=1 j=1
If the value aij we to be laid out in an n m table these are the row/column/total averages respectively.
We mentioned previously that the Pearson correlation-coecient measure dependence but only of the linear nature.
More general dependence can be measure by looking at the correlations between the distances of the observations
from one another rather than looking at the original observations. The general intuition is that if X and Y are
dependent then small distances in X values should correspond to small distances in Y values. The dCov method
relies on this intuition in order to test for dependence and is performed as following:
n
1. Sample {(xi , yi )}i=1 where (xi , yi ) Rp Rq and denote x := (x1 , ..., xn ), y := (y1 , ..., yn ).
15
2. Compute all the pairwise distances aij = kxi xj k2 and bij = kyi yj k2 .
3. Normalize the distances by reducing the the row, column and total average and dene:
Aij = aij ai ai a
Bij = bij bi bi b
4. Dene the dCov statistic and dCor statistic as follows::
n
n
1 XX
Aij Bij
n2 i=1 j=1
dCov (x, y) := V 2 (x, y) =
dCor (x, y) = R (x, y) =

2
V 2 (x,y)
V
2 (x)
V 2 (x) V 2 (y) > 0
V 2 (y)
V 2 (x) V 2 (y) = 0
0
Where V 2 (x) = V 2 (x, x) and V 2 (y) = V 2 (y, y).
5. Use permutation to compute the distribution of R2 and perform a hypothesis test for independence.
Remark 34. Note that the computational complexity here is O n2 for the computation of all distance pairs.
Why this method works: It can be shown that V 2 (x, y) is an estimator for the parameter V 2 (X, Y ) dened by:
V 2 (X, Y ) :=
1
cp cq
|X,Y (x, y) X (x) Y (y)|

1+p
kxk2
Rp+q
where cd :=
2 (1+d)
( 12 (1+d))
1+q
kyk2
dxdy
is a normalized constant. It then follows dCor (x, y) is an estimator for
dCor (X, Y ) = R (x, y) =

2
V 2 (X,Y )
V 2 (X)
V 2 (Y )
V 2 (X) V 2 (Y ) > 0
V 2 (X) V 2 (Y ) = 0
It can be shown that V 2 (X, Y ) 0 and X, Y are independent i V 2 (X, Y ) = 0.
Another important property: It can be shown that the sample version V is equal to the population version
V if one uses the empirical characteristic functions given by:
n
k=1
k=1
k=1
1 X ihx,xk i n
1 X ihy,yk i
1 X i(hx,xk i+hy,yk i)
nX (x) =
e
, Y (y) =
e
, X,Y (x, y) =
e
n
n
n
That is, if we sample x := (x1 , ..., xn )and y := (y1 , ..., yn ) and calculate these empirical characteristic function and

~ n ,Yn whose distribution is derived from these characteristic function then we will get that V 2 X
n , Yn
dene X
equals V 2 (x, y). Based on this relationship one can prove some interesting properties of the statistic R2 (x, y) such
as:
16
1. 0 R2 (x, y) 1 and R2 (x, y) = 0 i x1 = y1 = ... = xn = yn .

2. The statistic V 2 (x, y) and R2 (x, y) converge almost surely to V 2 (x, y) and R2 (X, Y ) as n .
3. There exists a sequence of thresholds n such that the sequence of tests that rejects H0 : X, Y are independent
when R2 ((x1 , ..., xn ) , (y1 , ..., yn )) > n is a consistent sequence of tests.
References: [11, 12]
3.2.3 Another distance based method for continuous univariate RVs:

Fact 35. If X, Y are two dependent univariate RVs then there exists (x, y) R2 such that FX,Y (x, y) 6= FX (x) FY (y).
Furthermore if X, Y are continuous then there exists (x, y) such that fX,Y (x, y) 6= fX (y) fY (y) and from continuity
there is a ball B := B ((x, y) , ) such that fX,Y (u, v) 6= fX (u) fY (v) for all (u, v) B .
Motivation: based on the aforementioned fact it can be shown that if X, Y are continuous and dependent univariate
RVs then there exists p := (xp , yp ) R2 such that if the plane is divided into four quadrants around p denoted
p
p
p
p
p
Qp1,1 , Qp1,2 , Qp2,1 , Qp2,2 then given Oj,k
=
fXY (u, v) dudv it holds that O11
O22
6= O1,2
O2,1
. This is also
Qp
j,k
equivalent to the indicator RVs 1{X>xp } and 1{Y >yp } being dependent.
Conclusion: Dependence of two univariate RVs can be identied by inspecting division of the plane to quartiles.
The main question is how to choose the center of the division which will reveal such dependence. The idea suggest
by Hoeding [6] is to simply let the data itself dene the centers.
Figure 3.2: A division of the plane into four quadrants where one of the observations is selected as the center of the
division. If we counts the number of observations in each quadrant then as the sample size grows this approximates
the chance of a random sample belonging to that quadrant.
n
The procedure is carried out as follows given a sample {(xi , yi )}i=1 :
Perform n division of the plane to quartiles where each time pi = (xi , yi ) is taken to be the center point.
i
For each division compute a 2 2 table of the values opj,k
, j, k {1, 2}, these values are up to normalization
pi
estimators for the values Oj,k
previously dened.
17
Compute the following test statistic:

Hn =
n
2
1 X pi pi
i
i
o o op1,2
op2,1
n4 i=1 1,1 2,2
Perform a permutation test to check for signicance of the test statistic.
Remark 36. This test statistic is asymptotically equivalent to the following function of the empirical CDF:
Hn :=

Fn X,Y (x, y) Fn X (x) Fn Y (y) dFn X,Y (x, y)
(3.3)
R2
Where the empirical CDF functions are dened as in 32. This fact can be used to prove consistency of the sequence
of test dened by Hn (with appropriate rejection areas). Since X, Y are independent i FXY FX FY it can be
seen why Hn and thus Hn may be good measures of independence and we would expect Hn to converge to zero in
case of independence. Note that the integration in the denition is taken according to the empirical measure dFX,Y
and since this is in-fact a discrete distribution this actually means we sum up the values of the function under the
integral sign at the atoms of the distribution.
3.2.4 An extension of Hoeding's method:

Kaufman et al [7] suggested a variation on Hoeding's method where the main dierence is that instead of using
divisions of the plane into quartiles based on a single center one should look at k dierent points p1 , ..., pk and
2
divide the plane into (k + 1) areas (some of which are innite) by plotting the vertical and horizontal lines that
pass through the selected points. The test statistic is then dened as follows:
(k+1)2
1i1 ...ik1 n
j=1
Sn,k =
i ,...,ik1
The values oj1
i ,...,ik1
oj1
i ,...,ik1
ej1
2
i ,...,ik1
ej1
are the observed values which are simply the number of observations found in the area of
i ,...,ik1
the division dened by the points pi1 , ..., pik1 and the values ej1
are the expected probability values under
H0 (independence) which are obtained by multiplying the marginal CDF values FX , FY for the area of the division
dened by the points pi1 , ..., pik1 .
Remark 37. The value
P(k+1)2
j=1
i ,...,ik1
oj1
i ,...,ik1
ej1
i ,...,ik1
ej1
2
is actually the 2 value computed for a (k + 1) (k + 1)
contingency table with the aforementioned observed and expected values.

The computational complexity of this method is O nmax(2,min(k1,4)) .
Advantage: Both this method and Hoeding methods can be shown to be distribution free.
Disadvantage: Both methods are only applicable for two univariate RVs.
It remains an open question to nd a distribution free test for multivariate RVs
Remark 38. the dCor test statistic is not distribution free.

18
3.3 Comparison of dierent tests for independence:
3.2.5 An information theoretic test for independence:

Denition 39. Given two continuous RVs X Rp , Y Rq with densities fX , fY the mutual information of X
and Y is:
I (X, Y ) :=
fX,Y (~u, ~v ) log

Rp+q
fX,Y (~u, ~v )
fX (~u) fY (~v )

d~ud~v
Remark 40. The mutual information can similarly be dened for discrete or mixed RVs.
Fact 41. Given two RVs X Rp , Y Rq , I (X, Y ) 0 and I (X, Y ) = 0 i the two are independent.
Given this fact the problem of determining dependence is reduced to estimation of the mutual information. The
problem is that this estimation is usually dicult and tests that are based on this method usually have low power
(see ). One such test is the MIC test [8].
Remark 42. There is some evidence that better estimators for the mutual information can yield ecient tests, see
[7]

For any type of hypothesis test there is always the question whether there is a test (a test statistic) which is better
than all others in some sense. Ideally we would want a "universal" test which is best under any/most alternatives
for example in a minimax sense.
Denition 43. Suppose we are given a parameterized family of distributions f (x; ) depending on the unknown
scalar parameter and a partition of the parameter space = 0 1 , we denote by Hi the hypothesis

i . Given a test statistic T X~n := T (X1 , ..., Xn ) and a corresponding binary test function
~n =
X

1 T X~n Rn ()
0 otherwise
for testing a one-sided hypothesis with condence . We say that T is uniformly most powerful (UMP) if for

0
0
0
any other test statistic T X~n := T (X1 , ..., Xn ) and corresponding binary test function X~n for which
0

0
sup E X~n
= = sup E X~n
It holds that:
0

0
E X~n
= 1 1 = E X~n
1
Fact 44. It can be shown that there are no UMP tests for two-sided hypothesis testing and there are no UMP test
for vector valued parameterized families. Furthermore if the null is of the form H0 : = 0 and the alternative is
one sided H1 : 0 then the Neyman-Pearson lemma guarantees that the likelihood ratio test is a UMP test.
19
Remark 45. In particular the likelihood ratio test for testing for independence (which is not UMP since this is not
a simple one sided parameter hypothesis) rejects H0 if
n
X

log
i=1
fX,Y (xi , yi )
fX (xi ) fY (yi )

>C
In the context of testing for independence it is almost certain that no UMP test exists, one way of reaching this
conclusion is the fact that for dierent types of alternatives (dierent types of dependence) there are better suited
tests. Thus, instead of looking for a test which is optimal under any alternative we would like to be able to compare
between dierent tests under a specic set of alternative hypothesis in which we are interested, for example for
specic alternatives we believe are likely to occur in real data. Given such a set of alternative hypothesis we can
perform a comparison of test power using simulations as follows:
1. Generate data from various kinds of dependency models using simulations.
2. Perform the various hypothesis test one wishes to compare on the simulated data with given condence .
3. Estimate the power of each test using permutations.
4. Evaluate the power of each test as a function of the dependency model, the sample size and .
Figure 3.3: A list of tests for Independence and their performance under various dependence models.
3.3.1 Equitable tests for independence:

When testing for independence of two RVs X, Y there are two possible questions one can ask:
1. Is there dependence between the RVs
2. How strong is the dependence.
In this context the following denition arises:
20
Denition 46. A dependency measure D (X, Y ) will be called equitable if given RVs X1 , X2 , Y , a noise factor

Z N 0, 2 and two smooth functions g1 , g2 such that Xi = gi (Y ) + Z it holds that D (X1 , Y ) = D (X2 , Y ).
Remark 47. This basically means that the measure D (X, Y ) evaluate the strength of the dependence equally
regardless of the type of functional dependence.
Figure 3.4: dCor values computed for dierent types of dependence and noise levels. This illustration shows that
while the dCor measure is capable of detecting all types of dependence (the value is positive in all cases where
dependence exists) it is is not an equitable measure. For example, in the last row the noise level for the circle and
semi-circle is almost identical but the dCor values are quite dierent (0.2 vs 0.5).
It has also been shown [8] that the mutual information based MIC statistic is not equitable. In fact it remains an
open question to nd a dependency measure which is equitable even under restrictions on the functional dependence.
21
4 MULTIPLE HYPOTHESIS TESTING (MHT):
Multiple Hypothesis Testing (MHT):
In many cases when we are faced with performing some statistical test we will be required to conduct more than a
single hypothesis test. At the same time, we will want to control not only the probability for false rejection in each
single hypothesis test but the probability for false rejection across all tests performed.
Remark 48. There is no assumption on any relation between the hypothesis tests, each hypothesis can have its own
set of data, test statistic, rejection area and Pval.
Example 49. Suppose we want to test the connection between m = 20, 000 genes and a disease with condence
= 0.05 in every hypothesis. Without any further restrictions if we conduct each test separately with = 0.05 we
will get that even if none of the genes are connected to the disease the expected number of rejections is m = 1000.
That is, we will have 1000 false rejections. This problem is known as the multiple-hypothesis problem.
Denition 50. Given a MHT with m hypotheses we will denote the following:
#
True Null
True Alternative
Total
Not Signicant
mR
Signicant
Total
m0
m m0
For example U is the number of hypotheses where H0 should have been rejected and that was indeed the result and
V is the number of hypotheses where H0 should have been accepted but it was rejected (false rejections).
Remark 51. For simplicity we will be assuming that the test statistics we are discussing are all continuous and in
particular the distribution of the Pvals under H0 is Uniform [0, 1]. Most of the result we will discuss will remain
approximately true even if the test statistics were discrete.
Remark 52. Recall that when conducting a hypothesis test based on a statistic T (X1 , ..., Xn ) the Pval of the test
is itself a statistic (a RV) which is a function of X1 , ..., Xn . We will now denote this RV as P := P (X1 , ..., Xn ) and
denote its realized value by p := P (x1 , ..., xn ).
4.1 Family Wise Error Rate (FWER):

When testing a single hypothesis we wanted to bound the type-1 error such that P (reject | H0 = true) . Following
this logic the Family-Wise-error Rate (FWER) of a multiple hypothesis test is dened as:
FWER = P (There is at least one false rejection) = P (V > 0)
Where V is as previously denoted the number of false-rejections. Similarly to a single hypothesis test where we
wanted to upper bound the type-1 error we would want to upper bound the FWER and insure our testing method
guarantees FWER . As we have already seen in order to do this it is not sucient to simply set the condence
level of each test to be . There are however corrections that do allow controlling the FWER simply by selecting
appropriate rejection thresholds (or condence level) for all singular tests.
22
4.2 False Discovery Rate (FDR):
4.1.1 The Bonferroni Correction:

A naive solution for FWER control when conducting m hypothesis tests with Pvalues Pi is to simply set the the
condence level of each test to be
m.
Using a union bound it can be seen this achieves the required result:
!
m
m n
n
X
[
o
o
=m =
FWER = P (All rejections are correct) = P
P Pi <
Pi <
m
m
m
i=1
i=1
Where P
Pi <
since under H0 the assumption is that Pi Uniform [0, 1]. The main problem with this
approach is that it is extremely conservative and it would be extremely dicult to reject any hypothesis even in
cases where the alternative is true and even with powerful tests. That is, there is a signicant loss of overall power.
As is evident, control of the FWER is simply too restricting in order to allow for any ecient testing method, wee
thus turn to an alternative approach of controlling the error when conducting multiple hypothesis tests:

The False Discover Rate is a less conservative measure to the quality of a MHT. The idea is that we will allow false
rejections but we will want to ensure the proportion of false rejections out of all rejections is low.
Denition 53. Given a MHT with m hypothesis we denote R+ = max {R, 1} and Q =
V
R+ .
Q is thus the
proportion of false rejections out of the total number of rejections. We then dene the FDR of the MHT to be
FDR = E [Q], that is the expected false rejection proportion.
Remark 54. The expectation here is taken with regard to the joint distribution of the Pvals of all the hypothesis
tests conducted. This is some distribution dened on [0, 1]
such that the marginal distribution of every Pval is
Uniform [0, 1] under H0 .

We would obviously want our testing procedure (note that in this context the procedure is something that we
would in theory want to repeat for dierent data sets but with the same underlying hypotheses) to ensure a low
FDR value. The following procedure suggested by Benjamini and Hochberg [1] allows given a condence value
to determine which hypothesis should be rejected while ensuring FDR :
The Benjamini-Hochberg (BH) Procedure:

1. Conduct all the singular hypothesis tests and compute the realized Pvals p1 , ..., pm .
2. Order the Pvals such that p(1) p(2) ... p(m) .

3. Compute i = max i {1, ..., m} | p(i) mi (if exists).
4. Reject all the hypotheses that match the Pvals p(1) , ..., p(i) (if i doesn't exist reject none).
n
o
i
The rejected set of hypothesis is thus RejectedBH (p1 , ..., pm ; ) := i {1, ..., m} | pi m
.
23
Figure 4.1:
Claim 55. Assume we conduct m hypothesis tests based on continuous independent test statistics T1 , ..., Tm using
the BH procedure with parameter . Then FDR =
m0
m
Remark 56. If the statistics are discrete it is still guaranteed that FDR
m0
m .
Proof. We note that the result of the BH procedure given parameter is a function only of the vector (p1 , ..., pn )
of Pvals obtained in the tests performed. Dene the following events informally:
(i)
Ck = {The i'th hypothesis was rejected = k hypotheses were rejected}

The idea is that if we look at all the Pvalues pj for i 6= j and we add a constant values to all of them which will
lead to the i'th hypothesis to being reject the result would be that exactly k hypothesis will be rejected in total.
(i)
More formally (p1 , ..., pm ) Ck i for all qi [0, 1] exactly one of the following holds:
1. i
/ RejectBH (p1 , ..., pi1 , qi , pi+1 , ..., pm ; ) - The Pval qi does not lead to rejection of the i'th hypothesis.
2. i RejectBH (p1 , ..., pi1 , qi , pi+1 , ..., pm ; ) and also:
V := |RejectBH (p1 , ..., pi1 , qi , pi+1 , ..., pm ; )| = k

That is for any value of qi if the i'th hypothesis was rejected then k hypotheses were rejected.
(i)
Two important properties of the events Ck (which we will not prove) are:

(i)
1. The event Ck does not depend on the value of the RV Pi and thus does is independent of the event Pi
24
k
m
n
o

Pm
(i)
(i)
2. For any i the collection Ck | k {1, ..., m} is a partition of the sample space and thus k=1 P Ck = 1.

We note that the event Pi
k
m

(i)
Ck is identical by denition to the event Pi
k
m
{R = k} and thus:

k
k
k
(i)
P ({R = k}) P Pi
|R = k = P
Pi
{R = k} = P
Pi
Ck
m
m
m
Where R is again the number of rejections. Thus by independence and linearity of expectation we got:
FDR = E
X

X

m0
m
m X
V
V
1
k
=
P
({R
=
k})
E
|
R
=
k
=
P
({R
=
k})
P
P
|
R
=
k
i
R+
k
k
m
i=1
k=1
Same Events X
m0
m X
1
z}|{
P
=
k
i=1
k=1

k=1
Pi
k
m
Independence X

m0
m X

1
k
z}|{
(i)
(i)
=
Ck
P
Pi
P Ck
k
m
i=1
k=1
=1
Pi |H0 U [0,1] m m0
XX
z}|{
=
k=1 i=1

1 k
(i)
P Ck
k m
}|
#{
Disjoint Union X
m
m
0 X (i)
z}|{
P Ck
=
=
m0
m i=1
m
z
"
k=1
As requested.
We have shown the B[3]H procedure controls the FDR given independent test statistics, the question remains what
happens when independence can not be assumed.
Denition 57. Recall that given x, y Rm we denote x y i xi yi i. Furthermore we will say that a subset
D Rm is ascending if for all x D and all y Rm x y = y D.
Denition 58. A RV X := (X1 , ..., Xn ) will be said to be Positively-Regression-Dependent on Subset

(PRDS) if for any ascending D Rm and for all 1 i n the function i,D (x) := P (X D | Xi = x) is
non-decreasing in x R. This means that the probability that X D does not decrease when Xi increases.
Claim 59. Assume we conduct m hypothesis tests based on continuous test statistics T1 , ..., Tm that have the PRDS
property. Then the BH procedure with parameter guarantees FDR
m0
m .
Proof. A generalization of the proof for independent statistics, see [3].

Remark 60. Even though in the case of PRDS dependency (and more generally in other models of positive dependence) the BH procedure controls the FDR which is the expectation of
V
R+
the variance of
V
R+
is generally tends to
grow as the level of positive dependence grows.
4.2.1 The BH procedure under more general dependence:

It turns out that without any assumption regarding the nature of dependence of the test statistics it is possible for
the BH procedure to fail and the FDR would exceed the
m0
m
bound. However, when such a deviation does occur
the degree to which the FDR exceeds the bound is not large, as is emphasized by the following claim:
25
Claim 61. Assume we conduct m hypothesis tests based on continuous test statistics T1 , ..., Tm . Then the BH
procedure with parameter guarantees the following bound:
FDR
m
m0 log (m)
m0 X 1
m i=1 i
m
Proof. See [3].

Remark 62. Simulation studies have shown that in most cases which are not pathological examples the method
usually achieves the
m0
m
bound and when it does not the deviation from the bound is relatively small and does
not reach the level of multiplication by the logarithmic factor.
Conclusion: The BH procedure is not particularly sensitive to existence of dependence of the test statistics.
4.2.2 What is the true meaning of FDR control:
We are reminded that the FDR is the expected value of the false rejection ratio and thus control of the FDR does

not guarantee a low false rejection ratio, specically Var RV+ can be large even when E RV+ is kept small.
Example 63. Suppose we conduct a MHT for testing m = 10000 hypotheses using the BH procedure with
parameter = 0.1 and 1000 rejections were obtained. We would like to deduce that approximately 100 of these
rejections are false and thus approximately 900 of our ndings are true. However, an alternative explanation is
that there is a strong positive dependence between the various test statistics and thus in 10% of cases in which we
implement the procedure there would be 1000 rejections while in the other 90% there will be none (since we chose
= 0.1). In both scenarios it would the FDR is less than 0.1 but the actual circumstances are very dierent.
The problem is that the BH procedure is still quite conservative, especially in cases where m0 m we would get
a situation in which FDR
m0
m
. In actuality we would want a procedure that has enough power to detect
eects that are not signicant in the BH procedure and still guarantees FDR . Notice that if we knew what m0
before using the procedure (which is not possible) we could have used the BH procedure with parameter
m
m0
and
obtain the required result of FDR .
4.2.3 Adaptive Procedures (Modied BH procedures):

Since m0 is not known we would like the estimate it and use the estimator in an attempt to obtain a procedure
which will ensure FDR . This approach bring rise to adaptive procedures which we will now describe (see
[2]). In the adaptive procedure the goal is to set the rejection threshold in an adaptive way which will be suitable
for various m0 values such that eventually the obtained FDR value will be independent of m0 . Theses procedures
generally follow the following scheme:
1. Compute an estimator m
0 for m0 .
26
2. Conduct the BH procedure again with the parameter
m
m
0 .
That is nd:
i = max i {1, ..., m} | p(i)
m
0
(4.1)
n
o
i
3. The rejected hypotheses are then RejectedABH (p1 , ..., pm ; ) := i {1, ..., m} | pi m
.
Iterative Method: Use the procedure iteratively until the estimator m

0 converges (till the dierence in the
estimator between iterations is less than some small error value ).
Remark 64. The simplest way to compute an estimator for m0 is by simply to use the standard BH procedure with
parameter and then take m
0 = m R as an estimator. There are however many other variations, see [2, 4, 9].
Ideally if this method worked perfectly it would guarantee FDR
m0
E[m
0 ] .
In actuality this is not the case and
there are several technical diculties in proving control of the FDR using an adaptive procedure:
First, for any constant c the BH procedure with parameter ca guarantees FDR
RV there is no assurance that FDR
m0
m E [c]
and specically c =
m
m
0
m0
m c.
However, if c is a
doesn't yield to the desired result.
m
Second, if m
0 is an unbiased estimator of m0 then m
0 is biased upwards compared to since by Jensen's
h i
1
1
1
inequality we would know that m
0 = E m
0 > E[m
0 ] and thus

m
1
m0
=
E
=E
m0 >
m
0
m
0
E [m
0]
Which means even if we achieve FDR E
1
m
0
m0 we are not guaranteed FDR .
A possible solution: this analysis immediately shows we would prefer positively biased estimators of m0 (E [m
0] >
m0 ). This type of estimators are more conservative in the sense that they estimate the number of hypothesis for
which H0 is false (m m0 ) to be smaller than it actually is. It turns out that given such estimators it is possible
to prove control of the FDR by under assumption of independence as the following theorem claims:
(1)
Theorem 65. Suppose that m

0 = m
0 (P1 , ..., Pm ) is a monotonic (in Pi values) estimator of m0 . Denote by m
0
the same estimator calculated for m 1 hypotheses and the same Pi values except one value which is Treu Null (that
is except for one Pj for which H0 is"known
# to be true and thus Pj Uniform [0, 1]). Then assuming P1 , ..., Pm are
independent it holds that FDR E
m0
(1)
m
0
Proof. See [13].

Remark 66. This theorem almost fullls the demand FDR
m0
E[m
0]
at the price of increasing m

0 by at most 1.
Furthermore, there are other (biased) estimators that by using some additional conservation corrections guarantee
h i
1
1
that E m
0 m0 and for these estimators control of the FDR to level can be shown (under certain conditions).
Conclusion: At the price of being slightly conservative at estimation of m0 it is possible to construct an adaptive
procedure that under independence of the Pvalues controls the FDR as required. Specically when the dierence
27
between m0 and m is large using such a procedure will provide a great improvement in power compared to the
standard BH procedure.
4.2.4 What to do if the Pvalues are not independent:

When the Pvalues are not independent the variance of the estimators of m0 signicantly grows and the estimators
h i
1
1
become unstable such that the corrections required in order to ensure that E m
0 m0 become greater and
greater. Thus there is no adaptive procedure that controls the FDR under general dependence or even under
specic types of dependence unless one is willing to lose a lot of power by using very conservative estimators of
m0 . However, doing that is pointless since the original purpose of the adaptive procedure was to improve the power
compared to the standard BH procedure.
Conclusion: if it is known (or shown) that there is only a weak dependence between the Pvalues then the use
of an adaptive procedure can be very attractive. On the other hand, when the dependence is strong it is not
recommended to use these procedures since the FDR can grow beyond what is expected.
It remains an open challenge to nd an adaptive procedure that both ensures high power and control the FDR at
level even when there is dependence of the Pvalues (or to prove no such procedure exists).
4.2.5 Estimation versus control of the FDR:

The purpose of the BH procedure and its variations was the control the size of the FDR. An alternative approach
to the problem is to rst use a certain procedure for MHT and then attempt to estimate the FDR of the procedure
in order to evaluate it (see [9]). Denote by 0 =
m0
m
the proportion of True Null hypothesis.
Denition 67. Suppose a MHT with a rejection threshold (reject the i'th hypothsis i Pi ), we denote:
V () := # {False Rejection with threshold }
R () := # {Rejections with threshold } = # {pi }
Storey's Procedure (2002):

1. Select a parameter (0, 1).
2. Compute all the Pvalues p1 , ...., pm .
3. Estimate 0 by
0 () =
#{pi >}
(1)m .
4. Given a threshold (0, 1) estimate P (Pi ) by
R()
z
}|
{
1
P (Pi ) =
max # {pi }, 1
() :=
5. Estimate the FDR by FDR
0 ()
i )
P(P
28
4.3 An alternative method for MHT using Qvalues:
Explanation:
1. The reason
0 () is a sensible (and unbiased) estimator of 0 is that
"
E [# {Pi > }] = E
m
X
#
1{Pi >} =
i=1
m
X
#{pi >}
(1)
is an unbiased estimator of m0 :
m
X

E 1{Pi >}
i=1
z}|{
P (Pi > ) = m0 P (Pi > | Pi Uniform [0, 1]) = m0 (1 )
i=1
The marked equality is the result of the fact that Pi > i the i'th hypothesis was not rejected with condence
. Thus P (Pi > ) = 0 unless Pi is distributed under H0 in which case it is distributed Uniform [0, 1] and has
probability (1 ) to larger than . Since there are m0 such i values for which P (Pi > ) = (1 ) and for
the remaining i values P (Pi > ) = 0 the result is obtained.
() is a sensible estimator of FDR = E

2. The reason FDR
(a) First,
V
R+
is two fold:
V ()
m
is the proportion of false rejections with threshold and thus

0 () is an estimator
i
h
V ()
m0
0 is an unbiased estimator of 0 = m .
()] := E m since
h
i
(Pi ) by denition clearly is an estimator of 1 E [R ()] = E R() .
(b) Second, P
m
m
1
m E [V
Thus the quotient
0 ()
i )
P(P
is an estimator of
E[V ()]
E[R()] ,
meaning we estimated the quotient of the expectation
instead of the expectation of the quotient which is the FDR.
() has some desirable properties such as in the

Under certain conditions it can be shown that the estimator FDR
case where m grows bu the proportion 0 remains constant (in which case but R and V grow in tandem).
Despite we described this as an estimating procedure for the FDR it can also be used for purpose of control as part
of the adaptive BH procedure if we use the estimator m
0 = m
0 (), furthermore we can optimize over to achieve
even better results.
4.3 An alternative method for MHT using Qvalues:

Storey ([10]) proposed an alternative method for MHT using what he termed q-values. The goal of the procedure is
to produce a measure of condence in each hypothesis which is analogous to the p-value but is suitable for multiple
hypothesis.
Denition 68. Given a MHT with m hypotheses the q-values q1 , ..., qm are the minimal values such that the
BH procedure with parameter will reject the i'th hypothesis respectively.
Remark 69. The q-values represent the signicance of each hypothesis in the context of being tested as part of a
MHT. Using q-values is obviously equivalent to using the BH procedure but has the advantage of giving a more
accessible representation to MHT by transferring the majority of the diculty the the computation of the q-values.
After said computation is done it is simply used to determine whether to reject the i'th hypothesis like one would use
29
4.4 Other variations of FDR:
the p-value for a single hypothesis test. Meaning, if we want to ensure FDR it suces to reject all hypothesis
for which qi . The following algorithm computes the q-values:
1. Compute all the Pvals p1 , ..., pm and order them p(1) ... p(m) .
2. Compute q(i) = min

p(i) m
i ,1 .

3. Shrink and order: for i = m 1 down to 1 set q(i) = min q(i) , q(i+1) .
4. To get qi from q(i) perform the opposite permutation to pi 7 p(i) .
Step 3 is performed in order to regain the order of the ordered values q(i) since it is impossible that p(i) < p(j)
and at the same time q(i) > q(j) . One can show this algorithm indeed computes the values in accordance with the
denition of the q-values.
Remark 70. In general when conducting a MHT we assume that we only have access to the p-values and that we
have no preference for certain hypothesis over others. Assuming this is true we will always use the same rejection
threshold for all hypotheses since there is no logical reason to reject one hypothesis with a certain p-value if we did
not reject another hypothesis that has a lower p-values. Under dierent assumptions it is possible to dene other
procedures that do not use an identical rejection threshold for all hypotheses.
4.4 Other variations of FDR:

- positive FDR
- local FDR (lfdr)
4.5 Empirical Bayes View:
30
REFERENCES
REFERENCES
References
[1] Yoav Benjamini and Yosef Hochberg,
Controlling the false discovery rate: a practical and powerful approach to
multiple testing, Journal of the Royal Statistical Society. Series B (Methodological) (1995), 289300.
Adaptive linear step-up procedures that control the
[2] Yoav Benjamini, Abba M Krieger, and Daniel Yekutieli,
false discovery rate, Biometrika 93 (2006), no. 3, 491507.

[3] Yoav Benjamini and Daniel Yekutieli,
The control of the false discovery rate in multiple testing under depen-
dency, Annals of statistics (2001), 11651188.

[4] Bradley Efron, Robert Tibshirani, John D Storey, and Virginia Tusher, Empirical bayes analysis of a microarray
experiment, Journal of the American statistical association 96 (2001), no. 456, 11511160.
[5] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schlkopf, and Alexander J Smola,
A kernel
method for the two-sample-problem, Advances in neural information processing systems 19 (2007), 513.
[6] Wassily Hoeding,
A non-parametric test of independence, The Annals of Mathematical Statistics (1948),
546557.
[7] Shachar Kaufman, Ruth Heller, Yair Heller, and Malka Gorne, Consistent distribution-free tests of association
between univariate random variables, arXiv preprint arXiv:1308.1559 (2013).

[8] David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J Turnbaugh,
Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti,
Detecting novel associations in large data sets,
science 334 (2011), no. 6062, 15181524.

[9] John D Storey,
A direct approach to false discovery rates, Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 64 (2002), no. 3, 479498.

[10]
, The positive false discovery rate:
A bayesian interpretation and the q-value, Annals of statistics (2003),
20132035.
[11] Gbor J Szkely, Maria L Rizzo, Nail K Bakirov, et al.,
Measuring and testing dependence by correlation of
distances, The Annals of Statistics 35 (2007), no. 6, 27692794.

[12] Gbor J Szkely, Maria L Rizzo, et al., Brownian
distance covariance, The annals of applied statistics 3 (2009),
no. 4, 12361265.
[13] Amit Zeisel, Or Zuk, Eytan Domany, et al.,
Fdr control with adaptive procedures and fdr monotonicity, The
Annals of Applied Statistics 5 (2011), no. 2A, 943968.
31

Notes English

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes English

Uploaded by

Copyright:

Available Formats

Modern Statistical Data Analysis:

February 27, 2015

1 Data and Data Preprocessing:

Completion of Missing Values: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Classical Hypothesis Testing:

Some Notation and Reminders: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Classical Problems in Parametric Statistics: . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Tests for Independence:

Classical tests for independence of scalar RVs: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Testing for independence using permutations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Testing for independence using a Kernel function:

Testing for independence using the Distance Correlation (dCor) method: . . . . . . . . . . . .

Another distance based method for continuous univariate RVs: . . . . . . . . . . . . . . . . .

An extension of Hoeding's method:

An information theoretic test for independence: . . . . . . . . . . . . . . . . . . . . . . . . . .

Comparison of dierent tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Equitable tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Multiple Hypothesis Testing (MHT):

Family Wise Error Rate (FWER): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The Bonferroni Correction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

False Discovery Rate (FDR): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The BH procedure under more general dependence:

What is the true meaning of FDR control: . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Adaptive Procedures (Modied BH procedures): . . . . . . . . . . . . . . . . . . . . . . . . .

What to do if the Pvalues are not independent: . . . . . . . . . . . . . . . . . . . . . . . . . .

Estimation versus control of the FDR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

An alternative method for MHT using Qvalues: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Other variations of FDR: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Empirical Bayes View: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 DATA AND DATA PREPROCESSING:

Data and Data Preprocessing:

1.1 Visualization of Data:

Figure 1.2: in this scatter-plot we see an outlier. This

1.2 Data Transformations:

1 DATA AND DATA PREPROCESSING:

1.2 Data Transformations:

1.3 Completion of Missing Values:

2 CLASSICAL HYPOTHESIS TESTING:

1.3 Completion of Missing Values:

Classical Hypothesis Testing:

2.1 Some Notation and Reminders:

Denition 3. An independent sample or realization x1 , ..., xn Rd from the distribution F is a realization

Denition 6. Given a family of distributions F (x; ) (a statistical model) parameterized by , an estimator of

that quanties the

2.1.1 Classical Problems in Parametric Statistics:

2.1 Some Notation and Reminders:

2 CLASSICAL HYPOTHESIS TESTING:

2.1 Some Notation and Reminders:

2 CLASSICAL HYPOTHESIS TESTING:

Remark 12. A few remarks:

It is a common misconception that the following relation exist

2.1 Some Notation and Reminders:

2 CLASSICAL HYPOTHESIS TESTING:

2.2 Non-Parametric Tests:

2 CLASSICAL HYPOTHESIS TESTING:

2.2 Non-Parametric Tests:

i=1 ri 1{yi =j}

1{yi =1} The test statistic u = R1

is constant. The main advantage

2.2.1 Permutation Tests:

2.2 Non-Parametric Tests:

2 CLASSICAL HYPOTHESIS TESTING:

1. Draw N random permutations s1 , ..., sN Sn .

4. Reject H0 if P < for given condence .

An extension of Hoeding's method:

Comparison of dierent tests for independence: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Adaptive Procedures (Modied BH procedures): . . . . . . . . . . . . . . . . . . . . . . . . .

Denition 3. An independent sample or realization x1 , ..., xn Rd from the distribution F is a realization

Denition 6. Given a family of distributions F (x; ) (a statistical model) parameterized by , an estimator of

that quanties the

4. Reject H0 if P < for given condence .

Sample by simulation {(xi , yi )}ni=1 F

This is an estimator of the correlation coecient given by

Denition 29. The characteristic function of a RV X is dened by X (t) = E eit

It can be shown that V 2 (X, Y ) 0 and X, Y are independent i V 2 (X, Y ) = 0.

1. 0 R2 (x, y) 1 and R2 (x, y) = 0 i x1 = y1 = ... = xn = yn .

Perform a permutation test to check for signicance of the test statistic.

3.2.4 An extension of Hoeding's method:

3.3 Comparison of dierent tests for independence:

3.3 Comparison of dierent tests for independence:

3.3 Comparison of dierent tests for independence:

3.3 Comparison of dierent tests for independence: