Professional Documents
Culture Documents
CONTENTS
CONTENTS
Contents
I Hypothesis Testing:
1.1
Visualization of Data: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Data Transformations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
2.2
2.1.1
Non-Parametric Tests: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.1
10
Permutation Tests: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.1
13
3.2
14
3.2.1
. . . . . . . . . . . . . . . . . . . . . . . .
14
3.2.2
15
3.2.3
17
3.2.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.2.5
19
19
3.3.1
20
3.3
4.2
22
22
4.1.1
23
23
4.2.1
. . . . . . . . . . . . . . . . . . . . . . .
25
4.2.2
26
4.2.3
26
4.2.4
28
4.2.5
28
CONTENTS
CONTENTS
4.3
29
4.4
30
4.5
30
References
31
Part I
Hypothesis Testing:
During this course when we discuss "data" we will usually be referring to a collection of vectors x1 , ..., xn Rp
i.i.d
which we usually assume that these are realizations of of an I.I.D sequence Xi F with F being some known or
unknown distribution.
Examination of scatter-plots of the various variables in relation to one another in order could help detect
trends and relations as well as detect outliers. For example consider these two images:
Figure 1.1: Passing a linear trend-line through a scatterplot for the collection of (x,y) values (blue dots). We
would like to model the relation between X and Y and
possibly predict given an X value what corresponding
Y value would match it. The linear regression line (in
black) is one way to do this.
Examination of density-plots might be more informative when the number of observations and their location
makes a regular scatter-plot uninformative.
Figure 1.3: On the left side we see a scatter-plot of x,y values, it can be seen that the sheer amount of observations
and the way they are scattered does not allow us to notice any trend as all areas look equally dense (equally covered
with black dots). On the other hand, the right side image shows us a density plot where dierent colors represent
the point density in each area. In this particular example the density in the red area is roughly 1000 times higher
than in the purple area - something that could not be seen in the left plot.
Examination of box-plots gives a compact and convenient representation of the averages and standard deviations of data in various subgroups of the data. This is mostly useful when there is a large number of variables
that need to be compared or examined together.
Figure 1.4: It can be seen that a box-plot can represent the distribution of the observations around the mean across
several groups. This allows comparison of both means and standard deviations of the various groups and allow
detection of dierences between the groups.
from a sequence of random variables X1 , ..., Xn F (here Xi can be either scalar or vector valued).
Denition 4. Given a distribution F a statistic of the distribution is a RV T (X1 , ..., Xn ) which is a function of
i.i.d
a sequence of RVs X1 , ..., Xn F . The value of the statistic T (x1 , ..., xn ) is calculated based on realizations.
Remark 5. In many cases the statistics we will discuss will be estimators for the parameters of some distribution.
For example the sample average is an estimator for the parameter which is the distribution expectation.
Denition 9. A risk function for an estimator of the parameter is the expected value ofsome lossfunction,
2
h i
= E L ,
, one common example is the mean-squared-error risk MSE := E
R ,
.
2
F (x; ) parameterized by (both x and can be scalars or vectors) such that x1 , ..., xn is an i.i.d sample from F .
In this context there are several classical question that arise:
1. Point Estimation: Given the observations x1 , ..., xn one tries to nd an estimator (x1 , ..., xn ) which
approximates the unknown parameter . There are a few desirable qualities in such an estimator:
(a) Consistency: we would like the estimator to converge in probability to the real value of .
h
i
(b) Lack of Bias: we would ideally like the estimator to have little or no bias in the sense that E 0.
(c) Low Risk: given some risk function R we would like an estimator with low risk.
2. Condence Interval Estimation: a condence interval [a, b] for the parameter with condence level is
any interval such that P [ [a, b]] = 1 . We often want to estimate such intervals based on observations
and the values of the interval edges (usually take to be symmetric) are functions of the observations.
3. Hypothesis Testing: hypothesis testing deals with decision problems of the type = 0 or , 0 . There
is a null hypothesis regarding the true value of the parameter, denoted by 0 and the objective is to decide
with a given level of certainty whether to accept or reject this null hypothesis.
Several key elements of all of these problems are the number of parameters that need to be estimated, the amount
of data available and the number of hypothesis we wish to test.
Denition 10. Suppose x1 , ..., xn is a realization of X1 , ..., Xn F (x; ) . Given a null hypothesis H0 regarding
i.i.d
the value of and the complementary alternative hypothesis H1 there are several stages to a statistical test of
H0 :
1. Denition and calculation of a test statistic T (X1 , ..., Xn ) based on the observations.
2. Denition and calculation of a rejection area R () that contains the parameter with probability .
3. Rejection of H0 i T (x1 , ..., xn ) R () (equally acceptance of H1 i T Rc )
There are two kinds of errors that arise in this context:
1. Type 1 Error (false positive): = P (T R | H0 = true)
2. Type 2 Error (false negative or miss): = P (T Rc | H1 = true)
Denition 11. Given a statistical test for with test statistic T (x1 , ..., xn ) the Pvalue of the test is dened as the
probability that we observe a value at least as extreme as T (x1 , ..., xn ) given the underlying distribution of the
observations under H0 :
Pvalue = P (T (X1 , ..., Xn ) T (x1 , ..., xn ) | T (X1 , ..., Xn ) is distributed accrording to H0 )
It is important to remember that just like T (X1 , ..., Xn ) the Pvalue is a random variable which is a function of
X1 , ..., Xn . Furthermore if H0 is true then the distribution of the Pvalue would be Uniform [0, 1].
Figure 2.1: Here we see the empirical density (left) and empirical cumulative distribution (right) of 500 Pvalues
calculated by a simulation in which the data was sampled according to H0 . It can be seen that the simulation result
indeed shows that the approximated density of the Pvalues in such a case is roughly Uniform[0, 1].
In order to calculate the type-2 error one would need an assumption of the distribution of under H1 .
Denition 13. Given a statistical test with test statistic T and rejection area R () the power of the test is
Power = 1 = P (T R () | H0 = false)
This value of course depends on the choice of . It is generally desired that a test would have a low and high
power simultaneously but in reality there is a trade o between the two and that is not always possible.
Figure 2.2: Here we see the density of the test statistic under H0 (in blue) and under H1 (in red). Given some
rejection threshold (denoted by the black line) we would reject H0 if the test statistic is to the right of the line.
The condence level would then be the area to the right of the line found under the density given H0 , here that
area is marked in red. The power of the test is the area to the right of the line found under the density given H1 ,
here denoted in green. It can be seen that moving the rejection threshold impact both and the power.
Example 14. One of the simplest examples of a parametric test is the independent two-sample test for equality
n
of means. Assume that we are given observations {(xi , yi )}i=1 where yi {0, 1} is a categorical variable and we
assume that the xi values are realization from the distribution Xi |yi N yi , 2 . We would like to test the
hypothesis H0 : 0 = 1 , that is that the means in both groups are equal. If we assume that is known then we
can conduct a simple Z-test by calculating the z-score of the mean dierence:
nj =
j =
1
nj
n
X
1{yi =j}
i=1
n
X
xi 1{yi =j}
i=1
1
0
Z=r
2 n10 + n11 2
Under the assumption of H0 being true Z N (0, 1) and thus we will reject H0 i Z > Z 2 or Z < Z 2 for
Z 2 = 1 2 . This is a two-sided test in which we have equally divided the rejection area between the two tails
of the distribution. We can also calculate the Pval of the test given by Pval = 2 min { (Z) , 1 (Z)}. Similarly
if the variance of the distribution is unknown we would use a t-test with n 2 degree of freedom by calculating the
following statistic:
tn2
1
0
= r
;
2 0
2 1
+
n0
n1
2 j =
1 X
2
(xi
j ) 1{yi =j}
nj 1 i=1
The quantity in the denominator is simply an estimator for 2 . In this case the distribution of tn2 is no longer
normal but it is a known distribution named the T-distribution with n 2 degree of freedom. This distribution can
be used to calculate the rejection area and the Pval in the same way as with the Z-test.
One problem with the t-test is that it assumes an underlying normal distribution of the observations, as assumption
which can not be omitted and thus limits this test to specic cases. Additionally the t-test is not guaranteed to
control the type-I error , especially for skewed distributions.
of T does not depend on F . Furthermore, a hypothesis test for H0 will be called a distribution free test if the
test statistic is distribution free, that is it does not depend on the distribution of the observations under H0 .
Remark 16. While a distribution free test statistic must be independent of the distribution of the data under H0 it
is possible that the distribution of the statistic does depend on the alternative hypothesis.
Example 17. We will now give an example for a distribution free aparametric test for the two independent sample
n
decision problem. Given the observations {(xi , yi )}i=1 when again yi {0, 1} we would like to determine whether
Xi |yi = j has the same distribution for both j = 0 and j = 1. The Mann-Whitney-Wilcoxon Rank-Sum Test
allows us to do this and works as following:
1. The rank of each observation is calculated as ri =
2. For j = 1, 2 the value Rj =
3. Given n1 =
Pn
i=1
Pn
Pn
j=1
1{xj xi } .
is calculated.
n1 (n1 +1)
2
is calculated.
4. The distribution of U is calculated and a rejection area is chosen accordingly given condence .
Despite it appears U is not symmetric it actually is since R1 + R2 =
n(n+1)
2
of this test is the fact it is distribution free and thus can be used regardless of the underlying distribution of
the observations. The main question that remains is how to calculate the distribution of U in order to obtain the
rejection area. In practice for relatively small samples (up to 20 observations) there are tables that directly calculate
said distribution while for large sample there is a normal approximation to the distribution of U . An alternative
more modern method is to use numerical methods as we will now discuss.
Given the data {(xi , yi )}i=1 where yi are again binary labels we want to test the hypothesis H0 : X, Y are independent
0
(equivalently if x1 , ..., xn1 F and x1 , ..., xn2 G the hypothesis would be H0 : F = G).
Assume that
Tobs := T ((x1 , y1 ) , ..., (xn , yn )) is some statistic (essentially any function of the data). A permutation test based
on the statistic T would be carried out as follows:
10
2. For each permutation calculate the test statistic on the permuted data Ti := T ({(xi , yi )}i=1 ).
3. Compute the empirical Pvalue P =
1
N
PN
i=1
1{Ti >Tobs } .
Claim 18. For any joint-distribution (Xi , Yi ) F and any test statistic T ({(xi , yi )}ni=1 ) the distribution of the
i.i.d
Proof. Sketch of proof: Under H0 the RVs T1 , ..., Tn , Tobs are i.i.d and thus # Ti > Tobs is uniformly distributed
on {0, ..., N } which immediately gives us that P Uniform 0, N1 , N2 , ..., 1
1
N,1
Remark 19. All of this is correct assuming that T is continuous and there are no ties. However if T was discrete
and ties were possible the distribution of P is still approximately uniform.
Advantages of Permutation tests: Accuracy, no underlying assumption about the data, exibility (can be used
with complicated null models)
Disadvantages of permutation tests: Computationally intensive and do not provide any insight regarding the
analytic characteristic of the distribution (for example the relation between the power of the test and the sample
size is not easily understood).
Input: condence level , an assumed distribution F of the data under H1 , sample size n.
Parameters: N - number of permutations, K - number of simulations.
for k = 1, ...., K do:
11
1
K
Pk
k=1
Rk
Suppose we are given two RVs X, Y and we would like to test whether they are independent. Ideally we would like
a test that would be able to detect any kind of probabilistic dependence between the two variables (not necessarily
linear or monotonic for example).
P
Denition 21. Given Xn F (x; ) an estimator (X1 , ..., Xn ) for is said to be consistent if n
.
i.i.d
Denition 22. A sequence of tests Tn = (gn , Rn ) with sample size n, test statistic gn (X1 , ..., Xn ) and rejection
area Rn for the rejection of H0 is said to be consistent if the two following properties hold:
n
The general method for constructing consistent test for independence is as follows:
1. Find a distance measure D (X, Y ) 0 such that D (X, Y ) = 0 i X, Y are independent.
n for D.
2. Find a consistent sequence of estimators D
n
o
n > n .
3. Find a series of threshold n and dene rejection areas Rn = D
Remark 23. The nal step is often tricky as it depends on the rate of convergence of D n .
Example 24. Testing for independence using a kernel method:
n
i.i.d
Suppose we again have the data {(xi yi )}i=1 F where yi is binary and we would like the test H0 : X, Y are independent.
We dene a kernel Kh (xi , xj ) and compare the kernel values for pairs of the same and dierent groups by calculating
the following statistic:
T =
n
X
i,j=1
Kh (xi , xj )
The kernel function could for example be a Gaussian kernel Kh (xi , xj ) = e h2 kxi xj k . Here h is a width parameter
which determines how fast the kernel goes to zero and in turn impacts the rejection area (it is possible to pick an
optimal h value using a more sophisticated technique). When the X values do not depend on Y we would obviously
have that E [T ] = 0 and thus it would suce to test the hypothesis that E [T ] = 0 by comparing the value of |T | to
an appropriate critical value. This can be done either by a normal approximation of the distribution of T or in a
direct method using permutations.
12
Pn
(xi x) (yi y)
2 Pn
2
i=1 (xi x)
i=1 (yi y)
R (x, y) = qP
n
i=1
Fact 25. When (X, Y ) = 0 we would say that X,Y are uncorrelated. It is known that independent RVs are always
Figure 3.1: Here we see the values of the Pearson correlation-coecient for various types of dependence and various
levels of noise. It can be seen in the rst and second lines that the value of the coecient depends on the amount
of noise but does not depends on the slope as long as the slope is not zero. The third line shows that for non-linear
dependence the coecient can very well be zero despite the fact the variables have a functional relationship and
are not independent.
Another measure of dependence is the Kendell-Tau Rank correlation coecient dened by:
P
(x, y) =
i<j
1)
Using permutations it is again possible to perform a hypothesis test for the signicance of this coecient. The main
disadvantages of is that it is only dened for scalar RVs and it is only capable of detect monotonic dependencies.
On the other hand it has the advantage of being a distribution free statistic.
13
are independent i PX (~x) PY (~y) = PXY (~x, ~y) for all (~x, ~y) Rp Rq . If X,Y are continuous with densities
fX , fY and joint density fX,Y then an equivalent condition is that fX (~x)fY (~y ) = fX,Y (~x, ~y ) for all (~x, ~y ) Rp Rq .
X, Y
Assume we are given two RVs X Rp , Y Rq with distributions PX , PY and joint distribution PXY , we would like
a statistical test which will determine whether the probability distributions PX PY and PXY are identical. When
the distributions are both univariate the Kolmogorov-Smirno two-sample test provides an analytic solution to
this problem but it does not generalize to the multivariate case. We would thus want to approach the problem
by dening a statistic whose distribution is dierent when X, Y are dependent or independent and then use a
permutation test in order to test for signicance of this statistic.
Remark 27. The problem of testing for independence is similar (but not identical) to testing for equality of distribution in a two sample setting. When testing for equality of distribution we are given a set of data sampled from
distribution P and a dierent set of data sampled from a distribution Q and we would like to determine whether
P Q based on these two independent samples. In comparison, when testing for independence we would like
to determine whether PXY PX PY but in this case we evaluate these distributions based on a single sample.
There are several types of methods for testing for independence using permutations:
Kernel methods: these methods rely on denition of a kernel K(x, y) which measures similarity between x
and y . The value of the Kernel are then computed and compared for pairs from the same group and from the
two dierent groups.
Geometric methods: these methods rely on dening a distance between distribution measure D (X, Y ).
Information based methods: it can be shown that two RVs X, Y are independent i their mutual information I (X, Y ) equals zero. This provides a way for testing for independence by computing or estimating
the mutual information.
In example 24 we described a two-sample method for testing independence given a sample {(xi , yi )}i=1 where yi
was a binary value based on the statistic
T =
n
X
i,j=1
Kh (xi , xj )
(3.1)
it can be seen that for yi = yj we are given a value with a positive sign and for yi 6= yj we got a value with a
negative sign, all these values are summed up and a permutation test is used to check whether the computed value is
n
i.i.d
signicantly dierent than zero. In the more general case where we are given a single sample {(xi , yi )}i=1 PXY
we need to adapt this method to suit our needs (see [5]). One way to do this is treat (xi , yi ) as sampled from
n
PXY and (xi , yj ) i 6= j as samples from PX PY . In order to do this we dene A := {(xi , yi )}i=1 and B :=
14
{(xi , yi ) | 1 i 6= j n} and notice that under H0 , A is an i.i.d sample from PXY and B is an i.i.d sample out of
PX PY . We thus want to determine whether these two samples are identically distributed since this is equivalent to
independence of X, Y . In order to do this we will compute the following values:
two points from same group
z
}|
{
+Kh ((xi , yi ) , (xj , yj )) i 6= j
one from same and one from dierent
z
}|
{
Kh ((xi , yi ) , (xl , yk ))
i and l 6= k
one from same and one from dierent
}|
{
z
Kh ((xl , yk ) , (xi , yi ))
l 6= k
two points from dierent groups
z
}|
{
+Kh ((xi , yj ) , (xl , yk ))
i 6= j, l 6= k
Where we note that (xi , yj ) Rpq are the "points" in our sample. We then calculate the test statistic given in
formula 3.1 and perform a standard permutation test in order to test for signicance.
Remark 28. One small problem with this method is that there are usually a lot more point in group B than in A.
3.2.2 Testing for independence using the Distance Correlation (dCor) method:
h
>
(3.2)
Denition 33. Given any sequence of values aij indexed by i {1, ..., n} and j {1, ..., m} we denote:
m
ai :=
1 X
1X
1 XX
aij , aj :=
aij , a =
aij
m j=1
n i=1
n m i=1 j=1
If the value aij we to be laid out in an n m table these are the row/column/total averages respectively.
We mentioned previously that the Pearson correlation-coecient measure dependence but only of the linear nature.
More general dependence can be measure by looking at the correlations between the distances of the observations
from one another rather than looking at the original observations. The general intuition is that if X and Y are
dependent then small distances in X values should correspond to small distances in Y values. The dCov method
relies on this intuition in order to test for dependence and is performed as following:
n
1. Sample {(xi , yi )}i=1 where (xi , yi ) Rp Rq and denote x := (x1 , ..., xn ), y := (y1 , ..., yn ).
15
2. Compute all the pairwise distances aij = kxi xj k2 and bij = kyi yj k2 .
3. Normalize the distances by reducing the the row, column and total average and dene:
Aij = aij ai ai a
Bij = bij bi bi b
4. Dene the dCov statistic and dCor statistic as follows::
n
n
1 XX
Aij Bij
n2 i=1 j=1
V 2 (x,y)
V
2 (x)
V 2 (y)
V 2 (x) V 2 (y) = 0
0
Where V 2 (x) = V 2 (x, x) and V 2 (y) = V 2 (y, y).
5. Use permutation to compute the distribution of R2 and perform a hypothesis test for independence.
Remark 34. Note that the computational complexity here is O n2 for the computation of all distance pairs.
Why this method works: It can be shown that V 2 (x, y) is an estimator for the parameter V 2 (X, Y ) dened by:
V 2 (X, Y ) :=
1
cp cq
kxk2
Rp+q
where cd :=
2 (1+d)
( 12 (1+d))
1+q
kyk2
dxdy
V 2 (X,Y )
V 2 (X)
V 2 (Y )
V 2 (X) V 2 (Y ) > 0
V 2 (X) V 2 (Y ) = 0
Another important property: It can be shown that the sample version V is equal to the population version
V if one uses the empirical characteristic functions given by:
n
k=1
k=1
k=1
1 X ihx,xk i n
1 X ihy,yk i
1 X i(hx,xk i+hy,yk i)
nX (x) =
e
, Y (y) =
e
, X,Y (x, y) =
e
n
n
n
That is, if we sample x := (x1 , ..., xn )and y := (y1 , ..., yn ) and calculate these empirical characteristic function and
~ n ,Yn whose distribution is derived from these characteristic function then we will get that V 2 X
n , Yn
dene X
equals V 2 (x, y). Based on this relationship one can prove some interesting properties of the statistic R2 (x, y) such
as:
16
Furthermore if X, Y are continuous then there exists (x, y) such that fX,Y (x, y) 6= fX (y) fY (y) and from continuity
there is a ball B := B ((x, y) , ) such that fX,Y (u, v) 6= fX (u) fY (v) for all (u, v) B .
Motivation: based on the aforementioned fact it can be shown that if X, Y are continuous and dependent univariate
RVs then there exists p := (xp , yp ) R2 such that if the plane is divided into four quadrants around p denoted
p
p
p
p
p
Qp1,1 , Qp1,2 , Qp2,1 , Qp2,2 then given Oj,k
=
fXY (u, v) dudv it holds that O11
O22
6= O1,2
O2,1
. This is also
Qp
j,k
equivalent to the indicator RVs 1{X>xp } and 1{Y >yp } being dependent.
Conclusion: Dependence of two univariate RVs can be identied by inspecting division of the plane to quartiles.
The main question is how to choose the center of the division which will reveal such dependence. The idea suggest
by Hoeding [6] is to simply let the data itself dene the centers.
Figure 3.2: A division of the plane into four quadrants where one of the observations is selected as the center of the
division. If we counts the number of observations in each quadrant then as the sample size grows this approximates
the chance of a random sample belonging to that quadrant.
n
Perform n division of the plane to quartiles where each time pi = (xi , yi ) is taken to be the center point.
i
For each division compute a 2 2 table of the values opj,k
, j, k {1, 2}, these values are up to normalization
pi
estimators for the values Oj,k
previously dened.
17
n
2
1 X pi pi
i
i
o o op1,2
op2,1
n4 i=1 1,1 2,2
Remark 36. This test statistic is asymptotically equivalent to the following function of the empirical CDF:
Hn :=
Fn X,Y (x, y) Fn X (x) Fn Y (y) dFn X,Y (x, y)
(3.3)
R2
Where the empirical CDF functions are dened as in 32. This fact can be used to prove consistency of the sequence
of test dened by Hn (with appropriate rejection areas). Since X, Y are independent i FXY FX FY it can be
seen why Hn and thus Hn may be good measures of independence and we would expect Hn to converge to zero in
case of independence. Note that the integration in the denition is taken according to the empirical measure dFX,Y
and since this is in-fact a discrete distribution this actually means we sum up the values of the function under the
integral sign at the atoms of the distribution.
divide the plane into (k + 1) areas (some of which are innite) by plotting the vertical and horizontal lines that
pass through the selected points. The test statistic is then dened as follows:
(k+1)2
1i1 ...ik1 n
j=1
Sn,k =
i ,...,ik1
i ,...,ik1
oj1
i ,...,ik1
ej1
2
i ,...,ik1
ej1
are the observed values which are simply the number of observations found in the area of
i ,...,ik1
the division dened by the points pi1 , ..., pik1 and the values ej1
H0 (independence) which are obtained by multiplying the marginal CDF values FX , FY for the area of the division
dened by the points pi1 , ..., pik1 .
P(k+1)2
j=1
i ,...,ik1
oj1
i ,...,ik1
ej1
i ,...,ik1
ej1
2
Advantage: Both this method and Hoeding methods can be shown to be distribution free.
Disadvantage: Both methods are only applicable for two univariate RVs.
It remains an open question to nd a distribution free test for multivariate RVs
I (X, Y ) :=
fX,Y (~u, ~v )
fX (~u) fY (~v )
d~ud~v
Remark 40. The mutual information can similarly be dened for discrete or mixed RVs.
Fact 41. Given two RVs X Rp , Y Rq , I (X, Y ) 0 and I (X, Y ) = 0 i the two are independent.
Given this fact the problem of determining dependence is reduced to estimation of the mutual information. The
problem is that this estimation is usually dicult and tests that are based on this method usually have low power
(see ). One such test is the MIC test [8].
Remark 42. There is some evidence that better estimators for the mutual information can yield ecient tests, see
[7]
Denition 43. Suppose we are given a parameterized family of distributions f (x; ) depending on the unknown
scalar parameter and a partition of the parameter space = 0 1 , we denote by Hi the hypothesis
i . Given a test statistic T X~n := T (X1 , ..., Xn ) and a corresponding binary test function
~n =
X
1 T X~n Rn ()
0 otherwise
for testing a one-sided hypothesis with condence . We say that T is uniformly most powerful (UMP) if for
0
0
0
any other test statistic T X~n := T (X1 , ..., Xn ) and corresponding binary test function X~n for which
0
0
sup E X~n
= = sup E X~n
It holds that:
0
0
E X~n
= 1 1 = E X~n
1
Fact 44. It can be shown that there are no UMP tests for two-sided hypothesis testing and there are no UMP test
for vector valued parameterized families. Furthermore if the null is of the form H0 : = 0 and the alternative is
one sided H1 : 0 then the Neyman-Pearson lemma guarantees that the likelihood ratio test is a UMP test.
19
Remark 45. In particular the likelihood ratio test for testing for independence (which is not UMP since this is not
a simple one sided parameter hypothesis) rejects H0 if
n
X
log
i=1
fX,Y (xi , yi )
fX (xi ) fY (yi )
>C
In the context of testing for independence it is almost certain that no UMP test exists, one way of reaching this
conclusion is the fact that for dierent types of alternatives (dierent types of dependence) there are better suited
tests. Thus, instead of looking for a test which is optimal under any alternative we would like to be able to compare
between dierent tests under a specic set of alternative hypothesis in which we are interested, for example for
specic alternatives we believe are likely to occur in real data. Given such a set of alternative hypothesis we can
perform a comparison of test power using simulations as follows:
1. Generate data from various kinds of dependency models using simulations.
2. Perform the various hypothesis test one wishes to compare on the simulated data with given condence .
3. Estimate the power of each test using permutations.
4. Evaluate the power of each test as a function of the dependency model, the sample size and .
Figure 3.3: A list of tests for Independence and their performance under various dependence models.
20
Denition 46. A dependency measure D (X, Y ) will be called equitable if given RVs X1 , X2 , Y , a noise factor
Z N 0, 2 and two smooth functions g1 , g2 such that Xi = gi (Y ) + Z it holds that D (X1 , Y ) = D (X2 , Y ).
Remark 47. This basically means that the measure D (X, Y ) evaluate the strength of the dependence equally
regardless of the type of functional dependence.
Figure 3.4: dCor values computed for dierent types of dependence and noise levels. This illustration shows that
while the dCor measure is capable of detecting all types of dependence (the value is positive in all cases where
dependence exists) it is is not an equitable measure. For example, in the last row the noise level for the circle and
semi-circle is almost identical but the dCor values are quite dierent (0.2 vs 0.5).
It has also been shown [8] that the mutual information based MIC statistic is not equitable. In fact it remains an
open question to nd a dependency measure which is equitable even under restrictions on the functional dependence.
21
In many cases when we are faced with performing some statistical test we will be required to conduct more than a
single hypothesis test. At the same time, we will want to control not only the probability for false rejection in each
single hypothesis test but the probability for false rejection across all tests performed.
Remark 48. There is no assumption on any relation between the hypothesis tests, each hypothesis can have its own
set of data, test statistic, rejection area and Pval.
Example 49. Suppose we want to test the connection between m = 20, 000 genes and a disease with condence
= 0.05 in every hypothesis. Without any further restrictions if we conduct each test separately with = 0.05 we
will get that even if none of the genes are connected to the disease the expected number of rejections is m = 1000.
That is, we will have 1000 false rejections. This problem is known as the multiple-hypothesis problem.
Denition 50. Given a MHT with m hypotheses we will denote the following:
#
True Null
True Alternative
Total
Not Signicant
mR
Signicant
Total
m0
m m0
For example U is the number of hypotheses where H0 should have been rejected and that was indeed the result and
V is the number of hypotheses where H0 should have been accepted but it was rejected (false rejections).
Remark 51. For simplicity we will be assuming that the test statistics we are discussing are all continuous and in
particular the distribution of the Pvals under H0 is Uniform [0, 1]. Most of the result we will discuss will remain
approximately true even if the test statistics were discrete.
Remark 52. Recall that when conducting a hypothesis test based on a statistic T (X1 , ..., Xn ) the Pval of the test
is itself a statistic (a RV) which is a function of X1 , ..., Xn . We will now denote this RV as P := P (X1 , ..., Xn ) and
denote its realized value by p := P (x1 , ..., xn ).
m.
Using a union bound it can be seen this achieves the required result:
!
m
m n
n
X
[
o
o
=m =
FWER = P (All rejections are correct) = P
P Pi <
Pi <
m
m
m
i=1
i=1
Where P
Pi <
since under H0 the assumption is that Pi Uniform [0, 1]. The main problem with this
approach is that it is extremely conservative and it would be extremely dicult to reject any hypothesis even in
cases where the alternative is true and even with powerful tests. That is, there is a signicant loss of overall power.
As is evident, control of the FWER is simply too restricting in order to allow for any ecient testing method, wee
thus turn to an alternative approach of controlling the error when conducting multiple hypothesis tests:
Denition 53. Given a MHT with m hypothesis we denote R+ = max {R, 1} and Q =
V
R+ .
Q is thus the
proportion of false rejections out of the total number of rejections. We then dene the FDR of the MHT to be
FDR = E [Q], that is the expected false rejection proportion.
Remark 54. The expectation here is taken with regard to the joint distribution of the Pvals of all the hypothesis
tests conducted. This is some distribution dened on [0, 1]
3. Compute i = max i {1, ..., m} | p(i) mi (if exists).
4. Reject all the hypotheses that match the Pvals p(1) , ..., p(i) (if i doesn't exist reject none).
n
o
i
The rejected set of hypothesis is thus RejectedBH (p1 , ..., pm ; ) := i {1, ..., m} | pi m
.
23
Figure 4.1:
Claim 55. Assume we conduct m hypothesis tests based on continuous independent test statistics T1 , ..., Tm using
the BH procedure with parameter . Then FDR =
m0
m
Remark 56. If the statistics are discrete it is still guaranteed that FDR
m0
m .
Proof. We note that the result of the BH procedure given parameter is a function only of the vector (p1 , ..., pn )
of Pvals obtained in the tests performed. Dene the following events informally:
(i)
More formally (p1 , ..., pm ) Ck i for all qi [0, 1] exactly one of the following holds:
1. i
/ RejectBH (p1 , ..., pi1 , qi , pi+1 , ..., pm ; ) - The Pval qi does not lead to rejection of the i'th hypothesis.
2. i RejectBH (p1 , ..., pi1 , qi , pi+1 , ..., pm ; ) and also:
Two important properties of the events Ck (which we will not prove) are:
(i)
1. The event Ck does not depend on the value of the RV Pi and thus does is independent of the event Pi
24
k
m
n
o
Pm
(i)
(i)
2. For any i the collection Ck | k {1, ..., m} is a partition of the sample space and thus k=1 P Ck = 1.
We note that the event Pi
k
m
(i)
Ck is identical by denition to the event Pi
k
m
{R = k} and thus:
k
k
k
(i)
P ({R = k}) P Pi
|R = k = P
Pi
{R = k} = P
Pi
Ck
m
m
m
Where R is again the number of rejections. Thus by independence and linearity of expectation we got:
FDR = E
X
X
m0
m
m X
V
V
1
k
=
P
({R
=
k})
E
|
R
=
k
=
P
({R
=
k})
P
P
|
R
=
k
i
R+
k
k
m
i=1
k=1
Same Events X
m0
m X
1
z}|{
P
=
k
i=1
k=1
k=1
Pi
k
m
Independence X
m0
m X
1
k
z}|{
(i)
(i)
=
Ck
P
Pi
P Ck
k
m
i=1
k=1
=1
Pi |H0 U [0,1] m m0
XX
z}|{
=
k=1 i=1
1 k
(i)
P Ck
k m
}|
#{
Disjoint Union X
m
m
0 X (i)
z}|{
P Ck
=
=
m0
m i=1
m
z
"
k=1
As requested.
We have shown the B[3]H procedure controls the FDR given independent test statistics, the question remains what
happens when independence can not be assumed.
Denition 57. Recall that given x, y Rm we denote x y i xi yi i. Furthermore we will say that a subset
D Rm is ascending if for all x D and all y Rm x y = y D.
Claim 59. Assume we conduct m hypothesis tests based on continuous test statistics T1 , ..., Tm that have the PRDS
property. Then the BH procedure with parameter guarantees FDR
m0
m .
V
R+
the variance of
V
R+
is generally tends to
m0
m
the degree to which the FDR exceeds the bound is not large, as is emphasized by the following claim:
25
Claim 61. Assume we conduct m hypothesis tests based on continuous test statistics T1 , ..., Tm . Then the BH
procedure with parameter guarantees the following bound:
FDR
m
m0 log (m)
m0 X 1
m i=1 i
m
m0
m
bound and when it does not the deviation from the bound is relatively small and does
Conclusion: The BH procedure is not particularly sensitive to existence of dependence of the test statistics.
4.2.2 What is the true meaning of FDR control:
We are reminded that the FDR is the expected value of the false rejection ratio and thus control of the FDR does
not guarantee a low false rejection ratio, specically Var RV+ can be large even when E RV+ is kept small.
Example 63. Suppose we conduct a MHT for testing m = 10000 hypotheses using the BH procedure with
parameter = 0.1 and 1000 rejections were obtained. We would like to deduce that approximately 100 of these
rejections are false and thus approximately 900 of our ndings are true. However, an alternative explanation is
that there is a strong positive dependence between the various test statistics and thus in 10% of cases in which we
implement the procedure there would be 1000 rejections while in the other 90% there will be none (since we chose
= 0.1). In both scenarios it would the FDR is less than 0.1 but the actual circumstances are very dierent.
The problem is that the BH procedure is still quite conservative, especially in cases where m0 m we would get
a situation in which FDR
m0
m
eects that are not signicant in the BH procedure and still guarantees FDR . Notice that if we knew what m0
before using the procedure (which is not possible) we could have used the BH procedure with parameter
m
m0
and
26
m
m
0 .
That is nd:
m
0
(4.1)
n
o
i
3. The rejected hypotheses are then RejectedABH (p1 , ..., pm ; ) := i {1, ..., m} | pi m
.
Remark 64. The simplest way to compute an estimator for m0 is by simply to use the standard BH procedure with
parameter and then take m
0 = m R as an estimator. There are however many other variations, see [2, 4, 9].
Ideally if this method worked perfectly it would guarantee FDR
m0
E[m
0 ] .
there are several technical diculties in proving control of the FDR using an adaptive procedure:
First, for any constant c the BH procedure with parameter ca guarantees FDR
RV there is no assurance that FDR
m0
m E [c]
and specically c =
m
m
0
m0
m c.
However, if c is a
m
Second, if m
0 is an unbiased estimator of m0 then m
0 is biased upwards compared to since by Jensen's
h i
1
1
1
inequality we would know that m
0 = E m
0 > E[m
0 ] and thus
m
1
m0
=
E
=E
m0 >
m
0
m
0
E [m
0]
Which means even if we achieve FDR E
1
m
0
A possible solution: this analysis immediately shows we would prefer positively biased estimators of m0 (E [m
0] >
m0 ). This type of estimators are more conservative in the sense that they estimate the number of hypothesis for
which H0 is false (m m0 ) to be smaller than it actually is. It turns out that given such estimators it is possible
to prove control of the FDR by under assumption of independence as the following theorem claims:
(1)
the same estimator calculated for m 1 hypotheses and the same Pi values except one value which is Treu Null (that
is except for one Pj for which H0 is"known
# to be true and thus Pj Uniform [0, 1]). Then assuming P1 , ..., Pm are
independent it holds that FDR E
m0
(1)
m
0
m0
E[m
0]
Furthermore, there are other (biased) estimators that by using some additional conservation corrections guarantee
h i
1
1
that E m
0 m0 and for these estimators control of the FDR to level can be shown (under certain conditions).
Conclusion: At the price of being slightly conservative at estimation of m0 it is possible to construct an adaptive
procedure that under independence of the Pvalues controls the FDR as required. Specically when the dierence
27
between m0 and m is large using such a procedure will provide a great improvement in power compared to the
standard BH procedure.
m0 . However, doing that is pointless since the original purpose of the adaptive procedure was to improve the power
compared to the standard BH procedure.
Conclusion: if it is known (or shown) that there is only a weak dependence between the Pvalues then the use
of an adaptive procedure can be very attractive. On the other hand, when the dependence is strong it is not
recommended to use these procedures since the FDR can grow beyond what is expected.
It remains an open challenge to nd an adaptive procedure that both ensures high power and control the FDR at
level even when there is dependence of the Pvalues (or to prove no such procedure exists).
m0
m
Denition 67. Suppose a MHT with a rejection threshold (reject the i'th hypothsis i Pi ), we denote:
V () := # {False Rejection with threshold }
R () := # {Rejections with threshold } = # {pi }
#{pi >}
(1)m .
R()
z
}|
{
1
P (Pi ) =
max # {pi }, 1
() :=
5. Estimate the FDR by FDR
0 ()
i )
P(P
28
Explanation:
1. The reason
0 () is a sensible (and unbiased) estimator of 0 is that
"
E [# {Pi > }] = E
m
X
#
1{Pi >} =
i=1
m
X
#{pi >}
(1)
is an unbiased estimator of m0 :
m
X
E 1{Pi >}
i=1
z}|{
P (Pi > ) = m0 P (Pi > | Pi Uniform [0, 1]) = m0 (1 )
i=1
The marked equality is the result of the fact that Pi > i the i'th hypothesis was not rejected with condence
. Thus P (Pi > ) = 0 unless Pi is distributed under H0 in which case it is distributed Uniform [0, 1] and has
probability (1 ) to larger than . Since there are m0 such i values for which P (Pi > ) = (1 ) and for
the remaining i values P (Pi > ) = 0 the result is obtained.
V
R+
is two fold:
V ()
m
0 ()
i )
P(P
is an estimator of
E[V ()]
E[R()] ,
Denition 68. Given a MHT with m hypotheses the q-values q1 , ..., qm are the minimal values such that the
BH procedure with parameter will reject the i'th hypothesis respectively.
Remark 69. The q-values represent the signicance of each hypothesis in the context of being tested as part of a
MHT. Using q-values is obviously equivalent to using the BH procedure but has the advantage of giving a more
accessible representation to MHT by transferring the majority of the diculty the the computation of the q-values.
After said computation is done it is simply used to determine whether to reject the i'th hypothesis like one would use
29
the p-value for a single hypothesis test. Meaning, if we want to ensure FDR it suces to reject all hypothesis
for which qi . The following algorithm computes the q-values:
1. Compute all the Pvals p1 , ..., pm and order them p(1) ... p(m) .
2. Compute q(i) = min
p(i) m
i ,1 .
3. Shrink and order: for i = m 1 down to 1 set q(i) = min q(i) , q(i+1) .
4. To get qi from q(i) perform the opposite permutation to pi 7 p(i) .
Step 3 is performed in order to regain the order of the ordered values q(i) since it is impossible that p(i) < p(j)
and at the same time q(i) > q(j) . One can show this algorithm indeed computes the values in accordance with the
denition of the q-values.
Remark 70. In general when conducting a MHT we assume that we only have access to the p-values and that we
have no preference for certain hypothesis over others. Assuming this is true we will always use the same rejection
threshold for all hypotheses since there is no logical reason to reject one hypothesis with a certain p-value if we did
not reject another hypothesis that has a lower p-values. Under dierent assumptions it is possible to dene other
procedures that do not use an identical rejection threshold for all hypotheses.
30
REFERENCES
REFERENCES
References
[1] Yoav Benjamini and Yosef Hochberg,
multiple testing, Journal of the Royal Statistical Society. Series B (Methodological) (1995), 289300.
Adaptive linear step-up procedures that control the
The control of the false discovery rate in multiple testing under depen-
experiment, Journal of the American statistical association 96 (2001), no. 456, 11511160.
[5] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schlkopf, and Alexander J Smola,
A kernel
method for the two-sample-problem, Advances in neural information processing systems 19 (2007), 513.
[6] Wassily Hoeding,
546557.
[7] Shachar Kaufman, Ruth Heller, Yair Heller, and Malka Gorne, Consistent distribution-free tests of association
A direct approach to false discovery rates, Journal of the Royal Statistical Society: Series B
20132035.
[11] Gbor J Szkely, Maria L Rizzo, Nail K Bakirov, et al.,
no. 4, 12361265.
[13] Amit Zeisel, Or Zuk, Eytan Domany, et al.,
31