Professional Documents
Culture Documents
BIOINFORMATIK II
PROBABILITY & STATISTICS
Summer semester 2006
The University of Z
urich and ETH Z
urich
Problems in statistics:
Given: a probability model for some (chance) experiment:
X P .
Here, P is a probability distribution (given by a probability function
p (x) or a distribution function F (x)) for any . The P are all known,
but the actual value of the parameter is unknown.
(Ex: X Bin(100, p) but the probability p is unknown.)
Two main areas in statistics are:
ESTIMATION: estimate the unknown value of given observations
of X (a single observation is usually not enough).
TESTING: test a hypothesis about the unknown value of . Base
acceptance/rejection upon observations of X (a single observation is
usually not enough).
Statistical estimation:
Given: a probability model: X P ,
where P are known, but actual value of the parameter is unknown.
(ex: X Bin(100, p), but the probability p is unknown.)
To be able to estimate the value of we repeat the experiment n
times independently, which gives x1 , x2 , . . . , xn . These are n
observations of X.
Next step: use the observations x1 , x2 , . . . , xn to compute an
estimate of . (Observe the values, and then take a good guess )
The collection x1 , x2 , . . . , xn is called an (observed) sample of random
variables X1 , X2 , . . . , Xn ; the latter are independent and have the same
distribution as X.
The collection X1 , X2 , . . . , Xn is called a (random) sample.
Estimator, estimate
Def: An estimator of is a function of the random variables
1 , X2 , . . . , Xn ).
X1 , X2 , . . . , Xn , written (X
For theory and principles of estimation. It is RANDOM!
******
1 , x2 , . . . , xn ) calculated from
Def: An estimate of is the quantity (x
the observed values x1 , x2 , . . . , xn of X1 , X2 , . . . , Xn .
For practice. The value computed after the experiment.
It is not random.
P (X1 , X2 , . . . , Xn ) > 0 as n
the more observations, the closer to the truth.
] as low as possible:
The mean square
error
MSE[
h
2 i
1 , X 2 , . . . , Xn )
MSE [ ] := E (X
for every .
Not too much variation, not too far away from truth .
(Note: MSE [ ] = Var [ ] if is unbiased , i.e. if E [ ] = .)
10
11
100
x
and hence
L(x1 , . . . , xn ; ) =
x (1 )100x
100
x1
x1
(1 )
for x = 0, 1, . . . , 100;
100x1
...
100
xn
xn (1 )100xn .
(1 )
100x1
...
xn
(1 )
100xn
xi
(1 )
100n
xi
12
13
X
1
1 , x2 , . . . , xn ) =
xi .
(x
100n i=1
(Compare this with the two estimators on Slide 5: 1 (X1 , X2 , . . . , X20 ) is
the ML estimator!)
14
15
Unbiased estimators
1 , X2 , . . . , Xn ) be some estimator of the unknown value of .
Let (X
If
E (X1 , X2 , . . . , Xn ) =
16
BIOINFORMATIK II
PROBABILITY & STATISTICS
Summer semester 2006
The University of Z
urich and ETH Z
urich
17
18
19
20
21
22
Type I error.
Ex: Suppose that the null hypothesis H0 will be rejected if
T (X1 , X2 , . . . , Xn ) > C for some constant C.
And otherwise, if T (X1 , X2 , . . . , Xn ) C, H0 is accepted.
(C is the critical value in this example).
NOTE: Even if H0 is true it is (in general) possible that
T (X1 , X2 , . . . , Xn ) > C, since the data are random!
If this occurs, a Type I error is being made: rejection of a true null
hypothesis.
23
Significance level
(Type I error = rejection of a true null hypothesis).
The probability for this type of incorrect decision, the significance
level, should be kept (reasonably) low.
= P T (X1 , X2 , . . . , Xn ) > C |H0 true =
= P Type I error .
Typically, one takes equal to some low probability (often 0.05 or 0.01),
and then determines the corresponding value of C.
C depends upon !
24
Statistical Significance
Our test (at significance level ) is
Reject H0
T (X1 , X2 , . . . , Xn ) > C.
25
Type II error
Another type of incorrect decision can also be made:
Type II error: acceptance of a false null hypothesis.
(Type II is usually a less serious error than type I).
Suppose that the significance level is fixed ( = 0.05 or some other
value), and the critical value C has been determined such that
= P T (X1 , X2 , . . . , Xn ) > C |H0 true
holds.
Then, the probability of a Type II error is
= P T (X1 , X2 , . . . , Xn ) C |H0 false .
26
27
28
p-values:
The p-value is a probability, and can only be computed after the data
have been observed.
Suppose that the test is: we reject H0 if and only if
T (X1 , X2 , . . . , Xn ) > C, for some critical value C.
The significance level is
= P T (X1 , X2 , . . . , Xn ) > C |H0 true .
Now suppose we observe x1 , x2 , . . . , xn .
Compute the observed test statistic t := T (x1 , x2 , . . . , xn ).
The p-value is defined as
P T (X1 , X2 , . . . , Xn ) t |H0 true .
(Interpretation: the probability for seeing something at least as extreme
as just observed... how unlikely the observed value is.)
29
30
31
32
Simple hypotheses
Suppose that we have a test problem where the hypotheses are simple,
i.e. they completely specify the probability function.
Ex: X Bin(20, p). H0 : p = 0.25, HA : p = 0.35, which is equivalent to
N
H0 : P (X = k) =
0.25k (1 0.25)N k
k
and
N
0.35k (1 0.35)N k .
HA : P (X = k) =
k
33
34
35
xi
xi
p1
)n
(1 p1 )
xi
p s 1 p ns 7 s 13 ns
1
1
=
,
=
p0
1 p0
5
15
xi
p0 (1 p0
Pn
where s = i=1 xi is the total number of matches. But this LR is just a
(increasing) function of s! So instead of rejecting H0 if LR is too big,
we can also use the test that we reject H0 if s is too big. What means
too big ??
36
37
38