Skript NMS

Numerical Methods and Stochastics
Part 2: Stochastics Prof. Dr. Herold Dehling Lehrstuhl Mathematik XII Fakultt fr Mathematik a u Ruhr-Universitt Bochum a
Preface
These are the lecture notes for the second part of the course Numerical Methods and Stochastics, held in the Summer Term 2011 at the Ruhr-University Bochum.
Bochum, May 23, 2011
Prof. Dr. Herold Dehling
Literature. (1) Nathabandu T. Kottegoda and Renzo Rosso: Applied Statistics for Civil and Environmental Engineers, 2nd ed., Blackwell Publishing 2008. (2) Lothar Sachs: Statistische Methodenlehre. Springer Verlag, Berlin. (3) Peter Brockwell and Richard Davis: Introduction to Time Series Analysis. Springer Verlag
iii
Contents
Preface Chapter 1. Basic Concepts of Probability Theory 1. Outcomes, Events, Probability 2. Laplace Experiments and Elementary Combinatorics 3. Independence, Conditional Probabilities 4. System Failure Chapter 2. Discrete Distributions 1. Descriptive Statistics for Discrete Data 2. Stochastic Models 3. Inferential Statistics 4. Goodness of t tests Chapter 3. Continuous Distributions 1. Descriptive Statistics for Continuous Data 2. Stochastic Models 3. Important Continuous Distributions 4. Transformation of Densities 5. Inferential Statistics 6. Goodness of t tests Chapter 4. Multivariate Distributions 1. Descriptive Statistics for Multivariate Data 2. Stochastic Models 3. Inferential Statistics Chapter 5. Linear Regression 1. Simple Linear Regression 2. Multiple Linear Regression Chapter 6. Analysis of Variance 1. Two-sample problem 2. Single Factor ANOVA Chapter 7. Principal Component Analysis 1. Linear Algebra Tools 2. Principal Components of Multivariate Data Chapter 8. Extreme Value Theory 1. Extreme Value Distributions 2. Inferential Statistics for Extreme Value Distributions
v
iii 1 1 4 5 7 9 9 11 16 19 21 21 22 24 28 29 32 33 33 35 48 49 49 54 61 61 63 67 67 69 73 73 78
vi
CONTENTS
Chapter 9. Introduction to Time Series Analysis 1. Trend, Seasonal Eects and Stationary Time Series 2. Statististical Analysis of Stationary Processes 3. Spectral Analysis of Time Series
79 79 84 89
CHAPTER 1
Basic Concepts of Probability Theory

In this chapter we want to review the basic concepts of Probability Theory that are relevant for this course. Most should be known from undergraduate courses. 1. Outcomes, Events, Probability Probability Theory provides models for experiments whose outcomes are not completely deterministics but are subject to randomness. We will not attempt to give an exact denition of randomness. For our purposes it suces to think of experiments that can be repeated under identical circumstances, but still the outcomes dier from one repetition to the other. There are many examples: Games of chance such as rolling dice or playing roulette, outcomes of physical measurements that are subject to measurement errors, number of passengers at a bus stop between 7:00 a.m. and 8:00 a.m. on consecutive days, daily maximum wind speed on campus, amount of rain fall on a given day. The basic ingredients of a mathematical model of a random experiment are the outcome space , the event space F and the probability distribution P , usually denoted by the triple (, F , P ). The outcome space is simply the set of all possible outcomes of the experiment. When it comes to the threefold rolling of dice, the outcome space is = {(1 , 2 , 3 ) : i {1, . . . , 6}}. For the measurement error in a single experiment, we can take = R and for the maximum wind speed tomorrow = [0, ). Often we are not so much interested in the exact outcome of an experiment. I.e., regarding tomorrows maximum wind speed, we might want to know whether the speed will exceed 100 km/h. Technically an event is a subset of the outcome space, and by F we denote the set of all events that are of interest to us. In discrete experiments this will often be the set of all subsets of the outcome space, in experiments with a continuum of possible outcomes, this will often be the set of all Lebesgue measurable subsets of the outcome space. Since events are subsets of , we can apply the usual operations of set theory, e.g. unions, complements and intersections. Each of these operations has a specic meaning when it comes to events; for some of the most common operations we have listed the interpretations in Figure 1 The nal ingredient of a probability tripel is the probability distribution P which assigns to each event A F its probability P (A). Technically, the probability distribution is a map P : F [0, 1], satisfying the following three axioms that were formulated by Kolmogorov in 1933: (1) P () = 1
1
1. BASIC CONCEPTS OF PROBABILITY THEORY
Venn diagram
1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000
Symbol
Interpretation
The sure event
1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 A 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000 1111111111111111111 0000000000000000000
The impossible event
Ac
A did not occur
11 00 11 00 11 00
AB AB B\A
Both A and B have occured
1111111 0000000 111111 000000 B A 1111111 0000000 111111 000000 1111111 0000000 111111 000000 1111111 0000000 111111 000000 1111111 0000000 111111 000000
A or B have occured
111111 000000 B 111111 000000 111111 000000 111111 000000 111111 000000
B has occured, but A did not occur
B A
AB
B A
If A occurs, B will also occur
A B = A and B are mutually exclusive Figure 1. Event space interpretation of common operations of set theory (2) P (A1 . . . Ak ) = k P (Ai ), for all pairwise disjoint events A1 , . . . , An i=1 F. (3) P (A1 A2 . . .) = P (Ak ) for all pairwise disjoint events A1 , A2 , . . .. k=1 The rst of these three conditions is a normalization condition. The second is called nite additivity and the third is countable additivity. One can show that countable additivity, in fact, implies nite additivity. Nevertheless it is common to mention both axioms separately. A probability distribution is a map P : F [0, 1] that satises these three axioms. We should mention here that the word probability is used both for the probability distribution, i.e. the map P : [0, 1] as well as for the probability P (A) of a given event A. In the midst of the mathematical formalism, it is very important to have an intuitive interpretation of probability. Mathematical statisticians prefer the frequentist interpretation, where the probability of the event A is viewed as the mathematical idealization of the relative frequency with which the event occurs in an innitely
1. OUTCOMES, EVENTS, PROBABILITY
long sequence of repetitions of the same experiment under identical conditions. If we repeat the experiment of rolling three dice 10000 times and we nd that in 1679 of these experiments, the sum exceeded 15, the relative frequency is 0.1679. The law of large number guarantees that in an innite sequence of independent repetions of the experiment under identical conditions, the relative frequency converges to P (A). It should be mentioned that in real life applications, it is much less clear what independent repetitions of an experiment under identical circumstances means. Think of daily maximum wind speed - there is certainly dependence from day to day, moreover the circumstances dier with the season. These are very important issues to keep in mind. Though relative frequency provides a good intuitive picture of the meaning of probability, we should realize that the relative frequency in a nite series of experiments is at best an approximation to the probability of the event. If in 100 coin tosses, we nd 43 Heads, the relative frequency of Heads is 0.43. However, this is most likely not the probability of Heads. If we repeat the same 100 coin tosses tomorrow, we will obtain a dierent relative frequency, whereas the probability is still the same. If the coin is symmetric, we might assume on grounds of symmetry 1 that the probability of Head is 2 . For an asymmetric coin, we will see that based on the above outcome, 0.43 is a good estimate for the probability of Head. If we want to nd the true probability, we have to continue tossing the coin ad innitum - quite an impossible task in real life. When calculating probabilities of specic events, the following rules are extremely useful. get Theorem 1.1. Given a probability space (Omega, F , P ) and events A, B, C, we P (Ac ) P (A B) P (A B C) 1 P (A) P (A) + P (B) P (A B) P (A) + P (B) + P (C) P (A B) P (A C) P (B C) + P (A B C) P (B \ A) = P (B) P (A B) A B = P (B \ A) = P (B) P (A) = = =
One should always keep theses rules in mind and be able to use them in a given situation. There are occasions when it is dicult to calculate the probability of an event A, while the probability of Ac can easily be calculated. In another situation, a given event might be expressed as a union or intersection of simpler events whose probabilities can be calculated easily. Example 1.2. (i) What is the probability to get at least one 6 in six tosses with a fair die? In this case, it makes sense to consider the complementary event, i.e. the 5 6 event A that we do not toss a 6. We have P (A) = 6 and thus the probability of 6 the event that we toss at least one 6 is 1 5 . 6 (ii) What is the probability that in repeated tosses of a fair die the rst 6 occurs in the 10th toss? In this case, it is helpful to express the event as A9 \A10 , where A9 and
A10 are the events that during the rst 9, respectively 10, tosses no 6 occured. We 9 9 5 10 5 10 and hence P (A9 \A10 ) = 5 6 . have A10 A9 , P (A9 ) = 5 , P (A10 ) = 6 6 6 2. Laplace Experiments and Elementary Combinatorics The simplest random experiments have nitely many outcomes with the same probability. In this case, the probability of any event can be calculated by the formula |A| P (A) = . ||
The corresponding mathematical model is called Laplace space. We say that the probability of an event equals the ratio of the number of favorable outcome to the number of all outcomes. Practical calculatations in Laplace spaces require a basic knowledge of combinatorial formulas for calculating cardinalities of sets. We will explain these formulas in connection with socalled urn models, where k balls are drawn from an urn with n balls carrying numbers 1, . . . , n. The basic question is for the number of possible outcomes in this experiment. The answer depends heavily on tow factors, namely the way we draw and which outcomes we regard as dierent. We can draw with or without replacement, and we can take the order in which balls where drawn into account or not. Drawing with replacement, taking the order into account. In this case the outcome can be described by a k-tuple (1 , . . . , k ) where i represents the ball that was drawn in the i-th drawing. In this case, the sample space becomes I = {(1 , . . . , k ) : 1 i n}. The cardinality of I is nk . Drawing without replacement, taking the order into account: Again, the outcome can be described by a k-tuple (1 , . . . , k ), where i represents the ball that was drawn in the i-th drawing. The only dierence is that we have to take into account that the same ball cannot be drawn twice, as we draw without replacement. The sample space is II = {(1 , . . . , k ) : 1 i n, i = j f r i = j}. u The cardinality of II is (n)k := n(n 1) (n k + 1); there are n possibilities to draw the rst ball, then (n 1) possibilities for the second ball, (n 2) for the third and nally (nk +1) for the k-th ball. If we draw all balls, i.e. when k = n, we have n! := n(n 1) 2 1 possibilities to do so. In this way, we obtain as a by-product of our calculations the number of all permutations of the numbers 1, . . . , n, which is given by n!. Drawing without replacement, not taking the order into account: In this case, we neglect the order in which the balls have been drawn. Thus we cannot represent the outcomes as k-tuples, as these automatically imply a specic order. Instead we represent an outcome by the set of those k numbers that have been drawn. The sample space is III := {A {1, . . . , n} : |A| = k}.
3. INDEPENDENCE, CONDITIONAL PROBABILITIES
The cardinality of III equals the cardinality of II divided by k! - to each subset A {1, . . . , n} we have k! tuples (1 , . . . , k ) of pairwise dierent numbers from 1, . . . , n. Thus we get |III = (n)k n! = =: k! k!(n k)! n . k
Hypergeometric Distribution. We will now give a rst application of the formulae derived above. We draw without replacement n ball from an urn with N balls, of which R are red and the remaining balls, i.e. N R, are white. What is the probability of the event Er that we get r red balls among the n balls that are drawn. We apply the model III . Thus, the total number of outcomes is N . We n have R possibilities to draw the red balls and N R possibilities to draw the n r r nr white balls. Thus we obtain P (Er ) =
R r N R nr N n
these probabilities are called hypergeometric probabilities. 3. Independence, Conditional Probabilities In many situations, we are facing several events in a random experiment whose individual probabilities are known to us. We are interested in the probability of a further event that is dened in terms of the rst events. In this section, we will present techniques for the calculation of such probabilities. Definition 1.3. The events A and B are called independent, if P (A B) = P (A) P (B). Several events A1 , . . . , An are called independent, if for all choices of indices 1 i1 < . . . < ik n we have P (Ai1 . . . Aik ) = P (Ai1 ) . . . P (Aik ). In intuitive terms, independence of events means that the events do not inuence each other. If A and B are independent events, the probability that B will occur is not inuenced by the information that A has occured. This is e.g. the case in repeated tosses with a fair die and the events A and B that a 6 occurs in the rst and second toss, respectively. Binomial Distribution. We consider an experiment with two possible outcomes usually we call these success and failure, respectively. The associated probabilities are p and q = 1 p. This experiment is carries out n times, independently of each other. What is the probability of the event Ek to get a total of k successes? We consider the model = {(1 , . . . , n ) : i {0, 1}}
where i = 1 means that we had a success in the i-th experiment and i = 0 that we had a failure. We rst dene probabilities for the outcomes, i.e. for each element of . We begin with an example, taking n = 5 and the outcome (0, 1, 1, 1, 0) - this represents the outcome that the rst experiment resulted in a failure, the next three
experiments in successes and the nal one again in a failure. By independence we get P ({(0, 1, 1, 1, 0)}) = q p p p q = p3 q 2 . Analogously, we get the general formula P ({(1, . . . , n )}) = pk q nk , where k = n i denotes the number of successes in the sequence of experiments. i=1 We now return to the original question: what is the probability to have exactly k successes. We have exactly n outcomes that correspond to this event - this is k the number of n-tuples of 0s and 1s of which exactly k are 1s. Thus we get P (Ek ) = this is called the binomial distribution. Definition 1.4 (Conditional Probability). Given events A, B with P (A) > 0, we dene the conditional probability of B given A by P (B|A) := P (A B) . P (A) n k nk p q ; k
For an intuitive understanding of a conditional probability, we use the frequentist interpretation of probability. We carry out a large number of experiments independently and under identical circumstances, focusing on the events A and B. Then P (B|A) corresponds to the relative frequency with which the event B occured among the subsequence of experiments where A occured. If we consider the two events A and B in a random experiment and we know that A has already occured, then P (B|A) (and not P (B)) is the relevant probability that B will occur. Neglecting this simple fact is the source of a lot of confusion in applications of probability theory. Example 1.5. A machine consists of three parts, each of which is defect with probability p. We denote the events that the rst, second and third part are defect by A1 , A2 , A3 , respectively, and we assume that these events are independent. The machine fails to perform when all three parts fail simultaneously. Thus, the probability of a a machine failure is p3 . Now suppose that by inspection of the rst part we have learned that this is defect. The resulting conditional probability of a machine failure given this information is p3 = p2 . p It is important to realize that this is the relevant probability, e.g. for any risk assessment, given the information about the failure of the rst part. P (A1 A2 A3 |A1 ) = For practical computations in connection with conditional probabilities there are three important formulas which we will present now. Multiplication rule: Let A1 , . . . , An be events satisfying P (A1 . . . An1 ) = 0. Then P (A1 . . . An ) = P (A1 ) P (A2|A1 ) P (A3 |A1 A2 A3 ) P (An |A1 . . . An1 ).
4. SYSTEM FAILURE
Law of total probability: Let B1 , . . . , Bk be disjoint events that form a partition of the outcome space, i.e. B1 . . . Bk = satisfying P (Bi ) = 0 for all i. Then we get for an arbitrary event A
k
P (A) =
i=1
P (A|Bi ) P (Bi ).
Bayes formula: Let B1 , . . . , Bk be disjoint events that form a partition of the outcome space, i.e. B1 . . . Bk = satisfying P (Bi ) = 0 for all i. Then we get for an arbitrary event A with P (A) = 0 P (Bj |A) = P (A|Bj ) P (Bj )
k i=1
P (A|Bi ) P (Bi )
4. System Failure An important application of probability theory in engineering sciences consists of computations of failure of performance in complex structures. A machine consists of a number of parts that can all fail individually and whose failure inuences the functioning of the machine. There are many possibilities in which this may happen. In some cases, failure of a single component results in the failure of the entire machine. In other cases, a machine fails when all components fail together. In most realistic application, we will have intermediate possibilities. The causal structure leading to system failure is usually represented in a tree structure. We will rst present the basic elements of such a tree, which are the AND and OR gates. A
Machine fails
AND Y
Figure 2. AND gate in a failure tree AND gate. Take a system consisting of exactly three components. We assume that the system fails when all three components fail simultaneously. By A, B, C we denote the events that the individual A B C. We will now show how this probabilty can be calculated in various situations. First case: A, B, C are independent events; then Second case: One of the individual events, e.g. A implies the other two, i.e. A B und A C. In this case P (A B C) = P (A). P (A B C) = P (A) P (B) P (C).
General case: We always have the upper bound In most practical applications, we have the lower bound P (A B C) min(P (A), P (B), P (C)). P (A B C) P (A) P (B) P (C).
This is not a mathematical theorem that holds for arbitrary events A, B, C - it is e.g. false, if the events are disjoint, in which case the left-hand side is zero. However, in most practical applications, there is some kind of positive dependence between the failure of separate components. Failure of one component leads to stress on the other components and thus increases the probability of further failures. In such situations, the above lower bound holds. A
Mashine fails
OR Y
Figure 3. OR gate in a failure tree OR gate. Again, for simplicity, we consider a system consisting of 3 parts. In this case, we assume that the system fails if one of the parts fails. Using the same notation as above, failure now corresponds to the event A B C. In this case we have the universal upper bound In the special case that all individual events are disjoint, we even have the identity here. On the other hand, we always have A, B, C A B C and thus In total, we have the universal bounds P (A B C) max(P (A), P (B), P (C)). P (A B C) P (A) + P (B) + P (C).
In some cases, sharper bounds are possible. For example, we have the identity P (A B C) = P (A) + P (B) + P (C) P (A B) P (A C) P (B C) + P (A B C),
max(P (A), P (B), P (C)) P (A B C) P (A) + P (B) + P (C).
which is a special case of the inclusion-exclusion formula, valid for the union of an arbitrary number of events. This formula is helpful if we are able to calculate the probabilities on the right-hand side. Another formula, valid for independent events is P (A B C) = 1 P (Ac B c C c ) = 1 (1 P (A))(1 P (B))(1 P (C)).
CHAPTER 2
Discrete Distributions
Before we can decide for a certain statistical procedure, we have to analyze the type of data that we are confronted with. It makes a great dierence whether the data are discrete, i.e. that they can take at most countably many values (in most cases, actually only nitely many), or whether they are continuous. Statistical procedures that might be highly appropriate in one case, can lead to complete nonsense in another case. The present chapter is devoted to discrete data. Possible examples are (1) Number of cars crossing a bridge during a given hour. (2) Number of passengers in the U35 train leaving Bochum Central Station on Sunday morning at 8:05 a.m. for the Ruhr-University (3) Number of car accidents on the A40 between Duisburg and Dortmund on a given day (4) Number of defect parts in a shipment of 1000 parts 1. Descriptive Statistics for Discrete Data The task of descriptive statistics is to summarize data in such a way that the essential information provided by the data set becomes clearer. Data sets are typically quite large and the relevant information is often hidden behind the vast amount of numbers. We distinguish two dierent kinds of descriptive statistics, graphical and numerical summaries. It will turn out that some of the summary statistics are suitable for both discrete and continuous data, while others make sense only in the discrete case. Numerical summaries. Let x1 , . . . , xn be a data set of real numbers. The numerical summaries introduced here make sense both for continuous and discrete data. Arithmetic mean: The arithmetic mean is simply dened as 1 x= n
n
xi .
i=1
x is a measure of location; it tells us where the center of mass of the data set is. The arithmetic mean is a time honored summary statistics. It suers from lack of robustness, i.e. it is very sensitive to outliers in the data. Median: The median of a data set is the observation with the property that at most 50% of the data lie to the left and at most 50% lie to the right of this observation. For a formal denition, we introduce the order statistics, obtained by ordering the original observations in increasing order, x(1) x(2) . . . x(n) .
9
10
2. DISCRETE DISTRIBUTIONS
If we have identical values in the original data set, they appear in the ordered data set with the same frequency. We now dene the median as follows med(x1 , . . . , xn ) = x( n+1 ) 2 1 x n + x n +1 2 2 2 if n is odd if n is even
Trimmed Mean: The trimmed mean is simply the arithmetic mean of all data, except for a certain percentage in the extremes. Given a number [0, 1 ] we dene 2 the -trimmed mean as follows 1 x := n 2 [n]
n[n]
x(i) .
i=[n]+1
Quartiles: The three quartiles Q1 , Q2 , Q3 are those values that mark the smallest and largest 25% of the data and the median. Roughly, they are dened as Q1 = x( n+1 ) , Q3 = x(3 n+1 )
4 4
There are exact formulas for the case when n + 1 is not divisible by 4, but we dont need them for our purposes. Sample variance: The sample variance is dened as 1 s := n1
2 n
i=1
(xi x)2
The sample variance measures the squared deviation of the observations from the mean and is thus a measure of spread of the data. The mysterious n 1 in the denominator appears so that s2 becomes an unbiased estimator of the variance. Interquartile range: The interquartile range is dened by IQR = Q3 Q1 . The IQR gives the length of the region in which the central 50% of the data are located, and provides a robust estimate of the spread of the data. Of the above summaries, the mean, the median, the trimmed means and the quartiles, are measures of location of the data, the sample variance and the interquartile range are measures of spread. Gracal summaries. Frequency table: In a frequency table we simply count how often a certain value occured in the data set. Histogram: The histogram of discrete data is a graphical display of the frequency table. Empirical distribution function: The empirical distribution function is the function Fn : R [0, 1], dened as Fn (x) =
1 #{1 i n : xi x}. n I.e. Fn (x) is the relative frequency of observations in our data set that are smaller or equal than x. The empirical distribution function is an estimate of the distribution function of the random mechanism that generated the data set.
2. STOCHASTIC MODELS
11
Boxplot: A boxplot is a graphical display of the main characteristics of a data set, containing the mean, the quartiles, the interquartile range and an indication of outliers. 2. Stochastic Models Descriptive statistics is concerned exclusively with the data set at hand and does not provide answers to questions such as: which mechanisms could have generated the data and what can we expect when we collect data under the same circumstances again. Answers to such questions can only be provided by techniques of inferential statistics. In inferential statistics, we regard the data as realizations of random variables whose distribution is unknown and should be estimated. Most summary statistics that we got to know in the section on descriptive statistics become estimators in the context of inferential statistics. Before we can proceed with inferential statistics, we have to develop some tools from probability theory, especially the notion of a random variable and its distribution. Definition 2.1. Let (, F , P ) be a probability space. A random variable (rv) is a map X : R, that is measurable in the sense that { : X() a} F for all a R. Roughly speaking, a random variable is a rule that assigns to each outcome of our random experiment a value X(). The technical condition of measurability assures that the sets { : X() a} are events whose probability is well dened. Example 2.2. (i) We toss two times a fair die and note the results as 1 , 2 . Thus our sample space becomes The event space is simply the space of all subsets of - this is common for discrete experiments. Moreover, we assume that all outcomes are equiprobable, i.e. we use Laplace probabilities. We dene the random variable X : R X(1 , 2 ) = 1 + 2 , which gives us the sum of the two face values. (ii) Further random variables associated with the same experiment are X(1 , 2) = max(1 , 2 ) X(1 , 2) = min(1 , 2 ) X(1 , 2) = |1 2 |. Probability theory and statistics is formulated in the language of random variables and their distributions, while the original probability distribution on the sample space is less present. Random variables oer a list of advantages, such as Random variables provide a summary of the outcome of the experiment and remove all details that are of no interest to us. As random variables are real-valued functions, one can carry out arithmetic operations on random variables. = {(1 , 2 ) : i {1, . . . , 6}}.
12
In many cases, very dierent experiments lead to random variables with identical properties. This aspect helps to nd order in the midst of chaos of possible random experiments. When dealing with random variables, we are only concerned with their distributions. Roughly speaking the distribution tells us the probability that certain values of the random variable occur. At this point we have to separate two cases, namely discrete random variables and continuous random variables. Both require dierent techniques when it comes to characterizing their distributions. Definition 2.3. A random variable is called discrete if its range of values is nite or at most countably innite. Example 2.4. (i) The number of successes in independent Bernoulli trials is a discrete random variable whose range is given by {0, . . . , n}. (ii) We perform a series of independent Bernoulli experiments until the rst succes occurs and denote by X the number of failures preceding this rst success. X is a discrete random variable with range {0, 1, 2, . . .} - this set is countably innite. (iii) Random variables that can take any value in an interval [0, 1] are not discrete. Real life examples are, e.g. random numbers, roundo errors, most physical measurements, rain fall data. Definition 2.5. (i) Let (, F , P ) be a probability space and X : R a random variable. The map A P (X A) := P ({ : X() A}), A R interval, is called the distribution of X. (ii) If X is a discrete random variable with range {a1 , a2 , . . .}, we dene is probability function p : {a1 , a2 , . . .} by p(ai ) := P (X = ai ) = P ({ : X() = ai }). The distribution of a random variable assigns to each interval A R the probability that X takes values in this interval. For a discrete random variable, the distribution is completely characterized by the probability function, since for any interval A R we have P (X A) = p(ai ).
ai A
The probability function of a discrete random variable has two characteristic properties, (1) p(ai ) 0 (2) i p(ai ) = 1. Example 2.6. (i) Let X denote the score in a single toss with a fair die. The range of X is {1, . . . , 6}; the probability function is given by 1 p(i) = P (X = i) = , 1 i 6. 6
13
k -
r -
10 10 Figure 1. Probability functions of the binomial distribution with parameters n = 20 and p = 0.75 (left) and the hypergeometric distribution with parameters N = 32, R = 24 and n = 20 (right) This is a special case of the so-called uniform distribution on the set {1, . . . , N} which is characterized by the probability function 1 p(k) = , 1 k N. N (ii) Let X be the score in two tosses with a fair die. In this case, the range is {2, . . . , 12}, and the probability function is given by p(2) = p(12) =
1 2 3 , p(3) = p(11) = , p(4) = p(10) = , 36 36 36 4 5 6 p(5) = p(9) = , p(6) = p(8) = , p(7) = . 36 36 36 (iii) Let X denote the number of successes in n independent Bernoulli experiments with success probability p. The range of X is {0, . . . , n}, and the probability function is given by. n k p (1 p)nk p(k) = k This distribution is called binomial distribution with parameters n and p. (iii) We draw n balls without replacement from an urn with N balls, of which R are red and N R white. Let X denote the number of red balls in the sample. In this case we get P (X = r) =
R r N R nr N n
This distribution is calles hypergeometric distribution with parameters (N, R, n). (iv) We perform a series of independent Bernoulli experiments until we obtain the rst success and denote by X the number of failures preceding this rst success. X is a discrete random variable with range {0, 1, 2, . . .}, and probability function P (X = k) = (1 p)k p, k = 0, 1, 2, . . . . This distribution is called the geometric distribution with parameter p. (iv) More generally, we may consider the number of failures preceding the rth success
14
in a series of independent Bernoulli trials. Denoting this random variable by X, we get for any k 0 P (X = k) = p k + r 1 r1 p (1 p)k = k k+r1 r p (1 p)k . k
This formula can be derived as follows: in order for the event X = k to occur, we must have exactly k failures amont the rst k + r 1 experiments and a success at the (k + r)th experiment. Because of independence, the respective probabilities have to be multiplied. This distribution is called a negativ binomial distribution with parameters (r, p).
k -
k -
10 10 Figure 2. Wahrscheinlichkeitsfunktion der geometrischen Verteilung mit Parameter p = 0.25 (links) und der negativ-binomialen Verteilung mit Parametern r = 4 und p = 0.4 (rechts) The next probability distribution, the so-called Poisson distribution will be introduced as an approximation to the binomial distribution via the following limit theorem Theorem 2.7. Let (Xn )n1 be a sequence of binomial random variables with parameters n and pn satisfying
n
lim npn = (0, ). k . k!
Then for any k 0

n
lim P (Xn = k) = e
Definition 2.8. The probability distribution with probability function k , k 0, k! is called Poisson distribution with parameter . p (k) = e The Poisson distribution is an approximation to the binomial distribution with small success probability p, a large number of experiments and a moderate size np. Note that = n p is the expected number of successes in the n trials.
15
k -
k -
5 5 Figure 3. Probability function of a Poisson distribution with parameter = 2 (left) and = 5 (right) Example 2.9. The probability that a given person during one year will get 1 ill from leukemea is 10000 . We want to calculate the probability that in a town with 20000 residents exactly k residents with get leukemia during the next year. In this case, we may safely apply Poisson approximation with parameter = n p = 1 20000 10000 = 2. If we denote by X the number of residents that will get leukimea during the next year, we obtain P (X = k) e2 2k . k!
Expected value and variance. We now want to introduce two very important numerical characteristics that describe the distribution of a random variable. The rst of these, the so-called expected value, gives the center of the distribution of X. Definition 2.10. Let X be a discrete random variable with range {a1 , a2 , . . .} and probability function p(ai ). We then dene the expected value of X by E(X) =
i
ai p(ai ).
The expected value is a weighted mean of all values in the range of X with the respective probabilities as weights. A good physical picture is provided by interpreting the expected value as a center of mass when mass p(ai ) is put into the points ai . Definition 2.11. The variance of the random variable X is dened as Var(X) = E(X E(X))2 . Var(X) is called the standard deviation of X. The variance is the mean square deviation of a random variable from its expected value, and can be viewed as a measure of spread of the distribution. Variance is a quadratic measure of spread, the standard deviation is a linear measure. The table below provides the most important discrete distributions with their means and variances.
16
Distribution X() Uniform Bernoulli Binomial
Probability function E(X) Var(X) 1 N + 1 N2 1 {1, . . . , N} N 2 12 {0, 1} pk q 1k p pq n k nk p q {0, . . . , n} np npq k

R k N R nk N n k
Hypergeom. {0, . . . , n} Poisson {0, 1, . . .}
R N
R N
N R N
q q geometrisch {0, 1, . . .} q k p p p2 q q r+k1 k r Neg.-bin. {0, 1, . . .} q p r r 2 k p p Table 1. Probability function, expected value and variance of some important probability distributions 3. Inferential Statistics In the context of inferential statistics, the data x1 , . . . , xn are regarded as realizations of independent identically distributed random variables X1 , . . . , Xn with probability function pk = P (Xi = k) where the probabilities pk are unknown. It is the goal of inferential statistics to estimate the unknown probabilities pk and to test hypotheses concerning these probabilities. Generally, we distinguish two types of models, namely parametric models and nonparametric models. In a parametric model, we assume that the probability function is known up to an unknown parameter . The space of possible parameter values is called the parameter space. In the case of integer-valued random variables, we are thus given a family p of probability functions and an associated family of probability distributions P such that p (k) = P (X = k). Example 2.12. (i) We perform n independent Bernoulli experiments with unknown success probability [0, 1]; in this case our parameter space is = [0, 1]. By X we denote the number of successes. We know that X has a binomial distribution with parameters n and , i.e. n k (1 )nk P (X = k) = k (ii) We consider the number of claims submitted to a large automobile insurance.with claim size larger than a million euros. Since this concerns very rare events, we may apply Poisson approximation. In this case, the parameter is unknown. The parameter space is (0, ), and P (X = k) = e k . k!
k!
N n N 1
3. INFERENTIAL STATISTICS
17
In both examples given above, the task of the statistician is to estimate the unknown parameter , to test hypotheses concerning the value of and to test the hypothesis that the above model makes sense. Before we study details, we will rst investigate nonparametric procedures. Nonparametric procedures. Here we will introduce procedures that are universally applicable as they do not require any kind of parametric model/ Empirical distribution function: We have already dened earlier the empirical distribution function Fn : R [0, 1] of the data x1 , . . . , xn R by Fn (x) =
1 #{1 i n : xi x}. n Fn (x) is an estimator of the underlying distribution function F (x) = P (X x). This estimator makes sense both for discrete as well as for continuous data; though it is in fact used most often in the context of continuous data. Relative frequency: Let X be a discrete random variable with integer values and unknown probability function pk = P (X = k). Given observations x1 , . . . , xn , we can estimate pk by the relative frequency with which the value k occured in the sample, i.e. 1 pk = #{1 i n : xi = k}. n It is important to realize that this relative frequency does not equal the unknown probability. pk is only an estimate of pk ; when we repeat the experiment under iden tical circumstances, we will most likely get a dierent estimate, while the probability function remains the same. Moment estimators: We dene the k-th moment and the k-th central moment of a random variable X by mk = E(X k ) ck = E(X E(X))k We have alrady met some of these moments: the rst moment is the expected value, the second central moment is the variance. Given realisations x1 , . . . , xn of n independent random variables with the same distribution, we can estimate the moments of the underlying distribution by the corresponding sample moments. These are dened by 1 mk := n
n n
xk . i
i=1
Estimating the central moments requires some more care; a rst guess is to take 1 ck = n (xi x)k .
i=1
However, we have already seen in the case of the second moments that a correction factor is needed in order to assure unbiasedness of the estimators. In this way we
18
obtained the following estimator for the variance, s2 x 1 := n1

n
i=1
(xi x)2 .
There are similar formulas for general central moments. As with the empirical distribution function and with the relative frequencies, we note also for the moment estimators that these provide an estimate of the theoretical moments that has to be distinguished from the corresponding theoretical moments. Maximum likelihood estimators. The best known teachnique for the estimation of parameters is the maximum likelihood method, also known as ML estimator. One can in fact show that the ML estimator is in some sense asymptotically optimal when the sample size is large. Given a sample x1 , . . . , xn that may be regarded as n independent realizations of a random variable X with integer values und probability function p (k) := P (X = k), we dene the likelihood function Often it is more convenient to work with the log-likelihood function l() = log(L()). The maximum likelihood estimator, also abbreviated as ML estimator or MLE, of the parameter is the value of where the likelihood function takes its maximum. In a concise mathematical notation, this is M L = argmaxL(). Since the logarithm is a monotonically increasing function, we may as well compute the ML estimator by nding the maximum of the log-likelihood function. In most examples, the late is easier to calculate. Example 2.13. (i) Given n independent realizations of a binomially distributed random variable with unknown parameter [0, 1], the ML estimator of is given by n 1 = xi . n i=1 (ii) Similarly, if we assume that the data were generated by a Poisson distribution with inknown parameter (0, ), the ML estimator is 1 = n
n
L() = Lx1 ,...,xn () := p (x1 ) p (xn ).
xi .
i=1
Method of moments. While the ML estimator is asymptotically optimal, the calculations required to determine the MLE are often dicult and require numerical procedures. In this section, we will introduce an estimator that is much easier to calculate, but unfortunately does not share the optimality properties. The method of moments may be applied, if we are able to express an unknown parameter as a function of the theoretical moments, i.e. = g(m1 , . . . , mk ).
4. GOODNESS OF FIT TESTS
19
We then obtain the method of moments estimator of the parameter by replacing the theoretical moments in the argument of g by their estimates m1 , . . . , mk , i.e. = g(m1 , . . . , mk ). Example 2.14. (i) Let X be a binomially distributed random variable with parameters m und , where m is known and unknown. In this case, we have m1 = E(X) = m and thus we can express in the following way as function of the rst moment: m1 = . m Hence the method of moments estimate of becomes 1 1 = m1 = m mn
n
xi
i=1
(ii) Let X be a Poisson-distributed random variable with unknown parameter ; in this case we have m1 = and thus = m1 . Hence, the method of moments estimate is =
1 n n i=1
xi .
In both cases, the method of moments estimator coincides with the ML estimator. This is not always the case, as we will see in the next chapter when we study continuous random variables. 4. Goodness of t tests In parametric statistics, we always make a specic model assumption. Of course this assumption may be false and as a result our statistical inference can become completely meaningless. Thus it is important to have tests for the validity of model assumptions. In this section we will present the 2 -test which is suitable for discrete data. Let x1 , . . . , xn be independent realizations of a random variable with range {a1 , . . . , ak }. We want to test the hypothesis that P (X = ai ) = pi , where p1 , . . . , pk are given numbers - this is called a simple hypothesis. In order to test this hypothesis, we determine the frequencies n1 , . . . , nk of the values a1 , . . . , ak in the sample. Under the hypothesis, we expect that the value ai occures n pi times in the sample. Thus it makes sense to compare these expected frequencies with the observed frequencies. We collect both in one table as follows: ai a1 a2 . . . ak ni n1 n2 . . . nk n pi n p1 np2 . . . n pk In the year 1900 Karl Pearson proposed the 2 -test for goodness of t which uses the test statistic k (ni n pi )2 T = n pi i=1
20
2 and rejects the hypothesis if T 2 k1, . Here k1, denotes the -percentile of the 2 -distribution with k 1 degrees of freedom which can be found in all statistical tables or in computer packages. So far, we have considered the 2 -test for the case of a simple hypothesis, i.e. when the probabilities p1 , . . . , pk are exactly known. More often, we are given the hypothesis that the unknown probabilities belong to a given family pi () of distributions, where is an unknown parameter. The precise formulation of the hypothesis is then that for some
P (X = ai ) = pi (),
In order to test this hypothesis, we rst estimate the unknown parameter by the maximum likelihood estimator . Next we compare the observed frequencies ni with via the test statistic the expected frequencies n pi ()
k
1 i k.
T =
i=1
(ni n pi ())2 . n pi ()
We reject the null hypothesis when where p is the dimension of the parameter space . The theoretical basis for the 2 -test is the fact that under the hypothesis, the test statistic T is asymptotically 2 -distributed, as the sample size becomes larger. For practical applications, we have to make sure that the sample size is large enough for the 2 -test to be valid. The most common rule requires that all expected frequencies should exceed 1 and that at most 20% of the expected frequencies should be smaller than 5. If these requires are not met, one has to combine some of the outcomes together in order to be able to apply the 2 -test. T 2 kp1, ,
CHAPTER 3
Continuous Distributions
In this chapter, we will investigate continuous data, i.e. data that can potentially take a continuum of values. Real life examples are, e.g. (1) Maximum wind speed in a given year on the campus of the Ruhr-University (2) Measurement error in a physical experiment (3) Maximum water level of the Ruhr river in Hattingen in a given year (4) Salary of a random chosen resident of Bochum Following the pattern of the previous section, we will study descriptive statistics, stochastic models, and statistical inference for continuous distributions. 1. Descriptive Statistics for Continuous Data Most techniques that we introduced in the previous chapter still make sense for continuous data; the notable exception is the frequency table. Numerical summaries. We have already dened the quantities arithmetic mean, median, trimmed mean, quartiles, sample variance and interquartile range. Two more quantities that are usually only applied in connection with continuous data, are skewness and kurtosis, which will be dened now. Skewness: Skewness is a numerical measure for the degree to which data are non-symmetric about their mean. It is dened as 1 n
n
i=1
(xi x)3 .
Note that the skewness is 0 in case the data are perfectly symmetrically distributed with respect to the mean - in this case positive and negative values of (xi x)3 cancel. Kurtosis: Gracal summaries. We have already dened the quantities empirical distribution function and boxplot. New are the histogram and the Q-Q-Plot. Histogram: Assume that the range of our data set is the interval (a, b]. In order to dene the histogram, we need a partition of the range of our data set, given by points a = a0 < a1 . . . < ak1 < ak = b. We then dene the value of the histogram hn (x) hn (x) = 1 #{1 i n : aj1 < xi aj }, if x (aj1 , aj ]. n(aj aj1 )
21
22
3. CONTINUOUS DISTRIBUTIONS
The histogram depends crucially on the choice of the partition. Statistical packages use automatic procedures to make this choice. The histogram is an estimator of the density of the underlying distribution. Q-Q-plot: The Q Q-plot is a plot of the empirical distribution function on a graph with non-linear axes chosen in such a way that the distribution function of any normal distribution would become a straight line. The best tting normal distribution is usually also displayed in the Q-Q-plot. A quick graphical test for normality of the data is thus the deviation of the Q-Q-plot from this straight line. 2. Stochastic Models In the context of inferential statistics the data x1 , . . . , xn are regarded as realizations of n independent identically distributed random variables. The data provide information about the underlying distribution of these random variables. In this section, we will study models for the distribution of random variables whose range is a continuum of values. Density and Distribution Function for Continuous Random Variables. Definition 3.1. (i) The distribution function F (x) of a real-valued random variable is dened by F (x) := P (X x). (ii) We say that the random variable X has a continuous distribution, if there exists a function f : R R such that for all interval [a, b] R we have
b
P (X [a, b]) =
f (x)dx.
a
f is called the probability density, or for short density, of the random variable X. There is a simple relation between the probability density and the distribution function; we have namely
x
F (x) =

f (t) dt
f (x) = F (x); i.e. the distribution function is the integral of the density and the density is the derivative of the distribution function. Example 3.2. The distribution with the density f (x) =
(1) x
falls x 1 sonst.
is called a Pareto distribution with parameter ; here (1, ). The corresponding distribution function is given by
x
F (x) =
f (t)dt = 1
1 x1
for x 1 and F (x) = 0 for x 1.
23
Mean, variance and higher order moments. We will introduce some important numerical characteristics of the distribution of a continuous random variable. The expected value is dened as E(X) :=

x f (x) dx,
where f (x) is the density of the random variable X. Like in the discrete case, the density of a continuous random variable is a weighted mean of the possible values of the random variable with weights equal to the density. The expected value describes the location of the distribution. In many cases, we are interested in the expected value of a function u(X) of a given random variable X In this case, we can use the following transformation formula E(u(X)) = u(x)f (x) dx. In this way, we may calculate the expected value of X 2 as follows E(X 2 ) = x2 f (x) dx.
The expected value is linear in the sense that given two random variables X1 , X2 , and real numbers a1 , a2 , b we have E(a1 X1 + a2 X2 + b) = a1 E(X1 ) + a2 E(X2 ) + b. Like in the discrete case, the variance of the random variable X is dened by The interpretation is again the same. The variance measures the mean squared deviation of the random variable from its mean; it is a measure of the spread of the data. In calculations, the following formula is often useful: Moreover, we dene the k-th moment mk and the k-th central moment ck of a random variable X by mk = E(X k ) ck = E(X E(X))k . Var(X) = E(X 2 ) (E(X))2 . Var(X) := E(X E(X))2 .
Of these moments, only the third and the fourth moment have practical relevance. We dene the skewness S and the kurtosis K of a distribution via these moments. c3 S = (c2 )3/2 c4 K = (c2 )2 S is a measure for deviation from symmetry of a distribution; for symmetric distributions we have S = 0. K measures how much the distribution is concentrated around its mean. For a normal distribution we obtain K = 3. Thus one often considers the standardized quantity E = K 3;
24
called excess. The excess is the basis for a quick test for normality of a distribution. 3. Important Continuous Distributions Uniform Distribution. The uniform distribution on the interval [a, b] R has the density 1 (1) f (x) := 1[a,b] (x). ba It is obvious that f is indeed a density, i.e. that f is nonnegative f (x) dx = 1. As a symbol for the uniform density, we use U(a, b), and we write X U(a, b) to denote that X has a uniform distribution on [a, b]. The continuous uniform distribution is an analogue of the discrete uniform distribution. For any interval I [a, b], we have 1 P (X I) = |I|, ba where |I| denotes the length of I. I.e. the probability of a realization in the interval I is proportional to the length of I. The uniform distribution is a good model for choosing a number from the interval [a, b] at random. Roundo errors in numerical 1 calculations may be regarded as uniformly distributed on the interval [ 1 , 2 ]. We 2
0.5
1 0.4
0.8 0.3 0.6 0.2 0.4
0.1 0.2
0 0
0 0
Figure 1. Density (left) and distribution function (right) of a uniform distribution. get the following formulas for the expected value and the variance of a uniform distribution. E(X) = (a + b)/2 Var(X) = (b a)2 /12.
Normal distribution. The normal distribution with parameters and 2 , where R, 2 > 0, has the density (x )2 1 2 2 . (2) f (x) := e 2 2 The symbol for the normal distribution is N(, 2 ). The normal distribution is the most popular distribution in statistics, for a variety of reasons. First of all, many quantities in nature are really normally distributed. This is a consequence of the famous Central Limit Theorem which states that the sum of a large number
3. IMPORTANT CONTINUOUS DISTRIBUTIONS
25
of individually small contributions is approximately normally distributed. Many random phenomena in nature can be regarded as the result of a large number of small individual contributions. A good example are measurement errors in physical experiments.
0.5
1 0.4
0.8 0.3 0.6 0.2 0.4
0.1 0.2
0 0
10
0 0
Figure 2. Dichte (links) und Verteilungsfunktion (rechts) einer Normalverteilung. The normal distribution was discovered by Abraham de Moivre (1667-1754) as an approximation to the binomial distribution for large n. Note that, in contrast with the Poisson approximation, here p is xed. Carl Friedrich Gau (1777-1855) gave the normal distribution its central role in statistics; among other things Gauss argued that the central limit theorem would provide a justication for the use of the normal distribution in statistics. Thus, the normal distribution is often called the Gauian distribution . The corresponding density is called the Gau curve oor the bell shaped curve. On the last 10 DM bill before the introduction of the euro, the normal density was printed next to a portrait of Gau. The expected value and the variance of the normal distribution are given by the following formulas: E(X) = Var(X) = 2 . Calculations with the normal distribution are complicated by the fact that the integral of the normal density cannot be calculated analytically. Thus probabilities of the type P (a X b) cannot be calculated by hand, but only via numerical techniques or by the use of statistical tables. Tables are available for the standard normal distribution N(0, 1). Its distribution function is usually denoted by the symbol (x); i.e.
x
(x) =
t2 1 e 2 . 2
If X has a normal distribution with parameters and 2 , we can consider the standardized random variable X Z := .
26
Z has a standard normal distribution, since P (a X b) = P ( + b X + b) + b (x)2 1 = e 22 dx = 2 2 + a
b a
1 2 ex /2 dx. 2
Thus, in order to compute probabilites referring to the random variable X, we can transform this to probabilities referring to Z and then use a table of standard normal probabilities. Example 3.3. Suppose X N(5, 16); then P (1 X 5) = P (1 Z 0) = (0) (1) 0.34; here Z :=
X5 . 4
Exponential Distribution. The exponential distribution with parameter > 0 has the density (3) f (x) := ex 1[0,)(x).
We use the sympbol Exp(). The exponential distribution is the continuous analogue of the geometric distribution and is popular as model for life times, e.g. of industrial parts. If T is an Exp()-distributed random variable, we get for any t > 0 P (T t) = Thus we have P (T s + t|T t) = P (T s + t, T t) P (T t) P (T s + t) = = e(s+t) et = es = P (T s) P (T t)
t
ex dx = et .
for all s, t > 0. This remarkable property is known as lack of memory of the exponential distribution. In connection with life times, this implies that the distribution of the remaining life time of an individual of a machine part that has already survived for at least t years is the same as for a newborn or a new machine part. This assumption is in most cases unrealistic. However, the waiting time until a radioactive decay follows this distribution. Expectation and variance of an exponential distribution are given by 1 1 Var(X) = . 2 E(X) =
3. IMPORTANT CONTINUOUS DISTRIBUTIONS

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
27
0.5
1.5
2.5
3.5
4.5
0.5
1.5
2.5
3.5
4.5
Figure 3. Density of an exponential distribution (left) and a gamma distribution (right) Gamma distribution. The gamma distribution with parameters r > 0 and > 0 is given by the density (4) r r1 x x e 1[0,) (x). f (x) := (r)
0
Here, denotes the gamma function, dened as (t) := xt1 ex dx, t > 0.
We use the symbol Gamma(r, ). Note that the exponential distribution is a special case of a gamma distribution, where r = 1. Expected value and variance of a gamma distributed random variable are given by r E(X) = r Var(X) = . 2 2 -distribution. We have already encountered the 2 -distribution in connection with the 2 -test for goodness of t. It turns out that the 2 -distribution is a special case of the gamma distribution. The ( f , 1 )-distribution is also called 2 2 2 -distribution with f degrees of freedom, and denoted by the symbol 2 . f The 2 -distribution occurs often in the context of normally distributed random variables. E.g., if U1 , . . . , Un are independent normally distributed random variables, 2 2 U1 + . . . + Un is 2 -distributed. n t- and F -distribution. The t- and the F -distribution play a prominent role in statistics when the underlying data are normally distributed. We will introduce these distributions at a later moment in connection with their applications. For both distributions, the densities can be explicitly calculated. However, there are no analytical formulas for the distribution function and thus one has to use statistical tables or numerical procedures.
28
Distribution
Density function E(X) Var(X) 1 b + a (b a)2 1(a,b) (x) U[a, b] ba 2 12 (x)2 1 e 22 N(, 2 ) 2 2 2 1 1 Exp() ex 1(0,) (x) 2 r r1 x r r x e 1(0,) (x) Gamma(r, ) (r) 2 n 2 2 n 1 x x 2 e 2 1(0,) (x) 2 n 2n n ( n ) 2 1 r rs Beta(r, s) xr1 (1 x)s1 1(0,1) (x) B(r, s) r + s (r + s + 1)(r + s)2 Table 1. Density functions, expected values, and variances of important continuous distributions 4. Transformation of Densities We often face the situation that we wish to determine the distribution of the random variable X = u(Y ), where u is some known function and Y is a random variable with known density fY (y). If u is monotone function that is continuously dierentiable, we may apply a transformation formula that we will develop now in case that u is monotonically increasing. We rst compute the distribution function of X, Here u1 is the inverse of u. For monotonically decreasing functions u we ahve We can then determine the density of X by taking derivatives on both sides. In both cases, we obtain fX (x) = d 1 u (x) fY (u1 (x)). dx FX (x) = P (X x) = P (u(Y ) x) = P (Y u1(x)) = 1 FY (u1 (x)). FX (x) = P (X x) = P (u(Y ) x) = P (Y u1 (x)) = FY (u1 (x)).
Example 3.4. (i) A non-negative random variable X is called log-normally distributed, if ln(X) has a normal distribution. In order to determine the density of a log-normal distribution, we dene the random variable Y = ln(X).
1 By denition, Y has the density function fY (y) = 22 exp( (y) ). Since X = eY , 22 we may appply the above transformation formula and get
2
1 (ln(x) )2 fX (x) = | ln (x)|fY (ln(x)) = exp 2 2 x 2 2 (ii) Let Y be a random variable with density fY (y) and dene X = a Y + b,
29
xb , a
where a > 0, b R are constants. Thus u(y) = a y + b and u1 (x) = the density function of X is given by 1 fX (x) = fY a 1 fX (x) = exp 2 xb . a (x )2 2 2
and hence
E.g., if Z has a standard normal distribution, then X = Z + has the density ;
i.e. X has a normal distribution with parameters (, 2). (iii) In simulation studies, we often want to general random variables with a given distribution function F . Most computers provide us with random numbers that are uniformly distributed on the unit interval [0, 1]. Denoting such a random number by Y , one can show that X = F 1 (Y ); has the distribution function F . This procedure is sometimes called the quantile transform. 5. Inferential Statistics Nonparametric procedures. The task of inferential statistics is to draw conclusions about the distribution of the random variables that have generated the data. In this endeavor we distiguish between nonparametric and parametric procedures, like in the case of discrete data. Important estimators in nonparametric statistics are the empirical distribution function as estimate of the underlying distribution function and the histogram as estimate of the density. In the same way, we may estimate the quantiles of the underlying distribution by the quantiles of the empirical distribution function. Further nonparametric estimators are the sample moments, which are estimators of the corresponding theoretical moments. Nonparametric procedures have the distiguished advantage that they do not depend on the validity of model assumptions. On the other hand, in parametric models one can usually nd procedures that are superior to the nonparametric counterparts - of course at the risk that these might be useless in case the model assumptions are violated. There are situations, where a parametric model is the only sensible option. This holds e.g. for extreme value statistics where we have to make inference about a range where no observations have ever been made. Maximum Likelihood Estimator. The ML technique is applicable whenever the model species that the underlying density belongs to a family of densities f (x), where = (1 , . . . , p ) Rp is a nite dimensional parameter. Given the data x1 , . . . , xn which we view as realizations of independent random variables with density f (x), where is unknown, we dene the likelihood funktion (5) The ML estimator of is the value of that maximizes the likelihood function. In practice, one determines the ML estimator by setting the partial derivatives of the likelihood function equal to zero. As a result we obtain p equations with p unkowns; these equations are called the likelihood equations. Except for very simple situations, L() = f (x1 ) f (xn ).
30
there are usually no analytic solutions to the likelihood equations. Hence one has to use numerical procedures in order to determine the ML estimator. In practice, it is often easier to determine the maximum of the log-likelihood function, dened by
n
l() = ln L() =
i=1
ln f (xi ).
Since ln(x) is a strictly increasing function, the log-likelihood function assumes its maximum in the same value as the likelihood function itself. Example 3.5. (i) We are given the data x1 , . . . , xn and we assume that these are realizations of n independent, identically distributed random variables with a normal distribution whose parameters (, 2 ) are unknown. In this case, we have the likelihood function
(x1 )2 (xn )2 1 1 1 1 exp 2 L(, ) = e 22 e 22 = n/2 n (2) 2 2 2
i=1
(xi )2
and the log-likelihood funktion n 1 l(, ) = log(2) n log() 2 2 2 1 0 = 2

n n
i=1
(xi )2 .
Taking partial derivatives with respect to and we obtain the likelihood equations (xi )
n
i=1
1 1 0 = n + 3 1 = n 2 = 1 n
n
i=1
(xi )2 .
The solution of this system of equations is given by xi = x

i=1 n
i=1
(xi )2 .
Thus, we have determined the ML estimator of the parameters of a normal distribution. Instead of the ML estimator for the variance, one usually takes the sample 1 variance s2 := n n (xi )2 - in contrast with the ML estimator, the sample x i=1 variance is an unbiased estimator. (ii) Again, given data x1 , . . . , xn , we make the assumption that these are realizations of n independent identically distributed random variables with an exponential distribution with unknown parameter . Some easy calculations show that the ML estimator for is given by = 1 n
n 1
xi
i=1
(iii) The gamma distribution is a rst example where one cannot determine the ML
31
estimator analytically. The likelihood equations are nonlinear and cannot be solved explicitly; in this case one has to use numerical procedures. Unbiased estimators. The estimate of an unknown parameter will generally not be equal to the true value. In fact, we cannot expect to determine the precise value of the unknown parameter on the basis of nitely many observations. If we repeat the same experiment that led to the original data under identical circumstances, we will get a dierent value of our estimate. In this way, the estimator is itself a random variable whose distribution can be determined (but this distribution will generally depend on the unknown value of the parameter). The estimator is called unbiased for the parameter , if E() = . If we use an unbiased estimator, in the long run we hit the true value on average correctly. Thus unbiased estimators have a desirable property. Considering the above estimators, we note that the arithmetic mean and the sample variance are unbiased estimators for the expected value and the variance of 1 the underlying distribution. The ML estimator n n (xi x)2 , however, is not an i=1 unbiased estimator for the variance. Method of Moments. We have already introduced the method of moments in the previous chapter in the context of discrete data. In the case of continuous data, the method of moments may be used as well. Example 3.6. (i) For the normal distribution note that = E(X) and 2 = Var(X). Thus the method of moments estimators of and 2 are given by M M = x; M M 2 1 = m2 m1 = n
n
i=1
(xi x)2 .
In this case the method of moments estimator and the ML estimator coincide. (ii) If we want to estimate the parameters of a gamma distribution, we use the fact r r that E(X) = , Var(X) = 2 and hence m1 E(X) = Var(X) m2 m2 1 m2 1 r = E(X) = m2 m2 1 Thus we obtain the following estimators for the parameters: x = 1 n 2 i=1 (xi x) n = r =
1 n
Condence intervals. The estimates that we studied so far are also called point estimates, as they give a single point as estimate for . Of course, we know very well that the estimate is not equal to the true value! Condence regions provide a technique to quantify this uncertainty about the true value of the unknown parameter. Technically speaking, a condence region is a map that associates to the
x2 n 2 i=1 (xi x)
32
data x1 , . . . , xn a region C(x1 , . . . , xn ) . We call this map a (1 ) condence region if for all parameter values I.e., we demand that the condence region covers the true parameter with probability at least (1 ). A common condence level is 95%; in this case the true value is covered by the condence region in 19 out of 20 cases. Usually, a condence region is an interval - in this case one talks about condence intervals. Example 3.7. A (1 ) condence interval for the parameter of a normal distribution is given by s2 s2 [ x tn1;/2 , x + x tn1;/2 ], x n n is the -quantile of the tn1 distribution. 6. Goodness of t tests Kolmogorov-Smirnov Test. The Kolmogrov-Smirnov test provides a technique to test the hypothesis that the underlying distribution equals a xed given distribution F . The test is based on a comparison between the empirical distribution function Fn (x) and the hypothetical distribution function F (x). We calculate the test statistic D = sup |Fn (x) F (x)|
x
P ( C(X1 , . . . , Xn ) 1 .
where tn1;/2
and reject the hypothesis if D Dn, . Here Dn, is the -quantile of the so-called Kolmogorov-Smirnov distribution. The latter is the distribution of D under the hypothesis that the data are generated by random variables with distribution function F . Kolmogorov and Smirnov managed to calculated this distribution asymptotically; the result is incorporated into statistical tables and into computer programs. 2 Goodness of Fit Test. One can apply the 2 goodness of t test also to continuous data if one groups the data rst. The remaining procedure is the same as in the case of discrete data.
CHAPTER 4
Multivariate Distributions
In many situations, the information about the distribution of a single random variable is not sucient for the calculation of certain probabilities of interest. Suppose X, Y are two random variables whose distributions are known to us. However, this information does not suce to calculate the probability that X + Y 1000. In practice, such a situation might arise when X and Y denote the discharge of water from two factories into a sewage system, and where we want to know the probability that the total discharge exceeds a given amount. In order to calculate such probabilities, we need to know the joint distribution.
1. Descriptive Statistics for Multivariate Data Following the outline of the previous sections, we start this chapter with a short overview of techniques of descriptive statistics, in this case for multivariate data. We will focus on the bivariate case, and we assume that we have data (x1 , y1 ), . . . , (xn , yn ), which may be regarded as realizations of n independent random vectors (X1 , Y1 ), . . . , (Xn , Yn ). Note, that it is the pairs that are assumed to be independent of each other, not the two coordinates Xi and Yi - quite to the contrary, these will usually not be independent. In a concrete application, It is important to reect critically whether this assumption is reasonable or not. If we have measurements of the annual maximal water level at two dierent rivers, the data may safely be assumed to be independent from year to year. However, if the water levels are recorded by the hour, we will have dependence from one hour to the next. The latter case requires completely dierent techniques, which will be introduced later in the chapter on time series analysis. The rst step in any data analysis of multivariate data should always be a separate analysis of the two coordinates x1 , . . . , xn and y1 , . . . , yn , respectively. This can be done using the techniques introduced in the two previous chapters, depending on whether the data are discrete or continuous. In this section, we will concentrate on techniques that go beyond the separate analysis of the coordinates. We will mostly focus on techniques for the analysis of the dependence of the two coordinates. Scatterplot: The scatterplot is simply a graphical display of all the data points (x1 , y1 ), . . . , (xn , yn ) in a cartesian coordinate system. Every single observation (xi , yi ) is represented by some symbol, e.g. a dot or a star. Covariance und Correlationcoefficient: The sample covariance sx,y and the sample correlationcoecient rx,y of the data (x1 , y1), . . . , (xn , yn ) are given by
33
34
4. MULTIVARIATE DISTRIBUTIONS
the formulas sx,y rx,y 1 (xi x)(yi y ) := n 1 i=1 sx,y := , s2 s2 x y

n
1 1 respectively. Here s2 = n1 n (xi x)2 , s2 = n1 n (yi y )2 are the sample y x i=1 i=1 variances of the two coordinates. The sample covariance and the sample correlation coecient are measures of the degree of dependence of the two coordinates. In order to get a better understanding of this, we calculate the least squares regression line, which is dened as the straight line y = a + b x that minimizes the sum of squares of vertical distances of the data points from the line, n
(6)
i=1
(yi a b xi )2 .
Some calculations show that this minimization problem is solved by sy b = rx,y sx a = y b. x Moreover, the following formula holds for the minimum of (6)
n n
min
a,b i=1
(yi a b xi ) =
i=1
2 (yi a b xi )2 = (1 rx,y )s2 . y
From this identity, we may derive directly that 1 rx,y 1 - this is also consequence of the Cauchy-Schwarz inequality. Moreover, we see how the sample correlation coecient rx,y measures the closeness of the data points from the regression line. In the most extreme case, when rx,y = 1 or rx,y = 1 , all the points lie on the regression line - the sign of the sample correlation coecient equals the sign of the slope of the regression line. The other extreme occurs when rx,y = 0. In this case the regression line is horizontal and there is no linear dependence between the two coordinates. Frequency table: Consider discrete data, where the x- and the y-variable assume the values a1 , . . . , al and b1 , . . . , bm , respectively. In this case, we may display information about the frequencies njk with which the possible values (aj , bk ) have occured among the data (x1 , y1 ), . . . , (xn , yn ) in a two-dimensional frequency table. . x .. y a1 a2 . . . al1 al b1 n11 n21 . . . b2 n12 n22 . . . ... ... ... bm1 n1,m1 n2,m1 . . . bm n1m n2m . . .
n1 n2
nl1,1 nl1,2 . . . nl1,m1 nl1,m nk1 nl1 nl2 . . . nl,l1 nlm nl n1 n2 nm1 nm n
35
In the margins, we have displayed the row and column sums, i.e. nj = m njk and k=1 nk = l njk . The margins give the frequency table of the two coordinates sepaj=1 rately. In the case of continuous distributions, we can produce a similar frequency table for grouped data. Histogram: In analogy with the one-dimensional case, we can also display information about two-dimensional data in a histogram. To begin with, we group the data by partitioning the range of the x- and y-coordinate separately, i.e. a0 < a1 < . . . < al , b0 < b1 < . . . < bm .
The partitions must have the property that xi [a0 , al ] and yi [b0 , bm ] for any i. By njk we denote the number of data points (xi , yi ) that lie in the rectangle (aj1 , aj ] (bk1 , bk ], i.e. njk = #{1 i n : aj1 < xi aj , bk1 < xi bk }. The histogram is a step function that is constant on each of the basic rectangles (aj1 , aj ] (bk1 , bk ], where it takes the value njk . n (aj aj1 )(bk bk1 ) Note that when we have l groups for the x coordinate and m groups for the ycoordinate, we get in total l m groups. Thus, for a histogram to be meaningful, we need a large number of observations. 2. Stochastic Models Joint Distribution, Marginal Distribution. We consider two random variables X, Y , which we may also regard as a random vector, i.e. as a function (X, Y ) : R2 . that associates to each outcome the pair (X(), Y ()). Definition 4.1. The joint distribution of X and Y is the map that associates to each rectangle R R2 the probability P ((X, Y ) R), R P ((X, Y ) R) := P ({ : (X(), Y ()) R}. The distributions of X and Y are called marginal distributions. Joint Probability Function of Discrete Random Variables. When the random variables X, Y are discrete, their joint distribution may be represented by the joint probability function p(x, y) = pX,Y (x, y) := P (X = x, Y = y), where x X(), y Y (). The joint probability function completely characterizes the joint distribution, as for any subset A R2 P ((X, Y ) A) = p(x, y).
(x,y)A
36
Using the joint probability function, we may also calculate the marginal probability functions of X and Y separately. We get pX (x) = P (X = x) =
y
P (X = x, Y = y) =
y
pX,Y (x, y),
where we have used the fact that the events {X = x, Y = y}, y R are a disjoint partition of the event {X = x}. Analogously, we get pY (y) = x pX,Y (x, y). Example 4.2. (i) Let X and Y be the outcomes of the rst and the second toss with an unbiased die. The two tosses are supposed to be independent. In this case, we have the joint probability function pX,Y (j, k) = P (X = j, Y = k) = 1 , 1 j, k 6. 36
(ii) Again we study two tosses with an unbiased die. We denote by X the score of the rst toss and by Y the sum of the scores. The joint probability function p(j, k) is given in the table below . j .. k 1 2 3 4 5 6 P (Y = k) 2
1 36
3
1 36 1 36
4
1 36 1 36 1 36
5
1 36 1 36 1 36 1 36
6
1 36 1 36 1 36 1 36 1 36
7
1 36 1 36 1 36 1 36 1 36 1 36 6 36
8 0
1 36 1 36 1 36 1 36 1 36 5 36
0 0 0 0 0
1 36
9 0 0
1 36 1 36 1 36 1 36 4 36
0 0 0 0
2 36
0 0 0
3 36
0 0
4 36
10 11 12 P (X = j) 1 0 0 0 6 1 0 0 0 6 1 0 0 0 6 1 1 0 0 36 6 1 1 1 0 36 36 6
1 36 3 36 1 36 2 36 1 36 1 36 1 6
0
5 36
In the margins of the table, we have given the row and column sums; they give the marginal probability functions of the two random variables. (iii) We study a random experiment with three possible outcomes A, B, C and respective probabilities p, q, r. The experiment is carried out independently n times. We denote by N1 (N2 , N3 ) the number of experiments with outcome A (B, C). In this case, we get P (N1 = k, N2 = l, N3 = m) = n! pk q l r m . k! l! m!
More generally, suppose that we have experiments with k possible outcomes and respective probabilities p1 , . . . , pk . In this case, we get P (N1 = n1 , . . . , Nk = nk ) = n! pn 1 pn k . k n1 ! nk ! 1
This probability distribution is called multinomial distribution. As a concrete example, we may calculate the probability that in 18 tosses with a fair die each score 18! is obtained exactly 3 times. We get P (N1 = 3, . . . , N6 = 3) = (3!)6 61 . 18
37
Joint Density Function. The most relevant examples of joint distributions are given by a joint density function which we will introduce in this section. Definition 4.3. The random variables X, Y have joint density function f (x, y), if for all rectangles A R2 (7) P ((X, Y ) A) = f (x, y)dxdy.
A
Remark 4.4. (i) One can show that (7) holds for a much larger class of sets, namely for all sets that are measurable. (ii) A joint density function has two characteristic properties: f (x, y) 0

f (x, y)dxdy = 1.
These two properties are completely analogous to the properties of a univariate density function. Example 4.5. (i) The following function is a joint density function: f (x, y) = e(x+y) 1[0,)[0,)(x, y). In fact, we will see later that f (x, y) is the joint density of two independent exponentially distributed random variables X, Y . (ii) The random variables X1 , X2 have a bivariate normal distribution with param2 2 eters 1 , 2 R, 1 , 2 > 0, (1, 1), if their joint density function is given by 1 f (x1 , x2 ) = 21 2 1 2 1 (x1 1 )(x2 2 ) (x2 2 )2 (x1 1 )2 exp 2 + 2 2 2(1 2 ) 1 1 2 2
We will see below that f (x1 , x2 )dx1 dx2 = 1, i.e. that f (x, y) is indeed a joint R2 probability density. It is much easier to represent the bivariate normal density in vector-matrix notation; we have 1 1 f (x1 , x2 ) = exp (x )t 1 (x ) 2 2 det() where x = (x1 , x2 )t , = (1 , 2 )t and =
2 1 1 2 2 1 2 2
The matrix is called the covariance matrix of the random vector (X1 , X2 ). The bivariate normal distribution is a very popular model for dependent observations that individually have a normal distribution. Examples are any two measurements taken on the same individual, such as height and arm lenght. Major practical advantages of bivariate normal distributions are the facts that they are chracterized by a small number of parameters and that linear combinations are themselves normally distributed. In this way, one only has to calculate means and covariances when
38
dealing with the joint normal density. (iii) If f (x), g(y) are univariate densities, we get a bivariate density f (x, y) by dening f (x, y) = f (x) g(y). We will see below that a joint density has this structure if and only if the corresponding random variables are independent. Marginal Density. Theorem 4.6 (Calculation of Marginal Density from Joint Density). Let X, Y be random variables with joint density f (x, y). Then X has the (marginal) density fX (x) =

f (x, y)dy.
Proof. By denition of the joint density, we get for an arbitrary interval [a, b] R P (a X b) = P (a X b, < Y < b) = =
a
f (x, y)dxdy
[a,b](,) b
f (x, y)dy)dx.
Hence
f (x, y)dy is the density function of X.
Example 4.7. (i) If X, Y have the joint density f (x, y) = e(x+y) 1[0,)[0,)(x, y), we nd that X has the marginal density fX (x) = =
0
e(x+y) 1[0,)[0,)(x, y)dy
ey dyex 1[0,)(x) = ex 1[0,)(x);
i.e. X has an exponential distribution with parameter 1. 2 2 (ii) Let X1 , X2 have a bivariate normal distribution with parameters 1 , 2 , 1 , 2 , . Then, the marginal density of X1 is given by fX1 (x1 ) =
(x ) 1 { 1 21 1 2(12 ) 1 e 2 ) 2(1 1 2 2
(x )2 (x1 1 )(x2 2 ) + 2 22 } 1 2 2
dx2
In order to calculate the above integral, we rst consider the exponent, which we can rewrite as follows: 1 (x1 1 )2 (x1 1 )(x2 2 ) (x2 2 )2 { + 2 } 2 2 2(1 2 ) 1 1 2 2 2 2 1 { 2 (x1 1 )2 2 (x1 1 )(x2 2 ) + (x2 2 )2 } = 2 ) 2 2 2(1 2 1 1 1 2 2 = {((x2 2 ) (x1 1 ))2 + (1 2 ) 2 (x1 1 )2 } 2 2 2(1 2 )2 1 1 1 2 1 {((x2 2 ) (x1 1 ))2 } + 2 (x1 1 )2 . = 2 ) 2 2(1 2 1 21
39
Thus we obtain fX1 (x1 ) = = 1 1

2 21
21
e 2 e
(x1 1 )2 2 2 1
1
2 22 (1 2 )
1 (x2 2 2 (x1 1 ))2 2(12 ) 2 1 2
dx2
(x1 1 )2 2 2 1
In the last step we have used the fact that we are integrating a normal density with 2 mean 2 + 2 (x1 1 ) and variance (1 2 )2 - hence the integral equals 1. We 1 2 have thus proved that X1 has a normal distribution, with parameters 1 , 1 . Transformation Formula for Joint Densities. Theorem 4.8 (Transformation Formula for n-Variate Densities). Let X = (X1 , . . . , Xn ) be a random vector with joint density fX (x) and let u : Rn Rn be an invertible dierentiable map with dierentiable inverse u1 : Rn Rn . We dene new random variables Y1 , . . . , Yn by Y = (Y1 , . . . , Yn ) = u(X1 , . . . , Xn ). Then Y has the joint density fY (y1 , . . . , yn ) = fX (u1 (y1, . . . , yn ))| det(Ju1 (y1 , . . . , yn ))|. In the special case, when u is linear, i.e. when u(x) = A x+b where A is an invertible n n matrix and b Rn , we obtain 1 fAX+b (y) = fX (A1 (y b)). | det(A)| Example 4.9. Let X1 , X2 have a bivariate normal distribution with parameters 2 2 1 , 2 , 1 , 2 , . We dene new random variables Y1 , Y2 by Y = A X + b, where X = (X1 , X2 )t , Y = (Y1, Y2 )t and where A R22 is an invertible matrix, and b R2 . Using the above transformation formula, we obtain that Y1 , Y2 has the joint density 1 fX ,X (A1 (y b)) fY1 ,Y2 (y1 , y2 ) = | det(A)| 1 2 1 1 t t 1 = e 2 (y(b+A )) (AA ) (y(b+A )) . t) det(A A We may conclude that Y1 , Y2 has again a bivariate normal distribution with vector of expected values b + A and covariance matrix AAt . Finally, we consider the random variable Y1 separately; note that Y1 = a11 X1 + a12 X2 + b1 . We have shown above that the marginal distribution of a jointly normal vector is a normal distribution, whose parameters can be obtained from the parameters of the joint distribution. In this case we get that Y1 has a normal distribution with expected value E(Y1 ) = Y1 = b1 + a11 1 + a12 2 . The variance of Y1 is the upper
40
left entry of the covariance matrix of (Y1 , Y2), which is given by A At . Hence we nd 2 2 2 Var(Y1 ) = Y1 = a2 1 + a2 2 + 2a11 a12 1 2 , 11 12 thus proving the formula stated earlier. Expected Value, Covariance and Correlation Coecient. Let X, Y be two random variables, and u : R2 R some function. In order to calculate the expected value of E(u(X, Y )), we have two possibilities. (1) We can calculate the distribution of Z := u(X, Y ) and then compute E(u(X, Y )) = E(Z) using the denition of the expected value. I.e., we get zp(z) and zf (z)dz, respectively, where p(z) and f (z) denote the probability function, respectively probability density of Z. (2) We can apply one of the following transformation formulas, E(u(X, Y )) =
xX(),yY ()
u(x, y)pX,Y (x, y)
E(u(X, Y )) =
u(x, y)fX,Y (x, y)dxdy.
In general, the computation of E(u(X, Y )) is greatly simplied by the use of these formulas. Example 4.10. If X1 and X2 have a bivariate normal distribution, we get E(X1 X2 ) = x1 x2 fX1 ,X2 dx1 dx2 = 1 2 + 1 2 .
The last identity is not as obvious as we present it here - there are still a few lines of computation necessary to verify this. Definition 4.11. Let X, Y be two random variables. We dene their covariance and their correlation coecient as follows: Cov(X, Y ) := E((X E(X))(Y E(Y ))) Cov(X, Y ) X,Y := Var(X) Var(Y ) Var(X) Cov(X, Y ) is called the covariance matrix of Cov(X, Y ) Var(Y ) X, Y . The random variables are called uncorrelated when Cov(X, Y ) = 0. The matrix := The covariance and the correlation coecient are important numerical characteristics of the joint distribution that measure the degree of linear dependence of the random variables X and Y . This is not obvious from the denitions, but will become clearer later on. Theorem 4.12. (i) Let X, Y be random variables; then we have Cov(X, Y ) = E(X Y ) E(X) E(Y ) |X,Y | 1 Var(X) = Cov(X, X) (ii) Independent random variables are uncorrelated.
41
Example 4.13. Let X1 , X2 have a bivariate normal distribution with parameters 2 2 (1 , 2 , 1 , 2 , ). We have seen above that E(X1 X2 ) = 1 2 + 1 2 and thus we get Cov(X1 , X2 ) = 1 2 . Hence is the correlationscoecient of X1 , X2 . Moreover, we see that the matrix which arises in the denition of the bivariate normal density, is exactly the covariance matrix. The correlation coecient of two random variables has an important interpretation in connection with linear prediction. Given two random variables, X and Y , we would like to predict Y by some linear function of X. Among all linear predictors, we will choose the one that minimizes E(Y (a + bX))2 . Note that Y (a + bX) is the error that we make when we predict Y by a + b X. If we interpret (Y (a + bX))2 as the loss that we make as a result of the prediction error, we are thus trying to minimize the expected loss. Theorem 4.14. The function f (a, b) := E(Y (a bX))2 has a unique minimum in b = a Moreover E(Y (a + b X))2 = (1 2 ) Var(Y ). X,Y Theorem 4.15. Let X1 , . . . , Xn be random variables. Then we get
n n
Var(Y ) X,Y Var(X) = E(Y ) b E(X).
Var(
i=1
Xi ) =
i=1
Var(Xi ) + 2
1i<jn
Cov(Xi , Xj ).
In case, the random variables are pairwise uncorrelated, we obtain

n n
Var(
i=1
Xi ) =
i=1
Var(Xi ).
Proof. Given the random variables X, Y , we get Var(X + Y ) = E(X + Y E(X + Y ))2 = E (X E(X))2 + 2(X E(X))(Y E(Y )) + (Y E(Y ))2
= E(X E(X))2 + 2E ((X E(X))(Y E(Y ))) + E(Y E(Y ))2 = Var(X) + 2 Cov(X, Y ) + Var(Y ).
This is the formula for the sum of two random variables; similarly one can prove the general case.
42
Independent Random Variables. Definition 4.16. The random variables X and Y are called (stochastically) independent, if P (X A, Y B) = P (X A) P (Y B). holds for all intervals A, B R. The random variables X, Y are independent if and only if the events {X A}, {Y B} are independent for arbitrary intervals A, B R. The following theorem shows that independence of random variables may be detected from the form of the joint probability function in the discrete case, and the joint density function in the continuous case. Theorem 4.17. Let X, Y be random variables with joint probability density fX,Y (x, y) or joint probability function pX,Y (x, y), respectively. Then, X and Y are independent if and only if (8) fX,Y (x, y) = fX (x) fY (y),
respectively pX,Y (x, y) = pX (x)pY (y) in the discrete case. Proof. We restrict ourselves to the continuous case, i.e. when the random variables have a density. Moreover, we only show one direction, namely that (8) implies that the random variables are independent. Given the intervals A, B R, we get from (8) P (X A, Y B) = P ((X, Y ) A B) = =
A B
fX,Y (x, y) dx dy
AB
fX (x)fY (y)dydx fX (x) dx

A B
fY (y) dy
= P (X A)P (Y B). This shows that X and Y are independent random variables. Example 4.18. Bivariate normally distributed random variables X1 , X2 are independent if an only if = 0. In that case, the joint density is given by f (x1 , x2 ) = 1
2 21
(x1 1 )2 2 2 1
1
2 22
(x2 2 )2 2 2 2
Hence, both random variables X1 , X2 are normally distributed with parameters 2 2 (1 , 1 ) and (2 , 2 ), respectively. This holds even in the case when = 0, as we will see below. Theorem 4.19. If X, Y are independent random variables, we have E(XY ) = E(X) E(Y ). Moreover, independent random variables are uncorrelated.
43
Proof. We present the proof for the case random variables that have a joint density. By the transformation formula for expected values, we get E(X Y ) = x y f (x, y)dxdy = = x y fX (x)fY (y)dxdy x fX (x)dx yfY (y)dy = E(X) E(Y ).
In the rest of this section, we will study two closely related topics that frequently arise in connection with random variables X, Y : Given a function g(x, y) and an interval [a, b] R, compute the probability P (a g(X, Y ) b). Given a funktion g(x, y), compute the distribution function and the density of g(X, Y ). There is a variety of techniques that can be used to solve these problems. Before going into detail, we will rst give an overview of the techniques. (1) Direct calculation of the probability P (a g(X, Y ) b) using the denition, i.e. P (a g(X, Y ) b) = f (x, y)dxdy,
G
which proves the rst assertion. The second statement follows directly from the above formula for the covariance, i.e. from Cov(X, Y ) = E(X Y ) E(X) E(Y ).
where G = {(x, y) : a g(x, y) b}. As a special case, we may then compute P (g(X, Y ) a) for all a R and thus obtain the distribution function of g(X, Y ). Subsequently, we may calculate the density by taking the derivative with respect to a. (2) Application of the convolution formula in order to compute the density of a sum of two independent random variables X, Y . The convolution formula states that the density of X + Y is given by fX+Y (z) =

fX (x)fY (z x)dx,
where fX and fY are the density functions of X and Y , respectively. (3) When we want to calculate the distribution of a sum of a large number of independent random variables, we may apply normal approximation. Roughly speaking, the central limit theorem tells us that the sum of a large number of individually small random variables is approximately normally distributed. 2 2 (4) If X, Y have a bivariate normal distribution with parameters 1 , 2 , 1 , 2 , and if g(x, y) = x + y + b is a linear function, then g(X, Y ) has a normal distribution with parameters (, 2 ), where = 1 + 2 + b 2 2 2 = 2 1 + 2 2 + 21 2 . (5) For the calculation of the joint distribution of the pair g1 (X, Y ), g2 (X, Y ), where u(x, y) = (g1 (x, y), g2(x, y)) is an invertible dierentiable map, we may use a transformation formula that we shall present below.
44
Example 4.20. (i) Let X, Y be random variables with the joint density function f (x, y) = e(x+y) 1[0,)[0,)(x, y). We want to calculate the probability that X+Y 2. Note that this corresponds to the probability that the vector (X, Y ) takes values in the set G := {(x, y) : x + y 2}. Thus we get P (X + Y 2) =
2
exy 1[0,)[0,)(x, y)dxdy

G 2x
=
0 2
(
0
exy dy)dx
=
0 2
ex ey |2x dx 0
2
=
0
e (1 e
x2
)dx =
0
ex e2 dx = 1 e2 2e2 .
(ii) (Nach Plate, Beispiel 7.12) Das Widerstandsmoment eines Trgers mit rechtecka 2 igem Querschnitt ist gegeben zu w = b h /6, wobei b die Breite und h die Hhe o des Trgers. Durch Variationen im Produktionsprozess schwanken sowohl Breite a als auch Hhe der gelieferten Trger. Wir nehmen an, dass b, h Realisierungen o a von unabhngigen Zufallsvariablen B, H mit Dichtefunktionen fB (b) bzw. fH (h) a sind. Wir wollen jetzt die Verteilung des Widerstandsmoments W berechnen. Es ist B H 2 /6 w genau dann wenn (B, H) in dem Teilbereich des ersten Quadranten 6 liegt, der von der h-Achse und der Kurve b = w h2 begrenzt wird. Also folgt P (B H /6 w) = =
0 w 2 0 0 6w/h2
fB (b)fH (h) db dh 6c 6 dc dh h2 h2 0 6c 6 fB dh dc. h2 h2 0 fB

w
=
0
Damit haben wir die Verteilungsfunktion des Widerstandsmoments bestimmt. Die Dichte nden wir durch Dierenzieren nach w; die Ableitung ist nach dem Hauptsatz 6c 6 der Analysis fW (w) = 0 fB h2 h2 dh. Sums of Independent Random Variables. There are formulae for the calculation of the probability function and the probability density of a sum of independent random variables. Theorem 4.21. (i) Let X, Y be independent random variables with densities f (x) and g(y), respectively. Then, their sum X + Y has the density h(x) = f (y)g(x y)dy.
(ii) Let X, Y be independent N-valued random variables with probability functions p(i) and q(j), respectively. Then, their sum X + Y has the probability function
k
r(k) =
i=0
p(i)q(k i).
45
Example 4.22. (i) Let X, Y be independent binomially distributed random variables with parameters (n, p) and (m, p), repsectively. Then X + Y is binomially distributed with parameters (n + m, p). (ii) Let X, Y be independent Poisson-distributed random variables with parameters and , respectively. Then X + Y has a Poisson-distribution with parameter + . (iii) Let X, Y be independent normally distributed random variables with param2 2 eters (1 , 1 ) and (2 , 2 ), respectively. Then X + Y is normally distributed with 2 2 parameters (1 + 2 , 1 + 2 ). (iv) Let X, Y be independent exponentially distributed random variables with parameter . Then X + Y is Gamma-distributed with parameters r = 2 and . (v) Let X, Y be independent Gamma-distributed random variables with parameters (r, ) and (s, ), respectively. Then X + Y is Gamma-distributed with parameters (r + s, ). (vi) Let X, Y be independent 2 -distributed random variables with degrees of freedom f and g, respectively. Then X + Y has a 2 +g -distribution. f Asymptotic Distribution of Sums of Independent Random Variables. Let X1 , X2 , . . . , Xn be independent and identically distributed random variables (i.e. results of n experiments carried out independently and under identical circumstances) with expected value and variance 2 . Then E Var 1 n 1 n
n
Xi
i=1 n
= = 2 n
Xi
i=1
I.e., the arithmetic mean of n independent and identically distributed random variables has the same expected value as one of the random variables and a variance that is smaller by a factor of n. Theorem 4.23 (Law of Large Numbers). Let X1 , X2 , . . . be a sequence of independent and identically distributed random variables with E(Xi ) = and Var(Xi ) = 2 . Then P Thus we get for any > 0 lim P 1 n
n
1 n
i=1
Xi
2 . n 2
i=1
Xi
= 0.
Proof. The proof follows directly from Chebychevs inequality together with the above identities. Theorem 4.24 (Central Limit Theorem). Let X1 , X2 , ... be independent and identically distributed random variables with expected value = E(X1 ) and variance
46
n k=1
2 = Var(X1 ). We dene the sums Sn := Sn n P (a b) n 2 as n .

b a
Xk . Then we get
x2 1 e 2 dx = (b) (a), 2
n n Usually, we are not so much interested in the distribution of Sn2 , but in probabilities of the type P (a Sn b). These can be approximated by a simple linear transformation and a subsequent application of the central limit theorem:
a n Sn n b n P (a Sn b) = P ( ) n 2 n 2 n 2 b n a n ( ) ( ) n 2 n 2 Example 4.25. We toss 100 fair dice independently. What is the probability that we obtain a total score between 330 and 370? We denote by Xk the score of the k-th toss and by Sn = n Xk the total score. Thus we want to compute k=1 P (330 Sn 370). Note that = E(X1 ) = 3.5, 2 = Var(X1 ) = 35 and hence 12 330 350 370 350 Sn n P (330 Sn 370) = P n 2 100 35 100 35 12 12 normal probability that
100 12 100 12
370350 330350 Since 35 = 1.17 and 35 = 1.17, we can nd in tables of the standard
P (330 Sn 370) (1.17) (1.17) = 0.879 0.121 = 0.758 For integer-valued random variables there is an improvement of the normal approximation via the socalled continuity correction. We get P (k Sn l) = P (k 1 1 Sn l + ) 2 2 1 1 k 2 n l + 2 n Sn n = P n 2 n 2 n 2 1 k 1 n l + 2 n ) ( 2 ) ( n 2 n 2
For the example above this yields the improved approximation P (330 Sn 370) (1.20) (1.20) = 0.885 0.115 = 0.770. For small values of n, continuity correction provides a considerable improvement of the accuracy of a normal approximation.
47
Distribution of Maxima and Minima of Random Variables. Let X1 , . . . , Xn be independent identically distributed random variables with distribution function F (x) = P (Xi x). We want to calculate the distribution of the maximum of these random variables, i.e. of V = max(X1 , . . . , Xn ). The distribution function of V is given by FV (x) = P (V x) = P (X1 x, . . . , Xn x) = P (X1 x) . . . P (Xn x) = (F (x))n .
V has a density fV (x), provided the individual random variables Xi have a density f (x). In that case, we obtain d d fV (x) = FV (x) = (F (x))n = nF n1 (x)F (x) = nf (x)F n1 (x). dx dx Theorem 4.26. Let X1 , . . . , Xn be independent identically distributed random variables with distribution function F (x) and density function f (x). Then, V = max(X1 , . . . , Xn ) has the distribution function FV (x) = (F (x))n and the probability density fV (x) = nf (x)F (x)n1 . In the same way, we may calculate the distribution of the minimum of the random variables, i.e. of U = min(X1 , . . . , Xn ). We obtain and thus P (U > x) = P (X1 > x, . . . , Xn > x) = P (X1 > x) . . . P (Xn > x) = (1 F (x))n
P (U x) = 1 P (U > x) = 1 (1 F (x))n . Again, U has a density function, provided this holds for X1 . We obtain d fU (x) = (1 (1 F (x))n ) = n (1 F (x))n1 F (x) = n f (x) (1 F (x))n1 . dx Theorem 4.27. Let X1 , . . . , Xn be independent and identically distributed random variables with distribution function F (x) and density f (x). Then, U = min(X1 , . . . , Xn ) has the distribution function and the density fU (x) = n f (x) (1 F (x))n1 . FU (x) = 1 (1 F (x))n
Example 4.28. Let X1 , . . . , Xn be independent random variables, uniformly distributed on the interval [0, 1], and dene U = min(X1 , . . . , Xn ). The distribution function of Xi is given by F (x) = x, and the density is given by f (x) = 1[0,1] (x). Thus, U has the distribution function and density, given by the following formulas FU (x) = 1 (1 x)n fU (x) = n (1 x)n1 0 x 1,
48
2 -, t- and F -Distribution. In the statistical analysis of normally distributed observations, we encounter the 2 -, t- and F -distribution which we will introduce now. Recall that the density and the distribution function of a standard normal distribution are given by x2 1 (x) = e 2 2 x t2 1 e 2 dt. (x) = 2 As we have remarked earlier, there is no way to calculate the distribution function analytically. We have to use statistical tables or numerical procedures to evaluate the integral. 2 -Distribution. If Z1 , . . . , Zn are independent standard normal random variables, 2 2 Z1 + . . . + Zn has a 2 -distribution with n degrees of freedom; in short: 2 -distribution. n If X1 , . . . , Xn are independent N(, 2 )-distributed random variables, 1 2 -distribution. n
n
i=1
(Xi )2
has a t-Distribution. Let Z and X be independent random variables and assume that Z has a standard normal distribution and X a 2 -distribution. Then, the f distribution of Z T := X/f is called a Student t-distribution with f degrees of freedom; in short: tf -distribution. F -Distribution. Let X and Y be independent f -, respectively 2 -distributed g random variables. Then, the distribution of X/f F := Y /g is called an F -distribution with (f, g) degrees of freedom; in short: Ff,g -distribution. 3. Inferential Statistics Estimation of the Correlation Coecient. Covariance and correlation coecient of the pair (X, Y ) can be estimated from the data (x1 , y1 ), . . . , (xn , yn ) by the empirical covariance and the empirical correlation coecient introduced in the beginning of this chapter. In the case of normally distributed data, there are formulae for condence bounds for X,Y .
CHAPTER 5
Linear Regression
In this chapter we will study models for random experiments where the outcome depends not only on chance, but also on the value of certain explanatory variables. A simple example are the mechanical properties of concrete that depend on the composition of concrete. We will rst treat the case of a single explanatory variable, mainly as a warm-up to the more general case of several explanatory variables. 1. Simple Linear Regression Simple linear regression is usually treated in undergraduate statistics courses. We treat it here as a motivation for the general case of multiple linear regression that will be studied in the next section. We consider an experiment whose outcome y depends on the value of another variable x, which can be chosen by the experimenter. One usually calls x the explanatory or independent variable and y the dependent variable. In the rst instance, we propose a linear relationship between x and y, i.e. y = + x, for some constants , R. In addition we want to include the inuence of randomness which might enter through measurement errors, or because other variables that have not been taken into the model vary from one experiment to the other. We assume that the random component is additive and normally distributed with mean zero and variance 2 . In this way the outcome of the experiment becomes a random variable Y that can be written as Y = + x + , and where is an N(0, 2 )-distributed random variable. Obviously, a single experiment will not be very helpful when it comes to making statistical inference concerning the unknown parameters of the model. Instead, we perform n independent experiments at dierent values x1 , . . . , xn of the independent variable. Denoting the outcome of the i-th experiment by Yi , we obtain the model (9) This model is called simple linear regression. Note that the model makes several assumptions that should be critically evaluated before the model is applied. First of all, there is the linear relationship between x and y, which is always an oversimplication and which will at best hold approximately in some small range of x-values. Secondly there is the assumption of additive inuence of randomness that has the same variance in all experiments. This implies that the variance of the outcomes of the experiment do not vary with the value of x. Often this is unrealistic, because the variance increases with the expected value of the outcome of the experiment. Thirdly, we make the simplifying assumption that i have a normal distribution.
49
Yi = + xi + i , 1 i n.
50
5. LINEAR REGRESSION
Observe that the simple linear regression model (9) species that E(Yi ) = + xi and that Var(Yi ) = . Moreover, Yi has a N( + xi , 2 )-distribution with density f,, (yi ) = 1 2 2 e
(yi xi )2 2 2
Maximum Likelihood Estimation of Regression Parameters. We will now determine the ML estimators of the parameters , , 2 of the regression model. The joint density of Y1 , . . . , Yn is given by f,, (y1 , . . . , yn ) = 1 1 exp 2 (2 2 )n/2 2
n
i=1
(yi xi )2
and hence we have the log-likelihood function (10) 1 n l(, , ) = log(2 2 ) 2 2 2

n
i=1
(yi xi )2 .
We now take partial derivates with respect to , , 2 and obtain the following system of likelihood equations, 1 2 1 2
n n
i=1
(yi xi ) = 0
1 n + 3
i=1 n
(yi xi ) xi = 0 (yi xi )2 = 0.
i=1
Looking at these equations, we see that and can be determined from the rst two equations. With some calculations, we get (11) (12) = y x =
The line y = + x is called the (estimated) regression line. The slope of the regression line can also be expressed as sy = rx,y , sx where s2 and s2 are the sample variances of the x- and y-sample, respectively, and x y where n i=1 (xi x)(yi y ) rx,y = n n 2 i=1 (yi y )2 i=1 (xi x) is the sample correlation coecient.
n i=1 (xi x)(yi n 2 i=1 (xi x)
y)
1. SIMPLE LINEAR REGRESSION
51
Finally, the ML estimator for the variance 2 can be obtained from the third likelihood equation; we get (13) M L 2 1 = n
n
i=1
(yi xi )2 .
Before we discuss properties of these estimators, we want to explore the relationship to the method of least squares. Method of Least Squares. Looking at the log-likelihood function (10), one can rst keep 2 xed and determine the maximum with respect to and . Stripping o all terms that do not depend on , , this leads to the minimization of
n
(14)
i=1
(yi xi )2 .
As this term does not involve 2 , minimization of the above sum of squares gives the ML estimators for and . This technique, i.e. determining those values of , that minimize (14), is called the least squares method. This method was rst used by Carl Friedrich Gau (1777-1855) around the year 1800 in connection with the determination of the orbits of the small planet Ceres. Geometrically, the least squares method estimates the regression line by minimizing the sum of squares of the vertical distances between the line and the data points (xi , Yi ). The connection between the least squares method and the Maximum Likelihood method was also discovered by Gauss. The power of the least squares method lies in the fact that it eventually leads to a system of linear equations which can be solved by methods of linear algebra. Properties of the Regression Estimators. First we remark that the ML estimator of the variance is not unbiased. Like in the case of the estimation of the variance of a normal distribution from i.i.d. data, one has to change the denominator n. In the case of simple linear regression, the appropriate denominator is n 2; i.e. we obtain the estimator n 1 s2 := (yi xi )2 . y|x n 2 i=1 Theorem 5.1. The least squares estimators and are normally distributed and unbiased, with variances Var() := 2 x2 i n n (xi x)2 i=1 1 Var() := 2 n 2 i=1 (xi x)
n i=1
s2 is an unbiased estimator for 2 . Moreover (n2)s2 / 2 has a 2 distribution. n2 y|x y|x In addition s2 is stochastically independent of , y|x Condence Intervals for the Regression Parameters. The above mentioned properties of the estimators for the regression parameters permit to construct condence regions.
52
Confidence Interval for : From Theorem 5.1 we can infer that the statistic T := s2 y|x n
n i=1 (xi n 2 i=1 xi
x)2
1/2
has a tn2 -distribution. Thus we have with probability 95% tn2;0.025 s2 y|x n
n i=1 (xi n 2 i=1 xi
x)2
1/2
tn2;0.025
Some elementary algebra leads from here to the 95% condence interval for , tn2;0.025 s2 y|x n x2 i n (xi x)2 i=1
n i=1
1 2
, + tn2;0.025
s2 y|x
x2 i n (xi x)2 i=1
n i=1
1 2
Confidence Interval for : With similar reasoning we obtain the following 95% condence interval for the slope of the regression line , tn2;0.025 s2 y|x 1 n 2 i=1 (xi x)
1 2
, + tn2;0.025
s2 y|x
s2 y|x 2
1 n 2 i=1 (xi x)
1 2
confidence interval for 2 : Since X = (n 2) get that with probability 95% 2 n2;0.975 (n 2) s2 y|x
is 2 -distributed, we n2
2 n2;0.025 . 2 Again, using some elementary algebra, we obtain from here the followung 95% condence interval for 2 :
2 2 (n 2)s2 /2 y|x n2;0.025 , (n 2)sy|x /n2;0.975
Tests for regression parameters. We can now test various hypotheses concerning the regression parameters , , 2. Most relevant is the hypothesis that = 0, i.e. that the regression line is horizontal: H : = 0 A : = 0. Note that the hypothesis = 0 implies that the independent variable x has no inuence on the outcome of the experiment. As test statistic we take a properly normalized version of the estimated slope, namely
n
T =
i=1
(xi x)2
s2 |X Y
Under the above hypothesis, T has a tn2 distribution. We reject the hypothesis for very small and very large values of T ; for a test with level 5% we reject when There is also a one-sided version of the above test, i.e. when testing H : = 0 A : > 0. T tn2;0.025 oder T tn2;0.025 .
1. SIMPLE LINEAR REGRESSION
53
This test problem is appropriate when we know that the slope is in any case nonnegative. In this case we take the same test statistic as above, but we reject only for large values of T , i.e. when T tn2;0.05 . More generally, we may test the hypothesis H : = 0 against any of the alternatives A : = 0 or A : > 0. For this test problem, the appropriate test statistic is
n
T =
i=1
(xi x)2
s2 |X Y
The critical values are the same as in the case 0 = 0. Analysis of Variance. We will add a further consideration to the least squares method that will bridge a gap to later chapters. We rst look at the outcomes of the experiments y1 , . . . , yn themselves, i.e. without their connection to the independent variables x1 , . . . , xn . We dene the Total Sum of Squares as
n
SQT =
i=1
(yi y )2 ;
From here we obtain the following decomposition of the total sum of squares:
n n n
SQT is the sum of squared deviations of the observations from the arithmetic mean of the yi s. We then decompose yi y as follows: yi y = ( + xi y ) + (yi xi ). (yi y )
2
=
i=1
i=1
(yi xi )2 +
n
i=1
( + xi y )2
+2
i=1 n
( + xi y )(yi xi )
n
(15)
n i=1 (
=
i=1
(yi xi )2 +
i=1
( + xi y )2 .
are orthogonal. The two terms on the right-hand side of (15) are called Error Sum of Squares and Regression Sum of Squares, respectively. The decomposition (15) is called Analysis of Variance (ANOVA) decompostion. The two sums of squares in the ANOVA decomposition tell us which portion of the variation in the outcomes of the experiments may be attributed to the variation in the values of the independent variable and which portion to the deviations of the outcomes from the expected values. The regression sum of squares is in some sense explained by our model while the error sum of squares is not. The ratio of the two
The proof that + xi y )(Yi xi ) = 0 is a short exercise in algebra. The deeper geometric reason behind it is the fact that the two vectors (y1 x1 , . . . , yn xn )t , (+ x1 y , . . . , + xn y )t = SQError +SQRegr
54
sums of squares is a measure for the degree to which the model explains the variation in the outcomes of the experiments. We dene the coecient of determination R := As = y x and =
n sy r , sx x,y n 2 n 2 i=1 ( + xi y ) . n 2 i=1 (yi y )
we can express the numerator of R2 as follows: s2 2 y r s2 x,y x

n n
i=1
( + xi y )2 =
i=1
((xi x))2 =
i=1
(xi x) =
2
i=1
(yi y )2 rx,y . 2
Thus we get that, in the case of simple linear regression, R = In later sections, 2 we will be able to dene R also in other models. The ANOVA decomposition (15) is also the basis for a test for the hypothesis H : = 0 versus the alternative A : = 0. We dene the F -test statistic SQRegr F = = SQErr /(n 2)
n i=1
2 rx,y .
+ xi y )2 . (yi xi )2 /(n 2)
n i=1 (
Under the hypothesis = 0, F has an F1,n2 distribution. We reject the hypothesis when the test statistic exceeds the upper -quantile of the F1,n2 distribution, i.e. when F F1,n2; . This test is in fact equivalent to the t-test presented above for the same test problem. One can easily show that F = T 2 . However, unlike for the t-test, there is no one-sided version of the F -test. 2. Multiple Linear Regression Model and Examples. The multiple linear regression model, often just called linear model is an extension of the simple linear regression model. In multiple linear regression, we consider the eects of p explanatory variables x1 , . . . , xp on the outcome of the experiment. For a single experiment, the model is Y = 1 x1 + . . . + p xp + , where 1 , . . . , p R are the regression parameters and is an N(0, 2 )-distributed random variable. At rst sight, it seems puzzling that the constant term is absent here. We will see in the examples below that it may be introduced articially by choosing the rst variable x1 1. As in the case of simple linear regression, we perform our experiment n times independently. We denote by xij the value of the j-th explanatory variable in the i-th experiment and by Yi the outcome of the i-th experiment:
p
(16)
Yi =
j=1
xij j + i ,
1 i n.
As in the case of simple linear regression, the random variables 1 , . . . , n are assumed to be independent N(0, 2 )-distributed. We introduce some further notation. The matrix X = (xij )1in,1jp
2. MULTIPLE LINEAR REGRESSION
55
is called design matrix of the linear model. We dene the parameter vector = (1 , . . . , p )t and the data vector Y = (Y1 , . . . , Yn )t . In this way we may express the linear model as Y = X + , where = (1 , . . . , n ). Example 5.2. (i) The simple linear regression model that we studied earlier in this chapter is a special case of multiple linear regression. We dene xi1 = 1, xi2 = xi , 1 = and 2 = . With these denitions, we obtain
p
xij j = xi1 1 + xi2 2 = + xi ,

j=1
and thus the linear model (16) becomes the simple linear regression model (9). The design matrix of the simple linear regression model is 1 x1 1 x2 X = . . . . . . . 1 xn
(ii) In a polynomial regression model we have a single explanatory variable x. We assume that the outcome of the experiment is a polynomial function of x plus a random component, i.e. Y = 1 + 2 x + 3 x2 + . . . + p+1 xp + , where 1 , . . . , p are unknown regression parameters and is a N(0, 2 )-distributed random variable. If we perform the experiment n times independently at dierent values of x, we obtain the model Yi = 1 + 2 xi + 3 x2 + . . . + . . . + p+1 xp + i . i i Here xi denotes the value of the explanatory variable in the i-th experiment. It turns out that the polynomial regression model is a linear model, whose design matrix is given by 1 x1 x2 xp 1 1 1 x2 x2 xp 2 2 . . Rn(p+1) . X= . . . . . . . . . . . 2 . xp 1 xn xn . n
This might come as a surprise, since the relationship between the explanatory variable and the expected value of the outcome of the experiment is nonlinear. The term linear in a linear model refers to the fact that the model is linear in the unknown parameters 1 , . . . , p+1 . (iii) We consider the k sample problem, where k independent samples of sizes n1 , . . . , nk are drawn from normal populations with means 1 , . . . , k and identical variance 2 . We denote the random variables in the i-th sample by Xi1 , . . . , Xini ;
56
thus we get X11 , X12 , . . . , X1n1 N(1 , 2 ) X21 , X22 , . . . , X2n2 N(2 , 2 ) . . . Xk1 , Xk2, . . . , Xknk N(k , 2 ) Xij = i + (Xij i ) =: i + ij ,
We may write the j-th observation in the i-th sample as follows where ij are independent N(0, 2 )-distributed random variables. This can be viewed as a linear model where the vector of observations is Y = (X11 , . . . , X1n1 , X21 , . . . , X2n2 , . . . , Xk1, . . . , Xknk )t . The explanatory variables are {0, 1}-valued 1 0 . . . 1 0 0 1 . . . X = 0 1 . . . 0 0 . . . 0 0 and the design matrix is ... 0 . . . ... 0 ... 0 . . . ... 0 . . . ... 1 . . . ... 1
The entries of the design matrix simply indicate to which sample the observation in the given row belongs. One calls such explanatory variables also dummy variables.
Least Squares Method. As in the case of the simple linear regression model, the maximum likelihood technique for estimation of the parameters 1 , . . . , p in the multiple linear regression model leads to the least squares method. We have to determine the values of 1 , . . . , p that minimize
n p
i=1
(Yi
xij j )2 .
j=1
We solve this minimization problem by taking partial derivatives with respect to 1 , . . . , p equal to zero. In this way we obtain for l = 1, . . . , p the equations 0= l
n p n p
i=1
(Yi
xij j ) = (2)
j=1 i=1
(Yi
xij j )xil .
j=1
This is a system of p linear equations in p unknowns 1 , . . . , p which can be rewritten as

n n p
xil Yi =
i=1 i=1 j=1
xil xij j .
57
A short exercise in matrix algebra shows that this system can be rewritten in vectormatrix notation as X t Y = (X t X), with the notations introduced above. If the matrix X t X is invertible, we get the following explicit formula for the least squares estimator: = (X t X)1 X t Y.
1 The Maximum Likelihood estimator for the variance is n n (Yi p xij j )2 . j=1 i=1 Again, this estimator is biased; we get an unbiased estimator if we replace the term n in the denominator by n p; thus
s2 1 ,...,xp y|x
1 = np
i=1
(Yi
xij j )2 .
j=1
Theorem 5.3. The least squares estimator for the parameters 1 , . . . , p and the variance estimator s2 1 ,...,xp in the linear model (16) have the following propery|x ties: (1) j has a normal distribution with mean j and variance 2 [(X t X)1 ]jj , where [(X t X)1 ]jj denotes the (j, j)-th entry of the matrix (X t X)1 . (2) has a multivariate normal distribution with mean and covariance matrix 2 (X t X)1 . (3) (n p) s2 1 ,...,xp / 2 has a 2 distribution. np y|x and s2 (4) are stochastically independent.
y|x1 ,...,xp
In particular, and s2 1 ,...,xp are unbiased estimators. y|x Condence Intervals for Regression Parameters. The rst and the third statement of the above theorem allow us to determine condence intervals for the regression parameters 1 , . . . , p and for the variance 2 . Confidence interval for j : Using the same arguments as in the case of simple linear regression, we obtain the following 95% condence interval for j : j tnp;0.025 s2 1 ,...,xp y|x [(X t X)1 ]jj , + tnp;0.025 s2 1 ,...,xp y|x [(X t X)1 ]jj
confidence interval for 2 : Similarly, we obtain a 95% condence interval for 2 :

2 2 (n p)s2 1 ,...,xp /2 np;0.025 , (n p)sy|x1 ,...,xp /np;0.975 y|x
ANOVA Decomposition, Multiple Correlation Coecient. For the multiple linear regression model we have again an ANOVA decompostion, which is based on the simple identity
p p
yi y = (yi
xij j ) + (
j=1 j=1
xij j y )
58
From here we obtain

n n p n p
i=1
(yi y )
=
i=1
(yi +2
xij j ) +
2 j=1 n p
(
i=1 j=1 p
xij j y )2 xij j y )
i=1 p
(yi
xij j )(
j=1 n j=1 p
=
i=1
(yi
xij j )2 +
j=1
(
i=1 j=1
xij j y )2 .
because the sum of the mixed products on the right-hand side vanishes - this can be veried using some algebraic calculations, where the deeper geometric reason is again the orthogonality of the two vectors involved. We dene the multiple correlation coecient R2 by n p 2 i=1 ( j=1 xij j y ) 2 R := . n 2 i=1 (yi y ) The interpretation of R2 is the same as in the case of simple linear regression; R2 tells us which portion of the total sum of squares is explained by the multiple linear regression model. Tests in multiple linear regression models. The most important testing problem in multiple linear regression asks whether some of the regression coecients are equal to zero. This means that the corresponding explanatory variables have no inuence on the outcome of the experiment. For a single regression coecient j , the hypothesis H : j = 0 versus A : j = 0 may be tested by a t-test. We take the test statistic j T = , sy|x1 ,...,xp and note that under H, this statistis has a tnp -distribution. Thus the t-test with size rejects the hypothesis, when T tnp;/2 or T tnp;/2 When one wants to test whether several of the regression coecients are equal to zero, one cannot use a repeated t-test, for a variety of reasons. First of all one faces the risk of increased level of the test. Roughly speaking every single test has probability of a false rejection of the hypothesis and these probabilities add up when making multiple tests. If we perform 20 tests that each have level 5%, on average one of these tests will reject, even if the hypotheses are correct. Added complication is the fact that the outcome of the t-test for the above testing problem depends on the model that is being considered. Suppose the test for p = 0 did not reject. In that case, we may consider the smaller model that contains only the rst p 1 explanatory variables. If we now test the additional hypothesis that p1 = 0, we will often get a result that diers from that obtained if we had tested this hypothesis without deleting the p-th explanatory variable. How to handle these problems wisely, is considered in test for selection of explanatory variables - details lead too far for this course.
59
If we want to test simultaneously whether a group of regression parameters equals 0, we face the hypothesis (17) H : q+1 = . . . = p = 0, where 0 q < p, against the alternative that at least one of the j = 0, q+1 j p. If this hypothesis is true, the last (p q) explanatory variables do not contribute to the outcome of the experiment. In fact, we may consider any other subset of variables, too - we have chosen the last ones for ease of notation. In order to test the hypothesis (17), we consider the multiple linear regression model that we obtain by setting q+1 = . . . = p = 0 in the original model, i.e.
q
Yi =
j=1
xij j .
Again we may estimate the parameters 1 , . . . , q with the help of the least squares technique. In order to distinguish the resulting estimates from those obtained under (0) (0) the full model, we denote the new estimates by (0) = (1 , . . . , q ). It very important to realize that these estimates for the regression coecients are dierent from the estimates obtained in the full model! Again, we may consider an ANOVA decomposition, which in this case reads
n n q 2 n q 2
i=1
(yi y ) =
i=1
yi
(0) xij j
j=1
+
i=1 j=1
(0) xij j
The rst term on the right-hand side gives the part of the variation in the data that cannot be explained by the reduced regression model. The F -test for the hypothesis (17) compares this unexplained variation to that which was unexplained under the full model, via the following test statistic
n i=1
F :=
Yi
q (0) j=1 xij j n i=1
n i=1
Yi
2
p j=1 xij j
/(p q)
Yi
p j=1 xij j
/(n p)
Note that the numerator is always non-negative - we increase the unexplained variation by deleting some of the variables. Under the hypothesis (17), the test statistic F has an Fpq,np-distribution. We thus reject the hypothesis if F Fpq,np;.
CHAPTER 6
Analysis of Variance
In this chapter, we will analyze a further application of the ANOVA decomposition. We are considering an experiment whose outcome depends on the value of a single factor, which may take k dierent values. In this case, the explanatory variable is discrete and thus it makes little sense to consider a linear regression model with the values of the factor as explanatory variable. We denote the outcomes of the experiment at the i-th level of the factor by Xi1 , . . . , Xini . In addition, we assume that all random variables are indentically distributed, that their distributions are normal with identical variance 2 and with expected value i . In this way we get X11 , X12 , . . . , X1n1 N(1 , 2 ) X21 , X22 , . . . , X2n2 N(2 , 2 ) . . . Xk1 , Xk2, . . . , Xknk N(k , 2 ) Note that this is exactly the k-sample problem treated in the previous chapter. In principle, the estimation and testing problems for the k-sample problem may be seen as special cases of such problems for a linear model. However, the special structure of the model allows an easier direct treatment and also some specic interpretations that justify a separate treatment.
1. Two-sample problem We will begin with the special case of the two-sample problem. In this case, the outcome of the experiment depends on a factor that can only take two values. Such data arise frequently in case-control studies, when e.g. machine parts have either undergone a treatment or not. In the two-sample case, we use special notation. The m observations in the rst sample are denoted by x1 , . . . , xm ; the n observations in the second sample are denoted by y1 , . . . , yn . We assume that all the random variables are indenpendent and normally distributed, i.e. that
2 X1 , . . . , Xm N(1 , 1 ) 2 Y1 , . . . , Yn N(2 , 2 ).
Descriptive Statistics: In the rst instance, we may consider the two samples separately. We calculate their sample means and sample variances and denote these 1 1 by X, Y , s2 and s2 , respectively. Thus we have X = m m Xi , Y = n n Yi X Y i=1 i=1
61
62
6. ANALYSIS OF VARIANCE
and s2 X s2 Y 1 = m1 = 1 n1
m
i=1 n
(Xi X)2
i=1
(Yi Y )2
A good graphical summary of the two samples consists of the two boxplots drawn next to each other; this gives a excellent optical impression of the samples. In addition, one should calculate condence intervals for the expected values 1 and 2 in each of the samples separately, and add them to the boxplots. The 95% condence intervals for 1 and 2 are given by X tm1;0.025 Y tn1;0.025 s2 /m, X + tm1;0.025 X s2 /n, X + tn1;0.025 Y s2 /m X s2 /n X
F -Test for Variances: Most standard statistical procedures assume that the variances in the two populations are equal, i.e. that the hypothesis
2 2 H : 1 = 2
holds. Thus one should always test this hypothesis at the beginning any statistical analysis of the data. The common F -test uses the ratio of the two sample variances as test statistic: s2 Y F = 2 . sX If the above hypothesis is true, F has an Fn1,m1 -distribution. Whether we reject for large and small values of F or only for large (or only for small values), depends on 2 2 the choice of alternative. Most relevant is the alternative A : 1 = 2 . In this case, both very large as well as very small values of F hint that the hypothesis might be wrong. Thus we reject the hypothesis when F Fn1,m1;/2 or F Fn1,m1;1/2 . 2 2 In the case of the one-sided alternative A : 2 > 1 , we reject the hypothesis when F Fn1,m1;1 . Two-Sample t-Test: We now assume that the variances in the two samples 2 2 are equal. Thus we may use a common symbol for the variances, 2 := 1 = 2 . 2 2 In this situation, both sX as well as sY are estimates of the population variance 2 . In fact, we can improve these estimates by taking a convex combination of the two. Some calculations show that the optimal estimator is given by s2 P m1 2 n1 1 := sX + s2 = Y m+n2 m+n2 m+n2
m n
i=1
(Xi X)2 +
i=1
(Yi Y )2
s2 is usually referred to as pooled sample variance. The common test statistic for P testing the hypothesis H : 1 = 2
2. SINGLE FACTOR ANOVA
63
is the normalized dierence of the sample means T = Y X

1 m 1 n
. s2 P
Note that the term in the denominator is the estimated standard deviation of the numerator. If the above hypothesis is true, the test statistic T has a tm+n2 distribution. The rejection region depends again on the choice of alternative. In the case of the one-sided alternative A : 2 < 1 we reject when T tm+n2; . In the case of the two-sided alternative A : 2 = 1 , we reject when T tm+n2;/2 or T tm+n2;/2 . 2. Single Factor ANOVA We now consider the general k-sample case, also known as One Factor ANOVA Model. X11 , X12 , . . . , X1n1 N(1 , 2 ) X21 , X22 , . . . , X2n2 N(2 , 2 ) . . . Xk1 , Xk2, . . . , Xknk N(k , 2 ) Note that we have immediately assumed that all variances are equal. Descriptive Statistics: As in the two-sample case, it makes sense to consider the samples individually and to calculate sample means and sample variances separately. The following notation is commonly used for the sample means in the dierent samples and for the overall sample mean: Xi 1 := ni
ni
Xij
j=1 k ni
1 := n1 + . . . + nk
Xij
i=1 j=1
Observe that the overall sample mean is not the arithmetic mean of the individual samples, unless all sample sizes are equal. The sample variance in the i-th sample is denoted by the symbol s2 , i.e. i s2 := i 1 ni 1
ni
j=1
(Xij Xi )2 .
A good graphical display consists of the boxplots for each of the samples, drawn in a single graph next to each other, and similarly for the condence intervals for the population means.
64
6. ANALYSIS OF VARIANCE
ANOVA Decomposition: We consider the following decomposition of the total sum of squares:
k ni k k ni
(18)
i=1 j=1
(Xij X ) =
i=1
ni (Xi X ) +
i=1 j=1
(Xij Xi )2 .
Like in the previous ANOVA decompositions, also here the validity of the above identity is not obvious. For a proof, the following identity may be used:
n n
j=1
(xj a) = n( a) + x
j=1
(xj x)2 ,
valid for any real numbers x1 , . . . , xn with arithmetic mean x and any a R. Ap plying this identity with xj = Xij , 1 j ni , and a = X to the inner sum of (18), and observing that in this case x = Xi , we get
ni ni
j=1
(Xij X ) =
j=1
(Xij xi )2 + ni (Xi X )2 .
Taking the sum over i on both sides, we obtain the ANOVA decomposition (18). The ANOVA decomposition (18) has an important interpretation. On the lefthand side, we have the total variation in the entire data set x11 , . . . , xknk . This variation is decomposed into two contributions, namely the variation between the k groups and the variation within the groups - these correspond to the two terms on the right-hand side of (18). The three types of sums of squares are called Total Sum of Squares, Between Groups Sum of Squares and Within Groups Sum of Squares, respectively, F -Test for 1 = 2 = . . . = k : The most important test problem for the single factor model is to test for equality of all expected values, i.e. (19) H : 1 = . . . = k , against the alternative that at least one of these equalities is violated. If this hypothesis is true, there is in fact no dierence between the populations - the levels of the explanatory variable have no inuence on the expected value of the outcome of the experiment. The idea behind the F -test for testing the hypothesis (19) is to compare the sum of squares within groups with the sum of squares between groups. If their ratio is large, this hints that there is a dierence due to the dierent levels of the factor. It is customary to use some norming constants, and to take the following test statistic F :=
k i=1 k i=1
ni (Xi X )2 /(k 1)
ni j=1 (Xij
Xi )2 /(n k)
where n := n1 + . . . + nk is the total sample size. If the hypothesis is true, F has an Fk1,nk -distribution. Large values of F hint that the hypothesis might not be true and thus we reject the hypothesis for large values of F . The 5% level test rejects when F Fk1,nk;0.05.
2. SINGLE FACTOR ANOVA
65
The F -test in the one-factor ANOVA model can be seen as a special case of the general F -test introduced in the previous chapter. However, the special character of the model makes the direct treatment presented in this chapter, easier. Further Tests for the One Factor ANOVA model. We want to mention two further tests that are quite often used in conection with the one factor ANOVA model, without going into deep details. Bartletts test for equality of variances: Since the F -test is only valid under the hypothesis that all the variances are equal, it is important to be able to test this hypothesis. Bartlett has a test for this, which is implemented into all standard statistical packages. Multiple comparisons: When the hypothesis (19) has been rejected, one is usually interested in more detailed questions. Are there some levels of the explanatory that yield the same expected value for the outcome of the experiment? Of course, one could perform many pairwise comparisons between any two levels of the explanatory variable, using a t-test. This would, however, lead to the usual problem of multiple tests, namely the increase in the probability of a false rejection. Several statisticians have recommended alternative tests; the best known are Tukeys test and the Newman-Keuls test.
CHAPTER 7
Principal Component Analysis

Multivariate data are often high dimensional, e.g. because many measurements have been taken on each individual. For a variety of reasons, it is desirable to obtain lower dimensional approximations to the data set. In this chapter we want to introduce an important technique for such a dimension reduction. 1. Linear Algebra Tools Principal Component Analysis relies heavily on tools from Linear Algebra which we want to introduce here rst. Specically, we will need results from the theory of diagonalization of symmetric matrices and the geometry of Euclidean vector spaces. Diagonalization of Symmetric Matrices. We recall the denition of eigenvalues and eigenvectors of a square matrix A Rnn . A real number R is called eigenvalue of A, if we can nd a vector x = 0 such that A x = x. Such a vector x is then called eigenvector of A with eigenvalue . Recall that a matrix A is called symmetric, if At = A, where At is the transpose of A. An equivalent condition for symmetry is aij = aji , for all 1 i, j n. The vectors u1 , . . . , uk Rnn are called orthonormal, if ut uj = i 1 0 for i = j for i = j.
Orthonormal vectors are always linearly independent. An orthonormal system that spans Rn is called an orthonormal basis (ONB). Theorem 7.1 (Principal Axes Theorem). Let A Rnn be a symmetric matrix. (i) There exists an orthonormal basis (u1 , . . . , un ) for Rn , consisting of eigenvectors of the matrix A, i.e. A ui = i ui , 1 i n, where 1 , . . . , n are the eigenvalues of A. (ii) There exists an orthonormal matrix U Rnn 1 0 . . . 0 2 . . . Ut A U = . . .. . . . . . 0 0 ...
67
The columns of U are the orthonormal eigenvectors of the matrix A.
such that 0 0 . . . n
68
7. PRINCIPAL COMPONENT ANALYSIS
Note that the two parts of the Principal Axes Theorem are equivalent statements. Given an orthonormal basis {u1 , . . . , un } of eigenvectors of A, we can consider the matrix U = [u1 , . . . , un ] with columns u1 , . . . , un . This matrix is orthogonal and satises AU = UD, where D is the diagonal matrix whose diagonal elements are the eigenvalues 1 , . . . , n of A corresponding to u1 , . . . , un . Multiplying from the left by U t and using that U t U = I, we obtain U t AU = D. In the other direction, we can infer from U t AU = D, where U is an orthogonal matrix, that AU = UD. Denoting the columns of U by u1 , . . . , un , we get Aui = i ui , and hence ui is an eigenvector of A with corresponding eigenvalue i . The vectors u1 , . . . , ui form an orthonormal basis, since they are the columns of an orthogonal matrix. In connection with Principal Component Analysis, we will use a characterization of eigenvectors of a symmetric matrix nonnegative denite matrix A via an extremal property of the quadratic form xt A x, x Rn . Recall that A is called nonnegative denite if xt Ax 0 for all x Rn . Eigenvalus of nonnegative denite symmetric matrices are always nonnegative. We put the eigenvalues of A into decreasing order, i.e. 1 2 . . . n 0, and denote by u1 , . . . , un the corresponding orthonormal basis of eigenvectors. The eigenvector u1 corresponding to the largest eigenvalue is then characterized by the property ut A u1 = max xt A x. 1 n
xR , x =1
(20)
More generally, we get for any 1 k n, ut A u1 + . . . + ut A uk = 1 k
x1 ,...,xk Rn orthonormal
max
xt A x1 + . . . + xt A xk 1 k
Observe that the left-hand side of (20) equals 1 + . . . + k . Orthogonal Projection. Given a vector x Rn and a subspace W Rn , we want to nd the vector PW (x) W that is closest to x, i.e. that satises x PW (x) = min x w .
wW
The right-hand side is also called the distance of x to W and denoted by d(x.W ). It turns out that PW (x) may be completely characterized by the following two properties: PW (x) W x PW (x) W.
The map x PW (x) is called orthogonal projection onto the subspace W . If (w1 , . . . , wk ) is an orthonormal basis for W , we may calculate the orthogonal projection directly as follows
k
PW (x) =
i=1
t (wi x)wi .
In this case, we may compute the distance between x and W by the formula
k
x PW (x)
= x
t (wi x)2 . i=1
2. PRINCIPAL COMPONENTS OF MULTIVARIATE DATA
69
2. Principal Components of Multivariate Data Our data set consists of n vectors X1 , . . . , Xn Rp , which we view as realisations of independent, identically distributed random vectors. We may think of p measurements being taken on n individuals that have been chosen at random from some population. In engineering applications, we might think of p measurements taken on machine parts coming from the same producer. We denote the j-th coordinate of the vectors Xi by xij , i.e. Xi = (xi1 , . . . , xip )t , 1 i n.
It is the goal of principal component analysis to reduce the dimension of the data vector by projecting the data vector to a suitable low-dimensional subspace W . The subspace W should be chosen in such a way that we loose as little information about the original data as possible. Before we come to the principal compenent analysis, we center our data, i.e. we 1 subtract the means xj = n n xij from the j-th coordinate of each of the vectors. i=1 Thus we get the reduced observations xij 1 = xij xj = xij n
n
xij .
i=1
1 In vector notation, we dene the sample mean X = n n Xi and the reduced i=1 vector of observations Xi = Xi X. Equivalently, we might think of moving the center of our coordinate system to the sample mean. In the rest of this chapter that the original data have already been transformed in this way, i.e. that Xi Xi . We x a dimension k, where 1 k < p. In principal component analysis, we want to nd the k-dimensional subspace W Rp that is closest to the data set in the sense that the sum of squared distances, n
d(Xi , W )2 ,
i=1
is minimized. Here d(x, W ) = minwW x w , as dened above. By the GramSchmidt orthogonalization procedure, we might nd for any k-dimensional subspace W Rn an orthonormal basis w1 , . . . , wk . Using the explicit formula for the distance d(x, W ) given above, we may write d(Xi , W )2 = Xi
2 t (w1 Xi )2 . . . (wk Xi )2 .
Hence we get the following formula for the sum of squared distances of the data vectors X1 , . . . , Xn from the subspace W ,
n n
d(Xi , W )
i=1
=
i=1 n
Xi Xi
i=1 2
t t (w1 Xi )2 . . . (wk Xi )2 n n t (w1 i=1 2 t (wk Xi )2 i=1
Xi ) + . . . +
70
7. PRINCIPAL COMPONENT ANALYSIS
Observing that (w t X)2 = (w t X)(X t w) = w t (X X t )w, we may rewrite each of the summands on the right-hand side as follows:
n t (wj i=1 n n t wj (Xi i=1
Xi ) =
Xit )wj
t wj i=1
t Xi Xit wj = wj SX wj ,
where we have dened SX = n Xi Xit . Thus we get in total the following formula i=1 for the sum of squared distances of the data vectors Xi from the subspace W spanned by the othornormal basis w1 , . . . , wk :
n n
(d(Xi , W ))
i=1
=
i=1 n
Xi Xi
i=1
t t w 1 SX w 1 + . . . + w k SX w k k t w j SX w j , j=1
(21)
Observe that the matrix SX = n Xi Xit is symmetric and non-negative denite. i=1 Since we have transformed the data set into new coordinates such that the sample means are equal to zero, SX may in fact be rewritten as
n
SX =
i=1
(Xi X)(Xi X)t .
1 Observe that n1 SX is the sample covariance matrix of the data set. As the rst term on the right-hand side of (21) does not depend on the choice of the subspace W , the original problem of nding the subspace that minimizes n 2 i=1 d(Xi , W ) translates into nding k orthonormal vectors w1 , . . . , wk that maximize k t w j SX w j , j=1
This is exactly the maximum problem whose solution we have presented above in the section on eigenvalues and eigenvectors. We determine the eigenvalues of the matrix SX and list them in decreasing order, i.e. 1 2 . . . k k+1 . . . . p . The maximum problem is then solved by any orthonormal set w1 , . . . , wk of eigenvectors associated to the k largest eigenvalues 1 , . . . , k . With this choice of w1 , . . . , wk , t we get wj SX wj = j and thus
n n
d (Xi , W ) =
i=1 i=1
Xi
1 . . . k .
In the special case k = p, we get W = Rp and thus d2 (Xi , W ) = 0. Hence we obtain the following important identity:
n
Xi
i=1
= 1 + . . . + p .
2. PRINCIPAL COMPONENTS OF MULTIVARIATE DATA
71
Note again that Xi has been transformed by subtracting o the means. Hence the left-hand side is the total variation in the data set, which by the above identity is equal to the sum of the eigenvalues of the matrix SX . und somit gilt n d2 (Xi , W ) 1 + . . . + k . = 1 i=1 n 2 1 + . . . + p i=1 Xi Es gibt also 1 +...+k ein Ma daf r an, wie gut die Datenvektoren durch W apu 1 +...+p proximiert werden. Wenn wir die Dimension k festlegen wollen, dann dient dieser Quotient oft als Kriterium. Mann kann z.B. fordern, dass er mindestens 80% sein soll. Definition 7.2. The eigenvectors of the matrix SX are called the principal components of the data set X1 , . . . , Xn . Beachte, dass f r 1 j p folgende Identitten gelten: u a
n t t (wj Xi )2 = wj SX wj = j . i=1
Also gibt j die Streuung der Daten in die Richtung aj an. Also stellt die Zerlegung n Xi 2 = 1 + . . . + p eine Streuungszerlegung in die Richtungen der i=1 Hauptkomponenente dar. Die erste Hauptkomponente gibt also gerade die Richtung an, in der die Daten maximal streuen, die zweite Hauptkomponente gibt die dazu orthogonale Richtung der maximalen Streuung, u.s.w. Wir kehren jetzt zu der Approximation durch Projektion auf einen niedrigdimensionalen Unterraum zur ck. Wenn wir dim(W ) = k festlegen, so haben wir u folgende Formel f r die Projektion: u
k
PW (x) =
j=1
(at x)aj . j
Wenn k ausreichend gro ist, wird PW (Xi ) eine gute Approximation an Xi sein - der totale Approximationsfehler uber alle Datenvektoren genommen ist, wie wir oben j . Wir erhalten also gesehen haben, p j=k+1
k
Xi
(at Xi )aj . j
j=1
Diese Gleichung kann man auch in Vektor-Matrix-Notation schreiben. Wir denieren die Matrix L := (a1 ak ) Rpk und die Vektoren t a1 Xi . fi = . Rk , . t ak Xi und erhalten dann Xi PW (Xi ) = L fi . Der Vektor fi heit Faktorvektor, seine Komponenten heien Faktoren. Die Matrix L heit Faktorladungsmatrix, die Eintrge der Matrix L heien Faktorladungen. a
CHAPTER 8
Extreme Value Theory

In many applications in engineering, extreme stresses on a construction play a crucial role, much more so than average stresses. For the correct assessment of the height of a dam one has to know the distribution of the maximal water level of the river. The same holds for the stress that is imposed on a bridge by wind forces and by the number of cars passing over the bridge. Such extrema follow very special probability distributions which we will introduce in this chapter. A special and particularly dicult feature of extreme value statistics is the fact that we are typically interested in probabilities in a range of values that have never been observed before. When the goverment of The Netherlands determined the necessary height of seadikes after the desastrous ood of 1953, it was decided that 1 the dikes should be so high that a ooding would occur with a probability of 10000 per year. At the same time, historic records of maximal tides existed for only a few hundred years. This feature makes extreme value statistics very special and should also be a reason for a very careful application with lots of safety margins. 1. Extreme Value Distributions Let X1 , . . . , Xn be independent and identically distributed random variables with distribution function F (x) = P (X1 x). Then Mn := max(X1 , . . . , Xn ) has the distribution function P (Mn x) = P (X1 x, . . . , Xn x) = P (X1 x) P (Xn x) = (F (x))n . Using this formula, we can calculate the maximal stress over a period of n time points, provide we know the stress at a single time point, and the stresses are independent. The above formula forms the basis for theoretical considerations about the class of possible extreme value distributions. There are theorems which state that, up to rescaling and centering, there is a small number of possible extreme value distributions. These theorems are analogues of the central limit theorem for sums of independent random variables, but now for maxima. The central limit theorem in its most general form says that the only limit distribution of sums of a large number of independent and individually small random variables is the normal distribution. In the same spirit, the extreme value limit theorem states that the limit distribution of maxima of a large number of independent random variables belongs to a class of three possible extreme value distributions. The techniques that we will introduce in this chapter can be used for modelling of extreme values. Our data will be extrema of certain phenomena, i.e. the maximal water level of a river at a given point during a given year. We have several data which we assume to be independent. The latter assumption only holds if the time
73
74
8. EXTREME VALUE THEORY
lag between data collections is large enough. In the example of the water level, annual maxima may safely be assumed to be independent, while for daily levels this seems doubtful. In the context of extreme values, we are not only interested in the distribution function F (x) := P (X x) but even more in the survival function F (x) := P (X > x); i.e. F (x) = 1F (x). Note that for continuous distributions we have F (x) = P (X x). The crossing of a large x-value often corresponds to a catastrophic event - the water level exceeds the height of a dike, the capacity of a dam is exceeded, the maximal load of a bridge is exceeded. Thus, we would like to know the probability of such an event, respectively we are interested in the x-value that is not exceeded with a preset probability q. A related characteristic is the length of the return time interval. Repeating experiments with a success probability q over and over again, we will have on average once every 1 experiments a success. I.e., the mean time between two successes is 1 . q q If we set the height of a dike at x, the mean return time between two oods is 1 Tx := . F (x) In the case of the heights of seadikes in The Netherlands, the political decision was made that Tx should be at least 10000 years, leaving the statisticians with the task to calculate the corresponding height x. We will rst introduce some of the most important extreme value distributions. Log-normal Distribution. We have the two- and the three-parameter lognormal distribution, denoted by the symbols LN2 and LN3, respectively. The random variable X has a two-parameter log-normal distribution with parameters (, 2 ), if Y := ln(X) has a N(, 2 )-distribution. Using the transformation formula for densities, we can calculate the density of a LN2(, 2 )-distribution. We get (22) 1 2 2 f (x) = e(ln(x)) /(2 ) ; x 2 2 x 0.
The rst two moments of a log-normal distribution are given by (23) (24) x = E(X) = exp( +
2 x = Var(X) = 2 (e
2
2 ) 2 1)
We obtain the three-parameter log-normal distribution by shifting the twoparameter log-normal distribution by a xed value x0 . The random variable X has a three-parameter log-normal distribution with parameters (x0 , , 2), if Y := ln(X x0 )
1. EXTREME VALUE DISTRIBUTIONS

0.09 0.08 0.07 0.06 0.05 0.5 0.04 0.4 0.03 0.02 0.01 0 0.3 0.2 0.1 0 1 0.9 0.8 0.7 0.6
75
10
15
20
25
30
35
40
45
50
10
15
20
25
30
35
40
45
50
Figure 1. Density (left) and Distribution Function (right) of a Lognormal Distribution with Parameters = 1, = 2. has a N(, 2 )-distribution. The density of the LN3(x0 , , 2 )-distribution is given by 1 2 2 e(ln(xx0 )) /(2 ) ; x x0 . (25) f (x) = 2 x 2 The rst two moments of the three-parameter log-normal distribution is given by (26) (27) x = E(X) = x0 + exp( +
2 x = Var(X) = 2 (e 1).
2
2 ) 2
The quantiles of the log-normal distribution can be calculated easily from the quantiles of the normal distribution. If zp denotes the upper p-th quantile of the standard normal distribution, the corresponding quantile of the LN3(x0 , , 2 )-distribution is given by xp = x0 + ezp + . If the parameters of the log-normal distribution are known, one may calculate the quantiles with the help of this formula. Pearson III Distribution. As for the log-normal distribution, there is also a two- and a three-parameter version of the Pearson distribution. The two-parameter Pearson(r, )-distribution is simply the Gamma-distribution with these parameters. Its density is given by (28) f (x) = r r1 x x e , (r) x 0,
where > 0, r > 0. In the special case when r = 1 we obtain the exponential distribution with density f (x) = ex . The rst two moments of a Pearson II distribution are given by r x = (29) r 2 x = (30) 2
76
0.4 0.35

1 0.9 0.8 0.3 0.7 0.25 0.2 0.6 0.5 0.4 0.3 0.1 0.2 0.05 0.1 0
0.15
10
10
Figure 2. Density (left) and Distribution Function (right) of a Pearson II-Distribution with Parameters r = 2, = 1. We obtain the Pearson III distribution from the Pearson II distribution by a shift by x0 . The density of the P III(x0 , r, )-distribution is given by (31) f (x) = r (x x0 )r1 e(xx0 ) , (r) x x0 .
The rst two moments of the P III(x0 , r, )-distribution are given by r x = E(X) = x0 + (32) r 2 x = Var(X) = (33) . 2
0.4 0.35
1 0.9 0.8
0.3 0.7 0.25 0.2 0.6 0.5 0.4 0.3 0.1 0.2 0.05 0.1 0
0.15
10
10
Figure 3. Density (left) and Distribution Function (right) of a PearsonIII-distribution with Parameters x0 = 3, r = 2, = 1. Weibull Distribution. Also for the Weibull distribution there exists a twoand a three-parameter version. The two-parameter Weibull distribution is usually given by its distribution function. For the parameter values (, s), the distribution function is (34) F (x) = 1 exp(xs ); x 0.
1. EXTREME VALUE DISTRIBUTIONS
77
Again, we get the three-parameter version by a shift. The corresponding distribution function becomes F (x) = 1 exp((x x0 )s ); x x0 .
0.09 0.08 0.07 0.06 0.05
1 0.9 0.8 0.7 0.6 0.5
0.04 0.4 0.03 0.02 0.01 0 0.3 0.2 0.1 0
10
15
20
25
30
35
40
45
50
10
15
20
25
30
35
40
45
50
Figure 4. Density (left) and Distribution Function (right) of a Weibull Distribution with Parameters s = 2, = 10.
Gumbel Distribution. The Gumbel-Distribution, sometimes called extreme value distribution, is the most important member of the class of distributions that can occur as limits of suitably transformed maxima of independent random variables. We rst introduce the standard Gumbel-distribution, given by the density F (y) = ee . We have the following formulas for the expected value and the variance: E(Y ) = = 0.5772 2 Var(Y ) = ; 6 here denotes the Euler-constant. The random variable X has a Gumbel-distribution with parameters (x0 , ), if Y := (X x0 ) has a standard Gumbel-distribution. A Gumbel (x0 , )-distributed random variable thus has the distribution function F (x) = P (X x) = P ((X x0 ) x) = ee and the density function f (x) = e(xx0 ) ee
(xx0 ) (xx0 ) y
Expected value and variance of the Gumbel(x0 , )-distribution may be calculated from the corresponding characteristics of the standard Gumbel-distribution. We get
78

Y
X = x0 +
and thus E(X) = = x0 + Var(X) = 2 . 62

0.8 0.7
0.4 0.35
0.3
0.6
0.25
0.5
0.2 0.15
0.4 0.3
0.1
0.2
0.05
0.1
0 5
0 5
Figure 5. Density of a Gumbel-distribution with Parameters x0 = 0, = 1 (left) and = 2 (right). 2. Inferential Statistics for Extreme Value Distributions We just list some standard techniques; eectively these have been introduced and applied in earlier chapters in connection with other classes of distributions. Empirical Distribution Function Histogram Maximum Likelihood Estimator Method of Moments
CHAPTER 9
Introduction to Time Series Analysis

This chapter is devoted to the analysis of time series, i.e. data x1 , . . . , xn collected at subsequent time points t = 1, . . . , n. We regard the data as realizations of random variables X1 , . . . , Xn . In contrast with the assumptions in the previous chapters, these random variables may not be assumed to be independent. Quite to the contrary, the random variables are heavily dependent and it is in fact the dependence between values observed at dierent points in time that is of great interest to us. E.g., the dependence makes it possible to predict future values of the time series from past data. 1. Trend, Seasonal Eects and Stationary Time Series The stochastic model for a time series is a stochastic process (Xt )t1 ; this is simply a sequence of random variables dened on the same probability space . In principle, this sequence is innitely long - in practice this is just a mathematical idealization. We will always observe just a nite segmet of the time series, X1 , . . . , Xn . The following denition is for a variety of reasons crucial, both for the theoretical development as well as for the practical statistical analysis of time series. Definition 9.1. The time series (Xt )t1 is called stationary, if the joint distribution of any segment of variables of length k, i.e. (Xm+1 , . . . , Xm+k ) does not depend on m. Stationarity of a time series allows us to make statistical inference concerning the joint distribution of segments of random variables of any given length k. The vectors (X1 , . . . , Xk+1 ), (X2 , . . . , Xk+2), . . . all have the same distribution and thus the corresponding vectors of subsequent observations (x1 , . . . , xk+1 ), (x2 , . . . , xk+2 ), . . . may be viewed as realizations of identically distributed random vectors. However, we have to face the challenge that these vectors are not independent. When k = 2, the pairs (x1 , x2 ), (x2 , x3 ), (x3 , x4 ), . . . are realizations of random variables with the same distribution as (X1 , X2 ). This opens the chance to estimate any parameters of the joint distribution of (X1 , X2 ), and therefor, by stationarity, of any other pair (Xi , Xi+1 ). Such parameters are, e.g. the covariance and the correlation coecient of any two subsequent observations. Most time series in real life are, however, not stationary. We rst have to remove sources of nonstationarity from the data before we can apply methods developed for the analysis of stationary time series. There are mainly two eects that disturb stationarity - the trend and seasonal eects. Before we dene what we mean by
79
80
9. INTRODUCTION TO TIME SERIES ANALYSIS
these eects, we give them names; mt for the value of the trend at time t and st for the seasonal eect. The standard model of time series analysis assumes that the time series may be expressed as a sum of trend, seasonal eect and a stationary time series (Yt )t1 with E(Yt ) = 0: (35) Xt = mt + st + Yt ,
Of the three terms on the right-hand side, we have already dened what a stationary process is. The seasonal component is also easily explained. A seasonal eect with period d must satisfy the identity st+d = st for all t 1. E.g., if we have monthly data with a yearly seasonal eect, then d = 12. Without an additional requirement, the seasonal component cannot be identied from the decomposition (35). E.g., if we replace st by st + a and at the same time mt by mt a, we get the same nal result, but dierent components in (35). Thus we make the additional restriction that
d
(36)
t=1
st = 0
The mean eect of the seasonal component is thus set equal to zero. A formal denition of the trend is more dicult. The trend is a slowly changing term that describes the long-term development of the mean of the process. Roughly speaking, the trend function contains the slowly moving component of the time series, while the stationary process Xt contains the rapidly uctuating components. In practice, it is often dicult to distinguish a long-term trend from segments where the process purely as a result of chance exhibits trend-like behavior. Estimation and Removal of Trend and Seasonal Component. At the beginning of every time series analysis, we have to remove the trend and the seasonal component, in order to be able to treat the remaining process with the statistical techniques available for stationary processes. There are various procedures available. The choice depends partly on the question whether the original time series contains a trend or a seasonal component, or both. Trend Estimation by Least Squares Regression: The rst technique of trend estimation starts from making a parametric model for the trend function, e.g. a second degree polynomial of the form mt = + t + t2 , where and are unknown parameters. We may then estimate these parameters using the least squares technique, i.e. by minimizing
N
t=1
(Xt t t2 ).
We have seen in the chapter on regression analysis, how to solve this minimization problem. The second degree polynomial is just an example. We can t any model
1. TREND, SEASONAL EFFECTS AND STATIONARY TIME SERIES
81
of the type
p
mt =
j=1
j j (t),
where 1 , . . . , p are known basis functions and where 1 , . . . , p are unknown parameters that can be estimated by the least squares technique. Trend Estimation by Smoothing. The procedure explained above, i.e. modeling the trend by a parametric family of functions, is only feasible if we may assume that the functional form of the trend remains constant over the entire period that we observe the time series. In many applications, this assumption will be unrealistic. Alternative techniques estimate the trend by a local smoothing. The simplest smoothing technique is the moving average. Given a bandwidth 2 k + 1, we estimate the trend function at time t locally by the arithmetic mean of the 2 k+1 observations Xtk , . . . , Xt+k , i.e. 1 mt = (Xtk + . . . + Xt+k ), 2k + 1 Simple moving average smoothing works when we have a time series without a seasonal component, i.e. when Xt = mt + Yt . In this case we get 1 (Xtk + . . . + Xt+k ) 2k + 1 1 1 = (Ytk + . . . + Yt+k ) + (mtk + . . . + mt+k ). 2k + 1 2k + 1 The rst term on the right hand side is a sum of random variables that have expected value 0. Thus their arithmetic mean will be close to 0, since positive and negative values cancel on average - one can show that this remains true, although the variables are not independent. Of course the eect that the mean of random variables that individually have expected value 0, is small, will only be visible if the window is wide enough. The second term on the right hand side is a mean of subsequent trend values. As the trend is assumed to be slowly varying, we have that mt = 1 (mtk + . . . + mt+k ) mt . 2k + 1 Generally, taking a moving average of a time series has a smoothing eect; the short term uctuations have disappeared and the long term trend becomes clearer visible. Rather than taking the simple arithmetic mean of subsequent observations, one can also take a weighted mean of subsequent observations, i.e.
k
mt =
i=k
i Xt+i ,
k where i are weights that satisfy i=k i = 1. Generally, the weights will be smaller towards the end of the time window, thus decreasing the inuence of observations further away from t. There are situations when it is desirable to take one-sided moving averages only, because in this way we avoid the eect that in order to estimate the trend at time t, we need to look into the future of the process. This is especially an issue in forecasting problems, where, given the data X1 , . . . , Xn , we
82
want to predict the next value Xn+1 . A popular one-sided moving average is dened by mt =
j=0
(1 )j Xtj ,
where is a smoothing parameter that still has to be chosen. The above procedure is called exponential smoothing. In the end, i.e. after having estimated the trend function mt , we study the process Yt = Xt mt , where the trend has been removed from the data. The further analysis of the process (Yt ) is carried out by techniques of times series analysis of stationary data that will be the topic of the next two sections. In the end of this paragraph, we want to make a denition that puts the procedures that we have introduced above into a more general perspective. In all cases, we have assigned to the original time series (Xt )t1 a new time series Xt :=
i=
i Xti ,
t 1.
The operation that assigns to the process (Xt )t1 the new process (Xt1 is called a linear lter. Differencing Operator: A dierent procedure for trend removal that may lead directly to a stationary process, applies the dierencing operator , dened by for t 2 and X1 = X1 . Observe that the -Operator assigns to a given time series (Xt )t1 the new series (Xt )t1 . Dierencing is in fact a special kind of lter with weights 0 = 1 and 1 = 1. We may apply the -operator several times; e.g. two applications yield If the original data contain a linear trend, this is removed by a single application of the dierencing operator. If the trend function is a polynomial of degree p, we obtain after p iterations of the dierencing operator a stationary process. Removal of a Seasonal Component. We now consider the case that the time series has only a seasonal component, but no trend; i.e. we have the model X t = + st + Y t , where st+d = st , for any t 1, and d st = 0. Moreover R is the long-term i=1 mean of the time series and Xt and (Yt )t1 is a stationary process with expected value zero. We rst estimate by the arithmetic mean of all observations, = 1 n
n
Xt = Xt Xt1 ,
2 Xt = Xt Xt1 = Xt 2 Xt1 Xt2 .
Xi ,
i=1
and remove the mean from the data. Thus we obtain the new process Zt = Xt . Since , we get Z t st + Y t .
1. TREND, SEASONAL EFFECTS AND STATIONARY TIME SERIES
83
Since si = si+d = si+2 d = . . ., a natural idea is to estimate si , 1 i d, by the arithmetic mean of Zi , Zi+d , Zi+2 d , . . .. We assume that the number of observations is an integer mutiple of the period, i.e. n = p d. Then the estimator for si becomes 1 si = p
p
Zi+j d .
j=1
In order to ensure that the identiability condition (36) holds, we nally take 1 si = si d
d
st .
t=1
Removal of Trend and Seasonal Component. If the data contain both a trend and a seasonal component, we have to take special precaution while estimating them. One usually begins with a trend estimation via moving averages, with a window length adapted to the period of the seasonal component. A moving average smoothing of the process (35) yields
k k k k
Xt+i =
i=k i=k
mt+i +
i=k k i=k
st+i +
i=k
Yt+i .
If k is chosen so that 2k + 1 = d, we get Thus we get for this choice of k that

k
st+i = 0 by periodicity of s and (36).

k
mt :=
i=k
Xt+i =
i=k
mt+i +
i=k
Yt+i ,
is close to mt by the arguments given above in connection with pure trend estimation. We have thus found a lter that estimates the trend while eliminating the seasonal component at the same time. The same argument works if the window length is any integer multiple of d, i.e. when 2k + 1 = m D. If the period of the seasonal component is even, we cannot nd a k such that 2 k + 1 = d. In this case, we have to apply a variant of the simple moving average. Choosing k so that 2 k = d, we dene 1 1 1 mt = Xtk + Xtk + Xt+k . 2k 2k i=k+1 2k Having estimated the trend in any of the ways introduced above, we remove the trend by considering the new time series Zt = Xt mt . Since mt mt , we get from (35) that Zt st +Yt. Thus the process (Zt )t1 contains (approximately) no trend, but only a seasonal component. The seasonal component may then be estimated using the techniques explained above for trend-free time series
k1
84
Differencing Operator for Seasonal Components. An alternative technique for removal of a seasonal component is dierencing at lag d; we dene the corresponding operator d by Since st st std = 0, a single application of the d operator removes the seasonal component. Next, the resulting time series may be analyzed using the techniques introduced above for time series that have no seasonal component. 2. Statististical Analysis of Stationary Processes In the rest of this chapter, we will always assume that we have a stationary time series (Xt )t1 . In practice, we arrive here by removing the trend and the seasonal component from the original data. In this section, we follow the style of earlier chapters, by rst presenting techniques of descriptive statistics, then introducing stochastic models for stationary time series and nally studying statistical inference for parameter estimation. Graphical Techniques of Descriptive Statistics. In the rst instance, one should always look at the so-called Time Series Plot; this is the plot of the observations X1 , . . . , Xn as a function of time. A second plot that is of interest, plots the points (X1 , Xh+1 ), . . . , (Xnh , Xn ) for dierent values of h - typically some small values of h are interesting. These plots reveal the dependence of two observations taken at lag h. Numerical Characterizations of Time Series. The very rst summary statistics is the arithmetic mean of the data set, X := 1 n
n
d Xt = Xt Xtd
Xi .
i=1
X is an unbiased estimator of the expected value of X1 , since by stationarity E(X) = 1 n

n
E(Xi ) = E(X1 ).
i=1
A numerical summary of the degree of dependence of data at a xed lag h, is given by the sample covariance and the sample correlation coecient of the points (Xt , Xt+h ), 1 t n h. These sample statistics are estimates of the covariance and the correlation coecient of Xt and Xt+h - note that by stationarity of the process, both do not depend on t, but only on h. Definition 9.2. Let (Xt )t0 be a stationary time series. We dene the autocovariance function (h) and the autocorrelation function h , h Z by (h) = Cov(Xt , Xt+h ) (h) h = (0) (h) and h are called lag h autocovariance and autocorrelation, respectively.
2. STATISTISTICAL ANALYSIS OF STATIONARY PROCESSES
85
Remark 9.3. Note that Var(Xt ) = Var(Xt+h ) by stationarity of the time series. Moreover (0) = Var(Xt ) and hence we obtain the following identity for the correlation coecient of Xt and Xt+h , Xt ,Xt+h = Cov(Xt , Xt+h ) Var(Xt ) Var(Xt+h ) = (h) . (0)
I.e., (h) equals the correlation coecient of Xt and Xt+h . Next we dene sample analogues of the autocovariance and autocorrelation function. Definition 9.4. The sample autocovariance function and the sample autocor relation function r(h) are dened by 1 (h) := nh (h) . r(h) := (0)
nh
i=1
(Xi X n )(Xi+h X n )
Observe that up to a small boundary eect at the ends of the sample, this is equal to the correlation coecient of the points (Xt , Xt+h ), 1 t n h.
Remark 9.5. Since we usually study the sample autocovariance only at lags h that are small compared to the sample size n, we sometimes replace the n h in the denominator by n. If we do so, we obtain a simpler formula for the autocorrelation function nh (Xi X n )(Xi+h X n ) r(h) = i=1 n (Xi X)2
i=1
Stochastic Models for Stationary Time Series. There are various ways to dene models for stationary time series. In this section, we will introduce some models that are of great importance in all of time series analysis. Definition 9.6. (i) The process (Zt )tZ is called a white noise process, if the random variables Zt , t Z, are independent and N(0, 2 )-distributed. We use the symbol (Zt )t1 W N(0, 2 ) to indicate that (Zt )t1 is a white noise process. (ii) The process (Xt )t1 is called moving average process of order q, short MA(q)process if we can nd constants 1 , . . . , q R and a white noise process (Zt )t1 such that (37) Xt = Zt + 1 Zt1 + . . . + q Ztq . (iii) The process (Xt )t1 is called an autoregressive process of order p, short AR(p)process, if we can nd constants 1 , . . . , p R and a white noise process (Zt )tZ such that (38) Xt = 1 Xt1 + . . . + p Xtp + Zt . (iv) The process (Xt )t1 is called ARMA(p, q)-process of order (p, q), if we can nd constants 1 , . . . , p R, 1 , . . . , q R and a white noise process (Zt )tZ such that (39) Xt 1 Xt1 . . . p Xtp = Zt + 1 Zt1 + . . . + q Ztq .
86
It is important to have some intuition for the time series models introduced above. In the case of the white noise process, this is quite simple. We have independent, normally distributed observations, no dependence of the observations at dierent points in time - this is what we studied in the earlier lectures. Of course, this is a very unrealistic assumption for a real-life time series and in the few cases that it might hold, we may forget about time series analysis and return to the methods introduced earlier for the analysis of independent data. While white noise processes do not have a lot of relevance in their own right, they are extremely important building blocks for more complex time series such as AR, MA and ARMA processes. In the context of AR, MA and ARMA models, the Zt are called innovations. In the case of an MA(q)-process, the present value of the time series, Xt = Zt + 1 Zt1 + . . . + q Ztq , is a linear combination of the present innovation as well as the past q innovations. One can imagine that the innovations model some underlying process of elementary events in the background that drives the time series which we observe. Events of the past inuence the present observation. E.g., when q = 2, we have Xt = Zt + Zt1 + Zt2 , i.e. the present value of the time series is a linear combination of todays innovation as well as the past two innovations. Autoregressive processes have a strong analogy with multiple linear regression models. Recall that in multiple linear regression we model the outcome of the i-th experiment as
p
Yi =
j=1
xij j + i ,
where xij , 1 j p are the vales of the explanatory variables and 1 , . . . , p are unknown parameters. Formally, the AR(p)-equation (37) is identical to this multiple linear regression equation - the explanatory variables are the past observations Xt1 , . . . , Xtp . There is, however, a subtle dierence which also makes the statistical analysis of AR models more complicated than the analysis of multiple regression models. In the AR model, the explanatory variables are not independent variables, to be chosen freely by the experimenter, but they are themselves random variables. The regression analogy provides helpful intuition for AR processes. The present value Xt is modeled as a linear combination of p past values and an innovation component Zt . Without the innovation term, the AR(p)-equation (37) becomes a deterministic dierence equation xt = 1 xt1 + 2 xt2 + . . . + p xtp , which is a discrete analogue of a dierential equation. The AR process may thus be regarded as a dierence equation with random perturbations. In the case of AR processes, we face the non-trivial question whether the equation Xt = 1 Xt1 + . . . + p Xtp + Zt has a stationary solution, i.e. whether there exists a stationary process (Xt )t1 satisfying this equation for all t p + 1. This is not always the case, as the
2. STATISTISTICAL ANALYSIS OF STATIONARY PROCESSES
87
following simple example shows. We consider the AR(1)-process Xt = Xt1 + Zt . By an iterated application of this equation, we get Xt = Xt1 + Zt = ( Xt2 + Zt1 ) + Zt = 2 Xt2 + Zt1 + Zt
t1
= X0 +
j=0
j Ztj .
The terms on the right-hand side converge as t if and only if || < 1. In this case, we obtain the representation Xt =
j=0
j Ztj ,
i.e. the AR(1)-process has an MA() representation. For higher order AR processes, one cannot give such a direct analysis. It turns out that the AR(p)-process has a stationary solution if and only if the complex polynomial does not have roots that lie in the unit disc {z C : |z| 1}. Though AR-processes are more dicult than MA-processes, as far as their denition is concerned, they are much easier to handle when it comes to their statistical analysis. Because of their analogy with multiple linear regression models, there are simple procedures for parameter estimation. Moreover, AR-processes are very easy to predict. If we have observed the past values xt1 , . . . , xtp , we may predict the next value by Vorhersage f r den (im Moment unbekannten) Wert Xt gegeben durch u Xt := 1 xt1 + . . . + p xtp . One can actually show that this is the best possible predictor of Xt , in the sense that it minimizes the mean square prediction error E(Xt Xt )2 . Concerning the intuition of ARMA-process, there is little that can be added to what has already been said about moving average and autoregressive processes. An ARMA process can best be viewed as an autoregressive process with moving average innovations. By allowing moving average innovations rather than white noise innovations, one can often considerably reduce the number of parameters needed in order to get a good t of the data. E.g., an ARMA(2,1)-model, which has three parameters, might provide the same t as an AR(5)-model, which has ve parameters, not counting the unknown variance 2 of Zt in both cases. Parameter Estimation, Model Fitting. Estimating the parameters in a time series is a dicult task. Most procedures do not allow an exact analytic solution, but require the use of numerical algorithms. To begin with, one has to distinguish two problems. Firstly there is the estimation of the parameters 1 , . . . , p and/or 1 , . . . , q given the order p, q N. Secondly one has to choose the right order p, q N. We will just cover some aspects in this section which can still be done by hand, at least partly. A rst procedure is based on the method of moments. We choose that (z) := 1 1 z . . . p z p
88
model whose theoretical autocorrelations are equal to the sample autocorrelations of the data. In order to do this, we rst need to calculate the autocovariance function of the dierent models introduced above. Autocovariance function of a White Noise Process: This a simple task, as the observations at dierent points in time are assumed to be independent and thus uncorrelated. Hence we get (h) = Cov(Zt , Zt+h ) = 2 0 if h = 0 if h = 0.
The autocorrelation function of a white noise process is thus given by (0) = 1 and (h) = 0 for h = 0. Test for White Noise Process: The above formula for the autocorrelation function provides the basis for a quick test whether a given process is a white noise process. We calculate and plot the sample autocorrelation function and check whether r(h) diers signicantly from 0 for h = 0. All statistical programs provide test boundaries in the plots. If the test is to be carried out by hand, one can apply the fact that, for a white noise process, nr(h) is asymptotically N(0, 1)-distributed. This test of the hypothesis that a given process is a white noise process has wide applications in model checking. All the models that we introduced above have a white noise process as building block. E.g., for the AR(p)-prozess we have Xt = 1 Xt1 + . . . + p Xtp + Zt . We may rewrite this as Zt = Xt 1 Xt1 . . . p Xtp , which can be used, in combination with the above white noise test, as the basis for a test whether an AR(p)-process is an appropriate model. Specically, we estimate the parameters of the AR-process by 1 , . . . , p and study the estimated innovations Zt := Xt 1 Xt1 . . . p Xtp , t = p + 1, . . . , n.
Subsequently, we test the hypothesis that (Zt ) is a white noise process. Autocovariance Function of an MA-Process: We consider the MA(q)process Xt = Zt + 1 Zt1 + . . . + q Ztq . In order to calculate the covariance of Xt and Xt+h , we write the two equations, Xt = Zt + 1 Zt1 + . . . + q Ztq Xt+h = Zt+h + 1 Zt+h1 + . . . + q Zt+hq . By bilinearity of the covariance operator, we can derive the covariance from our above formula for the autocovariance function of the white noise process. All terms of tht type Cov(Zi , Zj ) disappear when i = j. Thus we get the formula Cov(Xt , Xt+h ) = h + 1 h+1 + . . . + qh q 0 falls |h| q sonst
The characteristic property of an MA(q)-process is the fact that (h) = 0 for |h| > q. This fact may be used for a quick graphical test whether an observed time series can be modeled by an MA-process. At the same time, the empirical autocorrelation function gives a basis for estimating the order of the MA-process.
3. SPECTRAL ANALYSIS OF TIME SERIES
89
Yule-Walker-estimator: The Yule-Walker estimator for the unknown parameters of an ARMA(p, q)-process is a method of moments estimator. We choose the parameters of the ARMA model in such a way that the rst (p + q) theoretical autocovariances are equal to the sample autocovariances. A practical implementation requires formulas for the autocovariance function of an ARMA-process which one can nd in any book on time series analysis. SInce one cannot do the resulting calculations by hand, we will not present the formulas here. Least Squares Estimators for AR-Processes: The parameters of an ARprocess may be estimated with the help of the analogy with the multiple linear regression model, Xt = 1 Xt1 + . . . + p Xtp + Zt . We get the least squares estimator by minimizing the sum of squares
n
t=p+1
(Xt 1 Xt1 . . . p Xtp )2 .
In the case of an AR-process, the least squares estimator is almost identical to the Yule-Walker estimator. Maximum-Likelihood Estimator: For general ARMA(p, q)-processes, the parameters can be estimated by the maximum likelihood technique. The ML estimators are in some sense optimal estimators. Unfortunately, the form of the likelihood function is so complicated that the ML-estimates cannot be calculated by hand, and thus one has to use numerical algorithms. These algorithms are built into any statistical software package. 3. Spectral Analysis of Time Series In this section we will give a brief introduction to spectral analysis of a time series. The basic idea of spectral analysis is the decomposition of a time series into a linear combination of periodic signals of sine and cosine functions. The calculations are easier if we use complex-valued functions f (t) = eit instead, noting that where i = 1 denotes the complex unit. More generally, we can dene for a complex number z = x + iy, x, y R, the complex exponential by ez = ex (cos y + i sin y). Note that all calculations with the complex exponential follow the same rules as for real exponentials. This refers, e.g. to the formula ez+w = ez ew , valid of any complex numbers w, zinC. We start with an elementary example of a time series that ilustrates the main ideas. Example 9.7. In this example we study a time series that can be expressed as a nite linear combination of periodic signals. Let 1 , . . . , k (, ] be given frequencies and let A1 , . . . , Al be independent complex-valued random variables with 2 E(Aj = 0 and j := Var(Aj ). For complex-valued random variable Z, W , we dene Cov(Z, W ) = E (Z E(Z))(W E(W )) , eit = cos t + i sin t,
90
and consequently Var(Z) = E|Z E(Z)|2 . We now dene the time series
l
Xt =
j=1
Aj eij t .
The expected value of Xt is zero. The autocovariance function of the process (Xt )t1 can be calculated as follows:
l l l
(h) = E(Xt+h X t ) =
j=1 k=1
E(Aj Ak )e
ij (t+h) ik t
=
j=1
2 j eihj .
Here we have used the fact that the amplitudes A1 , . . . , Al are independent random variables so that all the mixed covariances vanish. We see that the autocovariance function in the above example may be expressed as a linear combination of eihj , 1 j l, where the j are the frequencies of the harmonic functions that occur in the representation of Xt . The coecients in the linear combination are the variances of the random amplitudes. Although the above example is a very simple case of a time series, it turns out that a similar representation as a linear combination of simple harmonic functions is always possible. In general, however, we will not have a nite linear combination, but an integral representation, both for the autocovariance function as well as for the process itself. The representation of a general time series as an integral of harmonic functions requires the theory of stochastic integral and thus goes well beyond the scope of this lecture. In the following denition, we will introduce the reprensation of the autocovariance function. Definition 9.8 (Spectral Density Function). The function f : [, ] [0, ) is called spectral density function of the stationary time series (Xt )t1 , if the autocovariance function (h) allows the following representation
(40)
(h) =
eih f () d, h Z.
Remark 9.9. With the above denition, the spectral density function is dened via an implicit equation. In the special case, when the autocovariance function is summable, there is an explicit formula for the spectral density, 1 f () = 2
k=
eik (k).
In order to verify this formula, we observe the following identity: 1 1 falls k = l eik eil d = 0 falls k = l. 2 Thus

ei h f ()d =
1 2
k= (k)
ei h eik d = (h).
Example 9.10. Let (Zn )n1 W N(0, 2 ) be a Gaussian white noise process. In this case, (0) = 2 and (h) = 0 for h = 0. The spectral density function is given by 2 . f () = 2
91
This claim is easily veried by showing that f satises (40). If we want to calculate spectral densities of furthe processes, especially of ARMA processes, we need a transformation formula for spectral densities under the application of lters. Let (Xt )tZ be a time series and let (j )jZ be a sequence of real or complex numbers. Applying the lter (j )jZ to (Xt )tZ , we obtain the new time series Yt := j Xtj .
jZ
Example 9.11. (i) In connection with the estimation of a trend, we considered the moving average of 2k + 1 subsequent data points, 1 Yt := 2k + 1 In this case, we have applied the lter j = 0
1 2 k+1 k
Xtj .
j=k
The following theorem shows how to derive the spectral density of a transformed time series from that of the original time series
(ii) The dierence operator , introduced earlier for the removal of a trend, is a special lter, Yt = Xt Xt1 , given by the coecients 0 = 1, 1 = 1 and j = 0 otherwise.
falls k j k falls |j| > k.
Theorem 9.12. Let (Xt )tZ be a time series with spectral density fX (), (, ] and let (j )jZ be a lter j= |j | < . Then the transformed time series Yt := j Xtj
j=
has the spectral density where (z) := fY () = |(ei )|2 fX (), j k= j z is the transfer function of the lter.
Proof. Consider the autocovariance function of the transformed time series (Yt ); we get Cov(Yt+h , Yt ) =
j,kZ
j k Cov(Xt+hj , Xtk ) j k
j,kZ
ei(h+kj) f ()d j k eik eij f ()d

j,kZ 2
eih
eih
j eij f ()d.
j=
92

j=
Observing that (ei ) =
j eij , the proof of the theorem is nished.
We may apply the above theorem in order to compute the spectral density of ARMA processes. First of all, we consider the MA process Xt = Zt + 1 Zt1 + . . . + q Ztq , and observe that Xt is the result of the application of a lter to the white noise process (Zt )tZ . The transfer function of the lter is (z) = 1 + 1 z + . . . + q z q , and hence we get the following formula for the spectral density of an MA process: 2 |(ei )|2 . 2 For the calculation of the spectral density of an ARMA process, we recall the ARMA equation f () = Xt 1 Xt1 . . . p Xtp = Zt + 1 Zt1 + . . . + q Ztq . Thus the spectral densities of the two sides of the above equation must be equal. 2 For the left-hand side, we get the spectral density function |(ei )|2 . Denoting 2 the spectral density of (Xt )t1 by fX (), the spectral density of the right-hand side is given by |(ei )|2 fX (). Thus we obtain the following formula for the spectral density of an ARMA-process 2 |(ei )|2 fX () = , 2 |(ei )|2 where the complex polynomials (z), (z) are dened as follows: (z) = 1 + 1 z + . . . + q z q (z) = 1 1 z . . . p z p . Spectral Density Estimation. If we have observed the segment X1 , . . . , Xn of a time series (Xt )t1 , we may estimate the spectral density function. The basic idea of spectral density estimation is quite simple, and uses the explicit representation of the spectral density as a function of the autocovariances, given by f () = 1 eih (h). 2 h=
Replacing the theoretical autocovariances (h) by their estimates (h), we obtain a preliminary estimator of the spectral density. The emphasis on preliminary is important, as we will later see that the resulting estimator needs some modications. In this context, we use the following estimator of the autocovariance function: 1 (h) = n
n
j=1
(Xj X)(Xj+h X).
93
The terms containing Xt -values that are undened, have to be interpreted as zero. This is a useful convention that helps to reduce worries about summation boundaries. In this way we obtain the following estimate for the spectral density: 1 In () := 2
h=
ih 1
1 (Xj X)(Xj+h X) = n j=1 n
e
j=1
ij
(Xj X) .
Definition 9.13 (Periodogram). Let X1 , . . . , Xn be data from a time series (Xt )t1 . We dene the periodogram 1 In () := n The sum
n ij (Xj j=1 e n 2
e
j=1
ij
(Xj X) .
X) is called the Discrete Fourier Transform of the data.
One can show that the periodogram is an unbiased estimator of the spectral density. However, the variance does not converge to zero and thus the periodogram is not a consistent estimator. In order to remove this problem, one takes smoothed versions of the periodogram as spectral density estimators. This procedure is implemented into all advanced statistical packages.

Skript NMS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Skript NMS

Uploaded by

Copyright:

Available Formats

Numerical Methods and Stochastics

Bochum, May 23, 2011

Prof. Dr. Herold Dehling

Basic Concepts of Probability Theory

1. BASIC CONCEPTS OF PROBABILITY THEORY

The sure event

The impossible event

A did not occur

Both A and B have occured

B has occured, but A did not occur

If A occurs, B will also occur

1. OUTCOMES, EVENTS, PROBABILITY

1. BASIC CONCEPTS OF PROBABILITY THEORY

3. INDEPENDENCE, CONDITIONAL PROBABILITIES

1. BASIC CONCEPTS OF PROBABILITY THEORY

1. BASIC CONCEPTS OF PROBABILITY THEORY

max(P (A), P (B), P (C)) P (A B C) P (A) + P (B) + P (C).

lim npn = (0, ). k . k!

Then for any k 0

Distribution X() Uniform Bernoulli Binomial

Probability function E(X) Var(X) 1 N + 1 N2 1 {1, . . . , N} N 2 12 {0, 1} pk q 1k p pq n k nk p q {0, . . . , n} np npq k

Hypergeom. {0, . . . , n} Poisson {0, 1, . . .}

obtained the following estimator for the variance, s2 x 1 := n1

L() = Lx1 ,...,xn () := p (x1 ) p (xn ).

4. GOODNESS OF FIT TESTS

for x 1 and F (x) = 0 for x 1.

0.8 0.3 0.6 0.2 0.4

3. IMPORTANT CONTINUOUS DISTRIBUTIONS

0.8 0.3 0.6 0.2 0.4

Z has a standard normal distribution, since P (a X b) = P ( + b X + b) + b (x)2 1 = e 22 dx = 2 2 + a

3. IMPORTANT CONTINUOUS DISTRIBUTIONS

E.g., if Z has a standard normal distribution, then X = Z + has the density ;

and the log-likelihood funktion n 1 l(, ) = log(2) n log() 2 2 2 1 0 = 2

The solution of this system of equations is given by xi = x

the formulas sx,y rx,y 1 (xi x)(yi y ) := n 1 i=1 sx,y := , s2 s2 x y

2 (yi a b xi )2 = (1 rx,y )s2 . y

pX,Y (x, y),

f (x, y)dy is the density function of X.

e(x+y) 1[0,)[0,)(x, y)dy

ey dyex 1[0,)(x) = ex 1[0,)(x);

Thus we obtain fX1 (x1 ) = = 1 1

1 (x2 2 2 (x1 1 ))2 2(12 ) 2 1 2

u(x, y)pX,Y (x, y)

u(x, y)fX,Y (x, y)dxdy.

Var(Y ) X,Y Var(X) = E(Y ) b E(X).

In case, the random variables are pairwise uncorrelated, we obtain

fX (x)fY (y)dydx fX (x) dx

exy 1[0,)[0,)(x, y)dxdy

fB (b)fH (h) db dh 6c 6 dc dh h2 h2 0 6c 6 fB dh dc. h2 h2 0 fB

2 = Var(X1 ). We dene the sums Sn := Sn n P (a b) n 2 as n .

and hence we have the log-likelihood function (10) 1 n l(, , ) = log(2 2 ) 2 2 2

n i=1 (xi x)(yi n 2 i=1 (xi x)

1. SIMPLE LINEAR REGRESSION

x2 i n (xi x)2 i=1

1. SIMPLE LINEAR REGRESSION

we can express the numerator of R2 as follows: s2 2 y r s2 x,y x

2. MULTIPLE LINEAR REGRESSION

xij j = xi1 1 + xi2 2 = + xi ,

This is a system of p linear equations in p unknowns 1 , . . . , p which can be rewritten as

2. MULTIPLE LINEAR REGRESSION

confidence interval for 2 : Similarly, we obtain a 95% condence interval for 2 :

From here we obtain