Loyabook

Lebesgues remarkable theory of measure and integration with probability
Paul Loya
Contents
Preface Prologue: Lebesgues 1901 paper that changed the integral . . . forever Part 1. Finite additivity i 1 7 9 9 16 26 36 43 58 69 69 76 85 95 107 125 127 127 131 141 151 161 173 173 182 190 202 219 237 239 239
Chapter 1. Measure & probability: nite additivity 1.1. Introduction: Measure and integration 1.2. Probability, events, and sample spaces 1.3. Semirings, rings and -algebras 1.4. The Borel sets and the principle of appropriate sets 1.5. Additive set functions in classical probability 1.6. Lebesgue and LebesgueStieltjes additive set functions Chapter 2. Finitely additive integration 2.1. Integration on semirings 2.2. Random variables and (mathematical) expectations 2.3. Properties of additive set functions on semirings 2.4. Bernoullis Theorem (The WLLNs) and expectations 2.5. De Moivre, Laplace and Stirling star in The normal curve Part 2. Countable additivity
Chapter 3. Measure and probability: countable additivity 3.1. Introduction: What is a measurable set? 3.2. Countable additivity, subadditivity, and the principles 3.3. Innite product spaces and Kolmogorovs measure axiom 3.4. Outer measures, measures, and Carath eodorys idea 3.5. The extension theorem and regularity properties of measures Chapter 4. Reactions to the extension & regularity theorems 4.1. Gamblers ruin, BorelCantelli, and independence 4.2. Borels strong law of large numbers 4.3. Littlewoods rst principle(s), Borel measures and completions 4.4. Geometry, Vitalis nonmeasurable set, and paradoxes 4.5. The Cantor set Part 3. Integration
Chapter 5. Basics of integration theory 5.1. Introduction: Interchanging limits and integrals
iii
iv
CONTENTS
5.2. 5.3. 5.4. 5.5. 5.6.
Measurable functions and Littlewoods second principle Sequences of functions and Littlewoods third principle Lebesgues denition of the integral and the MCT Integral properties and the principle of appropriate functions The DCT, Osgoods principle and complex-valued functions
245 257 267 277 289 309 309 327 339 348 358 368 383 385 385 390 404 417 431 442 455 455 466 474 485 495 507 513 527 527 539 552 563 575 585 599
Chapter 6. Some applications of integration 6.1. Practice with the DCT and its corollaries 6.2. Lebesgue, Riemann and Stieltjes integration 6.3. Approximations and the StoneWeierstrass theorem 6.4. Probability distributions, mass functions and pdfs 6.5. Independence and identically distributed random variables 6.6. Laws of large numbers and normal numbers Part 4. More measure and integration
Chapter 7. Fubinis theorem and change of variables 7.1. Introduction: Iterated integration 7.2. Product measures, volumes by slices, and volumes of balls 7.3. Fubini-Tonelli, convolutions, and applications 7.4. Change of variables in multiple integrals 7.5. Some applications of change of variables 7.6. Polar coordinates and integration over spheres Chapter 8. Banach, Hilbert and Fourier 8.1. Introduction: The who, what and why of Lp 8.2. Minkowski, Young, and H older-Rogers inequalities 8.3. Banach spaces, Lp RieszFisher, and approximations 8.4. Hilbert spaces and orthogonality 8.5. Fourier series, orthonormal bases, and L2 RieszFisher 8.6. Fourier transforms 8.7. The central limit theorem Chapter 9. The antiderivative and arc length problems 9.1. Introduction: Antiderivatives, arc length and dierentiability 9.2. Bounded variation and complex measures 9.3. Absolute continuity, Fr echet-Riesz, and RadonLebesgueNikodym 9.4. The fundamental theorems of calculus and lengths of curves 9.5. Lebesgues last theorem Bibliography Index
Preface
To teach eectively a teacher must develop a feeling for his subject; he cannot make his students sense its vitality if he does not sense it himself. He cannot share his enthusiasm when he has no enthusiasm to share. How he makes his point may be as important as the point he makes; he must personally feel it to be important. George P olya (18871985). Mathematical Discovery (New York, 1981).
Some books present mathematics as a work of art with perfectly worded denitions, theorems, and proofs. Historically, however, denitions have been reworked, theories have been changed, and aws have been found. One of the greatest paradigm shifts occurred in 1901 when Henri Lebesgues (18751941) theory of integration was revealed. In order to make mathematics more personal and interesting to a general mathematics audience, I believe that its important to spent time explaining the history of the Lebesgue integral, some of the inadequacies of previous theories this integral xed, and some of the people involved in its story. Out of these precepts comes this book, which is based on lectures I have given to rst year graduate students at Binghamton University on and o since 2003. I have tried hard to make this book into a living book in the sense of Charlotte Maria Shaw Mason (18421923), a denition of which is1
Living books are usually written by one person who has a passion for the subject and writes in conversational or narrative style. The books pull you into the subject and involve your emotions, so its easy to remember the events and facts. Living books make the subject come alive. They can be contrasted to dry writing, like what is found in most encyclopedias or textbooks, which basically lists informational facts in summary form.
The theme of the book is that Lebesgue integration is living, useful, exciting, beautiful, and understandable! The approach found in this book is as follows. I. Historical: This book is lled with historical information concerning the mathematics found herein. For example, we start o in our prologue with (a translation of) Lebesgues 1901 seminal paper Sur une g en eralisation de lint egrale d enie [231]. It was in this paper he rst revealed to the world his integral, which was further explained in his doctoral thesis Intgrale, longueur, aire, presented at the Sorbonne (University of Paris) in France in 1902. Lebesgues 1901 paper allows the students to see the idea behind his integral along with problems of the Riemann integral and how his integral xes these problems. From this paper, the students see the Who, What, Where,
1http://simplycharlottemason.com/basics/faq/livingbook/ i
ii
PREFACE
When and Why of the Lebesgue integral, and our book frequently returns to Lebesgues seminal paper as the raison d etre of this book. II. Abstract measure and integration: We start from the beginning of Chapter 1 working with abstract measure theory, using probability as the catalyst for going outside the realms of Euclidean space. In fact, throughout this book, measure and probability are intertwined and probability is used to inspire measure theoretic concepts. We do not assume any previous knowledge of probability (not even discrete probability), so this book could serve as an introduction to this fascinating and important area. III. Proceed at a deliberately slow pace: From the very beginning of Chapter 1, we try to give motivation and intuition behind the sometimes abstruse ideas found in abstract measure and integration theory. For example, Chapter 1 is devoted to nite additivity and only after mastering nite additivity and seeing all that it can do, we move onto Chapter 2 dealing with countable additivity. The idea is to build solid intuition about measure, integration and probability and the goal is that the students have an intimate working knowledge of the material, which I think is hard to develop when the material is presented at a fast pace. To help with developing intuition, at the beginning of each chapter there is a expository section explaining some of the ideas, questions and history of certain topics of the chapter. IV. Visual: Im a big fan of visual learning. Abstract measure and integration theory is inherently a hard subject and we believe it requires more than just written expressions to get across its ideas; it needs visual imagery to develop solid intuition. Thus, throughout this book we try to use pictures to illustrate many of the concepts and ideas. In fact, there are over 300 gures in this book that try to help the reader understand the mathematics. For example, we describe geometrically Giuseppe Vitalis (1875-1932) famous nonmeasurable set. This enables students to see how Vitalis set is dened and why it has some of its strange properties. There are many exercises in this book, sometimes I wonder if there are far too many, but hopefully they allow an instructor a wide range of exercises to choose from. Apart from some well-known (standard) exercise problems or well-known results put into problem form, many of the exercises I have made up myself. I have also gathered many exercise problems from sources like the American Mathematical Monthly or others, all of which I have cited in the problems. This book is written specically with advanced undergraduates and rst year (American) graduate students in mind; the only prerequisite is an undergraduate course in real analysis at the level of Introduction to Real Analysis by Bartle and Sherbert [22]. The students in my measure and integration courses were usually rst year graduate students in statistics, who need a lot of analysis, or pure mathematics specializing in areas not requiring a lot of analysis; in fact, this course was the only course such students ever took or needed to take in analysis at the graduate level. In order to interest such a diverse body of students is exactly the reason I wanted to write a book with the above four principles. Always with my students needs in mind, I wanted to write a living book in the sense of Charlotte Mason, a book that not only teaches Lebesgues theory but also a book that is shows its usefulness, beauty, excitement, and nally, a book that is personable, where the students come with me on a journey through the living saga of Lebesgues theory.
Prologue: Lebesgues 1901 paper that changed the integral . . . forever

Sur une g en eralisation de lint egrale d enie On a generalization of the denite integral2
Note by Mr. H. Lebesgue. Presented by M. Picard.

In the case of continuous functions, the notions of the integral and antiderivatives are identical. Riemann dened the integral of certain discontinuous functions, but all derivatives are not integrable in the sense of Riemann. Research into the problem of antiderivatives is thus not solved by integration, and one can desire a denition of the integral including as a particular case that of Riemann and allowing one to solve the problem of antiderivatives.(1) To dene the integral of an increasing continuous function y (x) (a x b) we divide the interval (a, b) into subintervals and sums the quantities obtained by multiplying the length of each subinterval by one of the values of y when x is in the subinterval. If x is in the interval (ai , ai+1 ), y varies between certain limits mi , mi+1 , and conversely if y is between mi and mi+1 , x is between ai and ai+1 . So that instead of giving the division of the variation of x, that is to say, to give the numbers ai , we could have given to ourselves the division of the variation of y , that is to say, the numbers mi . From here there are two manners of generalizing the concept of the integral. We know that the rst (to be given the numbers ai ) leads to the denition given by Riemann and the denitions of the integral by upper and lower sums given by Mr. Darboux. Let us see the second. Let the function y range between m and M . Consider the situation m = m0 < m1 < m2 < < mp1 < M = mp y = m when x belongs to the set E0 ; mi1 < y mi when x belongs to the set Ei .3 We will dene the measures 0 , i of these sets. Let us consider one or the other of the two sums m0 0 + mi i ; m0 0 + mi1 i ;
2This is a translation of Lebesgues paper where he rst reveals his integration theory. This paper appeared in Comptes Rendus de lAcademie des Sciences (1901), pp. 10251028, and is translated by Paul Loya and Emanuele Delucchi. 3Translators footnote: That is, Lebesgue denes E = y 1 (m) = {x [a, b] ; y (x) = m} and 0 Ei = y 1 (mi1 , mi ] = {x [a, b] ; mi1 < y (x) mi }.
1
2 PROLOGUE: LEBESGUES 1901 PAPER THAT CHANGED THE INTEGRAL . . . FOREVER
if, when the maximum dierence between two consecutive mi tends to zero, these sums tend to the same limit independent of the chosen mi , this limit will be, by denition, the integral of y , which will be called integrable. Let us consider a set of points of (a, b); one can enclose in an innite number of ways these points in an enumerably innite number of intervals; the inmum of the sum of the lengths of the intervals is the measure of the set.4 A set E is said to be measurable if5 its measure together with that of the set of points not forming E gives the measure of (a, b).(2) Here are two properties of these sets: Given an innite number of measurable sets Ei , the set of points which belong to at least one of them is measurable; if the Ei are such that no two have a common point, the measure of the set thus obtained is the sum of measures of the Ei . The set of points in common with all the Ei is measurable.6 It is natural to consider rst of all functions whose sets which appear in the denition of the integral are measurable. One nds that: if a function bounded in absolute value is such that for any A and B , the values of x for which A < y B is measurable, then it is integrable by the process indicated. Such a function will be called summable. The integral of a summable function lies between the lower integral and the upper integral.7 It follows that if an integrable function is summable in the sense of Riemann, the integral is the same with the two denitions. Now, any integrable function in the sense of Riemann is summable, because the set of all its points of discontinuity has measure zero, and one can show that if, by omitting the set of values of x of measure zero, what remains is a set at each point of which the function is continuous, then this function is summable. This property makes it immediately possible to form nonintegrable functions in the sense of Riemann that are nevertheless summable. Let f (x) and (x) be two continuous functions, (x) not always zero; a function which does not dier from f (x) at the points of a set of measure zero that is everywhere dense and which at these points is equal to f (x) + (x) is summable without being integrable in the sense of Riemann. Example: The function equal to 0 if x is irrational, equal to 1 if x is rational. The above process of construction shows that the set of all summable functions has cardinality greater than the continuum. Here are two properties of functions in this set. (1) If f and are summable, f + is and the integral of f + is the sum of the integrals of f and of . (2) If a sequence of summable functions has a limit, it is a summable function.
4Translators footnote: Denoting by m (E ) the measure of a set E (a, b), Lebesgue is dening m (E ) to be the inmum of the set of all sums of the form i (Ii ) such that E i Ii where Ii = (ai , bi ] and (Ii ) = bi ai . Its true that Lebesgue doesnt specify the types of intervals, but it doesnt matter what types of intervals you choose to cover E with (I chose left-half open ones because of my upbringing). 5Translators footnote: Lebesgue denes E as measurable if m (E ) + m ((a, b) E c ) = b a. 6Translators footnote: Lebesgue is saying that if the E are measurable, then i i Ei is measurable, if the Ei are pairwise disjoint, then m ( i Ei ) = i m (Ei ), and nally, that i Ei is measurable. The complement of a measurable set is, almost by denition, measurable; moreover, its not dicult to see that the empty set is measurable. Thus, the collection of measurable sets contains the empty set and is closed under complements and countable unions; later when we dene -algebras, think about Lebesgue. 7Translators footnote: Lower and upper integrals in the sense of Darboux.
PROLOGUE: LEBESGUES 1901 PAPER THAT CHANGED THE INTEGRAL . . . FOREVER 3
The collection of summable functions obviously contains y = k and y = x; therefore, according to (1), it contains all the polynomials and, according to (2), it contains all its limits, therefore it contains all the continuous functions, that is to say, the functions of rst class (see Baire, Annali di Matematica, 1899), it contains all those of second class, etc. In particular, any derivative bounded in absolute value, being of rst class, is summable, and one can show that its integral, considered as function of its upper limit, is an antiderivative. Here is a geometrical application: if |f |, | |, | | are bounded, the curve x = f (t), y = (t), z = (t), has a length given by the integral of (f 2 + 2 + 2 ). If = = 0, one obtains the total variation of the function f of bounded variation. If f , , do not exist, one can obtain an almost identical theorem by replacing the derivatives by the Dini derivatives.
Footnotes: (1) These two conditions imposed a priori on any generalization of the integral are obviously compatible, because any integrable derivative, in the sense of Riemann, has as an integral one of its antiderivatives. (2) If one adds to this collection suitably selected sets of measure zero, one obtains the measurable sets in the sense of Mr. Borel (Le cons sur la th eorie des fonctions ).
Some remarks on Lebesgues paper In Section 1.1 of Chapter 1 we shall take a closer look at Lebesgues theory of integration as he explained in his paper. Right now we shall discuss some aspects he brings up in his paper involving certain defects in the Riemann theory of the integral and how his theory xes these defects. The antiderivative problem. One of the fundamental theorems of calculus (FTC) says that for a bounded8 function f : [a, b] R, we have
b
(0.1)
a
f (x) dx = F (b) F (a),
where F is an antiderivative of f , which means F (x) = f (x) for all x [a, b]. It may be hard to accept at rst, because its not stated in a rst course in calculus, but the FTC may fail if the integral in (0.1) is the Riemann integral! In fact, there are bounded functions f that are not Riemann integrable, but have antiderivatives, thus for such functions the left-hand side of (0.1) does not make sense. In Section 9.1 we shall dene such a function due to Vito Volterra (18601940) that he published in 1881. With this background, we can understand Lebesgues inaugural words of his paper: In the case of continuous functions, the notions of the integral and antiderivatives are identical. Riemann dened the integral of certain discontinuous functions, but all derivatives are not integrable in the sense of Riemann. Research into the problem of antiderivatives is thus not solved by integration, and one can
8 The Riemann integral is only dened for bounded functions, which is why we make this assumption. We would deal with unbounded functions, but then well have to discuss improper integrals, which we dont want to get into.
desire a denition of the integral including as a particular case that of Riemann and allowing one to solve the problem of antiderivatives. In Lebesgues theory of integral, we shall see that the fundamental theorem of calculus always holds for any bounded function with an antiderivative. In this sense, Lebesgues theory of integral solves the problem of antiderivatives. The limit problem. Suppose that for each n = 1, 2, 3, . . . we are given a function fn : [a, b] R, all bounded by some xed constant.9 Also suppose that for each x [a, b], limn fn (x) exists; since this limit depends on x, the value of the limit denes a function f : [a, b] R such that for each x [a, b], f (x) = lim fn (x).
n
The function f is bounded since we assumed all the fn s were bounded by some xed constant. A question that youve probably seen before in elementary real analysis is the following: Given that the fn s are Riemann integrable, is it always true that
b b
(0.2)
a
f (x) dx = lim
fn (x) dx?
a
We shall call this question the limit problem, which by using the denition of f (x), we can rephrase as follows: Is it always true that
b a n b
lim fn (x) dx = lim
fn (x) dx,
a
which is to say, can we switch limits with integrals? In the Riemann integration world, the answer to this question is No for the following reason: Even though each fn is Riemann integrable, its not necessarily the case that the limit function b f is Riemann integrable. Thus, even though the numbers a fn (x) dx on the rightb hand side of (0.2) may be perfectly well-dened, the symbol a f (x) dx on the left-hand side of (0.2) may not be dened! For an example of such a case, we go back to the example Lebesgue brought up in the second-to-last paragraph of his paper where he wrote Example: The function equal to 0 if x is irrational, equal to 1 if x is rational. Denoting this function by f : R R, we have f (x) = 1 if x is rational, 0 if x is irrational.
This function is called Dirichlets function after Lejeune Dirichlet (18051859) who introduced it in 1829; heres a rough picture of Dirichlets function: 1
0
9That is, there is a constant C such that |f (x)| C for all x [a, b] and for all n. n
PROLOGUE: LEBESGUES 1901 PAPER THAT CHANGED THE INTEGRAL . . . FOREVER 5
Its easy to show that f : R R is not Riemann integrable on any interval [a, b] with a < b (See Exercise 1). Now, in 1898, Ren e-Louis Baire (18741932) introduced the following sequence of functions fn : R R, n = 1, 2, 3, . . ., dened by fn (x) = 1 0 if x = p/q is rational in lowest terms with q n, otherwise.
Here is a picture of f3 focusing on x [0, 1]:
1 0
1 3 1 2 2 3
f3 1
Notice that f3 (x) = 1 when x = 0, 1/3, 1/2, 2/3/1, the rationals with denominators not greater than 3 when written in lowest terms, otherwise f3 (x) = 0. More generally, fn is equal to the zero function except at nitely many points, namely at 0/1, 1/1, 1/2, 1/3, 2/3, . . ., (n 1)/n and 1/1. In particular, fn is Riemann integrable and for any a < b,
b
fn (x) dx = 0;
a
here we recall that the Riemann integral is immune to changes in functions at nitely many points, so as the fn s dier from the zero function at only nitely b b many points, a fn (x) dx = a 0 dx = 0. Also notice that
n
lim fn = the Dirichlet function,
which as we mentioned earlier is not Riemann integrable. Hence, for this simple example, the limit equality (0.2) is nonsense because the left-hand side of the equality is not dened. In Lebesgues theory of integration, we shall see that the limit function f will always be Lebesgue integrable (which Lebesgue mentions in point (2) at the end of the second-to-last paragraph of his paper) and moreover, the equality (0.2) always holds when the sequence fn is bounded. In this sense, Lebesgues theory of integral gives a positive answer to the limit problem. Finally, lets discuss The arc length problem. In the last paragraph of Lebesgues paper he mentions the following geometric application: Here is a geometrical application: if |f |, | |, | | are bounded, the curve x = f (t), y = (t), z = (t), has a length given by the integral of (f 2 + 2 + 2 ). To elaborate more on this, suppose we are given a curve C in 3-space dened by parametric equations C : x = f (t) , y = (t) , z = (t) , such as shown on the left-hand picture here: a t b,
To dene L, the length of C , we approximate the curve by a piecewise linear curve, an example of which is shown on the right, and nd the length of the approximating curve. Geometrically its clear that the length of the piecewise linear curves is less than or equal to the true length of the curve; for this reason we dene the length of the curve L by (0.3) L := the supremum of the lengths of the piecewise linear approximations, provided that this value is nite. In elementary calculus we learned another formula for the length of the curve:
b
(0.4)
L=
a
(f (t))2 + ( (t))2 + ( (t))2 dt,
assuming that the derivatives are bounded. A natural question is: Are the two notions of length, dened by (0.3) and (0.4), equivalent? The answer is No if the Riemann integral is used in (0.4)! More precisely, there are curves which have length in the sense of (0.3) but such that (f (t))2 + ( (t))2 + ( (t))2 is not Riemann integrable; thus, (0.4) is nonsense if the integral is understood in the Riemann sense. In Lebesgues theory of integral, we shall see that the two notions of arc length are equivalent. Thus, Lebesgues theory of integral solves the arc length problem. There are many other defects in Riemanns integral that Lebesgues integral xes, and well review and discuss new defects as we progress through the book (for example, see the discussion on multi-dimensional integrals in Chapter 7). Summary. If we insist on using the Riemann integral, we have to worry about important formulas that are true some of the time; however, using the Lebesgue integral, these defective formulas become, for all intents and purposes, correct all of the time. Thus, we can say that Lebesgues integral simplies life!
Exercises 0.1. 1. Using your favorite denition of the Riemann integral you learned in an elementary course on real analysis (for instance, via Riemann sums or Darboux sums), prove that Dirichlets function is not Riemann integrable on any interval [a, b] where a < b.
Part 1
Finite additivity
CHAPTER 1
Measure & probability: nite additivity

In this work I try to give denitions that are as general and precise as possible to some numbers that we consider in Analysis: the denite integral, the length of a curve, the area of a surface. Opening words to Henri Lebesgues (18751941) thesis [232].
1.1. Introduction: Measure and integration This section is meant to be a motivational speech where we outline the basic ideas behind the theory of measure,1 or assigning size to sets, and how to use this notion of measure to dene integrals. 1.1.1. Lebesgue sums. Recall that the Riemann integral of a function with domain an interval [a, b] is dened via Riemann sums, which approximate the (signed) area of a function by partitioning the domain. The basic idea of Henri Lebesgue (18751941) is that he forms Lebesgue sums by partitioning the range of the function. Let us recall Lebesgues words in his inaugural 1901 paper Sur une g en eralisation de lint egrale d enie (On a generalization of the denite integral), Henri Lebesgue To dene the integral of an increasing continuous function (18751941). y ( x ) ( a x b)
one divides the interval (a, b) into subintervals and forms the sum of the quantities obtained by multiplying the length of each subinterval by one of the values of y when x is in the subinterval. If x is in the interval (ai , ai+1 ), y varies between certain limits mi , mi+1 , and conversely if y is between mi and mi+1 , x is between ai and ai+1 . Of course, instead of giving the division of the variation of x, that is to say, to give the numbers ai , one could have given the division of the variation of y , that is to say, the numbers mi . From here there are two manners of generalizing the concept of the integral. One sees that the rst (to be given the numbers ai ) leads to the denition given by Riemann and the denitions of the integral by upper and lower sums given by Mr. Darboux. Let us see the second. Let the function y range between m and M . Given m = m 0 < m 1 < m 2 < < m p 1 < M = m p y = m when x belongs to the set E0 ; mi1 < y mi when x belongs to the set Ei .2 We will dene the measures 0 , i of these sets. Let
1Numero pondere et mensura Deus omnia condidit (God created everything by number, weight and measure). Sir Isaac Newton (16431727). 2 Translators footnote: That is, Lebesgue denes E0 = y 1 (m) = {x [a, b] ; y (x) = m} and Ei = y 1 (mi1 , mi ] = {x [a, b] ; mi1 < y (x) mi }.
9
10
1. MEASURE & PROBABILITY: FINITE ADDITIVITY
us consider either of the two sums m 0 0 + m i i ; m 0 0 + mi1 i ; if, when the maximum dierence between two consecutive mi tends to zero, these sums tend to the same limit independent of the mi chosen, this limit will be, by denition, the integral of y , which will be known as integrable.
To visualize what Lebesgue is saying concerning the second method of generalizing the concept of the integral, consider a bounded nonnegative function f , such as shown here:
f (x) mi mi1
] [
Ei
Figure 1.1. The set Ei = {x ; mi1 < f (x) mi } for this example is a union of two intervals; the left interval in Ei is a left-half open (or right-half closed) interval and the right interval in Ei is a right-half open (or left-half closed) interval. Lets say the range of f lies between m and M . Our goal is to determine the area below the graph. Just as Lebesgue says, let us take a partition along the y -axis: Let E0 = {x ; f (x) = m0 } and for each i = 1, 2, . . . , p, let (1.1) m = m0 < m1 < m2 < < mp1 < M = mp .
Ei := f 1 (mi1 , mi ] = x ; mi1 < f (x) mi .
Figure 1.1 shows an Ei , which in this case is just a union of two intervals and in the following gure we take p = 6:
m6 m5 m4 m3 m2 m1 m0
E1 E2 E3 E4 E5 E6
f (x)
E5 E4 E3 E2E1
Figure 1.2. The sets E1 , . . . , E6 . In Figure 1.1 we wrote Ei in terms

of right and left-half open intervals. In this gure, for simplicity we omit the details of the endpoints. Omitted from the picture is E0 = {x ; f (x) = m0 }, which is the set {a, b} for this example.
Following Lebesgue, put i = m(Ei ), where m(Ei ) = the measure or length of the set Ei .
1.1. INTRODUCTION: MEASURE AND INTEGRATION
11
For the function in Figure 1.1, it is clear that Ei has a length because E0 = {a, b} (just two points, so m(E0 ) should be zero) and for i > 0, Ei is just a union of intervals (and lengths for intervals have obvious meanings), but for general functions, the sets Ei can be very complicated so it is not clear that a length can always be assigned to Ei . In any case, a set that has a well-dened notion of measure or length is called measurable. (By the way, there are some sets that dont have a well-dened notion of length see Section 1.1.3 for a discussion of this fact.) Now a careful study of Figure 1.3 shows that m0 0 + mi1 i = m0 m(E0 ) + m0 m(E1 ) + m1 m(E2 ) + m2 m(E3 ) +
is the lower area of the rectangles shown in the left-hand picture in Figure 1.3, while m0 0 + mi i = m0 m(E0 ) + m1 m(E1 ) + m2 m(E2 ) + m3 m(E3 ) +
is the upper area of the rectangles shown in the right-hand picture in Figure 1.3.
m6 m5 m4 m3 m2 m1 m0
E1 E2 E3 E4 E5 E6
f (x)
m6 m5 m4 m3 m2 m1 m0
E1 E2E3 E4 E5 E6
f (x)
E5 E4E3E2E1
E5 E4E3E2E1
Figure 1.3. Approximating the area under the graph of f from the inside and from the outside. We now go back to Lebesgues 1901 paper where he says:
if, when the maximum dierence between two consecutive mi tends to zero, these sums tend to the same limit independent of the mi chosen, this limit will be, by denition, the integral of y , which will be known as integrable.
We can make this precise as follows. Let P denote the partition {m0 , m1 , . . . , mp }, let P be the maximum dierence between two consecutive mi in the partition, and let LP = m0 m(E0 ) + mi1 m(Ei ) and UP = m0 m(E0 ) + mi m(Ei ),
which we call the lower and upper sums dened by the partition P . For a real number I we write lim P 0 LP = I if given any > 0, there is a > 0 such that for any partition P with P < , we have I LP < . There is a similar denition of what lim P 0 UP = I means. Then Lebesgue is saying that if there exists a real number I such that (1.2)
P 0
lim LP = I
and
P 0
lim UP = I,
12
then we say that f is integrable, and the common limit I is by denition the integral of f , which we shall denote by (1.3) f := I = lim LP = lim UP .
P 0 P 0
Assuming that each set Ei is measurable, this denition actually works!3 A function for which each set Ei is measurable is called a measurable function. In particular, in Section 6.2 well see that a Riemann integrable function is measurable, and the limit (1.3) is equal to the Riemann integral of the function. However, many more functions have integrals that can be dened via (1.3). Such a function is called Lebesgue integrable and the corresponding integral is called the Lebesgue integral of the function. As we shall see in the sequel, this integral has some powerful features as hinted in the prologue. Moreover, just as the notions of open sets in Euclidean space generalize to abstract topological spaces, which is indispensable for modern mathematics, the Lebesgue integral for Euclidean space has a valuable generalization to what are called abstract measure spaces. For this reason, we shall develop Lebesgues theory through abstract measure theory, a time consuming but worthy task. We shall see the usefulness of abstract measure theory in action as we study probability in this book. Before we go on, here is a nice description of the dierence between Lebesgue and Riemann integration from Lebesgue himself [239, pp. 18182]:
One could say that, according to Riemanns procedure, one tried to add the indivisibles by taking them in the order in which they were furnished by the variation in x, like an unsystematic merchant who counts coins and bills at random in the order in which they come to hand, while we operate like a methodical merchant who says: I have m(E1 ) pennies which are worth 1 m(E1 ), I have m(E3 ) dimes worth 10 m(E3 ), etc. S = 1 m(E1 ) + 5 m(E2 ) + 10 m(E3 ) + . I have m(E2 ) nickels worth 5 m(E2 ),
Altogether I have
The two procedures will certainly lead the merchant to the same result because no matter how much money he has there is only a nite number of coins or bills to count. But for us who must add an innite number of indivisibles the dierence between the two methods is of capital importance.
1.1.2. Measurable sets, -algebras and integrals. Now what properties should measurable sets have? Certainly the empty set , since it has nothing in it, should be measurable with measure (or size) zero. Also, any bounded interval (a, b), (a, b], [a, b), [a, b] should be measurable (with measure b a). You might recall that any open subset of R can be written as a countable union of open intervals.4 Since open sets are so fundamental to mathematics, we would surely want
3 If you are interested, see Chapter 5 for the details, the integral of a function taking on both positive and negative values is dened by breaking up the function into a dierence of two nonnegative functions. The integral of the function is the dierence of the integrals of the two nonnegative functions, provided these integrals exist. 4You can even take the intervals to be pairwise disjoint; see Problem 5 in Exercises 1.4.
13
open sets to be measurable. Thus, we also would like measurable sets to be closed under countable unions (that is, a countable union of measurable sets should be measurable). Since closed sets are also fundamental, and closed sets are just complements of open sets, we would like measurable sets to also be closed under taking complements. To summarize: The empty set should be measurable, measurable sets should be closed under countable unions, and measurable sets should be closed under complements. Finally, intervals should be measurable. Omitting the last property, which deals specically with the real line, generalizing these considerations we are lead to the following denition. A collection of subsets S of a set X is called a -algebra sigma-algebra of subsets of X if5 (1) S ; (2) An S , n = 1, 2, . . . implies n=1 An S (that is, S is closed under countable unions); (3) A S implies Ac = X \ A S (that is, S is closed under complements). For example, the Borel sets B , to be discussed more thoroughly in Section 1.4, is roughly speaking the -algebra of subsets of R obtained by taking countable unions and complements of intervals, and doing these operations either nitely or innitely many of times.6 We say that B is the -algebra generated by the intervals or that B is the smallest -algebra containing the intervals. Thus, every Borel set should have a measure. Another important -algebra is the Lebesgue measurable sets M . In a sense that can be made precise (see Theorem 3.14), the collection M makes up the largest -algebra containing the intervals such that the notion of measure has nice properties, where we now describe nice. Now what nice properties should our measure m have? As stated already, the measure of any interval should be the length of the interval, e.g. for a left-half open interval (a, b], we have m(a, b] = b a. Also, the empty set should have zero measure: m() = 0. Now observe that 1 1 1 1 1 (0, 1] = ,1 , , 2 3 2 4 3 is a countable union of pairwise disjoint intervals: ... ( (] ] ( ] (] (] (] ( ] ( 1 11 1 1 1 0. . . 1 76 5 4 3 2 Moreover, the measure of the whole is the sum of the measures of the parts: 1 1 1 1 1 (1.4) m(0, 1] = m , 1 + m , +m , + , 2 3 2 4 3 since m(0, 1] = 1 and right-hand side is also 1 because its a telescoping sum
n=1
1 1 1 1 1 1 1 = 1 + + + = 1. n n+1 2 2 3 3 4
The property (1.4) is called countable additivity. Of course, this example was concocted, but no matter what geometric example you can think of, this countable additivity of measure (length, area, volume, . . .) always holds; for example,
5Since the complement of a union of sets is the intersection of the complements of the sets, we can replace (2) with the condition that S be closed under countable intersections. 6Actually, to make this precise, we need the notion of transnite induction. You can read about transnite induction and the Borel sets on page 101 of [348].
14
consider the following square of side length 1 with smaller squares drawn within it:
Figure 1.4. Each shaded square is measurable, so (the -algebra property) the union of the shaded squares is also measurable. The area of the union is 1/3 (the shaded region makes up 1/3 of the big square). On the other hand, the sum of the areas of the shaded squares is 1/2 + 1/4 + 1/8 + , which also equals 1/3. Thus, it seems that countable additivity is an inherent property of measure. We generalize these considerations as follows. A measure on a -algebra S (of subsets of some set X ) is a map7 such that () = 0 and such that is countably additive in the sense that for any set A S written as a union of pairwise disjoint sets, A = A1 A2 A3 with An S for all n, we have this countable additivity is a generalization of (1.4). The space X (really, the triple (X, S , )) is called a measure space. Maurice Fr echet (18781973), in his 1915 paper [140], building on the work of Johann Radons (18871956) 1913 paper [320] concerning Rn , seems to be the rst to dene measures on abstract spaces spaces that are not Euclidean. Figure 1.5 shows some examples of measures.
r 2 2r 90 (degrees) 2 4 9 4/3 1 H 1/2 T 1/2 A A 1/52
: S [0, ]
(A) = (A1 ) + (A2 ) + (A3 ) + ;
Figure 1.5. Length, angle, area, volume, cardinality and probability

(e.g. involving coins or playing cards) are measures. In each case a number is assigned that measures how much of each quantity is there.
With the denition of a measure space, we can present the denition of the integral in complete generality! Let S be a -algebra of subsets of a set X and let be a measure on S . The sets in S are called measurable sets. We call a bounded nonnegative function f : X [0, ) measurable if each set Ei of the form (1.1) is measurable, that is, Ei S ; in particular, the sum (1.2) is dened for each n. We then dene the integral of the function f by the formula (1.3). Note how general this denition is! (The domain of the function f is a set X , which need not be Euclidean space.) As mentioned earlier, the ability to dene the integral of functions on abstract spaces is one of the most far reaching properties of Lebesgues
7Here, we x an object thats not a real number and denote it by ; we adjoin this object to the set of nonnegative real numbers and call the set [0, ]. Well talk more about innities in Section 1.5. We allow to take the value because we should allow sets to have innite measure, such as the real line R which has innite length.
15
theory. We remark that in Chapter 5 when we formally study the Lebesgue integral, we will modify the denition (1.3) slightly so it works for unbounded functions. 1.1.3. The measure problem. How do we assign measure, or length, to an arbitrary subset of R? We can do this rather quickly (within a few pages in fact!), but for the sake of pedagogy were going to take it slow as follows. Now we certainly know how to measure lengths of intervals, e.g. m(a, b] = b a. Because of the various assortments of intervals available, for concreteness we shall choose one kind to work with. We dene I 1 as the collection of all bounded lefthalf open intervals (a, b]. Thus, the rst step is the observation that we have a completely natural measure m : I 1 [0, ), dened by m(a, b] = b a where a b. The question is how to assign lengths to more general subsets of R. The second step is to dene m on more complicated sets such as nite unions of intervals. We denote by E 1 , the so-called elementary gures of R, as the collection of all nite unions of elements of I 1 . Thus, we shall try to extend the function m : I 1 [0, ) to a function This step is relatively easy. The third, and most complicated step, is to extend m so that its dened on a -algebra containing E 1 such as the Borel sets. The trick to do this is to dene m on all subsets of R; that is, we assign a length to all subsets of R. Unfortunately, this notion of length is not additive! For example, there exist disjoint subsets A, B R such that the length of A B is strictly less than the sum of the lengths of A and B (see Section 4.4.2). Strange indeed! However, we shall prove there is a -algebra M of subsets of R, called the Lebesgue measurable sets, which we mentioned earlier and will turn out to contain the Borel sets, such that is a measure. Here is a summary of our measure theory program. Our Measure Theory Program: 1 (1) The collection I has the structure of whats called a semiring. Hence, the rst thing we need to do is understand the properties of semirings. This is done in Sections 1.3, where we study other structures such as rings and -algebras. (The collection E 1 turns out to be a ring.) (2) In Section 1.4 we study the Borel sets B . (3) In Section 1.6 we study Lebesgue measure, and a slight generalization called the LebesgueStieltjes measure, on I 1 . (4) We then extend m to a function on E 1 . This is done in Section 2.3. (5) Next, we dene the length of any subset of R. The idea on how to do this goes back more than 2000 years to Archimedes of Syracuse (287 BC212 BC). In Proposition 1 of Archimedes book On the measurement of the circle [175], he found8 the area of a disk of radius r to be r2 . He did this by approximating the area of a disk by the areas of circumscribed and inscribed regular polygons,
8Proposition 1 actually says that the area of a disk is equal to that of a right-angled triangle where the sides including the right angle are respectively equal to the radius and circumference of the disk.
m : E 1 [0, ).
m : M [0, ]
16
Archimedes (287 BC212 BC)
Figure 1.6. Circumscribing and inscribing a circle with regular polygons.
whose areas are easy to nd; see Figure 1.6. The area of a region found by circumscribing it by simple geometric shapes (specically, rectangles) is called the outer measure of the region. We can nd the outer measure of an arbitrary subset in Rn , for any n, by similar means. Outer measures will be studied in Section 3.4. (6) Finally, in Section 3.5 we show that this whole process of extending m from I 1 to E 1 , then nally to M and B works. Lastly, we shall meet many interesting friends and sights along our extension journey such as Pascal, Fermat, and some probability.
Exercises 1.1. 1. Let k N and let f (x) = xk on an interval [0, b]. Compute the Lebesgue integral 0 f using the denition in (1.3) by recognizing the lower and upper Lebesgue sums LP and UP as Riemann sums.
b
1.2. Probability, events, and sample spaces After reading the last section, you might be wondering why we emphasized the abstract notion of a -algebra and integration on abstract spaces. Why cant we just focus on concrete sets like Euclidean space? In this section we answer this question and a lot more through probability. In fact, we show that abstract spaces and notions such as countable unions and intersections are not just abstract nonsense but are required in order to answer very concrete questions arising from simple probability examples.9 1.2.1. The beginnings of probability theory. Probability started to ourish in the year 1654, although there have been isolated writings on probability before that year. For example, Girolamo Cardano (15011576) wrote Liber de Ludo Aleae on games of chance around 1565, which can be considered the rst textbook on the
9If abstract spaces were of no use except for the sake of being abstract, then we should only
talk about Lebesgue measure and integration on Euclidean space! This reminds of a quote by George P olya (18871985): A mathematician who can only generalise is like a monkey who can only climb up a tree, and a mathematician who can only specialise is like a monkey who can only climb down a tree. In fact neither the up monkey nor the down monkey is a viable creature. A real monkey must nd food and escape his enemies and so must be able to incessantly climb up and down. A real mathematician must be able to generalise and specialise. Quoted in [262].
1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES
17
probability calculus10 (posthumously published in 1663), and Galileo Galilei (1564 1642) wrote Sopra le Scoperte dei Dadi on dice games around 1620. In the year 1654, the French writer Antoine Gombaud Chevalier de M er e (16071684) asked Blaise Pascal (16231662) a couple of questions related to gambling. One problem can be called the dice problem and the other can be called the problem of points. The dice problem has to do with throwing two dice until you get a double six; more specically,
How many times must you throw the dice in order to have a better than 5050 chance of getting two sixes?
The problem of points (also called the division problem ) deals with how to divide the stakes for an unnished game; more specically,
How should one divide the prize money for a fair game (that is, each player has an equal probability of winning a match) that is started but was ended before a player won the money?
Blaise Pascal (16231662).
Pascal solved these problems in correspondence with Pierre de Fermat (16011665); you can read their letters in [358]. These problems have been around for many years before Pascals time, for example, Cardano discussed the dice problem for the case of one die in his 1565 book Liber de Pierre de Fermat Ludo Aleae. To my knowledge, the rst published version of the problem of (16011665). points was by the founder of accounting, Fra Luca Pacioli (1445-1517), in Summa de Arithmetica [303] in 1494 (see Problem 8 in Exercises 1.5 for Paciolis problem). However, it is reasonable to say that probability theory as a mathematic discipline was developed through the discussions between Pascal and Fermat. We study the dice problem and the problem of points in Section 1.5. The words chance, fair, and probability in the above descriptions of the dice problem and the problem of points are in quotes because they need to be dened; in other words, do we really know exactly what these questions are asking? In some sense these words should represent numbers from which you can make certain conclusions (e.g. that a game is fair). Naturally, whenever we speak of numbers we think of functions. In the context of probability, these functions are called probability measures, which well talk about in due time, and which assign numbers to the outcomes (or events) of a random phenomenon such as a game, for example. In this section we study events and in Sections 1.5 and 4.1 we study probability measures in great depth. 1.2.2. Sample spaces and events. Probability theory is the study of the mathematical models of certain random phenomena such as, for instance, what numbers land up when you throw two dice or what side of a coin is right-side up when you ip a coin. Whenever you conduct an experiment involving random phenomenon, the most fundamental fact you need to know is all the possible outcomes of the random phenomenon. A set containing all the possible outcomes of the experiment is called a sample space for the given experiment.
10Calculus conjures up images of limits, dierentiation, and integration of functions. However, calculus has a much broader meaning (from Websters 1913 dictionary): A method of computation; any process of reasoning by the use of symbols; any branch of mathematics that may involve calculation. In this broader sense, there are many calculi in mathematics such as the probability calculus, variational calculus (calculus of variations); some of my research involves what are called pseudodierential calculi.
18
Example 1.1. If you toss a coin once,11
a sample space is where H represents that the coin lands with heads up and T with tails up. We could also use X = {0, 1} where (say) 1 represents heads and 0 tails. Example 1.2. A sample space when you throw two dice Y = (1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6) (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6) (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6) (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6) (5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6) (6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6) , or more succinctly, Y = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)}, the set of all pairs (m, n), where m, n {1, 2, . . . , 6}. The numbers m and n represent the numbers die 1 and die 2, respectively, show. Example 1.3. What if you want to study the phenomenon of throwing the dice twice in row; that is, throwing them, picking them up and then throwing them again? If Y = {(1, 1), . . . , (6, 6)} is all the possible outcomes for a single throw of the dice, a sample space for two throws is X = {(x1 , x2 ) ; x1 , x2 Y } = Y Y , where the rst entry x1 Y represents the roll of the dice on the rst throw and the second entry x2 Y what happens on the second throw. Similarly, a sample space for throwing the dice n-times in a row is Y n = Y Y Y (n factors of Y ). (Note that Y Y can also represent the sample space for throwing four dice at once.) is X = {H, T },
Example 1.4. Now if we want to answer Antoine Gombaud Chevalier de M er es dice problem, we are not told how many times one should throw the dice. In this situation one can use an idealized phenomenon of throwing the dice innitely many times. In this case, a sample space is the set of all sequences of elements of Y = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)}, which we can denote in various ways such as X = {(x1 , x2 , x3 , . . .) ; xi Y for all i} = Y Y Y Y =
Y = Y ,
i=1
which we call the innite product of Y with itself. Here, x1 represents the outcome on the rst roll, x2 the outcome on the second roll, and so on.
An event is a collection of possible outcomes of an experiment; in other words, an event E represents the outcome that one of the elements of E occurs as a result of the experiment.
11
The hand of my then 2 year old daughter Melodie Loya.
19
Symbol Set theory jargon X (universal) set X (universal) set empty set A subset of X Ac = X \ A complement AB union AB intersection A\B dierence
Probability theory jargon sample space certain event impossible event event that an outcome in A event that no outcome in A event that an outcome in A event that an outcome in A event that an outcome in A
occurs occurs or B occurs and B occurs and not in B occurs
Table 1. A set theory/probability theory dictionary.
Example 1.5. Let X = Y Y , where Y = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)}, which represents the sample space for the phenomenon of throwing two dice twice. Here are some examples of events. (1) The trivial events are the two extremes: anything or nothing occurring. The event that anything can happen on the rst and second throws is X = Y Y , which is called the certain event, and the event that nothing happens on both throws is , which is called the impossible event. (2) The event that we throw a double six on at least one of the two throws is given by the subset A = {(x1 , x2 ) Y 2 ; xi = (6, 6) for some i = 1, 2}. (3) The event that we throw a pair of odd numbers on at least one of the throws is B = {(x1 , x2 ) Y 2 ; x1 O or x2 O}, where O = {(m, n) ; m, n {1, 3, 5}}. (4) We now consider events formed by performing set operations with A and/or B . First, the event that we do not throw a double six on either throw is Ac = Y 2 \ A = {(x1 , x2 ) Y 2 ; xi = (6, 6) for i = 1, 2}. (5) The event that we throw either a double six or a pair of odd numbers on at least one throw is C = A B.
(6) The event that we throw a double six on one of the throws and a pair of odd numbers on the other throw is D = A B. (7) Finally, the event that we throw at least one double six and we dont throw any odd pair is E = A \ B.
See Table 1 for a dictionary of set theory/probability theory jargon. This simple example shows that the usual set operations (unions, dierences, etc.) are an essential part of probability. As a side remark, in the next section well study the notion of a ring of sets, which is a collection of sets that is closed under unions, intersections, and dierences. The above dictionary shows that the concept of a ring is a very natural object of study in probability.
Example 1.6. Now let X = Y Y , where Y = {(1, 1), (1, 2), (1, 3), . . . , (6, 6)}, which represents the phenomenon of throwing two dice innitely many times. Given
20
a natural number n, what is the event that we throw a double six on the nth throw? It is An = {(x1 , x2 , . . .) ; xn = (6, 6)} = Y Y Y {(6, 6)} Y ,
where the set {(6, 6)} occurs in the nth position. Another question is: What is the event that we throw a double six at some point? It is Notice that A=
n=1
A = {(x1 , x2 , . . .) ; xi = (6, 6) for some i N}. An (Event that we throw a double six at some point).
Hence, A is a countable union of sets.
Thus, the notion of countable union is an essential part of probability in the sense that countable unions result from very simple probability questions. In (4) of Example 1.5 we saw that forming complements is also an essential part of probability. (E.g. the event that we never throw a double six is the complement of the event that we do throw a double six at some point.) Conclusion: We see that the idea of studying -algebras is not a gment of the imagination but is required to study probability! This conclusion will be even more evident after we look at the concept of . . . 1.2.3. Innitely often. Consider again the experiment of throwing two dice innitely many times. What is the event that we throw a double six not just once, twice or even a nite number of times during any given innite sequence of throws, but an innite number of times? To answer this question, lets consider the following abstract problem: Let (1.5) A1 , A2 , A3 , A4 , . . . , An , . . . , Ak . . . be events in a sample space X ; we are interested in those x X belonging to innitely many An s. We denote the collection of such x with the special notation the set of x X that belong to the An s i.o. or innitely often. Contemplating the sequence (1.5) we see that if x X happens to belong to innitely many of the sets A1 , A2 , . . . then given any n N, however large, we can nd a k n such that x Ak ; otherwise x would be conned to A1 , A2 , . . . , An1 contrary to the fact that x belonged to innitely many of A1 , A2 , . . .. To reiterate, we showed that Transforming this into set theory language, the for all is really intersection and the there exists is really union; that is, x n=1 kn Ak . Reversing this argument shows that if x n=1 kn Ak , then x belongs to innitely many of A1 , A2 , . . .. Thus, we have shown Part (1) of the following proposition. For all n = 1, 2, . . ., there exists a k n such that x Ak . {An ; i.o.} := {x X ; x belongs to innitely many An s},
Proposition 1.1. Let A1 , A2 , . . . be subsets of a set X . (1) We have {An ; i.o.} = lim sup An , where lim sup An :=
n=1 k =n
Ak .
21
(2) Now dene the set of x X that almost always belong to an An . Then
{An ; a.a.} := {x X ; x belongs to An for all but nitely many ns},
{An ; a.a.} = lim inf An ,
where lim inf An :=

n=1 k=n
Ak .
To prove the second equality, observe that if x belongs to all but nitely many of the sets in the list (1.5), we can choose a natural number n large enough so that in the list (1.5), these nitely many sets lie to the left of An . Then x belongs to An , An+1 , An+2 , . . .; that is, For some n = 1, 2, . . ., for all k n we have x Ak . In set theory language, we have x n=1 kn Ak . Reversing this argument shows that if x n=1 kn Ak , then x belongs to all but nitely many of A1 , A2 , . . ..
Example 1.7. Consider again the sample space X = Y Y representing the phenomenon of throwing two dice innitely many times. Given a natural number n, let An be the event that we throw a double six on the nth throw: An = Y Y Y {(6, 6)} Y , where the set {(6, 6)} occurs in the nth position. Then the event that we throw innitely many double sixes is just {An ; i.o.}. Hence, according to our proposition, The event that we throw innitely many double sixes = On the other hand, The event that all but a nite number of throws were double sixes =

Ak .
n=1 k=n
Ak ,
n=1 k=n
which is the event that we throw only a nite number of non-double sixes.
Notice that the right-hand sides of the above events involve countable intersections and unions again, the notion of -algebra pops out at us! To summarize: After looking at all the above examples, if you didnt know about -algebras already, you would be forced to invent them! 1.2.4. Bernoulli sequences, Monkeys, and Shakespeare. A Bernoulli trial is a random phenomenon that has exactly two outcomes, success and failure (or yes and no, pass and fail, etc.). For example, declaring heads to be a success and tails to be failure, ipping a coin is a Bernoulli trial. Say that we have two dice and were interested in obtaining a double six. Then throwing two dice becomes a Bernoulli trial if we regard a double six as success and not obtaining a double six as failure. A sample space of a Bernoulli trial can be Y = {0, 1} , where 1 = success and 0 = failure.
22
We are mostly interested in an innite sequence of trials such as, for example, ipping a coin innitely many times.12 Such a sequence of trials can be realized as an innite sequence of 0s and 1s:
H T H T T T H (1, 0, 1, 0, 0, 0, 1, . . .)
Sequences of Bernoulli trials are called Bernoulli (or Bernoullian) sequences after Jacob (Jacques) Bernoulli (16541705) who studied them in his famous treatise on probability Ars conjectandi, published posthumously in 1713 by Jacobs nephew Nicolaus Bernoulli (16871759). Here is an interesting example of a Bernoulli sequence. We are given a typewriter that has separate keys for lower case and upper case letters and all the other dierent symbols (punctuation marks, numbers, etc. . . including a space). Lets take a sonnet of William Shakespeare (15641616), say
Shall I compare thee to a summers day?

Here it is:
Shall I compare thee to a summers day? Thou art more lovely and more temperate: Rough winds do shake the darling buds of May, And summers lease hath all too short a date: Sometime too hot the eye of heaven shines, And often is his gold complexion dimmd; And every fair from fair sometime declines, By chance, or natures changing course, untrimmd; But thy eternal summer shall not fade, Nor lose possession of that fair thou owest; Nor shall Death brag thou wanderst in his shade, When in eternal lines to time thou growest; So long as men can breathe, or eyes can see, So long lives this, and this gives life to thee.
My word processor tells me there are a total of 632 symbols here (including spaces). Now lets do an experiment. We put a monkey in front of the typewriter,
have him hit the keyboard 632 times, remove the paper, put in a new paper, have him hit the keyboard 632 more times, remover the paper, etc. . . repeating this process innitely many times. We are interested in whether or not the Monkey will ever type Shakespeares sonnet 18. Here, a success is that he types it and a failure is that he doesnt type it. Thus, the sample space for this experiment is a Bernoulli sequence X = Y = {(x1 , x2 , x3 , . . .) ; xi Y := {0, 1}}, where on the ith page, xi = 1 if the monkey successfully types sonnet 18 and xi = 0 if the monkey fails; e.g. an element of X is something like
fails fails fails fails success
fails
0 , 0 , 0 , 0 ,1 ,0 ,...
12Technically speaking, in an innite (or even nite) sequence of trials we require the probability of success to remain the same at each trial. However, we havent technically dened probability yet, so we wont worry about this technicality.
23
One might ask if given n N, whats the event that the monkey types Shakespeares sonnet 18 on the nth try? The answer is An = Y Y Y {1} Y Y where {1} occurs in the nth factor. Here are some more questions one might ask, whose answers can be written in terms of the An s: (1) Question: Given n N, whats the event that the monkey types Shakespeares sonnet 18 at least once in the rst n pages? Answer: n k=1 Ak . (2) Question: Given n N, whats the event that the monkey does not type Shakespeares sonnet 18 in any of the rst n pages? Answer:
n c
Ak
k=1
= {0} {0} {0} Y Y Y ,
where the {0}s occur in the rst n factors. (3) Question: Whats the event that the monkey eventually types Shakespeares sonnet 18? Answer: It is n=1 An . (4) Question: Whats the event that the monkey types Shakespeares sonnet 18 innitely many times? Answer: It is

{An ; i.o.} =
An .
k=1 n=k
(5) Question: Finally, whats the event that the monkey types Shakespeares sonnet 18 on all but nitely many trials? Answer: It is

{An ; a.a.} =
Ak .
n=1 k=n
Once we study probability measures, we can compute the probability of each of these events occurring. In particular, it is interesting to note that the probability is one that Shakespeares sonnet 18 is typed innitely many times (see Section 4.1)! On the other hand, it is basically impossible that the monkey will type Shakespeares sonnet 18 in any reasonable nite amount of time (see Section 2.4)! 1.2.5. The measure problem for probability. Back in Section 1.1.3 we studied the measure problem for length. Before jumping into the abstract material in the next section we briey study the corresponding problem for probability. In Section 1.5 we will give a thorough discussion on probability, so here we shall proceed intuitively and not worry about being precise. Below, when we use the term probability, it means what you think it means: Its a number between 0 and 1 measuring the likelihood that an event will occur. Consider the sample space for an experiment consisting of an innite sequence of coin tosses: Y = Y Y Y Y , where Y = {0, 1} and 1 represents heads and 0 tails. The event we throw a head on the rst toss (not caring what happens on the other tosses) is E = {1} Y Y Y .
24
What is the probability of the event E occurring? Assuming a head or tail is equally likely to be thrown (= fair coin), the answer is 1/2. The event that we throw a head on the rst toss and a tail on the third toss is F = {1} Y {0} Y . What is the probability of the event F occurring? After some thought, the answer should be 1/2 1/2 = 1/4. Now let k N and let Y1 , Y2 , . . . , Yk be nonempty subsets of Y and consider the event (1.6) the event that an outcome in Y1 occurs on the rst toss, an outcome in Y2 occurs on the second toss, . . ., and an outcome in Yk occurs on the k th toss. What is the probability of this event occurring? After some thought, the answer should be (1.7) 1 2
Y1 Y2 Y3 Yk Y Y Y Y ,
where is the number of sets amongst Y1 , . . . , Yk that equal {0} or {1}. A set of the form (1.6) is called a cylinder set. Why cylinder set? If we look at R3 = R R R and consider sets A1 , A2 R, then A1 A1 R is a cylinder in R3 extending above and below the set A1 A2 in the plane as seen in the picture below. If we put Y = R in (1.6) and let the Yi s be subsets of R, then (1.6) would be an innite-dimensional cylinder. If we denote the collection of cylinder sets by C , then we have a map
A1 A2 R A2
A1
dened by assigning a nonempty cylinder set the number (1.7). Note that the empty set is also a cylinder set (just take Y1 = ); we then put () := 0. By the way, the collection C has the properties of a semiring, to be discussed in the next section. Now that we know how to assign probabilities to cylinder sets via (1.7), the question is: Can we dene the probability of an event that is not a cylinder set? For instance, what is the probability of tossing innitely many heads? Given n N, the event of throwing a head on the nth toss is where {1} occurs in the nth factor. Thus, the event of throwing innitely many heads is

: C [0, 1]
An = Y Y Y {1} Y Y
{An ; i.o.} =
Ak .
n=1 k=n
This event is really quite complicated so its not entirely obvious what the probability is! (Youll be able to prove that the probability is one after reading Section 4.1.) We are now in a position similar to what we talked about in Section 1.1.3 for length: We have to extend the function from C , where its perfectly dened, to a -algebra containing C . The rst step is to dene on more complicated sets such as nite unions of cylinder sets. The collection of nite unions of cylinder sets has the structure of a ring. The next step is to extend further so that its dened on a -algebra containing C . This extension process is very similar to
25
the extension process for length! The beauty of abstract measure theory is that it unites seemingly unrelated topics such as length on the one hand and probability on the other. Our next order of business is to understand semirings, rings, and -algebras, which we have already mentioned several times.
Exercises 1.2. 1. (Gamblers ruin) Let Y = {0, 1} and let X = Y Y be the sample space of tossing a coin innitely many times where 1 represents heads and 0 tails. Suppose a gambler starts with an initial capital of $i where i N. (He is playing against a person with an innite amount of money.) On each ip, the gambler bets $1 that a head is thrown; if its a head he wins $1 and if its a tails he loses $1. He plays until he goes broke. For each k = 1, 2, . . ., dene Rk : X R as follows: If x = (x1 , x2 , . . .) X , (1.8) Rk (x) := 1 1 if xk = 1 if xk = 0,
which represents what is gained or lost on the kth toss. (i) For each n N, dene the net winning function Wn : X R as follows: Given a sequence x = (x1 , x2 , . . .) of coin tosses, (1.9) Wn (x) := R1 (x) + R2 (x) + + Rn (x) , where Rk is dened in (1.8). For n N, put An = {i + Wn = 0}
n1 k=1
{i + Wk > 0} , n > 1,
and A1 = {i + W1 = 0}, where {i + Wn = 0} = {x X ; i + Wn (x) = 0} and {i + Wk > 0} = {x X ; i + Wk (x) > 0}. Explain why An is the event that the gambler goes broke on exactly the nth play. (ii) Explain why n=1 An is the event that the gambler eventually goes broke. (iii) The gambler aspires to gain $a where a > i is an integer. As soon as he reaches $a he quits (if he hasnt gone broke before reaching $a). Involving the Wn s, what is the event that he actually does reach $a? (If youre interested in the foolishness of gambling, please see Section 4.1.1 for an analysis of the folly of gambling.) 2. (Winning streaks) Let X = Y Y , where Y = {0, 1}, be the sample space for the experiment of tossing a coin innitely many times. Let n N. Write down the event that a head is tossed (at some point) exactly n times in a row? What is the event that no run of heads more than n occurs in a sequence of coin tosses? Considering the gambler from the previous problem who bets on coin ips, what is the event that the gambler has a streak of exactly n wins in a row? 3. (Random walks) Suppose you put a particle at the origin. Each second thereafter, the particle moves one unit (either to the right or to the left). The path the particle follows is called a random walk. (i) Let 0 denote a move one unit to the left and 1 a move one unit to the right. Let Y = {0, 1} and explain why X = Y Y Y represents a sample space for the random phenomenon of a particle undergoing a random walk. (ii) A more traditional sample space is (see e.g., Fellers classic [130, Ch. 3]) the set X2 {0, 1, 2, . . .} Z, where X2 = {(0, x0 ), (1, x1 ), (2, x2 ), . . . ; x0 = 0, xi+1 xi {1, 1} for i = 0, 1, 2, . . .}. Explain why X2 can also be considered a sample space for random walks. Find a bijection between X and X2 . (iii) For n N, let Wn : X R be dened as in (1.9) above and put W0 := 0. Let a Z and n 0. In terms of Wn , what is the event the particle is at the point a at time n? What is the event the particle visits the point a for the rst time
26
at time n, where if a = 0, we mean the particle visits the origin for the rst time after starting the particles journey? (iv) For a Z, what is the event the particle visits the point a innitely many times?
1.3. Semirings, rings and -algebras The set I of left-half open intervals in R has a simple structure of a semiring, while the Borel sets has a more robust structure of a -algebra. A fundamental problem in measure theory is that of extending an additive set function on a basic class of subsets (e.g. m on I 1 ) to a -algebra containing the basic class (e.g. B , the Borel sets). For this reason, the purpose of this section is to understand these two classes, and classes that lie in between. 1.3.1. Semirings. Let us start by noting some basic properties of I 1 = {(a, b] ; a, b R}, where (a, b] = if b a. First, I 1 as just noted. Also, if I, J I 1 , then I J I 1 , for as can be veried. Finally, observe that if I, J I 1 , then I \ J is a union of pairwise disjoint sets in I 1 , for which can be veried by drawing pictures like this one (of course, one can prove this rigorously too):
a1
1
(a1 , b1 ] (a2 , b2 ] = (a, b],
where a = max{a1 , a2 } and b = min{b1 , b2 },
(a1 , b1 ] \ (a2 , b2 ] = (a1 , b] (a, b1 ],
where b = min{b1 , a2 } and a = max{a1 , b2 },
a2
]
b1
]
b2
Figure 1.7. In this situation, we see that (a1 , b1 ] (a2 , b2 ] = (a2 , b1 ] and (a1 , b1 ] \ (a2 , b2 ] = (a1 , a2 ]. To summarize: I 1 contains , is closed under intersections, and the dierence of two sets in I 1 can be written as a union of pairwise disjoint sets in I 1 . Generalizing these properties, we arrive at the following denition, due to John von Neumann [408, p. 85]: A collection of subsets I of a set X is called a semiring if (1) I ; (2) If A, B I , then A B I ; (3) If A, B I , then there are nitely many pairwise disjoint sets N A1 , . . . , AN in I (for some N N) such that A \ B = n=1 An . We can replace the last statement with the following equivalent one: If A, B I and B A, then B is part of a partition of A in the sense that there are pairwise N disjoint sets A1 , . . . , AN in I such that A1 = B and A = n=1 An ; see Problem 1. Finally, we can generalize Property (3) above to the following (also see Problem 1): If A, I1 , . . . , In I , then there are nitely many pairwise disjoint sets J1 , . . . , JN in I such that
n N
(1.10)
A\
Ik =
k=1 k=1
Jk .
1.3. SEMIRINGS, RINGS AND -ALGEBRAS
27
Note that Property (3) is exactly this statement with n = 1. One can prove (1.10) using an induction argument on n. We shall use (1.10) in Lemma 1.3 below.
Example 1.8. Here are two simple examples of semirings. Let X = {a, b, c} be a set consisting of three elements and let I = {, {a}, {b, c}, X }. It is easy to check that I is a semiring. Another example of a semiring is I = {, {a}, {b}, {c}, X }. Example 1.9. Heres a nonexample: The set of all open intervals is not a semiring. Here, Condition (3) of a semiring is not satised because e.g. (0, 2) \ (0, 1) = [1, 2) cannot be written as a nite union of pairwise disjoint open intervals. Similarly, the set of all closed intervals is not a semiring. Example 1.10. However, the set of all bounded intervals (open, closed, and half-open ones) is a semiring. (Exercise!) This semiring is a little too large for our taste, so we like to focus on just the left-half open intervals I 1 .
Dene I n as the collection of all left-half open boxes13 in Rn : I n = {(a1 , b1 ] (an , bn ] ; ai , bi R}. Heres a picture of such a box when n = 2 (in which case the 2-dimensional box is, of course, called a rectangle):
b2 ]
a2 ( a1 ] b1
(a1 , b1 ] (a2 , b2 ]
Figure 1.8. Here we drew dotted lines to emphasize lines not part of
the rectangle; in the future we shall be careless and usually draw the boxes with solid lines. (E.g. see Figure 1.9 in a few pages.)
Weve showed that I 1 is a semiring; is it true that I n is a semiring for any n? The answer if yes, which follows from the following. Products of semirings Proposition 1.2. If I1 , . . . , IN are semirings, then so is the product In particular, I n = I 1 I 1 (n factors of I 1 ) is a semiring. I1 IN := A1 AN ; A1 I1 , . . . , AN IN .
Proof : For notational simplicity, we prove this result for only two semirings. Thus, we show that if I and J are semirings, then I J is also a semiring. Since = , we have I J . So, we are left to verify Conditions (2) and (3) of a semiring.
13Geometrically, an element of I n is not a left-half open box as seen in the picture; its not just open on the left, its open at the left and at the bottom! Elements of I n should be called products of left-half open intervals, but the name left-half open boxes has stuck with me.
28
Let A, B I J . Then, A=CD and B = E F, where C, E I and D, F J . By denition of set intersection, its straightforward to show that A B = (C E ) (D F ).
Since I and J are semirings, C E I and D F J . Thus, A B I J . Suppose now that B A; we need to show that B is part of a partition of A. Since B A, we have E C and F D. As I and J are semirings, we can write Dm , Cn , D = C=
n m
where the unions are nite, Cn I are disjoint for dierent ns, Dm J are disjoint for dierent ms, and where C1 = E , D1 = F . By properties of the Cartesian product, we have A = C D =
n,m
Cn Dm .
Since the Cn s are pairwise disjoint and the Dm s are pairwise disjoint, it follows that the sets Cn Dm are disjoint for dierent (n, m). Moreover, C1 D1 = E F = B . Hence, A is a union of pairwise disjoint sets in I J which contains B as one of the sets. Our proof is complete.
Let A1 = (a1 , b1 ] and A2 = (a2 , b2 ] be in the semiring I 1 as in the picture ( ( ]

b1
a1
a2
]
b2
Observe that although A1 and A2 are not disjoint in this picture, we can write the union A1 A2 as a union of disjoint sets in I 1 : A1 A2 = B1 B2 B3 , where B1 = (a1 , a2 ], B2 = (a2 , b1 ], B3 = (b1 , b2 ]. In other words, the union A1 A2 of elements of I 1 can be replaced by pairwise disjoint elements that have the same union. The following lemma says that this property holds (even for countable unions) for any semiring and is one of the fundamental properties of semirings. Fundamental lemma of semirings Lemma 1.3. If {An } are countably many sets in a semiring I , then for each n, there are nitely many semiring elements Bn1 , Bn2 , . . . An such that the Bnm s are pairwise disjoint (that is, Bn1 m1 and Bn2 m2 are disjoint when (n1 , m1 ) = (n2 , m2 )) and An = Bnm .
n n,m
Proof : Given such a countable collection {An }, the trick is to replace this collection with a collection of pairwise disjoint sets having the same union. Heres the (standard) way to do so: Dene a sequence of sets {Bn } by B1 = A1 , B2 = A2 \ A1 , B3 = A3 \ (A1 A2 ) ,
29
and in general, Heres a picture of the rst few steps (where the Ak s are rectangles):
A1 A2 A3 B1 B2 = A2 \ A1 B3 = A3 \ (A1 A2 )
Bn = An \ A1 A2 An1 .
Note that Bn need not be in the semiring since semirings need not be closed under unions and set dierences. We claim that (1) Bn An for each n. (2) The Bn s are pairwise disjoint. (3) A = n Bn , where A = n An . Indeed, Bn is a subset of An by denition of Bn . To prove (2) note that if m < n, then Bn by denition does not contain any points of Am while we already know from (1) that Bm Am ; hence Bn Bm = . To prove (3), note that since Bn An for all n, we have n Bn A. To prove that A n Bn , let x A = n An . Then x Ak for some k. By the well-ordering principle of N, we may assume that k is the smallest natural number such that x Ak . Then x / A1 , . . . , Ak1 and hence x Bk by denition of Bk . Thus, x n Bn , which proves (3). Now, by (1.10) we can write Bn as a union Bn = m Bnm where this union is nite and where the Bnm I are pairwise disjoint. Unioning over n we get A = n Bn = n,m Bnm exactly as we wanted.
1.3.2. Rings and -algebras. Next in the hierarchy of classes of subsets are rings. A nonempty collection of subsets R of a set X is called a ring of subsets of X if (1) A B R (that is, R is closed under unions); (2) A \ B R (that is, R is closed under dierences). Since = A \ A, it follows that R . Moreover, we claim that a ring is closed under intersections. Indeed, given two sets A, B , we have the formula A B = A \ (A \ B ), as can be veried. Therefore, a ring is closed under intersections. Hence, a ring is a nonempty collection of subsets that is closed under unions, intersections, and set dierences. It follows that a ring is a semiring, but not vice versa in general. We remark that by induction, one can show that rings are closed under nite intersections and unions, that is,
N N
If A1 , . . . AN R , then
n=1
An R and
n=1
An R .
Finally, if you might be wondering how our rings relate to the rings youve seen in abstract algebra, see Problem 12.
Example 1.11. Simple examples of rings are {, X } and P (X ), where P (X ) is the power set of X . One of the most important examples is the ring E n of elementary gures (or sets) in Rn . Such a set is by denition a nite union of left-half open boxes; see Figure 1.9. Theorem 1.5 below shows that E n is a ring. In fact, when you think of a ring, you should think of nite unions of sets.
30
Figure 1.9. Elementary gures are just unions of left-half open boxes.
We now prove that any collection of sets is contained in a smallest ring. Theorem 1.4. Let A be any collection of subsets of a given set. Then there exists a unique smallest ring containing A , where smallest means, by denition, that the ring contains A and the ring is contained in any ring that contains A . This smallest ring is called the ring generated by A and is denoted by R (A ).
Proof : We leave uniqueness to you. To prove existence we use a standard trick widespread in mathematics: We intersect all sets having the property we want! (For example, you use this trick in abstract algebra to nd an ideal generated by a set.) In our case, given the collection A , we intersect all rings containing A ; thus, let us dene R (A ) :=
A R
where the intersection is over all rings R with A R . Since the power set P (X ) is a ring that contains A , the right-hand intersection is non-empty. By construction, R (A ) is contained in every ring containing A . It remains to prove that R (A ) is a ring. To this end, let A, B R (A ), which means that A, B R for all rings R that contain A . Since a ring is closed under unions and dierences, it follows that A B R and A \ B R for all rings R that contain A . This implies that A B R (A ) and A \ B R (A ), which shows that R (A ) is a ring.
Recall that E n denotes the collection of elementary gures in Rn , which is the collection of all nite unions of left-half open boxes. The following result says that the elementary gures is the ring generated by I n . Theorem 1.5. If I is a semiring, then A R (I ), the ring generated by I , if and only if A is a nite union of sets in I , if and only if A is a nite union of pairwise disjoint sets in I .
Proof : By the fundamental lemma of semirings (Lemma 1.3), the last two statements are equivalent so we just have to prove the rst if and only if. Let U be the collection of all nite unions of sets in I ; we need to prove that R (I ) = U . Since a ring is closed under taking nite unions and I R (I ), it follows that R (I ) contains all nite unions of sets in I ; that is, U R (I ). Since I U , if we show that U is a ring, then we must have U = R (I ) since R (I ) is the smallest ring containing I . The set U is closed under unions by denition. To see that U is closed under dierences, let A, B U . Then A = An and
31
B=
Bm are nite unions of sets in I , and so by properties of sets, A\B = An \ Bm =

m n
An \
Bm .
m
Since I is a semiring and An , Bn I , by (1.10) we can write An \ Bm =

m k
Cnk ,
a nite union of (pairwise disjoint) sets Cnk I . Hence, A\B =

n
An \
Bm =
m n k
Cnk =
n,k
Cnk
is a nite union of sets in I , and so, U is closed under dierences.
We now discuss the important -algebras: A -algebra (also called a -eld) of subsets of a set X is a collection S of subsets with the following properties:14 (1) S ; (2) If An S , n = 1, 2, . . ., then n=1 An S ; c (3) If A S , then A = X \ A S . We claim that -algebras are also rings; to prove this we just have to prove closure under set dierences. However, if A, B belong to a -algebra S , then by De Morgan laws, we have A \ B = A B c = (Ac B )c . The right-hand side is in S since S is closed under unions and complements, therefore A \ B S . Recall that a ring is closed under nite intersections; for a -algebra, countable intersections are allowed. Lemma 1.6. A -algebra is closed under countable intersections.
Proof : Given An , n = 1, 2, . . . in a -algebra S , we need to show that A := n=1 An S . To see this, observe that by the De Morgan laws, Ac =
n=1
An
n=1
Ac n.
Since An S for each n and S is closed under complements and countable unions, we see that Ac S ; it follows that A = (Ac )c also belongs to S .
Almost the exact same proof used in Theorem 1.4 establishes the following more general result. The last statement of the theorem shall be left as an exercise.
14Note that the countably innite union A in (2) also contains nite unions because n=1 n a nite union A1 Ak can be written as a countably innite union n=1 An where we put An = for n = k + 1, k + 2, . . ..
32

semiring ring -algebra
Figure 1.10. Every -algebra is a ring, and every ring is a semiring.
Theorem 1.7. Given any collection of subsets A of a set, there exists a unique smallest -algebra containing A called the -algebra generated by A and is denoted by S (A ). Moreover, we have that is, the -algebra generated by A equals the -algebra generated by the ring generated by A .
Example 1.12. Trivial examples of -algebras are {, X } and P (X ) for any set X . For our purposes, the most important examples of -algebras are the Borel sets in Rn , which is the -algebra generated by I n and is discussed in Section 1.4, and the -algebra generated by the cylinder sets of a sequence of random experiments, which is the topic for the next section.
S (A ) = S (R (A ));
Semirings are (usually) very simple objects, elements of the ring generated by the semiring are slightly more complicated being unions of elements of the semiring, while elements of -algebras can take on imaginative shapes since they can involve countable unions, intersections and complements of elements of the semiring. For example, let X be a nonempty set and let I denote the collection of all singletons of X together with the empty set; so, A I means A = or A = {x} where x X . Its easy to check that I is a semiring (see Problem 7). Now consider the picture:
In I
In R (I )
In S (I )
The left represents a singleton set {x} I , consisting of just a single point of X . In the middle is an element of the ring R (I ), a nite union of singletons, which is just a nite subset of X , and nally, on the right is a countable number of points in X (densely packed together to make them look continuous), which is an element of the -algebra S (I ). In Figure 1.10 we give a summary of the relationships between the various collections of sets that we have so far introduced. Since Theorem 1.5 gives a precise description of elements of R (I ), the ring generated by a semiring I , you may be wondering if there is a similar descriptive theorem for S (I ). Certainly if a set can be obtained from I by taking at most countably many combinations of unions, intersections, and/or complements, then the set is in S (I ) because S (I ) is closed under such operations. The converse statement (any set in S (I ) can be obtained from I by taking at most countably many combinations of unions, intersections, and/or complements) is false, but can be made true using transnite induction, a subject not a prerequisite for
33
reading this book! However, its useful to think of S (I ) as exactly those sets obtained in this way just to have a mental image of S (I ). Before moving to our next subject, sequence space, we prove the following proposition (which is false for semirings; see Problem 6). It says that functions can always push-forward rings and -algebras from one set to another. Proposition 1.8. Let A be either a ring or -algebra of subsets of a set X . Then given any set Y and function f : X Y , the collection is a class of subsets of Y of the same type as A . Af := {A Y ; f 1 (A) A }
Proof : Assume that A is a -algebra; the ring case is similar. We shall prove that Af is a -algebra. Since f 1 () = A , we have Af . Let A1 , A2 , . . . be sets in Af , so that f 1 (An ) A for each n. Since by set theory, f 1
An =
f 1 ( A n ) ,
n=1
n=1
and A is closed under countable unions, it follows that f 1 ( n=1 An ) A , 1 which implies (A ) A n=1 An Af by denition of Af . If A Af , then f by denition of Af , and by basic set theory, we have Since A is closed under complements, X \ f 1 (A) A , so f 1 (Y \ A) A , which implies Y \ A Af . Thus, Af is a -algebra. f 1 ( Y \ A ) = X \ f 1 ( A ) .
1.3.3. Sequence space. When we discussed the measure problem for probability in Section 1.2.5 we stated that the cylinder sets form a semiring. We now prove this in a slightly more general situation than just for Bernoulli sequences. Let X1 , X2 , . . . be sample spaces and let X be the set of all innite sequences: X = {(x1 , x2 , x3 , . . .) ; xi Xi for all i}, which can be denoted by
X1 X2 X3 X4
or
i=1
Xi .
X is called a sequence space, which model innitely many experiments performed in sequence (where X1 is the sample space of the rst experiment, X2 the second etc.). If X1 = X2 = X3 = equal the same sample space, say Y , then X = Y , the innite product of Y with itself, and Y represents a model for an experiment repeated an innite number of times as we discussed in Section 1.2. For example, if Y = {(j, k ) ; j, k = 1, . . . , 6}, then Y is a sample space for throwing two dice an innite number of times, if Y = {0, 1}, then Y is the space of Bernoulli sequences, and if Y = (0, 1], then Y is a sample space for picking an innite sequence of points in (0, 1] at random. For each i N, let Ii be a semiring of subsets of Xi and suppose that Xi Ii for each i. A cylinder set generated by I1 , I2 , . . . is a subset C X of the form for some n N and events A1 I1 , A2 I2 , . . . , An In . Thus, C represents the event that A1 occurs on the rst trial, A2 occurs on the second trial, . . ., and C = A1 A2 An Xn+1 Xn+2 Xn+3
34
An occurs on the nth trial, and anything can happen on all the trials after the nth one. We can also represent a cylinder set by where A I1 In , with I1 In dened in Proposition 1.2. We denote the collection of cylinder sets generated by I1 , I2 , . . . by C . Cylinder sets Proposition 1.9. C forms a semiring and the ring generated by C con sists of all subsets of i=1 Xi of the form for some n N where A R (I1 In ). A Xn+1 Xn+2 C = A Xn+1 Xn+2 Xn+3 , for some n,
Proof : Because this proof would be rather long, we shall only prove the semiring statement and leave the ring statement as Problem 2. Its easy to see that C contains the empty set. To prove the other two conditions of a semiring are satised, the idea is to apply Proposition 1.2 for the product of nitely many semirings. (The intersection condition is easy, and you should be able to prove it without using Proposition 1.2.) We reduce to the nite case as follows. Let A, B C and write for some n, m, A = A Xn+1 Xn+2 Xn+3 and where A I1 In and B I1 Im . Note that we can assume m = n. Indeed, if e.g. we had m < n, then we can dene B = B Xm+1 Xm+2 Xn , and then B = B Xm+1 Xm+2 Xn Xn+1 Xn+2 then we can work with B instead of B . Thus, we may assume that where A , B I1 In . Now, since I1 , . . . , In are semirings, so is the product I1 In by Proposition 1.2, a fact well use below. Observe that By Property (2) of a semiring, we know that A B I1 In , so A B C . Also observe that By Property (3) of a semiring we know that A \ B is the union of some pairwise disjoint sets A 1 , . . . , AN I1 In . It follows that
N
B = B Xm+1 Xm+2 Xm+3
= B Xn+1 Xn+2 , and
A = A Xn+1 Xn+2
B = B Xn+1 Xn+2 ,
A B = (A B ) Xn+1 Xn+2 .
A \ B = (A \ B ) Xn+1 Xn+2 .
A\B =
Ak ,
k=1
where Ak = A k Xn+1 Xn+2 , and where A1 , . . . , AN C are pairwise disjoint. This completes our proof.
35
We remark that the abstract notion of a semiring captures, at the same time, the essential properties of intervals or boxes (dealing with Euclidean space) on the one hand, and cylinder sets (dealing with probability) on the other, two seemingly distinct concepts! This shows that the power to abstract is, well, powerful.
Exercises 1.3. 1. (a) Prove that Condition (3) of a semiring is equivalent to the following statement: If A, B I and B A, then there are pairwise disjoint sets A1 , . . . , AN in I such that A1 = B and A = N n=1 An . (b) Generalize (a) as follows. Let A1 , . . . , An , A be members of a semiring I where A1 , . . . , An are pairwise disjoint and are all subsets of A. Prove that there are sets An+1 , . . . , AN I such that A1 , . . . , An , An+1 , . . . , AN are pairwise disjoint, and A= N k=1 Ak . Suggestion: If all else fails, remember your old friend induction. (c) If A, I1 , . . . , In I , prove that there are nitely many pairwise disjoint sets N J1 , . . . , JN in I such that A \ n k=1 Ik = k=1 Jk . 2. Prove Theorem 1.7 and complete the proof of Proposition 1.9. 3. Let A be either a semiring, ring or -algebra of subsets of a set X and let Y X . Prove that the restriction of A to Y , {A Y ; A A }, is a class of the same type as A . (Just choose one class to work with the proofs have the same avor.) 4. Let A R be a nonempty set and let I be the collection of all left-half open intervals with both end points in A. Prove that I is a semiring. 5. Let A be a nonempty collection of subsets of some set and let I denote the collection of nite intersections A1 A2 where for each n, either An A or Ac n A such that An A for at least one n. Prove that I is a semiring. 6. (i) Give an example of two semirings whose intersection is not a semiring. (ii) Give an example showing that Proposition 1.8 is false for semirings. (We remark that the fact that the intersection of rings remains a ring and the intersection of -algebras remains a -algebra were the main ingredients in the proofs of Theorems 1.4 and 1.7, respectively. Thus, in general, given a set A there may not be a smallest semiring containing A .) 7. Here are more examples of various classes. Let X be a nonempty set. (a) Given A X , let A = {A} be a single element collection of subsets of X . What are R (A ) and S (A )? Answer the same question when A = {A, B } where A, B X . (b) Assume X is innite and let R be the collection of subsets A X such that either A is nite or Ac is nite. Show that R is a ring but not a -algebra. (c) Assume X is uncountable and let S be the collection of subsets A X such that either A is countable or Ac is countable. Show that S is a -algebra. Is it ever true that S = P (X )? To answer this question, assume that X has an uncountable subset whose complement is uncountable.15) (d) Let I be the collection consisting of the empty set and all singleton sets of X (sets of the form {x} with x X ). (i) Prove that I is a semiring and if X has more than one element, show that I is not a ring. (ii) Prove that R (I ) consists of all nite subsets of X . (iii) Prove that S (I ) is the -algebra described in (c). 8. Given a collection of subsets A of a set, prove that the -algebra generated by A equals the union of the -algebras generated by countable subsets of A ; that is, if D is the collection of countable subsets of A , prove that S (A ) = Suggestion: Prove that
15 BD BD
S (B ).
S (B ) is a -algebra containing A .
Assuming certain facts about cardinality (see e.g. [10, p. 78]) one proof goes as follows: There is a bijection f : X {0, 1} X (in fact, between X Y and X for any set Y whose cardinality is not greater than X ). The set f (X {0}) X is uncountable and so is its complement.
36
9. Let F be any nonempty collection of functions on a nonempty set such that if f, g F , then max{f, g } and min{f, g } are also in F . Show that the system of left-half open function intervals IF , consisting of sets of the form (f, g ] := {(x, t) X R ; f (x) < t g (x)} where f, g F with f g , is a semiring. 10. Show there is no -algebra containing a countably innite number of sets. 11. In this problem we show that preimages behave very nicely (cf. Proposition 1.8). Let A be either a ring or -algebra of subsets of a set Y . (a) Given a function f : X Y , prove that f 1 (A ) = {f 1 (A) ; A A } is a class of subsets of X of the same type as A . (b) It can turn out that f 1 (A ) can have a more robust structure than A . For instance, nd a function f and a ring R such that f 1 (R ) is a -algebra. (c) Find a function f and a semiring A such that f 1 (A ) is not a semiring. (d) Unfortunately, images dont behave nicely. Find a function f and sets A and B such that f (A B ) = f (A) f (B ). Find a function f and a ring R such that f (R ) is not a ring. 12. (Cf. Problem 5 in Exercises 2.1.) Let R be a ring of sets. For sets A, B R , dene addition and multiplication of the two sets by, respectively, A + B := (A \ B ) (B \ A), A B := A B.
(The right-hand side of A + B is called the symmetric dierence of A and B and is usually denoted by AB .) With these operations, prove that R is a commutative ring in the algebraic sense of the word (in particular, you need additive and multiplicative identities). Also prove that R has the following properties: A A = A, A+A =0 for all A R .
Any ring with these properties is called a Boolean ring.
1.4. The Borel sets and the principle of appropriate sets In this section we study one of the most important -algebras, the Borel sets in Rn , named after Emile Borel (18711956). We also introduce the principle of appropriate sets, which is used throughout measure/integration theory. Finally, we show how to dene Borel sets in an arbitrary topological space. 1.4.1. The Borel subsets of Rn . In Borels 1898 book [49] he discussed properties of measure and measurable sets in the interval [0, 1] (taken from [173, p. 103]):
When a set is formed of all the points comprised in a denumerable innity of intervals which do not overlap and have total length s, we shall say that the set has measure s. When two sets do not have common points and their measures are s and s , the set obtained by uniting them, that is to say their sum, has measure s + s . More generally, if one has a denumerable innity of sets which pairwise have no common point and having measures s1 , s2 , . . ., sn , . . ., their sum . . . has measure s1 + s2 + + sn + . All that is a consequence of the denition of measure. Now here are some new denitions: If a set E has measure s and contains all the points of a set E of which the measure is
Emile Borel (18711956).
1.4. THE BOREL SETS AND THE PRINCIPLE OF APPROPRIATE SETS
37
s , the set E E , formed of all the points of E which do not belong to E , will be said to have measure s s . . . The sets for which measure can be dened by virtue of the preceding denitions will be termed measurable sets by us ...
At the beginning of the quote, Borel is discussing sets that are countable unions of intervals, then in the new denition later on he talks about dierences of sets. Thus, the sets he is working with are in the -algebra generated by the intervals, which in honor of Borel is nowadays called the Borel sets: The Borel subsets B n of Rn is, by denition, the -algebra generated by I n , where recall that I n is the left-half open boxes. For n = 1, we denote B 1 by B . Since E n is the ring generated by I n , by Theorem 1.7 of the last section we can also dene B n as the -algebra generated by E n . Heres a picture to consider:
In I n
In R (I n )
In S (I n )
On the left is an element of I n , a single left-half open box. In the middle is an element of the ring R (I n ), a nite union of left-half open boxes and nally, on the right is a blob, lets say its an open subset of Rn . Since well prove in Section 1.4.3 that open subsets are Borel sets, the blob is in B n . See Section 1.4.3 for some neat pictures of Borel sets. We used I n to dene the Borel sets, but there is nothing special about left-half open boxes. For example, (dealing with n = 1 for simplicity) we claim that B is also the -algebra generated by the bounded open intervals. To prove this, let IO be the collection of all bounded open intervals in R; we must show that S (IO ) = B , which means S (IO ) B and B S (IO ). Take the rst set inequality: S (IO ) B . To prove this we use the so-called principle of appropriate sets, explained as follows. Let C be a collection of sets (e.g. C = IO ) and let A be a -algebra (e.g. A = B ); when can we say that S (C ) A ? (In our problem, we need to know that S (IO ) B .) The easiest way is via the16 Principle of appropriate sets: If C is a collection of sets and all these appropriate sets are contained in a -algebra A , then S (C ) is also contained in A . The proof of this principle is trivial: If A is a -algebra and C A , then A is a -algebra containing C and hence S (C ) A since S (C ) is the smallest -algebra containing C . By the principle of appropriate sets, to prove that S (IO ) B , we just have to show that IO B . Thus, given (a, b) IO , we need to prove that (a, b) B .
16This principle seems to be popular in Russian books such as [335, 351]. I thank Anton Schick for showing me [351].
38
To see this, observe that (a, b) =
k=1
a, b
1 . k
Since -algebras are closed under countable unions, it follows that (a, b) B . Thus, IO B , so by the Principle of Appropriate Sets we know that S (IO ) B . The proof that B := S (I 1 ) S (IO ) has an analogous avor: We just have to prove I 1 S (IO ). To do so, let (a, b] I 1 and observe that
(1.11)
(a, b] =
k=1
a, b +
1 k
Since -algebras are closed under countable intersections, it follows that (a, b] S (IO ). Thus, I 1 S (IO ), so by the Principle of Appropriate Sets we know that B = S (IO ) S (IO ). Thus, we have shown that S (IO ) = B . We generalize this result in Proposition 1.10 below. First some notation. We shall denote a box in R by the notation (a, b] where a and b are the n-tuples of numbers a = (a1 , . . . , an ), b = (b1 , . . . , bn ). Other types of boxes (open, closed, unbounded, etc.) are denoted similarly; e.g. and [a, b] = [a1 , b1 ] [an , bn ].
n
(a1 , b1 ] (an , bn ]
(a, ) = (a1 , ) (a2 , ) (an , ). The proof of the following result uses the principle of appropriate sets as we did in the proof that B 1 equals the -algebra generated by the bounded open intervals and we leave its proof to you as practice on using this very useful principle. Proposition 1.10. The Borel sets B n is the -algebra of subsets of Rn generated by any of one of the following collections of subsets of the form (here, a and b represent n-tuples of real numbers): (1) (a, b]; (5) (, a]; (2) (a, b); (6) (, a); (3) [a, b); (4) [a, b]; (8) (a, ). (7) [a, );
We have dened the Borel sets in terms of intervals, but we can instead dene them in terms of the topology of Rn ; that is, the Borel sets is the -algebra generated by the open sets. Although a bit overkill, we prove this result via the so-called dyadic cube theorem, which is interesting in its own right. 1.4.2. Dyadic cube theorem. A dyadic interval is an interval of real numbers of the form k k+1 , 2j 2j 1 where k Z and j N. Its convenient to use the notation j (k, k + 1] for such 2 an interval; this notation is used for dyadic cubes below. The length of such an interval is 1/2j . The dyadic intervals of a xed length, partition the real line R, as can be seen here:
39
] ( 3 2j
] ( 2 2j
] ( 1 2j
] ( 0
( ] 1 2j
( ] 2 2j
( ] 3 2j
Figure 1.11. The dyadic intervals (k, k + 1]/2j where k Z. Here is a picture of the rst few types of dyadic intervals:
( ] 1 2 ( ] 1 2 ( ] 1 2 ( ] 3 8 ( ] 1 4 ( ] 1 4 ( ] 1 8 ( ] 0 ( ] 1 2 ( ] 1 4 ( ] 1 8 ( ] 1 4 ( ] 3 8 ( ] 1 2 ( ] 1 2
( ] 0
( ] 0
Figure 1.12. The dyadic intervals (k, k + 1]/2j where j = 1, 2, 3. A dyadic cube in Rn of length 1/2j is a product of n dyadic intervals of length 1/2 . Thus, a dyadic cube is a cube of the form
j
1 (k, k + 1], 2j where k = (k1 , . . . , kn ) is an n-tuple of integers and k + 1 = (k1 + 1, . . . , kn + 1). We shall call the number 1/2j the length of the above dyadic cube, although we should probably say length of a side of the cube to be more precise. The set of all dyadic cubes form a countable set since the set of all dyadic cubes is in one-to-one correspondence with Zn N, where the correspondence is (k1 , . . . , kn , j ) We now come to the . . . Dyadic cube theorem Theorem 1.11. Each open set in Rn is a pairwise disjoint union of (countably many) dyadic cubes.
Proof : Note that we dont need to add countably many since we already noted above that the set of all dyadic cubes is countable. Step 1: We collect some properties of dyadic cubes which are obvious by studying Figure 1.12; we ask you to prove these properties in Problem 6. (1) A point in Rn is contained in a unique dyadic cube of a given length. (2) If C and C are dyadic cubes of dierent lengths, then C and C intersect if and only if C C or C C . Step 2: Before attacking the proof, consider the interval (1, 0) in R. If you were to decompose (1, 0) as a pairwise disjoint union of dyadic intervals you would probably do it as follows:
1 (k, k + 1]. 2j
40
( 1
]( 1 2
]( ] ) 1 1 0 4 8
Each dyadic interval I in this picture is a subset of (1, 0) and it has the following property: If I is another dyadic interval and I I , then I is not contained in (1, 0). (For example, consider the interval (1/2, 1/4]. The only dyadic interval containing (1/2, 1/4] is (1/2, 0], which is not contained in (1, 0).) We exploit this property in our proof for open sets. Step 3: Now to our proof, let U = be an open set in Rn . Let V be the union of all dyadic cubes C U having the property that if C C where C is dyadic cube of length strictly greater than the length of C , then C U . By Property (2) listed in Step 1, one can check that V is a union of pairwise disjoint dyadic cubes. We show that V = U . By construction, V U . To see that U V , let x U . Since U is open, there are dyadic cubes with suciently small lengths contained in U that contain x. Among all such cubes, let C be the one with the largest length (which can be at most 1/2). By denition, C is one of the cubes that make up V , so x C V . Thus, U = V and our proof is complete.
1.4.3. Borel sets and topology. We can now characterize the Borel sets in terms of the topology of Rn . Theorem 1.12. The Borel subsets of Rn is the -algebra generated by the open sets of Rn .
Proof : By the Principle of Appropriate Sets! Let S be the -algebra generated by the open sets; we need to prove that S = B n . First of all, given a nonempty open subset U Rn , by the Dyadic cube theorem we can write U as a countable union of left-half open boxes and hence any open set is in B n . Therefore, S B n since S is the smallest -algebra containing the open sets. On the other hand, Equation (1.11) (although stated for intervals, it also holds for boxes) shows that every left-half open box is a countable intersection of open sets and hence is in S . It follows that B n S .
Note that to prove this result we didnt need the full power of the dyadic cube theorem. We just needed the fact that any open set is the countable union (of not necessarily pairwise disjoint) left-half open boxes, a statement much easier to prove than the dyadic cube theorem; see Problems 2 and 3. As a consequence of Theorem 1.12, if a subset of Rn can be obtained from open or closed sets by taking countable unions, intersections and/or complements, then the set is a Borel set; such sets include any set you can physically picture! Figures 1.13 and 1.14 show some famous fractal Borel subsets17 of R and R2 . Well take a close look at the Cantor set in Section 4.5. Figure 1.14 shows some Borel subsets of R3 .18 Motivated by Theorem 1.12, for an arbitrary topological space X , we dene the Borel sets
17Pictures taken from the wikipedia commons. For information on fractals see e.g. [264]. 18The pictures of the nautilus shell and Mona Lisa are taken from the wikipedia commons.
The picture of the Alexander horned sphere is from [187, p. 176]. The Alexander horned sphere, named after James Alexander II (18881971), is homeomorphic to a 3-dimensional ball.
41
Figure 1.13. On the left is a construction of the Cantor set, which is

whats left over after repeatedly erasing the open middle thirds starting from the unit interval [0, 1]. The Cantor set is a closed subset of R, hence is a Borel subset of R see Section 4.5. In the middle is a Julia set and on the right is the Mandelbrot set, both of which are closed and hence are Borel subsets of R2 .
Figure 1.14. The nautilus shell (a cut away of it on left), the human
head, and the Alexander horned sphere, each we assume is a closed subset of R3 , are Borel sets.
B (X ) as follows: B (X ) is the -algebra of subsets of X generated by the open sets. Proposition 1.13. If f : X Y is a continuous map between topological spaces, then the preimage of any Borel set in Y under f is a Borel set in X .
Proof : By the principle of appropriate sets! By Proposition 1.8, is a -algebra of subsets of Y . Let O be the collection of open subsets of Y . Since f is continuous, it follows that O S . Thus, B (Y ) S since B (Y ) is the smallest -algebra containing O . Now the statement B (Y ) S means that for every A B (Y ), we have f 1 (A) B (X ), which is what we wanted to show. S = {A Y ; f 1 (A) B (X )}
Since Borel sets are dened via topology, the following should not be a surprise. Proposition 1.14. Borel sets are preserved under homeomorphisms.
Proof : Let f : X Y be a homeomorphism of topological spaces and let A X be a Borel set of X ; we must prove that f (A) is a Borel set of Y . To prove this, let g = f 1 . Then g : Y X is continuous, so by Proposition 1.13, the set g 1 (A), which equals f (A), is a Borel set in Y . This completes our proof.
42
This proposition is not true for the so-called Lebesgue measurable sets that well discuss later, see Problem 8 in Exercises 4.4.
Exercises 1.4. 1. Let Ak be the -algebra generated by sets of the form (k) given in Proposition 1.10, where k = 1, 2, . . . , 8; thus, A1 = B n by denition of B n , A2 is the -algebra generated by the bounded open boxes, and so forth. Prove the following sequence of inclusions: A1 A2 , A2 A3 , , A7 A8 , A8 A1 . Using this fact, conclude that A1 = A2 = = A8 , which proves Proposition 1.10. You may assume that n = 1 throughout your proof, which will make a couple of the inclusions notationally simpler to prove. 2. In this problem we prove that any open set can be written as the countable union of (not necessarily pairwise disjoint) left-half open boxes. Here are some steps. (i) First prove that the set of all nonempty left-half open boxes with rational vertices is countable. That is, prove that the set of all boxes (a1 , b1 ] (an , bn ] where a1 , . . . , an , b1 , . . . , bn Q with ai < bi for each i, is countable. (ii) Let U Rn be open and nonempty and denote by A the set of all nonempty left-half open boxes I with rational vertices such that I U . Show that U= I.
I A
3. Using a proof similar to the one in the previous problem, prove that any open set is a countable union of (not necessarily pairwise disjoint) open boxes. Prove the same with open boxes replaced with closed boxes. 4. Prove that the Borel subsets of Rn is the -algebra generated by the collection of all (i) dyadic cubes; (ii) left-half open boxes with rational end points; (iii) left-half open boxes with dyadic end points; (iv) closed sets; (v) compact sets. 5. Prove that any open set in R is a countable union of pairwise disjoint open intervals. Is this last statement true in Rn for n > 1 if we replace open intervals by open boxes? Suggestion: For each x in an open set U R, prove that there is a largest open interval containing x, say Ix . Show that these intervals are pairwise disjoint, countable, and the U is the union of all such intervals. 6. Prove the properties of dyadic cubes stated in Theorem 1.11. We remark that the technique in Step 3 used to prove the dyadic cube theorem is useful in establishing analogous statements concerning open sets. For instance, imitating the proof of the dyadic cube theorem, show that each open set in Rn is a countable union of closed cubes with pairwise disjoint interiors. 7. Another common way to prove the dyadic cube theorem is as follows. Let U1 U be the union of all dyadic cubes of length 1/21 that are contained in U . Let U2 U be the union of all dyadic cubes of length 1/22 contained in the set U \ U1 . Proceeding by induction, assuming that Uj has been dened, let Uj +1 U be the union of all dyadic cubes of length 1/2j +1 contained in the set U \ (U1 U2 Uj ). Let V be the union of all the Uj s. Prove that U = V . 8. Given x Rn , r R with r = 0, and A Rn , we denote the translation of A by x by A + x or x + A, and the multiple of A by r by rA; that is, we dene x + A = A + x := {a + x ; a A}, B n + x := {A + x ; A B n } and rA := {ra ; a A}.
Prove that B n is translation and scalar invariant. That is, prove that r B n := {rA ; A B n } both equal B n .
i=1
9. (Cylinder sets in R ) In this problem we study -algebras in R . Let R = which consists of all innite sequences (x1 , x2 , . . .) with xi R for all i.
R,
1.5. ADDITIVE SET FUNCTIONS IN CLASSICAL PROBABILITY
43
(i) Let C1 denote the collection of cylinder sets of R of the form A1 A2 An R for some n N and Ai I 1 for each i. Let C2 denote the cylinder sets formed by requiring Ai B for each i. Prove that S (C1 ) = S (C2 ). We denote this common -algebra by B . Suggestion: To prove that S (C2 ) S (C1 ), you need to prove that any element of C2 belongs to S (C1 ). To prove this, rst prove that for any i N and A B , we have Ri1 A R S (C1 ). (ii) Prove that each of the following sets belongs to B : (a) {x R ; sup{x1 , x2 , . . .} 1}. (b) {x R ; xn 0 for each n and n=1 xn 1}. (c) {x R ; limn xn exists}. Suggestion: For (c), recall that limn xn exists {xn } is Cauchy. 10. (Cf. [31]) For Rn , the -algebra generated by the open sets and by the compact sets are the same, namely the Borel sets (see Problem 4). For a general topological space, they need not be the same. A topological space is said to be -compact (also called -bounded) if it is equal to a countable union of compact subsets. (a) Prove that for any -compact Hausdor space, the Borel sets coincides with the -algebra generated by the compact sets. Suggestion: Recall that for a Hausdor space, compact sets are closed. (b) Suppose that the topology of X consists of just the sets and X . What are the Borel sets? What is the -algebra generated by the compact sets?
1.5. Additive set functions in classical probability Statements such as the probability of getting a head when you ip a coin is 1/2 or the probability that a die shows 1 when thrown is 1/6 are facts we are all familiar with. This section is devoted to understanding the basics of probability theory and additive set functions. However, we begin by discussing . . . 1.5.1. Arithmetic on R. In measure theory, show up often, since for instance, R and other unbounded intervals have innite length. Let (called innity and also denoted by +) and (called minus innity) be distinct objects that are not real numbers.19 We dene the extended real numbers R as the set R {}. We order R by taking the usual ordering on R and dening < and < a < for any real a. We use the standard notations for intervals; for example, we shall see that measures take values in the interval which equals [0, ) {}, the set of nonnegative real numbers and . The arithmetic operations on R are dened in the obvious manner that agrees with the usual rules we are used to for real numbers (except for one exception dealing with multiplication of zero and innity described below). Instead of listing all the rules, we give a few examples. First, on the subset R R of real numbers, the usual rules of arithmetic apply. When innities are involved, we have that is, anything (except ) plus equals , and a = a := , + a = a + := , for < a , for 0 < a , [0, ] = {x R ; 0 x },
19Heres one natural way to dene innity. We call a sequence {a } of real numbers right n unbounded f if given any M > 0 we have an > M for all n suciently large. Now dene as the collection of all right unbounded sequences. With this denition, given a sequence {an }, we write an if {an } . (Recall that is a set, so {an } makes sense!) Similarly, we can dene as the collection of all left unbounded sequences.
44
and so on for the other rules of arithmetic for R. The only rules that are not dened are certain indeterminate forms as learned in elementary calculus, such as However, there is one caveat. We do dene zero times : 0 () = () 0 := 0. , + , and division of innities, which are not dened.
that is, anything strictly positive times equals . For division, we have a := 0, for all a R.
This denition might seem a bit strange as it goes against our calculus upbringing. However, in measure theory, its incredibly useful as you will see in the sequel. Using the order properties of R, we can dene convergence of sequences of extended real numbers in the same way as for real numbers. Precisely, a sequence {an } in R is said to converge to a real number L R, written an L (or lim an = L), if given any > 0 there is an N N such that for all n N , L < an < L + . We write an if for any M R there is an N N such that for all n N , we have M < an ; an analogous denition holds for an . In particular, we can dene convergence of innite series of extended real numbers as the convergence of their partial sums; well work with such series in Chapter 3. 1.5.2. Probabilities and additive set functions. When we are dealing with a random phenomenon we can never say exactly what will occur, we can only say the probability (or chance or likelihood ) that a certain outcome will occur. Mathematically speaking, probability is the assignment of real numbers in the interval [0, 1] to represent these likelihoods; numbers close to zero representing low likelihoods and numbers closer to one higher likelihoods. For example, consider the sample space X = {(1, 1), . . . , (6, 6)} of throwing two dice, which is a set of 36 possible outcomes. We would all agree that the probability you throw a double six is 1/36 (assuming we are not cheating). We would also all agree that the probability you throw a (1, 1) or a (6, 6) is 2/36. More generally, given an event A X consisting of n outcomes, we would all agree that the probability of at least one outcome in A occurring is n/36. With this example in mind, given any nite sample space X and a event A X , the classic denition of the probability of an outcome in A occurring is (1.12) probability of A := number of elements of A #A = . #X total number of possible outcomes
If probabilities are assigned in this way, the apparatus (coin, dice, . . .) used in the experiment is said to be fair and each outcome is said to occur equally likely or with equal probability. One way to think of the formula (1.12) is as probability = volume where the volume of an event A is the proportion of all possible outcomes that lie in the event A. Since time immemorial, (1.12) has been used as the denition of mathematical probability. For example, Girolamo Cardano (15011576) used the formula (1.12) to compute various probabilities in Liber de Ludo Aleae, the rst text on the calculus of probabilities; for instance, in Chapter 14 of his book he (correctly) tabulates #A for various events where the sample space is X = {(1, 1), . . . , (6, 6)},
45
the sample space for throwing two dice. In Abraham de Moivres (16671754) classic The Doctrine of Chances rst published in 1718, he says [100, p. 1]:
1. The Probability of an Event is greater or less, according to the number of Chances by which it may happen, compared with the whole number of Chances by which it may either happen or fail. 2. Wherefore, if we constitute a Fraction whereof the Numerator be the number of Chances whereby an Event may happen, and the Denominator the number of all the Chances whereby it may either happen or fail, that Fraction will be a proper designation of the Probability of happening. Thus if an Event has 3 Chances to happen, 3 and 2 to fail, the Fraction 5 will tly represent the Probability of its happening, and may be taken to be the measure of it.
Back to our dice example, we denote the probability of an outcome in A occurring by (A) and we dene it by #A number of elements of A = , total number of possibilities 36 where #A = number of elements of A. Thus, assigns to every subset A X a number (A) [0, 1], in other words, is a function (A) := : P (X ) [0, 1], where P (X ) is the power set of X . This function has several easily proved properties such as () = 0 (which is obvious since # = 0) and 36 #X = = 1, 36 36 and also if A, B X are disjoint, then #(A B ) = #A + #B , so (X ) = (A B ) =
#A + #B #(A B ) = = (A) + (B ). 36 36 An induction argument shows that given any nite number of sets A1 , A2 , . . . , AN N N P (X ) that are pairwise disjoint, ( n=1 An ) = n=1 (An ). This discussion shows that the following denition is worthy of study: Given a semiring I , a function : I [0, ] is called a set function and is said to be additive, or nitely additive, if (1) () = 0. N (2) If A = n=1 An I where A1 , . . . , AN I are pairwise disjoint, then
N
(A) =
n=1
(An ).
Since rings or -algebras are also semirings, this denition works in particular when I is a ring or -algebra. The set function is called a (nitely additive) probability set function20 if has range in [0, 1] (so that : I [0, 1]) and in addition to (1) and (2), if X denotes the universal set, we have (3) X I and (X ) = 1.
20Usually called a nitely additive probability measure but we shall use the word measure strictly for countably additive set functions.
46
This abstract description of probability goes back at least to . . . 1.5.3. Hilberts sixth problem and Kolmogorovs axioms. In the 1900 International Congress of Mathematicians held in Paris, David Hilbert David Hilbert (18621943) gave his famous list of 23 open problems in mathematics, now (18621943). called Hilbert problems, which have greatly inuenced the direction of mathematical research since that time. Heres Hilberts sixth problem [182]:
6. Mathematical treatment of the axioms of physics The investigations on the foundations of geometry suggest the problem: To treat in the same manner, by means of axioms, those physical sciences in which mathematics plays an important part; in the rst rank are the theory of probabilities and mechanics.
Andrey Kolmogorov
(19031987).
is a probability set function, which we dened earlier. We remark that since X R by assumption, it follows that R is closed under complements. Indeed, if A R , then as rings are closed under dierences, we conclude that Ac = X \ A R . Moreover, if : R [0, 1] is a probability set function, then X = A Ac implies that 1 = (A) + (Ac ) since (X ) = 1. Thus, we have the following very useful complement formula: (1.13) A R = Ac R and (Ac ) = 1 (A).
In 1933, Andrey Kolmogorov (19031987) published Grundbegrie der Wahrscheinlichkeitsrechnung [215] (for an English translation, see http://www.kolmogorov.com/), in which he solves the probability component of Hilberts problem. Let X be a set and let R be a collection of subsets of X referred to as a collection of observable or plausible events. Here are Kolmogorovs axioms [215, p. 2]:21 I. R is a ring of subsets of a set X . II. R contains X . III. To each set A in R is assigned a nonnegative real number (A). This number (A) is called the probability of the event A. IV. (X ) = 1. V. If A and B have no element in common, then (A B ) = (A) + (B ). The triple (X, R , ) is called a eld of probability. Of course, these axioms just mean that : R [0, 1]
Most of the probability set functions you have seen in lower-level probability courses are of the following type (the proof is left to you): Proposition 1.15. Given any nite sample space X describing a fair experiment and using the classic denition (1.12) to dene a set function we obtain a nitely additive probability set function.
21Kolmogorov assumes the ring R to be nite ; the case of innitely many observable events is discussed in Section 3.3.3. Also, he uses the term eld for what we call a ring; nowadays the word eld, also called an algebra, refers to a ring closed under complements.
: P (X ) [0, 1],
47
Now consider the following proposition describing a not-necessarily fair experiment whose sample space is countable. Probability mass functions Proposition 1.16. Let X be a nonempty countable (nite or innite) set. Then a function : P (X ) [0, ] is a nitely additive set function if and only if there is a function m : X [0, ] such that for all A X , we have (A) =
xA
m(x),
where the sum is only over those xs such that x A. If the sum xA m(x) diverges, it equals, by denition . The set function is a nitely additive probability set function if and only if xX m(x) = 1, in which case the function m is called a probability mass function. We remark that Proposition 1.16 holds verbatim if we replace nitely additive with countably additive everywhere in the proposition (see Problem 2 in Exercises 3.2); we shall learn more about countably additive measures in the next chapter. To prove the only if, let : P (X ) [0, ] be a nitely additive set function and then dene m(x) := {x} for all x X. We often drop the parentheses, so we can write this as m(x) = {x}. Now given A X we can write A = xA {x}, so by additivity, (A) =
xA
{x} =
m(x).
xA
This proves the only if; we shall leave the if statement to you (see Problem 2). If X is nite and we assign the same masses to each point of X , that is, 1 m(x) = for all x X, #X then the dened as in Proposition 1.16 is exactly the classical probability set function (1.12) (why?). However, there are situations where m is not constant, one case of which well see as we now return to the problem of points. 1.5.4. The problem of points. Heres an example of the problem of points related to Americas pastime: Problem of points: Baseball teams A and B (of equal ability) are playing in the world series. The rst one to win four games wins the series and wins D dollars. Lets say that after four rounds, team A has won three games and team B has won one game, but because of salary disputes (what else?) the players went on strike and for the rst time in history the world series was canceled. It was decided to sell the Commissioners Commissioners Trophy22 (given to the winner of the series) and the money Trophy given to the teams. How should the money be fairly divided?
22The picture of the Commissioners Trophy was taken from the Wikipedia commons.
48
Event Round 5 1 A 2 A 3 A 4 A 5 B 6 B 7 B 8 B
Round 6 Round 7 A A A B B A B B A A A B B A B B
Table 2. All possible outcomes of three hypothetical future matches

between teams A and B. Here, A represents that team A wins and B that team B wins. Each outcome occurs with probability 1/8.
Pascal and Fermats ingenious solution to this problem is called the method of combinations. Their observation was that we can consider the teams playing a best of seven series where the teams play seven complete rounds and at the end of the seven rounds, we count who has won the most rounds. In other words, a team wins the rst one to win four series if and only if it wins the best of seven series. Thus, we might as well divide up the prize money assuming the teams were playing the best of seven series; in the words of Pascal [358, p. 557],
Therefore, since these two conditions are equal and indierent, the division should be alike for each.
Now, thinking in terms of the best of seven, we have already played four rounds, so we just have three more to go. Table 2 shows the sample space of all the possible outcomes in the hypothetical 57 rounds. Thus, our sample space is X = {(A, A, A), (A, A, B ), . . . , (B, B, B )}, a set consisting of eight elements, and we dene : P (X ) [0, 1] using the classical denition (1.12); recall that we assumed the teams were of equal ability which is why each hypothetical outcome is equally likely. Let A be the event that Team A wins the series and let B be the event that Team B wins. Then recalling that team A already has three wins, they would win the series if any one of the top seven outcomes in Table 2 occurred in the hypothetical 57 rounds; the only way team B would win is if the last outcome in Table 2 occurred. Thus, (A) = the probability team A wins is = 7 #A = , #X 8
and by the complement formula (1.13), (B ) = 1 7/8 = 1/8. Of course, we could obtain (B ) using the classical denition: (B ) = the probability team B wins is = #B 1 = . #X 8
For this reason, it would seem like the fairest division of the prize money would be Team A gets D 7D dollars and Team B gets dollars. 8 8
49
Now that we have seen the method of combinations, you should be able to read Pascals letter of Monday, August 24, 1654 to Fermat [358, p. 555]:
This is the method of procedure when there are two players: If two players, playing in several throws, nd themselves in such a state that the rst lacks two points and the second three of gaining the stake, you say it is necessary to see in how many points the game will be absolutely decided. It is convenient to suppose that this will be in four points, from which you conclude that it is necessary to see how many ways the four points may be distributed between the two players and to see how many combinations there are to make the rst win and how many to make the second win, and to divide the stake according to that proportion. I could scarcely understand this reasoning if I had not known it myself before; but you also have written it in your discussion. Then to see how many ways four points may be distributed between two players, it is necessary to imagine that they play with dice with two faces (since there are but two players), as heads and tails, and that they throw four of these dice (because they play in four throws). Now it is necessary to see how many ways these dice may fall. That is easy to calculate. There can be sixteen, which is the second power of four; that is to say, the square. Now imagine that one of the faces is marked a, favorable to the rst player. And suppose the other is marked b, favorable to the second. Then these four dice can fall according to one of these sixteen arrangements: a a a a 1 a a a b 1 a a b a 1 a a b b 1 a b a a 1 a b a b 1 a b b a 1 a b b b 2 b a a a 1 b a a b 1 b a b a 1 b a b b 2 b b a a 1 b b a b 2 b b b a 2 b b b b 2
and, because the rst player lacks two points, all the arrangements that have two as make him win. There are therefore 11 of these for him. And because the second lacks three points, all the arrangements that have three bs make him win. There are 5 of these. Therefore it is necessary that they divide the wager as 11 is to 5.
Please notice Pascals words: I could scarcely understand this reasoning if I had not known it myself before. This shows that mathematics is not easy, even to one of the greatest mathematicians who ever lived! In fact, the method of combinations has confused many great mathematicians such as Gilles Roberval (16021675), one of the leading mathematicians in Paris at the time of Pascal and Fermat. He objected to the method of combinations because the true game described in Pascals letter above would have ended as soon as the winner was declared and not all the extra four games really needed to be played. Robervals comments to Pascal were [358, p. 556]:
That it is wrong to base the method of division on the supposition that they are playing in four throws seeing that when one lacks two points and the other three, there is no necessity that they play four throws since it may happen that they play but two or three, or in truth perhaps four.
50
Roberval might have been happy if Pascal approached the problem as follows. Recall that the rst player in Pascals letter lacks two points and the second three points, so the real ending scenarios could only have been the following ten: a a a a b b a b a 1 1 a b b a b a b 2 1 b b b a a b b b a a b a 1 2 1 b b b b a b b 2 2
For this sample space it is clear that the outcomes are not equally likely; e.g. aa and aba are not equally likely to occur. Indeed, since the players are equally likely to win a round, in two rounds the probability that aa occurs should be 1/4. In three rounds the probability that aba occurs should be 1/8. Similarly, in four rounds the probability that abba occurs should be 1/16. Continuing in this manner, we see that we should assign (that is, dene ) probabilities as follows: {aa} = 1 1 , {aba} = {baa} = {bbb} = , 4 8 1 . 16
where the bottom row shows who wins. Let us take our sample space to be these ten scenarios: X = {aa, aba, abba, abbb, . . . , bbb}.
{abba} = {abbb} = {baba} = {babb} = {bbaa} = {bbab} = Since
1 1 1 1 1 1 1 1 1 1 + + + + + + + + + = 1, 4 8 8 8 16 16 16 16 16 16 it follows from Proposition 1.16 that induces a nitely additive probability set function : P (X ) [0, 1]. Now the event that a wins is {aa , aba , abba , baa , baba , bbaa} hence the probability that a wins is {aa , aba , abba , baa , baba , bbaa} = 1 1 1 1 11 1 1 + + + + + = ; 4 8 16 8 16 16 16
thus, by the complement formula (1.13), the probability that b wins is 1 11/16 = 5/16. These are exactly the same division rules (11/16 to player a and 5/16 to player b) as given by the method of combinations! We can do a similar argument with the world series example. For the world series example, the real ending scenarios are only the following four: Event Round 5 1 A 2 B 3 B 4 B Round 6 A B B Round 7 Probability
1 2 1 4 1 8 1 8
A B
In terms of our mathematical framework, we put X = {A, BA, BBA, BBB } and dene 1 1 1 1 {A} = , {BA} = , {BBA} = , {BBB } = ; 2 4 8 8
51
then via Proposition 1.16, we get a probability set function : P (X ) [0, 1]. Now, A wins the series if any of the rst three outcomes in the table occur and B wins if the last one occurs, so the probability team A wins is = {A, BA, BBA} = and 7 1 1 1 + + = , 2 4 8 8
1 . 8 These are exactly the same numbers we got using the method of combinations! Now, why did we get the same answers using the method of combinations and the method that might have made Roberval happy? Is it a coincidence? No, of course, and we leave you to gure out why! By the way, there is yet another way to solve the problem of points through the concept of expected winnings, a notion due once again to our friend Pascal, who explained it in his July 29, 1654 letter to Fermat [358, p. 547]. You can read about expected winnings in one of the rst published books (1657) on probability, Libellus de Ratiociniis in Ludo Aleae [194] by Christiaan Huygens (16291695), who learned expectations from Pascal. the probability team B wins is = {BBB } = 1.5.5. The dice problem. Recall that the dice problem is the following: How many times must you throw two dice in order to have a better than 5050 chance of getting two sixes?
Let n N, let Y = {(i, j ) ; i, j = 1, 2, . . . , 6} (the sample space of throwing two dice once), and let X = Y Y Y Y , = Y n;
(n times)
then X is a sample space for throwing two dice n times. We know (via the counting rules we review at the end of this section) that since Y has 36 elements, #X = 36n . We dene the probability set function : P (X ) [0, 1] by assuming equally likely outcomes. In order to answer the dice problem, let A X be the event that we get two sixes in at least one of the n throws; explicitly, A = {(x1 , . . . , xn ) ; there is an i, xi = (6, 6)}.
To nd (A), by the complement formula (1.13), we just have to nd (Ac ), which turns out to be quite easy. Notice that Ac is the event that we never throw two sixes; explicitly, Ac = {(x1 , . . . , xn ) ; for all i, xi = (6, 6)} = B B B B,
(n times)
where B = Y \ {(6, 6)}. Since #B = 35, it follows that #Ac = 35n and hence (Ac ) = Thus, by the complement formula, (1.14) Throwing the dice n times, the probability =1 of throwing two sixes at least once 35 36
n
#Ac = #X
35 36
52
We can now solve the dice problem. We want to know what n must be in order that if we throw the dice n times we have a better than a 1/2 probability of throwing two sixes; that is, we want 1 35 36
n
1 , 2
which holds
log 2 = 24.605 . . . . log(36/35)
This was exactly Pascals solution to M er es question: We need at least 25 throws to have a better than 5050 chance of getting two sixes. Just for fun, let us nd (1.14) using another method. Let Ak X be the event we throw two sixes on the k th throw but not on any throw before the k th. Thus, Ak = B B B {(6, 6)} Y Y Y .
(k 1 times) (n k times)
Then #Ak 35k1 1 36nk 1 = = n 36 36n 36 35 36

k 1
One can check that A = A1 An and that A1 , . . . , An are pairwise disjoint, so by additivity we have
n
(A) =
k=1
(Ak ) =
1 36
k=1
35 36
35 n 36 35 36
k 1
1 1 = 36 1
=1
35 36
which agrees with (1.14). We remark that M er e believed the answer was 24 and not 25 throws to have a better than 5050 chance of getting two sixes. The reason was the following gamblers rule common in those days. Let T be the total number of outcomes of an experiment (each outcome equally likely) and let t be the number of trials of the experiment needed to have a better than 5050 chance of getting a specic outcome. For example, in the experiment of throwing a single die, T = 6 (six sides on a die) and for the specic outcome of throwing a six, it was well-known even from the times of Cardano in the 1500s, that t = 4. (Indeed, throwing the die n times, the probability of throwing a six at least once is 1 (5/6)n ; when n = 3, this number is < 1/2 while for n = 4, the number is > 1/2.) The gamblers rule was that the ratio t/T is the same for all experiments, as long as T is suciently large. Thus, as we saw in the case of a single die, we have T = 6 and t = 4, so t/T = 4/6 = 2/3. Hence, we should have t/T = 2/3 = .66666 . . . for any experiment as long as T is large enough. In the case of two dice, T = 36 (six sides on each dice) so we conclude that t/36 = 2/3, or t = 36 2/3 = 24, which was M er es (wrong) solution. In fact, it was later shown by Abraham de Moivre (16671754) in his book The Doctrine of Chances [100, p. 36] that t/T log 2 = 0.6931 . . . for T large. If M er e would have known this, he would have gotten the correct answer: t 36 0.6931 = 24.95 . . ., that is, t = 25. See Problem 3 for a proof of de Moivres theorem. Appendix/Review on basics of counting. In order to use the classical denition of probability one needs to be able to count the number of elements of
53
a set. Thus, before going any further, we quickly review some elementary ideas on counting. Let X1 , . . . , Xm be nite sets and consider the product If #Xi = ni , then we have X = X1 X2 Xm = {(x1 , x2 , . . . , xm ) ; xi Xi for each i}. #X = n1 n2 nm .
To see this, observe that for an m-tuple (x1 , x2 , . . . , xm ) X , there are n1 choices for the rst component x1 . For each of the n1 choices for x1 , there are n2 choices for x2 . Thus, there are n1 n2 ways of lling the rst two components x1 , x2 . Continuing in this manner with x3 , . . . , xn , we get our formula for #X . In particular, let Y be a set having n elements and put where there are m factors of Y ; then #X = nm . In probability jargon, the set Y m is called the set of all m-samples taken from Y with replacement. This is because if Y represents the set of objects in an urn,23 then you can think of an m-tuple (x1 , x2 , . . . , xm ) Y m as a sequence of m objects that you take from the urn, one at a time, making sure to replace each object to the urn before taking the next one. Now consider the following subset P Y m : How many elements does P have? Observe that there are n choices for the rst component x1 of an element (x1 , x2 , . . . , xm ) P . For each of the n choices for x1 , there are n 1 choices for x2 since x2 is not allowed to equal x1 . Thus, there are n(n 1) ways of lling the rst two components x1 , x2 . Continuing in this manner with x3 , . . . , xn , the number of elements of P equals n! . #P = n(n 1)(n 2) (n m + 1) = (n m)! P = {(x1 , . . . , xm ) ; for all i, j with i = j , xi = xj }. X = Y Y Y = Y m,
In probability jargon, the set P is called the set of all m-permutations of Y or the set of all m-samples taken from Y without replacement. This is because if Y represents the set of objects in an urn, then you can think of an m-tuple (x1 , x2 , . . . , xm ) P as the sequence of m objects that you take from the urn, one at a time, without replacing the object you took before taking the next one. Finally, let 1 m n and consider the set C = {A ; A Y , #A = m};
thus, C consists of all subsets of Y having m elements. The number of elements of C is called the number of combinations of n things taken m at a time. An element of C is a subset A = {x1 , x2 , . . . , xm } Y consisting of m distinct elements of Y (so that xi = xj for i = j ). Notice that there are m! dierent ways to form an m-tuple from the elements of the set {x1 , x2 , . . . , xm } since there are m choices for the rst component of the m-tuple, m 1 choices for the second component of the m-tuple, etc. Each such m-tuple will give an element of the set P dened
231828 Webster dictionary: An urn is a vessel of various forms, usually a vase furnished with a foot or pedestal, employed for dierent purposes, as for holding liquids, for ornamental uses, for preserving the ashes of the dead after cremation, and anciently for holding lots to be drawn. Image taken from Kantners 1887 Book of Objects, page 12.
54
above, and moreover, each element of the set P can be obtained in this way. Thus, to each element of C corresponds m! elements of P and hence, after some thought, we conclude that #C m! = #P. In view of the formula for #P that we found above, we see that #C = n! =: m! (n m)! n , m
which is just the familiar binomial coecient. If Y represents the set of objects in an urn, then you can think of an element {x1 , x2 , . . . , xm } C as m objects that you scoop from the urn all at once. You can also think of {x1 , x2 , . . . , xm } C as the set of m objects that you take from the urn, one at a time, without replacing the object you took before taking the next one. Since the elements of a set are not ordered, you dont care about the order in which the objects were taken from the urn, you only care about the objects you got.
Exercises 1.5. 1. Let I be a semiring of subsets of a set X and let : I [0, ] be a nitely additive set function. Prove that if A, B I with A B , then (A) (B ). 2. Prove the only if part of Proposition 1.16 using characteristic functions as follows. (i) Recall that the characteristic function of a set A is dened by A (x) = 1 if x A and A (x) = 0 if x A. With X a countable set and dened as in the statement of Proposition 1.16 for some mass function m, show that given any set A X , we have (A) = m (x ) A (x ). (ii) Show that if A =
N n=1 x X
An where the An s are pairwise disjoint, then

N
A =
n=1
An .
(iii) Using (i) and (ii) prove Proposition 1.16. 3. (De Moivres theorem) Let T be the total number of outcomes in an experiment (each outcome equally likely) and let t be the number of trials of the experiment needed to have a better than 5050 chance of getting a specic outcome. Show that where for any r R, r is the smallest integer r . Prove that limT t/T = ln 2. 4. (The birthday problem) Here is a question similar to the dice problem: How many students must be in your class in order to have a better than 5050 chance of nding two people with the same birthday? Assume that every year has 365 days. Also assume that the people are randomly chosen which means that all days of the year are equally likely to be the birthday of a person. (i) Let n N, n 2, and give an explicit description of a sample space X representing the birthdays of n randomly chosen people, and dene the probability set function : P (X ) [0, 1]. (ii) Explicitly dene the subset A P (X ) representing the event that n randomly chosen people all have dierent birthdays. (iii) Determine (A), the probability that n randomly chosen people have dierent birthdays. From this, nd a formula for the probability Pn that at least two people in n randomly chosen people have the same birthday. (iv) Show that Pn < 1/2 for n < 23 and Pn > 1/2 for n 23 and conclude that we need a classroom with at least 23 students. t = log 2/ log(1 1/T ),
55
5. (Another birthday problem) We now do the last problem with classroom replaced by any group of people in history! We shall assume the Gregorian calendar (the calendar established October 15, 1582 and currently in use in most countries) has always been used. In this calendar, a leap birthday (on Feb. 29) occurs 97 times every 400 years. Thus, in 400 years, there are a total of 400 365+97 = 146097 days. A regular (non-leap) birthday occurs 400 times in these 146097 days while a leap birthday occurs only 97 times. Thus, a = 400/146097 (respectively, b = 97/146097) is the probability a randomly chosen persons birthday is a regular day (respectively, leap day). Let n N, n 2, let Y = {1, 2, . . . , 366} (366 being the leap day), and let X = Y n , which represents the sample space of the possible birthdays for n people. (i) Assign a probability to a singleton {(x1 , . . . , xn )} consisting of n birthdays and dene a probability set function : P (X ) [0, 1]. (ii) Let A X be the event that n randomly chosen people all have dierent birthdays, let A0 A be the subset of A where none of the n people (who have dierent birthdays) have birthday 366 and for i = 1, 2, . . . , n, let Ai A be the subset of A where the ith person has birthday 366. Find (Ai ) for i = 0, 1, . . . , n. 97n (iii) Prove that (A) = 365 364 (365 n + 2)an 366 n + . Then nd the 400 probability that at least two people in n randomly chosen people have the same birthday. Optional: Using a computer, check that we need at least 23 people to have a better than 5050 chance of nding two people with the same birthday. 6. (Yet Another birthday problem) In a group of n people, let k of them be women. What is the probability that in a group of n people, at least one woman shares a birthday with another person? Assume 365 day years and that all days of the year are equally likely to be the birthday of a person. (i) Give an explicit description of a sample space X and the subset A P (X ) representing the event that in a group of n people, at least one woman shares a birthday with another person. (ii) Determine the probability in question. 7. The rst of two players (of equal ability) to win three matches wins $D dollars. Player A wins the rst match, but was injured so cannot continue. How should we fairly divide the prize money? 8. The problem of points has been around for a long time. Perhaps the rst published version is by Fra Luca Pacioli in Summa de Arithmetica [303] in 1494 (translation found in [295]): A team plays ball such that a total of 60 points is required to win the game, and each inning counts 10 points. The stakes are 10 ducats. By some incident they cannot nish the game and one side has 50 points and the other 20. One wants to know what share of the prize money belongs to each side. In this case I have found that opinions dier from one to another, but all seem to me insucient in their arguments, but I shall state the truth and give the correct way. Solve this problem assuming each team is equally likely to win an inning. 9. This problem is quoted from [386, p. 10]: Suppose each player to have staked a sum of money denoted by A; let the number of points in the game be n + 1, and suppose the rst player to have gained n points and the second player none. If the players agree to A separate without playing any more, prove that the rst player is entitled to 2A 2 n, assuming each player is equally likely to gain a point. 10. (Casanovas Lottery) In 1756 a meeting was held to discuss strategies on how to raise funds to help nance the French military school Ecole Militaire. Giacomo Casanova (17251798) was invited to the meeting and proposed a lottery following a similar lottery introduced in the 1600s in the city of Genoa located in northern Italy (you can read about Casanovas lottery in [372]). Heres how the lottery is played. Tickets with the numbers 1, 2, . . . , 90 were sold to the people. The person can choose a single number, or two numbers, . . ., up to ve numbers. At a public drawing, ve
56
Bet Extrait simple Ambe simple Terne Quaterne Quine
Probability
89 4 90 5 88 3 90 5 87 2 90 5 86 1 90 5 85 0 90 5
Odds of winning 1 in 18 1 in 400.5 1 in 11,748 1 in 511,038 1 in 43,949,268
Payo 15 270 5,500 75,000 1,000,000
Table 3. Odds of winning Casanovas Lottery (La Loterie de France).
numbers were drawn from the ninety. If the person choose a single number (called an extrait simple) and it matched any one of the ve numbers, he won 15 times the cost of the ticket; if he choose two numbers ambe simple and they matched any two of the ve numbers, he won 270 times the cost of the ticket; similarly, if he won on choosing three terne, four quaterne, or ve quine numbers he won respectively 5, 500, 75, 000, and 1, 000, 000 times the cost of the ticket (see Table 3). In this problem we study Genoese type lotteries. Let n N and we label n tokens with the numbers 1, . . . , n. Let m (with m n) be the number of tokens drawn, one after the other, from a rotating cage. (E.g. n = 90 and m = 5 in Casanovas lottery.) Assume that each token is equally likely to be drawn. (i) Explain how X1 = {(x1 , . . . , xm ) ; xi {1, . . . , n} , xi = xj for i = j } and represent two dierent sample spaces for the phenomenon of randomly choosing m tokens from n tokens. (ii) Let m and let a1 , a2 , . . . , a {1, 2, . . . , n} be distinct numbers on a ticket a sucker has chosen. What is the event that these numbers match any of the numbers on m randomly drawn tokens from the cage? Describe the event as subsets A1 X1 and A2 X2 . (iii) Find #X1 , #X2 , #A1 , and #A2 and show that #A 1 #A 2 = , #X1 #X2 both of which equal
n m n m
X2 = {x ; x {1, . . . , n} , #x = m}
Conclude that the probability that the numbers a1 , a2 , . . . , a match any of ( n ) the numbers on m randomly drawn tokens from the cage equals mn . Verify ( m) that Table 3 gives the odds for Casanovas lottery. The odds shown in Table 3 are certainly against the people (e.g. in quine you can only win one million times what you paid, although the odds are that you have to spend 43 million times the cost of a ticket to win quine), so we can see why in 1819 Pierre-Simon Laplace (17491827) said in an address to a governmental council [372, p. 4] The poor, excited by the desire for a better life and seduced by hopes whose unlikelihood it is beyond their capacity to appreciate, take to this game as if it were a necessity. They are attracted to the
57
Matches 2 + bonus 3 4 5 no bonus 5 + bonus 6
Probability 6 43 4 2 4 49 43 6
43 3 49 6 6 43 4 2 49 6 6 42 5 1 49 6 6 43 5 1 49 6 6 43 6 0 49 6 6 3
Approximate odds of winning 1 in 81 1 in 57 1 in 1032 1 in 55,491
1 43
1 in 2,330,636 1 in 13,983,816
Table 4. Odds of winning the Canadian Lotto 6/49.
combinations that permit the greatest benet, the same that we see are the least favorable. The lottery was discontinued in 1836. (iv) In Casanovas lottery, there are two other types of bets. One is called Extrait d etermine, where a person species a single number and the place where the number occurs (e.g. 12 as the second token drawn amongst the ve). The payo is 70 times the wager. The other is the Ambe d etermine, where a person specics two numbers and the places where the numbers occurs (e.g. 12 as the second token and 33 as the fourth token drawn amongst the ve). The payo is 5, 100 times the wager. What sample space, X1 or X2 would you use to compute probabilities for winning Extrait d etermine and Ambe d etermine? Compute the corresponding probabilities for general n and m n. (For Casanovas lottery, n = 90 and m = 5, answers are, respectively, 1 in 90 and 1 in 8, 010). 11. (The Canadian 6/49 Lottery) On a Canadian 6/49 lottery ticket,24 you choose six numbers from 1, 2, . . . , 49. For the drawing, 49 balls are labeled 1, 2, . . . , 49 and then six balls are drawn at random from the 49 balls. After this, a seventh ball, called the bonus ball, is drawn at random from the remaining 43 balls. You win under the following conditions: (1) Three, four, or six of your numbers match respectively three, four, or six of the numbers of the rst six balls drawn. (The bonus ball is irrelevant here.) (2) Two or ve of your numbers match respectively two or ve of the numbers of the rst six balls drawn and one of your other numbers matches the bonus ball. (3) Five of your numbers match ve of the numbers of the rst six balls drawn and your sixth number does not match the bonus ball. See Table 4 for the odds of winning Lotto 6/49.25 In this problem we study 6/49 type lotteries. Let n N and we label n tokens with the numbers 1, . . . , n. Let m (with m n) be the number of tokens drawn. (E.g. n = 49 and m = 6 in Lotto 6/49.)
The New York Lotto, as well as other lotteries, is very similar to the Canadian 6/49. To put these odds in perspective, the chance of a person in the USA being killed (in his lifetime) by lightning is about 1 out of every 35, 000 (see http://www.lightningsafety.noaa.gov/resources/LtgSafety-Facts.pdf). If you buy one lotto ticket an hour, how many years will it take for you to have a better than 5050 chance of winning the jackpot (six matches)? Answer: You will be dead for many many years before it happens!
25 24
58
Assume that each token is equally likely to be drawn. (Well consider a bonus ball in a moment.) Let a1 , a2 , . . . , am {1, 2, . . . , n} be distinct numbers on a ticket a sucker has chosen, and let N with m. (i) Let X1 = {x ; x {1, . . . , n} , #x = m}. What is the event that exactly numbers amongst a1 , . . . , am match numbers on m randomly drawn tokens? Describe the event as a subset A1 X1 . Prove that Probability that A1 occurs =
m nm m n m
Using the formula, verify the second, third, and sixth rows in Table 4. Suggestion: Group the numbers {1, 2, . . . , n} into two groups, {a1 , . . . , am } and {1, 2, . . . , n} \ {a1 , . . . , am }. To match numbers, you need exactly numbers from the rst group and exactly m numbers from the second group. (ii) Henceforth assume that after the m balls are drawn, another ball, the bonus ball, is drawn at random amongst the remaining n m balls. Write down a sample space, call it X2 , to represent this situation. What is the event that exactly numbers amongst a1 , . . . , am match numbers on the rst m randomly drawn tokens and one of your remaining m numbers matches the bonus ball? Describe the event as a subset A2 X2 . Prove that Probability that A2 occurs =
m nm m n m
Using the formula, verify the rst and fth rows in Table 4. (iii) What is the event that exactly numbers amongst a1 , . . . , am match numbers on the rst m randomly drawn tokens and none of your remaining m numbers matches the bonus ball? Describe the event as a subset A3 X2 . Prove that Probability that A3 occurs =
m nm1 m n m
m . nm
Using the formula, verify the fourth row in Table 4.
1.6. Lebesgue and LebesgueStieltjes additive set functions Last section we introduced additive sets functions and gave examples of them occurring in probability. In this section we study the additive set functions occurring in the geometry of Euclidean space. 1.6.1. Lebesgue measure on I n . Recall that is dened by m() := 0 and for each nonempty (a, b] I 1 , we dene The function m is called Lebesgue measure on R1 . We can also dene Lebesgue measure on Rn . Recall that we denote a left-half open box (a1 , b1 ] (an , bn ] in Rn by the notation (a, b] where a and b are the n-tuples of numbers a = (a1 , . . . , an ), b = (b1 , . . . , bn ) where ak bk for each k . Also recall that I n is the set of such boxes. We dene Lebesgue measure on boxes in the obvious way: is dened by m(a, b] := (b1 a1 ) (b2 a2 ) (bn an ), m : I n [0, ) m(a, b] := b a. m : I 1 [0, )
1.6. LEBESGUE AND LEBESGUESTIELTJES ADDITIVE SET FUNCTIONS
59
which is the product of the lengths of the sides of the box. From the picture
B6
B2 B4 B3
B5
B1
Figure 1.15. A rectangle decomposed as a union of non-overlapping

ones. If B is the big rectangle, its obvious that m(B ) =
N k=1
m(Bk ).
it is obvious that if a box is partitioned into smaller boxes, then the measure of the box is the sum of the measures of the smaller boxes. This is obviously true, but its proof is not at all trivial; we shall prove it in Proposition 1.18 below. In other words, its obvious that m is nitely additive, where recalling the denition from the last section, a function : I [0, ] on a semiring I is called a set function and is said to be additive or nitely additive, if (1) () = 0. (2) If A = N n=1 An I , A1 , . . . , AN I pairwise disjoint, then
N
(1.15)
(A) =
n=1
(An ).
The principle, which well see again and again (e.g. in the law of large numbers in Section 2.4), to proving that Lebesgue measure is additive is to transform this measure problem into an integration problem involving functions ; in this case, Riemann integration at the calculus level involving characteristic functions. Here, given a nonempty set X and a subset A X , the characteristic (or indicator) function of A, A : X R is dened by
1
A (x) :=
1 0
For example, observe that for any interval (a, b] I 1 , the characteristic function (a,b] : R R is Riemann integrable and26 (1.16) (a,b] (t) dt = m(a, b].
if x A.
if x A,
Heres a picture: Thus, characteristic functions relate integration with measures - this is a key observation that will help us later. Another key observation is the following lemma.
26For simplicity, we omit the limits of integration; the integral
(a,b] (t) dt is taken over any
interval containing (a, b].
60

1
(a, b]
Figure 1.16.
(a,b] (t) dt =
b a
dt = b a = m(a, b].
Product-sum formulas for characteristic functions Lemma 1.17. Given any set X and subset E = E1 E2 E3 where E1 , E2 , E3 , . . . X are countably many pairwise disjoint sets, we have E = E 1 + E 2 + E 3 + .
61
Given another set Y and subsets A X and B Y , we have AB (x, y ) = A (x) B (y ).
Proof : To prove the rst equality, observe that x E if and only if x En for a unique n. It follows that E (x) = 1 if and only if n En (x) = 1. To prove the second equality, note that C D (x, y ) = 1 (x, y ) C D xD, yC
C (x ) = 1 , D (y ) = 1
C (x) D (y ) = 1.
Proposition 1.18. Lebesgue measure m : I n [0, ) is additive.

Proof : For notational simplicity, we prove this result for n = 2; the general case is only notationally more cumbersome. Let I J I 2 = I 1 I 1 , and suppose that
N
IJ =
k=1
Ik Jk
is a union of pairwise disjoint left-half open rectangles where Ik , Jk I 1 . We need to prove that
N N
m(I J ) =
k=1
m(Ik Jk ) ; that is, m(I ) m(J ) =
m(Ik ) m(Jk ).
k=1
To prove this, note that by the sum formula for characteristic functions, we have
N
I J (x, y ) =
k=1
Ik Jk (x, y ),
and then by the product formula, we conclude that

N
(1.17)
I (x ) J (y ) =
k=1
Ik (x) Jk (y ).
Let us x x R in the equality (1.17), and regard both sides of the equality as functions only of the variable y . Then integrating both sides of (1.17) with respect to y and using (1.16), we obtain
N N
I (x) J (y ) dy =
k=1
Ik (x)
Jk (y ) dy
m(J ) I (x) =
k=1 N k=1
m(Jk ) Ik (x).
We now integrate both sides of m(J ) I (x) = x, again using (1.16), obtaining
N
m(Jk ) Ik (x) with respect to
m(I ) m(J ) =
k=1
m(Ik ) m(Jk ).
This proves our result.
62
1.6.2. LebesgueStieltjes additive set functions. We end this section with another example of a set function on I 1 that is of importance in many elds such as functional analysis and probability theory. Before introducing this set function, we review some denitions. A function f : R R is said to be nondecreasing if for any x y , we have f (x) f (y ). Although we are mostly interested in nondecreasing functions, we remark that the function f is called nonincreasing if for any x y , we have f (x) f (y ). A monotone function is a function that is either nondecreasing or nonincreasing. The following lemma contains some of the main properties of nondecreasing functions. Lemma 1.19. Let f be a nondecreasing function on R. Then the left and right-hand limits, f (x) and f (x+), exist at every point. Moreover, the following relations hold: f (x) f (x) f (x+), and if x < y , then f (x+) f (y ).
f (x+) f (x) q x q q
Figure 1.17. This picture of a nondecreasing function f suggests that f (x) = sup{f (y ) ; y < x} and f (x+) = inf {f (y ) ; x < y }.
Proof : See Figure 1.17 for an illustration of this lemma. Fix x R. Since f is nondecreasing, for all y < x we have f (y ) f (x), so the set A := {f (y ) ; y < x} is bounded above by f (x). Hence, the supremum of A exists; call it a. We shall prove that a = limy x f (y ). To this end, let > 0 be given. Then the number a cannot be an upper bound for A and hence there is a z < x such that a < f (z ), or rearranging the inequality we have a f (z ) < . Now given any y with z < y < x, by monotonicity, we have f (z ) f (y ), therefore z < y < x = f (y ) f (z ) = a f (y ) a f (z ) = a f (y ) < . On the other hand, since a is an upper bound of A, it follows that for any y < x, we have f (y ) a, which implies that for y < x, |a f (y )| = a f (y ). To summarize: For all y R with z < y < x, we have |a f (y )| < . This means, by denition, that
y x
lim f (y ) = a = sup{f (y ) ; y < x}.
Thus, f (x) = a. Moreover, since f (x) is an upper bound for A, it follows that a f (x). Hence, f (x) f (x). By considering the set {f (y ) ; x < y } one can similarly prove that f (x+) = inf {f (y ) ; x < y } f (x).
63
Let x < y . Then we can choose w with x < w < y , so by denition of inmum and supremum, f (x+) = inf {f (y ) ; x < y } f (w) sup{f (z ) ; z < y } = f (y ). This completes our proof.
There is an analogous statement for nonincreasing functions although we wont need this statement. Given a nondecreasing function f , we dene the set function by f (a, b] := f (b) f (a). This set function is called the LebesgueStieltjes set function of f . Thomas Stieltjes Here, Stieltjes refers to Thomas Jan Stieltjes (18561894) who shortly (18561894). before his death in a famous 1894 paper on continued fractions [368], introduced what are now called RiemannStieltjes integrals (see Section 6.2), where LebesgueStieltjes set functions come from. One way to interpret f , as Stieltjes originally did, is to consider a rod lying along the interval [0, ) and let f (x) = mass of the rod on the interval [0, x]:
x f (x)
f : I 1 [0, )
Then f (a, b] = f (b) f (a) is exactly the mass of the rod between a and b, so f measures not necessarily uniform mass distributions. Another interpretation of f is that it measures how much f distorts lengths. Indeed, f (a, b] is just the length of the interval (f (a), f (b)], the image of the interval (a, b] under f . Here are some examples:
f (x) = x f (b) f (a) ( a ] b ( a ] b f (a) = f (b) = 1 f (a) ( a ] b f (x) = H (x) f (x) = ex f (b)
In particular, for f (x) = x, we get the usual Lebesgue measure. The middle picture shows the Heaviside function H (x), which equals 0 for x < 0 and 1 for x 0, and for the particular interval in the picture we have H (a, b] = 0. In the third picture, f distorts lengths exponentially. Problem 3 looks at various examples of LebesgueStieltjes set functions including the Heaviside function. Our next task is to prove that general LebesgueStieltjes set functions are additive, and to do so, we need the following lemma. Lemma 1.20. Let (a, b] = n=1 (an , bn ] be a union of pairwise disjoint nonempty left-half intervals. Then we can relabel the sets (a1 , b1 ], (a2 , b2 ], . . ., (aN , bN ] so that b1 = b , aN = a , and an = bn+1 , n = 1, 2, . . . , N 1.
N
64
a2
a1
] (
b1 a3
] (
]
b3
b2
a3
a2
] (
b2 a1
] (
]
b1
b3
Figure 1.18. The left gure shows a left-half open interval written as
a pairwise disjoint union (a2 , b2 ] (a1 , b1 ] (a3 , b3 ]. By relabeling the subscripts, we can write this same union as (a3 , b3 ] (a2 , b2 ] (a1 , b1 ] where b1 is the right end point, a3 is the left end point, and an = bn+1 for n = 1, 2.
Figure 1.18 shows an illustration of this lemma when N = 3. Since the statement of this lemma seems so intuitively obvious we shall leave the details to you. (Warning: Although obvious, to give an honest completely rigorous proof is tedious and written out in detail, should take you about a page!) In the following proposition we prove that LebesgueStieltjes set functions characterize all nite-valued additive set functions on I 1 . Universality of LebesgueStieltjes set functions on I 1 Proposition 1.21. A set function : I 1 [0, ) is a nitely additive if and only if = f for some nondecreasing function f : R R.
Proof : Necessity is proved in Problem 4. To prove suciency, let f : R R be nondecreasing. Given a union (a, b] = N n=1 In of pairwise disjoint elements of N I 1 , we need to show that f (a, b] = n=1 f (In ). Since f () = 0 we may assume that the In s are nonempty. Then according to the previous lemma we can relabel the In s so that
N
(a, b] =
n=1
(an , bn ],
where b1 = b, aN = a, and an = bn+1 for n = 1, 2, . . . , N 1. Now observe that the following sum telescopes:
N N
f (an , bn ] =
n=1
n=1
(f (bn ) f (an ))
= (f (b1 ) f (a1 )) + (f (b2 ) f (a2 )) + + (f (bN ) f (aN )) = f ( b1 ) f ( a N ) , which is f (b) f (a). This is just f (a, b], exactly as we set out to prove.
We remark that LebesgueStieltjes set functions can be dened for nondecreasing functions dened on any interval by just extending the function to be constant outside of the interval so that it remains a nondecreasing map on R. For instance, if f : [a, b] R is a nondecreasing function on a closed interval, we dene f (x) = f (a) for x < a and f (x) = f (b) for x > b. Then the extended map f : R R is nondecreasing so it denes a LebesgueStieltjes set function.
Exercises 1.6. 1. Let : I [0, ] be map on a semiring I satisfying (A) = N n=1 (An ) for any set A I written as A = N A where A , . . . , A I are pairwise disjoint. If n 1 N n=1 (A) < for some A, show that () = 0. Thus, the requirement that () = 0 in the denition of an additive set function is redundant if is not identically .
65
2. The notion of -nite will occur quite often in future chapters. An additive set function on a semiring I of subsets of a set X is said to be -nite if X = n=1 Xn where {Xn } is a sequence of pairwise disjoint sets in I with (Xn ) < for each n. Most measures of practical interest are -nite. (a) Show that Lebesgue measure on I n and any LebesgueStieltjes set function on I 1 are -nite. (b) Prove that is -nite if X = n=1 Xn where {Xn } is a sequence of not necessarily pairwise disjoint sets in I with (Xn ) < for each n. Suggestion: Use the fundamental lemma of semirings (Lemma 1.3). 3. In this problem we look at examples of LebesgueStieltjes set functions. (a) Let I be the semiring of left-half open intervals in (0, 1] and dene : I [0, ] by (a, b] = b a if a = 0 and (a, b] = if a = 0. Show that is nitely additive. (b) Given a function g : R [0, ) that is Riemann integrable on any nite interval, we dene mg : I 1 [0, ) by taking the Riemann integral g :
b
mg (a, b] :=
a
g (x) dx.
In particular, if g = 1, this is just the usual Lebesgue measure m. Let f : R R be a nondecreasing continuously dierentiable function. Show that f = mf , where f is the LebesgueStieltjes measure corresponding to f . Remark: In the subject of Distribution theory its common to identify the measure mg with the function g ; that is, consider the measure mg and the function g dening the measure as the same. Thus, it is OK to write mg = g , properly understood. Hence, the equality f = mf can be written f = f if you wish. (c) Given R, dene H : R R by H (x) := 0 1 if x < , if x .
This function is called the Heaviside function, in honor of Oliver Heaviside (18501925) who applied it to simulate current in an electric circuit. Related to the Heaviside function is the Dirac delta function , named after the great mathematical physicist Paul Dirac (19021984), by the formal properties (x) dx :=
I
1 0
if I if / I,
where I I 1 . Of course, there is no function with these properties, hence the reason for the quotes on function (however, see Problem 2c in Exercises 3.2). In view of the remark in Part (b), prove that H = . 4. In this exercise we prove that the nite-valued additive set functions on I 1 are exactly the LebesgueStieltjes set functions.27 (a) Let : I 1 [0, ) be an additive set function (thus, satises (1.15)) and dene f : R R by f (x) := (x, 0] (0, x] if x < 0, if x 0.
Show that f is nondecreasing and = f . (b) Let g : R R be nondecreasing and suppose that f = g . Prove that f and g dier by a constant. So, the function corresponding to a LebesgueStieltjes set function is unique up to a constant.
27
If youre interested in the corresponding statement for I n , see [41, p. 176].
66
5. In this exercise, we study the translation invariance of measures on I 1 . Related properties for Rn are studied in Section 4.4. Given x R and A R, the translation of A by x is denoted by A + x or x + A: x + A = A + x = {a + x ; a A} = {y R ; y x A}.
(a) Prove that I 1 is translation invariant in the sense that if I I 1 , then x + I I 1 for any x R. (b) A set function : I 1 [0, ] is translation invariant if (x + I ) = (I ) for all x R and I I 1 . A function f : R R is ane if f (x) = ax + b for some a, b R. Prove that if is the LebesgueStieltjes set function dened by an ane function, then is translation invariant. In Problem 8 well prove the converse. 6. (Cauchys functional equation I) (Cf. [417]) In this and the next problem we study Cauchys functional equation, studied by Augustin-Louis Cauchy (17891857) in 1821. (This problem doesnt involves measures, but its useful for Problem 8.) A function f : R R is said to be additive if it satises Cauchys functional equation: for every x, y R , we have f (x + y ) = f (x) + f (y ). Suppose that f : R R is additive. (i) Prove that f (0) = 0 and for all x R, f (x) = f (x). (ii) Prove that f (rx) = r f (x) for all r Q and x R. In particular, setting x = 1, we see that f (r ) = f (1) r for all r Q. (We can do even better and say that f (x) = f (1) x for all x R if we add one assumption explained next.) (iii) Suppose that in addition to being additive, f is bounded on some interval (a, a) with a > 0; thus, there is a constant C > 0 such that for all |x| < a, we have |f (x)| C . Prove that f (x) = f (1) x for all x R. Suggestion: Show that for all n N, we have |f (x)| C/n if |x| < a/n. Next, x x R, let n N and choose r Q such that |x r | < a/n. Verify the identity f (x) f (1) x = f (x r ) f (1) (x r ) and try to estimate the absolute value of the right-hand side. 7. (Cauchys functional equation II: Hamels theorem) (Cf. [180]) Let f : R R be additive but not linear; that is, f not of the form f (x) = f (1) x for all x R. How bad can f be? After all, we know from Part (ii) of the previous problem that f (r ) = f (1) r for all r Q. In fact, f can be extremely bad! Hamels theorem, named after Georg Hamel (18771954), states that the graph of f , Gf := {(x, f (x)) ; x R}, is dense in R2 . In other words, for each p R2 and > 0 there is a point z Gf such that |p z | < . To prove this, you may proceed as follows. (i) Prove that if the graph of f (x)/f (1) is dense in R2 , then so is the graph of f . Conclude that we may henceforth assume that f (r ) = r for all r Q. (ii) Since f is not linear there is a point x0 R such that f (x0 ) = x0 ; thus, f (x0 ) = x0 + for some = 0. Let p R2 and > 0. Choose rational numbers r, s such that |p (r, s)| < /2. Next, choose a rational number a = 0 such that . 8 Finally, choose a rational number b such that x0 b < . If x = r + a(x0 b), 8|a| show that f (x) s = r + a s + a(x0 b). a = a + r s < sr < 8| |
(iii) Let z = (x, f (x)), where x = r + a(x0 b). Show that |p z | < . (iv) Use Hamels theorem to prove Part (iii) of the previous problem; that is, prove that if f : R R is additive and bounded on some interval (a, a) with a > 0, then f must be linear.
67
8. Let f : R R be a nondecreasing function. In this problem we prove that the LebesgueStieltjes set function dened by f is translation invariant if and only if f is ane. Since every set function on I 1 is a LebesgueStieltjes set function (by Problem 4), after completing this problem you have proven the following theorem. Theorem. LebesgueStieltjes set functions dened by ane functions are the only translation invariant set functions on I 1 . By Part (b) of Problem 5 we just have to prove necessity. (i) Assume that f is translation invariant. Let g (x) = f (x) f (0). Show that g is additive, that is, g (x + y ) = g (x) + g (y ) for all x, y R. (ii) Assume Part (iii) of Problem 6 and prove that g (x) = g (1) x for all x R. From this, deduce that f is ane.
Remarks
1.1 : There are many expositions of Lebesgues theory that you can nd; see for examples Ulams nice article [392]. 1.2 : See [296] for a translation of Girolamo Cardanos (15011576) book Liber de Ludo Aleae and see [295] for some history on the Pascal-Fermat-M er e triangle. For more on the dice problem, see [315] and for the problem of points, see [114]. For the general history of probability, see classic (free) book [386] and dealing with relations to measure theory, see e.g. the articles [56], [105], [160], and [42]. 1.5 : Pierre-Simon Laplace (17491827) greatly inuenced the mathematical theory of probability through his groundbreaking treatise Th eorie analytique des probabilit es [222] published in 1812. In Essai philosophique dur les probabilit es, the introduction to the second edition of Theorie analytique, he calls the principle #A number of elements of A probability of A := = #X total number of possible outcomes the rst principle of the calculus of probabilities (see [223, p. 5,6]): The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought. The ratio of this number to that of all the cases possible is the measure of this probability, which is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all the cases possible. In Section 1.5 we discussed (and dened) the notion of fairness and we used this notion several examples. However, we remark that fairness in theory may not actually be fairness in practice. Perhaps the most famous illustration of this is Weldons dice data. In 1894, Raphael Weldon (18601906) wrote to Francis Galton (18221911) concerning a dice experiment consisting of 23,806 tosses of twelve dice you can read Weldons letter in [306]. If the dice were fair, the probability of throwing a 5 or 6 for any given toss of one die is 1/3. However, the probability obtained experimentally from Weldons 23,806 tosses turns out to be approximately 0.3377, a little larger than 1/3; see [130, p. 138]. One explanation (see [110, p. 273]) for the discrepancy between theory and practice could be due to the fact that the hollowed-out pips on each face used to denote the numbers on a die make the die slightly unbalanced; e.g. the 6 face should be lighter than the 1 face. Since the 5 and 6 faces are the lightest faces one might conjecture that they will land upwards most often. This is indeed the case at least from Weldons data. If you are interested in birthday-type problems, check out [35, 273, 285, 293].
CHAPTER 2
Finitely additive integration

The basic theme of this chapter (and a recurring theme in this book) is that we can use integration of functions to help us better understand the measure of sets. 2.1. Integration on semirings In our proof that Lebesgue measure is additive on I n (Proposition 1.18) we saw how useful integration theory can be to derive properties of set functions. In this section, we develop a simple integration theory on semirings, and in Section 2.3 we apply this theory to study additive set functions on semirings. In Chapter 5 we develop a more sophisticated theory of integration on -algebras. 2.1.1. Integrals on semirings. Let : I [0, ] be an additive set function on a semiring I of subsets of a set X . Our goal is to dene integrals dened via . However, our integration theory shall be very primitive in that we only integrate simple functions, which are described as follows. Recall that for any subset A X , 1 A : X R is the characteristic function of A and is dened by A (x) := 1 if x A, 0 if x A.
A X
A function f : X R is called an I -simple function (also called an I -step function or a simple random variable in probability) if f is of the form
N
(2.1)
f=
n=1
a n An ,
where A1 , . . . , AN I are pairwise disjoint and a1 , . . . , aN R. Heres an illustration of such a simple function:
a5 a4 a3 a2 a1 A1A2 A4
5 n=1
A3
A5
an An . The sum 5 n=1 an (An ) represents the sum of the areas of the rectangles under f .
Figure 2.1. Here, f =
69
70
2. FINITELY ADDITIVE INTEGRATION N
Given an I -simple function f = n=1 an An we dene the integral of f , denoted by f , as the extended real number dened by
N
(2.2)
f :=
n=1
an (An ),
provided that the right-hand side is a well-dened extended real number (thus, + and are allowable integrals). By well-dened, we mean that the right-hand side cannot contain a term equal to + and another term equal to (because is not dened). Recall that the convention is if an = 0 and (An ) = , then an (An ) := 0. Note that if all an are nonnegative, then the right-hand side of (2.2) is always well-dened (it may equal +, but this is OK) and it geometrically represents the area under the graph of f as seen in Figure 2.1. Going back to the denition of simple functions, we remark that the presentation of f as the sum (2.1) is not unique; for example, if I = I 1 , then the simple function f (x) = (0,3] (x) can be written in many dierent ways: (0,3] = (0,1] + (1,3] The basic reason for non-uniqueness is the fact that we can write unions of elements of I 1 in many dierent ways as seen here:
( A1 ] ( A2 ]
= (0,1] + (1,2] + (2,3] = .
( B1 ] ( B2 ] (
B3
( C1 ] ( C2 ] ( C3 ] ( C4 ] ( C5 ]
Figure 2.2. A1 A2 can be written in many dierent ways; e.g. A1

A2 = B1 B2 B3 = C1 C5 .
Since a simple function can be written in many dierent ways, its not obvious that the formula (2.2) gives the same value for all presentations of f . To prove this is M indeed the case, suppose that f = k=1 bk Bk is another presentation of f , where B1 , . . . , BM I are pairwise disjoint and b1 , . . . , bM R. Then
N M
(2.3)
n=1
an An (x) =
n=1
bn Bn (x)
for all x X .
We assume that all the an s and bn s are nonzero otherwise we can just drop them from the sums. Now observe that (2.4) an = b m because at a point x An Bm , the left-hand side of (2.3) equals an and the right-hand side equals bm . Next, we claim that An =
m
if An Bm = ,
(An Bm )
and
Bm =
n
(An Bm ).
For example, to prove the left-hand equality, let x An . Then the left-hand side of (2.3) equals an and hence is not zero, therefore the right-hand side is also not zero; in particular, x Bm for some m. This shows that An m (An Bm ); the
2.1. INTEGRATION ON SEMIRINGS
71 m (An
opposite inclusion m (An Bm ) An is automatic, so An = right-hand equality is proved similarly. Finally, observe that an (An ) =
n n m
Bm ). The
an (An Bm ) (by the additivity of ) bm (An Bm ) (by (2.4)) (by the additivity of ).
=
m n
=
m
bm (Bm )
Thus, the integral of f is well-dened, independent of the presentation of f . We remark that since the notation f doesnt explicitly mention , in some cases it may not be clear what measure we are integrating with respect to, and in such cases to emphasize the measure we use the notation f d for f.
For the semiring I n with Lebesgue measure m, we can denote the integral (2.2) by several notations: f dm or f dx or f (x) dx or f (t) dt etc. for f.
2.1.2. Average value of a function. This is another interpretation of the integral, besides the obvious geometric interpretation as an area under a graph. Consider briey the following. Suppose we have a total of n students in a classroom and suppose there are n1 students with height h1 , there are n2 students with height h2 , . . ., and there are nN students with height hN , as seen here:
h1 h3 ......
h2
hN nN students
n1 students n2 students n3 students
We can describe this situation with a height function. Let X denote the set of students and let Hk X be the subset of students with height hk . Then f : X R dened by
N
f=
k=1
hk Hk
is the function that takes a student x and outputs the height f (x) of the student (if x Hk , we have f (x) = hk ). What is the average height? This is easy: Ave. height =
N k=1
hk (# of students with height hk ) , total # of students

N
where the numerator (which equals k=1 hk nk ) is the sum total of the heights of all the students, and the denominator (which is n) is the total number of students. If : P (X ) [0, ) is the counting measure, which takes a subset of X and gives the number of elements of the set, we have Ave. height = 1 (X )
N
hk (Hk ),
k=1
72
2. FINITELY ADDITIVE INTEGRATION
In view of (2.2), the sum on the right-hand side of Ave. height is exactly the integral of the height function f ! (Of course, this example involved heights, but it can work for computing averages of most anything you can think of!) More generally, in the set up described in the denition (2.2), we dene Ave. value of f := 1 (X ) f,
where we assume that X I and 0 < (X ) < . In case is a nitely additive probability set function, so that (X ) = 1, we have Ave. value of f = f.
In the context of probability, an I -simple function is called a simple random variable and the integral f d is called the expected value, or mean (average) value, of f , and is usually denoted by E (f ). This number represents the value of f that we expect to observe, at least on average, if we repeat the experiment a large number of times. See Section 2.2 for more details on expectations. 2.1.3. Properties of the integral. We now prove that the integral has all the properties that we expect an integral to have. But rst, a denition: A collection A of functions is called an algebra of functions if it is closed under taking linear combinations and products; that is, (1) f, g A (2) f, g A = af + bg A for any a, b R. = f g A .
Proposition 2.1. The set of I -simple functions forms an algebra.

Proof : We need to show that any linear combination and product of simple functions is again a simple function. That the product of simple functions is simple will be left to you (just use that A B = AB for any sets A and B ) and we shall only prove the linear combination statement. Let f and g be I -simple functions and let a, b R; we need to show that af + b g is an I -simple function. Actually, since its easy to see that af and b g are I -simple functions, we just have to show that f + g is an I -simple function. Furthermore, we may assume that g has just one term (exercise: deduce the general case by induction on the number of terms in a presentation of g ). Thus, let
N
f=
n=1
an An ,
g = c B
be I -simple functions; we need to show that f +g =

n
cn Cn ,
where the sum is nite and cn R and the Cn I are pairwise disjoint. Before diving into the (complicated) proof that f + g is an I -simple function, consider the case where I = I 1 , the left-half open intervals, and f = aA consists of one term. Supposing that A and B are as in the following picture, we decompose A B as C1 C2 C3 as shown here:
73
A (
] B C3 ] ]
. ( C1 ] ( C2 ] (
C1 = A \ B , C2 = A B , C3 = B \ A
Then A = C1 + C2 and B = C2 + C3 , so f + g = aA + bB = a(C1 + C2 ) + b(C2 + C3 ) = aC1 + (a + b)C2 + bC3 . Since C1 , C2 , C3 are pairwise disjoint, this proves that f + g is an I -simple function; its good to keep the ideas of this proof in mind when reading the more complicated proof below for the general case. Back to the general case, from the semiring dierence property (Property (1.10) of a semiring), there are nitely many pairwise disjoint sets {Bk } in I such that
N
B\
An =
n=1 k
Bk .
Therefore, using the dierence-intersection formula S = (S \ T ) (T S ) for any sets S, T X , we have (2.5) B=
k
Bk
(A n B ) .
Again using the dierence property of semirings, for each n there are nitely many pairwise disjoint sets {Anm } in I such that An \ B = Anm .
m
Therefore, by the dierence-intersection formula, we have (2.6) An =

m
Anm
(A n B ).
By (2.5) and (2.6) and the sum formula for characteristic functions in Lemma 1.17, we have B =
k
Bk +
n
An B
N n=1
and
An =
m
Anm + An B .
Therefore, the formulas f = f=

n,m
an An and g = c B take the form g=

k
an Anm +
n
an An B ,
c Bk +
n
c An B ,
and so, (2.7) f +g =

n,m
an Anm +
n
(an + c) An B +
c Bk .
k
By construction, the sets {Anm , An B, Bk } are pairwise disjoint, so after all this work we see that f + g is an I -simple function.
Properties of the integral Theorem 2.2. The integral has the following properties: (1) The integral is nonnegative: f 0 for any nonnegative I -simple function f .
74
(2) If f and g are I -simple functions and a, b R, and if f exists, g exists, and the sum a f + b g makes sense, then the integral (af + b g ) exists and is linear: (af + b g ) = a f +b g.
(3) The integral is monotone: For any I -simple functions f and g with f g such that f and g exist, we have f g.
Proof : The proof of (2) is the longest, so we shall prove (1) and (3), then (2) last. Step 1: To prove (1) is quick: If f is nonnegative, then in the presentation (2.1), each an must be nonnegative, which implies that f 0. Step 2: Assuming weve already proved (2) (which well do in Step 3), we shall prove (3). Let f g be I -simple functions; we shall prove that f g provided these integrals exist. If f = , then f g is automatic, so assume that f = . Observe that g = f + (g f ), and by Proposition 2.1, g f is an I -simple function. Moreover, since f g , the function g f is nonnegative, so by (1) we know that (g f ) 0. Since f = , the sum f + (g f ) is well-dened, hence by (2), g = f + (g f ) = g= f+ (g f ) = f g,
where in the second implication, we used that (g f ) 0. Step 3: To prove (2), given I -simple functions f and g and real numbers a, b, we need to show that (af + b g ) = a f + b g . Actually, since its easy to show that af = a f and b g = b g , we just have to show that (f + g ) = f + g . Moreover, as in the proof of Proposition 2.1, we may assume that g has just one term. Thus, let
N
f=
n=1
an An ,
g = c B
be I -simple functions. By denition of the integral, we have f+ g=

n
an (An ) + c (B ),
where we assume that the sum on the right is well dened. The object of the game now is to express the right-hand side so it becomes (f + g ). To this end, note that by Equation (2.5) in the proof of Proposition 2.1 and additivity of , we have (see the proof of the previous proposition for the various notations used here) (B ) =
n
(An B ) +
(Bk ),
k
and by Equation (2.6) in the proof of the Proposition 2.1, for each n we have (An ) =
m
(Anm ) + (An B ).
75
Therefore, f+ =
n
g=
n
an (An ) + c (B ) (Anm ) + (An B ) + c

n
an
m
(An B ) +
k
(Bk )
k
=
n,m
an (Anm ) + an (Anm ) +
n,m n
an (An B ) +
c (An B ) + b (Bk ).
k
c (Bk )
(an + c) (An B ) +
This expression, in view of Formula (2.7) in Proposition 2.1, is by denition the integral (f + g ), just as we wanted to show.
We leave the proof of the following to you.1 Corollary 2.3. An I -simple function is any function of the form
N
f=
n=1
a n An ,
an R, An I ,
N
where A1 , . . . , AN I are not necessarily disjoint, in which case f=

n=1
an (An ),
provided that the right-hand side is dened.

Exercises 2.1. 1. Prove the following useful identities for characteristic functions: 2. In this problem, we connect integration with summation. Let P (N) be the power set of the natural numbers. Consider the counting function # : P (N) [0, ] dened by #(A) := number of elements in A (in particular, #(A) = if A is innite). (a) Show that # is nitely additive on P (N). (b) Given a simple function f : N [0, ), show that f d# =
AB = A B , Ac = 1 A , AB = A + B A B .
f (n).
n=1
3. This exercise deals with LebesgueStieltjes additive set functions. (a) Let g be a nondecreasing function on R and let g : I 1 [0, ) be its corresponding LebesgueStieltjes set function dened by g (a, b] = g (b) g (a). Given any I 1 -simple function f = N k=1 ak Ak where Ak = (xk1 , xk ], show that
N
f dg =
k=1
ak {g (xk ) g (xk1 )}.
1 In view of this corollary, you may wonder why at the beginning of this section we didnt N dene an I -simple function to be a function f : X R of the form f = n=1 an An , where N A1 , . . . , AN I are not necessarily pairwise disjoint, and then dene f = n=1 an (An ). The reason is that in this case, its a lot of work to prove that the integral is well-dened (independent of the presentation of f ). In contrast, assuming at the beginning that A1 , . . . , AN are pairwise disjoint, proving the integral was well-dened was relatively painless. We could then focus on theorems and then get the formula f = N n=1 an (An ) as a corollary of our theorems.
76
Readers familiar with the RiemannStieltjes integral will recognize the right-hand side as a RiemannStieltjes sum (Section 6.2 of Chapter 5). (b) Let g be a continuously dierentiable nondecreasing function on R. Prove that for any I 1 -simple function f , we have f dg = f g dx,
where the right-hand side denotes the Riemann integral of f g . 4. Let : I [0, ] be an additive set function on a semiring I and let g be a nonnegative I -simple function. Dene mg : I [0, ] by mg (A) := A g d, for all A I .
Note that A g is a nonnegative I -simple function (this follows from the fact that simple functions form an algebra), so the integral is dened. (i) Prove that mg : I [0, ] is additive. (ii) Prove that for any I -simple function f , we have (provided the integrals exist) f dmg = f g d.
5. Following [416], we give Bourbakis2 proof of Problem 12 in Exercises 1.3. Let R be a ring of subsets of a set X , and let ZX 2 be the ring of Z2 -valued functions on X . (Recall that Z2 = {0, 1} with addition and multiplication modulo 2.) (i) Show that as elements of ZX 2 , we have AB = A + B (modulo 2), where AB = (A \ B ) (B \ A) is the symmetric dierence of A and B . (ii) Show that R , with its operations of multiplication and additive given by intersection and symmetric dierences, respectively, is isomorphic to a subring of ZX 2 . (iii) Show that R is isomorphic to ZX 2 if and only if R is the power set of X .
2.2. Random variables and (mathematical) expectations The theory of expectations can be traced back to a letter from Pascal to Fermat on Wednesday, July 29, 1654 on the problem of points. In this section we study expectations (really integrals) from the probabilistic viewpoint. 2.2.1. Expectation as an expected average. In any experiment we perform, we always try to (1) observe the outcomes and then (2) take a measurement, or assign a numerical value (that is, record data), to whats observed. (For example, if we roll a die, we can observe and then record the number of dots on the top face.) In colloquial jargon, we would call (1) and (2) a measurement and in usual mathematical jargon, we would call (1) and (2) a numerical function on the sample space (because to each element of the sample space, we assign a number). However, in probability jargon, we use the term random variable; thus, a3 random variable assigns numerical values to the outcomes of an experiment. In this section we consider simple random variables. Let (X, R , ) be a eld of probability, meaning that X is a sample space, R is a ring of observable events containing X , and : R [0, 1] is an additive set function with (X ) = 1. Let
By the way, Bourbaki was the brainchild of a group of French mathematicians started by Henri Cartan (19042008) and Andr e Weil (19061998), and is not a real person. Bourbaki was just a pen name used by the group as the author of their math books. 3Being technical, a random variable must be measurable, which means the event that measurements (= the function values) lie in any given, say, open, interval must always be observable (see Section 1.5.3 for observable event ); we shall return to this discussion in Section 6.4.
2
2.2. RANDOM VARIABLES AND (MATHEMATICAL) EXPECTATIONS
77
f : X R be a simple random variable, which is probability jargon for an R -simple function. Thus, we can write
N
f=
k=1
a k Ak ,
where A1 , . . . , AN R are pairwise disjoint. Note that to any outcome x X , f (x) is one of the values a1 , . . . , aN . These values can vary widely and depending on how big N is, can be quite extensive, so we shall look for a single value that represents the function reasonably well. A natural choice for such a value is the average value of f . From the previous section we know that the integral f represents the average value of f . However, since we are dealing with probability now, we can interpret average value of f in a slightly dierent way. Suppose that we repeat the experiment a large number of times, say n times where n is large, and we note the values of f on each experiment. Recall that on each experiment, f takes the values ak with probability pk = (Ak ) where k = 1, 2, . . . , N . Thus, intuitively speaking, after doing the experiment n times, one would expect that f would take the value ak approximately n pk number of times. Hence, one would expect that the average value of f over n experiments = add up every value of f obtained over n experiments number of experiments n a1 (np1 ) + a2 (np2 ) + + aN (npN ) . n
Thus, the average value of f over n experiments a1 p 1 + a2 p 2 + + aN p N

N
=
k=1
ak (Ak ).
In other words, by denition of the integral, for n large, The expected average value of f over n experiments f.
Of course, as n gets larger and larger, the more precise this should be! This discussion shows us that the number f summarizes the average value of f . Thus, these thought experiments compel us to dene, for any simple random variable f : X R, the expectation, expected value, or mean value, of f by E (f ) := By our discussion above, we can interpret E (f ) as the expected average value of f over a large number of experiments. When we study the law of averages (the weak law of large numbers) in Section 2.4, see especially Subsection 2.4.3, we show that this interpretation is correct. f.
78
2.2.2. Expectation as an expected gain. As we described above, expected value represents an expected average value over many experiments. However, the idea of expected value was originally used by Pascal in a dierent sense, namely in the sense of an appropriate amount a gambler should be entitled to if he is not able to continue the game he is playing. In the days of Pascal, the currency in France was the louis dor, seen on the side.4 This gold coin was struck in 1640 by Louis XIII and it was also called the pistole after a Spanish gold coin used in France since the 1500s. Here is Pascals letter to Fermat on Wednesday, July 29, 1654 [358, p. 547]:
This is the way I go about it to know the value of each of the shares when two gamblers play, for example, in three throws, and when each has put 32 pistoles at stake: Let us suppose that the rst of them has two (points) and the other one. They now play one throw of which the chances are such that if the rst wins, he will win the entire wager that is at stake, that is to say 64 pistoles. If the other wins, they will be two to two and in consequence, if they wish to separate, it follows that each will take back his wager that is to say 32 pistoles. Consider then, Monsieur, that if the rst wins, 64 will belong to him. If he loses, 32 will belong to him. Then if they do not wish to play this point, and separate without doing it, the rst should say I am sure of 32 pistoles, for even a loss gives them to me. As for the 32 others, perhaps I will have them and perhaps you will have them, the risk is equal. Therefore let us divide the 32 pistoles in half, and give me the 32 of which I am certain besides. He will then have 48 pistoles and the other will have 16.
Lets see mathematically what Pascal is saying. In the second sentence They now play on throw . . ., Pascal argues that we are really in the situation of one throw. Thus, consider the sample space where 0 and 1 represent the rst gambler losing, respectively winning, the throw, and let : P (X ) [0, 1] be the probability set function with {0} = {1} = 1/2; here we assume the gamblers are of equal ability. Let f : X R be the random variable representing the rst gamblers gain. Now according to Pascals words,
Consider then, Monsieur, that if the rst wins, 64 will belong to him. If he loses, 32 will belong to him.
X = {0, 1},
Hence, f (0) = 32 and f (1) = 64; that is, f = 32 {0} + 64 {1} in our usual notation involving characteristic functions. Then the rst gambler says
I am sure of 32 pistoles, for even a loss gives them to me. As for the 32 others, perhaps I will have them and perhaps you will have them, the risk is equal. Therefore let us divide the 32 pistoles in half, and give me the 32 of which I am certain besides.
In other words, the rst gamblers claims that his rightful gain is 32 32 + = 48 pistoles. 2
4Picture from the wikipedia commons.
79
Notice that 32 64 + = 48 as well! 2 2 Thus, the gamblers expected gain is exactly the expected value as we dened it! We can generalize Pascals gambling example as follows. Suppose that the rst gambler gains a pistoles if he loses and b pistoles if he wins; in this case f := 32 {0} + 64 {1} = f (0) = a and f (1) = b , or f = a {0} + b {1} . Then according to Pascal, the rst gambler is sure of getting a pistoles, and of whats left over, namely b a pistoles, the risk is equal that he will win them or lose them, so Pascal would argue that the gamblers rightful gain is ba a+b a+ = . 2 2 On the other hand, we can get the same number through integration: 1 1 a+b f = a { 0 } + b { 1 } = a + b = , 2 2 2 the same! Actually, this generalized gambling example is basically Proposition I of Christiaan Huygens (16291695) book Libellus de Ratiociniis in Ludo Aleae [194] (see Problem 1), which is the rst book to systematically study expectations. Do you remember Gilles Personne de Roberval (16021675) who objected to Pascals method of combinations we studied back in Section 1.5.4? (Speaking of Section 1.5, we invite you to solve the problem of points we studied back in that section using expectations that is, using integrals.) He might object to Pascals pistol argument because in reality the gamblers can play more than just one round. In fact, the true sample space is representing that the rst gambler wins the rst toss (1), loses the rst toss but wins the second (01), or loses both tosses (00). In this case, the probabilities are 1 1 1 {00} = , {1} = , {01} = 2 4 4 and the random variable f , representing the rst gamblers pistol winnings, is f (1) = 64 , f (01) = 64 , f (00) = 0. Hence, Roberval would probably accept that the expected gain is 1 1 1 E (f ) = f = 64 + 64 + 0 = 48, 2 4 4 which exactly the same number as before! Now let us consider the general case of a probability eld (X, R , ) and a simple random variable
N
X = {1, 01, 00},
f=
k=1
a k Ak ,
where A1 , . . . , AN R are pairwise disjoint. Suppose that f represents the gain of a gambler; that is, a1 is the gain if the event A1 occurs, a2 is the gain if the event A2 occurs, and so forth. If we put pk = (Ak ) , k = 1, 2, . . . , N,
80
then E (f ) = f = a1 p 1 + a2 p 2 + + aN p N
N
=
k=1
(gain when Ak occurs) (probability Ak occurs).
By our interpretation of expected value as an expected average, we know that E (f ) is the expected average gain of the gambler, if he plays the game a large number of times. From this viewpoint, its reasonable to say that E (f ) represents the appropriate amount the gambler should be entitled to if somehow he wouldnt be able to continue the game. 2.2.3. Examples. We now compute some expectations.
Example 2.1. In our rst example, we shall see the dierence between the everyday use of expectation and the mathematical use of expectation. Lets try to win the jackpot of the Canadian Lotto 6/49 as explained in Problem 11 in Exercises 1.5. We either win or lose, so X = {0, 1} where 0 = lose and 1 = win. We win the jackpot with probability p = 1/13, 983, 816. Lets say the jackpot is $10, 000, 000 and let f equal 10, 000, 000 if we win the jackpot and 0 otherwise. Then the mathematical expectation of the amount we win is E (f ) = f = 10000000 1 1 +0 1 13, 983, 816 13, 983, 816 = 0.715 . . . .
Thus, our mathematical expectation is about 72 cents. However, we (talking for myself) really expect to win $0 of the jackpot! Example 2.2. Suppose that we ip a fair coin n times; what is the expected number of heads that well throw? Let X = Y n where Y = {0, 1}. Observe that if Ai = Y Y {1} Y Y (there are n 1 factors of Y here) where the {1} is in the ith factor, then fi = Ai equals 1 if we toss a head on the ith toss and 0 if we toss a tail on the ith toss. Note that (Ai ) = 1/2 for each i. The function is the random variable giving the number of heads in n tosses. Therefore, the expected number of heads in n tosses of a coin is
n n n
Sn := f1 + f2 + + fn ,
Sn =
i=1
fi =
i=1
(Ai ) =
i=1
1 n = , 2 2
which is exactly as intuition tells us! Observe that Sn f1 + f2 + + fn = n n is the random variable giving the number of heads per toss in n tosses; its expectation is (Sn /n) = (1/n) Sn = (1/n)(n/2) = 1/2, just as intuition tell us!
We shall return to the following example in Section 6.6 when we study the Law of Large Numbers.
Example 2.3. (Genoese type lotteries; cf. the Casanovas lottery problem in Problem 10 in Exercises 1.5) The ideas behind modern-day lotteries come from a lottery held in Genoa, a historic city in northern Italy, which dates from the early 1600s (see [28, 371, 372] for more on the Genoese lottery). The basics of the Genoese lottery were
81
as follows. 90 tokens labeled with the numbers 1, 2, . . . , 90 were drawn sequentially from a rotating cage, the wheel of fortune, in a public place by a blindfolded boy in a blue suit (a common uniform in orphanages). Beforehand, players would chose
Figure 2.3. There are two wheel of fortunes in this lottery in Guildhall, London, 1751. Photo taken from [11, p. 68]. one, two, three, four, or ve particular numbers and they would win if the numbers they chose matched any of the ve numbers drawn. Let n N and we label n tokens with the numbers 1, . . . , n. Let m (with m n) be the number of tokens drawn, one after the other, from a rotating cage (we assume each token is drawn with equal probability); e.g. n = 90 and m = 5 in the Genoese lottery. Observe that X = {(x1 , . . . , xm ) ; xi {1, . . . , n} , xi = xj for i = j } represents a sample space for the drawing of m tokens from a lot of n tokens, where xi represents the ith token drawn. X has n(n 1)(n 2) (n m + 1) number of elements (since for a typical element (x1 , . . . , xm ) X there are n choices for x1 , n 1 choices for x2 , etc.). Thus, the probability measure is : P (X ) [0, 1] , (A) = #A . n(n 1) (n m + 1)
If youre interested, see Problem 10 in Exercises 1.5 for the probabilities of the various ways to win the Genoese lottery. For each i = 1, 2, . . . , m, consider the random variable given by the value of the ith token drawn: fi : X R where x = (x1 , . . . , xm ). Then f :XR , where f := f1 + f2 + f3 + f4 + f5 , dened by fi (x) := xi ,
represents the sum of the values of the randomly drawn tokens. What is the expectation of f ? Since the expectation (integral) is linear we just have to compute the expectation of each fi . Observe that fi = Ai1 + 2Ai2 + 3Ai3 + + nAin , where Aik = {(x1 , . . . , xm ) X ; xi = k}, the event that the number k appears on the ith draw. Thus, E (fi ) = fi = (Ai1 ) + 2(Ai2 ) + 3(Ai3 ) + + n(Ain ).
For each k, the set Aik has (n 1)(n 2) (n m + 1) elements (do you see why?), therefore (n 1)(n 2) (n m + 1) 1 (Aik ) = = . n(n 1)(n 2) (n m + 1) n
82
Another way to see this is to observe that intuitively the probability that the number k appears on the ith draw should be 1/n since there are n total numbers, each one equally likely, that could appear on the ith draw. Thus, n+1 1 + 2 +3 + + n = , E (fi ) = n 2 where we used the formula 1 + + n = n(n + 1)/2. Since E (f ) = E (f1 ) + + E (fm ) we conclude that m(n + 1) E (f ) = . 2 For instance, in the Genoese lottery, we have n = 90 and m = 5, so The expected sum of the numbers drawn in the Genoese lottery = 227.5. Exercises 2.2. 1. (Huygens propositions) The following propositions are found in Christiaan Huygens (16291695) book Libellus de Ratiociniis in Ludo Aleae [194]: Proposition I: If I expect a or b, and have an equal chance of gaining either b of them, my Expectation is worth a+ . 2 Proposition II: If I expect a, b, or c, and each of them be equally likely to b+c . fall to my Share, my Expectation is worth a+3 Proposition III: If the number of Chances I have to gain a, be p, and the number of Chances I have to gain b, be q . Supposing the Chances equal; my +bq Expectation will then be worth ap . p+ q Prove each of these proposition using the mathematical denition of expectations. 2. (Cardanos game) In Girolamo Cardanos (15011576) book Liber de Ludo Aleae, he writes [296, p. 240]: Thus, in the case of six dice, one of which has only an ace on one face, and another a deuce, and so on up to six, the total number is 21, which divided for one throw. by 6, the number of faces, gives 3 1 2 In other words, consider six dice, the rst one having a single dot on one side and blanks on the other ve sides, the second one having two dots on one side and blanks on the other ve sides, and so forth. He says that if you roll all six dice, the expected 1 number of dots rolled is 3 2 . Can you prove this? 3. (Roulette) An American roulette wheel has the numbers 00, 0, 1, 2, 3, . . . , 36 on its perimeter. The 00 and 0 are in green, and the other numbers have the colors red and black; heres a picture where the reds appear whitish:5
A ball is spun on the wheel and it lands on a number. (i) (Singles) Suppose that you bet on a single number 00, 0, 1, . . . , 36. If the ball lands on your number, you are paid 35 to 1, namely you win 35 times the amount you bet, otherwise you lose the amount you bet. Suppose you bet $1 on a number; if the ball lands on your number you get $35, otherwise you lose your $1. What is the expected amount you will win?
5
Picture from the Wikipedia commons. Author is Ron Shelley.
83
(ii) (Doubles) Suppose that you bet on two numbers. If the ball lands on either number, you are paid 17 to 1. Suppose you bet $1 on doubles. What is the expected amount you will win? (iii) (Triples) Suppose that you bet on three numbers. If the ball lands on one of your numbers, you are paid 11 to 1. Suppose you bet $1 on triples. What is the expected amount you will win? (iv) (Reds) Suppose that you bet on reds (or on blacks, or on evens, or odds, or on high numbers (19 36) or low numbers 1 18). If the ball lands on reds (or on blacks, or on evens, or odds, or on high numbers or on low numbers), you are paid 1 to 1. Note that 00 and 0 are considered odd if you bet on evens, and even if you bet on odds! Suppose you bet $1 on red. What is the expected amount you will win? (You get the same expected winnings if you bet on blacks or evens or odds or on highs or lows.) 4. (Pascals wager) In this problem we look at Pascals wager, the primordial example of the modern subject of decision theory. Blaise Pascal (16231662) argued that as long as there is a positive probability that God exists, a person should believe in Him. Here are Pascals thoughts as quoted in article 233 of Pascals Pens ees :6 Let us then examine this point, and say, God is, or He is not. But to which side shall we incline? Reason can decide nothing here. There is an innite chaos which separated us. A game is being played at the extremity of this innite distance where heads or tails will turn up. What will you wager? According to reason, you can do neither the one thing nor the other; according to reason, you can defend neither of the propositions. Do not, then, reprove for error those who have made a choice; for you know nothing about it. No, but I blame them for having made, not this choice, but a choice; for again both he who chooses heads and he who chooses tails are equally at fault, they are both in the wrong. The true course is not to wager at all. Yes; but you must wager. It is not optional. You are embarked. Which will you choose then? Let us see. Since you must choose, let us see which interests you least. You have two things to lose, the true and the good; and two things to stake, your reason and your will, your knowledge and your happiness; and your nature has two things to shun, error and misery. Your reason is no more shocked in choosing one rather than the other, since you must of necessity choose. This is one point settled. But your happiness? Let us weigh the gain and the loss in wagering that God is. Let us estimate these two chances. If you gain, you gain all; if you lose, you lose nothing. Wager, then, without hesitation that He is. That is very ne. Yes, I must wager; but I may perhaps wager too much. Let us see. Since there is an equal risk of gain and of loss, if you had only to gain two lives, instead of one, you might still wager. But if there were three lives to gain, you would have to play (since you are under the necessity of playing), and you would be imprudent, when you are forced to play, not to chance your life to gain three at a game where there is an equal risk of loss and gain. But there is an eternity of life and happiness. And this being so, if there were an innity of chances, of which one only would be for you, you would still be right in wagering one to win two, and you would act stupidly, being obliged to play, by refusing to stake one life against three at a game in which out of an innity of chances there is one for you, if there were an innity of an innitely happy life to gain. But there is here an innity of an innitely happy life to gain, a chance of gain against a nite number of chances of
6
See eg. http://www.gutenberg.org/ebooks/18269 for the entire text of Pens ees.
84
loss, and what you stake is nite. It is all divided; where-ever the innite is and there is not an innity of chances of loss against that of gain, there is no time to hesitate, you must give all. And thus, when one is forced to play, he must renounce reason to preserve his life, rather than risk it for innite gain, as likely to happen as the loss of nothingness. For it is no use to say it is uncertain if we will gain, and it is certain that we risk, and that the innite distance between the certainly of what is staked and the uncertainty of what will be gained, equals the nite good which is certainly staked against the uncertain innite. It is not so, as every player stakes a certainty to gain an uncertainty, and yet he stakes a nite certainty to gain a nite uncertainty, without transgressing against reason. There is not an innite distance between the certainty staked and the uncertainty of the gain; that is untrue. In truth, there is an innity between the certainty of gain and the certainty of loss. But the uncertainty of the gain is proportioned to the certainty of the stake according to the proportion of the chances of gain and loss. Hence it comes that, if there are as many risks on one side as on the other, the course is to play even; and then the certainty of the stake is equal to the uncertainty of the gain, so far is it from fact that there is an innite distance between them. And so our proposition is of innite force, when there is the nite to stake in a game where there are equal risks of gain and of loss, and the innite to gain. This is demonstrable; and if men are capable of any truths, this is one. Heres a simplied version of Pascals argument; see [84] for a more thorough analysis. We work under the following assumptions: (a) God exists with probability p and doesnt exist with probability 1 p. (b) (If He exists,) God rewards those who believe in Him with joy in an eternal afterlife measured by a number J . God rewards those who dont believe in Him with joy in an eternal afterlife measured by A, where A is a positive number representing eternal anguish. (c) Let B be a number representing the amount of joy experienced in life, living as if you believed God exists. (d) Let D be a number representing the amount of joy experienced in life, living as if you didnt believe God exists. Let Y denote the random variable representing the total amount of joy you will experience, both in this life and the afterlife, if yes, you believe God exists, and let N denote the random variable representing the total amount of joy you will experience, both in this life and the afterlife, if no, you do not believe God exists. Find E (Y ) and E (N ). Pascal argues that its reasonable to base our belief in God on which number E (Y ) or E (N ) is larger. Show that E (N ) > E (Y ) D > p(J + A) + B.
Thus, if p = 0, then your total joy is based strictly on earthly joys and one might as well forget belief in God. However, if p > 0 and J and A are suciently large (in fact, Pascal considers J to be innite), then believing in God is the reasonable option. 5. (cf. [12]) (The birthday problem) What is the expected number of people in a room of n 2 people who share the same birthday with at least one other person in the room? We assume that a year has exactly 365 days (forget leap years). (i) Write down a sample space X and the probability set function . (ii) Let f : X R be the random variable representing the number of people who share the same birthday with at least one other person in the room. Explain why f = n k=1 fk where fk = 1 if the k th person shares the same birthday with at least one other person and fk = 0 otherwise. Here, fk is the characteristic function of a set; write down the set explicitly.
2.3. PROPERTIES OF ADDITIVE SET FUNCTIONS ON SEMIRINGS
85
(iii) Find E (f ). (iv) What is the smallest number of people in a room needed so that at least two people are expected to share the same birthday? (Youll need a calculator.) 6. (The hat check problem) n people enter a restaurant and their hats are checked in. After dinner the hats are randomly re-distributed back to their owners. What is the expected number of customers that receive their own hats? (i) Write down a sample space X and the probability set function . (ii) Let f : X R be the random variable representing the number of people who receive their own hats. Show that f = n k=1 fk where fk = 1 if the k th customer receives his own hat and fk = 0 otherwise. fk is the characteristic function of a set; write down the set explicitly. (iii) Find E (f ).
2.3. Properties of additive set functions on semirings In this section we use the properties of the integrals of functions to derive properties of additive set functions. In particular, we show that Lebesgue measure extends from I n to dene an additive set function on E n , the ring of elementary gures on Rn , and we study probability models on sequence space. At the end of this section, we give a thorough analysis of the monkeyShakespeare experiment. 2.3.1. Application of integration I: Properties. For our rst application of integration, we derive some properties of nitely additive set functions on semirings. Given an additive set function : I [0, ] on a semiring I and a set A I , by denition of the integral of the I -simple function f = A , we have (2.8) (A) = A .
Although this is a denition, its geometrically obvious for the case I = I 1 :

1 m(a, b] = b a (a, b] (a,b] =
b a
1= ba
In Theorem 2.4 below we give slick proofs of some properties of the set function by exploiting the formula (2.8) and the properties of the integral in Theorem 2.2. (Of course, one can prove Theorem 2.4 without integration . . . please feel free to do so!) We remark that since rings and -algebras are also semirings, all the properties and denitions that we state for semirings also hold for set functions on rings and -algebras. Properties of additive set functions Theorem 2.4. Let : I [0, ] be additive on a semiring I . Then, (1) is monotone in the sense that if A, B I , A B , then (A) (B ). (2) is countably superadditive in the sense that if A I and n=1 An A where A1 , A2 , . . . I are pairwise disjoint, then
n=1
(An ) (A).
86
(3) is nitely subadditive in the sense that if A I and A where A1 , . . . , AN I , then

N
N n=1
An
(A)
(An ).
n=1
(4) is subtractive in the sense that if A, B I with A B , B \ A I , and (A) < , then (B \ A) = (B ) (A).
Proof : Figure 2.4 illustrates this theorem. To prove (1), observe that if A B are
B\A A B A4 A5 A3 A A1 A2
n
A4 A1 (A)
A3 A A2
n
(A) (B ) (B \ A) = (B ) (A)
(An ) (A)
(An )
Figure 2.4. Left: Properties (1) and (4), Middle: (2), Right: (3).
sets in I , then A B . Therefore, by monotonicity of the integral, (A) = A
n=1
B = (B ).
Let A I and assume that An A where A1 , A2 , . . . I are pairwise disjoint. Observe that for any N N we have N n=1 An A, so
N n=1
An A ,
by the sum formula for characteristic functions in Lemma 1.17. Therefore, by linearity and monotonicity of the integral, we see that
N N
An =
n=1 n=1 N
An
A ;
that is, (An ) (A).
n=1
Letting N proves the superadditivity property. Let A I and assume that A N n=1 An where A1 , . . . , AN I . Then,
N
An ,
n=1
which we leave you to verify. Hence by monotonicity and linearity of the integral, we have
N N N
An
n=1
=
n=1
An
(A)
(An ).
n=1
Finally, to prove (4) note that B\A = B A . Integrating both sides we get (B \ A) = (B ) (A), where we used that (A) = so that (B ) (A) is well-dened.
87
We remark that in general we cannot replace N by in Property (3) ; see Section 3.2 for counterexamples. We also remark that one could prove Theorem 2.4 using only the properties of additive set functions and semirings, and without using any integration theory, however, the proof isnt as elegant. 2.3.2. Application of integration II: Products. Our second application of integration deals with products of additive set functions. Let 1 , . . . , N be additive set functions on semirings I1 , . . . , IN . From Proposition 1.2 we know that the product I1 IN is a semiring. We dene : I1 IN [0, ] by (A1 AN ) := 1 (A1 ) N (AN ), for all boxes A1 AN I1 IN . Heres a picture when N = 2:
X2 A2 A1 A1 A2 (A1 A2 ) = 1 (A1 ) 2 (A2 ) X1
In the product 1 (A1 ) N (AN ), we use the conventions that 0 := 0 and 0 := 0 in case there is a 0 and in the product. The set function is the product of 1 , . . . , N . The main example to keep in mind is Lebesgue measure m on I n = I 1 I 1 , which is just the n-fold product of Lebesgue measure on I 1 . We use our integration theory to give a simple proof that is additive.
Theorem 2.5. The set function : I1 IN [0, ] is additive; in words, the product of additive set functions is additive.
Proof : This proof is almost word-for-word the same as the proof of Proposition 1.18! For notational simplicity, we prove this result for only two additive set functions, say : I [0, ] and : J [0, ], where I and J are semirings on sets X and Y , respectively. Let A B I J , and suppose that
N
AB =
k=1
Ak Bk
where the Ak Bk I J are pairwise disjoint. By the product and sum formulas for characteristic functions in Lemma 1.17, we see that
N
A (x ) B (y ) =
k=1
Ak (x) Bk (y ).
Let us x x X , and put a = A (x) and ak = Ak (x) (thus, a and each ak is either 0 or 1). Then the above equality is just
N
a B (y ) =
k=1
ak Bk (y ).
Both sides are J -simple functions, so integrating both sides of this equality, we obtain
N
a (B ) =
k=1
ak (Bk ),
88
or after substituting a = A (x) and ak = Ak (x), we get

N
(B ) A (x ) =
k=1
(Bk ) Ak (x).
If (B ) and each (Bk ) is nite, then both sides of this equality are I -simple functions so we can integrate both sides of the equality, obtaining
N
(A) (B ) =
n=1
(An ) (Bn ).
On the other hand, if A or any An has innite measure, then it is straightforward to check that this equality still holds. Thus, (A B ) = N k=1 (Ak Bk ), which proves our result.
2.3.3. Probability set functions on sequence space. We begin by reviewing sequence space from Section 1.3.3. Given a countable number of sample spaces X1 , X2 , . . ., sequence space is the set of all innite sequences: X = {(x1 , x2 , x3 , . . .) ; xi Xi for all i} = X1 X2 X3 X4
=
i=1
Xi .
X represents the sample space for a countable number of experiments performed in sequence where X1 is the sample space of the rst experiment, X2 the sample space for the second experiment, and so on.
Example 2.4. If then X = Y
is a sample space for an innite sequence of rolling two dice; e.g.
X1 = X2 = = Y = {(j, k) ; j, k = 1, . . . , 6}
(6, 3) , (4, 5) , (1, 5) , . . .
If Xi = (0, 1] for each i, then X = Y is a sample space for picking an innite sequence of points from the interval (0, 1].
Assume that we are given probability set functions: where Ii is a semiring on Xi (thus, Xi Ii and i (Xi ) = 1). For instance, for the dice case in Example 2.4, we can put i = 0 : P (Y ) [0, 1] for all i, where #A 0 (A) = for all A Y. 36 Thus, we are working with fair dice. In the case Xi = (0, 1] in Example 2.4 we can put i = Lebesgue measure on left-half open intervals in (0, 1]. A natural question is: Is there an obvious probability set function on X dened using 1 , 2 , 3 , . . .? Well, it is not so obvious how to assign a probability to an arbitrary subset of X , but for some subsets its clear; these subsets are the cylinder sets, which we introduced back in Section 1.3.3. Recall that a cylinder set is a subset A X that, for some n N, can be written as (2.9) A = A1 A2 An Xn+1 Xn+2 Xn+3 X1 X2 Xn Xn+1 Xn+2 Xn+3 1 : I1 [0, 1] , 2 : I2 [0, 1] , 3 : I3 [0, 1] , . . . ,
89
for some events Ai Ii . The event A represents the event that A1 occurs on the rst trial, A2 occurs on the second trial, . . ., An occurs on the nth trial, not caring what happens after the nth trial. What is the probability of the event A occurring? To answer this, consider the case when n = 2; then we are asking: What is the probability of the event A1 occurring on the rst trial and A2 on the second trial (not caring about what happens afterwards)? Just thinking intuitively without thinking too much into the details, if we think of 1 (A1 ) as the fraction of times the event A1 occurs and 2 (A2 ) as the fraction of times the event A2 occurs, then it would make sense that the product7 1 (A1 ) 2 (A2 ) is the fraction of times the event A1 followed by A2 will occur. More generally, with A as above, it would make sense that the Probability of A = 1 (A1 ) 2 (A2 ) n (An ). Since 0 i (Ai ) 1 for each i, the product on the right is also in [0, 1]. This discussion motivates the following. Let C i=1 Xi denote the collection of all cylinder sets and dene : C [0, 1] by (A) := 1 (A1 ) 2 (A2 ) n (An ) for A as in (2.9).
(2.10)
The set function is called the innite product of 1 , 2 , . . .. This denition is very similar to the denition just dealt with in Theorem 2.5 where we considered the product of nitely many additive set functions. Note that putting A1 = X1 , A2 = X2 , . . . , An = Xn in (2.9), we see that X C, and by denition of , we have (2.11) (X ) = 1 (X1 ) 2 (X2 ) n (Xn ) = 1 1 1 = 1.
The innite product measure Proposition 2.6. The innite product set function : C [0, 1] is a nitely additive probability set function.
Proof : Recall that in Proposition 1.9 we proved C forms a semiring by reducing the proof to the product of nitely many semirings (Proposition 1.2). In a very similar way, we can prove that : C [0, 1] is nitely additive by reducing it to Theorem 2.5 where we proved that the product of nitely many additive set functions is additive. We shall leave the details to you if youre interested.
7Here, we are actually imposing the condition that the trials A , A be independent, which 1 2 roughly speaking means that the knowledge of the occurrence of the rst event A1 does not eect the probability of the second event A2 occurring. We shall return to independence in Section 4.1.
90
I R (I )
Given: : I [0, ], additive set function. Question: Does have an extension : R (I ) [0, ]?
extend from the semiring I to the generally much larger collection R (I ). Theorem 2.7 says yes.
Figure 2.5. Since I R (I ), its a natural question to ask if we can
2.3.4. Application of integration III: Extensions. In Theorem 2.7 below we apply integration theory to show that any additive set function on a semiring can be extended to the generated ring; see Figure 2.5. The primary examples are extending Lebesgue measure from I n , the left-half open boxes, to the elementary gures E n in Rn , and the innite product measure from the cylinder sets to the ring generated by the cylinder sets. If : I [0, ] is an additive set function on a semiring I , and if R is a ring containing I , then an additive set function : R [0, ] is called an extension if (A) = (A) for all A I . The semiring extension theorem Theorem 2.7. If : I [0, ] is an additive set function on a semiring I , then there is a unique additive set function on the ring R (I ) generated by I that extends . We denote this (unique) set function by again and call it the extension of to R (I ).
Proof : We already know that (2.12) (A) = A for all A I .
The idea is to simply dene the extension by this same formula for any A R (I )! Of course, we have to prove that (2.12) (i) is dened for each A R (I ), (ii) is additive, and (iii) is the unique additive set function on R (I ) that extends . Step 1: Given A R (I ), lets show that the function A is an I -simple function; this implies that A is dened. In fact, we know by Theorem 1.5 that A = n An , a nite union where A1 , A2 , . . . I are pairwise disjoint. Thus, by the sum formula for characteristic functions in Lemma 1.17, A =
n
An .
Therefore, A is an I -simple function. In particular, we can dene (A) := A for all A R (I ).
Note that this formula is consistent with the formula (2.12) when A I . Step 2: Let A R (I ) and assume that A = n An , a nite union of pairwise disjoint sets A1 , A2 , . . . R (I ). Then by the sum formula for characteristic functions, A = n An . Therefore, by linearity of the integral, (2.13) (A) = A =
n
An =
n
An =
n
(An ).
91
Thus, is indeed nitely additive on R (I ). Step 3: Let : R (I ) [0, ] be nitely additive and assume that = on I ; we shall prove that = on R (I ). Indeed, by Theorem 1.5 we can write A = n An , a nite union where A1 , A2 , . . . I are pairwise disjoint. Then by nite additivity, we have (A ) =
n
(A n ) =
n
(An ),
n
where we used that = on I . On the other hand, the sum exactly (A) by (2.13). Thus, = and our proof is complete.
(An ) is
As an easy corollary, we obtain Extensions of familiar set functions Corollary 2.8. (1) For each n, Lebesgue measure m on I n extends uniquely to an additive set function on the ring of elementary gures E n . (2) The LebesgueStieltjes measure f on I 1 of any right-continuous nondecreasing function f : R R extends uniquely to an additive set function on the ring of elementary gures E 1 . (3) The innite product of probability set functions, : C [0, 1], on cylinder sets of a sequence space extends uniquely to be a nitely additive probability set function : R (C ) [0, 1]. Technically speaking, the semiring extension theorem only guarantees that the innite product measure : C [0, 1] has a unique extension : R (C ) [0, ]. However, since (X ) = 1 (we showed this in (2.11)), by monotonicity we have (A) (X ) = 1 for all A R (C ). Thus, : R (C ) [0, 1]. 2.3.5. On Monkeys and Shakespeare. Back in Section 1.2.4 we described the monkeyShakespeare experiment, which we briey review.8 Choose your favorite Shakespeare passage (or any other passage for that matter) and let N be the number of symbols the passage consists of. For example, N = 632 for the sonnet Shall I compare thee to a summers day? Put a monkey in front of a typewriter and let him hit the keyboard N times, remove the paper, put in a new paper, have him hit the keyboard N more times, remove the paper, etc. . . repeating this process innitely many times. If we consider a success (= 1) when the monkey types the passage and a failure (= 0) when he doesnt type it, then the sample space for this experiment is the space of Bernoulli sequences Y where Y = {0, 1}; e.g.
8Cartoon from http://www.sangrea.net/free-cartoons/
92
fails
fails
fails
fails
success
fails
0 , 0 , 0 , 0 ,1 ,0 ,...
Assume that on any given trial, the probability of a success is a constant p (0, 1) so that the probability of a failure is 1 p. Thus, (Of course, 0 () = 0 and 0 (Y ) = 0 {0, 1} = 1.) For example, assuming that the keyboard can make 100 dierent symbols (for a nice round number) and there are a total of N symbols in your favorite Shakespeare passage, assuming that each symbol is equally likely to be typed, we have p= 1 100
N
0 : P (Y ) [0, 1] is dened by 0 {1} = p , 0 {0} = 1 p.
1 . 102N
Thus, p = 1/100632 = 1/101264 for Shakespeares sonnet 18. Let : C [0, 1] be the innite product of 0 with itself, where C is the semiring of cylinder subsets of Y . Then by the semiring extension theorem, we know that extends to a probability set function : R (C ) [0, 1]. Here is a Question: For each n N, what is the probability that the monkey will type your passage within the rst n pages? Let An Y be the event that the monkey types your passage within the rst n pages. Then Ac n is the event that the monkey does not type your passage in the rst n pages, and hence Ac n = {0} {0} {0} Y Y Y .
(n times)
Thus, Ac n C R (C ) and so, as R (C ) is a ring and X C , we have An = X \ Ac R (C ). In particular, the probability (An ) is dened. Now by denition n of , we have n n (Ac n ) = (0 {0}) = (1 p) . Therefore, by subtractivity, or in words, the
c (An ) = (X \ Ac n ) = (X ) (An ) = 1 (1 p) , n n
For example, consider the situation p = 1/101264 as in Shakespeares sonnet 18. The probability that the monkey will type Sonnet 18 within the rst 1 googol pages (where a googol is by denition 10100 ) is 1 1 1 101264
10100
Probability the passage will be typed within the rst n pages = 1 (1 p) .
Some estimates show that9 this number is approximately 101164 , or 0.000000000000 000000000000 1 . . . .
1164 zeros 9Exercise: Try to nd out how I got this!
93
To summarize: It is essentially impossible that the monkey will type Sonnet 18 within the rst 1 googol pages. Lets ask another Question: How many pages must the monkey type in order to have at least a 1% chance of typing Shakespeares sonnet 18? We want to nd n so that 1 n . 1 (1 p) 100 Rearranging and taking logarithms we obtain 99 100 100 n n (1 p) = (1 p) = n log (1 p)1 log 100 99 99 = Thus, the answer is that we need at least n log 100 99 . log(1 p)1
log 100 99 pages. log(1 p)1 We can get an accurate estimate of the right-hand side as follows. We rst use calculus to show that if 0 x 1/2, then10 Taking logarithms, we obtain (1 x)1 e2x . =
log(1 x)1 2x
Hence, noting that log(100/99)/2 = 0.005025 . . . 0.005, it follows that we need more than log 100 5 103 99 pages. 2p p For example, if p = 1/101264 , just to have a 1% chance of typing the entire sonnet 18 of Shakespeare, the monkey must type more than 5 103 = 5 101261 pages, 101264 which is quite a lot of pages. For perspective, the number of atoms in the observable universe is11 approximately 1080 , so the number of pages is around the order of magnitude of 101180 times the number of atoms in the known universe. Heres another Question: How many years must the monkey type in order to have at least a 1% chance of typing Shakespeares sonnet 18? To answer this question we have to make some assumptions about how fast the monkey can type. Lets be overly generous and assume that he can type 10 pages per minute (this is quite fast: 6320 symbols per minute if he types Shakespeares sonnet 18 consisting of 632 symbols!). Then in one year he can type (lets forget leap years for simplicity) 10 pages 60 minutes 24 hour 365 days pages = 5.256 106 . minute 1hour day year year
10Exercise: Try to prove this! 11See http://en.wikipedia.org/wiki/Observable universe
1 1 . 1 log(1 x) 2x
94
Thus, to have just a 1% chance of producing the desired text (which he has a probability p of producing on a single page) it will take the monkey approximately 109 1 5 103 years (the monkey equation), p 5.256 106 p where, again, we assumed that he can type 10 pages of that text in a minute. So, for example, if p = 1/101264 , it will take approximately 101255 years to have a measly 1% chance of producing sonnet 18. Some say the age of the universe is estimated to be 13.7 109 years12, so it will take the monkey on the order of magnitude of 101245 universe ages to have just a 1% chance of typing Shakespeares sonnet 18! So, basically we can say that within the current estimates of the age of the universe, the monkey doesnt have a chance of typing sonnet 18! Even if every particle in the universe was a monkey, say we had 1080 monkeys to help type, it would still take 101255 /1080 = 101175 years! We will return to the monkeyShakespeare problem in Section 4.1.
Exercises 2.3. 1. Let : R [0, ] be an additive function on a ring R . Given sets A and B in R , prove that (A B ) + (A B ) = (A) + (B ). If you draw a Venn diagram of A B , do you see why this formula is obvious? Warning: Keep in mind that may take the value , and in this case you should never subtract two quantities because you could have a nonsense statement like . Suggestion: First prove that AB + AB = A + B . 2. Let : R [0, ] be an additive function on a ring R . Given sets A, B, C in R , prove that (A B C ) + (A B )+(B C ) + (A C ) = (A) + (B ) + (C ) + (A B C ).
3. (Inclusionexclusion principle) Generalize the previous exercises as follows. Let : R [0, ) be an additive function on a ring R . Given sets A1 , . . . , AN in R , prove that
N N
n=1
An =
n=1
(An )
1i<j N
(Ai Aj )
N
+
1i<j<kN
(Ai Aj Ak ) + (1)N 1
An .
n=1
Taking all the negative terms to the left, show that the resulting formula holds even if : R [0, ] (innity may be in the range of ). 4. Consider the famous passage: To be or not to be, that is the question. from Shakespeares The Tragicall Historie of Hamlet, Prince of Denmarke (or Hamlet for short). For each n N, what is the probability that the monkey will type the passage within the rst n pages? Estimate the number of pages the monkey must type in order to have at least a 1% chance of typing the passage. Choose a page/minute rate that the monkey can type and estimate the number of years it will take the monkey to have at least a 1% chance of typing the passage.
12See http://en.wikipedia.org/wiki/Age of the universe
2.4. BERNOULLIS THEOREM (THE WLLNS) AND EXPECTATIONS
95
5. (The hat check problem) n people enter a restaurant and their hats are checked in. After dinner the hats are randomly re-distributed back to their owners. In this problem we determine the probability that nobody receives their correct hat. (i) Write down a sample space and the corresponding probability measure . (ii) Let Ak be the event that the kth person gets his correct jacket. Using the inclusion-exclusion formula, nd ( n k=1 Ak ), the probability that at least one person gets his own hat. (1)k (iii) Conclude that the answer to our problem is n and hence, k=0 k!
n
lim
probability that n men dont get their correct hats =
1 . e
What a strange place to nd the number e! (iv) Solve the following related problem, called the Probleme du Treize, which rst appeared in the probability book Essay d Analyse sur les Jeux de Hazard [281] written by Pierre R emond de Montmort (16781719) and rst published in 1708 (see [319] for a version in English): A person shues thirteen cards, each of the same suit, and deals them one at a time while saying ace, two, three, . . ., king. If at some point a card is dealt that matches the name he says, he loses; otherwise if he goes through all the cards never dealing the card he says, he wins. What is the probability that he wins? 6. (The cracker jack box problem) There are ve dierent surprises you can get in any given Cracker Jack box: a top, secret decoder, race car, super ball, and a mini frisbee, each equally likely to be found. You buy six Cracker Jack boxes. In this problem we show that the probability of obtaining each of the rst three surprises (top, decoder, 6 3 6 2 6 and car) in these six boxes equals 1 3 4 +3 5 5 . 5 (i) Write down a sample space and the corresponding probability measure . (ii) Solve the problem. Suggestion: The inclusion-exclusion principle might help. 7. (The coupon collectors problem) Let n N. A certain bag of chips contains one of n dierent coupons (say labeled 1, . . . , n), each coupon equally likely to be found. You buy m bags of chips. In this problem we show that the probability that in these m bags you do not have the complete set of n coupons is
n
k=1
(1)k+1
n k
k n
(i) Write down a sample space and the corresponding probability measure . (ii) Solve the problem. Suggestion: Let Ck be the event that coupon k was not found in the m boxes. Note the event that you havent found all n coupons is C1 Cn . The inclusion-exclusion principle might help.
2.4. Bernoullis Theorem (The WLLNs) and expectations In Chapter 9 of Girolamo Cardanos (15011576) book Liber de Ludo Aleae, written around 1565, he speaks of throwing a die and states [296]
. . . in six casts each point should turn up once; but since some will be repeated, it follows that others will not turn up.
In this passage we can see that Cardano understands the concept that in throwing a die, each side of a die should occur about once in every six throws; this is a very rudimentary form Bernoullis theorem, also known as the weak law of large numbers (WLLN) or more commonly as the law of averages. In this section we make this concept more precise through Bernoullis law of large numbers. We also show in a precise way that the expected value of a simple random variable is indeed the average value of the random variable over a large number of experiments.
96
2.4.1. Bernoullis Theorem and the zero principle. The (Weak) Law of Large Numbers, a name coined by Sim eon-Denis Poisson (17811840) in 1835 [313, p. 478], was rst proved by Jacob Bernoulli (16541705) in the mid 1680s and put in his 1713 treatise Ars conjectandi, which was published eight years after his death. Here is Jacob Bernoullis description of his theorem, taken from [288, p. 1453] (I bolded certain statements near the Jacob Bernoullibottom). (16541705). Similarly, if anyone has observed the weather over a period of
years and has noted how often it was fair or how often rainy, or has repeatedly watched two players and seen how often one or the other was the winner, then on the basis of those observations alone he can determine in what ratio the same result will or will not occur in the future, assuming the same conditions as in the past. This empirical process of determining the number of cases by observation is neither new nor unusual; in chapter 12 and following of Lart de penser the author, a clever and talented man, describes a procedure that is similar, and in our daily lives we can all see the same principle at work. It is also obvious to everyone that it is not sucient to take any single observation as a basis for prediction about some [future] event, but that a large number of observations are required. There have even been instances where a person with no education and without any previous instruction has by some natural instinct discovered quite remarkably that the larger the number of pertinent observations available, the smaller the risk of falling into error. But though we all recognize this to be the case from the very nature of the matter, the scientic proof of this principle is not at all simple, and it is therefore incumbent on me to present it here. To be sure I would feel that I were doing too little if I were to limit myself to proving this one point with which everyone is familiar. Instead there is something more that must be taken into consideration something that has perhaps not yet occurred to anyone. What is still to be investigated is whether by increasing the number of observations we thereby also keep increasing the probability that the recorded proportion of favorable to unfavorable instances will approach the true ratio, so that this probability will nally exceed any desired degree of certainty, or whether the problem has, at it were, an asymptote.
This was not an easy problem, for later in his writings he states
It is this problem that I decided to publish here, after having meditated on it for twenty years.
So, please dont feel discouraged if it takes a week to solve a homework problem . . . at least it isnt twenty years! For a more precise statement of Bernoullis theorem, I certainly couldnt do better than in James Victor Uspenskys (18831947) classic probability book [394, p. 96]:
This chapter will be devoted to one of the most important and beautiful theorems in the theory of probability, discovered by Jacob Bernoulli and published with a proof remarkably rigorous (save for some irrelevant limitations assumed in the proof) in his admirable posthumous
97
book Ars conjectandi (1713). This book is the rst attempt at scientic exposition of the theory of probability as a separate branch of mathematical science. If, in n trials, an event E occurs m times, the number m is called the frequency of E in n trials, and the ratio m/n receives the name of relative frequency. Bernoullis theorem reveals an important probability relation between the relative frequency of E and its probability p. Bernoullis theorem. With the probability approaching 1 or certainty as near as we please, we may expect that the relative frequency of an event E in a series of independent trials with constant probability p will dier from that probability by less than any given number > 0, provided the number of trials is taken suciently large.
Let us now make Uspenskys statement of Bernoullis theorem explicit in terms of sample spaces. Let Y = {0, 1}, where 1 (success) is the event E occurring and 0 (failure) is the event E not occurring, and let X := Y be the sample space of an innite sequence of experiments. For concreteness we shall think of 1 (the event E occurring) as heads and 0 (the event E not occurring) as tails; e.g.
H T H T T T H (1, 0, 1, 0, 0, 0, 1, . . .)
Let p be the probability of a head on any given ip; that is, the probability of the event E occurring. On Y the probability set function is Let 0 : P (Y ) [0, 1] , where 0 {1} = p , 0 {0} = 1 p.
: R (C ) [0, 1] be the innite product of 0 with itself, where C is the semiring of cylinder subsets of X and R (C ) is the ring generated by C . Given an innite sequence of coin tosses, x = (x1 , x2 , x3 , . . .) X where each xi is either 1 (a head) or 0 (a tail), and given n N, note that x1 + x2 + x3 + + xn is the total number of heads in n trials, which Uspensky calls the frequency of heads in n trials. Thus, x1 + x2 + x3 + + xn n is the proportion of heads in n trials, which Uspensky calls the relative frequency of heads in n trials. Since the probability of tossing a head on any given trial is p, intuition suggests that if n is large, then we should have x1 + x2 + x3 + + xn p, n where the larger n is the better this approximation should be. To make this approximation idea precise, let > 0 be given. Then x1 + x2 + + xn xX; p < n is the event that the relative frequency of heads in n trials is within of p. In Exercise 1 you will prove that this event belongs to R (C ), the ring generated by the cylinder subsets of Y . In particular, x1 + x2 + + xn p < xX; n
98
is dened and it represents the probability that in n tosses of a coin, the relative frequency of successes will be within of p. The precise statement of Bernoullis theorem is that the larger n is, the closer this probability should be to 1: Bernoullis theorem Theorem 2.9. For each > 0, x1 + x2 + + xn lim x X ; p < n n
= 1.
This result is also called the weak law of large numbers. The weak part of this title distinguishes this result from the strong law of large numbers that well prove in Section 6.6 and is stronger than the weak law because the strong law implies the weak law but not vice versa. Note that since xX; x1 + + xn p < n =X\ xX; x1 + + xn p , n
Bernoullis Theorem is equivalent to the statement that

n
lim x X ;
x1 + x2 + + xn p n
= 0.
We remark that weve just employed what we call the Zero principle: Instead of proving A = B , transform it into C = 0. Although a simple idea, this principle does come in handy. 2.4.2. Proof of the weak law of large numbers. My favorite way to prove this theorem is to transform it into a problem involving integrals of functions on X instead of measures of points of X and then follow Pafnuty Chebyshevs (18211894) 1867 proof of the law of large numbers [82] (see [358] for a translation). For each i, consider the random variable that observes a head on the ith toss: 1 if xi = 1 fi : X R dened by fi (x) := xi = Pafnuty Chebyshev 0 if xi = 0 (18211894). where x = (x1 , x2 , x3 , . . .). The function fi is really just a C -simple function, for, let Ai X be the event that on the ith toss we ip a head: (2.14) Ai = Y Y Y Y {1} Y C , f i = Ai , the characteristic function of Ai . We let Sn = f 1 + f 2 + + f n , which is the simple random variable that observes the total number of heads in n tosses. Note that Sn (x) x1 + x2 + + xn p = xX; p . xX; n n where the {1} occurs in the ith slot. Then
99
We usually simplify set notation and use the Probabilists set notation: {x X ; Property(x)} = {Property} (i.e., drop x). Then with this notation in mind, we write xX; x1 + x2 + + xn p n = Sn p . n
Thus, Bernoullis theorem can be written as . . . Bernoullis theorem: function version Theorem 2.10. For each > 0, Sn p lim n n
= 0.
This is Bernoullis theorem transformed into a statement involving functions on X . Observe that Sn p = |Sn np| n = (Sn np)2 n2 2 . n Hence, we are left to prove that
n
lim (Sn np)2 n2 2 = 0.
This limit turns out to be an easy consequence of Chebyshevs inequality, who stated and then used (a similar) inequality in his proof of the law of large numbers in 1867. We remark that an earlier version (1853) of the inequality is due to Ir en eeJules Bienaym e (17961878), so the inequality is sometimes called the Bienaym e Chebyshev inequality. Yet another name for the inequality is Markovs inequality, named after Andrey Markov (18561922), who was a student in Chebyshevs classes. We shall see several reincarnations of Chebyshevs inequality in the sequel. Chebyshevs inequality, Version I Lemma 2.11. If : R [0, ] is a nitely additive set function on a ring of subsets of a set X with X R , and f is a nonnegative R -simple function, then for any constant a > 0, 1 f d. {f a} a
Proof : Heres a picture showing why Chebyshevs inequality is obvious:
f a { f a}
Its obvious that Area of Rect. Area under f ; that is, a {f a} f.
First of all, from Exercise 1 we know that A := {x X ; f (x) a} R so that (A) is dened. Now, by denition of A, a f on the set A, thus aA f.
100
By monotonicity of the integral, we obtain aA d and by denition of the integral, aA d = a (A), and hence a (A) f d , which is equivalent to Chebyshevs inequality. f d,
Since Sn np = f1 + + fn npX is a sum of simple functions, and simple functions form an algebra (Proposition 2.1), it follows that (Sn np)2 is a simple function. Thus, by Chebyshevs inequality, we have (Sn np)2 n2 2 1 n2 2 (Sn np)2 .
Although we dont have to, we shall evaluate the right-hand integral using the functions (2.15) Ri := fi p , i = 1, 2, 3, . . . , which are related to the Rademacher functions introduced in 1922 by Hans Rademacher (18921969);13 see the exercises for various properties of these functions. Observe that Sn np = f1 + f2 + + fn np = (f1 p) + (f2 p) + + (fn p) = R1 + R2 + + Rn , so 1 (R1 + R2 + + Rn )2 . n2 2 To evaluate the right-hand side, observe that (Sn np)2 n2 2
n
(R1 + R2 + + Rn )2 = (R1 + + Rn )(R1 + + Rn ) = Hence, 1 n2 2 (R1 + R2 + + Rn )2 = 1 n2 2

n
Ri Rj .
i,j =1
Ri Rj .
i,j =1
Since Ri = fi p and fi = Ai = (Ai ) = p (see the denition of Ai in (2.14)) and 1 = X = (X ) = 1, we see that Ri Rj = (fi p)(fj p) = = = = (fi fj pfi pfj + p2 ) fi fj p fi p f j + p2 1
f i f j p2 p2 + p2 f i f j p2 .
13If p = 1/2, then 2R , i = 1, 2, . . ., are the original Rademacher functions. i
101
By denition of fi , we have so f i f j = Ai Aj = Ai Aj fi fj = (Ai Aj ) = p p2 if i = j if i = j,
where you can check that (Ai Aj ) = p2 for i = j using the expression for Ai in (2.14). Thus, p p2 if i = j Ri Rj = 0 if i = j. Finally, we conclude that (2.16) (Sn np)2 n2 2 = This shows that
n
1 2 n 2 1 n2 2 1 n2 2
Ri Rj
i,j =1 n i=1
(p p2 ) p p2 . n2
n (p p2 ) =
lim (Sn np)2 n2 2 = 0,
which completes the proof of Bernoullis Theorem. We now extend Bernoullis theorem to general probability spaces. 2.4.3. Expectations revisited. Let 0 : I [0, 1] be a probability set function on a semiring I of subsets of a sample space Y . Given an I -simple random variable f : Y R, recall that the expectation of f , E (f ) := f , was interpreted as the expected average value of f over a large number of experiments. We can now make this precise! To do so, let X := Y , the sample space for repeating the experiment modeled by Y an innite number of times and let be the innite product of 0 with itself, where C is the cylinder subsets of X generated by the semiring I on each factor Y of X . Given an innite sequence of outcomes x = (x1 , x2 , x3 , . . .) X , note that f (xk ) is the value of f on the k th outcome of the innite sequence of experiments, so f (x1 ) + f (x2 ) + f (x3 ) + + f (xn ) n is, for a given x X , exactly the average value of f observed per experiment during the rst n experiments. Our intuitive notion of expectation suggests that f (x1 ) + f (x2 ) + f (x3 ) + + f (xn ) E (f ), n where the larger n is the better this approximation should be. This is exactly correct as Theorem 2.12 below shows! To set-up this theorem, for each i dene so fi represents the observation of the random variable f on the ith experiment. fi : X R by fi (x1 , x2 , . . .) := f (xi ), : R (C ) [0, 1]
102
The expectation theorem Theorem 2.12. For each > 0, f1 + f2 + f3 + + fn lim E (f ) < n n
= 1.
The proof of this theorem is very similar to the proof of Bernoullis theorem, so similar in fact, that we leave it as an excellent exercise to test if you understood the proof of Bernoullis theorem; see Problem 3. 2.4.4. Experimental verication of Bernoullis theorem. On pages 109113 of Uspenskys classic probability book [394], he lists eight examples showing the experimental verication of Bernoullis Theorem. My favorite example is Buons needle problem,14 studied by Georges Buon (17071788) who considered the following needle experiment in 1777, quoted from page 112 of [394]: Georges Buon One of the most striking experimental tests of Bernoullis theo(17071788). rem was made in connection with a problem considered for the
rst time by Buon. A board is ruled with a series of equidistant parallel lines, and a very ne needle, which is shorter than the distance between lines, is thrown at random on the board. Denoting by the length of the needle and by h the distance between lines, the probability that the needle will intersect one of the lines (the other possibility is that the needle will be completely contained within the strip between two lines) is found to be 2 . h The remarkable thing about this expression is that it contains the number = 3.14159 . . . expressing the ratio of the circumference of a circle to its diameter. p=
See Problem 5 for a proof that the probability the needle will cross a line is indeed 2/h as stated. Heres a picture of the situation:
Therefore, by Bernoullis Theorem if we throw a needle a large number of times, then the ratio between the number of times the needle crosses a line and the total number of throws should be close to 2/h . One such experiment was conducted by Rudolf Wolf (18161893) between 1849 and 1853. In his experiment, = 36 and h = 45 (in millimeters), so the theoretical probability that a needle crosses a line is 2 72 = = 0.5093 . . . . h 45 He threw the needle 5000 times and it crossed a line 2532 times giving a ratio 2532 = 0.5064, 5000
14Note: Buon is not the same as Buoon, which is a clownish-type person.
103
not far o from the true probability! By the way, this approximation gives an probabilistic method to determine ! Indeed, 2 h Hence, Wolfs experiment show that P = = = 2 . PL
2 72 = = 3.15955766 . . . , 0.5064h 0.5064 45 not a bad approximation. In fact, Wolfs original motivation to do his experiment was to nd via Bernoullis theorem; here is what he said (quoted from [324]):
In the well-known work One Million Facts (Lalanne 1843) I found the following note that attracted my highest attention: On a plane surface draw a sequence of parallel, equally spaced straight lines; take an absolutely cylindrical needle of length a, less than the constant interval d that separates the parallels, and drop it randomly a great number of times on the surface covered by the lines. If one counts the total number q of times the needle has been dropped and notes the number p of times the needle crosses with any one of the parallels, the quantity 2aq : pd will express the ratio of circumference and diameter all the more precisely the more trials that have been made.
By the way, the Buoon needle experiment is perhaps the rst Monte Carlo method, which is a general term describing any method that uses random experiments to approximate solutions to mathematical problems. Another experimental verication deals with the Genoese lottery presented back in Example 2.3. In that experiment, we perform a lottery by randomly drawing ve tokens from ninety, and then we sum the ve numbers drawn. Thus, Y = {(y1 , . . . , y5 ) ; yi {1, . . . , 90} , yi = yj for i = j } and we dene f :Y R , where f (y1 , y2 , y3 , y4 , y5 ) := y1 + y2 + y3 + y4 + y5 ,
which represents the sum of the values of the randomly drawn tokens. We computed E (f ) = 227.5. Now perform an innite sequence of lotteries where in each lottery we randomly draw ve tokens from ninety, and in each lottery we sum the ve numbers drawn. From the Expectation Theorem, if we observe a large number n of lotteries, for a given sequence of outcomes (x1 , x2 , . . .), we would expect that the arithmetic mean f (x1 ) + f (x2 ) + + f (xn ) n will not dier much from the expected value 227.5. In the book Wahrscheinlichkeitstheorie, Fehlerausgleichung, Kollektivmalehre [90], Emanuel Czuber (18511925) carefully gathered data of 2, 854 Genoese-type lotteries that operated in Prague between 1754 and 1886. If you look in [394, p. 187] youll nd a very large table listing his results. From this table you can compute that the arithmetic mean of the sum of the ve tokens draw in the 2, 854 lotteries is 227.67. Not far from the theoretical value of 227.5!
Exercises 2.4.
104
1. Let R be a ring of subsets of a set X such that X R and let f : X R be an R -simple function. Prove that for any R, {f inequality } R where inequality can be , >, or <. 2. (Poissons theorem) Let p1 , p2 , p3 , . . . be real numbers with 0 < pn < 1 for all n. Consider an innite sequence of, say, coin tosses, and assume that the probability of obtaining a head on the nth toss is pn . In Bernoullis Theorem, pn = p for all n, but now we are allowing the probability to change depending on the toss. By mimicking the proof of Bernoullis theorem, prove Sim eon-Denis Poissons (17811840) theorem: For each > 0, x1 + x2 + + xn p1 + p2 + + pn lim x X ; < = 1, n n n where is the innite product of 1 , 2 , 3 , . . . with i assigning the probability pi to heads on the ith toss. 3. (The expectation theorem) In this problem we prove Theorem 2.12. We shall use the notation as explained in Subsection 2.4.3. (i) We rst rewrite the statement of the expectation theorem. For each i dene fi : X R by fi (x1 , x2 , x3 , . . .) := f (xi ) for all (x1 , x2 , x3 , . . .) X. Show that fi : X R is a C -simple random variable. Thus, you must show that fi is a linear combination of characteristic functions of cylinder sets. (ii) Let Sn := f1 + + fn and p = E (f ). Given > 0, prove that (2.17)
n
lim
Sn p < n
=1
lim
Sn p n
= 0.
(iii) Now mimic the proof of Bernoullis theorem to prove the right-hand side of (2.17), which proves the Expectation Theorem. (iv) Let Y = (0, 1] and let 0 : I [0, 1] be Lebesgue measure on I = left-half open intervals in Y . Dene a function f : Y R as follows. If x (0, 1], write x = 0.x1 x2 x3 . . . in decimal (= base 10) notation where if x has two decimal expansions, we take the one that does not terminate. We dene f (x) = x1 = tenth place digit of x; thus, f (0.123 . . .) = 1 and f (0.9) = f (0.8999 . . .) = 8, etc. (a) Show that f : Y R is an I -simple random variable. (b) If we sample numbers in (0, 1] at random and average their tenth digits, what should these averages approach as the number of samples increases? Use the Expectation Theorem to make your answer rigorous. 4. Here is a dierent proof of Bernoullis theorem. Below we use the notation p, , etc. as in the proof of Bernoullis theorem in the main text and we put Sn = f1 + f2 + + fn where the fi s are as before. n k nk p q , where q = 1 p. (i) Assume that if 0 k n, then {Sn = k} = k (You can prove this if you wish; we shall prove it in Theorem 2.13.) Given > 0, prove that n Sn n k nk p q , p+ = k n k=m+1 where m = n(p + ), the largest integer n(p + ). n (ii) Given > 0, expanding peq + qep via the binomial theorem, prove that Sn p+ n en peq + qep
2
(iii) Prove that for any x R, ex x + ex .
105
(iv) Prove that peq + qep

n
2 Sn p + e / 4 . n n n (v) Conclude that S p + 0 as n . Assuming S p 0 as n n n , whose proof is analogous, prove Bernoullis Theorem. 5. (Buons needle problem) A oor is ruled with horizontal parallel lines at distances h apart from each other. A needle of length < h, so thin that it can be considered to be a line segment, is thrown on the oor so that it is equally likely to land on any part of the oor. In this problem we consider the question: What is the probability that the needle will intersect one of the lines? To answer this question proceed as follows. (i) For a needle thrown on the oor, let p denote the lowest lying end point of the needle. If the needle lands parallel to the horizonal lines, let p denote the left end point of the needle. Let y , where 0 y < h, be the distance between p and the horizontal line immediately above or level with it. Let be the angle of the needle from the horizonal passing through p; here are a couple examples:
pe
2 2
+ qe
2 2
n2
Taking = /2, show that
y
p
y < sin .
Show that the needle crosses a parallel line if and only if Because of the above considerations, we can think of the rectangle as the sample space for the needle experiment and we can think of the event that the needle crosses a line as the subset A X given by (ii) If X were a nite set, then we would interpret the statement that when the needle is thrown on the oor it is equally likely to land on any part of the oor to mean that the Probability the needle crosses a line = #A/#X . Unfortunately, A and X are innite sets, so the right-hand side is not dened. However, if we interpret # as area we can get a perfectly well-dened right-hand side. Thus, we shall dene the Area A . Probability the needle crosses a line := Area X The area of X is h. Determine the area of A using Figure 2.6 and assuming facts concerning the Riemann integral and its area interpretation. Finally, prove that the probability the needle crosses a line is 2/h.
y
X = {(, y ) ; 0 < , 0 y < h},
A = {(, y ) X ; y < sin }.
y = sin
Figure 2.6. The region A is the portion under the curve y = sin .
The region X is the rectangle containing A (hence X has area h).
106
6. (Another Buon needle problem) A oor is ruled with horizontal parallel lines at distances h apart from each other. A needle of length > h, so thin that it can be considered to be a line segment, is thrown on the oor so that it is equally likely to land on any part of the oor. Show that the probability the needle will intersect a line is 2 2 (1 cos 0 ) + , h where 0 = arcsin(h/). 7. (Equivalence of Y and [0, 1]) In this problem we relate the innite product measure for Bernoulli sequences to Lebesgue measure on the real line; this is due to Hugo Steinhaus (18871972) [367]. Fix a natural number b 2. Given a number x [0, 1], we can write it in base b, that is, with respect to its b-adic expansion: x1 x2 x3 (2.18) x= + 2 + 3 + , b b b where the xi s are in the set of digits Y = {0, 1, . . . , b 1}. FACT (which you may assume): A number x [0, 1] has a unique b-expansion except for rational numbers 0 < x < 1 that can be written with a denominator a power of b; such numbers have two expansions, one terminating (meaning xi = 0 for all i suciently large in (2.18)) and the other non-terminating. (The non-terminating expansions end with an innite string of (b 1)s.) Dene x1 x2 x3 (2.19) F : Y [0, 1] by F (x) := + 2 + 3 + . b b b (i) Let T Y be the subset of all terminating elements of Y ; that is, (x1 , x2 , . . .) T if xi = 0 for all i suciently large. Prove that T is countable. In particular, by our FACT stated above, it follows that F : Y \ T (0, 1] is a bijection. Thus, Y is uncountable, although a direct proof of this fact isnt hard. (ii) Let a1 , . . . , aN {0, 1, . . . , b 1} and let (2.20) where C Y is the collection of cylinder sets. Prove that F (A \ T ) = 1 (k , k + 1] = bN k k+1 , , N bN b A = {a1 } {aN } Y Y C ,
where k = bN 1 a1 + bN 2 a2 + + aN . Let 0 : P (Y ) [0, 1] assign fair probabilities: 0 (A) = #A/b and let : R (C ) [0, 1] denote the innite product of 0 with itself. Observe that where m denotes Lebesgue measure. (Can you prove this?) Thus, events in Y of the form (2.20) correspond to intervals with b-adic endpoints and the measure of the event equals the Lebesgue measure of the corresponding interval. (iii) Prove that if A R (C ), then F (A \ T ) E 1 = R (I 1 ), in which case (A) = m(F (A \ T )). (iv) Let R1 , R2 , . . . be the Rademacher functions dened in (2.15) with p = 1/2. With b = 2, show that F (x ) =
i=1
(A) = m(F (A \ T )),
Ai 1 = + 2i 2
i=1
Ri (x) 2i
for all x Y
where Ai is given in (2.14). 8. (Weak form of Borels simply normal number theorem cf. Problem 5 in Exercises 4.2.) In this problem we prove a version of Borels simply normal number theorem published in 1909 [51]. Consider Lebesgue measure m : I [0, 1] where I := left-half open intervals in (0, 1]. Let b N with b 2, let Y = {0, 1, . . . , b 1}, and x a digit d Y . You may assume the results in Problem 7.
2.5. DE MOIVRE, LAPLACE AND STIRLING STAR IN THE NORMAL CURVE
107
(i) Dene f :Y R by f (x) = 1 0 if x = d, otherwise,
and for each i, dene fi : (0, 1] R by fi (x) := f (xi ) where xi is the ith digit of x in the b-adic expansion (2.18) in Problem 7 in case x has two expansions we take the non-terminating one. Thus, fi observes if the ith digit is d. Let G : (0, 1] Y \ T be the inverse of the bijective map F : Y \ T (0, 1]. Let Ai = Y Y {d} Y Y with {d} in the ith slot. Observe that fi = Ai G (why?). Prove that fi = F (Ai \T ) . In particular, by (iii) of the previous problem, fi is an I 1 simple function. (ii) Intuitively speaking, since there are a total of b digits, for a randomly picked number x (0, 1], the digit d should appear in (2.18) with frequency 1/b, that is, we should have f1 (x) + f2 (x) + + fn (x) 1 n b Sn 1 < n b for n large.
This is made precise as follows: Given > 0, we claim that (2.21)

n
lim m
= 1,
where Sn := f1 + + fn ,
n < belongs to R (I 1 ) and its measure ap1 in the sense that the set S n b proaches 1 as n . To prove (2.21), prove that
Sn 1 < n b
1 Tn < , n b
where is the innite product measure described in (ii) of Problem 7 and Tn := n i=1 Ai . Now use the Expectation theorem to nish o the proof of (2.21). 9. Let Y , where Y = {0, 1}, be the sample space for an innite sequence of coin tosses with probability p (0, 1) for a head (and 1 p for a tail) on each toss. Let n N and a1 , . . . , an R and put Tn (x) := a1 A1 + + an An , where Ak = Y Y {1} Y with {1} in the kth slot. Express e2Tn : Y R as a simple function and prove the formula
n
e2Tn =
k=1
(1 + 2p eak cosh(ak ))
where cosh t = (ex + ex )/2 is the hyperbolic cosine.
2.5. De Moivre, Laplace and Stirling star in The normal curve The normal curve, the protagonist of this section, is ubiquitous in mathematics and nature. Its use in probability was rst discovered by Abraham de Moivre (16671754) and has since then been one of the cornerstones of probability theory; in fact, without it one may argue that the eld of statistics would not exist. Its said to show up in areas as diverse as daily maximum and minimum air temperature, real estate prices, IQ scores, body temperature, heights of adult people, adult body weight, shoe size, stock market analysis, heart rates, kinetic theory, population dynamics and on and on, and in this section we shall discuss one of the most famous mathematical reasons for its ubiquity, the de MoivreLaplace theorem.
108
Figure 2.7. In 1914, Albert Blakeslee (18741954) arranged 175 military cadets in a histogram according to their heights for this photo. This photo from [44] is an example of a living histogram and it has been used in many genetics textbooks.
2.5.1. What me, normal? Consider the following experiment: Ask a nonmathematically inclined friend to think about the heights of all the students in the university. Without a doubt, he would imagine the heights of most students clustered around some average value and as the heights move further from this average there would be less and less students at those heights. In essence your friend is assuming a bell curve for heights. See Figure 2.7 for a human bell curve. More generally, the bell curve is likely to show up in situations where most data points tend to cluster around some average value and there are fewer data points at the extremes. The technical name for the bell curve is the normal density function, which is the function (x) = 1 2 2 e
(x)2 22
where is referred to as the mean (or average) and the standard variation; heres a picture when = 5 and for various :
=5 = 1/2 (tall graph) = 1 (middle graph) = 4 (short graph)
Thus, is the famous bell curve and is its center and measures how much spreads from the mean; the smaller is the more concentrated is near while the larger is the more spread is from . See Section 6.5 for more on standard variations. The normal density was rst discovered by Abraham de Moivre (1667 1754) as early as 1721 [411], although its discovery is sometimes attributed to Carl Gauss (17771855) (and hence the normal density is sometimes called the Gaussian density), who wrote a paper on error analysis involving the normal density in 1809, almost 90 years after de Moivre; see [265, 266, 307]. One of the many interesting aspects surrounding the normal density is its explicit dependence on , the ratio of the circumference of a circle to its diameter. Now what does (dealing with circles) have to do with probability? Beats me, but it does! This mysterious relationship was noted by the
Abraham de Moivre (16671754).
109
1963 Nobel laureate Eugene Wigner (19021995) in a famous paper The Unreasonable Eectiveness of Mathematics in the Natural Sciences [415], who wrote the following concerning the appearance of in the normal density:
There is a story about two friends, who were classmates in high school, talking about their jobs. One of them became a statistician and was working on population trends. He showed a reprint to his former classmate. The reprint started, as usual, with the Gaussian distribution and the statistician explained to his former classmate the meaning of the symbols for the actual population, for the average population, and so on. His classmate was a bit incredulous and was not quite sure whether the statistician was pulling his leg. How can you know that? was his query. And what is this symbol here? Oh, said the statistician, this is pi. What is that? The ratio of the circumference of the circle to its diameter. Well, now you are pushing your joke too far, said the classmate, surely the population has nothing to do with the circumference of the circle. Naturally, we are inclined to smile about the simplicity of the classmates approach. Nevertheless, when I heard this story, I had to admit to an eerie feeling because, surely, the reaction of the classmate betrayed only plain common sense.
Of course, there are many functions that can give a bell-like curve, so why the normal density in particular? In fact, one can give mathematical arguments demonstrating why! One argument is the de MoivreLaplace theorem to be discussed later and another is more of a heuristic argument involving error analysis that well present now and is essentially contained in Robert Adrains (17751843) 1808 paper [2]. It seems like Adrain, one of the few great American mathematicians in the early 19th century, was the rst to publish an error analysis derivation of the normal density in Robert Adrain his work on the method of least squares. Carl Gauss (17771855) produced (17751843). a similar result one year later in 1809 [265, 266]. We remark that you will often nd Adrains derivation called the HerschelMaxwell derivation, after Sir John Herschel (17921871) and James Clerk Maxwell (18311879), who gave similar derivations in 1850 and 1860, respectively [197, Ch. 7]. Heres (our interpretation of) Adrains argument. Say that you want to measure the position of an object, like a star, and place the star at the origin in the plane. We can also think of this as a dart game: The bulls-eye is the star and a measurement of the stars location is like throwing a dart at the dart board, where the dart hits is where we measure the star. We shall speak in this dart language henceforth. We shall also make assumptions as we proceed, but the basic idea is that the darts should cluster near the bulls-eye with fewer and fewer hits far away from the bulls-eye; this, of course, is exactly the situation we want to produce a bell curve. Consider the probability that the dart hits a small region of the dart board dA = dx dy shown here:
dy | | dA
dx
110
Assume that for some function : R [0, ), given a point (x, y ) R2 , the probability that the x-coordinate of the dart lies in an (innitesimally small) interval of length dx around x is (x) dx and the probability that the y -coordinate of the dart lies in an (innitesimally small) interval of length dy around y is (y ) dy ; this gives the probability that the dart lies in dA as15 (x) dx (y ) dy = (x) (y ) dxdy. In other words, we can consider (x) (y ) as the probability per unit area that the dart hits the point (x, y ); thus, (x) is the probability per unit length that the x-coordinate of the thrown dart is x (with a similar interpretation for (y )). Thus, for example considering (x), given any interval I R the probability the x-coordinate of the dart lands in I is (2.22)
I
(x) dx.
Its reasonable to assume that the probability depends only on the distance from the origin and not on how the axes are oriented; so for example, the probability we hit an area immediately around the point (1, 0) is the same as the probability that we hit areas immediately around (0, 1), (1, 0) and (0, 1). Thus, if we introduce polar coordinates (r, ), where x = r cos and y = r sin , then (x) (y ) is a function depending only on r. Taking the partial derivative of (x) (y ) with respect to , we conclude that16 (x) x y (y ) + (x) (y ) = 0.
x
Recalling that x = r cos and y = r sin we see that
(x) y (y ) + (x) (y ) x = 0. This implies that (2.23) (x) (y ) = . x(x) y(y )
= y and
= x. Hence,
The left-hand side of (2.23) is a function of x only while the right-hand side is a function of y only; thus17 (y ) (x) = = C, x(x) y(y ) for some constant C R. Hence, (x) satises the ordinary dierential equation (x) = C x(x), whose solution is (x) = Ae 2 x for another constant A. Assuming that the probability the dart hits the board a very far distance from the origin should be close to zero, we conclude that C < 0 (for if C = 0, then (x) = A, a constant and if
15Here we are assuming what is called independence of the x and y -coordinates. 16You can also take the partial derivative with respect to x, then with respect to y , noting
C 2
that the partial of r with respect to x is x/r and with respect to y is y/r . . . try it! 17Can you prove that if f (x) = g (y ) for all x, y R, then f (x) = C and g (y ) = C for some constant C ?

C 2
111
C > 0, then (x) = Ae 2 x would grow exponentially as x gets larger). Hence, we can write C = 1/ 2 for some > 0. Thus, (x) = Ae 22 . In Problems 4 and 6 you will prove that 2 (2.24) or ex dx = 2 0
x2
ex dx =
R
which is called probability integral, (also the Gaussian integral or Laplaces integral) and which is where enters the picture. The proofs of (2.24) in the exercises use nothing of measure theory, just basic Riemann integration, and in Section 6.1 well return to the proofs using Lebesgue integration. Now, the probability that the x-coordinate of the thrown dart is some real number is 1. Thus, recalling the formula (2.22), it follows that
1=
R
(x) dx = A
e 22 dx.
x2
Replacing x with x 2, it follows that 2 1 = A 2 ex dx = A 2. Thus, A = 1/( 2 ) and ta-da, we get the normal density with mean = 0:
x2 1 e 22 . 2 We can interpret this discovery as Random errors distribute themselves normally; this is one way to state the so-called normal law of errors. We remark that Pierre-Simon Laplace (17491827) was probably the rst to try and rigorously evaluate the probability integral; this was done in the historic 1774 paper M emoire sur la probabilit e des causes par les ev` enmens [224]. (Memoir on the probability of the causes of events.)18
(x) =
be the innite product measure on the ring R (C ) generated by the cylinder subsets C of X . The de MoivreLaplace theorem begins with the seemingly innocent
18Although Laplace might have been the rst to evaluate the probability integral, as we said
2.5.2. De Moivre, Stirling, Wallis, and the binomial distribution. For the rest of this section we work in the set-up we had for the weak law of large numbers. Thus, let Y = {0, 1} be the sample space of an experiment with 1 a success, occurring with probability p, and 0 a failure, occurring with probability q := 1 p, and let X := Y be the sample space of an innite sequence of trials of the experiment. Let : R (C ) [0, 1]
earlier, Abraham de Moivre was the rst to discover the normal density in probability; see the articles [5, 97, 307] for the exciting detective-type story! Also, through my own personal researches I think that the rst to indirectly state the value of the probability integral was James Stirling (16921770), who in his famous 1730 book Methodus Dierentialis [390, p. 127] explicitly said that (1/2) = , where is the gamma function. See Theorem 6.3 for the relation of (1/2) to the probability integral.
112
The binomial distribution Theorem 2.13. In a sequence of n Bernoulli trials, with a probability p of success on each trial and probability q = 1 p of failure, the probability of obtaining exactly k successes is given by b(k ; n, p) := n k nk p q , k 0 k n,
113
and b(k ; n, p) = 0 otherwise.

Proof : That b(k; n, p) = 0 otherwise is clear, so assume that 0 k n. Let I {1, 2, . . . , n} consist of exactly k elements and let AI X be the set AI = {a1 } {a2 } {an } Y Y Y , where ai = 1 if i I and ai = 0 if i / I . Then as there are exact k of the ai s equal to 1 (and n k of them equal to 0) we have (AI ) = pk (1 p)nk . If A is the event that we get k successes in the rst n trials (without regard to which trials they occur), then A = I AI where the union is over all subsets I {1, 2, . . . , n} consisting of k elements. Since the AI s are pairwise disjoint, (AI ) = pk (1 p)nk for each I , and the number of subsets of {1, 2, . . . , n} consisting of exactly k elements is n , it follows that k (A) =
I
(AI ) = pk (1 p)nk + pk (1 p)nk + + pk (1 p)nk (n k) = n k p (1 p)nk . k

terms
The function b(n, p) : Z R, dened by k b(k ; n, p), is called the binomial mass function.
Example 2.5. Suppose that we ip a fair coin n times. What is the probability that we obtain exactly k heads where 0 k n? By our theorem, the answer is b(k; n, 0.5) = n k 1 2
n
1 2
nk
n 1 . k 2n
In particular, if n = 2k, the probability of half the throws resulting in heads is b(k; 2k, 0.5) = (2k)! 1 2k 1 = . (k!)2 22k k 22k
100! 1 . If anyone can (50!)2 2100 guess what this number equals to (say) three decimal places, youre the next Thomas Fuller19 (17101790). (The answer is 0.079589, accurate up to 6 decimal places.) For instance, if n = 100 and k = 50, the answer is
19 Thomas Fuller, the Virginia calculator, was African and in 1724 at the age of 14, he was shipped to America and sold as a slave. He never learned to read or write but he was a mathematician of the nest caliber. Heres a part of Fullers obituary, from the Columbian Centinel (Boston), Vol. 14, Dec. 29, 1790: He could multiply seven into itself, that product by seven, and the products, so produced, by seven, for seven times. He could give the number of months, days, weeks, hours, minutes, and seconds in any period of time that any person chose to mention, allowing in his calculation for all leap years that happened in the time; he would give the number of poles, yards, feet, inches, and barley-corns in any distance, say the diameter of the earths orbit; and in every calculation he would produce the true answer in less time than ninety-nine men out of a hundred would produce with their pens.
114
John Wallis (16161703).
As this example shows, although we have a nice formula regarding probabilities for successes, computationally its basically useless! Thus, for any non-trivial application its important to be able to approximate binomial coefcients. De Moivre was the rst to give a useful approximation for b(k ; 2k, 0.5) using Stirlings formula; well get to this later. For the moment, we want to mention a simpler formula that does the job called Wallis formula, named after John Wallis (16161703) who proved it in 1656, and is given by 2n 2 2 4 4 6 6 8 8 10 10 2n = = . 2 2n 1 2n + 1 1 3 3 5 5 7 7 9 9 11 n=1
Here, the innite product on the right-hand side of /2 is, by denition, 2n 2n := lim n 2 n 1 2 n +1 n=1 = lim
n n
k=1
2k 2k 2k 1 2k + 1 .
The proof of Wallis formula is very elementary, just using basic Riemann integration techniques, so we will leave its proof as a must-do exercise (see Problem 3); if you are willing to wait, well also prove Wallis formula in Section 6.1 using Lebesgues theory. We can write Wallis formula as (2.25) 1 = lim n n
n k=1
2 2 4 4 6 6 2n 2n 1 3 3 5 5 7 2n 1 2n + 1
1 2 4 (2n) 2k = lim . 2k 1 n n 1 3 (2n 1) 4 3

2
Indeed, observe that Wallis rst formula can be written as = lim n 2 so that = lim
n
2 1
n k=1
2n 2n 1
1 2n + 1 1
,
n
2 2n + 1
1 2k = lim 2k 1 n n
1 + 1/2n k=1
2k . 2k 1
Now using that 1 + 1/2n 1 as n implies Wallis second formula (2.25). Wallis formula (2.25) answers the question: What is the ratio of the even numbers 2, 4, . . . , 2n and the odd numbers 1, 3, . . . , 2n 1? The answer is: Approximately n. Now what does (dealing with circles) have to do with ratios of even and odd numbers? Beats me, but it does! Now back to our problem, recall that wed like to estimate bk := b(k ; 2k, 0.5) = 2k 1 (2k )! 1 = . 2 k (k !)2 22k k 2
Using the denition of the factorial and some algebra, we leave you to check that bk = 1 3 5 (2k 1) . 2 4 (2k )
Finally, recalling Wallis formula (2.25), we see that 1 bk = lim , which implies that lim = 1. k k (1/ k ) k bk
115
1 bk . k Here, given any two sequences {ck }, {dk }, if we have ck = 1, lim k dk we write ck dk and we say that the sequences are asymptotically equal or asymptotically equivalent; basically this means that ck and dk are roughly the same size as k . Thus, bk is asymptotically equal to 1/ k . In particular, taking k = 50 we obtain 1 = 0.079788 . . . . b50 50 The true answer b50 = 0.079589 . . ., pretty close! 2.5.3. de MoivreLaplace theorem. In the last subsection we found a formula for b(k ; 2k, 0.5). Now what if we need to approximate b(k ; n, 0.5) when n = 2k or what if we have a unfair coin so we need to approximate b(k ; n, p) where p = 1/2? In such cases, Wallis formula doesnt help. However, there is an amazing relationship between b(k ; n, p) and the normal density, the de MoivreLaplace theorem, which we shall explain. Figure 2.8 shows a graph of k b(k ; n, 0.5) for various n while Figure 2.9 shows a graph of k b(k ; n, 0.75) for various n.
Because of this limit, we write
Figure 2.8. p = 0.5, n = 10 (left), n = 100 (middle), n = 200 (right).
Figure 2.9. p = 0.75, n = 10 (left), n = 100 (middle), n = 200 (right). Closely scrutinizing the graphs in both cases, its clear that the graph of k b(k ; n, p) (where p = 0.5 or 0.75) looks approximately normal with mean at np. Therefore, we conjecture that for general p and large n, we have (2.26) n k nk p q is approximately normal with mean np, k
116
and with some standard deviation depending on n. To make this approximately normal statement precise, we recast the binomial in terms of random variables. Recall our set up is the sample space X = Y for an innite sequence of Bernoulli trials Y = {0, 1} with 1 occurring with probability p and 0 with probability q := 1 p, and denotes the innite product measure on the ring R (C ) generated by the cylinder subsets C of X . Given i N, let fi : X R be the simple random variable observing whether or not we are successful on the ith toss: fi (x) := xi = 1 0 if xi = 1 if xi = 0
for all x = (x1 , x2 , x3 , . . .) X . Then given n N,
is the simple random variable giving the number of successes on the rst n trials. In particular, Sn = k is exactly the probability of k successes in the rst n trials, so we can recast the conjecture (2.26) as Sn = k is approximately normal with mean np. The following theorem is a precise sense in which this conjecture holds: De MoivreLaplace theorem Theorem 2.14. Uniformly for x R, we have x (tnp)2 1 lim Sn x e 2npq dt n 2 npq
Sn := f1 + f2 + + fn
= 0.
In other words, given any > 0 there is an N such that for all n N , 1 Sn x 2 npq
x
(tnp)2 2npq
dt < holds for all x R.
For this reason, its often said that de MoivreLaplace: For large n, the sum Sn is approximately normal with mean np and standard variation npq . There are other ways to rewrite the de MoivreLaplace theorem (see Problem 7); heres a rather popular one: Uniformly for x R, we have (2.27)
n
lim
Sn x n
1 2 pq/n
e 2pq/n dt
(tp)2
= 0.
Since Sn /n is the simple random variable giving the average number of successes on the rst n trials, (2.27) can be interpreted as: de MoivreLaplace: For large n, the average Sn /n is approximately normal with mean p and standard variation pq/n.
The form of the de MoivreLaplace theorem as it is usually proved is the following version: Uniformly for x R, we have (2.28)
n
lim
Sn np x pqn
1 = 2
e 2 dt .
t2
117
In other words, a certain rescaling of the sum Sn is approximately normal with mean 0 and standard variation 1. In Problem 7 you will prove that the formula in Theorem 2.14, the formula (2.27) and the formula (2.28) are equivalent. The original method to verify the conjecture (2.26) and prove the de Moivre n! Laplace theorem 2.14 was to estimate the binomial coecient n k = k!(nk)! , which means to estimate the factorial. Such an estimate was rst discovered by Abraham de Moivre (16671754) around 1721, although the formula itself is called Stirlings formula, after James Stirling20 (16921770) who published it in his most famous work Methodus Dierentialis [390] in 1730. This formula is the following asymptotic formula for the factorial function: n n n! 2n (2.29) = 2n nn en . e Recall that means that the ratio of n! and 2n nn en approaches unity as n : n! lim = 1. n! 2n nn en means that n 2n nn en Now what do and e have to do with multiplying all the integers from 1 to n? Beats me, but they do! You will prove Stirlings formula in Problem 5. Stirlings formula can now be used to prove the de MoivreLaplace theorem, and this is done in many elementary probability textbooks such as [88, p. 251]. Unfortunately, the proof is basically an exercise in dexterity and patience using Stirlings formula and isnt so enlightening. We will prove the de MoivreLaplace theorem using the modern technique of characteristic functions in Section 8.7. 2.5.4. Some remarks on de MoivreLaplace. We remark that de Moivre essentially proved the de MoivreLaplace theorem, although he did not write the 2 formula as an integral involving ex /2 , instead opting to write the integral as an innite series. The version with the explicit normal density is basically found in book 2, Chapter 3, of Laplaces 1812 book Th eorie analytique des probabilit es [222] starting on page 280 of the Oeuvres compltes de Laplace (1878). However, the de MoivreLaplace theorem actually rst appeared in Laplaces work M emoire sur les approximations des formules qui sont fonctions de tr` es grands nombres, et sur leur application aux probabilit es, published in 1810; this paper forever established the normal densitys place in probability theory as an approximation tool. Remarking on Laplace, Isaac Todhunter (18201884) in his 1865 book [386, p. 465] wrote on the whole the Theory of Probability is more indebted to him than to any other mathematician. Lets take a closer, abstract, look at the de MoivreLaplace theorem. Consider each of the random variables f1 , f2 , f3 , . . . as representing certain unrelated observations concerning an experiment and Sn as the cumulative eect of the rst n observations. As n , the cumulative eect Sn describes the overall eect of the random observations. We ask: Is it possible to describe this overall eect in an orderly way? Put in this way, the answer seems of such overwhelming diculty as to be immune to the most powerful weapons Pierre-Simon found in the arsenal of mathematical analysis. [1, Preface]. However, the de Laplace MoivreLaplace theorem answers this question with a resounding yes. So (17491827).
20No known portraits of Stirling seem to exist.
118
struck was de Moivre by the signicance of this revelation, he wrote [100, p. 251-52]
And thus in all Cases it will be found, that altho Chance produces Irregularities, still the Odds will be innitely great, that in process of Time, those Irregularities will bear no proportion to the recurrency of that Order which naturally results from Original Design.
and a page later he added,

Again, as it is thus demonstrable that there are, in the constitution of things, certain Laws according to which Events happen, it is no less evident from Observation, that those Laws serve to wise, useful and benecent purposes; to preserve the stedfast Order of the Universe, to propagate the several Species of Beings, and furnish to the sentient Kind such degrees of happiness as are suited to their State. But such Laws, as well as the original Design and Purpose of their Establishment, must all be from without : the Inertia of matter, and the nature of all created Beings, rendering it impossible that any thing should modify its own essence, or give to itself, or to any thing else, an original determination or propensity. And hence, if we blind not ourselves with metaphysical dust, we shall be led, by a short and obvious way, to the acknowledgement of the great MAKER and GOVERNOUR of all; Himself all-wise, all powerful and good.
We remark that in Section 8.7 we will prove the central limit theorem, an extension of the de MoivreLaplace theorem, which describes under general hypotheses the behavior of the cumulative eect of random independent factors. We end this section with an application of the de MoivreLaplace theorem.
Example 2.6. (The return of M er e, cf. [315]) Recall from Subsection 1.5.5 that M er e believed that in repeatedly throwing two dice, in order to have a better than 50-50 chance of getting a double six, one only needed to throw the dice 24 times; the true answer, as we found, was 25. Lets say that M er e was stubborn and insisted on 24 as the correct number and he conducted the following experiment: He rolls two dice 24 times; he wins if he gets a double six at least once during those 24 throws and loses otherwise. We know that the probability of success is p = 1 (35/36)24 = 0.4914 . . .. M er e is interested in the following question: What is the probability of winning more often than losing in n trials of the experiment? Let X = Y , where Y = {0, 1}, the sample space for repeating the dice game an innite number of times, where the probability of winning is p = 0.4914 for any particular game. Fixing a large n N, Sn is the total number of wins in n games, so we are interested in the probability that Sn is more than half of n; that is, we want to know n < Sn . 2 In order to use the de MoivreLaplace theorem, we rewrite this as follows (recall that q = 1 p): n < Sn 2 = 1 Sn 1 1 2 npq n 2
n/2
(tnp)2 2npq
dt.
119
Using this formula,21 one can nd the approximate probability that M er e wins more than half his games. For example, if n = 500, the probability in question is only .35. Exercises 2.5. 1. Prove that
(n + 1)p k b(k; n, p) =1+ . b(k 1; n, p) kq Let m = (n + 1)p, the largest integer not greater than (n + 1)p (the oor of (n + 1)p). Prove that b(k; n, p) in increasing with k for 0 k m and decreasing with k for m k n; the maximum term b(m; n, p) is called the central term. When (n + 1)p is an integer, prove that b(m; n, p) = b(m 1; n, p). 2. In this problem we evaluate some integrals that will be used in future problems. (a) As far as I can tell, all standard proofs of Wallis formula use one of four types of integrals: Prove that for any real number > 1/2, we have
0
(1 + x2 ) dx =
0
(1 x2 ) 2 dx =
cos22 d =
0 /2
sin22 d.
(b) We shall focus on the sine integral. Put S = 0 S1 , S0 = , (2.30) S+1 = +1 2 (c) Prove that for any n N, S2n = 1 3 5 (2n 1) , 2 4 6 (2n) 2
sin x dx. Prove that S1 = 1.
S2n+1 =
3. (Elementary proof of Wallis formula) In this problem we give Eulers proof (made a little more rigorous) of Wallis formula, found in Chapter 9, De evolutione integralium per producta innita, of Eulers calculus textbook Institutionum calculi integralis volumen primum (Foundations of Integral Calculus, volume 1) [124] (cf. [341, p. 153]). We also show how Wallis formula implies the probability integral. 1 (i) In Eulers book, he uses the integrals 0 t 2 dx to derive Wallis formula. Taking t = sin x transforms Eulers integrals into S := 0 sin x dx. (ii) Show that S2n+2 S2n+1 S2n . Now replace the S s here with their expressions in Part (c) of Problem 2, and show that 2n + 1 2Wn 1, 2n + 2 where 2 2 4 4 6 6 (2n)(2n) , Wn = 1 3 3 5 5 (2n 1)(2n + 1) Conclude that lim Wn = /2, which is exactly Wallis product. 4. (Elementary proof of Wallis = probability integral) In this problem we show how Wallis formula implies the probability integral. (This is basically an exercise in Bourbakis book [57, p. 127].) (i) Show that for all x R, 2 1 1 x 2 e x . 1 + x2
21 Technically speaking we should be employing the continuity correction (if you know what this is), but we wont worry about this. Also, we have intentionally left out a discussion of the precise error in approximating {Sn x} with the integral of the normal density; all we know is the error vanishes uniformly in x as n . The somewhat arbitrary golden rule of 30 says that as long as n 30, the approximation should be good enough. See [103] for a discussion of such matters including when the golden rule of 30 fails. 1t /2
2 4 (2n) . 1 3 5 (2n + 1)
120
(ii) Conclude that

1 0
(1 x2 )n dx
enx dx
(1 + x2 )n dx,
and from these inequalities, Problem 2, and Wallis formula, derive the probability integral. 5. (Elementary proof of Wallis = Stirlings formula) Many standard proofs of 1 Stirlings formula rst prove de Moivres result that n! B nn+ 2 en , then nd B by Wallis formula (or the probability integral). Heres one such proof. (i) For any x [0, 1), prove that 0 log(1 + x) x x2 2 x3 . 3
Suggestion: Remember any facts about alternating series? 1 log n. Prove that (ii) Dene an = log(n!) + n n + 2 an an+1 = n+ 1 2 log 1 + 1 n 1,
. then using (i), prove that for n 2, |an an+1 | constant n2 1 ( a a ) exists and use this to prove that limn an (iii) Show that limn n k k+1 k=1 n n+ 1 2 e for some constant B . exists. Use this fact to prove that n! B n (iv) Show that Wallis formula can be written as 22n (n!)2 lim = , n 2 2n(2n)! then use this to show that B = 2 . 6. (Stieltjes method cf. [418, p. 272]) In this problem we give Thomas Stieltjes (1856 1894) computation of the probability integral in the two-page (but ingenious) 1890 2 2 paper Note Sur lint egrale 0 eu du [369]. Dene Tn = 0 xn ex dx; what we want is T0 . 1 (i) Prove that Tn = n Tn2 for n 2 and then prove that for n 0, 2 T2n = n! 1 3 5 (2n 1) 2 4 (2n) and T2n+1 = n! . 2
(ii) Stieltjes brilliant idea is the following identity: For all n N, we have
2 Tn < Tn1 Tn+1 .
Prove this using Stieltjes (ingenious) trick: Consider the polynomial p(t) = at2 + bt + c, where a = Tn1 , b = 2Tn , c = Tn+1 , and show that p(t) > 0 for all t R (In particular, p(t) does not have any real roots.) Subhint: Show that p (t ) =
0
xn1 (x + t)2 ex dx.
(iii) Using (i) and (ii) show that 2n + 1 2 2n + 1 2 T2 T2n < T2n1 T2n+1 . n+1 < 2 2 (iv) Using (i) and (iii), show that 2n + 1 3 2 1 2 2 2n 1 2 T0 < 1 < 2(2n + 1) . 2n 4 2 2n (v) Finally, use Wallis formula on (iv) to determine the probability integral. 7. Prove that the statement in Theorem 2.14, in the formula (2.27), and in the formula (2.28), are equivalent.
121
8. Prove the following result: For any a, b [, ] with a < b, we have
b x2 1 Sn np b = e 2 dx. lim a < n npq 2 a 9. (De MoivreLaplace = Bernoulli) In this problem we show that the de Moivre Laplace theorem implies Bernoullis theorem. To do so, let > 0 and show that for any xed, but arbitrary, r > 0,
b x2 1 Sn np b = lim a e 2 dx. n npq 2 a Prove the same result holds when any on the left is replaced with <; e.g. prove
Sn np <r npq Sn np <r npq
Sn p < n Sn p < . n
for all n suciently large. This implies that
Use the de MoivreLaplace theorem on the left-hand side to prove Bernoullis theorem. 10. (Tossing a fair coin [250]) If you toss a fair coin many times, you might think that in the long run, the number of heads and tails even out. This is denitely false as you will prove in this problem. To see this, prove that if Sn is the total number of heads thrown in n tosses, then Dn := 2Sn n is the dierence in heads and tails thrown in n tosses. Next, with p = q = 1/2, answer the following questions using the de Moivre Laplace theorem (or its equivalent formulations youll need a good calculator for (i) and (iii)): (i) In 500 throws, what is the probability that the absolute value of the dierence between the number of heads and tails is no more than 20? (ii) Given any r > 0, prove that
n
lim |Dn | r = 0.
Interpret this probabilistically. (iii) How many tosses are needed to ensure that |Dn | 2 holds with probability 99%?
Remarks
2.2 : Another way to see that expectation can be interpreted as expected gain is from the viewpoint of weighted averages. Recall that given N numbers x1 , . . . , xN and N weights w1 , . . . , wN , nonnegative numbers that sum to one, we dene the weighted average (or weighted mean) as the number The xi s with larger weights contribute more to the sum than the xi s with smaller weights. Such weighted averages appear on class syllabi, at least from those classes that give grades. A typical class might assign grades as follows: Homework is 20%, Midterm is 30%, and the Final is 50% of your grade. Thus, if you scored an 80 on Homework, an 89 on the Midterm, and a 95 on the Final, your semester score is 80 .2 + 89 .3 + 95 .5 = 90.2. Your nal exam score helped to boost your semester score above a ninety even though your other scores were below ninety. Since the nal was weighed more heavily than the other grades, this professor believed that the nal exam best measured the understanding of the course. Weighted averages applied to grades combine dierent grades throughout the semester to give a rightful grade. More generally, weighted averages give a rightful x 1 w1 + + x N wN .
122
common value to the xk s taking into account that some xk s are judged more important than others. Any case, back to expected gain, let (X, R , ) be a probability eld and consider a simple random variable
N
f=
k=1
ak Ak ,
a k R,
where A1 , . . . , AN R are pairwise disjoint. Suppose that f represents the gain of a gambler; that is, a1 is the gain if the event A1 occurs, a2 is the gain if the event A2 occurs, and so forth. If we put pk = (Ak ), k = 1, 2, . . . , N , then E (f ) = f = a1 p 1 + a2 p 2 + + aN p N
N
=
k=1
(gain when Ak occurs) (probability Ak occurs).
Thus, E (f ) is a weighted average, where the gain ak is weighted according to its probability pk of occurring. Knowing that weighted averages correspond to a rightful value, in this sense we can interpret E (f ) as the gamblers rightful gain. 2.4 : For a history of the law of large numbers, see [350]. 2.5 : The book The life and times of the central limit theorem [1] is a great book for history on the de MoivreLaplace theorem and its generalization, the central limit theorem. If youre interested in the normal law of errors, see e.g. [96, 365]. Many books go over the HerschelMaxwell (really Adrain) derivation of the normal law; a few of the books that do so are [197, Ch. 7], [163, p. 209] and [357, p. 66]. As mentioned in 2.5, it was Abraham de Moivre who rst derived Stirlings formula. He did this in a supplement to his 1730 paper Miscellanea Analytica called Approximatio ad Summan Terminorum Binomii (a+b)n in Seriem expansi, dated Nov. 12, 1733 (reproduced in English in The Doctrine of Chances [100]). In this paper he derived Stirlings formula with the constant 2 replaced by a non-explicit constant: n n , n! B n e where 1 1 1 1 B e1 12 + 360 1260 + 1680 = 2.507399 . . . ; for comparison, 2 = 2.506628 . . .. Unfortunately, de Moivre wasnt able to determine B explicitly, which is where Stirling enters the picture [100, p. 243-44]22 : It is now a dozen years or more since I found what follow . . . When I rst began that inquiry, I contented myself to determine at large the Value of B , which was done by the addition of some Terms of the above-written Series; but as I perceived that it converged but slowly, and seeing at the same time that what I had done answered my purpose tolerably well, I desisted from proceeding farther till my worthy and learned Friend Mr. James Stirling, who had applied himself after me to that inquiry, found that the Quantity B did denote the Squareroot of the Circumference of a Circle whose Radius is Unity, so that if that Circumference be called c, the Ratio of the middle Term to the Sum of all the Terms will be expressed by 2 . nc But altho it be not necessary to know what relation the number B may have to the Circumference of the Circle, provided its value be attained, either by pursuing the Logarithmic Series before mentioned, or any other way; yet I own with pleasure that this discovery,
22
Page 143 of the ebook found at http://www.ibiblio.org/chance/.
123
besides that it has saved trouble, has spread a singular Elegancy on the Solution. Actually, in the preface to Stirlings book [390, p. 18], he admits that de Moivre found the formula rst: The problem of nding the middle coecient in a very large power of the binomial had been solved by de Moivre some years before I considered it. We mentioned that it was Laplace who gave the rst proof of the probability integral. Quoting from Stiglers translation [370, p. 367], heres the passage from the 1774 paper M emoire sur la probabilit e des causes par les ev` enmens [224] where Laplace derives the probability integral: Let [(p + q )3 /2pq ]zz = ln , and we will have23 2qp (p + q )3 2dz exp zz = 2pq (p + q )2 d . ln
The number can here have any value between 0 and 1, and, supposing the integral begins at = 1, we need its value at = 0. This may be determined using the following theorem (see M. Eulers Calcul int egral). Supposing the integral goes from = 0 to = 1, we have n d (1 2i ) 1 n+i d , = 2 i i ( n + 1) 2 (1 )
whatever be n and i. Supposing n = 0 and i is innitely small, we will have (1 2i )/(2i) = ln , because the numerator and the denominator of this quantity become zero when i = 0, and if we dierentiate them both, regarding i alone as variable, we will have (1 2i )/(2i) = ln , therefore 1 2i = 2i ln . Under these conditions we will thus have n d (1 2i ) n+i d = (1 2i ) 2i d ln 2i d 1 ; = i2 ln
Thus,
d = 2i ln supposing the integral is from = 0 to = 1. In our case, however, the integral is from = 1 to = 0, and we will have d = . 2i ln 2 dz exp (p + q )3 zz 2pq pq 2 = . (p + q )3/2
Therefore
= 2, this last equation can be written, in modern notation, 2 2 1 2 ez dz = 2 = . ez dz = 2 2 0 0 Of course, Laplaces argument is not quite rigorous. Later on, in his 1781 paper M emoirs sur les probabilit es published in M emoirs de lAcademie royale des Sciences de Paris, Laplace gives a completely rigorous derivation of the probability integral using double integrals; see Section 7.3 if youre interested. Note that if we set
23
( p+ q ) pq
The limits on the left-hand integral are from 0 to .
124
Figure 2.10. Three Galton boards.

The normal density can also be seen as emerging from the binomial coecients via the Galton board, named after Francis Galton (18221911). Imagine a vertically placed board with many regularly spaced pin nailed into the board. We then drop a large number of tiny balls from the top which hit the pins and bounce left and right eventually landing in bins at the bottom of the board; see Figure 2.10.24 Heres Galton himself describing how the outline of the balls in the bins at the bottom form the normal density (see also [306]): [147, p. 64] The shot passes through the funnel and issuing from its narrow end, scampers deviously down through the pins in a curious and interesting way ; each of them darting a step to the right or left, as the case may be, every time it strikes a pin. The pins are disposed in a quincunx fashion, so that every descending shot strikes against a pin in each successive row. The cascade issuing from the funnel broadens as it descends, and, at length, every shot nds itself caught in a compartment immediately after freeing itself from the last row of pins. The outline of the columns of shot that accumulate in the successive compartments approximates to the Curve of Frequency (Fig. 3, p. 38), and is closely of the same shape however often the experiment is repeated. The outline of the columns would become more nearly identical with the Normal Curve of Frequency, if the rows of pins were much more numerous, the shot smaller, and the compartments narrower; also if a larger quantity of shot was used. To see mathematically why the normal density should be obtained, imagine each row as an experiment, whether a ball bounces to the left (say a 0) or the right (say a 1), each with probability 1/2. Thus, if there are n rows, the path of the ball can be described by an n-tuple of 0s and 1s, which is just a sequence of Bernoulli trials!
24 Figures 7, 8 and 9 from [147, p. 63]. A demonstration of the Galton board can be found at http://www.jcu.edu/math/isep/Quincunx/Quincunx.html
Part 2
Countable additivity
CHAPTER 3
Measure and probability: countable additivity

We intend to attach to each bounded set a positive number or zero called its measure and satisfying the following conditions: 1) There are sets whose measure is not zero. 2) Two congruent sets have the same measure. 3) The measure of the union of a nite number or an innite countable number of sets that are pairwise disjoint is the sum of the measures of the sets. We solve this problem of measure for those sets that we call measurable. In the introduction to Chapter I (The Measure of a Set) of Henri Lebesgues (18751941) 1902 thesis [232].
3.1. Introduction: What is a measurable set? It turns out that there are many answers to this very simple question! 3.1.1. Answer #1 by Lebesgue. Building on the researches of, for instance, Emile Borel (18711956) and Camille Jordan (18381922), Lebesgues answer to the question What is a measurable set? is given in his rst paper on integration theory [231]:
Let us consider a set of points of (a, b); one can enclose in an innite number of ways these points in an innite number of intervals; the inmum of the sum of the lengths of the intervals is the measure of the set. A set E is said to be measurable if its measure together with that of the set of points not forming E gives the measure of (a, b).
This answer is somewhat terse, so to grasp what Lebesgue is saying, lets give the exact same denition of measurability, but explained in more detail from page 182 of his book [239]. Let A be a subset of an interval (a, b). Heres how he denes the measure m(A) of A:
Enclose A in a nite or denumerably innite number of intervals, and let l1 , l2 , . . . be the length of these intervals. We obviously wish to have m(A) l1 + l2 + .
If we look for the greatest lower bound of the second member for all possible systems of intervals that cover A, this bound will be an upper bound of m(A). For this reason we represent it by m (A), and we have m(A) m (A). (3)
If (a, b) \ A is the set of points of the interval (a, b) that do not belong to A, we have similarly m((a, b) \ A) m ((a, b) \ A).
127
128
3. MEASURE AND PROBABILITY: COUNTABLE ADDITIVITY
Now we certainly wish to have m(A) + m((a, b) \ A) = m(a, b) = b a; and hence we must have m(A) b a m ((a, b) \ A). (4)
The inequalities (3) and (4) give us upper and lower bounds for m(A). One can easily see that these two inequalities are never contradictory. When the lower and upper bounds for A are equal, m(A) is dened, and we say that A is measurable.
Lets take a closer look at what Lebesgue is saying. Let A R, not worrying for the moment on whether A (a, b); for example, A could be unbounded. Consider his denition of m (A). Lets cover A by countably many intervals {In }, so that
In .
n=1
Because of the various assortments of intervals available, for concreteness we assume all the intervals In are in I 1 . Given any such I I 1 , recall that m(I ) is the usual length of I . Now, since A n=1 In , the sum n=1 m(In ) should be greater than the true measure of A. Intuitively speaking, the worse n=1 In approximates A, the bigger the sum n=1 m(In ), and the better n=1 In approximates A, the smaller the sum n=1 m(In ). Heres a one-dimensional illustration:
(] (
] ( ]( ]
]( ]
Figure 3.1. The top shows a set A R (which could be quite com-
plicated). The bottom shows left-half open intervals covering A. The sum of the lengths of the intervals is larger than the true measure of A. It therefore makes sense to dene the (outer) measure of A to be the smallest, more precisely, the inmum, of the set of all such sums of lengths of intervals that cover A.
and heres a two-dimensional illustration:
Figure 3.2. The left-hand picture shows a cover of A = a disk in R2 by ve rectangles that gives a bad (too large of an) approximation to A. The right-hand picture shows a cover of A by eight rectangles that gives a better (closer) approximation to A; in this case, the sum of the areas of the rectangles is closer to the true measure of A. The inmum of all such sums of areas of rectangular approximations should be the true measure of A.
3.1. INTRODUCTION: WHAT IS A MEASURABLE SET?
129
We dene m (A) as the smallest possible sum dene smallest in terms of inmums:
n=1
m(In ). More precisely, we
(3.1)
m (A) := inf
n=1
m(In ) ; I1 , I2 , I3 , . . . I 1 cover A .
The inmum is taken over all covers {In } of A, where In I 1 for all n N. This procedure denes the outer measure m (A) [0, ] for any set A R (outer since A n=1 In and the union may extend to the outside of A). Thus, we have a map m : P (R) [0, ], where P (R) is the power set of R, the set of all subsets of R. The function m is called Lebesgue outer measure and it assigns a length to every subset of R. It might seem like m is exactly what we need to solve Lebesgues measure problem:
1) There are sets whose measure is not zero. 2) Two congruent sets have the same measure. 3) The measure of the union of a nite number or an innite countable number of sets that are pairwise disjoint is the sum of the measures of the sets.
Certainly m satises 1) (e.g. its easy to check that m (R) = ) and in Section 4.4 well prove it satises 2), where congruent means one set can be translated and/or reected to get the other set. However, it fails to have Property 3 in Lebesgues quote; in fact, (see Section 4.4) one can always break up an interval (a, b), where a < b, as a union (a, b) = A (a, b) \ A for some subset A (a, b) such that Thus, the sum of the measures of the parts (A and (a, b) \ A) is greater than the measure of the whole (the interval (a, b))! This fact follows from the work of Giuseppe Vitali (18751932), who in his famous 1905 paper [401] proved the existence of a non-measurable set, and said
This suces us to conclude: The problem of measure of the set of points of a straight line is impossible.
b a < m (A) + m ((a, b) \ A).
Intuitively speaking, non-measurable sets have blurry or cloudly edges, so the denition (3.1) assigns a larger measure to them than they should have. Now although m is not a measure on P (R), there is a proper subset of P (R) on which m does satisfy 3); these sets are what Lebesgue calls measurable. Now assume that A is a subset of some bounded interval (a, b). Then when Lebesgue says When the lower and upper bounds for A are equal, m(A) is dened, and we say that A is measurable , he is saying that the subset A is measurable if see Equations (3) and (4) in Lebesgues quote. In other words, he denes A (a, b) to be measurable if (3.2) b a = m (A) + m ((a, b) \ A). m (A) = b a m ((a, b) \ A);
Intuitively speaking, a measurable set has a distinct edge: its unambiguous to what points are in A and not in A (that is, in (a, b) \ A), so the measure of points in A and not in A is exactly the measure of (a, b). We then dene m(A) = measure of A := m (A).
130
3.1.2. Answer #2 by Carath eodory. Unfortunately, the denition (3.2) for measurability only works for sets that belong to some open interval. Of course, it would also be nice to dene measurability for unbounded sets too. Notice that (3.2) can be written as (3.3) since (a, b) A = A as A (a, b), and we are using the fact that m (a, b) = b a, which you will prove in Problem 1. For an arbitrary subset A of R, there might not exist an interval (a, b) that contains A. However, notice that even though there might not exist such an interval containing A, both sides of the equality (3.3) are still dened for any interval (a, b) (because m is dened for any subset of R). Thus, why not simply declare a subset A R to be measurable if (3.3) holds for any interval (a, b)? This is exactly Constantin Carath eodorys (18731950) brilliant idea which he published in 1914 [73]. In fact, it turns out that for theoretical purposes its convenient to replace (a, b) by any subset of R. Constantin Thus, heres Carath eodorys denition of measurability: A subset A R is Carath eodory measurable if for any subset E R, we have (18731950). m (E ) = m (E A) + m (E \ A). If A does lie in an interval, this denition is equivalent to Lebesgues denition, but the advantage of Carath eodys denition is that it works for unbounded sets too. 3.1.3. Answer #3 by Littlewood. John Littlewoods (18851977) description of measurable sets provides lots of intuition about measurability. On page 26 of his book Lectures on the theory of functions [251] he tries to emphasize to his readers that the theory of Lebesgue integration is not so dicult as it may seem. In particular, concerning measurable sets he says
Every [nite Lebesgue] (measurable) set is nearly a nite sum
m (a, b) = m ((a, b) A) + m ((a, b) \ A),
John Littlewood of intervals. (18851977). We can in fact take Littlewoods statement here and make it into a denition. Let A R and assume that m (A) < . Then we can dene A to be measurable if A is nearly a nite union of intervals, that is, an elementary gure: A is measurable if A is nearly an elementary gure. We make this precise as follows: Given any > 0 there is an element I E 1 (= the elementary gures in R1 = nite unions of elements of I 1 ) such that m (A \ I ) < and m (I \ A) < . Thus, A is nearly equal to I in the sense that the points in A but not in I and the points in I but not in A have small outer measure. Now what if A does not necessarily have nite outer measure? Then we can modify Littlewoods denition of measurability as follows: Given any subset A R, A is measurable if A is nearly an open set. We make this precise as follows. To say that A is nearly an open set we mean given > 0 there is an open set U R such that AU and m (U \ A) < .
3.2. COUNTABLE ADDITIVITY, SUBADDITIVITY, AND THE PRINCIPLES
131
Thus, the points in U but not in A has small outer measure. Littlewoods approach to measurability is intuitively appealing because we all should have an intuitive feel for unions of intervals and open sets; measurable sets are not much dierent. We end this present section with Littlewoods words of wisdom [251, p. 2627]:
Most of the results of this present section are fairly intuitive applications of these ideas, and the student armed with them should be equal to most occasions when real variable theory is called for. If one of the principles would be the obvious means to settle a problem if it were quite true, it is natural to ask if the nearly is near enough, and for a problem that is actually soluble it generally is. Exercises 3.1. 1. We prove that m (a, b) = b a for all a < b. The arguments will be repeated often in the sequel! We show that m (a, b) b a and b a m (a, b). (i) Prove that m (a, b) b a. (ii) Assume that (a, b) n=1 (an , bn ]; then it follows by denition of inmum that b a m (a, b) if we can show that b a n=1 m(an , bn ]. Let > 0 with < (b a)/2 and using compactness, show there is an N such that [a + , b ] N N n=1 (an , bn + 2n ), and conclude that (a + , b ] n=1 (an , bn + 2n ]. 1 (iii) Using nite subadditivity of m on I and that > 0 can be arbitrarily small, prove that b a n=1 m(an , bn ].
3.2. Countable additivity, subadditivity, and the principles In this section we study measures with a focus on measures dened on semirings and rings. In particular, we show that Lebesgue measure is indeed a measure. 3.2.1. Countably additive set functions on semirings. An additive set function : I [0, ] on a semiring I (in particular, a ring or -algebra) is said to be countably additive if for any countable collection of pairwise disjoint sets A1 , A2 , A3 , . . . in I such that n=1 An I , we have

n=1
An
=
n=1
(An ).
A countably additive set function is called a measure and elements of I are said to be measurable. If X I and (X ) = 1, then is called a probability measure. Here are a couple remarks. First, a nitely additive set function on a nite collection of sets I is automatically countably additive (can you prove this?). In particular, all probability set functions on nite sample spaces are measures. So, countable additivity is only a new idea when I is innite. Second, if I is a -algebra, we dont have to assume n=1 An I (because -algebras are closed under countable unions). So, the natural domain for a measure is really a -algebra. Because of its importance, let us repeat ourselves by dening measures on -algebras: A set function : S [0, ], where S is a -algebra, is called a measure if (1) () = 0; (2) is countably additive in the sense that

A=
n=1
An
(A) =
n=1
(An )
for any sequence of pairwise disjoint sets {An } S .
132
The triple (X, S , ) (or (X, ) if S is understood, or (S , ) is X is understood, or just X if both S and are understood) is called a measure space and sets in S are called measurable (or -measurable to be more precise). If is a probability measure, we call (X, S , ) a probability space. We shall focus on measures dened on -algebras starting in Section 3.4, but for this and the next section we mostly work with measures dened on semirings and rings. In particular, we shall prove that the additive set functions of geometry (Lebesgue measure Theorem 3.1 below) and those we have looked at in our probability examples (eg. innite product measures Section 3.3) are countably additive. However, lest you think that all nitely additive set functions are countably additive, consider the following example.
Example 3.1. Let f : R R be the nondecreasing function dened by f (x ) = 0 1 if x 0, if x > 0.
Recall that the corresponding LebesgueStieltjes set function is dened by f (a, b] = f (b) f (a). We know (by Proposition 1.21) that f : I 1 [0, ) is nitely additive. Heres a picture of f :
1 = f (a) = f (b) = f (1)
0 = f (0) . . . ( (] (] (] ( ] ( 0 a b
] 1
Observation: Notice that f (a, b] = 0 for any a, b > 0, while f (0, 1] = 1. Using this fact its easy to show that f is not countably additive. Indeed, consider the decomposition (as shown in the picture above) (0, 1] =
n=1
1 1 , or , n+1 n
1 ,1 n+1 n
A=
n=1
An ,
where A = (0, 1] and An =
; then A1 , A2 , . . . are pairwise disjoint. By

n=1
our observation above, for any n N, we have f (An ) = 0, so However, f (A) = f (1) f (0) = 1, so, as 1 = 0, f (A) =
f (An ) = 0.
f (An ).
n=1
Now how would we prove that an additive set function : I [0, ] is countably additive (if indeed its so)? That is, given A I with A = n=1 An with An I for each n and pairwise disjoint, we would need to show that
Thus f : I 1 [0, ) is not countably additive.
(A) =
n=1
(An ).
We can break this equality up into two separate inequalities:

(3.4)
n=1
(An ) (A)
and (A)
(An ).
n=1
The rst inequality holds because any nitely additive set function is countably superadditive (see Theorem 2.4 back in Section 2.3). The second inequality holds
133
if is replaced by a nite N ; this is just nite subadditivity which holds for any nitely additive set function (again from Theorem 2.4 back in Section 2.3). However, for innite sums it my fail. For instance, in the above example we have 1 1 A= n=1 An , where A = (0, 1] and An = n+1 , n , but as 1 0, we have
f (A)
f (An ).
n=1
Conclusion: the second inequality in (3.4) is the deciding factor in determining countable additivity. This explains why the following denition is important. An additive set function : I [0, ] on a semiring I is said to be countably subadditive if A n=1 An where A, A1 , A2 , . . . I implies1
(A)
(An ).
n=1
In probability theory, this inequality is called Booles inequality after George Boole (18151864). We can reword our conclusion as: If a nitely additive set function is countably subadditive, then its countably additive. We shall use this fact to prove that Lebesgue measure is indeed a measure, but rst . . . 3.2.2. The principles and Lebesgue(-Stieltjes) measures. An alternative name for analysis might be the science of inequalities because many proofs come down to showing one thing is less than or equal to another! A notable artice in this science is the following fact: Given (extended) real numbers a, b, ab a b + for all > 0.
We call this the principle, and we leave you to prove it. Now what if b is an innite series, say b = k=1 ak ? In this case, by the principle we have

(3.5) Since 1 = in (3.5) by
ak
k=1
ak + for all > 0.

k=1
1 k=1 2k (this is a geometric series), we have = k k=1 2k we obtain the so-called /2 principle,
replacing which states
k=1 2k ;
ak
k=1
ak +
k=1
2k
for all > 0.
Well use this idea shortly, but rst lets review notation. Recall from Section 1.4, right before Proposition 1.10, that we denote a box (p1 , q1 ] (pn , qn ] in Rn by (p, q ] where p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ) are elements of Rn . Of course, this box could be the empty set if any pk qk . Given r, s R, we let (p + r, q + s] be the box determined by the n-tuples (p1 + r, . . . , pn + r) and (q1 + s, . . . , qn + s). There are analogous notations for closed, open, etc., types of boxes.
1Note that there is no assumption on the pairwise disjointness of the A s. n
134
Lebesgue measure is a measure Theorem 3.1. For any n, Lebesgue measure m : I n [0, ) is a measure, that is, m is countably additive.
Proof : As we discussed earlier, we just have to prove countable subadditivity. To n n this end, let (a, b] k=1 (ak , bk ] where (a, b] I and, for each k , (ak , bk ] I . We shall prove that m(a, b] k=1 m(ak , bk ]. Step 1: By the /2k principle, xing > 0, we just have to prove (3.6) m(a, b]
k=1
m(ak , bk ] +
. 2k
We now turn (3.6) into a statement without the /2k s. Indeed, for k N take a real number k > 0 such that m(ak , bk + k ] diers from m(ak , bk ] by at most /2k . (Here, we use the fact that for any p, q Rn , m(p, q + r ] is a continuous function of r [0, ) can you prove this?). Thus, m(ak , bk + k ] m(ak , bk ] + k . 2 Hence, (3.6) follows from (3.7) m(a, b]
k=1
m(ak , bk + k ].
Summary: We are left to prove (3.7) to establish subadditivity of m. Step 2: We now use compactness to reduce the countable subadditivity in (3.7) to nite subadditivity (and we know m is nitely subadditive). To do so, let r > 0 and observe that [a + r, b] (a, b]
k=1
( a k , bk ]
( a k , bk + k ) .
k=1
Since the closed box [a + r, b] is compact, it is covered by a nite union of the open sets on the far right, say the rst N of them,
N
[a + r, b]
( a k , bk + k ) .
k=1
Since (a + r, b] [a + r, b] and (ak , bk + k ) (ak , bk + k ], we conclude that

N N
(a + r, b]
k=1
(ak , bk + k ] , which implies m(a + r, b]
m(ak , bk + k ].
k=1
The right-hand sum is the same sum with N replaced by and hence, m(a + r, b] m(ak , bk + k ].
k=1
Finally, letting r 0 and using the fact that m(a + r, b] is a continuous function of r , we obtain (3.7) as desired.
We now consider LebesgueStieltjes set functions. In Proposition 1.21 we proved that any such set function is nitely additive. However, we saw in Example 3.1 that f may not be countably additive, which was the case for the function f (x) = 0 if x 0, 1 if x > 0.
135
Note that in this example, f is not right continuous at 0. This is exactly the defect in f that prevents f from being countably additive as the following theorem shows. LebesgueStieltjes measures Theorem 3.2. A set function : I 1 [0, ) is a measure if and only if = f , where f : R R is a right-continuous nondecreasing function.
Proof : The proof of suciency, that f is a measure if f is right-continuous, is similar to the proof of Theorem 3.1 basically everywhere you see an m in the proof of Theorem 3.1, replace it with f we leave you the details. Assume now that : I 1 [0, ) is a measure; in particular, is additive so from Proposition 1.21 we know that = f for some nondecreasing function f : R R. Fixing x R, to prove that f is right-continuous at x we just have to prove that f (x) = lim f (xn ) for any strictly decreasing sequence x1 > x2 > x3 > with limn xn = x (why?). Consider the union
(x, x1 ] =
n=1
(xn+1 , xn ]
. . .( ] ( x x4 x3
] ( x2
] x1
and note that the sets (xn+1 , xn ] are disjoint for dierent n. Since f is assumed to be a measure, f (x1 ) f (x) = f (x, x1 ] =
n=1 N 1 n=1
f (xn+1 , xn ] f (xn ) f (xn+1 ) .
= lim The last sum telescopes:

N 1 n=1
f (xn ) f (xn+1 ) = f (x1 ) f (xN ), and we get

N
f (x1 ) f (x) = f (x1 ) lim f (xN ). Canceling o f (x1 ) shows that f (x) = lim f (xN ).
N
3.2.3. Equivalence of countable additivity and subadditivity. Before we introduced the 2 k principle we saw that countable subadditivity implies countable additivity. We now prove the converse. Thus, countable subadditivity and countable additivity are equivalent. Before the proof, we need the following result on double summations that well use again and again, often without mentioning it. Double summation lemma Lemma 3.3. For each pair (m, n) N N, let amn [0, ] and let f : N N N be a bijective function; therefore f (1), f (2), f (3), . . . is a list of all elements of N N. Then

amn =
m=1 n=1 n=1 m=1
amn =
n=1 m,n amn
af (n) .
Either of these sums is denoted by
(since all sums are the same).
Proof : We remark that the double sum m=1 n=1 amn means for each m N, to sum the inner summation a , which gives a nonnegative extended real mn n=1
136
number for each m N. We then sum all these numbers from m = 1 to . (The double sum n=1 m=1 amn has a similar meaning.) An alternative way to look n at double sums are as follows. If smn := m i=1 j =1 aij , then

amn = lim
m=1 n=1
m n
lim smn
and
amn = lim
n=1 m=1
n m
lim smn ,
where the iterated limits, eg. limm limn smn , means for each m N, take the inner limit rst, limn smn , then to take the outer limit m next. Let L1 = m=1 n=1 amn , L2 = n=1 m=1 amn , L3 = n=1 af (n) , and put we shall prove that L := sup {smn ; m, n N} ;
L1 = L2 = L3 = L. We rst show that L1 , L2 , L3 L, then we prove the opposite inequality. For all m, n, we have, by denition of L, smn L, so using the fact that limits preserve inequalities, we have lim smn L. Taking m , we get
n m n
lim
lim smn L.
Thus, L1 L. Similarly, L2 L. Given n N, we can choose m so that f (1), . . . , f (n) N N belong to those pairs (i, j ) where 1 i, j m. Then,
n m m i=1
af (i)
i=1 j =1
aij L,
by denition of L. Taking n we get L3 L. Thus, L1 , L2 , L3 L. On the other hand, by denition of L1 and L2 , for any m, n we have Thus, taking the supremum over all m, n, we see that L L1 and L L2 . Also, given m, n N, using that f (1), f (2), f (3), . . . is a list of all elements of N N, we can choose N N so that f (1), f (2), . . . , f (N ) contain all the pairs (i, j ) with 1 i m and 1 j n. Hence,
N
smn L1
and
smn L2 .
smn
i=1
af (i) L3 .
Taking the supremum over all m, n, we get L L3 .
We shall use this double summation lemma as follows. Let : I [0, ] be a measure and let A I and assume that2 A = m,n Amn where Amn I for each m, n N N and which are pairwise disjoint, meaning that when (m, n) = (k, ), Amn Ak = . We claim that (3.8) (A) =
m n
(Amn ) =
m n
(Amn ).
n=1
Indeed, pick any bijection f : N N N. Then we have A = by countable additivity,
Af (n) . Thus,
(A) =
n=1
(Af (n) ).
2This includes nite unions because we could take A mn = except for nitely many (m, n).
137
On the other hand, by our lemma with amn = (Amn ) we have

(Af (n) ) =
n=1 m=1 n=1
(Amn ) =
n=1 m=1
(Amn ).
This proves (3.8); we sometimes denote either sum in (3.8) by m,n (Amn ). We now prove that countable additivity and countable subadditivity are equivalent. Equivalence of countable additivity and subadditivity Theorem 3.4. If : I [0, ] is additive on a semiring, then is countably additive that is, is a measure if and only if is countably subadditive.
Proof : We already know that countable subadditivity implies countable additivity, so suppose that is a countably additive; we shall prove that is countably subadditive. To do so, let A I with A n=1 An where An I for each n; we need to show that (A) n=1 (An ). To prove this, we rst intersect both sides of A n=1 An with A to obtain A = n=1 (A An ). By the Fundamental Lemma of Semirings, there exist pairwise disjoint sets {Bnm } I such that for each n, Bnm (A An ) are nite in number, and A=
n
is countably subadditive;
(A A n ) =
Bnm .
n,m
By countable additivity and by our discussion on (3.8), (3.9) (A) =

n m
(Bnm ).
Moreover, since for each n, Bnm (A An ) An and the Bnm s are pairwise disjoint, by superadditivity we have (Bnm ) (An ).
n
Combining this with (3.9) we get (A)
(An ) and our proof is complete.
We now end this section by giving an alternative characterizations of countable additivity in terms of the notion of continuity. 3.2.4. Continuity of measures. A sequence of sets A1 , A2 , . . ., is nondecreasing if A1 A2 A3 A4 , in which case, the limit set is by denition (with an accompanying picture)
lim An :=
n=1
An
A1 A2 A3 lim An
The sequence is nonincreasing if A1 A2 A3 A4 , and in this case,
lim An :=
n=1
An
lim An
A2 A1
An additive set function : R [0, ] on a ring R is said to be
138

A A3 \ A2 A2 \ A1 A1 A1 \ A2 A2 \ A3 A3 \ A4 A
Figure 3.3. Left: A1 A2 A3 are nondecreasing concentric
disks whose union is the large disk A. Right: A1 A2 A3 are nonincreasing concentric disks whose intersection is the small disk A.
(1) continuous from below if for any nondecreasing sequence of sets {An } in R with limit set A = lim An R , we have (lim An ) = lim (An ). (2) is continuous from above if if for any nonincreasing sequence of sets {An } in R with limit set A = lim An R and (A1 ) = , we have (lim An ) = lim (An ). We call a set function continuous if it is continuous from below and continuous from above. Recall that a function f : R R is continuous at a point a if and only if given any sequence {an } with a = lim an , we have f (a) = lim f (an ) , that is, f (lim an ) = lim f (an ). This is why we use the term continuous in the above denitions for a set function. Equivalence of measure and continuity (1) (2) (3) (4) Theorem 3.5. If : R [0, ] is additive on a ring R , then is a measure if and only if is continuous from below. If is a measure, then is continuous from above. If is real-valued; that is, (A) < for each A R , then is a measure if and only if is continuous from above. If is real-valued, then is a measure if and only if is continuous from above at ; that is, given any nonincreasing sequence of sets {An } in R ,
if
n=1
An = , then lim (An ) = 0.
Proof : We only prove (2) (which implies the only if parts of (3) and (4) ) and the only if part of (1), leaving the rest for your enjoyment. To prove the only if part of (1), assume that is a measure. To prove continuity from below, let A1 A2 A3 be a nondecreasing sequence of sets in R with limit set A. Observe that (see the left-hand picture in Figure 3.3) A = A 1 (A 2 \ A 1 ) (A 3 \ A 2 ) (A 4 \ A 3 ) .
139
The sets A1 , A2 \ A1 , A3 \ A2 , . . . are pairwise disjoint, so by countable additivity and the denition of innite series as the limit of partial sums, (A) = (A1 ) + (3.10) = lim
k=1
(Ak+1 \ Ak )
n1 k=1
(A1 ) +
(Ak+1 \ Ak ) .
1 Note that An = A1 n k=1 (Ak+1 \ Ak ), a union of pairwise disjoint sets; thus, the term in brackets in (3.10) is simply (An ). Thus, (A) = limn (An ). To prove (2), assume that is a measure and let A1 A2 A3 be a nonincreasing sequence of sets in R with limit set A such that (A1 ) = . In particular, since An A1 for each n, by monotonicity we have (An ) < for all n. Now observe that (see the right-hand picture in Figure 3.3)
The sets on the right are pairwise disjoint, so by countable additivity and subtractivity of additive set functions, we have (A1 ) = (A) +
k=1
A 1 = A (A 1 \ A 2 ) (A 2 \ A 3 ) (A 3 \ A 4 ) .
(Ak \ Ak+1 ) = (A) +
k=1
((Ak ) (Ak+1 ))
n1 k=1
= (A) + lim
((Ak ) (Ak+1 )),
where we used that all (Ak )s are nite (being subsets of (A1 ), which is nite). 1 Now the right-hand sum telescopes n k=1 ((Ak ) (Ak+1 )) = (A1 ) (An ). Thus, (A1 ) = (A) + lim (A1 ) (An ) . Canceling (A1 ) gives (A) = lim (An ) as we wanted to show. Example 3.2. You may wonder about the hypothesis (A1 ) = in the denition of continuous from above. This hypothesis is needed otherwise the result is false. Heres a trivial counterexample. Consider : P (R) [0, ] dened by () := 0 and (A) := for A = . One can check that is a measure. Observe that =
n
An ,
where An = 0,
n=1
1 , n
Exercises 3.2. 1. Let X be an innite set. Dene : P (X ) [0, ] by (A) = 0 if A is nite and (A) = if A is innite. Prove that is nitely, but not countably additive. 2. (Various examples of measures) (a) For g : R [0, ) that is Riemann integrable on any nite interval, prove that b mg : I 1 [0, ), dened by mg (a, b] = a g (x) dx, is a measure on I 1 . (b) Prove that Proposition 1.16 in Chapter 1 holds verbatim where we replace nitely additive with countably additive everywhere in that proposition. (c) (Dirac and discrete measures) (i) Given a semiring I of subsets of a set X and given X , dene : I [0, ) by (A) := 1 0 if A if / A.
so lim An = , and A1 A2 A3 . Since (An ) = for every n, lim (An ) = . As (lim An ) = () = 0, we have (lim An ) = lim (An ).
140
Prove that is a measure; it is called the Dirac measure supported on . (ii) Let R be a ring of subsets of X . Assume that R contains the total set X and all countable subsets of X . We say that a measure : R [0, ] is discrete if there exists a countable set C X such that (X \ C ) = 0. Prove that a measure is discrete if and only if there are countably many points 1 , 2 , . . . X and extended real numbers a1 , a2 , . . . [0, ] such that =
n
an n
in the sense that (A) = n an n (A) for all A R . (d) Let X be an uncountable set and let S be the -algebra of subsets A X such that A or Ac is countable. (See Problem 7c in Exercises 1.3.) Dene on S by (A) = 0 or 1 according to A or Ac countable. Is a measure? 3. (Rationals are only nitely additive) Suppose we wanted to study lengths of intervals of rational numbers. Instead of working with the semiring I 1 , we would work with the semiring I := {(a, b] Q ; a, b Q , a b}. Let : I [0, ) be dened by (I ) := b a for I = (a, b] Q I . We shall prove is not a measure! (i) Prove that is nitely additive. (ii) Prove that is not countably additive. Suggestion: Dene a countable cover I1 , I2 , I3 , . . . I of (0, 1] Q such that n=1 (In ) < 1. 4. (The semiring extension theorem for measures) Let : I [0, ] be a (nitely) additive set function on a semiring I ; then by the Semiring Extension Theorem 2.7 we know that its ring extension : R (I ) [0, ] is (nitely) additive. Prove the following result: is countably additive on I if and only if is countably additive on R (I ). The if portion is automatic (why?), its the only if that requires proof. 5. (Atoms) Let (X, S , ) be a measure space. We call a measurable set A an atom if (A) > 0 and for all measurable B A we have (B ) = 0 or (B ) = (A). Suppose there is a set E S with (E ) < and there are real numbers 0 a < b < (E ) such that no measurable subset of E has measure in (a, b). We shall prove that has an atom in E . (i) Dene real numbers b b1 b2 and nested measurable sets E B1 B2 such that (1) there is a C > 0 with 0 < (Bn ) bn C/2n for all n; (2) no measurable subset of Bn has measure in (a, bn ). Suggestion: To start, let b1 = b, B1 = E and C = 2((E ) b). If there is a measurable B Bn with 0 < (B ) bn ((Bn ) bn )/2, take Bn+1 = B and bn+1 = bn ; if no such B exists, put Bn+1 = Bn and determine bn+1 . (ii) Let D = n=1 Bn and let d = lim bn . Show that (D ) = d and no subset of D has measure in (a, d); then, show that no subset of D has measure in (0, d a). (iii) Finally, conclude that has an atom thats a subset of E . 6. (Nonatomic measures) Let (X, S , ) be a measure space. Then is nonatomic if it has no atoms and its -nite if we can write X = nN Xn where (Xn ) < for each n. We prove Waclaw Sierpi nskis (18821969) result [356]: If is nonatomic and -nite, A S , and if 0 < b < (A), then there is a set B A such that (B ) = b; it follows that given any A S , the range of on measurable subsets of A equals the entire interval [0, (A)]. (i) Using -niteness reduce to the case (A) < , which we now assume. (ii) Assuming Problem 5 show that the range of on measurable subsets of A is dense in the interval [0, (A)]. (iii) Let 0 < b < (A) and show there is a B S with (B ) = b. Suggestion: Inductively dene a nondecreasing sequence B1 B2 of measurable sets such that b 1/2n (Bn ) < b for all n; then consider B = n Bn .
3.3. INFINITE PRODUCT SPACES AND KOLMOGOROVS MEASURE AXIOM
141
3.3. Innite product spaces and Kolmogorovs measure axiom The main problem in this section is the following. Let 1 : I1 [0, 1] , 2 : I2 [0, 1] , 3 : I3 [0, 1] , . . . , be (countably many) probability measures where Ii is a semiring on a sample space Xi . Let C i=1 Xi denote the collection of all cylinder sets, sets of the form A = A1 A2 An Xn+1 Xn+2 Xn+3 , where Ai Ii for each i, and dene the innite product of 1 , 2 , . . ., : C [0, 1], on a cylinder set A as written above, by (A) := 1 (A1 ) 2 (A2 ) n (An ). By Proposition 2.6 we know is nitely additive. Since the i s were, by assumption, measures, its natural to conjecture that is in fact countably additive. Proving this is the main goal of this section. 3.3.1. Innite product probabilities I: Finite sample spaces. For pedagogical reasons, we rst assume the Xi s are nite sets; the proof in this case is simpler than the general case, which well handle in Section 3.3.2. For example, if p (0, 1) and Xi = {0, 1} for each i with {1} occurring with probability p and {0} with probability 1 p, then we shall prove in particular that the innite product measure on Bernoulli sequences is indeed a measure. More generally, assume the Xi s are nite nonempty sample spaces. Then given probability set functions i : P (Xi ) [0, 1] , i = 1, 2, . . . ,
we shall prove that the corresponding innite product measure : C [0, 1] is really a measure, that is, countably additive. In fact, we shall prove even more: In Theorem 3.7 below we shall prove that an arbitrary additive set function on C is automatically countably additive! The proof of this fact uses the following compactness property of cylinder sets (assuming the Xi s are nite!):3 Lemma 3.6. If A, A1 , A2 , A3 , . . . are cylinder sets with A N then there is an N such that A n=1 An .
n=1
An ,
Assuming this lemma for the moment, lets prove the following theorem: Measures on the cylinder sets Theorem 3.7. If the Xi s are nite nonempty sets, then any additive set function on C is countably additive; explicitly, if : C [0, ] is nitely additive, then in fact its countably additive. In particular, the innite product of countably many probability set functions is a probability measure on C .
Proof : We just have to prove countable subadditivity. Let A C and assume that A n=1 An where A1 , A2 , . . . C . By our lemma there is an N such that
3In fact, its possible to put a topology on X so its compact and the cylinder sets are i=1 i both open and closed; then Lemma 3.6 follows from the fact that cylinder sets are compact.
142
A implies
N n=1
An . By nite subadditivity, we have (A) (A)

n=1
N n=1
(An ), which
(An ).
This completes our proof. How easy was that! Proof of Lemma 3.6 : Step 1: We prove our lemma by contraposition. Assume that A is not a subset of any nite number of A1 , A2 , . . .; we shall prove that A n=1 An . Since A A1 A2 An for any n, for each n N there is a point (3.11)
n n n (a n 1 , a2 , a3 , a4 , . . .) A \ (A1 A2 An ). 1 1 1 (a 1 1 , a2 , a3 , a4 , . . . ) 2 2 2 (a 2 1 , a2 , a3 , a4 , . . . )
We now list these sequences:
(3.12)
3 3 3 (a 3 1 , a2 , a3 , a4 , . . . ) 4 4 4 (a 4 1 , a2 , a3 , a4 , . . . )
. . .
. . .
. . .
. . .
. . .
Consider the rst column, which represents a sequence of points in the nite set X1 . Since X1 is a nite set, at least one point in X1 , call such a point a1 , must be repeated innitely many times in the rst column. Thus, there is an innite set I1 N such that an 1 = a1 for all n I1 . Now consider the second column of (3.12) where we only consider the elements an 2 where n I1 . Since I1 is innite, there are innitely many an 2 s in the second column where n I1 . Since X2 is a nite set, at least one point in X2 , call such a point a2 , is repeated innitely many times amongst the an 2 s where n I1 . Thus, there is an innite set I2 I1 such that an 2 = a2 for all n I2 . Note that since I2 I1 , for all n I2 we still have an 1 = a1 . In conclusion, we have an innite set I2 I1 such that Continuing by induction, we nd innite subsets I1 , I2 , I3 , . . . N and points a1 X1 , a2 X2 , . . ., such that for each m N, (3.13) We now put Step 2: We claim that a A and a / A1 , A2 , . . ., which completes our proof. To see that a A, we use the denition of cylinder set to write A = B XN +1 XN +2 XN +3 for some N and for some set B X1 XN . To prove that a A, we just have to prove that (a1 , . . . , aN ) B . To prove this, let m = N in (3.13) and x any n IN . By (3.11) we know that
n n which implies (an 1 , a2 , . . . , aN ) B . By (3.13), we get (a1 , a2 . . . , aN ) B . To see that a / A1 , A2 , . . ., x an i. To see that a / Ai , we follow the same pattern as to prove a A. We start with the denition of cylinder set to write n n n n (a n 1 , a2 , . . . , aN , aN +1 , aN +2 , . . .) A, n n (a n 1 , a2 , . . . , am ) = (a1 , a2 , . . . , am ) for all n Im ; n (a n 1 , a2 ) = (a1 , a2 ) for all n I2 .
a := (a1 , a2 , a3 , . . .)
Xi .
Ai = Bi XM +1 XM +2 XM +3
143
for some M and for some set Bi X1 XM . To prove that a / Ai , we just have to prove that (a1 , . . . , aM ) / Bi . Let m = M in (3.13) and x any n IM with n i. By (3.11) and the fact that n i, we know that
n n which implies (an / Bi . By (3.13), we see that (a1 , a2 . . . , aM ) / Bi . 1 , a2 , . . . , aM ) n n n n (a n / Ai , 1 , a2 , . . . , aM , aM +1 , aM +2 , . . .)
3.3.2. Innite product probabilities II: General sample spaces. We now drop the assumption that the sample spaces Xi are nite. For example, in the case Xi = (0, 1] for each i with Lebesgue measure, the innite product i=1 Xi = (0, 1] represents the sample space of picking an innite sequence of real numbers at random from the interval (0, 1]. Theorem 3.8 says that Lebesgue measure on (0, 1] is a measure. More generally, assume we are given probability measures where Ii is a semiring on a sample space Xi (no longer assumed to be nite). Let C i=1 Xi denote the collection of cylinder sets generated by the semirings I1 , I2 , . . . and consider the innite product of 1 , 2 , . . ., The following theorem says that is a measure. Probability measures on the cylinder sets Theorem 3.8. The innite product of countably many probability measures is a probability measure on C .
Proof : Our proof begins with . . . Step 1: Introduction to sections. By a section we mean the following. Let A i=1 Xi . Given (x1 , x2 , . . . , xn ) X1 Xn , we dene A(x1 , x2 , . . . , xn ) := y
i=n+1
1 : I1 [0, 1] , 2 : I2 [0, 1] , 3 : I3 [0, 1] , . . . ,
: C [0, 1].
Xi ; (x1 , x2 , . . . , xn , y ) A .
This set is called the section of A at (x1 , x2 , . . . , xn ). Figure 3.4 shows a couple examples of sections in the nite product case X1 X2 and X1 X2 X3 .
X2 A(x1 ) x1 X1 X1 X3 A(x1 , x2 ) X2 (x1 , x2 )
Figure 3.4. Here, we draw the Xi s as if they were the real line. On the left, A is a (lled in) oval and on the right, A is a solid ball. In both cases, the sections are line segments. In the rst case, A(x1 ) = {x2 X2 ; (x1 , x2 ) A} and in the second case, A(x1 , x2 ) = {x3 X3 ; (x1 , x2 , x3 ) A}.
One reason sections are important is the following Claim: If (a1 , a2 , . . .) i=1 Xi and A R (C ), then
(a1 , a2 , . . .) A if and only if A(a1 , a2 , . . . , ak ) = for all k.
144
In other words, (a1 , a2 , . . .) A if and only if for every k N, the (a1 , . . . , ak )section of A is not empty. The proof of this claim is not dicult (but does require some thought) so we shall leave it to the interested reader (see Problem 2). We now relate . . . Step 2: Integration and measures of sections. We rst introduce some notation. For each k N we let and we let C (k) = cylinder subsets of Xk Xk+1 Xk+2 .
(k) = innite product measure of k , k+1 , k+2 , . . . . The set function (k) : C (k) [0, 1] is nitely additive and it extends uniquely to a nitely additive set function (k) : R (C (k) ) [0, 1]. Observe that C (1) = C and (1) = while for k > 1, we think of (k) : R (C (k) ) [0, 1] as, roughly speaking, the restriction of : R (C ) [0, 1] from i=1 Xi to i=k Xi . Let A R (C ). Then we claim that (1) For any x1 X1 , we have A(x1 ) R (C (2) ). (2) The function f : X1 R dened by f (x1 ) := (2) (A(x1 )) for all x1 X1 is an I1 -simple function. (3) We have (A) = f d1 . Heres a picture showing why (3) is obvious in the simple case of X1 X2 :
X2 (2) (A(x1 )) = length of the line segment A(x1 ). x1 X1
Figure 3.5. Integrating (summing) the lengths of the line segments

over all the points x1 X1 gives the area of A; that is, area of A = (A). f d1 =
Since an element of R (C ) is a union of pairwise disjoint elements of C , we just have to check (1)(3) for an element of C . Thus, assume that A C . Then we can write A = A1 A2 AN for some Ai Ii . It follows that given x1 X1 , A (x 1 ) = A2 AN
i=N +1
Xi
i=N +1
Xi
if x1 / A1 , if x1 A1 . if x1 / A1 , if x1 A1 ;
Hence, A(x1 ) C (2) for all x1 X1 . Moreover, we have (2) (A(x1 )) = it follows that 0 2 (A2 ) N (AN )
f (x1 ) = a A1 (x1 ), where a = 2 (A2 ) N (AN ). Thus, f = a A1 is an I1 -simple function and by denition of the integral, f d1 = a 1 (A1 ) = 1 (A1 )2 (A2 ) N (AN ) = (A), as required. This completes the proof of (1)(3). Identical arguments prove the following results, which well need below: If A R (C (k) ), then
145
(1) For any xk Xk , we have A(xk ) R (C (k+1) ). (2) The function f : Xk R dened by f (xk ) := (k+1) (A(xk )) for all xk Xk is an Ik -simple function. (3) We have (3.14) (k) (A) = f dk .
With all this preliminary material, we are now ready for . . . Step 3: Idea of proof. Now how do sections help to prove that the innite product measure is a measure? By the Semiring Extension Theorem we know that extends uniquely to be additive on R (C ). We shall prove that this extension is countably additive, then by restricting back to C we see that is countably additive on C as well. We shall use (the contrapositive in the displayed statement of) Part (4) of Theorem 3.5 to prove that : R (C ) [0, 1] is a measure: Let A1 , A2 , . . . R (C ) with A1 A2 A3 and with lim (An ) = 0; we need to show that n=1 An = . To do so, we shall nd a X such that point (a1 , a2 , . . .) i i=1 (3.15) Goal: An (a1 , a2 , . . . , ak ) = for all n, k N. Since this holds for all k N it follows that (a1 , a2 , . . .) An for all n and since this holds for all n we get (a1 , a2 , . . .) n=1 An = n=1 An . This proves that and shows that is a measure. Now that we know what we are after, our next step is . . . Step 4: Proof of Theorem. Instead of proving the statement (3.15) directly, we turn this statement about sets into a statement about measures of sets. To this end, note that (A1 ) (A2 ) so, as lim (An ) = 0, there is an > 0 such that (3.16) (An ) for all n N.
We can consider this inequality as the k = 0 statement of the following claim: There is a point (a1 , a2 , . . .) i=1 Xi such that Claim: (k+1) (An (a1 , a2 , . . . , ak )) k for all n, k N. 2 Our claim certainly implies our goal: Since in particular An (a1 , a2 , . . . , ak ) has positive measure, it must be nonempty. We shall prove our claim by induction on k. Regarding (3.16) as the k = 0 case, assuming the k 1 case: (3.17) (k) (An (a1 , a2 , . . . , ak1 )) k1 for all n N, 2 we shall prove there is a point ak Xk giving our claim. We now turn our claim, which is a statement about the measure of a set, to a statement involving a function. Indeed, given n N, dene fn : Xk [0, 1] by fn (xk ) := (k+1) (An (a1 , a2 , . . . , ak1 , xk )) Then to prove our claim we need to show there is a point ak Xk such that fn (ak ) /2k for all n N. In other words, if we dene for each n N, Bn = fn k Xk , 2 then we need to show that the intersection n=1 Bn is not empty. Observe that since fn is a simple function on Xk we know that Bn R (Ik ) (Problem 1 in Exercises 2.4). Also, since A1 A2 A3 , all the sections of the Ai s are for all xk Xk .
146
also nonincreasing, so f1 f2 f3 . It follows that B1 B2 B3 . Moreover, observe that combining (3.14) and (3.17), we have fn dk Thus, 2k1 fn dk = Bn fn dk + Bn 1 dk + , 2k
c fn dk Bn
2k1
for all n N.
dk 2k
= k (Bn ) +
c where in the second line we used that fn 1 and on the set Bn , fn < /2k . k Thus, k (Bn ) /2 for all n N; in particular, lim k (Bn ) = 0. Now we
are given that k : Ik [0, 1] is a measure, so by Problem 4 in Exercises 3.2 it follows that k : R (Ik ) [0, 1] is also a measure. Hence by Part (4) of Theorem 3.5 we know that n=1 Bn = . This nally completes our proof!
3.3.3. Kolmogorovs countable additivity model. Recall Andrey Kolmogorovs (19031987) axioms for probability [215, p. 2]:
I. R is a ring of subsets of a set X . II. R contains X . III. To each set A in R is assigned a nonnegative real number (A). This number (A) is called the probability of the event A. IV. (X ) = 1. V. If A and B have no element in common, then (A B ) = (A) + (B ).
Here, Kolgomorov is actually assuming there are only nitely many events. When the number of events can be innite, on page 14 of [215], Kolgomorov adds another axiom and says
In all future investigations, we shall assume that besides Axioms IV, still another holds true: VI. For a decreasing sequence of events A1 A2 A3 of R , for which n=1 An = , the following equation holds lim (An ) = 0.
By Part (4) of Theorem 3.5, we know that Axiom VI is equivalent to saying that : R [0, 1] is a measure. Thus, Kolgomorov is essentially axiomatizing probability so that probability becomes a part of measure theory! On the next page, Kolmogorov makes the following interesting statement [215, p. 15] (I bolded a sentence near the bottom).
Since the new axiom is essential for innite elds of probability only, it is almost impossible to elucidate its empirical meaning, as has been done, for example, in the case of Axioms I V . . .. For, in describing any observable random process we can obtain only nite elds of probability. Innite elds of probability occur only as idealized models of real random processes. We limit ourselves, arbitrarily, to only those models which satisfy axiom VI. This limitation has been found expedient in researches of the most diverse sort.
147
So, it seems like Kolgomorov arbitrarily studies countable additive probability models because it has been found expedient in researches. That this axiom is expedient in researches is very true: Countable additivity gives an incredibly useful theory of the integral (which, for instance, xes all the deciencies of the Riemann integral discussed in the prelude such as the interchange of limits and integrals). Nonetheless, axiomatizing probability so that it becomes a part of measure theory has not come without controversy. This is because probability is supposed to model the likelihood of real-life phenomena and although everyone can accept nite additivity as part of real- Bruno de Finetti life probability, how can we say denitively that all real-life probabilistic (19061985). phenomena behave countably additively? One of the great opponents of the countable additivity axiom of probability was the famous probabilist Bruno de Finetti (19061985) who said [99, p. 229]
From the viewpoint of the pure mathematician who is not concerned with the question of how a given denition relates to the exigencies of the application, or to anything outside the mathematics the choice is merely one of mathematical convenience and elegance. Now there is no doubt at all that the availability of limiting operations under the minimum number of restrictions is the mathematicians ideal. Amongst their other exploits, the great mathematicians of the nineteenth century made wise use of such operations in nding exact results involving sums of divergent series: rst-year students often inadvertently assume the legitimacy of such operations and fail the examination when they imitate these exploits. At the beginning of this century it was discovered that there was a large area in which the legitimacy of these limiting operations could be assumed without fear of contradictions, or of failing examinations: it is not surprising therefore that the tide of euphoria is now at its height.
That the tide of euphoria is now at its height is true; take for instance one of the worlds experts in probability theory, Richard Dudley (1938), who said The denition of probability as a (countably additive, nonnegative) measure of mass 1 on a -algebra in a general space is adopted by the overwhelming majority of researches in probability [110, p. 273]. We shall follow the crowd and mostly limit ourselves to measures instead of nitely additive set functions because it allows us to develop a powerful theory of the integral where limits and integrals can be interchanged without fear of contradictions, or of failing examinations and because the availability of limiting operations under the minimum number of restrictions is the mathematicians ideal. As Kolgomorov mentioned, such a theory of the integral has been found expedient in researches of the most diverse sort. However, with this said, Id like to say that studying nite additivity is important both philosophically and mathematically; philosophically because in real-life we really only encounter nitely additive phenomena (countably additivity is really just an idealization) and mathematically because there are completely natural set functions which are nitely, but not countably, additive such as asymptotic density that youll study in Problems 6 and 7.
Exercises 3.3. 1. In Problem 7 of Exercises 2.4 we related the sequence space {0, 1} modeling an innite sequence of fair coin tosses and the interval [0, 1] with Lebesgue measure. Looking at this problem, one might think that Theorem 3.7 holds for the interval [0, 1]; that is,
148
if I consists of all left-half open intervals in [0, 1] and : I [0, ) is nitely additive, then its automatically countably additive. This is false: Find a nitely, but not countably, additive set function on I . Therefore, although {0, 1} and [0, 1] are in some respects similar, they are measure theoretic very dierent. 2. In this problem we study sections of sets. (i) Let I1 , I2 , . . . be semirings on nonempty sets X1 , X2 , . . ., let A R (C ), and let (a 1 , a 2 , . . . ) i=1 Xi . Prove that (a1 , a2 , . . .) A if and only if A(a1 , a2 , . . . , ak ) = for all k. (ii) Find a counterexample to the if part of (i) if we drop the assumption A R (C ); that is, nd sets X1 , X2 , . . ., a subset A i=1 Xi and a point (a1 , a2 , . . .) / A. i=1 Xi such that A(a1 , a2 , . . . , ak ) = for all k , yet (a1 , a2 , . . .) 3. (General product measures) In this problem we extend Theorem 3.8 to arbitrary products. Let I be an index set (not necessarily countable) and for each i I , let i : Ii [0, 1] be a probability measure, where Ii is a semiring on a sample space Xi . We denote by iI Xi the set of all functions x : I iI Xi such that x(i) Xi for each i I . For example, in the case I = N we identity the function x with the innite tuple (x1 , x2 , . . .) where xi := x(i). Let C iI Xi denote the collection of all cylinder sets, where A C means that for some nite set F I , we can write A=
iF
Xi
Xi ;
i /F
by which we mean x A if and only if x(i) Xi for i F and there are no conditions on x(i) for i / F . We dene the innite product of the i s, : C [0, 1], on a cylinder set A as written above, by (A) := iF i (Ai ). (i) Prove that is nitely additive. In particular, extends uniquely to a nitely additive probability set function on R (C ). (ii) Prove that is a measure by reducing to an application of Theorem 3.8 in the case I is countable. Suggestion: Let A1 , A2 , . . . C be a pairwise disjoint cylinder sets. Show that there is a countable set C I such that for each k, Ak = Bk i / C Xi where Bk is a cylinder set of the countable product iC Xi . Show that B1 , B2 , . . . are pairwise disjoint elements of the cylinder subsets of iC Xi . 4. (Coherent probability and nitely additive probability) In this problem we discuss Bruno de Finettis notion of coherent probability. First, the notion of bettors gain. Let 0 p 1, let a R, and let A X where X is the sample space of some experiment. You walk into a casino and you pay $(ap) betting that the event A occurs in the experiment; if a < 0, then the casino actually pays you $(|a|p). (For example, if a = 10 and p = 1/2, you pay $10/2 = $5 for a 50-50 chance of winning $10.) If the event A occurs, you get $a and if A doesnt occur, you get nothing. We can summarize your net gain with the function g : X R , dened by g = a A p ; note that if x A, then g (x) = a pa and if x / A, then g (x) = pa, which are exactly your net gain when the event A does or doesnt occur. If we are given N weights p1 , . . . , pN [0, 1], N amounts a1 , . . . , aN R, and N events A1 , . . . , AN X , then the function, called the bettors gain,
N
(3.18)
G : X R , dened by G =
n=1
an An pn ,
represents your net gain if you bet $(an pn ) on the event An , winning $an if the event An occurs. Note that if G(x) > 0 for all x X , then you win regardless what the outcome of the experiment is; if this is the case, the game is called unfair,4 otherwise
4
Of course, unfair to the casino but you might consider it fair to you!
149
the game is fair. Were now ready to explain de Finettis theory of probability, which basically says that probabilities should only give rise to fair games. Let A be a collection of subsets of X and let : A [0, 1]. The function is called a coherent probability if for any nite number of events A1 , . . . , AN A and real numbers a1 , . . . , an R, the bettors gain function (3.18) with pn = (An ), n = 1, . . . , N , is always fair; that is, G(x) > 0 for all x X . Note that A is not assumed to be a ring and is not assumed to be nitely additive. This is quite dierent from Kolgomorovs axioms for a probability! However, if A is a ring containing the whole space X , then de Finettis theory and Kolgomorovs (nitely additive) theory are the same, which shows that de Finettis theory is more general than Kolgomorovs. (de Finettis theorem) Let : A [0, 1] be a set function where A is a ring containing X . Then is a coherent probability if and only if is a nitely additive probability set function. Prove this as follows (taken from [24]): (i) Assume that is a coherent probability. By considering the bettors gain function X (X ), prove that (X ) = 1. Given disjoint sets A, B A , by considering the bettors gain function (A (A)) + (B (B )) (AB (A B )), prove that (A B ) (A) + (B ). Similarly, prove the opposite inequality. Conclude that is a nitely additive probability set function. (ii) Assume now that is a nitely additive probability set function. By way of contradiction, assume that is not a coherent probability, meaning there are sets A1 , . . . , AN A and real numbers a1 , . . . , an R such tha the bettors gain function (3.18) with pn = (An ), n = 1, . . . , N , is strictly positive at all points of X . Show that there are pairwise disjoint nonempty sets B1 , . . ., BM A with X = B1 BM and constants b1 , . . . , bM R such that
M
n=1
bn Bn (Bn ) > 0
at all points of X . Prove that for each m, we have bm > M n=1 bn (Bn ). 5. (A countably additive paradox) In this problem we show how countable additivity could lead to paradoxical results in probability. Suppose that : R [0, 1] is a countably additive probability set function on a ring R . One might think that only gives rise to fair games where we are allowed to bet on countably many events. By fair we mean there does not exist countably many events A1 , A2 , . . . R and amounts a1 , a2 , . . . R such that the bettors gain
n=1
an An (An )
converges to a positive real number at all points of the sample space. However, it turns out that we can have unfair games! Let X = (0, 1], R the ring generated by those elements of I 1 that are subsets of (0, 1], and consider Lebesgue measure m : R [0, 1], which is a countably additive probability set function by Problem 4 in Exercises 3.2. Heres Beams construction [25]: (1)n Step 1: Since the alternating harmonic series converges conditionally, n=1 n from the Riemann rearrangement theorem of elementary real analysis, we can rearrange the terms of the alternating harmonic series so that the series converges to any given value (or diverges). Lets x a positive real number c > 0 and rearrange the natural (1)in numbers i1 , i2 , i3 , . . . so that = c. in n=1
150
and an = (1)in +1 . Prove that Step 2: Let An = 0, i1 n=1 an An m(An ) n converges to a positive real number c at all points x (0, 1]. This shows that its possible to construct a game where the bettor can win an arbitrary predetermined amount of money no matter what the outcome of the game is! 6. (Asymptotic density) What does is the event that a randomly chosen natural number is even? One way to interpret this is that the sample space consists of subsets of N and the event is the subset 2N (consisting of even natural numbers), where for any a N and A N, we dene aA := {a x ; x A}. How would we assign a probability to the event 2N? In general, given a subset A N, what is the probability that a randomly chosen natural number lies in A? Here is the common way to assign such a probability. Given a subset A N, dene (3.19) D(A) := lim
N
#N ( A ) , N
where #N (A) := {x A ; x N }, provided this limit exists. This limit is called the (asymptotic) density of A. Let D denote the collection of all A N such that the limit (3.19) exists. Then we have a map D : D [0, 1] One may think that D is a probability measure, but it is not countably additive! Here are some properties of D. (a) Prove that D(N) = 1 and D(A) = 0 for any nite set A. For any a N, prove that D(aN) = 1/a. One can interpret this as saying that one out of every a-th natural number is divisible by a. (b) For any a N and r = 0, 1, 2, . . . , a 1, put A(a, r ) := aN + r = {an + r ; n N}. Prove that D(A(a, r )) = 1/a. 2k 2k+1 (c) Let A = 1}. Show that the limit (3.19) does not exist. In k=0 {10 , . . . , 10 particular, D = P (N). (d) Show that D is closed under nite pairwise disjoint unions, under complements, but is not closed under countable unions. Prove that D : D [0, 1] is not countably additive, where countable additivity means that if A1 , A2 , . . . D are pairwise disjoint and A = k=1 Ak D , then D (A) = k=1 D (Ak ). (e) Since D is closed under nite pairwise disjoint unions, you might think D is a ring. However, D is not even a semiring: Find sets A, B D (A, B = N) such that AB / D . Suggestion: If youre having trouble, consider the following interesting example [65, p. 571]: Fix any A0 / D . Put A = 2N and B = (2A0 ) (2Ac / A0 }. 0 + 1) = {2n ; n A0 } {2n + 1 ; n Show that A, B D (both have density 1/2) but A B = 2A0 and show that 2A0 / D. 7. (More on asymptotic density) As we saw in the previous problem, the set D on which asymptotic density is dened is not so well behaved. In this problem we study a subset of D that is a semiring. (i) Let a, b N. Prove that given any c Z, the equation ax by = c has a solution (x, y ) Z Z if and only if d divides c where d is the greatest common divisor of a and b. Moreover, in the case that ax by = c has a solution (x, y ) Z Z, it has innitely many solutions and all solutions are given as follows: If (x0 , y0 ) is any one solution of the equation with c = 1, then for general c Z, all solutions are of the form c b c a x = x0 + t , y = y0 + t , for all t Z. d d d d (ii) Let I be the collection of all subsets of N of the form A(a, r ) where a N, r = 0, 1, . . . , a 1, and A(a, r ) = aN + r = {an + r ; n N}. In the previous
3.4. OUTER MEASURES, MEASURES, AND CARATHEODORYS IDEA
151
problem you showed that D(A(a, r )) = 1/a. Using (i) prove that I is closed under intersections. (iii) Prove that A(a, r )c = 0ka1 , k=r A(a, k), where the union is over all integers k between 0 and a 1 except k = r . (iv) Prove that the dierence of any two elements of I is a union of pairwise disjoint elements of I . Conclude that I is a semiring. In particular, asymptotic density is a nitely additive probability set function on the semiring I . (This probability set function is not countably additive; for a proof using the extension theorem, see Problem 15 in Exercises 3.5.) 8. (A nitely additive paradox) (cf. [110, p. 94]) In this problem we show how nite additivity could lead to paradoxical results in probability. Given any set A N for which the asymptotic density exists, suppose that the probability that a natural number chosen at random lies in A is D(A). Two people, Jack and Jill, pick natural numbers at random and the one who picks the larger number wins. You call out either Jack or Jills name at random, and the person who you call on tells you his number; at this point you dont know what number the other person chose. However, show that the person you didnt call on wins the game with probability one.
3.4. Outer measures, measures, and Carath eodorys idea Given an additive set function : I [0, ] where I is a semiring of subsets of some set X , its natural to ask the following Question: Can we extend to be a measure on S (I )? That is, can we extend to a measure The answer is yes if and only if the original set function : I [0, ] is a measure.5 The idea to construct the extension is to rst dene what is called an outer measure from , : P (X ) [0, ], which is dened on all subsets of X . This map is generally not a measure. In this section we study outer measures and dene a -algebra M P (X ) such that : M [0, ] is a measure. In Section 3.5 we show that S (I ) M , so by restriction, : S (I ) [0, ] is a measure and we show that if : I [0, ] is a measure, then extends (that is, (I ) = (I ) for all I I ). 3.4.1. Outer measures. We begin this section by studying outer measures from the abstract point of view. In Carath eodorys 1918 book [74], [113, p. 48] he calls a map6 where P (X ) is the power set of a set X , an outer measure on X if (1) () = 0; (2) is countably subadditive in the sense that

: S (I ) [0, ]?
: P (X ) [0, ],
An
n=1
(A)
(An ).
n=1
Constantin Carath eodory (18731950).
(There is no condition on the disjointness of {An }.)

5The only if statement is obvious: If : S (I ) [0, ] is a measure extending the given on I , then by restricting to I S (I ), it follows that : I [0, ] is also a measure. 6Actually, Carath eodory worked with X = Rn and not a general set X .
152
Note that is monotone in the sense that if A B , then (A) (B ). Indeed, A B and so by (1) and (2), we have (A) (B ) + () + () + = (B ). Thus, (A) (B ). Heres an example showing that an outer measure may not be a measure. By the way, the purpose of some examples is not they are useful in any practical sense but that they help us to understand statements: what they imply and what they dont imply.7
Example 3.3. Consider, for example, the set X = {a, b, c} consisting of three distinct elements and let : P (X ) [0, ] be dened by () = 0, (X ) = 2, and (A) = 1 otherwise. Note that if A = {a}, B = {b}, then A and B are disjoint and by denition of , we have (A B ) = 1, (A) = 1 and (B ) = 1. Hence, (A B ) < (A) + (B ). Therefore, is not additive. However, is countably subadditive. To see this, let A n An where A, A1 , A2 , . . . X . We must verify that (3.20) (A)
n=1
(An ).
We consider three cases. Case 1: A = . Then (A) = 0, so (3.20) is trivially true. Case 2: A = X . Then (A) = 2. If An = X for some n, then (An ) = 2, so (3.20) holds. If An = X for all n, then as X A1 A2 , there must be at least two dierent sets Ai and Aj amongst A1 , A2 , . . . that are not empty. Hence, (Ai ) + (Aj ) = 2, and (3.20) holds in this case too. Case 3: A = , X . Then (A) = 1. Since A n=1 An and A = , at least one set An cannot be empty; for this set, we have (An ) = 1 or 2, so (3.20) holds.
The most common way to construct an outer measure is the Lebesgue outer measure construction we briey discussed at the beginning of this chapter. Recall that the basic idea to assign a measure to an arbitrary subset of Euclidean space is to circumscribe the set with, in general innitely many, boxes. Thus, given a set A Rn , the idea is to cover A by countably many left-half open boxes: A k=1 Ik , where Ik I n for each k . For example, heres a picture with A is a disk, where we show three covers of A with rectangles:
Its obvious that the sum of the areas of the rectangles in the far right picture (assuming they dont overlap too much) give the best measurement of the true size of A. This shows why we should use covers of A by countably many boxes: the more boxes, the better the approximation. Now intuitively speaking, the smallest possible sum of areas of (countably many) rectangles covering A should equal to
7If you have to prove a theorem, do not rush. First of all, understand fully what the theorem
says, try to see clearly what it means. Then check the theorem; it could be false. Examine the consequences, verify as many particular instances as are needed to convince yourself of the truth. When you have satised yourself that the theorem is true, you can start proving it. George P olya (18871985) [314].
153
exact measure of A. Thus, it makes sense to dene8
(3.21)
m (A) := inf
k=1
m(Ik ) ; I1 , I2 , I3 , . . . I n cover A .
This denes an outer measure (because it measures A from the outside) m (A) [0, ] for each subset A Rn . Hence, we have dened a set function which is called Lebesgue outer measure. The fact that Lebesgue outer measure is an outer measure is a consequence of Theorem 3.9 below, where we show how to generate outer measures from arbitrary set functions. We remark that it is important in the denition (3.21) for m (A) to take countable covers of A rather than nite covers. For nite covers, the denition (3.21) is called outer content (see Problem 12), used by Camille Jordan (18381922) as the foundation of Riemann integration theory. Taking countable covers gives rise to Lebesgue integration theory. Before presenting the next theorem, we recall what inmum, or greatest lower bound, in (3.21) means. Let
m : P (Rn ) [0, ],
S :=
k=1
m(Ik ) ; I1 , I2 , I3 , . . . I n cover A .
Then by denition of inmum, (3.21) means (Inf 1) m (A) is a lower bound for S ; that is, m (A) k=1 m(Ik ) for any cover n {Ik } of A by elements of I . (Inf 2) m (A) is the greatest lower bound for S ; that is, if m (A) < , then cannot be a lower bound for S , which means there must be an element of S less than . Explicitly, there are sets {Ik } in I n such that

Ik
k=1
with
k=1
m(Ik ) < .
With this review of inmums fresh in memory, we can prove the following theorem.9 Construction of outer measures Theorem 3.9. Let A be any collection of subsets of a set X such that A and let : A [0, ] be a map such that () = 0. The collection A and the map are not assumed to have any other properties. For any A X , copying the denition (3.21) we dene
(3.22)
(A) := inf
n=1
(In ) ; I1 , I2 , I3 , . . . A cover A .
This denes a map which is an outer measure. : P (X ) [0, ],
of C , and we dene inf {+} = + so that (A) = + if all the sums in (3.22) equal +.
N 8The sums k=1 m(Ik ) include nite sums k=1 m(Ik ), N N, by taking Ik = for k > N . 9We dene inf = + so that (A) = + in (3.22) if there is no cover of A by elements
154
Proof : We leave you to prove that () = 0 and lets move on to proving is countably subadditive. To this end, let A n=1 An ; we shall prove (3.23) If
n=1 (An ) n=1 (An ) is
(A)
n=1
(An ).
= , then (3.23) trivially holds, so we may assume that the sum nite. In particular, for each n N, we have (An ) < . We now use the 2 n principle: Given > 0, instead of proving (3.23) on the nose, we instead prove that (3.24) (A)
n=1
(An ) +
. 2n
To prove this, observe that since for each n N, we have (An ) < (An ) + 2 n, by (Inf 2) above with = (An ) + 2 n , we can cover An by sets {Inm }mN in A such that (3.25)
m=1
(Inm ) < (An ) +
. 2n
Since A is covered by {An }nN , A is also covered by {Inm }n,mN . Now order the countably many sets {Inm }n,mN in any way you wish, say {Ia1 , Ia2 , . . .}. Then by denition of inmum in (3.22), since {Ia1 , Ia2 , . . .} covers A, we have (3.26) (A)
n=1
(Ian ) =
(Inm ),
n=1 m=1
where we used Lemma 3.3. Replacing (3.25) into the far right-hand sum in (3.26) we get (3.24) as we wanted.
To get practice using the denitions (3.21) or (3.22), let us show that Lebesgue outer measure gives volumes consistent with our usual notion of volume.
Example 3.4. Let a = (a1 , . . . , an ), b = (b1 , . . . , bn ) Rn with ak < bk for each k and consider the open box (a, b). Is it true that m (a, b) = (b1 a1 ) (bn an )? Since (a, b) (a, b], by denition of m (a, b), we have m (a, b) m(a, b] = (b1 a1 ) (bn an ).
n To prove the opposite inequality, let (a, b) for each k. For k=1 Ik where Ik I every > 0 note that (a, b ] (a, b) k=1 Ik , where we take > 0 small enough so that (a, b ] is not empty. We know that m is a measure on I n , so it is countably subadditive, and hence
(b1 a1 ) (bn an ) = m(a, b ]
k=1
m(Ik ).
Since > 0 can be arbitrarily small, this inequality implies that ( b1 a 1 ) ( bn a n )

k=1
m(Ik ).
Thus, by denition of inmum, (b1 a1 ) (bn an ) m (a, b), which implies that m (a, b) = (b1 a1 ) (bn an ). Similar proofs show that the outer measure of any interval is its usual measure; for instance, m [0, 1] = 1 and the outer measure of a single point is zero.
155
Example 3.5. If A Rn is countable, A = {a1 , a2 , . . .} = k {ak } for some ak Rn . As m {ak } = 0 for each k, by countable subadditivity it follows that A has outer measure zero: m (A)
k=1
m {ak } =
0 = 0.
k=1
This example proves the following: Theorem 3.10. Any countable subset of Rn has measure zero. In particular, any subset of the rational numbers in R has measure zero and any subset Rn with positive outer measure must be uncountable. Thus, as m [0, 1] = 1, the interval [0, 1] is uncountable. In particular, since the rational numbers are countable, we have a measure-theoretic proof that irrational (= nonrational) numbers in [0, 1] exist and they form an uncountable subset of [0, 1]. Of course, this is one of the most dicult proofs that irrational numbers exist! It might surprise you that there are uncountable sets with measure zero. One example was dened by Cantor and will be studied in Section 4.5 to come. 3.4.2. Measurable sets and complete measures. From Theorem 3.4 we know that measures are outer measures, but outer measures may not be measures (as weve seen by examples). However, the celebrated Carath eodorys Theorem, see Theorem 3.11 below, shows how to construct measures from outer measures. The basic idea to do this for subsets of R was given in the introduction to this chapter, Section 3.1, where we showed that it is natural to consider a subset A R as being measurable if for any subset E R, we have m (E ) = m (E A) + m (E \ A). Of course, this idea can be applied to any outer measure! Thus, we shall declare a subset A X to be measurable, or -measurable to emphasize , if for any subset E X , we have (3.27) (E ) = (E A) + (E \ A). We can think of this as saying that A cleanly cuts any set in the sense that if any set E is sliced into parts in A and outside of A, then is additive on this decomposition as in this picture:
A EA E\A E (E ) = (E A) + (E \ A)
Sets that dont always cut cleanly are not measurable. Another, physics-type, interpretation of measurability is in terms of mass: If we think of as mass, then A is measurable means that A always conserves mass in the sense that given any E X , conservation of mass holds for the parts of E in A and not in A. The set of all measurable sets is denoted by M : M = A P (X ) ; for all E X , (E ) = (E A) + (E \ A) . One can easily check that and X are measurable; for instance, X is measurable because for any E X , we have E X = E and E \ X = , so (3.27) is just
156
the tautology (E ) = (E ). Thus, M is not empty. We can rewrite (3.27) as follows. Since E \ A = E Ac where Ac is the complement of A, (3.27) is (E ) = (E A) + (E Ac ). We always have here, since outer measures are subadditive; measurability is thus determined by the inequality , that is, A M for all E X , (E ) (E A) + (E Ac ).
Example 3.6. Let X = {a, b, c} consist of three distinct elements and let be dened by () = 0, (X ) = 2, and (A) = 1 otherwise. In Example 3.3 we proved that is an outer measure. Let us determine M . We already know that and X are measurable, so let A X with A = , X . Since A = X , Ac = , therefore both A and Ac are nonempty. Let d, e {a, b, c} with d A and e Ac . If E = {d, e}, then E A = {d} and E Ac = {e}. Thus, by denition of , we have (E ) = 1, (E A) = 1 and (E Ac ) = 1. Therefore, (E ) = (E A) + (E Ac ), which implies that A is not measurable. It follows that M = {, X }.
For this example, note that M is trivially a -algebra and denes a measure on M . The Carath eodory Theorem 3.11 below states that for a general outer measure , M forms a -algebra on which denes a complete measure. Here, a measure : S [0, ] on a -algebra S is said to be complete if A B and B S with (B ) = 0 = A S. If this holds, then (A) = 0 also, since is monotone. In words, is complete any subset of a measurable set of measure zero is measurable (that is, must belong to the -algebra).
Example 3.7. Its easy to nd trivial examples of complete measures. For instance, given a -algebra S we can dene : S [0, ] by () = 0 and (A) = otherwise. Its easy to check that is a complete measure. Nontrivial examples of complete measures are Lebesgue measure and more generally, any outer measure restricted to its measurable sets, as well see in Carath eodorys theorem below.
3.4.3. Carath eodorys theorem: Outer measures to measures. Here is the celebrated theorem due to Carath eodory. Carath eodorys theorem Theorem 3.11. Let : P (X ) [0, ] be an outer measure and let M be the collection of measurable sets. Then M is a -algebra and the restriction of to M , : M [0, ], denes a measure. Moreover, in particular, : M [0, ] is a complete measure. A X and (A) = 0 = A M ;
Proof : We leave you to prove the statement involving the completeness property of . We break up the measure portion into two steps. Step 1: We show M is closed under unions and complements (such a system of sets is called an algebra of sets). To begin, we show that if A M ,
157
then Ac = X \ A M . Indeed, if E X , then since A M , (E ) = (E A) + (E Ac ) = (E Ac ) + (E (Ac )c ),
since A = (Ac )c . Thus, Ac M . We next show that M is closed under unions. Let A, B M . Then for any set E X , we need to show that (E ) = (E (A B )) + (E (A B )c ). To see this, we apply the denition of the measurability of B to obtain (3.28) (E (A B )) = (E (A B ) B ) + (E (A B ) B c ) = (E B ) + (E A B c ),
where we used that (A B ) B = B and B B c = . Now using the fact that E (A B )c = E Ac B c , we obtain (E (A B )) + (E (A B )c ) = (E B ) + (E A B c ) + (E Ac B c ). Since A is measurable, the sum of the last two terms is (E B c A) + (E B c Ac ) = (E B c ), so (E (A B )) + (E (A B )c ) = (E B ) + (E B c ) = (E ), since B is measurable. This shows that A B is measurable. Thus, M is an algebra of sets. In particular, since A B = (Ac B c )c , M is closed under intersections, and since A \ B = A B c , it is also closed under dierences. Step 2: Next, we show that M is a -algebra and is a measure on M . We already know that M and M is closed under complements, so we just have to prove that M is closed under countable unions. To this end, let A = n=1 An where the An s are measurable sets. We need to show that A is measurable. Replacing An with An \ (A1 An1 ), which is also measurable since M is closed under unions and dierences, we may assume that A1 , A2 , A3 , . . . are pairwise disjoint. Now given E X , we need to show that (E ) (E A) + (E Ac ). Our technique to prove this is to use the measurability of A1 , then A2 , then A3 , etc. to try and express (E ) in terms A = n=1 An . To start, observe that since A1 M , we have (E ) = (E A1 ) + (E Ac 1 ). Since A2 M we can write the second term as
c c c (E Ac 1 ) = (E A1 A2 ) + (E A1 A2 )
= (E A2 ) + (E (A1 A2 )c ),
where we used that Ac 1 A2 = A2 \ A1 = A2 (recalling A1 and A2 are disjoint) c c and De Morgans law Ac 1 A2 = (A1 A2 ) . Thus, (E ) = (E A1 ) + (E A2 ) + (E (A1 A2 )c ).
N
We now see the pattern: By using induction (left to you!) for any N N we have (E ) =
n=1
(E An ) + (E (A1 AN )c ).
158
Since A1 AN A, we have Ac (A1 AN )c . Thus, since is monotone, we have (E Ac ) (E (A1 AN )c ), and so

N
(E )
n=1
(E An ) + (E Ac ).
This formula holds for any N , so taking N and using that limits preserve inequalities, we have (3.29) Now E A = (3.30)
n=1
(E )
n=1
(E An ) + (E Ac ).
E An , so recalling that is countably subadditive, we have (E A)

n=1
(E An ).
Combining this with (3.29), we see that which shows that A M . Moreover, putting E = A in (3.29) (noting that A An = An for each n and A Ac = ) and in (3.30) we see that (A)
(E ) (E A) + (E Ac ),
(An )
n=1
and
n=1
(A)
(An ).
n=1
This implies that (A) = This completes our proof.
(An ), which shows that is a measure on M .
In particular, given any set function : A [0, ] on a collection of sets A of subsets of a set X with A and () = 0, since is an outer measure, Carath eodorys theorem says that M , the -measurable sets, is a -algebra and : M [0, ] is a measure. Applying this result to Lebesgue measure we see that the collection of sets that are measurable with respect to Lebesgue outer measure m on P (Rn ) is a -algebra, which we denote by M n (= Mm ): M n := Sets that are measurable with respect to m . The collection M n is called the Lebesgue measurable sets. The completeness part of Carath eodorys theorem says and the measure part of Carath eodorys theorem says m : M n [0, ] A Rn and m (A) = 0 = A M n, : P (X ) [0, ]
is a measure, where we dropped the from m . We shall study more properties of M n in Sections 4.3 and 4.4.
Exercises 3.4. 1. In this exercise we compute various outer measures and their corresponding measurable sets. Let X be any nonempty set. For each case, show that is an outer measure and determine the measurable sets.
159
2.
3.
4. 5.
6.
(a) For any set A X , dene (A) as the number of points of A if A is nite and (A) = if A is innite. (b) For any set A X , dene (A) = 0 if A = and (A) = 1 otherwise. (c) Dene () = 0, (X ) = 2, and (A) = 1 otherwise. Consider the cases when X has one, two, and at least three elements. (d) Now assume that X is uncountable and for any set A X , dene (A) = 0 if A is countable and (A) = 1 if A is uncountable. Let : P (X ) [0, ] be an outer measure (a) Prove that a set A X is -measurable if and only if (B C ) = (B ) + (C ) for all sets B A and C Ac . (b) If is nitely additive, prove that is countably additive, that is, a measure. In this exercise we compute outer measures and their corresponding measurable sets. In Problems (a)(d) below, you are given a set function : A [0, ]; for each problem, (i) determine (A) for all A and (ii) determine all -measurable sets. (a) Let X be a nonempty set and let A consist of , X , and all singletons (sets consisting of one element). Assume that X has at least two elements and dene : A [0, ] by () = 0, (X ) = , and (A) = 1 for all singleton sets A. (Consider two cases, when X is nite and when X is innite.) (b) Let A be as in (a) where we assume that X is uncountable. Dene (X ) = 1 and (A) = 0 for all singleton sets A. (c) Let f : R R be the characteristic function of (0, ) and let = f and A = I 1 where f : I 1 [0, ) is the LebesgueStieltjes additive set function corresponding to f . Hint: = 0; that is, (A) = 0 for all A R. Note that (I ) = (I ) for all I = (a, b] with a 0 < b. (d) Let f : R R be the characteristic function of [0, ) and let = f and A = I 1 where f : I 1 [0, ) is the LebesgueStieltjes additive set function corresponding to f . Hint: (A) = 1 if 0 A and (A) = 0 if 0 / A. Remark: In (d), we have in particular (I ) = (I ) for all I I 1 , which is very dierent from what happens in (c). The dierence between (c) and (d) is that in Problem (d), is a measure while in (c), is only additive but is not a measure (see Theorem 3.2). The extension theorem found in the next section is the underlying reason for this dierence. Let : A [0, ] be as in Theorem 3.9. Prove that (A) (A) for all A A . Find an example of a set function and a set A A such that (A) = (A). In this problem we see that funny things can happen involving innities. (a) Let : S [0, ] be a complete measure on a -algebra S and let A, B S with (A) = (B ) < . Prove that given any set C such that A C B , we have C S . If (A) = (B ) = , can we still conclude that C S ? Prove it or provide a counterexample. (b) Let : S [0, ] be a measure and let A S have nite measure. Let {An } be a sequence of pairwise disjoint subsets An A with An S for each n and assume that (A) = n=1 (An ). Prove that (A \ n=1 An ) = 0. What if we drop the assumption that (A) < ; is the result still true? In this problem we look at properties of Lebesgue outer measure. (a) Let a = (a1 , . . . , an ), b = (b1 , . . . , bn ) Rn with ak bk for each k. Using the denition (3.21) of Lebesgue outer measure, prove that m (a, b] = m [a, b] = m [a, b) = (b1 a1 ) (bn an ), the usual notion of volume. Of course, this formula holds for all types of bounded boxes given as products of the various sorts of intervals in R. (b) Using the denition of Lebesgue outer measure, prove that given an integer 1 k n, show that {0}k Rnk has measure zero as a subset of Rn .
160
7. Let A denote the set of open intervals in R and let : A [0, ] assign to each such interval its standard length. Show that (A) = m (A) for all A R, where m is Lebesgue outer measure dened in (3.21), which uses only left-half open intervals. (Remark: Of course this problem applies equally well if you replace A by your favorite type(s) of bounded intervals in R: open, closed, right-half open, etc. This problem also applies to Rn , where A is a collection of your favorite bounded boxes.) 8. Let Icn denote the set of all left-half open cubes, where a cube is a box whose sides have the same length. Let c : Icn [0, ] assign to each such box its standard n volume. Show that c (A) = m (A) for all subsets A R . 9. Let f : [a, b] R be a continuous, hence uniformly continuous, function. Let A = {(x, f (x)) ; x [a, b]} R2 be the graph of f . In this problem we prove that m (A) = 0. (i) Let > 0 be arbitrary and choose > 0 so that |f (x) f (y )| < for |x y | < . Using this fact show that we can write [a, b] = N k=1 Ik where the Ik s are intervals with pairwise disjoint interiors such that for some points ak Ik , we have
N
k=1
Ik [f (ak ) , f (ak ) + ].
(ii) Prove that m (A) 2(b a). Conclude that m (A) = 0. 10. We generalize the previous problem to graphs in Rn . (a) Let K Rn1 be a compact set and let f : K R be a continuous, hence uniformly continuous, function. Let A = {(x, f (x)) ; x K } Rn be the graph of f . Prove that m (A) = 0. 2 (b) Show that the sphere Sn1 := {x Rn ; x2 1 + + xn = 1} has measure zero. n 11. Let A R be nonempty (otherwise arbitrary). (a) If a R, prove that (a, ) x m (A (a, x]n ) is continuous, nondecreasing. (b) Prove that limx m (A (a, x]n ) = m (A (a, )n ). (c) Prove that lima m (A (a, )n ) = m (A). (d) Prove that if 0 b < m (A), then there is a B A with m (B ) = b. (As a corollary, Lebesgue measure is nonatomic; cf. Problems 5 and 6 in Exercises 3.2.) 12. (Outer content) For any subset A Rn , dene
N
c(A) := inf
k=1
m(Ik ) ; N N, I1 , I2 , . . . , IN I n cover A .
We dene inf := . The number c(A) is called the outer (Jordan) content of A and was introduced by Camille Jordan (18381922) in 1892. (i) Show that c : P (Rn ) [0, ] is nitely subadditive. (ii) Show that c{a} = 0 for any point a Rn . (iii) Let A be a dense subset of [0, 1]n (e.g. A is the set of all points in [0, 1]n with rational coordinates). Show that c(A) = 1. Suggestion: If A N k=1 (ak , bk ], then taking closures of both sides we obtain10 [0, 1]n N k=1 [ak , bk ]. Can you use this fact to show that c(A) = 1? (iv) Show that c : P (Rn ) [0, ] is not countably subadditive. (v) Finally, show that c : P (Rn ) [0, ] is not nitely additive. Suggestion: Let A = Qn [0, 1]n and B = [0, 1]n \ A. Find c(A B ), c(A) and c(B ). (vi) Show that m (A) c(A) for all A Rn . In particular, a set with zero content has zero measure. The converse is false; however, prove the following: (vii) If A Rn , then c(A) = 0 if and only if the closure A compact and m (A) = 0. 13. (Nonmeasurable sets; cf. [241].) Let : P (X ) [0, ] be an outer measure. Observe that if A is measurable, then there is a measurable set B such that A B and (B \ A) = 0; just take B = A. For a nonmeasurable set, this property is false.
From topology, if B C1 CN , a nite union, then B C 1 C N , where the bar above the set represents closure of the set. (This fact is not true for countable unions.)
10
3.5. THE EXTENSION THEOREM AND REGULARITY PROPERTIES OF MEASURES 161
(a) Let A be a nonmeasurable set; that is, let A X with A / M . Show there is an > 0 such that for any B M with A B , we have (B \ A) . Suggestion: If not, then for each n, there is a measurable set Bn A with (Bn \ A) < 1/n. Let B = n=1 Bn and use B to show that A is measurable. (b) Given a nonmeasurable set A, show that there exists an > 0 such that for any measurable sets B, C M with B A and C Ac , we have (B C ) .
3.5. The extension theorem and regularity properties of measures The big theorem in this section is the extension theorem which states in particular that a measure on a semiring can always be extended to be a measure on the -algebra generated by the semiring. For instance, we can extend Lebesgue measure from I n to the Borel sets and we can extend any probability measure on the cylinder sets of a sequence space to the -algebra generated by the cylinder sets. We end this section by answering some important uniqueness questions involving Carath eodorys denition of measurability. 3.5.1. The extension theorem. The following lemma implies that given an additive set function on a semiring : I [0, ] we always have S (I ) M and we can extend to a measure on S (I ) using if and only if is a measure. The extension theorem, Theorem 3.13 which follows this lemma, discusses the uniqueness of the extension.
Lemma 3.12. Let : I [0, ] be a nitely additive set function on a semiring I of subsets of a set X , let be the generated outer measure dened in Equation (3.22) of Theorem 3.9, and let M denote the -measurable sets. Then (1) S (I ) M , where S (I ) is the -algebra generated by I . In particular, restricting to S (I ), : S (I ) [0, ] is a measure. (2) extends (i.e. (I ) = (I ) for all I I ) if and only if : I [0, ] is a measure.
Proof : Throughout this proof we denote S (I ) by S . Proof of (1): Since M is a -algebra and S is the smallest -algebra containing I , to prove that S M we just need to show that I M , that is, given A I , we need to prove that for all E X , If (E ) = +, then this inequality is satised, so let E X and assume that (E ) < +. We now use the -principle! Thus, x > 0; our result then follows if we can prove that (3.31) (E A) + (E \ A) (E ) + . (E ) := inf
n=1
: P (X ) [0, ]
(E A) + (E \ A) (E ).
To prove this, recall that (3.32) (In ) ; I1 , I2 , I3 , . . . I cover E .
162
Now (E ) < (E ) + , so by denition of inmum, there are sets I1 , I2 , . . . I such that (3.33) E
n=1
In
with
n=1
(In ) < (E ) + .
Note that E A n=1 (In A) and In A I and that E \A n=1 (In \A). By the denition of a semiring, for each n (since In , A I ) we have In \ A = m Inm for some nitely many pairwise disjoint elements Inm I . Thus, E \ A nm Inm . Hence, by denition of (E A) and (E \ A), we have
(3.34)
(E A)
n=1
(In A)
and
(E \ A)
(Inm ).
n,m
Now observe that I n = (I n A ) (I n \ A ) = (I n A ) is a union of pairwise disjoint sets, so, as is additive, (In ) = (In A) + (Inm ).
m
Inm
m
Therefore, adding the inequalities in (3.34), we obtain (E A) + (E \ A)

n=1
(In ),
which according to (3.33), is (E ) + . This proves (3.31). Proof of (2): Assume that extends ; we show that : I [0, ] is a measure. Indeed, by (1), : S (I ) [0, ] is a measure, so by restricting to I S (I ) it follows that : I [0, ] is also a measure. Since = on I , we conclude that is a measure. Conversely, given : I [0, ] is a measure and given I I , we need to show that (I ) = (I ), or Since I covers itself, by denition of lower bound in the denition of (I ), we have (I ) (I ). On the other hand, since is a measure it is countably subadditive, so for any I1 , I2 , . . . I such that I n=1 In , we have (I ) n=1 (In ). Hence, (I ) is a lower bound for the set on the right-hand side of (3.32) for E = I . Since (I ) is the greatest such lower bound, it follows that (I ) (I ). Therefore, (I ) = (I ). (I ) (I ) and (I ) (I ).
We now come to the main result of this section, but rst a denition: An additive set function : I [0, ] on a semiring I is said to be -nite if we can write X = X1 X2 X3 X4 for some pairwise disjoint sets X1 , X2 , . . . I with (Xn ) < for all n.
Example 3.8. Lebesgue measure on I n is -nite because Rn is covered by, for instance, unit volume cubes with integer vertices. Example 3.9. If : I [0, ) is any real-valued additive set function (never takes on the value ) and X I , then is -nite. Indeed, just take X1 = X and X2 = X3 = = . Then X = X1 X2 with (Xn ) < for each n. In particular, any probability set function is -nite.
M T S (I ) I
is a measure here T is a -algebra containing I dened here
Figure 3.6. The disks represent the nondecreasing sequence of sets

I S (I ) T M . Here, T M is a -algebra containing I and therefore S (I ) T (since S (I ) is the smallest -algebra containing I ). The extension theorem says that restricting to T is a measure extending and this extension is unique if is -nite.
We now state the extension theorem, a schematic of which is in Figure 3.6. The extension theorem Theorem 3.13. Let : I [0, ] be a measure on a semiring I and let T M be a -algebra containing I (e.g. T = S (I )). Then (1) (Existence:) The restriction of to T , is a measure extending ; that is, (I ) = (I ) for all I I . (2) (Uniqueness:) If is -nite, then is the only extension of to T . In the -nite case we always drop the superscript and write : T [0, ] for the extension .
Proof : Let T M be a -algebra. Then : T [0, ] is a measure because is a measure on M , so by restriction it denes a measure on T M . This measure extends thanks to Part (2) of the previous lemma. We now prove uniqueness assuming -niteness; see Problem 1 for examples showing that uniqueness may fail if the -nite assumption is dropped. Let : T [0, ] be a measure extending ; we shall prove that = . Let A T . To prove that (A) = (A) we need to show that (3.35) (A) (A) and (A) (A).
: T [0, ],
: T [0, ]
To prove the left inequality, observe that if A n=1 In where In I for each n, then by countable subadditivity and the fact that = on I , we have (A )
n=1
(I n ) =
n=1
(In ).
Therefore, (A) is a lower bound for all sums appearing on the right; since (A) is the greatest lower bound of such sums it follows that (A) (A). To prove the second inequality in (3.35), write X = n=1 Xn where {Xn } is a sequence of pairwise disjoint sets in I with (Xn ) < for each n. Then A= n=1 (A Xn ), which is a countable union of pairwise disjoint elements of
164
T , so we have (A) =
n=1
(A Xn )
and
(A ) =
n=1
(A Xn ) .
(3.36)
Thus, to prove that (A) (A), given n N it suces to prove that (A Xn ) (A Xn ). Xn = (A Xn ) Bn , and (Xn ) = (A Xn ) + (Bn ).
To prove this, write Xn as a union of disjoint sets, where Bn = Xn \ A. By additivity,
Now Xn I , and hence (Xn ) = (Xn ) (= (Xn )), therefore (A Xn ) + (Bn ) = (A Xn ) + (Bn ).
(Xn ) = (A Xn ) + (Bn )
All terms here are nite (because (Xn ) < ), so we can subtract without fear of subtracting innities, and get We already proved the left-hand inequality in (3.35); that argument works for any element of T , so in particular we have (Bn ) (Bn ). It follows that (A Xn ) (A Xn ) 0, which proves (3.36) and completes our proof. (A Xn ) (A Xn ) = (Bn ) (Bn ).
Warning: Ive seen the extension theorem called the Carath eodory extension theorem, the Carath eodoryFr echet extension theorem, the Carath eodoryHopf extension theorem, the Hopf extension theorem, the HahnKolmogorov extension theorem, and many others that I cant remember! However, the theorem is originally due to Maurice Fr echet (18781973) who proved it in 1924 [141]. Maurice Fr echet 3.5.2. Some applications of the extension theorem. The extension (18781973). theorem is a general theorem dealing with extensions of measures on abstract semirings. Now, what is an abstract theorem good for? Indeed,
The apex and culmination of modern mathematics is a theorem so perfectly general that no particular application of it is feasible. George P olya (18871985).
In this case, the general extension theorem does have applications and is usually applied to the -algebra T = S (I ). For instance, we know from Theorems 3.1, 3.2 and 3.7, all in Section 3.2, that Lebesgue measure m on I n , real-valued measures on I 1 correspond to LebesgueStieltjes measures of right-continuous functions, and the innite product probability measure on the cylinder sets of a sequence space, are all countably additive. Hence, the extension theorem gives Lebesgue, LebesgueStieltjes, and innite product measures Theorem 3.14. (1) There exists a unique measure on the Borel sets B n extending Lebesgue measure m on I n . This extension is usually called Borel measure and sometimes Lebesgue measure, and is denoted by m. (2) A set function : B [0, ], which is nite on intervals, is a measure if and only if = f , the LebesgueStieltjes measure, where f : R R is a right-continuous nondecreasing function.
(3) Given probability measures on countably many sample spaces Xi , i = 1, 2, . . ., let : C [0, 1] be the induced product measure. Then there exists a unique measure on S (C ) that extends . This extension is called the (innite) product measure on S (C ) and is denoted by . A more elaborated statement of Part (2) is: Given probability measures {i } on countably many semirings on sample spaces {Xi }, there exists a unique measure (the innite product measure) : S (C ) [0, 1], where C is the set of cylinder subsets of i=1 Xi , that gives the natural measure on cylinder sets. By natural we mean that on a cylinder set A1 A2 An Xn+1 Xn+2 , we have (A1 A2 An Xn+1 Xn+2 ) = 1 (A1 ) 2 (A2 ) n (An ). Theorem 3.14 follows from the extension theorem and from the fact that m and are -nite (Examples 3.8 and 3.9).11 Part (3) of Theorem 3.14 is a special case of the DaniellKolgomorov theorem studied in Problem 13. Here are some examples applying the extension theorem.
Example 3.10. By Theorem 3.14 we know theres a unique measure on say the Borel sets, namely Lebesgue measure m, that agrees with the usual notion of volume on left-half open boxes. Since all the variety of boxes are Borel sets, we can therefore determine the Lebesgue measure of any sort of box using the properties of measures. Of course, from Example 3.4 in Section 3.4 and Problem 6 back in Exercises 3.4, we already know that the Lebesgue measure of any sort of box is its usual volume. However, we can easily verify this now using properties of measures. For example, given an open box (a, b) with a = (a1 , . . . , an ), b = (b1 , . . . , bn ) Rn and ai < bi for each i, we can write (a, b) =
k=1
a, b
1 , k
1 1 1 where a, b k = a 1 , b1 k an , bn k and where the union is nondecreasing. Since measures are continuous (see Theorem 3.5) we conclude that
m(a, b) = lim m a, b
k
1 1 1 = lim b1 a1 bn an k k k k = ( b1 a 1 ) ( bn a n ) ,
just what we expected. Example 3.11. (A uniqueness result:) Let : B n [0, ] be a measure on the Borel sets. Suppose that on left-half open boxes, dilates their standard volume by a xed constant; that is, there is a constant > 0 such that (I ) = m(I ) for all I I n . Then we claim that = m on all Borel sets! Indeed, 1 : B n [0, ] is a measure and it agrees with m on I n . Therefore by the uniqueness part of 1 Theorem 3.14, we must have = m on all of B n .
11Technically speaking, the extension theorem deals with measures with ranges in [0, ] rather than [0, 1], so you should think about why Part (3) of Theorem 3.14 holds.
166
3.5.3. Regular outer measures and the uniqueness of measurable sets. Given an outer measure , you may ask the following: Question: What right does M have to be called the -measurable sets? We shall give two precise reformulations of this question with their answers. Heres our rst formulation: Question 1: Can we nd a strictly larger class of sets on which denes a measure? In other words, does there exist a -algebra S such that M S with M = S and : S [0, ] is a measure? The answer is no for regular outer measures: An outer measure is said to be regular if Roughly speaking, if we think of an arbitrary set A as being possibly quite ugly and we think of the set B M as being nice, then regularity basically says that we can determine the outer measures of ugly sets by only considering nice elements of M ; heres a picture of an ugly set A on the left and a nicer (not so jagged) set B M on the right containing A:
A B A B and (A) = (B )
given any A X there is a B M such that A B and (A) = (B ).
Every outer measure we encounter in practice (e.g. Lebesgue measure) is regular (see Theorem 3.16 below). Some properties of regular outer measures are explored in Problem 12. The following theorem answers Question 1. Proposition 3.15. For a regular outer measure , there is no -algebra strictly larger than the -measurable sets on which denes a measure. This proposition justies the term the -measurable sets. We wont use this result in the sequel, so we leave the proof of this proposition to Problem 5. The following example shows that the regularity assumption in the proposition cannot be dropped.
Example 3.12. Consider a previous example X = {a, b, c} consisting of three distinct elements and () = 0, (X ) = 2, and (E ) = 1 otherwise. We showed that M = {, X }. If A = {a}, then the only measurable set containing A is X , and (A) = 1 = 2 = (X ), so is not regular. Consider the -algebra S = {, A, Ac , X }. Note that M S and (X ) = 2 = 1 + 1 = (A) + (Ac ). Using this fact one can check that is a measure on S . Thus, S is a strictly larger -algebra than the -measurable sets on which denes a measure.
We remark that Proposition 3.15 does not say that M is the largest algebra on which denes a measure. Here, to say that M is the largest means that if S is a -algebra and : S [0, ] is a measure, then S M . In other words, M is largest means that if is a measure on a -algebra S , then M must contain S . Thus, we ask: Question 2: Is M the largest -algebra on which is a measure? In Problem 14, you will show that the answer is in general no even if is regular! However, if the outer measure is generated from an additive set function on a
semiring, it is true that the measurable sets is the largest -algebra containing the semiring on which the outer measure is a measure; this is the content of . . . The regularity theorem Theorem 3.16. Let : I [0, ] be an additive set function on a semiring I of subsets of a set X and let be the outer measure dened in Equation (3.22) of Theorem 3.9. Then (1) is a regular outer measure. In fact, the following stronger regularity property holds: Given any A X , there is a B S (I ) M such that (2) M , the -measurable sets, is the largest -algebra containing I on which denes a measure. Thus, Carath eodorys denition of measurable sets produces the largest algebra that contains I for which denes a measure. Its in this sense that Carath eodorys denition is the optimal denition one can possibly hope for. The proof of the regularity theorem involves an ingredient in the proof of Lemma 3.12, so to avoid repetition of what we have already done, we shall leave the proof of the Regularity Theorem for Problems 6 and 8.
Exercises 3.5. 1. In the following examples of measures : I [0, ], determine the outer measure , nd M , and show that does not have a unique extension to S (I ). (a) Let X be a set consisting of more than one element and let B X be a proper nonempty subset of X . Let I = {, B } and dene : I [0, ] by () = 0 and (B ) = . (b) Let I = I 1 and dene : I [0, ] by (I ) = if I = and () = 0. 2. Let f : R R be nondecreasing and right continuous, so that f : I 1 [0, ) is a measure and hence f : Mf [0, ] is a measure, where Mf is the set of f measurable sets. Here, we drop the superscript from f . Recalling the properties of measures found in e.g. Theorem 3.5, consider the following. (a) For a R, show that f {a} = f (a) f (a), where f (a) is the left-hand limit of f at a. In particular, f is continuous at a if and only if f {a} = 0. (b) For a R, show that f (a, ) = f () f (a), where f () := limx f (x). (c) Using (a) and (b), for any a, b R, derive formulas for f [a, b], f [a, b), f (a, b), f [a, ), f (, a), f (, a], and f (R). In particular, show that for f = x, which gives the standard Lebesgue measure, the measure of these sets are what they intuitively should be. (d) Assume now that f is strictly increasing, that is, f (x) < f (y ) if x < y . Prove that f (A) > 0 for any subset A R that has an interior point. 3. Let : B [0, ] be a Borel measure, which means is a measure on the Borel sets B of R. Assume that (K ) < for all compact sets K R. (a) Using Problem 4 in Exercises 1.6, prove that there exists a nondecreasing rightcontinuous function f : R R such that equals the restriction of the Lebesgue Stieltjes outer measure f to the Borel sets B . Thus, Borel measures that are nite on compacta are just LebesgueStieltjes measures. (b) Suppose that is a nite Borel measure, which means that (R) < . Dene f :RR by f (x) := (, x] for all x R.
: P (X ) [0, ]
AB
and (A) = (B ).
168
Show (i) f is nondecreasing, (ii) f is right-continuous, and (iii) the measure equals the restriction of f to B . 4. (Carath eodorys separation theorem) Given sets A, B Rn , the distance between them is dened by where |x y | denotes the standard Euclidean distance between x and y . We shall prove Carath eodorys 1914 result: If A, B Rn and d(A, B ) > 0, then Thus, Lebesgue outer measure is additive on sets that are separated by a positive distance. (Thus, if A, B Rn are disjoint, in order for m (A B ) = m (A) + m (B ), the sets A and B have to satisfy d(A, B ) = 0.) (i) Let E Rn and dene f : Rn R by f (x) := d({x}, E ), the distance between the point x and the set E . Prove that f is a continuous function. (ii) Let A, B Rn with d(A, B ) =: r > 0 and let C = {x Rn ; d({x}, B ) r/2}. Prove that C is a closed set such that A C and C B = . (iii) Prove that m (A B ) = m (A) + m (B ). 5. Prove Proposition 3.15. Precisely, let S be a -algebra of subsets of X such that M S and denes a measure on S . To prove that S M , let A S . Show that A M , which means given E X , we have Suggestion: Apply the regularity assumption to E ; that is, there is a set B M such that E B and (E ) = (B ). 6. In this problem we prove the regularity part of Theorem 3.16. Let : I [0, ] be an additive set function on a semiring I of subsets of a set X and let A X . We shall prove that there is a B S (I ) such that A B and (A) = (B ). This proves that is regular since S (I ) M by Lemma 3.12. (i) If (A) = , prove that such a B exists. (ii) Assume now that (A) < . Given > 0, show that there is a set B S (I ) such that A B and (3.37) Suggestion: Review the argument to get (3.33) in Lemma 3.12. (iii) In particular, for each k = 1, 2, . . ., putting = 1/k in (3.37) we can nd a Bk S (I ) such that A Bk , and 1 (Bk ) (A) + . k Let B = k=1 Bk . Show that B S (I ), A B , and (A) = (B ). 7. (Improved regularity theorem) In this problem we improve the regularity part of Theorem 3.16. Let : I [0, ] be a -nite additive set function on a semiring I of subsets of a set X and let A X . (i) Prove there is a B S (I ) such that (ii) Find an example showing we cannot drop the -nite condition. Suggestion: For (i), rst prove the result in the case (A) < . In general, show that A = n=1 An where A1 A2 A3 and (An ) < for each n. For each n nd a Bn having the property for An and from the Bn s form the desired B . 8. We now complete the proof of Theorem 3.16. Let : I [0, ] be an additive set function on a semiring I of subsets of a set X . Prove that if S is a -algebra with I S , and if : S [0, ] is a measure, then S M . Suggestion: Given A S , you must show that A M , which means for all E X , (E A) + (E Ac ) (E ). AB and (A E ) = (B E ) for all -measurable sets E . (A) (B ) (A) + . (E A) + (E Ac ) (E ). m (A B ) = m (A) + m (B ). d(A, B ) := inf {|x y | ; x A & y B },
9.
10.
11.
12.
Given E , use regularity: There is a B S (I ) such that E B and (E ) = (B ). (Uniqueness of Lebesgue measure I) Let : B [0, ] be a measure on the Borel sets of R which is nite on intervals and suppose that is translation invariant on I 1 (which means (x + I ) = (I ) for all x R and I I 1 , where x + I := {x + y ; y I }). Using any facts from Exercises 1.6, prove that = m, where = (0, 1] and m denotes Lebesgue measure. (See Problem 7 in Exercises 4.4 for a higher dimensional generalization.) In particular, if is translation invariant and (I ) = m(I ) for some nonempty I I 1 , then in fact = m. Remark: Translation invariance is a strong condition; under weaker conditions we still get = m. In fact, it follows from the next problem that if = m on elementary gures of a xed Lebesgue measure, then = m on all Borel sets. (Uniqueness of Lebesgue measure II) Let : B [0, ] be a measure on the Borel sets of R which is nite on intervals and let (0, ). (i) If = m on all elements of I 1 with Lebesgue measure ; need it be true that = m on all Borel sets? Prove it or give a counterexample. (ii) If = m on all elements of E 1 := R (I 1 ) with Lebesgue measure , prove that = m on all Borel sets. Suggestion: If A E 1 with m(A) = /2, prove that (A) = /2. Use induction to prove that for all k N, if A E 1 and m(A) = /2k , then (A) = /2k . Now prove that = on all elements of I 1 . Remark: Thus, Lebesgue measure is completely determined by its measure on elementary gures of a xed measure. In the following problem we prove a version of this result for general nonatomic -nite measures. (Uniqueness of nonatomic measures) Please review Problems 5 and 6 in Exercises 3.2 and use any results in those problems. Let (X, S , ) be a nonatomic -nite measure space and let 0 < < (X ). (Here, (X ) = is allowed.) We shall prove: If : S [0, ] is a measure and = on elements of S with -measure , then = on all elements of S . (i) Let t < min{, (X ) }. If A, B S with (A) = (B ) = t, prove that (A) = (B ). Suggestion: Show there is a C S disjoint to A and B with (C ) = t. (ii) Let n N with /n < min{, (X ) }. If A S and (A) = /n, prove that (A) = /n. (iii) Show that for all positive rational numbers r , if A S with (A) = r, then (A) = r as well. (iv) Finally, prove the result. Let be a regular outer measure on P (X ) and assume that (X ) < . (a) Prove that a set A X is measurable if and only if (X ) = (A) + (Ac ). Suggestion: To prove the if part, let B be measurable such that A B and (A) = (B ). Prove that (B \ A) = 0 and conclude that A is measurable. The inner measure of a set A X is by denition (A) := (X ) (Ac ). Show that A is measurable if and only if (A) = (A). Show that the if statement is false in Part (a) when is not regular; that is, give an example of a non-regular outer measure and a set A X such that (X ) = (A) + (Ac ), but A is not measurable (that is, A / M ). Consider the set function : C [0, ], where = and C = M . If is regular, prove that the outer measure generated by is exactly . (To prove this you do not need the assumption that (X ) < .) Roughly speaking, the outer measure generated by a regular outer measure is same outer measure we started out with. If is not regular, this is not true: Give an example of a non-regular outer measure such that = .
(b) (c)
(d)
(e)
170
13. (Kolmogorovs fundamental theorem) In this problem we prove a baby version of whats now called the (Daniell)Kolmogorovs extension theorem, which Kolmogorov named the fundamental theorem in his 1933 book [215], another version of which was published by Percy Daniell (18891946) in 1919 [92]. (i) Given probability measures i : Ii [0, 1], i N, where Ii is a semiring on a sample space Xi , let : S (C ) [0, 1] be the innite product measure on the -algebra generated by the cylinder subsets of i=1 Xi . For n N and A I1 In , dene n : I1 In [0, 1] by Xi ; (x1 , . . . , xn ) A}. where A Xn+1 Xn+2 = {(x1 , x2 , . . .) Prove that n is a measure and for any A I1 In , prove the following consistency condition: (ii) We now prove the converse: Suppose that for each n N we are given a probability measure n : I1 In [0, 1] satisfying the above consistency condition. Prove there exists a unique probability measure : S (C ) [0, 1] satisfying Equation (DK) above for all n N and A I1 In . 14. (Nonmeasurable sets) Let : P (X ) [0, ] be an outer measure (regular or not) and suppose that M = P (X ). Then there is nonmeasurable set A X ; that is, there is a set A X such that A / M . Show that if (X ) = , then S := {, A, Ac , X } is a -algebra on which denes a measure, but S M . 15. (More on asymptotic density) Let I be the collection of all subsets of N of the form A(a, r ) = aN + r = {an + r ; n N} where a N, r = 0, 1, . . . , a 1. Dene by D(A(a, r )) = 1/a; from Problem 7 in Exercises 3.2 we know that D is nitely additive. Prove that D is not countably additive. Suggestion: If it was, then it would have an extension to a measure D : S (I ) [0, 1]. Show that singletons belongs to S (I ) and D{n} = 0 for all n N. 16. (Liouville numbers) A real number is called a Liouville number, named after Joseph Liouville (18091882), if is irrational12 and it has the property that for each k N there is a rational number p/q = such that p 1 < k. q q D : I [0, 1] n (A) = n+1 (A Xn+1 ). n (A) := A Xn+1 Xn+2 (DK),
i=1
Liouville numbers are important in number theory (for instance, Liouville numbers are transcendental, which means they are not roots of any polynomial with integer coecients). Prove that the set of all Liouville numbers has measure zero. Suggestion: Heres one way to go about a proof. Let L be the set of Liouville numbers, let r1 , r2 , . . . be a list of all rationals, where we write rn = pn /qn in lowest terms, and notice that L = Qc

Ikn ,
k=1 n=1
k k where Ikn = (rn 1/qn , rn + 1/qn ).
Now given N, show that m(L [, ]) = 0. Finally, show that m(L) = 0. 17. (Measures via discrete approximations) Let A = n=1 In where I1 , I2 , . . . are pairwise disjoint intervals in [0, 1] and suppose that [0, 1] \ A is also a countable union of pairwise disjoint intervals. Let Pn = {1/n, 2/n, . . . , (n 1)/n, 1}. In this problem we prove that #(A Pn ) m(A) = lim . n n
We dont have to make the assumption that is irrational since one can prove that must be irrational in order to satisfy the inequality property.
12
Note that #(A Pn )/n = [number of elements of (A Pn )]/n can be thought of as some type of density of points in A of the form 1/n, 2/n, . . .. This limit shows that these densities approach the measure of A. (i) If I [0, 1] is an interval, prove that (ii) Let > 0 and let [0, 1] \ A = n=1 Jn where the Jn s are pairwise disjoint intervals (so that [0, 1] = n=1 In n=1 Jn . Prove that there is an N such that N m(A \ IN ) and m([0, 1] \ A \ JN ) where IN = N n=1 In and JN = n=1 Jn . (iii) Prove that (2) #(IN Pn ) #(A Pn ) n #(JN Pn ), (1) nm(KN ) N #(KN Pn ) nm(KN ) + N nm(I ) 1 #(I Pn ) nm(I ) + 1.
where in the rst line, K = I or J . (iv) From these inequalities in (iii), prove the desired result. (v) The assumption that [0, 1] \ A can be written as a union of intervals is important. Find pairwise disjoint intervals I1 , I2 , . . . such that m(A) = lim #(A Pn )/n.
n
Suggestion: If youre having trouble nding such intervals, consider intervals such as found in Problem 3 of Exercises 3.2.
Remarks
3.1 : A popular way to dene a Lebesgue measurable set is using both outer and inner measures. If A (a, b), Lebesgue dened the inner measure of A as One then denes a set A (a, b) to be measurable if m (A) = m (A). The downside with this denition is that it requires A to be a subset of set of nite measure (in this case, (a, b)), while Carath eodorys denition does not require niteness. In fact, heres what Carath eodory had to say about his denition [113, p. 72]: The new denition has great advantages: 1. It can be used for the linear measure. 2. It holds for the Lebesgue case, even if m A = . 3. The proofs of the principal theorems of the theory are incomparably simpler and shorter than before. 4. The main advantage, however, is that the new denition is independent of the concept of inner measure. 3.2, 3.3 : The book [40] is devoted to the theory of nitely additive set functions, which are called charges. In this book, Salomon Bochner (18991982) is quoted as having remarked that nitely additive measures are more interesting, more dicult to handle, and perhaps more important than countably additive ones. 3.4 : Hermann Hankel (18391873), who we discussed at the end of Section 4.5, was the rst person to grasp the idea of outer content, the precursor to outer measure where nite covers (instead of countably innite covers) are used to measure the size of sets. In 1870 Hankel proved [164] that for a bounded function f on a closed interval, (1) Riemann integrability of f is equivalent to (2) for every > 0, the set of points S where f has jumps > has zero content. Hankel then went on to equate measure-theoretic smallness (zero content) with topological smallest (nowhere dense) by proving that (2) is equivalent to the set S being nowhere dense for every > 0. This statement is false; a counterexample is the characteristic function of a nowhere dense set of positive measure such as a thick Cantor set studied in Section 4.5. Although Hankel confused measure-theoretic smallness and topological smallest (many other did so as well), it can be said that Hankel initiated the measure theoretic approach to integration [156, p. 167]. m (A) = b a m ((a, b) \ A).
172
3.5 : We know from the regularity theorem that if : I [0, ] is a nitely, but not countably additive, set function on a semiring I , then : S (I ) [0, ] does not extend . A natural question is: Is there a nitely additive set function : S (I ) [0, ] that does extend . The answer is yes there always exists an extension [40, p. 78]. Unfortunately, the extension is not generally unique and the proofs that such extensions exist are nonconstructive. Heres a quote from [43]: One serious loss is the constant interplay of nitely additive measures with the axiom of choice; this renders the whole subject unreal. Here is an example. Take the basic space as the positive integers: = {1, 2, 3, 4, . . .}, take B to the set of integers that has rst digit suppose we begin an approach to picking an integer at random by assigning the number theoretic natural density. Thus even numbers 1 are assigned probability 2 , square free numbers are assigned probabil6 ity 2 , the primes are assigned probability 0, and so on. A standard result says that there are Banach limits: nitely additive probabilities P which are invariant, extend density, and are dened for all subsets of integers. The existence of such P is very roughly equivalent to the axiom of choice. The question now is, what is P (B )? There can be no answer; P (B ) can be assigned any value in 1 , 5 ! and then 9 9 extended. Thus, the existence of Banach limits is no real help. It gives the illusion of a concrete useful construction with little content. 1 : B = {1, 10, 11, . . . , 19, 100, 101, . . .}
CHAPTER 4
Reactions to the extension & regularity theorems

Have we got a treat for you in this chapter! Now that we know Lebesgue measure and innite product probability measures extend from their respective semirings to appropriate -algebras, we can study a lot of things we couldnt before. This chapter is devoted to such studies. 4.1. Gamblers ruin, BorelCantelli, and independence This section is devoted to answering probability questions involving Bernoulli sequences. Let Y = {0, 1} and let 0 : P (Y ) [0, 1] be a probability measure; let us say for some 0 < p < 1, 0 {1} = p, 0 {0} = 1 p, 0 {0, 1} = 1, and 0 () = 0. By the extension theorem (see Theorem 3.14) there is a unique probability measure : S (C ) [0, 1]
such that on a cylinder set A1 A2 An Y Y of Y , we have (A1 A2 An Y Y ) = 0 (A1 ) 0 (A2 ) 0 (An ). In the following subsections we give various applications of this set-up to gambling and Monkeys and Shakespeare. 4.1.1. Gamblers ruin and the foolishness of gambling. Consider a gambler1 with an initial capital of $i who walks into a casino. He sits down at a table and is determined that he will play the game over and over again until he either wins everything (all the money of the house) or loses everything. His probability of winning a game is p and of losing is q = 1 p. If he wins he gets $1 and if he loses he gives the house $1. Let t be the total amount of money involved; the gamblers initial $i plus the casinos money. Our question is What is the probability of the gamblers ruin? We can model this situation using the sample space Y , where Y = {0, 1} with 1 representing a win and 0 a loss. For each n = 1, 2, 3, . . ., dene Wn : Y R as the net amount the gambler has won after n rounds of play. Thus, if x = (x1 , x2 , . . .) Y , then Wn (x) = #1s amongst x1 , . . . , xn #0s amongst x1 , . . . , xn .
1The painting is called Cardsharps, by Gerard van Honthorst (1590-1656), Museum Wiesbaden. From the wikipedia commons. 173
174
4. REACTIONS TO THE EXTENSION & REGULARITY THEOREMS
Notice that i + Wn represents the total amount of money the gambler has after n plays. It follows that
n1 k=1
{0 < i + Wk < t}
is the event that the gambler neither goes broke nor wins everything during the rst n 1 plays, and that {i + Wn = 0} is the event that he has no money on the nth play. Thus, Ai,1 := {i + W1 = 0} and for n > 1,
n1
Ai,n := {i + Wn = 0}
k=1
{0 < i + Wk < t} ,
is the event that the gambler, with an initial capital of $i, goes broke on exactly the nth play. We leave it for you to check that each Ai,n R (C ). Hence,
Ai,n
n=1
belongs to the -algebra S (C ) and is the event that the gambler goes broke on some play; that is, the event that the gambler is eventually ruined. Since the sets Ai,n are pairwise disjoint in n, it follows that

Pi :=
n=1
Ai,n
=
n=1
(Ai,n )
is the probability that the gambler is ruined, where recall that i represents his initial capital. We shall put P0 = 1 because if he starts with no capital, hes already ruined so his chance of ruin is 1, while we put Pt = 0 because if he starts with all the money he will not play, so he has no chance of being ruined. We now derive an equation for Pi for any i with 0 i t, in terms of Pi+1 and Pi1 . The intuitive idea is that either the gambler wins the rst round (in which case he now has a capital of $(i + 1)) or he loses the rst round (in which case he now has a capital of $(i 1)). Since these are mutually exclusive events, the probability of ruin for the gambler should be the probability that he wins the rst round and then is ruined plus the probability that he loses rst round and then is ruined. Since he has a probability p of winning a round, the probability that he wins the rst round and then is ruined should be p Pi+1 , where Pi+1 denotes the probability of ruin starting with a capital of $(i + 1). Similarly, the probability of him losing the rst round and then being ruined should be q Pi1 where q = 1 p. Hence, the following equation should hold: (4.1) Pi = p Pi+1 + qPi1 , P0 = 1 , Pt = 0. In Problem 10 you will provide a precise proof of this intuitive statement. Now, Equation (4.1) is an example of a dierence equation, and there is an extensive literature on how to solve such equations; see e.g. [154] or [117]. Thus, we just have to solve (4.1) and were done! Of course, Pascal didnt have a developed theory of dierence equations to turn to, so he had to solve (4.1) from scratch. According to
4.1. GAMBLERS RUIN, BORELCANTELLI, AND INDEPENDENCE
175
Edwards [115], he probably went about this this way: Pi = p Pi+1 + q Pi1 = (p + q )Pi = p Pi+1 + q Pi1 = Pi+1 Pi = (Pi Pi1 ), (since p + q = 1) = p(Pi+1 Pi ) = q (Pi Pi1 )
where = q/p. Setting i = 1 in the last equation above, we obtain P2 P1 = (P1 P0 ) = (P1 1). Next, replacing i with 2, we obtain P3 P2 = (P2 P1 ) = (P1 1) = 2 (P1 1). Continuing, we see the pattern P3 P2 = 2 (P1 1) . . . = . . . P2 P1 = (P1 1)
P4 P3 = 3 (P1 1)
Pi Pi1 = i1 (P1 1). Summing the left and right-hand columns, noticing that we get a telescopic sum for the left-hand column, we obtain Pi P1 = + 2 + 3 + + i1 (P1 1). If = 1, we see that and for = 1, we have + 2 + 3 + + i1 = ( i )/(1 ) (using the sum of a geometric progression), so in case = 1, Pi P1 = i (P1 1). 1 Pi P1 = (i 1)(P1 1),
Setting i = t and using that Pt = 0, we can nd P1 from these equations (which is a good review of basic algebra), and we nally arrive at the answer: Gamblers ruin theorem Theorem 4.1. The probability of ruin for a gambler starting with an initial capital of $i is i t (q/p) (q/p) if p = q 1 (q/p)t Pi = 1 i if p = q, t where t is the total money involved (gambler + his foe).
Example 4.1. (Roulette) The American roulette wheel consists of 38 slots, two of which are green and numbered 0, 00, eighteen are red and eighteen are black and are numbered 1 through 36 (in a mixed-up order):
176
A ball is placed in the wheel and the wheel is spun; the object is to predict where the ball will land when the wheel stops. There are many bets you can make; e.g. that the ball will end up on a certain number, or on a certain combination of numbers, or whether the number will be red or black, even or odd. Lets say that you save up your measly graduate student monthly stipend, $1000 (which was about my stipend!), and you go to the local casino to play a game of roulette. Your favorite color is red, so you always bet that the ball will land on a red slot. In particular, your probability of winning is p = 18/38 = 9/19. Lets say that this casino is small and only has $9000. What is the probability of your ruin? In this case, q/p = (1 p)/p = 10/9. Thus, the probability of your ruin is P1000 = (10/9)1000 (10/9)10000 . 1 (10/9)10000
Using MapleTM (say), we nd that
P1000 = 0.9999999999999 . . . . . . 999999999984744, where there are a total of 410 digits of 9 after the decimal point! In other words, essentially with 100% certainty, you will lose absolutely EVERYTHING! Example 4.2. With the same situation as above, lets say that you played a fair game (of course, there is no such thing as a fair game at a casino). Now what is the probability of your ruin? In this case, for any initial capital $i that you have, and capital $j your foe has, your chance of ruin is i j Pi = 1 = . i+j i+j 9000 = .90; in other words, For example, if i = 1000 and j = 9000, we get P1000 = 10000 with 90% certainty, you will lose EVERYTHING!
Hopefully these examples have served the same purpose as Richard Proctors (18371888) classic book Chance and luck, in which he says [318]:
If a few shall be taught, by what I have explained here, to see that in the long run even fair wagering and gambling must lead to loss, while gambling and wagering scarcely ever are fair, in the sense of being on even terms, this book will have served a useful purpose.
4.1.2. The BorelCantelli lemmas. Named after Emile Borel (18711956) and Francesco Paolo Cantelli (18751966), the BorelCantelli lemmas gives simple conditions when the probability of an event occurring innitely often is either 0 or 1. Recall that (see Proposition 1.1 in Section 1.2) that given sets A1 , A2 , . . . of a set X , {An ; i.o.} := The event that an outcome occurs in innitely many dierent An s

= {x X ; x belongs to innitely many An s} =
Ak .
n=1 k=n
177
Heres the rst BorelCantelli lemma, which was Borels Problem III in his famous 1909 paper [51]. The rst BorelCantelli lemma Theorem 4.2. Let S be a -algebra, let : S [0, ] be a measure on S , let A1 , A2 , . . . S , and put A = {An ; i.o.}. Then
n=1
(An ) <
(A) = 0.
Thus, with probability 1, at most nitely many of the events A1 , A2 , . . . occur.

Proof : By denition, A =

Ak =
n=1
Bn , where Bn =
k =n
Ak . In particu-
lar, since A = B1 B2 B3 , we have A Bn for any n, so (4.2) 0 (A) (Bn )

k =n
n=1 k=n
for any n.
Now by denition of Bn and using countable subadditivity, we know that (Bn ) (Ak ).
By assumption, k=1 (Ak ) < , so it follows that (Bn ) can be made arbitrarily small by taking n large. In view of (4.2), we must have (A) = 0.
Intuitively, the assumption n=1 (An ) < means that the nth term of the series, (An ), is small enough as n , to ensure the event that innitely many of the An s simultaneously occur has probability zero.
Example 4.3. (Run lengths) Consider the sample space Y , where Y = {0, 1}, for an innite sequence of fair coin tosses that is, p = 1/2 and let denote the innite product measure on Y . For each n N, dene the run length function n : Y [0, ] as follows: Given a sequence of coin tosses x = (x1 , x2 , x3 , . . .) Y , put n (x) := the number of consecutive tosses of heads starting from the nth toss, that is, n = the run of heads from the nth toss. Given a sequence k1 , k2 , k3 , . . . of natural numbers, whats the probability that you toss a coin in such a way that for innitely many ns the run of heads starting at the nth toss is at least kn ? I dont know of a general answer to this question, but using the rst BorelCantelli lemma we can give an answer when the probability is zero. Dene An = {x Y ; n (x) kn }, which is the event that the run of heads from the nth toss is at least kn . By denition of n (x), we see that An = {x Y ; xn = 1, xn+1 = 1, . . . , xn+kn 1 = 1},
n 1 which is a cylinder set, and moreover, (An ) = 2 . Therefore, by the rst Borel Cantelli lemma, 1 < = {An ; i.o.} = 0. kn 2 n=1
1 For example, since n=1 2n < , with probability zero you can toss a coin such that for innitely many ns the run of heads starting at the nth toss is at least n.
178
The second BorelCantelli lemma deals with independent events. Intuitively speaking, two events are independent if the occurrence of one event doesnt inuence the occurrence of the other event. For example, the event that a randomly chosen student in your class is male (call this event A) and the event that it rains today (call it B ) are independent. If P (A) is the probability that a randomly chosen student in your class is male and P (B ) is the probability that it rains today, then what is the probability of A B ? Intuitively speaking, it should be (4.3) P (A B ) = P (A) P (B ), because if we think of P (A) as the fraction of times a male student is chosen and P (B ) as the fraction of times it rains, then it would make sense that the product P (A) P (B ) is the fraction of times a male student is chosen and it rains. However, the event that a randomly chosen student in your class is male and the event that a randomly chosen student is wearing a dress are not independent since gender inuences clothing style. Mathematically speaking, we dene independence via (4.3) and we generalize it as follows: Given a probability measure : S [0, 1], we call events A1 , A2 , A3 , . . . S (a nite or countably innite collection of events) pairwise independent if for any i = j , the events Ai and Aj satisfy the equality in (4.3) (with replacing P ). We say that A1 , A2 , A3 , . . . are independent if the following stronger condition is satised: For any nite subcollection Ai , Aj , . . ., Ak of A1 , A2 , A3 , . . ., we have (Ai Aj Ak ) = (Ai ) (Aj ) (Ak ). Note that independence implies pairwise independence. In Problem 2 you will show that pairwise independence may not imply independence. We now come to the second BorelCantelli lemma, which is Cantellis contribution proved in 1917 [67]. The proof is so nice that we shall leave it as an exercise (see Problem 8)! The usual statement of the second BorelCantelli Francesco Cantelli lemma requires the events A1 , A2 , . . . S to be independent. The relaxed (18751966). condition of just being pairwise independent was discovered by Paul Erd os (19131996) and Alfr ed R enyi (19211970) in 1959 [119]. The second BorelCantelli lemma Theorem 4.3. Let : S [0, 1] be a probability measure on a -algebra S , let A1 , A2 , . . . S be pairwise independent, and put A = {An ; i.o.}. Then
n=1
(An ) =
(A) = 1.
Intuitively, the assumption n=1 (An ) = means that terms of the series, (An ), are big enough as n to ensure the event that innitely many of the An s simultaneously occur has probability one. Notice that combining the rst and second BorelCantelli lemmas, we see that if A1 , A2 , . . . S are pairwise independent and A = {An ; i.o.}, then either (A) = 0 or (A) = 1. Indeed, either n=1 (An ) converges or diverges. If the sum converges, then (A) = 0 by the rst BorelCantelli lemma and if it diverges, then
179
(A) = 1 by the second BorelCantelli lemma. That (A) = 0 or 1 is an example of a zero-one law, of which there are many in probability theory; see Problem 6 for Borels zero-one law.
Example 4.4. (Monkeys and Shakespeare) We continue our MonkeyShakespeare drama. Put a Monkey in front of a typewriter and see if he can type Shakespeares sonnet 18 (or any other passage), and give him an innite number of opportunities to do so. Consider the sample space Y , where Y = {0, 1}, and where 1 represents a successful typing of the passage and 0 not typing the passage. Assume that on each try he has the probability p of typing the passage and let denote the innite product measure on Y . What is the probability the Monkey will type the passage an innite number of times? As we saw in Section 2.3.5, with certain assumptions on the keyboard and the monkeys typing speed, the probability is essentially zero that the Monkey will type sonnet 18 in any reasonable amount of time. However, it turns out that with probability one the monkey will type sonnet 18 an innite number of times! To see this, for each n N, let An be the event that the monkey types sonnet 18 on the nth page: where the 1 is in the nth slot. Observe that if Ai1 , Ai2 , . . ., Aik where i1 < i2 < < ik , then It follows by the denition of the innite product measure that Therefore, A1 , A2 , . . . are independent. Moreover,
n=1
An = Y Y Y {1} Y Y ,
Ai1 Ai2 Aik = {x Y ; xi1 = = xik = 1}.
(Ai1 Ai2 Aik ) = pk = p p p = (Ai1 ) (Ai2 ) (Aik ). (An ) =

n=1
p = ,
so by the second BorelCantelli lemma, it follows that with probability one, the monkey types sonnet 18 an innite number of times!
This example really shows the immense gulf between the nite and the innite: In any reasonable nite amount of time (eg. the hypothetical age of the universe) the monkey will almost certainly not type sonnet 18, but given an innite amount of time, he will type sonnet 18 an innite number of times!
Exercises 4.1. 1. Consider the sample space Y where Y = {0, 1}. Write down the event A throwing a head at some point as a subset of Y . Show that A S (C ) and A / R (C ). Finally, answer the question: What is the probability that you throw a head at some point, assuming a constant probability p > 0 of throwing a head on a single toss? 2. We throw a six-sided die innitely many times. Let Aij be the event that on the ith and j th throws we obtain the same number. Show that {Aij } are pairwise independent (that is Aij and Ak are independent if Aij = Ak ) but not independent. 3. If A1 , A2 , A3 , . . . are independent events, prove that n=1 (An ). n=1 An = 4. If a sequence of sets A1 , A2 , A3 , . . . are independent, prove that any sequence of the form B1 , B2 , B3 , . . . where Bi is either Ai or Ac i , is also independent. (In particular, c c the sequence of complements Ac 1 , A2 , A3 , . . . is independent.) Suggestion: First prove that the sequence obtained by replacing exactly one Ai by Ac i is independent. 5. A sequence of events A1 , A2 , . . . of a probability space (X, S , ) is called a coin tossing sequence if the events are independent and there is a p (0, 1) such that (Ai ) = p for each i. Show that there doesnt exist a coin tossing sequence if X is countable and
180
{x} S for all points x X . Suggestion: Let q = max{p, 1 p} and let x X . Show that {x} q n for each n. Prove the n = 1 case by noting that either x A1 c c or x Ac 1 . Prove the n = 2 by noting that x A1 A2 , or A1 A2 , or A1 A2 , or c Ac A . Continue. 1 2 In the following two problems we give applications of the second Borel Cantelli lemma (which is proved in Problem 8). 6. (Borels zero-one law) Let p1 , p2 , p3 , . . . (0, 1), let Y = {0, 1} and for each n, let n : P (Y ) [0, 1] be the probability measure assigning pn to {1} and 1 pn to {0}. Consider the sample space Y with measure , the innite product of the n s. Prove the following result, which is in Borels 1909 paper [51]: If A is the event that an innite number of successes occurs in an innite sequence of experiments, then (1)
n
pn < = (A) = 0
and
(2)
n
pn = = (A) = 1.
7. (Modied second BorelCantelli lemma.) (a) Let : S [0, 1] be a probability measure on a -algebra S , let A1 , A2 , . . . S , and suppose that some subsequence An1 , An2 , . . . is pairwise independent and k=1 (Ank ) = . Prove that {An ; i.o.} = 1. (b) (Patterns) Let Y be a nite set, let k N and let s1 , s2 , . . . , sk be a list of k (not necessarily distinct) elements of Y . Using (a), prove that with probability one, the pattern s1 , s2 , . . . , sk occurs innitely often in a randomly chosen sequence in Y . Explicitly, let An Y be the event that the pattern s1 , s2 , . . . , sk occurs at position n; that is, An consists of all elements x Y such that xn = s1 , . . . , xn+k1 = sk . Prove that {An ; i.o.} = 1. 8. (Proof of the second BorelCantelli lemma) Let : S [0, 1] be a probability measure on a -algebra S , let A1 , A2 , . . . S be pairwise independent, and assume that n=1 (An ) = . Well show that {An ; i.o. } = 1. This proof uses a clever inequality due to Kai Lai Chung (1917 ) and Paul Erd os (19131996) [87]. gestion: Dene E = f and consider the integral (f E )2 , which is 0. (ii) Heres the clever inequality: Prove that for any B1 , B2 , . . . , Bm S , we have
m
(i) Prove that for any simple function f : X R, we have
f 2 . Sug-
(Bk )
k=1
Bk
k=1 i,j =1
(Bi Bj ).
(iii) Show that {An ; i.o.} = limn limm (iv) Prove that
m i,j =n
Suggestion: Put B = m k=1 Bk , = (B ), and dene 1 : S [0, 1] by 1 (A) = (A B )/ . Show that 1 : S [0, 1] is a probability measure, then apply (i) to the simple function f = m k=1 Bk on the measure space (X, S , 1 ). (Ai Aj ) =
m k =n
(Ak )
m k =n 2
Ak .
m k =n
(Ak ).
(v) Now prove {An ; i.o.} = 1. 9. (Another proof of the second BorelCantelli lemma) Let : S [0, 1] be a probability measure on a -algebra S , let A1 , A2 , . . . S be pairwise independent, and assume that n=1 (An ) = . We will show that {An ; i.o. } = 1. n (i) Put fn = k=1 Ak . Show that limn E (fn ) = and that {An ; i.o. } = {lim fn = }. Thus, we just have to show that {lim fn = } = 1. (ii) Dene the standard deviation of fn by n := E [(fn E (fn ))2 ], the square 2 root of the expectation of (fn E (fn ))2 . (Here, n is called the variance of fn .) Given > 0, prove that |fn E (fn )| < n 1 1 . 2
181
1 (iii) Prove that given any > 0, E (fn ) n < fn 1 2. 2 2 2 (iv) Show that n = E (fn ) E (fn ) , then show that n E (fn ). (v) Recalling that limn E (fn ) = from (i), given > 0, choose N N such that for all n N , we have E (fn ) 42 . Now using (iii) and (iv), prove that 1 1 for all n N , E (fn ) < fn 1 2 . 2 (vi) Finally, prove that {lim fn = } = 1, which proves the second BorelCantelli lemma. Suggestion: If youre having trouble showing this, heres one way, not the most elegant of the many ways, to go about it. First show that
{lim fn = } =
B
N
2 1 where B = n=1 {2 < fn }. Show that (B ) 1 2 and B1 B2 B3 , then use continuity from above for measures. 10. In this problem we prove the formula Pi = p Pi+1 + qPi1 found in (4.1). The following functions play important roles in this proof: Given i N, dene Ri : Y R by
Ri (x) =
Observe that R1 + + Rn : Y R is a random variable giving net amount the gambler has won after n rounds of play. (i) For any i, n N, let Ei,n = {y Rn ; i + y1 + + yn = 0}
n1 k=1
1 1
if xi = 1 . if xi = 0.
{y Rn ; 0 < i + y1 + + yn < t},
where n > 1 and if n = 1, we put Ei,1 = {y R ; i + y1 = 0}. Show that {(R1 , . . . , Rn ) Ei,n } = {(R2 , . . . , Rn+1 ) Ei,n } . In fact, prove that {(R1 , . . . , Rn ) Ei,n } equals {(Rj1 , . . . , Rjn ) Ei,n } for any choice of natural numbers j1 < j2 < < jn . (ii) Show that Ai,n = {(R1 , . . . , Rn ) Ei,n }, where Ai,n is the event that the gambler goes broke on exactly the nth play, and that for n > 1, Ai,n = {R1 = 1} {(R2 , . . . , Rn ) Ei+1,n1 } {R1 = 1} {(R2 , . . . , Rn ) Ei1,n1 } . Using this equality, prove that (Ai,n ) = p (Ai+1,n1 ) + q (Ai1,n1 ). (iii) Finally, prove that Pi = p Pi+1 + qPi1 . 11. Assume that the casino has an unlimited amount of money. Prove that no matter how large the gamblers initial capital is, his probability of ruin is 1. 12. (Dirichlets approximation theorem) This problem doesnt use any measure theory, but is given here because it helps to better appreciate the next problem! Let R with 2 and let A denote the set of all real numbers R such that there are innitely many rational numbers p/q with (4.4) p 1 < . q q
The way to think about (4.4) is that the bigger the , the better can be approximated by rational numbers and hence the more rational the number is. Dirichlets approximation theorem, named after Lejeune Dirichlet (18051859) who made the pigeonhole principle famous, is the statement that Qc A2 . (Please review/look up the pigeonhole principle before proceeding.)
182
(i) For any real number x denote by {x} = x x the fractional part of x, where x is the greatest integer x. Note that {x} [0, 1). Let n N and partition [0, 1) into n subintervals [0, 1/n), [1/n, 2/n), . . ., [(n 1)/n, 1). Let R be an irrational number. Prove, by considering the n + 1 numbers, 0, { }, {2 }, . . ., {n } and using the Pigeonhole principle, that there are two dierent integers a, b {0, 1, . . . , n} such that |{a } {b }| < 1/n. (ii) Conclude that there are integers p, q with 1 q n such that |q p| < 1/n and use this to prove Dirichlets approximation theorem: Qc A2 . 13. We shall study the measure theoretic properties of the A s from Problem 12. (a) Prove that for each 2, A is a Borel set. Suggestion: To get you started, note that A is dened in terms of innitely often. (b) Prove that m(A2 ) = and m(A ) = 0 for > 2. Suggestion: For the second equality, apply the rst BorelCantelli lemma to A [, ] for any N. (c) A real number R is Diophantine of exponent if there is a constant C > 0 such that for all rational numbers p/q , we have p C . q q
If A is the collection of Diophantine numbers of exponent > 2, prove that m(Ac ) = 0. (We say almost all real numbers are Diophantine of exponent > 2.) Suggestion: First prove that if is not Diophantine of exponent > 2, then A . s 14. (The zeta function) Fix s (1, ) and put (s) := , the Riemann zeta n=1 n 1 s function. Dene : P (N) [0, 1] by specifying {n} = (s) n for each n N. (i) For a N, put aN := {a n ; n N} and prove that (aN) = as . (ii) Prove that the sets of the form p N, where p is prime, are independent. (iii) Using Problems 3 and 4, give a probabilistic proof of Leonhard Eulers (1707 1783) sum-product formula: (s ) =
p
(1 ps )1
where the product is over all prime numbers.
15. Prove there does not exist a probability measure : P (N) [0, 1] such that (aN) = 1/a for all a N where aN := {a n ; n N}.
4.2. Borels strong law of large numbers This section is devoted to Emile Borels (18711956) strong law of large numbers, the rst version of which was published in the 1909 paper Les probabilit es denombrables et leurs applications arithm etiques [51]. 4.2.1. The strong law of large numbers (Borels theorem). Before we state the strong law of large numbers, let us recall the weak law. Let Y , where Y = {0, 1}, be the sample space for an innite sequence of (say) coin tosses, and let p be the probability of a head on any given ip. Then given an innite sequence of coin tosses, x = (x1 , x2 , x3 , . . .) Y , the ratio
x1 + x2 + x3 + + xn n is the proportion of heads in n tosses, which for a typical sequence of coin tosses should intuitively be close to p for n large. Bernoullis theorem, or the weak law of large numbers, is one interpretation of this intuitive idea: For each > 0,
n
lim x Y ;
x1 + x2 + + xn p < n
= 1.
4.2. BORELS STRONG LAW OF LARGE NUMBERS
183
Another, slightly dierent, interpretation is that if you consider the event A := x Y ; lim
n
x1 + x2 + + xn =p , n
then this event should occur with probability one; that is, in a sequence of coin tosses, with probability one the proportion of heads in n tosses approaches p as n . This is the Strong Law. Borels strong law of large numbers Theorem 4.4. The set A belongs to S (C ) and (A) = 1; in other words, the event x1 + x2 + + xn =p lim n n belongs to S (C ) and it occurs with probability one. As we did with the Weak Law, in order to prove the Strong Law we rst transform its statement into a statement involving functions. For each i N, let fi := Ai : Y R, where Ai Y is the event that on the ith toss we ip a head, and let Sn = f 1 + + f n , which represents the simple random variable giving the total number of heads in n tosses. Then Sn A = lim =p . n n Thus, the Strong Law of Large Numbers is really a statement about the points where the limit of a certain sequence of functions equals p. Thus, before proceeding, we better learn some general results on limits of sequences of functions. Lemma 4.5. Let g, g1 , g2 , g3 , . . . be real-valued functions on a probability space (X, S , ). Suppose that for each > 0 and n N, {|gn g | } S . Then (1) L := {lim gn = g } S , or equivalently, Lc = {lim gn = g } S . (2) (L) = 1 if and only if for each > 0, {|gn g | ; i.o.} = 0. (3) If for each > 0,
n=1
{|gn g | } < ,
then (L) = 1. (4) If (L) = 1, then for each > 0, lim {|gn g | } = 0.
n
Proof : By denition of limit, we have xL > 0 , N N , n N , |gn (x) g (x)| < .
184
It follows from the Archimedean property of real numbers2 that this condition is equivalent to the following: 1 x L m N , N N , n N , |gn (x) g (x)| < . m In the language of sets, is intersection and is union, so this statement is simply that L=

Ac m,n ,
m=1 N =1 n=N 1 where Am,n := |gn g | m . By assumption, each Am,n S , so it follows that L S . This proves (1). To prove (2), observe that
Lc =
Am,n =
m=1 N =1 n=N
m=1
{x X ; x Am,n for innitely many ns} ,
where we used {An ; i.o.} = N =1 n=N An for any sequence of sets {An }. From c this equality it follows that (L ) = 0 (which is equivalent to (L) = 1) if and only if for each m N,
|gn g |
1 ; for innitely many ns m
= 0.
By the Archimedean property of the real numbers, this last statement holds if and only if for each > 0, {|gn g | ; for innitely many ns} = 0; that is, if and only if for each > 0, {|gn g | ; i.o.} = 0. Part (3) follows from Part (2) and the rst BorelCantelli lemma. Finally, to prove Part (4), assume that (L) = 1 and let > 0; we must show that limn {|gn g | } = 0. Note that {|gn g | ; i.o.} =
n=1 k=n
{|gk g | } =
Bn ,
n=1
where Bn = k=n {|gk g | }. Observe that B1 B2 B3 , so by the continuity from above property of measures, we have {|gn g | ; i.o.} = lim (Bn ).
n
By Part (2), the left side of this equality is 0, so limn (Bn ) = 0. Since {|gn g | } Bn , it follows that limn {|gn g | } = 0. Proof of the strong law : First of all, by Problem 1 in Exercises 2.4, we know that for each > 0, Sn p n S (C ). (In fact, this set belongs to R (C ).)
n = p S (C ). To prove Therefore, by Part (1) of the lemma, A := limn S n that (A) = 1, by Part (3) of our lemma, for xed > 0 we just have to show that Sn p < . n n=1
We shall prove this using Chebyshevs inequality together with plain hard work! To begin, observe that Sn p n
2
= {|Sn np| n} = (Sn np)4 n4 4 .
One version says that given any > 0 there is a natural number m such that 1/m < .
185
As we did in the proof of the Weak Law, let us introduce Ri := fi p , i = 1, 2, 3, . . . ,
the Rademacher functions, and observe that Sn np = R1 + R2 + + Rn . Thus, Sn p n

n
=
k=1
Rk
n4 4
By Chebyshevs inequality (Lemma 2.11), we have

n
k=1
Rk
n4 4
1 n4 4
4
Rk
k=1
Observe that if we multiply out R1 + + Rn

n
we obtain
Rk
k=1
=
i,j,k,
Ri Rj Rk R ,
where the sum contains terms of the following form: 4 (1) Rm (when all of Ri , Rj , Rk , R are the same). 2 2 (2) Rr Rs , r = s, (when two distinct pairs of Ri , Rj , Rk , R are the same) (3) Ri Rj Rk R in which at least one factor is not repeated. Note that there are n terms of the form (1) (this should be clear) and of the form (2), there are 3n(n 1) terms.3 We now consider integrals of each of these types of functions. Note that |Ri | 1 for any i, so
4 Rm
1 = (Y ) = 1
and
2 2 Rp Rq
1 = (Y ) = 1.
We now compute the integrals of the third type of functions. Assume, for example, that i is distinct from j, k, . Observe that when we multiply out Rj Rk R = Aj p Ak p A p , we get a linear combination of characteristic functions B where the set B equals an intersection of one, two, or three sets amongst Aj , Ak , A or B equals Y (in this case, B 1, which occurs when we multiply the three ps together). With B as just described, it follows that Ri Rj Rk R is a linear combination of terms of the form (Ai p)B = Ai B pB , so the integral, Ri Rj Rk R is a linear combination of terms of the form Ai B pB = (Ai B ) p(B ). Recalling the form of the set B , we leave it as a short exercise for you to check that (Ai B ) = (Ai ) (B ) = p (B ). Hence, Ri Rj Rk R = 0. Now,
n
Rk
k=1
=
m=1
4 Rm + p= q
2 2 Rp Rq + ,
3 There are exactly n(n 1) pairs (r, s) where r = s and there are exactly three ways to 2 produce a term R2 r Rs , r = s, from Ri Rj Rk R (namely when (i) i = j and k = , (ii) i = k and j = , and (iii) i = and j = k ). This is how we got 3n(n 1) in (2).
186
where consists of the type (3) terms, and therefore by our computations above, we have
n
Rk
k=1
=
m=1 n
4 Rm + p= q
2 2 Rp Rq +
(4.5)
1+3
m=1 p= q
1+0
= n + 3n(n 1) 3n2 . We conclude that

n 4
k=1
Rk
n4 4
1 n4 4
Rk
k=1
3 3n2 = 2 4. n4 4 n
Hence,
n=1
Sn p n
n=1
3 < , n2 4
and the proof of the strong law of large numbers is complete.
We remark that because of Property (4) in Lemma 4.5, the strong law of large numbers automatically implies the Weak Law, and in this sense the strong law is n stronger than the weak law. Indeed, since (A) = 1 where A = limn S n =p , by Property (4) in Lemma 4.5, for each > 0,
n
lim
Sn p n
= 0.
This is exactly the statement of the weak law. However, the weak law doesnt automatically imply the strong law in the sense that the converse of Property (4) in Lemma 4.5 is in general false; see Problem 3. We also remark that there is a corresponding strong version of the expectation theorem 2.12. Let 0 : I [0, 1] be a probability measure on a semiring of a sample space Y , let X := Y , the sample space for repeating the experiment modeled by Y an innite number of times, let : S (C ) [0, 1] be the innite product of 0 with itself, and nally, let f :Y R be a simple random variable. For each i, dene (4.6) fi : X R by fi (x1 , x2 , . . .) := f (xi ),
which represents the random variable f observed on the ith iterate of the experiment. The following theorem is proved in exactly the same way as the strong law, with only slight modications, so we leave its proof for your enjoyment. The strong expectation theorem Theorem 4.6. The event f1 + f2 + + fn = E (f ) n n lim belongs to S (C ) and it occurs with probability one.
187
Example 4.5. Let Y = (0, 1] and let 0 : I [0, 1] be Lebesgue measure on I = lefthalf open intervals in (0, 1]. Dene f : (0, 1] R as the tenth-place digit function. Thus, if x (0, 1] and we write x = 0.x1 x2 x3 . . . in base-ten notation (taking the nonterminating expansion if x has two expansions), then f (x) := x1 . From Problem 3 in Exercises 2.4 we know that f : Y R is an I -simple random variable. Moreover, its easy to check that E (f ) = 1/10. Hence, if fi is dened as in (4.6) and C denotes the cylinder sets of (0, 1] generated by I , by the Strong Expectation Theorem we know that f1 + f2 + + fn 1 lim = n n 10 belongs to S (C ) and it occurs with probability one. In other words, if we sample numbers in (0, 1] at random and average their tenth digits, with probability 1 these averages approach 1/10 as the number of samples increases.
4.2.2. A couple remarks. Our rst remark deals with the limitations of Theorem 4.6. Consider Example 4.5 but now let f : (0, 1] R be the function f (x) = x; in other words, f represents the actual number (not its tenth digit) picked from the interval (0, 1]. The function f is not an I -simple function so its expected value is not yet dened! However, if it were dened it should be 1/2 and the expectation theorem in this case should therefore read: If we sample numbers in (0, 1] at random and average them, with probability 1 these averages approach 1/2 as the number of samples increases. However, to prove this rigorously we need to learn expected values (integrals) of functions more general than simple functions. We shall study integration in the next chapter and prove a very general SLLN in Section 6.6. For our second remark, we note that the SLLN is really a feature of countable additivity in the sense that it may fail to hold for nitely additive probabilities. Heres a simple example.
Example 4.6. As usual, let X = Y where Y = {0, 1} and let be the innite product measure assigning the probability p for obtaining a head on any given ip. Let us suppose that we live in a world where coins eventually ip to an innite run of tails. The sample space in such a world is the subset T X where T := {x X ; there is an N with xi = 0 for all i N }. Dene I := {A T ; A C } and dene : I [0, 1] by (A T ) := (A). In Problem 1 we ask you to check that I is a semiring and is nitely additive, but not countably additive. Nonetheless, being nitely additive, extends uniquely to a nitely additive set function : R (I ) [0, 1]. For each i N, dene fi : T [0, 1] as before: fi (x) = 1 if xi = 1 and fi (x) = 0 for xi = 0. Is it true that the SLLN holds for the fi s? To answer this question, let x T . Then there is an N such that xi = 0 for all i N . Therefore fi (x) = 0 for all i N , so f1 (x) + + fn (x) f1 (x) + + fN (x) = for all n N. n n
f1 (x)++fn (x) n
Taking n we see that lim

n
= 0. Thus, assuming p > 0, = .
lim
f1 + + fn =p n
188
Hence,
n
lim
f1 + + fn =p n
=0
The SLLN fails!
Although the SLLN fails, in Problem 1 you will prove that the WLLN holds!
There are at least three ways to react when confronted with such an example. One way is to view nitely additive probabilities as pathological. Another way is to dismiss countable additive probabilities (because simple examples should only validate theories, not give counterexamples to them). A third way is to dismiss the underlying model. The third reaction is the best because coins that eventually ip to all tails is certainly bogus! See Problem 2 for a probability paradox involving the above example.
Exercises 4.2. 1. In this problem and the next one we study the set function : R (I ) [0, 1] dened in Section 4.2.2. In particular, (ii) proves that is well-dened. (i) Prove that T is countable. (ii) Prove that if A T = B T where A, B C , then A = B . (iii) Prove that is nitely additive. (iv) Given x T and > 0, show there is an I I with x I and (I ) < . (v) Since T is countable, we can write T = {t1 , t2 , . . .}. Prove there are sets I1 , I2 , . . . I such that ti Ii and (Ii ) < 1/2i+1 . Use this to show that is not countably subadditive and hence not countably additive. (vi) Prove that the WLLN holds. 2. (A nitely additive probability paradox; cf. Problem 8 in Exercises 3.3) Jack and Jill, on top of a hill, each ip a coin innitely many times. Suppose they live in a world where coins eventually ip to an innite run of tails. They record the number of ips it takes to throw the last head (until an innite run of tails occurs) and the one with the smallest number wins. You call out either Jack or Jills name, and the person who you call on tells you his number; at this point you dont know what number the other person chose. Then, they reveal their numbers. Who wins? In this problem we describe a model of this situation, then answer the question. (i) Let F P (T ) denote the collection of all nite subsets of T , and consider the collection A := {A F ; A R (I ) , F F } and the set function for all A R (I ) and F F . Prove that A is a ring and P is a nitely additive probability set function on A . (ii) You call out a persons name at random, say Jill. Suppose that Jill told you she threw the last head on the ip n. Let A be the event that Jack wins or they tie. Show that A A (in fact, A F ) and P (A) = 0. Whats your conclusion? 3. We show that the converse to Property (4) in Lemma 4.5 doesnt hold. Let X = [0, 1] with Lebesgue measure. Given n N, we can write n = 2k + i where k {0, 1, 2, . . .} i i+1 and 0 i < 2k and let fn be the characteristic function of the interval . , 2k 2k (To get an idea of what these functions look like, it may be helpful to draw pictures of f1 , f2 , f3 , . . . , f7 .) (i) Show that for each > 0, lim m{|fn | } = 0. (ii) Show m{lim fn = 0} = 0. Thus, the converse to Property (4) in Lemma 4.5 can fail badly. 4. (Borels simple normal number theorem for binary) Let b N with b 2. Given a number x [0, 1], we can write it in base b, that is, its b-adic expansion: x1 x2 x3 (4.7) x= + 2 + 3 + , b b b that lim fn (x) does not exist at any x [0, 1], so {lim fn = 0} = ; in particular,
n n
P : A [0, 1]
dened by P (A F ) := (A)
189
where the xi s are in the set of digits Y := {0, 1, . . . , b 1}. A number x [0, 1] may have two b-adic expansions in (4.7), one that is terminating and the other nonterminating (which occurs if and only if x = 1 and x bn N for some n N); in order to have a unique expansion we agree to write such numbers in their non-terminating b-adic expansions. Borels simply normal theorem deal with the frequency of digits occurring in the expansion (4.7). Please reread Problems 7 and 8 in Exercises 2.4 at this time; you may assume and use those results. (i) Let F : Y [0, 1] be the map dened by x1 x2 x3 F (x1 , x2 , x3 , . . .) := + 2 + 3 + for all (x1 , x2 , x3 , . . .) Y . b b b Let 0 : P (Y ) [0, 1] assign fair probabilities, 0 (A) = #A/b, and nally, let : S (C ) [0, 1] denote the innite product of 0 . It turns out that is basically Lebesgue measure in the following sense: Prove (a) A B if and only if F 1 (A) S (C ), in which case (F 1 (A)) = m(A). (b) B S (C ) if and only if F (B ) B , in which case (B ) = m(F (B )). (ii) For the rest of this problem, assume b = 2; we shall deal with the general case in the next problem. The number x [0, 1] is said to be simply normal in base 2 if it asymptotically has the same number of 0s and 1s in its binary expansion in the sense that x1 + x2 + + xn 1 lim = . n n 2 Using the Strong Law of Large Numbers, prove that all numbers in [0, 1], except for a set of measure zero, are simply normal; more precisely, prove that if A := x [0, 1] ; lim
n
x1 + x2 + + xn 1 = n 2
then A is a Borel set and m(A) = 1. We remark that although every number in [0, 1] is simply normal in base 2 except for a set of measure zero, its not easy to determine whether any given number is normal. For instance, its not known whether (the decimal parts of) e, , log 2, or 2 are simply normal in base 2! (Of course, one can concoct simply normal numbers such as 0.101010101010101010 . . ..) 5. (Borels simply normal number theorem) We generalize the previous problem and prove Borels celebrated simply normal number theorem published in 1909 [51]. Let b N with b 2. Fix a digit d Y = {0, 1, . . . , b 1} and dene f :Y R by f (x) = 1 0 if x = d, otherwise.
For each i, dene fi : [0, 1] R by fi (x) = f (xi ) where xi is the ith digit of x in the expansion (4.7). Thus, fi observes if the ith digit is d. Intuitively speaking, since there are a total of b digits, for a randomly picked number x [0, 1], it seems reasonable that in its b-adic expansion (4.7), the digit d should appear with frequency 1/b, that is, it should be that f1 (x) + f2 (x) + + fn (x) 1 (4.8) lim = . n n b (i) Assume the results in Problem 7 in Exercises 2.4 and (i) of Problem 4. Prove that the set of x [0, 1] satisfying (4.8) is Borel and has Lebesgue measure 1. (ii) A number x [0, 1] is said to be simply normal in base b if given any digit d {0, 1, . . . , b 1}, the limit (4.8) holds. In (i) you proved that all x [0, 1], except for a set of measure zero, are simply normal in any xed base b. A number x [0, 1] is said to be simply normal if its simply normal in every base b 2. Prove that all x [0, 1], except for a set of measure zero, are simply normal. Note: To describe in a simply way just one simply normal number is not known!
190
4.3. Littlewoods rst principle(s), Borel measures and completions Recall that a set A Rn is Lebesgue measurable means that m (E ) = m (E A) + m (E Ac ) for every set E Rn .
We interpret this geometrically as saying that A has distinct edges, so inside of E , the measure of points in A and not in A exactly add up to the measure of E . The purpose of this section is to understand another geometric interpretation of a measurable set in terms of elementary gures and the topology of Rn according to Littlewoods rst principles, named after John Littlewood John Littlewood (18851977). Littlewoods second and third principles are (18851977). in Section 5.3. We start this section by stating some results in the general case, which are good enough for most applications, then well specialize to Rn . 4.3.1. Measurability and nonmeasurability. If : I [0, ] is an additive set function on a semiring I of subsets of a set X , the regularity theorem 3.16 says that given any A X , there is a B S (I ) M such that AB and (A) = (B ). In other words, we can cover an arbitrary set A X by an element of S (I ) with the same measure as A. Note that the equality (A) = (B ) does not immediately imply that (B \ A) = 0. Indeed, if this were true, then because is a complete measure, B \ A would be measurable and hence A = B \ (B \ A), being the dierence of two measurable sets, would also be measurable. In other words, if A X and there is a set B S (I ) such that AB and (A) = (B ) and (B \ A) = 0, then A is measurable. If is -nite, the converse holds by the following theorem. Regularity and measurable sets Theorem 4.7. Let : I [0, ] be an additive set function on a semiring I of subsets of a set X . If there is a B S (I ) such that then A is -measurable with the converse holding if is -nite. AB with (A) = (B ) and (B \ A) = 0,
Proof : We just have to prove the converse. So assume -niteness, which means X= n=1 Xn where {Xn } I is a sequence of pairwise disjoint sets with nite measure, and hence with nite measure (because Xn is a cover of itself, so (Xn ) (Xn ) by denition of ). Then given A M , write A = n=1 An where An = A Xn M , which are disjoint for dierent ns with nite measure. By the Regularity Theorem we know there is a set Bn S (I ) with An Bn and (An ) = (Bn ). Since (Bn ) = (An ) < , subtractivity implies that (Bn \ An ) = 0. If B = n=1 Bn , then B S (I ) and B=
n=1
Bn =
n=1
An (Bn \ An ) = A N,
4.3. LITTLEWOODS FIRST PRINCIPLE(S), BOREL MEASURES AND COMPLETIONS 191

where we used that A = n=1 An and we put N = n=1 (Bn \ An ). Since (Bn \ An ) = 0 for each n, by subadditivity we have (N ) = 0. Since B = A N it follows that (A) = (B ) and (B \ A) = 0, which completes our proof.
Returning to our discussion before this theorem, we see that if A X is not measurable, then for any B S (I ) with A B and (A) = (B ), we must have (B \ A) > 0. Heres a picture:
B (1) (A) = (B ) (2) (B \ A) > 0 AB
(1) can be interpreted as saying that there is no volume between A and B while (2) says that there is volume between A and B ! In view of this dichotomy, we visualize A as having blurry or cloudy edges because in a sense, the substance in B and not in A is empty (this is (1)) and on the other hand it takes up space (this is (2))! See Problem 4 for an example of a nonmeasurable set and see Section 4.4 where we present the most famous measurable set of them all, Vitalis set. The following corollary is our rst example of a Littlewood principle; it says that -measurable set are just elements of S (I ) up to sets of measure zero. The FUN theorem for general measures Corollary 4.8. If : I [0, ] is a -nite additive set function on a semiring I , then A M A = F N, where F S (I ) and N has measure zero; in fact, N is a subset of an element of S (I ) of measure zero. Thus, -measurable set are, up to sets of measure zero, elements of S (I ).
Proof : The direction = is automatic (why?) so we just prove =. Let A M . Then Ac M so by Theorem 4.7, recalling that is -nite so we can apply the converse, we know there is a B S (I ) such that Ac B and (B \ Ac ) = 0. Let F = B c S (I ). Then taking complements of Ac B we see that F A and since B \ Ac = B A = A F c = A \ F , we have (A \ F ) = 0. Thus, A=F N , where N = A \ F has -measure zero. By the Regularity Theorem, N is a subset of an element of S (I ) of the same measure as N , namely zero. This proves our result.
The set B is suppose to be the region on and inside of the oval. Lets consider the statements (1) (A) = (B ) and (2) (B \ A) > 0.
Heres a schematic of the situation:

N F A= F N
In this schematic, A M is a blob, F S (I ) is the interior of the blob, which makes up most of A, and N is represented by the boundary of A and is supposed
192
to be a small measure zero part of A. Now compare the statement with the Carath eodory denition of M : (2) A M (1) A M A = F N , where F S (I ) and N has measure zero. (E ) = (E A) + (E Ac ) for all E X .
The formulation (1) for measurability is, in my opinion, a conceptually easier way to understand measurability than (2). Heres an immediate corollary (of the corollary) for additive set functions on the left-half open boxes I n in Rn . The FUN theorem for Rn Corollary 4.9. If : I n [0, ) is additive, then A M
where F is a Borel set and N has -measure zero; in fact, N is a subset of a Borel set of measure zero. Thus, -measurable set are, up to sets of measure zero, Borel sets. We stated this theorem for general additive set functions : I n [0, ), but the main examples to keep in mind are Lebesgue measure on Rn and Lebesgue Stieltjes measures on I 1 . In particular, Lebesgue measurable sets are just Borel sets up to sets of Lebesgue measure zero. The precise type of Borel set F is contained in Part (c) of Problem 3. 4.3.2. Littlewoods rst principle(s) for Rn . Littlewoods rst principle [251, p. 26] for subsets of R states that
Every [nite Lebesgue] (measurable) set is nearly a nite sum of intervals.
A = F N,
In Rn we interpret this principle as follows: A set with nite Lebesgue outer measure is Lebesgue measurable if and only if it is nearly an elementary gure. We can make the word nearly precise, meaning that for any > 0, the set, call it A, diers from an elementary gure by a set of measure less than in the sense that there exists an elementary gure I E n with See the left-hand picture here: m (A \ I ) < and m (I \ A) < .
Figure 4.1. On the left, A (a disk) is nearly equal to an elementary

gure I in the sense that the dierences A \ I and I \ A have small measure. On the right, A (a blob with a jagged edge) is covered by an open set U (represented by an ellipse) such that the dierence U \ A has small measure.
Thus, we interpret the term nearly in Littlewoods rst principle as up to sets (that is, sets of measure less than ). In Theorem 4.10 below we extend this principle to encompass more general measures on Rn (not just Lebesgue measure)
and taking advantage of the topological structure of Rn , we can give an alternative formulation of Littlewoods principles in terms of open and closed sets. In the following theorem we use the notion of a G set (pronounced geedelta), which is a countable intersection of open sets, and an F set (pronounced e-sigma), which is a countable union of closed sets. Note that G and F sets are Borel sets and a set is a G set if and only if its complement is an F set. These sets show up often; see Problem 1. As with Corollary 4.9, we state the following theorem for general measures : I n [0, ), but the main examples to keep in mind are = m, Lebesgue measure, and LebesgueStieltjes measures on I 1 . In the general case, Parts (2),(3) of the following theorem say, see the right-hand picture in Figure 4.1, -measurable sets are nearly open sets (or closed sets). Parts (4),(5) and (6) of the following theorem can by interpreted as saying that G and F Borel sets essentially make up all elements of M ; that is, -measurable sets are essentially G sets (or F sets). We remark that the following Littlewood principles for Rn hold in greater generality; see Problems 2 and 8. Littlewoods rst principle(s) for Rn Theorem 4.10. Let : I n [0, ) be additive and let A Rn . (1) If (A) < , then A is -measurable if and only if given > 0 there is an I E n with Without the assumption (A) < , the set A is -measurable if and only if any one of the Properties (2)(5) hold. (2) Given > 0 there is an open set U Rn such that (3) Given > 0 there is a closed set C Rn such that (4) There is a G set G Rn such that AG with CA and AU and (U \ A) < . (A \ I ) < and (I \ A) < .
(A \ C ) < .
(5) There is an F set F Rn such that F A with
(G \ A) = 0 and (A) = (G).
(A \ F ) = 0 and (F ) = (A).
Proof : We shall prove (1), (2), and (4), leaving the equivalence of measurability to (3) and (5) for Problem 3. Step 1: We begin by establishing a useful fact. Let A Rn be arbitrary and let > 0. We shall prove there exists an open set U Rn such that If (A) = , then we can take U = Rn and were done, so assume that (A) < . We now do another 2 k proof! By denition of inmum in the equality (A) = inf
k=1
AU
and
(U ) (A) + .
(Ik ) ; A
k=1
Ik , Ik I n ,
194
there are sets Ik I n such that (4.9) A
Ik
with
k=1
k=1
(Ik ) (A) +
. 2
Now the idea is get the desired open set U is to replace the Ik s, which are of the form Ik = (ak , bk ], by slightly larger open boxes. To this end, observe that Ik = so by the continuity of measures, 1 = (Ik ). j Hence, for each k we can choose k > 0 so that (ak , bk + k ] < (Ik ) + k+1 . 2 By monotonicity, we have (ak , bk + k ) (ak , bk + k ] and by denition of , we have (Ik ) (Ik ). Thus, (ak , bk + k ) < (Ik ) + k+1 . 2 Let Jk = (ak , bk + k ) and set U = k=1 Jk . Then U is open, A U , and
j j =1
a k , bk +
1 , j
lim ak , bk +
(U )
k=1
(Jk ) =
(Ik ) + (Ik ) + 2
k=1 k=1
2k+1
(A) + Step 2: Assuming is -nite, we prove A is -measurable = (2) =
+ = (A) + . 2 2 = A is -measurable,
(4)
which shows the equivalence of measurability to (2) and (4). To prove A is -measurable = (2), let > 0 and assume A is measurable. Writing Rn = k=1 Xk where {Xk } is a sequence of pairwise disjoint boxes with nite measure, we have A = k=1 Ak where Ak = A Xk with (Ak ) < . It follows by Step 1 that there is an open set Bk with Ak Bk and (Bk ) (Ak ) < /2k . This implies that (Bk \ Ak ) < /2k by subtractivity. If U = k=1 Bk , then U is open and U \ A ( B \ A ), so by countable subadditivity, k k k=1 (U \ A)
k=1
(Bk \ Ak ) <
k=1
= . 2k
To see that (2) = (4), assume that for any > 0 there is an open set U such that A U and (U \ A) < . Then, in particular, for each k = 1, 2, . . ., there is an open set Bk such that A Bk , and 1 (Bk \ A) < . k Thus, if B = k=1 Bk , then B is a G , A B , and since B Bk for each k , we have 1 (B \ A) (Bk \ A) < , for all k N. k
Since k is arbitrary, it follows that (B \ A) = 0. Finally, to show that (4) = A is -measurable, assume there is a G set G such that AG with (G \ A) = 0.
Then G \ A has measure zero so is -measurable, since is a complete measure. Therefore, A = G \ (G \ A) is measurable since both G and G \ A are. Step 3: We now prove the only if part of (1). Let A be -measurable with nite measure and let > 0. Let A k=1 Ik as in (4.9), so that
k=1
(Ik ) (A) +
k=1
(Ik ) < (A) + .
Since the sum k=1 (Ik ) is nite, there exists an N such that k=N +1 (Ik ) < N . Let I = k=1 Ik . Then I E n and we shall prove that this set has the required properties. First, observe that since A is covered by I1 , I2 , . . ., it follows that A \ I is covered by IN +1 , IN +2 , . . .. Thus, by denition of (A \ I ) we have k=N +1
(A \ I )
(Ik ) < .
Second, to prove that (I \ A) < , observe that I \ A monotonicity and subtractivity,

k=1 k=1
k=1 Ik
\ A, so by
(I \ A) By subadditivity,
k=1
Ik \ A
Ik \ A
(A).
Ik \ A
k=1
(Ik ) < (A) + ,
and from this we get (I \ A) < . This completes the only if part of (1). Step 4: Lastly, we prove the if part of (1). So, assume that A, which has nite -outer measure has the properties in (1) ; we shall prove that A is -measurable. Let > 0. Then by (2) we just have to show there is an open set U such that A U and (U \ A) < . To this end, let U be given by Step 1 with in Step 1 replaced by /3. This implies, in particular, that AU and (U ) < (A) + . 3
By assumption there is an I E n such that (A \ I ) < 3 and (I \ A) < . 3
196
We shall now work with (U \ A) = (U Ac ) until we get it less than ; of course, we shall see three s on the way, hence the /3 above! Now, (U Ac ) = (U Ac I ) + (U Ac I c ) (I is -measurable) (Ac I ) + (U I c ) (monotonicity) < + (U I c ) (since (I \ A) < /3) 3 = + (U ) (U I ) (I is -measurable) 3 2 < + (A) (U I ) (since (U ) < (A) + /3) 3 2 + (A) (A I ) (since (A I ) (U I )) 3 2 + (A I c ) (I is -measurable) = 3 (since (A \ I ) < /3).
This completes the proof of Littlewoods rst principles.
To repeat our discussion before this theorem, Littlewoods principle tells us that -measurable sets are not badly behaved at all: They are nearly elementary gures or open/closed sets and they are essentially Borel sets. 4.3.3. Regular Borel measures. Notice that as a consequence of Step 1 in the proof of Theorem 4.10, for any set A Rn , we have (A) = inf { (U ) ; A U , U open}, Thus, we can determine the outer measure of any subset of Rn using the open sets. If A is in addition -measurable, we can also use the compact sets to determine (A). Theorem 4.11. If : I n [0, ) is additive, then any compact subset of Rn has nite -measure and the following regularity properties hold: (1) For every set A Rn , we have (2) For every -measurable set A Rn , we have (A) = inf { (U ) ; A U , U open}.
(A) = sup{ (K ) ; K A, K compact}.
Proof : As explained above, Property (1) follows from Step 1 in the proof of Theorem 4.10. To prove (2), let A be -measurable and put S := { (K ) ; K A, K compact}; we need to show that m (A) = sup S . If K A, then by monotonicity, (K ) (A), so (A) is an upper bound for S . To show that (A) is the least upper bound of S , let < (A); we shall prove that is not an upper bound of S . Choose > 0 such that < + < (A). From Property (2) of Theorem 4.10 we know there is a closed set C such that C A and (A \ C ) < . Since A = C (A \ C ), we have (A) = (C ) + (A \ C ) (C ) + .
Therefore, (A) (C ) + , so
recalling that + < (A). Thus,
(C ) ( (A) ) > 0, < (C ).
For each j N, let Kj = C [j, j ]n . Then {Kj } is a nondecreasing sequence of compact sets such that C = j =1 Kj , so by continuity of measures, we have (C ) = lim (Kj ), By denition of limit and the fact that < (C ), it follows that there is some K = Kj such that < (K ). Since K = Kj C A, this shows that is not an upper bound of S .
j
In general, given a topological space X such that every compact set is a Borel set, a measure on the -algebra of Borel subsets of X is said to be a regular Borel measure if every compact subset of X has nite -measure and for any Borel set A X , Properties (1) and (2) of Theorem 4.11 hold with replaced by . Explicitly, a measure : B (X ) [0, ] on the Borel sets of X is a regular Borel measure if for all compact sets K X , we have K B (X ) and (K ) < , and also (1) For every Borel set A X , we have (2) For every Borel set A X , we have (A) = inf {(U ) ; A U , U open}.
Thus, Theorem 4.11 implies that given an additive set function : I n [0, ), the restriction of , the outer measure induced by , to the Borel sets is a regular Borel measure. In particular, when restricted to the Borel sets, Lebesgue measure and LebesgueStieltjes measures are examples of regular Borel measures. In the case X = Rn the condition that (K ) < on compact sets K is enough to guarantee conditions (1) and (2); see Problem 7 for the proof. Finite on compacta = regular
(A) = sup{(K ) ; K A, K compact}.
Theorem 4.12. A measure on B n thats nite on compact sets is regular.
4.3.4. Translations, dilations, and the cube principle. It should be obvious that Lebesgue measure is translation invariant in the sense that the measure of a set doesnt change if the set is moved:
x
Here we translated the rectangle by a vector x. Another obvious property is that measure should scale with the dimension. For example, if a line segment is doubled in length, then the measure of the new segment is two times the original length. If the sides of a rectangle are each doubled, then the measure of the new rectangle is 22 = 4 times the original measure as seen here:
198
More generally, if the sides of a box in Rn are each doubled, then the measure of the new box is 2n times the original measure. In Proposition 4.13 we prove that this dilation property of outer measure holds for all subsets of Rn . To make these statements concerning translations and dilations rigorous, we make some denitions. Given x Rn and A Rn , the translation of A by x is denoted by A + x or x + A and is dened by A + x = x + A := {a + x ; a A} = {y Rn ; y x A}. Given r > 0 , the dilation of A by r is denoted by rA: Let x = (x1 , . . . , xn ) and let I = (a1 , b1 ] (an , bn ] I n . Then observe that I + x = (a1 + x1 , b1 + x1 ] (an + xn , bn + xn ] and Thus, I is invariant under translations and dilations, and m(I + x) = m(I ) and m(rI ) = rn m(I ). In Proposition 4.13 we show that on any subset of Rn , not just boxes, Lebesgue outer measure is invariant under translations and scales correctly under dilations. However, before doing this, we shall discuss a little . . . Philosophy: If a certain property of Lebesgue measure holds for cubes, then it holds for the Lebesgue outer measure of any set. To see why this philosophy should be true, recall from the dyadic cube theorem that any open set can be written as a union of pairwise disjoint (dyadic) cubes. Thus, if a certain property holds for the volume of cubes it should hold for open sets. Now Theorem 4.11 says that given any set A Rn , m (A) = inf {m(U ) ; A U , U open}. Since m (A) can be expressed in terms of this inmum involving open sets only, if a certain property holds for the Lebesgue measure of open sets it should pass through the inmum to hold for arbitrary sets. This leads us to the The cube principle: If a certain property of Lebesgue measure holds for cubes (elements of I n whose sides have the same length), then it holds for the Lebesgue outer measure of any set. Of course, there is a corresponding box principle but cubes are sometimes easier to work with; also, this principle does not hold for any property but it does hold for many cases (each case should be checked). Translation and dilation properties Proposition 4.13. If A Rn is arbitrary, then for any x Rn and r > 0, m (A) = m (A + x) and m (rA) = rn m (A). Moreover, A is Lebesgue measurable if and only if the translation A + x (or the dilation rA) is Lebesgue measurable.
n
rA := {ra ; a A} = {y Rn ; r1 y A}.
rI = (ra1 , rb1 ] (ran , rbn ].
Proof : We prove this proposition in two steps. Step 1: It can by easily checked that the identities m (A) = m (A + x) and m (rA) = r n m (A) hold for cubes, so by the cube principle they must hold for all A Rn . . . done. Well, just to convince ourselves that were not cheating here we shall work through the proof that m (A) = m (A + x). Consider the case when A = U Rn is open. Then by the dyadic cube theorem there are pairwise disjoint cubes I1 , I2 , . . . I n such that U = k=1 Ik . Then U +x=
(I k + x ),
k=1
which is easily checked and is still a union of pairwise disjoint sets, so by countable additivity and the fact that m(Ik + x) = m(Ik ) for each k, we have m(U + x) =
k=1 n
m(Ik + x) =
k=1
m(Ik ) = m(U ).
Now given any subset A R , by Theorem 4.11 we have m (A + x) = inf {m(U ) ; A + x U , U open} = inf {m(V + x) ; A V , V open} = inf {m(V ) ; A V , V open} = m (A). = inf {m(U ) ; A U x, U open}
(put V = U x)
Step 2: We now consider Lebesgue measurability. Suppose that A M n . Then from the FUN theorem (Corollary 4.9) we know that where F B and N has measure zero, and hence,
n
A = F N,
Since translations are homeomorphisms, by Proposition 1.14 they preserve Borel sets and hence, F + x is a Borel set. Also, by Step 2 we have m (N + x) = m (N ) = 0, so N + x has measure zero. Thus, by the FUN theorem, A + x is Lebesgue measurable. Conversely, if A + x is Lebesgue measurable, then the argument above shows that (A + x) + y in Lebesgue measurable for any y Rn . Taking y = x shows that (A + x) + (x) = A is Lebesgue measurable. The proof that A is Lebesgue measurable if and only if rA is Lebesgue measurable is left to the reader.
A + x = (F + x ) (N + x ).
The cube principle is quite handy; for example, we see this principle again in Theorem 4.15 in Section 4.4. 4.3.5. Completions of general measures. We now discuss the important subject of completions. To begin, recall that if : I [0, ] is a -nite measure on a semiring I , then from Carath eodorys theorem (Theorem 3.11) we know that is a complete measure and from the FUN theorem (Corollary 4.8) we know that A M if and only if A = F N where F S (I ) and N is a subset of an element of S (I ) of measure zero. Moreover, since N has measure zero, it follows that (A) = (F ). : M [0, ]
200
These properties serve as a guide to make an arbitrary measure complete. Indeed, let us consider arbitrary measure : S [0, ] on a -algebra S . We denote by S , called the completion of S with respect to , the collection of all sets of the form F N , where F S and N is a subset of an element of S of -measure zero. In Problem 11 a must do exercise! you will prove that S is a -algebra. We dene the completion of , : S [0, ] as follows: If A = F N S , then (A) := (F ), In Problem 11 you will show that is a complete measure on S , and youll prove that the completion of Borel measure (Rn , B n , m) is Lebesgue measure (Rn , M n , m). We summarize these results in the following theorem. Completions of measures Theorem 4.14. If : S [0, ] is a measures on a -algebra S , then S is a -algebra and : S [0, ] is complete on S . Moreover, the completion of Borel measure (Rn , B n , m) is Lebesgue measure (Rn , M n , m).
Exercises 4.3. 1. In this problem we look at various examples of G and F sets. (a) Show that a countable union of F sets is an F set. (b) Show that a countable intersection of G sets is an G set. (c) Show that every countable subset of Rn is an F . (d) For a < b, show that the intervals (a, b), [a, b], and (a, b] are both F and G sets. (e) In R1 , show that the rational numbers form an F and the irrational numbers form a G . (f) Show that every open and closed set in Rn is both a G and F . (g) Let f : R R. (i) Show that the set of points where f is continuous (call it Cf ) is a G . Suggestion: Show that Cf = n=1 Gn with Gn := c R ; there is a > 0 such that x, x (c , c + ) = |f (x) f (x )| < 1/n ,
and show that Gn is open. (ii) Show that Df = {c X ; f is not continuous at c}, the set of discontinuity points of f , is an F set. 2. (Littlewoods rst principle(s) for general additive set functions) In this problem we generalize Littlewoods rst principles for Rn to the general case. Let : I [0, ] be an additive set function on a semiring I of subsets of a set X . Prove the following: (1) Let A X with (A) < . Then A is -measurable if and only if given > 0 there is an I R (I ) with (A \ I ) < and (I \ A) < .
(2) Let A X with (A) < . Then A is -measurable if and only if there is a B S (I ) such that If we drop the assumption (A) < , give a counterexample to the only if statement. (3) Assuming now that is -nite and let A X (without assuming (A) < ), prove that A is -measurable if and only if we have the equalities: 3. Let : I n [0, ) be an additive set function on I n . (a) Prove the equivalence of measurability, Property (3), and Property (5) in Littlewoods theorem 4.10. (b) Prove that A M A = F N , where F is an F and N is a subset of a G set of -measure zero. (c) Prove that a set A Rn is -measurable if and only if we have the equalities: 4. (Examples of nonmeasurable sets) Let X = R2 and I = {I R ; I I 1 }. (i) Prove that I is a semiring of subsets of X and let : I [0, ) be dened by (I R) := m(I ), where m(I ) is the usual Lebesgue measure of I . Prove that is a -nite measure. (ii) Prove that A M if and only if A = B R where B M 1 , that is, B is a Lebesgue measurable subset of R. In particular, given any subsets B, C R with C = R, the set B C X is not -measurable. For example, [0, 1] [0, 1] is not -measurable. 5. (Steinhaus theorem) In this problem we prove a fascinating result due to Hugo Steinhaus (18871972) proved in 1920. His theorem states that if A Rn is Lebesgue measurable and m(A) > 0, then the dierence set A A := {x y ; x, y A} contains a nonempty open ball containing the origin. To prove this, proceed as follows. (i) Let A M n with m(A) > 0. Prove that there is a compact set K and an open set U with K A U such that 0 < m(U ) < 2m(K ). Suggestion: Use Theorem 4.11 to nd a U and K satisfying these properties. (ii) For any r > 0, let Br Rn denote the open ball centered at the origin. Prove that there is a > 0 such that for all x B , we have x + K := {x + y ; y K } U . (iii) Prove that for all x B , we have (x + K ) K = . Conclude that for all x B , we have (x + A) A = . (iv) Finally, prove that B A A. (v) Basically redoing your proof, show that if : B n [0, ) is a translation invariant regular Borel measure and A Rn is a Borel set with (A) > 0, then A A contains a nonempty open ball containing the origin. 6. (Cauchys functional equation III) Please review Problem 6 in Exercises 1.6. Using Steinhaus theorem, prove that if f : R R is additive and bounded on a measurable set of positive measure, then f (x) = f (1) x for all x R. Suggestion: Show there is an open interval containing the origin on which f is bounded. 7. Prove that any measure : B n [0, ] that is nite on the compact sets is automatically a regular Borel measure. Suggestion: Dene 0 on I n by 0 (I ) = (I ) for each I I n and consider 0. 8. (Littlewoods rst principle(s) for regular Borel measures) Prove that properties (2) (6) of Littlewoods rst principle(s) for Rn can be generalized, verbatim, to a -nite regular Borel measure on a topological space X . For example, (2) in this inf { (U ) ; A U , U open} = (A) = sup{ (K ) ; K A, K compact}. inf {(B ) ; A B, B S (I )} = (A) = sup{(C ) ; C A, C S (I )}. AB with (B \ A) = 0 and (A) = (B ).
202
general case is the following: A subset A X is -measurable if and only if given > 0 there is an open set U X such that Similarly, prove the analogous statements for properties (3) (5). 9. Let A Rn be Lebesgue measurable. Prove that if 0 < a < m(A), there is a compact set K A with m(K ) = a. 10. (Borel and nonatomic measures) Please review Problems 5 and 6 in Exercises 3.2 for the relevant denitions; in particular, the previous problem shows that Lebesgue measure is nonatomic. More generally, if a regular Borel measure (on a topological space) has the property that singleton sets have measure zero, prove that the Borel measure is nonatomic. 11. (Completion of a measure) Let be a measure on a -algebra S . We denote by S , called the completion of S with respect to , the class of all sets of the form F N , where F S and N is a subset of an element of S of measure zero. (i) Prove that S is a -algebra. (ii) Dene : S [0, ] by (A) := (F ), where A = F N with F S and N is a subset of an element of S of measure zero. Show that is well-dened; in other words, if A = F N is another presentation of A, prove that (F ) = (F ). Show that is a complete measure on S . We call the completion of . Prove that if B A C with B, C S and (C \ B ) = 0, then A S and (B ) = (A) = (C ). Let : I [0, ] be a measure on a semiring I , assume that S (I ) S and that = |S where : P (X ) [0, ] is the outer measure generated by . If is -nite prove that S = M and = . In particular, the completion of Borel measure (Rn , B n , m) is Lebesgue measure (Rn , M n , m). We show that the -nite assumption is needed in (v). Let X be a uncountable set, I = S the -algebra of all subsets of X that are countable or have countable complements, and let = = the counting measure on S ; thus, for each A S , (A) = (A) = the number of points in A. Show that S = S and M = P (X ). AU and (U \ A) < .
(iii) (iv) (v)
(vi)
4.4. Geometry, Vitalis nonmeasurable set, and paradoxes In this section we show that Lebesgue measure has all the geometric properties that our intuitive notions of length, area, and volume would lead us to expect. We also construct the famous Vitali set, a set that is not Lebesgue measurable, and we shall study A paradox, a paradox, A most ingenious paradox! 4 4.4.1. The geometry of Lebesgue measure. Recall from Proposition 1.14 that Borel sets of topological spaces are preserved under homeomorphisms. This is false for Lebesgue measurability as youll prove in Problem 8 in Exercises 4.5. However, Lebesgue measurability is preserved under all ane transformations of Euclidean space, which are linear transformations followed by translations; in other words, an ane transformation is a map on Rn of the form for some xed b Rn and linear transformation T : Rn Rn . (If you need to review linear transformations, see Section ?? of the Appendix.) When T is an orthogonal transformation (composition of rotations and reections), the ane
4
Rn x
b + T x Rn ,
Taken from The Pirates of Penzance by Gilbert and Sullivan.
4.4. GEOMETRY, VITALIS NONMEASURABLE SET, AND PARADOXES
203
transformation is called a rigid motion5 of Euclidean space, and in this case not only is measurability preserved, measure is also preserved. Its, of course, obvious that measure does not depend on rigid motions: A box has the same volume when it is sitting at on a table and when it is tipped on its side as in Figure 4.2. The
Figure 4.2. Measure should not depend on whether we look at an

object straight on or with our head turned.
invariance of measure under rigid motions follows from Proposition 4.13 of the last section and Theorem 4.15 below. The following theorem proves the following fact we all learned in linear algebra when we were rst introduced to the determinant: | det T | = the factor by which a linear transformation T changes volume. In particular, since | det O| = 1 for any orthogonal matrix O, it follows that volume is invariant under orthogonal transformations, as it should be. Linear transformations and Lebesgue measure Theorem 4.15. For any linear transformation T : Rn Rn and for any set A Rn , we have (4.10) m (T (A)) = | det T | m (A). Moreover, if A is Lebesgue measurable, then T (A) is Lebesgue measurable; the converse holds if T is invertible.
Proof : The proof of this theorem is a little long, so it might be a good idea to skim it at a rst reading of this section. The idea to prove (4.10) consists of two parts. In the rst part, which is the easy part, we prove that for any invertible linear transformation T : Rn Rn and subset A Rn , we have m (T (A)) = D(T ) m (A), where D(T ) := m (T ((0, 1]n )). In the second part, which is more dicult, we show that D(T ) = | det T |. We break up our proof into several steps. In Step 1Step 3 we only work with invertible linear transformations and in Step 4 and Step 5 we consider noninvertible transformations. Step 1: We prove (4.11). The cube principle applies to this situation (you should check this) so we just have to prove (4.11) for cubes. Let I = (a, b] I n be a cube, where a = (a1 , . . . , an ) and b = (b1 , . . . , bn ). Since I is a cube, which we may assume is nonempty, we have bk = ak + c for all k with c > 0. Hence, I = (a, b] = a + (0, c]n = a + c(0, 1]n .
5Some authors require that det T > 0, which is to say, T is a rotation.
(4.11)
204
Therefore, by linearity, T (I ) = T (a) + c T ((0, 1]n ). By the translation and dilation properties of measure, we conclude that m (T (I )) = m (c T (0, 1]n ) = cn D(T ) = D(T ) m(I ), since m(I ) = cn . This proves (4.11). To nish proving our theorem, we just have to prove that D(T ) = | det T |. To prove this, the idea is to show that D(T ) has similar properties as | det T | with respect to products and inverses. Step 2: We claim that (4.12) D (I ) = 1 , D (T S ) = D (T ) D (S ) , D ( T 1 ) = D ( T ) 1 where I is the identity transformation and T and S are invertible linear transformations on Rn . Since I ((0, 1]n ) = (0, 1]n , it follows that D(I ) = 1 and by (4.11), we have D(T S ) = m (T S ((0, 1]n )) = D(T ) m (S ((0, 1]n )) = D(T ) D(S ). In particular, I = T T 1 which implies that D(T 1 ) = D(T )1 as required. Step 3: We now prove that D(T ) = | det T |. By Theorem ?? in the Appendix we know that any invertible matrix can be written as a product of elementary matrices, so T can be written in the form T = E1 E2 EN , where E1 , . . . , EN are elementary matrices. Therefore, by the multiplicative property of D(T ) found in the second equality in (4.12) we have D (T ) = D (E 1 )D (E 2 ) D (E N ). Since det T = (det E1 )(det E2 ) (det EN ), to prove that D(T ) = | det T | all we have to do is prove that D(E ) = | det E | for any elementary matrix. Now there are two types of elementary matrices, type I matrices of the form Ei (a) and type II of the form Eij (a), where for a R, a = 0, Ei (a) is the elementary matrix given by the operation multiply the ith row by a and Ei,j (a), where a R and i = j , is the elementary matrix given by the operation add a times the j th row to ith row. Consider a type I matrix Ei (a). Then Ei (a)((0, 1]n ) = (0, 1] Ii (0, 1], where all the factors equal (0, 1] except the ith one, which is Ii = (0, a] if a > 0 or Ii = [a, 0) if a < 0. From this formula, we see that (4.13) D(Ei (a)) = m (0, 1] Ii (0, 1] = |a|. = 1 = D ( I ) = D ( T T 1 ) = D ( T ) D ( T 1 ) ,
On the other hand, Ei (a) is obtained from the identity matrix by replacing the 1 in the ith diagonal spot by a. Thus, det Ei (a) = a, so D(E ) = | det E | for E of type I. Now consider a type II matrix Ei,j (a) where a R. Given such a matrix, we leave you to verify the identity (Hint: First prove that Eij (a)1 = Eij (a).) Therefore, by the multiplicative property of D, the second formula in (4.12), we see that D(Eij (a)1 ) = D(Ei (1)) D(Eij (a)) D(Ei (1)). Eij (a)1 = Ei (1)Eij (a)Ei (1).
205
By the third formula in (4.12), we have D(Eij (a)1 ) = D(Eij (a))1 , and by (4.13) we have D(Ei (1)) = D(Ej (1)) = 1. Therefore, D(Eij (a))1 = D(Eij (a)). Hence, D(Eij (a))2 = 1 and so D(Eij (a)) = 1. Since det Eij (a) = 1 as well, as you can easily verify, this shows that D(E ) = | det E | for E of type II. This completes the proof of (4.10) for invertible linear transformations T . Step 4: We now prove (4.10) assuming that T is not invertible. For T noninvertible, | det T | = 0, so the equality (4.10) will hold if we can prove that m (T (A)) = 0 for any A Rn . To prove this, note that by Theorem ?? in the Appendix, we can write T in the form T = SR, where S is an invertible matrix (a product of elementary matrices) and R is a matrix with at least one zero row; lets say the kth row where 1 k n. By (4.10) for invertible transformations we know that for any subset A Rn , m (T (A)) = m (S (R(A)) = | det S | m (R(A)). R(A) Rk1 {0} Rnk and since m (Rk1 {0} Rnk ) = 0 (can you prove this?), it follows that m (R(A)) = 0. Thus, m (T (A)) = 0. Step 5: We now prove the last statement of our theorem: If A is Lebesgue measurable, then T (A) is Lebesgue measurable with the converse holding if T is invertible. Let A M n and assume rst that T is not invertible. Then by Step 4 we know that T (A) has measure zero and hence is measurable. Assume now that T is invertible. Then from the FUN theorem (Corollary 4.9) we know that A = F N , where F B n and N has measure zero. Observe that Since T denes a homeomorphism on Rn (being invertible), by Proposition 1.14, T (F ) is a Borel set. Also, by (4.10) we have m (T (N )) = | det T | m (N ) = 0, so T (N ) has measure zero. Thus, by the FUN theorem, T (A) is Lebesgue measurable. Hence, we have shown that if A is measurable, then T (A) is measurable. Conversely, if T (A) is Lebesgue measurable, then (applying the previous statement to T 1 ) it follows that T 1 (T (A)) = A is Lebesgue measurable. T (A ) = T (F ) T (N ). Now by the fact that the kth row of R is zero, we have
In undergraduate vector calculus courses, determinants are usually related to volumes of parallelopipeds in R3 . We can also do the same in Rn . Let V = {v1 , . . . , vn } be a basis of Rn , so that the vectors v1 , . . . , vn are linearly independent. We dene the parallelopiped P (V ) spanned by V as the set of points P (V ) := x1 v1 + x2 v2 + + xn vn ; 0 xi 1 , i = 1, . . . , n . When n = 2, P (V ) is shown in Figure 4.3 and its usually called a parallelogram instead of a parallelopiped. In Problem 2, you will prove that if A = [v1 vn ] is the matrix with columns v1 , . . . , vn , then m(P (V )) = | det A|.
206
v2 v 1
v1 + v2
Figure 4.3. The absolute value of the determinant of the matrix with
columns v1 , v2 is equal to the area of the parallelogram spanned by v1 , v2 .
4.4.2. Vitalis remarkable sets. The rst person to exhibit a nonmeasurable set was Giuseppe Vitali (18751932), who did so in his 1905 paper Sul problema della misura dei gruppi di punti di una retta [401] (On the problem of measure of the set of points of a line). In 1908, Edward Van Vleck (18631943) [397] without knowledge of Vitalis work also constructed such a set. Vitalis set was a subset of (0, 1/2), but in fact, given any set A Rn of positive outer measure, Vitalis proof produces a Giuseppe nonmeasurable subset V of A; we present this proof in Theorem 4.16 beVitali (1875 low. (Note that since any set with zero outer measure is measurable, there 1932). never exists a non-measurable subset of a set of measure zero; this is why we assume positive outer measure.) Before going to Vitalis theorem, we describe intuitively how to visualize a Vitali set V . Assume that A is measurable with nite measure. Then in Problem 8 you will prove that a Vitali subset V A has the following interesting properties: (4.14) (1) m (V ) > 0 Of course, (1) is obvious because if m (V ) = 0, then V is measurable, which we are claiming is false. Its (2) that is interesting, because it can be interpreted as saying that V has no volume (because it says even though you subtract o V from A, the resulting set still has the measure of A). To summarize, (1) says V has volume while (2) says V has no volume! Because of this, one can visualize V as a foggy set, which in a sense takes up space but in another sense is void of substance; see Figure 4.4. (Comparison with a cloud is a poor analogy, so dont make too much of it!). More generally, from Section 4.3.1 we know that any nonmeasurable set is a set with a blurry or cloudy boundary. and (2) m (A) = m (A \ V ).
Figure 4.4. Heres a real cloud, courtesy of Fir0002/Flagstaotos.
Vitalis theorem Theorem 4.16. Any subset of Euclidean space with positive Lebesgue outer measure has a subset that is not Lebesgue (and hence, not Borel) measurable.
207
Proof : We rst reduce to the case when A is bounded, then we do Vitalis proof. Step 1: Let A be any subset of Rn with nonzero Lebesgue outer measure. n Then A Rn = k=1 [k, k ] , so intersecting with A we get A=
k=1
(A [k, k]n ).
Since m is countably subadditive, we conclude that 0 < m (A) m (A [k, k]n ).
k=1
Thus, m (A [k, k]n ) is nonzero for some k. By proving the theorem for the set A [k, k]n , we assume from now on that A [a, a]n for some real a > 0. Step 2: We now partition Rn in a special way. Given any two n-tuples x, y Rn , we write x y if x y is rational, that is, an element of Qn . It is easy to check that this relation is an equivalence relation; that is, for all x, y, z Rn , Then from the elementary theory of equivalence relations, partitions Rn into equivalence classes, that is, nonempty pairwise disjoint subsets whose union is Rn such that two points x, y Rn belong to the same set in the partition if and only if x y . Figure 4.5 shows a picture. For example, in the case n = 1, here
Rn
xx
x y = y x
x y & y z = x z.
Partition of Rn
A is partitioned too
Figure 4.5. On the left is an abstract picture of Rn as a rectangle

and in the middle is a schematic picture of Rn partitioned into equivalence classes pictured as horizontal strips. The picture shows only nitely many equivalence classes, although in reality there are uncountably many and the equivalence classes are quite complicated (impossible to draw!) unlike this very simplied drawing! are some examples of equivalence classes: 1 1 2, 2 + , 2 1, 2 + 1, 2 + , . . . 2 3 1 1 e, e + , e 1, e + 1, e + , . . . 2 3 1 1 , + , 1, + 1, + , . . . 2 3 Note that each equivalence class is countable. Indeed, if we x an element v in an equivalence class, then given any other element x of the same equivalence class, we have x v = r Qn . Thus, x = v + r , so all other elements of the same class are obtained from the xed element v by adding a suitable rational n-tuple. Since the set of all rational n-tuples is countable, it follows that each equivalence class is countable. Consequently, there are uncountably many equivalence classes;
208
indeed, otherwise Rn would be a countable union of countable sets and hence countable, which, of course, we know is false. Step 3: We now construct Vitalis nonmeasurable set. Indeed, the partition of Rn also partitions A into pairwise disjoint sets as shown on the right-hand picture in Figure 4.5/left-hand picture in Figure 4.6. Now choose a point from each partition set of A and let V be the set of all such points. Heres a picture of V , where in the right-hand picture we reiterate what we discussed at the end of Step 2:
v+r
A is partitioned
V = set of points
All points in the equiv. class of v are of the form v + r for some r Qn
Figure 4.6. On the left, A is partitioned by the equivalence classes of

Rn and in the middle we form V by choosing a point from each partition set of A. (Each equivalence class is countable and there are uncountably many equivalence classes.) Given v V and r Qn , the point v + r is in the same equivalence class as v and all points in the equivalence class of v are of the form v + r for some r Qn . Notice that we can write A in terms of the Vitali set V as follows: Given v V , if we let Av := {r Qn ; v + r A}, then we can write Since V A [a, a]n , we leave you to show that given v V , we have Av Qn [2a, 2a]n (just show that if x [a, a]n and x + y [a, a]n , then y [2a, 2a]n ). In particular, if we put Q = Qn [2a, 2a]n , then since for any v V we have Av Q, it follows that A W where W := v + r ; v V and r Q =
r Q
A = {v + r ; v V and r Av }.
V +r .
Heres a picture of whats going on:
Given v V , the shaded area is the set {v + r ; r Av }
The shaded rectangle is the set {v + r ; r Qn [2a, 2a]n }
The union of the shaded rectangles is W .
left is a subset of the shaded rectangle in the middle. The set W is the union of all shaded rectangles on the right. Since V [a, a]n and Q [2a, 2a]n we have W [3a, 3a]n . Thus, A W [3a, 3a]n .
Figure 4.7. Since Av Q = Qn [2a, 2a]n , the shaded area on the
(4.15)
Step 4: So far we have just been trying to understand Vitalis interesting set. We now show that its not measurable. To do so, lets assume that V
209
is measurable and derive a contradiction. Since V is assumed measurable, by translation invariance of Lebesgue measure, any translate of V is measurable with measure equal to the measure of V . Hence, W is measurable. Noting that Q is countable (its a subset of Qn , which is countable) and the (V + r )s are disjoint for dierent r s, by countable additivity of Lebesgue measure, we have m(W ) =
r Q
m(V ).
This is an innite series of the constant number m(V ). Thus, either m(W ) = (if m(V ) > 0) or m(W ) = 0 (if m(V ) = 0). However, according to (4.15), we have m (A) m(W ) m([3a, 3a]n ) = (6a)n . Recalling that m (A) > 0, it follows that m(W ) is some positive nite number and hence cant equal 0 or . This contradiction completes our proof.
If A is measurable with positive nite measure, then recall from (4.14) that m (V ) > 0 and m (A) = m (A \ V ). In particular, although A = V (A \ V ), we have m (A) < m (V ) + m (A \ V ); that is, the sum of the volumes of the parts is greater than the volume of the whole! This seems paradoxical because it violates conservation of mass. However, conservation of mass is technically only valid for objects that have well-dened masses, so to solve this paradox we just have to accept that nonmeasurable sets dont have well-dened masses and hence the conservation of mass does not apply to nonmeasurable sets. Heres a related result, which youll prove in Problem 9: Paradoxical decompositions Corollary 4.17. If A Rn is measurable with positive nite measure, then given any nonmeasurable set B A, we have m (A) < m (B ) + m (A \ B ). From this corollary, one can imagine taking a pea and dissecting it into not just two but many many pieces so that the sum of the volumes of the parts is larger than the sun. This, in fact, can be done and youll prove it in Problem 10. This is the secret to the BanachTarski paradox well look at in Section 4.4.4. Heres another interesting corollary. Corollary 4.18. There is no translation invariant measure on P (Rn ) that extends Lebesgue measure m : I n [0, ). In fact, there is no translation invariant measure on P (Rn ) that assigns nonzero nite values to bounded nonempty sets.
Proof : Assume there is a translation invariant measure : P (Rn ) [0, ] that assigns nonzero measures to bounded nonempty sets. Dene the set V as in Step 3 of Theorem 4.16. Then leaving the details to you, if you repeat Step 4 of Theorem 4.16 with instead of m, youll show that (W ) = or (W ) = 0, both of which are impossible.
210
4.4.3. Vitalis secret. In a footnote at the end of the Lebesgues 1907 paper [235, p. 212], Lebesgue remarked that Vitali had constructed a nonmeasurable set:
I would add that the existence, in the idealistic sense, of nonmeasurable sets has been shown by Mr. Vitali.
Now why did Lebesgue say in the idealistic sense ? To nd out, lets review how we dened V . We were given a partition of A; recall that there were uncountably many partition sets, where each partition set was countable. We then chose a point from each partition set. Heres a picture to contemplate:
a, b, c, d, e, f, . . .
Figure 4.8. V was obtained by choosing a point from each partition

of A. Take a partition of A and look at its points, say we denote them by a, b, c, d, e, . . .. Which of the points in this partition is in V ? Answer is: Who knows! We know (because of how V is dened) that V contains one of these points, but we dont know which one!
In other words, we didnt give a rule how to choose each point from each partition, we simply said to choose one and let V be the set of points chosen. How do we know we can simultaneously choose a point from each partition set (remember there are uncountably many such partition sets in A) and gather them all together in a set V ? Well, the true answer is that we have to take it by faith that we can do so; in mathematical terms we have to take it as an axiom that we can do so! This axiom, as you should already know, is the axiom of choice, introErnst Zermelo duced by Ernst Zermelo (18711953) in 1904 [396, p. 139-141]. This axiom (18711953). states that given any collection C of nonempty sets, we can form a new set, called a choice set, by choosing6 an element from each set in C . Heres a picture:
A B C D E ...... {, , , , , . . .}
Figure 4.9. On the left we have a collection of nonempty sets, C = {A, B, C, . . .}, and on the right is a choice set, obtained by taking an element from each of the nonempty sets. Although the axiom of choice probably seems perfectly logical, even self-evident, note that the choice set is inherently nonconstructive : the axiom of choice only says that a choice set exists; it doesnt tell you how its elements are obtained or even what its elements are! Now in the days of Lebesgue, there were two camps, empiricists who only accepted objects that could be explicitly dened by some rule (Lebesgue was an empiricist) and idealists who also accepted objects obtained by nonconstructive methods even though its impossible to explicitly state the rule
6More precisely, there is a function, called a choice function, with domain C such that f (A) A for all A C .
211
dening them (Zermelo was an idealist). This explains why Lebesgue said in the idealistic sense . Finally, we remark that sometimes we dont need the axiom of choice such as when C consists of only nitely many sets, or when there is a constructive way to choose the elements. For example, suppose that C is a collection of subsets of N. Then in Figure 4.9 we can dene to be the least element A, to be the least element of B , etc. In this way, we can explicitly construct a choice set. The axiom of choice is only needed when one needs a choice set without knowing how to explicitly choose the elements. This is exactly the situation in Vitalis proof: There are uncountably many partition sets of A and there is no way to give a rule to pick a point in any given partition; thus we are forced to rely on the axiom of choice to produce V for us. Since Vitali uses the axiom of choice to construct his nonmeasurable set, which he denoted by G0 , he stated [401, p. 235]:7
Something could be objectionable about considering the set G0 . This can be fully justied if it is accepted that the continuum can be well ordered. For those who do not want to accept our result, it follows that: the possibility of the problem of the measure of sets of points of a straight line and the well ordering of the continuum cannot coexist.
Now even though we dened a nonmeasurable set using the axiom of choice, is it possible to dene one without using the axiom of choice? van Vleck thought so in his construction [397, p. 241]:
Thus it seems to me possible, and perhaps not dicult, to remove the arbitrary element of choice in my example by conning ones attention to a proper subset of the continuum, though as yet I have not succeeded in proving that this is possible.
In fact, it is impossible to explicitly produce a nonmeasurable set, where well explain what we mean by explicit below. Heres the explanation why, which requires us to review a bit of set theory. First of all, standard set theory is based on the ZermeloFraenkel axioms, named after Ernst Zermelo (18711953) and Abraham Fraenkel (18911965) and the resulting axiomatic system is known as ZF set theory, which is sucient for most of elementary mathematics. Because of the special character of the axiom of choice, this axiom is not part of ZF. However, since the axiom of choice seems selfevident (at least to me if I dont analyze it too much!) one might believe that it can be proved from ZF. However, from the work of Kurt G odel (19061978) and Paul Cohen (19342007), its known that the axiom of choice is logically independent of ZF, which means one cannot prove or disprove the axiom of choice using only the axioms of ZF. In other words, in the ZF world, we are free to accept or decline the axiom of choice. If we accept it into ZF, we are using ZFC set theory, which is what mainstream mathematicians use. Now the question to whether or not one can explicitly produce a nonmeasurable set, which we shall take to mean using only the ZF axioms, Robert Solovay (1938) answers our question in his 1970 paper [361]; heres what he says:8
7Note that Vitali mentions the well ordering of the continuum. The reason is that the
axiom of choice is equivalent to the well-ordering principle (that any set can be well-ordered), a fact proved by Zermelo in 1904, a year before Vitalis paper appeared. 8 More precisely, Solovay proved that the statement every subset of R is Lebesgue measurable is consistent under the axioms of ZF plus axiom I = there exists an inaccessible cardinal. See [361] for the precise statement of Solovays result and see [410, Ch. 13] for more on the r ole of the axiom of choice in producing nonmeasurable sets.
212
We show that the existence of a non-Lebesgue measurable set cannot be proved in Zermelo-Frankel set theory (ZF) if use of the axiom of choice is disallowed.
In this sense, one cannot explicitly produce a non-measurable set. 4.4.4. A paradox? A paradox, A most ingenious paradox! Now a nonmeasurable set is not entirely paradoxical: Think of a set with a very, very blurry boundary and its not unthinkable that it cant be measured. However, the axiom of choice can actually produce entirely paradoxical results as well see. The 1828 Websters dictionary says that a paradox is a tenet or proposition contrary to received opinion, or seemingly absurd, yet true in fact. Heres a paradox, really a theorem because its Felix Hausdor proved, due to Felix Hausdor (18681942) who published it in 1914 [91, (18681942). 170, 171]:
(Hausdor paradox) There is a countable set H of the sphere S2 such that S2 \ H can be divided into pairwise disjoint sets A, B and C such that A, B, C and C D are pairwise congruent.
By congruent we mean that any one of the sets A, B, C, C D be be obtained from any other one by a suitable rotation. Since H is countable lets consider it as negligible and forget it. Then heres a na ve picture of the situation:
A C B C A B
Since A, B and C are pairwise congruent we think of the sphere as divided as in the left-hand picture, so we can think of A as a third of the sphere. On the other hand, since A and B C are congruent we think of the sphere as decomposed as in the right-hand picture, so we can think A as half the sphere. Here lies the paradox, for according to the Hausdor paradox, we would then have 1/3 = 1/2! Of course, A, B and C are much more complicated than these simple pictures reveal; they are formed using the axiom of choice in a similar, but more complicated, manner as Vitalis set was formed. Because of this paradoxical decomposition of the sphere produced by the axiom of choice, Borel said [52, p. 210]:
Hence we arrive at the conclusion that the use of the axiom of choice and a standard application of the calculus of probabilities to the sets A, B , C , which this axiom allows to be dened, leads to a contradiction: therefore the axiom of choice must be rejected.
Pretty strong words against the axiom of choice. Now if you think the Hausdor paradox was paradoxical, consider the BanachTarski paradox (really a theorem and uses the axiom of choice), which will blow your mind.
A paradox? A paradox, A most ingenious paradox! Weve quips and quibbles heard in ocks, But none to beat this paradox!9
One version of the BanachTarski paradox states that it is possible to cut up a solid ball into nitely many pieces,10 then re-assemble the pieces using only rigid
9
10
Taken from The Pirates of Penzance by Gilbert and Sullivan. Raphael M. Robinson (19111995) proved that ve pieces (and no less) suce [331].
213
motions, and end up with two solid balls again . . . the punchline is that each of the two solid balls has the same size as the original one:
Figure 4.10. Magic! Producing two balls identical to the original. This theorem was proved by Stefan Banach (18921945) and Alfred Tarski (19021983) in 1924 [21]. To make this re-assembling language precise, given two subsets A, B Rn , we shall call them congruent by dissection if they can be decomposed as nite unions of pairwise disjoint N sets, A = N k=1 Ak and B = k=1 Bk , such that for each k , the set Ak is congruent to Bk , which means that there is a rigid motion Tk : Rn Rn such that T (Ak ) = Bk . For instance, any triangle is congruent by dissection to a rectangle as seen in Figure 4.11. Stefan Banach (18921945).

Figure 4.11. Any triangle is congruent by dissection to a rectangle. The magic trick of producing two balls from one is just the statement that any (solid) ball is congruent by dissection with two disjoint balls, each Alfred Tarski of which is identical in size to the original ball. In fact, its possible to do (19021983). even better:
(BanachTarski paradox) Any two bounded objects of Rn with n 3 having nonempty interiors are congruent by dissection!
For example, one can take a very small solid ball, say the size of a pea, and cut it into nitely many pieces, then re-assemble the pieces using only rigid motions to produce a solid ball the size of the sun:
Solid ball the size of a pea
Solid ball the size of the sun
Note that the pieces produced when cutting up the pea cannot all be Lebesgue measurable because Lebesgue measure would preserve the measure of the pea; as with the Hausdor paradox, the BanachTarski paradox uses the axiom of choice. Youll prove some baby versions of the BanachTarski paradox in Problem 11.
Example 4.7. This example is far from anything compared with the BanachTarski paradox (because it uses countable instead of nite dissections), but it gives nonetheless a taste of the BanachTarski paradox. We call two sets A and B congruent by innite dissection if they can be decomposed as innite unions of pairwise disjoint sets, A = k=1 Ak and B = k=1 Bk , such that for each k , the set Ak is congruent
214
to Bk . Consider the set W from Step 3 in the proof of Vitalis theorem, which is a union of translates of the set V : W =
k=1
V + rk ,
where {r1 , r2 , r3 , . . .} is an enumeration of the countable set Qn [2a, 2a]n . We know that V + rk and V + r are disjoint for k = . We claim that W can be written as a disjoint union of two subsets, each of which is congruent by innite dissection to W ; that is, we claim that we can write where A and B are disjoint subsets of W , both of which are congruent by innite dissection to W . Indeed, dene A=
k
W = A B,
Ak
and
B=
k
Bk
where Ak = V + r2k and Bk = V + r2k1 . Then A and B are disjoint, W = A B , and we claim that both A and B are congruent by innite dissection to W ! To prove this for A, just observe that so it follows that A = k Ak is congruent by innite dissection to k (V + rk ) = W . The proof for B is similar, just translate by rk r2k1 . (See Problems 11, 12, and 13 for related results.) translating Ak by rk r2k = V + rk ,
Now comes the obvious question: If the axiom of choice produces such out-ofthis-world paradoxical results, why do we use it? Cant we just get along with ZF and forget ZFC? Well, it turns out that to have a useful theory of mathematics we have to have some axiom which allows us to choose elements from an innite number of sets. Here are a few results from Real Analysis that we should all be familiar with: (1) A set of real numbers is closed if and only if it contains all its limit points. (2) A function f : R R is continuous at a point a R if and only if its sequentially continuous at a. (3) R is not the countable union of countable sets. All these results hold in ZFC and we would all agree they are useful for mathematics. For instance, what if (3) were false? Then the Lebesgue measure of R would be zero and there would be no measure theory! In fact, without the axiom of choice each of these three results could be false [198, Ch. 10], [178, Ch. 4]! In the end, I think it would be better to live in a mathematical world with the axiom of choice than without it,11 even though we have to live with strange paradoxes; indeed, instead of blaming the axiom of choice for these paradoxes, we could instead shift the blame on very complicated nonmeasurable sets for which the usual notion of volume does not apply! In the end, I agree with Solovay [361, p. 3]:
11Here are theorems you might be familiar with that are equivalent to the axiom of choice: 1) The Cartesian product of nonempty sets is nonempty; 2) Every surjective function has a right inverse; 3) Every vector space has a basis; 4) Tychonos theorem (the product of compact topological spaces is compact). Here are a small sampling of theorems weve proved in this book that use the axiom of choice: The fundamental lemma of semirings (Lemma 1.3); that an intersection of a nonempty nonincreasing sequence of nonempty cylinder sets is nonempty (Lemma 3.6); the Construction of outer measures from set functions (Theorem 3.9). It might be interesting to look back at other courses to see where the axiom of choice is used e.g. its used in the elementary fact that every innite set has a countably innite subset!
215
Of course, the axiom of choice is true, and so there are non-measurable sets. Exercises 4.4. 1. Let T be a noninvertible linear transformation on Rn . In the proof of Theorem 4.15, we showed that m (T (A)) = 0 for any A Rn using elementary matrices. Heres another way to prove this property of T . Note that it suces to prove that m (T (Rn )) = 0. (a) Since T is singular, show that some unit vector v Rn is orthogonal to the column space of T . If O is an orthogonal matrix with v as its rst row, show that OT is a matrix with rst row zero, and thus OT (Rn ) {0} Rn1 . Using this fact, prove that m (OT (Rn )) = 0, and from this, deduce that m (T (Rn )) = 0. (b) If T is not invertible, is the statement A Rn is Lebesgue measurable if and only if T (A) is Lebesgue measurable true? 2. Prove that if V = {v1 , . . . , vn } is a basis of Rn and A = [v1 vn ] is the matrix with columns v1 , . . . , vn , then m(P (V )) = | det A| where P (V ) is the parallelopiped spanned by the basis vectors in V . 3. In Step 3 of the proof of Theorem 4.15, we used that every matrix is a product of elementary matrices to derive the formula D(T ) = | det T |. In this problem we give another method (amongst many others) to prove that D(T ) = | det T |, which uses more linear algebra (and hence is a good review of some linear algebra facts). Assume Step 1 and Step 2 of Theorem 4.15. Assume that T is invertible. (i) Prove that T = M N where M is an orthogonal matrix and N is a symmetric matrix with | det T | = det N . Suggestion: Since T t T is positive denite symmetric, where T t is the transpose of T , Rn has an orthonormal basis of eigenvectors {vk } with corresponding positive eigenvalues { k }. Dene the positive denite symmetric matrix N by the condition N vk = k vk for each k and set M := T N 1 . Now prove that M is orthogonal. (ii) Since N is a symmetric matrix, prove (or recall that) N = O B O1 where O is an orthogonal matrix and B is diagonal. Conclude that D (T ) = D (M ) D (B ). (iii) Prove that D(B ) = | det B | and D(M ) = 1. Finally, prove that D(T ) = | det T |. Suggestion: Prove that M , since its orthogonal, maps the unit ball in Rn onto itself. From this, show that D(M ) = 1. 4. (Luzins condition (N)) A function f : Rn R is said to fulll Luzins condition (N) [338, p. 244] if f maps null sets (sets of measure zero) to null sets; that is, if for every set A Rn of measure zero, its image f (A) also has measure zero. This condition is named after Nikolai Luzin (also spelled Lusin) (18831950). Prove the following. Theorem. A continuous function f : Rn R maps Lebesgue measurable sets into Lebesgue measurable sets if and only if it satises condition (N). Suggestion: To prove the if direction, use Part (5) of Theorem 4.10. Prove that any closed set in Rn can be written as a countable union of compact sets. To prove the only if direction, suppose that A Rn has measure zero but f (A) has positive outer measure and use Vitalis theorem 4.16 on the set f (A). 5. This concepts developed in this problem will be helpful for future problems. (a) Given a point x Rn , we dene its box norm by x
b
:= max{|x1 |, |x2 |, . . . , |xn |}.
This norm is equivalent to the standard norm of vectors in Rn ; that is, an open set with respect to this norm is an open set with respect to the standard norm on Rn and vise versa. Also, note that for any > 0, [, ]n = {x Rn ; x b }. Thus, the box [, ]n = the ball of radius in the box norm. This norm is very convenient for measure theory. Given an n n matrix A, dene |A| :=
216
max{
n j =1
|aij | ; i = 1, . . . , n}. Show that for any x Rn , we have Ax

n n b
(b) A function f : R R is said to be locally Lipschitz if given any point a Rn , there are constants La and ra such that f (x ) f (y )
b n
|A| x b .
La x y b ,
for all x, y in the closed box {z R ; z a b ra }, which is the closed box centered at a of radius ra . The constant La is called a Lipschitz constant.12 (i) Show that any locally Lipschitz function is continuous and (ii) show that any dierentiable function is locally Lipschitz. Here, a function f : Rn Rn is said to be dierentiable at a point p Rn if there is an n n matrix-valued function : Rn Rnn that is continuous at p such that f (x) f (p) = (x)(x p) for all x Rn . (The derivative of f at p is then by denition (p).) f is said to be dierentiable if its dierentiable at each point p Rn . 6. Prove that a locally Lipschitz function f : Rn Rn satises Luzins condition (N) (and hence takes Lebesgue measurable sets to Lebesgue measurable sets). For example, given any Lebesgue measurable set A R, the square of A, {a2 ; a A}, is also Lebesgue measurable being the image of f (x) = x2 , which is dierentiable and hence locally Lipschitz by the previous problem. You may proceed as follows (i) Let I be a closed cube in Rn , that is, a closed box in Rn where each side has the same length. Show that I = a + [, ]n , for some a = (a1 , . . . , an ) and > 0 and show that if is suciently small, then and that m (f (I )) Ln a m(I ). (ii) Prove that if A Rn has measure zero, then f (A) also has measure zero. Suggestion: If Ak = {a A ; La k, ra 1/k}, show that A = k=1 Ak . To show that each f (Ak ) has measure zero, use Littlewoods principle that any measurable set can be approximated by an open set and use the dyadic cube theorem. 7. We prove that Lebesgue measure is the unique translation invariant measure on B n that assigns the correct volume to the unit cube (0, 1]n . That is, we shall prove Theorem. If : B n [0, ] is a measure such that (I ) < for all I I n and (A + x) = (A) for all A B n and x Rn , then = m, where = ((0, 1]n ). In particular, if ((0, 1]n ) = 1, then = m. You may proceed as follows: (i) For each k, m N, put Ik,m = (k/2m )(0, 1]n = (0, k/2m ] (0, k/2m ] (0, k/2m ]. Ik,m =
f (I ) f (a) + La [, ]n ,
Show that Ik,m can be written as a union of pairwise disjoint translates: 1 + I1,m , 2m
where the union is over all = (1 , . . . , n ) Nn with 1 j k for j = 1, . . . , n, and where ( 1)/2m := (1 1)/2m , . . . , (n 1)/2m ). (ii) Prove that for each k, m N, we have (Ik,m ) = (k/2m )n where = ((0, 1]n ). (iii) Prove that for any real number r > 0, we have r (0, 1]n = r n .
12 Lipschitz is named after Rudolf Lipschitz (18321903), and Lipschitz conditions are ubiquitous in the study of dierential equations where such conditions are utilized to prove that certain equations have solutions.
217
(iv) Prove that for all left-half open cubes I , we have (I ) = m(I ) and from this conclude that for all open sets U Rn , we have (U ) = m(U ). (v) Now prove the theorem. (vi) What if we omit the statement (I ) < for all I I n ; is our theorem still true? Prove it or give a counterexample. 8. (Amazing properties of Vitalis set) We shall have fun with Vitalis set. Below, V denotes a Vitali set of a bounded set A Rn of positive outer measure. (a) Heres another way to prove that V is not measurable. Show that V V Qn = {0}. Conclude, by Steinhaus theorem in Problem 5 of Exercises 4.3, that V cannot be measurable. (b) Show that any subset of V that has positive outer measure is not measurable. Suggestion: Apply a similar proof used to show V was not measurable. (c) Assume that A is measurable. Prove that m (A \ V ) = m (A). This in particular implies that We generalize this inequality in Problem 9 below. Suggestion: If m (A \ V ) < m (A), prove there is a measurable set E such that A \ V E A and m (E ) = m (A \ V ). Try to use Part (b). Prove or give a counterexample: If A is nonmeasurable, then we always have m (A) < m (V ) + m (A \ V ). Take (by regularity) any measurable set B Rn with V B and m (V ) = m (B ). Prove that for any measurable set E B with positive measure, E V is nonmeasurable. Is this still true if we drop the assumption that m (V ) = m (B ) (keeping the assumption V B )? As in (e), let B Rn be measurable with V B and m (V ) = m (B ). Prove that B = B1 B2 where B1 and B2 are disjoint and m (B ) = m (B1 ) = m (B2 ). Assume that A is measurable and let G = (A \ V ) Ac . Prove that G is nonmeasurable, m (Gc ) > 0, and m (E G) = m (E ) for all measurable sets E Rn . This equality is interpreted as saying that G lls the entire space Rn uniformly and has no gaps. This is surprising because m (Gc ) > 0, so G certainly does not ll Rn ! This phenomenon cannot happen for measurable G as we prove next. (h) Prove that if G Rn is measurable and m (E G) = m (E ) for all measurable sets E , then m (Gc ) = 0. 9. (Non-additivity of Lebesgue outer measure) Let A Rn be measurable with positive, nite measure, and let B A. Prove that B is measurable if and only if m (A) = m (B ) + m (A \ B ). This implies Corollary 4.17 (why?). (Actually, this result follows from Problem 12 in Exercises 3.5 if you did that problem.) Suggestion: Find a measurable set C such that B C A and m (B ) = m (C ). Prove that m (C \ B ) = 0. 10. (Baby BanachTarski) Let A Rn have nonempty interior. (i) Prove that given any > 0, there is an N N and pairwise disjoint sets A1 , A2 , . . . , AN such that A = A1 A2 AN and m (A1 ) + m (A2 ) + + m (AN ) > . For example, you can take a pea and dissect it into nitely many pieces such that the sum of the volumes of the pieces is greater than the volume of the sun! Suggestion: By a suitable translation and since A has nonempty interior, we may assume A contains a neighborhood of the origin, and hence a cube [3a, 3a]n for some a > 0. Does this remind you of something from Vitalis theorem? m (A) < m (V ) + m (A \ V ).
(d) (e)
(f) (g)
218
(ii) Prove that there are countably many pairwise disjoint sets B1 , B2 , . . . such that A = B1 B2 B3 and m (B1 ) + m (B2 ) + m (B3 ) + = . 11. (More Baby BanachTarski) Here are some BanachTarski type results one can get from the set W from Step 4 in the proof of Vitalis theorem. (i) Given N N, show that W can be written as a disjoint union of N subsets, each of which is congruent by innite dissection to W . (ii) Assume that the bounded set A used to construct W has the property that A + Qn = Rn (e.g. boxes or balls with nonempty interiors have this property). Prove that W and Rn are congruent by innite dissection. Suggestion: Let s1 , s2 , s3 , . . . be an enumeration of Qn and show that Rn = k=1 (V + sk ). (iii) Let A, B Rn be bounded sets where A has a nonempty interior. Prove there is a subset A0 of A and a bounded set B0 containing B that are congruent by innite dissection. For example, let A be a pea and B the sun. Then there is a subset of the pea and a bounded set containing the sun that are congruent by innite dissection! Suggestion: By translating we may assume that A contains a cube [3a, 3a]n A, where a > 0, and let A0 = W be the set constructed by applying Vitalis proof to [a, a]n . Try to nd B0 . 12. (Even more Baby BanachTarski) In this problems we prove that the unit circle S1 can be written as a disjoint union of two sets, each of which is congruent by innite dissection with S1 . In fact, the proof is simply copying Vitalis proof! (i) Given any two elements x, y S1 , dene Check that this relation is an equivalence relation on S1 . (ii) Choose a point from each equivalence class and let V be the set of all such points. Let 1 , 2 , . . . be a list of all rational numbers in [0, 1) and let Vn = e2in V = {e2in v ; v V }. Show that S1 = n=1 Vn , the Vn s are pairwise disjoint, then complete the proof of the statement. 13. (Sierpi nskiMazurkiewicz paradox) One of the beginning results that lead up to the BanachTarski paradox was the following interesting result published by Stefan Mazurkiewicz (18881945) and Waclaw Sierpi nski (18821969) in 1914: There is a nonempty subset X of R2 such that X = A B where A and B are disjoint and each is congruent to X . Prove this theorem as follows. (i) Show that there is a real number R such that ei is transcendental; that is, ei is not the root of any polynomial with integer coecients. Use the elementary fact that the set of all algebraic numbers is countable. (ii) Identify R2 with C, and dene X as the set of all points in R2 of the form a0 + a1 ei + a2 e2i + + an eni , for some n, a0 , a1 , a2 , . . . , an {0, 1, 2, 3, . . .}. Let A X be those points such that a0 = 0 and let B X be those points such that a0 = 0. Prove that A B = , X = A B , and both A and B are congruent to X . (iii) Here is a related paradox for S1 : Well show that S1 (the unit circle) and S1 minus a point are congruent by dissection. To prove this, identify R2 with C as before, and identify S1 with {ei ; R}. Given p S1 , let A = {ein p ; n N} and let B = (S1 \ {p}) \ A. Then S1 \ {p} = A B . Write S1 as A B where A and B are disjoint and A is congruent to A. (iv) Given a countable subset C S1 , show that S1 and S1 \ C are congruent by dissection. Suggestion: Show that there is an angle R such that the sets in C, ei C, e2i C, . . . are pairwise disjoint. Let A = C and let B = (S1 \ n=1 e C ) \ A, then continue as in (iii). x y if x = ye2i for some Q [0, 1).
4.5. THE CANTOR SET
219
4.5. The Cantor set In this section we describe a compact (and hence, a Borel) uncountable set of real numbers with measure zero. This set is called the Cantor set after Georg Cantor (18451918) who constructed the set in 1883 [68, 69]. Perhaps the rst person to construct Cantor-type sets was Henry Smith (18261883) in 1875 [360]; in fact, his set is a half-scaled version of Cantors set see the Remarks section to this chapter. Other early constructions of Cantor-type sets are due to Paul du Bois-Reymond (18311889) in 1882 Henry Smith [108], [109, p. 188], Vito Volterra (18601940) in 1881 [405], and others. (18261883). Its not only fascinating to know what the Cantor set is, but also why it came about, so we start by briey reviewing its history (see [95, 131] for more details). 4.5.1. The Cantor middle-third set. Cantor introduced his set in the fth paper of the six paper series Uber unendliche, lineare Punktmannigfaltigkeiten (On innite, linear point sets) [68, 69, 72], which established Georg Cantor the fundamentals of Cantors new transnite set theory. One of his ulti- (18451918). mate goals was to prove the continuum hypothesis CH13 and he stated that he needed to give a denition as precise and as general as possible when a set can be called continuous (a continuum) and after saying this he expressed hope in proving CH [69, p. 574]:
Therefore the question about the cardinality of Rn reduces to the analogous question about the open interval (0, 1) and I hope to be able to answer it with a rigorous proof that this cardinality is no other that the one of our second number class. It will then follow that every innite set of points has either the cardinality of the rst number class or the cardinality of the second number class.
Cantor then goes on to describe the precise and as general as possible properties of a continuum. The rst property he mentions is that of being perfect. Here, a set A Rn is said to be perfect if A equals its set of limit points. Recall that a point p is said to be a limit point of A if given any open set U containing p, there is a point a A dierent from p such that a U . The set of limit points of A is denoted by A , so A is perfect means that A = A . Examples of perfect sets are continuums such as R or nite unions closed intervals. However, although not obvious at rst glance, Cantor pointed out that there are perfect sets that are not continuums, revealing for the very rst time his now famous set in the following footnote at the end of his paper:
As an example of a perfect set which is not everywhere dense in any even so small interval, I name the set of all real numbers which are contained in the formula c1 c2 c z= + 2 + + + 3 3 3
One form of the continuum hypothesis (CH) states that an innite subset of R either has the cardinality of N or R. If 0 denotes the cardinality of N, from set theory [191] we know there is a next larger cardinality which we denote by 1 . Cantor called 0 the rst number class and 1 the second number class. Another form of CH is that R has the cardinality of the second number class.
13
220
where the coecients c can assume the values 0 or 2 at leisure and the series can consist of a nite or innite quantity of members.
In other words, Cantor denes his set as the set of real numbers (in [0, 1]) whose tertiary (or base 3) expansions can be written using only the digits 0 and 2. Cantor did not explain where his set came from, he didnt prove anything about his set and he never mentions his set in the main body of the paper! I am amazed how the Cantor set can generate so much mathematical fruit in the ensuing years from its humble beginnings as a simple footnote! Well get back to his set after we nish Cantors story; in particular, well explain Cantors comment that his set is not everywhere dense in even so small interval. Because of Cantors example of a perfect set that is not continuous, being perfect is not enough to characterize a continuous set of points. Thus, Cantor adds another condition that he calls connectedness (which is dierent from how we use the term today). Armed with a precise topological (as we would now describe it) characterization of continuums as perfect-connected sets he hoped to prove CH in his subsequent work. Unfortunately, Cantor never realized his hope and so famous was CH that it was the rst problem in Hilberts list of 23 open problems given at the 1900 International Congress of Mathematicians in Paris. As you probably know, Cantor was doomed to fail because by the later work of Kurt G odel and Paul Cohen it was discovered that CH is undecidable it cannot be proved or disproved (i.e. its independent) in the standard axioms of modern mathematics (ZFC set theory). Now that we know why the Cantor set came about, to precisely characterize continuums, lets ll in the details Cantor left out! Instead of dening the Cantor set as Cantor originally did, we shall dene it geometrically the way Henry Smith dened his sets; later on we shall relate the geometric denition with Cantors original denition. The construction of the Cantor set is illustrated in Figure 4.12. We start with the closed interval [0, 1]. From this interval, we remove the open
0 C0 0 C00 0
1 9 2 9
C000 C002
1 C2
1 3 2 3
[0, 1] C1 C2 C3 C4 C5
1 C20 C22
7 9 8 9
C02
1 3 2 3
C020 C022
C200 C202
C220 C222
Cn , which from our eyes, looks very tiny. However, it turns out that C has uncountably many points, as well see later.
Figure 4.12. The Cantor set C is the limit set lim Cn :=
n=1
middle third (1/3, 2/3) forming the two disjoint sets C0 and C2 , whose union we denote by C1 : 1 2 C1 = C0 C2 = 0, ,1 . 3 3 Note that C0 and C2 each have length 1/3. We now remove each of the open middle thirds from C0 and C2 and denote the remaining set by C2 . Thus, from C0
4.5. THE CANTOR SET
221
we remove (1/32 , 2/32 ) forming the two disjoint sets C00 and C02 , and from C2 we remove (7/32 , 8/32) forming the two disjoint sets C20 and C22 : 1 2 7 2 1 8 C2 = C00 C02 C20 C22 = 0, 2 2 , , 2 2,1 . 3 3 3 3 3 3 Note that C00 , C02 , C20 , C22 each have length 1/32 . We now continue this removing open middle thirds process indenitely and whats left over after discarding all the open middle thirds we shall call the Cantor set. If youre want more details on how the Cantor set is dened, here they are. We shall proceed by induction and follow the convention, as we have already been doing, that whenever we divide a set into thirds, we tack on a 0 to denote the rst set and a 2 to denote the third set, such as seen here:
I= [ a | c | d ] b I0 = [ a ] c I2 = [ d ] b
Figure 4.13. An interval I is divided into thirds, I0 is the rst set and I2 is the third set. (Here, c = a + (b a)/3 and d = a + 2(b a)/3.) Now suppose by way of induction that C1 Cn have already been dened, such that the nth set is a union of 2n sets: where the C1 ...n s are pairwise disjoint closed intervals of length 1/3n and the union is over all n-tuples (1 , . . . , n ) of 0s and 2s. For each interval C1 ...n , we remove its middle third, forming two disjoint closed intervals C1 ...n 0 and C1 ...n 2 . Since the length of C1 ...n is 1/3n , the lengths of C1 ...n 0 and C1 ...n 2 are 1/3n+1 . We now put Cn+1 := C1 ...n n+1 Cn , Cn = C1 ...n ,
where the union is over all (n + 1)-tuples (1 , . . . , n , n+1 ) of 0s and 2s. This completes our induction step. The Cantor set is the limit set lim Cn , that is, the intersection
C :=
n=1
Cn .
If B = k=1 Ik where the sets Ik are all the open middle thirds removed to form C , then we also have C = [0, 1] \ B. Some properties of the Cantor set are immediate from its denition. For example, since each Cn is a nite union of closed intervals, each Cn is closed and since a countable intersection of closed sets is closed, the Cantor set is closed. Since the Cantor set is closed and bounded (its contained in [0, 1]) its compact. Now what are some points in the Cantor set? Judging from Figure 4.12 it doesnt look like theres much in the Cantor set. However, its certainly not empty since the Cantor set contains all the end points of each interval C1 ...n , recalling that only their open middle thirds were thrown away. So, C contains the points 2 7 8 1 2 7 8 1 2 1 0, 1, , , 2 , 2 , 2 , 2 , 3 , 3 , 3 , 3 , .... 3 3 3 3 3 3 3 3 3 3
222
Well describe more points in the Cantor set once we relate our geometric denition with Cantors original denition. We now summarize the main properties of C in a theorem. Let A R. We say that A is totally disconnected if for every two points x, y A, x < y , there is a real number z R with x < z < y such that A (, z ) (z, ). We say that A is nowhere dense if the closure A contains no open intervals. In other words, any open interval in R must contain a point not in A. Thus, nowhere dense is equivalent to the complement of A being dense.14 Nowhere dense is the precise meaning of Cantors comment not everywhere dense in even so small interval. Properties of Cantors set Theorem 4.19. The Cantor set is perfect, uncountable, compact, totally disconnected, nowhere dense, and has measure zero.
Proof : To show that C is perfect, let x be any point in the Cantor set. We need to show that x is a limit point of C . To this end, let I R be any open interval containing the point x. For each n N, let In denote the closed interval (one of the C1 n s) of Cn that contains x. By construction of the Cantor set, we know that the length of In is 1/3n , therefore we can choose n large enough so that In is completely contained in the open interval I . Let y = x be one of the end points of In . Then y C , y I , and y = x. Thus, x is a limit point of C We already know that the Cantor set is compact. To prove that C is uncountable we proceed by contradiction.15 Suppose that C is countable, so we can write the Cantor set as a list C = {c1 , c2 , . . .}. Since C0 and C2 are disjoint, c1 can be contained in only one of them. Let C1 be the set not containing c1 . Since C1 0 and C1 2 are disjoint, c2 can be contained in at most one of them. Choose one of the two that does not contain c2 and call it C1 2 C1 . Continuing by induction, we construct a sequence of closed intervals In = C1 ...n such that In does not contain cn and In+1 In for each n. Since I1 I2 I3 is a nested sequence of closed intervals, by the nested intervals theorem (Theorem ??), the intersection n=1 In is not empty; let c a point in the intersection. Since In = C1 ...n Cn , we see that c n=1 Cn = C . To summarize, we have found a point c C such that c In for each n. However, by construction, for any n, cn is not in the closed interval In , so c, being in all the intervals In , cannot be any of the numbers c1 , c2 , . . .. This contradicts the assumption that {c1 , c2 , . . .} was a list of all the elements of C . That the Cantor set is totally disconnected and nowhere dense, we leave for Problem 1. Finally, to prove that C has measure zero, recall that Cn is the union of 2n disjoint intervals of length 1/3n , we have 1 2 n . = 3n 3 Now for each n, C Cn and so, m(C ) m(Cn ) = (2/3)n . Since (2/3)n 0 as n , we must have m(C ) = 0. m(Cn ) = 2n
4.5.2. Cantors original denition. We now relate Smiths geometric denition of the Cantor set with Cantors original denition. To do so, we rst recall geometrically how to dene the base 10 expansion of a real number (which you
14Recall that D Rn is dense means D = Rn ; i.e., any open set in Rn intersects D .
15
Actually, any nonempty perfect subset of Rn is uncountable, a fact you may try to prove.
4.5. THE CANTOR SET
223
should have seen in an elementary analysis course). Let x [0, 1]; for example, consider x = 3 = 0.14159 . . ., the decimal part of , which is represented by the small dot on the left side:
0 1
We can expand x as a decimal, a1 a3 a2 + 3 + , + 10 102 10 and we shall explain how to nd the coecients a1 , a2 , . . . in the decimal expansion. The rst step is to divide the interval [0, 1] into 10 equal intervals of length 1/10: x = 0.a1 a2 a3 =
0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10 10 10
1 Next, nd which fraction is to the immediate left of x, call that fraction a 10 where 16 a1 {0, 1, . . . , 9}; then the rst digit of x in base 10 is a1 . For example, the rst digit of 3 is 1. We can also label each of the 10 intervals from 0 to 9 (labeling left to right) and determine which interval contains x. Now divide the interval of length 1/10 containing x into 10 equal parts, each part of which has length 1/102 as seen here for x = 3:
1 2 For clarity, we magnify the interval 10 , 10 as shown in the following picture, where 1 1 2 0 1 the fraction 102 , 102 , 102 , . . . is the distance from the point a 10 (= 10 for x = 3):
0 102 1 102 2 102 3 102 4 102 5 102 6 102 7 102 8 102 9 102 10 102
a2 Find the fraction that is to the immediate left of x, say the fraction 10 2 (where a2 {0, 1, . . . , 9}); then the second digit of x is a2 . Alternatively, we can label the intervals from 0 to 9 and nd which interval contains x. For example, the dot representing x = 3 lies just to the right of 4/102 so we have a2 = 4. If we keep on repeating the division into 10 procedure we eventually get all the ai s: a1 a3 a2 x = 0.a1 a2 a3 = + 2 + 3 + 10 10 10 Figure 4.14 shows step-by-step how to get the base 10 expansion of 3.
a = 3 = .14159265 . . .
a = 0. 1 . . . a = 0.14 . . . a = 0.141 . . . a = 0.1415 . . . a = 0.14159 . . .
Figure 4.14. To review, in the rst picture we divide [0, 1] into 10

intervals, each of length 1/10. We label the intervals 0 to 9 and we see that x = 3 is in interval 1, so a1 = 1. In the second picture we magnify the interval [1/10, 2/10] and we divide this interval into 10 equal intervals of length 1/102 . Labeling the intervals from 0 to 9, we see that a lies in interval 4, so a2 = 4. We continue this process and we get the base 10 expansion of 3.
16If x happens to lie on exactly one of the fractions d/10 where d {1, . . . , 9}, then we could put a1 = d or a1 = d 1; in this case x can be written in two dierent ways in base 10.
224
Now recall that given any b N with b > 1, called a base, and given any x [0, 1] we can write x as a b-adic or b-ary expansion x= a3 a2 a1 + 2 + 3 + b b b
where ai {0, 1, . . . , b 1} for each i, are called the digits of x. We can nd the digits of x using the same successive division trick outlined above for base 10, except for general bases we divide into b subintervals at each stage. For example, if we focus on b = 3, then given any x [0, 1] we can write x in a tertiary expansion x= a3 a2 a1 + 2 + 3 + 3 3 3
where ai {0, 1, 2} for each i. Moreover, we can determine these digits by successive divisions by 3. Thus, we divide [0, 1] into thirds, label the intervals 0, 1 and 2, then a1 is the interval in which x belongs. We divide the a1 interval into thirds, label the newly formed intervals 0, 1 and 2, then a2 is the interval in which x belongs and so on. Bing! A light bulb should have went on in your head! In the geometric construction of the Cantor set we are doing exactly this successive division by thirds except we are omitting all numbers in the intervals labeled with 1. Thus, we see at least intuitively that x C if and only if x= 1 3 4 2 + 2 + 3 + 4 + , 3 3 3 3 where j {0, 2}.
You will prove this in Problem 2. In fact, the j s here correspond exactly to the j s in the construction of the sets C1 ...n in the Cantor set. (This is, of course, the reason why we denoted the C1 ...n s the way we did!) Using this description of the Cantor set, one can easily nd points in the Cantor set such as 1 2 2 2 2 2 1 0 + 2 + 3 + 4 + 5 + = 2 1+ + 2 + 3 3 3 3 3 3 3 3 = 2 1 1 = , 2 3 1 1/3 3
which we already knew was in the Cantor set, but we can nd other points in C besides end points of deleted intervals, such as 1 0 2 0 2 1 2 0 + 3 + 4 + 5 + = 2 1 + 2 + 4 + + 3 32 3 3 3 3 3 3 as well as 2 1 2 0 2 2 1 0 1 + 2 + 4 + + 3 + 4 + 5 + = + 3 32 3 3 3 3 3 3 = 2 3 1 = . 3 1 1/32 4 = 2 1 1 = , 32 1 1/32 4
In fact, there are many other points in the Cantor set besides the endpoints such as 3 11 1 , 13 , 12 , and many more; see Problems 3 and 4. Heres a quote (to name a few) 10 from Ralph P. Boas, Jr. (19121992) [46, p. 97]:
When I was a freshman, a graduate student showed me the Cantor set, and remarked that although there were supposed to be points in the set other than the endpoints, he had never been able to nd any. I regret to say that it was several years before I found any for myself.
4.5. THE CANTOR SET
225
4.5.3. The Cantor function. We now dene the Cantor function (also called Cantors singular function) This function has the interesting property that it increases from 0 to 1 essentially without changing! We construct exactly how Cantor did in 1883 [70]. Step 1: The rst step is to dene on C ; see Figure 4.15. Given a point x C
1
3 4 1 2 1 4
: [0, 1] [0, 1].
0
1 2 1 9 9 3 2 7 8 3 9 9
Figure 4.15. Constructing the Cantor function; Step 1. we can write it in a tertiary expansion 2 a3 2 a4 2 a2 2 a1 + 2 + 3 + 4 + , x= 3 3 3 3 where aj {0, 1} for each j ; we dene (x) via the binary expansion a1 3 4 a2 (x) := + 2 + 3 + 4 + . 2 2 2 2 In other words, we omit the factor of 2 from the numerators in x and change from base 3 to base 2: For example, (0) = 0, and, since 2 2 2 2 1 = + 2 + 3 + 4 + , 3 3 3 3 we have 1 1 1 (1) = + 2 + 3 + = 1. 2 2 2 An interesting property of is that (x) = (y ) if x and y are end points of the same deleted interval removed during the construction of the Cantor set; see Problem 5. For instance, 1/3 and 2/3 are endpoints of the deleted open interval (1/3, 2/3) removed in the rst stage of the Cantor set construction, and, since 1 2 2 2 = 2 + 3 + 3 + , 3 3 3 3 we have 1 1 1 1 1 = 2 + 3 + 4 + = , 3 2 2 2 2 2 0 0 2 = 3 +3 and also, since 3 2 + 33 + , we have 2 1 0 1 0 = + 2 + 3 + = . 3 2 2 2 2 0 . 3 2 a 1 2 a 2 2 a 3 2 a 4 = 0 .2 a 1 a 2 a 3 a 4 .
226
Thus, (1/3) = (2/3). We claim that : C [0, 1] is onto. To see this, let y [0, 1] and write y in binary: a3 a4 a2 a1 + 2 + 3 + 4 + , where aj {0, 1}. y= 2 2 2 2 Then by denition of the Cantor function, we have (x) = y where x = 2 a1 2 a3 2 a4 2 a2 + 2 + 3 + 4 + . 3 3 2 3
Thus, maps the visually very tiny Cantor set onto the whole interval [0, 1]! By the way, this last statement gives another proof that C is uncountable, for if it were countable, then (C ) would be countable which it is not. Step 2: We now extend the domain of from C to all of [0, 1], the basic gist is shown here: 1
7/8 3/4 5/8 1/2 3/8 1/4 1/8
1 2 1 9 9 3
2 7 8 3 9 9
Figure 4.16. Constructing the Cantor function; Step 2. To do so, we need to dene on the open middle thirds removed from [0, 1] to form the Cantor set. Let {Ik } be all the open middle thirds removed to form the Cantor set. Writing Ik = (ak , bk ), the end points ak and bk belong to the Cantor set and we already know that (ak ) = (bk ). We now dene (x) to equal this common value for x Ik ; that is, for x in an open middle third, we dene (x) to equal its values at the end points (which from Step 1, are already known and are equal). This denes our function : [0, 1] [0, 1].
In Problem 5 you will prove that : [0, 1] [0, 1] is continuous and nondecreasing. We now look at some of s interesting properties. First, we know that on any discarded open middle third, is constant (even on the closure of each open middle third). In general, a function f on a topological space X is said to be locally constant if for each c X there is an open set U containing c such that f is constant on U . Thus, is locally constant on [0, 1] \ C , which is just the union of all the deleted open middle thirds. Hence, although is locally constant on [0, 1] \ C , which has length 1, somehow goes from 0 to 1 doing all its increasing (without jumping because is continuous) on the visually very tiny Cantor set, which has length zero! There is one more property that we want to share. Let denote the LebesgueStieltjes measure of the Cantor function . Let B = k=1 Ik where the sets Ik are all the open middle thirds removed to form C . Since is
4.5. THE CANTOR SET
227
constant on each Ik , we have (Ik ) = 0, which implies that
(B ) =
k=1
(Ik ) = 0.
Hence, 1 = (1) (0) = [0, 1] = (B C ) = (C ), so the Cantor set has -measure one! We summarize our ndings in the following theorem. Cantors function Theorem 4.20. The Cantor function : [0, 1] [0, 1] has the following properties: (1) is a nondecreasing continuous function mapping [0, 1] onto [0, 1]; (2) is dierentiable except on a set of measure zero (namely C ) with = 0; (3) (C ) = [0, 1]; (4) [0, 1] = B C with (B ) = 0 and (C ) = 1. One question you might be wondering is why in the world would Cantor think of such a function? The reason was to supply a counterexample to a statement of Axel Harnack (18511888) [113, p. 17]. We all know that if f : [a, b] R is dierentiable at all points in [a, b] and f (x) = 0 at all points, then f is constant. A natural question is if the conclusion f is constant is still true if at all points is replaced by some weaker condition. In 1882, Harnack [168], [173, p. 60], proved that if a continuous function satises f (x) = 0 in general, then f must be constant. Here, in general means that given > 0, there is a set A of content zero such that if x is not in A, then f (x + h) f (x) < for all h suciently small. h See Problem 12 in Exercises 3.4 for the denition of (outer) content. Thus, Harnack is saying that if the dierence quotient of a function can be made arbitrarily small outside a negligible set (a set of zero content), then the function must be constant. Intuitively this seems plausible, but unfortunately its false! In particular, its easy to check that Cantors function is continuous and satises (x) = 0 in general yet isnt constant! Cantors example is typical with Cantor-type sets: They serve as testing grounds for the validity of theories. 4.5.4. Cantor-like sets with positive measure. The Cantor set, as weve already seen, has measure zero. We can also dene sets that have the same properties as the Cantor set except they have positive measure. Here is one example . . . since we have treated the Cantor set example so thoroughly we shall treat the following example more cavalierly. Going back to the construction of the Cantor set, we see that the Cantor set was obtained by removing an open interval of length 1/3 at the rst step, then two intervals of length 1/32 at the second step, then 22 intervals of length 1/33 at the third step, then 23 intervals of length 1/34 at the fourth step, etc. Instead of removing intervals whose lengths are powers of 1/3 we can use other numbers to. Let k N with k 3; we shall remove intervals whose lengths are powers of 1/k . We start with the closed interval [0, 1]. From this interval, we remove the
228
open middle interval of length 1/k , which forms the two disjoint closed intervals. From each of these two intervals, we remove the open middle interval of length 1/k 2 ending up with 22 intervals remaining, from which we remove their open middle intervals of length 1/k 3 ending up with 23 intervals, and so on; see Figure 4.17. At the end of the nth stage, we have 2n closed intervals and from these intervals we remove each of their open middle intervals of length 1/k n+1 . Continuing this process indenitely, whats left over after discarding all the open middle intervals is the thick Cantor k set, which we denote by C (k ). The main dierence between
0
1 k 1 k2 1 k3 1 k3 1 k3 1 k2 1 k3
[0, 1] C1 (k ) C2 (k ) C3 (k )
Figure 4.17. The rst few stages in constructing C (k). the cases k > 3 and the case k = 3 (the original Cantor set) is that C (k ) has positive measure for k > 3. Indeed, observe that if we look back at the construction of C (k ), the total lengths of the open intervals we removed from [0, 1] is 22 23 1 2 1 + 2 + 3 + 4 + = k k k k 2 n=1 Hence, 1 k3 = . k2 k2 1 For instance, m(C (4)) = 2 . Moreover, as k , we see that m(C (k )) 1. Hence, we can get Cantor-like subsets of [0, 1] whose measure is as close to 1 as desired. In Problem 11 you will prove the following theorem and you will construct the corresponding Cantor function. m(C (k )) = 1 Thick Cantor sets C (k ) for k 3 Theorem 4.21. The Cantor set C (k ) is perfect, uncountable, compact, totally disconnected, nowhere dense, and has positive measure for k > 3. As with the standard Cantor set, thick Cantor sets are also used as testing grounds for theories. For example, in 1870 Hermann Hankel (18391873) proved [164, pp. 89-92] that a bounded function is Riemann integrable on an interval [a, b] if and only if its points of continuity form a dense set in [a, b]. We remark that a function f : [a, b] R whose points of continuity form a dense set in [a, b] is said to be pointwise discontinuous, so Hankel claimed that Riemann integrablity and pointwise discontinuity are equivalent. In 1875 Henry Smith proved Hankel false. Indeed, let A [0, 1] be any closed nowhere dense set of positive measure (like one of the thick Cantor sets) and let f = A : [0, 1] R = Heres a graph of f : 1 if x A 0 if x / A.
2 k
1 2/k 1 = . 2 1 2/k k2
4.5. THE CANTOR SET
229
Figure 4.18. The set A is the dust particles on the horizontal axis. Its easy to see that f is continuous at each point of the open set [0, 1] \ A (where f = 0) and f is discontinuous at each point of A. In particular, since A is nowhere dense, the points of continuity form a dense set in [0, 1]. However, in Problem 12 you will prove that f is not Riemann integrable and hence we have a counterexample to Hankels theorem.
Exercises 4.5. 1. Prove that the Cantor set is totally disconnected and nowhere dense. 2. In this exercise, we prove that the Cantor set is exactly those numbers in [0, 1] that have a ternary decimal expansion containing only the digits 0 and 2. (a) Prove, for instance by induction on n, that 1 2 n 1 2 n 1 C1 ...n = x [0, 1] ; + 2 + + n x + 2 + + n + n . 3 3 3 3 3 3 3 1 At the same time, show that [0, 1] \ Cn = n B1 ...k with the k=0 Bk where Bk = union over all k-tuples (1 , . . . , k ) of 0s and 2s, where k 1 1 k 2 1 + + k + k+1 < x < + + k + k+1 . B1 ...k = x [0, 1] ; 3 3 3 3 3 3 When k = 0, we interpret the k-tuple as empty and put B0 as the interval (1/3, 2/3). By denition of the Cantor set, note that [0, 1] \ C = k=0 Bk . We remark that one can also use 0s and 1s to index the sets C1 ...n instead of 0s and 2s; for instance, we can write Bk = Ba1 ...ak with the union over all k-tuples of 0s and 1s, where 2a1 2ak 1 2a1 2ak 2 (4.16) Ba1 ...ak = x [0, 1] ; + + k + k+1 < x < + + k + k+1 . 3 3 3 3 3 3 When k = 0, we regard Ba1 ...ak as the interval (1/3, 2/3). (b) Given x C , prove that we can write 1 2 n (4.17) x= + 2 + + n + , j {0, 2}. 3 3 3 (c) Finally, prove that any number of the form (4.17) is an element of C . 3. (Cf. [275]) In this problem we give some tricks for producing points in the Cantor set. (a) Prove that the Cantor set is invariant under reection about 1/2; that is, given x [0, 1], prove that x C if and only if 1 x C . (b) Prove that the Cantor set is invariant under division by 3; that is, given x [0, 1], prove that x C if and only if x/3 C . (c) Starting from the number 1/4 C , by reecting about 1/2 and dividing by 3, verify that the following numbers also belong to the Cantor set: 1 3 1 11 1 11 25 35 , , , , , , , . 4 4 12 12 36 36 36 36 4. In this problem we search for more points in the Cantor set. (a) Prove that if k N, then 3k2 C. 1 (b) Prove that if k N, then 3k1 C . Suggestion: After expanding 3k1 using +1 +1 j i geometric series, the formula 1/3 = 2 / 3 might come in handy. i=j +1 (c) For k N and 0 k 1, show that
3 3k +1
C and 2
3 3k 1
C.
230
5. (Properties of the Cantor function) We analyze the Cantor function more closely. (a) Prove that if x, y C are endpoints of an open deleted middle third interval, then (x) = (y ). Suggestion: Use (4.16) in Problem 2. (b) Prove that : C [0, 1] is nondecreasing. Suggestion: Let x, y C with x < y 2aj 2bj and write x = j =1 3j and y = j =1 3j where aj , bj {0, 1}. Let k N be the smallest natural number such that ak = k . Thus, a1 = b1 , . . . , ak1 = bk1 but ak = bk . Show that ak < bk . Using this fact prove that (x) (y ). (c) Show that : [0, 1] [0, 1] is nondecreasing. (d) Show that : [0, 1] [0, 1] is continuous. (e) Let n N. Show that the range of on [0, 1] \ Cn equals {/2n ; = 1, 2, . . . , 2n 2n 1 1}. As a consequence, show that [0, 1] \ Cn = =0 D for some open intervals D where (say) the right end points of the intervals D form an increasing sequence such that = /2n on the -th interval D . (We remark that some authors dene the Cantor function on [0, 1] \ C via this property. That is, given x [0, 1] \ C , we have x [0, 1] \ Cn for some n, then one denes (x) = /2n if x is in the -th interval D . Dening in this way, you have to check that (x) is dened independent of the n chosen. You then have to extend from [0, 1] \ C to the Cantor set C . See Problem 11 for an explanation of this method to dene .) 6. (cf. [80]) (Characterization of the Cantor function) (i) Prove that the Cantor function : [0, 1] [0, 1] satises for all x [0, 1]: (1) is nondecreasing; (2) (x/3) = (x)/2; (3) (1 x) = 1 (x). (ii) Prove that if f : [0, 1] [0, 1] is continuous and satises (1), (2) and (3), then f is the Cantor function. Suggestion: Prove by induction on k that f 2a1 2ak 2 + + k + k+1 3 3 3 = a1 ak 1 + + k + k+1 2 2 2
k 2 for all a1 , . . . , ak {0, 1}. The formula 1 = 31 k + i=1 3i , which holds for all k N, might come in handy at one point in your proof. 7. (non-tertiary Cantor-like sets) (a) Let T be the set of points in [0, 1] that can be written in a base 10 expansion containing only 3s and 8s. Prove that T is (i) perfect, (ii) uncountable, (iii) compact, (iv) totally disconnected, (v) nowhere dense, and (vi) has measure zero. (b) After working with T , let D {0, 1, . . . , 9} have cardinality at most 9 and let A be the set of points in [0, 1] which can be written in a base 10 expansion containing only digits in D. The set A also has properties (i)(vi); however, just show that A is compact (hence, Lebesgue measurable) and m(A) = 0. j (c) Let B1 = {x R ; x = j =1 aj /5 ; aj {2, 1, 1, 2}} and let B2 = {x R ; x = j j =1 aj /5 ; aj {2, 1, 0, 1, 2}} Find m(B1 ) and m(B2 ). 8. (A Lebesgue non-Borel set cf. [75, 161]) In this problem we show there exists a Lebesgue measurable set that is not Borel, that Borel measure is not complete, and that Lebesgue measurability is not preserved under homeomorphisms. (i) Let : [0, 1] [0, 1] be the Cantor function, and let (x) = (x + (x))/2. Prove that : [0, 1] [0, 1] is strictly increasing, that is, if x < y , then (x) < (y ). In particular, is a continuous bijection of [0, 1] onto [0, 1]. (ii) Show that (C ) is Lebesgue measurable with measure 1/2. Suggestion: Since [0, 1] = (C ) ([0, 1] \C ), it might be easier to show that ([0, 1] \C ) is measurable with measure 1/2. (iii) Show that there exists a Lebesgue measurable set M C such that (M ) is not Lebesgue measurable. In particular, Lebesgue measurability is not preserved under homeomorphisms. Suggestion: Use Vitalis theorem 4.16 on the set (C ).
4.5. THE CANTOR SET
231
(iv) Why is M not a Borel set? Since M C and C is a Borel set with measure zero, it follows that Borel measure, that is, Lebesgue measure on the Borel sets, is not complete. 9. (Sums of Cantor sets, cf. [263], [323], [395]) (a) Prove Hugo Steinhaus (18871972) theorem for the Cantor set, which states that C + C := {x + y ; x, y C} = [0, 2]. (See Problem 5 in Exercises 4.4 for another Steinhaus theorem.) Thus, even though the Cantor set seems tiny, when you add all its points together you ll up the interval [0, 2]. Suggestion: Given a [0, 2], consider the tertiary expansion of a/2. (b) Show there is a set A R of measure zero such that A + A = R. 10. If you liked the previous problem, here are some related ones. (a) Using Steinhaus Cantor set theorem, show that C C = [1, 1]. (b) If n N, put Sn = C + C + + C (n copies of C ) = {x1 + + xn ; xi C}. Prove that Sn = [0, n]. (c) Show that for n N, we have Sn Sn = [n, n]. 11. (Thick Cantor sets and functions) Fix k N with k > 2. In this problem we give a detailed construction of the Cantor set C (k) and its Cantor function k : [0, 1] R. Given bounded intervals I and J we write I < J if the right-end point of I is the left-end point of J . It would be helpful to draw many pictures during this exercise! (i) Stage 1: Remove the open middle interval from [0, 1] of length 1/k. Denote the remaining closed left-hand interval by C11 , the remaining closed right-hand interval by C12 and the open removed interval by B11 . Note that C11 < B11 < C12 . Let C1 = C11 C12 and B1 = B11 and observe that C1 is a union of 21 closed intervals, B1 is a union of 21 1 open intervals, C1 B1 = [0, 1], and 1 m(C1 ) = 1 k . Dene f1 : B1 R by f1 (x) = 1/2 for all x B1 . Induction step: Suppose n N and assume that disjoint sets Cn and Bn n have been dened, where Cn = 2 j =1 Cn,j with the Cn,j s pairwise disjoint closed
2 1 intervals of equal length and Bn = j =1 Bn,j with the Bn,j s pairwise disjoint 2 2n open intervals (of possibly dierent lengths). Assume that m(Cn ) = 1 k k n (so each Cn,j has measure exactly m(Cn )/2n ), Cn Bn = [0, 1], and
n
Cn,1 < Bn,1 < Cn,2 < Bn,2 < < Cn,2n 1 < Bn,2n 1 < Cn,2n . Finally, assume that fn : Bn R has been dened where fn (x) = j 2n for x Bn,j , j = 1, 2, . . . , 2n 1.
It would be helpful to draw a picture of fn , which looks like a staircase. Prove that m(Cn,j ) > 1/kn+1 for all j . Then remove from each Cn,j the open middle interval of length 1/kn+1 (which leaves closed intervals of positive length) and dene sets Cn+1 and Bn+1 having the same properties as Cn and Bn with n replaced by n + 1 and dene fn+1 : Bn+1 R in a similar manner as fn . (ii) Prove that Cn+1 Cn and Bn Bn+1 for all n, and dening C (k) = n=1 Cn and B = n=1 Bn , prove that C (k ) and B are disjoint, C (k ) B = [0, 1], and m(C (k)) = (k 3)/(k 2). Prove Theorem 4.21. (iii) Prove that for each n N, fn+1 (x) = fn (x) for all x Bn . (Recall that Bn Bn+1 .) Dene f : B R as follows: If x B = n Bn , then x Bn for some n and we put f (x) := fn (x). Prove that f (x) is well-dened, independent of the n chosen, and prove that f : B R is locally constant and nondecreasing. (iv) Dene k : [0, 1] R as follows: k (0) = 0 and given any x (0, 1], dene k (x) := sup {f (y ) ; y B and y < x}. Prove that is nondecreasing, continuous, and k = f on B .
232
12. (Counterexample to Hankel) Using Riemann sums (or upper and lower Darboux sums if you wish), prove that if A [0, 1] is a closed nowhere dense set with positive measure, then A : [0, 1] R is not Riemann integrable.
Remarks
4.1 : The Gamblers ruin problem was rst posed and solved by Blaise Pascal (1623 1662) (see [115] for some history on Gamblers ruin, especially Pascals r ole in the problem). 4.2 : Its interesting to note that Borels celebrated 1909 paper [51] was not rigorous, at least by todays standards. Here are Borels words leading up to his description of the law of large numbers [51, p. 258]: We propose to study the probabilities for a decimal fraction belonging to a given set assuming 1. The decimal digits are independent; 2. Each of them takes each of the values 0, 1, 2, 3, . . . , q 1 with 1 probability q . There is no need to insist on the somewhat arbitrary character of these two hypotheses: the rst, in particular, is necessarily inexact, if we consider as one is always forced to do in practice, a decimal number dened by a law, indeed whatever the nature of this law. It may nevertheless be interesting to study the consequences of this assumption, precisely with the goal to realize the extent to which things like this happen as if this hypothesis holds. Borels paper was not rigorous, but it was made so by Hugo Steinhaus (18871972 in 1922 [367]; in this paper he calls Borels normal number theorem Borels Paradox: Mr. E. Borel has been the rst to show the interest of the study of enumerable probability,17 and he gave some applications to arithmetics that he discovered on this path. Among those applications, the following theorem, known as Borels paradox drew the attention of analysts: The probability that the frequency of the digit 0 in the dyadic expansion of a random number be 1/2, equals one, where we call frequency of the digit 0 the limit value of the quotient by n of the number of times that this digit appears in the rst n digits of the expansion. We can nd, in dierent authors, the following statement of the same theorem: Almost all numbers have the property that the frequency of the digit 0 in their dyadic expansion equals 1/2, where almost all means that the Lebesgue measure of the set of the that do not enjoy this property is zero. To prove this statement, it is sucient to change the wording of Mr. Borels original proof, without changing the core idea. The goal of this note is to establish a system of postulates for enumerable probability that will allow once and for all to switch from one interpretation to the other in this kind of research. Its argued that Van Vleck produced the rst zero-one law [290]. See [321] for another example of nite additivity = SLLN. 4.3 : I learned Littlewoods rst principle from Roydens text [336, p. 72].
Enumerable probability is countably additive probability, in contrast to nitely additive probabilities.
17
4.5. THE CANTOR SET
233
4.4 : The story of the axiom of choice is fascinating. Let us start in 1900, at the International Congress of Mathematicians, where David Hilbert (18621943) gave a list of 23 open problems in mathematics, the rst problem of which was Cantors problem of the cardinal number of the continuum. As part of this problem, he asked to well order the real numbers [182]: The question now arises whether the totality of all numbers may not be arranged in another manner so that every partial assemblage may have a rst element, i.e., whether the continuum cannot be considered as a well ordered assemblage a question which Cantor thinks must be answered in the armative. In 1904, Zermelo wrote a letter to Hilbert [396, pp. 139141] proving the well ordering theorem, the fact that an arbitrary set can be well ordered.18 To prove this theorem, Zermelo stated and used the axiom of choice and called it a logical principle because it is applied without hesitation everywhere in mathematical deductions [396, p. 141]. Now although applied without hesitation everywhere Zermelo received much criticism for his proof; here, for example, are some harsh words by Borel against the axiom of choice [53, pp. 12511252]: One cannot, in fact, hold as valid the following reasoning, to which Mr. Zermelo refers: it is possible, in a particular set M , to choose any specic element m ; since this choice can be made for any of the sets M , it can be made for the set of those sets. Such reasoning seems to me not to be better founded than the following: To wellorder a set M , it is enough to choose an arbitrary element in it, to which we assign the rank 1, then another, to which we will attribute the rank 2, and so on transnitely, in other words until we exhaust all elements of M by the sequence of transnite numbers. Now, no mathematician will regard such reasoning as valid. It seems to me that the objections that one can raise against it are valid objections against every reasoning involving an arbitrary choice repeated an unenumerable innity of times; such reasonings are out of the scope of mathematics. Its remarkable that for many years, both Borel and Lebesgue held similar views concerning the axiom of choice, although a great deal of their own work in measure theory relied on it. Before 1908, set theory did not have an axiomatic foundation, rather it was na ve set theory or loosely speaking, logical principles applied to sets. The controversy surrounding Zermelos Axiom was the impetus that led him to axiomatize set theory in 1908 [396, pp. 183215]; by doing so with his axiom of choice as one of the cornerstones, he could secure his Axiom and his proof of the well ordering theorem on a sound foundation.19 Unfortunately, Zermelos axiomatization didnt win people over to his axiom of choice; indeed, it opened up new attacks on his axiomatic system! However, the tide in favor of the Axiom began to turn in 1916 when the Polish mathematican Waclaw Sierpi nski (18821969) started publishing papers on the subject of set theory and analysis and their dependance on Zermelos Axiom, which culminated in a 55 page article in 1918 on this subject [353]. In fact, the Warsaw school of mathematics, which Sierpi nski was a part of, played a large role in the dissemination of the Axioms place in mathematics. As the years passed by, this role expanded to diverse mathematical elds. Moreover, due to the work of G odel and Cohen which shows that the axiom of choice is logically independent of ZF,
That is, any set X has a relation < such that for any x, y, z X , x < x never holds, x < y and y < z implies x < z , and nally, each nonempty subset of X has a least element. 19 Its erroneously stated in many books that Zermelo formulated his axioms as a response to the many paradoxes of na ve set theory (such as Russells Paradox concerning sets that contain/dont contain themselves); see [282, Ch. 3].
18
234
any fears of Zermelos Axiom producing a mathematical contradiction were eliminated. Thus, nowadays Zermelos Axiom is accepted without qualms. Now we really cant do justice to the fascinating story of the Axiom and it would ll an entire book to discuss in depth its origins, development and inuence; thankfully such books are available, such as [282, 283]. For more information about the BanachTarski paradox and related paradoxes, see e.g. [45, 91, 282, 376, 410, 412]. In particular, the title A paradox, a paradox, a most ingenious paradox of Subsection 4.4.4 was inspired by the paper [45]. The measure problem: In Corollary 4.18 we saw that there does not exist a translation invariant measure (= countably additive set function) on P (Rn ) that gives the usual volume to boxes. There are many questions one can ask such as what happens if we drop the translation invariant requirement or the countably additive requirement; more precisely, (1) Is there a measure on P (Rn ) that gives the usual volume to boxes? (2) Is there a translation invariant nitely additive set function on P (Rn ) that gives the usual volume to boxes? First, it might surprise you, but it turns out that in ZFC we cannot prove there is a measure on P (Rn ) that assigns the usual volume to boxes! This follows from the work of Stanislaw Ulam (19091984) in 1930. The proof uses lots of set theory but you can get the basic gist from the expository articles [349] and [392], the latter being written by Ulam himself.20 We remark that if we change Question 1 to Is there a nitely additive set function on P (Rn ) that gives the usual volume to boxes? The answer is yes; in fact, in 1948 Alfred Horn (1918-2001) and Alfred Tarski (1901-1983) used the axiom of choice to prove the following remarkable result [190]: Any nitely additive set function on a semiring of subsets of any given set can be extended to a nitely additive set function on the power set of the set. In particular, since Lebesgue measure is nitely additive on I n it has an extension as a nitely additive set function to P (Rn ). Thus, we can always extend nitely additive set functions with no problem whatsoever. The answer to the Question 2 is yes, there does exist a translation invariant nitely additive set function on P (Rn ), n N, that gives the usual volume to boxes. In fact, a theorem due to Jan Mycielski (1932 ) [287] implies that if the types of rigid motions youre interested in forms an amenable group, then there exists a nitely additive set function invariant under those rigid motions [410, Ch. 10]. In particular, any abelian
Heres a quick synopsis from [349] in case youre interested. Working in ZFC, lets assume that we can show there is a measure on P (Rn ) that gives the usual volume to boxes. Then by Ulams 1930 result [391] it follows that there exists a weakly inaccessible cardinal. (For a precise statement of Ulams theorem, see [199, p. 297].) On the other hand, one can show [169, p. 321] that the existence of a weakly inaccessible cardinal proves that ZFC is consistent, which means that one cannot prove a contradictory statement from the axioms of ZFC. Now comes Kurt G odel (19061978), who proved two incompleteness theorems in 1931 (you can read a translation in [396]). The second incompleteness theorem basically says that any axiomatic mathematical system cannot prove its own consistency, unless its in fact inconsistent. In particular, we cannot prove that ZFC is consistent, so we cannot prove that there is a measure on P (Rn ) that gives the usual volume on boxes (assuming of course that ZFC is consistent, which is our underlying assumption). Addendum: We cannot conclude that there is not a measure on P (Rn ) that gives the usual volume on boxes! This is because of G odels rst incompleteness theorem, which basically says that any axiomatic mathematical system contains statements that are either (provably) true, (provably) false, or undecidable (cannot be proved either way). Thus, the statement there is a real number x such that x2 = 2 is a true statement, the statement there is a real number x such that x2 + 1 = 0 is a false statement, while the statement There exists a measure on P (Rn ) that gives the usual volume to boxes may either be false or undecidable, for as weve seen it cant be true unless ZFC is inconsistent.
20
4.5. THE CANTOR SET
235
group is an amenable group, and therefore since translations form an abelian group, it follows that there does exist a translation invariant nitely additive set function on P (Rn ) that gives the usual volume to boxes. Mycielskis theorem uses the axiom of choice. For n = 1, 2, one can do even better: Using the axiom of choice, Stefan Banach (18921945) in 1923 [19] proved there exists a nitely additive set function on P (Rn ) giving the usual measure to elements of I n and is also invariant under all rigid motions; the BanachTarski paradox shows that this statement is false for n 3. 4.5 : The article [94] contains a vivid exposition on the Cantor set and the Cantor function. Henry Smith (18261883) was probably the rst person to dene a Cantor-like middle third set; heres the portion of the paper where he does so [360, p. 94]: 15. (iv.) Let m be any given integral number greater than 2. Divide the interval from 0 to 1 into m equal parts; and exempt the last segment from any subsequent division. Divide each of the remaining m 1 segments into m equal parts; and exempt the last segment of each from any subsequent division. If this operation be continued ad innitum, we shall obtain an innite number of points of division P upon the line from 0 to 1. These points are in loose order: Here, loose order means that P is nowhere dense. To make Smiths construction concrete, consider m = 3. We take [0, 1] and divide it into 3 equal parts and then omit the last interval. We then take the 2 remaining parts and divide them into 3 equal parts each, then omitting the last interval in each part; heres a comparison of the constructions of the Cantor set (left) and the Smith set (right):
Cantor set
Smith set with m = 3
This picture shows that at each stage of the Smith set construction (with m = 3) we basically have half of the sets in the corresponding Cantor set construction; in fact, you can verify that Smiths set consists of those numbers in [0, 1] which have a tertiary expansion with only digits 0 and 1; if you scale these points by 2 we get those numbers in [0, 1] with tertiary expansions having digits 0 and 2, which is exactly Cantors set. Smith can indeed claim to have priority over Cantor! We leave it as an exercise for you to check that at the nth stage of the Smith construction, the measure of the non-omitted segments is (1 1/m)n . This approaches 0 as n , so the measure of P is 0. Later in Smiths paper, he describes another set where instead of dividing each stages segments into m equal parts, to get to the nth stage, he divides each of the previous stage segments into mn equal parts: 16. (v.) Let us now, as in the last example, divide the interval from 0 to 1 into m equal parts, exempting the last segment from any further division ; let us divide each of the remaining m 1 segments by m2 , exempting the last segment of each segment; let us again divide each of the remaining (m 1)(m2 1) segments by m3 , exempting the last segment of each segment; and so on continually. After k 1 operations we shall have N = 1 + (m 1) + (m 1)(m2 1) + ... + (m 1)(m2 l)...(mk2 1) exempted segments, of which the sum will be 1 1 1 m 1 1 m2 1 1 m k 1 .
236
This sum, when k is increased without limit, approximates to the nite 1 1 1 limit 1 E m ; where E m is the Eulerian product 1 m k , 1 and is certainly dierent from zero. The points of division of Q exist in loose order over the whole interval. 1 Here Smith constructs a nowhere dense set Q with measure21 equal to k=1 1 mk , which is positive.22 Recall our discussion from the end of Section 4.5 that characteristic functions of nowhere dense sets with positive measure give counterexamples to Hankels theorem on the equivalence of Riemann integrability and pointwise discontinuity. After presenting the above example of a nowhere dense set of positive measure, Smith says The result obtained in the last example deserves attention, because it is opposed to a theory of discontinuous functions, which has received the sanction of an eminent geometer, Dr. Hermann Hankel. Six years later, Vito Volterra dened a nowhere dense set with positive measure (without knowing about Smiths paper) and said the same thing concerning Hankels claim [405]: Having said that, let us consider a function that has value 0 in all points of the set and in all of its limit points, and has value 1 in all other points. This function is pointwise discontinuous because in any interval one can nd another not containing points of the set: in any inner point of this interval, the function is continuous. Furthermore, it is evident that the function is discontinuous in all points of the set and all their limit points - and [the height of] all jumps equal[s] 1. Since, then, the points of the set cannot be enclosed by intervals which total sum is as little as wanted, we conclude that this function is not suitable for integration.
Heres a terse explanation on how Smith got the measure of the exempted segments as a product. First, observe that the measure of the exempted segments after k 1 operations is
2 2 k 2
21
(m1)(m 1) m l)...(m 1) m1) 1 + ... + (m1)( + (m . (Here we recall that to get to M = m 1+2 + m1+2+3 m1+2++k1 the nth stage, we divide each segment in the previous stage into mn parts; thus, each segment at the nth stage has length 1/m1 1/m2 1/mn = 1/m1+2++n .) Second, rewrite M as M = 1 m
+ 1
1 m
1 m
+ 1
1 1 + 1 m m2 1 1 + 1 m m2
1
1 m
1 m2
1 m
1 1 + ... + 1 m m3 1 1 + ... + 1 m 2 m3
1
1 m
1 m2
1
1 m2
1 mk2
Third, after the 1 in this last sum, if you factor out a 1 youll nd you get 1 1
22
1 m
1 mk1 1
= 1
mk2
1 . mk1
Its an undergraduate exercise to prove that k=1 (1 ak ), with 0 ak < 1 for each k , converges (to a positive real number) if and only if the series k=1 ak converges.
1 m2
, then a 1
1 m2
, and so on,
mk1
just has Smith says.
Part 3
Integration
CHAPTER 5
Basics of integration theory

At the beginning of this century it was discovered that there was a large area in which the legitimacy of these limiting operations could be assumed without fear of contradictions, or of failing examinations: it is not surprising therefore that the tide of euphoria is now at its height. Bruno de Finetti (19061985) [99, p. 229]
5.1. Introduction: Interchanging limits and integrals One reason the tide of euphoria is now at its height concerning the Lebesgue integral is that one can (in many cases) freely interchange limits with Lebesgue integration without fear of failing examinations. With Riemann integration, many exams have been failed because of . . . 5.1.1. Non-Riemann integrable limits. In 1898, Ren e-Louis Baire (1874 1932) introduced the sequence Dn : [0, 1] [0, 1], n = 1, 2, 3, . . ., dened by Dn (x) = 1 If x = p/q is rational in lowest terms with q n, 0 otherwise.
D : [0, 1] [0, 1] dened by D(x) = 1 if x is rational and 0 otherwise. This function is the restriction to [0, 1] of the famous Dirichlet function introduced by Lejeune Dirichlet (18051859) in 1829:
Notice that Dn is zero except for nitely many points, namely at 0/1, 1/1, 1/2, 1/3, 2/3, . . ., (n 1)/n. In particular, Dn is Riemann integrable. Also notice that for each x [0, 1], the limit lim Dn (x) exists and the limit function is the function
n
Figure 5.1. An approximate drawing of the characteristic function of

the rationals, aka the Dirichlet function Q : R R.
Using the denition of the Riemann integral, one can check that D is not Riemann integrable. Thus, in the context of the Riemann integral,1 D = lim Dn
n
does not imply
D = lim
Dn ,
since the integral D is not even dened! One of the most useful properties of the Lebesgue integral is that this failure of the Riemann integral never happens!
1We follow the convention of (usually) not putting lower/upper limits on denite integrals.
Thus,
f means to integrate a function f over its domain. In this example,

239
means
1 0.
240
5. BASICS OF INTEGRATION THEORY
More precisely, given any sequence fn of uniformly bounded Lebesgue integrable functions, if f = limn fn , then we always have f = lim
n
fn ,
in the sense that the integrals here are dened in the Lebesgue sense and this equality holds. To summarize: For Riemanns integral we have to worry about the integrability of limit functions, while for Lebesgues theory we dont have such worries. In this sense we can say that Lebesgues integral simplies life! A diehard fan of Riemanns theory of integration might say: The functions Dn above are pathological and never occur in real life. The only functions Ive seen are nice functions (say piecewise dierentiable) and in which case we shouldnt have any problems with non-Riemann integrable limit functions. It turns out this diehard fan is terribly wrong: Even limits of dierentiable functions may not be Riemann-integrable. Some of the nicest functions in analysis are functions that are not just once dierentiable, but have innitely many derivatives. Such functions are called smooth. Even limits of smooth functions may not be Riemann integrable! In fact, we have the following theorem. Non-Riemann integrable limits Theorem 5.1. There exists a nonincreasing sequence of smooth (innitely dierentiable) nonnegative functions fn : [0, 1] [0, 1] converging pointwise to a non-Riemann integrable function. If we dont care about smoothness, an easier proof shows the existence of a nonincreasing sequence of continuous functions fn : [0, 1] R such that the limit function f : [0, 1] R is not Riemann integrable; see Problems 3 and 5. In order to prove Theorem 5.1 we take a short intermission to discuss . . . 5.1.2. Convolutions and bump functions. It is obvious that given any nonempty interval (, ) R, there exists a smooth function : R [0, 1] such that = 0 outside of (, ), > 0 on (, ), and = 1 on a subinterval of (, ), such a function is called a bump function. We can simply draw the graph of such a function with a pencil, which will look something like that shown here:
1 + , (t)
Although it is obvious that bump functions exist, it still requires proof! We shall in fact prove a higher dimensional result that well need later in this book. We shall construct such a function through the notion of convolution. Given functions f, g : R R we dene their convolution as the function f g : R R, dened by f g (x) :=
R
f (x y ) g (y ) dy =
g (x y ) f (y ) dy,
at each x R where this integral makes sense. Since we havent dened the Lebesgue integral yet, we have to interpret this integral as an (improper) Riemann
5.1. INTRODUCTION: INTERCHANGING LIMITS AND INTEGRALS
241
integral and f and g are assumed to be Riemann integrable. Note that the equality of the integrals follows by making the change of variables y x y . An important property of the convolution is its ability to blur (or smear, smudge, smooth out . . .) functions. In fact, a physical interpretation of f g in general is as a blurring of the functions f and g . Consider the following example.
1 Example 5.1. Let (x) = [1,1] , where [1,1] is the characteristic function of [1, 1]. 2 For each > 0, let 1 (x) := 1 (x/) = [,] (x), 2 where we used that x/ [1, 1] if and only if x [, ]. Given any Riemann integrable function f , observe that 1 f (x) = f (x y ) (y ) dy = f (x y ) dy 2 R = 1 2
x+
f (y ) dy,
x
where at the last step we made a change of variables. Thus, In other words, convolution with replaces f (x) by its average values on an neighborhood of x; in this sense f is a blurring of f . Note that if f is continuous at x, then as 0 its average values around x approach f (x) and hence f (x) f (x) if f is continuous at x. (That is, the convolution gets sharper less blurry as 0.) In Section 8.3.5 we shall see this to be true in much greater generality. f (x) = Average value of f on the interval [x , x + ].
We apply this blurring idea to prove the existence of bump functions. Although there are quicker proofs (see Problem 2), convolutions are so important that we should see it in action. Before we start we rst nd a smooth (innitely dierentiable) function to use in place of the function from Example 5.1.
Lemma 5.2. There is a smooth function : R [0, ) such that (x) > 0 for |x| < 1, (x) = 0 for |x| 1, and = 1. Indeed, rst dene g : R [0, ) by g (x) := 0 for |x| 1 and g (x) := 2 e1/(1x ) for |x| < 1. In Problem 1 we ask you to prove that g is smooth. Let c := g , which is a positive constant, and put (x) := g (x)/c so that = 1. This satises Lemma 5.2. Now for each > 0 put (5.1) (x) := 1 (x/). Then : R [0, ) satises (x) > 0 for |x| < , (x) = 0 for |x| , and 1 (x) dx = (x/) dx = (x) dx = 1, where we changed variables x x. We now use this family to blur or smooth out the characteristic function of a box to prove the existence of bump functions as seen here: Bump function theorem Theorem 5.3. Let (a, b) be a nonempty open box in Rn and let > 0 so that the box (a + , b ) is not empty. Then there exists a smooth function
242
1 +
[+/2, /2] blur

1 +
, (x)
Figure 5.2. Convolution smooths out a characteristic function. : Rn R such that 0 1, > 0 on the box (a, b), = 1 on the box [a + , b ], and = 0 outside of the box (a, b). (See Figure 5.2.)
Proof : We break up the proof in two parts. Step 1: We rst reduce to the case n = 1. Suppose that for all real numbers < , there is a function , : R R having the properties illustrated in Figure 5.2. If (a, b) = (a1 , b1 ) (an , bn ) is an open box in Rn , dening (x) = a1 ,b1 (x1 ) an ,bn (xn ), where x = (x1 , . . . , xn ), it follows that satises the conditions of the proposition. We can thus focus on the n = 1 case. Moreover, by translation we just have to prove our proposition for an interval (a, a), symmetric about the origin. Step 2: We now nish our proof. Let > 0 so that (a + , a ) is not empty. Let /2 : R [0, ) be the function in (5.1) with replaced by /2, as seen here:
/2
a a +
0 2 2
/2 = 1
a a
Figure 5.3. A graph of /2 .

Following the idea from Figure 5.2, let f := the characteristic function of the interval [a + /2, a /2]. We now smooth out f by convoluting it with /2 . Working out the convolution integral, we have (x) := f /2 (x) =
x+a/2 xa+/2
/2 (y ) dy,
so (x) equals the integral of /2 over the (a /2)-ball about x. Using this explicit formula for its easy to check that satises all the conditions required: is smooth, 0 1, > 0 on (a, a), = 1 on [a + , a ], and = 0 outside of box (a, a). For example, since /2 is smooth, by the fundamental theorem of calculus (for Riemann integrals) it follows that (x) is smooth. We also have 0 (x) /2 (y ) dy = 1.
R
We shall leave the other properties of for you to verify.
5.1.3. Proof of Theorem 5.1. The basic idea behind the proof of Theorem 5.1 is as follows. Let A [0, 1] be a closed nowhere dense set with positive measure, such as a thick Cantor set found in Section 4.5. From Problem 12 in Exercises 4.5 we know (and its not dicult to prove) that A : [0, 1] R is not Riemann integrable. The idea to prove Theorem 5.1 is to nd a sequence of smooth nonnegative functions fn : [0, 1] [0, 1] whose limit is A .
5.1. INTRODUCTION: INTERCHANGING LIMITS AND INTEGRALS
243
Before constructing such functions, we remark that its easy to nd closed nowhere dense sets with positive measure without having to go through the elaborate construction of thick Cantor sets. Indeed, let {r1 , r2 , . . .} be a list of all rational numbers in (0, 1) and for each k N, let Ik (0, 1) be an open interval containing rk with2 m(Ik ) 1/2k+1 and put
(5.2)
U=
Ik .
k=1
Because the rationals are dense, it follows that U is dense in [0, 1], and
m(U )
k=1
m(Ik )
1 2k+1
k=1
1 . 2
Let A = [0, 1] \ U . Then A is closed (because U is open), nowhere dense (because 1 1 (because m(U ) 2 ). For each interval Ik = (ak , bk ), U is dense), and m(A) 2 choose k > 0 such that (ak + k , bk k ) is not empty. Let be any function as described in Theorem 5.3 (and shown in Figure 5.2) having the property that 0 k 1, > 0 on Ik , k = 1 on [ak + k , bk k ], and k = 0 outside of Ik . Dene k : R R by k = 1 k ; see Figure 5.4 for a picture of k . The only properties we need of k are that k
1 0 ak ak + k bk k bk k (x) 1
k : R R
Figure 5.4. The function k (x). is smooth, and 0 k (x) Now dene by <1 =1 if x Ik = (ak , bk ) if x / Ik = (ak , bk ).
fn : R R
n
fn (x) := 1 (x) 2 (x) 3 (x) n (x) . We shall prove that {fn } has all the required properties. First of all, fn is smooth because its a product of smooth functions. Since k 1 for all k , the range of fn is in [0, 1], and since for any real number a [0, 1] we have an+1 an 1, it follows that fn+1 (x) = 1 (x) 2 (x) n (x) n+1 (x) 1 (x) 2 (x) n (x) 1 1 (x) 2 (x) n (x) n+1 (x)
n n+1 n
= fn (x).
2For example, take I = (0, 1) (r 1/2k+2 , r + 1/2k+2 ). k k k
244
Hence, {fn } is nonincreasing. We now prove for x [0, 1],

n
lim fn (x) = A (x) =
1 0
if x A, if x / A.
To see this, let x A. Then x / Ik for all k , so k (x) = 1 for all k . Therefore, fn (x) = 1 for all n and hence limn fn (x) = 1. Now let x [0, 1] \ A, which means that x U , or x Ik for some k . Then, as 0 i 1 for all i, it follows that for all n k , 0 fn (x) := 1 (x) 2 (x) k (x) n (x) = k (x)n . 1 1 k (x) 1
n n
Since x Ik , we know that 0 k (x) < 1. Therefore, as limn an = 0 for any real number a with 0 a < 1, we have limn k (x)n = 0. This implies that limn fn (x) = 0, which completes the proof.
Exercises 5.1. 1. (Smooth bumps I) Dene f : R R by f (x) := e1/x for x > 0 and f (x) := 0 for x 0. Heres a graph of f :
1 f
Prove the nth derivative f (n) , n = 1, 2, . . ., can be written as pn1 (x) f (x ), x2n where pn1 (x) is an (n 1) degree polynomial in x. (Suggestion: Use lH ospitals rule to analyze the derivatives of f at 0.) Prove that f is smooth. Note that if g : R R is the function introduced in Lemma 5.2, then g (x) = f (1 x2 ). Since f is smooth it follows that g is smooth. 2. (Smooth bumps II) Let > 0 and dene : R R by the formula f (n) (x) = (x ) = f (x ) , f (x ) + f ( x )
where f is dened in the previous problem. (i) Prove that : R R is nondecreasing, (x) = 0 for x 0, (x) > 0 for x > 0, and (x) = 1 for t . (ii) Let (a, b) be a nonempty open interval in R so that (a + , b ) is not empty. Dene a,b : R R a,b (t) = (t a) (b t). Prove that a,b : R [0, 1] is smooth such that > 0 on (a, b), = 1 on [a + , b ], and = 0 outside of (a, b). (See Figure 5.2.) 3. In this problem and the next we prove continuous versions of Theorem 5.1. (i) Let (a, b) R be a nonempty open interval and let > 0 so that the closed interval [a + , b ] is not empty. Explicitly dene, for example using a piecewise linear function, a continuous function : R R such that 0 1, > 0 on (a, b), = 1 on [a + , b ], and = 0 outside of (a, b). (ii) Now follow the proof of Theorem 5.1 to nd a nonincreasing sequence of continuous functions fn : [0, 1] R such that the limit function f : [0, 1] R is not Riemann integrable. In the following problems we give a dierent method to nd such a sequence {fn }.
5.2. MEASURABLE FUNCTIONS AND LITTLEWOODS SECOND PRINCIPLE
245
4. This problem is used in Problem 5. Given any nonempty closed set C R dene d(x, C ) := inf |x z | ; z C . The number d(x, C ) is distance from x to C . (a) Show that we can replace inf with min; in other words, there exists a point z0 C such that d(x, C ) = |x z0 |. (b) Dene f : R R by f (x) := d(x, C ). Prove that f is continuous. In fact, prove that for any points x, y R, we have |d(x, C ) d(y, C )| |x y |. 5. Let A [0, 1] be any Cantor-type set in [0, 1] with positive measure that we constructed in Section 4.5 and write A = k=1 Ak , where A1 A2 and Ak is the union of the 2k pairwise disjoint closed intervals left over at the kth stage of the construction of A. For each n N, dene fn : [0, 1] R by fn (x) := 1 n min or more explicitly, fn (x) = 1 n d(x, An ) 0 if d(x, An ) if d(x, An ) >
1 , n 1 . n
1 , d(x, An ) ; n
Notice that fn (x) = 1 if x An and fn (x) = 0 if d(x, An ) > 1/n. (i) Prove that fn is continuous. (ii) Prove that {fn } is a nonincreasing sequence of functions. (iii) Prove that for x [0, 1], limn fn (x) = A (x).
5.2. Measurable functions and Littlewoods second principle In this section we study the concept of measurability. We shall see that measurable functions are basically very robust (or strong or durable) continuous-like functions. We make continuous-like precise in Luzins theorem (Theorem 5.9), which is also known as Littlewoods second principle. We also study the concept of almost everywhere. 5.2.1. Measurable functions. A measurable space is a pair (X, S ) where X is a set and S is a -algebra of subsets of X . The elements of S are called measurable sets. Recall that a measure space is a triple (X, S , ) where is a measure on S ; if we leave out the measure we have a measurable space. Our goal now is to dene what a measurable function is, but before doing so we briey recall Lebesgues integral from Section 1.1 when we went over Lebesgues seminal paper Sur une g en eralisation de lint egrale d enie [231]. Given a bounded function f : [, ] [0, ), the idea to nd the area under f is to partition the range of the function rather than the domain. Heres a picture:
m3 m2 m1 m0 E1 E2 E3 f (x) m3 m2 m1 m0 E1 E2 E3 f (x)
246
In this specic example we partition the range into the point {m0 } and the three subintervals (m0 , m1 ], (m1 , m2 ] and (m2 , m3 ], and the shaded rectangles approximate the area under f from below (similarly, we can approximate the area under f from above). The area under the shaded rectangles shown on the right is (5.3) where E0 = {f = m0 } (which equals {} for this specic graph) and E1 = {m0 < f m1 } = f 1 (m0 , m1 ], Area under f m0 m(E0 ) + m0 m(E1 ) + m1 m(E2 ) + m3 m(E3 ),
with similar expressions for E2 , E3 . Here, recall that we follow the probabilists set notation: {x X ; Property(x)} = {Property} (i.e., drop x). For example, E0 is really {x ; f (x) = m0 }. If we put m1 = 1 (or any other negative number) we could also write E0 = {m1 < f m0 } = f 1 (m1 , m0 ] so that all the Ei s have the same form. Now the sum (5.3) only makes sense if each set Ei is Lebesgue measurable (so that m(Ei ) is dened without having to worry about pathologies). Thus, it makes sense to consider only those functions for which Now this requirement makes sense for any measurable space where we replace Lebesgue measurable with whatever measurable sets we are working with! We shall return to Lebesgues denition of the integral in Section 5.4. In general, we deal with extended real-valued functions, in which case we allow a, b R. This discussion suggestions the following denition: Given a measurable space (X, S ) and function f : X R, if we say that f is measurable.3 It turns out, see Problem 1, that we can always assume a R and b = . We are thus led to the following denition: f : X R is measurable if f 1 (a, ] = {f > a} S for each a R. We emphasize that the denition of measurability is not articial but is required by Lebesgues denition of the integral! If X is the sample space of some experiment, a measurable function is called a random variable; thus, In probability, random variable = measurable function. We note that intervals of the sort (a, ] are not special, and sometimes it is convenient to use other types of intervals. Proposition 5.4. For a function f : X R, the following are equivalent: f is measurable. f 1 [, a] = {f a} S for each a R. f 1 [a, ] = {f a} S for each a R. f 1 [, a) = {f < a} S for each a R. f 1 (a, b] S for all a, b R, f 1 (a, b] is Lebesgue measurable for all a, b R.
(1) (2) (3) (4)
3As a reminder, for any A R, f 1 (A) := {x X ; f (x) A}, so for instance f 1 (a, ] = {x X ; f (x) (a, ]} = {x X ; f (x) > a}, or leaving out the variable x, f 1 (a, ] = {f > a}.
247
Proof : Since preimages preserve complements, we have f 1 (a, ]

c
= f 1 (a, ]c = f 1 [, a].
Since -algebras are closed under complements, we have (1) (2). Similarly, the sets in (3) and (4) are complements, so we have (3) (4). Thus, we just to prove (1) (3). Assuming (1) and writing [a, ] =
1 n=1
1 , n
=
1
f 1 [a, ] =
1 , n
n=1
f 1 a
1 , , n
we have f [a, ] S since each f a S and S is closed under countable intersections. Thus, (1) = (3). Similarly, (a, ] =
a+
n=1
1 , n
f 1 (a, ] =
f 1 a +
n=1
1 , , n
shows that (3) = (1).
As a consequence of this proposition, we can prove that measurable functions are closed under scalar multiplication. Indeed, let f : X R be measurable and let R; well show that f is also measurable. In fact, assume that < 0 (the = 0 case is easy and the > 0 case is similar) and observe that for any a R, a {f > a} = f < . By Proposition 5.4 it follows that {f > a} is measurable. Well analyze more algebraic properties of measurable functions in Section 5.3. We now give examples of measurable functions. We rst show that all nice functions are measurable.
Example 5.2. Let X = Rn with Lebesgue measure. Then any continuous function f : Rn R is measurable because for any a R, by continuity (the preimage of any open set is open), (where we used that f does not take the value ) is an open subset of Rn . Since open sets are measurable, it follows that f is measurable. f 1 (a, ] = f 1 (a, )
Thus, for Lebesgue measure, continuity implies measurability. However, the converse is far from true because there are many more functions that are measurable than continuous. For instance, Dirichlets function D : R R, D(x) = 1 if x Q, 0 if x Q,
is Lebesgue measurable. Note that D is nowhere continuous. That D is measurable follows from Example 5.3 below and the fact that D = Q and Q is measurable.
Example 5.3. For a general measurable space X and set A X , we claim that the characteristic function A : X R is measurable if and only if the set A is measurable. Indeed, looking here at a graph of A with three dierent as,
a 1 a a R
X A
248
we see that
Of course, since A is non-constructive, so is A . You will probably never nd a nonmeasurable function in practice. The following example shows the importance of studying extended real-valued functions, instead of just real-valued functions.
Example 5.4. Let X = Y , where Y = {0, 1}, be the sample space for a Monkey Shakespeare experiment (or any other experiment involving a sequence of Bernoulli trials). Let f : X [0, ] be the number of times the Monkey types sonnet 18: f (x1 , x2 , x3 , . . .) = the number of 1s in (x1 , x2 , x3 , . . .) = x1 + x2 + x3 + .
Hence, {A > a} S for all a R if and only if A S , which proves the claim. In particular, there exists non-Lebesgue measurable functions. In fact, given any nonmeasurable set A Rn , the characteristic function A : Rn R is not measurable.
X {A > a} = A
if a < 0 if 0 a < 1, if a 1.
Notice that f = when the Monkey types sonnet 18 an innite number of times (in fact, as we saw in Example 4.4 of Section 4.1, f = on a set of measure 1). To show that f is measurable, given a R by Proposition 5.4 we need to show that {f a} S . To prove this, observe that f (x ) = Thus, {f a} =
n=1
xn a
n=1
x1 + x2 + + xn a for all n N.
{x ; x1 + x2 + + xn a}.
Back in Section 2.1 we dened simple functions. For a quick review in the current context of our -algebra S , recall that a simple function (or S -simple function to emphasize the -algebra S ) is any function of the form
N
The set {x ; x1 + x2 + + xn a} only depends on the rst n tuples of an innite sequence (x1 , x2 , x3 , . . .), so this set is of the form An Y Y Y , where An Y n is the subset of Y n consisting of those points with no more than a total of a entries with 1s. In particular, {x ; x1 + x2 + + xn a} R (C ) and hence, it belongs to S (C ). Therefore, {f a} also belongs to S (C ), so f is measurable.
s=
n=1
a n An ,
where a1 , . . . , aN R and A1 , . . . , AN S are pairwise disjoint. By Corollary 2.3 we know that we dont have to take the An s to be pairwise disjoint, but for proofs its often advantageous to do so. Theorem 5.5. Simple functions are measurable.
N Proof : Let s = n=1 an An be a simple function where a1 , . . . , aN R and A1 , . . . , AN S are pairwise disjoint. If we put AN +1 = X \ A1 AN and aN +1 = 0, then X = A1 A2 AN AN +1 , a union of pairwise disjoint sets, and s = an on An for each n = 1, 2, . . . , N + 1. As in the picture,
249
a4 a3 a2 a a1 A1 A3 A2 A4
{x X ; s(x) > a} =
an >a
An
it follows that {s > a} = An ,

an >a
where the union is over all n such that an > a. Hence, {s > a} is just a union of elements of S so s is measurable.
5.2.2. Measurability, continuity and topology. We have dened measurability in terms of the preimage of intervals; in the following theorem we express measurability in terms of the preimage of open sets. Measurability criteria Theorem 5.6. For a function f : X R, the following are equivalent: (1) f is measurable. (2) f 1 ({}) S and f 1 (U ) S for all open subsets U R. (3) f 1 ({}) S and f 1 (B ) S for all Borel sets B R.
Proof : To prove that (1) = (2), observe that {} =
n=1 n=1
(n, ]
f 1 ({}) =
f 1 (n, ].
Assuming f is measurable, we have f 1 (n, ] S for each n and since S is a -algebra, it follows that f 1 ({}) S . If U R is open, then by the dyadic 1 cube theorem we can write U = n=1 In where In I for each n. Hence, f 1 ( U ) =
f 1 ( I n ) .
n=1
By measurability, f 1 (In ) S for each n, so f 1 (U ) S . To prove that (2) = (3), we just have to prove that f 1 (B ) S for all Borel sets B R. To prove this, recall from Proposition 1.8 that Sf = {A R ; f 1 (A) S} is a -algebra. Assuming (2) we know that all open sets belong to Sf . Since Sf is a -algebra of subsets of R and B is the smallest -algebra containing the open sets, it follows that B Sf . Finally, we prove that (3) = (1). Let a R and note that Assuming (3), we have f 1 ({}) S and since (a, ) R is open, and hence is Borel, we also have f 1 (a, ) S . Thus, f 1 (a, ] S , so f is measurable. (a, ] = (a, ) {} = f 1 (a, ] = f 1 (a, ) f 1 ({}).
We remark that the choice of using + over in the f 1 ({}) S parts of (2) and (3) were arbitrary and we could have used instead of . Consider Part (2) of Theorem 5.6, but only for real-valued functions:
250
One cannot avoid noticing the striking resemblance to the denition of continuity. Recall that for a topological space (T, T ), where T is the topology on a set T ,
Continuity: A function f : T R is continuous if and only if f 1 (U ) T for each open set U R.
Measurability: A function f : X R is measurable if and only if f 1 (U ) S for each open set U R.
Because of this similarity, one can think about measurability as a type of generalization of continuity. However, speaking philosophically, there are two big dierences between measurable functions and continuous functions as we can see by considering X = Rn with Lebesgue measure and its usual topology: (i) There are more measurable functions than continuous ones.
(ii) Measurable functions are closed under more operations than continuous functions.
The basic reason for these facts is that there are a lot more measurable sets than there are open sets. E.g., not only are open sets measurable but so are points, G (countable intersections of open) sets, F (countable union of closed) sets, etc. To illustrate Point (ii), in Section 5.3 we shall see that measurable functions are closed under all limiting operations. For example, a limit of measurable functions is always measurable, which is false for continuous functions. Indeed, we saw in Section 5.1 that the characteristic function of a Cantor set can be expressed as a limit of continuous (in fact, dierentiable) functions. To summarize this discussion,
Measurable functions are similar to continuous functions, but generally speaking there are more of them and they are more robust.
With this said, in Section 5.2.3 well see that measurable functions are nearly continuous functions, just like measurable sets are nearly open sets. Since we are on the subject of topology, recall that the collection of Borel subsets of a topological space is the -algebra generated by the open sets. For a measurable space (X, S ) where X is a topological space with S its Borel subsets, a measurable function f : X R is called Borel measurable to emphasize that the -algebra S is the one generated by the topology and not just any -algebra on X . Thus, For a topological space X , f : X R is Borel measurable f 1 (a, ] B (X ) for all a R. In the particular case X = Rn with its usual topology, f : Rn R is Borel measurable f 1 (a, ] B n for all a R.
Proposition 5.7. Any continuous real-valued function on a topological space is Borel measurable. The proof of this proposition follows word-for-word the Rn case in Example 5.2. A nice thing about Borel measurability is that it behaves well under composition. (The following proposition is generally false if f is assumed to be Lebesgue measurable; see Problem 9.)
251
Proposition 5.8. If f : R R is Borel measurable and g : X R is measurable, where X is an arbitrary measurable space, then the composition, is measurable. f g :X R
Proof : Given a R, we need to show that (f g )1 (a, ] = g 1 (f 1 (a, ]) S . The function f : R R is, by assumption, Borel measurable, so f 1 (a, ] B 1 . The function g : X R is measurable, so by Part (3) of Theorem 5.6, g 1 (f 1 (a, ]) S . Thus, f g is measurable. Example 5.5. If g : X R is measurable, and f : R R is the characteristic function of the rationals, which is Borel measurable, then Proposition 5.8 shows that the rather complicated function (f g )(x) = 1 0 if g (x) Q, if g (x) Q,
is measurable. Other, more normal looking, functions of g that are measurable include eg(x) , cos g (x), and g (x)2 + g (x) + 1.
5.2.3. Littlewoods second principle and Luzins theorem. We now continue our discussion of Littlewoods principles [251, p. 26] from Section 4.3 where we stated the rst; here are all of them taken from his book:
There are three principles, roughly expressible in the following terms: Every [nite Lebesgue] (measurable) set is nearly a nite sum of intervals; every function (of class L ) is nearly continuous; every convergent sequence of [measurable] functions is nearly uniformly convergent.
Nikolai Luzin The rst principle was illustrated in Theorem 4.10 and the third principle is (18831950). contained in Egorovs theorem (Theorem 5.15), which well get to in the next section. One common, although historically inaccurate,4 interpretation of the second principle comes from Luzins theorem, named after Nikolai Luzin (1883 1950) who proved it in 1912 [259], and this theorem makes precise Littlewoods comment that any Lebesgue measurable function is nearly continuous. We remark that Vitali in 1905 was the rst to state and prove Luzins theorem in the paper [403], although Vitali remarked it was known to Borel [50] and Lebesgue [234] in 1903. Luzins theorem Theorem 5.9. Let X Rn be Lebesgue measurable and let f : X R be a Lebesgue measurable function. Then given > 0, there is a closed set C Rn such that C X , m(X \ C ) < , and f is continuous on C .
4 Littlewoods original illustration of his second principle was not Luzins theorem! See Theorem 6 on [251, p. 27] for the precise illustration of Littlewoods second principle, which has to do with approximations in Lp (in Littlewoods notation, L ) by continuous functions.
252
Proof : Before presenting the proof, lets make sure we understand what it says. Take for example, Dirichlets function D : R R, which is the characteristic function of the rationals:
D is discontinuous everywhere! Luzins theorem says that given > 0 there is a closed subset C R such that m(R \ C ) < and f |C is continuous on C . In fact, its easy to nd such a closed set C consisting only of irrational numbers. Indeed, just let {r1 , r2 , . . .} be a list of all rational numbers in R and let In = (rn /2n+2 , rn + /2n+2 ). Then U := n=1 In is open, so C := R \ U is closed, and m(R \ C ) = m(U )
n=1
m(In ) =
n=1
= . 2n+1 2
Thus, m(R \ C ) < and since C is a subset of the irrational numbers, f |C = 0:

f |C 0 C {irrationals}
The zero function is continuous, so this example veries Luzins theorem for Dirichlets function. Now to the proof of Luzins theorem. Step 1: We rst prove the theorem only requiring that C be measurable (this proof is yet another example of the 2 k -principle). Let {Vk } be a countable basis of open sets in R; this means that every open set in R is a union of countably many Vk s. (For example, take the Vk s as open intervals with rational end points.) We want to nd a measurable set C such that f |C : C R is continuous and m(X \ C ) < . The continuity of f |C : C R is equivalent to (since {Vk } is a basis and by denition of the subspace topology on C ) for some open set Uk Rn . If f 1 (Vk ) happens to be open, then obviously we can take Uk = f 1 (Vk ) but f is not assumed to be continuous, so f 1 (Vk ) is not generally open. However, using Littlewoods rst principle we can approximate it with an open set! Indeed, given k N, since f 1 (Vk ) is measurable, by Littlewoods rst principle there is an open set Uk such that f 1 (Vk ) Uk and m(Uk \ f 1 (Vk )) < k . 2 Thus, we can write Uk as a union of disjoint sets Uk = f 1 (Vk ) Wk where m(Wk ) < k . 2 Now put C := X \ Then C is measurable and m(X \ C ) Moreover, for any k N,
k=1 k=1
C f 1 (Vk ) = C Uk for all k,
Wk . = . k 2 k=1
m(Wk ) <
where we used that C Wk = . It follows that f |C : C R is continuous.
C Uk = C f 1 (Vk ) Wk = C f 1 (Vk )
253
Step 2: We now require that C be closed. Given > 0 by Step 1 we can choose a measurable set B X such that m(X \ B ) < /2 and f is continuous on B . By a Littlewoods rst principle we can choose a closed set C Rn such that C B and m(B \ C ) < /2. Since we have X \ C = (X \ B ) (B \ C ),
m(X \ C ) m(X \ B ) + m(B \ C ) < . Also, since C B and f is continuous on B , the function f is automatically continuous on the smaller set C . This completes the proof of our theorem.
In Problem 7 we shall see that Luzins theorem holds not just for Rn but for topological spaces as well. Our last topic in this section is the idea of almost everywhere, a subject well see quite often in the sequel. 5.2.4. The concept of almost everywhere. Let (X, S , ) be a measure space. We say that a property holds almost everywhere (written a.e.) if the set of points where the property fails to hold is a measurable set with measure zero. We also say -a.e, because almost everywhere depends on the measure, although we often leave out the measure for simplicity. We also remark that a.e. can be expressed as almost everywhere or almost every. For example, we say that a sequence of functions {fn } on X converges a.e. to a function f on X , written (where a.e. = almost everywhere in the rst two and a.e. = almost every in the last one) if f (x) = lim fn (x) except on a set of measure zero. Explicitly,
n
fn f a.e. , lim fn = f a.e. , or lim fn (x) = f (x) for a.e. x
fn f a.e.
f1
f2
A := x ; f (x) = lim fn (x) S and (A) = 0.

n
Heres a picture where f is the zero function:

f3 ... = a.e. f
exists but does not equal f (x). In the general case, the limit lim fn (x) need not n exist at any point in A. For another example, given two functions f and g on X , we say that f = g a.e. if the set of points where f = g is measurable with measure zero: f = g a.e.
Example 5.6. Consider the Baire sequence fn : [0, 1] R, n = 1, 2, 3, . . ., dened by fn (x) = 1 0 If x = p/q is rational in lowest terms with q n, otherwise.
In this picture, for x A (which consists of a single point), the limit lim fn (x)
n
A := {x ; f (x) = g (x)} S and (A) = 0.
Let f (x) = 0 for all x [0, 1]. Then fn f Lebesgue a.e. because the set of points where this is false is the set of rational numbers in [0, 1], which has Lebesgue measure
254
zero. Observe that if D is the Dirichlet function on [0, 1] (the characteristic function of the rationals in [0, 1]), then D = 0 Lebesgue a.e. because of the same reason. Example 5.7. (Lebesgue = Borel a.e.) Heres a classic fact: If f : R R is Lebesgue measurable, then there is a Borel measurable function g : R R such that f = g Lebesgue a.e. You will prove this in Problem 6 (see also Problem 6 in Exercises 5.3).
On a measure space (X, S , ), if g is measurable and f = g a.e., then one might think that f must also be measurable. However, as youll see in the following proof, to always make this conclusion we need to assume the measure is complete. (The reason is that for an incomplete measure space, sets of measure zero can have nonmeasurable subsets; this cannot happen for complete measures.) Proposition 5.10. Assume that is complete and let f, g : X R. If g is measurable and f = g a.e., then f is also measurable.
Proof : Assume that g is measurable and f = g a.e., so A = {x ; f (x) = g (x)} is measurable with measure zero. Let a R; we need to show that {f > a} is measurable. First, the set {f > a} A is measurable since its a subset of A which is measurable with measure zero. Second, since f = g on Ac , we have {f > a} Ac = {g > a} Ac , which is also measurable since g is measurable and Ac is measurable. It follows that {f > a}, which is the union of {f > a} A and {f > a} Ac , is measurable. Exercises 5.2. 1. Here are some equivalents denition of measurability. Let f : X R be a function on a measurable space (X, S ). (a) Prove that f 1 (a, b] S for all a, b R if and only if f 1 (a, ] S for all a R. (b) If {an } is any countable dense subset of R, prove that f is measurable if and only if f 1 ({}) and all sets of the form f 1 (am , an ], where m, n N, are measurable. (For example, f is measurable if and only if f 1 ({}) and all sets of the form f 1 (k/2n , (k + 1)/2n ], where k Z and n N, are measurable.) (c) If f 0, prove that f is measurable if and only if for all k Z and n N with 0 k 22n 1, the sets f 1 (k/2n , (k + 1)/2n ] and f 1 (2n , ], are measurable. 2. Here are some problems dealing with nonmeasurability. (a) Find a non-Lebesgue measurable function f : R R such that |f | is measurable. (b) Find a non-Lebesgue measurable function f : R R such that f 2 is measurable. (c) Find two non-Lebesgue measurable functions f, g : R R such that both f + g and f g are measurable. (d) Find a Lebesgue measurable function f : R R and a Lebesgue measurable set set A R such that f 1 (A) is not Lebesgue measurable. Suggestion: Try to use the function in Problem 8 of Exercises 5.6. 3. Here are some problems dealing with measurable functions. (a) Prove that any monotone function f : R R is Lebesgue measurable. (b) A function f : R R is said to be lower-semicontinuous at a point c R if for any > 0 there is a > 0 such that |x c| < = f (c) < f (x).
Intuitively, this means that for x near c, f (x) is either near f (c) or greater than f (c). The function f is lower-semicontinuous if its lower-semicontinuous at all points of R. Prove that any lower-semicontinuous function is Lebesgue measurable.
255
(To get a feeling for lower-semicontinuity, note that the functions (0,) , (,0) , and (,0)(0,) are lower-semicontinuous at 0.) (c) A function f : R R is said to be upper-semicontinuous at a point c R if for any > 0 there is a > 0 such that |x c| < = f (x) < f (c) + .
4.
5.
6.
7.
Intuitively, this means that for x near c, f (x) is either near f (c) or less than f (c). The function f is upper-semicontinuous if its upper-semicontinuous at all points of R. Prove that any upper-semicontinuous function is Lebesgue measurable. Since we work with extended real-valued functions, we can use an extended Borel -algebra to dene measurability. In this problem we dene and study this -algebra. (a) The extended Borel -algebra, B, is the -algebra of subsets of R generated by I 1 and subsets of {}. Prove that A B if and only if for some B B , we have A = B or A = B {} or A = B {}, or A = B {, }. (b) Prove that B is the -algebra generated by the open sets in R, which by denition consist of open sets in R together with sets of the form U I J where U is open in R and I and J are intervals of the form [, a) and (b, ] where a, b R. (c) Prove that B is the -algebra of subsets of R generated by any of one of the following collections of subsets (here, a and b represent real numbers): (1) (a, b], {}; (2) (a, b), {}; (3) [a, b), {}; (4) [a, b], {}; (5) [, a]; (6) [, a); (7) [a, ]; (8) (a, ]. To avoid boredom, just prove a couple. (d) Prove that f : X R is measurable if and only if f 1 (A) S for each A B , if and only if f 1 (A) S for each set A of the form given in any one of eight collections of sets stated in Part (iii). In this problem we prove that measurable functions are closed under the usual arithmetic operations. Let f, g : X R be measurable. (i) Prove that f 2 , and more generally, |f |p where p > 0, and, assuming that f never vanishes, that 1/f are each measurable. (ii) Assume that f (x) + g (x) is dened for all x X ; that is, assume that f (x) and g (x) are not opposite innities for any x X . For a R and x X , prove that f (x) + g (x) < a if and only if there exists rational numbers r, s such that f (x) < r , g (x) < s, and r + s < a. Use this fact to prove that f + g is measurable. (iii) We now prove that the product f g is measurable. Let A = {x ; (f (x), g (x)) R R}. Prove that A is measurable. Show that f g : A R and f g : Ac R are measurable. Conclude that f g : X R is measurable. Suggestion: To prove measurability of f g on A, observe that f g = 1 (f + g )2 (f g )2 . 4 We can improve Luzins theorem as follows. First prove the (i) (Tietze extension theorem for R); named after Heinrich Tietze (18801964) who proved a metric space version in 1915 [385]. Let A R be a nonempty closed set and let f0 : A R be a continuous function. Prove that there is a continuous function f1 : R R such that f1 |A = f0 , and if f0 is bounded in absolute value by a constant M , then we may take f1 to the have the same bound. (ii) In (ii)(iv), f : X R denotes a Lebesgue measurable function, where X R is Lebesgue measurable. Prove there is a closed set C R such that C X , m(X \ C ) < , and a continuous function g : R R such that f = g on C and such that if f is bounded in absolute value by a constant M , then so is g . In (iii) and (iv) we give applications of this result. (iii) Prove there is a sequence of continuous functions {fn } where fn : R R such that on X , fn f Lebesgue a.e. (iv) (Lebesgue = Borel a.e.) Assume now that X is a Borel set (while f : X R is Lebesgue measurable). Prove there is a Borel measurable function g : X R such that f = g Lebesgue a.e. Here are some generalizations of Luzins Theorem.
256
(i) Let be a -nite regular Borel measure on a topological space X , let f : X R be measurable, and let > 0. Using Problem 8 in Exercises 4.3 on Littlewoods rst principle(s) for regular Borel measures, prove that there exists a closed set C X such that m(X \ C ) < and f is continuous on C . (ii) We now assume youre familiar with topology at the level of [286]. With the hypotheses as in (i) except now X is a normal topological space, prove that if C denotes the closed set in (i), there is a continuous function g : X R such that f = g on C and moreover, if f is bounded in absolute value by a constant, then we may take g to bounded by the same constant. You will need the Tietze extension theorem for normal spaces, a theorem actually proved by Pavel Urysohn (18981924) and published posthumously in 1925 [393]. 8. (Tonellis integral; [106]) Here we present Leonida Tonellis (18851946) integral published in 1924 [389]. Let f : [a, b] R be a bounded function, say |f | M for some constant M . We say that f is quasi-continuous (q.c.) if there is a sequence of closed sets C1 , C2 , C3 , . . . [a, b] with lim m(Cn ) = b a and a sequence of continuous functions f1 , f2 , f3 , . . . where each fn : [a, b] R satises f = fn on Cn and |fn | M . (i) Let f : [a, b] R be bounded. Prove that f is q.c. if and only if f is measurable. To prove the if statement, you may assume Problem 6. (ii) Let f : [a, b] R be q.c. and let {fn } be a sequence of continuous functions in the denition of q.c. for f . Let R(fn ) denote the Riemann integral of fn (which exists since f is continuous) and prove that the limit lim R(fn ) exists and its value is independent of the choice of sequence {fn } in the denition of q.c. for f . Tonelli denes the integral of f as
b n n
f := lim R(fn ).
a n
It turns out that Tonellis integral is exactly the same as Lebesgues integral. 9. (cf. [161]) We show that the composition of two Lebesgue measurable function may not be Lebesgue measurable. Indeed, let and M be the homeomorphism and Lebesgue measurable set, respectively, of Problem 8 in Exercises 4.4. Let g = M . Show that g 1 is not Lebesgue measurable. Note that both 1 and g are Lebesgue measurable. 10. (Cauchys functional equation IV: The Banach-Sierpi nski theorem) Use any information in Problem 6 in Exercises 1.6 and Problem 6 Exercises 4.3. Prove the Banach-Sierpi nski theorem, proved in 1920 by Stefan Banach (18921945) [17] and Waclaw Sierpi nski (18821969) [354], which states that if f : R R is additive and Lebesgue measurable, then f (x) = f (1) x for all x R. As a corollary, we get: Every discontinuous additive function is not Lebesgue measurable. Suggestions: One can prove that for some n N, the set {x R ; |f (x)| n} has positive measure, or one can use Luzins theorem to nd a set of positive measure on which f is bounded. 11. (Cauchys functional equation V: Hamel basis) Using the axiom of choice (really its equivalent form, Zorns lemma) there is a set B R (which is uncountable) such that every nonzero x R can be written uniquely as (5.4) for some nonzero r1 , . . . , rn Q and distinct elements b1 , . . . , bn B . The set B is called a Hamel basis, after Georg Hamel (18771954) who wrote a paper on Cauchys equation and rst described such a basis in 1905 [162]. Show that a function f : R R is additive if and only if for each x R, we have f ( x ) = r 1 f ( b1 ) + + r n f ( bn ) where x is written as in (5.4). This allows one to easily nd nonmeasurable functions. Indeed, let us dene f (b) = 1 for each b B so that f (x) = r1 + + rn where x is written as in (5.4); using the previous problem show that f is not measurable. x = r 1 b1 + r 2 b2 + + r n bn ,
5.3. SEQUENCES OF FUNCTIONS AND LITTLEWOODS THIRD PRINCIPLE
257
5.3. Sequences of functions and Littlewoods third principle In this section we continue our study of measurability. We show that measurable functions are very robust in the sense that they are closed under any kind of arithmetic or limiting process involving at most countably many operations: addition, multiplication, division etc.; this isnt surprising since measurable sets are closed under countable operations. We also discuss Littlewoods third principle on limits of measurable functions. 5.3.1. Limits of sequences. Before discussing limits of sequences of functions we need to start with limits of sequences of extended real numbers. Its a fact of life that limits of sequences of extended real numbers in general do not exist; for example, a sequence {an } can oscillate like this:
an a1 a3 a5 a7 a9 a11 a13 a15 a17 n a10 a12 a14 a16 a18
a2
a4
a6
a8
Figure 5.5. The sequence a1 , a2 , a3 , a4 , . . . bounces up and down. However, for the sequence shown in Figure 5.5, assuming that the sequence continues the way it looks like it does, it is clear that although lim an does not exist, the sequence does have an upper limiting value, given by the limit of the odd-indexed an s and a lower limiting value, given by the limit of the even-indexed an s. Now how do we nd the upper (also called supremum) and lower (also called inmum) limits of {an }? It turns out there is a very simple way to do so, as we now explain. Given an arbitrary sequence {an } of extended real numbers, put5 s1 = sup ak = sup{a1 , a2 , a3 , . . .},
k 1 k 2
s2 = sup ak = sup{a2 , a3 , a4 , . . .}, s3 = sup ak = sup{a3 , a4 , a5 , . . .},

k 3
and in general, sn = sup ak = sup{an , an+1 , an+2 , . . .}.

k n
Note that s1 s2 s3 sn sn+1 is an nonincreasing sequence since each successive sn is obtained by taking the supremum of a smaller set of elements (and when sets get smaller, their supremums cannot increase). Since {sn } is an nonincreasing sequence of extended real numbers, the limit lim sn exists in R; in fact, lim sn = inf sn = inf {s1 , s2 , s3 , . . .},
n 5Here, sup is in the sense of extended real numbers, so the (extended) real number s n could equal if the set {an , an+1 , an+2 , . . .} is bounded above only by .
258
as can be easily be checked. We dene the lim sup of the sequence {an } as lim sup an := inf sn
n
= lim sn = lim
n
sup{an , an+1 , an+2 , . . .} .
Note that the terminology lim sup of {an } ts well because lim sup an is exactly the limit of a sequence of supremums.
Example 5.8. For the sequence an shown in Figure 5.5, you can check that s1 = a1 , s2 = a3 , s3 = a3 , s4 = a5 , s5 = a5 , . . . . Thus, lim sup an is exactly the limit of the odd-indexed an s.
We now dene the lower/inmum limit of an arbitrary sequence {an }. Put6 1 = inf ak = inf {a1 , a2 , a3 , . . .},
k 1
2 = inf ak = inf {a2 , a3 , a4 , . . .},

k 2
3 = inf ak = inf {a3 , a4 , a5 , . . .},

k 3
and so forth. Note that 1 2 3 n n+1 is an nondecreasing sequence since each successive n is obtained by taking the inmum of a smaller set of elements. Since {n } is an nondecreasing sequence, the limit lim n exists, and equals supn n . We dene the lim inf of the sequence {an } as lim inf an := sup n
n
= lim n = lim
n
inf {an , an+1 , an+2 , . . .} .
Note that the terminology lim inf of {an } ts well because lim inf an is the limit of a sequence of inmums.
Example 5.9. For the sequence an shown in Figure 5.5, you can check that 1 = a2 , 2 = a2 , 3 = a4 , 4 = a4 , 5 = a6 , . . . , Thus, lim inf an is exactly the limit of the even-indexed an s.
The following lemma contains some useful properties of limsups and liminfs, the proof of which we leave to the Appendix.
6As with supremum, inf is in the sense of extended real numbers, so the (extended) real number n could equal if the set {an , an+1 , an+2 , . . .} is bounded below only by .
259
Lemma 5.11. For a nonempty set A R and a sequence of extended real numbers {an }, the following properties hold: (1) sup A = inf(A) and inf A = sup(A), where A = {a ; a A}. (2) lim sup an = lim inf(an ) and lim inf an = lim sup(an ). (3) lim an exists in R if and only if lim sup an = lim inf an , in which case, lim an = lim sup an = lim inf an . (4) If {bn } is another sequence of extended real numbers and an bn for all n suciently large, then lim inf an lim inf bn and lim sup an lim sup bn .
5.3.2. Operations on measurable functions. Let {fn } be a sequence of extended-real valued functions on a measure space (X, S , ). We dene the functions sup fn , inf fn , lim sup fn , and lim inf fn , by applying the operations pointwise to the sequence of extended real numbers {fn (x)} at each point x X . For example, sup fn : X R is the function dened by sup fn (x) := sup{f1 (x), f2 (x), f3 (x), . . .} and is the function dened by lim sup fn : X R at each x X. at each x X,
lim sup fn (x) := lim sup(fn (x)) We dene the limit function lim fn by
lim fn (x) := lim (fn (x))

n
at those points x X where the right-hand limit exists. We now show that limiting operations dont change measurability. Limits preserve measurability Theorem 5.12. If {fn } is a sequence of measurable functions, then the functions sup fn , inf fn , lim sup fn , and lim inf fn are all measurable. If the limit lim fn (x) exists at each x X , then the limit
n
function lim fn is measurable. For instance, if the sequence {fn } is monotone, that is, either nondecreasing or nonincreasing, then lim fn is everywhere dened and it is measurable.
Proof : To prove that sup fn is measurable, by Proposition 5.4 we just have to show that for each a R, {sup fn a} S . However, this is easy because by denition of supremum, for any a R, sup{f1 (x), f2 (x), f3 (x), . . .} a fn (x) a for all n,
260
therefore {sup fn a} =
n=1
{fn a}.
Since each fn is measurable, it follows that {sup fn a} S . Using an analogous argument one can show that inf fn is measurable. To prove that lim sup fn is measurable, note that by denition of lim sup, lim sup fn := inf {s1 , s2 , s3 , . . .}, where sn = supkn fk . We already proved that the supremum of a sequence of measurable functions is measurable, so each sn is measurable and since the inmum of a sequence of measurable function is also measurable, it follows that lim sup fn = inf {s1 , s2 , . . .} is measurable. An analogous argument can be used to show that lim inf fn is measurable. If the limit function lim fn is well-dened, then by Part (3) of Lemma 5.11 we know that lim fn = lim sup fn (= lim inf fn ). Thus, lim fn is measurable. Example 5.10. Let X = Y , where Y = {0, 1}, a sample space for the Monkey Shakespeare experiment (or any other sequence of Bernoulli trials), and let f : X [0, ] be the random variable given by the number of times the Monkey types sonnet 18. Then f (x ) = That is, f=
n=1 n
xn
n=1
An = lim
Ak ,
k=1
where An = Y Y Y {1} Y Y where {1} is in the nth slot. Therefore, f is a limit of simple functions, so f is measurable.
Given f : X R, we dene its nonnegative part f+ : X [0, ] and its nonpositive part f : X [0, ] by f+ := max{f, 0} = sup{f, 0} and f := min{f, 0} = inf {f, 0}. Here is an example of f for a parabolic graph:
R f R f+ R f
We have the following (easy-to-check) important equalities f = f+ f and |f | = f+ + f .
Assuming f is measurable, by the sup part of Theorem 5.12, we know that f+ (which equals sup{f, 0} and also f (which equals ( inf {f, 0} = sup{f, 0}) are measurable. In particular, the equality f = f+ f shows that any measurable function can be expressed as the dierence of nonnegative measurable functions. In the following theorem we prove that a function is measurable if and only if it is a limit of simple functions using Lebesgues idea of partitioning the y -axis in his paper Sur une g en eralisation de lint egrale d enie [231]. See Problem 2 for the situation actually treated in his paper.
261
Limits of simple functions Theorem 5.13. A function is measurable if and only if it is the limit of simple functions. Moreover, if the function is nonnegative, the simple functions can be taken to be a nondecreasing sequence of nonnegative simple functions.
Proof : Consider rst the nonnegative case. Let f : X [0, ] be measurable. Following Lebesgue, we shall partition the y -axis with ner and ner partitions and let the partition width go to zero. To give a concrete example of such a partition, for each n N, consider the simple function 0 if 0 f (x) 21 n 1 1 2 if < f ( x ) n n 2 2 2n 2 2 2n if 2n < f (x) 23 n s n (x ) = . . . . . . 2n 2n 2 1 22n 1 if < f (x ) 2 = 2n n 2n 2n n2 n 2 if 2 < f (x) . See Figure 5.6 for an example of a function f and pictures of the corresponding
R 1
1 2
R f 1
3 4 1 2 1 4
R f f
X R 1
1 2
X R 1 R s2
s1
3 4 1 2 1 4
s3
Figure 5.6. Here, f looks like a V and is bounded above by 1. The top gures show partitions of the range of f into halves, quarters, then eigths and the bottom gures show the corresponding simple functions. It is clear that s1 s2 s3 .
s1 , s2 , and s3 . At least if we look at Figure 5.6, it is not hard to believe that in general, the sequence {sn } is always nondecreasing: and lim sn (x) = f (x) at every point x X . Because this is so believable looking
n
0 s1 s2 s3 s4
at Figure 5.6, we leave you the pleasure of verifying these facts! Now let f : X R be any measurable function; we need to show that f is the limit of simple functions. To prove this, write f = f+ f as the dierence of its nonnegative and nonpositive parts. Since f are nonnegative measurable functions, we know that f+ and f can be written as limits of simple functions, say {sn } and {tn }, respectively. It follows that is also a limit of simple functions. f = f+ f = lim(sn tn )
262
Using Theorem 5.13 on limits of simple functions, it is easy to prove that measurable functions are closed under all the usual arithmetic operations. Of course, the proofs arent particularly dicult to prove directly (except perhaps the sum f + g see Problem 5 in Exercises 5.2). Theorem 5.14. If f and g are measurable, then f + g , f g , 1/f , and |f |p where p > 0, are also measurable, whenever each expression is dened.
Proof : We need to add the last statement for f + g and 1/f . For 1/f we need f to never vanish and for f + g we dont want f (x)+ g (x) to give a nonsense statement such as or + at any point x X . The proofs that f + g , f g , 1/f , and |f |p are measurable are all the same: we just show that each combination can be written as a limit of simple functions. By Theorem 5.13 we can write f = lim sn and g = lim tn for simple functions sn , tn , n = 1, 2, 3, . . .. Therefore, f + g = lim(sn + tn ) and f g = lim(sn tn ).
Since the sum and product of simple functions are simple, it follows that f + g and f g are limits of simple functions, so are measurable. To see that 1/f and |f |p are measurable, write the simple function sn as a nite sum ank Ank , sn =
k
where An1 , An2 , . . . S are nite in number and pairwise disjoint, and an1 , an2 , . . . R, which we may assume are all nonzero. If we dene un =
k 1 a nk Ank
and
vn =
k
|ank |p Ank ,
which are simple functions, then a short exercise shows that f 1 = lim un and |f |p = lim vn ,
where in the rst equality we assume that f is nonvanishing. This shows that f 1 and |f |p are measurable.
In particular, since products and reciprocals of measurable functions are measurable, whenever the reciprocal is well-dened, it follows that quotients of measurable functions are measurable, whenever the denominator is nonvanishing. 5.3.3. Littlewoods third principle. We now come to Egorovs theorem, named after named after Dimitri Egorov (18691931) who proved it in 1911 [116]. This theorem makes precise the third of Littlewoods principles, which is
Every convergent sequence of [real-valued] measurable functions is nearly uniformly convergent.
More precisely, in the words of Lebesgue who in 1903 stated (without proof!) Dimitri Egorov this principle as [234, p. 1229]7: (18691931). Every convergent series of measurable functions is uniformly
convergent when certain sets of measure are neglected, where can be as small as desired.
7
translation taken from [274, p. 112]
263
Lebesgue here is introducing the idea which is nowadays called convergence almost uniformly. A sequence {fn } of measurable functions is said to converge almost uniformly (or a.u. for short) to a measurable function f , denoted by if for each > 0, there exists a measurable set A such that (Ac ) < and fn f uniformly on A. As a quick review, recall that fn f uniformly on A means that given any > 0, Note that fn (x) and f (x) are necessarily real-valued (cannot take on ) on A. Therefore, Lebesgue is saying that
Every convergent sequence of real-valued measurable functions is almost uniformly convergent.
fn f a.u.,
|fn (x) f (x)| < ,
for all x A and n suciently large.
This is nowadays called Egorovs theorem. In the following example we look at the dierence between uniform and almost uniform convergence.
Example 5.11. Consider the following two sequences {fn } of functions on [0, 1]:
f fn f + f f3 f2 ]f1 1 f4 f3 fn f1 f2 ] 1
Figure 5.7. Left: The fn s are lines rotating toward the line f . Right:
The fn s are humps (of the same height) that move to the left and approach zero otherwise. The left-hand picture illustrates uniform convergence. Given > 0 we see that for all n suciently large, we have |fn (x) f (x)| < for all x [0, 1], or equivalently (using the denition of absolute value), Geometrically, uniform convergence simply means that for all n suciently large, the graph of fn is trapped between the graphs of f and f + . Now consider the right-hand picture. With 0 denoting the zero function we see that fn 0 pointwise, meaning that for each x [0, 1], fn (x) 0. However, fn 0 uniformly. For example, for any < height of the humps, its not the case that the graph of fn is trapped between the horizontal lines and , because the hump of fn will always stick out above the horizontal line at height . However, fn does converge to 0 a.u. with respect to Lebesgue measure. Indeed, let > 0 and let A = [/2, 1]. Then its clear that m([0, 1] \ A) < and fn 0 uniformly on A as seen here:
fn A
c
f (x) < fn (x) < f (x) +
for all x [0, 1].
] 1
Figure 5.8. Given any > 0, for all n suciently large, we see that |fn (x)| < for all x A.
264
Now that we understand a.u. convergence, we state Egorovs theorem. Egorovs theorem Theorem 5.15. On a nite measure space, a.e. convergence implies a.u. convergence for real-valued measurable functions. That is, any sequence of real-valued measurable functions that converges a.e. to a real-valued measurable function converges a.u. to that function.
Proof : Let f, f1 , f2 , f3 , . . . be real-valued measurable functions on a measure space X with (X ) < , and assume that f = lim fn a.e., which means there is a measurable set E X with (X \ E ) = 0 and f (x) = lim fn (x) for all x E . We need to show that fn f a.u. (this proof is yet another 2 k principle, or in this case, well use an 2 k principle). Given > 0, we need to nd a measurable set A such that (Ac ) < and for any > 0, |fn (x) f (x)| < , for all x A and n suciently large.
n
The idea to nd A is to nd, for each k N, a set Ak where is replaced by 1/k, then intersect all the Ak s. Step 1: Given > 0 and k N we shall prove that there is a measurable set B X and an N N such that (5.5) (B c ) < and for x B , |f (x) fn (x)| < 1 for all n > N. k
Indeed, motivated by the second condition in (5.5) lets dene for each m N, Bm := =
nm
x X ; |f (x) fn (x)| <
1 for all n m k 1 x X ; |f (x) fn (x)| < . k
Notice that each Bm is measurable and B1 B2 B3 . Also, since fn f on E , it follows that if x E , then |f (x) fn (x)| < 1/k for all n suciently large. Hence, there is an m such that x Bm and so, E
m=1
Bm .
Thus, as (X ) = (E ) (since (X \ E ) = 0), by continuity of measures, (X ) lim (Bm ).

m
On the other hand, since (Bm ) (X ) for all m (because Bm X ) it follows that (X ) = lim (Bm ).
m
Thus, we can choose N such that (X ) (BN ) < . Then with B = BN its easy to check that (5.5) holds. This concludes Step 1. Step 2: We now nish the proof. Let > 0. Then by Step 1, for each k N we can nd a measurable set Ak X and a corresponding natural number Nk N such that (Ac k) < 2k+1 and |f (x) fn (x)| < 1 for x Ak and n > Nk . k
265
Now put A =
k=1
Ak . Then Ac = (Ac )
k=1
k=1
Ac k , so
k=1
(Ac k)
= < , 2k+1 2
and we claim that fn f uniformly on Ac . Indeed, let > 0 and choose N such that 1/ < . Then 1 x A = x A = |f (x) fn (x)| < for all n > N = |f (x) fn (x)| < for all n > N . Thus, fn f a.u.
We remark that one cannot drop the niteness assumption; see Problem 7.
Exercises 5.3. 1. Show that each of the following functions is measurable on Y , where Y = {0, 1} and where measurable means S (C )-measurable where C is the collection of cylinder sets. (a) For each n N, dene n : Y R by n (x) := the run of heads starting from the nth toss; thus, n (x) = 0 if xn = 0, n (x) = i if xn = xn+1 = = xn+i1 = 1 and xn+i = 0, and n (x) = else. (b) Dene : Y R by (x) is longest run of heads in the sequence x. (Here we dene (x) := if for any M R there exists a run of heads in x of length > M .) (c) A gambler with an initial capital of $i walks into a casino, which has an innite amount of money, and he sits down at a table and starts gambling and he doesnt stop until he goes broke. Lets say that 1 represents he wins a game and 0 he loses a game; if he wins he gets $1 and if he loses he gives the house $1. Dene B : Y R by B (x) = the number of games he plays until he goes broke (B (x) = if he never go broke in the sequence x of games). 2. (Lebesgues 1901 idea) Let f : [, ] [0, ) be a bounded measurable function, with range lying in a bounded interval [m, M ]. Given a partition P = {m0 , m1 , . . . , mp } of [m, M ], put m1 = 1 and let P and uP be the simple functions
p N
P = m 0 E0 +
i=1
mi1 Ei , uP = m0 E0 +
m i Ei ,
i=1
Ei = f 1 (mi1 , mi ].
3.
4.
5. 6.
(i) Prove that Q P f uP uQ for partitions P and Q of [m, M ] with Q P. (ii) Let P denote the set of nondecreasing sequences of partitions of [m, M ] whose lengths are approaching zero. Prove that {Pn } is a nondecreasing sequence of simple function converging uniformly to f and prove that {uPn } is a nonincreasing sequence of simple function converging uniformly to f . Let A1 , A2 , . . . be in a -algebra and put lim sup An := n=1 k=n Ak and lim inf An := A . Let f and f be the characteristic functions of lim sup An and lim inf An , k n=1 k =n respectively, and for each n, let fn be the characteristic function of An . Prove that f = lim sup fn and f = lim inf fn . (Subsequence denition of lim sup and lim inf ) Let {an } be a sequence in R and let A R be the collection of all subsequential limits of {an }; thus, a A if there is a subsequence of {an } that converges to a. Prove (i) lim sup an = sup A and (ii) lim inf an = inf A. In fact, prove that lim sup an is the maximum element of A and lim inf an is the minimum element of A. If {fn } is a sequence of measurable extended real-valued functions, prove that the set {lim fn exists} = {x X ; limn fn (x) exists in R} is measurable. (Lebesgue = Borel a.e.) Let X Rn be a Borel set and let f : X R be Lebesgue measurable. Using Theorem 5.13, prove there is a Borel measurable function g : X R
266
such that f = g Lebesgue a.e. First prove that any M n -simple function equals, a.e., a B n -simple function. 7. Here are some a.u. convergence problems. (a) Let X = R with Lebesgue measure and let fn be the characteristic function of [n, ). Show that fn 0 everywhere (that is, pointwise), but fn 0 a.u. (This shows that the niteness assumption on X in Egorovs theorem cannot be dropped.) (b) Let X = [0, 1) with Lebesgue measure and let fn be the characteristic function of the interval [1 1/n, 1). Show that fn 0 a.u., but fn 0 uniformly. In the following series of problems, we study various convergence properties of measurable functions. We shall work in a xed measure space (X, S , ).
8. Let fn , n = 1, 2, . . ., and f be real-valued measurable functions. Prove that fn f a.e. if and only if for each > 0,
n=1 mn
{x ; |fm (x) f (x)| }
= 0.
9. (Cf. [258]) In this problem we give a characterization of almost uniform convergence. Let fn , n N, and f be real-valued measurable functions. Prove that fn f a.u. if and only if for each > 0,
n
lim
mn
{x ; |fm (x) f (x)| }
= 0.
10. Problems 8 and 9 are quite useful. (a) (a.u. = a.e.) Using Problems 8 and 9, prove that if a sequence {fn } of real-valued measurable functions converges a.u. to a real-valued measurable function f , then the sequence {fn } also converges to f a.e. Note that Egorovs theorem gives a converse to this statement when X has nite measure. (b) Using Problems 8 and 9 give another proof of Egorovs theorem. 11. In this problem we prove Luzins theorem using Egorovs theorem. Let f be a realvalued Lebesgue measurable function on a measurable set X Rn of nite measure. Given any > 0, we shall prove that there exists a closed set C Rn such that C X , m(X \ C ) < , and f is continuous on C . Proceed as follows. (i) First prove the theorem for simple functions. Suggestion: Let f be a simple N N function and write f = k=1 Ak , the ak s are real k=1 ak Ak where X = numbers, and the Ak s are pairwise disjoint measurable sets. Given > 0, there is a closed set Ck Rn with m(Ak \ Ck ) < /N (why?). Let C = N k=1 Ck . (ii) We now prove Luzins theorem for nonnegative f . For nonnegative f we know that f = lim fk where each fk , k N, is a simple function. By (i), given > 0 there is a closed set Ck such that m(X \ Ck ) < /2k and fk is continuous on Ck . Let K1 = k=1 Ck . Show that m(X \ K1 ) < . Use Egorovs theorem to show that there exists a set K2 K1 with m(K1 \ K2 ) < and fk f uniformly on K2 . Conclude that f is continuous on K2 . (iii) Now nd a closed set C K2 such that m(K2 \ C ) < . Show that m(X \ C ) < 3 and the restriction of f to C is a continuous function. (iv) Finally, prove Luzins theorem dropping the assumption that f is nonnegative. 12. A sequence {fn } of real-valued measurable functions is convergent in measure8 if there is an extended-real valued measurable function f such that for each > 0,
n 8
lim {x ; |fn (x) f (x)| } = 0.
If (X, ) is a probability space, convergent in measure is called convergent in probability.
5.4. LEBESGUES DEFINITION OF THE INTEGRAL AND THE MCT
267
(Does this remind you of the weak law of large numbers?) (i) Prove that if {fn } converges in measure to a measurable function f , then f is a.e. real-valued, which means {x ; f (x) = } is measurable with measure zero. (ii) If {fn } converges to two functions f and g in measure, prove that f = g a.e. Suggestion: To see that f = g a.e., prove and then use the set-theoretic triangle inequality: For any real-valued measurable functions f, g, h, we have x ; |h(x) g (x)| . {x ; |f (x) g (x)| } x ; |f (x) h(x)| 2 2 13. Here are some relationships between convergence a.e., a.u., and in measure. (a) (a.u. = in measure) Prove that if fn f a.u., then fn f in measure. (b) (a.e. = in measure) From Egorovs theorem prove that if X has nite measure, then any sequence {fn } of real-valued measurable functions that converges a.e. to a real-valued measurable function f also converges to f in measure. (c) (In measure = a.u. nor a.e.) Let X = [0, 1] with Lebesgue measure. Given n N, write n = 2k + i where k = 0, 1, 2, . . . and 0 i < 2k , and let fn be the characteristic i i+1 function of the interval . Draw pictures of f1 , f2 , f3 , . . . , f7 . Show that , 2k 2k fn 0 in measure, but lim fn (x) does not exist for any x [0, 1]. Conclude that {fn } does not converge to f a.u. nor a.e. 14. A sequence {fn } of real-valued, measurable functions is said to be Cauchy in measure if for any > 0, {x ; |fn (x) fm (x)| } 0, as n, m .
n
Prove that if fn f in measure, then {fn } is Cauchy in measure. 15. In this problem we prove that if a sequence {fn } of real-valued measurable functions is Cauchy in measure, then there is a subsequence {fnk } and a real-valued measurable function f such that fnk f a.u. Proceed as follows. (a) Show that there is an increasing sequence n1 < n2 < such that 1 1 {x ; |fn (x) fm (x)| k } < k , for all n, m nk . 2 2
k (b) Let Am = . Show that {fnk } is a Cauchy k=m x ; |fnk (x) fnk+1 (x)| 1/2 c sequence of bounded functions on Am . Deduce that there is a real-valued measurc c able function f on A := m=1 Am such that fnk f uniformly on each Am . c (c) Dene f to be zero on A . Show that fn f a.u. 16. Assuming Problem 15, Part (a) of 13 and the set-theoretic triangle inequality from Problem 12, prove the following theorem. (Completeness for convergence in measure) If {fn } is a sequence of real-valued measurable functions that is Cauchy in measure, then there exists a real-valued measurable function f such that fn f in measure.
5.4. Lebesgues denition of the integral and the MCT In this section we (nally) dene the integral of a nonnegative measurable function! We also establish the monotone convergence theorem, one of the most useful theorems in all of integration theory because the MCT basically says without fear of contradictions, or of failing examinations [99, p. 229] we can always interchange limits and integrals for nondecreasing sequences of nonnegative functions. 5.4.1. Lebesgues original denition. Its helpful to once again review Lebesgues original denition of the integral in Sur une g en eralisation de lint egrale d enie [231]. Given a bounded a function f : [, ] [0, ), we approximate the area under f from below and above as seen here by partitioning the range of the
268
function, which Lebesgue supposes has lower bound m and upper bound M :
m3 m2 m1 m0 E1 E2 f (x) m3 m2 m1 m0 E1 E2 f (x) m3 m2 m1 m0 E1 E2 f (x)
E3
E3
E3
Figure 5.9. Approximating the area under f from below and above. Let P be a partition of [m, M ]:
let E0 = {f = m0 } and Ei = f 1 (mi1 , mi ], i = 1, 2, . . . , p, and let LP = m0 m(E0 ) +

i
m = m0 < m1 < m2 < < mp1 < M = mp , mi1 m(Ei ) and UP = m0 m(E0 ) +
mi m(Ei ),
i
which we shall call, respectively, the lower and upper sums of f dened by the partition P . In Figure 5.9 we put p = 3 and the shaded rectangles approximate the area under f from below and above. Since f is assumed measurable, each of the sets E0 , . . . , Ep is measurable, so their measures m(Ei ) have all the properties length should have. The lower and upper sums of f are analogous to the lower and upper Darboux sums studied in Riemann integration (the dierence being that in Riemann integration the domain of f is partitioned rather than the range). Lebesgue says that f is integrable if there exists a real number I such that (5.6)
P 0
lim LP = I = lim UP ,
P 0
and the limit I is by denition the integral of f , which we shall denote by f := I = lim LP = lim UP .
P 0 P 0
Here, P is the maximum of the lengths mi mi1 where i = 1, 2, . . . , p and we say that lim P 0 LP = I if given any > 0, there is a > 0 such that for any partition P with P < , we have There is a similar denition of what lim P 0 UP = I means. Now it turns out, see Problem 4, that the limits in (5.6) always exist and are equal! Thus, an arbitrary nonnegative bounded measurable function is integrable. Since one can never explicitly construct a nonmeasurable function, basically all nonnegative bounded functions are Lebesgue integrable! Here we see the major dierence between the Riemann and Lebesgue theories: Its easy to construct functions (e.g. the Dirichlet function) that are not Riemann integrable. In Problem 4 you will prove that if we dene (5.7) then (5.6) holds with this I . Furthermore, it turns out that in (5.7) we can take P to be a partition of [0, ]; this frees us from the bounds of the particular function f and allows us to easily generalize Lebesgues denition of the integral to extended real-valued functions. I := sup LP ; P is a partition of [m, M ] , I LP <
269
5.4.2. The denition of the integral. Let (X, S , ) be a measure space and let f : X [0, ] be a measurable function. Given a partition P of [0, ], let Ei = {mi1 < f mi } = f 1 (mi1 , mi ], i = 1, 2, . . . , p, and let
p
0 = m0 < m1 < m2 < < mp1 < = mp , mi1 (Ei )

i=1
LP =
which we shall call the lower sum of f with respect to the partition P . By denition of measurability, each Ei is measurable so LP is dened. Because of (5.7) we dene the integral of f by (5.8) f := sup LP ; P is a partition of [0, ] .
= m1 (E2 ) + m2 (E3 ) + + mp1 (Ep ),
A couple of remarks: (1) This denition is only for nonnegative measurable functions.
(2) We allow f to be innite (if the set to the right of sup has no nite upper bound).
The integral of a general measurable function is taken up in Subsection 5.5.2. Figure 5.10 shows a picture of a function f , a partition of [0, ], and a lower sum LP .
m3 m2 m1 m0 E1 E2 f (x) m3 m2 m1 m0 E1 E2 f (x)
E3
E3
Figure 5.10. The integral
f is the supremum of all the approximating areas LP over all partitions of [0, ].
In the case that (X, S , ) is a probability space, the integral f is called the expectation, expected value, or mean value, of the random variable (= measurable function) f , and is usually denoted by E (f ): E (f ) := f,
and just as we learned back in Section 2.2, we interpret E (f ) as the expected average value of f over a large number of experiments. If A X is any measurable set, then the notation f :=
A A
f means
A f,
where A is the characteristic function of A; heres a picture: Using the denition (5.8), we have (5.9)
A
f = sup LP ; LP is a lower sum of A f .
270
f (x) A f = A f 0 on A o of A
Figure 5.11.
f is the area under f and above A.
Before proving theorems involving the integral, we rst review integrals of simple functions, which we introduced back in Section 2.1. Recall that the integral of a nonnegative simple function
N
f=
n=1
a n An ,
an [0, ), An S ,
N
where the An s can be taken pairwise disjoint, is given by (5.10) f=

n=1
an (An ).
At this point we may have a problem: The notation f denotes the number dened by the right-hand side of (5.10) and it also denotes the number dened by right-hand side of (5.8)! However, we leave it as an exercise for you to check these numbers are the same (see Problem 2).
Example 5.12. According to (5.10), if f = Q , the characteristic function of the rational numbers, a.k.a. the Dirichlet function, then the Lebesgue integral of f is dened and f = m(Q) = 0. As is well-known, the Riemann integral of f does not exist.
5.4.3. The MCT. The monotone convergence theorem was proved in 1906 by Beppo Levi9 (18751961) in the paper Sopra lintegrazione delle serie [243] and is one of the most useful theorems in all of integration theory. We shall have many opportunities to use it. It says that for any nondecreasing sequence of nonnegative functions we always have of the limit = limit of the Beppo Levi (18751961). .
More precisely, if we have a nondecreasing sequence f1 f2 fn converging to a limit function f , then f = lim fn ; geometrically this simply says that the area under the limit function f is the limit of the areas under each fn . Figure 5.12 shows a picture of this geometrically obvious fact. To prove this geometrically obvious fact, consider rst some other obvious facts. Lemma 5.16. If 0 f g are measurable functions, then f g (monotonicity).
9 Beppo Levi was born on May 14, 1875, a month and half earlier than Henri Lebesgue who was born on June 28, 1875.
271
f fn f3 f2 f1
Figure 5.12. The MCT says that the areas under the fn s approach
the area under f .
We shall leave the proof of this lemma to the interested reader. (All you have to do is check that given any partition P of [0, ], the lower function of f with respect to P is the lower function of g with respect to the same partition P .) Monotone convergence theorem Theorem 5.17. If {fn } is a nondecreasing sequence of nonnegative measurable functions, then lim fn = lim fn .
Proof : We rst make a few remarks. First of all, by Theorem 5.12, we know that f := lim fn is measurable. Second, f1 f2 implies, by Lemma 5.16, that f1 f2 is a monotone sequence, so lim fn exists as an extended real number. Finally, we remark that the equality lim fn = lim fn is meant in the sense of extended real numbers, so = is a possibility. Now to our proof, we have to prove the inequalities:
n
lim
fn
and
lim
fn
f.
Step 1: The rst inequality is easy: Since fn f for all n, by the previous lemma we have fn f for all n; taking n gives the rst inequality. To complete the proof of the MCT we need to prove that f = sup LP ; P is a partition of [0, ] lim
n
fn .
Fix a partition P of [0, ], 0 = m0 < m1 < m2 < < mp1 < = mp , and consider a lower sum
p
LP = we need to show that (5.11)
i=1
mi1 (Ei ) ,
Ei = {mi1 < f mi };
LP lim x Ei = mi1 < f (x) mi
fn .
Step 2: To nish the proof, observe that = mi1 < lim fn (x) mi = mi1 < fn (x) mi = x
n
for n large
Ein ,
n=1
where Ein = {mi1 < fn mi }. Thus, Ei n=1 Ei Ein . Since f1 f2 f3 , it follows that Ei Ei1 Ei Ei2 Ei Ei3 (do you see why?),
272
so by monotonicity and continuity from below, (Ei ) Thus,

p p n=1
Ei Ein
= lim (Ei Ein ).

n
LP =
i=1
mi1 (Ei )
i=1
mi1 lim (Ei Ein )

n p
= lim
i=1
mi1 (Ei Ein ) ;
where we pulled the limit out of the sum because the sum is a nite. Now the summation in the brackets is p i=1 mi1 (Ein ), which is a lower sum for fn . Hence the term in the brackets is fn . This implies (5.11) and were done.
One of the main uses of the MCT is to check whether or not the integral of the limit function is nite. Indeed, the MCT implies the following statement: If {fn } is a non-decreasing sequence of nonnegative measurable functions with limit function f such that limn fn is nite, then the limit function f also has a nite integral. (This follows from the equality f = limn fn .) This statement is essentially Levis original statement of his theorem [243, p. 776]; see Problem 7 for Levis beautiful original proof. The article [342] contains a history of Levis work, of which the MCT was only one of many important results he proved. In Levis early papers on Lebesgue integration, he questioned some of Lebesgues work and in response, Lebesgue wrote to Emile Borel [342, p. 60]:
My dear Borel, .... My theorems, invoked by Fatou, are now criticized by Beppo Levi in the Rendiconti dei Lincei. Beppo Levi has not been able to ll in a few simple intermediate arguments and got stuck at a serious mistake of formulation which Montel once pointed out to me and which is easy to x. Of course, I began by writing a note where I treated him like rotten sh. But then, after a letter from Segre,10 and because putting down those interested in my work is not the way to build a worldwide reputation, I was less harsh....
Levi was a courageous man to point out errors in the work of Lebesgue, an up and rising superstar at that time. We also remark that the monotone convergence theorem is not true for Riemann integrable functions on R. The standard example is Baires example explained at the very beginning of this chapter or the following very similar example.
Example 5.13. Consider [0, 1] with Lebesgue measure, let {rn } be any enumeration of the rational numbers in [0, 1], and let fn be characteristic function of {r1 , r2 , . . . , rn }: fn (x) = {r1 ,...,rn } = 1 0 if x = r1 , . . . , rn , otherwise.
Then fn is a nondecreasing sequence as seen here for a few ns:

f1 r3 r1 r2 r3 r1 r2 f2 r3 r1 r2 f3 f
10Corrado Segre (18631924), one of Beppo Levis teachers.
273
with limit the Dirichlet function restricted to [0, 1]: f (x) = [0,1]Q (x) = 1 0 if x [0, 1] Q, if x [0, 1] Q.
Since the sets {r1 , . . . , rn }, n = 1, 2, . . ., and [0, 1] Q are countable they are Lebesgue measurable (with zero measure), f and each fn are measurable functions, so the monotone convergence theorem can be applied and says that (5.12) f = lim
n
fn .
Of course, f = m([0, 1] Q) = 0 and fn = m{r1 , . . . , rn } = 0 since the Lebesgue measure of any countable set is zero. Thus, the equality (5.12) is just the statement 1 that 0 = 0. Note that f is not Riemann integrable, so the left-hand integral 0 f is not dened in the Riemann world.
5.4.4. Fatous lemma. The MCT says that given an arbitrary nondecreasing sequence fn of nonnegative measurable functions with limit function f , we can always interchange limits and integrals: (5.13) lim fn = lim fn .
The obvious question is if one can drop the assumption nondecreasing: If {fn } is a sequence of nonnegative measurable functions, does (5.13) hold (provided that the limits on both sides are dened)? The answer is in general Pierre Fatou no; in order to have an equality we have to impose certain conditions (18781929). such as in the Dominated Convergence Theorem to be proved in Section 5.6. However, Fatous lemma, named after Pierre Fatou (18781929) who proved the result in his doctoral thesis in 1906 [129, p. 376], says that we always have the inequality in (5.13): (5.14) lim fn lim fn ,
provided the limits exist. Heres an example showing the strict inequality <.
Example 5.14. Let X = R with Lebesgue measure and for each n N, let fn (x) = [n,n+1] (x) = 1 0 if n x n + 1 otherwise;
see Figure 5.13 for a picture. Then given x R, for all n > x we have fn (x) = 0.
f1 f2 f3
Figure 5.13. f1 , f2 , f3 , . . . represent a pulse moving to the right.

Hence, for each x R, lim fn (x) = 0. Thus, lim fn = 0, so
n
lim fn = On the other hand, fn = [n,n+1] = m [n, n + 1] = 1
0 = 0.
lim
fn = lim 1 = 1.
274
Thus, for this example, we have the strict inequality 0 < 1 in (5.14). (Other examples of sequences {fn } that give strict inequalities include fn = n(0,1/n) and fn = (1/n)[0,n] , as you can readily check.)
Heres Fatous lemma, which in particular implies11 (5.14). Fatous lemma Theorem 5.18. If {fn } is a sequence of nonnegative measurable functions, then lim inf fn lim inf fn .
Proof : First, by Theorem 5.12, f := lim inf fn is measurable. Now by denition of lim inf and Theorem 5.12, we know that f = lim gn where gn := inf {fn , fn+1 , . . .} is measurable. Moreover, {gn } is a nondecreasing sequence of measurable functions so by the monotone convergence theorem, f = lim gn = f = lim gn = f = lim inf gn ,
since lim = lim inf whenever the limit exists. By denition of inmum, gn fn , so by monotonicity, gn fn . Therefore, as lim infs preserve inequalities, f = lim inf gn lim inf fn .
In Problem 8 you will prove that the monotone convergence theorem is actually equivalent to Fatous lemma and in Problem 7 in Exercises 5.6 you will nd Fatous original proof. Finally, we note that immediately after the proof of the lemma that now bears his name, Fatou wrote [129, p. 376] I owe this remark to Mr. Lebesgue. We end this section with some remarks on notation. First, in some cases it may not be clear what measure we are integrating with respect to, and in such cases to emphasize the measure we use the notation f d for f.
Second, for both the Lebesgue measure space (Rn , M n , m) and the Borel measure space (Rn , B n , m), or restrictions thereof to Lebesgue (or Borel) measurable subsets of Rn , it is customary to use any one of the notations familiar from calculus: f dm or f dx or f (x) dx or f (x) dx1 dxn , and so forth,
where dx and dx1 dxn represent dm, and where we could replace x by any other letter denoting the coordinate variables on Rn . For the integral of a function f on an interval [a, b], we use
b b b
f or
a a
f dx or
a
f (x) dx , . . .
a b
for
[a,b]
f,
b
and we adopt the standard convention that b = a . The notation a f (x) dx is commonly used for Riemann integration, however, in Section 6.2 we will show
11 Limit inmums always exist as extended real numbers so we dont have to make any assumptions concerning existence of limits in Fatous Lemma. In case the limits exist (as extended real numbers) we know that lim inf = lim (from Part (3) of Lemma 5.11) and we get (5.14).
275
that any Riemann integrable function is Lebesgue integrable and the integral of the function dened by Riemann and Lebesgue give the same value. So, the notation b a f (x) dx is consistent for Riemann integrable functions.
Exercises 5.4. 1. (a) Let f : X [0, ) be measurable. Prove that (5.15)
1 n
f = lim
n
k=1
k (Ank ), 2n
where Ank = f (k/2 , (k + 1)/2 ]. 1 (b) Using the formula (5.15), nd 0 f where (i) f (x) = x, (ii) f (x) = 1/x and (iii) f (x) = 1/ x (in (ii) and (iii) put f (0) := 0). 2. Let f = N n=1 an An be a simple function where an 0 and An S for each n, where we assume that the An s are pairwise disjoint. With f dened by (5.8), prove that f= N n=1 an (An ). 3. (A standard denition) Given a measurable function f : X [0, ], prove that f = sup s ; s is a simple function with 0 s f .
In many modern-day textbooks, the integral of f is dened using this formula. 4. (Lebesgues original 1901 denition) Let f : [, ] [0, ) be a nonnegative bounded measurable function, with range lying in an interval [m, M ]. Dene I := sup {LP ; P is a partition of [m, M ]} . In this problem we prove that Lebesgue, f is integrable. Proceed as follows. (i) Prove that for any partition P ,
P 0
lim LP = I =
P 0
lim UP ; that is, in the words of
(ii) Prove that if P Q (so every partition point of P is some partition point of Q), then 0 LQ LP P ( ). (iii) Given > 0 by denition of supremum we can choose a partition P0 such that . Using (ii), prove that for any partition P we have I LP0 < 2 I LP < + P ( ). 2 Use this fact to prove that lim P 0 LP = I . (iv) Using (i), prove that lim P 0 UP = I . 5. (Youngs denition) William Henry Young (18631942) discovered an alternative formulation of Lebesgues theory of integration, which he published in the paper On the general theory of integration [419] in 1905. Fix a bounded measurable function f : X [0, ) on a measure space X . Given a collection A of countably many pairwise disjoint measurable subsets A1 , A2 , . . . of X , we call a sum of the form (5.16) LA = mn (An ) ,
n
0 UP LP P ( ).
mn := inf {f (x) ; x An }
a Y-lower sum of f . (It might we wise to try and understand what such a sum represents geometrically.) If each number mn is replaced by Mn := sup{f (x) ; x An }, we call the resulting sum UA an Y-upper sum of f . We dene the lower and upper Young integrals of f by
f := sup L ; L is a Y-lower sum of f
f := inf U ; U is an Y-upper sum of f ,
276
respectively. We say that f is Young integrable if f =
f , in which case we denote
(ii) Prove that f is Young integrable and f (= Lebesgues integral of f ) = (Y ) f . 6. (Youngs def. cont) Let f : X [0, ] be a nonnegative measurable function on X (not necessarily bounded). A Young lower sum of f a sum of the form (5.16) in the previous problem. With f denoting the Lebesgue integral of f , prove that f = sup L ; L is a Young lower sum of f . This equality is sometimes used to dene the integral f ; this is found for example in Stanislaw Saks (18971942) famous book [338, p. 19]. 7. ((Basically) Levis original argument [243].) Levis original proof uses three lemmas. Throughout we work on a nite measure space. The only results from this section you are allowed to quote are the denition of the integral and Lemma 5.16. (i) Lemma 1: Let g be a nonnegative measurable function. For each k, dene gk = min{g, k}. Prove that g = lim
k
the common value by (Y ) f , called the Young integral of f . In the case X = [, ], Youngs integral should remind you of Darboux integration, but instead of partitioning the domain [, ] into intervals we partition it into measurable sets. In this problem we prove that the Young integral of f exists and equals the Lebesgue integral of f . (i) Prove that if A = {A1 , A2 , . . .} and B = {B1 , B2 , . . .} are collections of pairwise disjoint measurable subsets of X and A B := {Ai Bj } is the collection obtained by intersecting all the sets in A and B , then LA LA B UA B LB . Conclude that f f .
gk .
Suggestion: One could prove this in two parts, when g = and g < . In the case that g = , prove that limk gk = too by way of contraposition (thus, assuming that gk is bounded above for all k by some nite constant, say M , using the denition of the integral, show that g M as well). (ii) Lemma 2: Let {fn } be a nondecreasing sequence of nonnegative functions, all of which are bounded by the same nite constant. Let f = lim fn and prove that f = lim fn . Suggestion: Let > 0 and let An = {x | f (x) fn (x) + 1 } where 1 = (X )/((X ) + M ). Show that limn (An ) = (X ) and hence c limn (Ac n ) = 0. Take N so that n N implies (An ) < /((X ) + M ), where fn M for all n (so f M too). Using the denition of the integral, prove that f An f + f. Ac n
Show that for n N , the rst integral on the right is fn + 1 (X ) and the fn f + fn , second integral on the right is M (Ac n ). Conclude that and hence that f = lim fn . (iii) Lemma 3: Let ank be a double sequence of nonnegative extended real-valued numbers such that ank is nondecreasing in n (for xed k) and nondecreasing in k (for xed n). OPTIONAL: Prove that
n k
lim lim ank = lim lim ank .

k n
The proof is similar to the proof of Lemma 3.3 in Section 3.2, which is why Lemma 3 is optional; skip the proof of Lemma 3 if you see its relation to Lemma 3.3. (iv) We now complete Levis proof. Let {fn } be a nondecreasing sequence of nonnegative functions and let f = lim fn . For any k N, let fnk = min{fn , k} and use Part (iii) on the sequence ank = fnk . 8. Here are some problems dealing with Fatous lemma.
5.5. INTEGRAL PROPERTIES AND THE PRINCIPLE OF APPROPRIATE FUNCTIONS 277
(a) Find a sequence of nonnegative functions on [0, 1] (with Lebesgue measure) such that Fatous lemma gives a strict inequality. (b) Prove that Fatous lemma implies the monotone convergence theorem. (c) Let {fn } be a sequence of nonnegative measurable functions on a measure space converging to a function f with fn f for each n. Prove that f = lim fn . (In particular, you need to show lim fn exists.) Show by counterexample that if we replace fn f by fn f for each n the conclusion is false.
5.5. Integral properties and the principle of appropriate functions Weve dened the integral for nonnegative measurable functions and in this section we dene it for extended real-valued functions. We also introduce the function version of the principle of appropriate sets. We begin by discussing some properties of the integral for nonnegative functions. 5.5.1. Properties of the integral. We start with Chebyshevs inequality, a useful inequality weve already seen back in Lemma 2.11 for simple functions. Chebyshevs inequality, Version II Theorem 5.19. For any nonnegative measurable function f and for any a > 0, we have 1 { f > a} f. a (The same inequality holds for {f a}.)
Proof : Let A = {f > a}. Since aA < f A and f A f we have aA f . Hence by monotonicity of the integral (Lemma 5.16), a(A) = a This concludes our proof.
f Thinking of f as the area of a region with height prole f , the following proposition says that the integral has all the properties that we believe area should have: f (1) Area is additive. (2) If the base of the region has length zero, the region has area zero. Integral = Area. (3) If two regions have the same height prole, they have the same area. (4) If the area of a region is zero, the region must have zero height. (5) If a region has positive height and area zero, then its base has length zero. (6) If a region has nite area, it has nite height prole.
f.
Some properties of area under curves Proposition 5.20. Let A be a measurable set, f and g be nonnegative measurable functions, and let a and b be nonnegative real numbers. (1) (af + b g ) = a (2) If (A) = 0, then
A
f +b
g,
f = 0.
278
(3) (4) (5) (6)
If f = g a.e., then f = g . If f = 0, then f = 0 a.e. If f > 0 on A and A f = 0, then (A) = 0. If f < , then f is a.e. real-valued, that is f (x) R for a.e. x. (That is, the set {f = } has measure zero.)
Proof : To prove (1), let fn and gn be nondecreasing sequences of nonnegative simple functions approaching f and g , respectively; e.g. such simple functions are provided by Theorem 5.13. Applying the Monotone Convergence Theorem to the three nondecreasing sequences: {fn }, {gn }, and {afn + bgn }, converging to f , g , and af + bg , respectively, we obtain a f +b g = lim a = lim a = lim = fn + lim b fn + b gn gn (MCT) (property of limits)
(afn + b gn )
(integral is linear on simple functions) (MCT).
(af + b g )
(That the integral is linear on simple functions is Theorem 2.2.) We shall leave (2) and (3) for your enjoyment. Suppose that f = 0 and let A = {x ; f (x) > 0}. Then (4) is the statement that (A) = 0. To see this, for each n = 1, 2, . . ., let An = {x ; f (x) > 1/n}. Then A = n=1 An . By Chebyshevs inequality, (An ) n f = 0.
Thus, (An ) = 0 for each n and so (A) = 0 as well, which proves (4). Assume now that f > 0 on a measurable set A and A f = 0; we shall prove that (A) = 0. Indeed, A f = A f = 0 so by (4) we have A f = 0 a.e. Since f > 0 on A it follows that A = 0 a.e. which implies that (A) = 0. Finally, to prove (6), suppose that f < and let A = {x ; f (x) = }. Then for any n N, we have {f = } {f > n}, so by monotonicity of measures and Chebyshevs inequality, we have 1 {f = } {f > n} f. n Taking n we get our result.
The following theorem says that we can always interchange integrals and innite series of nonnegative measurable functions. The series MCT Theorem 5.21. If {fn } is a sequence of nonnegative measurable functions, then

fn =
n=1 n=1
fn .
Moreover, if the sum n=1 fn is nite, then the series n=1 fn is nite a.e.; that is, the series n=1 fn converges to a real number a.e.
Proof : Let f :=
n=1
fn . Then by denition, f = lim gk where gk =

k
k n=1
fn .
Since the fn s are nonnegative, we have g1 g2 g3 , so by the Monotone Convergence Theorem, we have
k k
f = lim
gk = lim
fn
n=1
= lim =:
fn
n=1
fn ,
n=1
where in the third equality we used linearity of the integral from the previous proposition. Now assume that fn < , which is equivalent to saying that n=1 f< where f =
fn .
n=1
Then by Property (5) of the previous proposition, we see that f is nite a.e., which means that n=1 fn converges to a real number a.e. See Problem 5 for applications of the series MCT to proving the rst Borel Cantelli lemma and to the SLLN. Example 5.15. Let X = Y , where Y = {0, 1}, the sample space for (say) the Monkey Shakespeare experiment such that the Monkey can type sonnot 18 with probability p > 0 on any given page. Let f : X [0, ] be the random variable given by the number of times the Monkey types sonnet 18. What is the expected value of f ; in plain English, how many times would you expect the monkey to type sonnet 18? To answer this question, observe that f (x ) = That is, f=
n=1 n=1
xn
An ,
where An = Y Y Y {1} Y Y where {1} is in the nth slot. It follows that E (f ) =
An =
(An ) =
n=1
n=1
n=1
p = .
Thus, we would expect the monkey to type sonnet 18 an innite number of times; however, as the analysis in Section 2.3.5 shows, we wouldnt expect the monkey to type sonnet 18 even once in any reasonable nite amount of time.
The following theorem shows that the integral is countably additively on countable disjoint unions, just like measures are. Countable additivity of the integral Theorem 5.22. If f is nonnegative and measurable and A = where the sets An are disjoint measurable sets, then
n=1
An
f=
A n=1 An
f.
280
Proof : We just applying the series monotone convergence theorem to the series A f = where we used that A =
n=1 n=1
An f,
An , as is readily checked.
5.5.2. General denition of the integral. We now dene the integral for functions that can take negative as well as positive values. Recall that the nonnegative and nonpositive parts f+ and f of a measurable function f : X R are dened by f+ (x) = max{f (x), 0}, f (x) = min{f (x), 0}; here is an example of a function f with its corresponding f+ and f :
R f X R f+ X R f X
Thus, f+ represents the part of f above the X -axis and f represents the part of f below the X -axis, and also observe that f = f+ f , and |f | = f+ + f . We dene the integral of f geometrically as the area of f above the X -axis (that is, f+ ) minus the area of f below the X -axis (that is, f ): (5.17) f := f+ f ,
provided that the right-hand side is not of the form . Heres an illustration of this denition:
R + f := f+ f X f+ R f+ X R f f X
f = area of f above X axis area of f below X axis
Figure 5.14. The integral f represents the net signed area between the graph of f and the X -axis. We say that f is integrable if f+ < and f < , f+ < and
in which case f R. The identity |f | = f+ + f implies that f < if and only if |f | < . Thus, f is integrable |f | < .
We sometimes say that f is -integrable to emphasize the measure ; e.g. if X = Rn and is Lebesgue measure, wed say that f is Lebesgue integrable. More generally, given any measurable set A X , we dene f :=
A
A f,
where again, we assume that the right-hand side makes sense. The function f is said to be integrable on A if
A
f+ < and
f <
that is, if
A
|f | < .
Since the integral on general extended real-valued functions is dened as the dierence of integral of nonnegative functions, the properties of the integral in Proposition 5.20 for nonnegative functions translate directly to properties of the integral for extended real-valued functions. For example, (2) of Proposition 5.20 implies that if f : X R, g : X R are measurable and are either both nonnegative or both integrable, then12 f =g a.e. = f= g.
We can paraphrase this as saying Integrals only see a.e.; they are blind to sets of measure zero. Property (6) of Proposition 5.20 implies that if f : X R is integrable, then f is a.e. real-valued, that is, f (x) R a.e. 5.5.3. Linearity of the integral. Given integrable functions f, g : X R, we want to prove that (5.18) (f + g ) = f+ g.
It turns out that the left-hand side may not be dened as it stands. Indeed, since f and g are integrable, we know that the functions f and g are nite a.e.; in particular, on a set of measure zero they may take the values . Thus, the sum f (x) + g (x) is generally only dened a.e. (when its not of the form or + ). There is a general convention for dealing with situations like this, which well now explain. Let f be a measurable function dened a.e., that is, f is dened on a measurable set A with (Ac ) = 0 such that f 1 (a, ] is a measurable subset of X for each a R. Dene f on A f := 0 on Ac . Since A is measurable, one can check that f : X R is measurable. We say that f is integrable if f is integrable, in which case we dene (5.19) f := f.
12Actually, if f is only assumed integrable and f = g a.e. then g must also be integrable, and
also
f =
g.
282
Of course, since integrals only see a.e. we could have dened f to equal any measurable function on Ac without changing the value of f . We only chose 0 on Ac to make it simple. We shall apply the convention (5.19) many times in the sequel rarely mentioning it. In particular, this convention is how we understand the left-hand side of (5.18). Now to prove linearity we begin with the following. Lemma 5.23. Let f : X R be measurable and suppose that f = g h a.e. where g and h are nonnegative integrable functions. Then f is integrable, and f = g h.
Proof : Note that since g and h are each integrable, they are a.e. real-valued, so the dierence g h is dened a.e. Also note that |f | g + h a.e., which implies (by monotonicity of the integral on nonnegative functions) that the integral of |f | is nite, so f is integrable. Since g, h, f+ , and f are each integrable, they are a.e. real-valued and so in particular, rearranging the equalities it follows that f+ + h = g + f a.e. By linearity of the integral for nonnegative functions, we have f+ + h= g+ f . h. f = f+ f and f = g h (a.e.)
Each integral is nite so rearranging we get f+ Thus, f= g h as desired. f = g
Linearity of the integral Theorem 5.24. Given any integrable functions f and g and real numbers a and b, the integral is linear: (af + b g ) = a
Proof : To prove linearity it suces to show that af = a f for any a R and (f + g ) = f+ g.
f +b
g.
Consider the rst equality and the case a < 0; the case a 0 is easy. Write a = where > 0, so that By Lemma 5.23 and linearity of the integral on nonnegative functions we see that af = f f+ = = To prove that (f + g ) = f+ g , write f f+ f+ f =a f. af = (f+ f ) = f f+ .
f + g = (f+ f ) + (g+ g ) = (f+ + g+ ) (f + g ).
Applying Lemma 5.23 and using linearity of the integral on nonnegative functions, and using the denition of the integral, we obtain (f + g ) = = (f+ + g+ ) f+ + g+ (f + g ) f g = f+ g.
Integral inequalities Theorem 5.25. Given any integrable functions f and g , (1) If f g a.e., then (2) f |f |. f g . (Monotonicity)
Proof : If f g a.e. then using that g = f +(g f ), which is dened a.e., by linearity we have g= f+ (g f ).
Since g f 0 a.e. it follows that (g f ) 0, which proves monotonicity. To prove (2), recall that |f | = f+ + f and observe that f = f+ f f+ + f = |f |.
5.5.4. The principle of appropriate functions. Recall that the principle of appropriate sets says that if a collection of sets contains the appropriate sets, then it must contain all sets in the -algebra generated by the appropriate sets. The principle of appropriate functions says: If an integration property holds for the appropriate functions, then the property holds for all integrable functions. To illustrate this new principle, consider an ane transformation (a linear transformation followed by a translation) F : Rn Rn where F (x) = T x + b for some invertible n n matrix T and some b Rn . We shall prove the following sister theorem to Theorem 4.15: Ane transformations and Lebesgue integration Theorem 5.26. For any measurable function f : Rn R, the composite function f F is measurable, and (5.20) (f F ) | det T | = f, provided f is nonnegative or integrable. (Using invariance of Lebesgue measure under ane transformations, one can check that f F is measurable for any measurable function f on Rn .) To prove the equality (5.20), we use the following principle that works not only to prove (5.20) but also for just about any integration property: Let C denote the collection of
284
integrable functions having a certain property.13 principle of appropriate functions: Suppose (1) C contains characteristic functions of measurable sets (2) C a linear space (3) C is closed under limits of nondecreasing sequences of nonnegative functions Then C contains all integrable functions. There is a similar principle if we are just interested in nonnegative measurable functions: Just replace (2) by is closed under linear combinations by nonnegative constants; the conclusion is that C contains all nonnegative measurable functions. This principle says that if you can prove an integration property for the appropriate functions namely, characteristic functions of measurable sets then under some additional conditions, the property holds for all integrable functions. Lets understand why. If (1) is fullled, then by (2), C contains all simple functions. By (3), the set C contains all nonnegative integrable functions by the MCT (since by Theorem 5.13 any nonnegative measurable function is a limit of a nondecreasing sequence of nonnegative simple functions). Finally, since any integrable function can be written as a dierence of two integrable nonnegative functions, by (2) it follows that C contains any integrable function. Lets apply this principle to outline the proof of (5.20) for integrable functions. (1) First, we need to check that (5.20) holds for characteristic functions of measurable sets. Well assume youve done this see Problem 9. (2) Both sides of the equality (5.20) are clearly linear in f . (3) If 0 f1 f2 where each fn satises (5.20), then f F = lim fn F is a limit of a nondecreasing sequence of nonnegative measurable functions, so by the MCT, f F | det T | = lim Since fn F | det T | = fn F | det T | and f = lim fn .
fn by assumption, we conclude that f F | det T | = f.
This proves Theorem 5.26! This principle will be used quite often in the sequel. Heres another application of this principle. Let g : R [0, ] be measurable and for any Lebesgue measurable set A, dene mg (A) :=
A
g (x) dx.
The function g is called a density function and if g = 1 we call g a probability density function, or pdf, which well return to in Section 6.4. If A = n=1 An where the An s are disjoint Lebesgue measurable sets, then by countable additivity of the integral (Theorem 5.22),

mg (A) =
A
g=
n=1 An
g=
n=1
mg (An ).
13I thank Anton Schick for telling me about the catchy name the principle of appropriate functions.
Thus, mg is a measure, which we call the indenite integral of g and will be studied thoroughly in Section 9.3. The following theorem is a special case of Problem 14 (see also Problem 15), whose proof uses the principle of appropriate functions. Density functions and Lebesgue measure Theorem 5.27. A measurable function f : R R is mg -integrable if and only if f g is Lebesgue integrable, in which case, f dmg = f (x) g (x) dx.
We end this section with the following application of Theorem 5.26.

Example 5.16. From Theorem 5.26 we have R f (kx) dx = (1/k) R f (x) dx. Using this fact, we shall prove the following interesting result I found in [47]. If f : [1, ) [0, ] is measurable, then 1 f < implies k=1 f (kx) converges for almost all x [1, ). To prove this result, try to see why each of the following equalities and the last inequality holds:14
1
1 x
k=1
f (kx) dx =
k=1 1
1 f (kx) dx = x = = =
k=1 k=1 n=k n
1 f (x) dx x
n+1 n n+1 n
1 f (x) dx x 1 f (x) dx x
n=1 k=1 n+1
n=1 n=1
n n+1
1 f (x) dx x
1
f (x) dx =
n
f (x) dx < .
It follows that Exercises 5.5.
k=1
f (kx) < for a.e. x.
1. Let f be integrable. Prove the following properties of the integral. (a) If f 0 a.e. and A B are measurable sets, then A f B f . (b) If f a > 0 a.e. on a measurable set A, then (A) < . (c) If a f b a.e. on a measurable set A, then a (A) A f b (A). (d) If A B are measurable and (B \ A) = 0, then A f = B f . (e) If A f = 0 for all measurable A, then f = 0 a.e. 2. A measurable function g is essentially bounded if there is an M > 0 such that |g | M a.e. If f is integrable and g is essentially bounded, prove that f g is integrable. If X has nite measure, prove that any essentially bounded function is integrable. 3. In this problem we compute some integrals. (a) (Run lengths) Let Y , with Y = {0, 1}, be the sample space for an innite sequence of coin tosses where the probability of throwing a head on any given toss is p. For each n N, dene n : Y [0, ] by n (x) := the number of consecutive tosses of heads starting from the nth toss. Find E (n ), the expected run length of
b 14 of a function means R of [a,b] times the function. Since points have Lebesgue Here, a measure zero and integrals cant see sets of measure zero you could also use the characteristic function of (a, b] as well (or other sets diering from [a, b] by a set of Lebesgue measure zero).
286
heads. (By the way, if youre interested, you can show that the sequence n gives an example of Fatous strict inequality: lim inf n < lim inf n .) (b) (Random series) With the same measure space as in (a), let f : Y [0, ] be the randomized geometric series dened as follows: Given x Y , f (x) := 1 an 1 1 1 2 3 = 1 + , 2 2 2 2n n=1
where an = 1 if xn = 1 and an = 1 if xn = 0. Compute E (f ). (c) (Area under Cantors function) If : [0, 1] R is Cantors function from 1 Section 4.5, show that = 2 . Suggestion: Show that for all points not in the Cantor set, can be written as =
k=0 (a1 ,...,ak )
ak 1 a1 + + k + k+1 2 2 2
Ba1 ...ak ,
where [0, 1] \ C is the union over all sets Ba1 ...ak found in Problem 2 in Exercises 4.5. The a1 , . . . , ak s are 0s or 1s and they determine a unique natural number = ak + 2ak1 + + 2k1 a1 with 0 2k 1. (d) (An integrable everywhere unbounded function) Fix a list {r1 , r2 , . . .} of Q, n let g (x) = x1/2 (0,1) , and dene f : R [0, ] by f (x) = g (x rn ). n=1 2 (i) Using g = 1/2, compute f and conclude that f is nite a.e. (ii) Prove that f is unbounded in every open interval. (iii) Finally, prove that although f 2 is nite b a.e., for any a < b we have a f 2 = . 4. (A nowhere integrable real-valued function) In this problem we produce a meab surable function f : [0, 1] [0, ) such that a f = for all 0 a < b 1. (i) For each n N, let An be the set of points in [0, 1] which can be written in a base 2 (binary) expansion whose digits from the (n + 1)st through the 2nth positions are all zeros. Using the rst BorelCantelli lemma (see the next problem or Theorem 4.2), prove that m(A) = 0 where A = {An ; i.o.}. c (ii) Put fn = n22n An and dene f = n=1 A fn . Prove that f : [0, 1] [0, ) has the desired integral property. 5. In this problem we apply the series MCT to problems in probability. (a) (The rst BorelCantelli lemma) Let A1 , A2 , . . . be measurable and put A = {An ; i.o.}. Applying the series MCT to the series f (x) = n=1 An and observing that x A if and only if f (x) = , prove that
n=1
(An ) <
(A) = 0.
(b) (The strong law of large numbers) (i) Show that Borels SLLN, to which we refer you to Section 4.2.1 for the following notation, is equivalent to the statement + R n )4 that lim (R1 + = 0 a.e. where the Ri s are the Rademacher functions. (ii) To n4
(R1 ++Rn ) . prove this last statement, apply the series MCT to the series n=1 n4 (Equation (4.5) might help.) 6. Let be a measure and let A be a measurable set with 0 < (A) < and for each n N, let An A be measurable with n=0 (An ) < . For x A denote by N (x) the number of indices n such that x An . Prove that for some x A,
4
N (x )
n=0
(An ) . (A)
7. (The (Other) monotone convergence theorem) Let f1 f2 f3 0 be a nonincreasing sequence of nonnegative measurable functions.
(i) Assuming that
f1 < , prove that lim fn = lim fn .
(ii) If f1 = , show by example that lim fn = lim fn need not hold. 8. (Lebesgues analytic denition) We shall call a partition of R a set P = {mi }, where i Z, with . . . < m2 < m1 < m0 < m1 < m2 < . . . and with P := sup{mi+1 mi ; i Z} < . Let f : X R be measurable with (X ) < . (a) If for some partition P of R, the series (5.21) P :=
mi (Ai ) ,
i=
where Ai = {x ; mi f < mi+1 },
is absolutely convergent, prove that f is integrable. Moreover, prove that (5.21) converges absolutely for any partition of R. Finally, given any sequence P1 , P2 , . . . of partitions with Pn 0, prove the formula f = lim Pn ; this formula for f is called Lebesgues analytic denition [232, Sec. 24]. Suggestion: If gP := i mi Ai , prove that |f | |gP | + P and |f gP | P . (b) Prove that if f is integrable, then the series (5.21) converges for any partition P . (c) The condition (X ) < is needed in this problem: Give counterexamples to (c) and the rst and second statements in (a), in the case X = R. 9. (Ane transformations and Lebesgue integration) Prove the equality (5.20) for characteristic functions of Lebesgue measurable sets (youll need Theorem 4.15). 10. (Counting measures) (a) Let # : P (N) [0, ] be the counting measure. Prove that any extended realvalued function f on N is measurable, and is integrable if and only if the series n=1 |f (n)| is convergent, in which case f d# =
f (n).
n=1
(b) Let : P (X ) [0, ] be the counting measure on a countable set X . (i) Prove that any extended real-valued function f on X is measurable. (ii) Let x1 , x2 , . . . be any ordering of X and prove that given any f : X [0, ], we have f d =
f (x k ).
k=1
11. (Stretching theorem) (This problem uses no integration theory, but is needed for the next problem.) Let f : [a, b] R be dierentiable at each point of a set A [a, b] such that for some constant M , |f | M on A. In this problem we prove that m (f (A)) M m (A); that is, f stretches A by at most a factor of M . (i) Let > 0 and for n N, let An = {x A ; |f (x + h) f (x)| (M + )|h| for all |h| < 1/n with x + h A}. Show that A1 A2 and A = lim An . (ii) Fix n N. Show there are pairwise disjoint left-half open intervals I1 , I2 , . . . covering An such that m(Ik ) < 1/n for each k and k m(Ik ) m(An ) + . Next, show that m (f (An Ik )) (M + )m(Ik ). Now complete the proof. 12. (Image-integral theorem) Let f : [a, b] R be dierentiable at each point of a measurable set A [a, b]. Using Problem 11, prove that the image f (A) satises m (f (A))
A
|f |.
Suggestion: For > 0, use A = n An where An = {x A ; (n 1) |f (x)| n}. 13. Let g : R R be continuously dierentiable and nondecreasing and let g : B 1 [0, ) be its corresponding LebesgueStieltjes set function. Given a Borel measurable
288
function f , prove that f is g -integrable if and only if f g is Borel integrable, in which case f dg = f g dx.
14. Let (X, S , ) be a measure space, let g be a nonnegative measurable function, and dene mg : S [0, ] by mg (A) = (a) Prove that mg : S [0, ] is a measure. (b) Given a measurable function f : X R, prove that f is mg -integrable if and only if f g is -integrable, in which case, f dmg = f g d.
A
g d for all A S .
15. Fix a Lebesgue integrable function g : Rn [0, ] that vanishes only on a set of zero Lebesgue measure (that is, g is positive a.e.). Let I = M n , the -algebra of Lebesgue A I , where m denotes Lebesgue measure. (i) Show that mg : I [0, ] is a measure. (ii) Let A I . Prove that mg (A) = 0 if and only if m(A) = 0. n n (iii) Let m g : P (R ) [0, ] denote the outer measure generated by mg . Let A R and suppose that mg (A) = 0. Show that A I . Suggestion: Use regularity to nd an element B S (I ) = I such that A B and m g (A) = mg (B ) = mg (B ). Why does m ( B ) = m ( B )? g g (iv) Show that Mg = I , where Mg denotes the set of m g -measurable sets. (v) Now suppose that g vanishes on a set of positive Lebesgue measure (instead of on a set of zero Lebesgue measure). Show that Mg = I by nding an element of Mg that is not in I . 16. Let 1 and 2 be measures on (X, S ) and let = 1 + 2 . (i) Show that is a measure on S . (ii) Prove that a measurable function f is i -integrable for i = 1, 2 if and only if f is -integrable, in which case f d = f d1 + f d2 . measurable subsets of Rn , and dene mg : I [0, ] by mg (A) := g dm for all
A
17. (Catalans constant) Here are a couple integrals involving the famous Catalans (1)n constant, G = ene Catalan (1814 n=0 (2n+1)2 = 0.915965594 . . ., named after Eug` 1894). By the way, its not known whether or not Catalans constant is rational! (For many more formulas, check out [59].) In this problem you may evaluate integrals using the fundamental theorem of calculus. Using the MCT, prove that
1
G=
0
tan1 x dx = x
x dx . cosh x
by using the Maclaurin expansion for tan1 (x) for the rst equality and by writing n (2n+1)x x x ex = . Suggestion: The series you end up trying = 1+ n=0 (1) x e cosh x e2x to integrate are alternating series, but to apply the MCT you need nonnegative terms; try to group adjacent terms together to get a series of nonnegative terms. 18. (Basel problem) In this problem we give Leonhard Eulers (17071783) rst rigorous 2 2 proof that n=1 1/n = /6, which was originally announced by Euler in 1735 (see Section 6.1 for more on Eulers sum). The following is Eulers argument from his 1743 paper Demonstration de la somme de cette suite 1+1/4+1/9+1/16 . . . (Demonstration of the sum of the following series: 1 + 1/4 + 1/9 + 1/16 . . .) [123] (cf. [340, 341]). In this problem, you may evaluate integrals using the fundamental theorem of calculus.
5.6. THE DCT, OSGOODS PRINCIPLE AND COMPLEX-VALUED FUNCTIONS
289
2 2 (a) Prove that if we can show that n=1 1/(2n 1) = /8, then it follows that 2 2 n=1 1/n = /6; thus, we just have to prove the rst equality. Suggestion: 2 Break up n=1 1/n into sums over even and odd numbers. (b) Using the binomial expansion of (1 x2 )1/2 near x = 0, prove that
(5.22)
arcsin x = x +
n=1
1 3 5 (2n 1) x2n+1 , 2 4 6 (2n) 2n + 1
15 where this series is valid for all x [0, 1]. (c) Divide both sides of (5.22) by 1 x2 , then integrate over [0, 1], stating why term2 by-term integration is permissable, to prove that 2 /8 = n=0 1/(2n + 1) . See 1 2n+1 2 1/2 2 n +1 Problem 2 of Exercises 2.5 for the integral 0 x (1x ) dx = 02 sin t dt, where we substituted x = sin t. (d) Appendix: We can modify Eulers proof slightly as follows: In (5.22) substitute x = sin t, for 0 < t < /2, then integrate the resulting equality from 0 to /2 to get the formula for 2 /8. (This method was published in [85] and is essentially the same as multiplying (5.22) by dx/ 1 x2 and then putting x = sin t.) 19. In this problem, we give Pennisis formulas [308] for and 2 :
(1)
n! = , 2 1 3 5 (2n + 1) n=0
(2)
2 4 (2n) 2 1 = + . 72 8 n=1 [1 3 5 (2n + 1)](2n + 2)22n+2
In this problem, you may evaluate integrals using the fundamental theorem of calculus. You may proceed as follows: (i) Justify the following equalities: arcsin x 1 = arctan x 1 x2 x 1 x2
(ii) Expanding (1 x2 + 2x2 t2 )1 as a geometric series, prove that (5.23) arcsin x = 2 x2n x 1 x2 n=0
1/ 2 1/ 0 2
x 1 x2
1/
2
0
dt . (1 x2 ) + 2x2 t2
(1 2t2 )n dt.
(iii) Evaluating the integral 0 (1 2t2 )n dt (see Problem 2 in Exercises 2.5) and taking a particular x in (5.23), derive the rst Pennisi formula. Then integrating (5.23) from 0 to a for a certain a, prove the second Pennisi formula.
5.6. The DCT, Osgoods principle and complex-valued functions In this section we prove probably the most important and powerful limit theorem youll ever need, the dominated convergence theorem (DCT). With minimal assumptions, it allows you to interchange limits and integrals without fear of contradictions, or of failing examinations [99, p. 229]. We also discuss Vitalis convergence theorem and many corollaries of the DCT. 5.6.1. The dominated convergence theorem. From our discussion concerning Fatous lemma, we know that if {fn } is a sequence of nonnegative functions on a measure space, then concerning the interchange of limits and integration, without making further assumptions the best we can say is lim fn lim fn ,
15 Using the so-called Raabes test, one can in fact show that this series converges uniformly for x in [1, 1], but we wont need this fact.
290
provided, of course, that the limits on both sides exist. Thus, we seek sucient conditions under which we can say = rather than . The earliest Lebesgue integration convergence theorem, the bounded convergence theorem, was proved in Lebesgues 1902 thesis [232]; in modern abstract measure theory it reads: Bounded convergence theorem Theorem 5.28. If X is a space of nite measure, {fn } is a sequence of measurable functions such that lim fn exists a.e. and there is a constant
n
M > 0 such that for each n N, |fn | M a.e., then the limit function lim fn and each fn are integrable, and lim fn = lim fn .
This theorem follows from Lebesgues DCT well present below. The bounded convergence theorem (BCT) is a vast generalization of the Arzel` a [79] and Osgood [298] bounded convergence theorem for the Riemann integral. Although state of the art at the time, Lebesgues BCT has two big drawbacks: (1) It fails in the case X has innite measure.16 (2) It does not apply to unbounded functions. That the BCT does not apply to spaces of innite measure and to unbounded functions excludes a large chuck of spaces and functions! In 1908, while applying his newly discovered theory of integration in the paper Sur la m ethode de M. Goursat pour la r esolution de l equation de Fredholm [236], Lebesgue realized just how restrictive the boundedness assumption really was. He used the BCT to solve a Fredholm integral equation (certain equations involving integrals which are used in diverse areas of mathematics and physics, named after Ivar Fredholm (18661927)). Because he used the BCT in his solution he was forced to put restrictive boundedness hypotheses on the functions involved, which made his solution impractical. In order to eliminate the restrictive hypotheses and make his solution useful, he then states and proves the DCT; heres what he said [236, p. 11-12]:
In the preceding statement I have pointed out three restrictive hypotheses . . . The theorem on integration of sequences [the BCT], that has been previously used, will be replaced by the following: A convergent sequences of integrable functions fi is integrable term by term if there exists an integrable function F such that |fi | |F | for all i and for every value of the variable.
In modern abstract measure theory, Lebesgues theorem reads:
16Recall the moving pulse sequence {f } from Example 5.14 in Section 5.4. This sequence n is bounded and converges, yet lim fn < lim fn .
291
Dominated convergence theorem Theorem 5.29. Let {fn } be a sequence of measurable functions such that (1) lim fn exists a.e.;
n
(2) there exists an integrable function g such that for each n N, |fn | g a.e. Then the limit function lim fn and each fn are integrable, and lim fn = lim fn .
Proof : Figure 5.15 shows a picture of the situation. To the proof, let
f fn f3 f2 f1 g
and |fn | g for all n where we assume g is integrable. The DCT says that the net area under the fn s converge to the net area under f .
Figure 5.15. fn f (everywhere in this example, where f (0) = 0)
f (x) = lim fn (x)

n
when this limit exists (and zero when it doesnt exist recall the convention around (5.19)!). The a.e. inequality |fn | g implies that each fn is integrable and taking n in |fn | g a.e. it follows that |f | g a.e. Thus, f is also integrable. It remains to prove that lim fn exists and equals f . Because lim fn exists if and only if the liminf and limsup of fn are equal, in which case lim fn is the common value (see Lemma 5.11), all we need to prove is lim inf fn = lim sup fn = f.
Before proving these equalities we need two facts. First, we note that |fn | g = g fn g = fn + g 0.
where all the inequalities hold a.e. In particular, {fn + g } is an (a.e.) nonnegative sequence. The second fact concerns lim inf: For any constant b R and extended real-valued sequence {an }, we have (5.24) lim inf(an + b) = (lim inf an ) + b and lim inf(an ) = lim sup an ;
the rst equality is an exercise and the second property follows from Property (2) in Lemma 5.11. Now to our proof, we rst work on lim inf fn . The idea is to apply Fatous lemma; unfortunately, fn may not be nonnegative, violating the hypotheses of
292
Fatous lemma. However, fn + g is nonnegative as we saw above. Hence, (f + g ) = lim(fn + g ) = lim inf(fn + g ) (fn + g ) fn + (lim inf = lim when lim exists) (Fatous lemma) g (by (5.24)).
lim inf = Subtracting (5.25) g we get lim inf
f lim inf
fn .
Applying this argument to the sequence {fn } gives (f ) lim inf (fn ). fn f.
Multiplying by 1 and using the second fact in (5.24) gives lim sup Combining this inequality with (5.25), we see that lim sup fn f lim inf fn .
However, lim infs are always less than or equal to lim sups of a given sequence, so all inequalities must in fact be equalities. This proves the result.
In Section 6.1 well show you a plethora of tricks this theorem can perform. Observe that the BCT (Theorem 5.28) is a special case of the DCT with g equal to the constant function M , which is integrable since M = M (X ) < , assuming that (X ) < . Heres an example for which the DCT applies, but not the BCT.
Example 5.17. This interesting example is given by William Osgood (1864-1943) [297, 298] in 1896: For each n N dene fn : [0, 1] R by n2 x . 1 + n3 x2 If f = 0, its easy to check that fn f pointwise. Here are some graphs: fn (x) =
Figure 5.16. fn has maximum value
n/2 occurring when x = 1/n3 . Thus, the fn s are not uniformly bounded yet fn 0 pointwise.
For this sequence we can (as Osgood did) show by computation that lim fn = lim fn .
However, can we prove this equality using a convergence theorem? Using calculus we nd that fn has the maximum value n/2 (obtained when x = 1/n3 ); in particular, the sequence {fn } is not uniformly bounded). Hence, we cannot use the
293
BCT to answer our question. However, we can use the DCT! Its not obvious at rst glance what dominating function will work, but we can nd one with the help of calculus. Fix x > 0 and dene F : [0, ) [0, ) by F (t ) = t2 x . 1 + t3 x 2
Using elementary calculus we nd that the maximum value of F is 22/3 x1/3 /3 (obtained when t = (2/x2 )1/3 ). Thus, F (t) C x1/3 for all t 0 where C = 22/3 /3. In particular, taking t = 1, 2, 3, . . . we see that for all n N, 0 fn g where g (x) = C x1/3 . Thus, we can apply the DCT if we can show that g is integrable. To prove this we shall assume that any Riemann integrable function is Lebesgue integrable and the two integrals are the same, which well prove in Section 6.2. Now let gn = [1/n,1] g . Then 0 g1 g2 gn g so by the MCT we have
1
g = lim
gn = lim C
n 1/n
x1/3 dx 1 1 n2/3 = 3C , 2
= lim
3C 2
where from the second to third line we use the fundamental theorem of calculus. It follows that g is integrable, so the DCT applies.
Now, you may be asking: Lebesgues DCT gives a sucient condition for the interchange of limits and integrals; is there also a necessary condition? The answer is yes and it was provided in a truly amazing paper [404] by Giuseppe Vitali (1875 1932) in 1907, a year before the DCT was stated by Lebesgue.17 Vitalis convergence theorem Theorem 5.30. Let {fn } be a sequence of integrable functions such that lim fn exists a.e. Then the limit function lim fn is integrable on X and for n any measurable set A X we have lim fn = lim
A A
fn
if and only if the following conditions hold: (1) For each > 0 there is > 0 such that for all measurable sets A with (A) < , we have | A fn | < for all n N. (2) For each > 0 there is a measurable set A of nite measure such that for all measurable sets B X \ A, we have | B fn | < for all n N. Condition (1) is called uniform absolute continuity and it says that the integrals of the fn s can be made uniformly small on sets of small measure and Condition (2), which we shall call uniform Vitali smallness, basically says that the integrals of the fn s can be made uniformly small outside sets of nite measure. See Problems 11 and 12 for more on these conditions. Giving both sucient and necessary conditions on the interchange of limits and integrals, all the convergence theorems we discussed should be corollaries of the VCT. Indeed, in Section
17 If you want to read Vitalis original proof along with some history of the VCT along with further developments due to de la Vall ee Poussin, Hahn, and Saks, see Choksis paper [86].
294
4 of Lebesgues 1909 paper sur les int egrales singuli` eres [237], he states many convergence theorems and in a footnote on page 50 he says
All these properties are particular cases of a very general theorem of Mr. Vitali.
Indeed, in Problem 12 you will show that the DCT follows directly from the VCT and in Problem 8 you will nd an example of a sequence for which the VCT applies, yet the DCT fails. The proof of the VCT is outlined in Problem 18. 5.6.2. Osgoods Principles. In William Osgoods (1864-1943) 1897 paper Non-uniform convergence and the integration of series term by term [298, p. 155], he begins his paper with18
The subject of this paper is the study . . . of the conditions under which
x x
()
William Osgood Shortly after, he mentions what we shall coin Osgoods principle, which (1864-1943). basically says that the solution to problem () applies to the interchange of the integral and other processes involving limits (such as series, dierentiation and integration); in his own words,
The four problems of 1) integration of a series term by term, 2) dierentiation of a series term by term, 3) reversal of the order of integration in a double-integral, 4) dierentiation under the sign of integration, are in certain classes of cases but dierent forms of the same problem, a problem in double limits; so that a theorem applying to one of these problems yields at once a theorem applying to the other three.
x0 n
lim sn (x)dx = lim
sn (x)dx
x0
By Osgoods principle, Problems 1)4) should be answered by the DCT, and indeed they are! In fact, using the DCT, an answer to Problem 1) is in Theorem 5.31 below; Problem 2) is solved in Problem 5 in the Exercises; Problem 3) is illustrated in the FichtenholzLichenstein theorem in Problem 8 of Exercises 6.2 (cf. Section 7.3 on Fubinis theorem), and nally, Problem 4) is answered in Theorem 5.35. Heres Osgoods Problem 1), a theorem that complements Theorem 5.21. Integration of a series term by term Theorem 5.31. If {fn } is a sequence of integrable functions such that
n=1
|fn | < ,
then the series
n=1
fn converges a.e. to an integrable function, and
fn =
n=1 n=1
fn .
18The . . . in the quote deals with Condition (A), which is the assumption that {s } is n uniformly bounded on [x0 , x]. Also, the () marking the displayed equation was not in his paper.

Proof : Since |fn | < , by Theorem 5.21 g := n=1 n=1 |fn | converges a.e. and g = |fn | < . In particular, g is integrable and since absolute n=1 convergence implies convergence it follows that n=1 fn converges a.e. Now if n
295
gn :=
k=1
fk ,
k=1
then |gn | g , and hence by the DCT, lim gn =

k=1
fk is integrable, and
n
fk =
lim gn = lim
gn = lim
n n
fk
k=1
= lim
fk =:
k=1
k=1
fk .
The following theorem complements Theorem 5.22. Theorem 5.32. If f is integrable and A = disjoint measurable sets, then
n=1
An where the sets An are
f=
A n=1 An
f.
Proof : This result follows from the dominated convergence theorem applied to fn = n k=1 Ak f , which converges pointwise to A f .
5.6.3. Continuity and dierentiation. Integrals depending on a continuous parameter occur often in applications. For such applications, the following continuous version of the dominated convergence theorem is helpful. A point a R is called a limit point of a set I R if there exists a sequence {an } of points in I \ {a} such that an a. The points a = are allowable. Continuous dominated convergence theorem Theorem 5.33. Let I R be a set and let a R be a limit point. For each t I , let ft be a measurable function on X and suppose that (1) for a.e. x X , the limit lim ft (x) exists;
ta
(2) there exists an integrable function g such that for a.e. x X , |ft (x)| g (x) for all t I . Then lim ft and each ft are integrable, and
ta ta
lim ft = lim
ta
ft .
Proof : The inequality |ft | g a.e. implies that each ft is integrable and then taking t a shows that |f | g a.e. where f := lim ft , so f is integrable as well. To
t a
prove that
f = lim
prove that for any sequence {tn } in I \ {a} of real numbers converging to a, we
t a
ft , by the sequence formulation of limit we just have to
296
have (5.26) f = lim

n
ftn .
However, given such a sequence, for each n N we have |ftn | g a.e. and f = lim ftn a.e., so (5.26) follows from the usual Dominated Convergence Theorem.
This theorem can be used to establish conditions assuring the continuity or dierentiability of integrals depending on a parameter. Continuity of integrals Theorem 5.34. Let f : I X R, where I is an interval in R, such that for each t I , f (t, x) is a measurable function of x. Suppose (1) for a.e. x X , f (t, x) is a continuous function of t I ; (2) there exists an integrable function g : X [0, ] such that for a.e. x X , |f (t, x)| g (x) for all t I . Then f (t, x) d F (t) =
X
is a continuous function of t I .
Proof : Dene ft (x) := f (t, x). Then for a.e. x X we have |ft (x)| g (x) for all t I where g is integrable, and for a.e. x X we have f (a, x) = lim ft (x). Thus, by the continuous dominated convergence theorem, f (a, x) and each ft are integrable, and f (a, x) d = lim
t a t a
ft d,
which is another way of writing F (a) = limta F (t). This equality implies that F (t) is continuous at t = a, and hence on I .
Theorem 5.35 below gives very general conditions under which one can dierentiate under an integral sign, a trick that in the next section well see is incredibly useful. In fact, the 1965 Nobel laureate Richard Feynman (19181988) was an expert at this trick. Heres an excerpt of a letter from Feynman to his high school teacher Abram Bader [132, p. 176177] (cf. [133, Ch. 12])
Another thing I remember as being very important to me was the time when you called me down after class and said You make too much noise in class. Then you went on to say that you understood the reason, that it was that the class was entirely too boring. Then you pulled out a book from behind you and said Here, you read this, take it up to the back of the room, sit all alone, and study this; when you know everything that is in it, you can talk again. And so, in my physics class I paid no attention to what was going on, but only studied Woods Advanced Calculus up in the back of the room. It was there that I learned about gamma functions, elliptic functions, and dierentiating under an integral sign. A trick at which I became an expert . . . Thank you very much.19
19 The full reference of the book is Advanced Calculus: A Course Arranged with Special Reference to the Needs of Students of Applied Mathematics. Boston, MA, Ginn, 1926 by Frederick
297
Hopefully one day youll write to me thanking me for exposing you to this trick! In honor of Feynman, I shall call the following theorem Feynmans favorite theorem (although I dont know if it really was his favorite). Feynmans favorite theorem Theorem 5.35. Let f (t, x) : I X R, where I is an interval in R, such that for each t I , f (t, x) is integrable in x. Suppose that (1) for a.e. x X , the partial derivative t f (t, x) exists for all t I ; (2) there exists an integrable function g : X [0, ] such that for a.e. x X , |t f (t, x)| g (x) for all t I . Then f (t, x) d F (t) =
X
is dierentiable at each t I , and F (t) =

X
f (t, x) d. t
Proof : Fixing a I , observe that provided the limit exists, we have F (a) := lim
t a
1 F (t ) F (a ) = lim t a t a ta = lim
t a
f (t, x) d
f (a, x) d
ft (x) d,
where for each t I not equal to a, we have ft (x) := Thus, we just have to show that
t a
f (t, x) f (a, x) . ta f (t, x) d. t

t a
lim
ft (x) d exists and equals
and from the mean value theorem from elementary calculus, for any t I , |ft (x)| = f (t, x) f (a, x) = |t f (at , x)| , ta
To see this, note that by Assumption (1), for a.e. x X , t f (a, x) = lim ft (x)
for some number at between a and t. By Assumption (2) it follows that for a.e. x, we have |ft (x)| g (x) for all t I . Thus, by the continuous dominated convergence theorem, t f (a, x) and each ft are integrable, and
t a
lim
ft (x) d =
t a
lim ft (x) d =
f (a, x) d. t
See Section 6.1 for applications of this theorem. So far we have focused on real-valued functions, but everything we have talked about works for . . .
Woods (1864-1950). Also, the dots . . . is Feynman recalling a lecture at Cornell concerning the trick.
298
5.6.4. Complex-valued functions. Given a function f : X C, we can write it in the form f = f1 + i f2 , i = 1 , where f1 and f2 are real-valued functions, called the real and imaginary parts, respectively, of f . We say that f is measurable if both f1 and f2 are measurable. For example, the function f : R C dened by f (x) = eix is Lebesgue measurable since eix = cos x + i sin x and cosine and sine are both measurable. Because measurability of complex-valued functions is dened in terms of the measurability of their real and imaginary parts, its no surprise that many the properties of measurable (extended) real-valued functions studied in Sections 5.2 and 5.3 also hold for complex-valued functions. For instance, heres a partial list of properties of measurable complex-valued function: (1) Complex-valued constant functions are measurable. (2) If f and g are measurable complex-valued functions, then f g , f + g , and |f |p where p > 0, are also measurable. (3) If on the zero set {x ; f (x) = 0}, 1/f is redened as a measurable complexvalued function, then 1/f is a measurable function on X . (4) If {fn } is a sequence of measurable complex-valued functions, and if the limit lim fn (x) exists for a.e. x, then it denes a measurable function by redening
n
the limit to be zero (or any other measurable complex-valued function) on the points where it does not converge. These results follows directly from Theorems 5.12 and 5.14 by applying these results to the real and imaginary parts of the complex-valued measurable functions. We wont bore you with the proofs. Suce to say that everything you know is true for real-valued measurable functions is also true for complex-valued ones . . . as long as we stay away from properties that specically depend on the ordering of the real numbers because complex numbers do not have an ordering that respects the usual algebraic operations.20 For example, we dont dene the limit inmum of a sequence of complex-valued functions. We now turn to integration. We say that a measurable function f : X C is integrable if its real and imaginary parts are integrable. If f = f1 + i f2 is broken up into its real and imaginary parts, then we dene f := One can check that |f 1 |, |f 2 | | f | =
2 + f 2 |f | + |f |, f1 1 2 2
f1 + i
f2 .
It follows that f is integrable (meaning both |f1 | and |f2 | are nite real numbers) if and only if |f | is an integrable real-valued function. Thus, f : X C is integrable |f | < .
20For example, supposing there were an ordering > on C and supposing i > 0, multiplying
both sides by i we get 1 > 0, an absurdity. Similarly, if i < 0, then multiplying both sides by i gives 1 < 0, another absurdity.
299
Here are some other properties of the integral for complex-valued functions. Theorem 5.36. Given any complex-valued integrable functions f and g and complex numbers a and b, the integral is linear: (af + b g ) = a Moreover, f |f |. f +b g.
Proof : Since the integral is linear for real-valued functions, one can easily show that (f + g ) = f+ g.
Thus, all we have to show is that af = a f for a C. Writing a = + i where , R and writing f = f1 + i f2 where f1 and f2 are real-valued, we have af = f1 f2 + i (f2 + f1 ). Thus, af = = (f1 f2 ) + i f1 (f2 + f1 ) f2 + (def. of integral) f1 (linearity)
f2 + i f1 + i f1
= + i =a f.
(algebra)
We now prove the absolute value inequality. If f = 0, then | satised, so assume that f = 0. Let a = | f |/ f . Then f =a f= af.
f|
|f | is
If we write af = g1 + i g2 where g1 and g2 are real-valued, then as af is a real 2 2 number (equalling | f |) , it follows that af = g1 . Now g1 g1 + g2 = |af | = |f | since |a| = 1. Hence, f = af = g1 |f |.
We remark that all the properties of integrals for real-valued functions hold for complex-valued functions as well, as long as the properties dont require the complex numbers to be ordered. For example, the MCT doesnt make sense as stated for complex-valued functions, but the dominated convergence theorem and its continuous version and the continuity and dierentiability theorems all hold as stated for complex-valued functions. You can verify these statements if you wish.
Exercises 5.6. 1. (Cantors LebesgueStieltjes set function cf. [185]) Let denote the Lebesgue Stieltjes set function of the Cantor function from Section 4.5.
300
(i) Let f : R R be a continuous function. Prove that for -a.e. we have n 1 + + n C1 ...n , f f = lim fn where fn = 3 3 ...
1 n
where the sum is over all n-tuples (1 , . . . , n ) of 0s and 2s. (Recall that C = C1 ...n with this union over all n-tuples (1 , . . . , n ) of n=1 Cn with Cn = 0s and 2s see Problem 2 in Exercises 4.5.) (ii) Prove that f d = lim fn d .
(iii) Fix z C, let f (x) = ezx and put Fn = fn d . Prove that for n 2, n n n z Fn = Fn1 ez/3 cosh(z/3n ). Conclude that Fn = ez/3++z/3 k=1 cosh 3k . (iv) Prove that ezx d = ez/2
k=1
cosh
z . 3k
2
2. Fix a (1/2, 1) and for each n N, dene fn : [0, 1] R by fn (x) = na x enx . Show that lim fn exists and that the DCT, but not the BCT, implies that we can interchange limits with integrals for this sequence. You may evaluate integrals using calculus. 3. Given a R, b (0, ) with b > |a|, and given a sequence {cn } in R with cn , compute the following limits:
1
(a) lim
1+
0
ax n
dx ,
(b) lim
cn
fn dx ,
(c) lim
fn dx,
0
ax n bx e . You may evaluate integrals using calculus. n 4. (Weierstrass M-test) Let {fn } be a sequence of continuous functions on an interval I R and suppose that for each n N there is a constant Mn 0 such that |fn | Mn a.e. and n=1 Mn converges. Prove that f : I R is continuous. 5. (Dierentiation of series) Let {fn } be a sequence of dierentiable functions on an interval I R such that the series f (t) = n=1 fn (t) converges for each t I . Suppose that for each n N there is a constant Mn 0 such that |fn | Mn a.e. and n=1 Mn converges. Prove that f : I R is dierentiable, and where in (b) and (c), fn = 1 + f (t ) =
fn (t ) ;
that is,
n=1
d d fn (t) = fn (t). dt n=1 dt n=1
6. (Vi` etes formula) In this problem we give a probabilistic proof of Fran cois Vi` etes (15401603) formula: (5.27) 2 = 1 2 1 1 + 2 2 1 2 1 1 + 2 2 1 1 + 2 2 1 . 2
We shall use the following result in Part (i) of Problem 4 in Exercises 4.2. Let b N, b 2, Y = {0, 1, . . . , b 1} with fair probabilities assigned, and let F : Y [0, 1] be the map dened by x1 x2 x3 F (x1 , x2 , x3 , . . .) := + 2 + 3 + for all (x1 , x2 , x3 , . . .) Y ; b b b
b1 1 that is, F = k=1 kAnk and Ank = Y Y {k } Y n=1 bn fn where fn = with {k} in the nth spot. If : S (C ) [0, 1] is the innite product measure, then
A B if and only if F 1 (A) S (C ), in which case (F 1 (A)) = m(A).
Assume this result (or, if you wish, try to prove it if you havent seen it before!).
301
(i) Using the principle of appropriate functions, prove that f : [0, 1] R is Borel measurable if and only if f F : Y R is Borel measurable, and (5.28)
[0,1]
f (x) dx =
Y
f F d,
provided f is nonnegative or integrable; in particular, it holds when f : [0, 1] C is integrable. (ii) Fix z C. Then applying (5.28) to the function f (x) = e2zx , prove that ez sinh z = lim N z
N Y n=1
e2zfn /b d,
where sinh z := (ez ez )/2. b1 2zkA /bn 2zfn /bn nk (iii) Write N = N as a simple function and use the n=1 e n=1 k=1 e n N 2zfn /b formula to compute Y n=1 e d. Deduce the very interesting formula ez sinh z = z n=1
b1 k=1
1+
2ekz/b sinh b
kz bn
(iv) With b = 2 and z = it, prove that sin t cos = t n=1
t 2n
1 (v) Finally, put t = /2 and use the formula cos(/2) = (1 + cos()) to prove 2 Vi` etes formula (5.27). 7. (Original proofs) Suppose (X ) < , let {fn } be a sequence of measurable realvalued functions, and let f := lim fn , assumed to exist at all x X as a real number.
(a) Lebesgues proof of the BCT from his thesis [232, p. 259]: Assume there is a constant M > 0 such that for each n N, |fn | M . Let > 0 and let An = kn {|f fk | }. Prove that lim (An ) = 0. Next, prove that
n
fn
An
|f fn | +
Ac n
|f fn | 2M (An ) + (X ).
Conclude that f = lim fn . (b) Lebesgues proof of the DCT from his 1910 paper [238, p. 375376]: In this problem, the only convergence theorem for measurable functions you are allowed to use in your proof is the BCT. Assume there is an integrable function g such that for each n N, |fn | g . Let > 0 and show there is an M > 0 such that A g < where A = {g > M }. Next, prove that f fn
A
|f fn | +
Ac
|f fn | 2
g+
A Ac
|f fn |,
and use the BCT for the integral over Ac to show that f = lim fn . (c) Fatous proof of his lemma from his 1906 paper [129, p. 37576]: In this problem, the only convergence theorem for measurable functions you are allowed to use in your proof is the BCT. Assume now that the fn s are nonnegative; we need to prove that f lim inf fn .21 (If lim inf fn = , there is nothing to
21 Fatous lemma in our textbook reads that f := lim fn exists, so Fatou proves that
lim inf fn lim inf f lim inf fn .
fn ; however, Fatou assumes
302
prove, so assume that lim inf fn < .) Now, xing k N for the moment, dene Ek = {x X ; f (x) k}, and dene gn : X [0, ) by g n (x ) = fn (x) f (x ) if fn (x) k if fn (x) > k.
Ek
Using the BCT prove limn E gn = E f . Second, prove limn k k lim inf fn . So far, youve shown that for any k,
Ek
gn
f lim inf
fn .
Third, prove that for any nonnegative simple function s, s = limk E s. Using k this result and the denition of f as the supremum of its lower sums, prove that given any simple function s with 0 s f , we have s lim inf fn . Finally, conclude that f lim inf fn . 8. Here are some interesting (counter)examples. (a) (DCT is sucient but not necessary) We show that the dominating condition in the DCT is not necessary for the interchange of limits and integrals. Let X = R with Lebesgue measure and for n = 1, 2, 3, . . ., let fn = n(1/(n+1),1/n] . Show that22 (i) lim fn = 0 at all points of X ; (ii) A lim fn = lim A fn for all measurable sets A X ; (iii) there is no integrable function g on X such that for each n N, |fn | g a.e.; (iv) the conditions of the VCT are satised as they should be. (b) (Counterexample to the dierentiation theorem) Let f : [0, ) [0, ) R be the function f (t, x) = t2 etx and dene F (t) = 0 f (t, x) dx. Prove that F (0) = 1 and 0 t f (0, x) dx = 0 (since t [0, ), the derivative at t = 0 is the right-hand derivative). What hypothesis of Theorem 5.35 is violated? 9. Prove that a complex-valued function f : X C is measurable if and only if f 1 (A) is measurable for every open set A C. 10. (Averaging theorem) Suppose that (X ) < and let f : X C be integrable. (i) Let B be a closed ball in C, say B = {z C ; |z c| r } for some c C and r 0, such that A := {f B } has positive measure. Prove that the average of f over A lies in B ; that is, prove that 1 f B. (A) A (ii) Suppose there is a closed set C C such that for every measurable set A with (A) > 0, the averages of f over A are in C , that is, (5.29) 1 (A)
A
f C
for all A with (A) > 0.
Prove that f (x) C for a.e. x X . Suggestion: Since C is closed, C c is open, and hence we can write it as a countable union of closed balls. 11. (Absolute continuity) Let f : X R be an integrable function on a measure space (X, ). The set function mf dened by mf (A) := A f for all measurable sets A is said to be absolutely continuous if given any > 0, there is a > 0 such that for all measurable sets A with m(A) < , we have | A f | < . (i) Prove that mf is absolutely continuous if and only if m|f | is absolutely continuous. (ii) Prove that m|f | , and hence mf , is absolutely continuous. Suggestion: First prove that the set function s for any nonnegative integrable simple function s is absolutely continuous, then approximate |f | by simple functions.
22 In fact, let {an } be a sequence with 0 < < a3 < a2 < a1 = 1 and an 0. Let In = (an+1 , an ] and bn = [n(an an+1 )]1 . Then fn = bn In can be substituted for n(1/(n+1),1/n] .
303
12. (Vitali smallness) Let (X, ) be a measure space. An integrable function f : X R is said to be Vitali small if for each > 0 there is a measurable set A of nite measure such that for all measurable sets B Ac , we have | B f | < .23 (i) Prove that an integrable nonnegative function f is Vitali small if and only if for each > 0 there is a measurable set A of nite measure such that Ac f < . (ii) Prove that for any integrable function f , if |f | is Vitali small, then f is Vitali small. (iii) Prove that any integrable function f is Vitali small. Suggestion: You just have to prove that |f | is Vitali small. To do so, rst try to prove that any nonnegative integrable simple function is Vitali small. (iv) Using this problem and the previous one, prove that the VCT implies the DCT. 13. (Fundamental theorem of calculus, to be generalized in Chapter 9.) Given any dierentiable function f : R R with a bounded derivative, we shall prove that the derivative f is Lebesgue integrable on any compact interval [a, b],24 and
b
(5.30)
a
f d m = f ( b) f ( a ) .
This result improves on Riemann integration, because for the Riemann integral we have to assume not only that f is bounded but also Riemann integrable. Here, boundedness of f automatically ensures its Lebesgue integrable. Proceed as follows. (i) Let f : R R be dierentiable almost everywhere, that is, for a.e. x R, the limit of the dierence quotients, f (x) := lim fh (x)
h0
exists, where fh (x) =
f (x + h) f (x) . h
Show that f is a Lebesgue measurable function. (ii) Henceforth assume that f exists at all points of R and is bounded. Show that
b a
f = lim
h0
fh (x) dx.
a
(iii) Using the translation invariance of the integral as discussed at the end of Section 5.4, show that
b
fh (x) dx =
a
1 h
b+h b
f (x) dx
1 h 1 h
a+h
f (x) dx.
a c+ h
(iv) Show that for any point c R, we have lim
fact, taking h 0 in (iii), prove (5.30). (v) The boundedness condition is needed for the proof; consider the following example. Let f (x) = x2 sin(1/x2 ) for x = 0 and dene f (0) = 0. Prove that f exists for all x R but f is not Lebesgue integrable on [0, 1], that is, show 1 that 0 |f | = . You may assume that the Lebesgue integral is the same as the Riemann integral on Riemann integrable functions and use any theorems on 1 Riemann integration to help you show that 0 |f | = . 14. (Another proof of Lebesgues dominated convergence theorem) Assuming that (X ) < , prove Lebesgues DCT using Egorovs theorem. 15. (Yet another proof of Lebesgues dominated convergence theorem) Let {fn } be a sequence of measurable functions such that f := lim fn exists a.e. and |fn | g
Thus, the integral of f can be made arbitrarily small outside of sets of nite measure. In order for (5.30) to hold, we just need f to be dierentiable on the interval [a, b] with a bounded derivative, and not on all of R as stated, but assuming f is dierentiable on R allows us to not have to think about right and left-hand limits to dene f (a) and f (b).
24 23
h0
f (x) dx = f (c). Using this

c
304
a.e. for some integrable function g . We shall prove that f = lim (i) Show that f fn , fn that is, lim f fn = 0.
we have to show that lim gn = 0. (ii) Show that {2g gn } is an a.e. nonnegative sequence of measurable functions. (iii) Use Fatous lemma on the sequence {2g gn } to prove that lim sup gn = 0. (iv) Prove that lim gn = 0, which completes the proof. 16. (Youngs convergence theorem) Heres a convergence theorem due to William Henry Young (18631942) who discovered an alternative formulation of Lebesgues theory of integration, which he published in the paper On the general theory of integration [419] in 1905. See [309, Ch. 5] for more on Youngs integral. Prove the following theorem of Young, proved in 1910 [316, 421]: Theorem. Let {fn }, {gn }, {hn } be measurable functions and suppose that (1) the limit functions lim fn , lim gn , lim hn exist a.e.; (2) For each n N, gn fn hn a.e.; (3) lim gn and lim hn are integrable and lim gn = lim gn and lim hn = lim hn . Then lim fn and each fn are integrable, and lim fn = lim fn .
gn , where gn = |f fn |, which is dened a.e. Thus,
17. Using Youngs convergence theorem, prove the following two results. (a) Let {fn } be a sequence of measurable functions such that lim fn exists a.e. and
n
for each n N, |fn | gn a.e. for some integrable function gn . Suppose that lim gn is integrable and lim gn = lim gn . Then lim fn and each fn are integrable, and lim fn = lim fn .
(b) (Cf. [212]) Let {fn }, {gn }, {hn } be measurable functions and suppose that (1) the series n=1 gn and n=1 hn converge a.e. to nite values; (2) For each n N, gn fn hn a.e.; (3) n=1 gn and n=1 hn are integrable and we can interchange sums and inte hn . g = gn and grals: n n=1 hn = n=1 n=1 n=1 Then n=1 fn converges a.e. to an integrable function, and
n=1
fn =
n=1
fn .
18. (Vitalis convergence theorem) In this problem we prove the celebrated VCT. Let {fn } be a sequence of integrable functions such that f := lim fn exists a.e.
n
We need to show that f is integrable on X and for any measurable set A X we have A f = lim A fn if and only if {fn } is both uniformly absolutely continuous and uniformly Vitali small. Assume the results stated in Problems 11 and 12 concerning absolute continuity and Vitali smallness. (i) Prove the only if implication. Suggestion: Observe that for any measurable set M X,
M
fn
(fn f ) +
f ,
M M
and M (fn f ) 0 as n since M f = lim dicult, so we shall attack it in pieces.
fn . The if portion is more
305
(ii) For this step and the next step, we assume that (X ) < and that {fn } is uniformly absolutely continuous. (In the nite measure case we can drop the Vitali smallness condition.) Let > 0 and choose > 0 such that for all measurable sets B X such that (B ) < , we have B |fn | < for all n. Using Fatous lemma, prove that f is integrable on any measurable set B X such that (B ) < , and moreover, B |f | . (iii) With and as in the previous step, use Egorovs theorem to show there is a measurable set A X such that (Ac ) < and fn f uniformly on A. Prove that f is integrable on A and A f = lim A fn . Conclude by (ii) that f is integrable on X . Also, show that f fn
A
fn +
A Ac
f +
Ac
fn
and use this to show that | f fn | < 3 for n suciently large. Conclude that f = lim fn . (iv) We now consider the general case. Assume that {fn } is both uniformly absolutely continuous and uniformly Vitali small (but not necessarily that (X ) < ). Let > 0. Using the Vitali small condition and Fatous lemma, prove that there is a measurable set B with nite measure such that Bc |fn | < for all n and |f | . By Steps (ii) and (iii), f is integrable on B and hence f is integrable Bc on X . Now let A X be measurable and show that
A
fn
AB
fn +
AB A\B
f +
A\B
fn
and use this to show that | A f A fn | < 3 for n suciently large. Conclude that A f = lim A fn . Congratulations! You have just proven one of the best theorems in integration theory! 19. (Jensens inequality) In this problem we prove an inequality named after Johan Jensen (18591925) for convex functions. A function : I R, where I R is an interval, is said to be convex if (5.31) for all x, y I and 0 t 1. This inequality has a geometric interpretation: If is graphed and z is any point between x and y , then the point (z, (z )) lies on or below the line joining the points (x, (x)) and (y, (y )). (i) Show that is convex if and only if for all x < y < z belonging to I , we have (5.32) (ii) (y ) (x) (z ) (y ) . yx zy Suppose that is dierentiable on I . Using (5.32), prove that is convex if and only if is a nondecreasing function. In particular, if is twice dierentiable, is convex if and only if 0. If is dierentiable and is strictly increasing, prove that we can replace in (5.31) by < when x = y . In particular, this holds if is twice dierentiable and > 0. Suppose that is convex on I . Using (5.32) prove that is Lipschitz on any closed interval of I , that is, given any closed interval J I there is a constant M such that |(x) (y )| M |x y | for all x, y J . In particular, any convex function is continuous. Let be a convex function on I . Let f be a measurable function on a probability space such that the range of f is contained in I . Prove that f f (Jensens inequality). (tx + (1 t)y ) t(x) + (1 t)(y )
(iii)
(iv)
(v)
306
You may proceed as follows: Let y = f . Show that y I . Let be the supremum over x (a, y ) of the left-hand side of (5.32). Show that (z ) (y ) + (z y ) for all z I . In particular, this inequality holds when z = f (x) for all x. Now set z = f (x) and integrate both sides of the inequality. Note that f is measurable by Proposition 5.8. (vi) Let X = {1, 2, . . . , n} and given 1 , . . . , n 0, dene : P (X ) [0, ] by ({j1 , . . . , jk }) = j1 + + jk . Prove that is a measure. Suppose now that (X ) = 1, that is, k = 1. Using Jensens inequality with (x) = ax where a > 1, prove that for any positive numbers a1 , . . . , an ,
n 1 2 a 1 a2 an 1 a1 + 2 a2 + + n an .
Suggestion: Let f (k) = loga (ak ). In particular, if k = 1/n for each k, deduce that the geometric mean of n nonnegative real numbers never exceeds their arithmetic mean, that is, for any nonnegative numbers a1 , . . . , an , (a1 a2 an )1/n 1 (a 1 + a 2 + + a n ). n
Remarks
5.1 : We introduced Ren e Baires sequence {Dn } (where Dn : [0, 1] R is dened by Dn (x) = 1 if x = p/q Q in lowest terms with 1 q n and Dn (x) = 0 otherwise), in order to give an example of a sequence of Riemann integrable functions whose limit, the Dirichlet function, is not Riemann integrable. Historically, this sequence was introduced by Baire in 1898 [15] not to show that Riemann integrability fails under taking limits, but rather as an example of a function of Class 2. Here, a Class 0 function is a continuous function. A Class 1 function is a function that is not of Class 0 (that is, continuous) but is a limit of a sequence Class 0 functions. More generally, a Class n function is a function that is not in any of the preceding classes, but is a limit of a sequence of Class n 1 functions. To see that D is of Class 2, we rst prove that each Dn is of class 1 by proving its a limit of spiky continuous functions; heres a proof for D1 (the remaining cases are left to you):
fk 1 k 1 D1
1 k
1 k
Since each Dn is of Class 1 and Dirichlets function D is the limit of the Dn s, it follows that D is of Class 2, provided that we can show D is not Class 1; that is, provided we can show that D is not a limit of continuous functions. This result follows from Baires theorem on functions of rst class, one version of which reads [15], [16, Ch. 2], [302, p. 33]: If a function f : [0, 1] R is of Class 1, then the set of points where f is continuous is dense in [0, 1]. (You can prove this result from the Baire category theorem.) Since D has no continuity points, it cannot be of Class 1. In particular, there is no sequence of smooth functions converging to D. (This is the reason in Theorem 5.1 we could not simply use D as an example of a non-Riemann integrable function that is a limit of smooth functions.) 5.2 : Heres Vitalis statement from his 1905 paper [403, p. 601] of Luzins theorem: If f (x) is a nite and measurable function in an interval (a, b) of length l, there exists for each positive number , as small as desired, in (a, b)
307
a closed set with measure greater than l such that the values of f (x) at points of it form a continuous function. At the bottom of the same page, Vitali mentions This theorem seems to be the same as the one to which it is referred in the notes of Borel and Lebesgue and which is reproduced in the cited book of Lebesgue at the foot of page 125 in the note. The author, who is Lebesgue, has not yet given an explicit proof. Thus, Luzins theorem, from 1912, seems to be the same as the results of Borel [50] (1904) and Lebesgue [234] (1903), but Borel and Lebesgues results were not proven explicitly (cf. Bourbakis book [58, p. 223]). 5.3 : The Italian mathematician Carlo Severini (18721951) in 1910 proved a dierent version of Egorovs theorem involving almost uniform convergence for the partial sums of a series of orthogonal functions [346, 347]. He gives a footnote [346, p. 3] containing (a wrong) statement of Egorovs theorem. This is why we dont call Egorovs theorem the Egorov-Severini (or Severini-Egorov) theorem as its sometimes called. 5.4, 5.5 : The papers [183, 184] and the book [309] give very informative discussions on some of the early denitions of the integral. For example, in 1905, William Henry Young (18631942) published a Riemann-Darboux formulation of Lebesgues theory of integration [419]. See Problem 5 in Exercises 5.4. He partitions the domain of a function just like for Riemann-Darboux integration, except that he (1) considers innite sums (instead of nite sums) and (2) partitions the domain into measurable sets (instead of intervals). Heres what Young said [419]: What would be the eect on the Riemann and Darboux denitions, if in those denitions the word nite were replaced by countably innite, and the word interval by set of points? A further question suggests itself: Are we at liberty to replace the segment (a, b) itself by a closed set of points, and so dene integration with respect to any closed set of points? Going one step further, recognizing that the theory of the content of open sets quite recently developed by M. Lebesgue has enabled us to deal with all known open sets in much the same way as with closed sets as regards the very properties which here come into consideration, we may attempt to replace both the segment and the intervals of the segment by any kinds of measurable sets. Another formulation of Lebesgues theory of integration, due to Young [420, 424] and Frigyes Riesz (18801956) [329], uses monotone sequences and doesnt require measure theory. James Pierpont (18661938) [310] basically replaces measure by outer measure so one can integrate functions over non-measurable sets. 5.6 : To summarize the main convergence theorems: We have proven, in order: MCT Historically, however, the BCT was proved rst, by Lebesgue in his PhD. thesis of 1902 [232], then in 1906 Beppo Levi proved the MCT [243] and independently Fatou proved his lemma [129], nally in 1908 Lebesgue stated the DCT in [236]. Some of the original proofs of these results were in Problem 7 of Exercises 5.5. = Fatous lemma = DCT = BCT.
CHAPTER 6
Some applications of integration

This chapter is devoted to a few applications Lebesgues integral. We start with . . . 6.1. Practice with the DCT and its corollaries In this section we apply Lebesgues DCT to a very small sample of problems; all the applications of the DCT could very well ll thousands of pages. In this section, we use the fact that any Riemann integrable function is Lebesgue integrable and the two integrals are equivalent, which well prove in Section 6.2. In particular, we shall freely use standard facts concerning the Riemann integral, for instance, the fundamental theorem of calculus and change of variables. 6.1.1. The probability integral and some of its relatives. As a rst application of the DCT machinery (or rather, its corollaries on continuity and dierentiation of integrals, Theorems 5.34 and 5.35), we prove the probability integral formula 2 x 2 or ex dx = , e dx = 2 R 0 which we already studied back in Section 2.5. Let us put I = R, X = [0, ), and f (t, x) = ex in Theorems 5.34 and 5.35. Let M = compute in a moment. Note that |f (t, x)| M ex
2 2
tx
es ds
0 s2 0 e
ds, which is some number that well

2
and |t f (t, x)| = |x e(1+t
) x2
| | x e x |,
where we used the fundamental theorem of calculus to compute t f (t, x), and ex 2 and x ex are both integrable functions on X . Therefore, by the continuity and dierentiation theorems, the function
F (t) :=
0
f (t, x) dx
is a continuously dierentiable function of t R. Moreover, we have F (t) =

0
t f (t, x) dx =
0
x e(1+t
2
) x2
dx = 1 1 . 2 (1 + t2 )
1 e(1+t )x 2 (1 + t2 )
x = x=0
309
310
6. SOME APPLICATIONS OF INTEGRATION
1 (t) + c for some constant c. Since f (0, x) = 0 and tan1 (0) = Hence, F (t) = 1 2 tan 1 0, we must have F (t) = 2 tan1 (t). Now using the formula for f (t, x), observe that 2 for each x (0, ), and hence a.e. on [0, ), we have lim f (t, x) = M ex . Thus, t by the continuous dominated convergence theorem, we have 2 1 = tan1 () = lim F (t) = M ex dx = M 2 . t 4 2 0 This implies that M = /2, which is to say, 2 ex dx = 2 0
as we set out to prove. What might be amazing is that ex cannot be integrated 2 in nite terms, meaning that ex does not have an antiderivative expressible as a nite sum of familiar functions (such as rational functions, logarithms, exponen2 tial functions, trig functions, etc.) so there is no way to evaluate 0 ex dx by nding an antiderivative and using the fundamental theorem of calculus. In youre 2 interested in proving that ex cannot be integrated in nite terms, the articles 1 [207, 267, 334] will show you. Lets consider another example (called a cosine transform):
G(t) =
0
2
ex cos tx dx,
t R.
2 2
If we put f (t, x) = ex cos tx, then t f (t, x) = x ex sin tx, so so by the continuity and dierentiation theorems it follows that G(t) is a continuously dierentiable function of t R. Moreover, integrating by parts, we have G (t) =
0 x = x=0
|f (t, x)| ex
and |t f (t, x)| |x| ex ,
x ex sin tx dx t 2
0
2 t ex cos tx dx = G(t). 2 2
2 1 = ex sin tx 2
t The solution to the dierential equation G (t) = 2 G(t) is G(t) = Cet /4 for some 2 constant C . To evaluate the constant, we set t = 0 and use that 0 ex dx = /2 2 to nd C = /2. Thus, G(t) = /2 et /4 , or 2 t2 /4 ex cos tx dx = e . 2 0
Replacing x with ax and t with t/a where a > 0, we obtain the interesting result:
(6.1)
0
eax cos tx dx =
1 2
t2 /4a e , a
a > 0 , t R.
1If you cant wait, the basic reason follows from a theorem of Joseph Liouville (18091882), who in 1835 proved that if f (x) and g (x) are rational functions, where g (x) is not constant, then f (x)eg(x) can be integrated in nite terms if and only if there exists a rational function R(x) such that f (x) = R (x) + R(x)g (x). Using Liouvilles theorem, its a good exercise to try and show 2 that ex cannot be integrated in nite terms. If you get stuck, see [267, p. 300].
6.1. PRACTICE WITH THE DCT AND ITS COROLLARIES
311
6.1.2. The probability integral and the Stirling, Wallis formulas. Recall from Section 2.5 that Wallis formula, named after John Wallis (16161703) who proved it in 1656, is given by 2n 2n 2 2 4 4 6 6 8 8 10 10 = = , 2 2n 1 2n + 1 1 3 3 5 5 7 7 9 9 11 n=1 which can also be written in the equivalent form (6.2) 1 = lim n n
n
k=1
2k . 2k 1
Stirlings formula, although it really should be called de Moivres formula, named after James Stirling (16921770) who published it in 1730, is the asymptotic formula n! n 2n e
n
2n nn en ,
where means that the ratio of n! and 2n nn en approaches unity as n . Using the DCT several times we shall prove the following interesting theorem. Equivalence of probability integral, Stirling and Wallis Theorem 6.1. The probability integral formula, Stirlings formula, and Wallis formula are equivalent in the sense that each one implies the other. In particular, Stirlings and Wallis formulas hold because we proved the probability integral formula. We shall break up this proof into two lemmas, the rst lemma where we show the equivalence of the probability integral and Wallis, then the probability integral and Stirling. Lemma 6.2. The probability integral holds if and only if Wallis formula holds.
Proof : The idea is to start with the integral (6.3)
0
1 + x2
dx =
(2n 3)(2n 5) 3 1 = 2 (2n 2) (2n 4) 4 2 2
n1 k=1
2k 1 . 2k
There are many ways to prove this formula; perhaps one of the quickest is to use the dierentiation theorem as youll do in Problem 5 (see Problem 2 in Exercises 2.5 for another proof). We now take n in (6.3). However, before doing this, lets replace x by x/ n in the integral to get
0
1 + x2
1 dx = n
n
1+
x2 n
dx,
therefore,
0
1+
x2 n
dx =
2k 1 n . 2 2k k=1 x2 n
n
n1
To nish the proof of this lemma, we shall apply the DCT to

0
fn (x) dx ,
where fn (x) =
1+
312
By the well-known limit limn 1 + that

n
t n n
= et for any real number t, we see x2 n

n
lim fn (x) = lim
n
2
1+
n
= e x .
(eg. using the binomial theorem), we Observe that if we multiply out 1 + x n obtain n x2 (6.4) 1+ = 1 + x2 + junk 1 + x2 , n where junk is some unimportant nonnegative expression. Hence, 0 fn (x) =
1+
x2 n
1 =: g (x). 1 + x2
Since g (x) is integrable over [0, ), the DCT applies, and we see that (6.5) Thus,
n n
lim
1+
x2 n
dx =
0
ex dx.
lim
n 2
n1 k=1
2k 1 = 2k
ex dx.
From this formula and (6.2) it follows that the probability integral and Wallis formula are equivalent.
Here we see the beauty of the DCT: the limit equality (6.5) followed almost without any eort from the DCT. If we didnt know the DCT, but just had knowledge about the Riemann integral, the proof of (6.5) would take a lot longer, but it can be done. See Problem 1 for a very elementary (but long) proof of (6.5). If you look at that problem youll see that Lebesgues integral really does simplify life. The DCT will also come into play in the Leonhard Eulerproof of Lemma 6.4 below. But rst, lets take a very short detour to study (17071783). the gamma function. This function was introduced in 1729 by Leonhard Euler (17071783) and is dened, for each x > 0, by
(x) :=
0
tx1 et dt.
The gamma function generalizes the factorial function as we now show. Theorem 6.3. The gamma function has the following properties: (1) = 1, (x + 1) = x (x) for any x > 0, (n + 1) = n! for any n = 0, 1, 2, . . .. x 2 1 dx = . 2 =2 0 e
0
(1) (2) (3) (4)
Proof : We have (1) =
et dt = et
t= t=0
= 1,
which proves (1). To prove (2), we integrate by parts, (x + 1) =

0
tx et dt = tx et
t=0
+x
0
tx1 et dt.
313
Since x > 0, the rst term on the right vanishes, and since the integral on the far right is just (x), Property (2) is proved. If n = 0, then (0 + 1) = (1) = 1 by (1) and 0! is (by denition) 1, so (3) holds for n = 0. If n is a positive integer, then using (2) repeatedly, we obtain Finally, making the change of variables t = x2 , we see that 1 2 =
0
(n + 1) = n (n) = n (n 1) (n 2) = n (n 1) 2 1 (1) = n!. t1/2 et dt =

0
x1 ex 2x dx = 2
0
ex dx =
Property (3) allows us to use the gamma function to dene the factorial for any x > 1 via x! := (x + 1); when x = 0, 1, 2, . . ., this reduces to the usual denition by (3) above. In particular, by Properties (2) and (4), 1 1 1 1 != = +1 = . 2 2 2 2 2 Using Property (2) one can show that the factorial of any half-integer is a rational multiple of . (This begs the question: What does (dealing with circles) have to do with factorials of half-integers? Beats me, but it does!) Lemma 6.4. The probability integral holds if and only if Stirlings formula holds.
Proof : Here, we follow [305]. Making the change of variables t = u2 we have n! = (n + 1) =
0
tn et dt = 2
0
u2n+1 eu du.
Thus, n! en n
n+ 1 2
=2 =2
en n
n+ 1 2 0
u2n+1 eu du
2n+1
If we make the substitution u = u n Hence, n! en n where

n+ 1 2 2n+1
u n
enu du.
n + x, we get and enu = en(

2
x 1+ n
n
2n+1
n + x )2
= e 2
nxx2
=2
x 1+ n
x n 2n+1
2n+1
e2x
n x 2
dx = 2
R
fn (x) dx,
We shall apply the DCT to 2 we write x 1+ n

2n+1
1+ fn (x) = 0 e2x
e2x
n x 2
for else.
nx<
fn (x) dx. We rst nd lim fn (x). To this end,

n
n x 2
= =
x 1+ n x 1+ n
x 1+ n x 1+ n
2n
e2x
n
n x 2
e x
2 n
e x
314
Observe that x 1+ n Since
e x
=e
x n log 1+
x n
x lim n log 1 + n n
x n = lim
log 1 +
n
x n 1 n
x n
is of the form 0/0, a calculus exercise using lH ospitals rule shows that this limit equals x2 /2. Therefore,
n
lim
x 1+ n
e x
= e x
/2
From this it follows that

n
lim fn (x) = lim
x 1+ n
x 1+ n
e x
2 n
ex = 1 ex ex = e2x .
We now check that |fn (x)| = fn (x) is bounded by an integrable function. To this end, observe that since 1 + t et for all t R (can you prove this?), it follows that 0 fn (x) e
x n
2n+1
e2x
nx2
=e =e
x 2 nx+
x x 2 n 2
e2x
nx2
e|x|x ,
x since |x| for all x R and n N. We leave you to check that g (x) := e|x|x n is integrable over R. To conclude, we have shown that the hypotheses of the DCT are satised and hence 2 2 n! en lim fn (x) dx = 2 e2x dx = 2 ex dx, 1 = 2 lim n nn+ 2 n R R R where to get the last equality we replaced x with x/ 2. From this formula, it follows that the probability integral holds if and only if Stirlings formula holds.
2
6.1.3. The fundamental theorem of algebra. The fundamental theorem of algebra (FTA) answers the following question: Does every nonconstant polynomial have a root? Explicitly, let p(z ) be a polynomial of the complex variable z of positive degree n: p(z ) = an z n + an1 z n1 + + a1 z + a0 , where n 1, an = 0, and the ak s are complex coecients. Then the question is if there exists a z0 C such that p(z0 ) = 0. The FTA says the answer is yes: The fundamental theorem of algebra Theorem 6.5. Any nonconstant polynomial with complex coecients has a root in the complex plane. We shall present several proofs of the FTA [254, 255]. To do so, we need the following dierentiation fact. Let f (z ) be a rational function of z C. Writing z
315
This function is well-dened for all complex z because p(z ) is by assumption never zero. Consider the function
2
in polar coordinate form, z = r ei , consider the function f (rei ), a function of r and . In Problem 13 you will prove that2 f (rei ) = ir f (rei ). (6.6) r Now let p(z ) = z n + an1 z n1 + + a0 , a polynomial with complex coecients with n 1, and assume, by way of contradiction, that p(z ) has no roots. (Dividing p(z ) by an if necessary we may assume that an = 1.) We shall derive a contradiction in three ways, each using some variant of the following three step recipe: Proving the FTA in three easy steps. Step 1: Dene a function F (r) using the polynomial p(z ). Step 2: Show that F (r) = 0, so F (r) = C where C is a constant. Step 3: Show that C = 0 and C = 0. Contradiction. Thus, our original assumption must have been in error and the polynomial has a root. Proof 1: Consider the rational function zn zn . = n f (z ) = n p(z ) z + an1 z 1 + + a0
F (r) =
0
f (rei ) d.
Using the dierentiation theorem (whose assumptions we leave you to check), it follows that F (r) is a continuously dierentiable function of r R and by (6.6) we see that for r = 0, 1 2 f (rei ) d = f (rei ) d r ir 0 0 =2 1 i = f (re ) ir =0 1 1 i2 f (re ) f (rei0 ) = [f (r) f (r)] = 0, = ir ir where we used that e2i = ei0 = 1. Hence, F (r) = C = a constant. Setting r = 0 and using that f (0) = 0, we obtain F (r) =
2 2
F (0) =
0
f (0) d = 0,
so C = 0. On the other hand, by denition of f (z ), we have zn 1 f (z ) = n = , z + an1 z n1 + + a0 1 + an1 /z + + a0 /z n
so f (rei ) 1 as r . From this and the Continuous Dominated Convergence Theorem (whose assumptions we leave you to check), we see that
2
F (r) =
0
f (rei ) d
1 d = 2
0
as r .
Therefore, C = 2 , implying that 0 = 2 , an absurdity. Thus, our original assumption that p has roots must have been false.
2These equations are the polar coordinate Cauchy-Riemann equations.
316
The next two proofs3 and are appropriate for those who want to avoid complex numbers as much as possible. To do so, we need the following real counterpart to Equation (6.6). Given a rational function f (z ), write f (rei ) in terms of its real and imaginary parts: (6.7) f (rei ) = u(r, ) + i v (r, ) , where u and v are real-valued functions. Then, see Problem 13, (6.6) is equivalent to u u v v =r and = r . (6.8) r r The details of the following Proofs 2 & 3 are left to Problem 13. Proof 2: As before, assume that p(z ) is never zero and consider the rational 1 function f (z ) = p( z ) . In polar coordinates, as in (6.7) we can write 1 = u(r, ) + i v (r, ) , p(rei ) where u and v are real-valued functions. Here are three steps to get a contradiction:
2
(i) Dene G(r) =

0
u(r, ) d. Using (6.8), show that G (r) = 0. This shows
that G(r) = C = a constant. (ii) Take r = 0 in G(r) to show that C = 0. (iii) Now take the limit of G(r) as r to show that C = 0. Contradiction. For our last proof we assume that p(z ) has real coecients ; this loses no generality because we can replace p(z ) by another polynomial P (z ) with real coecients that has a root if and only if p(z ) has a root (see Part (c) of Problem 13). Proof 3: As usual, assume that p(z ) is never zero and consider the rational function f (z ) = [p(1 z )]2 and write, as in (6.7), 1 = u(r, ) + i v (r, ) , [p(rei )]2 where u and v are real-valued functions. We shall deviate slightly from our three step recipe for the FTA and give another three step recipe!
(i) Let H () =
0
u(r, ) dr. Using (6.8), show that H (0) = 0 and H () =
H (). Solving this ordinary dierential equation, show that H () = a cos for some constant a. (ii) Take = 0 to show that a > 0. (iii) Finally, take = to show that a < 0. Contradiction. Weve saved for last my favorite application of the DCT and its aliates: 6.1.4. Tannerys theorem and Eulers sum. The (complex) exponential function exp : C C is the function dened by
exp(z ) :=
k=0
zk , k!
for z C.
3I presented these proofs at an MAA meeting in Florida in 2004 [255]. A professor in the audience said that they smell of Liouvilles theorem (from complex analysis); if you know Liouvilles theorem, can you also smell it in the proof?
317
(Its easy to see that the series converges for all z C.) The fruit of the exponential function includes trigonometry (sines and cosines), growth and decay (e.g. compounded interest), probability (e.g. the normal distribution), and so forth, including what might be called the most beautiful formula in all of mathematics: ei + 1 = 0. Let z C and let {zn } be a sequence of complex numbers with zn z . Then we claim that4 (6.9) exp(z ) = lim 1+ zn n
n
To prove this, assume n 2 and expand (1 + zn /n)n using the binomial theorem: zn 1+ n
n n
=
k=0
k n zn = 1 + zn + k nk
ank ,
k=2
k k where ank = n k zn /n . Well let the reader do the math here to expand the binomial and show that ank can be written as
(6.10)
ank =
1 k!
1 n
2 n
k1 n
k zn .
Thus, the right-hand side of (6.9) is lim 1+ zn n

n n
= 1 + z + lim
ank .
k=2
We are now tempted to exchange the limit with the summation:

n
(6.11)
lim
ank =
k=2 k=2
lim ank . 1 k z , k!
so if (6.11) were in fact valid, we could conclude that lim 1+ zn n

n n
From the formula for ank in (6.10), we see that for xed k N, lim ank =
n
=1+z+
k=2
zk =: exp(z ). k!
This proves (6.9), provided that the interchange (6.11) were valid! Jules Tannerys (18481910) theorem [379, p. 292] will imply that the interchange is indeed valid.
4Youve certainly seen at one time (in the lH opitals rule section of calculus) a real number x n . If you knew complex logarithms, you version of (6.9): If x R, then exp(x) = limn 1 + n can prove (6.9) using a lH opital type argument, but well prove it using Tannerys theorem.
318
Tannerys series theorem

n Theorem 6.6. For each natural number n, let k=1 ank be a nite sum of real numbers where mn as n . If for each k , limn ank exists and there is a convergent series k=1 Mk such that |ank | Mk for all k, n, then
mn
lim
ank =
k=1 k=1
lim ank .
Proof : Consider the measure space (N, P (N), #) where # is the counting measure. Since all subsets of N are measurable, it follows that all functions f : N R are measurable. Moreover, one can check that (see e.g. Problem 10 in Exercises 5.5) f d# =
n=1
f (n)
for any integrable function f : N R. For each n, let fn : N R be the function fn (k) = ank 0 if 1 k mn otherwise.
Also, let g : N R be the function dened by g (k) = Mk for each k N; then because the series k=1 Mk converges, it follows that g : N R is integrable. Now by assumption, the limit limn fn (k) exists for each k N, and |fn (k)| g (k) for all n, k N. Therefore, by the DCT we have
n
lim
fn d# =
mn k=1
lim fn d#.
k=1
The left side is just limn This completes the proof.
ank while the right side is
limn ank .
We remark that Tannerys theorem can be proved without the DCT; the point here is to show that Tannerys theorem is just a special case of the DCT when the measure space is the natural numbers. Tannerys theorem has many applications (see Problem 9 for many examples), such as to our original question involving (6.11) and the exponential function (see Problem 9). Another application deals with the Basel problem. The Basel problem is one of my all-time favorite math problems. We begin our story with the Italian Pietro Mengoli (16251686) who in his 1650 book Novae quadraturae arithmeticae, seu de additione fractionum discussed the sum of the reciprocals of the squares, but admitted defeat in nding the sum: 1 1 1 1 = 1 + 2 + 2 + 2 + = ? 2 n 2 3 4 n=1 Heres the original Latin:5
5It took 5 years and hours of work just to nd this quite elusive original source, which is why I display it so proudly . I thank Rachele Delucchi for tracking down Mengolis book in the library at ETH Zurich and to Emanuele Delucchi for translating the Latin.
319
and heres an English translation:

Having concluded with satisfaction my consideration of those arrangements of fractions, I shall move on to those other arrangements that have the unit as numerator, and square numbers as denominators. The work devoted to this consideration has bore some fruit - the question itself still awaiting solution but it [the work] requires the support of a richer mind, in order to lead to the evaluation of the precise sum of the arrangement [of fractions] that I have set myself as a task.
Later, we see a plea by the famous Jacob (Jacques) Bernoulli (16541705) on page 254 of his 1689 book Tractatus de Seriebus Innitis.6 In this book Bernoulli evaluates the sums of many series but he was unable to evaluate the sum of the reciprocals of the squares and heres what he says:
And thus with that proposition we can evaluate the sums of the series whose denominators are dierences of triangular numbers, or dierences of squares. With XV we can do this also when the denominators 1 1 1 1 1 +3 +6 + 10 + 15 + ), but are pure triangular numbers (as in 1 it is worth pointing out that if the denominators are pure squares (as 1 1 1 in 1 +1 +1 + 16 + 25 + ) this computation is more dicult than 4 9 one would expect, even if we can easily see that it converges because it is manifestly smaller than the previous sum. If someone will nd out and communicate to us what has escaped our considerations, he will have our deep gratitude.
The problem to nd the sum of the reciprocals of the squares became known as the Basel problem after Bernoullis town Basel, Switzerland. About 46 years after Bernoullis plea, Leonhard Euler (1707-1783) solved the Basel problem. Eulers rst published attempt on the Basel problem is De summatione innumerabilium progressionum (The summation of an innumerable progression) [121], presented to the St. Petersburg Academy on March 5, 1731, where he estimates the sum of the reciprocals of the squares: 1 1 1 1 = 1 + 2 + 2 + 2 + 1.644934, 2 n 2 3 4 n=1 to six decimal places. This equals 2 /6 = 1.644934066848 . . . to six decimal places,7 which Euler no doubt realized (Euler was a phenomenal human calculator), so Euler now knew what the sum should equal. He found this approximation by ingeniously
6
Bernoullis book is available at http://www.kubkou.se/pdf/mh/jacobB.pdf.
7Actually, we found that James Stirling computed the value of the series to 17 decimal places
in Example 1 after Proposition 11 in his 1730 book Methodus Dierentialis [390].
320
6. SOME APPLICATIONS OF INTEGRATION 1 n=1 n2
rewriting the sum
as8
1 1 = (log 2)2 + . 2 2 n n 2n1 n=1 n=1 The advantage is that Euler knew log 2 to many decimal places and the sum on the right converges much faster than the original sum. For example, taking only 17 terms of the new series gives the approximation 1.64493402 . . . for the right-hand side, while (using the integral test remainder estimate) one needs around 2 million terms of the original series to get six places of accuracy to 2 /6! Now has to do with circles and circles with trig functions. Four years later, Euler wrote a new paper, De summis serierum reciprocarum (On the sums of series of reciprocals) [122], which was read in the St. Petersburg Academy on December 5, 1735, where he gave three proofs of the solution of the Basel problem using trigonometry. Heres the beginning of Eulers paper [27]:
So much work has been done on the series of the reciprocals of powers of the natural numbers, that it seems hardly likely to be able to discover anything new about them. For nearly all who have thought about the sums of series have also inquired into the sums of this kind of series, and yet have not been able by any means to express them in a convenient form. I too, in spite of repeated eorts, in which I attempted various methods for summing these series, could achieve nothing more than approximate values for their sums or reduce them to the quadrature of highly transcendental curves; the former of these is described in the next article, and the latter fact I have set out in preceding ones. I speak here about the series of fractions whose numerators are 1, and indeed whose denominators are the squares, or the cubes, or other ranks, of the natural numbers; of this type are 1 1 1 1 1+ 1 +1 + 16 + 25 +etc., likewise 1+ 1 + 27 + 64 +etc. and similarly for 4 9 8 higher powers, whose general terms are contained in the form x1 n. I have recently found, quite unexpectedly, an elegant expression for the 1 sum of this series 1+ 1 +1 + 16 +etc., which depends on the quadrature 4 9 of the circle, so that if the true sum of this series is obtained, from it at once the quadrature of the circle follows. Namely, I have found for six times the sum of this series to be equal to the square of the perimeter of a circle whose diameter is 1; or by putting the sum of this series = s, then 6s will hold to 1 the ratio of the perimeter to the diameter.
In other words, Euler proved that 1 1 1 2 1 = 1 + + + + = . n2 22 32 42 6 n=1 To explain Eulers third proof of this formula, recall as a consequence of the fundamental theorem of algebra (see Part (a) of Problem 13), we can always factor an nth degree polynomial p(x) into n linear factors: p(x) = a(r1 x)(r2 x) (rn x) where a is a constant and where r1 , . . . , rn are the roots of p. Assuming the roots
8Exercise: First, integrate by parts to get 1/2 log(1 t)/t dt = (log 2)2 + 1 log(1 t)/t dt. 0 1/2
Next, use the Taylor series for log(1 t) and integrate term-by-term to prove Eulers result.
321
are not zero and that p(0) = 1, we can rewrite the factorization as p(x) = 1 x r1 1 x r2 1 x rn
2
.
4
sin x x x 9 x Now the function p(x) = sin x has Taylor series x = 1 3! + 5! , so in some sense p(x) can be thought of as an innite degree polynomial. It has the roots
and it satises p(0) = 1. Hence, by analogy we should be able to write sin x x x x x x x 1+ 1 1+ 1 1+ , = 1 x 2 2 3 3 that is, (6.12) sin x = x 1 x2 2 x2 2 1 x2 22 2 1 x2 32 2 .
, , 2, 2, 3, 3, . . . ,
Now if you multiply all the terms on the right together, you will get 1 1 1 1 + 2 + 2 + 12 2 3 +
sin x x
where involves x4 , x6 , and so forth. Since we already know that 2 x4 2 1 x 3! + 5! also, comparing coecients of x we see that 1 2 1 1 1 + 2 + 2 + 2 1 2 3
2
1 . 3!
1 Simplication shows that n=1 n 2 = 6 , just as Euler said. Now this proof is pretty but not rigorous (even to Euler); the main problem is Eulers sine expansion (6.12), which was derived by applying a fact for polynomials to the non-polynomial function sin x x . So, in 1743 he gave a perfectly rigorous argument to close the Basel problem [123]. See Problem 18 in Exercises 5.5 for this solution. In Problems 11 and 12 youll give a perfectly sound derivation of Eulers sine product basically taken from Eulers famous book Introductio in analysin innitorum, volume 1 (Introduction to analysis of the innite) [127, p. 124128]. The proof uses Tannerys theorem. Heres an interesting proof of Eulers formula [253, 272], that uses the continuity and dierentiation theorems. We only outline the argument leaving the details to Problem 7. Consider the function tan1 (tx) dx. (6.13) F (t) = 1 + x2 0
(i) Show that F is a continuously dierentiable function of t R. Moreover, F (t) = log t t2 1
1 when t = 1 and when t = 1, we have F (1) = 2 . 2 ; hence, by the fundamental theorem of (ii) Show that F (0) = 0 and F (1) = 8 calculus, 1 2 log t = F (1) F (0) = dt. 21 8 t 0 9Recall that sin x = x x3 + x5 . 3! 5!
322
6. SOME APPLICATIONS OF INTEGRATION 1 1t2
(iii) Finally, expanding into a geometric series,

1 0
2k k=0 t ,
prove that
log t dt = t2 1
k=0
t2k log t dt =
k=1 1 n=1 n2
1 . (2k + 1)2
(iv) Now breaking up the sum E := into sums when n is even (when n = 2k ) and odd (when n = 2k + 1), that is, writing
E=
k=1
1 + (2k )2
2 8
k=0
1 1 = 2 (2k + 1) 4
k=1
1 + k2
k=0
E 1 = + 2 (2k + 1) 4
k=0
1 , (2k + 1)2
use
1 k=0 (2k+1)2
to prove that E =
2 6 .
Exercises 6.1. 1. In this problem we give a very elementary proof of (6.5), where elementary means that it uses material from a rst-year college calculus course. (i) Prove that for any x [0, ), we have x x2 log(1 + x) x. (ii) Given T > 0, prove that x T2 x, n log 1 + n n where the rst holds for 0 x T and the second one holds for all x 0. In particular, replacing x by x2 , taking exponentials of these inequalities, then rearranging, show that x e x
2
1+
x2 n
T2 n
e x , T.
where the rst holds for all x 0 and the second holds for 0 x (iii) Given T > 0, write
0
x2 1+ n
dx =
0
1+
x2 n
dx +
1+
x2 n
dx.
Using that (1 +
x2 n ) n
1 + x2 x2 from (6.4), show that

T
1+
x2 n
1 dx . T
T2 n
(iv) Given T > 0, using (ii) and (iii), show that

0
ex dx
1+
x2 n
dx e
ex dx +
1 . T
Setting T = n1/4 and then taking n , prove (6.5). 2. Many formulas can be written more succinctly using the gamma function. For instance, /2 let S = 0 sin x dx. Prove that S+1 = S1 , S0 = , S1 = 1. +1 2 Next, prove that for any n N, we have /2 n+1 n+1 n 2 2 ; that is, . sin x dx = Sn = n 2 2 n + 1 + 1 0 2 2 The same formula holds for 0 cosn x dx (just make the change of variables x /2 x). Without the gamma function, the formula for Sn must be broken up into even and odd n; see Problem 2 of Exercises 2.5.
/2
323
3. (Gamma function version of Stieltjes method) In this problem we give a gamma function version of Stieltjes method explained in Problem 6 of Exercises 2.5. (i) Prove the following identity: For all x > 0, we have 1 2 < (x) (x + 1). x+ 2 Suggestion: Fix x > 0 and consider the polynomial 1 p(t) = at2 + bt + c, where a = (x) , b = 2 x + , c = (x + 1), 2 and show that p(t) > 0 for all t R. Subhint: Notice that p (t ) =
0
r x (x + r 1/2 )2 er dr.
(ii) Using the inequality in (i), show that for all x > 0, we have 1 2 1 1 x+ < x+ (x) (x + 1). (x + 1)2 < x + 2 2 2 (iii) Let x = n N in the inequalities in (ii) to obtain 3 2 1 2 1 2 2n + 1 2n + 1 2n 1 2 < 1< . 2 2n 4 2 2 2n Dont forget that n +
1 2
2n1 2n3 2 2
(iv) Finally, use Wallis formula on (iii) to determine the probability integral. 4. (The probability integral) Here are some more proofs of the probability integral. (a) (Cf. [426]) Show that F (t ) =
0
1 2
1 2
et(1+x ) dx 1 + x2
is continuous for t 0 with F (0) = /2 and limt F (t) = 0, and continuously dierentiable for t > 0, and use F (t) to derive the probability integral. (b) (Cf. [418, p. 273]) Heres a neat one. Show that
t
F (t ) =
0
ex dx
+
0
et(1+x ) dx 1 + x2
is continuously dierentiable for t R with F (t) 0. Use this to derive the probability integral. 5. In this problem we evaluate some integrals that occur in applications. (a) Prove that for any t > 0, we have
0
(t2 + x2 )1 dx =
1 . 2 t
Show that dierentiating under the integral sign is allowable and prove that for any n N,
(t2 + x2 )n dx =
0
Conclude that
1 3 (2n 3) 1 . 2 2 4 (2n 2) t2n1 2

n1 k=1
(1 + x2 )n dx =
2k 1 ; 2k
knowing this formula is one of the main ingredient in the proof of Wallis formula. 2 t2 /4a (b) Using the formula 0 eax cos tx dx = 1 e , where a > 0 and t R 2 a (derived in (6.1)), show that
0
xeax sin tx dx =
t 4a
t2 /4a e , a
a > 0, t R.
324
(c) Show that
etx dx = 1/t. Using this fact, prove that

0
xn etx dx =
1 2
n! , tn+1
n = 0, 1, 2, . . . .
(d) Show that
0 0
etx dx =
2
/t for t > 0. Using this fact, prove that 1 3 (2n 1) , 2n+1 n = 0, 1, 2, . . . .
x2n ex dx =
2
(e) Evaluate
x et x dx, then prove that

0
x2n+1 ex dx =
n! , 2
n = 0, 1, 2, . . . .
(f) Evaluate
1 0
xt dx where t > 1, then prove that

1 0
xt (log x)n dx =
(1)n n! , (t + 1)n+1
n = 0, 1, 2, . . . .
6. Using a formula in Problem 5, prove the following interesting formulas:

1 0
1 1 dx = n xx n n=1
and
0
xx dx =
n=1
(1)n1 . nn
7. Using the continuity and dierentiation theorems, we evaluate some integrals. (a) Starting with (6.13), prove Eulers formula. Suggestion: To show that F (t) = log t/(t2 1), use partial fractions. (b) ([213], [33]) In this problem we solve one of the 1985 Putnam problems. Let a > 0 and show that F (t ) =
0
x1/2 eaxtx
dx
is continuous for t 0 with F (0) = /a and continuously dierentiable for 2 at t > 0. Compute F (t) and use it to prove that F (t) = /a e . Suggestion: After dierentiating F (t), make the change of variables x = t/(au) in the resulting integral. (c) Show that F (t ) =
0
e x
t2 /x2
dx
is continuous for t 0 and continuously dierentiable for t > 0. Compute F (t) 2t and use it to prove that F (t) = 2 e for t 0. In fact, can you see a shorter way to do this problem using the Putnam problem above? (d) Let a > 0 and show that F (t ) =
0
etx
sin ax dx x
is continuously dierentiable for t > 0. Compute F (t) and use it to prove that x e sin x F (t) = /2 tan1 (t/a). Conclude that dx = . x 4 0 (e) Show that F (t ) =
0
1 etx dx x2
is continuous for t 0 and continuously dierentiable for t > 0. Compute F (t) and use it to prove that F (t) = t.
325
(f) Show that F (t ) =

0
1 cos tx x e dx x
is a continuously dierentiable function of t R. Compute F (t) and use it to 1 prove that F (t) = 2 log(1 + t2 ). 8. The Laplace transform of a measurable function f on [0, ) is L (f )(s) =
0
esx f (x) dx,
n1 n2 1 + . (c) Find lim + + + n n n n (d) Verify that ank in (6.10) satises the hypothesis of Tannerys theorem. Suggestion: Since {zn } converges, the zn s are bounded in absolute value, say by a constant C . Show that |ank | C k /k!. 10. (Tannerys product theorem) Prove the following result [379, p. 296]: For each n natural number n, let m k=1 (1 + ank ) be a nite product where mn as n . If for each k, limn ank exists, and there is a convergent series k=1 Mk of nonnegative real numbers such that |ank | Mk for all k, n, then n n
n mn
dened for those s R such that esx f (x) is integrable in x. For instance, according to Problem 5c, we have L (xn )(s) = n!/sn+1 , which is valid for s > 0. (a) If f is Lebesgue integrable on [0, ), prove that L (f )(s) exists for all s 0 and is a continuous function of s [0, ) such that lims L (f )(s) = 0. n n+1 (b) Let f (x) = converges n=0 an x be a power series such that n=0 |an | n!/s for s > c for some c 0. Prove that the Laplace transform of f exists for all s > c n+1 for all s > c. and L (f )(s) = n=0 an n!/s n+1 | converges for s > c for some c (c) Conversely, suppose that n=0 an | n!/s n 0. Prove that f (x) = a x converges for a.e. x [0, ) and L (f )(s) = n n=0 n+1 a n ! /s for all s > c . n n=0 (d) Using (a), nd the Laplace transform of sin x. (Answer: 1/(s2 + 1)). (e) Let F (s) = 1/(s2 1) where s > 1. Using (b), nd a function f with L (f )(s) = F (s) for s > 1. (Answer: sinh(x).) 9. (Tannerys theorem) Here are some problems dealing with Tannerys theorem. 2n (2n)2 (2n)n (a) Find lim . + + + 2 2 n (3n) + 4 (3n)n + 4n 3n + 4 1 1 1 (b) Find lim . + + 2 2 2 n 2 n sin 1 n2 sin 2 n2 sin n
n2 n2 n2 n n n
(6.14)
lim
(1 + ank ) =
k=1
k=1
lim (1 + ank ).
Suggestion: Choose N so that |ank | 1/2 for all k, n where k N (why can choose N mn n such an N ?) and write m k=N (1 + ank ). Show that its k=1 (1 + ank ) = k=1 (1 + ank ) enough to prove the limit (6.14) with k starting from N rather than starting at 1. Then, mn n take the log of m k=N bk (n) where bk (n) = log(1 + ank ). k=N (1 + ank ) to get a sum Show that Tannerys series theorem applies to this sum, then derive Tannerys product theorem. It might be helpful to prove that | log(1 + x)| 2|x| if |x| 1/2. 11. (The Basel problem) Heres Eulers derivation of the sine expansion from Introductio in analysin innitorum, volume 1. (I rst saw this proof reading [159, Sec. 1.5], whose argument we follow.)
326
(i) Finding the nth roots of unity, prove that for n odd, z n 1 can be factored as (z n 1) = (z 1)
(n1)/2 k=1
(z e2ik/n )(z e2ik/n ).
(Here we have to assume you have taken a complex variables course.) (ii) Using the identity cos = (ei ei )/2 and given a C and replacing z by z/a in the above formula, show that (z n a n ) = (z a )
(n1)/2 k=1
(z 2 2az cos
2k + a 2 ). n
(iii) Putting z = (1 + ix/n) and a = (1 ix/n), and using trig identities, prove that 1 2i 1+ ix n
n
ix n
=x
(n1)/2 k=1 1 (eix 2i
n2
x2 tan2 (k/n)
Taking n and recalling that sin x = sin x = x lim

(n1)/2 k=1 n
eix ), we obtain .
x2 n2 tan2 (k/n)
k=1
Assume Tannerys product theorem from the previous problem (or prove it if you wish) to show that the right side equals x 12. Heres a proof of Eulers sum by Hofbauer [188]. (i) Show that
1 sin2 x
x2 2 k2
1 4
1 sin2 x 2
1 sin2
(1x) 2
(ii) Using (i) and induction, show that for any n N, 1= 2 4n

2n1 1 k odd
1 sin2
k 2n+1
where we only sum over k odd (that is, when k is even, we dene the kth term to be zero). Note that the n = 1 case is just the formula in (i) with x = 1/2. 2n 1 (iii) Write the formula in (ii) as 1 = k =0 ank for an appropriate ank . Apply Tan2 1 = nerys series theorem to derive the formula k=0 (2k+1)2 , and from this 8 deduce Eulers formula. 13. (The FTA) Here are some problems related to the FTA. (a) If p(z ) is a polynomial of positive degree n, prove that p has exactly n complex roots c1 , . . . , cn counting multiplicities, and show that for some constant a C, p(z ) = a (z c1 )(z c2 ) (z cn ). (b) Fill in the details of Proofs 2 & 3, including proving (6.6) and (6.8). Suggestion: First prove (6.6) holds for f (z ) = z k for any k N, then use the quotient rule to show it holds for any rational function. (c) Heres a complex-version of Proof 3. Let p(z ) = z n + an1 z n1 + + a0 be a polynomial with complex coecients, n 1, and suppose that p has no roots. (i) Let q (z ) = z n + an1 z n1 + + a1 z + a0 , the polynomial whose coecients are the complex conjugates of the coecients of p. Dene P (z ) = p(z ) q (z ) and prove that P (z ) has real coecients and P (z ) has a root if and only if p(z ) has a root. 1 dr and prove that F () is a continuously dier(ii) Dene F () = P (rei ) 0 entiable function of R such that F () = iF (). This is an ordinary dierential equation whose solution is F () = C ei for some constant C .
6.2. LEBESGUE, RIEMANN AND STIELTJES INTEGRATION
327
(iii) Taking = 0 and = , show that both C > 0 and C < 0, a contradiction. (Note that P (t) = |p(t)|2 > 0 for t R.)
6.2. Lebesgue, Riemann and Stieltjes integration In this section we compare Riemann integration to Lebesgue integration. In particular, we characterize Riemann integrable functions as bounded functions that are continuous almost everywhere. 6.2.1. The RiemannStieltjes integral. In elementary calculus you studied the Riemann integral, which was introduced by Bernhard Riemann (18261866) in his 1854 habilitationsschrift Ueber die Darstellbarkeit einer Function durch eine trigonometrische Reihe (On the representation of a function by trigonometric series). 10 In later analysis or probability courses you might have seen the Riemann Stieltjes integral, which is a useful generalization of the Riemann integral introduced by Thomas Stieltjes (18561894) in 1894 [368]. Before blinding you with many technical denitions, it might be helpful to review Stieltjes motivation behind his integrals. Consider point masses m1 , . . . , mN located at points x1 , . . . , xN as seen here:
m1 x1 m2 x2 m3 x3 mN xN
Given p N, the p-th moment of the masses is the sum If p = 1, this is just the center of mass familiar from physics. Stieltjes was studying moments of continuous mass distributions instead of point masses. Consider the interval [0, 1] as a solid rod and let : [0, 1] R be a nondecreasing nonnegative function such that for each x [0, 1], (x) is the mass of the rod segment [0, x]. How would one go about dening the p-th moment of the solid rod [0, 1]? Heres how: Partition the interval [0, 1] into a bunch of subintervals [x0 , x1 ], [x1 , x2 ], . . . , [xN 1 , xN ] where 0 = x0 < x1 < < xN 1 < xN = 1, as seen in Figure 6.1. Then the mass of the k th segment is (xk ) (xk1 ), so
x0 x1 x2 x3 xk1 xk xN
p p xp 1 m1 + x2 m2 + + xN mN .
Figure 6.1. (xk ) is the mass of [0, xk ] and (xk1 ) is the mass of [0, xk1 ], so (xk ) (xk1 ) is the mass of [xk1 , xk ]. choosing a point x k in the k th interval [xk1 , xk ], the sum should be a close approximation to what the true p-th moment of the rod should equal. If we put f (x) = xp , then this sum can be written as
N k=1 p p p (x 1 ) {(x1 ) (x0 )} + (x2 ) {(x2 ) (x1 )} + + (xN ) {(xN ) (xN 1 )}.
f (x k ) {(xk ) (xk1 )},
10 In Germany, one needs a Habilitation to lecture at a German university, one requirement of which is to write a second Ph.D. thesis called the habilitationsschrift. This requirement shows that one can do research after the Ph.D.
328
which is nowadays called a RiemannStieltjes sum of f . We now see how to dene the p-th moment of the rod: Just take ner and ner partitions of the rod and if these RiemannStieltjes sums approach a number, which we call the Riemann Stieltjes integral of f , then this number would be (by denition) the p-th moment of the rod. With this background, we now blind you with denitions! Throughout this section we work on a compact interval [a, b] and we x a nondecreasing right-continuous function : [a, b] R. (We assume right-continuity so that the corresponding LebesgueStieltjes set function is a measure, a fact well need later.) A partition of [a, b] is just a set of numbers P = {x0 , x1 , . . . , xN } where a = x0 < x1 < < xN 1 < xN = b. The length of P , denoted by P , is the maximum of the numbers xk xk1 where k = 1, . . . , N . Given a bounded function f : [a, b] R, a RiemannStieltjes sum of f with respect to is a sum of the form
N
S (P ) =
k=1
f (x k ) {(xk ) (xk1 )},
where P is a partition of [a, b] and x k [xk1 , xk ] for each k . This sum depends on P , f , , and the choices of the x s, but to simplify notation we omit these facts k except P . To make precise the idea of taking ner and ner partitions, let P= nondecreasing sequences of partitions of [a, b] whose lengths are approaching zero ;
explicitly, an element of P is a sequence {P1 , P2 , P3 , . . .} of partitions of [a, b] such that P1 P2 P3 and Pn 0. Thus, a sequence {Pn } represents the intuitive idea of adding more and more partition points in such a way that the distance between any two adjacent partition points 0. We say that f is (RiemannStieltjes) integrable with respect to if there is a real number I such that given any sequence {Pn } P , we have
n
lim S (Pn ) = I,
where this limit means that given any > 0 there is a p such that (6.15)
11 and all choices of the intermediate points x k in the RiemannStieltjes sums. The number I is called the RiemannStieltjes integral of f with respect to and we shall denote it by
|I S (Pn )| <
for all n p,
I=
R
f d.
Gaston Darboux (18421917). 11Our sequence denition is equivalent to the traditional - denition:
If (x) = x, then we simply say that f is Riemann integrable. There is another approach to the RiemannStieltjes integral via lower and upper sums, due to Gaston Darboux (18421917), which I think allows
For each > 0 there is a > 0 such that for any partition P of [a, b] with P < , we have |I S (P )| < for any RiemannStieltjes sum corresponding to the partition P . Our sequence denition has the advantage that it gives an ecient proof of Theorem 6.11 via the DCT.
329
one to prove theorems easier than using the denition (6.15). He introduced this new method for understanding the Riemann integral in his 1875 paper M emoire sur les fonctions discontinues [93]. This paper is really quite remarkable: it was basically a reworking of Riemanns theory of integration from scratch, packed with all the important theorems on Riemann integration. Because of mathematical convenience, we shall introduce Darbouxs approach, which Lebesgue later used in his paper Sur une g en eralisation de lint egrale d enie. If P = {x0 , x1 , . . . , xN } is a partition of [a, b], then we dene simple functions P and uP by
N N
P =
k=1
mk (xk1 ,xk ] ,
uP =
k=1
Mk (xk1 ,xk ] ,
where These lower and upper functions of f have the property that12 (6.16) P f uP
mk = inf {f (x) ; xk1 x xk },
Mk = sup{f (x) ; xk1 x xk }.
on the interval (a, b];
see Figure 6.2 for pictures of P and uP . The lower and upper sums of f with
x0
x1
x2
x3
x4
Figure 6.2. Here f is a linear function. The solid horizontal lines represent uP and the dotted lines P . respect to are the sums
N N
L(P ) =
k=1
mk {(xk ) (xk1 )}, L(P ) = P d ,
U (P ) =
k=1
Mk {(xk ) (xk1 )}. uP d
Observe that U (P ) = where these integrals are Lebesgue integrals with respect to the LebesgueStieltjes measure : M [0, ], dened by (c, d] = (d) (c) on elements of I 1 , and where M denotes the measurable sets.13 This last observation is key to relating RiemannStieltjes integrals to Lebesgue integrals. In order to prove Darbouxs theorem relating lower and upper sums to the RiemannStieltjes integral, we need the following lemma.
12Observe that (a) = u (a) = 0. This is why (6.16) only holds on (a, b] (unless f (a) = 0, P P in which case it holds on all of [a, b]). 13To be precise we should be writing ( ) because the measure : M [0, ] is really the Carath eodory extension of the measure : I 1 [0, ), however, for sake of notational simplicity we drop the .
330
Lemma 6.7. If f : [a, b] R is bounded, P and Q are partitions of [a, b], and P Q (so every partition point in P is a partition point in Q), then
Proof : Suppose that the partition Q contains exactly one more point than the partition P . An induction argument proves this lemma when P contains any number of extra points than Q. Let this one extra point be denoted by y and suppose that xk1 < y < xk . If mk = inf {f (x) ; xk1 x xk }, then observe that since inmums get bigger as sets get smaller, mk
x k 1 x y
P Q uQ uP .
inf
f (x )
and
mk
y x x k
inf
f (x ) ,
It follows that P Q . An analogous argument shows that uQ uP . Since Q uQ it follows that P Q uQ uP .
Recall that P= Given {Pn } P , let n and un be, respectively, the lower and upper simple functions of f corresponding to the partition Pn . In view of Lemma 6.7, we have so 1 2 n un u2 u1 , nondecreasing sequences {Pn } of partitions of [a, b] with Pn 0 .
Being monotone sequences, the limits lim L(Pn ) and lim U (Pn ) exist. The following theorem, which we shall call Darbouxs theorem, says that f is RiemannStieltjes integrable if and only if these limits equal the same value for all partitions in P . Darbouxs theorem Theorem 6.8. A bounded function f is RiemannStieltjes integrable with respect to if and only if for any {Pn } P , in this case, lim L(Pn ) and lim U (Pn ) equal the same value for any {Pn } P , namely I = R f d, the RiemannStieltjes integral of f . We leave the proof of Darbouxs theorem for Problem 1. 6.2.2. Lebesgue vs. Riemann integrals. We now characterize Riemann Stieltjes integrable functions and show that any RiemannStieltjes integrable function is also Lebesgue integrable and the two integrals agree. We need two important results. The rst result is an elementary, but perhaps surprising, fact concerning arbitrary monotone functions. Continuity of monotone functions Theorem 6.9. A monotone function is continuous except on a countable set. lim L(Pn ) = lim U (Pn );
L(P1 ) L(P2 ) L(Pn ) U (Pn ) U (P2 ) U (P1 ).
331
Proof : Since any interval can be written as a countable union of compact intervals (can you prove this?), we may focus on monotone functions dened on a compact interval [a, b]. Let f : [a, b] R be a monotone function; for example, heres a picture of a nondecreasing function:
a x1 x2 x3
Given any point t I , we dene the Jump of f at t := f (t+) f (t) = lim f (x) lim f (x).
x t + x t
Note that the jump is positive if and only if f is discontinuous at t. Now its obvious, at least by considering the above picture, that if we pick nitely many points x1 , x2 , . . . , xN strictly between a and b and add all the magnitudes of the jumps of f at the points x1 , x2 , . . . , xN , we cannot get more than the total height change |f (b) f (a+)|. More precisely, we must have
N
(6.17)
n=1
|f (xn +) f (xn )| |f (b) f (a+)|.
You will prove this obvious fact in Problem 1. Assuming this fact, well prove the result. For each k N, put Dk = {t I ; |f (t+) f (t)| > 1/k}. Since a point t is a discontinuity point of f if and only if the jump of f at t is positive, it follows that the set of discontinuity points of f equals D1 D2 D3 . Since a countable union of countable sets is countable, our result follows if we can prove that each Dk is countable. In fact, xing k N, we claim that Dk is nite. To see this, let x1 , x2 , . . . , xN Dk be strictly between a and b. Then according to (6.17) we have
N
n=1
|f (xn +) f (xn )| |f (b) f (a+)|.
Since each xn belongs to Dk , this implies that

N
n=1
1 |f (b) f (a+)| , k
or
N |f (b) f (a+)|. k
Thus, N k |f (b+) f (a)|, so N is bounded by k |f (b+) f (a)|.
The Let f : [a, b] R be a bounded, let {Pn } P , and let n and un be, respectively, the lower and upper simple functions of f corresponding to the partition Pn . Then by Lemma 6.7, for each x [a, b] we have 1 (x) 2 (x) n (x) un (x) u2 (x) u1 (x), In particular, the sequences n and un are monotone sequences, so the limits P (x) := lim n (x)
n
and uP (x) := lim un (x)

n
332
6. SOME APPLICATIONS OF INTEGRATION n=1
exist for each x [a, b]. Here, the subscript P denotes the union P = all the partitions. We have P uP and if x (a, b], then by (6.16), P f uP .
Pn of
The following lemma is a key ingredient to characterize LebesgueStieltjes integrable functions.
Lemma 6.10. If f : [a, b] R is bounded, {Pn } P , and x (a, b], then P (x) = f (x) = uP (x) f is continuous at x f is left-continuous at x if x /P if x P .
Proof : Consider rst the case x / P . Suppose that f is continuous at x. Let > 0. Then there is a > 0 such that (6.18) Since the lengths of the partitions approach zero, we can choose p such that the lengths of all of the partitions Pn are less than for all n p. Given a partition Pn = {x0 , . . . , xN } with n p there exists a k such that xk1 < x < xk ; |mk f (x)| and |Mk f (x)| . |y x| < = |f (y ) f (x)| < .
(6.19)
then (6.18) implies that Hence,
|n (x) f (x)| and |un (x) f (x)| . Taking n implies that P (x) = f (x) = uP (x) since > 0 was arbitrary. Suppose now that P (x) = f (x) = uP (x) where as above, x / P . Let > 0. Since P (x) = limn n (x) and uP (x) = limn un (x), there exists a p such that |n (x) f (x)| < and |un (x) f (x)| < for all n p. Fix n p and let Pn = {x0 , . . . , xN }. Then (6.19) holds, so n (x) = mk and un (x) = Mk , and hence, that is, f (x) is within of its inf and sup on the interval [xk1 , xk ]. It follows that if > 0 is chosen such that the interval (x , x + ) is contained in (xk1 , xk ), then |y x| < = |f (y ) f (x)| < . Thus, f is continuous at x. Now consider the case x P . In fact, the proof in this case is exactly the same as in the proof when x / P , apart from the following dierences: We can choose p large enough so that in (6.19), x = xk and in the proof of continuity of f we just have to choose > 0 such that the interval (x , x] is contained in (xk1 , xk ] (where x = xk ). Please go through the details if you wish. |mk f (x)| < and |Mk f (x)| < ;
Were now ready to prove the main result of this section, which characterizes RiemannStieltjes integrability in terms of LebesgueStieltjes measures. Characterization of RiemannStieltjes integrability
333
Theorem 6.11. A bounded function f is RiemannStieltjes integrable with respect to if and only if it is continuous -a.e., that is, the set of discontinuity points of f has -measure zero. When f is RiemannStieltjes integrable, then f is (in the Lebesgue sense) -integrable too, and f d =
R
f d ,
where the right-hand side denotes the Lebesgue integral of f with respect to the measure .
Proof : Let f : [a, b] R be bounded. Below we will need the following fact: is continuous at a point if and only if ({}) = 0. This fact isnt dicult to prove and was shown back in Problem 2 of Exercises 3.5. In particular, since is continuous at a (because : [a, b] R is right-continuous), we have ({a}) = 0. Step 1: We prove that f is RiemannStieltjes integrable with respect to if and only if the following condition holds: By Darbouxs Theorem we know that f is RiemannStieltjes integrable with respect to if and only if for all {Pn } P , Recall that L(Pn ) = lim L(Pn ) = lim U (Pn ). n d and Given any {Pn } P , we have P = f = uP -a.e.
U (P n ) =
un d ,
where n and un are, respectively, the lower and upper simple functions of f corresponding to the partition Pn . Since f is bounded, n and un are bounded, so the Dominated Convergence Theorem implies that lim L(Pn ) = P d and lim U (Pn ) = uP d .
Therefore, f is RiemannStieltjes integrable with respect to if and only if for all {Pn } P , (6.20) P d = uP d .
The equality (6.20) holds if and only if (uP P ) d = 0. This holds, since uP P 0, if and only if uP P = 0 -a.e., or P = uP -a.e. As P f uP on (a, b] and the set {a} has measure zero, it follows that P = uP -a.e. if and only if P = f = uP -a.e. This completes Step 1. Step 2: We now prove that when f is RiemannStieltjes integrable, it is also (Lebesgue) -integrable and the two notions of integral agree. To see this, let {Pn } P and recall from Step 1 that P = f = uP -a.e. Since P and uP are -measurable functions and is complete, it follows that f is -measurable too (see Proposition 5.10). Moreover, since Lebesgue integrals are equal on a.e. equal functions, we have P d = By (6.20) we see that P d =
R
f d =
uP d .
f d =
uP d ,
334
so the RiemannStieltjes and the Lebesgue -integral of f agree. Step 3: It remains to prove that f is RiemannStieltjes integrable with respect to if and only if f is continuous -a.e. By Step 1, we just have to show that Note that the direction = follows directly from Part (a) of Lemma 6.10, so we just have to prove the direction =. Assume that P = f = uP -a.e. for any {Pn } P . Since has only countably many discontinuity points, we can choose a sequence {Pn } P such that P , the union of all the partition points, contains no discontinuity points of , except possibly b. Now, P = f = uP -a.e. on [a, b] implies, by Parts (b) and (c) of Lemma 6.10, (6.21) f is continuous -a.e. on [a, b] \ P , f is left continuous -a.e. on P . P = f = uP -a.e. for any {Pn } P f is continuous -a.e.
Now is continuous at each point of P except possibly b, hence as a singleton set consisting of a continuity point of has measure zero, we have: Two cases: (P ) = ({b}) = 0 positive if is continuous at b, if is not continuous at b.
In the rst case, the rst line in (6.21) implies that f is continuous -a.e. on [a, b] (we can drop P since it has measure zero) and our proof is complete in this case. In the second case, according to the second line in (6.21), f must be left continuous at b; that is, f is continuous at b (since f is dened on [a, b], left continuity at b is the same as continuity at b). The rst line in (6.21) then implies that f is continuous -a.e. on [a, b] \ (P \ {b}). The set P \ {b} has -measure zero, so f is continuous -a.e. on [a, b]. This completes our proof.
As a corollary, we get Lebesgues characterization of Riemann integrability. Corollary 6.12. A bounded function on a nite interval is Riemann integrable if and only if it is continuous a.e. (with respect to Lebesgue measure), in which case the function is also Lebesgue integrable and the two notions of integral agree.
Example 6.1. In particular, since the Dirichlet function D = Q[0,1] is nowhere continuous, its not Riemann integrable on [0, 1], a fact we already knew. Using this corollary, one can also show that given any closed nowhere dense set A [0, 1] of positive measure, A is not a.e. continuous; hence A is not Riemann integrable (cf. Problem 12 in Exercises 4.5).
Since now we have proved that the Lebesgue integral agrees with the Riemann integral on Riemann integrable functions, when integrating Riemann integrable functions we shall henceforth use without proof common results concerning Riemann integration, e.g. the fundamental theorem of calculus, Change of variables in 1-dimension integrals, integration by parts, and so forth. 6.2.3. Neat functions that are Riemann integrable. In Riemanns habilitationsschrift he gave the following interesting example of a Riemann integrable function. First, Riemann introduced the function : R R dened as follows:14
14Riemann denoted (x) by (x).
335
Put (x) = x for 1/2 < x < 1/2, (1/2) = 0, and (1/2) = 0, then extend (x) to the whole real line by reproducing its graph periodically as shown here:
0.4 0.2 -3 -2 -1 0 -0.2 -0.4 0 1 2 3
Figure 6.3. Graph of (x). Note that is continuous except at the half-integers: 1 3 5 7 odd , , , , . . . , in general at all numbers , 2 2 2 2 2 where odd is an odd integer. Next, Riemann dened his function: f (x) := (n x) . n2 n=1

Since |(x)| 1/2 it follows that this series converges for all x R. Moreover, |f (x)| 1/2 1 1 2 = = , n2 2 n=1 n2 12 n=1
so f : R R is a bounded function. Figure 6.4 shows graphs of f2 , f4 , and f8 , where fN denotes the N -th partial sum of f . Note how erratic these functions
0.6 0.4 0.4 0.4 0.2 0 0 -0.2 0.5 1 -1 -0.5 -0.2 -0.4 -0.6 0 0.5 1
0.2
0.2
0 -1 -0.5 -0.2 0 0.5 1 -1 -0.5
-0.4
-0.4
Figure 6.4. Graphs of f2 (on the left) f4 (in the middle), and f8 (on
the right).
1 look. Since is discontinuous at the half integers odd 2 , the nth term n2 (n x) in odd Riemanns function is discontinuous at the points 2n . In Problem 11 you will prove that f (x) is continuous at all real numbers except those rational numbers of the odd form even . In particular, the set of discontinuity points is a set of measure zero and hence f (x) is Riemann integrable on any nite interval. Note also that the set of discontinuity points is dense in R, so f is indeed a strange function. Because Riemanns integral can handle such discontinuous functions, mathematicians such as Karl Weierstrass (18151897) said that Riemanns integral has been seen as the
336
most general thinkable and Paul du Bois-Reymond (18311889) said that Riemann extended the scope of integrable functions up to its extreme limit (quotes taken from [196, p. 266]). Another interesting function that is Riemann integrable is T : R R dened by T (x) = 1/q 0 if x Q and x = p/q in lowest terms and q > 0, if x is irrational.
This function is called Thomaes function (amongst other names such as the ruler function), named after Carl Johannes Thomae (18401921), who discussed this function in 20, page 14, of his 1875 book [382], where he gives several examples of pathological functions (including Riemanns function) that are Riemann integrable. See Figure 6.5 for a picture of Thomaes function on [0, 1]. In Problem
0.8
0.6 y 0.4
0.2
0 0 0.2 0.4 x 0.6 0.8 1
Figure 6.5. This is a plot of the points (p/q, T (p/q )) for 0 p/q 1
and q at most 13.
12 you will show that T is discontinuous on the rationals and continuous on the irrational numbers. In particular, Thomaes function is continuous a.e. and hence Riemann integrable.
Exercises 6.2. 1. (Omitted proofs) Let f : [a, b] R be bounded and : [a, b] R be nondecreasing. (a) Prove Darbouxs theorem, Theorem 6.8. Suggestion: To prove suciency, x {Qn } P and put I := lim L(Qn ) = lim U (Qn ). Given any other element {Pn } P , prove that I = lim L(Pn ) = lim U (Pn ) as well. From this, prove that given any {Pn } P , we have limn S (Pn ) = I , where this limit is in the sense of (6.15). (b) Prove the sum of jumps formula, Equation (6.17) in the proof of Theorem 6.9. 2. Evaluate
1/2 1/2
f (x) dx
0
and
0
T (x) dx
where f is Riemanns function and T is Thomaes function. 3. We only dened RiemannStieltjes integrals for bounded functions. To understand why, let f : [a, b] R be unbounded and consider the function (x) = x. (E.g. the function dened by f (0) = 0 and f (x) = 1/ x for 0 < x 1 is unbounded on [0, 1].) Given any m > 0 and given any partition P of [a, b], prove that there exists a Riemann Stieltjes sum S (P ) (since (x) = x, S (P ) is usually called a Riemann sum) such that |S (P )| > m. This shows that the condition (6.15) cannot be satised for any I R. 4. Let [a, b) R with a R and b R {} with a < b.
337
(i) Suppose that f : [a, b) R is Riemann integrable on every interval [a, c] where c < c b and limcb a f (x) dx exists; this limit is the improper Riemann integral of f and we say that f is improperly Riemann integrable. If f is nonnegative and improperly Riemann integrable, prove that f : [a, b) [0, ) is Lebesgue b integrable and the improper Riemann integral equals the Lebesgue integral a f . (ii) That f is nonnegative is important: Prove that f : [0, ) R, dened by n1 1 f (x ) = x (n,n+1] (x), is improperly Riemann integrable, but is n=1 (1) not Lebesgue integrable. (Another example is F (x) = x1 sin x; you can do this example instead of f (x) if you wish.) 5. Let f, : [0, 1] R be the functions f (x ) = 0 1
1 if 0 x 2 1 if 2 <x1
(x) =
0 1
if 0 x < 1 2 1 if 2 x 1.
(a) Prove, using our original sequence of partitions denition of RiemannStieltjes integrability, or use Darbouxs theorem if you wish, that f is not integrable with respect to on [0, 1]. (b) Let I1 = [0, 1/2] and I2 = [1/2, 1]. Using our sequence of partitions denition of RiemannStieltjes integrability, or Darbouxs theorem, prove that the restriction of f to I1 is integrable with respect to the restriction of on I1 and prove the similar statement for I2 . Note that although f is integrable with respect to on I1 and on I2 , f is not integrable with respect to on [0, 1] = I1 I2 . To avoid this pathology, its common to use the following denition of the integral. 6. (Another RiemannStieltjes integral) A bounded function f : [a, b] R is ARS integrable with respect to , where : [a, b] R is nondecreasing, if there is a real number I such that for some sequence {Pn } P , we have
n
lim S (Pn ) = I,
(6.22)
where this limit is in the sense of (6.15). We denote the number I by RS f d. Note that all we did was replace for every sequence in our original denition of the RiemannStieltjes integral with for some sequence. However, this slight change makes a world of dierence. (i) Prove that the number I , if it exists, is unique; that is, if {Qn } P satises limn S (Qn ) = I , then I = I . (ii) Prove that a bounded function f is ARS integrable with respect to if and only if for some partition {Pn } P , we have lim L(Pn ) = lim U (Pn );
in this case, lim L(Pn ) and lim U (Pn ) equal the same value for any {Pn } P satisfying (6.22), namely I = RS f d, the ARS integral of f . (iii) Its clear that if f is integrable with respect to (in our original denition), then f is ARS integrable with respect to . The converse is false; if f and are the functions in Problem 5, prove that f is ARS integrable with respect to on [0, 1]. (iv) (Additivity on intervals) Let a c b and suppose that f is ARS integrable with respect to on both subintervals [a, c] and [c, b]. Prove that f is ARS integrable with respect to on [a, b]. This additivity property is false for our original RiemannStieltjes denition of integrability by Problem 5. 7. (Characterization of ARS integrability) Let : [a, b] R be a nondecreasing right-continuous function and let A = the set of continuity points of . Prove the following theorem: A bounded function f is ARS integrable with respect to if and only if f is continuous -a.e. on A and f is not discontinuous from the left on [a, b] \ A. When f is ARS integrable with respect to , it is also (Lebesgue) -integrable and the two notions of integral agree.
338
Another way to state the integrability condition is as follows: f is ARS integrable with respect to if and only the set of discontinuity points of f in A has measure zero and f and are never simultaneously discontinuous from the left. 8. (FichtenholzLichenstein theorem) This theorem (cf. [364], [247], [248], [134]) is named after Grigori Fichtenholz (18881959) and Leon Lichtenstein (18781933). Let (X, S , ) be a measure space and let f : [a, b] X R, where [a, b] R is a closed interval, be a function such that (a) For a.e. x X , the function fx : [a, b] R, dened by fx (t) = f (t, x) for all t [a, b], is Riemann integrable on [a, b]. (b) For all t [a, b], the function ft : X R, dened by ft (x) = f (t, x) for all x X , is -measurable. (c) There is a -integrable function g : X [0, ] such that for a.e. x X , |f (t, x)| g (x) for all t [a, b]. We shall prove that (6.23)
R X
f (t, x) d dt =
X R
f (t, x) dt d.
Here, X denotes the Lebesgue integral over X while R denotes the Riemann integral over [a, b]. To prove (6.23) you may proceed as follows. (a) Let {Pn } be a nondecreasing sequence of partitions of [a, b] whose lengths are approaching zero and let S (Pn , fx ) be a Riemann sum of fx (t) in the t variable with respect to the partition Pn . Prove that S (P n , F ) =
X
S (Pn , fx ) d,
where F (t) = X f (t, x) d. (b) Show that the dominated convergence theorem can be applied to obtain (6.23) in the limit as n . 9. (cf. [228, 245]) This theorem states that a function is continuous almost everywhere if and only if it is has a right-hand limit almost everywhere. Proceed as follows. (a) Let f be a bounded real-valued function on an interval I . Given a point a I , we dene the oscillation of f at a by osc(f, a) = lim sup{f (x) ; x I , |x a| < } inf {f (x) ; x I , |x a| < } . Let D denote the set of discontinuities of f . Prove that D = n=1 Dn where Dn = {x ; osc(f, x) > 1/n}. (b) Let L denote the set of points where f has a right-hand limit. Prove that every point of Dn L is the left endpoint of an open interval that contains no point of Dn L. Conclude that Dn L is countable, and hence so is D L. (c) Now prove that a bounded function on an interval is continuous a.e. if and only if it has a right-hand limit a.e. 10. Let I R be an interval (open, closed, half-open, bounded, unbounded, . . .). In this problem we prove that any monotone function : I R has at most a countable number of discontinuities. (i) Show that if this statement holds for any compact interval I , then the statement holds for any interval. Show that if the statement holds for any nondecreasing function, then it holds for any monotone function. (ii) Assume that : I R is nondecreasing where I is compact. By e.g. Lemma 1.19 note that is discontinuous at a point x if and only if d(x) := lim (y ) lim (y )
y x + y x 0
is positive. For each n N, let Dn = {x ; d(x) > 1/n}. Prove that Dn is a nite set and that the set of discontinuities of is n=1 Dn .
6.3. APPROXIMATIONS AND THE STONEWEIERSTRASS THEOREM
339
11. (Riemanns function) In this problem we analyze Riemanns function. (i) For each n N, let fn : R R be a function and suppose that for some a R and for each n N, limxa fn (x) exists. Suppose there exists a convergent series n=1 Mn of nonnegative real numbers such that |fn | Mn for all n. Prove that f (x ) = n=1 fn (x) converges for each x R, and
x a
lim f (x) =
n=1
x a
lim fn (x).
We can replace limxa with left or right-hand limits . . . the proof is the same. odd (ii) Let D be the set of all rational numbers of the form even . Prove that D is a dense subset of R. (nx) (iii) Let f (x) = n=1 n2 , Riemanns function. Prove that f is continuous on R \ D . For the remainder of this problem we prove that f is discontinuous at each point in D. To do so, we shall prove an interesting result of Riemann: If r D and we p write r = 2 where p Z is odd, q N, and p and q have no common factors, q then 2 . f (r ) = f (r ) 16q 2 Here, f (r +) (resp. f (r )) denotes the right (resp. left)-hand limit of f at r . In particular, the formula for f (r ) shows that f is discontinuous on D, which is a countable (and hence measure zero) subset of R. (iv) As a rst step to prove the formula for f (r ), show that for any c R, we have f (c) =
n=1
(n c) . n2
(vi) With r as in (v), show that
1 for any any half integer h. Now let r D and write (v) Prove that (h) = (h) 2 p r = 2q where p Z is odd, q N, and p and q have no common factors. Show that (nr ) if n is not a multiple of q (nr ) = 1 (nr ) 2 if n is a multiple of q .
f (r ) = f (r )
1 2q 2
1 1 1 + 2 + 2 + 12 3 5
2
just as Riemann stated. and from this prove that f (r ) = f (r ) 16 q2 12. (Thomaes function) Prove that Thomaes function T is discontinuous at each rational number. We now prove that T is continuous at each irrational number as follows: (i) Prove that T (x 1) = T (x) for every x R. This shows that T is a periodic function on R with period 1, so we just have to prove that T is continuous at each irrational number in (0, 1). (ii) Fix an irrational number c (0, 1) and let > 0. Choose any n N with n 1/
and put A =
r=
p q
numbers. Prove that for all x [0, 1] \ A, we have |T (x) T (c)| = |T (c)| < . (iii) Now prove that T is continuous at c.
Q in lowest terms ; 1 p, q n
which is a nite set of
6.3. Approximations and the StoneWeierstrass theorem In this section we give a probabilistic proof of the Weierstrass approximation theorem and we also study the celebrated StoneWeierstrass theorem.
340
6.3.1. The WAT and Lebesgues very rst publication. Concerning continuous functions, Karl Weierstrass (18151897) proved two strikingly dierent results. In 1872 [413] he showed there are continuous functions that are nowhere dierentiable. In other words, there are continuous functions so jagged they dont have a tangent line anywhere! We shall study Weierstrass nondierentiable function in Section 9.1. On other hand, in 1885 Karl Weierstrass (18151897). [311, 400, 414], when he was 70 years old, he showed that any continuous function can be approximated as close as one wishes by polynomials. Thus, although continuous functions may be very jagged, they can always be approximated arbitrary close by the smoothest of all functions, polynomials! This last result is the fundamental theorem of approximation theory, or the Weierstrass approximation theorem Theorem 6.13. If f : I R is a continuous function on a compact interval I and > 0 is given, there exists a polynomial p(x) such that |p(x) f (x)| < for all x I . We remark that Lebesgues rst publication provided another proof of Weierstrass theorem. The paper was Sur lapproximation des fonctions [230] and it was published in 1898. He probably discovered the proof while an undergraduate stu dent at Ecole Normale Sup erieure in Paris, which he entered in 1894 and obtained his teaching diploma in mathematics in 1897; he obtained his doctorate in 1902. You will learn Lebesgues elegant proof in Problem 6. Now why is Weierstrass theorem in a section involving probability? The reason is that one of the most intuitive proofs of the theorem relies on probability! This proof is due to Sergei Bernstein (18801968) and was published in 1912 [34]. Heres Bernsteins idea for a function dened on [0, 1] [219]. Bernsteins game: Let f : [0, 1] R be a continuous function and let n N. Given t [0, 1], suppose that we toss a coin n times where the probability of ipping a head on any given toss is t. As we know, a sample space for this experiment is X = {0, 1}n with a head 1 assigned the probability t and tail 0 the probability 1 t on each toss. Suppose that if we toss k heads in n tosses, we gain f (k/n) dollars (if f (k/n) < 0 we lose dollars). If Sn : X R is the number of heads in n tosses, then our gain is modeled by the random variable g n : X R, dened by gn = f Sn n ,
the composition of f with Sn /n. Indeed, if k heads are tossed have Sn /n = k/n, so gn = f (k/n), which is our required gain. We know that the probability of getting k heads in n tosses = n k t (1 t)nk , k
341
so our expected gain for playing Bernsteins game is

n
E (gn ) =
k=0 n
(Gain when k heads are tossed) (probability k heads are tossed) f k n n k t (1 t)nk . k
=
k=0
Now recalling that t is the probability of getting a head on any given toss, it follows that if n is large, we should get approximately nt number of heads in playing Bernsteins game. Thus, Sn nt, so we should expect to gain approximately f (nt/n) = f (t) dollars. Since our expected gain is also E (gn ), we conclude that
n
f (t)
f
k=0
k n
n k t (1 t)nk . k
Hence, for n large, the continuous function f (t) is approximately a polynomial in t, specically, the polynomial to the right of , which is called the nth Bernstein polynomial. This heuristic argument for Weierstrass theorem will be made rigorous in Problem 1. Now that we know Weierstrass theorem, we shall present a very useful generalization due to Marshall Stone (19031989) [374, 375]. Let T be a topological space. A collection A of real-valued continuous functions on T is called an algebra of functions if for all f, g A , we have f g A and Marshall Stone af + bg A for all a, b R. The collection A is said to separate points (19031989). if given any two points x, y T there exists a function f A such that f (x) = f (y ). We denote by C (T, R) the set of all continuous real-valued functions on T . StoneWeierstrass theorem Theorem 6.14. Let T be a compact space and let A C (T, R) be an algebra of functions that separates points of T and contains the constant functions. Then any function in C (T, R) can be uniformly approximated by functions in A ; that is, given any > 0 and f C (T, R) there is a function g A such that |f (x) g (x)| < for all x T.
Proof : We shall give a Lebesgue-like proof of this result following [149]. Let A C (T, R) be the space of functions that can be uniformly approximated by functions in A . We must show that A = C (T, R). Since A is an algebra, one can check that A is an algebra. Now let f C (T, R); we need to prove that f A . Since T is compact, |f | is bounded by a constant, say M . Now f := f + M is nonnegative and if we can prove that f A , then as the constant function M belongs to A (and hence to A ) we get f = f M A . Thus, we might as well assume from now on that f is nonnegative. Let > 0 and x n N such that 1/n < /2. Following Lebesgue, lets partition the range of f using the partition 1 2 3 4 0, , , , , ..., n n n n
342
and next consider the function (which should be familiar looking by now) g=
k=1
k {k/n<f (k+1)/n} . n
This function is represented by the dark horizontal steps in Figure 6.6. Note that
5/n 4/n 3/n 2/n 1/n x f (x) Here, {1/n<f } + {2/n<f } + {3/n<f } + {4/n<f } = 3 Thus, at this particular x, 3 1 {1/n<f } + {2/n<f } + {3/n<f } + {4/n<f } = n n 3 On the other hand, g (x) = . n
Figure 6.6. Here we assume that 0 f 5/n. At the particular x shown, we have {1/n<f } = {2/n<f } = {3/n<f } = 1 while {4/n<f } = 1 {1/n<f } + {2/n<f } + {3/n<f } + {4/n<f } (at least 0. Thus, g = n at this particular x, but its true in general for the function pictured).
this sum is actually nite because f is bounded. Its easy to check that for all x T, 1 |f (x) g (x)| < ; n thus, g approximates f uniformly; wed be done with our proof if g A but g is generally not even continuous! However, well prove that g can be approximated closely by elements of A . To see this, well let you verify that, as seen in Figure 6.6, another way to write g is 1 {1/n<f } + {2/n<f } + {3/n<f } + g= n where this sum is actually nite since f is bounded. Fixing k N, we shall approximate {k/n<f } , which has the property that {k/n<f } = 1 0 if k/n < f (x), if f (x) k/n. , Bk = f k n
Based on this formula, consider the two closed, disjoint sets Ak = k+1 f n
In Problem 7 you will prove that given any two closed, disjoint sets A, B T there is a function F A such that F : T [0, 1] with F = 1 on A and F = 0 on B . Thus, there is a function fk A such that fk : T [0, 1] with fk = 1 on Ak and fk = 0 on Bk . Now consider the function 1 h := f1 + f2 + f3 + n (Because f is bounded, for k suciently large, Ak = and Bk = T , so eventually all the fk s are zero.) Since A is an algebra it follows that h A and we claim that for all x T , 2 (6.24) |f (x) h(x)| < . n This shows that f can be uniformly approximated by elements of A and hence by elements of A , which completes our proof. Let us x x T . If f (x) 1/n, then x Bk for k = 1, 2, . . . so it follows that all the fk (x)s are zero and hence
343
(6.25)
k0 + 1 k0 < f (x ) . n n By denition of the sets Ak , Bk we have x A 1 , A 2 , A 3 , . . . , A k 0 1 and
h(x) = 0. Thus, |f (x) h(x)| = |f (x)| 1/n, which proves (6.24) if f (x) 1/n. Now assume that f (x) > 1/n. Then there exists a unique k0 N such that
x Bk0 +1 , Bk0 +2 , . . . .
It follows that 1 f1 + f2 + f3 + n fk k0 1 1 + 0. 1 + 1 + + 1 + fk0 + 0 + 0 + 0 + = = n n n Recalling that 0 fk 1 for each k, we see that h(x) =
k0 1 k0 h(x) . n n n Comparing these inequalities with (6.25), the inequalities in (6.24) follow.
The StoneWeierstrass theorem has some very nice corollaries. Recall that a real-valued polynomial on R is a linear combination of products of the coordinate function on R. Similarly, a polynomial on Rn is a linear combination of products of the coordinate functions on Rn . For example, denoting the coordinate functions on R3 as x1 , x2 , x3 , a polynomial function on R3 is 1 5 3 2 3 4 x x . p(x1 , x2 , x3 ) = 1 + 2x1 x2 2 + 3x1 x2 x3 2 2 3 The WAT says that any continuous function on a compact interval in R can be uniformly approximated by polynomials; the n-dimensional version is the following: Density of polynomials Corollary 6.15. Any continuous function on a compact subset of Rn can be uniformly approximated by polynomials.
Proof : Let A be the set of all real-valued polynomials restricted to a compact subset T of Rn . Then you can check that A satises the hypothesis of the StoneWeierstrass theorem.
The StoneWeierstrass theorem can be extended to complex-valued algebras. For a topological space T we denote by C (T ) the space of all complex-valued continuous functions on T ; the proof of the following result is left for Problem 8. Complex-version of the StoneWeierstrass theorem Theorem 6.16. Let T be a compact space and let A be an algebra of C (T ). Assume that A separates points of X , contains the constant complex-valued functions, and is closed under complex conjugation: if f A , then f A . Then any function in C (T ) can be uniformly approximated by functions in A ; that is, given any > 0 and f C (T ) there is a function g A such that |f (x) g (x)| < for all x T.
344
A trigonometric polynomial is a function p : R C of the form

N
p() =
n=N
an ein ,
for some N = 0, 1, 2, . . . and numbers an C. Since ein = (ei )n and ei = cos + i sin (this is called de Moivres formula) we can write p() as a linear combination of powers of the trigonometric functions cosine and sine. This is why we call p() a trigonometric polynomial. A function f : R C that satises f () = f ( + 2 ) for all R is called 2 -periodic. For example, trigonometric functions are 2 -periodic, hence any trigonometric polynomial is 2 -periodic. Trig Weierstrass theorem Corollary 6.17. Any complex-valued 2 -periodic continuous function on R can be uniformly approximated by trigonometric polynomials.
Proof : Note that we cannot apply the (complex) StoneWeierstrass theorem directly (e.g. R is not compact!). The trick here is to identify 2 -periodic functions on R 1 with functions on the unit circle S = {z C ; |z | = 1} (here, if z = a + ib where a, b R, then |z | = a2 + b2 ), which is compact. A point z S1 can be written in the form z = ei = cos + i sin for some R and any two s dier by 2 :
z = ei = cos + i sin
We call a function p : S1 C a trigonometric polynomial if its of the form N n p (z ) = for some N = 0, 1, 2, . . . and numbers an C. If A is n=N an z the set of all trigonometric polynomials on S1 , then in Problem 9 you will check that the hypothesis of Theorem 6.16 are satised for A . Let f : R C be continuous and 2 -periodic. In Problem 9 we ask you to verify that there is a continuous function g : S1 C such that f () = g (ei ) for all R. Given > 0, by the complex StoneWeierstrass theorem there is a trigonometric polynomial n p (z ) = N on S1 such that |g (z ) p(z )| < for all z S1 . Putting n=N an z i z = e we obtain
N
f ( ) This proves our result. Exercises 6.3.
an ein <
N
for all R.
1. (Bernsteins proof of Weierstrass theorem) Please review Bernsteins game from Subsection 6.3.1 as the notation below is from that subsection. (i) Given a continuous function f : [a, b] R, by making a suitable translation and dilation of the interval [a, b], show that if we prove the Weierstrass theorem for the interval [0, 1] we get Weierstrass theorem for any interval. Henceforth we x a continuous function f : [0, 1] R. (ii) Fix > 0 and via uniform continuity choose > 0 such that |t1 t2 | < implies |f (t1 ) f (t2 )| < /2. Let A = {|Sn /n t| < } X (recall that t [0, 1] is
345
the probability of tossing a head on a coin toss, X = {0, 1}n with a head 1 assigned the probability t and tail 0 the probability 1 t on each toss). Let gn = f (Sn /n) and let g is the constant function on X taking the constant value f (t) at all points of X . Show that |gn g | . 2
(iii) Since f is continuous on [0, 1], its bounded, so choose M such that |f | M . Show that 2M |gn g | 2M (Ac ) 2 2 |Sn nt|2 n c A (iv) Show that |Sn nt|2 n, and conclude that
n
|gn g |
2M + 2. 2 n
k n k t (1 t)nk and show there is an N such k n k=0 that for all n >N, we have |Bn (t) f (t)| < for all t [0, 1]. Since Bn (t) is a polynomial in t, this proves Weierstrass theorem! 2. Heres a very interesting result proved by Weierstrass in the same paper [414] he proved his approximation theorem: If f : I R is a continuous function on a compact interval I , then there are polynomials p1 , p2 , . . . such that f = n=1 pn , where the convergence is uniform; that is, f can be written as a uniformly convergent innite series of polynomials! Suggestion: For each n N, let qn be a polynomial such that |f (x) qn (x)| < 1/2n for all x I . Put p1 = q1 and pn = qn qn1 for all n > 1. 3. (Dirac sequences) This problem will be used in the next problem. A Dirac sequence, named after Paul Dirac (19021984), is a sequence of functions {n } where each n : Rn C is integrable, such that (a) n = 1 and (b) Given > 0 and > 0 (v) Let Bn (t) = E (gn ) = f there is an N such that
n |x|
|n | < for all n N . = 1, prove that the sequence {n }, where for all x Rn ,
(i) If : R C is integrable with
n (x) := n (n x)
is a Dirac sequence. (ii) Let f : R C be a continuous function and suppose there is a compact set K R such that f 0 outside of K ; we say that f is compactly supported. Let {n } be a Dirac sequence and for each n N, let gn = f n , the convolution of f and n . Prove that gn f uniformly. Suggestion: Show that g n (x ) f (x ) = and note that for any > 0, |gn (x) f (x)| = |n (y )| |f (x y ) f (x)| dy + |n (y )| |f (x y ) f (x)| dy. n (y ) f (x y ) f (x) dy,
|y |<
|y |
Given > 0 show that > 0 can be chosen suciently small such that the integrals on the right are each less than for all n suciently large. 4. (Weierstrass proof ) In this problem we give Weierstrass 1885 proof [414]. (i) Assuming that Weierstrass theorem holds for the interval [0, 1] and for continuous functions vanishing at 0 and 1, prove that Weierstrass theorem holds for any continuous function on an compact interval. Thus, we henceforth let f : [0, 1] R be continuous with f (0) = f (1) = 0; we must prove that f can be uniformly approximated by polynomials. We extend f to all of R by dening f (x) = 0 for x / [0, 1] to get a continuous function f : R R.
346

1 x (ii) Let (x) = e and note that = 1. In particular, by the previous problem, n (x) := n (n x) forms a Dirac sequence.15 Let > 0. Then by the previous problem we can x an n N such that for all x R, |gn (x) f (x)| < /2, where
2
(6.26)
g n (x ) =
n n (x y ) f (y ) dy =
e n
2 (x y )2
f (y ) dy,
where we recall that f = 0 outside of [0, 1]. Let M = maximum value of |f | on [0, 1]. Recalling Taylors formula with remainder from elementary real analysis, prove that there is a polynomial q (t) such that e t = q ( t ) + R ( t ) ,
n R(t) < for all t [0, n2 ]. where the function R(t) satises M 2 2 2 t (iii) Now put t = n (x y ) into e = q (t) + R(t) and then put the result into (6.26) there is a polynomial p(x) such that |gn (x) p(x)| < /2 for all x [0, 1]. Use this to complete Weierstrass proof. 5. (Landaus proof ) Heres Edmund Landaus (18771938) 1908 proof the Weierstrass approximation theorem [221]. For each n = 0, 1, . . ., dene the continuous function n on R by 1 n (x) = (1 x2 )n , 1 x 1, cn
and n (x) = 0 for x / [1, 1], where cn = 1 (1 x2 )n dx; the n s make up the now famous Landau sequence. Show that {n } is a Dirac sequence and then prove Weierstrass theorem for a continuous function f : R R with f = 0 for x / [0, 1] (this suces as you showed in the previous problem). Suggestion: For x [0, 1], show that the functions gn (x) are polynomials so you dont have to consider a Taylors expansion as you did in the previous problem. 6. (Lebesgues proof ) Heres Henri Lebesgues proof the Weierstrass approximation theorem [230]. Before presenting Lebesgues proof, we remark that a piecewise linear function on [0, 1] is a continuous function l : [0, 1] R such that for some partition 0 = x0 < x1 < x2 < < xN 1 < xN = 1, we can write (6.27) l ( x ) = bk + m k ( x x k ) for all x [xk , xk+1 ], for some constants bk , mk R. Note that mk = (bk+1 bk )/(xk+1 xk ) (why?). (i) Show that a continuous function f : [0, 1] R can be uniformly approximated by piecewise linear functions; that is, given > 0 show that there is a piecewise linear function l : [0, 1] R such that for all x [0, 1], |f (x) l(x)| < . Suggestion: f is uniformly continuous on [0, 1], so there is an N N such that if x, y [0, 1] and |x y | < 1/N , then |f (x) f (y )| < . Now let xk = k/N and bk = f (xk ) for k = 0, . . . , N and dene l(x) as in (6.27). Prove that this l works. (ii) Prove that if Weierstrass theorem holds for piecewise linear functions on [0, 1], then it holds for all continuous functions on [0, 1]. (iii) Prove that if l : [0, 1] R is piecewise continuous as in (6.27), then we can write l = b0 +
N 1 k=0
( m k m k 1 ) g k ,
where m1 = 0 and gk (x) = 1 x xk + x xk for k = 0, 1, . . . , N 1. 2 (iv) Show that if Weierstrass theorem holds for the function f (x) = |x|, then you can prove it holds for any piecewise linear function.
15
Actually, Weierstrass used the function Gk (x) =
e x
2 /k2
obtained from ours by mak-
ing the substitution n = 1/k . He then considered small k instead of large n.
347
(v) Now we just have to prove Weierstrass theorem for the function f (x) = |x|. To do so as Lebesgue did in [230, p. 279] (cf. [311, p. 29]), write |x| = x2 = 1 + t , where t = x2 1. Prove that the binomial series for 1 + t = (1 + t)1/2 converges uniformly to 1 + t for 1 t 1. Finally, using this fact, prove Weierstrass theorem for f (x) = |x|. (Actually, Lebesgue just stated as a well-known fact that the binomial series converges uniformly to 1 + t.) Suggestion: Recall that (1 + 1/2 n t)1/2 = t for all 1 < t < 1. Show that n=0 n 1/2 n = (1)n1 (2n 3)! , 22n2 n!(n 1)! n = 2, 3, . . . , . Con-
and using Stirlings formula, show that
clude e.g. by the Weierstrass M test that uniformly for 1 t 1. Now look up e.g. Abels limit theorem or otherwise and show that 1/2 n 1 + t for t [ 1 , 1]. Finally, using this fact t converges uniformly to n=0 n and the formula |x| = 1 + (x2 1), prove Weierstrass theorem for f (x) = |x|. 7. (Continuous approximations to characteristic functions) Let T be a compact topological space and let A be an algebra of continuous real-valued functions on T that separates points and contains the constant functions. Let A C (T, R) be the space of functions that can be uniformly approximated by functions in A . In this problem we prove that given closed, disjoint sets A, B T , there is a A with : T [0, 1] such that = 1 on A and 0 on B . (i) If f A , prove that |f | A . Suggestion: f is bounded in absolute value by a constant M . Suggestion: Use the fact that |t|, where t [M, M ], can be uniformly approximated by polynomials and then put t = f . (ii) Show that if f, g A , then max{f, g } A and min{f, g } A . Suggestion: 1 Show that max{f, g } = 2 f + g + |f g | with a similar formula for min{f, g }. (iii) If a, b T are dierent points, prove there is a function f A with f : T [0, 1] such that f (a) = 1 and f = 0 on a neighborhood of b. Suggestion: First dene g (x) := min{ |(f (x) f (b))/(f (a) f (b))| , 1}. Then g A , g (a) = 1, g (b) = 0, and 0 g 1. By continuity there is a neighborhood U of b such that g (x) < 1/2 on U . Try to dene f using max (or mins) involving g . (iv) If a T and B T is closed (and hence compact)16 and does not contain a, prove there is a function f A with f : T [0, 1] such that f (a) = 1 and f = 0 on B . Suggestion: For each point p B there is a function fp A with fp (p) = 1 and f = 0 on an open neighborhood Up of p. Use compactness of B to nd nitely many such open neighborhoods covering B , let fp1 , . . . , fpk be the corresponding functions, then prove that min{fp1 , . . . , fpk } satises the conditions required. (v) Finally, prove that if A, B T are closed and disjoint, then there is a A with : T [0, 1] such that = 1 on A and 0 on B . Suggestion: Try to use a similar argument as we did in (iv); that is, for each point a A, let fa A with fa : T [0, 1] such that fa (a) = 1 and fa = 0 on B . . . etc. 8. (Complex StoneWeierstrass) Prove the complex StoneWeierstrass theorem. Suggestion: Prove that the real and imaginary parts of a function in A belong to A . Then, prove that if A A consists of all real-valued functions in A , then A is an algebra and it separates points, and hence the real StoneWeierstrass theorem applies to A . 9. (Trigonometric polynomials) (i) Let f : R C be continuous and 2 -periodic function. Dene g : S1 C as follows: If z S1 write z = ei for some R; then, dene g (z ) := f (). Prove that g is well-dened, continuous, and f () = g (ei ) for
16
1/2 e 2 n3/2 as n n 1/2 n t converges n=0 n
A closed subset of a compact space is compact.
348
all R. (ii) If A is the set of all trigonometric polynomials on S1 , prove that the hypothesis of the StoneWeierstrass Theorem 6.16 are satised.
Prelude to the general SLLN We saw back in Sections 2.4 and 4.2 that the right way to think about the laws of large numbers are as statements concerning limits of functions. Let X = Y with Y = {0, 1} be an innite sequence of Bernoulli trials where on each trial 1 occurs with probability p and 0 with probability 1 p and let denote the innite product measure. For each i N, dene fi := Ai : X R, where such that the {1} occurs in the ith slot. Observe that for any i, E (fi ) = (Ai ) = p. Let (which is just p) denote the common expectation of all the fi s. Then the SLLN is the statement that the event f1 + f2 + + fn = (6.28) lim n n occurs with probability one. There are two properties of the fi s that stand out. First, the values (namely, 0 and 1) of the fi s are distributed exactly the same in the sense that each value occurs with the same probability regardless of i; explicitly, for any i, fi = 1 with probability p and fi = 0 with probability 1 p. Because the values of the fi s are distributed the same, we say that the fi s are identically distributed. Second, its clear that the sets A1 , A2 , A3 , . . . are independent. Because of this we say that f1 , f2 , f3 , . . . are independent. The general SLLN is the statement (6.28) but for any independent and identically distributed (or i.i.d.) random variables. In Section 6.4 we study the notion of distributions, in Section 6.5 we study the notion of i.i.d. and in Section 6.6 we prove the general SLLN. 6.4. Probability distributions, mass functions and pdfs The goal of this section is to understand . . . 6.4.1. Probability distributions. The measurable space (R, B ), the real line with the Borel -algebra, plays a special role in real-life probabilities because numerical data, real numbers, is gathered whenever a random experiment is performed. Its important to analyze this data probabilistically and for this reason we shall call a law on R any probability measure on (R, B ); law in the sense that a measure gives a rule by which to judge the (probabilistic) behavior of data. Now data (numbers assigned to outcomes of an experiment) is described mathematically by random variables, so let (X, S , ) be a probability space and recall that a random variable on X is just a measurable function on X . In this and the next two sections we work exclusively with real-valued random variables. Given a random variable f : X R, we are interested in questions revolving around the data described by f , such as What is the likelihood that f takes values in a Borel set A R? For example, if A = (100, 120), this is the question: What is the likelihood that f lies strictly between 100 and 120? Ai = Y Y Y Y {1} Y
6.4. PROBABILITY DISTRIBUTIONS, MASS FUNCTIONS AND PDFS
349
For a general Borel set A R, the event that f takes values in the set A is Note that f 1 (A) S because f is measurable. The likelihood, or probability, that this event occurs is given by {f A} = (f 1 (A)) = the probability that the values of f are in the set A. For dierent As, the numbers (f 1 (A)) tell us probabilistically how the values of f are distributed amongst dierent Borel sets. For this reason, its natural to dene the (probability) distribution of f , or the law of f , as the measure Pf : B [0, 1] dened by Note that Pf : B [0, 1] is a measure because Pf () = (f 1 ()) = () = 0, Pf (R) = {f R} = (X ) = 1, and if A1 , A2 , . . . are pairwise disjoint Borel sets, then

{f A} = {x X ; f (x) A} = f 1 (A).
Pf (A) := {f A}
for all Borel sets A R.
Pf
n=1
An
=
n=1
f 1 (An )
=
n=1
(f 1 (An )) =
n=1
Pf (An ).
The measure Pf : B [0, 1] contains everything you need to know concerning the probabilistic behavior of the data f represents. Heres a picture of the situation:
f x {f A} (X, S , ) q R f (x) ) A (R, B , Pf ) (
Figure 6.7. Pf is a probability measure on (R, B ) such that for all Borel sets A R, Pf (A) = probability that f lies in A. (In the picture, A is an open interval and f 1 (A) = {f A} is an oval.) 6.4.2. Discrete probability distributions. Recall from Section 1.5 that if is a countable set, then a probability mass function m : [0, 1] is a function satisfying m( ) = 1.

Such a mass function determines the discrete probability measure via P (A) :=
A
P : P () [0, 1] m( ) , for all A ,
where the summation is only over those (at most countably many) points A. Conversely, given a probability measure P on P (), the function m( ) := P { } denes a mass function whose corresponding measure is P . Such probability measures are related to discrete random variables, where a random variable
350
Then Pf has the mass function17 m : R dened by
f : X R is said to be discrete if f has a countable range; otherwise f is called a continuous random variable. Suppose that f is discrete and denote the range of f by . Then for any A B not intersecting , we have f 1 (A) = , so Pf (A) = 0; in particular, its common to restrict Pf to subsets of . That is, we consider Pf as a map Pf : P () [0, 1]. m( ) := Pf { } = {f = } for all .
The mass function m measures how much f is concentrated at each . Given A , we have Pf (A) = m( ),
A
where the sum on the right is summed only over those (at most countably many) points A.
Example 6.2. (A couple well-known distributions) Let be a nite set. Then the (discrete) uniform distribution on is the probability measure on with mass function m : R given by 1 for all . m ( ) = #
Any random variable with range with distribution the uniform distribution is said to be uniformly distributed because its values are distributed with equal probabilities. Observe that if A , then #A Pf (A) = , # which is just the classical fair probability measure. Let p (0, 1) and suppose that = {0, 1, 2, . . . , n} and m : R is given by the binomial mass function m(k) := b(k; n, p) = n k p (1 p)nk , k 0 k n,
studied back in Section 2.5.2. Recall from Theorem 2.13 that m(k) is the probability that in a sequence of n Bernoulli trials we obtain exactly k successes (each success occurring with probability p). The corresponding measure on is called the binomial distribution and is denoted by B (n, p). Any random variable f with such a distribution is said to be binomially distributed and we write f B (n, p). See Problem 1 for another common distribution.
Binomially distributed random variables can be obtained experimentally from the Galton board, named after Francis Galton (18221911) and looked at in the notes and references on Chapter 1. Consider a Galton board with four rows as seen in Figure 6.8. A ball is dropped at the top and suppose that for some p (0, 1), when the ball hits a peg it bumps to the right (a success) with probability p and to the left (a failure) with probability 1 p. By studying this gure, one can see that in order for the ball to land in bin k , the ball must have exactly k successful bumps. Thus, if f is the random variable: f = the bin in which the ball lands,
17m is sometimes called the distribution function of f , although I dont like using this name because it causes confusion with the (cumulative) distribution function in Theorem 6.18.

0 1 2 3 4
351
Figure 6.8. A Galton board with n = 4 rows of pegs and bins labeled 04 where the balls eventually land. then f is binomially distributed with n = 4. Of course, the same can can said for any (nite) number n of rows of pegs. This is an abstract denition of f in the sense that we havent said what the sample space is (thus leaving out the domain of the function f ).18 Of course, its easy to dene a sample space describing this experiment. Indeed, let Y = {0, 1} = {left bump, right bump} with probabilities of p for 1 and 1 p for 0. Then a sample space is Y n with the product measure, and f : Y n R is the function To encompass all n N simultaneously, it is convenient to let X = Y with the innite product measure; then given n N, f = Sn = the random variable giving the number of successes on the rst n trials. From Section 2.5 we know that Sn looks much like the normal density function, a particular case of a probability density function, which we now describe. 6.4.3. Probability density functions. Here, a probability density function, or pdf, is a Lebesgue measurable function : R [0, ) such that (x) dx = 1.
R
f (x1 , x2 , . . . , xn ) = x1 + x2 + + xn .
Such a pdf determines a law by P (A) :=

A
P : B [0, 1] (x) dx for all Borel sets A R.
(Of course, not all laws have pdfs, such as measures vanishing everywhere except on a Borel set of measure zero.) Observe that P (A) is just the area under (x) and above A as shown here:
(x) A P (A) =
A
(x) dx
= shaded area
The most celebrated laws with pdfs are the

Example 6.3. (Normal distributions). The most famous pdf is (x ) =
(x)2 1 e 22 , 2 2
18In this business its common to describe random variables abstractly; that is, describing them in terms of what is observed without specifying the sample space. In fact, after a fascinating lecture of an expert probabilist on random variables dealing with the stock market, I asked him on what sample space his random variables were dened; he did not know!
352
which >0 terms called
is called a normal density function, where R is called the mean and the standard variation with 2 called the variance (we shall discuss these in Subsection 6.5.3). The law corresponding to a normal density function is a normal distribution and is denoted by N (, ); thus, N (, ) : B [0, 1]
(x)2 1 e 22 dx. 2 2 A The standard normal distribution is the measure N (0, 1). Heres a picture of a normal distribution, where (x) has the ubiquitous bell curve shape centered at :
is the measure dened on a Borel set A R by N (, )(A) :=
1 N (, )(A) = 2 2
e
A
(x)2 22
dx
= shaded area A
Now let f : X R be a random variable and let Pf denote its distribution. If a pdf exists for the measure Pf , then f must be continuous, meaning not discrete; however, the converse is false because there are continuous random variables not having pdfs (see Problem 4 for the proof). Amongst those continuous random variables with pdfs, there is a special place for those said to be normally distributed, which means their pdfs are normal densities. We usually write when the pdf is a normal density function with mean and standard deviation . f N (, ),
A normally distributed random variable can be obtained by letting and putting f : X R as the identity function Then given any Borel set A R, we have (X, S , ) = (R, B , N (, ))
f (x) = x for all x X = R.
Hence, Pf = N (, ) and f is normally distributed. Of course, replacing N (, ) by an arbitrary probability measure on R, this trick allows one to produce a random variable whose distribution is exactly the given measure. In fact, this example (except for trivial modications) is the only normally distributed random variable we can give in this book without going into mathematics outside the scope of this book!19 What makes the normal distribution so important is not that (exactly ) normally distributed random variables are everywhere, but that, as we discussed in Section 2.5, approximately normal random variables are everywhere, where approximately normal means that Pf (A) (x) dx
A
Pf (A) = {f A} = N (, ){x ; x A} = N (, )(A).
for some normal density function , where the approximation depends on the situation. For example, as we were discussing immediately before this example, let Sn be the random variable on Y giving the number of successes on the rst n trials of a Bernoulli sequence. Then from the de MoivreLaplace theorem we
19One such example involves Brownian motion; see [157, Ch. 5] for a thorough treatment.
353
learned back in Section 2.5 (see also Problem 8 in Exercises 2.5) we know that if Zn is the random variable Sn np Zn := npq where q = 1 p, then for any interval I [, ], we have lim Zn I
1 x2 = e 2 dx, n 2 I where is the innite product measure on Y . Thus, for all intervals I [, ],
n
(6.29)
lim Pfn (I ) = N (0, 1)(I ).
See Problem 5 for a related result. Heres another common example of a probability distribution with a pdf.
Example 6.4. (Continuous uniform distributions) Given < a < b < , the uniform (or rectangular) distribution on the interval [a, b], denoted by U (a, b), is the law U (a, b) : B [0, 1] with pdf (x) := 1 ba 0 if x [a, b], otherwise.
1 A ba
Heres a picture of the uniform distribution:

U (a, b)(A) = dx =
m (A) ba
1 ba a A b
= shaded area
U (a, b) is uniform over [a, b] in the sense that it assigns equal probabilities to subsets of [a, b] with the same length. A random variable f is said to be uniformly distributed if its probability distribution is a uniform distribution on some interval [a, b], in which case we write f U (a, b). Generally speaking, if the range of a random variable lies in an interval [a, b] and its values are equally likely to lie anywhere in [a, b], then the random variable is uniformly distributed. Some examples of uniformly distributed random variables include (under appropriate conditions) spinning a needle on a circular dial and observing where it stops and also picking a number at random from a given interval. See Problem 2 for another common distribution.
such that
6.4.4. (Cumulative) distribution functions. One can also approach probability distributions through LebesgueStieltjes set functions associated to special nondecreasing functions. A (cumulative) distribution function (cdf ) is a function F : R [0, 1] (i) F is nondecreasing ; (ii) F is right continuous; (iii)
1 0
lim F (x) = 0 ; (iv ) lim F (x) = 1,

x
1 0 1 0
Here are some pictures of cdfs:
354
Cdfs are important because they characterize laws. Laws and cdf s Theorem 6.18. Laws are in one-to-one correspondence with cdfs in the following sense: Given a cdf F : R R, its corresponding LebesgueStieltjes measure F : B [0, 1] is a law (probability measure). Moreover, given a law : B [0, 1], the function F : R R dened by is the unique cdf whose corresponding LebesgueStieltjes measure is .
Proof : We shall leave you to prove the rst statement and the uniqueness part of the second statement. Let be a law and dene F : R R as stated. We need to show that F is a cdf such that F = . Proof that F is a cdf: Since is monotone it follows that F is nondecreasing. To prove that F is right continuous, it suces to show that if a R and {an } is a nonincreasing sequence of points approaching a, then F (an ) F (a). To see this, observe that (, a] =
n=1
F (x) := (, x]
for all x R ,
(, an ] ,
hence by continuity of measures, we have (, a] = lim (, an ] .

n
That is, F (a) = lim F (an ), which shows that F is right continuous. If we put a = (in this case, (, ] = ) and {an } is a nonincreasing sequence approaching the same argument shows that 0 = () = lim (, an ]; that is, 0 = lim F (an ), which shows that limx F (x) = 0. Finally, if {an } is any nondecreasing sequence with an , then as R=
n=1
(, an ] ,
by continuity and using the fact that (R) = 1 since is a probability measure, we have 1 = (R) = lim (, an ] .
n
This shows that 1 = lim F (an ), which proves that limx F (x) = 1. Proof that F = : If (a, b] I 1 with a < b, then by subtractivity of measures we have F (a, b] = F (b) F (a) = (, b] (, a] = (, b] \ (, a] = (a, b]. Thus, F = on I 1 ; by the extension theorem it follows that F = on B = S (I 1 ) as well.
Given a random variable f : X R on a measure space (X, S , ), the (cumulative) distribution function (cdf ) of f is the cdf of its law Pf ; thus, the cdf
355
of f is the function F : R R dened by F (x) := {f x}, where {f x} is shorthand for {f (, x]}, which equals Pf (, x]. By the previous theorem we know that the LebesgueStieltjes measure F and the distribution Pf are identical laws on B . Thus, the study of probability distributions of random variables is really the study of LebesgueStieltjes measures (of cdfs)!
Exercises 6.4. 1. (Poisson distribution) In this and the next problem we introduce two important probability distributions through describing two dierent models describing the decay of particles. Suppose that we have a large number of radioactive particles, say n of them and we observe their decay over a time interval [0, t] (say time is in hours). Assume that the average rate of decay during the time interval [0, t] is /hour, meaning that on average, particles decay per hour during our observation interval [0, t]. (i) Let k N0 := {0, 1, 2, . . .}. Treat each of the n particles as a Bernoulli trial: Assign it a 1 success if it decays in the interval [0, t] and a 0 failure if it does not decay. Argue that the probability a given particle decays in the interval [0, t] is t/n. Conclude that, in the time interval [0, t], the probability the number of decays is k is exactly the binomial mass function b k; n, t n = n k t n
k
t n
nk
Conclude that the law describing probabilistically the number of decays in the interval [0, t] is given by the binomial distribution B (n, p) where p = t/n. (ii) Prove that
n
lim b k; n,
t n
(t)k t e . k!
(t)k t e for all k N0 is (iii) The function m : N0 [0, ) dened by m(k) = k! called the Poisson mass function. Prove that m is a probability mass function; its corresponding measure on N0 is called the Poisson distribution, which we denote by Pois(t). This distribution is named after Sim eon-Denis Poisson (1781 1840). In conclusion, you have proven that for n large,20 B n, t n Pois(t);
or in terms of our radioactive decay model, if the average rate of decay is /hour in a time interval [0, t] hours, for n large, the probability the number of decays is k (t)k t e . k!
Note that since the probability of a success is p = t/n, we can write this approximation as the probability the number of decays is k (np)k np e . k!
20 We use the term large meaning Given any error bound, we can take n as large as necessary to get an approximation within the error bound.
356
2. (Exponential distribution) We now make a dierent model for radioactive decay. Assume as before that we have n particles to begin with. We now do not work on a xed time interval but we let t vary. Let N (t) = number of particles remaining after t hours and we assume that the rate at which N (t) changes is some xed proportion r , called the decay constant, of the number of particles present at time t; that is, we assume that for some constant r (0, 1), (i) Using that N (0) = n, prove that N (t) = n ert for all t 0. (ii) Show that the proportion of particles that have decayed during the time interval [0, t] is (1 er t ). From this fact, argue that the probability that a given particle decays in the time interval [0, t] is also 1 er t . (iii) If Pt = the probability that a given particle decays in the time interval [0, t], prove that
t
N (t ) = r N (t ) ,
for all t > 0.
Pt =
0
rerx dx
(iv) Prove that the function : R [0, ) dened by (x) := r erx for x 0 and (x) = 0 otherwise, is a probability density function; such a density function is called an exponential density function. Then show that if : B [0, 1] is the law whose pdf is , called an exponential distribution, then for any interval I R, (I ) = the probability that a given particle decays in the time interval I . 3. (The Poisson and exponential distributions) The previous two problems presented two models of radioactive decay. In this problem we relate them. We work under the assumptions of Problem 2. (i) Fix t > 0 and recall from (ii) of Problem 2 that the probability a given particle decays in the time interval [0, t] is 1 er t . Now treat each of the n particles as a Bernoulli trial with 1 success if it decays in the interval [0, t] and a 0 failure if it does not decay. Given k N0 , prove that, in the time interval [0, t], (6.30) the probability the number of decays is k = n k 1 e r t
k
e r t
nk
(ii) To relate Formula (6.30) to Problem 1, recall from Problem 2 that the number of particles remaining after t hours equals n ert . Prove that the average rate of decay of the number of particles in the time interval [0, t] is n(1 ert )/t. Thus, using the notation from Problem 1, we have = n(1 ert ) ; t
solving this equation for ert and plugging the result into (6.30), obtain for the time interval [0, t] and for k N0 , (6.31) the probability the number of decays is k = n k t n
k
t n
nk
This is the formula in Part (i) of Problem 1. (iii) We now relate (6.31) to the Poisson mass function. Fix t > 0. Recall in Part (iii) of Problem 1 we got the Poisson mass function assuming (1) the number of particles n is large and (2) the average number of decays/hour, , is some ert ) constant. Unfortunately, in our case, = n(1t , so depends on n and is not constant! For this reason, let us assume the decay constant r is inversely proportional to n; specically, we assume that r = /n for some constant . Explain why for n large, ; thus, is approximately constant! From this,
357
conclude that from our assumption on r , for the time interval [0, t] and for n large, the probability the number of decays is k (t)k t e , k!
exactly as before. 4. (Problems on probability distributions) (a) Prove that if a pdf exists for the probability distribution of a random variable f , then f must be continuous (= not discrete), which means the range of f is uncountable. (b) Let X = [0, 1] with Lebesgue measure and let : X R be Cantors function. Show that does not have a pdf. (c) Let X = Y with Y = {0, 1} and consider the innite product measure on X assigning probabilities 1/2 to both 0 and 1 on each factor. Prove that the function f : X R dened by f (x1 , x2 , x3 , . . .) :=
n=1
xn 2n
for all (x1 , x2 , x3 , . . .) X,
has distribution function F (t) = 0 for t < 0, F (t) = t for 0 t 1 and F (t) = 1 for 1 < t. In particular, Pf = m = Lebesgue measure when restricted to [0, 1]. 5. (Convergence to a normal) Using the notation from the de MoivreLaplace Theorem, we know that (6.32)
n
lim n (A) = (A),
n np where A = (a, b] for any a, b [, ], n (A) := { S A}, and = N (0, 1). npq Does (6.32) hold when A R is a Borel set? The answer is sometimes; to understand what this means, proceed as follows. (i) If U R is open, prove that (U ) lim inf n (U ). Suggestion: Write U = k=1 Ik ( I ) for any where I1 , I2 , . . . I 1 are pairwise disjoint. Then n (U ) N n k k=1 N . Take lim inf of both sides, then take N . (ii) If C R is closed, prove that lim sup n (U ) (C ). (iii) If A R is a Borel set whose boundary has Lebesgue measure zero, prove that lim n (A) = (A). (Recall that the boundary of A is A \ A0 where A is the closure of A and A0 is the interior of A.) (iv) Find a set A R such that lim n (A) = (A). 6. (Sche es theorem, after Henry Sche e (19071977) [316, 343]) Let {fn } be a sequence of pdfs (thus, fn 0 and fn = 1 for each n) and suppose that f := lim fn
exists a.e. and f is also a pdf. In this problem we prove that for all Lebesgue measurable sets A R, lim fn = lim
A A
fn .
This result is interesting because it gives the conclusion of the DCT, although there is no mention of a dominating function (indeed, a dominating function may not exist). (i) Prove that |fn f | 0 as n and from this result prove Sche es theorem. Suggestion: Apply Fatous lemma to gn := fn + f |fn f |. (ii) Heres a situation for which Sche es theorem applies, but the DCT does not. n For each n N, let an = k=1 1/k and dene fn : R R by fn (x) = 1 for 1/(n + 1) x 1 and for an x an+1 and fn (x) = 0 otherwise. Using Sche es theorem prove that A fn lim A fn for all Lebesgue measurable sets A R; however, prove that there is no integrable function g : R R such that |fn | g for all n.
358
6.5. Independence and identically distributed random variables The goal of this section is to understand i.i.d. random variables. 6.5.1. Id random variables. Given random variables f1 , f2 , . . . : X R, we say that f1 , f2 , . . . are identically distributed (or i.d.) if Pfi = Pfj for all i, j . Thus, identically distributed just means having identical distributions.
Example 6.5. (Repeating an experiment) One standard way to generate identically distributed random variables is through repeating an experiment countably many times. Let Y be a sample space with a probability measure 0 : I [0, 1] on some semiring I of subsets of Y and let f : Y R be a random variable. Let X = Y with probability measure : S (C ) [0, 1], the innite product of 0 with itself, where C is the cylinder sets generated I . For each i N dene For example, if Y = {0, 1} and f (x) = x (thus, f (x) = 0 when x = 0 and f (x) = 1 when x is 1), then fi = Ai where with {1} occurring in the ith slot. These fi s are exactly the random variables that occurred in Bernoullis theorem and Borels Strong Law of Large Numbers we covered back in Sections 2.4 and 4.2. In the general case in (6.33) we claim that f1 , f2 , . . . are identically distributed = have the same distribution. To see this, let A R be a Borel set and observe that fi1 (A) := fi A = (x1 , x2 , . . .) ; f (xi ) A where {f A} occurs in the ith factor. Thus, = Y Y Y {f A} Y Y , Pfi (A) = 0 {f A}, Ai = Y Y Y Y {1} Y fi : X R by fi (x1 , x2 , . . .) = f (xi ).
(6.33)
which is independent of i. Hence, Pfi (A) = Pfj (A) for any i and j and therefore f1 , f2 , . . . are identically distributed.
Generally speaking, i.d. random variables have identical characteristics insofar as these characteristics can be expressed in terms of probabilities; for example, they have the same expected values. In order to prove this fact, we rst prove the following theorem, whose proof is a typical use of the principle of appropriate functions as explained in Section 5.5.2. Integration and distributions Theorem 6.19. If f : X R is a random variable and : R R is a Borel measurable function, then (f ) d =
X R
dPf
in the sense that the left-hand integral is dened if and only if the right-hand integral is, in which case both are equal. In particular, provided f is integrable, we have x dPf . E (f ) =
R
6.5. INDEPENDENCE AND IDENTICALLY DISTRIBUTED RANDOM VARIABLES
359
Proof : Given a random variable f : X R, we will prove that (6.34)

X
(f ) d =
R
dPf
holds for any Borel measurable function : R R (for which the integrals are dened) using the principle of appropriate functions: We rst prove (6.34) when is a characteristic function, then for nonnegative simple functions, then for nonnegative measurable functions, and then nally for arbitrary functions. First, let A be a Borel subset of R; we need to show that A (f ) d =
X R
A dPf .
To see this, observe that for any point x X , we have A (f (x)) = 1 f (x ) A Thus, we obtain the very useful formula (6.35) Hence, A (f ) d = f 1 (A) d = (f 1 (A)) = Pf (A) =
R
x f 1 ( A )
f 1 (A) = 1.
A (f ) = f 1 (A) .
(def. of integral) (def. of Pf ) (def. of integral)
A dPf .
Thus, (6.34) holds for characteristic functions. By linearity of the integral, (6.34) holds for simple functions: If s = N n=1 an An , where the an s are nonnegative and the An s are Borel sets, then
N N
s(f ) d =
X X n=1
an An (f ) d =
n=1 N
an
X
An (f ) d An dPf
R
=
n=1
an
N
=
R n=1
an An dPf =
R
s dPf .
If is a nonnegative measurable function, then writing = lim sn as nondecreasing limits of nonnegative simple functions, it follows that (f ) = lim sn (f ) is also a nondecreasing limit of measurable functions, so by the monotone convergence theorem, we have (f ) d = lim
X n
sn (f ) d = lim
X
sn dPf
R
=
R
dPf .
Thus, (6.34) holds for any nonnegative measurable function . Finally, if is an arbitrary Borel measurable function, then writing = + as the dierence of its nonnegative and nonpositive parts it follows that + (f ) d =
X R
+ dPf
and
X
(f ) d =
dPf .
360
Hence, (f ) is -integrable if and only if is Pf -integrable, in which case (f ) d :=

X X
+ (f ) d
R
(f ) d dPf .
R
=
R
+ dPf
dPf =:
Integration and pdfs Corollary 6.20. If f : X R is a random variable with probability distribution function : R [0, ) and : R R is a Borel measurable function, then (f ) d =
X R
(x) (x) dx
in the sense that the left-hand integral is dened if and only if the right-hand integral is, in which case both are equal. In particular, provided f is integrable, we have x (x) dx. E (f ) =
R
Proof : Given that Pf (A) =

A
(x) dx
for all Borel sets A R,
and that X (f ) d = is check that
dPf from the preceding theorem, all we have to do dPf =

R R
(x) (x) dx.
This is an application of the principle of appropriate functions, so we shall leave this as an exercise to you. (We have two more PAF proofs to do in this section (as well see) so we wont bore you with another one and any case, a slightly more general version of this exercise was given in Problem 14 of Exercises 5.5.)
The following theorem contains some important properties of distributions and identically distributed random variables. The proof is another typical use of the principle of appropriate functions. Properties of i.d. R.V.s Theorem 6.21. Let f : X R and g : X R be identically distributed random variables. Then for any Borel measurable function : R R, (1) (f ) and (g ) are also identically distributed. (2) We have (f ) d =
X X
(g ) d
in the sense that the left-hand integral is dened if and only if the right-hand integral is, in which case both are equal.
361
Proof : To prove (1), observe that for any Borel set A R, P(f ) (A) = ((f )1 (A)) = (f 1 (1 (A))) = Pf (1 (A)) = Pg (1 (A)) = (g
1
(since ( f )1 = f 1 1 ) (since f and g are i.d.)
(A))
= ((g )
(A))
= P(g) (A). Observe that if f and g are identically distributed, then Pf = Pg , so dPf =
R R
dPg ,
provided one integral (and hence both) is dened. By Theorem 6.19 the left and right sides equal (f ) d and (g ) d, respectively. This concludes our proof. Example 6.6. By (2) of Theorem 6.21, if f and g are identically distributed, then with (x) = max{x, 0} we see that f+ and g+ are identically distributed, and with (x) = min{x, 0} we see that f and g are identically distributed. With (x) = |x|, we also see that |f | and |g | are identically distributed.
6.5.2. Independent random variables. Now that we know what identically distributed random variables are, we dene independent random variables, which intuitively speaking are random variables whose values are distributed independently. More precisely, real-valued random variables f1 , f2 , f3 , . . . are independent if {f1 B1 }, {f2 B2 }, {f3 B3 }, . . . are independent for any family B1 , B2 , B3 , . . . R of Borel sets. Written another way, we require that the subsets of X ,
1 1 1 f1 (B1 ) , f2 (B2 ) , f3 (B3 ) , ,
be independent. Here, we recall that events A1 , A2 , A3 , . . . are independent means that for any nite subcollection Ai , Aj , . . ., Ak of A1 , A2 , A3 , . . ., we have (Ai Aj Ak ) = (Ai ) (Aj ) (Ak ). We say that f1 , f2 , f3 , . . . are pairwise independent if for any i = j , fi , fj are independent. Finally, we say that random variables f1 , f2 , f3 , . . . are i.i.d. if they are independent and identically distributed.
Example 6.7. (Repeating an experiment, again) Going back to Example 6.5, we claim that the sequence f1 , f2 , . . . in (6.33) is independent. To see this, let B1 , B2 , . . . R be Borel sets and recall from Example 6.5 that fi1 (Bi ) = Y Y {f Bi } Y Y , where {f Bi } occurs in the ith factor. Its easy to see that such sets are independent, so f1 , f2 , . . . are independent. Thus, f1 , f2 , . . . are i.i.d.
362
An important property of independent random variables is that expectations behave multiplicatively on such variables. The proof is yet one more typical use of the principle of appropriate functions! Theorem 6.22. If f1 , . . . , fn are independent random variables, then for any Borel measurable functions 1 , . . . , n , (1) 1 (f1 ), . . . , n (fn ) are independent. (2) We have 1 (f1 ) 2 (f2 ) n (fn ) = 1 (f1 ) 1 (f2 ) 1 (fn )
provided each of these integrals is dened. That is, the integral of the product is the product of the integrals, or stated in term of expectations, E 1 (f1 ) 2 (f2 ) n (fn ) = E (1 (f1 )) E (1 (f2 )) E (1 (fn )) .
Proof : Using the denition of independence, we leave it as an exercise to check that 1 (f1 ), . . . , n (fn ) are independent. For saneness of notation, we shall only prove (2) for two independent random variables f and g . Using the principle of appropriate functions, we shall prove that (6.36) (f ) (g ) = (f ) (g )
holds for any Borel measurable functions , (for which the integrals are dened). First, let A and B be Borel subsets of R; we need to show that A (f ) B (g ) = A (f ) B (g ) .
By the formula (6.35), we have A (f ) = f 1 (A) and B (g ) = f 1 (B) . Hence, A (f ) B (g ) = f 1 (A) g1 B = f 1 (A)g1 (B)
= f 1 ( A ) g 1 B = = f 1 (A) A (f )
= f 1 ( A ) g 1 ( B ) g 1 (B ) B (g ) .
Since simple functions are just linear combinations of characteristic functions, a short computation shows that s (f ) t (g ) = s (f ) t (g )
for any nonnegative Borel measurable simple functions s and t. (Just write s and t as linear combinations of characteristic functions and multiply out the left and right-hand sides of the above equality to see that its in fact an equality.) Using the monotone convergence theorem, it follows that if and are nonnegative measurable functions, then writing them as limits of nondecreasing sequences of nonnegative simple functions, we get (f ) (g ) = (f ) (g ) .
363
Finally, if and are arbitrary integrable functions, then we have = + and = + , so (f ) (g ) = = = + (f ) + (g ) + (f ) (g ) + (f ) (g ) (f ) + (g ) (f ) (g ) (f ) (g ) + (g ) + (f ) (g ) (g ) (f ) + (g ) (f ) + (g )
+ (f ) + (g ) + + (f )
+ (g ) + + (f ) (f ) (g ) .
= =
+ (f ) (f )
+ (g )
Example 6.8. By (1) of Theorem 6.22, if f and g are independent, then with (x) = (x) = max{x, 0} we see that f+ = (f ) and g+ = (g ) are independent. Similarly, f and g are independent, and |f | and |g | are independent.
6.5.3. Variance. Let f : X R be an integrable random variable. Recall that the idea behind dening the expectation = f was that f may be quite complicated (being a function with possibly many values on a possibly very complicated sample space), so we wanted a number that gives useful information about f . The expectation is one such number since it represents the average, or mean, value of f if the experiment is repeated a large number of times. It would also be useful to nd a number that tells us how far the value of f may be from its expected value on any single experiment. To dene this number, observe that the random variable |f | measures the deviation of f from its average. We could also use (f )2 as a measure of the deviation of f from its average; because squaring large numbers makes them larger (e.g. 102 = 100) and squaring small numbers makes them smaller (e.g. (1/10)2 = 1/100), the random variable (f )2 tends to emphasize the larger deviations of f from . If we take the average value of (f )2 we get what is called the variance of f : Var f := E [(f )2 ] = Since (f )2 = f 2 2f + 2 , we have (f )2 ,
E [(f )2 ] = E (f 2 ) 2 E (f ) + 2 = E (f 2 ) 22 + 2 = E (f 2 ) 2 , Var f = E (f 2 ) 2 .
so an alternative way writing the variance is as
The standard deviation of f is the square root of the variance: (f ) := Var f = (f )2 .
Both Var f and (f ) measure how much the values of f are spread from its mean.
364
Example 6.9. (Normally distributed random variables) Let f be a random variable and assume that f N (, ) for some R and > 0. We called the mean and the standard variation . . . we now show these labels are correct! Indeed, by Corollary 6.20 we know that E (f ) = 1 2 2 xe
R
(x)2 22
dx
and
E (f 2 ) =
1 2 2
x2 e
R
(x)2 22
dx.
Making the change of variables x = + 1 E (f ) = and 1 E (f 2 ) = Since (1) (2)

R R
2 2 y , we obtain 2 2 y )ey dy.

2
( +
R
( +
R
2 2 y )2 ey dy.
ey dy = ye
y
2
,
2
(3) and21
dy = 0 (since y ey is an odd function on R), 2 y 2 ey dy = /2, R
we leave you to show that E (f ) = and E (f 2 ) = 2 + 2 . This proves our result.
The following theorem is useful when studying sums of random variables. Theorem 6.23. If f1 , . . . , fn are pairwise independent with nite expectations and variances, then
n n
Var
k=1
n k=1
fk =
k=1
Var fk .
Proof : Let g = and that
fk , and observe that E (g ) =

n 2 n
n k=1
k , where k = E (fk ),
(g E (g ))2 =
k=1
(fk k )
=
n
j,k=1
(fj j )(fk k ) (fj j )(fk k ).
k=1
(fk k )2 +
j =k
By independence and Theorem 6.22, for j = k we have E [(fj j )(fk k )] = E (fj j ) E (fk k ) = (j j )(k k ) = 0, so
n n
Var g = E [(g E (g ))2 ] =
k=1
E (fk k )2 =
Var fk .
k=1
21
Dierentiate both sides of the equality y t1/2
ety dy =
R
2 y 2 ety
change variables y
in the integral) to get
t1/2 (to verify the equality, dy = t3/2 /2, then take t = 1.
365
6.5.4. Probabilistic quantities for i.i.d. random variables. The following theorem is no surprise. Theorem 6.24. Identically distributed random variables have the same expectation value, variance, and standard deviation (provided these notions are dened for the random variables).
Proof : Let f and g be identically distributed and integrable. Then with (x) = x, Property (2) of Theorem 6.21 implies that f= g ; that is, E (f ) = E (g ),
so identically distributed random variables have the same expectation. With = E (f ) = E (g ) and (x) = (x E )2 , Property (2) of Theorem 6.21 implies that (f )2 = (g )2 ,
so Var f = Var g . In particular, (f ) = (g ).
Lets recall the set-up in our repeating an experiment Examples 6.5 and 6.7. Let Y be a sample space with a probability measure and let f : Y R be a random variable. Then X = Y is the sample space measure the innite product of the measure on Y . For each i N, dene fi : X R by fi (x1 , x2 , . . .) = f (xi ),
which represents the data the random variable f assigns to the outcome on the ith trial of the experiment. We summarize the content of Examples 6.5 and 6.7 and Theorem 6.24 in the following: I.i.d. R.V.s from repeating an experiment Theorem 6.25. The random variables f1 , f2 , f3 , . . . are i.i.d. and E (fi ) = , Var(fi ) = 2 , (fi ) = for all i, where and are the common expectations and standard deviations of the fi s, which equal the expectation and standard deviation of the original random variable f . In particular, if Sn = f1 + + fn , then E (Sn ) = n , Var(Sn ) = n 2 , (Sn ) = n . The equality for Var(Sn ) (and consequently for (Sn )) follows from Theorem 6.23. Note that the same theorem holds for any sequence of i.i.d. random variables f1 , f2 , f3 , . . . if we drop the phrase which equal the expectation and standard deviation of the original random variable f from the theorem.
Example 6.10. (Binomially distributed random variables) As weve seen many times, let Y = {0, 1}, assign {1} a probability p (0, 1), {0} the probability q where q = 1 p and dene f : Y R by f (0) = 0 and f (1) = 1. Then E (f ) = 0 q + 1 p = p,
366
and Var(f ) = E (f 2 ) (E (f ))2 = 02 q + 12 p p2 Thus, (f ) = = p p2 = p(1 p) = pq. npq.
E (Sn ) = np , Var(Sn ) = npq , and (Sn ) =
pq . It follows that if Sn = f1 + + fn , then
More generally, if f : X R is any random variable with f B (n, p), then E (f ) = np , Var(f ) = npq , and (f ) = npq, as you can readily check. Exercises 6.5. 1. Here are some useful formulas, which you are free to use in subsequent problems. (a) If f : X [0, ) is integer valued (that is, f (X ) {0, 1, 2, . . .}, prove that E (f ) =
f >n
n=0
(b) Let f : R R be Borel measurable and let g : X R be an integrable function taking countably many values a1 , a2 , . . . , R. Prove that E (f (g )) = (c) If f : X [0, ], prove that E (f ) =
0
f (a n ) g = a n .
n=0
f x dx.
Suggestion: principle of appropriate functions. 2. (Royal oak lottery) Heres a part of the preface of Abraham de Moivres (16671754) famous book The Doctrine of Chances [100], rst published in 1718 and whose third and nal edition was in 1756: When the Play of the Royal Oak was in use, some Persons who lost considerably by it, had their Losses chiey occasioned by a Argument of which they could not perceive the Fallacy. The Odds against any particular Point of the Ball were One and Thirty to One, which entitled the Adventurers, in case they were winners, to have thirty two Stakes returned, including their own; instead of which they having but Eight and Twenty, it was very plain that on the single account of the disadvantage of the Play, they lost one eighth part of all the Money they played for. But the Master of the Ball maintained that they had no reason to complain; since he would undertake that any particular point of the Ball should come up in Two and Twenty Throws; of this he would oer to lay a Wager, and actually laid it when required. The seeming contradiction between the Odds of One and Thirty to One, and Twenty-two Throws for any Chance to come up, so perplexed the Adventurers, that they begun to think the Advantage was on their side; for which reason they played on and continued to lose. Heres what I think de Moivre is saying: For the Royal Oak Lottery, a single ball had 32 distinctive points and Adventurers would bet on which point would turn up when the Master of the Ball throws the ball. (Most lotteries are played with many balls but the Royal Oak was only played with one.) Thus, the probability of winning is 1/32. However, the Adventurers would only get paid 28 to 1 if they won and they thought this was unfair since after all, the odds are 1/32, not 1/28. However, the Master of the Ball said that any particular point of the ball should appear once in every 22 throws. Lets verify the Master of the Balls statement.
367
(i) Lets x a particular point on the ball, throw the ball an innite number of times in sequence, and let f be the number of throws to obtain that particular point. Write down a sample space and show that f is measurable. (ii) Show that for any n = 0, 1, 2, . . ., f > n = (31/32)n . Conclude that f 22 > 1/2. Thus, the Master of the Ball will throw that particular point more than half the time; hence, he felt condent to put a wager that he would throw any particular point at least once in 22 throws. (iii) Show that E (f ) = 32. Hence, the expected number of throws you need to obtain a particular point on the ball is 32 (just as you would expect)! Suggestion: To compute E (f ), use a formula in Problem 1. 3. (St Petersburg Paradox) You walk to a booth in St Petersburg and pay a fee to play the following game. A fair coin is ipped until a head appears. If the head appears on the nth ip, you are paid $2n . What do you expect to gain if you play such a game? (i) Write down a sample space X , the corresponding probability measure , and let f : X R be the function representing the amount you gain. Show that f is measurable. (ii) Show that E (f ) = , which seems to say that if you play this game, you can expect win an innite amount of money! Hence, it seems like you should be willing to pay any initial fee to play the St Petersburg game. (iii) Given n N, what is the probability that you win $2n ? In particular, what are your chances of winning $24 = $16? (iv) With (ii) and (iii) in view, do you see why the this scenario is called the St Petersburg paradox ? (v) One reason the St Petersburg game is a paradox is that it assumes the booth has an innite about money. Suppose that the booth only has $2N so if the rst head appears on the nth ip where n N , then you only get 2N dollars (instead of $2n ). Now show that the expectation is nite. 4. (The Coupon collectors problem) A certain bag of chips contains one of n different coupons (say labelled 1, . . . , n), each coupon equally likely to be found. In this problem we prove that the expected number of bags you need to obtain at least one of n 1 1 each type of coupon is n n k=1 k . In particular, since k=1 k log n, meaning that n 1 1 22 limn (log n) = 1, it follows that k=1 k the expected number of bags to get all n coupons log n; n what a strange place to see logarithms! Proceed as follows. (i) Write down a sample space X , the corresponding probability measure , and let f : X R be the function representing the number of bags required to complete the set of n coupons. Show that f is measurable. (ii) Using an expectation formula in Problem 1 and looking back at Problem 7 in Exercises 2.3, show that
n
E (f ) = n a n , (iii) Prove that
where an =
k=1
(1)k1
1 n . k k
an+1 = an +
1 . n+1
1 From this, prove that an = n k=1 k . (iv) The coupon collectors problem deals with other situations too. In a pack of 52 cards, you draw a card one at a time, returning the card you drew before you draw the next one. Show that the expected number of draws you need to produce a card of every suit is 8 1 . 3 22
Can you prove this fact?
368
5. Let 1 , 2 , 3 , . . . be a sequence of probability measures on B . Prove that there exists a probability space (X, S , ) and independent random variables f1 , f2 , f3 , . . . such that Pfn = n for each n. Suggestion: Let X = R with the innite product measure of the i s. 6. Let f1 , f2 , . . . be i.i.d. random variables on a probability space (X, S , ) and suppose that the range of each fi is {0, 1}, taking the value 1 with probability p and the value 0 with probability 1 p. With Sn = f1 + + fn , we shall give two proofs that {Sn = k} = n k p (1 p)nk . k
(i) (Easy) You can prove this using a similar argument as we did back in Theorem 2.13 in Section 2.5. (ii) (Harder, but neat) We can also give an analytic proof as follows. First prove that for all t R,
n
etSn =
k=0
ak etk ,
where ak = {Sn = k}.
Next, show that for any i, that
etfi = p et + q where q = 1 p and use this to prove etSn = (p et + q )n .
Finally, prove the desired formula.
6.6. Laws of large numbers and normal numbers Now that we know what independent and identically distributed random variables are, in this section we prove both the WLLN and the SLLN. We also prove Borels celebrated Normal Number Theorem [51]. 6.6.1. Etemadis strong law of large numbers. Let (X, ) be a probability space, let f1 , f2 , f3 , . . . : X R be pairwise independent, identically distributed random variables, and let Sn = f1 + + fn . The goal of this section is to prove the following SLLN, due to Nasrollah Etemadi in 1981 [120]. Etemadis L1 strong law of large numbers Theorem 6.26. If the fn s are integrable, the SLLN holds: Sn f1 + f2 + + fn lim = a.e., that is, lim = n n n
= 1,
n where denotes the common expectations of the fn s. Conversely, if lim S n exists a.e., then the fn s must all be integrable and the SLLN holds.
The L1 in the title is synonymous with integrable in reference to the fn s being integrable random variables. We shall discuss more on Lp matters in Chapter 8 (we shall also discuss a corresponding Lp SLLN). Property (4) in Lemma 4.5 and Etemadis SLLN implies the following WLLN: L1 weak law of large numbers
6.6. LAWS OF LARGE NUMBERS AND NORMAL NUMBERS
369
Theorem 6.27. If the fn s are integrable with common expectation , then for each > 0, we have f1 + f2 + + fn lim < = 1. n n This WLLN was proved by Aleksandr Khinchin (18941959) in 1929 [211]. To prove Etemadis SLLN we need a few . . . 6.6.2. Preliminary results. We begin with the famous and useful . . . Chebyshevs inequality, Version III Theorem 6.28. For any a > 0 and integrable f , 1 {|f E (f )| a} 2 Var f. a
Proof : Let A = {|f E (f )| a} = {(f E (f ))2 a2 } and note that a2 A (f E (f ))2 A (f E (f ))2 . a2 (A) (f E (f ))2 = Var f. Integrating we get Chebyshevs inequality:
It might be interesting to read Chebyshevs original statement of the inequality named after him (translation taken from [358, p. 581]):
If we designate by a, b, c . . ., the mathematical expectations of the quantities x, y , z . . ., and by a1 , b1 , c1 . . ., the mathematical expectations of their squares x2 , y 2 , z 2 . . ., the probability that the sum x + y + z . . . is included within the limits a+ b + c + ... + and will always be larger than 1 a+ b + c + ... a1 + b1 + c1 + . . . a2 b2 c2 . . .,
1 , 2
a1 + b1 + c1 + . . . a2 b2 c2 . . .,
no matter what the size of .
Assuming that (the nitely many) random variables x, y, z, . . . are independent,23 can you prove that Chebyshevs original statement is equivalent to the one stated in Property (1) above for the random variable f = x + y + z + ? It might also be interesting to note that Chebyshevs original proof takes up around four pages in [358]. Contrast this with the four-line proof above using Lebesgue integration theory. This phenomenon is not uncommon (although there are many exceptions): As mathematical technology is developed, results that once had hard and long proofs become shorter and more transparent. The next lemma we need is the following, which goes back at least to AugustinLouis Cauchys (17891857) 1821 book [78, p. 59].
23Actually, Chebyshev leaves out the assumption that the quantities x, y , z . . . are indepen-
dent, which is needed otherwise his statement is false. For example, take X = {0, 1}, the sample space for a fair coin, and let x = y be the random variable that assigns 1/2 to 0 and 1/4 to 1, and take z and the rest of the random variables to be zero. Then you can check that Chebyshevs statement is false for = 5/ 18.
370
Cauchys arithmetic mean theorem Lemma 6.29. If {an } is a convergence sequence of real numbers and L = lim an R, then a1 + a2 + + an lim = L. n
Proof : Put mn :=
a1 +a2 ++an n
and observe that for any n N,
mn L =
1 (a1 L) + (a2 L) + + (an L) . n
Let > 0 and choose N N so that for all n > N , |an L| < /2. Then for any n > N , we have |mn L| 1 n C n C = n C n |(a1 L) + + (aN L)| + 1 + + n 2 2 nN + n 2 + . 2 + 1 |(aN +1 L) + + (an L)| n
where C = |(a1 L) + + (aN L)|
also less than /2. This shows that |mn By choosing n larger, we can make C n L| < for n suciently large and completes our proof.
The nal lemma we need is an estimate for the integral in terms of a sum. Lemma 6.30. For any random variable f : X [0, ), we have
E (f )
n=0
{f > n} E (f ) + 1.
{f {f {f {f
> 3} > 2} > 1} > 0}
Figure 6.9. The horizonal strips have area {f > 0}, . . . , {f > 3}.
Proof : We remark that if you take a close look at Figure 6.9, its easy to see why this lemma holds: As seen in Figure 6.9, we certainly have E (f ) = f = area below the graph of f
n=0
{f > n}.
Also, if you imagine shifting f one unit up, then the graph of f + 1 is above the horizontal rectangles in Figure 6.9. This shows that
n=0
{f > n} area below the graph of f + 1 =
(f + 1) = E (f ) + 1.
371
Heres a proof of this geometric argument:

n=0
{f > n} = = = = =
n=0 k=n k
{k < f k + 1} {k < f k + 1} (interchange summation order)
k=0 n=0 k=0 k=0
(k + 1) {k < f k + 1} {k<f k+1}
(k + 1)
k=0
(k + 1) {k<f k+1} .
One can check that f (k + 1){k<f k+1} f + 1.

n=0
k=0
Integrating each term in this inequality we obtain E (f ) {f > n} E (f ) + 1.
6.6.3. Proof of Etemadis SLLN. Be prepared, the proof might seem long,24 but every step is elementary in the sense that all we need are basic denitions and properties; the BorelCantelli Lemmas and their corollaries are probably the most advanced results we use. n Step 1: Assume that lim S n exists; we shall prove that the fn s must all be integrable, which is equivalent to saying that f1 is integrable because the fn s are identically distributed. To see that f1 must be integrable, observe that fn = Sn Sn1 , so fn Sn Sn1 Sn Sn1 n 1 = = n n n n n1 n Sn 1 n Since n 1, we see that the assumption lim a.e. implies that lim fn = 0 n n a.e. If we suppose, by way of contradiction, that f1 is not integrable, meaning that E (|f1 |) = , then Lemma 6.30 would imply that
n=0
{|f1 | > n} =
n=0
{|fn | > n} = ,
where we used that {|f1 | > n} = {|fn | > n} as the fn s (and hence |fn |s see Example 6.6) are identically distributed. Since the fn s are also pairwise independent, the sets {|fn | > n} are pairwise independent (see Example 6.8), so by the second BorelCantelli Lemma, {|fn | > n ; i.o.} = 1. This means that, a.e., n| > 1) for innitely many ns. However, this cannot be possible if |fn | > n (or |fn fn lim n = 0 a.e.
24 Etemadis proof less than two pages long, but some details are left out; its common in journal articles to leave out details that can be lled in by the readers.
372
Step 2: We now begin our proof of the SLLN assuming the fn s are integrable. To do so, we claim that if the SLLN holds for integrable nonnegative random variables, then it holds as stated. To see this, write each fn in terms of its nonnegative + and nonpositive parts, fn , fn , and observe that Sn = n
n k=1 + fk
n k=1
fk
Since {fn } is pairwise independent and identically distributed, by Examples 6.6 + and 6.8 we know that {fn } and {fn } have the same properties, so if we can prove the SLLN for nonnegative random variables, it follows that n Hence, a.e. we have
n
lim
n k=1
+ fn
+ = E (f1 ) a.e. and
lim
n k=1
fn
= E (f1 ) a.e.
Sn + = E (f1 ) E (f1 ) = E (f1 ), n so the SLLN holds as stated. Thus, we shall henceforth assume that the fn s are n nonnegative. Our goal is to prove that lim S n = a.e. The idea to prove this limit is to rst truncate the fn s so that each fn is bounded. According to Boris Vladimirovich Gnedenko (19121995) [153, p. 230], this method of truncation was rst introduced by Andrei Andreyevich Markov (18561922) in 1907. The method of truncation is the following technique. For each n N, let n : R R be the truncation function
n
lim
(6.37)
n (x) :=
x 0
if x n otherwise,
and consider the truncated random variable gn := n (fn ) = fn 0 if fn n otherwise.
Let Tn = g1 + + gn . In Steps 46 shall prove the SLLN for the truncated random variables: Tn (6.38) lim = a.e. n n That this fact proves the SLLN follows from our next step. Step 3: In this step we prove the following: Claim: If lim
n
Sn Tn = a.e., then lim = a.e. n n n
To prove this claim, note that Sn T n Tn Sn = + . n n n By assumption, the last term tends to a.e. as n , so we just have to prove that the rst term on the right in (6.39) vanishes a.e. as n . To this end, observe that if fn (x) = gn (x) for all n from some point on, then limn (fn (x)gn (x)) = 0, so by Cauchys arithmetic mean theorem, we have lim (Sn (x) Tn (x))/n = 0. Another (6.39)
n
373
way of saying that fn (x) = gn (x) for all n from some point on is fn (x) = gn (x) for only nitely many ns, therefore fn = gn for at most nitely many n a.e. = lim Sn T n = 0 a.e. n
Now the left-hand side here means that fn = gn for innitely many n = 0, so by denition of i.o., Step 3 is completed once we show that f n = g n ; i .o . = 0 . To prove this we use the First BorelCantelli Lemma, which says that fn = gn ; i.o. = 0 if n=1 fn = gn < . To prove this, note that by denition of gn , we have fn = gn if and only if fn > n, thus

fn = gn =
n=1 n=1
fn > n =
n=1
f1 > n < ,
where we used that {f1 > n} = {fn > n} as the fn s are identically distributed. By Lemma 6.30 we have n=1 f1 > n E (f1 ) + 1, which is nite because the fn s are assumed integrable. This completes the Step 3. n Step 4: It remains to prove that limn T n = a.e. To prove this we use the following subsequence trick. Fix any > 1. Let an = n , where for any real number x, x denotes the largest integer x. In this step we shall prove that T an = a.e. n an lim
n and in Step 5, we shall use this to deduce that limn T n = a.e. Now, observe that n n E (Tn ) k=1 E (n (fn )) k=1 E (n (f1 )) = = , n n n where we used that E (n (fn )) = E (n (f1 )) as the fn s are identically distributed. Since 0 1 (f1 ) 2 (f1 ) 3 (f1 ) k (f1 ) f1 , by the MCT we have lim E (n (f1 )) = E (f1 ) = . Therefore, by Cauchys arithmetic mean theorem,
Tn ) = , which implies that lim we have lim E (n
E (Tan ) an
= as well. Hence,
(6.40)
lim
T an = a.e. an
lim
Tan E (Tan ) = 0 a.e. an
We henceforth focus on proving the right-hand side. To this end, note that by Part T E (T ) (3) of Lemma 4.5, to prove that limn an an an = 0 a.e., we just have to show that given any > 0, we have25
n=1
Tan E (Tan ) an
< ,
Tan E (Tan ) < n=1 an Tan E (Tan ) ; i.o. = 0; in other words, implies, by the First BorelCantelli Lemma, that an Tan E (Tan ) < for all n suciently large has measure the complement of the set of points where an T E (T ) zero. Since > 0 is arbitrary, with a little more work, we get that lim an a an = 0 a.e. n
25So you dont have to review Lemma 4.5, note that
374
which is exactly in the form where we can try to use Chebyshevs inequality! (This, of course, was the reason we derived the equivalence (6.40), to set us up for Chebyshev.) By Chebechevs inequality, Tan E (Tan ) an 1
2 a2 n
Var Tan .
Since the {fn } are pairwise independent, by Theorem 6.22, {gn } = {n (fn )} are pairwise independent too, so by the properties of the variance (see Theorem 6.23), for any n we have
n n n
Var Tn = Var
k=1
gk =
k=1
Var gk =
k=1 n
2 E (gk ) E (gk )2 2 ) E (gk
k=1 n
E (k (fk )2 )
k=1 n
=
k=1
E (k (f1 )2 )
n E (n (f1 )2 ), here we used that E (k (fk )2 ) = E (k (f1 )2 ) as the fn s are identically distributed. and at the last step we used that k n for 1 k n, so E (k (f1 )2 ) E (n (f1 )2 ) for 1 k n. Thus, Hence,
Tan E (Tan ) an
1
2 a2 n
Var Tan
1 E (an (f1 )2 ). an 2
n=1
Tan E (Tan ) an
1 2
1 E (an (f1 )2 ) a n n=1 an (f1 )2 , an n=1
1 E 2
so we are left to show that the right-hand side is nite. To do so, we claim that for some constant C we have Claim: an (x)2 C x for all x 0. an n=1 an (f1 )2 C f1 , an n=1 which shows that

Assuming this claim for a moment, we have
n=1
Tan E (Tan ) an
C E f1 < . 2
375
Thus, we just have to prove our claim. Notice that so far we havent used anything about the explicit form an = n , where > 1; however, we shall do so now. Observe that a1 a2 a3 , so given x 0 there is a smallest m N such that am x. Then by denition of n in (6.37), x2 1 an (x)2 = = x2 . a a a n n n=m n=m n n=1 For any t 1, observe that which implies 1/t 2/t, so 1/an 2/n . Therefore, recalling the geometric series formula n=m rn = rm /(1 r) for any r R with |r| < 1, we see that an (x)2 1 2 = x2 x2 an a n n=m n n=m n=1 x2 2(1/)m x2 = C m, (1 1/)

t t + 1 t + t = 2 t ,
where C = 2/(1 1/). By the way we chose m, we have x am = m m , (x )2 Cx just as we claimed. so x/m 1 and hence n=1 an an Step 5: Our last step is to prove that
n
lim
T an = a.e. an
lim
Tn = a.e. n
Recalling that an = n with > 1, we have a1 = 1 a2 a3 an , so given k N, there is a largest natural number n N such that an k an+1 . (Although n depends on k , we omit the explicit dependence.) Note that as k , we also have n , a fact that will be used later. Any case, observe that the inequality an k an+1 implies that Tan Tk Tan+1 and therefore 1 an+1 1 1 1 , and k k an
Ta Tk T an n+1 , an+1 k an or written in a slightly dierent way, (6.41) Ta Tk an+1 Tan+1 an n . an+1 an k an an+1 an+1 = . n an where m N. Then
Since an = n , an argument (which you can provide) shows that lim So far, > 1 has been arbitrary. We now choose = 1 + there is a measurable set Am of measure zero such that Tan (x) = n an lim which is the precise meaning of lim
n Tan an 1 m
for x / Am ,
= a.e. Hence, for x / Am ,

n
lim
an Tan (x) an+1 an
1 and 1 + 1/m
lim
an+1 Tan+1 (x) an an+1
= 1+
1 . m
376
Let A = m=1 Am , which has measure zero since its a countable union of sets of measure zero. Let > 0 be given and choose m N such that 1/m < /(2). Then the inequalities in (6.41), and the fact that as k , also n , it follows that given x X with x / A (which implies x / Am ), for k suciently large we have 1 Tk (x) 1 + . 1+ 1 + 1/m 2 k m 2
Observe that 1 1 1 1 = 1 + 3 + 1 + 1/m 2 m m2 m 2 1 1 = , m 2 where we used the fact that /m < /2. The same fact implies that 1 + < + . m 2 Thus, for k suciently large we have 1+ Tk (x) + . k
Tk (x) k
Since > 0 was arbitrary, it follows that limk proves the SLLN.
= for all x / A. This
6.6.4. Borels normal number theorem. If you did Exercises 4 and 5 in Exercises 4.1 this section will be a breeze; we begin by reviewing material from these exercises. The basic gist of a normal number is as follows. Roughly speaking, we say that a number is normal in base 10 if any given nite sequence of digits occurs in its decimal expansion with the expected frequency. Thus, for example, if x [0, 1] is normal in base 10 and we write x = 0.x1 x2 x3 in its decimal expansion, then the digit 5 would occur with frequency 1/10, the string 34 would occur with frequency 1/102 , and so forth. We now make this precise. Let b N with b 2. Given a number x [0, 1], we can write it in its b-adic expansion, otherwise known as its base b expansion: x1 x3 x2 (6.42) x= + 2 + 3 + , b b b for some xi s in the set of digits Y := {0, 1, . . . , b 1}. If x (0, 1) is rational and can be written with a denominator a power of b, it has two such expansions, one terminating and another non-terminating; in order to have a unique expansion we agree to use the non-terminating b-adic expansion. We shall call a word a nite string of digits. Thus, w = (d1 , d2 , . . . , dk ) for some k N (called the length of w) and d1 , . . . , dk Y ; lets us x such a word. For each i N, dene fi : [0, 1] R by fi (x) = 1 0 if (xi , xi+1 , . . . , xi+k1 ) = w, otherwise.
Thus, fi observes if the word w occurs in the b-adic expansion of x starting from the ith digit of x. Now consider the average f1 (x) + f2 (x) + + fn (x) , n
377
which is exactly the average number of times the word w occurs in the rst n digits of x. Intuitively speaking, since there are a total of b possible digits, the word w (consisting of k specied digits) should occur at any given position in the b-adic expansion of x with probability 1/bk . Hence, it seems reasonable that the word w should appear with frequency 1/bk ; that is, it should be that (6.43)
n
lim
1 f1 (x) + f2 (x) + + fn (x) = k. n b
If this is indeed the case, we say that x is normal in base b with respect to the word w. If (6.43) holds for all words, then we say that x is normal in base b. Finally, we say that x is normal or absolutely normal if its normal in all bases b 2. The following result was proved by Emile Borel in 1909 [51]: Borels Normal Number Theorem Theorem 6.31. Almost all numbers in [0, 1] are (absolutely) normal. You shall prove this theorem in Problem 6. We remark that although almost all numbers in [0, 1] are normal, and this has been known for over 100 years now, I dont know of any simple example! Maybe one of you will produce one! There are complicated examples of normal numbers, which can be computed in theory; the rst such number was found by Sierpinski in 1916 [26, 352]. On the other hand, we can give simple examples of numbers normal in specic bases. For example, the rst nontrivial numbers which are normal in a given base were constructed by D. G. Champernowne (19122000) in 1933 [81]. For example, in base 10, the number 0.12345678910111213141516171819202122232425262728293031 . . ., obtained stringing together all the natural numbers written in base 10, is normal in base 10. In any base b, the number obtained from stringing together all the natural numbers written in base b is normal in base b. Champernowne conjectured that 0.2357111317192329313741434753596167717379838997101103107 . . ., obtained by stringing together all the prime numbers, is also normal in base 10; this was subsequently proved in 1946 by Copeland and Erd os [89]. However, its not know whether naturally occurring numbers such as the decimal parts of e, , 2 or log 2 are normal in any base.
Exercises 6.6. 1. (Chebyshevs WLLN) Using Chebyshevs inequality, prove the following result of Pafnuty Chebyshev (18211894), proved in 1867 [82]: If f1 , f2 , f3 , . . . are pairwise independent integrable random variables with variances bounded by some xed constant, then for each > 0, we have
n
lim
f1 + f2 + + fn E (f1 ) + E (f2 ) + + E (fn ) < n n
= 1.
2. (Cantellis SLLN) In this problem we give Cantellis SLLN, proved in 1917 [67]. Let f1 , f2 , f3 , . . . be integrable independent random variables on X . For each n, k, put
n
Bn,k =
i=1
E [|fi E (fi )|k ].
378
In this problem we prove that if

n
2 Bn,4 +Bn, 2 n=1 n2
< , then
Sn E (Sn ) = 0 a.e., n where Sn = f1 + + fn . By Part (3) of Lemma 4.5, we just have to show that given any > 0, we have Sn E (Sn ) < . n n=1 lim To prove this, proceed as follows: (i) Use Chebyshevs inequality to prove that Sn E (Sn ) n
n i=1 4
1 n4 4
gi
i=1
(ii) Multiplying out

4 gi
where gi = fi E (fi ).
gi
show that we get a sum of terms of the following
2 2 form: (a) where i = 1, . . . , n, (b) gi gj , where i = j and 1 i, j n, and (c) terms of the form gi gj gk g in which at least one gm is not repeated. Show that the integrals of the type (c) terms are zero. (iii) Show that n 2 4 Bn,4 + Bn, 1 2 , g i 2 4 n4 4 n i=1
and use this to prove Cantellis SLLN. 3. (Kacs proof of the SLLN; cf. [155, 206]) Heres a proof of a SLLN by Mark Kac (19141984). Let f1 , f2 , f3 , . . . be i.i.d. random variables on X . Suppose there is a n = a.e. where constant C such that |fi | C for all i. We shall prove that lim S n Sn = f1 + + fn and = E (f1 ) = E (f2 ) = . n n (i) Show that lim S = a.e. is equivalent to lim T = 0 a.e. where Tn = g1 + + gn n n with gi = fi . (ii) Show that constant Tn 4 . n n2 4 Suggestion: Multiply out Tn and see Part (ii) of Cantellis SLLN in the previous problem to see how to deal with the various terms you get. (iii) Conclude that lim
Tn n 4 n=1 Tn n 4
< and then use Theorem 5.21 to prove that

Tn n
= 0 a.e., which implies lim

1
= 0 a.e.
4. (Another proof of the L WLLN) Let f1 , f2 , f3 , . . . be pairwise independent and identically distributed with nite common expectation and put Sn = f1 + + fn . We shall prove that for each > 0, we have
n
lim
Sn < n
= 1.
(i) We rst need the following (very) useful inequality: Given f with nite expectation and variance, prove that
2
|f |
f 2 ; that is, E (|f |)2 E (f 2 ).
Suggestion: Consider Var |f |. 2 (ii) Case I: Assume that f1 < . Prove that |Sn /n |2 = Var(f1 )/n. Use this fact, together with (i) and Chebyshevs inequality to prove Khintchines theorem. 2 Case II: We henceforth do not require f1 to be nite. We proceed as follows.
379
(iii) Let k be the truncation function in (6.37) and let fik = k (fi ) and Snk = f1k + + fnk . Show that |Sn Snk | n A |f1 | where Ak = {|f1 | > k}. k (Suggestion: Show that |fi fik | A |f1 | and use the triangle inequality.) k (iv) Show that Sn = n Snk Sn Snk E (f1k ) + E (f1k ) E (f1 ) + n n n Snk E (f1k ) + |f1 |. |f1 | + n Ak Ak
Show that the rst and last terms (which are the same) 0 as k and show by Case I that for xed k, the middle term 0 as n . Use these facts, plus an /3-trick to nish o the proof of Khintchins theorem. 5. (Sub-gaussian random variables; cf. [380, 387]) A random variable f : X R is called sub-gaussian with parameter > 0 if E (etf ) et
2
2 /2
for all t R.
(Properties of sub-gaussians) Prove the following properties: (i) If f is sub-gaussian with parameter , then so is f . 2 2 (ii) If f is sub-gaussian with parameter , then {f > } e /2 . (iii) if f1 , . . . , fn are independent and sub-gaussian with parameters 1 , . . . , n , re2 2. spectively, then Sn = n 1 + + n k=1 fk is sub-gaussian with parameter (iv) If f is bounded, say |f | M for some constant M , and E ( f ) = 0, then f is sub-gaussian with parameter 2 M . Suggestions: For (ii), observe that {f > } = { 2 f > 2 2 } = {e f > 2 2 e }. Dont forget your friend Mr. Chebyshev. To prove (iv), rst assume that 2 M = 1 and in this case show that E (etf ) et . If t 1, show that E (etf ) et . For 0 t 1, prove that etf = 1 + t f + t2 f 2 t3 f 3 t2 t3 + + 1 + tf + + + 1 + t f + t2 , 2! 3! 2! 3!
2 2
and use this to show that E (etf ) et . If M = 1, apply the result for M = 1 to the function f /M , which satises |f /M | 1. (v) (A SLLN for sub-gaussians) Prove the following version of the SLLN: Let f1 , f2 , . . . be independent, sub-gaussian random variables with parameters 1 , 2 , . . ., respectively, and suppose that there are constants C, d > 0 such that n 2 2d for all n. Then, k=1 k Cn lim
Sn = 0 a.e., n where Sn = f1 + + fn . Suggestion: By Part (3) of Lemma 4.5, we just have to show that given any > 0, we have
n=1
Sn n
n=1
{|Sn | n} < .
Show that {|Sn | n} = {Sn n} + {Sn n} , and use this, together with Properties (ii) and (iii) above for sub-gaussian random variables, to prove that n=1 {|Sn | n} < . (vi) Let f1 , f2 , f3 , . . . be i.i.d. random variables on X . Suppose there is a constant C such that |fi | C for all i. Using the SLLN for sug-gaussians, prove that n lim S = a.e. where Sn = f1 + + fn and = E (f1 ) = E (f2 ) = . n
380
6. (Normal Number Theorem) In this problem we prove Borels fascinating result. Fix b N with b 2 and let Y = {0, 1, . . . , b 1} be the set of digits in base b. Let 0 : P (Y ) [0, 1] assign fair probabilities, 0 (A) = #A/b, and let : S (C ) [0, 1] denote the innite product of 0 with itself. Fix a word w Y k for some k and dene g : Yk R by g (x) = 1 0 if x = w, otherwise.
For each i, dene gi : Y R by gi (x1 , x2 , . . .) = g (xi , xi+1 , . . . , xi+k1 ). Thus, gi observes if the word w occurs in the sequence (x1 , x2 , . . .) starting from the ith digit. We begin our proof by proving that for a.e. x Y , 1 Sn (x ) = k , (6.44) lim n n b where Sn (x) = g1 (x) + g2 (x) + + gn (x). To prove this, and the rest of Borels theorem, proceed as follows. (i) Given N, prove that g , g+k , g+2k , . . ., are i.i.d. Conclude that for some A S (C ) with (A ) = 1, for all x A , g (x) + g+k (x) + g+2k (x) + + g+(n1)k (x) 1 = k. n b (ii) Let A = A1 Ak . Prove that (A) = 1 and for all x A,
n
lim
1 Smk (x) = k. mk b Suggestion: Break up the sum Smk (x) = g1 (x)+ g2(x)+ g3(x)+ + gmk (x) into k sums of the form g (x)+ g+k(x)+ g+2k (x)+ + g+(m1)k where = 1, 2, . . . , m and use (i). (iii) Now prove that for all x A, (6.44) holds. Suggestion: For n > k, let m = n/k, the smallest integer n/k. Prove that
m
lim
S(m1)k (x) m 1 Sn (x ) Smk (x) m , (m 1)k m n mk m1 then let n . Note that m is the unique integer satisfying mk k < n mk. (iv) Using Problem 7 in Exercises 2.4, prove that the the set of all x [0, 1] such that (6.43) holds is Borel and has Lebesgue measure 1. (v) Finally, prove that the set of all words is countable and use the fact that the set of all bases b 2, b N is countable, to complete the proof of Borels Normal Number Theorem. Congratulations, you have just proven one of the most historic theorems in probability theory!
Remarks
6.1 : The rst published proof of the FTA appeared in 1748 and is due to Jean le Rond dAlembert (1717-1783) and a second proof was published in 1749 by Leonhard Euler (1707-1783). DAlemberts and Eulers proofs were awed and in fact, many others such as Joseph-Louis Lagrange (17361813) and Pierre-Simon Laplace (1749 1827) published awed proofs. Many credit the rst proof of the FTA to Carl Gauss (17771855) in his 1799 doctoral thesis a new proof of the theorem that every algebraic rational integral function In one variable can be resolved into real factors of the rst or the second degree. During a large part of Gauss thesis, he pointed out aws in the previous works starting from DAlembert and ending with Lagrange, for example [128]: Although the proofs for our theorem which are given in most of the elementary textbooks are so unreliable and so inconsistent with mathematical rigor that they are hardly worth mentioning, I shall nevertheless briey touch upon them so as to leave nothing out. In order to
381
demonstrate that any equation xm + A xm1 + B xm2 +etc.+ M = 0, or X = 0, has indeed m roots, they undertake to prove that X can be resolved into m simple factors. To this end they assume m simple factors x , x , x , etc., where , , , etc. are as yet unknown, and set their product equal to the function X . . . (I bolded the word assume.) In other words, they assumed what they were trying to prove! What they did was assume that a polynomial had some mysterious types of roots (which later in Gauss paper he called impossible roots, and then they tried to show that the roots must be complex numbers (the possible roots). We sometimes forget that even geniuses such as Euler made fundamental errors. So we too shouldnt get discouraged if we make a mistake now and again. Actually, Gauss proof also was awed, so Gauss really shouldnt have been so critical of others; what Gauss did was make a claim he didnt prove [128]: It seems to have been proved with sucient certainty that an algebraic curve can neither be broken o suddenly anywhere (as happens 1 e.g. with the transcendental curve whose equation is y = ln ) nor x lose itself, so to say, in some point after innitely many coils (like the logarithmic spiral). As far as I know, nobody has raised any doubts about this. However, should someone demand it then I will undertake to give a proof that is not subject to any doubt, on some other occasion. The other occasion, however, never came until 1920, 65 years after Gauss death, when Alexander Ostrowski (18931986) [301] lled in the details of Gauss missing claim. Thus, although something may seem obvious to us or others with sucient certainty, we should still prove it! Gauss was very fond of the FTA, for Gauss gave altogether four dierent proofs of it during his lifetime. My favorite, and perhaps the most elementary of all proofs of the FTA is due to Jean-Robert Argand (17681822) who published it in 1806 and another improved version in 1814/1815; to read about Argands argument see [373, p. 268]. For more on the history of the Basel problem, see [13, 60, 111]. 6.2 : The earliest proof of Theorem 6.11 on the characterization of RiemannStieltjes integrability that I could nd is due to William Young [423, p. 133]; Corollary 6.12 was proved by Lebesgue in 1902 (suciency) [232] and 1904 (necessity) [233]. 6.46.6 : In 1867, Chebyshev [82] generalized Bernoullis WLLN; heres Chebyshevs original statement of his WLLN [358, p. 588]:26 If the mathematical expectations of the quantities U1 , U2 , U3 , . . . and 2 2 2 of their squares U1 , U2 , U3 , . . . do not exceed a given nite limit, the probability that the dierence between the arithmetic mean of N of these quantities and the arithmetic mean of their mathematical expectations will be less than a given quantity, becomes unity as N becomes innite. Aleksandr Khinchin (18941959) in 1928 [210] introduced the term Strong Law of Large Numbers. In 1930, Kolmogorov [217] proved Etemadis Strong Law of Large Numbers with the additional assumption that the random variables were independent (rather than pairwise independent) and in 1933 in his famous book [215], he states a SLLN for independent and not-necessarily-identically distributed random variables; his theorem reads: If f1 , f2 , . . . are independent random variables with nite expectations 2 and n=1 Var(fn )/n < , then the SLLN holds for f1 , f2 , . . . in the
26 Chebyshev leaves out the assumption that U1 , U2 , U3 , . . . are independent. If taken literally, Chebyshevs theorem is false; can you nd a counterexample?
382
sense that the event f1 + + fn E (f1 ) + + E (fn ) =0 lim n n occurs with probability one.

Loyabook

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Loyabook

Uploaded by

Copyright:

Available Formats

Lebesgues remarkable theory of measure and integration with probability

5.2. 5.3. 5.4. 5.5. 5.6.

Prologue: Lebesgues 1901 paper that changed the integral . . . forever

Note by Mr. H. Lebesgue. Presented by M. Picard.

2 PROLOGUE: LEBESGUES 1901 PAPER THAT CHANGED THE INTEGRAL . . . FOREVER

PROLOGUE: LEBESGUES 1901 PAPER THAT CHANGED THE INTEGRAL . . . FOREVER 3

f (x) dx = F (b) F (a),

4 PROLOGUE: LEBESGUES 1901 PAPER THAT CHANGED THE INTEGRAL . . . FOREVER

lim fn (x) dx = lim

PROLOGUE: LEBESGUES 1901 PAPER THAT CHANGED THE INTEGRAL . . . FOREVER 5

Here is a picture of f3 focusing on x [0, 1]:

lim fn = the Dirichlet function,

6 PROLOGUE: LEBESGUES 1901 PAPER THAT CHANGED THE INTEGRAL . . . FOREVER

(f (t))2 + ( (t))2 + ( (t))2 dt,

Measure & probability: nite additivity

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

Ei := f 1 (mi1 , mi ] = x ; mi1 < f (x) mi .

Figure 1.2. The sets E1 , . . . , E6 . In Figure 1.1 we wrote Ei in terms

1.1. INTRODUCTION: MEASURE AND INTEGRATION

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

1.1. INTRODUCTION: MEASURE AND INTEGRATION

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

(A) = (A1 ) + (A2 ) + (A3 ) + ;

Figure 1.5. Length, angle, area, volume, cardinality and probability

1.1. INTRODUCTION: MEASURE AND INTEGRATION

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

Archimedes (287 BC212 BC)

Figure 1.6. Circumscribing and inscribing a circle with regular polygons.

1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES

Blaise Pascal (16231662).

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

Example 1.1. If you toss a coin once,11

The hand of my then 2 year old daughter Melodie Loya.

1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES

occurs occurs or B occurs and B occurs and not in B occurs

Table 1. A set theory/probability theory dictionary.

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

Hence, A is a countable union of sets.

1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES

{An ; a.a.} := {x X ; x belongs to An for all but nitely many ns},

{An ; a.a.} = lim inf An ,

where lim inf An :=

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

Shall I compare thee to a summers day?

1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES

= {0} {0} {0} Y Y Y ,

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

1.2. PROBABILITY, EVENTS, AND SAMPLE SPACES

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

(a1 , b1 ] (a2 , b2 ] = (a, b],

where a = max{a1 , a2 } and b = min{b1 , b2 },

(a1 , b1 ] \ (a2 , b2 ] = (a1 , b] (a, b1 ],

where b = min{b1 , a2 } and a = max{a1 , b2 },

1.3. SEMIRINGS, RINGS AND -ALGEBRAS

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

Let A1 = (a1 , b1 ] and A2 = (a2 , b2 ] be in the semiring I 1 as in the picture ( ( ]

1.3. SEMIRINGS, RINGS AND -ALGEBRAS

1. MEASURE & PROBABILITY: FINITE ADDITIVITY

1.3. SEMIRINGS, RINGS AND -ALGEBRAS

Bm are nite unions of sets in I , and so by properties of sets, A\B = An \ Bm =

Since I is a semiring and An , Bn I , by (1.10) we can write An \ Bm =

a nite union of (pairwise disjoint) sets Cnk I . Hence, A\B =

is a nite union of sets in I , and so, U is closed under dierences.