You are on page 1of 8

Handout 4 Defining the Population Mean and Population Variance

In the last handout the subject of statistics was defined to be the art and science of making inferences about a population on the basis of a sample. But what sort of inferences about a population does one seek to make ? One often seeks to estimate the population mean and the population variance. So the point of taking a sample and computing a sample mean and a sample variance is to estimate the population mean and population variance, respectively. The foregoing discussion is getting ahead of itself, in the sense that it presumes that the terms population mean and population variance have already been defined which they havent. There are two cases to consider : the case in which the population is finite, and the case in which the population is infinite. Finite versus Infinite Populations In the case of a finite population, say of size N, the definition of a population mean is the obvious analogy to the definition of a sample mean given in Handout 1 : the population

mean is

. For an infinite population this clearly makes no sense at all! Infinite

populations are not just the province of pure mathematicians spinning abstract theories for the sake of abstraction itself ; the case of an infinite population is quite common in scientific settings. Suppose one is investigating the wing span ( in millimeters ) of the progeny resulting from the cross breeding of two strains of fruit fly. What is the population under investigation? The fruit flies bred by the investigator ? The fruit flies bred by all researchers in the US that year ? The fruit flies bred by anyone, anyplace to date ? If its a truly general scientific investigation, the population of interest is the set of all wing spans of all potential fruit flies that have been or could be bred, anytime or anyplace ; which is, by the nature of the concept, an infinite set. And any scientific investigation of sufficient generality to be of significant enduring interest, is likely to be about an infinite population. Developing a Definition for the Case of Discrete Random Variables We will consider the problem of defining the population mean in the special case of a discrete random variable : that is, a random variable that takes on a finite number of different values. ( For example, consider the random variable X that takes on the value 1 if a coin toss comes up heads, and the value 0 if the coin toss comes up tails. This is a discrete random variable since there are only two different values the random variable takes on ; but one can, at least conceptually, keep tossing the coin from here to eternity, and hence the population of interest is infinite. )

The goal is to develop a definition which makes sense whether the population is finite or infinite. Were not going to just plop down an unmotivated definition !! The motivation will be provided by a consideration of how to efficiently compute a sample mean when one has so-called grouped data. Keep in mind that the computational aspect is not what is of primary importance here ( ....although some textbook authors make a ludicrously big deal out of dealing with grouped data ...) ; the significance lies in the motivation grouped data provide for defining the population mean and population variance. Lets start by considering a particular example. Consider the experiment of randomly selecting a household in Fairfax County . The random variable of interest is the response to an inquiry as to the number of adults over 18 years of age residing in the selected household. Repeating this experiment 1000 times gives the data set given in the following table : Number of Persons Over 18 1 2 3 4 5 6 7 8 The sample mean may be computed as : Number of Households 223 437 150 86 62 22 15 5

To generalize the computations in this example, say that the discrete random variable X takes on the value xi with frequency fi . Then the sample mean is computed by

where k is the number of different values of xi in the sample and n is the total number of observations.

Note, just doing some algebraic re-writing, that the last formula for

can be rewritten as :

The ratio fi /n is an approximation to P ( X = xi ) . To keep the notation simple , lets agree to write pi for P ( X = xi ) and Then one may write the formula for for the approximation fi /n . as

Now remember that

is the sample mean. There should be an analogous average motivates the following definition.

or mean for the entire population. The last formula for Definition.

Given a discrete random variable X that takes on k different values x1, x2, . . ., xk , the expected value of X , or population mean :X , is
:X

= E[X] = p1x1 + p2x2 + + pkxk

where pi is P ( X = xi ) for i = 1, , k. --------Remarks ( 1 ) The notations E[X] and :X are two different views of the same thing : using the notation :X one is looking at the population mean as a property of a set of values ; using the notation E[ X ] one is thinking of the population mean as a property of the random variable that generates the set. Both ways of looking at things have advantages in different instances. ( 2 ) Remark on notation : lower case Latin characters generally denote characteristics of a sample , e.g. ; lower case Greek characters generally denote population

characteristics ; e.g. :.

( 3 ) Its a good idea to get used to using summation notation, so note that

Computational Example Consider a random variable Y which takes on values 2,4 and 8 with probabilities , 1/4 and 1/4 , respectively. Then :Y = E[Y] = ( )( 2 ) + (1/4)( 4) + 1/4( 8 ) = 1 + 1 + 2 = 4. Computational Example If Y is a random variable, any function of Y , e.g. Y2 , Y3 - Y , *Y*, etc. , is a random variable, and so also has an expectation. For example, for Y as in the example above, E[Y2 ] = ( )( 22 ) + (1/4)( 42) + 1/4( 82 ) = 2 + 4 + 16 = 22. Note that E[Y2 ] is NOT the same as E[Y ]2 !!!! Computational Example In a certain board game, the number of spaces a player moves his token is determined by using a spinner. The spinner is just a circle printed on a rectangle of cardboard with a light metal arrow attached to the center of the circle, so that the arrow rotates freely. The spinner for the particular game I have in mind is marked off into three sections : a 180 section colored blue , a 120 section colored red, and a 60 section colored yellow. If , when spun, the arrow lands in the blue section , the token is moved forward 4 spaces, if the arrow lands in the red section the token is moved forward 6 spaces ; and if the arrow lands in the yellow section, then the token is moved back 12 spaces. Let Y be the random variable corresponding to the number of spaces the token is moved. Assuming that the head of the arrow is equally likely to stop at any point on the perimeter of the circle, one has P( Y = 4 ) = , P( Y = 6 ) = 1/3 , and P( Y = -12 ) = 1/6. Then one computes :
:Y = E[Y] = ( )( 4 ) + (1/3)( 6 ) + 1/6( -12 ) = 2 + 2 - 2 = 2

Computation aint everything, ya know ??? How does one interpret this result ? If one plays the game a very long time, and spins the spinner billions and billions of times ( to echo the late Carl Sagan ....) , one will average about 2 spaces forward per spin.

Defining the Population Variance The motivation for the definition of a population variance is very similar. If one writes out the definition of the sample variance long-hand, i.e. without using summation notation,

Now for grouped data , where x1 , x2 , . . ., xk are the k different values occurring in a sample of n observations, and where the frequency with which xi occurs in the sample is fi , one has :

If n is very large, dividing by n - 1 isnt all that much different computationally than dividing by n ; and if n is large we expect to be close to :, so that for large n,

or in compact summation notation :

This expression motivates the following definition : Definition. The population variance for a discrete random variable X, denoted Var[X] or , is

where pi is P ( X = xi ) for i = 1, , k. ------------------

Remarks ( 1 ) The square root of the population variance, FX , is called the population standard deviation. ( 2 ) Notice again the convention that lower case Latin characters denote sample characteristics, while lower case Greek characters denote population characteristics. ( 3 ) Note that one could also write the definition as Var[ X ] = E[ ( X - : )2 ] : an observation we will make use of later. (4) (5) Observe that , by definition , the variance of a random variable is non-negative. When the definition of the sample standard deviation was introduced in Handout 1 we noted that it would have been more natural if the denominator in the defining expression had been n instead of n - 1 . Heres the intuitive explanation : what one would really like to measure is the dispersion of the data about the population mean but since the population mean is unknown, one uses the sample mean in the computation. Data is information ;so having to estimate the population mean is like having one less data point.

Computational Example Referring to the random variable associated with the game spinner of the previous example, Var[ Y ] = E[ ( Y - : )2 ] = ( )( 4 - 2 )2 + (1/3)( 6 - 2)2 + 1/6( -12 - 2 )2 = 2 + 16/3 + 196/6 = 240/6 = 40. And the population standard deviation is FY . % 40 . 6.324. Computational Example Consider the random experiment of rolling a fair die. Let X be the function that assigns to an outcome the number of spots that appear on the top face. Then P(X = 1 ) = P(X= 2 ) = P( X=3) = P(X=4) = P(X=5) = P(X=6 ) = 1/6. E[X] = 1/6 1 + 1/6 2 + 1/6 3 + 1/6 4 + 1/6 5 + 1/6 6 = 3.5 ,

Var[ X ] = 1/6 ( 1 - 3.5 ) 2 + 1/6 ( 2 - 3.5 ) 2 + 1/6 ( 3 - 3.5 ) 2 + 1/6 ( 4 - 3.5 ) 2 + 1/6 ( 5 - 3.5 ) 2 + 1/6 ( 6 - 3.5 ) 2 = 2.917. Computational Example To help illustrate the relationship between the sample mean and variance, and the population mean and variance, I rolled a fair die 600 times. ( Well, I rolled it with the help of a random number generator thats built into Lotus 123. . .. ) I then computed the sample mean and variance for this sample of size 600. After that I rolled the die another 2400 times, and again computed the sample mean and variance. The results are summarized in the table given below: First 600 Rolls X= 1 2 3 4 5 6 Total Frequency 91 92 118 101 99 99 600 = 3.537 s2 = 2.785 Second 2400 Rolls X= 1 2 3 4 5 6 Frequency 403 370 383 434 417 393 2400 = 3.530 s2 = 2.895

Compare the sample means and the sample variances with the population mean and variance computed earlier.

Nice Examination Problem : A randomly chosen drilling site in a certain area of West Texas is quite variable in its potential for producing oil. Eighty percent of the time the well will produce only 50,000 barrels of oil ; ten percent of the time the well will produce 100,000 barrels of oil ; seven percent of the time the well will produce 250,000 barrels of oil ; two percent of the time the well produces 500,000 barrels of oil ; and , on the average, one in a hundred wells is a real gusher and produces 3,000,000 barrels of oil.

( a ) If X denotes the oil production of a randomly chosen well, compute the population mean ( expected value ) and standard deviation. ( You may choose any units for oil production you find convenient . ) ( b ) If a barrel of crude oil sells for about \$25, and the cost of drilling a well is about \$2,000,000, what is the average profit per well for a company that has the financial resources to stay in the oil business for the long run ? And here is a solution ---Using X to denote oil production per well ( in thousands of barrels ), one computes the mean and standard deviation of X to be 107.5 and 301.3615 , respectively. This means that in the long run the oil company will average a profit of 107,500 ( 25 ) - 2,000,000 = \$687,500 per well. It is well to emphasize long run , since 80% of the time a well will, in fact, lose money for the company.