You are on page 1of 6

Handout 3 Relating Data Sets to the Larger World

So far weve just taken data sets for granted as an object of study : we havent asked where data sets come from, or how data sets relate to the larger world. To address questions such as these, even just to state the questions in a meaningful way , requires taking the time to introduce new concepts, new terminology and new notation. Begging the patience of the reader, we commence. Definition A random experiment is an experiment so that the outcome is not known with certainty. A set of outcomes is called an event. The set of all possible outcomes for a random experiment is called the sample space. Even though the outcome of a random experiment is not known with certainty, one would still like to quantify the likelihood of an event. This is done through a probability function : a function that matches events with real numbers between zero and one. The idea of a probability is to give a measure of how likely an event is : events that are very likely have probability close to 1 , events that are very unlikely have a probability close to 0. ( A prerequisite to understanding the concepts in this handout is to have a basic understanding of the idea of a function. See the brief appendix for a quick review. ) Definition. Given a finite sample space S = {e1,e2,...,en } a probability function P on S is a function that assigns a real number P( ei ) to every outcome ei of S so that (1) For every i, 0 # P(ei ) # 1. (2) If E is a non-empty event, P(E), the probability of E, is the sum of the P(ei )s for the eis that belong to E. ( For example, if E = {e3,e5,e17}, then (3 ) (4 ) P(` ) = 0. P(S) = 1. P(E) = P(e3)+ P(e5)+ P(e17) )

( So P( e1 )+ P(e 2 ) + ... + P (e n ) = 1. ) ________________________________

This definition represents the axiomatic approach to probability : everything one claims to know about probability must follow from the axioms presented in the formal definition. Another, intuitively appealing way to think about probabilities is to think about a probability as being a long-run relative frequency. Thus, taking the view-point of a frequentist, the statement that the probability of getting heads when one flips a coin is 0.6 means that if one flips the coin a very, very long time, very close to 6 out of 10 of those flips will be heads. Its important to be able to think about a probability either as an axiomatist or a frequentist. Random Variables Note that the outcome of a random experiment need not be a number. Consider the random experiment in which I am blindfolded, and choose one of my students by, say, picking a slip with a student name from a hat full of such slips. The outcome isnt a number : its a person. Nevertheless, there are many numbers associated with the outcome :e.g., the students weight in pounds, the students height in inches, the students grade point average, an indicator variable which takes the value 1 if the student is female, and the value 0 if the student is male. Lets formalize this observation: Definition . A random variable is a function that associates a number to the outcomes of a random experiment. Now we can address the issue of where data sets some from : data sets arise by repeated observation of a random variable associated with repetitions of some random experiment. Remarks (1) Random variables are commonly denoted by upper case Latin characters : e.g. X, Y, W, Z, etcetera .

( 2 ) Note that a random variable being a function, it doesnt make any sense to ask , What value is X ? One can ask what value the random variable took on when one did the random experiment this time. If the random variable is denoted X, Y , W, Z, the value of the random variable observed one particular time is denoted by the corresponding lower case character : x, y , w , z. Population versus sample A data set usually represents a small subset , a sample , of the set of all potential observations of the random variable that could be made, the population. This observation leads to an answer to the second question posed in the introduction to this handout : How do data sets relate to the larger world ? In fact this dichotomy, sample versus population, is at the heart of the subject :

Definition Statistics is the art and science of making inferences about a population on the basis of a sample. So, for example, consider a telephone survey of 200 Illinois hog producers conducted by the US Department of Agriculture. Just why does the USDA interrupt these lucky 200 farmers at dinner and annoy them with all manner of questions about their hog operations? Is it because these 200 hog producers are inherently interesting in themselves ? No, these 200 are interesting to the extent that one can use data from this sample of 200 hog producers to make inferences about the population of all Illinois hog producers. For another example, consider the random experiment of tossing a coin and observing whether it lands heads or tails. The probability , p , of getting heads on a toss is a characteristic of the population of all possible tosses of the coin. One estimates this probability based on the proportion of heads in a sample. E.g. if a sample of five coin tosses is { H, H , T, H , T } , then our estimate of p is the sample proportion .

A common reaction of people when first presented with this definition, is to express disbelief that a small ( especially in relative terms ) subset of a population can be used to make usefully accurate inferences about the characteristics of the population. In rebuttal to this disbelief, consider the typical blood sample drawn at a medical clinic. The size of the blood sample is quite small relative to the quantity of blood circulating in the human body yet the patient and his physician somehow apparently have faith in conclusions based on such a tiny sample! The key idea that underlies that faith, is that the blood sample is representative of the blood circulating in the patients body. In general, the hope of making useful inferences on the basis of a sample is based on the sample being representative of the population. Students often make the mistake of thinking that the primary goal is to have a random sample; this is getting the cart before the horse, the goal is to have a representative sample : randomness is a means to an end. Statistical Independence We introduce the vital concept of independent random variables. Definition Two random variables are independent if the probability that one of the random variables takes on values in a set is not a function of the set of values the other random variable takes on.

Example Let X be the random variable which is one if a coin toss comes up heads, and zero if the coin toss comes up tails. Let Y be the random variable which is one if a second toss comes up heads and zero if the second toss comes up tails. Coins have no memory, hence knowing whether X was 0 or 1 has no bearing on the probability Y is 0 or 1. That is, X and Y are independent random variables. ( Note that whether the coin is fair or not is irrelevant. ) Example Consider the random experiment of randomly choosing a US male aged 18 to 35. Let H be the random variable corresponding to the height ( in inches ) of the person chosen ; let W be the random variable corresponding to the weight ( in pounds ) of the person chosen. Then H and W are dependent random variables : knowing that a person was taller than average, or shorter than average, would indeed have a bearing on the probability that the person was heavier than average or lighter than average. Remark Students often loose sight of the essential fact that the definition of independence is probabilistic in nature. Weight and height are independent, argues one student, my cousin Vinny is six-foot-four and weighs 95 pounds soaking wet. Yeah, chimes in another student, and my brother-in-law Ray is five-foot-two and weighs about 350 pounds. Whatever the facts about these kinfolks may , independence is not about individual cases. Think about the behavior of a rational gambler making a wager on the weight of a randomly chosen US male aged 18 to 35. Information on the height of the randomly chosen person would certainly be relevant to a wager on the weight of that person.

Appendix :

Review of the concept of a function.

Definition . A function is a process which associates members of a set A with members of a set B in such a way that every member of A is associated with precisely one member of B. Example 1. Let A be the set of all men in the world ; let B the set of all men who have ever lived. The pairing that associates a man with his father is a function. ( Every element a in A is associated with exactly one element b in B. ) Example 2. Let A be the set { # , & , 3 } , and let B be the set { 5, 7 }. The set of pairs { ( #,5 ) , ( & , 7 ) , ( 3, 7 ) } defines a function. Example 3. Let A be the set of all real numbers, and let B also be the set of all real numbers. Pairing a real number x with the real number 2x + 5 defines a function. Notation : Write f( x ) = 2x + 5 . So, for instance f( 1 ) = 2( 1) + 5 = 7. Remark : Dont get hung-up on notation One could describe the same function by writing g( t ) = 2t + 5.

You might also like