You are on page 1of 38

AP Exam Review Unit 1

Chapter 1/2
1.

Exploring and Understanding Data

Characteristics recorded about each individual are called variables. a. Categorical sex, race, ethnicity b. Quantitative income, weight, height (Note: quantitative variables have Units!!)

Chapter 3
2. 3. Frequency tables Used to record the counts for each category. Relative Frequency table Used to record percentages for each category. Bar Charts are used to graph categorical data with a bar height represents the frequency of the category a. All bars are the same width (Area Principal) with spaces between to indicate that they are freestanding b. **Note: It is important to label all graphs, including title, scale and axes!!! 4. Pie Charts use relative proportion (percentage) of a categorical variable a. Pie slice represents the percent of the category of the whole group represented by the entire circle b. **Note: It is harder to compare category size, but easier to make comparisons against typical percentages 5. Contingency table shows us how individuals are distributed along each variable based on the other variable. a. marginal distribution is the distribution of one variable in a contingency table in the margins (total counts)

b. conditional distribution is the distribution of only a specific variable value, a distribution of one variable for only those individuals satisfying some condition of the other variable, variable on inside of the table c. Independent variables when the distribution of one variable in a contingency table is the same for all categories of another variable independent. 1. 6. independent when we look at the conditional distributions of the table and distributions are similar 2. dependent when we look at the conditional distributions of the table and distributions are differnt Segmented Bar Graphs is an alternative to a Pie Chart that divides a bar instead of circles. a. Each bar is treated as a whole (100%) and is divided proportionally into segments corresponding to percentages in each group.

b. A great visual displays for seeing if distributions are alike or different in order to decide on independence.

Chapter 4
7. Frequency table gives bins and the counts in each bin for the distribution of the quantitative variable. a. Histogram is one graphical display of the distribution of a quantitative variable, which plots the bin counts as the heights of bars (like a bar chart with no spaces). 1. Plots counts, whereas a relative frequency histogram shows percentages. 2. There should be no gaps in your x-axis, since the scale should just slice the range into equal-sized bins. 3. Make sure your graph is well labeled, scaled, and follows the area principle! 8. Stem and Leaf is used for smaller data sets and is easily created by hand to preserve data a. quick visual of the datas distribution to show the same information as a histogram b. Split stem plots are also used for very large datasets or if more detail is needed

Chapter 4 (cont)
9. 10. Dotplot is a simple display for univariate data, where a dot is placed for each case in the data. Describing distributions make sure to CUSS and BS!!! C Center U - Unusual Features (Outliers and gaps) S Shape (unimodal; bimodal; uniform, skewed left (histogram tail to the left) and skewed right (histogram tail to the left) S Spread B Be 11. S Specific (Context and use comparative language!!! Timeplots used to show how the data behaves over time.

Chapter 5
12. Measures of Center: a. midrange is middle of the range found by taking (min + max)/2, easy to find , but very sensitive to outliers

b. median is middle of the distribution; same units as data, halfway point of data in numerical order 1. resistant to outliers c. Mean is arithmetic average of symmetric distributions since it is extremely sensitive to outliers. mean is the balance 1. Mean < Median 2. Mean = Median 3. Mean > Median 13. Measures of Spread: a. range is from minimum to maximum value a SINGLE number that is easy to find, but VERY sensitive to outliers. 1. splits the data at the median and find the median for each half. IQR=Q 3 - Q1 2. Ignores the extreme data values (possible outliers) so it gives a better indication of spread 3. The quartiles are Q1 = 25th percentile, Q2 (median) = 50th percentile and Q3 = 75th percentile c. Standard Deviation (s)/Variance (s2) are measures of variation of symmetric data, non-resistant to outliers 1. They take into account how far each data value is from the mean (the variation) a. You must square root variance to find the standard deviation, difficult to find by hand b. Our ruler to compare data values with different units so we make sure we are using the same scale to look at how the values deviate from the mean. Five Number Summary: Min Q1 Q2 (median) Q3 - Max a. boxplot is a graphical representation of the 5-number summary 1. Regular (non-modified) - shows the 5 number summary only 2. Modified shows the 5 number summary as well as outliers. In general, all of our boxplots are Modified point and median is the halfway point Skewed Left distributions Symmetric distributions Skewed Right distributions: d. Mean vs median

b. interquartile range (IQR) is the range of the middle 50% of the data

14.

b. Test for outliers 1. Low fence < Q1 - 1.5(IQRs) 2. Upper fence > Q3 + 1.5(IQRs)

Chapter 5 (cont)
15. Which to use? a. Symmetric graph use mean and standard deviation b. Skewed graph use the median and IQR (or the 5 number summary) Ogive is a cumulative frequency chart used to separate your data into classes and tally each frequency a. Determine the relative frequency in each class b. Graph each cumulative frequency at the end of each class c. Remember Old Faithful Worksheet

16.

Chapter 6 17. 18. Standardized scores aka z-scores uses z as common notation and always means to standardize the value a. Have NO UNITS so they are all comparable regardless of original units Shifting data by adding or subtracting the same number to Mean, Median, Quartiles, Minimum, and Maximum. a. 19. The spread (IQR, Range, and Standard deviation) does NOT change. Rescaling data by multiplying or dividing, both the measures of location and spread change.

20. Converting data to a z-scores, we are shifting the mean (to 0) and rescaling standard deviation (to 1). a. Standardizing does not change the shape of the distribution!

b. Gives us an indication of how unusual a value is by telling us how many standard deviations value is from the mean. 1. negative z-score tells us that the data value is below the mean 2. positive z-score tells us that the data value is above the mean 3. 0 tells us a data value is right at the mean. 21. Normal Model is a distribution that is bell-shaped (aka symmetric and unimodal), it may be modeled by the Normal Distribution. a. Parameters are values that represent the entire population. Mean Mean standard deviation standard deviation s b. Statistics are denoted by Latin letters and represent our sample. 22. Empirical Rule

23. To find the proportion between z-values for the standard Normal distribution: 2nd: Distribution: 2: normalcdf( 2nd: Distribution: 3: InvNorm( Normalcdf(low z, high z) InvNorm(proportion to left of z) To find the cutoff z value for a given percentage: 24. To determine if a dataset follows a Normal model, either: a. Look at a histogram or stem plot and check for non-Normal features (like gaps, outliers, and skewness) b. Compare your actual data to the Empirical Rule c. Look at a Normality Plot and check if the plot approximates a diagonal straight line

Unit 1 AP Review GIRLS


Night Shift Night Shift Military time of birth 0005 0104 0405 0407 Day Shift 0814 0909 1049 1053 1406 1407 1433 1446 Evening Shift 1742 1825 2010 2217 2327 2355 Birthweight in grams 3837 3334 2208 1745 2576 3746 3523 3430 3480 3116 3428 2184 2383 3500 3866 3542 3278 Day Shift 3208 Military time of birth 0118 0155 0257 0422 0431 0708 0735 0812 1035 1133 1209 1256 1305

Name BOYS
Evening Shift Birthweig ht in grams 3554 3838 3625 2846 3166 3520 3380 3294 3521 2902 2635 3920 3690 Military time of birth 1514 1631 1657 1807 1854 1909 1947 1949 1951 2037 2051 2104 2123 Birthweig ht in grams 3783 3345 3034 3300 3428 4162 3630 3406 3402 3736 3370 2121 3150

1.

There has long been an old wives tale about the time of day that babies are born. Is there any merit to this? Create a contingency table for the gender of the baby versus the time of day the baby was born. (Note: The times in the raw data are in military time!!!!) Night Shift (2300 0700) Female Male Total Day Shift (0700 1500) Evening Shift (1500 2300) Total

2.

Is there a relationship between the gender of the baby and the time of day the baby is born? Create a segmented bar graph below, remembering to clearly label the graph. Then answer the question.

3.

The birthweights listed are in grams. Find the following values for EACH gender. Values GIRLS BOYS For EACH gender, test for outliers. Remember to show all WORK!!!! Mean Girls Outliers? Median Std Dev Minimum Q1 Boys Outliers? Q3 Maximum Range IQR In the space below, create side-by-side boxplots (showing any outliers) for the birthweights of the babies by gender, remembering to label and scale the graph!!

4.

5.

Compare the distributions displayed above, remembering to use comparative language (CUSS & BS).

6.

The birthweights listed are in grams. Typically, in the US, we report birthweights in pounds and ounces. The conversion factors you need are : 16 ounces = 1 pound 1 ounce = 28.35 grams Use these conversion factors to convert the summary statistics for YOUR gender. Mean Median Std Dev minimum Q1 Q3 Maximum range IQR

7.

Assuming the birth weights for YOUR gender can be modeled by a Normal distribution, sketch the Normal model and label it based on the 68-95-99.7 rule.

8.

Based on this model, what percent of babies for YOUR gender weigh below 8 pounds?

9.

Based on this model, what weight would represent the 75th percentile of birthweights for YOUR gender?

AP Exam Review Unit 2


Chapter 7
1.

Exploring Relationships Between Variables

Scatter plot is a visual way you can show the associations between quantitative variables a. In order to describe a scatter plot, look for: 1. Form Is the data linear or curved? 2. Outliers Is there any points that appear to not fit the data set? 3. Direction Is there a positive, negative or no association between variables? 4. Strength How much scatter is apparent in the plot? Closer to straight line, closer to 1 b. Explanatory variable = Predictor and goes on the x axis c. Response variable = variable controlled by the explanatory variable and goes on the y-axis

d. If the relationship between the variables is unclear, it does not matter which one we identify as explanatory or response variable. 2. Correlation coefficient (r) gives us a numerical measurement of the strength of the linear relationship between the explanatory and response variables. a. Direction: positive r indicates a positive association and negative r indicates a negative association b. Strength: Values close to 0 indicate weak relations, as r gets closer to 1 or -1, the relationship is stronger and values of exactly 1 indicate a perfect line and perfect correlation. c. When to use correlation: 1. Quantitative Variables r cannot be applied to categorical data! Make sure you understand your variables

2. Linear data r can always be calculated, but correlation only measures strength of linear relationships, so watch for curvature! 3. Outliers r is calculated using z-scores (mean and st. dev), it is non-resistant to outliers! d. Properties of correlation 1. Sign gives the direction of association 2. Correlation is always between -1 and +1 3. Flipping x and y does NOT affect and changing units on x or y does not affect 4. NO units!! It has been standardized 5. Measures a LINEAR relationship only and is non-resistant to outliers

Chapter 8
3.

Least Squares Regression Line (LSRL) is the line of best fit that may not hit any of the data points, so we find the line that comes closer than any other line. a. LSRL is the line that minimizes the sum of the squared errors (called residuals). b. Residual is the difference between the observed and predicted values for y (dependent variable) = a + bx c. y A hat over a variable indicates a predicted value. 1. LSRL always contains the point (x-bar, y-bar) and use the variable names, not x and y! 2. Note: LSRL is non-resistant to outliers d. Slope of the LSRL sy b1 = r 1. Interpreting the slope:

sx

e.

a. Slope is defined as the amount of change in y as x increases by 1. b. Moving any number of standard deviations in x moves r times that number of standard deviations b0 = y b1 x in y. Intercept of the LSRL 1. Interpreting the intercept: a. The intercept is the value of y-hat when x = 0 b. The intercept is not always appropriate for interpretation!!!

Chapter 8 (cont)
4.

5.

6.

Residuals are the errors that occur since the line doesnt go through each point. a. To calculate a residual, subtract the predicted value from the actual value Residual = y - 1. Positive residual is above the line, where the model underestimated the value. 2. Negative residual is below the line, where the model overestimated the value. Residual Plots is a scatter plot of the explanatory variable vs. the residuals a. Residuals are very useful to us because we can determine how well a line fits the data by examining its errors. b. A good residual plot has a random scatter with no patterns. This indicates our model is appropriate. c. A bad residual plot shows patterns like curves, vs, etc. This indicates that we should look for a better model. Coefficient of Determination (R2) a. We have been using correlation to gauge strength of the linear relationship, but to get a better feel for the data, we can square the r value. b. R2 is the ratio of the explained variation of the response variable to the total variation. c. Squared correlation, r2, gives fraction (or percent) of datas variation that is accounted for by the linear model. d. The remaining fraction (1- r2) is the amount of the original variance that is left in the residuals (errors).

Chapter 9
7.

8.

9.

Extrapolation occurs when a linear model can help us make predictions, but is far away from x value . a. Extrapolations assume that past trends will continue into the far future. b. If you must extrapolate into the future, at least dont believe that the predictions will come true! Outliers in regression a. Outlying points can strongly influence a regression. Even one point can dominate the regression analysis. b. Types of regression: 1. y-outlier: a point that is extraordinary in its y-value 2. x-outlier: a point that is extraordinary in its x-value a. Be wary of x-outliers as they can have high leverage. b. Especially watch what happens when an x-outlier lines-up with the rest of the points 3. model outlier: a point that deviates from the regression line a. Whenever you notice a model outlier, you should fit the line to the other points alone and then compare the resulting regression model in order to understand how the outlying point affected the model. Note: This does NOT mean to just delete any point that doesnt fit your line! You must examine and analyze all deviations! Often times, this analysis tells us more than the original model did. c. Influential Points are points that highly influence the slope of the regression line and correlation coefficient Lurking Variables are variables that do not show up as part of the model but do affect appearance of variables in the model. No matter how strong the association is, no matter how large the r (or r-squared) value, no matter how straight the line is, you CAN NOT say that one variable CAUSES the other!

Chapter 10
10.

Non-Linear Regression: Model Name x-axis Linear x Exponential x Logarithmic log(x) Power log(x)

y-axis y log(y) y log(y)

Equation = a + bx log = a + bx = a + b(log x) log = a + b(log x)

Unit 2 AP Review

Name _________________________

Sarahs parents are concerned that she seems short for her age. Their pediatrician has the following record of Sarahs height: Age (months) Height (cm) 1. 36 88 48 90 51 91 54 93 57 94 60 95

Make a scatterplot of these data and describe the overall pattern of the data in context.

2.

Find the mean and standard deviation of both variables.

3.

Find the correlation coefficient and explain what it means.

4.

What is the equation of the least-squares regression line?

5.

Compute the slope and explain what the slope mean in context.

6.

Compute the y-intercept and explain what the slope mean in context.

7.

Find the coefficient of determination and explain what it means.

8.

According to the regression line, how much does Sarah grow each month on the average?

9.

For each age given, compute the predicted height using the least-squares line.

10.

Make the residual plot and determine if the LSRL is a good fit. Why?

11.

Use Sarahs predicted height at 60 months and determine the residual.

12.

Was it a positive or negative residual, so did the pediatrician underestimate or over estimate? How do we know?

13.

How tall would you predict Sarahs height to be at 62 months?

AP Exam Review Unit 3 Chapter 11

Gathering Data

1.

2.

Random: a. Dont be haphazard. Outcomes have a lot of structure, especially in the long run. b. Makes the sample as representative as possible. c. Randomizing protects us from the influences of all the features of our population, even ones that we may not have thought about it. d. It makes sure that on the average, the sample looks like the rest of the population. e. Not only does randomizing protect us from bias, it also makes it possible for us to draw inferences about the population when we see only a sample. Simulations: a. Identify the component to be repeated. b. Explain how you will model the outcome. c. Explain how you will simulate the trial (the sequence of events that we are pretending will take place.) d. State clearly what the response variable is. e. Run several trials. f. Analyze the response variable. g. State your conclusion (in context of the problem).

Chapter 12
3.

4.

5.

6.

7.

Population (parameter) is the entire group we are interested in a. This value is rarely known and our goal is usually to estimate the parameter. Sample (statistic) is a smaller group that is selected from the population. a. Their goal is to estimate the parameter. b. Proportion of the population that youve sampled doesnt really matter (unless you have a really small population). c. Its the sample size itself that makes the difference. d. Size of population does not matter, size of sample DOES Population Sample Mean standard deviation s Polls: a. Opinion polls are polls ran by organizations such as the Gallup Poll and are extremely diligent in their selection so that the sample represents the population. b. Straw polls are polls that gather information in a very poor way (such as those on websites or in magazines) Biased is a sample that does not represent the population in some important way a. There is usually no way to fix a biased sample b. no way to salvage useful information from it. Problem with census: a. Difficult and impractical to complete b. Populations shift in their demographics c. Too complex in terms of time and budget d. If using destructive sampling you would destroy population Sample Types: a. Simple Random Sample (SRS) is a sample in which every person AND every combination of people has an equal chance of being selected. 1. To get an SRS: Define your population of interest, where the sample will come from, assign numbers to each of the subjects and use a random table to select the sample.

Chapter 12 (cont)
7.

Sample Types: (cont) b. Stratified Random is more complicated than an SRS and involves splitting the population into subgroups. More useful when you think certain characteristics may be an influence in the data. (SOME from ALL) 1. To get a stratified sample: Define your population of interest, split your population into homogeneous groups, called strata, within each strata, use an SRS to determine who is sampled and combine the results from each strata c. Cluster is a sample also involving splitting the population into subgroups and is more useful when you think all subgroups are pretty similar and each will adequately represent population variability. (ALL from SOME) 1. To get a cluster sample: Define your population of interest, split your population into heterogeneous groups, called clusters, use an SRS to determine which cluster(s) to select and combine results from each cluster

d. Systematic is a sample that involves every nth object. This is more useful when you believe that the order of the list is not associated in any way with the responses sought. 1. To get a systematic sample: Define your population of interest, determine a starting place using a random table, and from your starting place, sample every nth object on the list. e. 8. Multistage sample is a sample made up of more than one sample type.

9.

How Not to sample: a. Voluntary Response Samples: Where a large group is invited to respond, those that actually do respond are counted. But many of these respondents will probably have a strong opinions or motivations. b. Convenience Samples: We simply include the individuals who are readily available. Problems to watch for: a. Undercoverage : Bias in which some portion of the population is not sampled at all or has a smaller representation in the sample than it has in the actual population. b. Nonresponse bias: Where someone who is chosen for the sample cannot be contacted or refuses to cooperate. The problem is that those who dont respond may differ from those who did respond. c. Response bias: Anything in the survey that influences the responses, such as wanting to please the interviewer, not wanting to answer personal or legal questions, etc. d. Wording of the question in a survey, as it can also influence the responses. Asking a question with a leading statement is a good way to bias the responses.

Chapter 13
10.

11.

Observational Study is a study where the researcher simply observes the subjects, recording the choices made and the outcome based on data in which no manipulation of factors has been employed. a. Types of observational studies: 1. Prospective study is an observational study in which subjects are followed to observe future outcomes. 2. Retrospective study an observational study in which subjects previous conditions are determined. Note: neither prospective nor retrospective studies can show cause-and-effect relationships. Experiment is a study that manipulates a variable to create treatments, then imposes these treatments on the subjects in order to record and compare the responses. Used if we are trying to prove causation. a. Elements of an experiment: 1. Factor is variable identified as at least explanatory to manipulate and at least response to measure. 2. Levels are specific values that the experimenter chooses for a factor

Chapter 13 (cont)

3. Treatments are combinations of specific levels from all the factors a. Once weve decided what to do to our subjects, we need to decide WHO gets what treatment. b. To have any hope of drawing a fair conclusion: 1. we must assign our participants to the treatments in a random manner 2. we may not assign based on participant (or proctors) choice b. The four major principles of experimental design: 1. Control sources of variation other than factors testing by making condition as similar as possible for all 2. Randomization a. Allows us to equalize the effects of unknown or uncontrollable sources of variation. b. It does not eliminate sources of variation, spreads them out among treatments and reduces bias 3. Replication a. We should repeat experiment, applying treatments to a number of subjects. (Replication within study) b. We should be able to repeat, following same design, and produce similar results (Replication of study) 4. Blocking **Not essential, but useful a. Situation in which groups of subjects are similar, it is often a good idea to gather them into blocks. c. By blocking, we isolate the variability due to differences between blocks so we can see differences due to the treatments more clearly. d. Types of Experimental Design 1. completely randomized design where all experimental units have an equal chance of receiving any treatment. (Similar to a SRS). 2. randomized block design where the randomization occurs only within the blocks. (Similar to Stratified) If we feel that a certain characteristic of our experimental units may influence the response, we can isolate the variability due to these differences by blocking our units into groups of similar characteristics an run the experiment separately within each block. e. Statistically significant is when we use random chance, we will get different samples and different responses. 1. Can this difference be attributed only to the fact that we used a different random group or is difference much bigger than what we would have expected by chance alone. f. Diagram of an experiment can often times help in thinking about the details of the experiment. However, a diagram is just a basic outline, not a complete description!

12.

13.

Blinding is two main classes of individuals who can affect the outcome of an experiment: those who could influence the results and those who evaluate the results. a. single-blind Every individual in either of these classes is blinded in an experiment b. double-blind Everyone in both classes is blinded in the experiment. Placebo is a fake treatment that looks just like the treatments being tested. a. Some of improvements seem with a treatment, even an effective treatment can be due to simple act of treating. b. To separate these two effects, we can use a control treatment that mimics the treatment itself.

Chapter 13 (cont)
c.

14.

15.

Best way to blind subjects from knowing whether they are receiving treatment or not. (Placebo Controlled) Matching is used to reduce variation in much the same way as blocking. Subjects are paired because they are similar in ways not under study. a. Matching can be with another experimental unit or with yourself (as in before/after studies) Confounding variables are two variables (explanatory or lurking) that the effect on the response variable cannot be distinguished from each other.

AP Exam Unit 3 Review Worksheet


1.

Name

Suppose you were asked to help design a survey of adult city residents in order to estimate the proportion that would support a sales tax increase. The plan is to use a stratified random sample, and three stratification schemes have been proposed. Scheme #1: Stratify adult residents into four strata based on the first letter of their last name (A-G, H-N, O-T, U-Z) Scheme #2: Stratify adult residents into three strata: college students, non-students who work fulltime and non-students who do not work full-time. Scheme #3: Stratify adult residents into five strata by randomly assigning residents into one of five strata Which of the three stratification schemes would be best in this situation? Explain why you chose that scheme.

2.

A statistics teacher wants to know how her students feel about an introductory statistics course. She decides to administer a survey to a random sample of students taking the course. She has several sampling plans to chose from. Name the sampling strategy in each. a. There are four ranks of students taking the class: freshmen, sophomores, juniors and seniors. Randomly select 15 students from each class rank.

b. Randomly select a class rank (freshmen, sophomores, juniors and seniors) and survey every student in that class rank.

c.

Each student has a nine-digit student number. Randomly choose 60 numbers.

d. Using the class roster, select every fifth student from the list.

3.

Listed below are the names of 20 students who are juniors. Use the random numbers listed below to select five of them to be in your sample. Clearly explain your method. Adam Chris Dave Deirdre Don Ellen Eric Joan Jonathan Judi Joy Kenny Laura Mary Paul Peter Rachel Robert Sara Stacey 39634 14595 Method: 62349 35050 74088 40469 65564 27478 16879 44526 19713 67331 39153 93365 69459 54526 17986 22356 24537 93208

4.

Is vitamin C helpful in preventing the common cold? We wish to conduct an experiment to examine if 1000 mg of vitamin C per day can reduce the incidence of colds. Suppose that all 120 of AP statistics students at a high school have volunteered for the study. a. Design a completely randomized experiment for this situation.

b. Would we need to block or blind this experiment? Explain.

5.

Voluntee Researchers who are studying a new shampoo formula plan to compare condition r of hair for people who use the new formula with condition of hair for people who 1 use the current formula. Twelve volunteers are available to participate in this 2 study. Information on these volunteers (numbered 1-12) is shown in the table. 3 a. These researchers want to conduct an experiment involving the two formulas 4 (new and current) of shampoo. They believe that condition of hair changes 5 with age but not gender. Because the researchers want the size of the blocks 6 in an experiment to be equal to the number of treatments, they will use 7 blocks of size 2 in their experiment. Identify volunteers (by number) that 8 would be included in each of the six blocks and give the criteria you need to 9 1 form the blocks. 10 11 2 12 3 4 5 6

Gender Male Female Male Female Female Male Male Female Male Female Male Female

Age 21 20 47 60 62 61 58 44 44 24 23 46

b. Other researchers believe that hair condition differ with both age and gender. These researchers will also use blocks of size 2 in their experiment. Identify volunteers (by number) that would be included in each of the six blocks and give the criteria you need to form the blocks. 1 2 3 4 5 6 c. The researchers in part b decide to select three of the six blocks to receive the new formula and three the current formula. Is this an appropriate way to assign treatments? Is so, describe a method for selecting the three blocks to receive the new formula. If not, describe an appropriate method for assigning treatments.

AP Exam Review Unit 4 Chapter 14


1.

Probability

2.

3. 4.

5.

Random Outcomes is a situation in which we know what outcomes could happen, but we dont know which particular outcome did or will happen. a. In the long run, random outcomes settle down in a way that is actually consistent and predictable. Probability is the measure of the likelihood that a given event will occur; expressed as a number between 0 and 1 in the long-run behaviors. a. Trial is each attempt that generates an outcome. b. Outcome is whatever happens in each trial. c. Sample space is the individual outcomes that are possible and comprise the trial. d. Event is the combination of outcomes in a trial e. conditions of probability: 1. Probabilities must be between 0 and 1, inclusive. 2. A probability of zero indicates impossibility of event occurring 3. A probability of one indicates certainty of event occurring (In the long run. Of course!) f. Probability Rules: 1. For any event A, 0 P(A) 1. 2. The probability of the set of all possible outcomes of a trial must be 1. P(S) = 1 3. Sample Space the set of all possible outcomes 4. The Complement Rule is the set of outcomes that are not in the event A is called the complement of A, denoted AC. a. The probability of an event occurring is 1 minus the probability that it doesnt occur: P(A) = 1 P(AC) 5. Addition Rule: For two disjoint (mutually exclusive) event A and B, the probability that one or the other occurs is the sum of the probabilities of the two events. (disjoint or mutually exclusive means that the two events can not occur at the same time) P(A or B) = P(A) + P(B), provided that A and B are disjoint. 6. Multiplication Rule: For two independent event A and B, the probability that both A and B occurs is the product of the probabilities of the two events. P(A and B) = P(A) x P(B), provided that A and B are independent. Independence is where trials can not be related in order for us to make statements about the long-run behavior of random phenomena. (outcome of one trial doesnt influence or change the outcome of another) Law of Large Numbers is when the long-run relative frequency of repeated independent events gets closer and closer to the true relative frequency as the number of trials increases. a. Misinterpretation of the law: Many people believe that random phenomena are supposed to compensate for whatever happened in the past. Common Errors: a. Dont add probabilities of events if theyre not disjoint!! For example, to find the probability of owning a car or a house, you cannot just add the probabilities of owning a car and owning a house because those two events are not disjoint. Many people own both! b. Dont multiply probabilities of events if theyre not independent!! For example, to find the probability of being absent and today being Friday, you cannot just multiply these probabilities because those two events are not independent. Knowing that it is Friday changes the probability of the absentee rate. c. Dont confuse disjoint and independent Disjoint events CANNOT be independent!!

Chapter 15
6.

Finding Probability: a. When the k possible outcomes are equally likely, each has a probability of 1/k. count of outcomes in A b. For any event A that is made up of equally likely outcomes P(A) = count of all possible outcomes c. General Addition Rule: P(A or B) = P(A) + P(B) P(A and B) when our events are not disjoint, will double count the probability of both A and B occurring.

d. Conditional probability: Takes into account a given condition and is written as P(A|B) and read as P(A and B ) probability of A given B has occurred. P BA =

P(A)

e.

General Multiplication Rule: P(A and B) = P(A) x P(B|A) when our events are not independent Note: Theres nothing special about which one we write as A or B, so this rule can also be stated as: P(A and B) = P(B) x P(A|B)

7.

Formal Independence Events A and B are independent whenever P(B|A) = P(B). a. Independence of two events means the outcome of one event does not influence probability of the other. Replacement: Sampling without replacement means that once individual is drawn it doesnt go back into pool a. We often sample without replacement, which doesnt matter too much when dealing with a large population. b. When drawing from a small population, we need to take note and adjust probabilities accordingly. Tree diagrams: Helps us think through conditional probabilities by showing sequences of events as paths that look like branches of a tree.

8.

9.

Chapter 16
10.

Random variable assumes any of several different values as a result of some random event. a. Random variables are denoted by capital letters, such as X b. A particular value of the random variable is denoted with a lowercase letter, x. c. Types of random variables: 1. Discrete: A random variable that take one of a finite number of distinct outcomes. 2. Continuous: A random variable that take any numeric value within a range of values. Probability model consists of a collection of all the possible values and the probability that they occur Expected value (E(X)) (center) is the particular value of interest in the model that we expect a random variable to take on and an also be notated as (for population mean) = E( X ) = x P( x ) a. To find expected value of a random variable we can sum the products of each possible value by the probability that it occurs: Variance for a random variable is Standard deviation for a random variable is

11. 12.

13. 14.

( x - ) 2 P( x ) 2 = Var ( X ) =
= Var (X ) =

Note: Whenever we report a center, we also need to report a spread. Its not enough to know where it is centered, we also need to know the variability. Remember: To find E(x) and , put x in L1 and P(x) in L2, then STAT: CALC: 1 VarSTAT, L1, L2 E(x) = and = x

(x - )

P( x )

Chapter 16 (cont)
15. Combining Random variables: E(X Y) = E(X) E(Y)

(X Y) = x + y

(you always add the standard deviations)

Chapter 17
16.

Bernoulli trials are situations that occur when we do simulations, we have Bernoulli trials if: a. There are two possible outcomes (success and failure). b. The probability of success, p, is constant. c. The trials are independent. 1. 10% condition rule is used to guarantee that Bernoulli trial is independent. If that assumption is violated, it is still okay to proceed as long as the sample is smaller than 10% of the population. Geometric Probability is the probability for a random variable that counts the number of Bernoulli trials until the first success and are completely specified by one parameter, p, the probability of success. Denoted as Geom(p). p = probability of success q = 1 p = probability of failure P(x) = qx-1p X = # of trials until the first success occurs 1 q a. The expected value is = S tandard deviation is = p p2 b. On TI, geometcdf (p, x) returns the total for all values in the interval [1, x]

17.

18.

Binomial Probability is the probability for a random variable that counts the number of successes in a fixed number of Bernoulli trials. There are two parameters: n = # of trials and p = probability of success. Denoted as Binom(n, p). P(x) = nCx px q n-x a. The expected value is = np Standard deviation is = npq b. On TI, binomcdf (n, p, x) returns the total for the binomial in the interval [0,x]

AP Exam Unit 4 Review Worksheet


1.

Name

We can hire a cab from one of three firms: X, Y, and Z. Of the hirings 40% are X, 50% are Y, and 10% are Z. For the cabs hired from X, 9% are late, the corresponding percentages from Y and Z are 6% and 20%, respectively. Calculate the probability that the next cab hired: a. will be from X and will not arrive late

b. will arrive late

c.

given that a call is made for a cab and it arrives late, find the probability that it came from Y.

2.

In a group of 100 people: 40 own a cat, 25 own a dog, and 15 own a cat and dog. Find the probability that a person chosen at random: a. owns a cat or dog b. owns a dog or a cat, but not both c. owns a dog, given that he owns a cat

d. does not own a cat, given that he owns a dog

3.

You play tennis regularly with a friend and from past experience you believe that the outcome of each match is independent. For any given match you have a probability of 0.6 of winning. The probability that you win the next two matches is?

4.

One thousand students at a city high school were classified according to both GPA and whether they consistently skipped classes. Skipped a lot a. What is the probability that a student has a GPA between Skipped little 2 and 3? total

< 2.0 80 175

GPA 2.0-3.0 25 450

> 3.0 5 265

Total

b. What is the probability that a student has a GPA under 2.0 and has skipped class alot?

c.

What is the probability that a student has a GPA under 2.0 given that he skipped class alot?

d. What is the probability that a student has a GPA over 3.0 and has skipped class a lot?

e.

Are GPA between 2.0-3.0 and skipped class little independent? Why or why not?

5.

Two events A and B are such that: P(A) = .6, P(B) = .3, and P(A|B) = .8. Find the probability that: a. both events occur

b. only one of the two events occur

6.

Selected boxes of a breakfast cereal contain a prize. Suppose that 5% of the boxes contain the prize and the other 95% contain the message Sorry try again. A consumer determined to find a prize decides to continue to buy boxes of cereal until a prize is found. Consider the random variable, x where x = number of boxes purchased to get a prize. a. What kind of probability distribution is this? b. What is the probability that at most 2 boxes must be purchased?

c.

What is the probability that exactly four boxes must be purchased?

d. What is the probability that more than four boxes must be purchased?

7.

A consumer organization estimates that 29% of new cars have a cosmetic defect such as a scratch or a dent when they are delivered to car dealers. This same organization believes that 7% have a functional defect something that does not work properly and that 2% of new cars have both kinds of problems. If you have a functional defect on a new car, whats the probability it also has a cosmetic defect?

8.

A companys human resources officer reports a breakdown of employees by job type and gender, shown in the table: Whats the probability that a worker Management Supervision selected at random is female, if the person is a supervisor?
Production

Male 7 8 45

Female 6 12 72

9.

To play a game, you must pay $5 for each play. There is a 10% chance you will win $100, a 40% chance you will win $50, and a 50% chance you will win only $25. What are the mean and standard deviation of your winnings?

10.

Safety engineers must determine whether industrial workers can operate a machines emergency shutoff device. Among a group of test subjects, 66% were successful with their left hands, 82% with their right hands, and 51% with both hands. What percent of these workers could not operate the switch with either hand?

11.

Neurological research has shown that in about 80% of people language abilities reside in the brains left side. Another 10% display right-brain language centers, and the remaining 10% have 2-sided language control. Assume that a freshman comp class contains 25 randomly selected people. Whats the probability that no more than 15 of them have left-brain language control?

12.

Since the stock market began in 1872, stock prices have risen in about 73% of the years. Assuming that market performance is independent from year to year, whats the probability that market will rise during at least 1 of the next 5 years?

13.

The Centers for Disease Control say that about 30% of high school students smoke tobacco (down from a high of 38% in 1997). Suppose you randomly select high school students to survey them on their attitudes toward scenes of smoking in the movies. Whats the probability that there are no more than 3 smokers among the 15 people that you chose?

AP Exam Review Unit 5 Chapter 18


1.

Inference with Proportions

2.

Sampling Distribution. even though we depend on sampling distribution models, we never actually get to see them. a. Sampling distribution models are important because they act as a bridge from the real world of data to the imaginary world of the statistic b. They enable us to say something about the population when all we have is data from the real world. Sampling distribution of p is the model that occurs when we take infinite samples from the same population and make a histogram. Provided that the sampled values are independent and the sample size is large enough, the sampling distribution of p is modeled by a Normal model with: a. mean: ( p p = true proportion of successes in population )= p = true proportion of successes in sample p b. standard deviation: SD( p ) = pq n c. conditions for proportions that MUST be true before using the Normal model: 1. 10% condition: If sample has not been made with replacement, then sample size, n, must < 10% of population. 2. Random Sampling condition: The sample must be a SRS or representative of the population. 3. Success/failure condition: The sample size has to be big enough so that bothnp are > 10. and nq

3.

d. If conditions are met, we know the sampling distribution follows the Normal model. 1. If it does, all of the previous information we have about Normal models holds true, such as the 68-95-99.7 Rule and finding probabilities from z-scores. When we get to means in Unit 6 then success/failure condition changes to normality: a. Normality condition: 1. Either the original population must be Normal 2. Central Limit Theorem (CLT) will assure a Normal-like distribution when our sample size exceeds 30. Mean: ( y) = Standard deviation: ( y) = SD( y) =

Chapter 19
4.

Confidence interval is the idea that our guess will be a little bit off. a. sample proportion p b. center is p c. standard deviation is pq n Standard error of p Since we dont know p, we cant find true standard deviation of the sampling distribution model so we find.

d.

5.

6.

q p n e. 95% confidence interval: if we reach out 2 SEs in either direction of p , we can be 95% confident that this interval captures the true proportion. Margin of Error (ME) is the extent of the interval on either side of p a. the formula for a confidence interval is: estimate +/- margin of error b. .The more confident we want to be, the larger our ME needs to be. c. In order to cut the ME in half, you have to quadruple the sample size. Most commonly chosen confidence levels and critical values: To find use the t table, look at the confidence levels in the bottom Critical Values: 90% = 1.645 row, find the critical value right above it in the row. 95% = 1.96
the standard error:

) = SE ( p

99% = 2.576

Chapter 19 (cont)

7.

So what does 90% = 1.645 mean? For a 90% confidence interval, the critical value is 1.645 because for a Normal model 90% of the values are within 1.645 standard deviations from the mean. Confidence level (typically 90%, 95%, or 99%) tells us the proportion of intervals that *in the long run* will capture our true population parameter by using this method When conditions are met, we may proceed with a one proportion z-interval Procedure for Confidence Intervals (CI) P define your Parameter A state your Assumptions/conditions N Name your interval I find your Interval C write your Conclusion in Context. I am ___% confident that the true proportion of

8. 9.

z p

q p n

lies between ____ and _____.

Chapter 20
10.

Hypothesis test is a proposal of a model for the world. If the data is consistent with that model, we have no reason to disbelieve the hypothesis. (Note: this does NOT mean they PROVE the hypothesis) a. If the facts are inconsistent with the model, we need to make a choice as to whether they are inconsistent enough to disbelieve the model. b. We begin by assuming that a hypothesis is true. Next we consider whether the data are consistent with the hypothesis. If they are, all we can do is retain the hypothesis we started with. If they are not, then we ask whether they are unlikely beyond a reasonable doubt. c. In Statistics, we can quantify our doubt by finding the probability that data like we saw could occur based on our hypothesized model. 1. retain (Fail to reject) is if the results seem consistent with what we would expect from natural sampling 2. reject is if the probability of seeing results like our data is really low. Writing the Hypothesis a. Null hypothesis (Ho) specifies a population parameter of interest and proposes a value for it. (Nothing different, nothing going on) b. Alternate hypothesis (Ha) contains the values of the parameter we accept if we reject the null. (There is something different, something did go on) Look at the evidences: In order to evaluate the evidence, we want to compare our data to what we would expect given that H0 is true. a. We can do this by finding out how far our data is from the mean. b. We can then make a decision by asking how likely it is to get results like we did if the null hypothesis were true. Procedure for Hypothesis Test: P Parameter of Interest P: indicates the true proportion of H State your Hypotheses Ho: p = ____ Ha: p < ____ or p > ____ or p ____ A Assumptions/Conditions 10%: less than 10% of our population SRS (or representative sample) At least 10 expected successes or failures N Name your procedure If the conditions are met, we will continue with _________________. T Calculate a Test Statistics For a 1-proportion z-test, the test statistic is calculated by: p p

11.

12.

13.

pq n

Chapter 20 (cont)

O Obtain a P-Value

The p-value tells us the probability that the observed test statistics (or an even more extreme value) could occur if the null is true. M Make a decision Written in terms of the null hypothesis. Based on the p-value, we either retain (fail to reject) or reject the Ho. S State conclusion in context! Written in terms of the alternate hypothesis. We have enough evidence (or not enough evidence) to conclude that ___________. Two-tailed test The test we run if we are testing not equal to or differs, it doesnt matter if the new result is below or above the null value. (), where the P-value is in both tails of the distribution: One-tailed test: The test we run if we are concerned about one direction only (< or >), where P-value is in only one tail of the distribution

14.

15.

16.

Finding the pvalue a. If your conditions are met, you can use the Normal model to find a p-value. b. Using your test statistic, determine probability that you will have a value as extreme as the one you got. c. It will help if you sketch your model! (Remember: The smaller the P-value, the more evidence we have against the null hypothesis.) How to make a decision: a. p-value tells probability of occurrence of test statistic as extreme as we got if null hypothesis is true. b. The p-value we use to reject the null hypothesis differs from situation to situation. c. If possible, we like to include a confidence interval for the parameter of interest when we do a hypothesis test.

Chapter 21
17.

Statistically significant is when the p-value is small, it tells us that our result is a rare occurrence (given H o). a. If the data are rare enough, we just dont think it could have happened by chance alone. b. We declare our results to be statistically significant when we reject the null, since our data did happen, and we believe that something other than Ho is true. 18. Alpha level is a threshold or cutoff value we set to define a rare occurrence . a. If p-value falls below the alpha level, we reject the Ho. b. The significance level of the test are these common alpha levels are .01, .05, and 0.1 19. P-value provides more information than just simply declaring REJECT or RETAIN, it gives a numerical value to the strength of the evidence against the null hypothesis. a. Before Calculators/Computers, statisticians used the critical value method to make decisions. 1. Critical Value is the z* from the table. 2. It is the threshold/cutoff value separated the Rejection Region from the Non-Rejection Region. b. p-values are easy to find now and provide more information for your decision! 20. Making Errors: Even when we have lots of data, when we make a decision in a hypothesis test, its possible that we will make an error. H0: I will be better off if I take no action. Ha: I will be better off if I take action. a. Type I Error: null hypothesis is true, but we reject 1. correspond to taking action when you would have been better off not doing so, controlled by our alpha level!

Chapter 21 (cont)

b. Type II Error: null hypothesis is false, but we retain 1. correspond to taking no action when you would have been better off taking action. 2. beta, but its difficult to calculate because we dont know how false Ho is. c. Power of test: our ability to detect a false hypothesis, correspond to taking action when you should have.

Chapter 22
21.

Two Sample Proportions is used when we want to compare two groups to see how they differ, whether a treatment is better than a placebo control, or whether this years results are better than last years. a. Assumptions/Conditions 1. EACH Sample < 10% of THEIR Population 2. Random or representative of each sample 3. Successes/failures for EACH sample greater than 10 4. Samples are independent of each other There should be no relationship between the groups b. When the conditions are met, we can proceed with a two proportion z interval. q p q p 1 p 2 ) z * 1 1 + 2 2 (p c. Hypothesis Test: n1 n2 P identify the parameter for the two groups H Null hypothesis No difference/change Ho: p1 = p2 or p1 p2 = 0 Alternate hypothesis Ha: p1 ___ p2 or p1 p2 ___ 0 (fill in <, >, or ) A The assumptions/conditions remain the same N When the conditions are met, we can proceed with a two proportion z-test. 1 p 2 ) (p T z= O obtain p-value M Make a decision about the null hypothesis S State your conclusion about the alternate hypothesis in context and make a comparison of higher/lower

1q p pq 1 + 2 2 n1 n2

AP Exam Review Unit 6/7 Chapter 23


1.

Inference with Means, Chi-Squared, and Linear Regression

2.

Now that we know how to create confidence intervals and test hypotheses about proportions, itd be nice to be able to do the same for means. We will base our confidence interval and hypothesis test on the sampling distribution of sample means. a. mean of the model is standard deviation of

3.

4.

Central Limit Theorem states that for large samples, the sampling distribution will be close to the Normal model. a. For small samples we will use t-distribution a new model we use because we use s instead of , and using s makes our interval too narrow so to make it wider we use t instead of z. b. we must allow for some extra variation. c. If you know , use the z-distribution but that rarely happens t-distribution is a mound-shaped distribution, with mean 0 and a spread that depends on a parameter called degrees of freedom, df, df = n - 1 a. degrees of freedom are the degrees that a parameter is allowed to vary or number of samples that can vary b. There is a different t-distribution for each degree of freedom. c. The greater the df, the smaller the spread. d. The spread of any t-distribution is greater than that of the standard normal distribution

Chapter 23 (cont)
5. To a. b. c.

6.

use the model, we must meet certain conditions 10% condition Randomization or representative Normality (instead of S/F) 1. When small samples, we have to graph the data to check for outliers/skewness. 2. For large samples, Central Limit Theorem applies so if n > 30, normal model applies If the conditions are met, we will proceed with a one-sample t interval for means. (Dont forget to report your df!!) * s x t df estimate margin of error n Finding Sample Size We still use ME, but we dont know n, we dont know t*, so use infinity line (z*) Hypothesis Test P state the parameter for the H: Null hypothesis is always the statement of no change Ho: = _____ Alternate hypothesis is then the statement of change Ha: < __ or > __ or __ A Conditions/assumptions remain the same N If our conditions/assumptions are met, we will proceed with a 1-sample t-test for means. Again, dont forget to report your df!!! x T

7. 8.

t=

s n

O Obtain the p-value M Make a decision to reject or retain the null hypothesis S State a conclusion based on evidence for or against the alternate hypothesis

Chapter 24
9. 10.

Comparing means is not very different from comparing two proportions. Two sample means PANIC P population model parameter of interest is the difference between the two means, 1- 2 Sampling distribution model is centered at 1- 2 with standard deviation, 2 1 2 A The conditions/assumptions for 2 sample means are: + n n a. 10% condition (both samples) 1 2 b. Randomization (both samples) c. Nearly Normal (both samples) graph if data is given!!! d. Independent Groups N If the conditions are met, we will proceed with a 2-sample t-interval for means. dont forget to report the df!!! (Given on your calculator) I 2 2
* x1 x2 t df

2 2 = 1 + 2 n1 n2

s1

n1

s2 n2

C 11.

State conclusion in context with a comparison of the difference in the two means (higher/lower)

Two sample means PHANTOMS P population model parameter of interest is the difference between the two means, 1- 2 H Null hypothesis No difference/change Ho: 1 = 2 or 1 2 = 0 Alternate hypothesis a change Ha: 1 ___ 2 or 1 2 ___ 0 (fill in blank with <, >, or ) A Assumptions/conditions remain the same. Remember to define what groups 1 and 2 are! N If conditions/assumptions are met, we will proceed with a 2-sample t-test for means. As always, dont forget the df!!!

Chapter 24 (cont)
T
t= (x1 x2 ) s2 s2 1 + 2 n1 n2

O Obtain a p-value M Make a decision to reject or retain the null hypothesis S State a conclusion based on evidence for or against the alternate hypothesis in context with a comparison of the difference in the two means (higher/lower)

Chapter 25
12.

13.

14.

15.

Matched Pair means data are paired when: a. observations are collected in pairs b. observations in one group are naturally related to observations in the other group. c. Paired data arise in a number of ways: 1. Perhaps the most common is to compare subjects with themselves before and after a treatment. Why is paired data special? a. Independence assumption is violated, but that means we get to do a better analysis because we can focus only on how the data has changed. b. If you know the data are paired, you can (and must!) take advantage of it. c. Once we know the data are paired, we can examine the pairwise differences. d. Because it is the differences we care about, we treat them as if they were the data and ignore the original two sets of data. Looking at the difference: a. Now that we have only one set of data to consider, we can return to one-sample t-test or interval. b. Mechanically, matched pairs test or interval is just a one-sample t-test or t-interval for the mean of the pairwise differences. Matched Pairs means PANIC P population model parameter of interest is the difference of the pair , d A Conditions/assumptions for a matched pairs interval: Sample < 10% of the population Randomization (either random treatments, random order, or random selection) Data must be paired!!! Normal (check the differencesnot original data!) N If the conditions are met, we will proceed with a matched pairs t-interval. d represents difference for each pair. dont forget to report the df = n-1 where n = the number of pairs I C

d tn*1

sd n

We are ___% confident that the true average (context) for the (1st group) is between ___ and ___ (higher/lower) than the (2nd group).

16.

Matched Paired PHANTOMS P population model parameter of interest is the difference of the pair , d H Null hypothesis No difference/change H o : d = 0 Alternate hypothesis difference/change Ha: d ____ 0 *Note: think of what d means! (fill in blank with <, >, or ) A Conditions/assumptions remain the same! Remember to define what d stands for! N If conditions/assumptions are met, we will proceed with a matched pairs t-test. dont forget the df!!! T d 0
t= sd n

Obtain a pvalue

Chapter 25 (cont)

M Make a decision to reject or retain the null hypothesis S State a conclusion based on evidence for or against the alternate hypothesis in context with a comparison of the difference in the paired data (higher/lower)

Unit 7 Chapter 26
17.

18.

Chi-Squared Distribution is a distribution made up of a family of curves, defined by degrees of freedom. a. Chi-Square test is a non-parametric test of statistical significance. b. It does not compare the data to a specific population parameter, but to another distribution. Chi-Squared Goodness of Fit Test (GOF) a. Hypothesis test addressing, when dealing with categorical data, does the data fits what we expected. b. PHANTOMS for GOF: Data is categorical, so we are looking at a df=k-1, where k is the number of categories PH Null hypothesis no difference/change Ho: The data fits the expected distribution Alternate hypothesis difference/change Ha: The data doesnt fit our expectations. A Assumptions stay the same for all 2 tests: The data must be in counts (not percentages or amounts) Randomization or representative (of course!) Expect at least 5 individuals in each cell (and you MUST show/tell the expected values!). N If our conditions are met, we can proceed with a Chi-Square Goodness of Fit Test with df = n - 1 ( observed expected)2 T 2 = expected Obtain the p-value Make a decision to reject or retain a. GOF are likely performed by people who have a theory of what proportions s hould be. b. Unfortunately, the only null hypothesis available for a GOF is that the theory is true. c. As we know, the hypothesis testing procedure allows us only to reject or fail to reject null. S State a conclusion in context about the alternate hypothesis Chi-Squared Homogeneity Test a. Hypothesis test comparing the distribution of counts for two or more groups on the same categorical variable. b. Data is gathered from two or more populations, we are interested in those populations have a similar/difference c. A test of homogeneity is actually the generalization of the two-proportion z-test. d. Differences and similarities to GOF 1. The statistic that we calculate for this test is identical to the chi-square goodness-of-fit. 2. We ask whether choices have changed among different populations rather than comparing to model. 3. Expected counts are found directly from the data and we have different degrees of freedom. e. PHANTOMS for Homogeneity: PH Null hypothesis no difference/change Ho: Distribution of choices is same among populations Alternate hypothesis difference/change Ha: Distribution of choices is different among populations This data is organized using a table. The df = (#rows-1)(#columns-1) A Assumptions stay the same for all 2 tests: Expected: you will use and show a matrix N If our conditions are met, we can proceed with a Chi-Square Homogeneity Test with df = (R-1)(C-1) ( observed expected)2 T 2 = expected O Obtain the pvalue M Make a decision to reject or retain S State a conclusion in context about the alternate hypothesis O M

19.

Chapter 26 (cont)

20. Chi-Squared Independence Test a. Data is gathered from one population, categorize it based on two variables do they affect each other b. A test of whether the two categorical variables are independent examines the distribution of counts for one group of individuals classified according to both variables in a contingency table. c. A chi-square test of independence uses the same calculation as a test of homogeneity. d. PHANTOMS for Independence: PH Null hypothesis no difference/change Ho: There is no relationship between variable 1 and 2 Alternate hypothesis difference/change Ha: There is a relationship between variable 1 and 2. This data is organized using a table. The df = (#rows-1)(#columns-1) A Assumptions stay the same for all 2 tests: Expected: you will use and show a matrix N If our conditions are met, we can proceed with a Chi-Square Independence Test with df = (R-1)(C-1) 2 ( observed expected ) 2 T = expected O Obtain the pvalue M Make a decision to reject or retain S State a conclusion in context about the alternate hypothesis Homogeneity vs Independence Homogeneity and Independence are virtually identical. a. The difference is: 1. What you are trying to figure out (same vs. a relationship between variables) 2. How the data is gathered Homogeneity uses a stratified sample (two populations) and asks each person one question (variable). Independence uses an SRS (one population) and classifies each person based on two variables.

21.

Chapter 27

22. In regression, we want to model relationship between two quantitative variables, the predictor and response. 23. We imagine an idealized regression line, which assumes the means of the distributions of response variable fall along the line when individual values are scattered around it. 24. We write this line with Greek letters and consider the coefficients to be parameters: =b0 +b1x we write y 0 1 0 = intercept 1 = slope Corresponding to our fitted line of y . 25. PHANTOMS for Inference Regression PH Null hypothesis no difference/change Ho : = 0 Alternate hypothesis difference/change Ha: ___ 0 As always, dont forget to define your variables! A Assumptions for all inference methods, we must check some conditions Linearity check the scatterplot to see if trend is linear Independence check to see if residuals are randomly scattered Normality check to see if the histogram of residuals is symmetric with no outliers Equal Variance check to see if residuals have a uniform spread N If our conditions are met, we can proceed with t-test for the slope of the regression line with df = n-2 b 1 T O Obtain the pvalue t= 1

= + x

M Make a decision to reject or retain S State a conclusion in context about the alternate hypothesis 27. PANIC for Inference Regression A Conditions/assumptions remain the same. N If conditions met, we can proceed with a t-interval for the slope of the line with df = n - 2 I s
b1 t n 2 e

SE (b1 )

n 1 s x

State conclusion in context

AP Exam Unit 5/7 Inference Review

You may work the PANIC and PHANTOMS all the way, if you would like. I would just go through naming the parameter, checking assumptions and naming the test. 1. A random sample of 415 potential voters was interviewed 3 weeks before the start of a state-wide campaign for governor; 223 of 415 said they favored the new candidate over the incumbent. However, the new candidate made several unfortunate remarks one week before the election. Subsequently, a new random sample of 630 potential voters showed that 317 voters favored the new candidate. Do these data support the conclusion that there was a decrease in voter support for the new candidate after the unfortunate remarks were made? Give appropriate statistical evidence to support your answer.

Name

2.

A large university provides housing for 10 percent of its graduate students to live on campus. The universitys housing office thinks that the percentage of graduate students looking for housing campus may be more than 10 percent. The housing office decides to survey a random sample of graduate students, and 62 of the 481 respondents say that they are looking for housing on campus. a. One the basis of the survey data, would you recommend that the housing office consider increasing the amount of housing on campus available to graduate students? Give appropriate evidence to support your recommendation.

b. In addition to the 481 graduate students who responded to the survey, there were 19 who did not respond. If these 19 had responded, is it possible that your recommendation would have changed? Explain.

3.

The Colorado Rocky Mountain Rescue Service wishes to study the behavior of lost hikers. If more were known about the direction in which lost hikers tend to walk, then more effective search strategies could be devised. Two hundred hikers selected at random from those applying for hiking permits are asked whether they would head uphill, downhill, or remain in the same place if they became lost while hiking. Each hiker in the sample was also classified according to whether he or she was an experienced or novice hiker. The resulting data are summarized in the following table. Direction Uphill Downhill Remain in Same Place Novice 20 50 50 Experienced 10 30 40 Do these data provide convincing evidence of an association between the level of hiking expertise and the direction the hiker would head if lost? Give appropriate statistical evidence to support your conclusion.

4.

Baby walkers are seats hanging from frames that allow babies to sit upright with their legs dangling and feet touching the floor. Walkers have wheels on their legs that allow the infant to propel the walker around the house long before he or she can walk or even crawl. Typically, babies use walkers between the ages of 4 months and 11 months. Because most walkers have tray tables in front that block babies views of their feet, child psychologists have begun to question whether walkers affect infants cognitive development. One study compared mental skills of a random sample of those who used walkers with a random sample of those who never used walkers. Mental skill scores averaged 113 for 54 babies who used walkers (standard deviation of 12) and 123 for 55 babies who did not use walkers (standard deviation of 15). a. Is there evidence that the mean mental skill score of babies who use walkers is different from the mean mental skill score of babies who do not use walkers? Explain your answer.

b.

Suppose that a study using this design found a statistically significant result. Would it be reasonable to conclude that using a walker causes a change in mean mental skill score? Explain your answer.

5.

A growing number of employers are trying to hold down the costs that they pay for medical insurance for their employees. As part of this effort, many medical insurance companies are now requiring clients to use generic brand medicines when filling prescriptions. An independent consumer advocacy group wanted to determine if there was a difference, in milligrams, in the amount of active ingredient between a certain name brand drug and its generic counterpart. Pharmacies may store drugs under different conditions. Therefore, the consumer group randomly selected ten different pharmacies in a large city and filled two prescriptions at each of these pharmacies, one for the name brand and the other for the generic brand of the drug. The consumer groups laboratory then tested a randomly selected pill from each prescription to determine the amount of active ingredient in the pill. The results are given in the following table.
Pharmacy Name brand Generic brand 1 245 246 ACTIVE INGREDIENT (in milligrams) 2 3 4 5 6 7 8 244 240 250 243 246 246 246 240 235 237 243 239 241 238 9 247 238 10 250 234

Based on these results, what should the consumer groups laboratory report about the difference in the active ingredient in the two brands of pills? Give appropriate statistical evidence to support your response.

6.

A study was conducted to determine where moose are found in a region containing a large burned area. A map of the study area was partitioned into the following four habitat types. 4 1. Inside burned area, not near edge of the burned area 2 3 2. Inside burned area, near edge of the burned area 1 3. Outside burned area, near edge of the burned area 4. Outside burned area, not near edge of the burned area Figure shows these four habitat types. Note: Figure not drawn to scale The proportion of total acreage in each of habitat types was determined for study area. Using an aerial survey, moose locations were observed and classified into one of four habitat types. Results re given in table below. Habitat type Proportion of Total acreage Number of moose observed 1 0.340 25 2 0.101 22 3 0.104 30 4 0.455 40 Total 1.000 117 a. The researchers who are conducting the study expect the number of moose observed in a habitat type to be proportional to the amount of acreage of that type of habitat. Are the data consistent with this expectation? Conduct an appropriate statistical test to support your conclusion. Assume the conditions for inference are met.

7.

The statistics department at a large university is trying to determine if it is possible to predict whether an applicant will successfully complete the Ph.D. program or will leave before completing the program. The department is considering whether GPA (grade point average) in undergraduate statistics and mathematics courses (a measure of performance) and mean number of credit hours per semester (a measure of workload) would be helpful measures. To gather data, a random sample of 20 entering students from the past 5 years is taken. The data are given below. Successfully Completed Ph.D. Program Student A B C D E F G H I J K L M GPA 3.8 3.5 4.0 3.9 2.9 3.5 3.5 4.0 3.9 3.0 3.4 3.7 3.6 Credit hours 12.7 13.1 12.5 13.0 15.0 14.7 14.5 12.0 13.1 15.3 14.6 12.5 14.0 Did Not Complete Ph.D. Program Student N O P Q R S T GPA 3.6 2.9 3.1 3.5 3.9 3.6 3.3 Credit hours 11.1 14.5 14.0 10.9 11.5 12.1 12.0 The regression output below resulted from fitting a line to the data in each group. The residual plot (not shown) indicated no unusual patterns, and assumptions necessary for inference were judged to be reasonable. Successfully Completed Ph.D. Program Predictor Coef StDev T P Constant 23.514 1.684 13.95 0.000 S = 0.5658 GPA 2.7555 0.4668 5.90 0.000 R-Sq = 76.0% Did Not Complete Ph.D. Program Predictor Coef Constant 24.200 GPA 3.485 b.

StDev 3.474 1.013

T 6.97 3.44

P 0.001 0.018

S = 0.8408 R-Sq = 70.3%

For students who successfully completed the Ph.D. program, is there a significant relationship between GPA and mean number of credit hours per semester? Give a statistical justification to support your response.

8.

A survey given to a random sample of students at a university included a question about which of two wellknown comedy shows, S or F, students preferred. The students were asked the question, Do you prefer S or F? The responses are shown below. Preference S F Total 185 139 324 a. Based on the results of this survey, construct and interpret a 95% confidence interval for the proportion of students in the population who would respond S to the question, Do you prefer S or F?

b.

What is the meaning of 95% confidence in part (a)?

c.

A follow-up survey of a separate group of randomly selected students was asked Do you prefer F or S? The responses are shown below. Preference S F Total 68 88 156 Based on these two surveys, is there evidence that the stated preference depends on the order in which the comedy shows were listed in the survey question? Justify your answer.

9.

In September 1990, each student in a random sample of 200 biology majors at a large university was asked how many lab classes he or she was enrolled in. The sample results are shown below. Number of Lab Classes Number Students 0 28 1 62 = 1.83 2 58 S = 1.29 3 28 4 16 5 8 Total 200 To determine whether the distribution has changed over the past 10 years, a similar survey was conducted in Sept. 2000 by selecting a random sample of 200 biology majors. Results from the year 2000 sample are below. Number of Lab Classes Number Students 0 20 1 72 = 1.93 2 60 S = 1.37 3 10 4 26 5 12 Total 200 Do the data provide evidence whether the distribution of the number of lab classes taken by biology majors was different in 2000 than in 1990? Perform an appropriate statistical test using = 0.10.

10.

A pharmaceutical company has developed a new drug to reduce cholesterol. A regulatory agency will recommend the new drug for use if there is convincing evidence that the mean reduction in cholesterol level after one month of use is more than 20 mg/dl, because a mean reduction of this magnitude would be greater than the mean reduction for the current most widely used drug. The pharmaceutical company collected data by giving the new drug to a random sample of 50 people from the population of people with high cholesterol. The reduction in cholesterol level after one month of use was recorded for each individual in the sample, resulting in a sample mean reduction and standard deviation of 24 mg/dl and 15 mg/dl, respectively. a. The regulatory agency decides to use an interval estimate for the population mean reduction in cholesterol level for the new drug. Provide this 95% confidence interval and interpret it.

b.

Because 95% confidence interval includes 20, the regulatory agency is not convinced that the new drug is better than the current best-seller. The pharmaceutical company tested the following hypotheses. H0: 20 Ha: > 20 = population mean reduction in cholesterol level for new drug. The test procedure resulted in a t-value of 1.89 and a p-value of 0.033. Because the p-value was less than 0.05, the company believes that there is convincing evidence that the mean reduction in cholesterol level for the new drug is more than 20. Explain why the confidence interval and hypothesis test led to different conclusions.

Simple Things To Do To Improve Your AP Exam Scores


1. Read the problem carefully, and make sure that you understand the question that is asked. Then answer the question(s)!!! Suggestion: Circle or highlight key words and phrases. That will help you focus on exactly what the question is asking. Suggestion: When you finish writing your answer, re-read the question to make sure you havent forgotten something important. Write your answers completely but concisely. Dont feel like you need to fill up the white space provided for your answer. Nail it and move on. Suggestion: Long, rambling paragraphs suggest that the test taker is using a shotgun approach to cover up a gap in knowledge. Dont provide parallel solutions. If multiple solutions are provided, the worst or most outrageous solution will be the one that is graded. Suggestion: If you see two paths, pick the one that you think is most likely to be correct, and discard the other one. Beware of careless use of language. Even if your calculations are correct, weak communication can cost you points. Suggestion: Distinguish between sample and population; data and model; lurking and confounding variables; r and r2 , etc. Know what technical terms mean, and use these terms correctly. A computation or calculator routine will rarely provide a complete response. Be able to write simple English and/or mathematical sentences that convey understanding. Suggestion: Practice writing narratives for past homework problems, and have them critiqued by your Teacher or a fellow student. Know the steps for performing inference: Hypotheses Assumptions/conditions Identify test (confidence interval) and calculate correctly Conclusions in context Suggestion: Learn the different forms for hypotheses, memorize conditions/assumptions for various inference procedures, and practice solving inference problems. Understand strengths and weaknesses of different experimental designs. Suggestion: Study examples of completely randomized, paired, matched pairs and blocked designs. Remember that a simulation can always be used to answer a probability question, as long as it is correct and you explain it adequately. Suggestion: Practice setting up and running simulations on your TI. Be able to interpret generic computer output. Suggestion: Practice reconstructing the LSRL equation from a regression analysis printout. Identify and interpret the other numbers. BEST WISHES AND GOOD LUCK TO ALL OF YOU ON THE EXAM MRS. GARRISON

2.

3.

4.

5.

6.

7.

8.

9.

You might also like