You are on page 1of 3

An association exists between two variables if a particular value for one variable is more likely to occur with certain

values of another variable. Response variable is the outcome variable on which comparisons are made ex. Income level. Explanatory variable defines the groups to be compared with respect to values on the response variable. Positive Correlation (both up) Negative correlation (one up/one down) The closer to 1 in ABSOLUTE VALUE the stronger the correlation also the stronger the STRAIGHT LINE CORRELATION, the closer to 0 in abs. value the weaker the correlation. -.879 is still a stronger correlation than .750 Positive correlation indicates positive association. Negative correlation indicates neg. association. R^2 is the coefficient of determination, represents the % of data closest to the line of best fit. Ex What proportional reduction in error do we get by using the regression line to make predictions instead of simply using the mean? Answer use r^2 Denotes strength of linear association of x & y Correlation does not imply causation; can be expressed, as association does not imply causation. Correlation always falls between -1 & 1. Two variables have the same correlation no matter, which is treated as the response variable. Simpsons Paradox- direction of an association between two variables can change after including a 3 rd value and analyze the data at separate levels of that variable. It is possible for the association to reverse after adjusting for a 3rd variable. Regression analysisoften used for observations of a quantitative response variable over time. A regression equation if often called a prediction equation. Regression line predicts the response variable y as a straight-line function of the value x of the explanatory variable. *Construct a scatterplot before finding a correlation or regression line. Regression equation y^=a+bx find with LingRegTTest The correlation and regression line are NONRESISTANT they are prone to distortion by outliers. Prediction errors are called RESIDUALS. Extrapolation refers to using a regression line to predict y values for x values outside the observed range of data. Predictions about the future using time series data are called forecasts Regression outliers are well removed from the trend that the rest of the data follows Sampling Frame is the list of all subjects in the population from which the sample is taken. When an observation has a large effect on results of a regression analysis it is influential for it to be influential it has to be a regression OUTLIER. Sampling Design is the next step after sampling frame (a sound sampling design can prevent sampling bias but cannot prevent response or nonresponse bias. A lurking variable is a variable, (not measured in the study) usually unobserved, that influences the association between the variables of primary interest. A lurking variable may be a common cause of both the response and explanatory variable. Lurking variables have to potential to be Confoundingwhen two explanatory variables are both associated with a response variable but are also associated with each other. For a LURKING variable to be CONFOUNDING it must be included in the study and associated with neither the response nor explanatory variable. Randomizationin assigning experimental units (subjects) to the treatments helps to balance out lurking variables. ANECDOTAL EVIDENCEcome from personal observations. Not representative of the entire population. Systematic sampletake every kth one. Multiple causes (more common) the association between the two variables becomes difficult to study the effect of any single variable. Crossover Design (Really Good Design) a matched pairs design in which subjects crossover during the experiment from using one treatment to using another treatment. Matched pairtwo observations for a particular subject, because they both come from the same person. Completely Randomized Designsubjects are randomly assigned to one of the treatments. Blocking (matching of subjects is a type of blocking) In experiments with matching, a set of matched experimental units. Randomized blocka block design with random assignment of treatments to units within blocks (to reduce possible bias treatments are usually randomly assigned within a block.) Design of StudiesBest way to collect data. REDUCE NOISEINCREASE SIGNAL Observational Studymerely observes rather than experiments with study subjects. Some researchers use this term to refer only to studies that use available subjects (ex convenience sample) and not to sample surveys that randomly select people. Experimental Studyassigns to each subject a treatment; subjects in an experimental study are often referred to as EXPERIMENTAL UNITS. Researchers impose a treatment or condition (such as exposure or non exposure to cell phone radiation.) CAN CONTROL for Lurking Variables, gives strongest INFERENCE. CAN ESTABLISH Cause & Effect. A simple random sample is often called a random sample. RANDOM SAMPLING IS THE BEST-all subjects in the frame have an equal chance of being selected. Simple Random Samplingmuch more likely to get a representative sample if you let chance rather than convenience determine the sample; Random Sample Designis implemented by using random numbers to select n subjects from the sampling frame. EU=the thing to which treatment is applied. Replication involves more than one EU per condition (treatment) Methods of collecting sample surveyspersonal interviews, telephone interviews, and self-administered questionnaire. Under coveragehaving a sampling frame that lacks representation from parts of the population. Sampling BiasResults from the sampling method (ex Non-random sampling or bias.) Subject does NOT have to be a person. Nonresponse Biaswhen some sampled subjects cannot be reached or refuse to participate. To Reduce Biasexperiments should be double blind, with neither the subject nor the data collector knowing which treatment a subject was assigned. Response Biasoccurs when subjects give an incorrect response (ex lying) or the question wording or the way the interviewer asks the questions is confusing or misleading. Volunteer Sampleis the MOST COMMON type of CONVENIENCE SAMPLE (not ideal) however sometimes necessary in both observational studies and experiments. Convenience SampleNot random, easy and cheap way to obtain data. Key Parts of a Sample Survey(most common use of SS is to estimate population percentages)1) Identify the population of all subjects of interest 2) Construct a sampling frame (attempts to list all the subjects in the population) 3) Use a random sampling design; implemented using random #s 4) Be cautious of sampling bias, due to non-random samples, under coverage, response bias, non-response bias. Stratified Random Sampledivides the population into separate groups, called STRATA, and then selects a simple random sample from each STRATUM. Cluster Random Sampletakes a simple random sample of clusters (such as city blocks) Most often by location. Factora categorical explanatory variable in an experiment, the categories are the treatments. Experimental Studies are preferable to non-experimental studies but are not always possible. Multi-Factor Experiments (Factorial Design) has at least two explanatory variables, allows you to test for a combination of treatments. Case-Control Studyan example of a retrospective study. Subjects who have a response outcome of interest, (ex cancer serves as cases) other subjects not having that outcome serve as (controls). The cases and controls are compared on an explanatory variable, like whether they were smokers, Case=Control Design

Censusa complete enumeration of an entire population. Also a survey that attempts to count the # of people in the population and to measure certain characteristics. Prospective Studiesfollow subjects into the future, tracks exposure and disease status over time. Retrospective studieslook at the subjects past.

Cohort Study Design--at the beginning none have disease. Influential Observationcan strongly effect the correlation and regression equation. Cross-Sectionalat one point in time Contingency Table (used for two categorical variables) Scatterplot (used for two quantitative variables) displays the relationship and show a positive or a negative correlation. Probability Distributionfor it to be a valid the sum of all probabilities is 1, and each probability must fall between 0 &1. The probability distribution of a random variable specifies its possible values and their probabilities. It is the randomness of the variable that allows us to specify probabilities for the outcomes. Parametersnumerical summaries of probabilities, most are denoted by Greek letters ex mean and SD, and population mean or a population proportion The mean of a probability distribution for a discrete random variable can be interpreted as the expected value of that variable. It is the value that can be expected as the average in a long run of observations. (not unusual for the expected value of a random variable to equal a number that is not a possible outcome) SD=the larger the SD the greater the spread, describes how far that random variable falls, on the average, from the mean of its distribution. The mean for a continuous distribution is the value of X where the graph would be in balance. The mean is called a weighted averageused when x is not equally likely. If all possible values are equally likely, then the value of the probability distribution is constant and the curve of the constant will be straight-line A RANDOM VARIABLEis a numerical measurement of the outcome of a random phenomenon. X-refers to the variable itself, x-refers to a particular value of the random variable. (ex X=number of heads in 3 flips; defines the random variable) x=2 represents a possible value for the random variable. A Discrete Random VariableX has separate values such as (0, 1, 2, 3) X p(x)=the mean of a probability distribution for a discrete random variable. Continuous Random Variablecan take any value in an interval, for example time, age, and size measurements like height and weight. The interval containing all possible values has a probability equal to 1, are measured in discrete values because of rounding. Normal Distribution(most important) is continuous, symmetric (symmetric around the mean), bell-shaped, and characterized by its mean the probability =0.68 within 1 SD, 0.95 within 2 SDs and 0.997 within 3 SDs of the mean. Z-sore for a value of x of a random variable is the number of sds that x falls from the mean. Z=x-mean/SD A STANDARD NORMAL DISTRIBUTION has a mean of ZERO and a Standard Deviation of ONE The mean and standard deviation completely describe the density curve. A negative (positive) z score indicates that the value is below (above) the mean. Probabilities for NORMAL CURVES are found using normalcdf(lower bound, upper bound, mean, standard deviation) also for normal random variables Invnorm function is used to find the value of z that corresponds to a certain probability. Invnorm(area under the curve, mean, sd) Finding probabilities for OTHER normally distributed random variables1. State the problem in terms of the observed random variable P(X<x) 2. Draw a picture to show the desired probability under the given normal curve. 3. Find the area under the normal curve using normalcdf( Conditions for a BINOMIAL DISTRIBUTION0) Counting the # of successes in a fixed # of trials. 1) Each trial has exactly two possible outcomes. 2) Each trial has the same probability of success 3) the trials are independent. 2&3 are the same thing. n * p = 17 * 0.6 = 10.2 expected success n * (1 - p) = 17 * 0.4 = 6.8 expected failures ALWAYS CHECK TO SEE IF BINOMIAL CONDITIONS APPLY1) Binary data 2) the same probability of success for each trial 3) Independent trials. EX of Binomial conditionsDeal 10 cards from a shuffled deck and count the # of cards 1. Two categories? Yes, red card=success & black=failure 2. Fixed # n? Yes n=10 3. Independent observations? No,

cards not replaced-so they are not independent. 4. Probability is the same? No, cards are not replaced-so p will changed based as each new card is drawn. P(X=x)=binompdf(# of trials, probability, # of successes looking for) To find the probability of exactly X successes out of N trials. P(X<=)=binomcdf(# of trials, probability, # of successes looking for) cumulative distribution function (adds up all the probabilities of successes up to a certain number.