You are on page 1of 46

STATS 250 Section 1.

1 statistics: numbers measured for some purpose; collection of procedures and principles for gathering data and analyzing data to help people make decisions when faced w/ uncertainty Section 1.2 dotplot: plot in which each dot represents response of individual five number summary: lowest/highest values; cutoff points for !" #" and $ of data median: value in middle when numbers are put in order lower %uartile: median of lower halve of data upper %uartile: median of upper halve of data risk: estimation that bad outcome will occur in future based on past base rate &baseline risk': rate or risk at beginning time period or under specific conditions population: collection of all individuals about which information is desired random sample: subset derived from population so that each individual has specified probability of being part of sample sample survey: investigative gathering of opinion or information from each sample individual margin of error: number added to/subtracted from sample information to produce ()* certain interval

self selected sample &volunteer sample' +, ask anyone who wishes to respond to do so e-. magazines" television stations" .nternet websites" etc. observational study: study in which participants are observed and measured variable: characteristic that differs from one individual to ne-t confounding variable: variable that is not main concern of study" but may be partially responsible randomized e-periment: study in which treatments are randomly assigned to participants treatment: specific regimen or procedure assigned to participants by e-perimenter random assignment +, each participant has specified probability of being assigned to each treatment placebo: pill or treatment designed to look like active treatment" but w/ no active ingredients statistically significant +, relationship/difference large enough to be unlikely to have occurred in sample w/o relationship/difference in population Section 1./ procedure for discovering knowledge 1' ask right %uestion&s' 2' collect useful data /' summarize and analyze data 0' make decisions and generalizations )' turn data and decisions into new knowledge

Section 2.1

raw data: collected numbers and category labels that have not yet been processed in any way variable: characteristic that can differ from one individual to ne-t observational unit &observation': individual who participates in study sample size: total number of observational units dataset: complete set of raw data in survey or e-periment statistic: summary measure for sample parameter: summary measure for population Section 2.2 types of variables categorical +, group or category names not necessarily w/ any logical ordering ordinal +, ordered categories %uantitative +, numerical values" either measurements or counts from individuals also known measurement or numerical split into discrete and continuous

Section 2./ fre%uency: count of how many observations fall into category fre%uency distribution: listing of all categories w/ fre%uencies relative fre%uency: count in category relative to total count over all categories relative fre%uency distribution: listing of all categories w/ relative fre%uencies outcome variable +, response variable categorical visual summaries pie charts +, single categorical variable w/o many categories bar graphs +, one/two categorical variables" good for comparisons Section 2.0 distribution: overall pattern of how often possible values occur summary features location &center" average' +, estimates of location median: middle value in data

mean: usual arithmetic average spread &variability' shape outlier: data point not consistent w/ bulk of data %uantitative pictures histogram +, similar to bar graph" but used for any number of data values" though not particularly informative w/ small sample sizes stem and leaf plots 1 dotplots +, present individuals values" so not very good for large datasets bo-plot &bo- and whisker plot' +, displays information given in five number summary shape symmetric: property of being similar on both sides of center e-. bell shaped" uniform skewed: property of values being more spread out on one side of center than other skewed to right +, higher values more spread our than lower values skewed to left +, lower values more spread out than higher values mode: most fre%uent value unimodal +, single prominent peak bimodal +, two prominent peaks Section 2.)

location measurements mean: usual numerical average median: middle data value for odd number of observations 23 average of middle data values for even number of observations spread measurements range + high value 4 low value inter%uartile range + upper %uartile 4 lower %uartile standard deviation lower %uartile &51': median of data values located below median upper %uartile &5/': median for data values located above median percentiles +, kth percentile is k* of data values at or below or &166 k'* of data values at or above Section 2.7 outlier: legitimate data value that represents natural variability for group and variable&s' measured might be from mistaken measurement or entering +, should be discarded or corrected individual may belong to different group than bulk of individuals

Section 2.8 bell shaped curve: curve w/ numerical values that follow pattern of smooth curve connecting tops of bars on histogram like symmetric bell standard deviation: measure of spread of values variance: s%uared value of standard deviation 9mpirical 3ule 7:* of values fall within 1 standard deviation of mean ()* of values fall within 2 standard deviations of mean ((.8* of values fall within / standard deviations of mean

Section ).1 scatterplot: two dimensional graph of measurements for two numerical variables e-planatory &-' variable may e-plain or cause differences in e-planatory &dependent or y' variable types of associations positive +, values of one variable tend to increase as values of other variable increases negative +, values of one variable tend to decrease as values of other variable increases linear relationship +, relationship pattern resembles straight line curvilinear patterns nonlinear/curvilinear +, scatterplot pattern resembles curve outlier: point that has unusual combination of data values Section ).2 regression analysis: statistics used to e-amine relationship b/w %uantitative response variable and one or more e-planatory variables regression e%uation: key element of regression analysis; description of how" on average" response and e-planatory variables are related used to predict response values given e-planatory values

regression line: straight line that describes how values of %uantitative response variable are related" on average" to values of %uantitative e-planatory variable simple linear regression +, methods used to analyze straight line relationships deterministic relationship +, if value of one variable is known" value of other variable can be e-actly determined statistical relationship +, has variation from average pattern predicted error: difference b/w observed and predicted values also known as residual least s%uares: mathematical criterion nearly always basis for estimating e%uation of regression line least s%uares line: line w/ property that sum of s%uared difference b/w observed and predicted y values is smaller for that line than any other sum of s%uared errors &SS9': representation of sum of s%uared prediction errors Section )./ correlation: number indicating strength and direction of straight line relationship determinants strength +, closeness of points to straight line

direction +, whether one variable increases/decreases when other variable increases represented by ;r< =earson product moment correlation correlation coefficient s%uared correlation &r2' +, describes strength of linear relationship ;proportion of variation e-plained by -< Section ).0 e-trapolation: use of regression e%uation to predict values outside range of observed data influential observations: outliers w/ e-treme - values that have influence on correlation and regression Section ).) interpretations of observed association 1' causation +, e-planatory variable causes change in response variable 2' causation w/ confounding factors /' no causation 0' response variable causes change in e-planatory variable 3ule for >oncluding >ause and 9ffect +, cause and effect relationships can be inferred from randomized e-periments" but not from observational studies

Section 7.1

contingency table &two way table': table that cover all contingencies for combinations of two variables cell: row and column combination conditional percentages +, help make ?udgments about table row percentages: percentages across row of contingency table column percentages: percentages down column of contingency table relationship +, if two or more rows/columns have different distributions of row percentages Section 7.2 risk: proportion that randomly selected individual within group will fall into undesirable category relative risk: ratio of risks in two different categories of e-planatory variable baseline risk: risk often in denominator of ratio percent increase/decrease in risk +, presented instead of multiple odds: comparison of chances that event happens or doesn@t happen odds ratio: comparison of two different groups w/ regard to odds of certain behavior or event

Section 7./ Simpson@s =arado-: parado- that relationship appears to be in different direction when confounding variable is not considered than when data is separated by categories of confounding variable Section 7.0 statistically significant: inference that observed difference reflected actual difference in population statistically significant relationship +, difference large enough to be unlikely in observed sample if no relationship in population steps to determine statistical significance 1' determine null 1 alternative hypotheses 2' verify necessary data conditions" summarize into appropriate test statistic /' assuming null hypothesis is true" find p value 0' decide whether result is statistically significant based on p value )' report conclusion in conte-t null hypothesis +, two variables are not related alternative hypothesis +, two variables are related chi s%uare statistic: value e-amining statistical significance of association b/w two categorical variables measures difference observed and e-pected counts practical significance +, statistical significance does not always indicate meaningful relationship nonsignificant results +, sample results are not enough to safely conclude relationship in population

Section 8.1

3andom circumstance: circumstance in which outcome is unpredictable oftentimes" outcome not determined until observed probability: number b/w 6 and 1 assigned to possible outcome of random circumstance provides information about how likely particular outcome will be result of circumstance Section 8.2 probability: proportion of times specific outcome would occur over long run also called relative fre%uency cannot accurately asses w/ only few trials two ways of determining relative fre%uency probabilities 1' make assumption about physical world 2' observe relative fre%uency personal probability: degree to which given individual believes event will happen must be coherent +, doesn@t contradict another personal probability

sub?ective probability: degree of belief that may be different for different individuals Section 8./ sample space: collection of all possible outcomes for random circumstance simple event: specific possible outcome event: collection of one more simple events in sample space complementary &complements' +, two events that do not contain any of same simple events and together cover entire sample space mutually e-clusive &dis?oint' +, two events that do not contain any of same simple events independent +, two events" in which knowing one will occur does not change other event@s probability dependent +, two events" in which knowing one will occur does change other event@s probability conditional probability: long run relative fre%uency that event occurs when another event also occurs Section 8.0 sample drawn w/ replacement +, individuals returned to eligible pool for each selection

sample drawn w/o replacement +, individuals not returned to eligible pool for each selection Section 8.) tree diagram: schematic representation of se%uence of events and probabilities &including conditional' Section 8.7 stimulate +, repeating situation by using computer or calculator and observing relative fre%uency of event done when probabilities are difficult or time consuming to calculate Section 8.8 confusion of inverse: phenomenon of assuming probability of event is appro-imately e%ual to probability for test of event determinants of test results 1' base rate or probability of having disease 2' sensitivity: proportion of positive tests that are correctly positive /' specificity: proportion of negative tests that are correctly negative coincidence: surprising concurrence of events" perceived as meaningfully related" w/ no apparent causal connection gambler@s fallacy: misperception that long run fre%uency of event should apply in short run law of small numbers: misconception that small samples are highly representative of drawn from populations

Section :.1

random variable +, assigns number to each outcome of random circumstance or unit in population classified into two broad classes 1' discrete random +, take one of countable list of distinct values; can find probabilities for e-act outcomes 2' continuous random +, take any value in interval or collection of intervals; cannot find probabilities for e-act outcomes" limited to intervals of values specific families of random variables: all random variables for which same formula would be used for probabilities within broad class Section :.2 probabilities distribution function &pdf': table or rule that assigns probabilities to possible values of random variable cumulative probability: probability that value of random variable is less than or e%ual to specific value cumulative distribution function &cdf': table or rule that assigns cumulative probabilities for any real number Section :./ mean value of random variable: long run average also called e-pected value

standard deviation of discrete random variable +, useful for %uantifying how spread out possible values of discrete random variable might be" weighted by how likely each value is to occur Section :.0 binomial random variable: A + number of successes in ;n< trials of binomial e-periment ;n< specified in advance +, not random value two possible outcomes &success and failure' independent outcomes probabilities of success and failure remain same through trials Bernoulli random variable: value of random variable in binomial e-periment in which n + 1 and value of random variable A is either 6 or 1 Section :.) probability density function: curve such that area under curve over interval e%uals probability that variable is in interval Section :.7 normal random variable: most commonly encountered type of continuous random variable w/ specific form of bell shaped probability density curve normal curve: bell shaped probability density curve

also known as normal distribution standardized score &z score': distance b/w specified value and mean" measured in numbers of standard deviations standard normal random variable: normal random variable w/ mean + 6 and standard deviation + 1 have standard normal distribution percentile: value of variable percentile rank: cumulative probability for percentile Section :.8 normal appro-imation to binomial distribution +, if binomial random variable is based on large enough ;n< trials w/ certain success probability" then is appro-imately normal random variable continuity correction: adding/subtracting .) to make appro-imation more accurate" based on which rectangles are desired Section :.: sum/difference +, most commonly encountered linear combinations statistically independent +, probability for any event associated w/ one random variable is not altered by whether or not any event for other random variable has happened

Section (.1

parameter: summary number characteristic of population" random situation" or comparison of populations associated w/ population usually impossible to know can use statistical methods to make good guess statistic: number computed from sample of values taken from larger population associated w/ sample survey" observational study" or e-periment sample estimate +, when used to estimate unknown value of population parameter variable +, two different samples taken from same population will likely have different sample statistics statistical inference procedures +, used to make conclusions about population parameters confidence intervals: inference techni%ue of creating interval of values that researcher is fairly sure will cover true" unknown value of population parameter hypothesis &significance' testing: inference techni%ue using sample data to attempt to re?ect hypothesis about population null value: value that would indicate that nothing of interest is happening statistical significance +, important to determining whether observed results are plausible Section (.2 population proportion: number b/w 6 and 1 representing proportion w/ trait that can be turned into percentage

difference in population proportions: comparison used for two populations or groups formed by categorical variable to compare feature population mean: simple summary for population that is average of variable for everyone in population paired differences: data formed by taking differences in matched pairs population mean for paired differences: mean from taking differences for entire population of possible pairs independent samples +, individuals in one sample are not coupled in any way w/ individuals in other sample difference in two population means +, comparing means for same %uantitative variable in two different populations of groups formed by categorical variable sampling distribution: relationship b/w parameter of interest and corresponding sample statistic Section (./ relationship b/w sample statistics and population parameters +, determines accuracy of confidence intervals as estimates of population parameters sampling distribution: probability distribution of possible values of statistic for repeated samples of same size taken from population

standard deviation of - bar: standard deviation of sampling distribution of sample mean standard deviation of p hat: standard deviation of sampling distribution of sample proportion standard error: estimated standard deviation for sampling distribution Section (.0 sample proportion: proportion w/ response in specified category Cormal >urve Dppro-imation 3ule for Sample =roportion +, sampling distribution for sample proportion is appro-imately normal distribution standard deviation of p hat +, how far apart sample proportion and true population proportion are likely to be standard error of p hat +, estimate standard deviation of p hat by using sample value p hat Section (.) mean of sampling distribution of p hat 1 4 p hat 2 standard deviation of p hat 1 4 p hat 2

standard error of p hat 1 4 p hat 2 standard error of the difference Section (.7 Cormal >urve Dppro-imation 3ule for Sample Eeans +, rule to understand what to e-pect for possible distribution of sample means in repeated sampling from same population sampling distribution of mean +, appro-imate normal distribution standard deviation of sample means +, do not confuse w/ s.d. for original population of measurements standard error of mean +, population s.d. rarely known" so sample s.d. used instead larger samples +, result in more accurate estimates of population values than smaller samples Section (.8 data collected as matched pairs +, sometimes called dependent samples &not statistically independent of each other' Section (.(

standardized statistics +, can be found for any of five samples statistics z + &sample statistic 4 population parameter'/&s.d. sample statistic' degrees of freedom &df' +, associated w/ any t distribution generally function of sample size" but depends on problem type use Student@s t when replacing population s.d. w/ sample standard s.d. Section (.16 Faw of Farge Cumbers +, guarantees that sample mean will eventually get close to population mean used by insurance companies" casinos" and other businesses to provide peace of mind about average profit/loss in long run >entral Fimit Gheorem +, if ;n< is sufficiently large" sample means of random samples from population are appro-imately normally distributed

Section 16.1 statistical inference: procedures used to made inferences about populations based on samples tools 1' confidence interval: use of sample data to estimate population parameter through fairly confident interval 2' hypothesis tests sampling distribution: use of knowledge of population to describe possible sample values opposite of confidence interval Section 16.2 unit: individual person or ob?ect to be measured population &universe': entire collection of units about which information is desired or would be measured if possible sample: collection of units actually measured or obtained sample size &n': number of units or measurements in sample population parameter: fi-ed number associated w/ population and" in conte-t of confidence intervals" is unknown and desired

Hundamental 3ule for Ising Jata for .nference +, available data can be used to make inferences about much larger group if data can be considered representative w/ regard to %uestion&s' of interest confidence interval: interval of values computer from sample data likely to include true population value important b/c information is incomplete &and possibly inaccurate' confidence level: indication of how likely interval estimate actually captures truth types of estimates point estimate: single number or point on number line interval estimate: interval of values attempting to estimate single" fi-ed value confidence interval &interval estimate': sample estimate K/ multiplier - standard error 1' sample estimate &sample statistic' +, sample proportion 2' multiplier +, amount of confidence /' standard error &standard error of sample statistic' +, estimate of standard deviation of sample statistic determinants of interval width sample size &n' sample proportion &p hat' multiplier Section 16./

multiplier zL +, found by using standard normal distribution formula conditions 1' sample is randomly selected from population 2' both nLp hat and nM1 &p hat'N should be at least 16 conservative vs. general margin of errors Section 16.0 conditions of confidence interval for difference b/w two population proportions 1' sample proportions are from independent" randomly selected samples from two populations 2' n1p hat1" n1M1 &p hat'N" n2p hat2" and n2M1 &p hat2'N are all at least 16

Section 12.1 five basic steps of hypothesis test 1' determine null 1 alternative hypotheses 2' verify data conditions" and if met" summarize data into appropriate test statistic /' assuming null hypothesis is true" find p value 0' decide statistical significance based on p value )' report conclusion in conte-t Section 12.2 null hypothesis &O6': statement that there is nothing happening &status %uo true" no relationship/difference' alternative hypothesis &Oa': statement that something is happening &status %uo false" relationship/difference' types of alternative hypotheses one sided &one tailed hypothesis test' two sided &two tailed hypothesis test' null value: specific number parameter e%uals if null hypothesis is true

reaching conclusion about two hypotheses 1' test statistic 2' p value +, computed by assuming null is true and determining probability of result as e-treme as observed test statistic in direction of alternative /' level of significance +, statistically significant when p value is less than chosen level types of errors type 1 error +, concluding alternative hypothesis is true when null hypothesis is actually true e-. false positive probability of type 1 error +, ;alpha< type 2 error +, failing to re?ect null hypothesis when alternative hypothesis is actually true e-. false negative probability of type 2 error +, ;beta< power: probability of making correct decision when alternative hypothesis is true 1 probablity of type 2 error +, 1 ;beta< determinants 1' sample size +, increases/increases 2' difference b/w true population value and null hypothesis +, increases/increases

Section 11.1 parameter: population characteristic cannot usually be determined estimate through sample information statistic &estimate': characteristic of sample estimates parameter paired data &samples': collection and use of pairs of related variables independent samples: samples in which measurements are not related to one another standard error: measure of average difference b/w statistic and population parameter; estimated standard deviation of sampling distribution for statistic t distribution: table of appropriate multipliers for confidence intervals for parameters involving means tL multiplier degrees of freedom &df': parameter associated w/ t distributions; n 1 Section 11.2

confidence interval for population mean: interval of values computed from sample data that fairly confidently covers true population mean ingredients 1' sample estimate 2' multiplier /' standard error also called t interval valid situations for t confidence interval 1' bell shaped population of measurements for random sample of any size 2' not bell shaped population of measurements" but large random sample measured &nP2) or /6' Section 11./ difference b/w two variables +, important when two %uantitative variables are collected in pairs confidence interval for population mean of paired differences +, have to check that there are enough pairs" or no e-cessive outliers or skew often of interest to know whether confidence interval for mean of paired differences covers 6 covers 6 +, possibly 6" indicating same populations means for two measurements does not cover 6 +, fairly certain population means for two variables are different

Section 11.0 difference b/w two populations means +, two sample t interval complicated formula for degrees of freedom +, Qelch@s appro-imation often of interest to know whether confidence interval for difference b/w two population means covers 6 covers 6 +, possibly 6" indicating no difference b/w population means does not cover 6 +, fairly certain population means for two groups are different appro-imate ()* confidence interval for difference in population means +, use multiplier 2 instead of using tL e%ual variance assumption +, assuming variances are e%ual for two populations b/c of e%ual standard deviations pooled variance: estimate of variance based on combined or ;pooled< data pooled standard deviation: s%uare root of pooled variance pooled standard error for difference b/w two means +, result of replacing individual standard deviations w/ pooled versions in formula for standard error

Section 1/.1 hypotheses about means +, t statistic hypotheses about proportions +, z statistic Section 1/.2 steps for testing hypotheses 1' determine null and alternative hypotheses 2' verify necessary data conditions" and if met" summarize into appropriate test statistic valid situations 1' appro-imately normal population of measurements for random sample of any size 2' not appro-imately normal population of measurements" but for random sample of large size Lshould not be used for notable skew or e-treme outliers compute test statistic /' assuming true null hypothesis" find p value 0' decide whether result is/is not statistically significant based on p value and report conclusion in situation conte-t re?ection region approach

1' find critical value and re?ection region for test 2' if test statistic is in re?ection region" conclude statistically significant and re?ect null Section 1/./ paired data: data collected in natural pairs often two measurements from each observational unit interested in differences" not about original observations paired t test: one sample t test used n sample of differences to e-amine whether sample mean difference is significantly different from 6 conducted on n differences replaces ;-< w/ ;d< in notation paired p test +, similar process as before Section 1/.0 two sample t test &t test for difference in two means': procedure for testing null hypothesis when comparing two means similar data conditions e-cept for additional ;independence< +, samples must not have been measured as paired or blocked data

df found by complicated Qelch@s appro-imation similar approach as w/ any hypothesis test pooled two sample t test +, based on assumption that common standard deviation for both populations are e%ual pooled sample variance" pooled standard deviation +, provided by combining sample variances changes 1' pooled sample variance used instead of individual samples variances 2' df+n1Kn2 2 generally use Qelch@s appro-imate" but if not available and two sample sizes are e%ual" acceptable to use substitute

62/12/260( 60:00:66
Section 10.1 simple regression: analysis of straight line &linear' relationship b/w response and e-planatory variables K regression model: relationship b/w response and e-planatory variables K y variable +, dependent variable" - variable +, predictor variable K relationship can be linear or curvilinear regression lines for sample &can be determined from data' and population &can only be imagined'2f K composed of intercept and slope least s%uares line: line among all possible lines w/ smallest sum of s%uared differences b/w sample and corresponding values of y deviation +, also referred to as random error" residual variation" and residual error K residual +, deviation of observed y value from sample regression line multiple regression: regression in which mean of response variable is function of two or more e-planatory variables constant variance +, general size of deviation of y values from line is same for all values of - values K must be assumed to make interferences about population Section 10.2

standard deviation for regression: measure of general size of residuals K useful for describing individual variation K provides information about regression e%uation prediction accuracy sum of s%uared errors: sum of s%uared residuals for sample K needed to calculate estimate s.d. K also known as residual sums of s%uares" sum of s%uared residuals proportion of variation e-plained by - +, used with r2 K e-pressed as * K SSG2: sum of s%uared difference b/w observed mean and sample y values measures size of deviations of y values from overall mean K SS9 +, measures deviations of y values from predicted values Section 10./ statistical significance of a linear relationship +, evaluated by testing population slope K if 6" unrelated K if not 6" related &depending on how large/small and positive/negative' confidence interval for population slope Section 10.0 * prediction interval +, describes y values for * of all individuals w/ particular - value for specified central percentage of population Section 10.) plots of the residuals +, provide information about validity of assumptions transformation +, e%uivalent to using different model K may be re%uired if conditions are not met

62/12/260( 60:00:66
Section 17.1

analysis of variance &DC2RD': significance test for determining differences among population means being compared K versatile tool for analyzing relation b/w %ualitative responses and categorical e-planations K one way vs. two way H statistic K arises from one way analysis of variance K used to test hypothesis about population means K H test +, significance test K sensitive to differences among sample means K H distribution +, probability distribution used to find p value of f statistic assumptions for H test K independent random samples K normal curve distributions within each population K different population means K same population s.d. largest s.d. is no more than twice that of smallest s.d. parameters +, indicate specific H distribution

K numerator degrees of freedom &+k 1' K denominator degrees of freedom &+C k' where C+total sample size and k+number of groups same procedure applies w/ critical values multiple comparisons: e-amining of specific pattern of differences among means using two or more comparisons family type 1 error rate: probability of making one or more type 1 errors when more than once significance test is done family confidence level: proportion of times that all intervals in set capture true parameter values both increase with more statistical tests done +, control procedures K Gukey@s procedure K Hisher@s procedure Section 17.2 total variation + variaton b/w groups variation w/i groups

analysis of variance table +, used to display information about sources of variation in response table sum of s%uares for groups &SS Sroups': weighted sum measuring variation b/w group means mean s%uare for groups &ES Sroups': numerator of H statistic sum of s%uared errors &SS9': sum measuring variation among individuals w/i groups mean s%uare error &ES9': denominator of H statistic pooled standard deviation: s%uare root of ES9 estimating population s.d. of response total sum of s%uares &SS Gotal': measure of total variation in data from all samples combined Section 17./ assumptions don@t always hold Truskal Qallis test: test for comparing medians based on relative sizes of data in samples K rank test

K nonparametric test +, no assumptions about specific distribution for population measurements Eood@s Eedia test: test for comparing population medians K nonparametric test 17.0 two way analysis of variance: test used to e-amine how two categorical e-planations affect mean of responses interaction +, effect on response of one e-planation depends on other e-planation main effect: overall effect of single e-planatory variable

You might also like