You are on page 1of 9

Pie Charts Show distribution of a categorical

Bar Graph show distribution of a categorical


Histograms for this graph we group values into groups of a set size, take the range of the data set and
pick equal size width for the classes. The size should effectively show distribution when graphing, put
the measured variable on the x-axis
Stem plots are useful for smaller data sets, start with lowest value at the top and work downwards
stem plot preserves the actual values of the observations, histograms dont
Time plot shows the value of a variable (y) vs the time (x) it was recorded
histograms and time plots give different kind of data, time plots gives time series data that shows that
change in time were as histograms displays cross-sectional data that shows change in values compared
to different individuals
Shape, Center, Spread, Outliers
Center is described in a few ways, 2 of which are
1) by showing the midpoint, where roughly half are above and below.
2) state the highest and lowest values in the range
Measuring Center
Mean add all values together then divide by the number of observations (n)

Median the formal measure of midpoint, order all observations then use
M = (n+1)/2 which gives the location of M in the list of ordered pairs

Mean is not resistant measure of center


Median is resistant measure of center
The Quartiles Q1 and Q3
Finding the Quartiles:
1) find M
2) Q1 is the median of the first set of observations of M of all observations
3) Q3 is the median of the rest of the observations
The 5 Number system
Minimum-----Q1-----M-----Q3 -----Maximum

Inter-quartile Range IQR is resistant measure of spread


IQR = Q1 - Q3
IQR is useful for finding outliers
The 1.5 x IQR rule of outliers
if a observation falls 1.5 x IQR above Q3 or below Q1 , it is a suspected outlier

Measuring Spread
Standard Deviation s
Variance s is the average of squares of the deviations of the observations from their mean

Important about standard deviations


- s measures spread about the mean and should be used only when the mean is
chosen as the measure of center.
- s is always zero or greater than zero.
- As the observations become more spread out about their mean, s gets larger.
- s is not resistant. A few outliers can make s very large.
The 5 Number summary is better for skewed or many outlier observations
while mean and standard deviation ( x , s) are better for the rest (symmetric distribution)
Density Curve
has exactly 1 underneath it
describes the overall pattern of the distribution
often a good description of the overall pattern of distribution. Outliers are not displayed
though
Median and Mean of a density Curve

density curve is an idealization of the actual data, when computing the mean and standard deviation we
use different symbols to distinguish them
mean of a density curve is
SD of a density curve as
the standard deviation controls the spread, the larger the is, the larger the spread
Normal distribution N( ,)

The 68 95 99.7 Rule


68% of the observations fall within of the mean
95% of the observation fall within 2 of the mean
99.7% of the observation fall within 3 of the mean
Standardizing and Z scores
The standardize value of x is

this is known as the Z score


Z-score tells us how many standard deviations from the mean the observation falls on
positive z-score means larger than the
negative z-score means smaller than the
if x has the N( ,) distribution, then the standardized variable z = (x ) / has the standard normal
distribution
Standard normal Distribution
is the normal distribution N(0,1) N( ,) so = 0 = 1
(z tells us how many from , x is )
by standardizing, you can use table A on page 677 to find the cumulative portions (z / 1) area under the
density curve

Cumulative proportion for a value x in a distribution are the proportion of observations in the
distribution that are less than or equal to x

Put values into z-scores then use table to find the portion to the left of the value
To find normal proportions by table
draw picture of curve, proportion needed
standardize
use table
To find normal proportion by percentage and table
draw picture of curve, proportion needed
use table
nu-standardize using x = + (table value)
FOCUS ON CHAPTER 3
Scatter plot shows the relationship between two qualitative variables measured on the same
individual
Correlation measures the direction and strength of the linear relationship between two quantitative
variables the symbol is r

Correlation is the average of products of the standardized values


Correlation is always between -1 and 1 the closer to 0 the value is, the weaker the correlation
if r = -1 or 1 then it is a straight line with no plots off the line
requires both variables to be quantitative
only measures strength of lines not curves
like mean and standard deviations, correlation is sensitive to outliers and is not resistant
is not a complete summary, should also give mean, standard deviation as well as correlation

Regression Lines summarizes the relationship between two variables, Specifically only when there is
a explanatory variable and a response variable. Often used to predict the Y-value
when given the X-value
The least-squares regression line of y-value on x-value is the line that makes the sum of squares of the
vertical distances of the data points from the line as small as possible
The Equation of the Least-Squares Regression line

y is used to emphasize that the line gives a predicted response

y for x

Calculate the means x and y


Calculate the standard deviations Sx and Sy of the two variables
Calculate the correlation r
slope

intercept

Facts about least-squares regression


1) the distinction between explanatory and response variables is essential in regression (the leastsquares makes the distance between data points smallest only in the Y-direction, if you reverse
the roles of the variables, we get different least-squares regression line)
2) there is a close connection between correlation and the slope of least-squares line
3) the least-squares regression line always passes through the point ( x , y )
4) The square of the correlation r, is the fraction of the variation in the values of y that is
explained by the least-squares regression of y on x

chosen the regression line:


the mean of the least-squares residuals is ALWAYS zero
Residual plot is a scatter-plot of the regression residuals against the explanatory variable. Residual
plot help us asses how well a regression line fits data (you plot all the residuals values of the response
variables on the y-axis and the explanatory variable on the x-axis)
correlation and regression lines describe only linear relationships. You can do the calculations
for any relationship, but it wont show anything
Correlation and least-squares regression line are not resistant, so always be aware of outliers
Extrapolation the use of a regression line for prediction far outside the range of values of explanatory

variable x that you used to obtain the line


Lurking variable is a variable that is not among the explanatory or response variables in a study but
still influences the relationships
Two-way table shows data of two categorical variables, one in the column and one in row
Simpsons Paradox when an association or comparison that holds for all of several groups can reverse
direction when the data are combined to form a single group
probability models
list of possible outcomes
a probability for each outcome
sample space S of a random phenomenon is the set of all possible outcomes
Finite and Discrete Probability
To assign probabilities in a finite model, list all the probabilities of all the individual outcomes. They
must be numbers between 0 and 1, add exactly to 1
Continuous Probability
assigns probabilities as areas under a density curve. The area under the curve and above any range of
values is the probability of an outcome in that range. It is all the values between an interval, infinite
decimal places...
REREAD CHAPTER 13
to estimate the unknown population mean
we use
the samples known mean xx
the population's standard deviation
the number of subjects in a sample n
/ n
this gives us the samples standard deviation
xx the spread (SD) = confidence interval
A confidence interval is
x z( / n)
this is known as the margin of error
-to obtain a smaller margin of error from the same data, you must be willing to accept lower confidence
-if the standard deviation of the population is lower than the spread / margin of error will be lower
-if our sample size gets larger, then our margin of error will decrease, note that because the sample size
is under a square root, a 4x times larger sample results in only half the size of margin of error
The alternative hypothesis is one-sided if it states that a parameter is larger than or smaller than the null
hypothesis value. It is two-sided if it states that the parameter is different from the null value (does not
equal)

The smaller the P-value is, the stronger the evidence AGAINST the H provided by the data
The smaller the P-value is, the stronger the evidence FOR the H provided by the data
significance level is from 0 to 1, the lower the significance level is the more likely it is to occur
If the P-value is as small or smaller than , we say that the data are statistically significant at level .
If p0.05 then there is no more than 1 chance in 20 that a sample would give this strong of
evidence by chance when H is actually true
z test for a population Mean
draw an SRS of size n from a Normal population that has unknown mean and known standard
deviation
to test the H that has a specified value
H : =
calculate the one-sample z test statistic

in terms of a variable Z having the standard Normal distribution, the P-value for a test of H against

The numerator measures how far the sample mean deviates from the hypothesized
mean 0. Larger values of the numerator give stronger evidence against H :
= . The denominator is the standard deviation of xx. It measures how much
random variation we expect. There is less variation when the number of observations
n is large. So z gets larger (more significant) when the estimated effect
xx - gets larger or when the number of observations n gets larger. Significance
depends both on the size of the effect we observe and on the size of the sample.
Understanding this fact is essential to understanding significance tests.
The z confidence interval for the mean of a N( , ) will have the margin of error m with sample size

falsely rejecting H is type 1 error, the significance level is the probability of type 1
failing to reject H when it is false is type 2 error, the power , is 1 minus the probability of type 2

increase of D.O.F makes t distribution more normal

You might also like