Professional Documents
Culture Documents
in Statistics
(reordered slightly for review the interactive
session)
NOTE: This Power Point file is not an introduction,
but rather a checklist of topics to review
Top Ten #10
Qualitative vs. Quantitative
Qualitative
Categorical data:
success vs. failure
ethnicity
marital status
color
zip code
4 star hotel in tour guide
Qualitative
If you need an average, do not calculate the
mean
However, you can compute the mode
(average person is married, buys a blue car
made in America)
Quantitative
Two cases
Case 1: discrete
Case 2: continuous
Discrete
(1) integer values (0,1,2,)
(2) example: binomial
(3) finite number of possible values
(4) counting
(5) number of brothers
(6) number of cars arriving at gas station
Continuous
Real numbers, such as decimal values
($22.22)
Examples: Z, t
Infinite number of possible values
Measurement
Miles per gallon, distance, duration of time
Graphical Tools
Pie chart or bar chart: qualitative
Joint frequency table: qualitative (relate
marital status vs zip code)
Scatter diagram: quantitative (distance from
CSUN vs duration of time to reach CSUN)
Hypothesis Testing
Confidence Intervals
Quantitative: Mean
Qualitative: Proportion
Top Ten #9
Population vs. Sample
Population
Collection of all items (all light bulbs made at
factory)
Parameter: measure of population
(1) population mean (average number of
hours in life of all bulbs)
(2) population proportion (% of all bulbs that
are defective)
Sample
Part of population (bulbs tested by inspector)
Statistic: measure of sample = estimate of
parameter
(1) sample mean (average number of hours
in life of bulbs tested by inspector)
(2) sample proportion (% of bulbs in sample
that are defective)
Top Ten #1
Descriptive Statistics
Measures of Central Location
Mean
Median
Mode
Mean
Population mean == x/N = (5+1+6)/3 = 12/3 =
4
Algebra: x = N* = 3*4 =12
Sample mean = x-bar = x/n
Example: the number of hours spent on the
Internet: 4, 8, and 9
x-bar = (4+8+9)/3 = 7 hours
Do NOT use if the number of observations is
small or with extreme values
Ex: Do NOT use if 3 houses were sold this week,
and one was a mansion
Median
Median = middle value
Example: 5,1,6
Step 1: Sort data: 1,5,6
Step 2: Middle value = 5
When there is an even number of observation,
median is computed by averaging the two
observations in the middle.
OK even if there are extreme values
Home sales: 100K,200K,900K, so
mean =400K, but median = 200K
Mode
Mode: most frequent value
Ex: female, male, female
Mode = female
Ex: 1,1,2,3,5,8
Mode = 1
It may not be a very good measure, see the
following example
Measures of Central Location -
Example
Sample: 0, 0, 5, 7, 8, 9, 12, 14, 22, 23
Sample Mean = x-bar = x/n = 100/10 = 10
Median = (8+9)/2 = 8.5
Mode = 0
Relationship
Case 1: if probability distribution symmetric
(ex. bell-shaped, normal distribution),
Mean = Median = Mode
Case 2: if distribution positively skewed to
right (ex. incomes of employers in large firm: a
large number of relatively low-paid workers
and a small number of high-paid executives),
Mode < Median < Mean
Relationship contd
Case 3: if distribution negatively skewed to left
(ex. The time taken by students to write
exams: few students hand their exams early
and majority of students turn in their exam at
the end of exam),
Mean < Median < Mode
Dispersion Measures of
Variability
How much spread of data
How much uncertainty
Measures
Range
Variance
Standard deviation
Range
Range = Max-Min > 0
But range affected by unusual values
Ex: Santa Monica has a high of 105 degrees
and a low of 30 once a century, but range
would be 105-30 = 75
Standard Deviation (SD)
Better than range because all data used
Population SD = Square root of variance
=sigma =
SD > 0
Empirical Rule
Applies to mound or bell-shaped curves
Ex: normal distribution
68% of data within + one SD of mean
95% of data within + two SD of mean
99.7% of data within + three SD of mean
Standard Deviation =
Square Root of Variance
1
) (
2
=
n
x x
s
Sample Standard Deviation
x
6 6-8=-2 (-2)(-2)= 4
6 6-8=-2 4
7 7-8=-1 (-1)(-1)= 1
8 8-8=0 0
13 13-8=5 (5)(5)= 25
Sum=40 Sum=0 Sum = 34
Mean=40/5=8
x x
2
) ( x x
Standard Deviation
Total variation = 34
Sample variance = 34/4 = 8.5
Sample standard deviation =
square root of 8.5 = 2.9
Measures of Variability - Example
The hourly wages earned by a sample of five students
are:
$7, $5, $11, $8, and $6
Range: 11 5 = 6
Variance:
Standard deviation:
( ) ( ) ( )
30 . 5
1 5
2 . 21
1 5
4 . 7 6 ... 4 . 7 7
1
2 2 2
2
=
+ +
=
E
=
n
X X
s
30 . 2 30 . 5
2
= = = s s
Graphical Tools
Line chart: trend over time
Scatter diagram: relationship between two
variables
Bar chart: frequency for each category
Histogram: frequency for each class of
measured data (graph of frequency distr.)
Box plot: graphical display based on
quartiles, which divide data into 4 parts
Top Ten #8
Variation Creates Uncertainty
No Variation
Certainty, exact prediction
Standard deviation = 0
Variance = 0
All data exactly same
Example: all workers in minimum wage job
High Variation
Uncertainty, unpredictable
High standard deviation
Ex #1: Workers in downtown L.A. have variation
between CEOs and garment workers
Ex #2: New York temperatures in spring range
from below freezing to very hot
Comparing Standard
Deviations
Temperature Example
Beach city: small standard deviation (single
temperature reading close to mean)
High Desert city: High standard deviation (hot
days, cool nights in spring)
Standard Error of the Mean
Standard deviation of sample mean =
standard deviation/square root of n
Ex: standard deviation = 10, n =4, so standard
error of the mean = 10/2= 5
Note that 5<10, so standard error < standard
deviation.
As n increases, standard error decreases.
Sampling Distribution
Expected value of sample mean = population
mean, but an individual sample mean could be
smaller or larger than the population mean
Population mean is a constant parameter, but
sample mean is a random variable
Sampling distribution is distribution of sample
means
Example
Mean age of all students in the building is
population mean
Each classroom has a sample mean
Distribution of sample means from all
classrooms is sampling distribution
Central Limit Theorem (CLT)
If population standard deviation is known,
sampling distribution of sample means is normal
if n > 30
CLT applies even if original population is
skewed
Top Ten #5
Expected Value
Expected Value
Expected Value = E(x) = xP(x)
= x
1
P(x
1
) + x
2
P(x
2
) +
Expected value is a weighted average, also a
long-run average
Example
Find the expected age at high school
graduation if 11 were 17 years old, 80 were
18 years old, and 5 were 19 years old
Step 1: 11+80+5=96
Step 2
x P(x) x P(x)
17 11/96=.115 17(.115)=1.955
18 80/96=.833 18(.833)=14.994
19 5/96=.052 19(.052)=.988
E(x)= 17.937
Top Ten #4
Linear Regression
Linear Regression
Regression equation:
=dependent variable=predicted value
x= independent variable
b
0
=y-intercept =predicted value of y if x=0
b
1
=slope=regression coefficient
=change in y per unit change in x
x y
b b 1 0
+ =
y
Slope vs Correlation
Positive slope (b
1
>0): positive correlation
between x and y (y increase if x increase)
Negative slope (b
1
<0): negative correlation (y
decrease if x increase)
Zero slope (b
1
=0): no correlation(predicted
value for y is mean of y), no linear
relationship between x and y
Simple Linear Regression
Simple: one independent variable, one
dependent variable
Linear: graph of regression equation is
straight line
Example
y = salary (female manager, in thousands of
dollars)
x = number of children
n = number of observations
Given Data
x y
2 48
1 52
4 33
Totals
x y
2 48
1 52
4 33 n=3
Sum=7 Sum=133
Slope (b
1
) = -6.5
Method of Least Squares formulas not on
BUS 302 exam
b
1
= -6.5 given
Interpretation: If one female manager has 1
more child than another, salary is $6,500
lower; that is, salary of female managers
is expected to decrease by -6.5 (in
thousand of dollars) per child
Intercept (b
0
)
33 . 2
3
7
= = =
n
x
x
33 . 44
3
133
= = =
n
y
y
b
0
= 44.33 (-6.5)(2.33) = 59.5
If number of children is zero,
expected salary is $59,500
x y
b b 1 0
=
Regression Equation
x y 5 . 6 5 . 59 =
Forecast Salary If 3 Children
59.5 6.5(3) = 40
$40,000 = expected salary
x forecast y
b b 1 0
+ = =
y y error
=
2
) (
2
2
=
n
y y
n
SSE
Sc
Standard Error of Estimate
Standard Error of Estimate
(1)=x (2)=y (3) =
59.5-
6.5x
(4)=
(2)-(3)
2 48 46.5 1.5 2.25
1 52 53 -1 1
4 33 33.5 -.5 .25
SSE=3.5
y
2
) ( y y
9 . 1 5 . 3
2 3
5 . 3
= =
= c S
Standard Error of Estimate
Actual salary typically $1,900
away from expected salary
Coefficient of Determination
R
2 =
% of total variation in y that can be
explained by variation in x
Measure of how close the linear regression
line fits the points in a scatter diagram
R
2
= 1: max. possible value: perfect linear
relationship between y and x (straight line)
R
2
= 0: min. value: no linear relationship
Sources of Variation (V)
Total V = Explained V + Unexplained V
SS = Sum of Squares = V
Total SS = Regression SS + Error SS
SST = SSR + SSE
SSR = Explained V, SSE = Unexplained
Coefficient of Determination
R
2
=
SSR
SST
R
2
= 197 = .98
200.5
Interpretation: 98% of total variation in salary
can be explained by variation in number of
children
0 < R
2
< 1
0: No linear relationship since SSR=0
(explained variation =0)
1: Perfect relationship since SSR = SST
(unexplained variation = SSE = 0), but does
not prove cause and effect
R=Correlation Coefficient
Case 1: slope (b
1
) < 0
R < 0
R is negative square root of coefficient of
determination
2
R R =
Our Example
Slope = b
1
= -6.5
R
2
= .98
R = -.99
Case 2: Slope > 0
R is positive square root of coefficient of
determination
Ex: R
2
= .49
R = .70
R has no interpretation
R overstates relationship
Caution
Nonlinear relationship (parabola, hyperbola,
etc) can NOT be measured by R
2
In fact, you could get R
2
=0 with a nonlinear
graph on a scatter diagram
Summary: Correlation Coefficient
Case 1: If b
1
> 0, R is the positive square root
of the coefficient of determination
Ex#1: y = 4+3x, R
2
=.36: R = +.60
Case 2: If b
1
< 0, R is the negative square
root of the coefficient of determination
Ex#2: y = 80-10x, R
2
=.49: R = -.70
NOTE! Ex#2 has stronger relationship, as
measured by coefficient of determination
Extreme Values
R=+1: perfect positive correlation
R= -1: perfect negative correlation
R=0: zero correlation
MS Excel Output
Correlation Coefficient (-0.9912): Note
that you need to change the sign because
the sign of slope (b
1
) is negative (-6.5)
Coefficient of Determination
Standard Error of Estimate
Regression Coefficient
Top Ten #6
What Distribution to Use?
Use Binomial Distribution If:
Random variable (x) is number of successes in n
trials
Each trial is success or failure
Independent trials
Constant probability of success () on each trial
Sampling with replacement (in practice, people
may use binomial w/o replacement, but theory is
with replacement)
Success vs. Failure
The binomial experiment can result in only
one of two possible outcomes:
Male vs. Female
Defective vs. Non-defective
Yes or No
Pass (8 or more right answers) vs. Fail (fewer
than 8)
Buy drink (21 or over) vs. Cannot buy drink
Binomial Is Discrete
Integer values
0,1,2,n
Binomial is often skewed, but may be symmetric
Normal Distribution
Continuous, bell-shaped, symmetric
Mean=median=mode
Measurement (dollars, inches, years)
Cumulative probability under normal curve : use
Z table if you know population mean and
population standard deviation
Sample mean: use Z table if you know
population standard deviation and either normal
population or n > 30
t Distribution
Continuous, mound-shaped, symmetric
Applications similar to normal
More spread out than normal
Use t if normal population but population
standard deviation not known
Degrees of freedom = df = n-1 if estimating the
mean of one population
t approaches z as df increases
Normal or t Distribution?
Use t table if normal population but population
standard deviation () is not known
If you are given the sample standard deviation
(s), use t table, assuming normal population
Top Ten #3
Confidence Intervals: Mean and Proportion
Confidence Interval
A confidence interval is a range of values within
which the population parameter is expected
to occur.
Factors for Confidence Interval
The factors that determine the width of a
confidence interval are:
1. The sample size, n
2. The variability in the population, usually
estimated by standard deviation.
3. The desired level of confidence.
Confidence Interval: Mean
Use normal distribution (Z table if):
population standard deviation (sigma)
known and either (1) or (2):
(1) Normal population
(2) Sample size > 30
Confidence Interval: Mean
If normal table, then
n
z
n
x o
=
Normal Table
Tail = .5(1 confidence level)
NOTE! Different statistics texts have different
normal tables
This review uses the tail of the bell curve
Ex: 95% confidence: tail = .5(1-.95)= .025
Z
.025
= 1.96
Example
n=49, x=490, =2, 95% confidence
9.44 < < 10.56
56 . 0 10
49
2
96 . 1
49
490
= =
One of SOM professors wants to
estimate the mean number of hours
worked per week by students. A sample
of 49 students showed a mean of 24
hours. It is assumed that the population
standard deviation is 4 hours. What is
the population mean?
Another Example
95 percent confidence interval for the
population mean.
12 . 1 00 . 24
49
4
96 . 1 00 . 24 96 . 1
=
=
n
X
o
The confidence limits range from 22.88 to
25.12. We estimate with 95 percent
confidence that the average number of hours
worked per week by students lies between
these two values.
Another Example contd
Confidence Interval: Mean
t distribution
Use if normal population but population
standard deviation () not known
If you are given the sample standard
deviation (s), use t table, assuming normal
population
If one population, n-1 degrees of freedom
n
s
n
x
tn 1
=
=
n
X
z
o
6. Conclusion: Do not reject the null hypothesis.
We cannot conclude the mean is greater than 16
ounces.
Example contd