Week 03

E370
2/14/2016
Descriptive Statistics
Statistics Sampling Methods Simple random
Pseudo random
Probability Stratified
Systematic
Cluster
Non Convenience
probability Judgment
Data Types Qualitative Nominal

or
Categorical Ordinal
Discrete
Quantitative
or Numerical Continuous
Graphical Types Bar/column

Qualitative Pie
Pareto Diagram
Histogram
Quantitative Frequency polygon
Ogive
Stem-n-Leaf Plot
Descriptive Statistics Mode (Nominal & Ordinal)

Qualitative Median (Ordinal only)
Center Spread Shape

Mean Range Skewness
Quantitative Median Variance Symmetric
Mode Standard Deviation Uni-, bi-modal,
Coefficient of Variation etc.
Measures of Center or Central Tendency
the location of the data, the middle of the data, the
balance point of the data, the most common element in
the data.
Spread
how similar to one another are the observations within
the data set, how distant they are from one another, how
far observations are from some fixed point.
Shape
how the mass of the data is arranged over the distribution
Dimensions of Data
General statistical symbol conventions:
Symbols differ for populations and samples
Populations are described by parameters
Parameters are represented by Greek letters
For example, = mu = population mean
= sigma = population standard deviation
N=Population Size
Samples are described by statistics
Statistics are represented by Latin letters
For example, = X-bar = sample mean
s = sample standard deviation
n = sample size
Symbols
Mean
thearithmetic average, the center of balance
for the data.
Median
The middle of the ordered data set
Mode
The most common value
Summary Methods--Center
Thesum of all the observations in a data set
divided by the number of observations.

= =

Any real number

Unique
Inclusive
= or
Balanced
Sensitive
Mean Characteristics
It is that value that divides the data set into two parts of
equal size with respect to the number of observations.
Specifically the value than which 50% of the observations
are larger and 50% are smaller.

= +
Unique
Any real number
More applicable
Insensitive
Exclusive
Median Characteristics
The value with the highest frequency.
Universal use
Simple
Only hope for nominal data
Insensitive
Highly unstable
Doesnt always exist.
Sometimes there is more than one mode.
Small changes in observations can dramatically change the
mode.
Mode Characteristics
Summary Statistic Behavior
Xi Xi + 7 2Xi
1 1+7=8 2*1=2
2 9 4
3 10 6
3 10 6
4 11 8
5 12 10
m=3 m = 10 m=6
Md = 3 Md =10 Md = 6
Mo = 3 Mo =10 Mo = 6
Irritating Data

= =

= =

Compare these formulas.
Other Mean Formulas

Freq. Rel. Cum.
Age Mark
fi Freq. Freq.
10~15 1 0.03 1 12.5
15~20 11 0.31 12 17.5
20~25 14 0.40 26 22.5
25~30 5 0.14 31 27.5
30~35 3 0.09 34 32.5
35~40 1 0.03 35 37.5
Age of Mother at birth of 1st child

Class Absolute
Age Product
Mark Frequency
10~15 12.5 1 12.5
15~20 17.5 11 192.5
20~25 22.5 14 315
25~30 27.5 5 137.5
30~35 32.5 3 97.5
35~40 37.5 1 37.5
Sum 792.5
Sum/n 22.64
Estimating Means of Grouped
Data--Absolute Frequencies
Class Relative
Age Product
Mark Frequency
10~15 12.5 0.03 0.36
15~20 17.5 0.31 5.50
20~25 22.5 0.40 9.00
25~30 27.5 0.14 3.93
30~35 32.5 0.09 2.79
35~40 37.5 0.03 1.07
Sum 22.64
Estimating Means of Grouped Data

Relative Frequencies
Range
How far apart the highest value and the lowest value
in the set are, thus, Range = (Max-Min)
Variance
It measures the average squared distance an
observation is from the mean
Standard Deviation
The average distance an observation is from the mean
Coefficient of Variation
A measure of relative dispersion, dispersion relative to
the mean.
Summary Methods--Spread
Simple and intuitive
Itdoesnt use all the data so it tells nothing

about how the data falls between the high and
the low point
It
is sensitive to extreme values, just like the
mean
Range Characteristics
Interquartile Range
The distance between the first and third quartiles
of the data set. IQR = Q3 - Q1
A quartile is a percentile, but instead of dividing
the data into 100 levels it divides it into 4
The first quartile is the 25th percentile
The third quartile is the 75th percentile
The IQR cuts off the smallest 25% and the
largest 25% of the data, removing outliers.
A Range Alternative
0 10 20 35 L25=(40+1)*(25/100)=
1 10 21 35 10.25
2 12 22 35 10th Obs = 9
4 13 22 38 11th Obs = 10
5 13 22 39 10-9=1
5 13 24 45 1*.25=.25 Q1=9.25
6 14 24 50 L75=(40+1)*(75/100)=
30.75
7 16 25 56
Q3=34.5
9 17 26 60
IQR=34.5-9.25=25.25
9 19 33 63
Calculate the IQR
2 2
2 1 2 1
= =
1
Why are there two formulas?

Conceptually, a sample does not include all the
information that a population does
Samples tend to UNDER estimate the variability
found in a population.
If we divide by a slightly smaller number (n1) we
get a slightly larger number.
Variance
Itis a unique value and uses all information in
the data set
It has desirable mathematic properties
Itis an average, thus, it has the same failings

as the mean, that is, sensitive to outliers
It is difficult to interpret
Variance Characteristics
= 2 = 2
Obviously the relative of variance

It is no easier to calculate--you have to get a
variance first--it is just easier to interpret.
It is measured in the same units as the data
is measured
It is also sensitive to outliers
Standard Deviation
Calculating relative frequencies is the closest to
dispersion one can get with categorical data.
What if you want to compare two data sets and
they are in different units, or they are in
different magnitudes?
The Coefficient of Variation (CV) is a measure
of relative dispersion.
Is everything covered?

= 100 = 100

Eliminates units and enables comparisons
Eliminates the effect of differences of
magnitude.
Often the best choice for comparative
dispersion
Concerns
Not usable for data with a 0 mean
Inappropriate for data that can be negative.
Coefficient of Variation
Summary Statistic Behavior
Xi Xi + 7 2Xi
1 8 2
2 9 4
3 10 6
3 10 6
4 11 8
5 12 10
=3 = 10 =6
Md = 3 Md =10 Md = 6
Mo = 3 Mo =10 Mo = 6
range = 4 range = 4 range = 8
2 = 1.667 2 = 1.667 2 = 6.667
= 1.291 = 1.291 = 2.582
More irritated data
Skewness
Measures the degree of asymmetry in a data set.
3
Pearsons 2nd Skewness Coefficient =

Ranges from -3 to 3 usually.
Reflects the general result that
when > Md, the data is right skewed, Sk>0
when < Md, the data is left skewed, Sk<0
when = Md, the data is un-skewed, ie, symmetric,
Sk=0
This is not a rule, rather a rule of thumb.
Statisticians know that the size of the sample and the
value of the mode affect the skewness.
Shape Methods
Right Skewed Histogram
Left Skewed Histogram
Symmetric Histogram
Which group of males has the
Weights: Boys Men more uniform weight?
Mean 54.78 172.52
How do you know?
Median 53 171.5
.
Mode 52 171 = = . %
.
Standard Deviation 7.93 21.81 .
= = . %
Sample Variance 62.91 475.48 .
Skewness 0.67 0.14 Which group is least
Range 32 103 symmetric?
(.)
Minimum 41 126 = = .
.
Maximum 73 229 (..)
Sum 2739 8626 = = .
.
Count 50 50
14.5% 12.6%
Some Descriptive Statistics
Chebyshevs Theorem Empirical or Normal Rule
1 find about 68%
1 of observations
%OBS 1 2 2 find about 95%
k of observations
k number of standard 3 find about
deviations > 1 99.7% of observations
Universal application Only bell-shaped and
Provides minimum symmetric distributions.
guarantee Only integer values of
Methods for estimating probabilities

A Chebyshev example: If k=1.5,

% = = = . = .
. .
Chebyshevs Theorem Empirical or Normal Rule
1 find about 68%
1 of observations
%OBS 1 2 2 find about 95%
k of observations
k number of standard 3 find about
deviations > 1 99.7% of observations
Universal application Only bell-shaped and
Provides minimum symmetric distributions.
guarantee Only integer values of
Methods for estimating probabilities

What minimum percent of observations does
Chebyshev predict for 2 ?

% = = = = %

Within how many standard deviations will at

least 44% of observations lie?

% . =

= . = .

= = . = . = .
.
Density Estimates: Chebyshev

The weights of a
part Samsung
Electronics receives
from suppliers have
a mean of 40
micrograms, a
standard deviation
of 3 micrograms,
and a bell-shaped
symmetric
distribution.
Approximately what
percent of parts
weigh between 34
and 37 micrograms?
Density Estimates: Empirical Rule

Use the Empirical Rule
to isolate the area
shaded in red, which is
the percentage of parts
weighing between 34
and 37 mcg.
The red area is half the
difference in area
between 1 and 2
The area that is 1 is
68%
The area that is 2
is 95%.
The difference in area
is (95%-
68%)=27%.
Half of 27% is 13.5%.

The weights of a part
Samsung Electronics
receives from suppliers
have a mean of 40
micrograms, a standard
deviation of 3
micrograms, and a bell-
shaped symmetric
distribution.
Samsung rejects parts
that weigh more than
46 micrograms or less
than 37 micrograms.
Approximately what
percentage of parts
does Samsung routinely
accept?

There are two ways to think
about this the first is which
is adding the area of those
parts accepted using the
Empirical Rule.
The area shaded in blue is the
percentage of accepted
parts.
The blue is the area for 1 ,
which is 68%, plus half the
difference in area between
1 and 2
The difference in area is
(95%- 68%)=27%, half of
which is (27%/2) =13.5%
Total area is 68% + 13.5% or
81.5%

There are two ways to think
about this and the second is
subtracting from 1 the area of
those rejected using the
Empirical Rule.
The area shaded in black is the
percentage of rejected parts.
The black area is the area
outside 2 , 1-95% = 5% plus
half the difference in area
between 1 and 2
The difference in area is (95%-
68%)=27%, half of which is
(27%/2) =13.5%
Total area of black is
5%+13.5%=18.5%.
The blue area is 1 18.5% or
81.5%.

The weights of a
part Samsung
Electronics receives
from suppliers have
a mean of 40
micrograms, a
standard deviation
of 3 micrograms,
and a bell-shaped
symmetric
distribution.
What is the
approximate range
of the weights of
this particular part?

Think of the standard
deviation as a unit of
distance along the
number line, here 1 =
3 mcg.
The Empirical Rule says
that virtually 100% of
observations will fall
within 3 .
The range is the distance
between the maximum
value and the minimum.
Since 3 = 6 , the
range is approximately
6*3mcg or 18 mcg.

Methods for One Variable
Three dimensions of data
Measures of Center
Mean, median and mode
Measures of Spread
Range, IQR, variance, standard deviation and coefficient of
variation.
Measures and concepts for Shape
Pearsons skewness coefficient
Symmetric, right or positive skewed, left or negative skewed
Estimating probabilities
Chebyshevs Theorem and the Empirical Rule
Descriptive Statistics

Week 03

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 03

Uploaded by

Copyright:

Available Formats

E370

Data Types Qualitative Nominal

Graphical Types Bar/column

Descriptive Statistics Mode (Nominal & Ordinal)

Center Spread Shape

Any real number

Any real number

Compare these formulas.

Other Mean Formulas

Age of Mother at birth of 1st child

Estimating Means of Grouped Data

Itdoesnt use all the data so it tells nothing

Why are there two formulas?

It has desirable mathematic properties

Itis an average, thus, it has the same failings

Obviously the relative of variance

Left Skewed Histogram

Methods for estimating probabilities

Methods for estimating probabilities

Within how many standard deviations will at

Density Estimates: Chebyshev

Density Estimates: Empirical Rule

Density Estimates: Empirical Rule

Density Estimates: Empirical Rule

Density Estimates: Empirical Rule

Density Estimates: Empirical Rule

Density Estimates: Empirical Rule

Density Estimates: Empirical Rule

You might also like