You are on page 1of 39

E370

2/14/2016
Descriptive Statistics
Statistics Sampling Methods Simple random
Pseudo random
Probability Stratified
Systematic
Cluster

Non Convenience
probability Judgment

Data Types Qualitative Nominal


or
Categorical Ordinal

Discrete
Quantitative
or Numerical Continuous

Graphical Types Bar/column


Qualitative Pie
Pareto Diagram
Histogram
Quantitative Frequency polygon
Ogive
Stem-n-Leaf Plot

Descriptive Statistics Mode (Nominal & Ordinal)


Qualitative Median (Ordinal only)

Center Spread Shape


Mean Range Skewness
Quantitative Median Variance Symmetric
Mode Standard Deviation Uni-, bi-modal,
Coefficient of Variation etc.
Measures of Center or Central Tendency
the location of the data, the middle of the data, the
balance point of the data, the most common element in
the data.
Spread
how similar to one another are the observations within
the data set, how distant they are from one another, how
far observations are from some fixed point.
Shape
how the mass of the data is arranged over the distribution

Dimensions of Data
General statistical symbol conventions:
Symbols differ for populations and samples
Populations are described by parameters
Parameters are represented by Greek letters
For example, = mu = population mean
= sigma = population standard deviation
N=Population Size
Samples are described by statistics
Statistics are represented by Latin letters
For example, = X-bar = sample mean
s = sample standard deviation
n = sample size

Symbols
Mean
thearithmetic average, the center of balance
for the data.

Median
The middle of the ordered data set

Mode
The most common value

Summary Methods--Center
Thesum of all the observations in a data set
divided by the number of observations.

= =

Any real number


Unique
Inclusive
= or
Balanced
Sensitive

Mean Characteristics
It is that value that divides the data set into two parts of
equal size with respect to the number of observations.
Specifically the value than which 50% of the observations
are larger and 50% are smaller.

= +

Unique

Any real number

More applicable

Insensitive

Exclusive

Median Characteristics
The value with the highest frequency.
Universal use
Simple
Only hope for nominal data
Insensitive

Highly unstable
Doesnt always exist.
Sometimes there is more than one mode.
Small changes in observations can dramatically change the
mode.

Mode Characteristics
Summary Statistic Behavior
Xi Xi + 7 2Xi
1 1+7=8 2*1=2
2 9 4
3 10 6
3 10 6
4 11 8
5 12 10
m=3 m = 10 m=6
Md = 3 Md =10 Md = 6
Mo = 3 Mo =10 Mo = 6

Irritating Data

= =


= =

Compare these formulas.

Other Mean Formulas


Freq. Rel. Cum.
Age Mark
fi Freq. Freq.
10~15 1 0.03 1 12.5
15~20 11 0.31 12 17.5
20~25 14 0.40 26 22.5
25~30 5 0.14 31 27.5
30~35 3 0.09 34 32.5
35~40 1 0.03 35 37.5

Age of Mother at birth of 1st child


Class Absolute
Age Product
Mark Frequency
10~15 12.5 1 12.5
15~20 17.5 11 192.5
20~25 22.5 14 315
25~30 27.5 5 137.5
30~35 32.5 3 97.5
35~40 37.5 1 37.5
Sum 792.5
Sum/n 22.64
Estimating Means of Grouped
Data--Absolute Frequencies
Class Relative
Age Product
Mark Frequency
10~15 12.5 0.03 0.36
15~20 17.5 0.31 5.50
20~25 22.5 0.40 9.00
25~30 27.5 0.14 3.93
30~35 32.5 0.09 2.79
35~40 37.5 0.03 1.07
Sum 22.64

Estimating Means of Grouped Data


Relative Frequencies
Range
How far apart the highest value and the lowest value
in the set are, thus, Range = (Max-Min)
Variance
It measures the average squared distance an
observation is from the mean
Standard Deviation
The average distance an observation is from the mean
Coefficient of Variation
A measure of relative dispersion, dispersion relative to
the mean.

Summary Methods--Spread
Simple and intuitive

Itdoesnt use all the data so it tells nothing


about how the data falls between the high and
the low point

It
is sensitive to extreme values, just like the
mean

Range Characteristics
Interquartile Range
The distance between the first and third quartiles
of the data set. IQR = Q3 - Q1
A quartile is a percentile, but instead of dividing
the data into 100 levels it divides it into 4
The first quartile is the 25th percentile
The third quartile is the 75th percentile
The IQR cuts off the smallest 25% and the
largest 25% of the data, removing outliers.

A Range Alternative
0 10 20 35 L25=(40+1)*(25/100)=
1 10 21 35 10.25
2 12 22 35 10th Obs = 9
4 13 22 38 11th Obs = 10
5 13 22 39 10-9=1
5 13 24 45 1*.25=.25 Q1=9.25
6 14 24 50 L75=(40+1)*(75/100)=
30.75
7 16 25 56
Q3=34.5
9 17 26 60
IQR=34.5-9.25=25.25
9 19 33 63
Calculate the IQR
2 2
2 1 2 1
= =
1

Why are there two formulas?


Conceptually, a sample does not include all the
information that a population does
Samples tend to UNDER estimate the variability
found in a population.
If we divide by a slightly smaller number (n1) we
get a slightly larger number.

Variance
Itis a unique value and uses all information in
the data set

It has desirable mathematic properties

Itis an average, thus, it has the same failings


as the mean, that is, sensitive to outliers

It is difficult to interpret

Variance Characteristics
= 2 = 2

Obviously the relative of variance


It is no easier to calculate--you have to get a
variance first--it is just easier to interpret.
It is measured in the same units as the data
is measured
It is also sensitive to outliers

Standard Deviation
Calculating relative frequencies is the closest to
dispersion one can get with categorical data.
What if you want to compare two data sets and
they are in different units, or they are in
different magnitudes?
The Coefficient of Variation (CV) is a measure
of relative dispersion.

Is everything covered?

= 100 = 100

Eliminates units and enables comparisons
Eliminates the effect of differences of
magnitude.
Often the best choice for comparative
dispersion
Concerns
Not usable for data with a 0 mean
Inappropriate for data that can be negative.

Coefficient of Variation
Summary Statistic Behavior
Xi Xi + 7 2Xi
1 8 2
2 9 4
3 10 6
3 10 6
4 11 8
5 12 10
=3 = 10 =6
Md = 3 Md =10 Md = 6
Mo = 3 Mo =10 Mo = 6
range = 4 range = 4 range = 8
2 = 1.667 2 = 1.667 2 = 6.667
= 1.291 = 1.291 = 2.582
More irritated data
Skewness
Measures the degree of asymmetry in a data set.
3
Pearsons 2nd Skewness Coefficient =

Ranges from -3 to 3 usually.
Reflects the general result that
when > Md, the data is right skewed, Sk>0
when < Md, the data is left skewed, Sk<0
when = Md, the data is un-skewed, ie, symmetric,
Sk=0
This is not a rule, rather a rule of thumb.
Statisticians know that the size of the sample and the
value of the mode affect the skewness.

Shape Methods
Right Skewed Histogram

Left Skewed Histogram

Symmetric Histogram
Which group of males has the
Weights: Boys Men more uniform weight?
Mean 54.78 172.52
How do you know?
Median 53 171.5
.
Mode 52 171 = = . %
.
Standard Deviation 7.93 21.81 .
= = . %
Sample Variance 62.91 475.48 .
Skewness 0.67 0.14 Which group is least
Range 32 103 symmetric?
(.)
Minimum 41 126 = = .
.
Maximum 73 229 (..)
Sum 2739 8626 = = .
.
Count 50 50
14.5% 12.6%
Some Descriptive Statistics
Chebyshevs Theorem Empirical or Normal Rule
1 find about 68%
1 of observations
%OBS 1 2 2 find about 95%
k of observations
k number of standard 3 find about
deviations > 1 99.7% of observations
Universal application Only bell-shaped and
Provides minimum symmetric distributions.
guarantee Only integer values of

Methods for estimating probabilities


A Chebyshev example: If k=1.5,

% = = = . = .
. .
Chebyshevs Theorem Empirical or Normal Rule
1 find about 68%
1 of observations
%OBS 1 2 2 find about 95%
k of observations
k number of standard 3 find about
deviations > 1 99.7% of observations
Universal application Only bell-shaped and
Provides minimum symmetric distributions.
guarantee Only integer values of

Methods for estimating probabilities


What minimum percent of observations does
Chebyshev predict for 2 ?

% = = = = %

Within how many standard deviations will at


least 44% of observations lie?

% . =


= . = .


= = . = . = .
.

Density Estimates: Chebyshev


The weights of a
part Samsung
Electronics receives
from suppliers have
a mean of 40
micrograms, a
standard deviation
of 3 micrograms,
and a bell-shaped
symmetric
distribution.
Approximately what
percent of parts
weigh between 34
and 37 micrograms?

Density Estimates: Empirical Rule


Use the Empirical Rule
to isolate the area
shaded in red, which is
the percentage of parts
weighing between 34
and 37 mcg.
The red area is half the
difference in area
between 1 and 2
The area that is 1 is
68%
The area that is 2
is 95%.
The difference in area
is (95%-
68%)=27%.
Half of 27% is 13.5%.

Density Estimates: Empirical Rule


The weights of a part
Samsung Electronics
receives from suppliers
have a mean of 40
micrograms, a standard
deviation of 3
micrograms, and a bell-
shaped symmetric
distribution.
Samsung rejects parts
that weigh more than
46 micrograms or less
than 37 micrograms.
Approximately what
percentage of parts
does Samsung routinely
accept?

Density Estimates: Empirical Rule


There are two ways to think
about this the first is which
is adding the area of those
parts accepted using the
Empirical Rule.
The area shaded in blue is the
percentage of accepted
parts.
The blue is the area for 1 ,
which is 68%, plus half the
difference in area between
1 and 2
The difference in area is
(95%- 68%)=27%, half of
which is (27%/2) =13.5%
Total area is 68% + 13.5% or
81.5%

Density Estimates: Empirical Rule


There are two ways to think
about this and the second is
subtracting from 1 the area of
those rejected using the
Empirical Rule.
The area shaded in black is the
percentage of rejected parts.
The black area is the area
outside 2 , 1-95% = 5% plus
half the difference in area
between 1 and 2
The difference in area is (95%-
68%)=27%, half of which is
(27%/2) =13.5%
Total area of black is
5%+13.5%=18.5%.
The blue area is 1 18.5% or
81.5%.

Density Estimates: Empirical Rule


The weights of a
part Samsung
Electronics receives
from suppliers have
a mean of 40
micrograms, a
standard deviation
of 3 micrograms,
and a bell-shaped
symmetric
distribution.
What is the
approximate range
of the weights of
this particular part?

Density Estimates: Empirical Rule


Think of the standard
deviation as a unit of
distance along the
number line, here 1 =
3 mcg.
The Empirical Rule says
that virtually 100% of
observations will fall
within 3 .
The range is the distance
between the maximum
value and the minimum.
Since 3 = 6 , the
range is approximately
6*3mcg or 18 mcg.

Density Estimates: Empirical Rule


Methods for One Variable
Three dimensions of data
Measures of Center
Mean, median and mode
Measures of Spread
Range, IQR, variance, standard deviation and coefficient of
variation.
Measures and concepts for Shape
Pearsons skewness coefficient
Symmetric, right or positive skewed, left or negative skewed
Estimating probabilities
Chebyshevs Theorem and the Empirical Rule

Descriptive Statistics

You might also like