Lecture 3

Stat 110, Lecture 3
Numerical Descriptive Statistics
bheavlin@stat.stanford.edu
Statistics
No Data Some Data Way Too

Much Data
Probability Inferential Descriptive

Statistics Statistics
Numerical Graphical
Stat 110 bheavlin@stat.stanford.edu
Numerical measures scale better as #groups grow
# points/group # groups
Dot plot low-to-moderate 1-3

1.0
stem&leaf, 0.8
0.6
moderate-to-high 1-2
0.4
Histogram 0.2
0.0
QQ plot moderate-to-high ~1-3
1.0
0.8
0.6
0.4
Boxplot moderate-to-high 2-30+

0.2
0.0
Hc1 Hc2
Multi-vari low-to-moderate 2-4/layer

Hc1 Hc2 Hc1 Hc2
W Z

Bin sizes
• M&S: 5 to 20 bins
• Sqrt( n )
Bin width
 hn =1,2,5 x 10k
 hn = 2 x IQR / n1/3 or
hn = 1.66 x stdev x [ loge(n)/ n ]1/3
1-2-5 rule dominates any difference.

Some terms
A statistic is a numerical descriptive measure
computed from sample data.
A parameter is a numerical descriptive measure of
a population.
A measure of central tendency describes a sample

or population by a single, typical value.
A measure of variation describes the width,
spread, and/or uncertainty of a sample,
population, or statistic.

Measures of Central Tendency
• Mean: the arithmetic average = Σi xi / n
• Median: the middle number.
with n numbers, x(i) the ith smallest,
n odd, x((n +1)/2);
n even, [ x(n/2)+x((n +1)/2) ]/2
• Mode: the most frequently occurring value(s),

often relative to nearby values.
Sometimes dependent on binning choices in
histogram.

CPU times cum
Stem Leaf count
1.17 1.61 1.16 1.38 3.53 4. 5-9 75, 25
1.23 3.76 1.94 0.96 4.75 4. 0-4
0.15 2.41 0.71 0.02 1.59 3. 5-9 53,76, 24

3. 0-4 07 22
0.19 0.82 0.47 2.16 2.01
2. 5-9 59, 21
0.92 0.75 2.59 3.07 1.40
2. 0-4 01,16,41, 20
1. 5-9 59,61,94, 17
1.63 = 1.17 + 1.61 + … + 1.40 1. 0-4 16,17,23,38,40 14
25
0. 5-9 71,75,82,92,96 9
25+1 = 13 0. 0-4 02,15,19,47 4
2

Measures of Relative Standing
The 100pth percentile of a data set is a value y
located so that 100p% of the area under the
relative frequency distribution lies to the left of y
and 100(1-p)% lies to its right.
Synonym: pth quantile.
The lower quartile is the 25% percentile.

The second quartile is the 50% percentile, a.k.a.
the median.
The upper quartile is the 75% percentile.
How to calculate quartile:
with n numbers, x(i) the ith smallest,

n+1: Q(.25) = x((n +1)/4)
stem&leaf: median( x(i) < median )
n+1: Q(.75) = x(3(n +1)/4)

stem&leaf: median( x(i) > median )
…more on this later

Measures of variation:
The range is the difference between the highest
and lowest numbers: x(n) – x(1)
The standard deviation estimates the square root
of the average squared difference from the
mean. For a sample, it is {Σi [x(i) – x– ]2 /(n –1)}1/2
The interquartile range is the third quartile minus

the first quartile, and
the mean absolute deviation (MAD) is the average
of the absolute differences from the median.
1.17 1.61 1.16 1.38 3.53
4.75 – 0.02 = 4.73 1.23 3.76 1.94 0.96 4.75
0.15 2.41 0.71 0.02 1.59
0.19 0.82 0.47 2.16 2.01
0.92 0.75 2.59 3.07 1.40
times – average time
(1.17–1.63)2 + … + (1.40–1.63)2 -.46 -.02 -.47 -.25 1.90

25-1 -.43 2.13 .31 -.67 3.12
-1.48 .78 -.92 -1.61 -.04
= 1.1928
2
-1.44 -.81 -1.16 .53 .38
-0.71 -.88 .96 1.44 -.23
Note: 8 of 25 residuals positive

IQR Stem Leaf count
4. 5-9 75, 25
25+1 25+1 4. 0-4

= 13 = 6.5
2 4 3. 5-9 53,76, 24
3. 0-4 07 22
13+1 = 7 2. 5-9 59, 21
2
2. 0-4 01,16,41, 20
1. 5-9 59,61,94, 17
s&l n+1
1. 0-4 16,17,23,38,40 14
Q(.75) = 2.16 or 2.285
Q(.25) = 0.82 or 0.785 0. 5-9 71,75,82,92,96 9
IQR = 1.34 or 0. 0-4 02,15,19,47 4
1.500
MAD = 0.8908
medAD
Stat 110 = 0.63 bheavlin@stat.stanford.edu
A statistical “trilemma”
simple
efficient robust

The Empirical Rule
If a data set has an approximately unimodal
distribution, then the following rules of thumb
may be used to describe the data set:
2. Approximately 68% are within 1 standard
deviation of the mean;
3. approximately 90-95% are within 2 standard
deviations of the mean; and
4. almost all the measurements lie within 3
standard deviations.

z-scores:
A z-score for a value y of a data set is the distance

that y lies above or below the mean, measured
in units of the standard deviation:
y–y
z =
s

Outlier…s
An observation that is unusually large or small
relative to the other values in a data set is
called an outlier. Outliers typically are
attributable to one of the following causes:
2. The measurement is observed, recorded, or
entered into the computer incorrectly.
3. The measurement comes from a different
population.
4. The measurement is correct, but represents a
rare, chance event.

How boxplots
30 -4
upper “fence”
-5
20
75th %ile -6
10
-7
median
0 -8
-9
-10
25th %ile -10
-20
lower “fence” -11
outside fence: “outlier”
-30 -12
clearout 01-12 no splits 2kTEOS 2kTEOS_HDP none
“fence”
• IQR = 75%ile – 25%ile
• step = 1.5 × IQR
• lower inner fence = min s.t. > 25%─step
• upper inner fence = max s.t. < 75%─step
Statistical rules for detecting outliers
z-score rule: Observations with z-scores greater

than 3 in absolute value.
Boxplot rule:
• beyond the inner fence, “suspect.” ~7000ppm
• beyond the outer fence, “highly suspect.” ~1ppm

30
Why boxplots? 20
10
• Visual presentation of a 0
data set. -10
• Scales well as numbers -20
of points grows, except -30

clearout 01-12 no splits
as the number of outliers

grow too.
• Designed especially to compare more than a few
groups.
• Not as good as histograms to detect bimodality,
details in tails.
mg/l
0.00 0.10 0.20 0.30 0.40
Multi-vari charts
col within which
1
Multiply nested Cu 2
label scale.
3
• Horizontal format.
• Boxplot by group. 1
• (Optional jitter).
Fe 2
Pb 2
3
Problem 2.28
■ (outlier)
0.16
0.2
0.14 Pb 0.18 Cu 0.34
0.32 Fe
0.12 0.16 0.3
0.28
0.1 0.14 0.26
0.12 0.24
0.08 0.22
0.1
0.2
0.06
0.08 0.18
0.04 0.16
0.06
0.14
0.02 0.04 0.12
0.1
0

QQ plots
0.35
0.25
Pb
0.15
• plot quantiles of 0.05

0.2
one group vs 0.15
quantiles of 0.1
Cu
another. 0.05
0.35
• Linear pattern 0.3
implies similar 0.25

0.2
Fe
“shape.” 0.15
• #group 1 need not .05.1 .2.25.3 .4 .05 .1 .15 .2 .15 .2 .25 .3 .35
equal #group 2…

scatterplots scatterplot matrix
or draftsman plot
0.2
0.35
0.25
0.15 Pb
0.15
Cu
0.05
0.1
0.2
0.15
0.05 Cu
0.1
.1 .15 .2 .25 .3 .35 0.05

Fe 0.35
0.3
scatterplot 0.25
Fe
or XY plot 0.2
0.15
.05.1 .15.2 .25.3 .35.4 .05 .1 .15 .2 .15 .2 .25 .3 .35

Correlation coefficient (Pearson)
z-scores Pb Cu Fe
r = Σi ( yi – y
sy
)( xi – x
sx
) Pb 1.00 0.36 0.31
Cu 0.36 1.00 0.70
–1 ≤ r ≤ 1
Fe 0.31 0.70 1.00
r = ±1 => perfectly linear

r ≥ 0, increasing relationship

2.0
107 systems, 1.0

"Memory"
arch05
28 performance
0.0
-1.0
benchmarks 1.5
0.5
"Stream"
-0.5 arch04
-1.5
notes: 0.5
"Math"
-0.5 arch03
• green=AMD
-1.5
• blue =Intel
0.5
"Games"
-0.5 arch01
-1.5
3.6
3.4
3.2 log10 MHz
3
2.8
-1.0 .0 .5 1.5 -1.5-0.5 .5 1.5 -1.5 -0.5 .51.0 -1.5 -0.5 .5 2.8 3 3.2 3.4 3.6

Lecture 3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

Stat 110, Lecture 3

Numerical Descriptive Statistics

No Data Some Data Way Too

Probability Inferential Descriptive

Dot plot low-to-moderate 1-3

QQ plot moderate-to-high ~1-3

Boxplot moderate-to-high 2-30+

Multi-vari low-to-moderate 2-4/layer

Stat 110 bheavlin@stat.stanford.edu

1-2-5 rule dominates any difference.

A measure of central tendency describes a sample

Stat 110 bheavlin@stat.stanford.edu

• Mode: the most frequently occurring value(s),

Stat 110 bheavlin@stat.stanford.edu

0.15 2.41 0.71 0.02 1.59 3. 5-9 53,76, 24

Stat 110 bheavlin@stat.stanford.edu

The lower quartile is the 25% percentile.

with n numbers, x(i) the ith smallest,

n+1: Q(.75) = x(3(n +1)/4)

…more on this later

Stat 110 bheavlin@stat.stanford.edu

The interquartile range is the third quartile minus

times – average time

(1.17–1.63)2 + … + (1.40–1.63)2 -.46 -.02 -.47 -.25 1.90

Note: 8 of 25 residuals positive

25+1 25+1 4. 0-4

Stat 110 bheavlin@stat.stanford.edu

Stat 110 bheavlin@stat.stanford.edu

A z-score for a value y of a data set is the distance

Stat 110 bheavlin@stat.stanford.edu

Stat 110 bheavlin@stat.stanford.edu

z-score rule: Observations with z-scores greater

Stat 110 bheavlin@stat.stanford.edu

data set. -10

• Scales well as numbers -20

of points grows, except -30

as the number of outliers

Stat 110 bheavlin@stat.stanford.edu

• plot quantiles of 0.05

one group vs 0.15

implies similar 0.25

Stat 110 bheavlin@stat.stanford.edu

.1 .15 .2 .25 .3 .35 0.05

.05.1 .15.2 .25.3 .35.4 .05 .1 .15 .2 .15 .2 .25 .3 .35

Stat 110 bheavlin@stat.stanford.edu

Cu 0.36 1.00 0.70

r = ±1 => perfectly linear

Stat 110 bheavlin@stat.stanford.edu

107 systems, 1.0

Stat 110 bheavlin@stat.stanford.edu

You might also like