You are on page 1of 25

Stat 110, Lecture 3

Numerical Descriptive Statistics

bheavlin@stat.stanford.edu
Statistics

No Data Some Data Way Too


Much Data

Probability Inferential Descriptive


Statistics Statistics

Numerical Graphical
Stat 110 bheavlin@stat.stanford.edu
Numerical measures scale better as #groups grow
# points/group # groups

Dot plot low-to-moderate 1-3


1.0

stem&leaf, 0.8
0.6
moderate-to-high 1-2
0.4
Histogram 0.2
0.0

QQ plot moderate-to-high ~1-3

1.0
0.8
0.6
0.4

Boxplot moderate-to-high 2-30+


0.2
0.0
Hc1 Hc2

Multi-vari low-to-moderate 2-4/layer


Hc1 Hc2 Hc1 Hc2
W Z

Stat 110 bheavlin@stat.stanford.edu


Bin sizes
• M&S: 5 to 20 bins
• Sqrt( n )

Bin width
 hn =1,2,5 x 10k
 hn = 2 x IQR / n1/3 or
hn = 1.66 x stdev x [ loge(n)/ n ]1/3

1-2-5 rule dominates any difference.


Stat 110 bheavlin@stat.stanford.edu
Some terms
A statistic is a numerical descriptive measure
computed from sample data.
A parameter is a numerical descriptive measure of
a population.

A measure of central tendency describes a sample


or population by a single, typical value.
A measure of variation describes the width,
spread, and/or uncertainty of a sample,
population, or statistic.

Stat 110 bheavlin@stat.stanford.edu


Measures of Central Tendency
• Mean: the arithmetic average = Σi xi / n
• Median: the middle number.
with n numbers, x(i) the ith smallest,
n odd, x((n +1)/2);
n even, [ x(n/2)+x((n +1)/2) ]/2

• Mode: the most frequently occurring value(s),


often relative to nearby values.
Sometimes dependent on binning choices in
histogram.

Stat 110 bheavlin@stat.stanford.edu


CPU times cum
Stem Leaf count
1.17 1.61 1.16 1.38 3.53 4. 5-9 75, 25
1.23 3.76 1.94 0.96 4.75 4. 0-4

0.15 2.41 0.71 0.02 1.59 3. 5-9 53,76, 24


3. 0-4 07 22
0.19 0.82 0.47 2.16 2.01
2. 5-9 59, 21
0.92 0.75 2.59 3.07 1.40
2. 0-4 01,16,41, 20
1. 5-9 59,61,94, 17
1.63 = 1.17 + 1.61 + … + 1.40 1. 0-4 16,17,23,38,40 14
25
0. 5-9 71,75,82,92,96 9
25+1 = 13 0. 0-4 02,15,19,47 4
2

Stat 110 bheavlin@stat.stanford.edu


Measures of Relative Standing
The 100pth percentile of a data set is a value y
located so that 100p% of the area under the
relative frequency distribution lies to the left of y
and 100(1-p)% lies to its right.
Synonym: pth quantile.

The lower quartile is the 25% percentile.


The second quartile is the 50% percentile, a.k.a.
the median.
The upper quartile is the 75% percentile.
Stat 110 bheavlin@stat.stanford.edu
How to calculate quartile:

with n numbers, x(i) the ith smallest,


n+1: Q(.25) = x((n +1)/4)
stem&leaf: median( x(i) < median )

n+1: Q(.75) = x(3(n +1)/4)


stem&leaf: median( x(i) > median )

…more on this later

Stat 110 bheavlin@stat.stanford.edu


Measures of variation:
The range is the difference between the highest
and lowest numbers: x(n) – x(1)
The standard deviation estimates the square root
of the average squared difference from the
mean. For a sample, it is {Σi [x(i) – x– ]2 /(n –1)}1/2

The interquartile range is the third quartile minus


the first quartile, and
the mean absolute deviation (MAD) is the average
of the absolute differences from the median.
Stat 110 bheavlin@stat.stanford.edu
1.17 1.61 1.16 1.38 3.53
4.75 – 0.02 = 4.73 1.23 3.76 1.94 0.96 4.75
0.15 2.41 0.71 0.02 1.59
0.19 0.82 0.47 2.16 2.01
0.92 0.75 2.59 3.07 1.40

times – average time

(1.17–1.63)2 + … + (1.40–1.63)2 -.46 -.02 -.47 -.25 1.90


25-1 -.43 2.13 .31 -.67 3.12
-1.48 .78 -.92 -1.61 -.04
= 1.1928
2
-1.44 -.81 -1.16 .53 .38
-0.71 -.88 .96 1.44 -.23

Note: 8 of 25 residuals positive


Stat 110 bheavlin@stat.stanford.edu
IQR Stem Leaf count
4. 5-9 75, 25

25+1 25+1 4. 0-4


= 13 = 6.5
2 4 3. 5-9 53,76, 24
3. 0-4 07 22
13+1 = 7 2. 5-9 59, 21
2
2. 0-4 01,16,41, 20
1. 5-9 59,61,94, 17
s&l n+1
1. 0-4 16,17,23,38,40 14
Q(.75) = 2.16 or 2.285
Q(.25) = 0.82 or 0.785 0. 5-9 71,75,82,92,96 9
IQR = 1.34 or 0. 0-4 02,15,19,47 4
1.500

MAD = 0.8908
medAD
Stat 110 = 0.63 bheavlin@stat.stanford.edu
A statistical “trilemma”

simple

efficient robust

Stat 110 bheavlin@stat.stanford.edu


The Empirical Rule
If a data set has an approximately unimodal
distribution, then the following rules of thumb
may be used to describe the data set:
2. Approximately 68% are within 1 standard
deviation of the mean;
3. approximately 90-95% are within 2 standard
deviations of the mean; and
4. almost all the measurements lie within 3
standard deviations.

Stat 110 bheavlin@stat.stanford.edu


z-scores:

A z-score for a value y of a data set is the distance


that y lies above or below the mean, measured
in units of the standard deviation:

y–y
z =
s

Stat 110 bheavlin@stat.stanford.edu


Outlier…s
An observation that is unusually large or small
relative to the other values in a data set is
called an outlier. Outliers typically are
attributable to one of the following causes:
2. The measurement is observed, recorded, or
entered into the computer incorrectly.
3. The measurement comes from a different
population.
4. The measurement is correct, but represents a
rare, chance event.

Stat 110 bheavlin@stat.stanford.edu


How boxplots
30 -4
upper “fence”
-5
20
75th %ile -6
10
-7
median
0 -8
-9
-10
25th %ile -10
-20
lower “fence” -11
outside fence: “outlier”
-30 -12
clearout 01-12 no splits 2kTEOS 2kTEOS_HDP none

“fence”
• IQR = 75%ile – 25%ile
• step = 1.5 × IQR
• lower inner fence = min s.t. > 25%─step
• upper inner fence = max s.t. < 75%─step
Stat 110 bheavlin@stat.stanford.edu
Statistical rules for detecting outliers

z-score rule: Observations with z-scores greater


than 3 in absolute value.

Boxplot rule:
• beyond the inner fence, “suspect.” ~7000ppm
• beyond the outer fence, “highly suspect.” ~1ppm

Stat 110 bheavlin@stat.stanford.edu


30

Why boxplots? 20

10

• Visual presentation of a 0

data set. -10

• Scales well as numbers -20

of points grows, except -30


clearout 01-12 no splits

as the number of outliers


grow too.
• Designed especially to compare more than a few
groups.
• Not as good as histograms to detect bimodality,
details in tails.
Stat 110 bheavlin@stat.stanford.edu
mg/l
0.00 0.10 0.20 0.30 0.40
Multi-vari charts
col within which
1

Multiply nested Cu 2
label scale.
3
• Horizontal format.
• Boxplot by group. 1

• (Optional jitter).
Fe 2

Pb 2

3
Stat 110 bheavlin@stat.stanford.edu
Problem 2.28
■ (outlier)
0.16
0.2
0.14 Pb 0.18 Cu 0.34
0.32 Fe
0.12 0.16 0.3
0.28
0.1 0.14 0.26
0.12 0.24
0.08 0.22
0.1
0.2
0.06
0.08 0.18
0.04 0.16
0.06
0.14
0.02 0.04 0.12
0.1
0

Stat 110 bheavlin@stat.stanford.edu


QQ plots
0.35

0.25
Pb
0.15

• plot quantiles of 0.05


0.2

one group vs 0.15

quantiles of 0.1
Cu

another. 0.05
0.35
• Linear pattern 0.3

implies similar 0.25


0.2
Fe

“shape.” 0.15

• #group 1 need not .05.1 .2.25.3 .4 .05 .1 .15 .2 .15 .2 .25 .3 .35

equal #group 2…

Stat 110 bheavlin@stat.stanford.edu


scatterplots scatterplot matrix
or draftsman plot
0.2
0.35

0.25
0.15 Pb
0.15
Cu

0.05
0.1
0.2

0.15
0.05 Cu
0.1

.1 .15 .2 .25 .3 .35 0.05


Fe 0.35
0.3
scatterplot 0.25
Fe
or XY plot 0.2
0.15

.05.1 .15.2 .25.3 .35.4 .05 .1 .15 .2 .15 .2 .25 .3 .35

Stat 110 bheavlin@stat.stanford.edu


Correlation coefficient (Pearson)
z-scores Pb Cu Fe

r = Σi ( yi – y
sy
)( xi – x
sx
) Pb 1.00 0.36 0.31

Cu 0.36 1.00 0.70

–1 ≤ r ≤ 1
Fe 0.31 0.70 1.00

r = ±1 => perfectly linear


r ≥ 0, increasing relationship

Stat 110 bheavlin@stat.stanford.edu


2.0

107 systems, 1.0


"Memory"
arch05
28 performance
0.0

-1.0

benchmarks 1.5
0.5
"Stream"
-0.5 arch04
-1.5

notes: 0.5
"Math"
-0.5 arch03
• green=AMD
-1.5
• blue =Intel
0.5
"Games"
-0.5 arch01
-1.5
3.6
3.4
3.2 log10 MHz
3
2.8

-1.0 .0 .5 1.5 -1.5-0.5 .5 1.5 -1.5 -0.5 .51.0 -1.5 -0.5 .5 2.8 3 3.2 3.4 3.6

Stat 110 bheavlin@stat.stanford.edu

You might also like