You are on page 1of 14

Descriptive Statistics

Tieming Ji
Fall 2012
1 / 14
Motivation: In order to investigate characteristics of a
population (very large, not able to enumerate every elements
in it), a sample (a relatively small size compared to the
population) is often taken for study. In this chapter, we are
going to learn (1) methods to visualize a sample; and (2)
statistics to quantify sample characteristics and use them to
infer characteristics of interest for a population.
Denition: A random sample of size n from the distribution of
X is a collection of n independent random variables, each with
the same distribution as X.
2 / 14
Example 1: To study the random variable X, the life span in
hours of the lithium battery in a particular model of pocket
calculator, we obtain a random sample of 50 batteries and
determine the life span of each we obtain. These data result:
4285 564 1278 205 3920
2066 604 209 602 1379
2584 14 349 3770 99
1009 4152 478 726 510
318 737 3032 3894 582
1429 852 1461 2662 308
981 1560 701 497 3367
1402 1786 1406 35 99
1137 520 261 2778 373
414 396 83 1379 454
3 / 14
Stem-and-Leaf Diagram
The decimal point is 3 digit(s) to the right of
the |
0 | 001112233334445555566667779
1 | 001344444568
2 | 1678
3 | 04899
4 | 23
0 | 00111223333444
0 | 5555566667779
1 | 001344444
1 | 568
2 | 1
2 | 678
3 | 04
3 | 899
4 | 23
4 / 14
Histograms
Life Span of Sample Batteries
F
r
e
q
u
e
n
c
y
0 1000 2000 3000 4000
0
5
1
0
1
5
2
0
5 / 14
(Empirical) Cumulative Distribution Plots
0 1000 2000 3000 4000
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
Life Span
E
m
p
i
r
i
c
a
l

F
(
x
)
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
G
6 / 14
Location Statistics Mean
Denition: Let X
1
, X
2
, , X
n
be a random sample of size n
for the random variable X. The statistic

n
i =1
X
i
n
is called the
sample mean and is denoted by

X.
Example 2: A random sample of size 9 yields the following observations
on the random variable X, the coal consumption in millions of tons by
electric utilities for a given year: 406, 395, 400, 450, 390, 410, 415, 401,
and 408. The sample mean is
x =
1
9
9

i =1
x
i
=
1
9
(406 + 395 + + 408) 408.3.
Thus, the average coal consumption of the 9 samples is around 408.3
million tons.
7 / 14
Location Statistics Median
Denition: The order statistics of a sample x
1
, x
2
, , x
n
is the
ordered observations from the smallest one to the largest one,
denoted by x
(1)
, x
(2)
, , x
(n)
.
Denition: Let x
(1)
, x
(2)
, , x
(n)
be the order statistics for a
sample of size n. The sample median is the middle observation
if n is odd. It is the average of the two middle observations if
n is even. We shall denote the median of a sample by x.
Denition: The median location is
n+1
2
.
In example 2, the order statistics are 390, 395, 400, 401, 406, 408, 410,
415, 450. The median location is (9+1)/2=5, and the median is x=406.
8 / 14
Measures of Variability Sample Variance and
Sample Standard Deviation
Denition: Let X
1
, X
2
, , X
n
be a random sample of size n
for X. The statistic
S
2
=
n

i =1
(X
i


X)
2
n 1
is called the sample variance. Further, the statistic S =

S
2
is
called the sample standard deviation.
9 / 14
Theorem: A computational formula for S
2
given a sample of
size n for the random variable X is computed by
S
2
=
n

n
i =1
X
2
i
(

n
i =1
X
i
)
2
n(n 1)
.
In example 2, we have

9
i =1
x
i
= 406 + 395 + + 408 = 3675 and

9
i =1
x
2
i
= 406
2
+ 395
2
+ + 408
2
= 1503051. Thus,
S
2
=
9

9
i =1
x
2
i

9
i =1
x
i

2
9 (9 1)
=
9 1503051 (3675)
2
9 8
303.25.
And the sample standard deviation is
S =

S
2

303.25 17.4.
10 / 14
Measures of Variability Sample Range
Denition: The sample range of a random sample with size n
is dened as x
(n)
x
(1)
.
In example 2, the sample range is 450-390=60. This measures the largest
dierence among the 9 samples for a yearly coal consumption.
11 / 14
Measures of Variability Interquartile Range
Sample range is aected by outliers. However the interquartile
range (iqr) is relatively robust when outliers exist. Interquartile
is dened as the dierence of the 3rd quartile (75%) and the
1st quantile (25%).
Steps for nding the sample interquartile range with a sample of size n:
Find the median location
n+1
2
, and round it down to the nearest
whole number which is called the truncated median location.
Dene q =
truncated median location +1
2
.
The 1st quartile, q
1
, is x
(q)
if q is an integer; otherwise, q
1
is the
average of x
(q0.5)
and x
(q+0.5)
.
The 3rd quartile, q
3
, is x
(nq+1)
if q is an integer; otherwise, q
3
is
the average of x
(nq+0.5)
and x
(nq+1.5)
.
The sample interquartile range is iqr=q
3
q
1
.
12 / 14
Boxplot
In example 1, there are 50 observations for the life span of a
kind of battery.
13 / 14
Chapter Summary
We do not require you to draw a gure given data though it is
not dicult. We basically want to test you if you can read a
gure and draw useful information. For example, are there
outliers by looking at a box plot? Can you guess the
population distribution by looking at a sample distribution
(histogram, stem-and-leaf diagram)? etc.
Understanding the basic concepts of statistics, sample mean,
sample median, sample variance, sample standard deviation,
sample range, sample interquartile range. Can you relate these
sample statistics with population parameters (location,
variation, etc.)?
14 / 14

You might also like