STATS 7053: Statistics in Engineering

STATS 7053
STATISTICS IN ENGINEERING
Lecturer: A/Prof Gary Glonek

School of Mathematical Sciences
The University of Adelaide
Semester 2, 2010

c School of Mathematical Sciences
1 Review of Descriptive Statistics and

Probability
1.1 Types of variables
When statistical data are recorded, it is useful to con-

sider various summary statistics and graphs. The
most appropriate method depends on the type of vari-
able and it is necessary to make the following distinc-
tions.
Categorical Variables: indicate to which category an

individual belongs. For example, a water pipe
may be concrete, iron, or polypropylene.
Numerical Variables: may be either discrete or con-

tinuous.
Continuous variables can take any value in

given range. For example the compressive
strength of concrete pavers.
Discrete variables can take values only in a finite

(or countable) set. For example, the number
of water snails in a one m2 area of reeds.

c STATS 7053 Statistics in Engineering 2010 1-1
Remarks
Categorical variables are sometimes coded numeri-

cally, but this does not make them numerical.
The classification of variables can be refined further,

but we will not consider the details.

1.2 Graphical summaries
Line charts
Used to describe the distribution of a discrete vari-

able.
The line chart is a series of vertical line segments

with height corresponding to relative frequency.
Example 143 one-litre air samples were taken 1.5m

above floor level in a certain workshop and the num-
ber of asbestos fibres recorded for each sample. The
data and line chart are given below.
Number of Fibres Frequency Relative Frequency

0 34 0.24
1 46 0.32
2 38 0.27
3 19 0.13
4 4 0.03
5 2 0.01
6+ 0 0.00
Total 143 1.00

0.05 0.10 0.15 0.20 0.25 0.30
Relative Frequeny
0 1 2 3 4 5
Number of Fibres
Fig. 1.1: Line chart for the asbestos fibre data

Histograms
Histograms can be constructed for continuous vari-

ables as follows.
1. Group the data into class intervals.
2. Calculate the relative frequency for each class in-

terval.
3. The histogram is a series of rectangles such that:
• the bases are the class intervals;
• the height of each rectangle is

relative frequency
H= .
class interval length

Example Compressive strengths in Mpa X for a ran-
dom sample of 200 concrete pavers are summarised
below.
Interval Frequency Relative frequency Height

40 < X ≤ 45 10 0.05 0.01
45 < X ≤ 50 10 0.05 0.01
50 < X ≤ 55 30 0.15 0.03
55 < X ≤ 60 40 0.20 0.04
60 < X ≤ 65 50 0.25 0.05
65 < X ≤ 70 30 0.15 0.03
70 < X ≤ 75 20 0.10 0.02
75 < X ≤ 80 10 0.05 0.01

50
40
Frequency
30
20
10
0
40 50 60 70 80
Compressive Strength
Fig. 1.2: Compressive strength for 200 pavers

Remarks
When the class intervals are all of the same length, as

in the present example, the heights of the rectangles
are simply proportional to frequency.
Computer packages generally use equal length class

intervals and often use frequency for the vertical
scale.

1.3 Summary statistics
Consider a sample of observations x1, x2, . . . , xn of

a numerical variable X. It is often useful to consider
numerical summaries of the data as well as graphical
displays. The most important features of a data set
are often the location and dispersion.
1.4 Measures of location
The sample mean
The sample mean of the numbers x1, x2, . . . , xn is

defined by
n
1 X
x̄ = xi .
n i=1
It is the most commonly used measure of location.
It can be interpreted as the “centre of mass” of the

distribution.

The sample median
The sample median, M , is the “middle value” in the

distribution.
It is calculated as follows:
1. Sort the data in ascending order. Let

x(1), x(2), . . . , x(n) represent the sorted data so
that x(1) is the minimum value etc.
2. Let m = (n + 1)/2.
3. If n is odd, so that m is an integer, the sample

median is defined to be
M = x(m).
4. If n is even, the sample median is defined to be

x(m−0.5) + x(m+0.5)
M = .
2
That is, the average of the two middle values.

The mode
The mode is defined to be the most frequently occur-

ring value in the data set.
Although it is superficially appealing for its simplicity,

the mode is rarely used in practice.
One difficulty with the mode is that there may not be a

unique most frequently occurring value.

Example The maximum annual floodpeak inflows

(m3s−1) to the Hardap Dam in Namibia for the years
1962-1987 are shown below.
1864 44 46 364 911 83 477 457 782
6100 197 3259 554 1506 1508 236 635 230
125 131 30 765 408 347 412
20
15
Frequency
10
5
0
0 1000 3000 5000 7000
Maximum Inflow
Fig. 1.3: Annual maximum inflow for the Hardap Dam,

1962-87

The sample mean is given by
1 21471
x̄ = {1864+44+. . .+412} = = 858.54.
25 25
To find the median:
1. The sorted data are
30 44 46 83 125 131 197 230 236

347 364 408 412 457 477 554 635 765
782 911 1506 1508 1864 3259 6100
2. m = (25 + 1)/2 = 13, so M = x(13) = 412.

Comparison of sample mean and sample median
For a symmetric distribution x̄ ≈ M .
For a positively skewed distribution x̄ > M . (eg The

Hardap Dam data),
For a negatively skewed distribution x̄ < M .
The sample mean is sensitive to outliers in the data.

Even a single very large or very small value can
change the mean appreciably.
The sample median is resistant to outliers.

1.5 Measures of Dispersion
The sample standard deviation
The sample variance is defined by

n
1 X
s2 = (xi − x̄)2.
n − 1 i=1
It is the “average squared deviation from the sample

mean x̄ ”.
Note: The divisor n − 1 is used for technical reasons

that will be discussed later. It does not change the
substantive interpretation and makes little difference
numerically unless n is very small.
Because the sample variance measures squared de-

viations the units are squares of the original units of
measurement.
For this reason, the sample standard deviation is

defined by
v
q u n
2 u 1 X
s= s =t (xi − x̄)2.
n − 1 i=1

Example
Recall the Hardap Dam data are given by

1864 44 46 364 911 83 477 457 782
6100 197 3259 554 1506 1508 236 635 230
125 131 30 765 408 347 412
and have sample mean x̄ = 858.84.
The sample variance is therefore,

1 n
s2 = (1864 − 858.84)2 + (44 − 858.84)2+
25 − 1 o
. . . + (412 − 858.84)2
41361857
= = 1723411
24
The sample standard deviation is

√
s = 1723411 = 1312.8

Properties of s
• s ≥ 0.
• s = 0 if and only if x1 = x2 = . . . = xn
• For any distribution, the following bounds apply:
– At least 75% of the data lie in the range x̄±2s;
– At least 88.9% of the data lie in the range

x̄ ± 3s.

• When the data are approximately normal, the 68-

95-99.7 rule applies:
– Approximately 68% of the data lie in the range

x̄ ± s;
– Approximately 95% of the data lie in the range

x̄ ± 2s;
– Approximately 99.7% of the data lie in the

range x̄ ± 3s.
15
Frequency
10
5
0
−3 −2 −1 0 1 2 3
Fig. 1.4: A “typical” normal sample

Example
For the Hardap Dam data, x̄ = 854.58 and s =
1312.8. Shown below are the percentages of obser-
vations in various ranges.
Range Minimum Normal Actual

x̄ ± s - 68% 92%
x̄ ± 2s 75% 95% 96%
x̄ ± 3s 88.9% 99.7% 96%
Remarks
• As might be expected, the normal approximation

is very inaccurate because the data are highly
skewed (non-normal);
• The general lower bounds are satisfied;
• Another consequence of the skewness is that the

left hand endpoints are negative.

Example (continued)
If we take natural logs, the data are approximately nor-
mal.
8
6
Frequency
4
2
0
3 4 5 6 7 8 9
Ln(Peak Inflow)
Fig. 1.5: The log transformed Hardap Dam data
For the transformed data y = ln(x) we have ȳ =

5.95 and s = 1.32 and the 68-96-99.7 rule is quite
accurate.
Range Minimum Normal Actual

ȳ ± s - 68% 64%
ȳ ± 2s 75% 95% 96%
ȳ ± 3s 88.9% 99.7% 100%

The coefficient of variation
For positive variables (such as weights and volumes

etc.) it sometimes useful to express the dispersion in
relative terms.
The coefficient of variation is defined by

s
.
x̄
Example For the (untransformed) Hardap Dam data,

the coefficient of variation is
s 1312.8
= = 152.9%.
x̄ 858.84

The interquartile range
The lower quartile or 25 percentile of a data set

x1, x2, . . . , xn can be found as follows:
1. Sort the data into ascending order,

x(1), x(2), . . . , x(n).
2. The position of the lower quartile in the ordered

sample is q = (n + 1)/4.
3. The lower quartile is determined as follows:
• If q is an integer, then
LQ = x(q);
• If q = r + 0.25 then
LQ = (3x(r) + x(r+1))/4;
• If q = r + 0.5 then
LQ = (x(r) + x(r+1))/2;
• If q = r + 0.75 then
LQ = (x(r) + 3x(r+1))/4.

The upper quartile is defined similarly to have posi-
tion 3(n + 1)/4 in the ordered sample.
The interquartile range is defined by
IQR = UQ − LQ.
The interquartile range is an alternative measure of

dispersion.
It can be interpreted as the range of the middle 50%

of the data.
Example For the Hardap Dam data, n = 25 and the

positions of the quartiles are 6.5 and 19.5.
The sorted data are:
30 44 46 83 125 131 197 230 236

347 364 408 412 457 477 554 635 765
782 911 1506 1508 1864 3259 6101

The lower quartile is

131 + 197
LQ = = 164.
2
The upper quartile=

782 + 911
UQ = = 846.5
2
The interquartile range is
IQR = 846.5 − 164 = 682.5.
The range
For very small data sets, the range = max − min is

sometimes used as a crude measure of dispersion.
The range is not generally useful as a measure of dis-

persion because it depends on the sample size.

Skewness
The coefficient of skewness is defined by

Pn
(xi − x̄)3
g = i=1 .
(n − 1)s3
For symmetric distributions, g ≈ 0.
A positively skewed distribution is one with a long

right-hand tail. In this case we expect g > 0.
A negatively skewed distribution is one with a long left-

hand tail. In this case we expect g < 0.
Some examples are shown in Figure 1.6.

Hardap data: Skewness=2.74 Skewness=0.03

20
20
15
15
Frequency
Frequency
10
10
5
5
0
0 1000 3000 5000 7000 −3 −2 −1 0 1 2
hardap x
Skewness=−1.21 Skewness=0.90
30
20
15
20
Frequency
Frequency
10
10
5
5
0
−10 −8 −6 −4 −2 0 0 5 10 15
x x
Fig. 1.6: Skewness for the Hardap data and 3 artificial

data sets

The boxplot
The boxplot is a graphical representation of the data

based on the median and quartiles.
5000
3000
●
1000
0
Fig. 1.7: Boxplot of the Hardap Dam data

• The box extends from LQ to UQ.
• The median is shown within the box.
• The upper whisker extends to the largest data

value within UQ + 1.5IQR.
• The upper whisker extends to the smallest data

value within LQ − 1.5IQR.
• Any data values beyond these limits are plotted

separately.

Boxplots provide less information than histograms.
For a single data set it is usually more useful to con-

sider the histogram.
Boxplots are very useful for comparing several data

sets.
Example Impact strength (in foot pounds) was mea-

sured for random samples of insulation material cho-
sen from 5 batches.
1.2
Impact Strength (foot pounds)
●
1.0
0.8
0.6
1 2 3 4 5
Batch
Fig. 1.8: Impact strength for five batches of insulation.

Linear Transformations
Consider data x1, x2, . . . , xn and let x̄ and sx be the

sample mean and sample standard deviation respec-
tively.
Suppose now that a linear transformation is applied to

compute a new variable yi = axi + b.
It can be checked that
ȳ = ax̄ + b and sy = |a|sx.
Similar rules apply for the median and IQR,
My = aMx + b and IQRy = |a|IQRx
Such transformations often arise because of unit con-

versions. For example temperature can be measured
in degree Kelvin, Celsius or Fahrenheit.
For non-linear transformations (e.g y = ln(x)) there

is generally no simple formula.

1.6 Covariance and Correlation
Example. Sixteen air samples from Herald Square in

New York City were analysed for carbon monoxide (X,
ppm) and benzoa pyrene (Y , ug/104m3). [data from
”Carcinogenic air pollutants in relation to automobile
traffic in New York City”, Environmental Science and
Technology, 1971,145-50]
The data are shown below.
i xi yi i xi yi
1 3 5 9 5 13
2 15 1 10 12 57
3 19 8 11 6 15
4 7 9 12 20 60
5 5 10 13 11 73
6 6 16 14 13 81
7 10 39 15 5 22
8 13 40 16 10 95

A scatter plot of the data is shown below and the sum-
mary statistics are
x̄ = 10, sx = 5.125, ȳ = 34, sy = 30.325.
●
80
●
60
●
benzoa pyrene
●
40
●
●
●
20
●
●
●
●
●
●
●
●
0
5 10 15 20
carbon monoxide
Fig. 1.9: New York air pollution data

The Sample Covariance
The sample covariance of X and Y is defined by

n
1 X
sxy = (xi − x̄)(yi − ȳ).
n − 1 i=1
It used to construct a measure the association be-

tween X and Y .
The scatter plot of y vs x for the New York air pollution

data reveals a positive association.
That is small values of y tend to go with small values

of x and large values of y tend to go with large values
of x.
The sample covariance is sxy = 54.4
In general:
For positive association, sxy > 0.
For negative association, sxy < 0.
When there is no association, sxy ≈ 0.

To see how the formula for the sample covariance
leads to this interpretation divide the scatter plot into
four quadrants by drawing lines at x̄ and ȳ.
+
80
+
60
+
benzoa pyrene
+
40
● +
+
20
+
+
+
+ + −
+
−
0
5 10 15 20
carbon monoxide
Fig. 1.10: New York air pollution data

Consider a point in the lower right quadrant. The prod-

uct (xi − x̄)(yi − ȳ) will be negative.
The same will be true for points in the upper left quad-
rant.
Similarly, for points in the lower left and upper right

quadrants the product will be positive.
When there is a positive association, there will be

more points in the lower left and upper right quadrants
a sample covariance will be positive.
When there is a negative association, there will be

more points in the upper left and lower right quadrants
and sample covariance will be negative.
When there is no assocation, the positive and nega-

tive terms will roughly cancel so the sample covari-
ance will be approximately zero.

The sample correlation coefficient
The intuitive explanation of the sample covariance

suggests that strong associations lead to large values
of |sxy |.
However, this interpretation is not valid because sxy

also depends on the scale of the data. If we multiply
each of the y-values by the same constant c (for ex-
ample to change units of measurement), the nature of
the association will not change but the sample covari-
ance will be multiplied by the same constant.
For this reason, the sample correlation is defined by

sxy
r= .
sx sy
Example For the air pollution data, we found
sx = 5.125, sy = 30.325 and sxy = 54.4.

The sample correlation is therefore
54.4
r= = 0.35.
5.125 × 30.325

The sample correlation is a dimensionless quantity

with the following properties:
• −1 ≤ r ≤ 1;
• r = 1 if and only if all points lie exactly on a

straight line with positive slope;
• r = −1 if and only if all points lie exactly on a

straight line with negative slope;
The following scatter plots are indicative of the

strength of association for various values of r.

r=0 r=0.1
● ●
●
● ●
● ● ●
● ● ●
● ● ● ●
● ●
●●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ●● ●
●
● ● ●
● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ●
●
● ●
r=−0.2 r=0.3
● ● ●
●
●
●
● ●
● ●
● ● ●
●
● ●
●
● ● ●
●
● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ●
● ● ● ● ●
● ●
● ● ●
●
● ● ●
● ●
● ●
● ●
Fig. 1.11: Example scatter plots and correlations.

r=−0.4 r=0.5
● ●
●
● ● ●
●
●
● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
●
● ● ● ● ● ●
● ●
● ●
● ● ●●
● ●
● ● ●
●
● ● ●
● ●
●
● ● ●
●
● ●
r=−0.6 r=0.7
● ● ●
●
●
●
●
● ●
● ● ●
● ● ●
● ●
● ● ●
● ●
● ●
● ● ●
●
●
●● ● ●
● ●
●
●●
● ●●
● ●
● ●
● ●
● ●
● ●
● ●
●
●
●
● ●
● ●

r=−0.8 r=0.9
● ●
●
●● ● ●
●
● ●
● ● ● ●
●
● ●
●
● ● ● ●●
●
● ● ● ●
●
● ● ●
● ● ●
● ●
●● ● ●
●
●
● ●
● ● ● ●
● ●
●● ● ●
● ●
●
● ●
r=−0.95 r=0.99
● ●
●
●
● ●
●
●
●
● ●
●● ●
●
●● ●
● ● ● ● ●
●●
● ●
● ●
●
● ● ●●
● ● ● ●
● ●
● ● ● ●
●
●●
●
● ●
●
● ● ●
●
●
● ●
● ●

Correlation is a measure of linear association
The correlation coefficient measures the degree of lin-

ear association in the data.
It will not necessarily detect a non-linear relationship.
r=0.032
●
●
● ●
● ● ●
● ●
●
●
●●
●
●
●
●
●●
●
● ● ● ● ●
●
● ●
● ● ●
●
● ● ●
● ●
● ● ●
● ● ●
● ●
●
Fig. 1.14: Scatter plot with strong quadratic relation-

ship and r ≈ 0.

Correlation is sensitive to outliers
Figure 4.7 show a scatter plot of a data set containing

a single outlier. With the outlier included, r = 0.9 but
if it is omitted, r = −1.
r=0.9
●
●
●
●
●
●
●
●
●●●
●●
●
●
●
Fig. 1.15: Scatter plot with outlier.

1.7 Time series
Statistical data often occur when the same variable is

recorded at several different times.
For example, the production of clay bricks (in millions)

was recorded monthly from Jan 1956 to August 1995.
Such data are referred to as time series.
Time series may arise from:
• A discrete process. For example, the physicist

Simon Newcomb made 66 measurements over
three consecutive days.
• Aggregation of a continuous process. For exam-

ple, monthly rainfall data.
• Subsampling of a continuous process. For exam-

ple daily exchange rate data.
A time series plot is obtained by plotting the data

against time with consecutive points connected by line
segments.

Example The time series plot of the brick production
is shown below.
Australian Monthly Brick Prodction (in Millions)

200
150
bricks
100
50
1960 1970 1980 1990
Time
Fig. 1.16: Brick production time series.

The brick time series plot allows us to identify many

interesting features such as
• A clearly defined annual pattern (December, Jan-

uary are always low);
• An increasing trend until about 1973, correspond-

ing to steady growth;
• A precipitous drop around 1975, resulting from

the world recession arising from the oil crisis.

A very simple model for a time series is that it is com-
posed of several components:
• A trend;
• A periodic component;
• Random noise.
Trend Term 1.0 Periodic Term
4
0.5
3
0.0
x
s
2
−0.5
1
−1.0
0
0 20 40 60 80 100 0 20 40 60 80 100
trend term Time
Noise Term Observed Series

3
6
2
4
1
n
y
0
2
−1
0
−2
−2
0 20 40 60 80 100 0 20 40 60 80 100
Time Time
Fig. 1.17: Composition of a hypothetical series

In this decomposition the trend and periodic effect are

thought of as deterministic.
The random component of a time series may exhibit

serial dependence.
That is, when the value at given time is influenced by

previous values of the series.
A series with no such dependence is frequently called

“white noise”. (We will give a more precise definition
later)
The dependence in a series {xt} can be measured by

calculating the sample autocorrelation.
The sample autocovariance at lag k is defined by

n
1 X
gk = (xi − x̄)(xi−k − x̄).
n i=k+1
The sample autocorrelation at lag k is defined by
rk = gk /g0.

White Noise
2
1
0
z1
−2
0 20 40 60 80 100
Time
Series z1
1.0
0.6
ACF
0.2
−0.2
0 5 10 15 20
Lag
Fig. 1.18: Time series plot and acf function for white
noise. Note that the only non-zero correlation is at lag
0.

Correlated Noise Series

1.0
z2
0.0
−1.0
0 20 40 60 80 100
Time
Series z2
1.0
0.6
ACF
0.2
−0.2
0 5 10 15 20
Lag
Fig. 1.19: Time series plot and acf function for serial
dependence. Note the presence of positive autocor-
relation at lags 1, 2, 3.


STATS 7053: Statistics in Engineering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STATS 7053: Statistics in Engineering

Uploaded by

Copyright:

Available Formats

STATS 7053

Lecturer: A/Prof Gary Glonek

1 Review of Descriptive Statistics and

1.1 Types of variables

When statistical data are recorded, it is useful to con-

Categorical Variables: indicate to which category an

Numerical Variables: may be either discrete or con-

Continuous variables can take any value in

Discrete variables can take values only in a finite

Categorical variables are sometimes coded numeri-

The classification of variables can be refined further,

1.2 Graphical summaries

Used to describe the distribution of a discrete vari-

The line chart is a series of vertical line segments

Example 143 one-litre air samples were taken 1.5m

Number of Fibres Frequency Relative Frequency

Fig. 1.1: Line chart for the asbestos fibre data

Histograms can be constructed for continuous vari-

1. Group the data into class intervals.

2. Calculate the relative frequency for each class in-

3. The histogram is a series of rectangles such that:

• the bases are the class intervals;

• the height of each rectangle is

Interval Frequency Relative frequency Height

Fig. 1.2: Compressive strength for 200 pavers

When the class intervals are all of the same length, as

Computer packages generally use equal length class

Consider a sample of observations x1, x2, . . . , xn of

1.4 Measures of location

The sample mean

The sample mean of the numbers x1, x2, . . . , xn is

It is the most commonly used measure of location.

It can be interpreted as the “centre of mass” of the

The sample median

The sample median, M , is the “middle value” in the

1. Sort the data in ascending order. Let

3. If n is odd, so that m is an integer, the sample

4. If n is even, the sample median is defined to be

The mode is defined to be the most frequently occur-

Although it is superficially appealing for its simplicity,

One difficulty with the mode is that there may not be a

Example The maximum annual floodpeak inflows

0 1000 3000 5000 7000

Fig. 1.3: Annual maximum inflow for the Hardap Dam,

To find the median:

1. The sorted data are

30 44 46 83 125 131 197 230 236

2. m = (25 + 1)/2 = 13, so M = x(13) = 412.

Comparison of sample mean and sample median

For a symmetric distribution x̄ ≈ M .

For a positively skewed distribution x̄ > M . (eg The

For a negatively skewed distribution x̄ < M .

The sample mean is sensitive to outliers in the data.

The sample median is resistant to outliers.

The sample standard deviation

The sample variance is defined by

It is the “average squared deviation from the sample

Note: The divisor n − 1 is used for technical reasons

Because the sample variance measures squared de-

For this reason, the sample standard deviation is

Recall the Hardap Dam data are given by

and have sample mean x̄ = 858.84.