You are on page 1of 24

STATS 7053

STATISTICS IN ENGINEERING

Lecturer: A/Prof Gary Glonek


School of Mathematical Sciences
The University of Adelaide

Semester 2, 2010


c School of Mathematical Sciences

1 Review of Descriptive Statistics and


Probability

1.1 Types of variables

When statistical data are recorded, it is useful to con-


sider various summary statistics and graphs. The
most appropriate method depends on the type of vari-
able and it is necessary to make the following distinc-
tions.

Categorical Variables: indicate to which category an


individual belongs. For example, a water pipe
may be concrete, iron, or polypropylene.

Numerical Variables: may be either discrete or con-


tinuous.

Continuous variables can take any value in


given range. For example the compressive
strength of concrete pavers.

Discrete variables can take values only in a finite


(or countable) set. For example, the number
of water snails in a one m2 area of reeds.

c STATS 7053 Statistics in Engineering 2010 1-1
Remarks

Categorical variables are sometimes coded numeri-


cally, but this does not make them numerical.

The classification of variables can be refined further,


but we will not consider the details.


c STATS 7053 Statistics in Engineering 2010 1-2

1.2 Graphical summaries

Line charts

Used to describe the distribution of a discrete vari-


able.

The line chart is a series of vertical line segments


with height corresponding to relative frequency.

Example 143 one-litre air samples were taken 1.5m


above floor level in a certain workshop and the num-
ber of asbestos fibres recorded for each sample. The
data and line chart are given below.

Number of Fibres Frequency Relative Frequency


0 34 0.24
1 46 0.32
2 38 0.27
3 19 0.13
4 4 0.03
5 2 0.01
6+ 0 0.00
Total 143 1.00

c STATS 7053 Statistics in Engineering 2010 1-3
0.05 0.10 0.15 0.20 0.25 0.30
Relative Frequeny

0 1 2 3 4 5

Number of Fibres

Fig. 1.1: Line chart for the asbestos fibre data


c STATS 7053 Statistics in Engineering 2010 1-4

Histograms

Histograms can be constructed for continuous vari-


ables as follows.

1. Group the data into class intervals.

2. Calculate the relative frequency for each class in-


terval.

3. The histogram is a series of rectangles such that:

• the bases are the class intervals;

• the height of each rectangle is


relative frequency
H= .
class interval length


c STATS 7053 Statistics in Engineering 2010 1-5
Example Compressive strengths in Mpa X for a ran-
dom sample of 200 concrete pavers are summarised
below.

Interval Frequency Relative frequency Height


40 < X ≤ 45 10 0.05 0.01
45 < X ≤ 50 10 0.05 0.01
50 < X ≤ 55 30 0.15 0.03
55 < X ≤ 60 40 0.20 0.04
60 < X ≤ 65 50 0.25 0.05
65 < X ≤ 70 30 0.15 0.03
70 < X ≤ 75 20 0.10 0.02
75 < X ≤ 80 10 0.05 0.01


c STATS 7053 Statistics in Engineering 2010 1-6
50
40
Frequency

30
20
10
0

40 50 60 70 80

Compressive Strength

Fig. 1.2: Compressive strength for 200 pavers


Remarks

When the class intervals are all of the same length, as


in the present example, the heights of the rectangles
are simply proportional to frequency.

Computer packages generally use equal length class


intervals and often use frequency for the vertical
scale.

c STATS 7053 Statistics in Engineering 2010 1-7
1.3 Summary statistics

Consider a sample of observations x1, x2, . . . , xn of


a numerical variable X. It is often useful to consider
numerical summaries of the data as well as graphical
displays. The most important features of a data set
are often the location and dispersion.

1.4 Measures of location

The sample mean

The sample mean of the numbers x1, x2, . . . , xn is


defined by
n
1 X
x̄ = xi .
n i=1

It is the most commonly used measure of location.

It can be interpreted as the “centre of mass” of the


distribution.

c STATS 7053 Statistics in Engineering 2010 1-8

The sample median

The sample median, M , is the “middle value” in the


distribution.

It is calculated as follows:

1. Sort the data in ascending order. Let


x(1), x(2), . . . , x(n) represent the sorted data so
that x(1) is the minimum value etc.

2. Let m = (n + 1)/2.

3. If n is odd, so that m is an integer, the sample


median is defined to be

M = x(m).

4. If n is even, the sample median is defined to be


x(m−0.5) + x(m+0.5)
M = .
2
That is, the average of the two middle values.


c STATS 7053 Statistics in Engineering 2010 1-9
The mode

The mode is defined to be the most frequently occur-


ring value in the data set.

Although it is superficially appealing for its simplicity,


the mode is rarely used in practice.

One difficulty with the mode is that there may not be a


unique most frequently occurring value.


c STATS 7053 Statistics in Engineering 2010 1-10

Example The maximum annual floodpeak inflows


(m3s−1) to the Hardap Dam in Namibia for the years
1962-1987 are shown below.
1864 44 46 364 911 83 477 457 782
6100 197 3259 554 1506 1508 236 635 230
125 131 30 765 408 347 412
20
15
Frequency

10
5
0

0 1000 3000 5000 7000

Maximum Inflow

Fig. 1.3: Annual maximum inflow for the Hardap Dam,


1962-87


c STATS 7053 Statistics in Engineering 2010 1-11
The sample mean is given by
1 21471
x̄ = {1864+44+. . .+412} = = 858.54.
25 25

To find the median:

1. The sorted data are

30 44 46 83 125 131 197 230 236


347 364 408 412 457 477 554 635 765
782 911 1506 1508 1864 3259 6100

2. m = (25 + 1)/2 = 13, so M = x(13) = 412.


c STATS 7053 Statistics in Engineering 2010 1-12

Comparison of sample mean and sample median

For a symmetric distribution x̄ ≈ M .

For a positively skewed distribution x̄ > M . (eg The


Hardap Dam data),

For a negatively skewed distribution x̄ < M .

The sample mean is sensitive to outliers in the data.


Even a single very large or very small value can
change the mean appreciably.

The sample median is resistant to outliers.


c STATS 7053 Statistics in Engineering 2010 1-13
1.5 Measures of Dispersion

The sample standard deviation

The sample variance is defined by


n
1 X
s2 = (xi − x̄)2.
n − 1 i=1

It is the “average squared deviation from the sample


mean x̄ ”.

Note: The divisor n − 1 is used for technical reasons


that will be discussed later. It does not change the
substantive interpretation and makes little difference
numerically unless n is very small.

Because the sample variance measures squared de-


viations the units are squares of the original units of
measurement.

For this reason, the sample standard deviation is


defined by
v
q u n
2 u 1 X
s= s =t (xi − x̄)2.
n − 1 i=1


c STATS 7053 Statistics in Engineering 2010 1-14

Example

Recall the Hardap Dam data are given by


1864 44 46 364 911 83 477 457 782
6100 197 3259 554 1506 1508 236 635 230
125 131 30 765 408 347 412

and have sample mean x̄ = 858.84.

The sample variance is therefore,


1 n
s2 = (1864 − 858.84)2 + (44 − 858.84)2+
25 − 1 o
. . . + (412 − 858.84)2
41361857
= = 1723411
24

The sample standard deviation is



s = 1723411 = 1312.8


c STATS 7053 Statistics in Engineering 2010 1-15
Properties of s

• s ≥ 0.

• s = 0 if and only if x1 = x2 = . . . = xn

• For any distribution, the following bounds apply:

– At least 75% of the data lie in the range x̄±2s;

– At least 88.9% of the data lie in the range


x̄ ± 3s.


c STATS 7053 Statistics in Engineering 2010 1-16

• When the data are approximately normal, the 68-


95-99.7 rule applies:

– Approximately 68% of the data lie in the range


x̄ ± s;

– Approximately 95% of the data lie in the range


x̄ ± 2s;

– Approximately 99.7% of the data lie in the


range x̄ ± 3s.
15
Frequency

10
5
0

−3 −2 −1 0 1 2 3

Fig. 1.4: A “typical” normal sample


c STATS 7053 Statistics in Engineering 2010 1-17
Example
For the Hardap Dam data, x̄ = 854.58 and s =
1312.8. Shown below are the percentages of obser-
vations in various ranges.

Range Minimum Normal Actual


x̄ ± s - 68% 92%
x̄ ± 2s 75% 95% 96%
x̄ ± 3s 88.9% 99.7% 96%

Remarks

• As might be expected, the normal approximation


is very inaccurate because the data are highly
skewed (non-normal);

• The general lower bounds are satisfied;

• Another consequence of the skewness is that the


left hand endpoints are negative.


c STATS 7053 Statistics in Engineering 2010 1-18

Example (continued)
If we take natural logs, the data are approximately nor-
mal.
8
6
Frequency

4
2
0

3 4 5 6 7 8 9

Ln(Peak Inflow)

Fig. 1.5: The log transformed Hardap Dam data

For the transformed data y = ln(x) we have ȳ =


5.95 and s = 1.32 and the 68-96-99.7 rule is quite
accurate.

Range Minimum Normal Actual


ȳ ± s - 68% 64%
ȳ ± 2s 75% 95% 96%
ȳ ± 3s 88.9% 99.7% 100%


c STATS 7053 Statistics in Engineering 2010 1-19
The coefficient of variation

For positive variables (such as weights and volumes


etc.) it sometimes useful to express the dispersion in
relative terms.

The coefficient of variation is defined by


s
.

Example For the (untransformed) Hardap Dam data,


the coefficient of variation is
s 1312.8
= = 152.9%.
x̄ 858.84


c STATS 7053 Statistics in Engineering 2010 1-20

The interquartile range

The lower quartile or 25 percentile of a data set


x1, x2, . . . , xn can be found as follows:

1. Sort the data into ascending order,


x(1), x(2), . . . , x(n).

2. The position of the lower quartile in the ordered


sample is q = (n + 1)/4.

3. The lower quartile is determined as follows:

• If q is an integer, then
LQ = x(q);

• If q = r + 0.25 then
LQ = (3x(r) + x(r+1))/4;

• If q = r + 0.5 then
LQ = (x(r) + x(r+1))/2;

• If q = r + 0.75 then
LQ = (x(r) + 3x(r+1))/4.


c STATS 7053 Statistics in Engineering 2010 1-21
The upper quartile is defined similarly to have posi-
tion 3(n + 1)/4 in the ordered sample.

The interquartile range is defined by

IQR = UQ − LQ.

The interquartile range is an alternative measure of


dispersion.

It can be interpreted as the range of the middle 50%


of the data.

Example For the Hardap Dam data, n = 25 and the


positions of the quartiles are 6.5 and 19.5.

The sorted data are:

30 44 46 83 125 131 197 230 236


347 364 408 412 457 477 554 635 765
782 911 1506 1508 1864 3259 6101


c STATS 7053 Statistics in Engineering 2010 1-22

The lower quartile is


131 + 197
LQ = = 164.
2

The upper quartile=


782 + 911
UQ = = 846.5
2
The interquartile range is

IQR = 846.5 − 164 = 682.5.

The range

For very small data sets, the range = max − min is


sometimes used as a crude measure of dispersion.

The range is not generally useful as a measure of dis-


persion because it depends on the sample size.


c STATS 7053 Statistics in Engineering 2010 1-23
Skewness

The coefficient of skewness is defined by


Pn
(xi − x̄)3
g = i=1 .
(n − 1)s3

For symmetric distributions, g ≈ 0.

A positively skewed distribution is one with a long


right-hand tail. In this case we expect g > 0.

A negatively skewed distribution is one with a long left-


hand tail. In this case we expect g < 0.

Some examples are shown in Figure 1.6.


c STATS 7053 Statistics in Engineering 2010 1-24

Hardap data: Skewness=2.74 Skewness=0.03


20

20
15

15
Frequency

Frequency
10

10
5

5
0

0 1000 3000 5000 7000 −3 −2 −1 0 1 2

hardap x

Skewness=−1.21 Skewness=0.90
30
20
15

20
Frequency

Frequency
10

10
5

5
0

−10 −8 −6 −4 −2 0 0 5 10 15

x x

Fig. 1.6: Skewness for the Hardap data and 3 artificial


data sets


c STATS 7053 Statistics in Engineering 2010 1-25
The boxplot

The boxplot is a graphical representation of the data


based on the median and quartiles.

5000
3000


1000
0

Fig. 1.7: Boxplot of the Hardap Dam data


• The box extends from LQ to UQ.

• The median is shown within the box.

• The upper whisker extends to the largest data


value within UQ + 1.5IQR.

• The upper whisker extends to the smallest data


value within LQ − 1.5IQR.

• Any data values beyond these limits are plotted


separately.

c STATS 7053 Statistics in Engineering 2010 1-26

Boxplots provide less information than histograms.

For a single data set it is usually more useful to con-


sider the histogram.

Boxplots are very useful for comparing several data


sets.

Example Impact strength (in foot pounds) was mea-


sured for random samples of insulation material cho-
sen from 5 batches.
1.2
Impact Strength (foot pounds)


1.0
0.8
0.6

1 2 3 4 5

Batch

Fig. 1.8: Impact strength for five batches of insulation.



c STATS 7053 Statistics in Engineering 2010 1-27
Linear Transformations

Consider data x1, x2, . . . , xn and let x̄ and sx be the


sample mean and sample standard deviation respec-
tively.

Suppose now that a linear transformation is applied to


compute a new variable yi = axi + b.

It can be checked that

ȳ = ax̄ + b and sy = |a|sx.

Similar rules apply for the median and IQR,

My = aMx + b and IQRy = |a|IQRx

Such transformations often arise because of unit con-


versions. For example temperature can be measured
in degree Kelvin, Celsius or Fahrenheit.

For non-linear transformations (e.g y = ln(x)) there


is generally no simple formula.


c STATS 7053 Statistics in Engineering 2010 1-28

1.6 Covariance and Correlation

Example. Sixteen air samples from Herald Square in


New York City were analysed for carbon monoxide (X,
ppm) and benzoa pyrene (Y , ug/104m3). [data from
”Carcinogenic air pollutants in relation to automobile
traffic in New York City”, Environmental Science and
Technology, 1971,145-50]

The data are shown below.

i xi yi i xi yi
1 3 5 9 5 13
2 15 1 10 12 57
3 19 8 11 6 15
4 7 9 12 20 60
5 5 10 13 11 73
6 6 16 14 13 81
7 10 39 15 5 22
8 13 40 16 10 95


c STATS 7053 Statistics in Engineering 2010 1-29
A scatter plot of the data is shown below and the sum-
mary statistics are

x̄ = 10, sx = 5.125, ȳ = 34, sy = 30.325.


80


60


benzoa pyrene


40



20








0

5 10 15 20

carbon monoxide

Fig. 1.9: New York air pollution data


c STATS 7053 Statistics in Engineering 2010 1-30

The Sample Covariance

The sample covariance of X and Y is defined by


n
1 X
sxy = (xi − x̄)(yi − ȳ).
n − 1 i=1

It used to construct a measure the association be-


tween X and Y .

The scatter plot of y vs x for the New York air pollution


data reveals a positive association.

That is small values of y tend to go with small values


of x and large values of y tend to go with large values
of x.

The sample covariance is sxy = 54.4

In general:

For positive association, sxy > 0.

For negative association, sxy < 0.

When there is no association, sxy ≈ 0.



c STATS 7053 Statistics in Engineering 2010 1-31
To see how the formula for the sample covariance
leads to this interpretation divide the scatter plot into
four quadrants by drawing lines at x̄ and ȳ.

+
80

+
60

+
benzoa pyrene

+
40

● +

+
20

+
+
+
+ + −
+

0

5 10 15 20

carbon monoxide

Fig. 1.10: New York air pollution data


c STATS 7053 Statistics in Engineering 2010 1-32

Consider a point in the lower right quadrant. The prod-


uct (xi − x̄)(yi − ȳ) will be negative.

The same will be true for points in the upper left quad-
rant.

Similarly, for points in the lower left and upper right


quadrants the product will be positive.

When there is a positive association, there will be


more points in the lower left and upper right quadrants
a sample covariance will be positive.

When there is a negative association, there will be


more points in the upper left and lower right quadrants
and sample covariance will be negative.

When there is no assocation, the positive and nega-


tive terms will roughly cancel so the sample covari-
ance will be approximately zero.


c STATS 7053 Statistics in Engineering 2010 1-33
The sample correlation coefficient

The intuitive explanation of the sample covariance


suggests that strong associations lead to large values
of |sxy |.

However, this interpretation is not valid because sxy


also depends on the scale of the data. If we multiply
each of the y-values by the same constant c (for ex-
ample to change units of measurement), the nature of
the association will not change but the sample covari-
ance will be multiplied by the same constant.

For this reason, the sample correlation is defined by


sxy
r= .
sx sy

Example For the air pollution data, we found

sx = 5.125, sy = 30.325 and sxy = 54.4.


The sample correlation is therefore
54.4
r= = 0.35.
5.125 × 30.325

c STATS 7053 Statistics in Engineering 2010 1-34

The sample correlation is a dimensionless quantity


with the following properties:

• −1 ≤ r ≤ 1;

• r = 1 if and only if all points lie exactly on a


straight line with positive slope;

• r = −1 if and only if all points lie exactly on a


straight line with negative slope;

The following scatter plots are indicative of the


strength of association for various values of r.


c STATS 7053 Statistics in Engineering 2010 1-35
r=0 r=0.1

● ●

● ●
● ● ●

● ● ●
● ● ● ●
● ●
●●
● ● ●
● ● ●
● ● ●
● ● ● ●
● ●● ●

● ● ●
● ●
● ● ● ●
● ● ●
● ● ●
● ●
● ●


● ●

r=−0.2 r=0.3

● ● ●




● ●
● ●
● ● ●

● ●

● ● ●

● ●
● ●
● ● ●
● ● ●
● ● ●
● ● ●
● ●
● ● ● ● ●
● ●
● ● ●

● ● ●
● ●
● ●

● ●

Fig. 1.11: Example scatter plots and correlations.


c STATS 7053 Statistics in Engineering 2010 1-36

r=−0.4 r=0.5

● ●

● ● ●


● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●

● ● ● ● ● ●
● ●
● ●
● ● ●●
● ●
● ● ●

● ● ●
● ●

● ● ●


● ●

r=−0.6 r=0.7

● ● ●





● ●
● ● ●
● ● ●
● ●
● ● ●
● ●
● ●
● ● ●


●● ● ●
● ●

●●
● ●●
● ●
● ●
● ●
● ●
● ●
● ●



● ●
● ●

Fig. 1.12: Example scatter plots and correlations.


c STATS 7053 Statistics in Engineering 2010 1-37
r=−0.8 r=0.9

● ●

●● ● ●


● ●
● ● ● ●

● ●

● ● ● ●●

● ● ● ●

● ● ●
● ● ●
● ●
●● ● ●


● ●
● ● ● ●
● ●
●● ● ●

● ●


● ●

r=−0.95 r=0.99

● ●



● ●



● ●
●● ●

●● ●
● ● ● ● ●
●●
● ●
● ●

● ● ●●
● ● ● ●
● ●
● ● ● ●

●●

● ●

● ● ●


● ●
● ●

Fig. 1.13: Example scatter plots and correlations.


c STATS 7053 Statistics in Engineering 2010 1-38

Correlation is a measure of linear association

The correlation coefficient measures the degree of lin-


ear association in the data.

It will not necessarily detect a non-linear relationship.

r=0.032


● ●
● ● ●

● ●



●●




●●

● ● ● ● ●

● ●
● ● ●

● ● ●
● ●
● ● ●
● ● ●
● ●

Fig. 1.14: Scatter plot with strong quadratic relation-


ship and r ≈ 0.

c STATS 7053 Statistics in Engineering 2010 1-39
Correlation is sensitive to outliers

Figure 4.7 show a scatter plot of a data set containing


a single outlier. With the outlier included, r = 0.9 but
if it is omitted, r = −1.

r=0.9








●●●
●●


Fig. 1.15: Scatter plot with outlier.


c STATS 7053 Statistics in Engineering 2010 1-40

1.7 Time series

Statistical data often occur when the same variable is


recorded at several different times.

For example, the production of clay bricks (in millions)


was recorded monthly from Jan 1956 to August 1995.

Such data are referred to as time series.

Time series may arise from:

• A discrete process. For example, the physicist


Simon Newcomb made 66 measurements over
three consecutive days.

• Aggregation of a continuous process. For exam-


ple, monthly rainfall data.

• Subsampling of a continuous process. For exam-


ple daily exchange rate data.

A time series plot is obtained by plotting the data


against time with consecutive points connected by line
segments.

c STATS 7053 Statistics in Engineering 2010 1-41
Example The time series plot of the brick production
is shown below.

Australian Monthly Brick Prodction (in Millions)


200
150
bricks

100
50

1960 1970 1980 1990

Time

Fig. 1.16: Brick production time series.


c STATS 7053 Statistics in Engineering 2010 1-42

The brick time series plot allows us to identify many


interesting features such as

• A clearly defined annual pattern (December, Jan-


uary are always low);

• An increasing trend until about 1973, correspond-


ing to steady growth;

• A precipitous drop around 1975, resulting from


the world recession arising from the oil crisis.


c STATS 7053 Statistics in Engineering 2010 1-43
A very simple model for a time series is that it is com-
posed of several components:

• A trend;

• A periodic component;

• Random noise.
Trend Term 1.0 Periodic Term
4

0.5
3

0.0
x

s
2

−0.5
1

−1.0
0

0 20 40 60 80 100 0 20 40 60 80 100

trend term Time

Noise Term Observed Series


3

6
2

4
1
n

y
0

2
−1

0
−2

−2

0 20 40 60 80 100 0 20 40 60 80 100

Time Time

Fig. 1.17: Composition of a hypothetical series



c STATS 7053 Statistics in Engineering 2010 1-44

In this decomposition the trend and periodic effect are


thought of as deterministic.

The random component of a time series may exhibit


serial dependence.

That is, when the value at given time is influenced by


previous values of the series.

A series with no such dependence is frequently called


“white noise”. (We will give a more precise definition
later)

The dependence in a series {xt} can be measured by


calculating the sample autocorrelation.

The sample autocovariance at lag k is defined by


n
1 X
gk = (xi − x̄)(xi−k − x̄).
n i=k+1

The sample autocorrelation at lag k is defined by

rk = gk /g0.


c STATS 7053 Statistics in Engineering 2010 1-45
White Noise
2
1
0
z1

−2

0 20 40 60 80 100

Time

Series z1
1.0
0.6
ACF

0.2
−0.2

0 5 10 15 20

Lag

Fig. 1.18: Time series plot and acf function for white
noise. Note that the only non-zero correlation is at lag
0.


c STATS 7053 Statistics in Engineering 2010 1-46

Correlated Noise Series


1.0
z2

0.0
−1.0

0 20 40 60 80 100

Time

Series z2
1.0
0.6
ACF

0.2
−0.2

0 5 10 15 20

Lag

Fig. 1.19: Time series plot and acf function for serial
dependence. Note the presence of positive autocor-
relation at lags 1, 2, 3.


c STATS 7053 Statistics in Engineering 2010 1-47

You might also like