You are on page 1of 22

Week 5.

Lecture 1&2
Estimation 1: Method of Moments
In all the distributions looked at over the last few weeks, their parameters are population
parameters. That is, distributions describe populations as a whole. Looking again at the
example of the length of the gaps between surface hairline cracks in a 10 m steel girder (X).
This population is made up of all the gaps along the surface of such a girder. The parameter
1/ in the exponential distribution, is then the population mean value for the random variable
X. The only way the value for 1/ could be known with certainty would be to scan every single
bit of space along the surface of the girder and identify every single crack and then measure
the lengths of the gaps between all these cracks. The reciprocal of the average of all these
lengths would give the population value for and would involve millions and millions of
measurements.
10 m Steel Girder

Sample of 5.

In reality there would never be the time or funds available to make these millions of
measurements. Instead a small proportion six locations in the image above - of the girder
edge space would be selected at random and scanned for cracks. This would constitute a
random sample which would then be used to estimate the population average length between
cracks, 1/. This could of course be different from the population value for 1/, so what is
required is a sample estimate that satisfies some desirable characteristics:

a. The sample estimate of a population parameter is at least on average equal to the


population or true value. For example, if lots of values are found from lots of separate random
samples of a given size, the average of these sample values would equal the true value.
This is called an unbiased estimator.
b. As the sample size increases there is a tendency for any bias to converge on zero, and
a tendency for the probability of any one sample estimate differing from the true population
value to converge on zero. Such an estimator is called a consistent estimator.

A hat symbol will be used to distinguish a sample estimate of a parameter from the true
or population value - is sample estimate of Such sample estimates are called statistics.

The population would be even larger when considering all manufactured 10 m girders.
Then it would be completely infeasible to obtain the population as a whole.

A. Simulating the procedure for Sampling from a Population.

In general then x1, x2, x3,.,xn represent a sample of data where the sample size is n
and each xi, for example, may be the number of cracks in the ith surface region of the 10 m
girder. From such a sample an unbiased estimator for the population parameter(s) are required.
The best way to see the properties of a sample and the statistics calculated from them is through
simulation. It is quite easy to simulate on a computer the physical process of randomly selecting
n regions on a 10 m girder and measuring the length of the gaps between cracks in these regions.
Simulations can then be used to look at how good various estimators are.

The first step is to draw a sequence of n numbers between zero and one via a process
of long division. A very simple computer algorithm to do this works as follows. To draw k = 1
to n random numbers between 0 and 1 first choose an initial seed number. This can be any
whole number. To illustrate suppose an initial seed (i.e. when k = 1) value of Xk = 1 is chosen.
Then perform the following division

Yk = (c+aXk)/m

where a, c and m are whole numbers stored in and used by the algorithm or random number
generator. Some values for these constants produce a better sequence of random number than
others. To illustrate, let c = 0, a = 7 and m = 13. Thus

Yk = (c+aXk)/m = (0 + 7 x 1)/13 = 0.538

Next, split this Yk into an integer part (Ik) given by the number before the decimal
point and a fraction part (Fk) given by the numbers after the decimal point. That is, Ik = 0
and Fk = .538. Fk is therefore our first random number between 0 and 1. To get the next seed,
Xk+1, use the formula

Xk+1 = (c+aXk) mIk

Thus Xk+1 = (0 + 7 x 1) 13 x 0 = 7

All the above steps are repeated another n-1 times to get n random values between 0 and
1. For example, to get the next random number between 0 and 1

Yk+1 = (c+aXk+1)/m = (0 + 7 x 7)/13 = 3.769.


The second random number between 0 and 1 is thus 0.769. So Ik+1 = 3 and Fk+1 = .769.
The next seed is then

Xk+2 = (c+aXk+1) mIk+1 = (0 + 7 x 7) - 13 x 3 = 49 39 = 10.

To get the third random number between 0 and 1

Yk+2 = (c+aXk+2)/m = (0 + 7 x 10)/13 = 5.385.

The third random number between 0 and 1 is thus 0.385. So Ik+2 = 5 and Fk+2 = .385. The
next seed is then

Xk+3 = (c+aXk+2) mIk+2 = (0 + 7 x 10) - 13 x 5= 70 65 = 5.

The following graph shows this process repeating to generate n = 20 random numbers
between 0 and 1. These appear random in nature despite the fixed rules for generating them.
There is no clear pattern in the data such as a rising or declining linear (or non linear) trend and
no cyclical repeating patterns of any length. Hence they are termed pseudo random numbers.
Provided the values for a, c and m are chosen wisely the sequence of numbers will only start
to repeat themselves (and so be a non-random pattern) when n is very large. Provided the
sequence is not to long they satisfy all the statistical characteristics of randomness in that, for
example, previous values in the sequence have no easily identifiable relationship to future
values. Excel generates random numbers between 0 and 1 in this way using its function Rand().
This function uses a = 1.141E+09, c = 12820163, m = 16777216. This Rand() function is an
example of a Linear Congruential Generator (LCG).

1
0.9
Random number between 0 and 1

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sequence number, n

The second step is to take a sequence of n random number between 0 and 1 and convert
them into a sequence of values drawn randomly from a particular distribution describing a
population. This can then be seen as a random sample taken from a population. The trick to
doing this is to think of the Rand() function as randomly picking a value for F(x) from the
population distribution. Remember F(x) is a cumulative probability and so has values between
0 and 1 just as Rand() randomly picks a value between 0 and 1. Having a random value for
F(x) and knowing the cdf function, it is very straightforward to solve this cdf for the random
value for X. Consider the following examples:

i. Drawing from a uniform distribution in Excel.

Consider the uniform distribution that has cdf

xa
F(x)
(b a)

where b is the maximum value that X can take and a the minimum value. Suppose the Rand()
function in Excel produces the number 0.44. The trick is to interpret this as a random value for
F(x), so F(x) = 0.44. Thus

xa
0.44 .
(b a)

The corresponding random value for X is therefore

x a 0.44(b a)

So in general

x a Rand()(b a)

In week 4, the Oyster Pearl example, was introduced. When pearl oysters are opened,
pearls of various sizes are typically found. Suppose, as an example, that each oyster contains a
pearl with a diameter in mm that has a U(0, 10) distribution, so a = 0 and b = 10. In sheet
Uniform Drawof Excel file Sim1, random draws from this distribution are made in cell D8
by hitting the F9 key repeatedly. Each time the F9 key is hit, the Rand() function in cell D7
generates a value between 0 and 1 in random fashion which is interpreted as a randomly drawn
value for F(x). The above formula is then inserted into cell D8 to convert these F(x) values into
randomly drawn values for x. The bottom graph visualize this process, where the orange line
shows the randomly drawn value for x within the loaf shaped pdf that summarizes all possible
values that x can take. Because, for such a distribution, all values for x between a and b are
equally likely, the orange line moves over the whole width of the pdf with equal frequency.

ii. Drawing from an exponential distribution.

The exponential cdf is

F(x) 1 e x
Replacing F(x) with Rand() and rearranging for x enables a random draw to be made
from this distribution using

x = -ln[1-Rand()]/

iii. Drawing from a Weibull distribution.

The Webull cdf is


F(x) 1 e( x)

Replacing F(x) with Rand() and rearranging for x enables a random draw to be made
from this distribution using

ln(x) = {ln[-ln(1-Rand()] ln()}/

Taking the exponential of this expression gives a random value for x.

iv. Drawing from a normal and log normal distribution in Excel.

In Excel the cumulative probability F(z) associated with a particular Z value was
obtained using the function NORMSDIST(Z). Then to read the table the other way around
getting the Z value associated with a particular F(z) value the Normsinv(F(z)) function is
used in Excel. So to randomly draw a Z value in Excel use Normsinv(Rand()). Then given that
Z = (x-)/, a random value for X can be found in Excel using the formula

x = + Normsinv(Rand())

In week 4 the concrete block example was introduced. A company manufactures


concrete blocks that are used for construction purposes and the weight of all the individual
concrete blocks produced by this company has a N[11, 0.3] and so the mean value is 11 kg and
the standard deviation is 0.3 kg. In sheet Normal Drawof Excel file Sim1, random draws
from this distribution are made in cell D9 by hitting the F9 key repeatedly. Each time the F9
key is hit, the Rand() function in cell D7 generates a value between 0 and 1 in random fashion
which is interpreted as a randomly drawn value for F(z). This randomly drawn value of F(z) is
shown on the vertical axis of the top left hand graph by the orange line. This is then inserted
into the Normsinv function in cell D8 to generate at random a value for Z. The Normsinv
functions reads from the Z table the value for Z associated with this selected F(z) value. In the
top left hand figure the blue S shaped curve is a graphical display of the numbers in the Z table.
Thus extending the orange line to the S shaped curve allows the Z value to be read of on the
horizontal axis. The bottom left hand graph transferred this Z value from the horizontal to the
vertical axis.

The equation = + Z is then inserted into cell D9 to convert this Z value into a
randomly selected value for x concrete weight. This equation is shown graphically in the
bottom right hand graph where the blue line is given by the equation + Z. This x value is
then transferred to the horizontal axis of the top right hand graph that shows the normal pdf
when = 11 and = 0.3. This top right hand graph visualizes this process, where the orange
line shows the randomly drawn value for x within the bell shaped pdf that summarizes all
possible values that x can take. Because, for such a distribution, values for x around the mean
or peak point are more likely, the orange line is seen with higher frequency around the peak
compared to the tails if the pdf when F9 is hit repeatedly.

To draw from a log normal distribution, first draw a random value for Y = ln(X) using

y = y + yNormsinv(Rand())

and then obtain the random value for X using x = exp(y). In this formula y is the mean of Y
and y, the standard deviation for Y.

v. Drawing from a Binomial distribution in Excel.

In Excel a random value for x that has a binomial distribution can be obtained using the
formula
x = Binomdist.Inv(n, p, Rand())

vi. Drawing from a Poisson distribution in Excel.

Recall from week 3, that the Poisson distribution can be used to approximate the binomial
distribution when n is very large (for practical purposes larger than 150) and the success
probability p is very small (less than 0.01). Then the parameter = np should be used for the
Poisson distribution (or n = p when using a binomial approximation).

So in Excel, first choose a small value for p (0.000001 is accurate enough). Then set p
equal to the Poisson parameter divided by this small value for p. A random value for x that
has a Poisson distribution can then be obtained using the formula

x = Binomdist.Inv(, 0.000001, Rand())

vii. Drawing from a triangular distribution in Excel.

The triangular cdf is

(x a) 2
F(x) when a x c
(b a)(c - a)
(b x) 2
1- when c x b
(b a)(b - c)

Replacing F(x) with Rand() and rearranging for x enables a random draw to be made
from this distribution using
x a [(b a)(c - a)Rand()] 0.5 when a x c
x b - {(b a)(b - c)(1 - Rand())} 0.5
when c x b

To make this operational, if the value for Rand() is less than or equal to (c-a)/[(c-a)+(b-
c)] then find x using the first expression for x above. Otherwise use the second expression for
x.

B. The Method of Moments Estimator.

There are many different methods to estimate the population parameter value from a
sample of data and this module will look at two of them the method of moments and the least
squares methods. The following two examples will be used to illustrate these.

Example 1: Protein content of cultures.

The following data was obtained from a study investigating the production of
cyclodextrin-glucosyltransferase enzymes by bacterial cultures. The enzyme production was
carried out in shaken cultures. The measured protein content of 6 cultures were:

Protein content: mg ml-1

1.91 2.62

1.66 2.57

2.64 1.85

Example 2: Fatigue of ceramic ball bearings.

The following data are the results from a life test on the rolling contact fatigue of
ceramic ball bearings Ten specimens were teste at a stress of 0.87 and the recorded number of
cycles to failures were:

Number of cycles to failures: 10-2 cycles

1.67 14.70

4.70 3.00

2.20 27.80

7.53 3.90

2.51 37.40
i. The Normal Distribution

The normal distribution is defined by its two parameters the population mean () and
the population standard deviation (). It can be shown that an unbiased estimator of the
population mean is the sample average x where
n
1
= x = xi
n
i=1

Also an unbiased estimator of the population standard deviation is the sample


standard deviation s

n
1
= s = [xi x]2
n1
i=1

Applying these formulas to the two examples above gives:

Example 1: Protein content of cultures.


1
= x = (1.91 + 1.66 + . +2.57 + 1.85) = 2.208
6

with

1
= s = [1.91 2.208]2 + [1.66 2.208]2 + . +[2.57 2.208]2 + [1.85 2.208]2 = 0.448
61

Example 2: Fatigue of ceramic ball bearings

1
= x = (1.67 + 4.70 + . +3.90 + 37.40) = 10.541
10
1
= s = [1.67 10.541]2 + [4.70 10.541]2 + . +[3.90 10.541]2 + [37.4 10.541]2 = 12.4432
101

Notice in these illustrations that the total squared deviations around the sample mean
are averaged out by dividing through by the sample size minus one and not by the sample size
itself. This is because if n is used, the resulting estimator will be biased but using n-1 achieves
an unbiased estimator. Obviously as the sample size gets bigger, the degree of bias reduces
because for large n there is little difference between the value for n and n-1.

Consider now the protein content example above and suppose the population
parameters are the same as the above sample estimates, i.e. the population mean is = 2.208
and the population standard deviation is = 0.448. Now simulate the process of drawing
samples of size 6 from this population that is assumed to be normally distributed. In the Excel
file Sim1 the formula in section A.iv, is inserted into cell B16 of the Norm sim sheet, to
do this. The measured protein contents of 6 randomly selected cultures are then shown in cells
E3:J3. Cell E3 therefore shows the protein content of the first randomly selected culture, 1.773.
In the Excel file notice how the simulated random sample of 6 values are clustered
around the mean of the population in the top graph and so it appears the simulation approach
is mimicking the population well - which is what you would hope from a random sample. The
sample average protein content from these 6 cultures is in cell L3 (2.11) - and is close to the
assumed population value of 2.208 (but not exactly the same). Similarly, the sample variance
in protein content from these 6 cultures is in cell M3 (0.311) - close to population value of
0.201 but not exactly the same. So although the sample mean and standard deviation are
unbiased estimates of and , in any one sample of size n they can be different from
them.

To get a feel for how different the estimates can be from the true values in any one
sample, the Excel file Sim1 repeats the process of collecting samples of size 6. In total 30
such samples are taken by repeatedly clicking on the button in cell range B12:B14. The graphs
(2nd and 3rd figures) show that in any one sample the estimates can be quite different from the
assumed population values and so it is important to be aware of this fact when using sample
estimates. However, over all 30 samples the average of all the sample means is 2.239 as shown
in cell L34 - which is very close to the population average of 2.208. Indeed the bias is just
1.42%. Likewise, over all 30 samples the average of all the sample variances is 0.207 as shown
in cell M34 - which is very close to the population variance of 0.2017. Indeed the bias is just
3.21%. Notice that if the sample variance had been calculated with a denominator of n (=6)
instead of n-1 (=5), as shown in column N, (VAR uses a denominator of n-1 and VARP uses a
denominator of n in the variance formula in Excel), the bias would be much higher at -14% as
shown in cell N35. Hence the need to use n-1 in the denominator to correct for the bias.

These biases in estimating the variance using a denominator of n instead of n-1


would fall further when working with samples sizes of more than 6. To accurately measure any
bias associated with a sample statistic it is necessary to average over a large number of samples
- much more than 30. Ideally, the population of samples of size n (this population is made up
of all the different samples of size n that can be formed form the population of values on x)
should be used, but a very reasonable approximation to this can be obtained by averaging over
10,000 samples of size n. Click on the button in cell range L36:N38 to see the outcome from
generating 10,000 samples of size 6. Cell L41 reveals the sample mean is unbiased even in
samples of size 6 (a bias of around only 0.2%). The bias in the variance using n-1 in the
denominator is also very small (cell M41 shows it to be around 0.45%). However, the bias in
the variance when using n in the denominator is substantial (cell N41 shows it to be around -
16.3%), Hence using n-1 corrects the bias obtained when using n in the denominator of the
variance formula:

In summary: x is an unbiased estimator of ; s2 = [xi- x]2/[n-1] is an unbiased


estimator of [xi- x]2/[n] is a biased estimator of

Hence the importance of adjusting n by the 1 degree of freedom lost in estimating


the sample mean x. It is important to realise however that s (the sample standard
deviation) is a biased estimator of because it involves a non-linear transformation of the
variance.
Obviously as n increases the difference between n-1 and n diminishes as thus so to
should any bias. Click on the button in cell range L36:N38 to see the outcome from generating
10,000 samples of size n = 50. Cell L41 reveals the sample mean is again unbiased (a bias of
around only 0.01%). The bias in the variance using n-1 in the denominator is also very small
(cell M41 shows it to be around -0.07%). Now, the bias in the variance using n in the
denominator has reduced compared to using n =6 (cell N41 shows it to be around -2.1%).

ii. The Log Normal Distribution

As the name suggests, if a random variable X has a log normal distribution, then that
simply means that the transformed random variable Y = ln(X) has a normal distribution as
described above (ln stands for natural logarithm). The log normal distribution is defined by its
two parameters the population log mean (y) and the population log standard deviation (y).
The method of moments estimator for y is the sample average y where
n
1
= x = yi
n
i=1

Also method of moments estimator of the population log standard deviation is the
sample standard deviation sy

n
1
= s = [yi y]2
n1
i=1

Applying these formulas to the two examples above gives:

Example 1: Protein content of cultures.

First take the natural log of all the data. Thus y1 = ln(1.91) = 0.6471 through to y6 =
ln(1.85) = 0.6152.
1
= y = (0.6471 + 0.5068 + . +0.9439 + 0.6152) = 0.774
6

with
1
= = [0.6471 0.7745]2 + [0.5068 0.7745]2 + . +[0.9439 0.7745]2 + [0.6152 0.7745]2 = 0.2079
61

From these the mean and standard deviation for X = EXP(Y) can be estimated

( y 0.5 y2 ) ( 0.775 ( 0.5) 0.20792 )


e e 2.217

y2 ( y2 2 y ) 0.20792 ( 0.20792 2 ( 0.775))


2 [e 1]e [e 1]e 0.217
Example 2: Fatigue of ceramic ball bearings

First take the natural log of all the data. Thus y1 = ln(1.67) = 0.5128 through to y10 =
ln(37.4) = 3.6217.
1
= y = (0.5128 + 1.5476 + . +1.361 + 3.6217) = 1.7882
10

with

1
= = [0.5128 1.7882]2 + [1.5476 1.7882]2 + . +[1.361 1.7882]2 + [3.6217 1.7882]2 = 1.0894
10 1

From these the mean and standard deviation for X = EXP(Y) can be estimated

( y 0.5 y2 ) (1.7882( 0.5)1.08942 )


e e 10.8225

y2 ( y2 2 y ) 1.08942 (1.08942 2 (1.7882))


2 [e 1]e [e 1]e 266.6545

Consider again the protein content example above and suppose the population
parameters are the same as the above sample estimates, i.e. the population log mean is y =
0.774 and the population standard deviation is y = 0.208. Now simulate the process of drawing
samples of size 6 from this population that is assumed to be log normally distributed. In the
Excel file Sim1 the formula in section A.iv, is inserted into cell B16 of the Log Normal
sim sheet, to do this. The measured log protein contents of 6 randomly selected cultures are
then shown in cells E3:J3. Cell E3 therefore shows the log protein content of the first randomly
selected culture, 0.863. In the Excel file notice how the simulated random sample of 6 values
are clustered around the mean of the population in the top graph and so it appears the simulation
approach is mimicking the population well - which is what you would hope from a random
sample. The sample log average protein content from these 6 cultures is in cell L3 (0.719) -
and is close to the assumed population value of 0.774 (but not exactly the same). Similarly, the
sample log variance in protein content from these 6 cultures is in cell M3 (0.046) - close to
population value of 0.043 but not exactly the same. So although the sample mean and
standard deviation are unbiased estimates of y and y, in any one sample of size n they
can be different from them.

To get a feel for how different the estimates can be from the true values in any
one sample, the Excel file Sim1 repeats the process of collecting samples of size 6. In total
30 such samples are taken by repeatedly clicking on the button in cell range B12:B14. The
graphs (2nd and 3rd figures) show that in any one sample the estimates can be quite different
from the assumed population values and so it is important to be aware of this fact when using
sample estimates. However, over all 30 samples the average of all the sample log means is
0.762 as shown in cell L34 - which is very close to the population average of 0.774. Indeed the
bias is just -1.65%. Likewise, over all 30 samples the average of all the sample log variances
is 0.041 as shown in cell M34 - which is very close to the population log variance of 0.043.
Indeed the bias is just -5.14%. Notice that if the sample variance had been calculated with a
denominator of n (=6) instead of n-1 (=5), as shown in column N, (VAR uses a denominator
of n-1 and VARP uses a denominator of n in the variance formula in Excel), the bias would be
much higher at -21% as shown in cell N35. Hence the need to use n-1 in the denominator to
correct for the bias.

These biases in estimating the variance using a denominator of n instead of n-1


would fall further when working with samples sizes of more than 6. To accurately measure any
bias associated with a sample statistic it is necessary to average over large number of samples
- much more than 30. Click on the button in cell range L36:N38 to see the outcome from
generating 10,000 samples of size 6. Cell L41 reveals the sample mean is unbiased even in
samples of size 6 (a bias of around only -0.1%). The bias in the log variance using n-1 in the
denominator is also very small (cell M41 shows it to be around -0.04%). However, the bias in
the log variance when using n in the denominator is substantial (cell N41 shows it to be around
-16.7%), Hence using n-1 corrects the bias obtained when using n in the denominator of the
variance formula:

In summary: y is an unbiased estimator of y; s2y = [yi- y]2/[n-1] is an unbiased


estimator of y [yi- y]2/[n] is an biased estimator of y

Obviously as n increases the difference between n-1 and n diminishes as thus so to


should any bias. Click on the button in cell range L36:N38 to see the outcome from generating
10,000 samples of size n = 50. Cell L41 reveals the sample log mean is again unbiased (a bias
of around only -0.01%). The bias in the sample log variance using n-1 in the denominator is
also very small (cell M41 shows it to be around 0.07%). Now, the bias in the variance using n
in the denominator has reduced compared to using n = 6 (cell N41 shows it to be around -
1.9%).

iii. The Exponential Distribution.

is the reciprocal of the population mean value for the random variable X. So the
method of moments estimator of is the reciprocal of the sample mean 1/x:

1 n
= = n
x i=1 xi

Example 1: Protein content of cultures.

1 1
= = = 0.4528
2.208
Example 2: Fatigue of ceramic ball bearings

1 1
= = = 0.0949
x 10.541
Consider again the protein content data above and suppose the population parameter is
the same as the above sample estimate, i.e. population is = 0.4528. In the Excel file Sim1
the formula in section A.ii, is inserted into cell B16 of the Exponential sim sheet, to do this.
The measured protein contents of 6 randomly selected cultures are then shown in cells E3:J3.
Cell E3 therefore shows the protein content of the first randomly selected culture, 2.231. In the
Excel file notice how the simulated random sample of 6 values are clustered around the mean
of the population in the top graph and so it appears the simulation approach is mimicking the
population well. The sample value from these 6 cultures is in cell M3 (0.492) - close to
population value of 0.453 but not exactly the same.

To get a feel for how different the estimates can be from the true values in any one
sample, the Excel file Sim1 repeats the process of collecting samples of size 6. In total 30
such samples are taken by repeatedly clicking on the button in cell range B12:B14. The graphs
(2nd and 3rd figures) show that in any one sample the estimates can be quite different from the
assumed population values and so it is important to be aware of this fact when using sample
estimates. Even worse the average of the 30 sample values appears to be above the true
value, with a bias of about 20% - as shown in cell M35. To accurately measure this bias it is
necessary to average over large number of samples - much more than 30. Click on the button
in cell range L36:N38 to see the outcome from generating 10,000 samples of size 6. Cell M40
reveals the sample is biased in such small samples of size 6 (the bias is still around 20%).

In summary: 1/x is an biased estimator of in small samples.

However, 1/x may be a consistent estimator of . Consistency at least requires


the bias to tend to zero as the sample size increases. Clicking on the button in cell range
L36:N38 to generate 10,000 samples of size n = 30 and then n = 50 reveals the bias falls to
around 2.9% and 1.8% respectively (read from cell M41).

Summary: The bias of 1/x as an estimator of diminishes in larger samples and so has
one of the properties of a consistent estimator.

iv. The Weibull Distribution

To recap the population mean and variance of the Weibull distribution are given by

1 1
1

1 2 1
2 1 2 1
2

The ratio of the squared mean to the variance is therefore


1 2 2 1 2
1 - 1 1
2 2
1
2 1 2 1 2 1
1 1
2

The value for is then that value that sets equal the two sides of this equation. It can
be estimated by substituting into this equation the sample mean and variance. Letting be
the resulting estimate for gives

2
1
n 2 i 1 x i x
n 2
2
1
2 n 1 [n (x i )] 2 1
i 1 1
2

Then it is quite straightforward to search numerically for the value of that sets the
left and right hand sides of these equations equal to each other. This can then be substituted
into the equation for the mean to estimate

1 1 1 1 n 1
x 1 1 n 1
x x i
i 1

There is an alternative to these equations that involves the use of the log transformation.
If yi = ln(xi), then without providing the proof it can be shown that (using the extreme value
distribution for interested readers):

1.28255

where y is the population standard deviation for the random variable y and 1.28255 is a
universal constant (Eulers constant). Thus a suitable estimate for is

1.28255 (n 1)
1.28255 n

y y
sy 2
i
i 1

where y is the log mean


n

y i
y i 1

Further it can be shown that


1 0.4501 y
e y

where y is the population mean for the random variable y. Thus a suitable estimate for is

1

n
y i y
n
i
2
yi
exp 1 0.4501 i 1
n (n 1)

It is important to realise that these two approach do not have to produce the same
estimated values for the Weibull parameters.

Example 1: Protein content of cultures.

Method 1.

2
1
2 0.2009
0.0412 1
2
4.8767 1
1
2

Inserting = 5.66 into the right hand side of this equation gives 0.0412 (you can grid
search to find this value in Excel). Hence = 5.66. Then:

1 1
1 0.4528[1.1767] 0.4528[0.9247] 0.4187
2.2083 5.66

Method 2.

Using the log mean and log standard deviation calculated above for the log normal
distribution:

1.28255
6.17
0.2079

and

1 0.4501 y 1
e y e0.77450.4501(0.2079) 2.3823 0.4198
2.3823

Note that the two methods produces similar but no identical estimates for and .
Example 2: Fatigue of ceramic ball bearings

Method 1.

2
1
2 154.8326
1.393 1
2
111.113 1
2 1

Inserting = 0.851 into the right hand side of this equation gives 1.393. Hence =
0.851. Then:

1 1
1 0.0949[ 2.175] 0.0949[1.0872] 0.1031
10.541 0.851

Method 2:

Using the log mean and log standard deviation calculated above for the log normal
distribution:

1.28255
1.18
1.0894

and

1 0.4501 y 1
e y e 1.78820.4501(1.0894) 9.7626 0.1024
9.7626

Again, note that the two methods produces similar estimates for and

Consider again the protein content example above and suppose the population
parameters are similar to the above sample estimates (using method 1), i.e. the population value
for = 5.66 and the population value for = 0.42. In the Excel file Sim1 the formula in
section A.ii, is inserted into cell B16 of the Weibull sim sheet, to do this. The measured
protein contents of 6 randomly selected cultures are then shown in cells E3:J3. Cell E3 therefore
shows the log protein content of the first randomly selected culture, 0.99. In the Excel file
notice how the simulated random sample of 6 values are clustered around the mean of the
population in the top graph and so it appears the simulation approach is mimicking the
population well. The sample value from these 6 cultures is in cell N3 (0.408) - close to
population value of 0.42 but not exactly the same. The sample value from these 6 cultures is
in cell O3 (5.40) - close to population value of 5.66 but not exactly the same.
To get a feel for how different the estimates can be from the true values in any one
sample, the Excel file Sim1 repeats the process of collecting samples of size 6. In total 30
such samples are taken by repeatedly clicking on the button in cell range B12:B14. The graphs
(2nd and 3rd figures) show that in any one sample the estimates can be quite different from the
assumed population values and so it is important to be aware of this fact when using sample
estimates. Even worse the average of the 30 sample values appears to be well above the true
value, with a bias of about 26% - as shown in cell O35. To accurately measure this bias it is
necessary to average over large number of samples - much more than 30. Click on the button
in cell range N36:O386 to see the outcome from generating 10,000 samples of size 6. Cells
N40:O41 reveals that the sample has a small bias in such small samples of size 6 (the bias is
around 1.5% in cell N41) but the sample has a large bias in such small samples of size 6 (the
bias is around 27% in cell O41)

In summary: 1.28225/sy has a large bias in small samples but =


1/exp(y+0.4501*s2y) has a smaller bias in small samples.

This is not surprising given that the sample standard deviation itself is a biased
estimator. The size of the bias is reduced for because this estimator is a combination of the
sample mean and sample standard deviation and the sample mean is an unbiased estimator.

However, these estimators may be consistent estimators. Consistency at least


requires the bias to tend to zero as the sample size increases. Clicking on the button in cell
range N36:O38 to generate 10,000 samples of size n = 30 and then n = 50 reveals the bias falls
to around 5.1% and 3.3% for respectively. For the corresponding bias figures are around
0.25% and 0.1% (read from cell N41:O41).

Summary: Thus the bias of = 1/exp(y+0.4501*s2y) as an estimator of and the bias


of 1.28225/sy as an estimator of both diminish in larger samples and so has one of the
properties of a consistent estimator.

v. Triangular and Uniform distributions

For both distributions an estimate for a is the smallest value in a sample of data and an
estimate for b is the largest value in a sample of data. Then for the triangular distribution the
population mean is (a+b+c)/3. Thus a sample estimate for c is given as

c 3x - a - b

where a and b are the sample estimates of a and b described above. If the estimate for c falls
outside of and b due to an outlier(s) ignore the outliers(s) when finding a and b

Example 1: Protein content of cultures.

Assuming the data have a uniform or triangular distribution

a = minimum value n sample =1.66


b = maximum value n sample =2.64

Then for a triangular distribution

c 3x - a - b 3(2.21) 1.66 2.65 2.32

Example 2: Fatigue of ceramic ball bearings

Assuming the data have a uniform or triangular distribution

a = minimum value n sample =1.67

b = maximum value n sample =27.8 (the largest value may be an outlier and so ignored)

Then for a triangular distribution

c 3x - a - b 3(10.541) 1.65 27.8 2.153

vi. The Poisson Distribution.

is the population mean value for the random variable X. The method of moments
estimator of is the sample mean x:
n
1
= x = xi
n
i=1

For the Poisson distribution this is also an estimate of the variance.

Consider as an example the number of coding errors in 1000 lines of code in a materials
selection software package with this number having a Poisson distribution with population
parameter = 4 errors. In the Excel file Sim1 the formula in section A.vi, is inserted into
cell B16 of the Poisson sim sheet, to do this. The measured number of errors in 6 randomly
selected 1000 lines of code are then shown in cells E3:J3. Cell E3 therefore shows the number
of errors in the first randomly selected sample, 5. In the Excel file notice how the simulated
random sample of 6 values are clustered around the mean of the population in the top graph
and so it appears the simulation approach is mimicking the population well. The sample value
from these 6 samples is in cell M3 (3.83) - close to population value of 4 but not exactly the
same.

To get a feel for how different the estimates can be from the true values in any one
sample, the Excel file Sim1 repeats the process of collecting samples of size 6. In total 30
such samples are taken by repeatedly clicking on the button in cell range B12:B14. The graphs
(2nd and 3rd figures) show that in any one sample the estimates can be quite different from the
assumed population values and so it is important to be aware of this fact when using sample
estimates. The average of the 30 sample values appears to be a little above the true value,
with a bias of about 5% - as shown in cell M35. To accurately measure this bias it is necessary
to average over large number of samples - much more than 30. Click on the button in cell range
L36:N38 to see the outcome from generating 10,000 samples of size 6. Cell M41 reveals the
sample is unbiased even in such small samples of size 6 (the bias is around 0.05%).

In summary: x is an biased estimator of in small samples.

This is not surprising given that x is an unbiased estimator of the population mean.

vii. The Binomial Distribution

The population mean value for a variable is given by m = pn*, where n* is


the number of trials and p the probability of a success. The method of moments estimator for
parameter p for a stated number of trials is therefore
n

x
xi
p i 1
n * nn *

Consider as an example a 10 m long girder where n* points are randomly selected on


the edges of this girder. If at such a point, no crack is observed x = 1, whilst if one or more
cracks is observed x = 0. Suppose further that the population probability of not observing a
crack at any point is 0.9. The sum all n* values for x then has a Binomial distribution with
population parameters p = 0.9 and n* = 10. In the Excel file Sim1 the formula in section
A.v, is inserted into cell B16 of the Binom sim sheet, to do this. Cell E3 shows how many of
the n*=10 randomly selected points had no cracks present was 8. Cells F3:J3 show the results
from selecting another five randomly selected set of 10 points along the girders edge. In the
Excel file notice how the simulated random sample of 6 values are clustered around the mean
of the population in the top graph and so it appears the simulation approach is mimicking the
population well. The sample p value from these 6 samples is in cell M3 (0.967) - close to
population value of 0.9 but not exactly the same.

To get a feel for how different the estimates can be from the true values in any one
sample, the Excel file Sim1 repeats the process of collecting samples of size 6. In total 30
such samples are taken by repeatedly clicking on the button in cell range B12:B14. The graphs
(2nd and 3rd figures) show that in any one sample the estimates can be quite different from the
assumed population values and so it is important to be aware of this fact when using sample
estimates. The average of the 30 sample p values appears to be very close to the true value,
with a bias of about -0.06% - as shown in cell M35. To accurately measure this bias it is
necessary to average over a large number of samples - much more than 30. Click on the button
in cell range L36:N38 to see the outcome from generating 10,000 samples of size 6. Cell M41
reveals the sample p is unbiased even in such small samples of size 6 (the bias is around 0.04%).

In summary: x/n* is an biased estimator of p in small samples.

This is not surprising given that x is an unbiased estimator of the population mean.
C. The Central Limit Theorem.

Consider a population with a mean of and a variance of 2 that has a non-normal


distribution

x ?[,2]

where x ? reads x has some unknown but non normal distribution (for example an exponential
distribution). Then image collecting a random sample of size n from this population. Let x1,
x2,,xn be this sample of observations. From this sample, the sample mean x1 can be
calculated. In theory this process can be repeated a very large number of times (m times),
allowing a very large number of sample means (all calculated from the same sample size) to
be calculated - x1, x2, xm - where m is a very large number. .If m is large enough, so that all
possible samples of size n are taken, then these sample means also have a population
distribution of values and this distribution is called the sampling distribution of means. (By
all, it is meant that there is no additional sample of size n that can be collected from the
population that has not already been collected).

The central limit theorem then states that this sampling distribution of means will have
an approximate normal distribution (despite the parent population from which these samples
were taken being non normal) with mean x and variance 2x:


x N x , x
2


=1
and where x = , x = and x = 2/n.

This approximation to the normal distribution gets better the larger is the sample size n
and as a very good rule of thumb the approximation is excellent for n 30. If x is normally
distributed, x N[,2], then the sampling distribution of means is exactly normally distributed
at any sample size even in samples as small as size n = 2:


If x N[,2] then x N x , 2x with x = and x = 2/n

These results can be proved through simulation using the protein content experiment as
an example.

Illustration: Protein content of cultures.

The sheet Central Limit Theorem in Excel file Sim1 carries out this simulation proof.
In this illustration, protein content is assumed to be exponentially distributed x Exp[] with
the = 0.4 (this is the assumed population value for ). As such x is non normally distributed.
Because the distribution is assumed exponential, the population mean and variance are
respectively = 2.5 (given by 1/) and 2 = 6.25 (given by 1/). Cells E3:J3 simulates the
physical process of testing six cultures for protein content (which essentially is randomly
drawing 6 values for x from the population (exponential) distribution. The protein content of
the first culture is 1.965. In the Excel file notice how the simulated random sample of 6 values
are clustered around the mean of the population in the top graph and so it appears the simulation
approach is mimicking the population well. Then in cell L3 the sample average of the first 3
observations (mimicking samples of size 3) is taken and in cell M3 the sample average of the
all 6 observations (mimicking samples of size 6) is taken. Thus when n = 6, x1= 2.627. Cells
E4:J4 simulates the physical process of testing another six cultures for protein content (which
essentially is randomly drawing another 6 values for x from the population (exponential)
distribution. The protein content of the first culture in this second sample 1.002, with a sample
average of x2 = 2.093 (when n = 6 in cell M4).

This process of repeat sampling is carried on down columns L and M so that m = 30


sample of sizes 3 or 6 are taken. Because m is small the population of samples of size 3 or 6 is
not obtained but m is large enough to show the above results hold approximately. Thus in cell
L39 the average of the 30 sample averages calculated from samples of size n = 3 is worked out,
giving x = 2.488. Notice this is very close to = 2.5 demonstrating the theorem that x =
Then in cell L40 the variance of the 30 sample averages calculated from samples of size n = 3
is worked out, giving x = 2.553. Notice this is close to 2/n = 6.25/3 = 2.08 demonstrating
the theorem that x = 2/n (notice the approximate equality comes from using a small value
for m, m = 30) As another illustration in cell M40 the variance of the 30 sample averages
calculated from samples of size n = 6 is worked out, giving x = 0.85. Notice this is close
to 2/n = 6.25/6 = 1.04 again demonstrating the theorem that x = 2/n.

These theorems can be proved more exactly using a very large m value so that then the
population of samples of size n are very nearly obtained. In Sim1 , m = 10,000 is used when
clicking on the button in cell range L43:M45 . If this button is clicked and n = 6 is selected,
then in cell L47 the average of the 10,000 sample averages calculated from samples of size n
= 6 is worked out, giving x = 2.49. Notice this is very very close to = 2.5 demonstrating the
theorem that x = n cell L48 the variance of the 10,000 sample averages calculated from
samples of size n = 6 is worked out, giving x = 1.037. Notice this is very very close to 2/n =
6.25/6 = 1.042 demonstrating the theorem that x = 2/n.

To illustrate the convergence to normality, click the button and select n = 30 and also
select n = 30 in cell C46 using the up/down arrow buttons. The following image shows this
convergence.
1
30 samples of size, n = 3
0.9
30 samples of size, n = 6
0.8
10000 samples of size, n = 30
0.7
Normal distrbution with mean 2.5
0.6 and variance 6.25/n with n = 30
f(x)

0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8
x

You might also like