Professional Documents
Culture Documents
Introduction of Uncertainty
1.1 Introduction
You can see these statements any time in your daily life: “it is likely that
tomorrow will rain;” “there is a 80 percent chance that Tom will win the competition;”
“the professor expects that 95 percent of the students can pass all of the final exams.”
No one could predict exactly whether it will rain or not, and you cannot be certain that
whether Tom will win or lost the competition tomorrow. Even you will be not sure the
number of classmates in the classroom when you come to the class tomorrow.
Just as the statements mentioned above, we often confront with the uncertainties.
However, we are forced to make decisions based on these uncertainties most of the
times. These kinds of uncertainties are surrounded in our everyday life, and we have
to understand and deal with them. To describe and quantify the uncertainties, we need
to introduce the ideas of probability and statistics. The major objective of this work is
to use the concepts and methods of probability and statistics to solve the real problems
under the uncertainties. In addition, no one likes to obtain the probability and statistics
by hand calculation. Fortunately, Excel and Excel based macro language (VBA)
provide a proper tool (various kinds of functions and charts) to help people to solve
the problems relating to probability and statistics.
The sources of uncertainty can be classified into two broad types: the aleatory
uncertainty and the epistemic uncertainty. Before doing the further explanation, let us
2 1:Introduction of Uncertainty
2. Toss a dice
A dice is tossed six times, the results are shown as
follows:
1 3 4 2 1 3
In the first example, the temperature varies from one to another. In the second
example, when tossing the dice, each sides of the dice is equally likely appeared, and
the results can not be predicted exactly. Such kinds of uncertainties are caused by the
natural randomness and defined as aleatory uncertainty.
Aleatory uncertainty is common in our daily life. For instance, ten students are
randomly chose off the street, their physical characteristics such as the heights and
weights are various; the results of playing lottery games are various and randomness;
the numbers of green lights you will see on the way home are various; the taxi fare
you have paid from home to school each time is also uncertainty. These phenomena
are all caused by the natural randomness and referred to as the aleatory uncertainty.
The aleatory uncertainty is usually analyzed by the statistical approaches such as the
probability functions, and it is irreducible through further measurements most of the
time.
However, comparing to the first two examples, the third one is difference. The
wrong conclusion that the atom is indivisible is associated with insufficient or
imperfect knowledge. This kind of uncertainty is defined as epistemic uncertainty,
which may reduce through further measurements, using improved experiments, or
consulting more experts. The epistemic uncertainty is also the common phenomena.
1: Introduction of Uncertainty 3
Definition
Aleatory uncertainty: Aleatory uncertainty is caused by natural randomness,
and analyzed by the statistical approaches such as the probability functions.
Epistemic uncertainty: Epistemic uncertainty is caused by insufficient or
imperfect knowledge about fundamental phenomena.
On the whole, the aleatory uncertainty is the data based, which may not be
reduced or modified. However, the epistemic uncertainty is knowledge based, which
may be reduced by using the improved experiments. When dealing with practical
problems, you can consider these two types of uncertainty separately or joining them
together. Irrespective of the type of uncertainty, statistics and probability provide the
proper tool for modeling and analysis the uncertainty.
The first Excel spreadsheet was released in 1982. After about 30 years, Excel
has developed as the leader in the spreadsheet market. You may also hear about
some other spreadsheets such as Office Web Apps and Google Spreadsheet.
However, these are not even considered as the minor threats to Microsoft. In fact, the
biggest competitor for Microsoft is itself.
As the domain of the commercial electronic spreadsheet market, Excel
spreadsheet is so versatile. Here are just a few of the applications for Excel:
1. Powerful data analysis options: Excel provides various kinds of functions
that can be used for data analysis.
2. Creating charts and graphics: Excel provides a wide variety of highly
customizable charts and the SmartArt tools to create professional looking
diagrams.
3. Visual Basic for Applications: Excel provides a easy learned macro
language (VBA) to help you create structured programs directly in Excel.
4. Easy learning: Excel is user friendly, and it provide many sources to help
you learn it easily. As a most widely used spreadsheet, you can solve the
problems related to Excel easily by asking friends or searching the internet.
In this book, we are going to focus on using Excel and Excel based macro
language (VBA) to solve the problems that is related to probability and statistics.
4
2.1 Introduction
Whether you have the knowledge about probability, you may have the intuitive
ideas that it must be a 50 to 50 chance of turning up the head when flipping a coin
once; the probability that one withdraws a heart randomly from a deck of 52 cards is
1/13; there is a 50 to 50 chance that the next person you will meet on the street is a
girl. In everyday life, the probability of an event is the chance that this event will
happen. The formal definition about probability is that “Probability can be referred to
as the occurrence of the events of interest relative to other events.”
In this chapter, we will introduce the elementary concepts about probability, the
fundamental rules, and show the basic methods of computing probabilities of various
events. Moreover, some Excel functions which can be used in probability calculation
will also be introduced.
Definition
Sample space: the set that consists of all the possible outcomes of an experiment
Sample point: the members of the sample space
Event: a set of outcomes, and simultaneously, a subset of the sample space
Example 2.1
When tossing a dice once, all of the possible outcomes that comprise the sample
space are 1, 2, 3, 4, 5, and 6. Some compounds events are shown as follows:
A = {1, 2} , which presents the event that the number of points are at most two.
B = {2,4,6} , which presents the event that the number of points are even.
After understanding the fundamental concepts about the probability, we can use
these knowledge to compute the probabilities of many interested events.
Definition
For the event of interest:
2.1
Reconsider Example 2.1, we have already got that the sample space is equal to
1, 2, 3, 4, 5, 6 , event " 1,2 , and event # 2, 4, 6 . The probability of event
A is: " 2/6; the probability of event B is: # 3/6.
The logical functions AND, OR, IF in Excel can be used to categorize data into
small groups for further analysis. They can be used individually, or nested and
combined with other functions to perform data analysis. Firstly, let us show the
foundations of these functions.
Function AND
For the function AND, it returns TRUE if all its arguments evaluate to true;
returns FALSE if any of the arguments evaluate to FALSE. You can specify up to 255
conditions.
6 2: Fundamental of Probability Models
Syntax
= AND (Logical 1, [Logical 2],…)
For instance, when planning a vacation, two important factors that will affect you
to make the decision are the money and time. The function AND can help you to
make the decision. Figure 2.1 illustrates the process of using the function AND.
For this example, if you have both money and time, the function returns true.
Congratulations! You will have a vacation.
Function OR
For the function OR, it returns TRUE if any arguments are TRUE; it returns
FALSE if all arguments are false. The function OR works similarly with the function
AND.
2: Fundamental of Probability Models 7
Syntax
= OR (Logical 1, Logical 2,…)
Function IF
The function IF returns one value if a condition you evaluate is TRUE, and
another value if that condition you evaluate is FALSE. Further more, the function IF
can be nested together to provide even more decision-making ability.
Syntax
= IF (Logical test, Value if true, Value if false)
Logical test: required, any expression that can be evaluated to TRUE or FALSE
Value if true: required, the value that you want to be returned if the test is true
Value if false: optional, the value that you want to be returned if the test is FALSE
For instance, you can choose TAXI or MINIBUS to come to UST, and the money
in your pocket is one of the important factors that will affect your decision. Function
IF can help you to make the decision. Figure 2.2 illustrates how to use the function IF.
In this example, if the money in your pocket is more than the Taxi fares, you
can come to UST by TAXI, otherwise, MINIBUS.
After introducing the fundamental ideas about the logical functions AND, OR, IF,
in practices, let us see an example.
Example 2.2
Ten students from UST put their daily spending on the internet, the details are
shown in the following table:
1) Categories the students under the condition that each days spending is larger than
$100 using the function AND.
2) Categories the students under the condition that any days’ spending is larger than
$150 using the function OR.
3) Categories the students under the condition that the students’ total weekly
spending is larger than $800, smaller than $600 , or between $600 and $800 using
the nested function IF.
[Solution]
1)
To obtain each days’ spending which is larger than $100, the function AND can
be used. Figure 2.3 shows the process of using the function AND to categorize the
students based on whose daily spending is larger than $100. If their spending from
2: Fundamental of Probability Models 9
Monday to Friday are all larger than $100, the function returns TRUE, otherwise,
FLASE. The part of the results are displayed in Figure 2.3, and the completed table
can be seen in Excel file named Chapter two with the spreadsheet named Example
2.2.
From Figure 2.3, we observe that the function returns FALSE in the cell K6,
which means that the first student’s daily spending is not all larger than 100. You can
copy the function and complete the calculation easily by putting the mouse on the cell
K6 and double clicking it.
2)
To obtain any days’ spending that is larger than $150, the function OR can be
used. Figure 2.4 shows the process of using the function OR. If any days’ spending is
larger than $150, the function returns TRUE, otherwise, FALSE. The part of the
results are displayed in Figure 2.4, and the completed table can be seen in Excel file
named Chapter two with the spreadsheet named Example 2.2.
From Figure 2.4, we observe that the function returns TRUE in the cell L6,
which means that the first student’s spending is larger than $150 someday. You can
complete the table easily by copying the function.
3)
The function IF can be nested together to categorize the students under the
condition that the students’ total weekly spending are larger than $800, smaller than
$600, or between $600 and $800. Figure 2.5 shows the process of using the nested
function IF.
The function returns “bad” when the total spending is larger than $700; the
function returns “Good” when the total spending is less than $600; the function
returns “Medium” when the total spending is between $600 and $800. The part of the
results are displayed in Figure 2.5, and the completed table can be seen in Excel file
named Chapter two with the spreadsheet named Example 2.2.
From Figure 2.5, we observe that the function returns “Bad” in the cell M6,
which means that the first student’s total weekly spending is larger than $800. You
can complete the calculation easily by copying the function.
In reality, instead of respective using the functions AND and OR, mixing them
up with the function IF is used more widely when analyzing data. Example 2.3 shows
the method to combine the functions AND and OR with the function IF to solve the
problems related to the probability calculation.
2: Fundamental of Probability Models 11
Example 2.3
Nowadays, online social network sites that focus on facilitating people to build
up social networks, make friends, share interests and activities are popular, especially
within young people. There is a survey named Which type of social network sites does
UST students prefer has been done, and three popular social network sites are chosen,
including Facebook, Twitter and Google+. The part of the survey results are displayed
in Table 2.1, and the completed table can be seen in Excel file named Chapter two
with the spreadsheet named Example 2.3
Table 2.1 Survey result about the online social network sites
Online Social Network
Student No. Facebook Twitter Google+
1 1 0 1
2 1 1 0
3 1 0 0
4 0 0 0
5 1 1 1
6 0 1 0
*1 standards for the students who prefer the specific Online Social Network sites, 0 standard for not
prefer.
1) To calculate the probability that students prefer Facebook and Twitter both.
2) To calculate the probability that students prefer either Twitter or Google+.
3) To calculate the probability that none of the online social network sites does the
students prefer.
[Solution]
1)
According to Eq. 2.1, we need to obtain the sample points and sample space
when calculating the probability. In this example, there are 50 students participate in
the survey, so that the all possible outcomes are equal to 50. To obtain the sample
points, we need to obtain the number of students who prefer Facebook and Twitter
12 2: Fundamental of Probability Models
both. To assort students who prefer Facebook and Twitter both, the functions IF and
AND can be used.
According to Figure 2.6, if the student prefers Facebook and Twitter both, the
function returns 1, otherwise, 0. The sample points can be obtained by counting how
many “1” we get using the function SUM.
The probability that student prefers Facebook and Twitter both is:
13
Pr A 0.26
50
2)
To calculate the probability that the students prefer either Twitter or Google+,
we need to obtain the sample points and sample space just like mentioned above. The
sample space has already been obtained, which is equals to 50. To obtain the sample
points, we need to find out the number of students who prefer either Twitter or
Google+ using the functions IF and OR.
2: Fundamental of Probability Models 13
If the student prefers either Twitter or Google+, the function returns 1, otherwise
0. You can obtain the sample points by counting how many “1” we get using the
function SUM.
The probability that student prefers either twitter or Google+ is:
30
# 0.6
50
3)
To calculate the probability that the students prefer none of them, we can firstly
calculate the probability that students prefer Facebook, Twitter or Google+, and then
the probability that students prefer none of them can be calculated.
41
) 0.82
50
__
C 1 + 0.82 0.18
In Example 2.3, the event that students prefer Facebook (F) and Twitter (T) both
can be described as the intersection of the events F and T, written as F,Tor FT; the
event that students prefer either Twitter (T) or Google+ (G) can be described as the
union of the events T or G, written as T - G ; the events that students prefer either of
these three sites and none of__them can be described as the complementary event,
Written as ) 1+ C .
Definition
In Example 2.3, as the sample space is given, the major step to calculate the
probability is to categorize the data and then obtain the sample points. However, for
another type of experiments, all possible outcomes (sample space) and the sample
points are unknown, and you need to obtain both of them when calculating the
probabilities. To lay out all possible outcomes (sample space), you can choose manual
display or by Excel. Examples 2.4 and 2.5 illustrate the process of using the methods
of manual display and Excel to display the all possible outcomes.
2: Fundamental of Probability Models 15
Example 2.4
Suppose you manage two projects at the same time and each of them has three
possibilities in completion:
A = 100% done, B = not sure, C = 100% failed
What is the probability at least one project 100% completed ?
[Solution]
According to Eq. 2.1 to obtain the probability that at least one project is 100%
complete, the sample space and sample points should be obtained firstly. This
question is relatively easy, so that all of the possible outcomes can be listed as
follows:
Sample Space AA, AB, AC, BA, BB, BC, CA, CB, CC , and the possible outcomes are
equal to nine.
Another way to display the outcomes is using a tree diagram, which is pictorial
presenting all the possibilities. Figure 2.9 shows the tree diagram that display all
possible outcomes of Example 2.4.
5
0.56
9
Example 2.4 is relatively easy, so that it is possible to display all of the outcomes
by hand. However, for the experiments which are relatively complex, it is hard to
manual display all of the possible outcomes. Under this condition, Excel becomes a
powerful tool to help us display the all possible outcomes.
Example 2.5
Reconsider Example 2.4, instead of two projects, you manage five projects at
this time. 1 = 100% done, 2 = not sure, 3 = 100% failed.
What is the probability that at least one project is 100% completed?
[Solution]
After finishing the display, we could count that the total trails that are equal to
243. The functions IF and OR can be used to categorize the trials that at least one
project is 100% completed.
2: Fundamental of Probability Models 17
Then the
211
0.87
243
A card is drawn at random from a desk of 52 cards. Let A donate the event that
an ace was drawn, and let B donate the event that a diamond was drawn.
Pr ( A) = 1 / 4 , as there are four aces; Pr ( B ) = 1 /13 , since there are 13 diamonds.
" , # donates the event that a card with ace and diamond was drawn.
The common sense told us that there is only one card with ace and diamond in a
desk of 52 cards. Therefore, the probability that " , # 1/52, which is equal to
Pr ( A) × Pr ( B ) = (1 / 13) × (1 / 4) = 1 / 52. We say that the event A is independent of the
event B, meaning that the occurrence of one event does not depend on the occurrence
or nonoccurrence of another event.
Definition
Statistical Independent: A and B are independent if and only if
The functions COUNTIF and COUNTIFS are the advanced counting formulas
that can be used to present the more complex examples. We will show how to use
these two functions in the following sections.
Function COUNTIF
Syntax
= COUNTIF (Range, Criteria)
Function COUNTIFs
In many cases, we want to count cells only if two or more criteria are met. The
function COUNTIFS allows us to set more than one criterion when categorizing and
counting cells. There are up to 127 range pairs of optional criterion for the function
COUNTIFS.
2: Fundamental of Probability Models 19
Syntax
= COUNTIFS (Range 1, Criteria l, [Range 2, Criteria 2]…)
Example 2.6
Ten students participate in a French course. Let M donate the event that students’
midterm scores are larger than 90, and let F donate the event that students’ final
scores are large than 90. Finding that if M and F are statistical independent.
[Solution]
The number of students whose score is larger than 90 is obtained, and we can
find that there are six students whose score is larger than 90 in this example.
DEFGHI GJKLM N
According to Eq. 2.1: DEFGHI OGEPI
6 5
Pr ( M ) = = 0.6 and Pr ( F ) = = 0.5
10 10
After obtaining the respective the probability that students’ midterm and final
scores are larger than 90, respectively, and the next step is to obtain the probability
that students’ midetrm and final scores are both larger than 90 using the function
COUNTIFS.
3
B,= 0.3
10
2: Fundamental of Probability Models 21
In many cases, the probability of the event A will be affected by the occurrence
or nonoccurrence of another event. For instance, when you toss a dice once, the
probability of landing on 1 is 1/6. However, if we had the extra information that the
die could have only landed on 1,3,5, the probability of landing on 1 is changed to 1/3.
Definition
For any two events A and B with # Q 0, the conditional probability of A given
that B has occurred is defined by:
",#
"|# 2.3
#
The conditional probability is common in the daily life, and the following
example is related to the conditional probability.
Example 2.7
A survey named How many languages the students could speak has been done.
There are 70 students participate in this survey, and the part of the survey results are
displayed in Table 2.2, and the completed table can be seen in Excel file named
Chapter two with the spreadsheet named Example 2.7.
2 0 1 0
3 0 1 0
22 2: Fundamental of Probability Models
4 0 0 0
5 1 0 1
6 0 1 0
1) What is the probability that a randomly selected student in this class can speak
Korean?
2) What is the probability that a student can speak Korean or Cantonese?
3) Giving that a student can speak Cantonese, what is the probability that he or she
can speak Mandarin?
[Solution]
The first step is to find out how many students can speak Korean, Cantonese and
Mandarin. The function SUM can be used to obtain the result:
SUM (C4 : C73) = 6, SUM (D4 : D72) = 45, and SUM (E4 : E73) = 21
1)
2)
To find out the probability that a student can speak Korean or Cantonese, the
functions IF and OR in Excel can be used just as mentioned earlier. The probability
that a student can speak Korean or Cantonese is: Pr(K-C) = 0.71
2: Fundamental of Probability Models 23
3)
To find out the probability that he or she can speak Mandarin given the condition
of a student can speak Cantonese, from Eq. 2.3
B,)
B|) 0.29
)
By multiplying the both sides of Eq. 2.3, the multiplication rule can be obtained.
Definition
A and B are statistical independent if "|# " and dependent otherwise.
When the two events are statistical independent, the chance that A has occurred
is not affected by the knowledge that B has occurred, which means that "|#
" or #|" # .
Example 2.8
A girl has three coats and two bags, which means that she has six ways to match
the coat and the bag. Let Ai donate the event that she selects the coat, for i = 1, 2, 3,
24 2: Fundamental of Probability Models
and then Pr ( A1 ) = 0.25, Pr ( A2 ) = 0.50, Pr ( A3 ) = 0.25. After choosing the coat, the
next step is to choose the bag. Let B donate the event that the girl choose the first bag,
and B for the second bag. The probability that the girl matches the first coat with the
first bag is 60%, whereas the corresponding percentages for the second and third coats
are 40 % and 20%, respectively.
1) What is the probability that the girl chooses the first bag match the first coat?
2) What is the probability that the girl randomly chooses a coat to match the first
bag?
[Solution]
When the experiment is relatively complex, the tree diagram is helpful to lay
these stations out. Figure 2.14 shows a tree diagram that pictorial the experimental
situation.
1)
The probability that the girl chooses the first bag match the first coat is:
2)
The probability that the girl randomly chooses a coat to match the first bag.
In this chapter, we put many emphasis on introducing the logical functions AND,
OR, and IF, which are useful during the decision-making process. The statistical
functions COUNTIF and COUNTIFs are also the important functions when doing the
advanced counting.
3.1 Introduction
In the previous chapters, we have learned the concepts of uncertainty and the
probability. In this chapter, we will introduce the concepts of the random variables
and the probability functions, particular for some commonly used probability
distributions such as the normal distribution. Furthermore, some Excel based
functions which are related to these distributions will also be elaborated, together with
the fundamental idea about Visual Basic for Applications (VBA).
Definition
Random Variable: A random variable is a real valued function on a sample space.
The random variables are usually denoted by capital letters such as X, Y, and Z.
3: Analytical Models of Random Phenomena 27
The lowercase letters are used to represent the values of the corresponding random
variables. For instance, when tossing a coin three times, all possible outcomes are S =
{HHH, HHT, HTH, HTT,THH, THT, TTH, TTT}. Let X denote the number of heads
obtained, and the possible values of X are 0, 1, 2 and 3.
There are two different types of the random variable: the discrete random
variable and continuous random variable.
Definition
Discrete Variable: A random variable is discrete if it can assume a finite or can be
listed in an infinite sequence of numbers.
Continue Variable: Its set of possible values consists either of all numbers in a
single interval on the number line or all numbers in a disjoint union of such intervals.
For instance, the random phenomenon such as the number of typhoon per year
and the points of tossing a die is described as the discrete random variable, because
these outcomes are countable and have physical meaning; on the contrary, the random
phenomenon such as the time to complete a project and the lifetime of an electronic
component is described as the continuous random variables, as the values of these
random variables exist in an interval.
Definition
For the discrete random variable X:
Probability Mass Function: Probability Mass Function (PMF) of X is defined for
every number x by
Cumulative Distribution Function: Cumulative Distribution Function (CDF) of a X
is defined for every number of x by ∑
y: y ≤ x
p( y)
When tossing a coin three times, the number of heads you get are countable, and
this kind of variable is discrete. Let X denote the number of heads obtained, and the
possible values of X are 0, 1, 2. We can obtain that 0 1/8, 1
3/8, 2 3/8, and 3 1/8. The PMF and associated CDF are
shown as follows:
0 x<0
1/ 8 x=0 1/ 8 0 ≤ x <1
x =1
f ( x) = 3 / 8 F ( x) = 4 / 8 1≤ x < 2
3/8 x=2 2≤ x<3
1/ 8 x=3 7 / 8
1 x=3
Definition
Probability Density Function (PDF): Let X be a continuous random variable,
the PDF of X is a function f(x) such that for any two numbers a and b with ,
dx
The three types of probability distributions described above are shown in Figure 3.2
Definition
Uniform Distribution: A random variable X is said to be uniform over the interval
[ a , b ] if its PDF and CDF are:
1 if x∈[a, b]
f ( x) = b − a 3.1
0 if x∉[a ,b]
0 if x < a
x −a
F ( x) = if x ∈[a, b] 3.2
1b − a
if x > b
The best known and widely used probability distribution is the normal
distribution. Many natural phenomena can be modeled by the family of the normal
distributions. Just as mentioned above, the height and weight of a specified group and
the scores on the SAT measurement are modeled by the normal distributions. More
over, phenomena such as the errors in the scientific experiments, the blood pressure,
and the time you spent each time from your home to UST are all fit closely to an
appropriate normal curve.
Furthermore, several useful distributions such as the t-distribution and
chi-squared distribution are based on the normal distribution, and we will encounter
and elaborate them later in this book.
3: Analytical Models of Random Phenomena 31
Definition
Normal Distribution: The probability density function for the normal distribution
( x − µ )2
1
is given by f ( x) = e 2σ 2
3.3
σ 2π
where µ is the mean of the theoretical distribution, σ is the standard deviation.
1 2
1 −2 y
f ( x) = e , which is referred to as the standard normal distribution. A
2π
standard normal variable, which can be expressed by the capital letter Z, is
transformed from a Normal ( µ , σ ) random variable X by the process of
standardizing.
x−µ
The standard normal variable Z, which is equal to , is the ratio between
σ
x − µ and σ . After standardizing, you can compute the probabilities for any
Function NORMSDIST
This function returns the standard normal cumulative distribution function with
µ = 0 and σ = 1 .
32 3: Analytical Models of Random Phenomena
Syntax
= NORMSDIST (z)
Function NORMDIST
This function returns the normal distribution for the specified mean and standard
deviation. This function has a very wide range of applications in statistics, including
hypothesis testing.
Syntax
= NORMDIST(x, mean, standard_dev, cumulative)
Function NORMSINV
This function returns the inverse of the standard normal cumulative distribution.
The distribution has µ = 0 and σ = 1 .
Syntax
= NORMSINV (probability)
Function NORMINV
This function returns the inverse of the normal cumulative distribution for the
specified mean and standard deviation.
3: Analytical Models of Random Phenomena 33
Syntax
= NORMINV (probability, mean, standard_dev)
If you are carefully enough, you will find the phenomenon that the normal
distibution table is exist in almost every probability and statistics textbooks. The
distribution tables are made by mathematicians to simplify the calculation process, as
the intergration process is not an easy job and no one likes to do the intergration every
single time. Nowadays, regardless of the complex of the mathematical theroy, you can
make your own standard normal distribution table by Excel functions. Example 3.1
shows how to make your own normal distribution table by Excel functions.
Example 3.1
[Solution]
Generally, the standard normal distribution table contains two parts: one is the
standard normal varible, the other is the corresponding probability. It is easy to draw
any kinds of probability tables using Excel functions. Take the commonly used
standard normal distribution table as an example, the major steps are shown as
follows:
Firstly, z value from 0.00 to 3.49 can be listed out in Excel spreadsheet, which is
shown in Figure 3.3.
Secondly, to calculate the probability of z, the functions NORMSDIST and
NORMDIST can be used. Figures 3.3 and 3.4 show the process of using these two
functions.
34 3: Analytical Models of Random Phenomena
Figure 3.3 Draw the normal distribution table by the function NORMSDIST
Figure 3.4 Draw the normal distribution table by the function NORMDIST
Tips:
Cell references
Sometimes, the cells should be fixed when using or copying Excel formula.
There are three types of references, including relative reference, absolute reference,
and mixed reference. Table 3.1 shows the definition about these three references.
3: Analytical Models of Random Phenomena 35
We use dollar ($) to fix the cell’s location. After adding dollar ($) to a formula,
that part of the formula is not automatically changed when copying or pasting the cell.
You can enter the $ manually by inserting dollar signs in the appropriate positions by
pressing SHIFT + 4 or using a handy shortcut - the F4 key. For instance, if you enter
= A1 to start a formula, pressing F4 converts the cell reference to = $A$1; pressing F4
again converts it to = A$1; pressing it again returns to $A1; pressing it one more time
returns to the original = A1.
In Example 3.1, we use the mixed reference. Figure 3.5 shows an example of
using the mixed reference cells.
Cells formatting
In general, cells formatting is not absolutely necessary, but it can make your
tables or worksheets more professional and attractive, such as changing the cells Font
color, Decimal place and so on. Excel provides three ways to help you format the
cells:
36 3: Analytical Models of Random Phenomena
(1) Using the Home tab of the Ribbon (shown in Figure 3.6).
(2) Using the Mini toolbar when you right click the cells (shown in Figure 3.7).
(3) In the Format Cells dialog box when press ing Ctrl +1 (shown in Figure 3.8)
These three ways are all available to format the cells, and you can choose any of
them as you like.
In Example 3.1, Font size of the numbers are being adjusted and centralized, and
the decimal places are also fixed. After making the appropriate decimal places, the
table is accomplishment. The part of the results are displayed in Table 3.2, and the
completed table can be seen in Excel file named Chapter 3 with the spreadsheet
named Example 3.1.
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
Starting of VBA
All the VBA work is done in the Visual Basic Editor (VBE). The VBA modules
are invisible unless you activate the VBE. There are two ways to active the VBE:
1. Press Alt+F11
2. Choose Developer Code Visual Basic
Your Excel Ribbons may not have the Developer. It is essential that you turn on
the Developer tab:
After performing these steps, Excel displays a new tab named Developer, which
is shown in Figure 3.9.
After activating the VBE, you can see a VBE window like Figure 3.10. The
upper-left corner of the IDE window shows all projects currently open, and the
lower-left corner shows the properties window. You can write the code on the right
side of the IDE window.
Pay attention that your VBE window will not look exactly like the window
shown in Figure 3.10.
Before you can do anything meaningful, you must have some VBA code in a
code window. This VBA code must be written within a procedure, and the procedure
consists of VBA statements. Generally, the Sub and Function procedures are widely
used in VBA programming.
1. Enter the code manually: The keyboard can be used to type the code.
2. Use the macro-recorder feature: Using Excel’s macro-recorder feature to record
your actions and convert them into VBA code.
The VBA code is usually stored in a VBA module. You can insert a module by
pressing Insert Module, which is shown in Figure 3.11.
To run a program, you can press F5 directly or select the Run menu, which is
shown in Figure 3.12.
3: Analytical Models of Random Phenomena 41
Generally, the first time you saving your workbook that contains macros, the file
format is XLSX, which can not contain macros. Excel display a warming which is
shown in Figure 3.13. You can choose the No option.
Figure 3.13 Excel warms when saving the workbook contain macros
After choosing the No option, you can choose the option called Excel
Macro-Enabled Workbook, which is shown in Figure 3.14. The file must be stored
within an XLSM extension.
42 3: Analytical Models of Random Phenomena
Example 3.2
The proficiency test in Mandarin is one of the most popular testing nowadays in
Hong Kong. According to the Hong Kong Examination and Assessment Authority
(HKEAA), the test contains four classes, A ( X ≥ 90 ), B ( 80 ≤ X < 90 ), C
( 60 ≤ X < 80 ), and D ( X < 60 ). In 2011, there were 600 students in UST having this
test. The results are displayed in Excel file named Chapter 3 with the spreadsheet
named Example 3.2.
[Solution]
1)
Using VBA
Besides using Excel functions, another way to categorize students’ grade is using
VBA. After activating the VBE Windows, you can write your code on the code
module. In Example 3.2, we write the code 3.1 to categorize students’ grade into A, B,
C, D automatically, and we also write the code 3.2 to count the number of students
who get A, B, C, and D.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 3.1
*********************************************************************
’Purpose: To decide the marks belong to which level of grades.
’*********start of coding***************************************
Sub test()
For i = 7 To 606
Select Case Cells(i, 4)
Case 90 To 100
44 3: Analytical Models of Random Phenomena
End Sub
‘************************************end of coding************
This macro consists of three key techniques. The respecting role in this macro is
detailed as follows:
1. Sub…end sub
The VBA code must be written within a procedure, the sub procedure is used in
code 3.1.
This structure is useful when choosing among three or more options. In this
example, the first block of code will be executed if the score is between 90 to 100,
and the corresponding cell returns to the letter A; the second block of code will be
executed if the score is between 80 to 90, and the corresponding cell return to the
letter B, and so on.
3. Apostrophe(‘)
Any text that follows by an apostrophe(‘) is ignored when executing the code,
and you can use it to explain your code.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 3.2
*********************************************************************
’Purpose: To count the number of students who get A, B, C, and D.
3: Analytical Models of Random Phenomena 45
***start of coding**************************************************
Sub count()
n = 0
For i = 1 To 1000
If Range("j7").Cells(i, 1).Value = "A" Then
1 n = n + 1
End If
Next
MsgBox n & "students get A", vbOKOnly, "test"
n = 0
For i = 1 To 1000
If Range("j7").Cells(i, 1).Value = "B" Then
2 n = n + 1
End If
Next
MsgBox n & "students get B", vbOKOnly, "test"
n = 0
For i = 1 To 1000
If Range("j7").Cells(i, 1).Value = "C" Then
3 n = n + 1
End If
Next
MsgBox n & "students get C", vbOKOnly, "test"
n = 0
For i = 1 To 1000
If Range("j7").Cells(i, 1).Value = "D" Then
4 n = n + 1
End If
Next
MsgBox n & "students get D", vbOKOnly, "test"
End Sub
This macro consists of three key techniques. The respecting role in this macro is
46 3: Analytical Models of Random Phenomena
detailed as follows:
1. If – Then (conditional)
The If-Then construct is widely used structure to execute the statements
conditionally. The basic structure is shown as follows:
If (condition)… Then
‘the code statement
End If
The code will be executed if the condition is true, otherwise not. In this example,
the first piece of code means that continually count the cells when the value of the cell
is equal to A, otherwise stop; the second piece of code means that continually count
the cells when the value of the cell is equal to B, otherwise stop, and so on. Notice
that the If statement has a corresponding End If statement.
Syntax
= MsgBox (prompt, bottom, title)
prompt: required, shows the messages that you want the user to read
bottom: optional, VBA have different kinds of bottom arguments
title: optional, text that appear in the message box title bar
In this example, the MsgBox function returns the value that how many students
get A, B, C, and D, and also displays a dialog box to show the results (shown in
Figure 3.16).
3: Analytical Models of Random Phenomena 47
3. Ampersand (&):
The ampersand (&) is used to concatenate strings. In the above code, the number
of students (n) and the text “student get A” are concatenate together, which is shown
in Figure 3.16.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2)
According to the raw data, using the functions AVERAGE and STDEV to obtain
the E(X) and S.D.
E(X) =Average (E5: E604) =78.42
S.D. = STDEV (E5: E604) =14.99
Using Table
To obtain the probability that student get A, the first step is to do the normalization.
48 3: Analytical Models of Random Phenomena
78.42 − 90
Z =( ) = −0.7725
14.99
The second step is to check the standard normal distribution table (shown in
Appendix Table A.1). We locate a column with the first digit of z and a row with the
second digit of z and read Φ (0.77) = 0.7794 . As the normal distribution is symmetric,
the probability that Pr (X # 90) = 1-0.7794 = 0.2206. The process to obtain the
probability that the students get B, C, and D is the similar.
The function NORMDIST can be used to obtain the probabilities. Figures 3.17
and 3.18 show the process to obtain the probability:
Similarly,
Pr (60<X<80) =NORMDIST(80, P6, P7, TRUE)-NORMDIST(60, P6, P7,TRUE)
=0.43, and Pr (X< 60 ) =NORMDIST(60, B6, B7, TRUE) = 0.11.
3)
In this question, we want to know the least marks should the student get to make
sure that 95 % of students can obtain the certification. In this question, the objective is
to obtain z value, and then the scores can be obtained.
Using Table
As the Pr (X<N) = 0.95 and Φ(Z)=1.65 are given, using the equation
78.42 − N
Z =( ) = 1.65 , N is equal to 53.75.
14.99
To decide the least marks that the student should get, the function NORMINV
can be used:
So that the student can get the certification when the score is higher than 53.75.
Definition
Lognormal Distribution: A nonnegative random variable X is said to have a
lognormal distribution if the random variable - +, has a normal distribution.
The mean and standard deviation of the lognormal distribution can be calculated as
follows:
ς
2
= ln{1 + (σ µ ) 2 } 3.4
λ = ln u − 0.5ς 2 3.5
Function LOGNORMDIST
Syntax
=LOGNORMDIST (x, mean, standard_dev.)
Function LOGINV
3: Analytical Models of Random Phenomena 51
Syntax
=LOGNORMINV (probability, mean, standard_dev.)
Example 3.3
According to the previous observation, the mean and standard deviation of the
weight of UST’s students are 136.33 pounds and 26.59 pounds, respectively. The
random variable follows which kinds of probability is not sure.
1) Suppose the random variable follows the normal distribution, what is the
probability that the student’s weight is heavier than 160 pounds?
2) Suppose the random variable follows the lognormal distribution, what is the
probability the student’s weight is heavier than 160 pounds?
[Solution]
1)
Using Table
The second step is to check the standard normal distribution table to obtain the
probability, the partial of the table is shown as follows:
52 3: Analytical Models of Random Phenomena
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.6985 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
2)
ς
2
= ln{1 + (σ µ ) 2 } = ln{1 + (26.59 136.33) 2 } = 0.04
Using VBA
the function LOGNORMDIST. One way to solve this problem is to recreate a new
function using VBA as follows:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 3.3
*********************************************************************
’Purpose: To recreate the function LOGNORMDIST by changing the
parameters according to Eqs. 3.4 and 3.5
‘Define variables:
End Function
**************************************end of coding*****************
This macro consists of one key technique. The respecting role in this macro is
detailed as follows:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
54 3: Analytical Models of Random Phenomena
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 3.4
*********************************************************************
’Purpose: To recreate function LOGNORMINV by changing the
parameters according to Eqs. 3.4 & 3.5
‘Define variables:
End Function
**************************************end of coding*****************
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In reality, the problems are often involving two possible outcomes: occurrence
and nonoccurrence. The events, such as the water test may or may not meet the
pollution control standards, appearance of head or tail when tossing a coin, whether or
not you pass an exam, are referred to as bernoulli sequence. This distribution has
several features, for instance, each trial has two possibilities; the probability of
success is constant in each trial; each trial is statistically independent.
Definition
Binomial Experiment: A binomial experiment involves n independent and
identical trials such that each trial can result in to one of the two possible outcomes,
namely, success(S) or failure (F).
G F
1' G!F
0,1,2. . ,K
B ; ,, EF 3.6
0 otherwise
x
B ; ,, ∑ b( y; n, p) 0,1, … , 3.7
y =o
Function BINOMDIST
Syntax
= BINOMDIST (number, trials, probability, cumulative)
Function CRITBINOM
Syntax
= CRITBINOM (trials, probability, alpha)
Example 3.4
Reconsider Example 3.2, according to the HKEEA’s rule, students will get a
certification if the score is higher than 60.
1) Randomly choose 10 students, find out the probability that 8 of them can get the
certification.
2) To make sure that at least 90% of students can get certification, how many
students are needed to get the score higher than 60?
[Solution]
1)
Hand Calculation
G F
1' G!F
0,1,2. . ,K
According to Eq. 3.6, B ; ,, EF
0 otherwise
Excel Function
3: Analytical Models of Random Phenomena 57
Pr ( X ≥ 60) = 0.89 and Pr ( X < 60) = 0.11 have already been obtained in
Example 3.2. the probability that 8 of 10 can get the certification is shown as follows:
Pr(X = 8) = BINOMDIST(8, 10, 0.89, FALSE) = 0.21.
2)
Excel Function
According to Example 3.2, n = 600, Pr ( X ≥ 60) = 0.89 and Pr ( X < 60) = 0.11
As a result, to make sure that 90 percent of students could get the certification,
there are at least 544 students should pass the exam (>60).
Definition
Poisson distribution: The number of rare events occurring within a fixed period of
time has Possion distribution.
58 3: Analytical Models of Random Phenomena
N OP QR S
F!
, 1,2,3.. 3.8
x
e− λ × λ y
F ( x) = ∑ 3.9
y =0 y!
The Poisson distribution can be used when having a large number of independent
Bernoulli trials and a very small probability of success.
The function POISSON calculates the Poisson Probability Mass Function or the
Cumulative Poisson Probability Function for a supplied set of parameters.
Syntax
= POISSON ( x, mean, cumulative)
Example 3.5
[Solution]
Let X = the number of students arriving for new semester registration, and λ =
the rate parameter =5 per hour. To calculate the probability, you can use the mode of
hand calculation and Excel function.
Hand Calculation
1)
e− λ × λ x
According to Eq. 3.8: f ( x) =
x!
e −5 × λ 3
Pr ( x = 3) = = 0.14
3!
2)
x
e− λ × λ y
According to Eq. 3.9: F ( x) = ∑
y =0 y!
e−5 × 50 e−5 × 51 e −5 × 52
Pr ( x ≥ 3) = 1 − ( + + ) = 1 − (0.007 + 0.034 + 0.084) = 0.875
0! 1! 2!
3)
Under this time, the parameter is changed, as t = 0.75, the new parameter is equal
to 0.75 × 5 = 3.75 .
According to Eq. 3.8 ,
60 3: Analytical Models of Random Phenomena
e −3.75 × 3.750
Pr ( x = 0) = = 0.02
0!
Let X = the number of students arriving for new semester registration, and a =
the rate parameter =5 per hour.
1)
2)
3)
Definition
Exponential Distribution: In a sequence of rare events, when the number of events
is Possion, the time between events has exponential distribution.
3: Analytical Models of Random Phenomena 61
d Me if x # 0 K
! R
3.11
0 if x % 0
d1 ' e if x # 0K
!Re
3.12
0 if x % 0
Where x as time between two occurrences, λ is the expected number of occurrence in
a unit of the time.
EXPONDIST
This function returns the value of the exponential distribution for a give value of
x. Generally, the function EXPONDIST is used to model the time between events,
such as the amount of time until the earthquake occur.
Syntax
= EXPONDIST( x, λ, cumulative )
Example 3.6
Nowadays, computer becomes one of the essential parts in our daily life. A
student costing $10,000 to buy a laptop. Suppose the life time of the laptop follows an
exponentially distribution with the average life time of 5 years. If the laptop fails
during the first two year, the manufacturer agrees to give a full refund. If the laptop
fails after the third year but before the fifth year, the manufacture will refund $1,000.
To calculate the probability that the computer is broken within two years, between the
third and fifth years.
62 3: Analytical Models of Random Phenomena
[Solution]
Hand Calculation
With the first two year, break rateλ=1/5, according to Eq. 3.13:
1
− *2
Pr ( X ≤ 2) = 1 − e 5
= 0.33
Between the third and fifth years, break rateλ=1/3, according to Eq. 3.13:
1
− *3
Pr (2 < X ≤ 5) = 1 − e 3
= 0.632
Within the first two years, the break rateλ=1/5, the probability that the laptop is
broken is:
= EXPONDIST (2, 1/5 , TRUE) = 0.33
The probability that the laptop is broken between the third and fifth years is:
= EXPONDIST(3, 0.33, TRUE) = 0.632
As a result, the probability that the manufactory give the refund to the students
with the first two years is 0.33, between the third and fifth years is 0.632.
For the probability functions that supported by Excel, all distributions have PDFs,
some have CDFs and inversed CDFs. Generally, the nomenclature of probability
functions in Excel can be divided into two parts: a name and a suffix. The base name
is an abbreviation of the distribution name, and the suffix is either DIST or INV.
The "DIST" function evaluates the PDF and possibly the CDF. If the function
has a CUMULATIVE argument, setting this argument to TRUE causes the DIST
function to compute the CDF. If the argument is FALSE, the function returns the PDF.
3: Analytical Models of Random Phenomena 63
The "INV" function evaluates the inverse CDF function. In addition, not all
distributions have the "INV" and “DIST "function, the summary about Excel
probability functions are shown in Table 3.3 .
NORMINV This function returns the inverse of the Ex. 3.1 & 3.2
normal cumulative distribution.
LOGNORMDIST This function returns the cumulative Ex. 3.3
lognormal distribution.
LOGNORMINV This function returns the inverse of the Ex. 3.3
lognormal distribution.
64 3: Analytical Models of Random Phenomena
Excel’s built-in functions can be changed using VBA macros. Table 3.5
summaries the changed functions that are used in this chapter.
4.1 Introduction
In the previous chapters, the fundamental ideas about probability have been
introduced. Most of the time, we have assumed that the observations come from a
particular distribution when analyzing the random phenomenon. However, there is no
evidence to verify our assumption. Reconsider Example 3.3, we want to obtain the
probability that the students’ weights are heavier than 160 pounds. Before doing
further calculation, we assume that the observations follow the normal and lognormal
distribution, respectively; when analyzing the laptop’s lifetime in Example 3.6, we
firstly assume that the observations follow the exponential distribution.
There is no evidence to verify the students’ weights and laptop’s lifetime follow
such types of the distributions. Therefore, we require some techniques to help us to do
the verification. In this chapter, two techniques called the probability paper and
Goodness-of-fit tests are used to test whether the probability model is appropriate to
the pre-described variable data.
special probability scale which are transformed by manner adjustment. Figure 4.1
shows an example of a normal probability paper.
4 4
99.9
3 3
99
2 2
95
Cumulative probability
90
1 1
80
70
60
0
50 0
40
30
20
-1 -1
10
5
-2 -2
1
-3
0.1 -3
-4 -4
0 20 40 60 80 100
Plotting Position
The data points plotted on the paper consist of the observed value and
cumulative probability.
4: Determination of the Probability Distribution Models 67
Definition
Each value from the sample is plotted as a point ( FX ( xn ) , xn ). xn is the
follows:
FX ( xn ) = n / ( N + 1) 4.1
N = the number of observed data
n = the nth data in an ascending order
If the observed data plotted in the probability paper has linear trend, the data
follows the selected probability distribution. Two commonly used distribution papers
are called the normal and lognormal paper. Let us take the normal distribution paper
as an example to show the application of probability paper.
Example 4.1
Reconsider Example 3.3, we have assumed that the students’ weights follow the
normal distribution.
1) Drawing a normal distribution paper.
2) Using the normal distribution paper to evaluate whether the assumption is correct.
[Solution]
1)
The next step is to obtain the corresponding z value using the function
NORMINV, and these values can be used as the special probability scale in the
vertical axis.
You can obtain the normal probability paper like Figure 4.1 if you follow the
above steps, and the complete drawing procedures can be seen in Excel file named
Chapter 4 with the spreadsheet named Example 4.1.
2)
You can select the range of data you want to rearrange right-click any cells in
the selected range choose sort from the shortcut menu choose custom sort. Then
you can see a sort dialog box (shown in Figure 4.2).
4: Determination of the Probability Distribution Models 69
Function SMALL
The function SMALL returns the kth smallest value in a data set.
Syntax
= SMALL (array, k)
array: a range of data for which you want to determine the kth smallest value
k: the position (from the smallest) in the array or range of data to return
Figure 4.3 shows the process of using the function SMALL in Example 4.1. The
cell references are fixed this time, and you can complete the range of cells by double
click the Cell O12.
70 4: Determination of Probability Distribution Models
In Table 4.1, the weights that are arranged in an ascending order are shown in
columns 2 and 5, and the corresponding cumulative probabilities are shown in
columns 3 and 6. The part of the survey results are displayed in Table 4.1, and the
complete table can be seen in Excel file named Chapter 4 with the spreadsheet named
Example 4.1.
190
170
(1, 162.88)
150
Weight( pounds)
(0,136.33)
130
110
90
70
-3 -2 -1 0 1 2 3
z value
Figure 4.4 Weights of UST’s students plotted on the normal probability paper
Figure 4.6 Turning on the macro recorder by clicking the record macro icon
4: Determination of the Probability Distribution Models 73
Excel will display a record dialog box for you after you press the record macro,
which is shown in Figure 4.7.
After introducing the fundamentals of the macro record, we will demonstrate the
steps of using the macro recorder. In the following example, a range of cells will be
formatted are using the macro recorder. The steps are shown as follows:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 4.1
*********************************************************************
Sub format1()
'
' format1 Macro
'
' Keyboard Shortcut: Ctrl+q
'
Range("K11:L71").Select
74 4: Determination of Probability Distribution Models
With Selection.Font
.Name = "Calibri"
.Size = 10
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ColorIndex = xlAutomatic
.TintAndShade = 0
.ThemeFont = xlThemeFontNone
End With
With Selection.Font
.Name = "Calibri"
.Size = 12
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ColorIndex = xlAutomatic
.TintAndShade = 0
.ThemeFont = xlThemeFontNone
End With
With Selection.Font
.Color = -4165632
.TintAndShade = 0
End With
End Sub
*********************************************************************
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In this example, the macro name is changed as format1 with the shortcut key Ctrl
+ q. The cells range from K11 to L71 are being selected, and the cells size and color
are changed. Excel’s macro recorder translate the actions into VBA code, which is
shown above. This technique is helpful for a beginner to learn VBA, and also helpful
when you do not know how the write the code.
4: Determination of the Probability Distribution Models 75
The chi-squared test for goodness-of-fit is widely used to determine whether the
observations come from a particular distribution. The basic logic is to test whether the
difference between the expected data and observed data can be accepted. The data is
divided into k intervals, and then the observed and theoretical frequency can be
obtained using Excel functions. Comparing the observed frequency in k intervals with
the corresponding theoretical frequencies, if the computed chi-squared value is less
than the critical value, the prescribed model is acceptable. The equation is shown as
follows:
( ni − ei )
2
k
χ =∑
2
< c1−α , f
i =1 ei 4.2
Where
ei : theoretical frequency
α : level of significance
76 4: Determination of Probability Distribution Models
After obtain the value of ∑ (ni − ei )2 / ei , the next step is to compare the value
with critical value c1−α , f ( α is the level of significance, and f is degree of freedom).
You can obtain the critical value by the table or Excel function CHIINV. Pay attention
that f is equal to k – 1 as n ∞ , otherwise, f must be reduced according to the
numbers of parameters, where f is equal to k-1-m.
According to Eq. 4.1, some statistics such as the observed and theoretical
frequency should be calculated before doing the further analysis. Excel functions can
be used to obtain these statistics. Therefore, before showing the example, some Excel
functions related to chi-squared test for goodness-of-fit will be introduced.
Function MAX
Function MAX returns the largest value from a supplied set of numerical values.
Syntax
= MAX ( number1, [number2], ... )
Function MIN
Function MIN returns the smallest value from a supplied set of numerical values.
Syntax
= MIN ( number1, [number2], ... )
Function ABS
Syntax
= ABS (number)
number: the real number of which you want the absolute value
Function CHIINV
Syntax
= CHIINV (probability, degrees_freedom)
Function FREQUENCY
Function FREQUENCY can be used to calculate how often values occur within a
range of values and return a vertical array of number.
Syntax
= FREQUENCY (data array, bins array)
data array: a set of values for which you want to count frequencies
bins array: an array of or reference to intervals into which you want to group the
values in data array
Example 4.2
In Example 3.3, we assumed that the random variable follows the normal and
lognormal distribution. However, whether the given random variable comes from the
normal or the lognormal distribution is not sure. In this example, the chi-squared test
is used to evaluate the appropriateness of the proposed normal and lognormal
distribution.
[Solution]
No.of Class 10
Class Widths 11.99
After obtaining the intervals, the next step is to calculate how often values occur
within a range of values using the function FREQUENCY. To create the frequency
distribution, select a range of cells (in this example, B6 : B65) that corresponds to the
number of cells in the bin range (in this example, E19: E28). Then enter the
Frequency formula( press Ctrl + Shift + Enter ). Figure 4.8 shows the process of using
the function FREQUENCY.
4: Determination of the Probability Distribution Models 79
Figure 4.9 shows the process to obtain the theoretical normal frequency by the
function NORMDIST.
Tips
You can see a little triangle on the cells F18, G18, and H18, which is used as a
sign for the comment. Sometimes, it is helpful to add a comment to explain the cell in
the spreadsheet, as the cells are too small to write the context. You can right click the
cell and choose Insert Comment from the shortcut menu, and the comment becomes
visible when you move the mouse over the cell.
Table 4.3 shows the summary of the calculations needed for the chi-squared test,
including the Observed Frequency (ni), Theoretical Frequency(ei), and the value of
∑ (ni − ei )2 / ei .
4: Determination of the Probability Distribution Models 81
The histogram and two PDFs of theoretical distributions are shown in Figure 4.11.
15
Histogram
Normal
12 Lognormal
Frequency
0
90 100 110 120 130 140 150 160 170 180 190 200 210
Weight (pound)
Figure 4.11 Chi-squared test to discriminate the two distribution models
82 4: Determination of Probability Distribution Models
obtained value with the critical value ( (c1−α , f ) . In both normal and lognormal
distributions, there are two parameters that are estimated from the available data.
Therefore, the degree of freedom on both cases is f = 10-1-2 = 7. At the significant
level 5% with f = 7, the critical value is obtained from Appendix Table A.3:
c0.95,7 = 14.07.
The function CHIINV also can be to obtain the critical value as follows:
c0.95,7 = CHIINV(0.05, 7) = 14.07
∑ (n − e )
i i
2
/ ei =17.23 > 14.07
∑ (n − e )
i i
2
/ ei =17.58 > 14.07
Another widely used goodness-of-fit test is the K-S test. Comparing the
experimental S n ( x) and theoretical cumulative probability !", if the maximum
discrepancy between the two probabilities is larger than the critical value for a given
sample size, the model is acceptable.
For a sample of size n, a set of observed data is rearranged by an ascending order.
From this ordered sample data, the experimental cumulative frequency function is
established as follows:
0 x < x1
k
S n ( x) = xk ≤ x < xk +1 4.3
1n x ≥ xn
Let #$ donate the maximum difference between the %$ !" and !", and let
#$) donate the critical value which is tabulated in Appendix Table A. 4. If #$ is less
4: Determination of the Probability Distribution Models 83
than the critical value #$) at the prescribed significance level α , the theoretical
distribution is acceptable.
Compare to the chi-squared test, one of the advantages of the K-S test is that it is
not necessary to divide the observed data into intervals. It is convenience for us to do
the test. Example 4.3 shows the procedures that using the K-S test.
Example 4.3
In Example 4.2, the chi-squared test is used to evaluate the appropriateness of the
proposed whether the observations in Example 3.3 come from the normal or
lognormal distribution, and now, the K-S test can be used to do the demonstration.
[Solution]
To solve this problem, the first step is to obtain the experimental and theoretical
cumulative probability. The part of the results are displayed in Table 4.4, and the
complete table can be seen in Excel file named Chapter 4 with the spreadsheet named
Example 4.3.
In Table 4.4, the second column shows the rearranged tabulated data in an
increasing order; the third column illustrates the calculations of experimental
cumulative frequency using Eq. 4.3; the fourth and fifth columns show the
corresponding cumulative frequencies of the normal and lognormal distribution; the
sixth and seventh columns show the discrepancy of the two cumulative frequencies.
1
0.9
0.8
0.7
0.6
CDF
0.5
0.4
0.3 Cumulative freq
Normal
0.2
Lognormal
0.1
0
90 100 110 120 130 140 150 160 170 180 190
xn
Figure 4.12 K-S Tests to discriminate two distribution models
Pay attention that Excel does not provide the functions to obtain the critical value
Dna . However, we can create the custom functions to obtain the critical value using
VBA.
4: Determination of the Probability Distribution Models 85
You can create your own VBA functions if Excel application functions are not
exist. the syntax is shown as follows:
After shown the process of creating the custom functions, let us see an example
a
to show how to create the functions to calculate the critical value Dn using VBA.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 4.2
**************************************************************
n > 50.
‘Define variables:
End Function
*********************************************end of coding********
This macro consists of two key techniques. The respecting role in this macro is
detailed as follows:
In this example, the function’s name is ks, the parameters are afa and n. When
afa is equal to 0.2, the function returns ks = 1.07/ n^0.5. After finishing the creation,
you can application the functions by put the parameters into the function.
2) If – else if (conditional)
In this functions, there are exist five conditions (afa = 0.2, 0.1, 0.05, 0.01), we
can use Else...If statements. If afa is equal to 0.2, the macro executes the equation ks =
1.07 / n ^ 0.5. If afa is equal to 0.1, the macro executes the equation ks = 1.22 / n ^ 0.5,
and so on.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pay attention that this function is useful when n is larger than 50.
At the significant level 5% with n = 148, we obtain the critical value of Dna
Since the D1 = 0.08 < 0.18 , and D2 = 0.08 < 0.18 , the normal and lognormal
Example 4.4
Modeling System with Uncertainty is one of the required courses for CIVL
students in UST. According to the previous experience, the analyzers have supposed
that the students’ scores follow the normal distribution before doing further analysis.
However, it is lack of evidence to demonstrate this hypothesis. In this example, the
K-S test provides a quantitative procedure to test the validity of three assumed
distribution models named normal, lognormal, and gamma. Table 4.5 shows the part
4: Determination of the Probability Distribution Models 87
of the testing results, the complete table is shown in Excel File named Chapter 4 with
the spreadsheet named Example 4.4.
[Solution]
In Table 4.6, the first column shows the k value; the second column shows the
rearranged tabulated data in an increasing order; the third column illustrates the
calculations of the experimental cumulative frequency using Eq. 4.3; the fourth, fifth
and sixth columns show the corresponding cumulative probabilities from the normal
and lognormal distributions, respectively; the seventh, eighth and ninth columns show
88 4: Determination of Probability Distribution Models
1
Cumulative freq
0.9 Normal
Lognormal
0.8
Gamma
0.7
0.6
CDF
0.5
0.4
0.3
0.2
0.1
0
30 40 50 60 70 80 90 100
xn
Figure 4.13 K-S test to discriminate three distribution models for midterm
scores.
From Table 4.6, we observe that the maximum discrepancy between the
empirical cumulative frequency and normal ( D1 ), lognormal ( D2 ), and Gamma ( D3 )
distribution are 0.08, 0.11 and 0.10, respectively.
At the significant level 5% with n = 148, we obtain the critical value of Dna
0.05
from Appendix Table A.4 as D148 = 0.11.
Since the D1 = 0.09 < 0.11, D2 = 0.12 > 0.11, and D3 = 0.108 < 0.11,
according to the K-S test, the normal and gamma distributions are verified as an
accepted model at the 5% of significant level, whereas the lognormal distribution is
reject as the maximum discrepancy between the two probabilities is larger than the
critical value.
4: Determination of the Probability Distribution Models 89
Excel functions which are used in this chapter are summarized in Table 4.7. The
functions MAX and MIN are used to find the largest and smallest values of the
observations. The functions NORMDIST, LOGNORMDIST, and GAMMADIST are
used to obtain the theoretical cumulative frequencies. The function CHIINV is used to
obtain the critical value in the chi-squared test.
Excel’s built-in functions can be changed using VBA macros. Table 4.8
summaries the changed functions that are used in this chapter.
5.1 Introduction
In reality, many problems are difficult to solve by the analytical solution. For
instance, it is difficult to derive the distribution functions of an event which is
governed by two (or more) random variables following the different distributions
using the analytical solution. Under such conditions, we can apply the numerical
approach to solve the problems. Monte Carlo simulation (MCS) is widely used to
solve the problems containing uncertainties, and it also enhance the application of the
probabilities and statistical models. The fundamental contributions of MCS is to
generate a large set of random numbers following the prescribed probability
distributions.
In this chapter, some essentials of MCS will be introduced, together with the
process of demonstrating the Central Limit Theorem using MCS method.
The name Monte Carlo was firstly used by the scientists in developing the
nuclear weapons in Los Alamos in the 1940s. Because the physicists involved in this
work were big fans of gambling, and the capital of Monaco was a center for gambling,
they give the simulations the code name Monte Carlo. MCS can be used to generate a
large set of random numbers following prescribed probability distributions.
5: Monte Carlo Simulation 91
Definition
Monte Carlo simulation: Monte Carlo simulation is a method of artificial
recreating a chance process (usually with a computer), running it many times, and
then observing the results directly.
The main contributions of MCS is to present the numerical methods for solving
the probabilistic problems that are difficult solved by the analytical method. MCS is
now used in many diverse fields. For instance, in the commercial practice, many
companies use MCS as an important tool to do the forecasting; in the field of
probability and statistics, it can be used to compute the probabilities, expected values,
and the other distribution characteristics.
MCS has several advantages. For instance, the algorithms are simple; MCS
provides much more flexibility to try things out before building the actual system.
However, this method also has several disadvantages. For instance, the simulation
never corresponds fully to the actual system, and the uncertainties and errors are exist.
In addition, the MCS requires a large number of calculations, too much time is
required to do the simulation.
Monte Carlo method is only available with a computer. The statistics software
packages like SAS, SPSS, MATLAB, and EXCEL have built-in procedures for
generating the random variables from the most commonly distributions. In this
chapter, Excel based random number generation will be introduced.
Function RAND
Syntax
= RAND ( )*
Function RANDBETWEEN
Syntax
= RANDBETWEEN (bottom, top)
Although Excel contains the built-in functions to obtain the random numbers, the
random number generation tool is much more flexible comparing with the built-in
functions. To apply this tool, you just need to press Data Analysis Data Analysis,
and then you can see a dialog box which is displayed in Figure 5.1. The random
variables can be obtained by choosing the type of the distributions and then entering
the relative parameters.
5: Monte Carlo Simulation 93
Figure 5.2 shows the dialog box used for the random number generation. The
parameter-box is varies, which is depending on the type of distribution that you select.
Types of Distributions(8)
Example 5.1
A player has a chance to roll a dice once after paying 60 dollars. The benefit the
player can get is equal to three powers of the points. For instance, you can get eight
dollars when the points is equal to two. Forecasting that whether you can get profit
from the game (Suppose the numbers of trials is equal to 1,000).
[Solution]
In this example, income can be obtained from three powers of the random points
that the player gets, where the benefit can be obtained from the formula: Benefits =
Income – Cost. After that, the average benefit E(x) can be calculated, and then the
decision can be made. the Excel function based method and VBA based method are
used to solve the problem following the steps just mentioned above.
Excel Solution
function RAND is from 0 to 0.99999…, we can add 1 to make the random numbers
range from 1 to 6.
The function INT can be used to round the numbers down to the nearest integer.
Figure 5.3 shows the process of generating random numbers by the functions RAND
and INT.
Figure 5.5 shows the dialog box named the random number generation.
Figure 5.5 means that 1000 random numbers are arranged in column Q, and the
value is from 1 to 6.999.
2. Obtain income
The function POWER can be used to calculate the total income you could get.
Figure 5.6 shows the process to calculate the income using the function POWER.
Syntax:
= Power (number, power)
3. Obtain profit
According to the equation Profits = Income - Cost, the profit you can get each
time can be obtained (shown in Figure 5.6).
Tips
Conditional Formatting
98 5: Monte Carlo Simulation
In this example, we use Excel feature called the conditional formatting, which is
a useful way to quickly identify the particular type of cells. There are different types
of conditional formatting rules, and you can find them by pressing
Home Styles Conditional Formatting. Figure 5.7 shows some types of conditional
formatting rules.
In Example 5.1, our objective is to set all negative values of the income having
different colors. The first step is to select all cells that you want to format, and then
choose the Highlight Cells Rules. After enter 0 into the box, the values greater than 0
are highlighted.
Besides the rules provided by Excel formatting suggestions, you can make
your own rules by selecting Home Styles Conditional formatting New
type. Figure 5.9 shows a new Formatting rule dialog box.
4. Make a decision
As the income and benefit have already been calculated in above steps, the
average benefits E(x) can be obtained by E(x) = AVERAGE (G6 : G1005) = 13.12.
The complete calculation process is presented in Excel file named Chapter 5 with the
spreadsheet named Example 5.1.
VBA Solution
Another method to create the random numbers and obtain the profit is using
VBA.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 5.1
100 5: Monte Carlo Simulation
**************************************************************
‘******start of coding******************************************
Sub example1()
Range("n6:060000").ClearContents
n = InputBox("n:", "numbers of trials")
Range("r6").Value = n
For i = 1 To n
‘generate the integer random numbers from 1 to 6
Range("N6").Cells(i, 1) = i
End Sub
*********************************************end of coding************
This macro consists of six key techniques. The respecting role in this macro is
detailed as follows:
In this example, the value starts from 1 and end with n, the loop will be executed
n times in total.
2. Define Cells
In VBA, you can not define the cells directly using the words such as “A1,” “B2.”
To return a specific cell, you can specify a row and column index. For instance, if
5: Monte Carlo Simulation 101
you want to return the spreadsheet cell B4, you can write such as cells (4, “B”), or (4,
2).
4. ClearContents
This statement is used to clear the selected range of cells, and here is the range
from N6 to O60000.
5. Inputbox
This input box here is used to enter the number of random numbers you want to
simulate. The syntax is shown as follows:
n = InputBox("n:", "numbers of trials")
6. Rnd
VBA built-in Rnd function, which returns a random number between 0 and 1
*********************************************************************
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
According to previous calculation, E(x) > 0. Therefore, the player can get profits,
but the money is not too much.
102 5: Monte Carlo Simulation
One of the fundamental questions before doing the simulation is to consider how
many trails are sufficient to run in a complex model? Generally speaking, the larger
the sample size, the higher the reliability of the results. The accuracy is measured in
the terms of C.O.V. (coefficient of variation). The following example shows the
influence of the sample size when doing the simulation.
Example 5.2
[Solution]
1)
Let µ e =150, σ e =30 and µi =100, σ i =25. As both the EQ and IQ follow the
normal distribution, the sum of the two normal variants is also normal variants. Under
this condition, the analytical solution is available to solve this problem.
The mean ( µt ) and the variance ( σ t2 ) can be obtained as follows:
2)
As mentioned earlier, the key of MCS is to generate the random numbers. In this
experiment, the purpose is to generate a series of the random numbers that follow N
(150,30) and N(100,25). Firstly, the random numbers between 0 and 1 can
be generated using the function RAND, and then these uniform random numbers can
be transformed to the normal distributed numbers using the function NORMINV.
Figures 5.10 shows the process to generate the uniform random numbers by the
function RAND.
Figure 5.10 The random numbers generation using the function RAND
After obtain these uniform random numbers, these numbers can be transformed
to the normal distributed numbers. Figures 5.11 shows the formula to generate the
random numbers by the function NORMINV.
The random number generation tool also can be used to generate random
numbers. Figure 5.12 shows the process of generating random numbers following the
normal distribution.
In Example 5.2, the random numbers arranged in one column with 100 rows, the
µ = 100 and σ = 25 , and the output is in the column Q.
According to Eq. 2.1 Pr (event) = sample points / sample space, the probability
that the sum of these two random variables is larger than 300 can be obtained. The
function COUNTIF can be used to obtain the sample points. Table 5.1 shows the
probability the T >300 when n is equal to 10,15,100, and 1000. The complete
calculation process is presented in Excel file named Chapter 5 with the spreadsheet
named Example 5.2.
In this example, we can observe that when the number of trails is small, such as
10 and 15, the results are far away from the analytical results and easy changed when
pressing F9. However, when the number of trails is large, such as 1000, the
probability is appropriate equal to 0.104, which is much closer to the analytical results.
Furthermore, the results are more stable when the sample size is large.
In Example 5.2, as both EQ and IQ follow the normal distribution, the sum of the
two normal variants is also the normal variants. However, if the variables follow the
different distributions, the sum of these variants following which type of distribution
can not be determined. As mentioned earlier, one of the advantages of MCS is to
solve probabilistic problems that are impossible solved by analytical methods.
Example 5.3 shows an example that using Monte Carlo method to solve problems
which do not have the analytical solution.
Example 5.3
Midterm scores M~N (185, 30), what is final scores F~U (30,185), how many
students can have a total score (T = M + F) greater than 300?
1) What model does T follow?
2) Try simulation (MCS) to find the probability that T >300.
[Solution]
1)
For the first question, which model does T follow is not deterimined, as the
distributions of the midterm and final scores are different. Under this condition, the
analytical solution can not be used to solve this problem.
2)
In this example, to generate the random numbers follow the normal distribution
with mean and standard deviation 185 and 30, respectively. The functions RAND and
NORMINV are used. The syntax is NORMINV (RAND(), mean, sd). Figure 5.13
106 5: Monte Carlo Simulation
The midterm scores follow the uniform distribution with the smallest and largest
value of 30 and 185, respectively. The function RAND can be used to generate
random numbers. The syntax is: RAND()*( upper limits – lower limits)+lower limits.
Figure 5.14 shows the procedures to simulate a series of uniform variables.
The complete calculation process is presented in Excel file named Chapter 5 with
the spreadsheet named Example 5.3.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 5.2
**************************************************************
*********start of coding**********************************************************
Sub example3()
Range("M12:P60000").ClearContents
n = InputBox("n", "n")
Range("k5").Value = n
For i = 1 To n
Range("m12").Cells(i, 1) = i
Range("n12").Cells(i, 1).Value
= Application.WorksheetFunction.NormInv(Rnd, 185, 30)
Range("o12").Cells(i, 1).Value = Rnd() * ((185 - 30) + 30)
Next
End Sub
*****************************************************end of coding***************
This macro consists of four key techniques. The respecting role in this macro is
detailed as follow:
1. For --Next(loop)
This structure is used to loop n times to generate random numbers
108 5: Monte Carlo Simulation
2. ClearContents
This statement is used to clear the range of cells from M12 to P60000.
3. Inputbox
This input box here is used to enter the number of trials you want to simulate.
4. Rnd
VBA built-in Rnd function, which returns a random number between 0 and 1.
You can directly use the function RND to generate the random numbers
following the uniform distribution, and it is unnecessary to start from the
statemet “Application.WorksheetFunction.”
*********************************************************************
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To find out the probability that the sum of these two random variables is larger
than 300, the total sample points and sample space should be obtained. The function
COUNTIF can be used to obtain the sample points under the condition of T >300. If
the total scores are larger than 300, the value returns to 1, otherwise, the value returns
0.
Hence, the probability can be obtained: Pr (T > 300) = 135/500 = 0.27. The
complete calculation process is presented in Excel file named Chapter 5 with the
spreadsheet named Example 5.3.
If you are carefully enough, you would find the phenomenon that nearly all of
the probability textbooks would mention the theory of the Central Limit Theorem.
However, the theoretical proof is often skipped in these kinds of textbooks, with such
a footnote instead: “Although we do not concern ourselves here with why the Central
Limit Theorem works, you need to understand why the veracity of this theorem is so
important.” This would not only make the Central Limit Theorem less
comprehensible and mysterious to students, but also undermines the purpose of
college education with such surface learning. In such situation, MCS provides a
numerical simulations technique to demonstrate the Central Limit Theorem.
The Central Limit Theorem is one of the most remarkable theory in the
probability and statistics fields. Its essence is shown as follows: when sample size n is
large enough (say n > 30), regardless of the particular distribution type of X, the
5: Monte Carlo Simulation 109
sample mean X follows approximately a normal distribution with mean (µ) and
standard deviation σ X n .
Definition
Central Limit Theorem: For any population with mean and standard deviation
, the distribution of the sample means for sample size n will have a mean of and a
X ~ N ( µ X ,σ x / n )
5.1
MCS can be used to prove the Central Limit Theorem, and the detailed
procedures are shown in Example 5.4.
Example 5.4
From Example 4.4, we have already drawn the conclusion that the normal
distribution is much an appropriate model for the students’ midterm scores. In this
example, the mean and standard deviation are equal to 73.92 and 12.22, respectively.
Suppose the sample size is equal to 100 and the number of simulations is equal to 80,
please demonstrate whether the sample follows N (73.92, 12.22/10).
[Solution]
Draw the random numbers
The functions RAND and NORMINV can be joined together to generate the
random numbers N (73.92, 12.22):
= NORMINV( RAND(), 73.92, 12.22)
The summarized statistics data are shown in Table 5.2. The complete calculation
process is presented in Excel file named Chapter 5 with the spreadsheet named
Example 5.4.
110 5: Monte Carlo Simulation
Figure 5.12 shows the distribution of students’ scores. According to Figure 5.15,
we can observed that the mean of students’ score is appropriately following the
normal distribution.
0.25
0.08
0.20
0.06
0.15
0.04 0.10
0.02 0.05
0 0.00
71.00 72.00 73.00 74.00 75.00 76.00 77.00 78.00
scores
σ 12.22
mean value µ X = 73.92, and standard deviation is equal to = = 1.22 .
n 100
After simulation, the mean and standard deviation can be calculated as µ =73.85 and
According to the definition of the Central Limit Theorem, not just for the normal
distribution, the sample mean X approached the other distributions such as uniform
and gamma also follow the normal distribution. Example 5.5 shows an example to
demonstrate it.
Example 5.5
Suppose that µ = 100 and s = 10 , using these two parameters to simulate and
test whether the sample means follow the normal distribution when the random
variable X follow the uniform and normal distribution, respectively.
[Solution]
In VBA programming, you can use Excel’s application functions directly in your
VBA code, which has the advantages such as convenience and speediness. In this
example, the functions AVERAGE, STANDARD DEVIATION, and NORMINV are
used to obtain the random variables and the essential statistics.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 5.3
112 5: Monte Carlo Simulation
**************************************************************
‘Define variables:
*********start of coding************************************************
Sub uniform_()
1
mn = Range("h11").Value
sd = Range("h12").Value
sz = Range("h9").Value ' sample size
n = Range("h10").Value ' the number of simulation
2
b = mn + sd / 2 * 12 ^ 0.5
a = 2 * mn - b
Range("h21:h50000").ClearContents
3
ReDim s(n)
For j = 1 To n
aa = 0
For i = 1 To sz
aa = aa + Rnd * (b - a) + a
Next
s(j) = aa / sz
Range("G21").Cells(j, 1).Value = j
Range("G21").Cells(j, 2).Value = s(j)
Next j
4
Range("h15").Value = Application.WorksheetFunction.Average(s)
Range("h16").Value = Application.WorksheetFunction.StDev(s)
5: Monte Carlo Simulation 113
**************************************************************
'Calculate the values follow uniform distribution using original
parameter, which can be used to draw the figure.
Range("o21:p1000").ClearContents
5
Range("o21").Value = a
Range("o22").Value = a
Range("o23").Value = b
Range("o24").Value = b
Range("p21").Value = 0
Range("p22").Value = 1 / (b - a)
Range("p23").Value = 1 / (b - a)
Range("p24").Value = 0
End Sub
**************************************end of coding*************
Comments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
114 5: Monte Carlo Simulation
Code 5.4
**************************************************************
‘Define variables:
*********start of coding************************************************
Sub normal_()
mn = Range("h11").Value
sd = Range("h12").Value
sz = Range("h9").Value ' sample size
n = Range("h10").Value ' the number of simulation
Range("h21:h50000").ClearContents
ReDim s(n)
For j = 1 To n
aa = 0
For i = 1 To sz
aa = aa + Application.WorksheetFunction.NormInv(Rnd, mn, sd)
Next
s(j) = aa / sz
Range("G21").Cells(j, 1).Value = j
Range("G21").Cells(j, 2).Value = s(j)
Next j
Range("h15").Value = Application.WorksheetFunction.Average(s)
Range("h16").Value = Application.WorksheetFunction.StDev(s)
**************************************************************
Range("o21:p1000").ClearContents
n = 100
For i = 1 To n
kk = mn - 3 * sd + 6 * sd / n * i
Range("o21").Cells(i, 1).Value = kk
Range("o21").Cells(i, 2).Value =
Application.WorksheetFunction.NormDist(kk, mn, sd, False)
Next
End Sub
***********************************************end of coding***********
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The code 5.4 is pretty similar with the code 5.3, and the only thing changed is the
distribution that random numbers followed. therefore, we not specify the code in here.
2. Histogram
The second step is to plot the histogram. Before plotting the histogram, the
relative data such as the classes boundary and bins should be prepared using VBA.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 5.5
**************************************************************
*******start of coding******************************************
Sub histo_()
116 5: Monte Carlo Simulation
Range("k21:m1000").ClearContents
1
bsize = Range("l11").Value
sd = Range("h14").Value
mn = Range("h13").Value
k = Range("h10").Value
3
ReDim ll(n)
ReDim uu(n)
ReDim freq(n)
4
For i = 1 To n
ll(i) = lowbp + bsize * (i - 1)
uu(i) = lowbp + bsize * (i)
freq(i) = Application.WorksheetFunction.CountIf(Range("h21:h"
& k + 500), ">=" & ll(i)) -
Application.WorksheetFunction.CountIf(Range("h21:h" & k + 500),
">=" & uu(i))
End Sub
***********************************************end of coding*******
Comments
1. read the inputs.
2. define the parameters that can be used in the further calculation.
3. reDim statement: the ReDim statement is used to size or resize a dynamic array.
5: Monte Carlo Simulation 117
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
__
Figures 5.16 and 5.17 show the distributions of X and X of the uniform and
normal distribution, respectively.
From Figures 5.15 and 5.16, we can draw a conclusion that no matter which
distribution dose X follow, the X follows the normal distribution.
You can run the simulation by selecting one of the probability distributions listed
in the combo box, and a series of random sample means are then numerically sampled,
with both text and graphical outputs instantly given on the spreadsheet.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 5.6
**************************************************************
*****start of coding********************************************
Sub run_Chapter5()
a = Range("b4").Value
fname = ActiveWorkbook.Name
If a = 1 Then
Application.Run "'" & fname & "'!uniform_"
ElseIf a = 2 Then
Application.Run "'" & fname & "'!normal_"
End If
5: Monte Carlo Simulation 119
End Sub
**************************************end of coding********************
Comments
Using the run event to choose the distribution we need.
*********************************************************************
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can see a form control in Excel spreadsheet named Example 5.5, and the
reason to use the controls on a worksheet is to make it easier for the user to provide
input. In this experiment, you may not have to create any macros because you can link
a control to a worksheet. You can access the form control by choosing Developer →
Controls →Insert. In this example, the combo Box can be used as the form controls.
We will introduce the essentials of the controls in chapter 7.
Excel functions which are used in this chapter are summarized in Table 5.3. The
functions RAND and RANDBETWEEN are used to generate random numbers; the
function NORMINV is used to transform the random numbers to follow the normal
distribution in this chapter.
120 5: Monte Carlo Simulation
6.1 Introduction
In the previous chapters, we observe that once we know the probability functions
(PDFs or CDFs) and the values of the parameters such as the mean and variable, we
can obtain the probability of an event. The process of estimating the parameters and
obtaining the appropriate distributions is based on available observational data. In
order to estimate the parameters and infer the appropriate distribution of a population,
the perfect way is to investigate every single point in a population. However, it is
difficult or impossible to investigate the entire group. Alternatively, we may examine
only a small part of this population, which is called a sample. The process of
obtaining the sample is called sampling.
The features of a population can be inferred by a lot of samples which draw from
that population. The process of inferring the features of a population from the results
found in the sample is known as statistical inference. For instance, 100 students are
randomly chosen to estimate the average heights of UST’s students. 20 toys are
randomly selected to estimate the defectiveness of a batch of toys. The 100 students
and 20 toys here are the samples which are randomly chosen to infer the population
features.
In this chapter, the fundamental of the point and interval estimation are being
introduced, together with some relevant Excel functions.
estimation is using many samples to infer the true value of the parameter, which is
symbolic with Greek letter θ .
Definition
Estimator: The equation used to estimate the population parameter θ .
Point estimate: A single number that can be used as a sensible value for θ .
For instance, the students in UST are randomly chosen to measure their heights,
the sample size n = 3, including x1 = 165 cm, x2 = 170 cm, and x3 = 172 cm. The
sample mean = (165 + 170 + 172)/3 = 169 cm. In this question, the estimator used
to obtain the point estimate of µ was X , and the point estimate ( is the value of
X that is equal to 169 cm.
Unbiased Estimator
Most of the time, there will be more than one estimator. Reconsider the example
cm; the estimator Xɶ can also be used to estimate µ, where the estimate
165 + 172
xɶ = = 168.5 cm. The unbiasedness is the most important factors when
2
choosing the estimator. In addition, the consistency, efficiency and sufficiency also
are important features to decide the goodness of the estimators.
Definition
Unbiased estimator: A point estimator θˆ is said to be an unbiased estimator of
θ if E (θˆ) = θ for every possible value of θ . If θˆ is biased, the difference
E (θˆ) − θ is called the bias of θˆ .
6: Statistical Inferences from Observational Data 123
Suppose there are two estimators: θˆ1 = xˆ − 2 , θˆ2 = x̂ , we could obtain that
E (θˆ1 ) = E ( xˆ − 2) = µ − 2 and E (θˆ2 ) = E ( xˆ ) = µ. According to the definition of the
unbiased estimator, if the expected value of an estimator is equal to the parameter, the
estimator is said unbiased. The point estimator θˆ1 is said to be a biased estimator as
E (θˆ1 ) = µ − 2 ≠ µ , and the point estimator θˆ2 is an unbiased estimator as E (θˆ2 ) = µ .
After introducing the essential of the point estimation, we will describe one
important method that can be used to obtain the point estimates: the method of
moments. The parameters of a distribution may be determined by first estimating the
mean and variance of the random variable, the process, is the basis of the method of
moments.
1 n
For sample mean, x = ∑ xi
n i =1
6.1
1 n
For sample variance, s 2 = ∑ ( xi − x )2
n − 1 i =1 6.2
The following section demonstrate whether the sample mean X and variance S 2
are the unbiased parameters.
[Proof]
1 n 1 1
E ( Xˆ ) = E ( ∑
n i =1
xi ) = {E ( x1 ) + E ( x2 ) + ...E ( xn )} = ( µ + µ + ... + µ ) = µ
n n
Therefore, the is an unbiased estimator.
1 n 1 n nµ
Suppose another estimator θ = ∑ ni ,
n − 1 i =1
E(θ E ∑
n − 1 i =1
ni ) =
n −1
[Proof]
For any rv Y, V (Y 2 ) = E (Y 2 ) − [ E (Y )2 ] , so E (Y 2 ) = V (Y 2 ) + [ E (Y ) 2 ]
Applying this to
1 ( ∑ Xi )
2
S =
2
n −1
∑ X i − n
2
Gives
1 1 2
E(S 2 ) ≠ σ 2 ∑ ( X i ) − E[(∑ X i ) ]
2
n −1 n
1 1 2
= ∑ (σ + µ ) − {V (∑ X i ) + [ E (∑ X i )]
2 2
n −1 n
1 2 1 2 1 2
= nσ + nµ − nσ − ( nµ )
2
n −1 n n
=
1
n −1
{nσ 2 − σ 2 } = σ 2
n n n
Example 6.1
[Solution]
Hand Calculation
According to Eqs. 6.1 and 6.2, the point estimates of the mean and variance are
8.2 + 3.4 + ... + 10.2 (8.2 − 7.2) 2 + (3.4 − 7.2) 2 + ... + (10.2 − 7.2) 2
x= = 7.2, s =
2
= 4.66.
25 25 − 1
The complete solution procedure can be seen in Excel file named Chapter 6 with
the spreadsheet named Example 6.1.
Even though the method of hand calculation can be used to obtain the mean and
variance, the calculation processes are complex. Fortunately, Excel contains functions
Excel Solution
The function VAR is used when calculating the variance for a sample. However,
when calculating the variance for an entire population, the function VARP is essential.
In this example, if we use the function VARP, s12 = VARP (J11:J35) = 7.98, and
comparing to the value obtained from the function VAR, the difference is significant.
However, if we enlarge the sample size, the gap becomes narrow. For instance, recall
Example 5.2. s 2 = VAR(H21:H520) = 25.28 and s12 = VARP (H21:H520) = 25.23,
126 6: Statistical Inferences from Observational Data
these two results are similar. Not only the functions VAR and VARP, the functions
STDEV and STDEVP also have the similar syntax.
From Example 6.1, we have obtained x = 7.2 . However, it is never the case that
x = 7.2 = µ . Because of the sampling variability, we can not make sure that the
sample mean we obtained each time is exactly the same. For instance, recall Example
6.1, randomly choose other groups of samples from the population, the results are
shown in Table 6.1.
From Table 6.1, we can observe that the sample mean is all differ slightly from
one to another, and there is no evidence to demonstrate which sample mean is
close to µ. In reality, the true mean for the population does exist and is a fixed
number, but we do not know exactly what it is. Most of the time, the estimated mean
is not exactly the same as the true mean.
One limitation for point estimation is that the estimate itself is a single number,
and it says nothing about how close it might be to µ. Rather than a point estimation, it
is sometimes more valuable to be able to specify an interval that the true mean is
possible within. For instance, you have a better chance to say that the average height
of the students is between 168 cm and 174 cm than a single estimation of 171 cm.
Before calculating the confidence interval, the confidence level, which is a
measure of the degree of reliability of the interval, should be selected. The most
frequently used confidence levels are 95%, 90 % and 99%. The higher the confidence
level, the stronger we believe that the value of the parameter being estimated lies
within the interval.
6.3.1 Confidence Interval for the Mean with a Known Population Variance
Any desired level of confidence interval can be achieved using different kinds of
z critical value. Figure 6.1 shows a probability of 1 − α that is achieved by using a
critical value zα / 2 . Suppose the confidence level is 95%, the critical value z α / 2 is
equal to 1.96.
1- α
− zα /2 zα /2
c
Figure 6.1 P (− zα / 2 ≤ Z ≤ zα / 2 ) = 1 − α
Example 6.2 illustrates the process to obtain the confidence interval for an
unknown mean with a known population variance.
128 6: Statistical Inferences from Observational Data
Example 6.2
The canteen in UST dose a survey named Do you Satisfy the Food in Canteen?
(suppose that the level of satisfactory is from 0 to 100). 100 students are randomly
chosen, and the mean and standard deviation are equal to 80 and 25, respectively.
1) Find a 95 percent confidence interval estimate the average satisfactory score of all
students in UST
2) What sample size is necessary to ensure that the resulting 95 percent CI has the
width within 15.
[Solution]
1)
Based on the Central Limit Theorem, we know that x ~N (80, 25/√100). Under the
σ
confidence level of 95%, the confidence interval for µ is: ( x − z α /2 ⋅ ,
n
σ
x + zα /2 ⋅ ). As x , σ and n are given, the key step to estimate the confidence
n
interval is to obtain z critical value. After that, the CI can be obtained from Eq. 6.3.
To obtain the z critical value, you can either choose the cumulative standard
normal table, or using the functions in Excel.
Table Solution
Because the confidence interval is 95%, the left tail and right tail area can be
calculated as shown in Figure 6.2.
6: Statistical Inferences from Observational Data 129
0.95
0.025 0.025
0.025 0.975
Figure 6.2 Illustration of a z critical value
Table 6.2 shows the standard normal table, which can be used to obtain the z
critical value. Look up the table, z0.975 = 1.96. As the normal distribution is
symmetric, z0.025 = −1.96.
Excel Solution
The functions NORMSINV and NORMINV in Excel also can be used to obtain z
value. For example, given left tail probability = 0.975, the corresponding z value is
equal to 1.96; the right tail probability = 0.025, the corresponding z value is equal to
-1.96.
Therefore, the 95 confidence interval for µ is:
25 25
(80 − 1.96 × ,80 + 1.96 × ) (75.1, 84.9)
100 100
Hence, we are 95% confidence that the mean value lies between 75.1 and 84.9.
2)
6.3.2 Confidence Interval for a Normal Mean with the Variance is Unknown
In section 6.3.1, we obtain the confidence interval for µ under the condition that
the sample size is large or the population σ is given. However, when the sample size n
is small, the Central Limit Theorem is no longer invoke. Under this condition, a new
family of probability distribution called t-distribution will be introduced.
The t-distribution is investigated by William S. Gosset, a chemist and statistician,
who noticed that the usual statistical practice of his daily work exist small errors when
the sample size is small. Gosset published the result in 1908 and signed “Student,”
because the company he worked had a policy that the employees were not permitted
to publish under their own names. As a result, his name is almost unknown outside
the statistical field.
The t-distribution is a probability distribution that is used to estimate population
parameters when sample size is small and/or when the population variance is
unknown. The T-test is very useful to handle small samples in quality control area.
6: Statistical Inferences from Observational Data 131
Definition
When X is the mean of a random sample of size n from a normal distribution with
mean µ, the rv.
X −µ
T=
S n 6.4
has a probability distribution called a t-distribution with n-1 degree of freedom, and
the degree of freedom is symbolic with Greek word ν .
There are some features for the t-distribution. For instance, the t-distribution is
bell-shaped and symmetrical to the origin just like the normal distribution; each
t-distribution curve is more spread out than the standard normal curve; the spread of
the corresponding t-curve decrease with the increase of v.
Confidence Interval for : Let x and s be the sample mean and sample standard
deviation. Then a 100(1- α ) % confidence interval for µ is:
s s
x − tα /2,n −1 ⋅ , x + tα / 2,n −1 ⋅
n n 6.5
Function TDIST
TDIST gives the probability in the right tail (Pr(X > x), or of being in the two
tails (Pr(|X| > x).
132 6: Statistical Inferences from Observational Data
Syntax
= TDIST (x, deg_freedom, tails) 1. one-tail, 2. two-tail
Function TINV
TINV considers the inverse of the probability of being in both tails.
Syntax
= TINV(probability, deg_freedom)
Example 6.3 illustrates the process to obtain the confidence interval for an
unknown mean with an unknown population variance.
Example 6.3
Continuing Example 6.2, the sample size is decreased to 10, and the sample
mean and S.D. are x = 80 and s = 25, respectively. What is the interval estimation
of the average scores with 95 % confidence interval?
[Solution]
Table Solution
You can make your own T table using the function TINV, and the process is
similar as creating the normal table. As we have already mentioned the process of
creating the standard normal table, we will omit the specific procedures of creating T
table.
In Table 6.3, the second row corresponds to the different values of α , and the
first column corresponds to the value of υ . The function TINV is used to obtain the t
critical value. You can choose the different value of α and v as you like, and draw a
table like Table 6.3.
Excel Solution
Pay attention that the function TINV is two tails. It may cause inconvenience as
the TINV only considers the inverse of the probability in two tails, VBA can be used
to turn it to right/left tail. The code is shown as follows:
134 6: Statistical Inferences from Observational Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 6.1
*********************************************************************
‘Purpose: To recreate the function TINV returns only the right tail
probability Pr(X > x).
‘Define variables:
*****start of coding********************************************
Public Function tinv_right(pb, df)
tinv_right = Application.WorksheetFunction.TInv(2 * pb, df)
End Function
******************************************end of coding*********
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 6.2
*********************************************************************
‘Purpose: To recreate the function TINV returns only the left tail
probability Pr(X < x).
‘Define variables:
‘**********start of coding**************************************
Public Functiontinv_left(pb, df)
tinv_left =(Application.WorksheetFunction.TInv(2 * pb, df)) * -1
End Function
‘********************************************end of coding*************
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After obtain the t critical value, a 95 percent confidence interval for μ is:
25 25
[80 - 2.262 × , 80 + 2.064 × ] = (62.1, 97.9)
3.16 3.16
Hence, we are 95% confidence that the mean value lies between 62.1 and 97.9.
In the sections 6.3.1 and 6.3.2, we mentioned the method to infer the confidence
interval for an unknown mean. Not only for the unknown mean, we might be also
interested in the spread of a population based on a sample. For example, expect for
knowing the sample mean of students’ grade, we might also want to know the
variation in their grades.
Before developing a confidence interval for the variance, we need another
distribution, called chi-squared distribution. Suppose a random variable follows the
normal distribution with the population variance of σ 2 , we can obtain the sample
variance of S 2 , define χ 2 statistic as χ 2 = ( n − 1) × S 2 / σ 2 , and χ 2 can be viewed
as a random variable following chi-squared distribution.
136 6: Statistical Inferences from Observational Data
Definition
Let X1,…Xn be a sample from a normal distribution having the unknown parameters
µ and σ 2 , the rv
(n − 1) * S 2
χ2 =
σ2
6.6
CHIINV can be used to return the one tail probability of the chi-squared distribution
Syntax
= CHIINV(probability, deg_freedom )
Example 6.4
[Solution]
1)
The functions AVERAGE and VAR in Excel can be used to obtain the sample
mean( x ) and s 2
_
x = 0.146 s 2 = 0.04
2)
Since CI and n are given, to obtain the critical value, the χ 2 table or the
function CHIINV in Excel can be used.
Table Solution
You can make your own χ 2 table using the function CHIINV. The function
CHIINV is used to obtain χ 2 critical value. You can choose the different value of
α and v as you like, and draw a table like Table 6.4.
138 6: Statistical Inferences from Observational Data
χ right
2
= χ 0.05/2,14
2
= 26.119 , χleft
2
= χ12−0.05/2,14 = 5.629 =5.629
Excel Solution
The functions CHIINV in Excel also can be used to obtain χ 2 value. For
example, given CI = 0.95 and υ = 14, the corresponding right tail critical value is
26.119, and the left tail critical value is 5.629 (shown in Figure 6.3).
6: Statistical Inferences from Observational Data 139
(n − 1) * S 2
According to Eq. 6.6 χ 2 = :
σ2
(n − 1) × S 2 14 × 0.04
The left end point is equal to = = 0.021
χ 2
0.025,14 26.119
(n − 1) × S 2 14 × 0.04
The right end point is equal to = = 0.099
χ 2
0.975,14 5.629
Hence, the 95 percentage two sided confidence interval for σ 2 lies between
0.021 and 0.099.
When comparing two variances, we often ask a question that whether one sample
variance significantly larger than another indicate that one population is more variable
than another? To answer this question, we will introduce another family of probability
distribution called F-distribution. Mathematically, the F-distribution is related to the
ratio of two chi-squared distributions. The symbol of F was used to remind us of the
great statistician and geneticist Sir Ronald A. Fisher, who found the density for the
central F-distribution. The F-distribution is widely used in statistical inference,
especially in analysis of variance.
140 6: Statistical Inferences from Observational Data
Definition
Let X1 … X m be random sample from a normal distribution with variance σ 12 , and
distribution with variance σ 22 , and let S12 and S 22 denote the two sample variances.
Then the rv
S12 / σ 12
F=
S 22 / σ 22
6.7
6.8
1
For instance, F0.05(5,2) = = 0.198
F0.95(2,5)
Function FDIST
6: Statistical Inferences from Observational Data 141
The function FDIST calculates the right tail F-probability distribution, which
measures the degree of diversity between two data sets.
Syntax
= FDIST (x, deg_freedom 1, deg_freedom 2 )
Function FINV
FINV is the function that returns the inverse of the F-probability distribution.
This is used to compare two data sets and find out how much variability between
them.
Syntax
= FINV(probability, deg_freedom 1, deg_freedom 2)
probability: right tail probability
Example 6.5 illustrates the process to obtain the confidence interval for an
unknown mean with an unknown population variance.
Example 6.5
On a calculus test, ten girls are randomly chosen, the mean score and the
standard deviation are equal to 64 and 10, respectively. At the same time, nine boys
are randomly chosen, the mean scores and standard deviation are equal to 60 and 15,
respectively. Can we conclude that the performance of the boys is more stable in the
calculus test with 95 percent confidence interval?
[Solution]
142 6: Statistical Inferences from Observational Data
The first step is to obtain the F critical value. You can either choose the F table,
or using the functions in Excel.
Table Solution
You can draw your own F table using the function FINV. Pay attention to the
tails of the table when you check the F table. Take the probability is equal to 0.05 as
an example to show how to draw a F table.
The probability is equal to 0.05, and the first column corresponds to the number
of degrees of freedom for variance in the numerator, and the second row corresponds
to the number of degrees of freedom for variance in the denominator. The function
FINV is used to obtain F critical value. You can choose the different value of v1 and v2
as you like, and draw a table like Table 6.5.
Excel Solution
The right tail probability and the number of degrees of freedom are given, the
function FINV in Excel is used to obtain the critical value of F.
Fright = FINV (0.025,9,8) = 4.36 Fleft = FINV (0.975, 9,8) = 0.24
As the ratio of the variance lies between 0.1 and 1.8, we can never say that the
performance of the boys is more stable than girls at the calculus test within the 95
percent confidence interval.
In the fields of points and interval estimation, the widely used distributions are
normal, students-t, chi-squared, and F. For these four functions, Excel provides the
corresponding probability functions. Generally, Excel provides PDFs with the suffix
of “DIST” and provides the inversed CDFs with suffix “INV.”
Let X be a random variable, x be a value of the random variable, and Pr be a
probability. The functions TDIST, CHIDIST and FDIST give the probability of being
in the right-tail (Pr(X>x)). And the function TDIST also provide a two-tail probability
as Pr(|X| > x). However, the functions NORMDIST and NORMSDIST give the
probability of being in the left-tail (Pr (X<x), which are on the contrary with the other
three distribution functions.
These Excel functions may confuse you sometimes. In order to take these
functions user-friendly, these functions can be changed a little using VBA macros.
The codes following show some examples about the changed functions.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 6.3
144 6: Statistical Inferences from Observational Data
*********************************************************************
‘Define variables:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For instance, recall Example 3.1, µ 140 and sd 3are given, the function
NEWNORMDIST can be used to obtain the Pr (X>160):
Then:
1-NORMDIST(160,140,30) = NEWNORMDIST(160,140,30) = 0.25
The left-tail function of NORMINV in Excel can be changed into right-tail using
VBA macros. The codes are shown as follows:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 6.4
*********************************************************************
‘Define variables:
6: Statistical Inferences from Observational Data 145
************start of coding************************************
Public Function Newnorminv(x)
Newnorminv = Application.WorksheetFunction.NormInv((1 - x), 0, 1)
End Function
**********************************end of coding **********************
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For instance, recall Example 7.3.1, x ~N (80, 25/√100) with α = 0.05 are
given. z value can be obtained as NEWNORMINV(0.025) = 1.96, which is equal to
NORMINV(0.975,0,1) = 1.96.
In addition, as the function TDIST provides the right and two tail probability,
you need to choose the tails each time when applyng it. In order to solve this problem,
the function TDIST can be changed as a right tail function using VBA macros, and the
code is shown as follows:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 6.5
*********************************************************************
‘Purpose: To recreate the function Tdist return only the one tail
probability Pr(X > x) ,so that the input arguments can be reduced
from three to two.
‘ Define variables:
Note: let the tail = 1, then TDIST returns only the one tail distribution, the input
146 6: Statistical Inferences from Observational Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the previous sections, Excel functions are directly entered into the cells.
Another way to insert a formula is to use the Function Library group on the formula
tab (shown in Figure 6.4).
If you do not remember which function you need, the formula tab is a useful way.
You can click the function category such as AutoSum and Financial, a series of
functions in that category will be listed. If you forget the name of the function, you
can click the Insert Function, and the function you needed may display in the dialog
box after entering some describing words. In addition, if you want to know more
about the functions, you can click the link called Help on this function(shown in
Figure 6.5).
In this chapter, the foundations of the points and interval estimation are being
introduced. To estimate the unknown means, Excel functions of NORMINV and
TINV can be used. To estimate the variance, Excel functions CHIINV and FINV can
be used. Table 6.6 shows the summary of Excel functions applied in this chapter.
Excel’s built-in functions can be changed using VBA macros. Table 6.7
summaries the changed functions that are used in this chapter.
Testing of Hypotheses
7.1 Introduction
You can see the following advertisements in your daily life: a drug company may
claim that their pills can last at least four hours; an auto-manufactory may claim that
their new products are better than the older ones; a food company claims that their
products are sugar free. To verify these kinds of statements, a technique referred to as
hypothesis testing can be used. Hypothesis testing is a statistical method that is using a
sample to verify the statements about the corresponding population. Hypothesis
testing contains two contradictory hypotheses: null hypothesis (H0) and alternative
hypothesis (H1). The statistic computed from the sample can be used to test if the H0
should be accepted or rejected.
In this chapter, the basic concepts and major procedures used in hypothesis
testing will be introduced, together with the relevant Excel functions. In addition, the
basic steps to create UserForms in Excel will be introduced as well.
At the beginning, we need to state exactly what we are testing, including the null
hypothesis (H0) and alternative hypothesis (H1). The null hypothesis (H0) is the
initially favored claim, and the hypothesis that is contradictory to H0 is called
alternative hypothesis (H1). For instance, a manufactory claims that the life time of
their products is at least four years, and we can state the null hypothesis H0: u = 4
and the alternative hypothesis H1: u ≠ 4 .
7: Testing of Hypotheses 149
Definition
The null hypothesis, donated by H0, is the claim that is initially assumed to be true.
The alternative hypothesis, donated by H1, is the assertion that is contradictory to
H0.
Definition
A type I error occurs when the true hypothesis is rejected.
A type II error occurs when the false hypothesis is accepted.
The probabilities of the type I and II errors are usually donated by Greek word
α and β , respectively. The value of α is often referred to as the level of significance,
the significance level is usually equal to 0.01 and 0.05, where 1 − α = 0.99, 0.95 , and
Following is the typical sequence that you can follow when doing the hypothesis
testing:
1. Set the null hypothesis (H0) and the alternative hypothesis (H1)
Generally, H0 is formulated as an equality, whereas the H1 is normally an
inequality. In general, H0 and H1 can be set as follows:
This step is very important when doing the hypothesis testing. H0 is usually
stated as equality. For instance, instead of setting H0 µ > µ0 or µ < µ0 , we usually
state H0: µ = µ0 . The reason is that the floated rejection region can cause the
problems of part acceptation and rejection.
To state the appropriate H1, one trick is to choose the expectance or preference
results. For instance, a drug company claims that their new drugs can last at least 4
hours, and the statement can be set as H0: µ = 4 and H1: µ >4; a food company
7: Testing of Hypotheses 151
claims that their products are sugar free, so that we can state H0: u = 0 and H1: u ≠ 0 .
Critical Value
Figure 7.1 Regions of rejection and acceptance
152 7: Testing of Hypotheses
Example 7.1
The manufacturer claims that the weight of their product is less than 300 g. The
customer organization randomly chooses 60 products and obtains the sample average
weight x = 297.25 g and a standard deviation s = 2.3 g . Comment the company’s
statement with the level of significance of 0.05.
[Solution]
As mentioned earlier, defining H0 and H1 are the first and key step when doing
the hypothesis testing. H0 is easy to set, as it is usually setting as an equality. However,
sometimes, you will feel confusion when setting H1. Suppose we do not know any
tricks to set H1, let us try the three possible conditions and find the appropriate one.
1. Set H0 and H1
H0: µ = 300
Acceptance Area
(1 − α )
Rejection
Area
-1.64
-1.81
Figure 7.2 z-test with the left-tailed alternation
As z (-1.81) is inside the rejection region, reject H0 and accept H1. The statement
of the company is accept that the weight of their product is less than 300 g.
As the testing procedures are pretty similar with the previous one, we will not
explain the details this time. H0 and H1 can be set as H0: µ = 300 and H1: µ ≠ 300 .
In addition, the statistic value z from the samples is not changed which is equal to
- 1.81.
Figure 7.3 shows the region for acceptance and rejection. This is a two-tailed test
with α = 0.05 and z = 1.96 . The rejection range is z >1.96 and z < - 1.96.
154 7: Testing of Hypotheses
Rejection Rejection
Acceptance Area
Region( α / 2 ) Region( α / 2 )
(1- α )
z (-1.81) is outside the rejection region, accepting H0. We can draw a conclusion
that the average weight is unequal to 300 g at the 5% level of significance. However,
in this example, we are only interested in whether the average weight is less than 300
g, and the conclusion that the average weight is unequal to 300 g is meaningless to us.
As the testing procedures are pretty similar with the previous one, we will not
explain the details this time. H0 and H1 can be set as H0: µ = 300 and H1: µ > 300 .
In addition, the statistic value z from the samples is not changed which is equal to -
1.81.
This is the right-tailed test with α = 0.05 and the rejection range is z > 1.64.
Clearly, z = -1.81 does not lie in the rejection region. Therefore, H0 can be accepted.
However, in this example, our purpose is to test the company’s statement that
whether their product’s weight is less than 300 g. When we set the alternative
H1 >300, one condition is to accept H0, where we can draw a conclusion that the
product’s weight is equal to 300 g; the other condition is that we reject H0 and accept
H1, we can draw a conclusion that the product’s weight is larger than 300 g. No mater
accepting or rejecting H0, the company’s statement is wrong, and the testing becomes
meaningless.
On the whole, defining the null and alternative hypothesis is the first and key
step in the process of hypothesis testing. Using one or two tailed alternative H1 is
depending on the situation. Generally, the expectance result is usually setting as the
alternative hypothesis.
7: Testing of Hypotheses 155
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 7.1
*********************************************************************
‘Define variables:
***Start Coding************************************************
Sub norminvleft()
a = InputBox("a", "level of significance")
z = Application.WorksheetFunction.norminv(a, 0, 1)
z = Application.WorksheetFunction.Round(z, 3)
MsgBox "Rejection region is z " & " < " & Z, vbOKOnly, "norminvleft"
End Sub
Sub norminvright()
a = InputBox("a", "level of significance")
z = -1 * Application.WorksheetFunction.norminv(a, 0, 1)
z = Application.WorksheetFunction.Round(z, 3)
MsgBox "Rejection region is z " & "> " & Z, vbOKOnly, "norminvright"
End Sub
Sub norminvtwo()
a = InputBox("a", "level of significance")
156 7: Testing of Hypotheses
z = Application.WorksheetFunction.norminv(0.5 * a, 0, 1)
z = Application.WorksheetFunction.Round(z, 3)
MsgBox "Rejection region is z" & "<" & Z & " or z" & ">" & -Z, vbOKOnly,
"Norminvtwo"
End Sub
***********************************************end of coding*********
This macro consists of two key techniques. The respecting role in this macro is
detailed as follows:
1) InputBox function
This function is useful for obtaining a single input. “a” will display in the input
box, and “"level of significance" will display in the title bar.
2) Function ROUND
This function is used to fix the digital number of the result.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Using controls on a worksheet can facility user to provide input. You can access
by choosing Develop Controls Insert. Figure 7.4 shows the controls that will
appear when following the above steps.
Excel offers two different sets of controls: Form control and ActiveX controls. In
this example, we focus on Form controls.
Excel provides different kinds of controls in the Form control, such as the
ScrollBar controls, the OptionBottom controls, and the TextBox controls. In Example
7.1, we show an example of using the OptionBottom controls, which allows a user to
select from multiple options depending he or she likes.
7: Testing of Hypotheses 157
In Example 7.1, there are three conditions for H1: two-tailed, left-tailed, and
right-tailed. You can click and drag the option bottom into the cells, and then double
click the items and rename the OptionBottoms. Figure 7.5 shows the Form controls
that can be used to ask the user for an option.
Figure 7.5 Form controls that asks the user for an option
After accomplish the Form controls, the next step is to link the controls to the
created macros. To do the connection, you can right click the bottom and choose the
item Assign Macro, and then assign the codes to the corresponding controls.
In Example 7.1, when choosing the left-tailed option, the VBA left-tailed codes
are executed. You can see a dialog box that require you to enter the value of a (0.05 in
this question). When pressing OK, you can obtain the rejection region that is z < -1.64.
Figures 7.7 and 7.8 show the dialog boxes that displayed by VBA’s InputBox and
MsgBox functions.
When the variance σ 2 is unknown and the sample size is small, the
t-distribution can be applied in the hypothesis testing. The testing procedures are as
similar as above, the only difference is that the statistic is changed from z to t.
Example 7.2
The manufactory claims that their coffee machines provide a population mean
volume of 110 ml of coffee per cup and a standard deviation of 5 ml. The volume of
coffee per cup is assumed to have a normal distribution. In order to do the quality
control, the machine is checked periodically by random sampling 15 cups of coffee
each day. The mean value x = 107.0 ml and standard deviation s = 6.5 . Comments
7: Testing of Hypotheses 159
[Solution]
As mentioned earlier, defining H0 and H1 are the first and key step when doing
the hypothesis testing. In this example, H0 and H1 are stated as follows:
H0: µ = 110
H1: µ ≠ 110
To test the sample mean with small sample size, t test can be chosen, and the test
statistic value is calculated as follows:
__
T = ( X − µ ) / ( s / n ) = (107 − 110 ) / (36 / 5) = 0.42
This example is a two-tailed test with α = 0.05 and n = 15 . The critical value
can be obtained from the Appendix Table A.2 or Excel function as mentioned in
previous chapters.
t0.05,14 = TINV (0.05,14) = 2.14
Since T (0.42) is not inside the rejection region, accept H0. The manufactory’s
claim is accepted.
Similarly, The function TINV can be recreated using VBA macros, combine the
macros with the UserForm controls, and you can choose the one you needed
depending on the situation. The codes are shown as follows:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 7.2
*********************************************************************
‘Purpose: To recreate the TINV function to one and two tailed. And
the functions of InputBox and MsgBox are used.
160 7: Testing of Hypotheses
‘Define variables:
Sub Tinvright()
a = InputBox("a", "level of significance")
n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(2 * a, n - 1)
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Rejection region is t " & "> " & x, vbOKOnly, "tinvright"
End Sub
Sub Tinvleft()
a = InputBox("a", "level of significance")
n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(2 * a, n - 1) * -1
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Rejection region is t " & "< " & x, vbOKOnly, "Tinvleft"
End Sub
Sub Tinvtwo()
a = InputBox("a", "level of significance")
n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(a, n - 1)
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Rejection region is t" & "<" & x & " or t" & ">" & -x, vbOKOnly,
"Tinvtwo"
End Sub
*****************************************************end of coding***
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
After completing the codes, these codes can be assigned to the Form controls.
Figure 7.9 shows the Form controls that can be used to ask the user for an option.
7: Testing of Hypotheses 161
Figure 7.9 Form controls that ask the user for an option
Example 7.2 uses a two-tailed test with α = 0.05 and n = 15 . When choosing
the two-tailed option, the VBA two-tailed codes are executed. After entering the value
of α and n , you can obtain the critical value which is equal to 2.145. Figures 7.10,
7.11 and 7.12 show the dialog boxes that displayed by VBA’s InputBox and MsgBox
functions.
In Example 7.1, we have introduced the process of using the UserForm controls
in a spreadsheet. On the other hand, you can create your own UserForm using VB
Editor. The major steps are shown as follows:
To create a dialog box, the first step is to insert a new UserForm in the VB Editor
window. To insert a UserForm, press Alt + F11 Choose Insert. The VB Editor will
display an empty UserForm as shown in Figure 7.13.
7: Testing of Hypotheses 163
2. Add controls
The Toolbox can be used to add controls. The Toolbox is displayed by choosing
View Toolbox, which is shown in Figure 7.14. You can click and drag the controls
you need into the UserForm. In this example, we choose the OptionButtom. Figure
7.14 shows an example of a UserForm using OptionButtom control.
Every control has several properties that determine how the control looks like.
You can choose View Properties Window or press F4 to show the properties
window (shown in Figure 7.15). You can change the name of the UserForm, the
height, the color and so on.
164 7: Testing of Hypotheses
It is a good idea to rename all the controls using the meaningful names. To
change the name of a control, you can right click the mouse and choose the properties,
and you can see a properties widows just like Figure 7.16.
You can adjust the UserForm control to make it looking professional by selecting
Format Align, which is shown is Figure 7.17.
5. Display a UserForm
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 7.3
*********************************************************************
Private Sub UserForm_active()
UserForm1.Show
End Sub
*********************************************************************
To display a UserForm from VBA, you can create a procedure that uses the show
method of the UserForm object.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
166 7: Testing of Hypotheses
You cannot display a UserForm from Excel without using at least one line of
VBA code. This procedure must be located in a standard VBA module and not in the
code module for the UserForm. After executing the macros, you can see a UserForm
which adds an OptionButton control to provide multiple options in Figure 7.18.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code: 7.4
*********************************************************************
‘Define variables:
***Start Coding************************************************
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Pb < " & a & " = " & x
End Sub
End Sub
End Sub
*****************************************************end of coding***
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The hypothesis testing of the population mean is mentioned in Examples 7.1 and
7.2. However, if you want to know whether the sample variance significantly larger or
smaller than a population variance, the chi-squared distribution test is a proper option.
Example 7.3
A restaurant claims that the waiting time for each customer is 5 minutes with
σ less than 1.5 minutes. Eight customers are ramdomly chosen and the standard
2
[Solution]
To solve this problem, the first step is to define H0 and H1. In this example, H0
and H1 are set as follows:
H0: µ = 110
H1: µ ≠ 110
This example is a left-tailed test with α = 0.01 and n = 8 . The critical value
can be obtained from the Appendix Table A.3 or the Excel function:
χ 0.99,7
2
= 1.24 , or CHIINV(0.99,7) = 1.24. The rejection region is χ 2 < 1.24.
Similarly, The function CHIINV can be recreated using VBA macros, combine
the macros with the UserForm, you can choose the one you wanted depending on the
situation. Figure 7.19 shows the Form control that can be used to ask the user for an
option.
Figure 7.19 Form controls that asks the user for an option
In this chapter, to test the population mean with given σ , Excel function
NORMINV can be used; to test the population mean without given σ or small
sample size, the function TINV can be used; to test population variance, the function
CHIINV can be used. Table 7.1 shows the summary of Excel functions used in this
chapter.
Regression Analysis
8.1 Introduction
Many variables observed in real life are relevant, such as the house area and the
sale price, the study hours and the final grades, the calories you intake and the weight
you getting, the crop yields and the amount of fertilizers used. The relations of such
kinds of variables can be expressed in a mathematical form. For two variables, say x
and y, the fixed variable x is called the independent variables, and y is called the
dependent variables. The process of estimating y from x is often referred to as
regression. The objective of regression analysis is to express the relationship between
two or more variables. For instance, the variable x presents the study hours, and y
presents the final grades. Generally, the student who spend a long time (x) to study
will earn good marks (y). However, the variables x and y are not deterministically
related, as the study time is just one of the factors that can affect the students’ grades.
In this chapter, the fundamentals about regression analysis will be introduced,
together with some useful Excel functions and the method of creating charts.
Y = β 0 + β1 ⋅ x + ε
8.1
ε : random error term, which follows the normal distribution
E( ε ) = 0
V( ε ) = σ 2
Generally, for a given set of data, there are more than one curves will appear to
fit it. Figure 8.1 shows the three possible curves that can be used to fit the data, and it
is hard to say which line is the best fit one.
172 8: Regression Analysis
50
45
y 3 = a1 x + a 0
40
35
30 y1 = b1 x + b0 d
25
20 y 2 = k1 x + k 0
15
10
0 10 20 30 40 50
Principle of Least
east Squares
The
he sum of squared vertical deviations from the points (x1, y1), …(x
… n, yn) to the
line is then
n
f (b0,b1 ) = ∑ [ yi − (b0 + b1 xi )]2 8.2
i =1
Too minimize the sum of squared residuals, we can take partial derivatives of
f (b0, b1 ) , equating both of them to zero and solving the resulting equations.
∂f (b0, b1 )
= ∑ 2( yi − b0 − b1 xi ) × (−1) = 0
∂ (b0 )
8: Regression Analysis 173
∂f (b0, b1 )
= ∑ 2( yi − b0 − b1 xi ) × (− xi ) = 0
∂ (b1 )
Cancellation of the two factors and rearrangement gives the following system of
equations, called the normal equations:
nb0 + (∑ xi )b1 = ∑ yi
__
__
∑ i i y S xy
x − x × y −
b1 = = 8.3
__ 2
S xx
∑ xi − x
__ __
b0 = y'− b1 x' 8.4
Where:
S xy = ∑ x y i i
−( ∑ x )(∑ y ) / n , and
i i
S xx = ∑ xi2 − ( ∑ xi ) 2 / n
According to Eqs. 8.3 and 8.4, the model estimates b1 and b0 can be calculated,
and then the regression line is obtained. Example 8.1 demonstrates the process of
obtaining the parameters by hand calculation. Further more, the process of drawing
the scatter diagram is also introduced.
Example 8.1
Some investigations show that the fathers and their sons’ heights have a strong
relationship. The table below shows the respective height (in inches) x and y of 15
fathers and their sons.
x 69 72 61 70 67 68 70 68 76 68 64 61 60 52 67
y 71 70 64 72 69 66 73 70 82 71 65 66 73 54 70
174 8: Regression Analysis
[Solution]
1)
The scatter diagram is often used to show the relationship between two variables.
Excel provides tools to create the scatter diagram. Figure 8.2 shows a scatter diagram
that depicts the fathers and their sons’ height.
90
85
80
Sons' height(inch)
75
70
65
60
55
50
50 55 60 65 70 75 80 85
Father's height (inch)
Figure 8.2 Scatter diagram to show the interdependence between the fathers
and their son’s height
The horizontal axis presents the father’s height, and the vertical axis presents the
son’s height. It can be seen in Figure 8.2 that the son’s height is interdependent with
the father’s height.
Excel is one of the most widely used software to create charts. The major steps to
create a chart likes Figure 8.2 are shown as follows:
The first step is to select the data. In this example, the range (C5: D19) is
selected ( shown in Figure 8.3).
After selecting the data, the next step is to choose a chart type from the
Insert Charts. You can choose one of them as you required. Figure 8.4 displays a
dialog box of insert chart. The main categories are listed on the left, and the subtypes
are shown as icons.
176 8: Regression Analysis
In this example, the XY (Scatter) chart is selected. You can see the XY chart after
pressing OK bottom (shown in Figure 8.5).
90
80
70
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80
Comparing to Figure 8.2, no one likes Figure 8.5, as it looks ugly and not
8: Regression Analysis 177
explains the data. Excel allows you to modify the chart and makes it exactly as you
like.
In general, there are two ways to modify the chart: one is using the Ribbon and
the mini Toolbar, and you can see it when clicking any cells inside the chart (shown in
Figure 8.6); the other way is to use the shortcut menu, and you can find it by right
click the element you want to modify and choose the option called Format from the
shortcut menu.
To adjust the horizontal or vertical axis, you can right click the axis and choose
Format from the shortcut menu. A formatting dialog box can be seen after choosing
the Format option (shown in Figure 8.7). You can modify the items as you like.
To format the gridlines, you can choose Chart Tools Layout Axes Gridlines
(shown in Figure 8.8). This drop-down controls contains options for all possible
gridlines in the chart.
Tips
If the objective is to remove the gridline only, you can simply right click the
gridline and choose the delete option on the shortcut (shown in Figure 8.9).
To change the axis titles, you can choose Chart Tools Layout Axis Titles. In
this example, the text Father’s height (inch) is added to the x-axis and son’s height
(inch) is added to the y-axis.
Figure 8.10 is a XY chart that is formatted from Figure 8.5. After formatting the
axis, gridlines, and axis titles, the figure looks professional.
90
85
80
Son's height(inch)
75
70
65
60
55
50
50 55 60 65 70 75 80 85 90
Father's height(inch)
2)
The statistics used for hand calculation are shown in Table 8.1.
According to Eqs. 8.3 and 8.4, the estimates b1 and b0 can be calculated, and then
the regression line is obtained. The calculation processes are shown as follows:
S xx = 66213 − (993) 2 / 15 = 476.4
S xy 404.8
b1 = = = 0.85
S xx 476.4
To obtain the estimates b0 and b1, hand calculation is achievable. However, the
process is repeated and complex when coming out the calculation. Alternatively, it is
effective to use Excel’s build-in functions to obtain the parameters. Example 8.2
demonstrates the process of using the functions SLOPE , INTERCEPT, TREND, and
FORECAST to obtain the estimates.
Example 8.2
Some researchers say that the students’ academic performance and their studying
hours are related. Random sampling 20 students from UST to do the survey named
Study Or Fail. Let x present the study hours (per day), and let y present the graded
point average (GPA) in the semester. The table below records the 20 students’ study
hours (x) and their GPA (y).
3) Suppose that five new comers want to predict their GPA using the regression line
that is obtained from the question two. Their study hours (per day) are 3.3, 2.0, 8.0,
6.0, and 4.4. Using these data to predict the new comers’ GPA.
[Solution]
1)
Since the procedures to draw the scatter diagram have already been demonstrated
specifically in Example 8.1 (pages 174-179), the detailed procedures are skipped here.
Figure 8.11 shows a scatter diagram that depicts the relationship between the study
hours and GPA. The horizontal axis presents the study hours, and the vertical axis
presents the student’s GPA. It can be seen in Figure 8.11 that the study hours and GPA
are positive related.
4.0
3.5
GPA
3.0
2.5
2.0
3.0 4.0 5.0 6.0 7.0 8.0
Study Hours(Hours per day)
Figure 8.11 Scatter diagram to show the interdependence between the study
hours and the GPA
2)
As mentioned earlier, the first step to obtain the least-squares regression line is to
estimate the parameters b0 and b1. The functions SLOPE and INTERCEPT can be
used to estimate the b0 and b1, respectively, and then the regression line can be
obtained by putting the b0 and b1 into the equation: y = b1 x + b0 .
182 8: Regression Analysis
Function SLOPE
The function SLOPE is used to calculate the slope of the linear regression line.
The syntax is shown as follows:
Syntax
= SLOPE (known ys, known xs)
The data are shown in the columns C and D. After entering the function SLOPE
in the cell N6, you can select the range of numbers that required for this function. The
result is shown in Figure 8.12.
However, you may feel odder when seeing the syntax of the function is: SLOPE
8: Regression Analysis 183
(known ys, known xs). The locations of the arguments can be switched using VBA. In
code 8.1, the function SLOPE is recreated to switch the arguments’ entering sequence.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 8.1
'****************************************************************
'Purpose: to recreate the function SLOPE to switch the arguments’ entering
sequence.
'Define variables:
'a: Known xs, which is an array or cell range of
numeric dependent data points
'b: Known ys, which is the set of independent data
points
*****Start Coding**********************************************
Function newslope(a, b)
newslope = Application.WorksheetFunction.slope(b, a)
End Function
***********************************************end of coding *******
Tips
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Function INTERCEPT
The function INTERCEPT is used to calculate the point at which a line will
intersect the y-axis. The syntax is shown as follows:
184 8: Regression Analysis
Syntax
= INTERCEPT(known ys, known xs)
The data is shown in the columns C and D. After entering the function
INTERCEPT in the cell N7, you can select the range of numbers that required for this
function. The result is shown in Figure 8.13.
Similar with the function SLOPE, the entering sequence of the function
INTERCEPT is also known ys first, and then known xs. You may also want to switch
the locations of the arguments, which means that the know xs first, and then know ys. In
the code 8.2, we show an example to recreate the function INTERCEPT to switch the
arguments’ entering sequence.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8: Regression Analysis 185
Code 8.2
'****************************************************************
'Purpose: to recreate the function INTERCEPT to switch the arguments’
entering sequence.
'Define variables:
'a: Known xs, which is an array or cell range of
numeric dependent data points
'b: Known ys, which is the set of independent data
points
*****’Start Coding*********************************************
Function newintercept(a , b)
newintercept = Application.WorksheetFunction.Intercept(b, a)
End Function
**************************************************end of coding *******
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
y = 0.27 x + 1.73.
3)
As the regression line y = 0.27 x + 1.73 is obtained in question two, to obtain the
value of y, we just need to substitute the value of x into the regression line. However,
suppose you do not know the regression line and want to obtain the value of y directly,
the functions TREND and FORECAST are the good choice.
Function TREND
The function TREND can be used to predict y value from each x without
knowing the regression line y = b1 x + b0 . The syntax is shown as follows:
186 8: Regression Analysis
Syntax
= TREND (known ys, known xs, new xs, const)
Continuing to Example 8.2, suppose there are five new students, and their study
hours have already given. The function TREND can be used to predict the students’
GPA. The original data are shown in the columns C and D, and the new students’
study hours are shown in the column U. This function returns an array of values, and
it must be entered as an array formula (Pressing Ctrl + Shift + Enter). The result is
shown in Figure 8.14.
Function FORECAST
The function FORECAST also can be used to predict y value for a given x value.
8: Regression Analysis 187
Syntax
= FORECAST (x, known ys, known xs)
Continuing to Example 8.2, the Known xs and Known ys are shown in the
columns C and D, respectively (shown in Figure 8.14). Suppose one student spends
3.3 hours per day on study, the GPA can be predicted using the function FORECAST
as:
FORECAST (3.3, D6:D25, C6:C25) = 2.6.
In reality, it is rare that every points exactly passes through the regression line,
and the variation is unavoidable. The further the line is away from the points, the less
it is able to explain. The coefficient of determination measures how well the
regression line represents the data.
Error Sum of Square (SSE) measures how much variation in y is not described by
the regression line. The total amount of variation is observed y values given by:
n
8.5
SSE = ∑ ( yi − yi )2
⌢
i =1
SSE
And the estimate of σ 2 is: σ 2 = 8.6
n−2
188 8: Regression Analysis
Total Sum of Square (SST) is the sum of the squared deviation about the
horizontal line at the mean y .
n
SST = ∑ ( yi − y ) 2
i =1 8.7
Figure 8.15 shows the least squares line, the horizontal line at height y , the
squared deviations about the least squares line ( yi − yi ) , and the squared deviations
⌢
about the horizontal line ( yi − y ).
50
Least squares line
y45
40
35
y
30
25 ( yi − y)
Horizontal line at height y
20
15
( yi − yi )
⌢
10
0 10 20 30 40 x 50
SSE is the sum of squared deviations about the least squares line, and SST is the
sum of squared deviation about the horizontal line at the mean of . The ratio
SSE/SST is the percentage of total variation is not answered by the least squares line,
and 1- SSE/SST is the proportion of the line can be explained. If r2 is close to 1, the
regression line would be a good fit.
The correlation coefficient measures the strength and the direction of a linear
relationship between two variables varying from -1 to +1.
S xx = ∑ ( xi − x )
S yy = ∑ ( yi − y )
The positive r indicates that the value of y will increase as x increasing. If x and y
have a strong positive linear correlation, r is close to +1. If x and y have a strong
negative linear correlation, r is close to -1. Negative values indicate that the value of
y will decrease as the increasing of x. Further more, the value of r near zero means
that there is a nonlinear relationship between the two variables.
Excel provides built-in functions to estimate the r2 and r, including the functions
RSQ, CORREL and LENEST.
Function RSQ
The function RSQ returns the coefficient of determination. The syntax is shown
190 8: Regression Analysis
as follows:
Syntax
= RSQ (known ys, known xs)
Function CORREL
The function CORREL returns the correlation coefficient. The syntax is shown
as follows:
Syntax
= CORREL (known ys, known xs)
Function STEYX
Returns the standard error of the predicted y-value for each x in the regression.
Syntax
= STEYX (known ys, known xs)
Excel does not provide functions to obtain the value of SSE and SST. However,
you can use VBA to recreate the functions.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 8.3
8: Regression Analysis 191
*****’Start Coding*********************************************
Function sse(a, b)
SSE = Application.WorksheetFunction.StEyx(a, b) ^ 2 *
(Application.WorksheetFunction.Count(a) - 2)
End Function
**************************************************end of coding *******
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 8.4
*****’Start Coding*********************************************
Function SST(a, b)
SST = Application.WorksheetFunction.StEyx(a, b) ^ 2 *
(Application.WorksheetFunction.Count(a) - 2) / (1 -
Application.WorksheetFunction.RSq(a, b))
End Function
**************************************************end of coding *******
Comments
SSE
SST is obtained according to Eq. 8.8: SST =
1− r2
192 8: Regression Analysis
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Example 8.3
[Solution]
The functions RSq, CORREL and LENEST are used to estimate the r2 and r.
To estimate r2
In Example 8.3, the data are shown in the columns C and D. After entering the
function RSQ in the cell G6, you can select the range of numbers that required for this
function. The result is shown in Figure 8.16.
As mentioned earlier, the coefficient of determination (r2) measures how well the
regression line represents the data. In this example, r2 is equal to 0.74, the linear
trendline is acceptable.
To estimate r
In Example 8.3, the data are shown in the columns C and D. After entering the
function CORREL in the cell G7, you can select the range of numbers that required
for this function. The result is shown in Figure 8.17.
From Examples 8.1 and 8.2, we have already introduced the functions SLOPE,
INTERCEPT, RSQ, and CORREL, which can be used to estimate the parameters b0,
b1, r2 and r , respectively. However, it is inconvenience to calculate the parameters
one by one. We will introduce a pretty good function called LINEST, as it estimates
not only the b0 and b1, but also the other statistics used in regression analysis such as
the r2 and r.
194 8: Regression Analysis
Function LINEST
The function LINEST here is an array formula which products the array results.
The syntax is shown as follows:
Syntax
=LINEST (known ys, known xs, const, states)
Figure 8.18 shows the statistics that is related to the regression analysis. The
table’s first and last columns are not provided by the function LINEST, and we add
the terms manually to show the meaning of each cells. Some statistics such as the
F-test overall are not mentioned in here, but such statistics are necessary in regression
analysis.
In the previous sections, we have focused on analyzing the two variables having
linear relations. For some variables, they themselves may not have obvious linear
relationship. However, after suitable transformation of the variables x and/or y, the
relationship between the resulting variables may intrinsically linearity.
Definition
A probability model relating y to x is intrinsically linear if, by means of a
transformation on y and/or x, it can be reduced to a linear probabilistic model.
Y ' = β 0 + β1 x '+ ε '
Four important intrinsically linear functions are given in Table 8.2. For an
exponential function, only y is transformed to achieve linearity. For a power function
relationship, both x and y are transformed to achieve linearity.
1 1
d. Reciprocal: y = a + β ⋅ x' = y = a + β x'
x x
One of the advantages of the intrinsically linear model is that the parameters
such as b0, b1, r2 and r of the transformed model can be estimated immediately using
the principle of least squares. For instance, according to Eqs. 8.3 and 8.4, we can
196 8: Regression Analysis
__
__
∑ i i '
x ' − x ' × y '− y
b1 =
__ 2
∑ x 'i − x '
__ __
b0 = y'− b1 x'
Example 8.4 demonstrates the process of using the intrinsically linear model to
solve the problem.
Example 8.4
Some researchers suggest that one of the important factors that affect the
moisture content(%) of the chips is the frying time(sec). The table below shows the
relationship between the frying time (x) and moisture content (y).
x 1 4 9 15 23 28 30 45 60
y 20 16.3 9.7 8.1 4.2 3.4 2.9 1.9 1.3
[Solution]
1)
25
20
Moisture Content(%)
15
10
0
0 10 20 30 40 50 60
Frying Time(sec)
In Examples 8.2 and 8.3, we have introduced the functions to obtain the
parameters which can be used to obtain the regression line. One of the shortcut to
obtain the regression line and r2 is using the Format Trendline, which can be obtained
by right click any points on the graph and choose the option called add trend line.
After choosing the options called linear, Display Equation on chart, and the
Display R-squared value on chart, you can get the regression line, the equation and
R-squared value at the same time (shown in Figure 8.21).
25
20
Moisture Content(%)
y = -0.292x + 14.51
15
R² = 0.721
10
0
0 10 20 30 40 50 60
Frying Time(sec)
Figure 8.21 shows that the frying time and the moisture content are negative
related. As mentioned earlier, the r2 measures how well the regression line represents
the data. In this example, r2 = 0.721, the linear trendline is not such good. However, it
does not mean that the variables x and y do not have relationship. The functions
EXPONENTIAL and POWER are used to test whether the transformed x and/or y
have the strong linear relationship.
2)
3.5
3
Moisture Content(%)
2.5
y = -0.047x + 2.773
2 R² = 0.940
1.5
0.5
0
0 10 20 30 40 50 60
Frying Time(sec)
In Figure 8.23, we obvious that each point is close to the regression line.
Furthermore, In this example, r2 = 0.94, which is pretty good than previous one.
200 8: Regression Analysis
For power function relationship, both x and y are transformed to achieve linearity.
Figure 8.24 shows the scatter plot of ln(y) and ln(x).
4
3.5
Moisture Content(%)
3
2.5 y = -0.685x + 3.475
R² = 0.882
2
1.5
1
0.5
0
0 1 2 3 4 5
Frying Time(sec)
Figure 8.24 shows a scatter diagram that depicts the frying time and the moisture
content. The horizontal axis presents the frying time (ln(x)), and the vertical axis
presents the moisture content (ln(y)). The chart shows that the frying time and the
moisture content are negative related.
As mentioned earlier, the closer r2 is to 1, the more successful is the regression
model in explaining y variation. According to the previous calculation, y’ = -0.047x +
2.773, the estimated regression function for the exponential model is ln(y) = -0.047x
+ 2.773 and y = e-0.047x + 2.77.
Excel functions used in this chapter are summarized in Table 8.3. In this chapter,
we have introduced some Excel functions which are related to regression analysis.
The functions SLOPE and INTERCEPT can be used to estimate the parameters b1 and
b0. The functions RSq and CORREL can be used to obtain the r2 and r. The function
LINEST can be used to estimate ten statistics relating to regression analysis, including
such as the parameters b0, b1 ,and r2. The function TREND is used to predict the y’s
values according to the new xs.
8: Regression Analysis 201
In this chapter, the functions such as SLOPE and INTERCEPT are recreated to
switch the entering sequence: know_y’s first, and then know_x’s . Table 8.4 shows the
summaries of the user defined functions.
References:
Appendix Tables
Table A.4 Critical values of Dna at significance level α in the K-S test