You are on page 1of 209

Contents

Chapter 1 Introduction of Uncertainty ................................................................... 1


1.1 Introduction ......................................................................................................... 1
1.2 Types of Uncertainty ........................................................................................... 1
1.3 Introducing Excel ................................................................................................ 3

Chapter 2 Fundamental of Probability Mode ........................................................ 4


2.1 Introduction ......................................................................................................... 4
2.2 Sample Space, Event, Sample Point .................................................................... 4
2.3 Statistical Independence .................................................................................... 17
2.4 Conditional Probability ..................................................................................... 21
2.4.1 Rule of Multiplication ................................................................................. 23
2.5 Summaries of Excel Functions .......................................................................... 25

Chapter 3 Analytical Models of Random Phenomena ........................................ 26


3.1 Introduction ....................................................................................................... 26
3.2 Random Variable ............................................................................................... 26
3.3 Probability Distribution of Random Variables .................................................. 27
3.4 Useful Probability Distribution ......................................................................... 29
3.4.1 Uniform Distribution .................................................................................. 30
3.4.2 Normal Distribution .................................................................................... 30
3.4.3 Lognormal Distribution .............................................................................. 49
3.4.4 Binomial Distribution ................................................................................. 55
3.4.5 Poisson Distribution .................................................................................... 57
3.4.6 Exponential Distribution ............................................................................. 60
3.5 Excel Functions Related to Probability Distribution ........................................ 62
3.6 Summaries of Excel Functions .......................................................................... 63

Chapter 4 Determination of the Probability Distribution Models ..................... 65


4.1 Introduction ....................................................................................................... 65
4.2 Probability Paper ............................................................................................... 65
4.3 Goodness-of-fit Test .......................................................................................... 75
4.3.1 Chi-Squared Test ......................................................................................... 75
4.3.2 Kolmogorov-Smirnov Test (K-S test) ......................................................... 82
4.4 Summaries of Excel Functions .......................................................................... 89
Chapter 5 Monte Carlo Simulation ....................................................................... 90
5.1 Introduction ....................................................................................................... 90
5.2 Monte Carlo Simulation .................................................................................... 90
5.2.1 Random Number Generation ...................................................................... 91
5.2.2 Trails Confirmation ................................................................................... 102
5.3 Central Limit Theorem .................................................................................... 108
5.4 Summaries of Excel Functions ........................................................................ 119

Chapter 6 Statistical Inferences from Observational Data ............................... 121


6.1 Introduction ..................................................................................................... 121
6.2 Point Estimation .............................................................................................. 121
6.3 Interval Estimation .......................................................................................... 126
6.3.1 Confidence Interval for the Mean with a Known Population Variance.... 126
6.3.2 Confidence Interval for a Normal Mean with the Variance is Unknown . 130
6.3.3 Confidence Interval for the Variance of a Normal Distribution ............... 135
6.3.4 Estimation of the Ratio of the Variance of the Two Populations ............. 139
6.4 Excel Functions Used in the Point and Interval Estimation ............................ 143
6.5 Summaries of Excel Functions ........................................................................ 147

Chapter 7 Testing of Hypotheses ......................................................................... 148


7.1 Introduction ..................................................................................................... 148
7.2 Null Hypothesis and Alternative Hypothesis .................................................. 148
7.3 Type I and Type II Errors ................................................................................ 149
7.4 Testing Procedures........................................................................................... 150
7.5 Summaries of Excel Functions ........................................................................ 169

Chapter 8 Regression Analysis............................................................................. 170


8.1 Introduction ..................................................................................................... 170
8.2 The Simple Linear Regression Model ............................................................. 170
8.3 Estimating Model Parameters ......................................................................... 171
8.4 Coefficient of Determination and Correlation Coefficient .............................. 187
8.4.1 Coefficient of Determination .................................................................... 187
8.4.2 Correlation Coefficient ............................................................................. 189
8.5 Intrinsic Linear Regression ............................................................................. 195
8.6 Summaries of Excel Functions ........................................................................ 200

References ................................................................................................................. 202

Appendix Tables ....................................................................................................... 203


1

Introduction of Uncertainty

1.1 Introduction

You can see these statements any time in your daily life: “it is likely that
tomorrow will rain;” “there is a 80 percent chance that Tom will win the competition;”
“the professor expects that 95 percent of the students can pass all of the final exams.”
No one could predict exactly whether it will rain or not, and you cannot be certain that
whether Tom will win or lost the competition tomorrow. Even you will be not sure the
number of classmates in the classroom when you come to the class tomorrow.
Just as the statements mentioned above, we often confront with the uncertainties.
However, we are forced to make decisions based on these uncertainties most of the
times. These kinds of uncertainties are surrounded in our everyday life, and we have
to understand and deal with them. To describe and quantify the uncertainties, we need
to introduce the ideas of probability and statistics. The major objective of this work is
to use the concepts and methods of probability and statistics to solve the real problems
under the uncertainties. In addition, no one likes to obtain the probability and statistics
by hand calculation. Fortunately, Excel and Excel based macro language (VBA)
provide a proper tool (various kinds of functions and charts) to help people to solve
the problems relating to probability and statistics.

1.2 Types of Uncertainty

The sources of uncertainty can be classified into two broad types: the aleatory
uncertainty and the epistemic uncertainty. Before doing the further explanation, let us
2 1:Introduction of Uncertainty

see three examples:

1. Measure the temperature


Six students measure the temperature ( °C ) of UST’s swimming pool. The results
are shown as follows:
27.6 27.5 26.9 27.1 27.3 27.7

2. Toss a dice
A dice is tossed six times, the results are shown as
follows:
1 3 4 2 1 3

3. Measure the components of an atom


Before the 20th century, people think that the atom is the smallest substance and
can not be divided further. However, in the early 20th century, physicists discovered
that the atom is divisible, including the electrons, protons, and neutrons.

In the first example, the temperature varies from one to another. In the second
example, when tossing the dice, each sides of the dice is equally likely appeared, and
the results can not be predicted exactly. Such kinds of uncertainties are caused by the
natural randomness and defined as aleatory uncertainty.
Aleatory uncertainty is common in our daily life. For instance, ten students are
randomly chose off the street, their physical characteristics such as the heights and
weights are various; the results of playing lottery games are various and randomness;
the numbers of green lights you will see on the way home are various; the taxi fare
you have paid from home to school each time is also uncertainty. These phenomena
are all caused by the natural randomness and referred to as the aleatory uncertainty.
The aleatory uncertainty is usually analyzed by the statistical approaches such as the
probability functions, and it is irreducible through further measurements most of the
time.
However, comparing to the first two examples, the third one is difference. The
wrong conclusion that the atom is indivisible is associated with insufficient or
imperfect knowledge. This kind of uncertainty is defined as epistemic uncertainty,
which may reduce through further measurements, using improved experiments, or
consulting more experts. The epistemic uncertainty is also the common phenomena.
1: Introduction of Uncertainty 3

Definition
Aleatory uncertainty: Aleatory uncertainty is caused by natural randomness,
and analyzed by the statistical approaches such as the probability functions.
Epistemic uncertainty: Epistemic uncertainty is caused by insufficient or
imperfect knowledge about fundamental phenomena.

On the whole, the aleatory uncertainty is the data based, which may not be
reduced or modified. However, the epistemic uncertainty is knowledge based, which
may be reduced by using the improved experiments. When dealing with practical
problems, you can consider these two types of uncertainty separately or joining them
together. Irrespective of the type of uncertainty, statistics and probability provide the
proper tool for modeling and analysis the uncertainty.

1.3 Introducing Excel

The first Excel spreadsheet was released in 1982. After about 30 years, Excel
has developed as the leader in the spreadsheet market. You may also hear about
some other spreadsheets such as Office Web Apps and Google Spreadsheet.
However, these are not even considered as the minor threats to Microsoft. In fact, the
biggest competitor for Microsoft is itself.
As the domain of the commercial electronic spreadsheet market, Excel
spreadsheet is so versatile. Here are just a few of the applications for Excel:
1. Powerful data analysis options: Excel provides various kinds of functions
that can be used for data analysis.
2. Creating charts and graphics: Excel provides a wide variety of highly
customizable charts and the SmartArt tools to create professional looking
diagrams.
3. Visual Basic for Applications: Excel provides a easy learned macro
language (VBA) to help you create structured programs directly in Excel.
4. Easy learning: Excel is user friendly, and it provide many sources to help
you learn it easily. As a most widely used spreadsheet, you can solve the
problems related to Excel easily by asking friends or searching the internet.

In this book, we are going to focus on using Excel and Excel based macro
language (VBA) to solve the problems that is related to probability and statistics.
4

Fundamental of Probability Models

2.1 Introduction

Whether you have the knowledge about probability, you may have the intuitive
ideas that it must be a 50 to 50 chance of turning up the head when flipping a coin
once; the probability that one withdraws a heart randomly from a deck of 52 cards is
1/13; there is a 50 to 50 chance that the next person you will meet on the street is a
girl. In everyday life, the probability of an event is the chance that this event will
happen. The formal definition about probability is that “Probability can be referred to
as the occurrence of the events of interest relative to other events.”
In this chapter, we will introduce the elementary concepts about probability, the
fundamental rules, and show the basic methods of computing probabilities of various
events. Moreover, some Excel functions which can be used in probability calculation
will also be introduced.

2.2 Sample Space, Event, Sample Point

An experiment is any activities whose outcomes are related to uncertainty. An


experiment can be simple such as flipping a coin and measuring the temperature of a
swimming pool, or very complex such as analyzing the factors that is affecting the
children’s Intelligence Quotient (IQ). The collection of all elementary results of an
experiment is called a sample space.
2: Fundamental of Probability Models 5

Definition
Sample space: the set that consists of all the possible outcomes of an experiment
Sample point: the members of the sample space
Event: a set of outcomes, and simultaneously, a subset of the sample space

Example 2.1

When tossing a dice once, all of the possible outcomes that comprise the sample
space are 1, 2, 3, 4, 5, and 6. Some compounds events are shown as follows:
A = {1, 2} , which presents the event that the number of points are at most two.
B = {2,4,6} , which presents the event that the number of points are even.
After understanding the fundamental concepts about the probability, we can use
these knowledge to compute the probabilities of many interested events.

Definition
For the event of interest:
2.1

Reconsider Example 2.1, we have already got that the sample space is equal to
1, 2, 3, 4, 5, 6 , event " 1,2 , and event # 2, 4, 6 . The probability of event
A is: " 2/6; the probability of event B is: # 3/6.

Excel Functions - AND, OR, IF

The logical functions AND, OR, IF in Excel can be used to categorize data into
small groups for further analysis. They can be used individually, or nested and
combined with other functions to perform data analysis. Firstly, let us show the
foundations of these functions.

Function AND

For the function AND, it returns TRUE if all its arguments evaluate to true;
returns FALSE if any of the arguments evaluate to FALSE. You can specify up to 255
conditions.
6 2: Fundamental of Probability Models

Syntax
= AND (Logical 1, [Logical 2],…)

Logical 1: required, the first condition that you want to test


Logical 2: optional, additional conditions that you want to test

For instance, when planning a vacation, two important factors that will affect you
to make the decision are the money and time. The function AND can help you to
make the decision. Figure 2.1 illustrates the process of using the function AND.

Figure 2.1 Example of using the function AND

For this example, if you have both money and time, the function returns true.
Congratulations! You will have a vacation.

Function OR

For the function OR, it returns TRUE if any arguments are TRUE; it returns
FALSE if all arguments are false. The function OR works similarly with the function
AND.
2: Fundamental of Probability Models 7

Syntax
= OR (Logical 1, Logical 2,…)

Logical 1, Logical 2,…are 1 to 255 conditions that you want to test

Function IF

The function IF returns one value if a condition you evaluate is TRUE, and
another value if that condition you evaluate is FALSE. Further more, the function IF
can be nested together to provide even more decision-making ability.

Syntax
= IF (Logical test, Value if true, Value if false)

Logical test: required, any expression that can be evaluated to TRUE or FALSE
Value if true: required, the value that you want to be returned if the test is true
Value if false: optional, the value that you want to be returned if the test is FALSE

For instance, you can choose TAXI or MINIBUS to come to UST, and the money
in your pocket is one of the important factors that will affect your decision. Function
IF can help you to make the decision. Figure 2.2 illustrates how to use the function IF.

Figure 2.2 Example of using the function IF


8 2: Fundamental of Probability Models

In this example, if the money in your pocket is more than the Taxi fares, you
can come to UST by TAXI, otherwise, MINIBUS.

After introducing the fundamental ideas about the logical functions AND, OR, IF,
in practices, let us see an example.

Example 2.2

Ten students from UST put their daily spending on the internet, the details are
shown in the following table:

Student Monday Tuesday Wednesday Thursday Friday Total


1 100 200 120 180 210 810
2 210 90 95 120 170 685
3 150 70 170 150 95 635
4 120 150 80 80 120 550
5 95 120 150 120 85 570
6 170 95 120 95 110 590
7 165 95 95 170 130 655
8 210 170 170 165 110 825
9 100 165 165 70 95 595
10 80 205 95 95 100 575

1) Categories the students under the condition that each days spending is larger than
$100 using the function AND.
2) Categories the students under the condition that any days’ spending is larger than
$150 using the function OR.
3) Categories the students under the condition that the students’ total weekly
spending is larger than $800, smaller than $600 , or between $600 and $800 using
the nested function IF.

[Solution]

1)

To obtain each days’ spending which is larger than $100, the function AND can
be used. Figure 2.3 shows the process of using the function AND to categorize the
students based on whose daily spending is larger than $100. If their spending from
2: Fundamental of Probability Models 9

Monday to Friday are all larger than $100, the function returns TRUE, otherwise,
FLASE. The part of the results are displayed in Figure 2.3, and the completed table
can be seen in Excel file named Chapter two with the spreadsheet named Example
2.2.

Figure 2.3 Using the function AND in Example 2.2

From Figure 2.3, we observe that the function returns FALSE in the cell K6,
which means that the first student’s daily spending is not all larger than 100. You can
copy the function and complete the calculation easily by putting the mouse on the cell
K6 and double clicking it.

2)

To obtain any days’ spending that is larger than $150, the function OR can be
used. Figure 2.4 shows the process of using the function OR. If any days’ spending is
larger than $150, the function returns TRUE, otherwise, FALSE. The part of the
results are displayed in Figure 2.4, and the completed table can be seen in Excel file
named Chapter two with the spreadsheet named Example 2.2.

Figure 2.4 Using the function OR in Example 2.2


10 2: Fundamental of Probability Models

From Figure 2.4, we observe that the function returns TRUE in the cell L6,
which means that the first student’s spending is larger than $150 someday. You can
complete the table easily by copying the function.

3)

The function IF can be nested together to categorize the students under the
condition that the students’ total weekly spending are larger than $800, smaller than
$600, or between $600 and $800. Figure 2.5 shows the process of using the nested
function IF.
The function returns “bad” when the total spending is larger than $700; the
function returns “Good” when the total spending is less than $600; the function
returns “Medium” when the total spending is between $600 and $800. The part of the
results are displayed in Figure 2.5, and the completed table can be seen in Excel file
named Chapter two with the spreadsheet named Example 2.2.

Figure 2.5 Using the nested function IF in Example 2.3

From Figure 2.5, we observe that the function returns “Bad” in the cell M6,
which means that the first student’s total weekly spending is larger than $800. You
can complete the calculation easily by copying the function.

Combination with the functions AND ,OR , IF

In reality, instead of respective using the functions AND and OR, mixing them
up with the function IF is used more widely when analyzing data. Example 2.3 shows
the method to combine the functions AND and OR with the function IF to solve the
problems related to the probability calculation.
2: Fundamental of Probability Models 11

Example 2.3

Nowadays, online social network sites that focus on facilitating people to build
up social networks, make friends, share interests and activities are popular, especially
within young people. There is a survey named Which type of social network sites does
UST students prefer has been done, and three popular social network sites are chosen,
including Facebook, Twitter and Google+. The part of the survey results are displayed
in Table 2.1, and the completed table can be seen in Excel file named Chapter two
with the spreadsheet named Example 2.3

Table 2.1 Survey result about the online social network sites
Online Social Network
Student No. Facebook Twitter Google+
1 1 0 1

2 1 1 0

3 1 0 0

4 0 0 0

5 1 1 1

6 0 1 0

*1 standards for the students who prefer the specific Online Social Network sites, 0 standard for not
prefer.

1) To calculate the probability that students prefer Facebook and Twitter both.
2) To calculate the probability that students prefer either Twitter or Google+.
3) To calculate the probability that none of the online social network sites does the
students prefer.

[Solution]

1)

According to Eq. 2.1, we need to obtain the sample points and sample space
when calculating the probability. In this example, there are 50 students participate in
the survey, so that the all possible outcomes are equal to 50. To obtain the sample
points, we need to obtain the number of students who prefer Facebook and Twitter
12 2: Fundamental of Probability Models

both. To assort students who prefer Facebook and Twitter both, the functions IF and
AND can be used.

Figure 2.6 Using the functions IF and AND in Example 2.3

According to Figure 2.6, if the student prefers Facebook and Twitter both, the
function returns 1, otherwise, 0. The sample points can be obtained by counting how
many “1” we get using the function SUM.
The probability that student prefers Facebook and Twitter both is:
13
Pr A 0.26
50

2)

To calculate the probability that the students prefer either Twitter or Google+,
we need to obtain the sample points and sample space just like mentioned above. The
sample space has already been obtained, which is equals to 50. To obtain the sample
points, we need to find out the number of students who prefer either Twitter or
Google+ using the functions IF and OR.
2: Fundamental of Probability Models 13

Figure 2.7 Using the functions IF and OR in Example 2.3

If the student prefers either Twitter or Google+, the function returns 1, otherwise
0. You can obtain the sample points by counting how many “1” we get using the
function SUM.
The probability that student prefers either twitter or Google+ is:
30
# 0.6
50

3)

To calculate the probability that the students prefer none of them, we can firstly
calculate the probability that students prefer Facebook, Twitter or Google+, and then
the probability that students prefer none of them can be calculated.

Figure 2.8 Using the functions IF and OR in Example 2.3


14 2: Fundamental of Probability Models

The probability that students prefers Facebook, Twitter or Google+ is:

41
) 0.82
50

And the probability that students prefer none of them is:

__
C 1 + 0.82 0.18

In Example 2.3, the event that students prefer Facebook (F) and Twitter (T) both
can be described as the intersection of the events F and T, written as F,Tor FT; the
event that students prefer either Twitter (T) or Google+ (G) can be described as the
union of the events T or G, written as T - G ; the events that students prefer either of
these three sites and none of__them can be described as the complementary event,
Written as ) 1+ C .

Definition

Union of the events, written as all outcomes that are either in E1 , E2 , or in


E1 ∩ E2 : both events
Intersection of events, written as the event consisting of all outcomes that are in both E1
E1 ∪ E2 : and E2
Complementary
__
events, The sample points is not in E:
written as E : Pr ( E ) = 1 − Pr ( E )

In Example 2.3, as the sample space is given, the major step to calculate the
probability is to categorize the data and then obtain the sample points. However, for
another type of experiments, all possible outcomes (sample space) and the sample
points are unknown, and you need to obtain both of them when calculating the
probabilities. To lay out all possible outcomes (sample space), you can choose manual
display or by Excel. Examples 2.4 and 2.5 illustrate the process of using the methods
of manual display and Excel to display the all possible outcomes.
2: Fundamental of Probability Models 15

Example 2.4

Suppose you manage two projects at the same time and each of them has three
possibilities in completion:
A = 100% done, B = not sure, C = 100% failed
What is the probability at least one project 100% completed ?

[Solution]

According to Eq. 2.1 to obtain the probability that at least one project is 100%
complete, the sample space and sample points should be obtained firstly. This
question is relatively easy, so that all of the possible outcomes can be listed as
follows:
Sample Space AA, AB, AC, BA, BB, BC, CA, CB, CC , and the possible outcomes are
equal to nine.
Another way to display the outcomes is using a tree diagram, which is pictorial
presenting all the possibilities. Figure 2.9 shows the tree diagram that display all
possible outcomes of Example 2.4.

Figure 2.9 The tree diagram for Example 2.4

As the sample points AA, AB, AC, BA, CA


16 2: Fundamental of Probability Models

5
0.56
9

Example 2.4 is relatively easy, so that it is possible to display all of the outcomes
by hand. However, for the experiments which are relatively complex, it is hard to
manual display all of the possible outcomes. Under this condition, Excel becomes a
powerful tool to help us display the all possible outcomes.

Example 2.5

Reconsider Example 2.4, instead of two projects, you manage five projects at
this time. 1 = 100% done, 2 = not sure, 3 = 100% failed.
What is the probability that at least one project is 100% completed?

[Solution]

The number of total possible outcomes are 3: 243, and it is difficult to


display all of the possible outcomes manually. Under this condition, Excel becomes a
good option. The part of the results are displayed in Figure 2.10, and the completed
table can be seen in Excel file named chapter two with the spreadsheet named
Example 2.5.

Figure 2.10 Trials displaying for Example 2.5

After finishing the display, we could count that the total trails that are equal to
243. The functions IF and OR can be used to categorize the trials that at least one
project is 100% completed.
2: Fundamental of Probability Models 17

Figure 2.11 Using the function IF and OR in Example 2.5

Then the
211
0.87
243

2.3 Statistical Independence

A card is drawn at random from a desk of 52 cards. Let A donate the event that
an ace was drawn, and let B donate the event that a diamond was drawn.
Pr ( A) = 1 / 4 , as there are four aces; Pr ( B ) = 1 /13 , since there are 13 diamonds.

" , # donates the event that a card with ace and diamond was drawn.
The common sense told us that there is only one card with ace and diamond in a
desk of 52 cards. Therefore, the probability that " , # 1/52, which is equal to
Pr ( A) × Pr ( B ) = (1 / 13) × (1 / 4) = 1 / 52. We say that the event A is independent of the

event B, meaning that the occurrence of one event does not depend on the occurrence
or nonoccurrence of another event.

Definition
Statistical Independent: A and B are independent if and only if

",# " < # (2.2)

Reconsider Example 2.2, the probability that students use Facebook is


18 2: Fundamental of Probability Models

= 0.66; the probability that students use Google+ is ? 0.34,


and =,? 0.32. As = < ? 0.66 < 0.34 0.22 A 0.3 , F is
dependent on G. The definition of independent can be extended. For instance, suppose
the events A, B, and C are statistically independent: ",#,) " <
# < ) .

Excel Functions - COUNTIF and COUNTIFS

The functions COUNTIF and COUNTIFS are the advanced counting formulas
that can be used to present the more complex examples. We will show how to use
these two functions in the following sections.

Function COUNTIF

Function COUNTIF is useful for a single-criterion counting.

Syntax
= COUNTIF (Range, Criteria)

Range: includes the particular cells to be counted


Criteria: determine whether the cells to be counted or not( number, expression, or text
string)

Function COUNTIFs

In many cases, we want to count cells only if two or more criteria are met. The
function COUNTIFS allows us to set more than one criterion when categorizing and
counting cells. There are up to 127 range pairs of optional criterion for the function
COUNTIFS.
2: Fundamental of Probability Models 19

Syntax
= COUNTIFS (Range 1, Criteria l, [Range 2, Criteria 2]…)

Range 1: required, the first range in which to evaluate the associated


criteria.
Criteria 1: required, the criteria define which cells will be counted
(number, expression, cell reference, or text)
Range 2, Criteria 2,...: optional, additional ranges and their associated criteria.

Example 2.6

Ten students participate in a French course. Let M donate the event that students’
midterm scores are larger than 90, and let F donate the event that students’ final
scores are large than 90. Finding that if M and F are statistical independent.

[Solution]

According to the definition of statistical independence, the events A and B are


independent if and only if ",# " < # . To test whether M and F
are statistical independent, we need to test if B,C B < C .
Therefore, our objective is to obtain the B , C , and B,C .
In Examples 2.2 and 2.3, we have already explained the method to obtain the
sample points using the functions AND, OR, and IF. In this example, we will
introduce another two Excel functions named COUNTIF and COUNTIFs to obtain
the sample points, and then calculate the probability. Figures 2.12 and 2.13 show the
process of using the functions COUNTIF and COUNTIFs.
20 2: Fundamental of Probability Models

Figure 2.12 Using the function COUNTIF in Example 2.6

The number of students whose score is larger than 90 is obtained, and we can
find that there are six students whose score is larger than 90 in this example.
DEFGHI GJKLM N
According to Eq. 2.1: DEFGHI OGEPI

6 5
Pr ( M ) = = 0.6 and Pr ( F ) = = 0.5
10 10
After obtaining the respective the probability that students’ midterm and final
scores are larger than 90, respectively, and the next step is to obtain the probability
that students’ midetrm and final scores are both larger than 90 using the function
COUNTIFS.

Figure 2.13 Using the function COUNTIFs in Example 2.6

3
B,= 0.3
10
2: Fundamental of Probability Models 21

According to Eq. 2.2: ",# " < #


B,= 0.3 B < = 0.6 < 0.5 0.3
In conclusion, these two events are independent.

2.4 Conditional Probability

In many cases, the probability of the event A will be affected by the occurrence
or nonoccurrence of another event. For instance, when you toss a dice once, the
probability of landing on 1 is 1/6. However, if we had the extra information that the
die could have only landed on 1,3,5, the probability of landing on 1 is changed to 1/3.

Definition
For any two events A and B with # Q 0, the conditional probability of A given
that B has occurred is defined by:
",#
"|# 2.3
#

The conditional probability is common in the daily life, and the following
example is related to the conditional probability.

Example 2.7

A survey named How many languages the students could speak has been done.
There are 70 students participate in this survey, and the part of the survey results are
displayed in Table 2.2, and the completed table can be seen in Excel file named
Chapter two with the spreadsheet named Example 2.7.

Table 2.2 Survey result about languages students can speak


Languages students can speak
Student No. Korean Cantonese Mandarin
1 0 1 1

2 0 1 0

3 0 1 0
22 2: Fundamental of Probability Models

4 0 0 0

5 1 0 1

6 0 1 0

1 stand for the students can speak that language , otherwise, 0 .

1) What is the probability that a randomly selected student in this class can speak
Korean?
2) What is the probability that a student can speak Korean or Cantonese?
3) Giving that a student can speak Cantonese, what is the probability that he or she
can speak Mandarin?

[Solution]

Let K the event that a student can speak Korean


C the event that a student can speak Cantonese
M the event that a student can speak Mandarin

The first step is to find out how many students can speak Korean, Cantonese and
Mandarin. The function SUM can be used to obtain the result:
SUM (C4 : C73) = 6, SUM (D4 : D72) = 45, and SUM (E4 : E73) = 21

1)

According to Eq. 2.1:

The probability that a student can speak Korean is:


4
X
70

2)

To find out the probability that a student can speak Korean or Cantonese, the
functions IF and OR in Excel can be used just as mentioned earlier. The probability
that a student can speak Korean or Cantonese is: Pr(K-C) = 0.71
2: Fundamental of Probability Models 23

3)

To find out the probability that he or she can speak Mandarin given the condition
of a student can speak Cantonese, from Eq. 2.3
B,)
B|) 0.29
)

2.4.1 Rule of Multiplication

By multiplying the both sides of Eq. 2.3, the multiplication rule can be obtained.

The Multiplicative rule

",# "|# < # 2.4

This rule is important because " , # is desired frequently, whereas


"|# and # can be specific easily from the problem description.
With Eq. 2.2, the two events are independent: ",# " < # .
When substituting Eq. 2.4 into the left-hand side of Eq. 2.2, we can obtain the
equation: "|# < # " < # . After cancelling the # , we can
obtain the equation: "|# " . Similarly, #|" # .

Definition
A and B are statistical independent if "|# " and dependent otherwise.

When the two events are statistical independent, the chance that A has occurred
is not affected by the knowledge that B has occurred, which means that "|#
" or #|" # .

Example 2.8

A girl has three coats and two bags, which means that she has six ways to match
the coat and the bag. Let Ai donate the event that she selects the coat, for i = 1, 2, 3,
24 2: Fundamental of Probability Models

and then Pr ( A1 ) = 0.25, Pr ( A2 ) = 0.50, Pr ( A3 ) = 0.25. After choosing the coat, the

next step is to choose the bag. Let B donate the event that the girl choose the first bag,

and B for the second bag. The probability that the girl matches the first coat with the
first bag is 60%, whereas the corresponding percentages for the second and third coats
are 40 % and 20%, respectively.

1) What is the probability that the girl chooses the first bag match the first coat?
2) What is the probability that the girl randomly chooses a coat to match the first
bag?

[Solution]

When the experiment is relatively complex, the tree diagram is helpful to lay
these stations out. Figure 2.14 shows a tree diagram that pictorial the experimental
situation.

Figure 2.14 Tree diagram for Example 2.8


2: Fundamental of Probability Models 25

1)

The probability that the girl chooses the first bag match the first coat is:

"Y , # "Y < #|"Y 0.12

2)

The probability that the girl randomly chooses a coat to match the first bag.

# "Y , # Z "[ , # Z "\ , # 0.37

2.5 Summaries of Excel Functions

In this chapter, we put many emphasis on introducing the logical functions AND,
OR, and IF, which are useful during the decision-making process. The statistical
functions COUNTIF and COUNTIFs are also the important functions when doing the
advanced counting.

Table 2.3 Summaries of the built-in functions


FUNCTION How it works? Notes
AND It returns TRUE if all its arguments are TRUE. Ex. 2.2 & 2.3

OR It returns TRUE if any argument is TRUE. Ex. 2.2& 2.3


IF It specifies a logical test to perform. Ex. 2.2 & 2.3
COUNTIF It counts the number of cells within a range that Ex. 2.6
meet a single criterion that you specify.
COINTIFS It applies criteria to cells across multiple ranges Ex. 2.6
and counts the number of times all criteria are
met.
26

Analytical Models of Random Phenomena

3.1 Introduction

In the previous chapters, we have learned the concepts of uncertainty and the
probability. In this chapter, we will introduce the concepts of the random variables
and the probability functions, particular for some commonly used probability
distributions such as the normal distribution. Furthermore, some Excel based
functions which are related to these distributions will also be elaborated, together with
the fundamental idea about Visual Basic for Applications (VBA).

3.2 Random Variable

In reality, many outcomes are randomness,and these possible outcomes can be


represented by the numerical values. Given each points in a sample space a number,
we have a function defined on that sample space, which is called a random variable.
A random variable allows us to transform experimental outcomes into the numerical
functions.

Definition
Random Variable: A random variable is a real valued function on a sample space.

The random variables are usually denoted by capital letters such as X, Y, and Z.
3: Analytical Models of Random Phenomena 27

The lowercase letters are used to represent the values of the corresponding random
variables. For instance, when tossing a coin three times, all possible outcomes are S =
{HHH, HHT, HTH, HTT,THH, THT, TTH, TTT}. Let X denote the number of heads
obtained, and the possible values of X are 0, 1, 2 and 3.
There are two different types of the random variable: the discrete random
variable and continuous random variable.

Definition
Discrete Variable: A random variable is discrete if it can assume a finite or can be
listed in an infinite sequence of numbers.

Continue Variable: Its set of possible values consists either of all numbers in a
single interval on the number line or all numbers in a disjoint union of such intervals.

For instance, the random phenomenon such as the number of typhoon per year
and the points of tossing a die is described as the discrete random variable, because
these outcomes are countable and have physical meaning; on the contrary, the random
phenomenon such as the time to complete a project and the lifetime of an electronic
component is described as the continuous random variables, as the values of these
random variables exist in an interval.

3.3 Probability Distribution of Random Variables

Distribution functions can be used to express the random variables. The


probability distribution of X says how the total probability of 1 is distributed among
the possible values of X. Recall the example of tossing a coin three times and counting
the number of heads. Let X denote the number of heads obtained, and the possible
values of X are 0, 1, 2 and 3. The probability distribution will tell us how the
probability of 1 is subdivided among these four possible values.
The following functions are generally used to describe the probability
distributions of the random variables.
28 3: Analytical Models of Random Phenomena

Definition
For the discrete random variable X:
Probability Mass Function: Probability Mass Function (PMF) of X is defined for
every number x by
Cumulative Distribution Function: Cumulative Distribution Function (CDF) of a X
is defined for every number of x by ∑
y: y ≤ x
p( y)

When tossing a coin three times, the number of heads you get are countable, and
this kind of variable is discrete. Let X denote the number of heads obtained, and the
possible values of X are 0, 1, 2. We can obtain that 0 1/8, 1
3/8, 2 3/8, and 3 1/8. The PMF and associated CDF are
shown as follows:
0 x<0
1/ 8 x=0 1/ 8 0 ≤ x <1
 x =1
f ( x) = 3 / 8 F ( x) =  4 / 8 1≤ x < 2
3/8 x=2 2≤ x<3
1/ 8 x=3 7 / 8
 1 x=3

The graphs of above example appear in Figure 3.1.

Figure 3.1 A PMF and associated PDF

However, if X is continuous, which means that a random variable whose set of


possible values is an entire interval of the numbers, the distribution functions are
different.
3: Analytical Models of Random Phenomena 29

Definition
Probability Density Function (PDF): Let X be a continuous random variable,
the PDF of X is a function f(x) such that for any two numbers a and b with ,

dx

Cumulative Distribution Function (CDF): F(x) for a continuous random


variable X is defined for every number x by:
dy
!"
For each x, F(x) is the area under the density curve to the left of x.

The three types of probability distributions described above are shown in Figure 3.2

Figure 3.2 Discrete Probability Function Continuous Probability Function

3.4 Useful Probability Distribution

In this sector, we will introduce six commonly used families of distributions,


30 3: Analytical Models of Random Phenomena

including the uniform, normal, lognormal, binormial, poisson, and exponential


distributions. Amazingly, absolutely different phenomena can be adequately described
by the same mathematical model. Taking the normal distribution as an example, many
natural phenomena such as the height and weight of a specified group and the scores
on the SAT can be modeled by the family of the normal distributions.

3.4.1 Uniform Distribution

The simplest example of a continuous distribution is the uniform distribution.


When the random variable X follows the uniform distribution, the random variable X
over the interval [ a , b ] is likely to have any value, and the probability are equally.
Generally, the uniform distribution can be used under the conditions that a value is
picked at randomly from a given interval, and the probability is only determined by
the length of the interval.

Definition
Uniform Distribution: A random variable X is said to be uniform over the interval
[ a , b ] if its PDF and CDF are:

 1 if x∈[a, b]
f ( x) =  b − a 3.1
0 if x∉[a ,b]

0 if x < a
x −a
F ( x) =  if x ∈[a, b] 3.2
1b − a
 if x > b

3.4.2 Normal Distribution

The best known and widely used probability distribution is the normal
distribution. Many natural phenomena can be modeled by the family of the normal
distributions. Just as mentioned above, the height and weight of a specified group and
the scores on the SAT measurement are modeled by the normal distributions. More
over, phenomena such as the errors in the scientific experiments, the blood pressure,
and the time you spent each time from your home to UST are all fit closely to an
appropriate normal curve.
Furthermore, several useful distributions such as the t-distribution and
chi-squared distribution are based on the normal distribution, and we will encounter
and elaborate them later in this book.
3: Analytical Models of Random Phenomena 31

Definition
Normal Distribution: The probability density function for the normal distribution
( x − µ )2
1
is given by f ( x) = e 2σ 2
3.3
σ 2π
where µ is the mean of the theoretical distribution, σ is the standard deviation.

When µ = 0 and σ = 1, the PDF of the normal distribution is:

1 2
1 −2 y
f ( x) = e , which is referred to as the standard normal distribution. A

standard normal variable, which can be expressed by the capital letter Z, is
transformed from a Normal ( µ , σ ) random variable X by the process of

standardizing.
x−µ
The standard normal variable Z, which is equal to , is the ratio between
σ
x − µ and σ . After standardizing, you can compute the probabilities for any

random variable which is follow the normal distribution.


It is unnecessary to remenber the PDF and CDF of the normal distributions, as
they are complex and not easy to remember. No one likes to do the intergrate every
single time. In the past, the distribution tables were used to obtain the probabilities;
but now, you do not need the table as long as you have the Excel functions. Excel
provides a various kinds of statistical functions that can be used for some common
calculations and more complex statistical distribution and probability test.

Excel Functions - NORMSDIST, NORMDIST, NORMSINV, NORMINV

Function NORMSDIST

This function returns the standard normal cumulative distribution function with
µ = 0 and σ = 1 .
32 3: Analytical Models of Random Phenomena

Syntax
= NORMSDIST (z)

z: the supplied value you want to calculate the function

Function NORMDIST

This function returns the normal distribution for the specified mean and standard
deviation. This function has a very wide range of applications in statistics, including
hypothesis testing.

Syntax
= NORMDIST(x, mean, standard_dev, cumulative)

x: the value for which you want the distribution


mean: the arithmetic mean of the distribution
standard_dev: the standard deviation of the distribution
cumulative: False(PDF); True(CDF)

Function NORMSINV

This function returns the inverse of the standard normal cumulative distribution.
The distribution has µ = 0 and σ = 1 .

Syntax
= NORMSINV (probability)

probability: the probability value (between 0 and 1)

Function NORMINV

This function returns the inverse of the normal cumulative distribution for the
specified mean and standard deviation.
3: Analytical Models of Random Phenomena 33

Syntax
= NORMINV (probability, mean, standard_dev)

probability: the probability value (between 0 and 1)


mean: the arithmetic mean of the distribution.
standard_dev: the standard deviation of the distribution

If you are carefully enough, you will find the phenomenon that the normal
distibution table is exist in almost every probability and statistics textbooks. The
distribution tables are made by mathematicians to simplify the calculation process, as
the intergration process is not an easy job and no one likes to do the intergration every
single time. Nowadays, regardless of the complex of the mathematical theroy, you can
make your own standard normal distribution table by Excel functions. Example 3.1
shows how to make your own normal distribution table by Excel functions.

Example 3.1

Draw a probability distribution table using Excel functions.

[Solution]

Generally, the standard normal distribution table contains two parts: one is the
standard normal varible, the other is the corresponding probability. It is easy to draw
any kinds of probability tables using Excel functions. Take the commonly used
standard normal distribution table as an example, the major steps are shown as
follows:
Firstly, z value from 0.00 to 3.49 can be listed out in Excel spreadsheet, which is
shown in Figure 3.3.
Secondly, to calculate the probability of z, the functions NORMSDIST and
NORMDIST can be used. Figures 3.3 and 3.4 show the process of using these two
functions.
34 3: Analytical Models of Random Phenomena

Figure 3.3 Draw the normal distribution table by the function NORMSDIST

As the function NORMSDIST directly returns the standard normal cumulative


distribution function, you can simply put z value in the function, and then the
corresponding probability can be obtained.
For the function NORMDIST, you need to put four parameters. Figure 3.4 shows
the process of using the function NORMDIST.

Figure 3.4 Draw the normal distribution table by the function NORMDIST

Tips:

Cell references

Sometimes, the cells should be fixed when using or copying Excel formula.
There are three types of references, including relative reference, absolute reference,
and mixed reference. Table 3.1 shows the definition about these three references.
3: Analytical Models of Random Phenomena 35

Table 3.1 Types of the cell references


Name Definition Example
Relative Reference The row and column references = A1 : B3
can change when you copy the
formula to another cell.
Absolute Reference The row and column references =$ A$1: $B$3
do not change when you copy the
formula.

Mixed Reference Either the row or column = $A1: B$3


reference is relative, and the
other is absolute.

We use dollar ($) to fix the cell’s location. After adding dollar ($) to a formula,
that part of the formula is not automatically changed when copying or pasting the cell.
You can enter the $ manually by inserting dollar signs in the appropriate positions by
pressing SHIFT + 4 or using a handy shortcut - the F4 key. For instance, if you enter
= A1 to start a formula, pressing F4 converts the cell reference to = $A$1; pressing F4
again converts it to = A$1; pressing it again returns to $A1; pressing it one more time
returns to the original = A1.
In Example 3.1, we use the mixed reference. Figure 3.5 shows an example of
using the mixed reference cells.

Fixing only the column Fixing only the row


Figure 3.5 Using the mixed reference cells

Cells formatting

In general, cells formatting is not absolutely necessary, but it can make your
tables or worksheets more professional and attractive, such as changing the cells Font
color, Decimal place and so on. Excel provides three ways to help you format the
cells:
36 3: Analytical Models of Random Phenomena

(1) Using the Home tab of the Ribbon (shown in Figure 3.6).
(2) Using the Mini toolbar when you right click the cells (shown in Figure 3.7).
(3) In the Format Cells dialog box when press ing Ctrl +1 (shown in Figure 3.8)

Figure 3.6 Formatting tools in the Home tab

Figure 3.7 The mini toolbar to format the cells


3: Analytical Models of Random Phenomena 37

Figure 3.8 Format Cells dialog box

These three ways are all available to format the cells, and you can choose any of
them as you like.
In Example 3.1, Font size of the numbers are being adjusted and centralized, and
the decimal places are also fixed. After making the appropriate decimal places, the
table is accomplishment. The part of the results are displayed in Table 3.2, and the
completed table can be seen in Excel file named Chapter 3 with the spreadsheet
named Example 3.1.

Table 3.2 The standard normal distribution table


z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.6985 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
38 3: Analytical Models of Random Phenomena

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621

VBA — Visual Basic for Applications

As mentioned in Chapter 1, Excel provides an easy learned macro language


(VBA) to help you create structured programs directly in Excel. If you’ve ever wanted
to do an automate routine operations, so that you don’t always have to perform boring
and repetitious tasks manually, Visual Basic for Applications (VBA) is suitable for
you. Visual Basic for Applications (VBA) is the powerful programming language,
which also can be used to develop the worksheet functions that you can’t find in
Excel. You can create a macro, after the macro is developed, then you can execute the
macro to perform many time-consuming procedures automatically.

Starting of VBA

All the VBA work is done in the Visual Basic Editor (VBE). The VBA modules
are invisible unless you activate the VBE. There are two ways to active the VBE:
1. Press Alt+F11
2. Choose Developer Code Visual Basic

Your Excel Ribbons may not have the Developer. It is essential that you turn on
the Developer tab:

1. Choose Office Excel Options


2. Click the Popular tab in Excel Options dialog box
3. Place a checkmark next to Show Developer Tab in the Ribbon

After performing these steps, Excel displays a new tab named Developer, which
is shown in Figure 3.9.

Figure 3.9 Displaying Excel's Developer tab


3: Analytical Models of Random Phenomena 39

After activating the VBE, you can see a VBE window like Figure 3.10. The
upper-left corner of the IDE window shows all projects currently open, and the
lower-left corner shows the properties window. You can write the code on the right
side of the IDE window.

The code window


The menu Bar

The Project Explorer window

The properties window

Figure 3.10 The Visual Basic Editor window

Pay attention that your VBE window will not look exactly like the window
shown in Figure 3.10.

Entering the VBA code

Before you can do anything meaningful, you must have some VBA code in a
code window. This VBA code must be written within a procedure, and the procedure
consists of VBA statements. Generally, the Sub and Function procedures are widely
used in VBA programming.

Sub procedures: A procedure is a set of instructions that performs some action.


Function procedures: A function is a set of instructions that returns a single value or
an array.
40 3: Analytical Models of Random Phenomena

You can add code to a VBA module in two ways:

1. Enter the code manually: The keyboard can be used to type the code.
2. Use the macro-recorder feature: Using Excel’s macro-recorder feature to record
your actions and convert them into VBA code.

Storing the VBA code

The VBA code is usually stored in a VBA module. You can insert a module by
pressing Insert Module, which is shown in Figure 3.11.

Figure 3.11 Insert the Modules.

Executing the VBA code

To run a program, you can press F5 directly or select the Run menu, which is
shown in Figure 3.12.
3: Analytical Models of Random Phenomena 41

The Run Macro button


The Run sub/ UserForm menu selection

Figure 3.12 Executing the VBA code

Saving workbooks that contain macros

Generally, the first time you saving your workbook that contains macros, the file
format is XLSX, which can not contain macros. Excel display a warming which is
shown in Figure 3.13. You can choose the No option.

Figure 3.13 Excel warms when saving the workbook contain macros

After choosing the No option, you can choose the option called Excel
Macro-Enabled Workbook, which is shown in Figure 3.14. The file must be stored
within an XLSM extension.
42 3: Analytical Models of Random Phenomena

Figure 3.14 Changing the file format to XLSM

After introducing the fundamental of VBA, let us see an example.

Example 3.2

The proficiency test in Mandarin is one of the most popular testing nowadays in
Hong Kong. According to the Hong Kong Examination and Assessment Authority
(HKEAA), the test contains four classes, A ( X ≥ 90 ), B ( 80 ≤ X < 90 ), C
( 60 ≤ X < 80 ), and D ( X < 60 ). In 2011, there were 600 students in UST having this
test. The results are displayed in Excel file named Chapter 3 with the spreadsheet
named Example 3.2.

1) Class students’ grades into A, B, C, D.


2) To obtain the probability that students can get A, B, C, and D.
3) Suppose those students who participate in the Mandarin test in 2012 have the
comparable level with students in 2012. In order to make sure 95 percentile
students get the certification, what is the least marks should the student get?

[Solution]

1)

Using Excel Function

To categorize students’ marks into A, B, C and D, the function IF can be used as


mentioned earlier, and the result is shown in Figure 3.15.
3: Analytical Models of Random Phenomena 43

Figure 3.15 Using the function IF in Example 3.2

After transforming the students’ marks into A, B, C, and D, the number of


students that get A, B, C, and D can be counted by the function COUNTIF.
The number of students who get A = COUNTIF (D7:D607,"A") = 121; the
number of students who get B = COUNTIF (D7:D607,"B") = 257; the number of
students who get C = COUNTIF (D7:D607,"C") = 162; the number of students who
get D = COUNTIF (D7:D607,"D") = 60.

Using VBA

Besides using Excel functions, another way to categorize students’ grade is using
VBA. After activating the VBE Windows, you can write your code on the code
module. In Example 3.2, we write the code 3.1 to categorize students’ grade into A, B,
C, D automatically, and we also write the code 3.2 to count the number of students
who get A, B, C, and D.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 3.1

*********************************************************************
’Purpose: To decide the marks belong to which level of grades.

’*********start of coding***************************************

Sub test()
For i = 7 To 606
Select Case Cells(i, 4)
Case 90 To 100
44 3: Analytical Models of Random Phenomena

Cells(i, 10) = "A"


Case 80 To 90
Cells(i, 10) = "B"
Case 60 To 80
Cells(i, 10) = "C"
Case 0 To 60
Cells(i, 10) = "D"
End Select
Next i

End Sub
‘************************************end of coding************

This macro consists of three key techniques. The respecting role in this macro is
detailed as follows:

1. Sub…end sub

The VBA code must be written within a procedure, the sub procedure is used in
code 3.1.

2. Select Case (conditional)

This structure is useful when choosing among three or more options. In this
example, the first block of code will be executed if the score is between 90 to 100,
and the corresponding cell returns to the letter A; the second block of code will be
executed if the score is between 80 to 90, and the corresponding cell return to the
letter B, and so on.

3. Apostrophe(‘)

Any text that follows by an apostrophe(‘) is ignored when executing the code,
and you can use it to explain your code.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 3.2

*********************************************************************
’Purpose: To count the number of students who get A, B, C, and D.
3: Analytical Models of Random Phenomena 45

***start of coding**************************************************

Sub count()
n = 0
For i = 1 To 1000
If Range("j7").Cells(i, 1).Value = "A" Then
1 n = n + 1
End If
Next
MsgBox n & "students get A", vbOKOnly, "test"

n = 0
For i = 1 To 1000
If Range("j7").Cells(i, 1).Value = "B" Then
2 n = n + 1
End If
Next
MsgBox n & "students get B", vbOKOnly, "test"

n = 0
For i = 1 To 1000
If Range("j7").Cells(i, 1).Value = "C" Then
3 n = n + 1
End If
Next
MsgBox n & "students get C", vbOKOnly, "test"

n = 0
For i = 1 To 1000
If Range("j7").Cells(i, 1).Value = "D" Then
4 n = n + 1
End If
Next
MsgBox n & "students get D", vbOKOnly, "test"

End Sub

**************************************end of coding ************************

This macro consists of three key techniques. The respecting role in this macro is
46 3: Analytical Models of Random Phenomena

detailed as follows:

1. If – Then (conditional)
The If-Then construct is widely used structure to execute the statements
conditionally. The basic structure is shown as follows:
If (condition)… Then
‘the code statement
End If

The code will be executed if the condition is true, otherwise not. In this example,
the first piece of code means that continually count the cells when the value of the cell
is equal to A, otherwise stop; the second piece of code means that continually count
the cells when the value of the cell is equal to B, otherwise stop, and so on. Notice
that the If statement has a corresponding End If statement.

2. MsgBox (show the results)


The MsgBox() function is useful to output a message to the user using a message
box like the one shows in Figure 3.16.

The simplified syntax for the MsgBox is:

Syntax
= MsgBox (prompt, bottom, title)

prompt: required, shows the messages that you want the user to read
bottom: optional, VBA have different kinds of bottom arguments
title: optional, text that appear in the message box title bar

In this example, the MsgBox function returns the value that how many students
get A, B, C, and D, and also displays a dialog box to show the results (shown in
Figure 3.16).
3: Analytical Models of Random Phenomena 47

MsgBox n & "students get A", vbOKOnly, "test"

Figure 3.16 The Message Box to display the message

3. Ampersand (&):
The ampersand (&) is used to concatenate strings. In the above code, the number
of students (n) and the text “student get A” are concatenate together, which is shown
in Figure 3.16.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2)

According to the raw data, using the functions AVERAGE and STDEV to obtain
the E(X) and S.D.
E(X) =Average (E5: E604) =78.42
S.D. = STDEV (E5: E604) =14.99

Using Table

To obtain the probability that student get A, the first step is to do the normalization.
48 3: Analytical Models of Random Phenomena

78.42 − 90
Z =( ) = −0.7725
14.99
The second step is to check the standard normal distribution table (shown in
Appendix Table A.1). We locate a column with the first digit of z and a row with the
second digit of z and read Φ (0.77) = 0.7794 . As the normal distribution is symmetric,

the probability that Pr (X # 90) = 1-0.7794 = 0.2206. The process to obtain the
probability that the students get B, C, and D is the similar.

The probability that the students get B:


78.42 − 00
Z =( ) = −0.105 , Pr (80 X<90) = 0.4562-0.2206 = 0.2356
14.99
The probability that the students get C:
78.42 − 60
Z =( ) = 1.299 , Pr (60 % 80 0.8907 ' 0.4562 0.4405
14.99
The probability that the students get D:
Pr(X<60) =1-0.8907 = 0.109

Using Excel Function

The function NORMDIST can be used to obtain the probabilities. Figures 3.17
and 3.18 show the process to obtain the probability:

Figure 3.17 Using the function NORMDIST in Example 3.2

Figure 3.18 Using the function NORMDIST in Example 3.2


3: Analytical Models of Random Phenomena 49

Similarly,
Pr (60<X<80) =NORMDIST(80, P6, P7, TRUE)-NORMDIST(60, P6, P7,TRUE)
=0.43, and Pr (X< 60 ) =NORMDIST(60, B6, B7, TRUE) = 0.11.

3)

In this question, we want to know the least marks should the student get to make
sure that 95 % of students can obtain the certification. In this question, the objective is
to obtain z value, and then the scores can be obtained.

Using Table

As the Pr (X<N) = 0.95 and Φ(Z)=1.65 are given, using the equation
78.42 − N
Z =( ) = 1.65 , N is equal to 53.75.
14.99

Using Excel Function

To decide the least marks that the student should get, the function NORMINV
can be used:

Figure 3.19 Using the function NORMDIST in Example 3.2

So that the student can get the certification when the score is higher than 53.75.

3.4.3 Lognormal Distribution

Lognormal distribution is also a popular probability distribution. Suppose the


random variable itself not follows the normal distribution, but their logarithms
+, follows, then X is called following the lognormal distribution.
50 3: Analytical Models of Random Phenomena

Definition
Lognormal Distribution: A nonnegative random variable X is said to have a
lognormal distribution if the random variable - +, has a normal distribution.
The mean and standard deviation of the lognormal distribution can be calculated as
follows:

ς
2
= ln{1 + (σ µ ) 2 } 3.4

λ = ln u − 0.5ς 2 3.5

It is unnecessary to remenber the PDF and CDF of the lognormal distribution, as


they are complex and long. One way to determine the lognormal distribution is using
the standard normal distribution table, as it has the logrithemic relationship with the
normal distribution. On the other hand, instead of using table, the cumulative
lognormal cumulative probability can be obtained from Excel functions.

Excel Functions – LOGNORMDIST and LOGINV

Function LOGNORMDIST

This function returns the cumulative lognormal distribution of x, where ln(x) is


normally distributed with parameters mean and standard_dev.

Syntax
=LOGNORMDIST (x, mean, standard_dev.)

x: the value at which to evaluate the function


mean: the mean of ln (x)
standard_dev.: the standard deviation of ln (x)

Function LOGINV
3: Analytical Models of Random Phenomena 51

This function returns the inverse of the lognormal cumulative distribution


function of X.

Syntax
=LOGNORMINV (probability, mean, standard_dev.)

probability: probability associated with lognormal distribution


mean: the mean of ln (x)
standard_dev.: the standard deviation of ln (x)

Example 3.3

According to the previous observation, the mean and standard deviation of the
weight of UST’s students are 136.33 pounds and 26.59 pounds, respectively. The
random variable follows which kinds of probability is not sure.
1) Suppose the random variable follows the normal distribution, what is the
probability that the student’s weight is heavier than 160 pounds?
2) Suppose the random variable follows the lognormal distribution, what is the
probability the student’s weight is heavier than 160 pounds?

[Solution]

1)

The random variable follows the normal distribution.

Using Table

The first step is to do the normalization:


160 − 136.33
Φ (Z) = ( ) = 0.89
26.59

The second step is to check the standard normal distribution table to obtain the
probability, the partial of the table is shown as follows:
52 3: Analytical Models of Random Phenomena

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.6985 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133

We can obtain that . 160 1 ' 0.81 0.19

Using Excel Function

The functions NORMSDIST or NORMDIST can be used:


. 160 1 ' NORMSDIST Z 1 ' NORMSDIST 0.89 1 ' 0.81 0.19
. 160 1 ' NORMDIST , mean, std. , cumulative
1 ' NORMDIST 160, 136.33, 26.59, true 1 ' 0.81 0.19

2)

Random variable follows the lognormal distribution.

Using Excel Function

As the random variable follows the lognormal distribution, to obtain the


probability that student’s weight is heavier than 160, the function LOGNORMDIST
can be used. The first step is to obtain the parameters λ and ς . According to Eqs.

3.4 and 3.5, the parameters can be obtained as follows:

ς
2
= ln{1 + (σ µ ) 2 } = ln{1 + (26.59 136.33) 2 } = 0.04

λ = ln u − 0.5ς 2 = ln136.33 − 0.5 × 0.04 = 4.90


Then, the probability that Pr (X > 160 ) can be obtained:
Pr ( X > 160) = 1 − Lognormdist ( x, mean, S .D )
= 1 − Lognormdist (160, 4.90, 0.02) = 1 − 0.72 = 0.18
3: Analytical Models of Random Phenomena 53

Using VBA

It is inconvenience to calculate the parameters λ and ς each time when using

the function LOGNORMDIST. One way to solve this problem is to recreate a new
function using VBA as follows:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 3.3

*********************************************************************
’Purpose: To recreate the function LOGNORMDIST by changing the
parameters according to Eqs. 3.4 and 3.5

‘Define variables:

‘y: the value for which you want the distribution.


‘yida: mean of lnY.
‘lamda: the standard deviation of lnY
‘mn: the arithmetic mean of the distribution
‘sd: the standard deviation of the distribution

*********Start of coding **************************************


Public Function newlogdist(y, mean, sd)
yida = (Log(1 + (sd / mean) ^ 2) ^ 0.5)
lamda = Log(mean) - 0.5 * yida ^ 2
newlogdist = 1- Application.WorksheetFunction.LogNormDist(y,
lamda, yida)

End Function
**************************************end of coding*****************

This macro consists of one key technique. The respecting role in this macro is
detailed as follows:

1. Using the worksheet function


To use a worksheet function in a VBA statement, just precede the function name
with Application.WorksheetFunction.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
54 3: Analytical Models of Random Phenomena

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 3.4

*********************************************************************
’Purpose: To recreate function LOGNORMINV by changing the
parameters according to Eqs. 3.4 & 3.5

‘Define variables:

‘pr: Probability associated with lognormal distribution


‘yida: mean of lnY.
‘lamda: the standard deviation of lnY
‘mn: the arithmetic mean of the distribution
‘sd: the standard deviation of the distribution

*********Start of coding **************************************


Public Function newloginv(pr, mean, sd)
yida = (Log(1 + (sd / mean) ^ 2) ^ 0.5)
lamda = Log(mean) - 0.5 * yida ^ 2
newloginv = Application.WorksheetFunction.LogInv(pr, lamda, yida)

End Function

**************************************end of coding*****************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Then, the probability that Pr (X > 160 ) can be obtained:


Pr ( X > 160) = newlognormdist (160,136.33, 26.59) = 0.18

In Example 3.3, we cannot simply say which supposing is correct, as we do not


have enough evidence to demonstrate whether or not the probabilistic model is
appropriate for the given random phenomenon. However, a statistical test called
goodness- of- fit test provides us a tool to determine whether a given random sample
comes from some specified probability distribution. The detailed about goodness- of
-fit test will be presented in the next chapter.
3: Analytical Models of Random Phenomena 55

3.4.4 Binomial Distribution

In reality, the problems are often involving two possible outcomes: occurrence
and nonoccurrence. The events, such as the water test may or may not meet the
pollution control standards, appearance of head or tail when tossing a coin, whether or
not you pass an exam, are referred to as bernoulli sequence. This distribution has
several features, for instance, each trial has two possibilities; the probability of
success is constant in each trial; each trial is statistically independent.

Definition
Binomial Experiment: A binomial experiment involves n independent and
identical trials such that each trial can result in to one of the two possible outcomes,
namely, success(S) or failure (F).

We often write ~Bin ,, to indicate that X is a binomial random variable


based on n trials with success probability of p.

~Bin ,, , the PMF and CDF will be donated by:

G F
1' G!F
0,1,2. . ,K
B ; ,, EF 3.6
0 otherwise
x
B ; ,, ∑ b( y; n, p) 0,1, … , 3.7
y =o

Excel Functions – BIMORMDIST and CRITBINOM

Function BINOMDIST

Returns the individual term binomial distribution probability.

Syntax
= BINOMDIST (number, trials, probability, cumulative)

number: the number of successes trials


trials: the number of independent trials
probability: the probability of success on each trial
cumulative: False (PMF), True (CDF)
56 3: Analytical Models of Random Phenomena

Function CRITBINOM

Excel function CRITBINOM returns the inverse of the cumulative binomial


distribution, which returns the smallest value (number of successes) for which the
cumulative binomial distribution is greater than or equal to a criterion value. This
function can be used in the area of quality assurance.

Syntax
= CRITBINOM (trials, probability, alpha)

trials: the number of Bernoulli trials


probability: the probability of a success on each trial
alpha: the criterion value

Example 3.4

Reconsider Example 3.2, according to the HKEEA’s rule, students will get a
certification if the score is higher than 60.

1) Randomly choose 10 students, find out the probability that 8 of them can get the
certification.
2) To make sure that at least 90% of students can get certification, how many
students are needed to get the score higher than 60?

[Solution]

1)

Hand Calculation

G F
1' G!F
0,1,2. . ,K
According to Eq. 3.6, B ; ,, EF
0 otherwise

Pr ( X = 8) = C108 × 0.898 × 0.112 = 0.21

Excel Function
3: Analytical Models of Random Phenomena 57

Pr ( X ≥ 60) = 0.89 and Pr ( X < 60) = 0.11 have already been obtained in

Example 3.2. the probability that 8 of 10 can get the certification is shown as follows:
Pr(X = 8) = BINOMDIST(8, 10, 0.89, FALSE) = 0.21.

2)

Excel Function

According to Example 3.2, n = 600, Pr ( X ≥ 60) = 0.89 and Pr ( X < 60) = 0.11

are given, and the result is:

CRITBINOM (600,0.89,0.9) = 544

As a result, to make sure that 90 percent of students could get the certification,
there are at least 544 students should pass the exam (>60).

3.4.5 Poisson Distribution

Generally, the poisson distribution is related to a concept of rare events. A very


important application of the poisson distribution arises in connection with the
occurrence of events of some type over time. Poisson random variables are used to
model the number of occurrences of certain events that come from a large number of
independent sources. For instance, the accidents in an industrial facility, the number
of calls to an office telephone during business hours, the number of customers
entering a store, the number of mistyping in a page, the number of people living to
100 years in a certain area, and the number of vacancies occurring during a year.

Definition
Poisson distribution: The number of rare events occurring within a fixed period of
time has Possion distribution.
58 3: Analytical Models of Random Phenomena

A discrete random variable X is said to have a Poisson distribution with


parameter M M . 0 , if the PMF and CDF of X is:

N OP QR S
F!
, 1,2,3.. 3.8

x
e− λ × λ y
F ( x) = ∑ 3.9
y =0 y!

The Poisson distribution can be used when having a large number of independent
Bernoulli trials and a very small probability of success.

The Poisson distribution as the Limit of the Binomial: If n →∞ and p → 0, such


that np = λ is constant, and then
,
UVGWXVYZ [ \ F Q 1 ' G!] ,
, 3.10
^WV__WG lim`a∞ [ \ F Q 1 ' G!]

Excel Function - POISSON

The function POISSON calculates the Poisson Probability Mass Function or the
Cumulative Poisson Probability Function for a supplied set of parameters.

Syntax
= POISSON ( x, mean, cumulative)

x: the number of events


mean: the expected number of events
cumulative: True = CDF, False = PMF
3: Analytical Models of Random Phenomena 59

Example 3.5

The number of students arriving for a new semester registration at an academic


room can be modeled by a Possion process with rate parameter of five per hour.
1) What is the probability that exactly three arrivals occur during a particular hour?
2) What is the probability that at least three arrivals occur during a particular hour?
3) If the operators of the registration service take 45-min break for lunch, what is the
probability that they do not miss any students coming for registration?

[Solution]

Let X = the number of students arriving for new semester registration, and λ =
the rate parameter =5 per hour. To calculate the probability, you can use the mode of
hand calculation and Excel function.

Hand Calculation

1)

e− λ × λ x
According to Eq. 3.8: f ( x) =
x!

e −5 × λ 3
Pr ( x = 3) = = 0.14
3!

2)
x
e− λ × λ y
According to Eq. 3.9: F ( x) = ∑
y =0 y!

e−5 × 50 e−5 × 51 e −5 × 52
Pr ( x ≥ 3) = 1 − ( + + ) = 1 − (0.007 + 0.034 + 0.084) = 0.875
0! 1! 2!

3)

Under this time, the parameter is changed, as t = 0.75, the new parameter is equal
to 0.75 × 5 = 3.75 .
According to Eq. 3.8 ,
60 3: Analytical Models of Random Phenomena

e −3.75 × 3.750
Pr ( x = 0) = = 0.02
0!

Using Excel Function

Let X = the number of students arriving for new semester registration, and a =
the rate parameter =5 per hour.

1)

From the Possion function: POSSION (x, mean, cumulative)


3 POSSION 3, 5, false 0.14

2)

From the POSSION function: POSSION (x, mean, cumulative)


#3 1' 2 POSSION 2, 5, true 0.88

3)

The parameter is changed. As t = 0.75, the new parameter is equal to


0.75 × 5 = 3.75 .
0 POSSION 0, 3.75, false 0.02

3.4.6 Exponential Distribution

The exponential distribution is closely related to the poisson process. The


exponential distribution is frequently used as a model for the distribution of times
between the occurrences of successive events. For instance, the amount of time until
the earthquake occur, the amount of time until a new war break out, and the amount of
time until the instrument’s components break down. These events are trend to have
exponential distribution. If the occurrence of an event follows a poisson distribution,
recurrence time of the next event would be described by exponential distribution.

Definition
Exponential Distribution: In a sequence of rare events, when the number of events
is Possion, the time between events has exponential distribution.
3: Analytical Models of Random Phenomena 61

Let λ >0 be a real number, a random variable X is said to be an exponential random


variable with parameter λ if its probability density function f(x) and F(x) is given by

d Me if x # 0 K
! R
3.11
0 if x % 0
d1 ' e if x # 0K
!Re
3.12
0 if x % 0
Where x as time between two occurrences, λ is the expected number of occurrence in
a unit of the time.

Excel Function - EXPONDIST

EXPONDIST

This function returns the value of the exponential distribution for a give value of
x. Generally, the function EXPONDIST is used to model the time between events,
such as the amount of time until the earthquake occur.

Syntax
= EXPONDIST( x, λ, cumulative )

: the value of the function


λ: the parameter of the distribution
cumulative: True CDF, Flase PMF

Example 3.6

Nowadays, computer becomes one of the essential parts in our daily life. A
student costing $10,000 to buy a laptop. Suppose the life time of the laptop follows an
exponentially distribution with the average life time of 5 years. If the laptop fails
during the first two year, the manufacturer agrees to give a full refund. If the laptop
fails after the third year but before the fifth year, the manufacture will refund $1,000.
To calculate the probability that the computer is broken within two years, between the
third and fifth years.
62 3: Analytical Models of Random Phenomena

[Solution]

Hand Calculation

With the first two year, break rateλ=1/5, according to Eq. 3.13:
1
− *2
Pr ( X ≤ 2) = 1 − e 5
= 0.33

Between the third and fifth years, break rateλ=1/3, according to Eq. 3.13:
1
− *3
Pr (2 < X ≤ 5) = 1 − e 3
= 0.632

Using Excel Function

Within the first two years, the break rateλ=1/5, the probability that the laptop is

broken is:
= EXPONDIST (2, 1/5 , TRUE) = 0.33

The probability that the laptop is broken between the third and fifth years is:
= EXPONDIST(3, 0.33, TRUE) = 0.632
As a result, the probability that the manufactory give the refund to the students
with the first two years is 0.33, between the third and fifth years is 0.632.

3.5 Excel Functions Related to Probability Distribution

For the probability functions that supported by Excel, all distributions have PDFs,
some have CDFs and inversed CDFs. Generally, the nomenclature of probability
functions in Excel can be divided into two parts: a name and a suffix. The base name
is an abbreviation of the distribution name, and the suffix is either DIST or INV.
The "DIST" function evaluates the PDF and possibly the CDF. If the function
has a CUMULATIVE argument, setting this argument to TRUE causes the DIST
function to compute the CDF. If the argument is FALSE, the function returns the PDF.
3: Analytical Models of Random Phenomena 63

The "INV" function evaluates the inverse CDF function. In addition, not all
distributions have the "INV" and “DIST "function, the summary about Excel
probability functions are shown in Table 3.3 .

Table 3.3 The probability functions


Distribution PDF CUMULATIVE? Quantile

Standard normal NORMSDIST Yes NORMSINV

Normal NORMDIST Yes NORMINV

Log normal LOGNORMDIST ---- LOGINV

Binomial BINOMDIST Yes CRITBINOM

Poisson POISSON Yes ----

Exponential EXPONDIST Yes ----

3.6 Summaries of Excel Functions

In this chapter, the functions related to probability calculation are introduced.


Table 3.4 shows the summaries of Excel functions used in this chapter.

Table 3.4 Summaries of the built-in functions


FUNCTION How it works? Notes
NORMSDIST This function returns the standard normal Ex. 3.1 &3 .2
cumulative distribution.
NORMDIST This function returns the normal Ex. 3.1 & 3.2
cumulative distribution.
NORMSINV This function returns the inverse of the Ex. 3.1 & 3.2
standard normal cumulative distribution.

NORMINV This function returns the inverse of the Ex. 3.1 & 3.2
normal cumulative distribution.
LOGNORMDIST This function returns the cumulative Ex. 3.3
lognormal distribution.
LOGNORMINV This function returns the inverse of the Ex. 3.3
lognormal distribution.
64 3: Analytical Models of Random Phenomena

BIMORMDIST This function returns the individual term Ex. 3.4


binomial distribution probability.

CRITBINOM This function returns the smallest value Ex. 3.4


for which the cumulative binomial
distribution is less than or equal to a
criterion value.
POSSION This function returns the Poisson Ex. 3.5
distribution.
EXPONDIST This function returns the Exponential Ex. 3.6
distribution.

Excel’s built-in functions can be changed using VBA macros. Table 3.5
summaries the changed functions that are used in this chapter.

Table 3.5 Summaries of the user defined functions


FUNCTION How it works?
NEWLOGNORMALDIST This function is used to recreate the function
LOGNORMDIST .
NEWLOGNORMINV This function is used to recreate the function
LOGNORMDINV.
65

Determination of the Probability Distribution


Models

4.1 Introduction

In the previous chapters, the fundamental ideas about probability have been
introduced. Most of the time, we have assumed that the observations come from a
particular distribution when analyzing the random phenomenon. However, there is no
evidence to verify our assumption. Reconsider Example 3.3, we want to obtain the
probability that the students’ weights are heavier than 160 pounds. Before doing
further calculation, we assume that the observations follow the normal and lognormal
distribution, respectively; when analyzing the laptop’s lifetime in Example 3.6, we
firstly assume that the observations follow the exponential distribution.
There is no evidence to verify the students’ weights and laptop’s lifetime follow
such types of the distributions. Therefore, we require some techniques to help us to do
the verification. In this chapter, two techniques called the probability paper and
Goodness-of-fit tests are used to test whether the probability model is appropriate to
the pre-described variable data.

4.2 Probability Paper

To determine whether a given random sample comes from the specified


probability distribution, the probability paper is one of the best choices. The observed
data is plotted on the specified graph paper, and if the data points have linear trend,
the data follows the selected distribution model. This kind of paper is called the
probability paper. To obtain a linear graph, the probability paper should have a
66 4: Determination of Probability Distribution Models

special probability scale which are transformed by manner adjustment. Figure 4.1
shows an example of a normal probability paper.

4 4

99.9
3 3

99
2 2
95
Cumulative probability

90
1 1
80
70
60
0
50 0
40
30
20
-1 -1
10
5
-2 -2
1

-3
0.1 -3

-4 -4
0 20 40 60 80 100

Value of the samples

Figure 4.1 The example of a normal probability paper

To obtain a linear graph, the special probability scale should be adjusted. In


Figure 4.1, the vertical axis on the right shows the z value, and the vertical axis on the
left shows the cumulative probability. You can put your data on the horizontal axis.
Pay attention that the different probability papers are associated with different
probability distributions.

Plotting Position

The data points plotted on the paper consist of the observed value and
cumulative probability.
4: Determination of the Probability Distribution Models 67

Definition
Each value from the sample is plotted as a point ( FX ( xn ) , xn ). xn is the

observed value (rearranged in an increasing order), and FX ( xn ) can be calculated as

follows:
FX ( xn ) = n / ( N + 1) 4.1
N = the number of observed data
n = the nth data in an ascending order

If the observed data plotted in the probability paper has linear trend, the data
follows the selected probability distribution. Two commonly used distribution papers
are called the normal and lognormal paper. Let us take the normal distribution paper
as an example to show the application of probability paper.

Example 4.1

Reconsider Example 3.3, we have assumed that the students’ weights follow the
normal distribution.
1) Drawing a normal distribution paper.
2) Using the normal distribution paper to evaluate whether the assumption is correct.

[Solution]

1)

The major steps to draw a probability paper are shown as follows:

1. Obtain the special probability scale


The key step to draw a probability paper is to obtain the special probability scale.
Take the normal distribution paper as an example, a series of cumulative probabilities
can be chosen, including 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.90,
0.95, 0.99, 0.999.

2. Obtain the corresponding z value


68 4: Determination of Probability Distribution Models

The next step is to obtain the corresponding z value using the function
NORMINV, and these values can be used as the special probability scale in the
vertical axis.

3. Choose the horizontal axis


The horizontal axis is in an arithmetic scale, which can be used to presents the
values of random variables.

4. Adjust the graph


You can add the grid lines and title to make the paper looks professional.

You can obtain the normal probability paper like Figure 4.1 if you follow the
above steps, and the complete drawing procedures can be seen in Excel file named
Chapter 4 with the spreadsheet named Example 4.1.

2)

The major steps to solve this problem are shown as follows:


1. Obtain the data plots
2. Plot the data plots on the specified graph paper
3. Decide whether the data points have the linear trend

1. Obtain the data plots


According to the definition of the data points, each value xn from the sample is
plotted as a point ( FX ( xn ) , xn ). xn can be obtained by rearranging the observed value
in an increasing order. After rearranging the observed value, FX ( xn ) can be obtained
by Eq. 4.1. Excel’s sort features and the function SMALL are the two useful way to
sort the random variables in Excel spreadsheet.

Sort dialog box

You can select the range of data you want to rearrange right-click any cells in
the selected range choose sort from the shortcut menu choose custom sort. Then
you can see a sort dialog box (shown in Figure 4.2).
4: Determination of the Probability Distribution Models 69

Figure 4.2 Using Excel sort features in Example 4.1

In this example, we need to rearrange the value of weights in an ascending order,


The option Sort on Values can be chosen. You also can choose the options Sort on
Cell Color, Font Color, or Cell Icon depending on the conditions.

Function SMALL

The function SMALL returns the kth smallest value in a data set.

Syntax
= SMALL (array, k)

array: a range of data for which you want to determine the kth smallest value
k: the position (from the smallest) in the array or range of data to return

Figure 4.3 shows the process of using the function SMALL in Example 4.1. The
cell references are fixed this time, and you can complete the range of cells by double
click the Cell O12.
70 4: Determination of Probability Distribution Models

Figure 4.3 Using the function SMALL in Example 4.1

After rearranging the values in an increasing order, the corresponding cumulative


probability ( FX ( xn ) ) can be calculated according to Eq. 4.1.

In Table 4.1, the weights that are arranged in an ascending order are shown in
columns 2 and 5, and the corresponding cumulative probabilities are shown in
columns 3 and 6. The part of the survey results are displayed in Table 4.1, and the
complete table can be seen in Excel file named Chapter 4 with the spreadsheet named
Example 4.1.

Table 4.1 Weights of UST’s students


n xn n/(N+1) n xn n/(N+1)
1 89.80 0.0164 2 96.50 0.0328
3 98.00 0.0328 4 98.50 0.0656
5 99.00 0.0820 6 99.00 0.0984
7 100.00 0.1475 8 100.50 0.1311
9 110.50 0.1475 10 102.40 0.1639
11 104.80 0.1803 12 110.00 0.1967
13 110.00 0.2131 14 110.00 0.2295
15 111.50 0.2459 16 112.40 0.2623
17 114.60 0.2787 18 120.00 0.2951
19 120.50 0.3115 20 123.50 0.3279
…… ……
4: Determination of the Probability Distribution Models 71

2. Plot the data plots on the specified graph paper


If the random variables X follow the normal distribution, the straight line will
pass through the point ( µ = x0 , FX ( x0 ) = 0.5 ) with the slope equal to the standard
deviation σ , where σ = ( x1 − x0 ), µ = x0 , and FX ( x1 ) = 0.841.
In this example, x µ 136.3. When FX ( x1 ) is to 0.841, x1 can be obtained
using Excel functions NORMINV which is equal to 162.88, and then the slope can be
obtained as σ = 162.88 − 136.33 = 26.55.
The points and straight line are plotted on the normal distribution paper, The x
axis shows the z values and the y axis shows the students’ weights (pounds).

190

170

(1, 162.88)
150
Weight( pounds)

(0,136.33)

130

110

90

70
-3 -2 -1 0 1 2 3
z value

Figure 4.4 Weights of UST’s students plotted on the normal probability paper

3. Decide whether the data points have the linear trend


From the straight line, we can observe that the mean value is equal to 136.33
with the z = 0 ( probability = 0.5); the value of weight is equal to 162.88 with the z = 1
(probability = 0.841). We can observe from Figure 4.4 that the data points have the
linear trend. Therefore, the data follows the normal distribution.
72 4: Determination of Probability Distribution Models

Creating VBA macros – Macro Recorder

As mentioned in Chapter 3, Excel provides two ways to create VBA code. We


have already introduced the method of entering the code directly into the module. In
this section, another method which can be used to create VBA code by the macro
recorder will be introduced.
Excel macro recorder translates your actions into VBA code. You can turn on
your record by choosing Developer Record Macro (shown in Figure 4.5), or just
click the Record Macro icon on the left-bottom of Excel worksheet (shown in Figure
4.6).

Figure 4.5 Turing on the macro recorder

Figure 4.6 Turning on the macro recorder by clicking the record macro icon
4: Determination of the Probability Distribution Models 73

Excel will display a record dialog box for you after you press the record macro,
which is shown in Figure 4.7.

The name of the macro, and you


can change it.

Specify a key to execute the macro


by pressing Ctrl + specified key.

The location for the macro.

Figure 4.7 The record macro dialog box.

After introducing the fundamentals of the macro record, we will demonstrate the
steps of using the macro recorder. In the following example, a range of cells will be
formatted are using the macro recorder. The steps are shown as follows:

1. Active the cell


2. Choose Developer code Record Macro
3. Change the macro name and put letter q in the edit box labeled Shortcut Key.
4. Click Ok and start to format the selected cell
5. Right click the cell and format
6. After formatting, choose Developer Code Stop Recording

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 4.1

*********************************************************************
Sub format1()
'
' format1 Macro
'
' Keyboard Shortcut: Ctrl+q
'
Range("K11:L71").Select
74 4: Determination of Probability Distribution Models

With Selection.Font
.Name = "Calibri"
.Size = 10
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ColorIndex = xlAutomatic
.TintAndShade = 0
.ThemeFont = xlThemeFontNone
End With
With Selection.Font
.Name = "Calibri"
.Size = 12
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ColorIndex = xlAutomatic
.TintAndShade = 0
.ThemeFont = xlThemeFontNone
End With
With Selection.Font
.Color = -4165632
.TintAndShade = 0
End With
End Sub
*********************************************************************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this example, the macro name is changed as format1 with the shortcut key Ctrl
+ q. The cells range from K11 to L71 are being selected, and the cells size and color
are changed. Excel’s macro recorder translate the actions into VBA code, which is
shown above. This technique is helpful for a beginner to learn VBA, and also helpful
when you do not know how the write the code.
4: Determination of the Probability Distribution Models 75

4.3 Goodness-of-fit Test

The probability paper is a useful technique to determine whether the


observations come from a particular distribution. However, as the straightness of the
line is based on personal subjective judgment, it is lack of the objective and
quantity-based evidence to demonstrate it. Alternatively, another technique called
Goodness-of-fit tests provides a quantitative procedures to test the fitness between the
observations and an assumed distribution model, especially suitable for determining
the relative goodness-of-fit of the two or more theoretical models. Two widely used
models are called the chi-squared and Kolmogorov-Smironv (K-S) test.

4.3.1 Chi-Squared Test

The chi-squared test for goodness-of-fit is widely used to determine whether the
observations come from a particular distribution. The basic logic is to test whether the
difference between the expected data and observed data can be accepted. The data is
divided into k intervals, and then the observed and theoretical frequency can be
obtained using Excel functions. Comparing the observed frequency in k intervals with
the corresponding theoretical frequencies, if the computed chi-squared value is less
than the critical value, the prescribed model is acceptable. The equation is shown as
follows:

( ni − ei )
2
k
χ =∑
2
< c1−α , f
i =1 ei 4.2
Where

* χ ~chi-squared distribution with degree of freedom ( f ) of k-1-m, where k = the


2

number of bins, m = the number of parameters.

k : the number of intervals


ni : observed frequency

ei : theoretical frequency

α : level of significance
76 4: Determination of Probability Distribution Models

*Chi-square distribution: If Z1, Z 2, ...... Z n are independent standard normal


random variables, then X, defined by X = Z12 + Z 22 + .... + Z n2 , is said to have a
chi-squared distribution with n degree of freedom.

After obtain the value of ∑ (ni − ei )2 / ei , the next step is to compare the value
with critical value c1−α , f ( α is the level of significance, and f is degree of freedom).
You can obtain the critical value by the table or Excel function CHIINV. Pay attention
that f is equal to k – 1 as n ∞ , otherwise, f must be reduced according to the
numbers of parameters, where f is equal to k-1-m.
According to Eq. 4.1, some statistics such as the observed and theoretical
frequency should be calculated before doing the further analysis. Excel functions can
be used to obtain these statistics. Therefore, before showing the example, some Excel
functions related to chi-squared test for goodness-of-fit will be introduced.

Excel Functions – MAX, MIN, ABS, CHIINV, FREQUENCY

Function MAX

Function MAX returns the largest value from a supplied set of numerical values.

Syntax
= MAX ( number1, [number2], ... )

Function MIN

Function MIN returns the smallest value from a supplied set of numerical values.

Syntax
= MIN ( number1, [number2], ... )

Function ABS

Returns the absolute value of a number.


4: Determination of the Probability Distribution Models 77

Syntax
= ABS (number)

number: the real number of which you want the absolute value

Function CHIINV

Function CHIINV calculates the inverse of the right-tailed probability of the


chi-squared distribution.

Syntax
= CHIINV (probability, degrees_freedom)

Function FREQUENCY

Function FREQUENCY can be used to calculate how often values occur within a
range of values and return a vertical array of number.

Syntax
= FREQUENCY (data array, bins array)

data array: a set of values for which you want to count frequencies
bins array: an array of or reference to intervals into which you want to group the
values in data array

To create the frequency distribution, select a range of cells that corresponds to


the number of cells in the bin ranges, and then enter the following array formula
(pressing Ctrl+Shift+Enter).
After elaborating the basic ideas about the chi-squared test for goodness-of-fit
and the related Excel functions, in practice, let us see an example.
78 4: Determination of Probability Distribution Models

Example 4.2

In Example 3.3, we assumed that the random variable follows the normal and
lognormal distribution. However, whether the given random variable comes from the
normal or the lognormal distribution is not sure. In this example, the chi-squared test
is used to evaluate the appropriateness of the proposed normal and lognormal
distribution.

[Solution]

As mentioned earlier, the function FREQUENCY can be used to obtain the


intervals and observed frequencies. The function FREQUENCY can be used to
calculate how often values occur in a specified interval. Before using this function, we
need to obtain the bins (intervals) first. Table 4.2 shows the statistics that are useful
when obtaining the intervals.

Table 4.2 Statistics used in frequency calculation


Observations 60
Max 185.70
Min 89.80 Where:
Intervals 95.90 Intervals = Max – Min
Mean 136.33
Class widths
sd 26.59 .

No.of Class 10
Class Widths 11.99

After obtaining the intervals, the next step is to calculate how often values occur
within a range of values using the function FREQUENCY. To create the frequency
distribution, select a range of cells (in this example, B6 : B65) that corresponds to the
number of cells in the bin range (in this example, E19: E28). Then enter the
Frequency formula( press Ctrl + Shift + Enter ). Figure 4.8 shows the process of using
the function FREQUENCY.
4: Determination of the Probability Distribution Models 79

Figure 4.8 Using the function FREQUENCY

The functions NORMDIST and LOGNORMDIST can be used to obtain the


theoretical frequency.

Suppose the random variables follow the normal distribution.

Figure 4.9 Using the function NORMDIST in Example 4.2


80 4: Determination of Probability Distribution Models

Figure 4.9 shows the process to obtain the theoretical normal frequency by the
function NORMDIST.

Suppose the random variables follow the lognormal distribution.


To obtain the theoretical frequency of lognormal distribution, the first step is to
obtain the parameters λ and ζ according to Eqs. 3.4 and 3.5, where:
ζ 2 = ln [1 + (136.33 / 26.59) 2 ] = 0.04 , and ζ = 0.19
λ = ln136.33 − 0.5 × 0.04 = 4.90
Therefore, the theoretical frequency can be obtained using the function
LOGNORMDIST (shown in Figure 4.10).

Figure 4.10 Using the function LOGNORMDIST in Example 4.2

Tips

Adding the comments to the cells

You can see a little triangle on the cells F18, G18, and H18, which is used as a
sign for the comment. Sometimes, it is helpful to add a comment to explain the cell in
the spreadsheet, as the cells are too small to write the context. You can right click the
cell and choose Insert Comment from the shortcut menu, and the comment becomes
visible when you move the mouse over the cell.
Table 4.3 shows the summary of the calculations needed for the chi-squared test,
including the Observed Frequency (ni), Theoretical Frequency(ei), and the value of
∑ (ni − ei )2 / ei .
4: Determination of the Probability Distribution Models 81

Table 4.3 Computations for chi-squared tests of the two distributions

Interval Observed Theoretical Frequencies


∑ (n − e )
i i
2
/ ei

Frequency Norm Lognorm Norm Lognorm


(ni) (ei)
89.50 0 2.35 1.12 2.35 1.12
101.49 9 3.36 3.45 9.49 8.92
113.48 7 6.00 7.23 0.17 0.01
125.46 6 8.78 10.36 0.88 1.83
137.45 10 10.53 11.15 0.03 0.12
149.44 10 10.33 9.66 0.01 0.01
161.43 6 8.30 7.08 0.64 0.16
173.41 5 5.46 4.56 0.04 0.04
185.40 6 2.94 2.65 3.17 4.25
197.39 1 1.95 2.74 0.46 1.11
Sum 60 60 60 17.23 17.58

The histogram and two PDFs of theoretical distributions are shown in Figure 4.11.

15
Histogram
Normal
12 Lognormal
Frequency

0
90 100 110 120 130 140 150 160 170 180 190 200 210
Weight (pound)
Figure 4.11 Chi-squared test to discriminate the two distribution models
82 4: Determination of Probability Distribution Models

After obtain the value of ∑ (n − e )


i i
2
/ ei , the next step is to compare the

obtained value with the critical value ( (c1−α , f ) . In both normal and lognormal

distributions, there are two parameters that are estimated from the available data.
Therefore, the degree of freedom on both cases is f = 10-1-2 = 7. At the significant
level 5% with f = 7, the critical value is obtained from Appendix Table A.3:
c0.95,7 = 14.07.

The function CHIINV also can be to obtain the critical value as follows:
c0.95,7 = CHIINV(0.05, 7) = 14.07

Suppose the random variable follows the normal distribution:

∑ (n − e )
i i
2
/ ei =17.23 > 14.07

Suppose the random variables follow the lognormal distribution:

∑ (n − e )
i i
2
/ ei =17.58 > 14.07

Therefore, according to the chi-squared test, the normal and lognormal


distributions are not approximately valid at the 5% significance level.

4.3.2 Kolmogorov-Smirnov Test (K-S test)

Another widely used goodness-of-fit test is the K-S test. Comparing the
experimental S n ( x) and theoretical cumulative probability !", if the maximum
discrepancy between the two probabilities is larger than the critical value for a given
sample size, the model is acceptable.
For a sample of size n, a set of observed data is rearranged by an ascending order.
From this ordered sample data, the experimental cumulative frequency function is
established as follows:

0 x < x1
k
S n ( x) =  xk ≤ x < xk +1 4.3
1n x ≥ xn

Let #$ donate the maximum difference between the %$ !" and !", and let
#$) donate the critical value which is tabulated in Appendix Table A. 4. If #$ is less
4: Determination of the Probability Distribution Models 83

than the critical value #$) at the prescribed significance level α , the theoretical
distribution is acceptable.
Compare to the chi-squared test, one of the advantages of the K-S test is that it is
not necessary to divide the observed data into intervals. It is convenience for us to do
the test. Example 4.3 shows the procedures that using the K-S test.

Example 4.3

In Example 4.2, the chi-squared test is used to evaluate the appropriateness of the
proposed whether the observations in Example 3.3 come from the normal or
lognormal distribution, and now, the K-S test can be used to do the demonstration.

[Solution]

To solve this problem, the first step is to obtain the experimental and theoretical
cumulative probability. The part of the results are displayed in Table 4.4, and the
complete table can be seen in Excel file named Chapter 4 with the spreadsheet named
Example 4.3.

In Table 4.4, the second column shows the rearranged tabulated data in an
increasing order; the third column illustrates the calculations of experimental
cumulative frequency using Eq. 4.3; the fourth and fifth columns show the
corresponding cumulative frequencies of the normal and lognormal distribution; the
sixth and seventh columns show the discrepancy of the two cumulative frequencies.

Table 4.4 Computations for K-S tests of the two distributions


Theoretical Frequency
ID x Sn(x) Normal Lognormal Dn1 Dn 2
FX ( x) FX ( x)
1 89.80 0.00 0.04 0.02 0.04 0.02
2 96.50 0.03 0.07 0.04 0.03 0.01
3 98.00 0.05 0.07 0.05 0.02 0.00
4 98.50 0.07 0.08 0.05 0.01 0.02
5 99.00 0.08 0.08 0.05 0.00 0.03
6 99.00 0.10 0.08 0.05 0.02 0.05
7 100.00 0.12 0.09 0.06 0.03 0.06
8 100.50 0.13 0.09 0.06 0.04 0.07
9 100.50 0.15 0.09 0.06 0.06 0.09
10 102.40 0.17 0.10 0.08 0.07 0.09
84 4: Determination of Probability Distribution Models

The empirical cumulative frequencies and the corresponding theoretical


cumulative frequencies of the normal and lognormal distribution are plotted as shown
in Figure 4.12.

1
0.9
0.8
0.7
0.6
CDF

0.5
0.4
0.3 Cumulative freq
Normal
0.2
Lognormal
0.1
0
90 100 110 120 130 140 150 160 170 180 190
xn
Figure 4.12 K-S Tests to discriminate two distribution models

The process to obtain the empirical cumulative frequencies and corresponding


theoretical frequencies is shown in Excel spreadsheet named Example 4.3. We can
observe that the maximum discrepancy between the empirical cumulative frequency
and normal ( D1 ) and lognormal ( D2 ) distribution by Excel function MAX, and the

result is D1 = 0.08, and D2 = 0.08.


At the significant level 5 % with n = 60, we obtain the critical value of Dna from

Appendix Table A. 4: D600.05 = 0.18.

Pay attention that Excel does not provide the functions to obtain the critical value
Dna . However, we can create the custom functions to obtain the critical value using

VBA.
4: Determination of the Probability Distribution Models 85

Creating the custom functions using VBA

You can create your own VBA functions if Excel application functions are not
exist. the syntax is shown as follows:

Function FunctionName (parameters)


‘ a block of code
FunctionName = Return value.
End function

After shown the process of creating the custom functions, let us see an example
a
to show how to create the functions to calculate the critical value Dn using VBA.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 4.2

**************************************************************

‘Purpose: To create a custom function in Excel worksheet to


calculate the critical values of Dna at significant level of a when

n > 50.

‘Define variables:

‘afa: level of significant


‘n: the number of observations
‘Ks: function that can be used to obtain critical value

****** Start of coding******************************************


Public Function ks(afa, n)
If afa = 0.2 Then
ks = 1.07 / n ^ 0.5
ElseIf afa = 0.1 Then
ks = 1.22 / n ^ 0.5
ElseIf afa = 0.05 Then
ks = 1.36 / n ^ 0.5
ElseIf afa = 0.01 Then
ks = 1.63 / n ^ 0.5
Else
ks = "no afa value"
End If
86 4: Determination of Probability Distribution Models

End Function
*********************************************end of coding********

This macro consists of two key techniques. The respecting role in this macro is
detailed as follows:

1) Creating the custom functions

In this example, the function’s name is ks, the parameters are afa and n. When
afa is equal to 0.2, the function returns ks = 1.07/ n^0.5. After finishing the creation,
you can application the functions by put the parameters into the function.

2) If – else if (conditional)

In this functions, there are exist five conditions (afa = 0.2, 0.1, 0.05, 0.01), we
can use Else...If statements. If afa is equal to 0.2, the macro executes the equation ks =
1.07 / n ^ 0.5. If afa is equal to 0.1, the macro executes the equation ks = 1.22 / n ^ 0.5,
and so on.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Pay attention that this function is useful when n is larger than 50.
At the significant level 5% with n = 148, we obtain the critical value of Dna

from the custom function that is created by VBA as follows:


D600.05 = ks (0.05, 60) = 0.18

Since the D1 = 0.08 < 0.18 , and D2 = 0.08 < 0.18 , the normal and lognormal

distribution are verified as an accepted model at the 5% of significant level. In the


next example, we will evaluate whether the students’ scores in the midterm test come
from the normal, lognormal, and gamma distribution using the K-S test.

Example 4.4

Modeling System with Uncertainty is one of the required courses for CIVL
students in UST. According to the previous experience, the analyzers have supposed
that the students’ scores follow the normal distribution before doing further analysis.
However, it is lack of evidence to demonstrate this hypothesis. In this example, the
K-S test provides a quantitative procedure to test the validity of three assumed
distribution models named normal, lognormal, and gamma. Table 4.5 shows the part
4: Determination of the Probability Distribution Models 87

of the testing results, the complete table is shown in Excel File named Chapter 4 with
the spreadsheet named Example 4.4.

Table 4.5 CIVL 2160 midterm scores in spring 2012


Name ID Score Name ID Score
Jarry 20030001 88.00 Fred 20030011 64.00
Tom 20030002 82.00 Lily 20030012 88.00
Sarah 20030003 56.00 Karl 20030013 86.00
Jaccica 20030004 94.00 William 20030014 76.00
Megan 20030005 74.00 Charles 20030015 76.00
Nicole 20030006 80.00 Michael 20030016 68.00
Berry 20030007 68.00 Bill 20030017 68.00
Tommy 20030008 88.00 Gary 20030018 70.00
Lina 20030009 80.00 Tiffny 20030019 66.00
Lisa 20030010 76.00 Henry 20030020 96.00

[Solution]

As mentioned earlier, the empirical cumulative probability Sn(x) is calculated by


Eq. 4.3; the functions NORMDIST, LOGNORMDIST, and GAMMADIST have been
used to obtain the corresponding theoretical cumulative distributions. The part of the
results are displayed in Table 4.6, and the complete table can be seen in Excel file
named Chapter 4 with the spreadsheet named Example 4.4.

Table 4.6 Computations of Sn(x), FX(x), and Dn


No. Normal Lognormal Gamma Dn
k Scores Sn(x) FX ( x) FX ( x) FX ( x) Dn1 Dn2 Dn3
1 36.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 46.00 0.01 0.01 0.00 0.00 0.00 0.01 0.01
3 48.00 0.02 0.02 0.01 0.01 0.00 0.01 0.01
4 50.00 0.03 0.03 0.01 0.02 0.00 0.02 0.01
5 50.00 0.03 0.03 0.01 0.02 0.01 0.02 0.02
6 50.00 0.04 0.03 0.01 0.02 0.02 0.03 0.03
7 54.00 0.05 0.05 0.03 0.04 0.00 0.01 0.01
8 54.00 0.05 0.05 0.03 0.04 0.00 0.02 0.01

In Table 4.6, the first column shows the k value; the second column shows the
rearranged tabulated data in an increasing order; the third column illustrates the
calculations of the experimental cumulative frequency using Eq. 4.3; the fourth, fifth
and sixth columns show the corresponding cumulative probabilities from the normal
and lognormal distributions, respectively; the seventh, eighth and ninth columns show
88 4: Determination of Probability Distribution Models

the discrepancy of the two cumulative frequencies.

Figure 4.13 displays the empirical cumulative frequency function of observed


data and the CDFs of the normal, lognormal and gamma distributions. By visual
inspection, these three CDFs appear to fit the empirical cumulative frequency.

1
Cumulative freq
0.9 Normal
Lognormal
0.8
Gamma
0.7
0.6
CDF

0.5
0.4
0.3
0.2
0.1
0
30 40 50 60 70 80 90 100
xn
Figure 4.13 K-S test to discriminate three distribution models for midterm
scores.

From Table 4.6, we observe that the maximum discrepancy between the
empirical cumulative frequency and normal ( D1 ), lognormal ( D2 ), and Gamma ( D3 )
distribution are 0.08, 0.11 and 0.10, respectively.

At the significant level 5% with n = 148, we obtain the critical value of Dna

0.05
from Appendix Table A.4 as D148 = 0.11.

Since the D1 = 0.09 < 0.11, D2 = 0.12 > 0.11, and D3 = 0.108 < 0.11,
according to the K-S test, the normal and gamma distributions are verified as an
accepted model at the 5% of significant level, whereas the lognormal distribution is
reject as the maximum discrepancy between the two probabilities is larger than the
critical value.
4: Determination of the Probability Distribution Models 89

4.4 Summaries of Excel Functions

Excel functions which are used in this chapter are summarized in Table 4.7. The
functions MAX and MIN are used to find the largest and smallest values of the
observations. The functions NORMDIST, LOGNORMDIST, and GAMMADIST are
used to obtain the theoretical cumulative frequencies. The function CHIINV is used to
obtain the critical value in the chi-squared test.

Table 4.7 Summaries of the built-in functions


FUNCTION How it works? Notes
MAX This function returns the largest value Ex. 4.1
from a supplied set of numerical values.
MIN This function returns the smallest value Ex. 4.1
from the selected range of value.
ABS This function returns the absolute value Ex. 4.2
of a number.
SMALL This function returns the k-th smallest Ex. 4.1 & 4.2
value in a data set.
FREQUENCY This function returns a frequency Ex. 4.1 & 4.2
distribution as a vertical array.
CHIINV This function returns the inverse of the Ex. 4.1
right-tailed probability of the chi-square
distribution.
NORMDIST This function returns the normal Ex. 4.1
cumulative distribution.
LOGNORMDIST This function returns the cumulative Ex. 4.2 & 4.3
lognormal distribution.

GAMMADIST This function returns the gamma Ex. 4.4


distribution.

Excel’s built-in functions can be changed using VBA macros. Table 4.8
summaries the changed functions that are used in this chapter.

Table 4.8 Summaries of the user defined functions


FUNCTION How it works?
K-S It is used to calculate the D critical value.
newlogdist It is used to obtain the cumulative lognormal distribution.
90

Monte Carlo Simulation

5.1 Introduction

In reality, many problems are difficult to solve by the analytical solution. For
instance, it is difficult to derive the distribution functions of an event which is
governed by two (or more) random variables following the different distributions
using the analytical solution. Under such conditions, we can apply the numerical
approach to solve the problems. Monte Carlo simulation (MCS) is widely used to
solve the problems containing uncertainties, and it also enhance the application of the
probabilities and statistical models. The fundamental contributions of MCS is to
generate a large set of random numbers following the prescribed probability
distributions.
In this chapter, some essentials of MCS will be introduced, together with the
process of demonstrating the Central Limit Theorem using MCS method.

5.2 Monte Carlo Simulation

The name Monte Carlo was firstly used by the scientists in developing the
nuclear weapons in Los Alamos in the 1940s. Because the physicists involved in this
work were big fans of gambling, and the capital of Monaco was a center for gambling,
they give the simulations the code name Monte Carlo. MCS can be used to generate a
large set of random numbers following prescribed probability distributions.
5: Monte Carlo Simulation 91

Definition
Monte Carlo simulation: Monte Carlo simulation is a method of artificial
recreating a chance process (usually with a computer), running it many times, and
then observing the results directly.

The main contributions of MCS is to present the numerical methods for solving
the probabilistic problems that are difficult solved by the analytical method. MCS is
now used in many diverse fields. For instance, in the commercial practice, many
companies use MCS as an important tool to do the forecasting; in the field of
probability and statistics, it can be used to compute the probabilities, expected values,
and the other distribution characteristics.
MCS has several advantages. For instance, the algorithms are simple; MCS
provides much more flexibility to try things out before building the actual system.
However, this method also has several disadvantages. For instance, the simulation
never corresponds fully to the actual system, and the uncertainties and errors are exist.
In addition, the MCS requires a large number of calculations, too much time is
required to do the simulation.
Monte Carlo method is only available with a computer. The statistics software
packages like SAS, SPSS, MATLAB, and EXCEL have built-in procedures for
generating the random variables from the most commonly distributions. In this
chapter, Excel based random number generation will be introduced.

5.2.1 Random Number Generation

MCS starts with generating random numbers with prescribed probability


distributions. There are two widely used method to generate the random numbers,
including Excel functions and the random number generation tool.

Excel functions – RAND and RANDBETWEEN

Function RAND

The function RAND in Excel simulates a uniform distribution on the interval


from 0 to 1. The idea is to draw the random numbers from the interval 0 to 1 with
every number equally likely to be chosen.
92 5: Monte Carlo Simulation

Syntax
= RAND ( )*

*no argument, but a set of empty parentheses must provide

For the function RAND, it is no limiting to draw random numbers from 0 to 1. In


fact, starting from drawing numbers from 0 to 1, we can transform the uniform
random variables into other variables with desired distribution. For instance, the
numbers uniformly distributed between 0 and 10 can be generated by multiplying the
original numbers by 10. In addition, we can add 50 to make them range from 50 to 60.
Pay attention that the recalculation can be achieved when you press F9.

Function RANDBETWEEN

The function RANDBETWEEN also can be used to generate random numbers.

Syntax
= RANDBETWEEN (bottom, top)

bottom: the lowest number you require


top : the highest number you require

The Random number generation tool

Although Excel contains the built-in functions to obtain the random numbers, the
random number generation tool is much more flexible comparing with the built-in
functions. To apply this tool, you just need to press Data Analysis Data Analysis,
and then you can see a dialog box which is displayed in Figure 5.1. The random
variables can be obtained by choosing the type of the distributions and then entering
the relative parameters.
5: Monte Carlo Simulation 93

Figure 5.1 Data analysis dialog box

Figure 5.2 shows the dialog box used for the random number generation. The
parameter-box is varies, which is depending on the type of distribution that you select.

Numbers of columns you want

Numbers of rows that you want

Types of Distributions(8)

Specify a starting value

Figure 5.2 Dialog Box for random number generation


94 5: Monte Carlo Simulation

After introducing the methods of generating random numbers, in practice, let us


see an example. As the MCS starts from the games, Example 5.1 is related to a
gambling game.

Example 5.1

A player has a chance to roll a dice once after paying 60 dollars. The benefit the
player can get is equal to three powers of the points. For instance, you can get eight
dollars when the points is equal to two. Forecasting that whether you can get profit
from the game (Suppose the numbers of trials is equal to 1,000).

[Solution]

In reality, it is impossible to do the decision-making after playing the game 1,000


times. Under this condition, MCS becomes a powerful tool as this method is good at
solving problems containing the repeating events.
As mentioned earlier, MCS starts with generating the random numbers with the
prescribed probability distributions. To solve this problem, the steps are shown as
follows:
1. Generate the random numbers
2. Obtain income
3. Obtain benefit
4. Make a decision

In this example, income can be obtained from three powers of the random points
that the player gets, where the benefit can be obtained from the formula: Benefits =
Income – Cost. After that, the average benefit E(x) can be calculated, and then the
decision can be made. the Excel function based method and VBA based method are
used to solve the problem following the steps just mentioned above.

Excel Solution

1. Generate the random numbers


As each dice has six sides, the random numbers should be integer and from one
to six. To generate the random numbers, you can choose Excel functions (RAND and
RANDBETWEEN ) or use the random number generation tool.

1.1 Use the function RAND


The function RAND can be used to return the random numbers between 0 and 1.
After that, the numbers uniformly distributed between 0 and 6 can be generated by
multiplying the original numbers by 6. As the random number generated by the
5: Monte Carlo Simulation 95

function RAND is from 0 to 0.99999…, we can add 1 to make the random numbers
range from 1 to 6.
The function INT can be used to round the numbers down to the nearest integer.
Figure 5.3 shows the process of generating random numbers by the functions RAND
and INT.

Figure 5.3 Random numbers generation using the function RAND

1.2 Use the function RANDBETWEEN

The function RANDBETWEEN also can be used to generate random numbers


from one to six. The main differences between this function and the function RAND
is that the function RANDBETWEEN returns only integers. In this experiment, the
function RANDBETWEEN is a better choice. Figure 5.4 shows the process of
generating random numbers by the function RANDBETWEEN.
96 5: Monte Carlo Simulation

Figure 5.4 Random numbers generation using the function RANDBETWEEN

1.3 Use the random number generation tool

Figure 5.5 shows the dialog box named the random number generation.

Figure 5.5 Dialog box for random number generation


5: Monte Carlo Simulation 97

Figure 5.5 means that 1000 random numbers are arranged in column Q, and the
value is from 1 to 6.999.

2. Obtain income

The function POWER can be used to calculate the total income you could get.
Figure 5.6 shows the process to calculate the income using the function POWER.

Syntax:
= Power (number, power)

number: the base number


power: an exponent, to which the based number is raised

Figure 5.6 Income calculation using the function POWER

3. Obtain profit

According to the equation Profits = Income - Cost, the profit you can get each
time can be obtained (shown in Figure 5.6).

Tips

Conditional Formatting
98 5: Monte Carlo Simulation

In this example, we use Excel feature called the conditional formatting, which is
a useful way to quickly identify the particular type of cells. There are different types
of conditional formatting rules, and you can find them by pressing
Home Styles Conditional Formatting. Figure 5.7 shows some types of conditional
formatting rules.

Figure 5.7 Conditional Formatting

In Example 5.1, our objective is to set all negative values of the income having
different colors. The first step is to select all cells that you want to format, and then
choose the Highlight Cells Rules. After enter 0 into the box, the values greater than 0
are highlighted.

Figure 5.8 One of several different conditional formatting dialog boxes


5: Monte Carlo Simulation 99

Besides the rules provided by Excel formatting suggestions, you can make
your own rules by selecting Home Styles Conditional formatting New
type. Figure 5.9 shows a new Formatting rule dialog box.

Figure 5.9 The new formatting rule dialog box

4. Make a decision
As the income and benefit have already been calculated in above steps, the
average benefits E(x) can be obtained by E(x) = AVERAGE (G6 : G1005) = 13.12.
The complete calculation process is presented in Excel file named Chapter 5 with the
spreadsheet named Example 5.1.

VBA Solution

Another method to create the random numbers and obtain the profit is using
VBA.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 5.1
100 5: Monte Carlo Simulation

**************************************************************

‘Purpose: To create random numbers from 1 to 6 and then obtain


the profit using the functions POWER, INT, RND.
‘Define variables:
‘n: the number of trials

‘******start of coding******************************************

Sub example1()
Range("n6:060000").ClearContents
n = InputBox("n:", "numbers of trials")
Range("r6").Value = n

For i = 1 To n
‘generate the integer random numbers from 1 to 6
Range("N6").Cells(i, 1) = i

‘obtaining the profit by the equation Benefits = Income – Cost


Range("N6").Cells(i, 2).Value = Application.WorksheetFuncti
on.Power(Int((6 - 1 + 1) * Rnd + 1), 3) – 60
Next

End Sub
*********************************************end of coding************

This macro consists of six key techniques. The respecting role in this macro is
detailed as follows:

1. For – Next (loop)


For – Next loop is one of the widely used looping structure in VBA, which is
used to repeat a group of statements with a specified number of times. The basic
syntax is:
For counter = start To end
[instructions]
Next [counter]

In this example, the value starts from 1 and end with n, the loop will be executed
n times in total.

2. Define Cells
In VBA, you can not define the cells directly using the words such as “A1,” “B2.”
To return a specific cell, you can specify a row and column index. For instance, if
5: Monte Carlo Simulation 101

you want to return the spreadsheet cell B4, you can write such as cells (4, “B”), or (4,
2).

3. Cells and Range


A range object present a single cell or a “range” of cells, and is very important in
VBA programming. You can specify the cells property within the range object. For
instance, without the range object, cells (4, 2) returns the spreadsheet cell B4, and the
current range is the entire worksheet. However, when the selection becomes
Range(“C2”).Cells (4,2), it returns the spreadsheet cell D5.
In this example, the code Range ("N6"). Cells(1, 1) = 1 may confuse you. The
value of 1 will be placed in N6, not A1, as the cells property use references related to
the selected range.

4. ClearContents
This statement is used to clear the selected range of cells, and here is the range
from N6 to O60000.

5. Inputbox
This input box here is used to enter the number of random numbers you want to
simulate. The syntax is shown as follows:
n = InputBox("n:", "numbers of trials")

Required, the text displayed in


the InputBox

Optional, the caption in the input


box window

6. Rnd
VBA built-in Rnd function, which returns a random number between 0 and 1
*********************************************************************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

According to previous calculation, E(x) > 0. Therefore, the player can get profits,
but the money is not too much.
102 5: Monte Carlo Simulation

5.2.2 Trails Confirmation

One of the fundamental questions before doing the simulation is to consider how
many trails are sufficient to run in a complex model? Generally speaking, the larger
the sample size, the higher the reliability of the results. The accuracy is measured in
the terms of C.O.V. (coefficient of variation). The following example shows the
influence of the sample size when doing the simulation.

Example 5.2

Suppose a particular population’s Emotion Quotient (EQ) and Intelligence


Quotient (IQ) follow the normally distribution, where EQ~N(150,30),IQ~N(100,
25. T is the sum of the two normal variants.
1) Find out the probability that T=E + I is larger than 300 using the analytical
solution.
2) Using MCS technique to obtain the results, and comparing the results when n is
equal to 10, 15, 100, 1000.

[Solution]

1)

Let µ e =150, σ e =30 and µi =100, σ i =25. As both the EQ and IQ follow the
normal distribution, the sum of the two normal variants is also normal variants. Under
this condition, the analytical solution is available to solve this problem.
The mean ( µt ) and the variance ( σ t2 ) can be obtained as follows:

µt = µi + µe = 150 + 100 = 250 σ t2 = σ e2 + σ i2 = 1525


The function NORMDIST can be used to calculate the probability when T >300.
Ρr (T > 300) = 1 − NORMDIST (300, 250,39.05, True) = 0.1

Hence, the probability that T = E + I is larger than 300 is equal to 0.1.

2)

Generally speaking, there are two steps to accomplish this problem:


1. Generate the random numbers
2. Calculate the probability
5: Monte Carlo Simulation 103

1. Generate the random numbers

As mentioned earlier, the key of MCS is to generate the random numbers. In this
experiment, the purpose is to generate a series of the random numbers that follow N
(150,30) and N(100,25). Firstly, the random numbers between 0 and 1 can
be generated using the function RAND, and then these uniform random numbers can
be transformed to the normal distributed numbers using the function NORMINV.
Figures 5.10 shows the process to generate the uniform random numbers by the
function RAND.

Figure 5.10 The random numbers generation using the function RAND

After obtain these uniform random numbers, these numbers can be transformed
to the normal distributed numbers. Figures 5.11 shows the formula to generate the
random numbers by the function NORMINV.

Figure 5.11 Random numbers generation using the function NORMINV


104 5: Monte Carlo Simulation

The random number generation tool also can be used to generate random
numbers. Figure 5.12 shows the process of generating random numbers following the
normal distribution.

Figure 5.12 Random number generation – the normal type

In Example 5.2, the random numbers arranged in one column with 100 rows, the
µ = 100 and σ = 25 , and the output is in the column Q.

2. Calculate the probability

According to Eq. 2.1 Pr (event) = sample points / sample space, the probability
that the sum of these two random variables is larger than 300 can be obtained. The
function COUNTIF can be used to obtain the sample points. Table 5.1 shows the
probability the T >300 when n is equal to 10,15,100, and 1000. The complete
calculation process is presented in Excel file named Chapter 5 with the spreadsheet
named Example 5.2.

Table 5.1 Results in Example 5.2


n Pr (T >300)
10 0
15 0.267
100 0.104
1000 0.104
5: Monte Carlo Simulation 105

In this example, we can observe that when the number of trails is small, such as
10 and 15, the results are far away from the analytical results and easy changed when
pressing F9. However, when the number of trails is large, such as 1000, the
probability is appropriate equal to 0.104, which is much closer to the analytical results.
Furthermore, the results are more stable when the sample size is large.

In Example 5.2, as both EQ and IQ follow the normal distribution, the sum of the
two normal variants is also the normal variants. However, if the variables follow the
different distributions, the sum of these variants following which type of distribution
can not be determined. As mentioned earlier, one of the advantages of MCS is to
solve probabilistic problems that are impossible solved by analytical methods.
Example 5.3 shows an example that using Monte Carlo method to solve problems
which do not have the analytical solution.

Example 5.3

Midterm scores M~N (185, 30), what is final scores F~U (30,185), how many
students can have a total score (T = M + F) greater than 300?
1) What model does T follow?
2) Try simulation (MCS) to find the probability that T >300.

[Solution]

1)

For the first question, which model does T follow is not deterimined, as the
distributions of the midterm and final scores are different. Under this condition, the
analytical solution can not be used to solve this problem.

2)

Two major steps to obtain Pr (T >300) by MCS are:


1. Generate the random numbers
2. Obtain the probability

1. Generate the random numbers

In this example, to generate the random numbers follow the normal distribution
with mean and standard deviation 185 and 30, respectively. The functions RAND and
NORMINV are used. The syntax is NORMINV (RAND(), mean, sd). Figure 5.13
106 5: Monte Carlo Simulation

shows the procedures to simulate a series of the normal random variables.

Figure 5.13 Simulating the normal random variables

The midterm scores follow the uniform distribution with the smallest and largest
value of 30 and 185, respectively. The function RAND can be used to generate
random numbers. The syntax is: RAND()*( upper limits – lower limits)+lower limits.
Figure 5.14 shows the procedures to simulate a series of uniform variables.

Figure 5.14 Simulating the uniform variables


5: Monte Carlo Simulation 107

The complete calculation process is presented in Excel file named Chapter 5 with
the spreadsheet named Example 5.3.

In this example, another way to generate random numbers is using VBA.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 5.2

**************************************************************

‘Purpose: To generate random numbers using function RAND


‘Define variables:

‘n: the number of trials

*********start of coding**********************************************************
Sub example3()

Range("M12:P60000").ClearContents
n = InputBox("n", "n")
Range("k5").Value = n

For i = 1 To n
Range("m12").Cells(i, 1) = i
Range("n12").Cells(i, 1).Value
= Application.WorksheetFunction.NormInv(Rnd, 185, 30)
Range("o12").Cells(i, 1).Value = Rnd() * ((185 - 30) + 30)
Next

End Sub

*****************************************************end of coding***************

This macro consists of four key techniques. The respecting role in this macro is
detailed as follow:
1. For --Next(loop)
This structure is used to loop n times to generate random numbers
108 5: Monte Carlo Simulation

2. ClearContents
This statement is used to clear the range of cells from M12 to P60000.
3. Inputbox
This input box here is used to enter the number of trials you want to simulate.
4. Rnd
VBA built-in Rnd function, which returns a random number between 0 and 1.
You can directly use the function RND to generate the random numbers
following the uniform distribution, and it is unnecessary to start from the
statemet “Application.WorksheetFunction.”
*********************************************************************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

2. Obtain the probability

To find out the probability that the sum of these two random variables is larger
than 300, the total sample points and sample space should be obtained. The function
COUNTIF can be used to obtain the sample points under the condition of T >300. If
the total scores are larger than 300, the value returns to 1, otherwise, the value returns
0.
Hence, the probability can be obtained: Pr (T > 300) = 135/500 = 0.27. The
complete calculation process is presented in Excel file named Chapter 5 with the
spreadsheet named Example 5.3.

5.3 Central Limit Theorem

If you are carefully enough, you would find the phenomenon that nearly all of
the probability textbooks would mention the theory of the Central Limit Theorem.
However, the theoretical proof is often skipped in these kinds of textbooks, with such
a footnote instead: “Although we do not concern ourselves here with why the Central
Limit Theorem works, you need to understand why the veracity of this theorem is so
important.” This would not only make the Central Limit Theorem less
comprehensible and mysterious to students, but also undermines the purpose of
college education with such surface learning. In such situation, MCS provides a
numerical simulations technique to demonstrate the Central Limit Theorem.
The Central Limit Theorem is one of the most remarkable theory in the
probability and statistics fields. Its essence is shown as follows: when sample size n is
large enough (say n > 30), regardless of the particular distribution type of X, the
5: Monte Carlo Simulation 109

sample mean X follows approximately a normal distribution with mean (µ) and
standard deviation σ X n .

Definition
Central Limit Theorem: For any population with mean and standard deviation

, the distribution of the sample means for sample size n will have a mean of and a

standard deviation of σ x / n , and will approach a normal distribution as:

X ~ N ( µ X ,σ x / n )
5.1

MCS can be used to prove the Central Limit Theorem, and the detailed
procedures are shown in Example 5.4.

Example 5.4
From Example 4.4, we have already drawn the conclusion that the normal
distribution is much an appropriate model for the students’ midterm scores. In this
example, the mean and standard deviation are equal to 73.92 and 12.22, respectively.
Suppose the sample size is equal to 100 and the number of simulations is equal to 80,
please demonstrate whether the sample follows N (73.92, 12.22/10).

[Solution]
Draw the random numbers
The functions RAND and NORMINV can be joined together to generate the
random numbers N (73.92, 12.22):
= NORMINV( RAND(), 73.92, 12.22)

The summarized statistics data are shown in Table 5.2. The complete calculation
process is presented in Excel file named Chapter 5 with the spreadsheet named
Example 5.4.
110 5: Monte Carlo Simulation

Table 5.2 Summary of the statistical data


Statistical Data
Sample Size n 100
No. of Simulation 80
Mean of X 73.92
SD of X 12.22
Sample Mean (CLT) 73.85
SD of Sample Mean (CLT) 1.34

To draw the histogram, the function FREQUENCY can be used just as


mentioned in Chapter 4. After getting the related statistics such as bins and frequency,
the histogram can be obtained. The complete procedures to draw the histogram are
presented in Excel file named Chapter 5 with the spreadsheet named Example 5.4.

Figure 5.12 shows the distribution of students’ scores. According to Figure 5.15,
we can observed that the mean of students’ score is appropriately following the
normal distribution.

Distribution of Students' Score


0.14 0.40
0.12 0.35 MCS
0.1 0.30 CLT
probability

0.25
0.08
0.20
0.06
0.15
0.04 0.10
0.02 0.05
0 0.00
71.00 72.00 73.00 74.00 75.00 76.00 77.00 78.00

scores

Figure 5.15 The histogram of the sum of students’ scores

According to the Central Limit Theorem: X ~ N ( µ , σ X ) , the theoretical


X
n
5: Monte Carlo Simulation 111

σ 12.22
mean value µ X = 73.92, and standard deviation is equal to = = 1.22 .
n 100

After simulation, the mean and standard deviation can be calculated as µ =73.85 and

σ = 1.34 , which is similar to the theoretical value.


Hence, we can draw a conclusion that the distribution of sample mean
approaches the normal distribution.

According to the definition of the Central Limit Theorem, not just for the normal
distribution, the sample mean X approached the other distributions such as uniform
and gamma also follow the normal distribution. Example 5.5 shows an example to
demonstrate it.

Example 5.5

Suppose that µ = 100 and s = 10 , using these two parameters to simulate and

test whether the sample means follow the normal distribution when the random
variable X follow the uniform and normal distribution, respectively.

[Solution]

MCS, integrated with Excel based programming (VBA), is a power tool to


enhance students’ understanding of the Central Limit Theorem. In this example, VBA
will be used to generate random numbers, calculate essential statistics, and draw the
corresponding figures.

1. Generate the random variables

In VBA programming, you can use Excel’s application functions directly in your
VBA code, which has the advantages such as convenience and speediness. In this
example, the functions AVERAGE, STANDARD DEVIATION, and NORMINV are
used to obtain the random variables and the essential statistics.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 5.3
112 5: Monte Carlo Simulation

**************************************************************

‘Purpose: To generate random numbers that follow the uniform


distribution, calculating the mean and sd, and then generating a
series of numbers that can be used to draw the figures.

‘Define variables:

‘n: the number of trials

*********start of coding************************************************
Sub uniform_()

1
mn = Range("h11").Value
sd = Range("h12").Value
sz = Range("h9").Value ' sample size
n = Range("h10").Value ' the number of simulation

2
b = mn + sd / 2 * 12 ^ 0.5
a = 2 * mn - b

Range("h21:h50000").ClearContents

3
ReDim s(n)
For j = 1 To n
aa = 0
For i = 1 To sz
aa = aa + Rnd * (b - a) + a
Next
s(j) = aa / sz
Range("G21").Cells(j, 1).Value = j
Range("G21").Cells(j, 2).Value = s(j)
Next j

4
Range("h15").Value = Application.WorksheetFunction.Average(s)
Range("h16").Value = Application.WorksheetFunction.StDev(s)
5: Monte Carlo Simulation 113

**************************************************************
'Calculate the values follow uniform distribution using original
parameter, which can be used to draw the figure.

Range("o21:p1000").ClearContents

5
Range("o21").Value = a
Range("o22").Value = a
Range("o23").Value = b
Range("o24").Value = b
Range("p21").Value = 0
Range("p22").Value = 1 / (b - a)
Range("p23").Value = 1 / (b - a)
Range("p24").Value = 0

End Sub
**************************************end of coding*************

Comments

1. read the inputs.


2. to obtain the values of a and b using the equations:
a+b (b − a ) 2
uX = , σX =
2 12
3. sum the random numbers that follow U~(a, b) from the 1( the first number ) to sz,
and then obtain the average value s(j) using s(j) = aa / sz. This process execute n
times using the For..Next structure. Because aa is equal to 0 as shown in the third
line of block three, aa will become aa +0 = aa, and apparently the “rand*(b-a)+a”
is the aa value in the first loop.
4. obtain the sample mean and sd using the simulation results. Pay attention that the
s in the functions Average(s) and StDev(s) is the sector containing a series of
numbers from 1 to n.
5. calculate the values following the uniform distribution using the original
parameters, which can be used to draw the figure.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
114 5: Monte Carlo Simulation

Code 5.4

**************************************************************

‘Purpose: To generate random numbers that follows the normal


distribution, calculating the mean and sd of them, and generating
a series of numbers that can be used to draw the figure.

‘Define variables:

‘n: the number of trials

*********start of coding************************************************

Sub normal_()
mn = Range("h11").Value
sd = Range("h12").Value
sz = Range("h9").Value ' sample size
n = Range("h10").Value ' the number of simulation

Range("h21:h50000").ClearContents

ReDim s(n)
For j = 1 To n
aa = 0
For i = 1 To sz
aa = aa + Application.WorksheetFunction.NormInv(Rnd, mn, sd)
Next
s(j) = aa / sz
Range("G21").Cells(j, 1).Value = j
Range("G21").Cells(j, 2).Value = s(j)
Next j

Range("h15").Value = Application.WorksheetFunction.Average(s)
Range("h16").Value = Application.WorksheetFunction.StDev(s)

**************************************************************

'Calculate the values following the normal distribution using


5: Monte Carlo Simulation 115

original data and draw the figure.

Range("o21:p1000").ClearContents
n = 100
For i = 1 To n
kk = mn - 3 * sd + 6 * sd / n * i
Range("o21").Cells(i, 1).Value = kk
Range("o21").Cells(i, 2).Value =
Application.WorksheetFunction.NormDist(kk, mn, sd, False)

Next

End Sub
***********************************************end of coding***********

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The code 5.4 is pretty similar with the code 5.3, and the only thing changed is the
distribution that random numbers followed. therefore, we not specify the code in here.

2. Histogram

The second step is to plot the histogram. Before plotting the histogram, the
relative data such as the classes boundary and bins should be prepared using VBA.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 5.5

**************************************************************

‘Purpose: To obtain the statistics required for drawing the


hisgogram: including bins, frequency, the inversed of the normal
cumulative distribution.
‘Define variables:

‘n : the number of trials

*******start of coding******************************************

Sub histo_()
116 5: Monte Carlo Simulation

Range("k21:m1000").ClearContents
1
bsize = Range("l11").Value
sd = Range("h14").Value
mn = Range("h13").Value
k = Range("h10").Value

'**************************** 3sd rule


2
lowbp = mn - 3 * sd
n = 6 * sd / bsize

3
ReDim ll(n)
ReDim uu(n)
ReDim freq(n)

4
For i = 1 To n
ll(i) = lowbp + bsize * (i - 1)
uu(i) = lowbp + bsize * (i)
freq(i) = Application.WorksheetFunction.CountIf(Range("h21:h"
& k + 500), ">=" & ll(i)) -
Application.WorksheetFunction.CountIf(Range("h21:h" & k + 500),
">=" & uu(i))

Range("k21").Cells(i, 1).Value = ll(i) / 2 + uu(i) / 2


Range("k21").Cells(i, 2).Value = freq(i) / k
Range("k21").Cells(i, 3).Value =
Application.WorksheetFunction.NormDist(uu(i), mn, sd, True) -
Application.WorksheetFunction.NormDist(ll(i), mn, sd, True)
Next

End Sub
***********************************************end of coding*******

Comments
1. read the inputs.
2. define the parameters that can be used in the further calculation.
3. reDim statement: the ReDim statement is used to size or resize a dynamic array.
5: Monte Carlo Simulation 117

4. to obtain the bins, frequency, and the theoretical frequency.


*********************************************************************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

__
Figures 5.16 and 5.17 show the distributions of X and X of the uniform and
normal distribution, respectively.

Figure 5.16 The distributions of X and Bar X

Figure 5.17 The distributions of X and bar X


118 5: Monte Carlo Simulation

From Figures 5.15 and 5.16, we can draw a conclusion that no matter which
distribution dose X follow, the X follows the normal distribution.
You can run the simulation by selecting one of the probability distributions listed
in the combo box, and a series of random sample means are then numerically sampled,
with both text and graphical outputs instantly given on the spreadsheet.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 5.6

**************************************************************

‘Purpose: let us choose the distribution we need.


‘Define variables:

‘n : the number of trials

*****start of coding********************************************

'Purpose: let us choose the distribution we need.


'Define variables:
'n : the number of trials

Sub run_Chapter5()

a = Range("b4").Value

fname = ActiveWorkbook.Name

If a = 1 Then
Application.Run "'" & fname & "'!uniform_"
ElseIf a = 2 Then
Application.Run "'" & fname & "'!normal_"

End If
5: Monte Carlo Simulation 119

Application.Run "'" & fname & "'!histo"

End Sub
**************************************end of coding********************

Comments
Using the run event to choose the distribution we need.
*********************************************************************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

You can see a form control in Excel spreadsheet named Example 5.5, and the
reason to use the controls on a worksheet is to make it easier for the user to provide
input. In this experiment, you may not have to create any macros because you can link
a control to a worksheet. You can access the form control by choosing Developer →
Controls →Insert. In this example, the combo Box can be used as the form controls.
We will introduce the essentials of the controls in chapter 7.

Figure 5.18 Example of using Form Controls

5.4 Summaries of Excel Functions

Excel functions which are used in this chapter are summarized in Table 5.3. The
functions RAND and RANDBETWEEN are used to generate random numbers; the
function NORMINV is used to transform the random numbers to follow the normal
distribution in this chapter.
120 5: Monte Carlo Simulation

Table 5.3 Summaries of the built-in functions


FUNCTION How it works? Notes
FREQUENCY It returns a frequency distribution as a Ex. 5.4
vertical array.

NORMINV It returns the inverse normal cumulative Ex. 5.2


distribution.
RAND It can be used to generate random numbers Ex. 5.1 & 5.4
from 0 to 1.
POWER It returns the result of a number raised to a Ex. 5.1
power.
AVERAGE It returns the average of its arguments. Ex. 5.1 & 5.2

STDEV It can be used to calculate the standard Ex. 5.2


deviation.
COUNTIF It can be used to count the number of cells Ex. 5.2
that meet your specification.

INT It can be used to round the numbers down Ex. 5.1


to the nearest integer.

RANDBTWTEEN It can be used to generate integer random Ex. 5.1


numbers.
121

Statistical Inferences from Observational Data

6.1 Introduction

In the previous chapters, we observe that once we know the probability functions
(PDFs or CDFs) and the values of the parameters such as the mean and variable, we
can obtain the probability of an event. The process of estimating the parameters and
obtaining the appropriate distributions is based on available observational data. In
order to estimate the parameters and infer the appropriate distribution of a population,
the perfect way is to investigate every single point in a population. However, it is
difficult or impossible to investigate the entire group. Alternatively, we may examine
only a small part of this population, which is called a sample. The process of
obtaining the sample is called sampling.
The features of a population can be inferred by a lot of samples which draw from
that population. The process of inferring the features of a population from the results
found in the sample is known as statistical inference. For instance, 100 students are
randomly chosen to estimate the average heights of UST’s students. 20 toys are
randomly selected to estimate the defectiveness of a batch of toys. The 100 students
and 20 toys here are the samples which are randomly chosen to infer the population
features.
In this chapter, the fundamental of the point and interval estimation are being
introduced, together with some relevant Excel functions.

6.2 Point Estimation

Given a parameter of interest, such as a population mean, the objective of point


122 6: Statistical Inferences from Observational Data

estimation is using many samples to infer the true value of the parameter, which is
symbolic with Greek letter θ .

Definition
Estimator: The equation used to estimate the population parameter θ .
Point estimate: A single number that can be used as a sensible value for θ .

For instance, the students in UST are randomly chosen to measure their heights,
the sample size n = 3, including x1 = 165 cm, x2 = 170 cm, and x3 = 172 cm. The
sample mean = (165 + 170 + 172)/3 = 169 cm. In this question, the estimator used

to obtain the point estimate of µ was X , and the point estimate ( is the value of
X that is equal to 169 cm.

Unbiased Estimator

Most of the time, there will be more than one estimator. Reconsider the example

mentioned above, the estimator X is used to estimate µ, where is equal to 169

cm; the estimator Xɶ can also be used to estimate µ, where the estimate
165 + 172
xɶ = = 168.5 cm. The unbiasedness is the most important factors when
2
choosing the estimator. In addition, the consistency, efficiency and sufficiency also
are important features to decide the goodness of the estimators.

Definition
Unbiased estimator: A point estimator θˆ is said to be an unbiased estimator of
θ if E (θˆ) = θ for every possible value of θ . If θˆ is biased, the difference
E (θˆ) − θ is called the bias of θˆ .
6: Statistical Inferences from Observational Data 123

Suppose there are two estimators: θˆ1 = xˆ − 2 , θˆ2 = x̂ , we could obtain that
E (θˆ1 ) = E ( xˆ − 2) = µ − 2 and E (θˆ2 ) = E ( xˆ ) = µ. According to the definition of the
unbiased estimator, if the expected value of an estimator is equal to the parameter, the
estimator is said unbiased. The point estimator θˆ1 is said to be a biased estimator as
E (θˆ1 ) = µ − 2 ≠ µ , and the point estimator θˆ2 is an unbiased estimator as E (θˆ2 ) = µ .
After introducing the essential of the point estimation, we will describe one
important method that can be used to obtain the point estimates: the method of
moments. The parameters of a distribution may be determined by first estimating the
mean and variance of the random variable, the process, is the basis of the method of
moments.

Sample mean X and variance S 2 are unbiased estimator to estimate the

population mean µ and variance σ 2 .

1 n
For sample mean, x = ∑ xi
n i =1
6.1

1 n
For sample variance, s 2 = ∑ ( xi − x )2
n − 1 i =1 6.2

The following section demonstrate whether the sample mean X and variance S 2
are the unbiased parameters.

1. The sample mean X is an unbiased estimator for population mean µ .

[Proof]
1 n 1 1
E ( Xˆ ) = E ( ∑
n i =1
xi ) = {E ( x1 ) + E ( x2 ) + ...E ( xn )} = ( µ + µ + ... + µ ) = µ
n n
Therefore, the is an unbiased estimator.
1 n 1 n nµ
Suppose another estimator θ = ∑ ni ,
n − 1 i =1
E(θ E ∑
n − 1 i =1
ni ) =
n −1

≠ µ , the θ2 is not an unbiased estimator.


124 6: Statistical Inferences from Observational Data

2. Sample variance S 2 is an unbiased estimator for the population variance σ 2 .

[Proof]
For any rv Y, V (Y 2 ) = E (Y 2 ) − [ E (Y )2 ] , so E (Y 2 ) = V (Y 2 ) + [ E (Y ) 2 ]
Applying this to

1  ( ∑ Xi ) 
2

S =
2

n −1 
∑ X i − n 
2

 
Gives
1  1 2 
E(S 2 ) ≠ σ 2 ∑ ( X i ) − E[(∑ X i ) ]
2

n −1  n 

1  1 2
=  ∑ (σ + µ ) − {V (∑ X i ) + [ E (∑ X i )] 
2 2

n −1  n 

1  2 1 2 1 2
=  nσ + nµ − nσ − ( nµ ) 
2

n −1  n n 

=
1
n −1
{nσ 2 − σ 2 } = σ 2

As the E ( S 2 ) = σ 2 , the estimator is unbiased. However, instead of n – 1, the


estimator that uses divisor n can be expressed as:
1 1 2  n −1 2
E ( S `2 ) =  ∑ E ( X i ) − E [ ∑ ( X i ) ] = σ ≠σ2
2

n n  n

As the E ( S 2 ) ≠ σ 2 , the estimator is biased.


Once finishing the theoretical part, let us see an example that using the method
of moments to estimate the point estimates.

Example 6.1

25 students in UST are randomly chosen to fulfill in a questionnaire named How


Much Time do You Spend on line (per day). The data is shown as follows:
6: Statistical Inferences from Observational Data 125

8.2 3.4 8.4 7.9 2.5 10 7.2 11 7.7


4.8 7.6 5.2 5.6 3.5 8.5 6.4 5.8 7.6
6.9 7.4 6.8 8.8 9.8 7.8 10.2
Calculate the sample estimate of the mean and variance.

[Solution]

Hand Calculation

According to Eqs. 6.1 and 6.2, the point estimates of the mean and variance are

8.2 + 3.4 + ... + 10.2 (8.2 − 7.2) 2 + (3.4 − 7.2) 2 + ... + (10.2 − 7.2) 2
x= = 7.2, s =
2
= 4.66.
25 25 − 1
The complete solution procedure can be seen in Excel file named Chapter 6 with
the spreadsheet named Example 6.1.

Even though the method of hand calculation can be used to obtain the mean and
variance, the calculation processes are complex. Fortunately, Excel contains functions

which can be used to obtain the and s 2 .

Excel Solution

In this example, the sample mean = AVERAGE(J11:J35) = 7.2, and the

sample variance s 2 = VAR(J11:J35) = 4.66. The complete solution procedure is


shown in Excel file named Chapter 6 with the spreadsheet named Example 6.1.

The functions AVERAGE, VAR, and VARP

The function VAR is used when calculating the variance for a sample. However,
when calculating the variance for an entire population, the function VARP is essential.
In this example, if we use the function VARP, s12 = VARP (J11:J35) = 7.98, and

comparing to the value obtained from the function VAR, the difference is significant.
However, if we enlarge the sample size, the gap becomes narrow. For instance, recall
Example 5.2. s 2 = VAR(H21:H520) = 25.28 and s12 = VARP (H21:H520) = 25.23,
126 6: Statistical Inferences from Observational Data

these two results are similar. Not only the functions VAR and VARP, the functions
STDEV and STDEVP also have the similar syntax.

6.3 Interval Estimation

From Example 6.1, we have obtained x = 7.2 . However, it is never the case that
x = 7.2 = µ . Because of the sampling variability, we can not make sure that the
sample mean we obtained each time is exactly the same. For instance, recall Example
6.1, randomly choose other groups of samples from the population, the results are
shown in Table 6.1.

Table 6.1 Sample mean and standard deviation


1 2 3 4 5 6 7 8 9 10
Mean( 7.2 5.4 5.2 5.4 6.0 6.1 5.9 5.5 5.2 6.5
Std.(s) 2.2 3.3 3.1 3.8 2.9 3.7 3.5 2.8 3.7 3.0

From Table 6.1, we can observe that the sample mean is all differ slightly from
one to another, and there is no evidence to demonstrate which sample mean is
close to µ. In reality, the true mean for the population does exist and is a fixed
number, but we do not know exactly what it is. Most of the time, the estimated mean
is not exactly the same as the true mean.
One limitation for point estimation is that the estimate itself is a single number,
and it says nothing about how close it might be to µ. Rather than a point estimation, it
is sometimes more valuable to be able to specify an interval that the true mean is
possible within. For instance, you have a better chance to say that the average height
of the students is between 168 cm and 174 cm than a single estimation of 171 cm.
Before calculating the confidence interval, the confidence level, which is a
measure of the degree of reliability of the interval, should be selected. The most
frequently used confidence levels are 95%, 90 % and 99%. The higher the confidence
level, the stronger we believe that the value of the parameter being estimated lies
within the interval.

6.3.1 Confidence Interval for the Mean with a Known Population Variance

According to the Central Limit Theorem, sample mean X follows


approximately a normal distribution with mean µ and standard deviation σ n . ..
When the sample size is large and the variance is already known, the unknown µ can
be estimated within a specified interval.
6: Statistical Inferences from Observational Data 127

Confidence Interval: A 100(1- α ) % confidence interval for the mean µ of a

normal population when the value of σ is known:


σ σ
( x − zα / 2 ⋅ , x + zα /2 ⋅ ) 6.3
n n

Any desired level of confidence interval can be achieved using different kinds of
z critical value. Figure 6.1 shows a probability of 1 − α that is achieved by using a
critical value zα / 2 . Suppose the confidence level is 95%, the critical value z α / 2 is
equal to 1.96.

1- α

− zα /2 zα /2
c

Figure 6.1 P (− zα / 2 ≤ Z ≤ zα / 2 ) = 1 − α

Example 6.2 illustrates the process to obtain the confidence interval for an
unknown mean with a known population variance.
128 6: Statistical Inferences from Observational Data

Example 6.2

The canteen in UST dose a survey named Do you Satisfy the Food in Canteen?
(suppose that the level of satisfactory is from 0 to 100). 100 students are randomly
chosen, and the mean and standard deviation are equal to 80 and 25, respectively.
1) Find a 95 percent confidence interval estimate the average satisfactory score of all
students in UST
2) What sample size is necessary to ensure that the resulting 95 percent CI has the
width within 15.

[Solution]

1)

Let x = 80, and σ = 25

Based on the Central Limit Theorem, we know that x ~N (80, 25/√100). Under the

σ
confidence level of 95%, the confidence interval for µ is: ( x − z α /2 ⋅ ,
n

σ
x + zα /2 ⋅ ). As x , σ and n are given, the key step to estimate the confidence
n

interval is to obtain z critical value. After that, the CI can be obtained from Eq. 6.3.
To obtain the z critical value, you can either choose the cumulative standard
normal table, or using the functions in Excel.

Table Solution
Because the confidence interval is 95%, the left tail and right tail area can be
calculated as shown in Figure 6.2.
6: Statistical Inferences from Observational Data 129

0.95
0.025 0.025

0.025 0.975
Figure 6.2 Illustration of a z critical value

Table 6.2 shows the standard normal table, which can be used to obtain the z
critical value. Look up the table, z0.975 = 1.96. As the normal distribution is
symmetric, z0.025 = −1.96.

Table 6.2 Standard normal table


0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756
130 6: Statistical Inferences from Observational Data

Excel Solution

The functions NORMSINV and NORMINV in Excel also can be used to obtain z
value. For example, given left tail probability = 0.975, the corresponding z value is
equal to 1.96; the right tail probability = 0.025, the corresponding z value is equal to
-1.96.
Therefore, the 95 confidence interval for µ is:
25 25
(80 − 1.96 × ,80 + 1.96 × ) (75.1, 84.9)
100 100
Hence, we are 95% confidence that the mean value lies between 75.1 and 84.9.

2)

The sample size n must satisfy


25
15 = 2 × 1.96 × ⇒ n = 42.68
n
Since n must be an integer, a sample size of 43 is required.

6.3.2 Confidence Interval for a Normal Mean with the Variance is Unknown

In section 6.3.1, we obtain the confidence interval for µ under the condition that
the sample size is large or the population σ is given. However, when the sample size n
is small, the Central Limit Theorem is no longer invoke. Under this condition, a new
family of probability distribution called t-distribution will be introduced.
The t-distribution is investigated by William S. Gosset, a chemist and statistician,
who noticed that the usual statistical practice of his daily work exist small errors when
the sample size is small. Gosset published the result in 1908 and signed “Student,”
because the company he worked had a policy that the employees were not permitted
to publish under their own names. As a result, his name is almost unknown outside
the statistical field.
The t-distribution is a probability distribution that is used to estimate population
parameters when sample size is small and/or when the population variance is
unknown. The T-test is very useful to handle small samples in quality control area.
6: Statistical Inferences from Observational Data 131

Definition
When X is the mean of a random sample of size n from a normal distribution with
mean µ, the rv.
X −µ
T=
S n 6.4
has a probability distribution called a t-distribution with n-1 degree of freedom, and
the degree of freedom is symbolic with Greek word ν .

There are some features for the t-distribution. For instance, the t-distribution is
bell-shaped and symmetrical to the origin just like the normal distribution; each
t-distribution curve is more spread out than the standard normal curve; the spread of
the corresponding t-curve decrease with the increase of v.

Confidence Interval for : Let x and s be the sample mean and sample standard
deviation. Then a 100(1- α ) % confidence interval for µ is:

s s
x − tα /2,n −1 ⋅ , x + tα / 2,n −1 ⋅
n n 6.5

Excel functions -TDIST and TINV

Function TDIST

TDIST gives the probability in the right tail (Pr(X > x), or of being in the two
tails (Pr(|X| > x).
132 6: Statistical Inferences from Observational Data

Syntax
= TDIST (x, deg_freedom, tails) 1. one-tail, 2. two-tail

x : required, the numeric value used to evaluate the distribution


deg_freedom: required, the number of degrees of freedom
tails : required, if tails = 1, one-tail, 2. two- tail

Function TINV
TINV considers the inverse of the probability of being in both tails.

Syntax
= TINV(probability, deg_freedom)

Example 6.3 illustrates the process to obtain the confidence interval for an
unknown mean with an unknown population variance.

Example 6.3

Continuing Example 6.2, the sample size is decreased to 10, and the sample
mean and S.D. are x = 80 and s = 25, respectively. What is the interval estimation
of the average scores with 95 % confidence interval?

[Solution]

In this example, n , s , and x are given, where n = 10 , s = 25 , and x = 80.


The first and key step is to obtain the t critical value. After that, the CI can be
obtained from Eq. 6.5. To find out the t critical value, you can either choose the t
table or the functions in Excel.
6: Statistical Inferences from Observational Data 133

Table Solution

You can make your own T table using the function TINV, and the process is
similar as creating the normal table. As we have already mentioned the process of
creating the standard normal table, we will omit the specific procedures of creating T
table.
In Table 6.3, the second row corresponds to the different values of α , and the
first column corresponds to the value of υ . The function TINV is used to obtain the t
critical value. You can choose the different value of α and v as you like, and draw a
table like Table 6.3.

Table 6.3 Critical values for the t-distribution


α
ν 0.1 0.05 0.025 0.01 0.005 0.001 0.0005
1 3.078 6.314 12.706 31.821 63.657 318.309 636.619
2 1.886 2.920 4.303 6.965 9.925 22.327 31.599
3 1.638 2.353 3.182 4.541 5.841 10.215 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587

The t value can be obtained from Table 6.3 as t0.025,9 = 2.262.

Excel Solution

As α = 0.05 and ν = 9 are given, the t critical value is equal to:


t0.05,9 = TINV (0.05, 9) = 2.262

As t-curve is symmetric about zero, −t0.05,9 = TINV (0.05,9) = −2.262.

Pay attention that the function TINV is two tails. It may cause inconvenience as
the TINV only considers the inverse of the probability in two tails, VBA can be used
to turn it to right/left tail. The code is shown as follows:
134 6: Statistical Inferences from Observational Data

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 6.1

*********************************************************************

‘Purpose: To recreate the function TINV returns only the right tail
probability Pr(X > x).

‘Define variables:

‘x: the numeric value to evaluate the distribution


‘df: the number of degrees of freedom

*****start of coding********************************************
Public Function tinv_right(pb, df)
tinv_right = Application.WorksheetFunction.TInv(2 * pb, df)
End Function

******************************************end of coding*********

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As a = 0.05 and ν = 9 are given, the t value can be obtained as

t0.05 = TINV _ right (0.025, 9) = TINV (0.05, 9) = 2.262

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 6.2

*********************************************************************
‘Purpose: To recreate the function TINV returns only the left tail
probability Pr(X < x).

‘Define variables:

‘x: the numeric value to evaluate the distribution


6: Statistical Inferences from Observational Data 135

‘df: the number of degrees of freedom

‘**********start of coding**************************************
Public Functiontinv_left(pb, df)
tinv_left =(Application.WorksheetFunction.TInv(2 * pb, df)) * -1
End Function
‘********************************************end of coding*************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As α = 0.05 and ν = 9 are given, the t value can be obtained as

t0.05 = TINV _ left (0.025,9) = −TINV (0.05, 9) = −2.262

After obtain the t critical value, a 95 percent confidence interval for μ is:
25 25
[80 - 2.262 × , 80 + 2.064 × ] = (62.1, 97.9)
3.16 3.16
Hence, we are 95% confidence that the mean value lies between 62.1 and 97.9.

6.3.3 Confidence Interval for the Variance of a Normal Distribution

In the sections 6.3.1 and 6.3.2, we mentioned the method to infer the confidence
interval for an unknown mean. Not only for the unknown mean, we might be also
interested in the spread of a population based on a sample. For example, expect for
knowing the sample mean of students’ grade, we might also want to know the
variation in their grades.
Before developing a confidence interval for the variance, we need another
distribution, called chi-squared distribution. Suppose a random variable follows the
normal distribution with the population variance of σ 2 , we can obtain the sample
variance of S 2 , define χ 2 statistic as χ 2 = ( n − 1) × S 2 / σ 2 , and χ 2 can be viewed
as a random variable following chi-squared distribution.
136 6: Statistical Inferences from Observational Data

Definition
Let X1,…Xn be a sample from a normal distribution having the unknown parameters

µ and σ 2 , the rv

(n − 1) * S 2
χ2 =
σ2
6.6

has a chi-squared ( χ 2 ) probability distribution with n-1 degree of freedom.

The χ 2 distribution is unsymmetrical, and its shape depends on the number of


degrees of freedom. The shape becomes more symmetrical as the increasing of the
number of degrees of freedom.
If an estimate of the variance is given, we can determine a corresponding
confidence interval for the variance or standard deviation for the population.
As mentioned earlier, to obtain a confidence interval for the variance or standard
deviation, the first step is to find out the critical value. As the chi-squared distribution
is not symmetric, the both right and left tail critical values should be tabulated
respectively. If the value of α is small, it returns the χ right 2
(ie., χα2 /2,υ ) , otherwise it
returns to χ left
2
(ie., χ12−α /2,υ ).
.

Excel function CHIINV

CHIINV can be used to return the one tail probability of the chi-squared distribution

Syntax
= CHIINV(probability, deg_freedom )

probability: right tail probability

As mentioned earlier, if we have an estimate of the variance or standard


deviation from a sample, we can determine a corresponding confidence interval for
the variance or standard deviation for the population. Next example shows the
6: Statistical Inferences from Observational Data 137

procedures to obtain the confidence interval for the variance.

Example 6.4

15 students independently measure the sugar contents of brand A beverage from


the super market. The following are the weight (g/ml) of sugar contents:

0.091 0.1 0.093 0.092 0.083 0.089 0.095 0.093


0.091 0.089 0.088 0.097 0.099 0.893 0.099

1) Estimate the population variance.


2) Compute a 95 percentage two sided confidence interval for σ 2 .

[Solution]

1)

The functions AVERAGE and VAR in Excel can be used to obtain the sample

mean( x ) and s 2
_
x = 0.146 s 2 = 0.04

Hence, the point estimation of the population variance is: σ 2 = 0.04

2)

Since CI and n are given, to obtain the critical value, the χ 2 table or the
function CHIINV in Excel can be used.

Table Solution

You can make your own χ 2 table using the function CHIINV. The function
CHIINV is used to obtain χ 2 critical value. You can choose the different value of
α and v as you like, and draw a table like Table 6.4.
138 6: Statistical Inferences from Observational Data

Table 6.4 Critical values for the chi-squared distribution


α
ν 0.995 0.99 0.975 0.95 0.05 0.025 0.01 0.005
1 0.000 0.000 0.001 0.004 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 12.592 14.449 16.812 18.548
7 0.989 1.239 1.690 2.167 14.067 16.013 18.475 20.278
8 1.344 1.646 2.180 2.733 15.507 17.535 20.090 21.955
9 1.735 2.088 2.700 3.325 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 18.307 20.483 23.209 25.188
11 2.603 3.053 3.816 4.575 19.675 21.920 24.725 26.757
12 3.074 3.571 4.404 5.226 21.026 23.337 26.217 28.300
13 3.565 4.107 5.009 5.892 22.362 24.736 27.688 29.819
14 4.075 4.660 5.629 6.571 23.685 26.119 29.141 31.319
15 4.601 5.229 6.262 7.261 24.996 27.488 30.578 32.801
16 5.142 5.812 6.908 7.962 26.296 28.845 32.000 34.267

The χ 2 value can be obtained from Table 6.4 as

χ right
2
= χ 0.05/2,14
2
= 26.119 , χleft
2
= χ12−0.05/2,14 = 5.629 =5.629

Excel Solution

The functions CHIINV in Excel also can be used to obtain χ 2 value. For
example, given CI = 0.95 and υ = 14, the corresponding right tail critical value is
26.119, and the left tail critical value is 5.629 (shown in Figure 6.3).
6: Statistical Inferences from Observational Data 139

Figure 6.3 Process of using the function CHIINV

(n − 1) * S 2
According to Eq. 6.6 χ 2 = :
σ2

(n − 1) × S 2 14 × 0.04
The left end point is equal to = = 0.021
χ 2
0.025,14 26.119

(n − 1) × S 2 14 × 0.04
The right end point is equal to = = 0.099
χ 2
0.975,14 5.629

Hence, the 95 percentage two sided confidence interval for σ 2 lies between
0.021 and 0.099.

6.3.4 Estimation of the Ratio of the Variance of the Two Populations

When comparing two variances, we often ask a question that whether one sample
variance significantly larger than another indicate that one population is more variable
than another? To answer this question, we will introduce another family of probability
distribution called F-distribution. Mathematically, the F-distribution is related to the
ratio of two chi-squared distributions. The symbol of F was used to remind us of the
great statistician and geneticist Sir Ronald A. Fisher, who found the density for the
central F-distribution. The F-distribution is widely used in statistical inference,
especially in analysis of variance.
140 6: Statistical Inferences from Observational Data

Definition
Let X1 … X m be random sample from a normal distribution with variance σ 12 , and

let Y1 … Yn be another random sample (independent of the X i ' s ) from a normal

distribution with variance σ 22 , and let S12 and S 22 denote the two sample variances.

Then the rv
S12 / σ 12
F=
S 22 / σ 22
6.7

has an F distribution with υ1 = m − 1 , υ 2 = n − 1

F-distribution has some properties. For instance, the F-distribution is not


symmetrical; it is skewed to the right; the value of the F-distribution is always greater
than or equal to zero.

Confidence Interval: A 100(1- α ) % confidence interval for the ratio


S12 σ 12 S12
/ Fα / 2.v1,v ≤ 2 ≤ 2 / F1−α / 2.v1,v
S 22 2
σ 2 S2 2

6.8

One of the important features for F test is:


F1− a ( v1v2 ) = 1 / Fa ( v2 ,v1 )

1
For instance, F0.05(5,2) = = 0.198
F0.95(2,5)

Excel Functions - FDIST and FINV

Function FDIST
6: Statistical Inferences from Observational Data 141

The function FDIST calculates the right tail F-probability distribution, which
measures the degree of diversity between two data sets.

Syntax
= FDIST (x, deg_freedom 1, deg_freedom 2 )

x : the value at which to evaluate the function

Function FINV

FINV is the function that returns the inverse of the F-probability distribution.
This is used to compare two data sets and find out how much variability between
them.

Syntax
= FINV(probability, deg_freedom 1, deg_freedom 2)
probability: right tail probability

Example 6.5 illustrates the process to obtain the confidence interval for an
unknown mean with an unknown population variance.

Example 6.5

On a calculus test, ten girls are randomly chosen, the mean score and the
standard deviation are equal to 64 and 10, respectively. At the same time, nine boys
are randomly chosen, the mean scores and standard deviation are equal to 60 and 15,
respectively. Can we conclude that the performance of the boys is more stable in the
calculus test with 95 percent confidence interval?

[Solution]
142 6: Statistical Inferences from Observational Data

The first step is to obtain the F critical value. You can either choose the F table,
or using the functions in Excel.

Table Solution

You can draw your own F table using the function FINV. Pay attention to the
tails of the table when you check the F table. Take the probability is equal to 0.05 as
an example to show how to draw a F table.
The probability is equal to 0.05, and the first column corresponds to the number
of degrees of freedom for variance in the numerator, and the second row corresponds
to the number of degrees of freedom for variance in the denominator. The function
FINV is used to obtain F critical value. You can choose the different value of v1 and v2
as you like, and draw a table like Table 6.5.

Table 6.5 Critical value of F - distribution


v1
v2 1 2 3 4 5 6 7 8 9
1 647.79 799.50 864.16 899.58 921.85 937.11 948.22 956.66 963.28
2 38.51e 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.39
3 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.47
4 12.22 10.65 9.98 9.60 9.36 9.20 9.07 8.98 8.90
5 10.01 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68
6 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52
7 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82
8 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36
9 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03
10 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78
11 6.72 5.26 4.63 4.28 4.04 3.88 3.76 3.66 3.59
12 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44
13 6.41 4.97 4.35 4.00 3.77 3.60 3.48 3.39 3.31
14 6.30 4.86 4.24 3.89 3.66 3.50 3.38 3.29 3.21

In this example, υ1 = n1 − 1 = 9 , υ 2 = n2 − 1 = 8. With a 95 % confidence interval,

Right tail probability α = (1 − 0.95) / 2 = 0.025

Fright = F0.025,(9,8) = 4.36 Fleft = F0.025,(8,9) = 0.26


6: Statistical Inferences from Observational Data 143

Excel Solution

The right tail probability and the number of degrees of freedom are given, the
function FINV in Excel is used to obtain the critical value of F.
Fright = FINV (0.025,9,8) = 4.36 Fleft = FINV (0.975, 9,8) = 0.24

As CI = Pr ( Fleft ≤ F ≤ Fright ) , the confidence interval can be obtained as:

(σ 12 / σ 22 )95% = {( S12 / S22 ) / Fright , ( S12 / S 22 ) / Fleft }

= {(10 2 /152 ) / 4.36, (102 / 152 ) / 0.24} = (0.1,1.8)

As the ratio of the variance lies between 0.1 and 1.8, we can never say that the
performance of the boys is more stable than girls at the calculus test within the 95
percent confidence interval.

6.4 Excel Functions Used in the Point and Interval Estimation

In the fields of points and interval estimation, the widely used distributions are
normal, students-t, chi-squared, and F. For these four functions, Excel provides the
corresponding probability functions. Generally, Excel provides PDFs with the suffix
of “DIST” and provides the inversed CDFs with suffix “INV.”
Let X be a random variable, x be a value of the random variable, and Pr be a
probability. The functions TDIST, CHIDIST and FDIST give the probability of being
in the right-tail (Pr(X>x)). And the function TDIST also provide a two-tail probability
as Pr(|X| > x). However, the functions NORMDIST and NORMSDIST give the
probability of being in the left-tail (Pr (X<x), which are on the contrary with the other
three distribution functions.
These Excel functions may confuse you sometimes. In order to take these
functions user-friendly, these functions can be changed a little using VBA macros.
The codes following show some examples about the changed functions.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 6.3
144 6: Statistical Inferences from Observational Data

*********************************************************************

‘Purpose: To turn the function NORMDIST from left-tail to right-


tail, and let the function return only the cumulative distribution
function

‘Define variables:

‘x: the value for which you want the distribution.


‘mean: the arithmetic mean of the distribution.
‘standard_dev: the standard deviation of the distribution

*****start of coding ******************************************

Public Function Newnormdist(x, mean, sd)


Newnormdist = 1 - Application.WorksheetFunction.NormDist(x, mean,
sd, true)
End Function
*****************************************end of coding ****************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For instance, recall Example 3.1, µ 140 and sd 3are given, the function
NEWNORMDIST can be used to obtain the Pr (X>160):
Then:
1-NORMDIST(160,140,30) = NEWNORMDIST(160,140,30) = 0.25

The left-tail function of NORMINV in Excel can be changed into right-tail using
VBA macros. The codes are shown as follows:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 6.4

*********************************************************************

‘Purpose: To recreate the function NORMDINV from left-tail to


right-tail.

‘Define variables:
6: Statistical Inferences from Observational Data 145

‘x: the value for which you want the distribution

************start of coding************************************
Public Function Newnorminv(x)
Newnorminv = Application.WorksheetFunction.NormInv((1 - x), 0, 1)
End Function
**********************************end of coding **********************

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For instance, recall Example 7.3.1, x ~N (80, 25/√100) with α = 0.05 are
given. z value can be obtained as NEWNORMINV(0.025) = 1.96, which is equal to
NORMINV(0.975,0,1) = 1.96.

In addition, as the function TDIST provides the right and two tail probability,
you need to choose the tails each time when applyng it. In order to solve this problem,
the function TDIST can be changed as a right tail function using VBA macros, and the
code is shown as follows:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 6.5

*********************************************************************

‘Purpose: To recreate the function Tdist return only the one tail
probability Pr(X > x) ,so that the input arguments can be reduced
from three to two.

‘ Define variables:

‘x: the numeric value at which to evaluate the distribution


‘df: an integer indicating the number of degrees of freedom

***********start of coding ************************************


Public Function NewTdist(x, df)
NewTdist = Application.WorksheetFunction.TDist(x, df, 1)
End function
************************************************end of coding ********

Note: let the tail = 1, then TDIST returns only the one tail distribution, the input
146 6: Statistical Inferences from Observational Data

arguments reduced from three to two.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

How to enter the formula in Excel

In the previous sections, Excel functions are directly entered into the cells.
Another way to insert a formula is to use the Function Library group on the formula
tab (shown in Figure 6.4).

Figure 6.4 Formula Tab

If you do not remember which function you need, the formula tab is a useful way.
You can click the function category such as AutoSum and Financial, a series of
functions in that category will be listed. If you forget the name of the function, you
can click the Insert Function, and the function you needed may display in the dialog
box after entering some describing words. In addition, if you want to know more
about the functions, you can click the link called Help on this function(shown in
Figure 6.5).

Figure 6.5 Function arguments for TDIST


6: Statistical Inferences from Observational Data 147

6.5 Summaries of Excel Functions

In this chapter, the foundations of the points and interval estimation are being
introduced. To estimate the unknown means, Excel functions of NORMINV and
TINV can be used. To estimate the variance, Excel functions CHIINV and FINV can
be used. Table 6.6 shows the summary of Excel functions applied in this chapter.

Table 6.6 Summaries of the built-in functions


FUNCTION How it works? Notes
VAR It can be used to calculate the variance for a Ex. 6.1
sample of a population.
VARP It can be used to calculate the variance for an Ex. 6.1
entire population.
NORMINV It can be used to obtain the inverse of the Ex. 6.2
normal cumulative probability.
TINV It retains the inverse of the t-distribution. Ex. 6.3
CHIINV It returns the inverse of the right-tail Ex. 6.4
probability of the chi-squared distribution.
FINV It returns the normal cumulative distribution. Ex. 6.5

Excel’s built-in functions can be changed using VBA macros. Table 6.7
summaries the changed functions that are used in this chapter.

Table 6.7 Summaries of the user defined functions


FUNCTION How it works?
NEWNORMDIST It can be used to turn the function NORMDIST from left- tail to
right- tail.
NEWNORMINV It can be used to turn the function NORMDINV from left- tail to
right- tail.
NEWTDIST It can be used to reduce entering parameters from three to two.
TINV_right It can be used to recreate the function TINV return only the
right- tail probability.
TINV_left It can be used to recreate the function TINV return only the
left-tail probability.
148

Testing of Hypotheses

7.1 Introduction

You can see the following advertisements in your daily life: a drug company may
claim that their pills can last at least four hours; an auto-manufactory may claim that
their new products are better than the older ones; a food company claims that their
products are sugar free. To verify these kinds of statements, a technique referred to as
hypothesis testing can be used. Hypothesis testing is a statistical method that is using a
sample to verify the statements about the corresponding population. Hypothesis
testing contains two contradictory hypotheses: null hypothesis (H0) and alternative
hypothesis (H1). The statistic computed from the sample can be used to test if the H0
should be accepted or rejected.
In this chapter, the basic concepts and major procedures used in hypothesis
testing will be introduced, together with the relevant Excel functions. In addition, the
basic steps to create UserForms in Excel will be introduced as well.

7.2 Null Hypothesis and Alternative Hypothesis

At the beginning, we need to state exactly what we are testing, including the null
hypothesis (H0) and alternative hypothesis (H1). The null hypothesis (H0) is the
initially favored claim, and the hypothesis that is contradictory to H0 is called
alternative hypothesis (H1). For instance, a manufactory claims that the life time of
their products is at least four years, and we can state the null hypothesis H0: u = 4
and the alternative hypothesis H1: u ≠ 4 .
7: Testing of Hypotheses 149

Definition
The null hypothesis, donated by H0, is the claim that is initially assumed to be true.
The alternative hypothesis, donated by H1, is the assertion that is contradictory to
H0.

H0 will be rejected only if the evidence is sufficient to against H0. Otherwise, we


will continue to believe H0. The two possible conclusions from the hypothesis testing
are reject H0 or fail to reject H0.
There is a familiar analogy in a criminal issue. Suppose that the initially claim
(H0) is that one is innocent, and the alternative claim (H1) is that he or she is guilty.
Instead of telling whether he or she is guilty or innocent, the major role of the jury is
to determine whether the existed evidence is sufficient to against the initially claim
(H0) .

7.3 Type I and Type II Errors

Because of the sampling variability, the errors are unavoidable in hypothesis


testing. No one likes errors. However, eliminating the error is difficult. The reason is
that any attempting of decreasing in one type of error is accompanied by increasing
the other type of error for a given sample size. The only way to reduce both errors is
to increase the sample size, which may or may not be feasibility in practice.

Definition
A type I error occurs when the true hypothesis is rejected.
A type II error occurs when the false hypothesis is accepted.

Recall the criminal issue mentioned above, H0 is refers to that he or she is


innocent, and H1 states that he or she is guilty. For the type I error, it can compare
with sending an innocent person to the prison, the false rejection of the null
hypothesis. In contrary, the type II error can compare with release the person when he
150 7: Testing of Hypotheses

or she is guilty, the false acceptance of the null hypothesis.

The probabilities of the type I and II errors are usually donated by Greek word
α and β , respectively. The value of α is often referred to as the level of significance,

where α = Pr{reject H 0 | H 0 is true}, and 1 − α = Pr{accept H 0 | H 0 is true}. In practice,

the significance level is usually equal to 0.01 and 0.05, where 1 − α = 0.99, 0.95 , and

it is very likely that the hypothesis testing is reliable.


In the most fields of science, the type I error is said more dangerous and
undesirable than the type II error. We shall design the test to reduce the type I error.
Generally speaking, testing at a lower level of significance means that only a large
amount of evidences can force rejection of H0.

7.4 Testing Procedures

Following is the typical sequence that you can follow when doing the hypothesis
testing:

1. Set the null hypothesis (H0) and the alternative hypothesis (H1)
Generally, H0 is formulated as an equality, whereas the H1 is normally an
inequality. In general, H0 and H1 can be set as follows:

Null hypothesis: H0: µ µ

µ µ two tailed test


The alternative hypothesis: H1: µ µ right tailed test
µ µ left tailed test

This step is very important when doing the hypothesis testing. H0 is usually
stated as equality. For instance, instead of setting H0 µ > µ0 or µ < µ0 , we usually
state H0: µ = µ0 . The reason is that the floated rejection region can cause the
problems of part acceptation and rejection.
To state the appropriate H1, one trick is to choose the expectance or preference
results. For instance, a drug company claims that their new drugs can last at least 4
hours, and the statement can be set as H0: µ = 4 and H1: µ >4; a food company
7: Testing of Hypotheses 151

claims that their products are sugar free, so that we can state H0: u = 0 and H1: u ≠ 0 .

2. Determine the proper statistic and its distribution


Depending on the population parameters that are being tested, choosing the
appropriate test statistic and its probability distribution. For instance, when you want
to test the population mean, you can use z test either the sample size is large enough
or you have already known σ .

3. Choose the level of significance α


Although the selection is largely subjective, in practice, the values of α is
between 1% and 5%.

4. Determine the statistic value from the samples


A suitable test statistic can be obtained from the given data.

5. Define the region for rejection of the null hypothesis


According to the pre-described α and the distribution, the rejection region can
be defined. If the computed statistic from the samples is inside the rejection range,
reject H0, otherwise, accept it. Figure 7.1 shows the acceptance and rejection region of
a normal distribution.

Critical Value
Figure 7.1 Regions of rejection and acceptance
152 7: Testing of Hypotheses

After elaborating the fundamental concepts about the hypothesis testing,


Examples 7.1, 7.2 and 7.3 show how to use these concepts to solve the problems
related to hypothesis testing.

Example 7.1

The manufacturer claims that the weight of their product is less than 300 g. The
customer organization randomly chooses 60 products and obtains the sample average
weight x = 297.25 g and a standard deviation s = 2.3 g . Comment the company’s
statement with the level of significance of 0.05.

[Solution]

As mentioned earlier, defining H0 and H1 are the first and key step when doing
the hypothesis testing. H0 is easy to set, as it is usually setting as an equality. However,
sometimes, you will feel confusion when setting H1. Suppose we do not know any
tricks to set H1, let us try the three possible conditions and find the appropriate one.

i. Left - tailed test

1. Set H0 and H1
H0: µ = 300

H1: µ < 300

2. Determine the proper statistic and its distribution


As our purpose is to test the population mean, we can choose z test:
z = ( x − µ ) / (s / n )

3. Choose the level of significance α


In this question, α = 0.05

4. Determine the statistic value from the samples


n = 60 , x = 297.25 , and s = 2.3 are given, we can obtain z value:
z = (297.5 − 300) / (2.3 / 60) = −1.81
7: Testing of Hypotheses 153

5. Define the region for rejection of the null hypothesis


This is the left-tailed test, and the rejection range is z < -1.64. Figure 7.2 shows
the acceptance and rejection area for z-test.

Acceptance Area
(1 − α )
Rejection
Area

-1.64
-1.81
Figure 7.2 z-test with the left-tailed alternation

As z (-1.81) is inside the rejection region, reject H0 and accept H1. The statement
of the company is accept that the weight of their product is less than 300 g.

ii. Two - tailed test

As the testing procedures are pretty similar with the previous one, we will not
explain the details this time. H0 and H1 can be set as H0: µ = 300 and H1: µ ≠ 300 .
In addition, the statistic value z from the samples is not changed which is equal to
- 1.81.
Figure 7.3 shows the region for acceptance and rejection. This is a two-tailed test
with α = 0.05 and z = 1.96 . The rejection range is z >1.96 and z < - 1.96.
154 7: Testing of Hypotheses

Rejection Rejection
Acceptance Area
Region( α / 2 ) Region( α / 2 )
(1- α )

-1.96 -1.81 1.96

Figure 7.3 z-test with the two-tailed alternation

z (-1.81) is outside the rejection region, accepting H0. We can draw a conclusion
that the average weight is unequal to 300 g at the 5% level of significance. However,
in this example, we are only interested in whether the average weight is less than 300
g, and the conclusion that the average weight is unequal to 300 g is meaningless to us.

iii. Right - tailed test

As the testing procedures are pretty similar with the previous one, we will not
explain the details this time. H0 and H1 can be set as H0: µ = 300 and H1: µ > 300 .
In addition, the statistic value z from the samples is not changed which is equal to -
1.81.
This is the right-tailed test with α = 0.05 and the rejection range is z > 1.64.
Clearly, z = -1.81 does not lie in the rejection region. Therefore, H0 can be accepted.
However, in this example, our purpose is to test the company’s statement that
whether their product’s weight is less than 300 g. When we set the alternative
H1 >300, one condition is to accept H0, where we can draw a conclusion that the
product’s weight is equal to 300 g; the other condition is that we reject H0 and accept
H1, we can draw a conclusion that the product’s weight is larger than 300 g. No mater
accepting or rejecting H0, the company’s statement is wrong, and the testing becomes
meaningless.
On the whole, defining the null and alternative hypothesis is the first and key
step in the process of hypothesis testing. Using one or two tailed alternative H1 is
depending on the situation. Generally, the expectance result is usually setting as the
alternative hypothesis.
7: Testing of Hypotheses 155

As mentioned earlier, the alternative hypothesis has three conditions: two-tailed,


right-tailed, and left-tailed. For each condition, the rejection region may change. It is
inconvenience to do the calculation each time. The three conditions can be listed
together using VBA macros, combining these macros with the UserForm, and then
you can choose the one you needed depending on the situation.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 7.1

*********************************************************************

‘Purpose: To recreate the NORMINV function to one and both tailed.


And the InputBox and MsgBox function are used.

‘Define variables:

‘a: Level of significant


‘z: the critical value

***Start Coding************************************************

Sub norminvleft()
a = InputBox("a", "level of significance")
z = Application.WorksheetFunction.norminv(a, 0, 1)
z = Application.WorksheetFunction.Round(z, 3)

MsgBox "Rejection region is z " & " < " & Z, vbOKOnly, "norminvleft"
End Sub

Sub norminvright()
a = InputBox("a", "level of significance")
z = -1 * Application.WorksheetFunction.norminv(a, 0, 1)
z = Application.WorksheetFunction.Round(z, 3)
MsgBox "Rejection region is z " & "> " & Z, vbOKOnly, "norminvright"
End Sub

Sub norminvtwo()
a = InputBox("a", "level of significance")
156 7: Testing of Hypotheses

z = Application.WorksheetFunction.norminv(0.5 * a, 0, 1)
z = Application.WorksheetFunction.Round(z, 3)
MsgBox "Rejection region is z" & "<" & Z & " or z" & ">" & -Z, vbOKOnly,
"Norminvtwo"

End Sub
***********************************************end of coding*********
This macro consists of two key techniques. The respecting role in this macro is
detailed as follows:

1) InputBox function
This function is useful for obtaining a single input. “a” will display in the input
box, and “"level of significance" will display in the title bar.
2) Function ROUND
This function is used to fix the digital number of the result.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using UserForm controls in a worksheet

Using controls on a worksheet can facility user to provide input. You can access
by choosing Develop Controls Insert. Figure 7.4 shows the controls that will
appear when following the above steps.

Figure 7.4 Excel’s two sets of controls

Excel offers two different sets of controls: Form control and ActiveX controls. In
this example, we focus on Form controls.
Excel provides different kinds of controls in the Form control, such as the
ScrollBar controls, the OptionBottom controls, and the TextBox controls. In Example
7.1, we show an example of using the OptionBottom controls, which allows a user to
select from multiple options depending he or she likes.
7: Testing of Hypotheses 157

In Example 7.1, there are three conditions for H1: two-tailed, left-tailed, and
right-tailed. You can click and drag the option bottom into the cells, and then double
click the items and rename the OptionBottoms. Figure 7.5 shows the Form controls
that can be used to ask the user for an option.

Figure 7.5 Form controls that asks the user for an option

After accomplish the Form controls, the next step is to link the controls to the
created macros. To do the connection, you can right click the bottom and choose the
item Assign Macro, and then assign the codes to the corresponding controls.

Figure 7.6 Assign Macro to the controls


158 7: Testing of Hypotheses

In Example 7.1, when choosing the left-tailed option, the VBA left-tailed codes
are executed. You can see a dialog box that require you to enter the value of a (0.05 in
this question). When pressing OK, you can obtain the rejection region that is z < -1.64.
Figures 7.7 and 7.8 show the dialog boxes that displayed by VBA’s InputBox and
MsgBox functions.

Figure 7.7 Dialog box displayed by the VBA’s InputBox function

Figure 7.8 Dialog box displayed by the VBA’s MsgBox function

When the variance σ 2 is unknown and the sample size is small, the
t-distribution can be applied in the hypothesis testing. The testing procedures are as
similar as above, the only difference is that the statistic is changed from z to t.

Example 7.2

The manufactory claims that their coffee machines provide a population mean
volume of 110 ml of coffee per cup and a standard deviation of 5 ml. The volume of
coffee per cup is assumed to have a normal distribution. In order to do the quality
control, the machine is checked periodically by random sampling 15 cups of coffee
each day. The mean value x = 107.0 ml and standard deviation s = 6.5 . Comments
7: Testing of Hypotheses 159

the manufactory’s statement with the level of significance of 0.05.

[Solution]

As mentioned earlier, defining H0 and H1 are the first and key step when doing
the hypothesis testing. In this example, H0 and H1 are stated as follows:
H0: µ = 110

H1: µ ≠ 110

To test the sample mean with small sample size, t test can be chosen, and the test
statistic value is calculated as follows:
__
T = ( X − µ ) / ( s / n ) = (107 − 110 ) / (36 / 5) = 0.42

This example is a two-tailed test with α = 0.05 and n = 15 . The critical value
can be obtained from the Appendix Table A.2 or Excel function as mentioned in
previous chapters.
t0.05,14 = TINV (0.05,14) = 2.14

The rejection region is: t > 2.14 and t < −2.14

Since T (0.42) is not inside the rejection region, accept H0. The manufactory’s
claim is accepted.

Similarly, The function TINV can be recreated using VBA macros, combine the
macros with the UserForm controls, and you can choose the one you needed
depending on the situation. The codes are shown as follows:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 7.2

*********************************************************************

‘Purpose: To recreate the TINV function to one and two tailed. And
the functions of InputBox and MsgBox are used.
160 7: Testing of Hypotheses

‘Define variables:

‘a: Level of significant


‘n: Sample size
‘x: Critical value
***Start Coding**********************************************

Sub Tinvright()
a = InputBox("a", "level of significance")
n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(2 * a, n - 1)
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Rejection region is t " & "> " & x, vbOKOnly, "tinvright"

End Sub

Sub Tinvleft()
a = InputBox("a", "level of significance")
n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(2 * a, n - 1) * -1
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Rejection region is t " & "< " & x, vbOKOnly, "Tinvleft"
End Sub

Sub Tinvtwo()
a = InputBox("a", "level of significance")
n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(a, n - 1)
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Rejection region is t" & "<" & x & " or t" & ">" & -x, vbOKOnly,
"Tinvtwo"
End Sub

*****************************************************end of coding***

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

After completing the codes, these codes can be assigned to the Form controls.
Figure 7.9 shows the Form controls that can be used to ask the user for an option.
7: Testing of Hypotheses 161

Figure 7.9 Form controls that ask the user for an option

Example 7.2 uses a two-tailed test with α = 0.05 and n = 15 . When choosing
the two-tailed option, the VBA two-tailed codes are executed. After entering the value
of α and n , you can obtain the critical value which is equal to 2.145. Figures 7.10,
7.11 and 7.12 show the dialog boxes that displayed by VBA’s InputBox and MsgBox
functions.

Figure 7.10 Dialog box displayed by VBA’s InputBox function


162 7: Testing of Hypotheses

Figure 7.11 Dialog box displayed by VBA’s InputBox function

Figure 7.12 Dialog box displayed by VBA’s MsgBox function

Creating UserForms by VB editor

In Example 7.1, we have introduced the process of using the UserForm controls
in a spreadsheet. On the other hand, you can create your own UserForm using VB
Editor. The major steps are shown as follows:

1. Work with UserForm - Tool box

To create a dialog box, the first step is to insert a new UserForm in the VB Editor
window. To insert a UserForm, press Alt + F11 Choose Insert. The VB Editor will
display an empty UserForm as shown in Figure 7.13.
7: Testing of Hypotheses 163

Figure 7.13 An empty UserForm

2. Add controls

The Toolbox can be used to add controls. The Toolbox is displayed by choosing
View Toolbox, which is shown in Figure 7.14. You can click and drag the controls
you need into the UserForm. In this example, we choose the OptionButtom. Figure
7.14 shows an example of a UserForm using OptionButtom control.

Figure 7.14 A UserForm with OptionButtom control

3. Change the property

Every control has several properties that determine how the control looks like.
You can choose View Properties Window or press F4 to show the properties
window (shown in Figure 7.15). You can change the name of the UserForm, the
height, the color and so on.
164 7: Testing of Hypotheses

Figure 7.15 Properties window for a UserForm control

It is a good idea to rename all the controls using the meaningful names. To
change the name of a control, you can right click the mouse and choose the properties,
and you can see a properties widows just like Figure 7.16.

Figure 7.16 Properties window for a CommandButton control


7: Testing of Hypotheses 165

4. Adjust the UserForm controls

You can adjust the UserForm control to make it looking professional by selecting
Format Align, which is shown is Figure 7.17.

Figure 7.17 Change the alignment of controls

5. Display a UserForm

After accomplishing the UserForm, you can display it by pressing F5 or writing


code as follows:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 7.3

*********************************************************************
Private Sub UserForm_active()
UserForm1.Show
End Sub
*********************************************************************

To display a UserForm from VBA, you can create a procedure that uses the show
method of the UserForm object.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
166 7: Testing of Hypotheses

You cannot display a UserForm from Excel without using at least one line of
VBA code. This procedure must be located in a standard VBA module and not in the
code module for the UserForm. After executing the macros, you can see a UserForm
which adds an OptionButton control to provide multiple options in Figure 7.18.

Figure 7.18 UserForm after adding an OptionButton control

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code: 7.4

*********************************************************************

‘Purpose: To create the UserForm - OptionButtom

‘Define variables:

‘a: Level of significant


‘n: Sample size
‘x: Critical value

***Start Coding************************************************

Private Sub OptionButton1_Click()

a = InputBox("a", "level of significance")


n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(2 * a, n - 1)
7: Testing of Hypotheses 167

x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Pb < " & a & " = " & x

End Sub

Private Sub OptionButton2_Click()

a = InputBox("a", "level of significance")


n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(2 * a, n - 1) * -1
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Pb < " & a & " = " & x

End Sub

Private Sub OptionButton3_Click()

a = InputBox("a", "level of significance")


n = InputBox("n", "sample size")
x = Application.WorksheetFunction.TInv(a, n - 1)
x = Application.WorksheetFunction.Round(x, 3)
MsgBox "Pb < " & a & " = " & x

End Sub

*****************************************************end of coding***

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The hypothesis testing of the population mean is mentioned in Examples 7.1 and
7.2. However, if you want to know whether the sample variance significantly larger or
smaller than a population variance, the chi-squared distribution test is a proper option.

Example 7.3

A restaurant claims that the waiting time for each customer is 5 minutes with
σ less than 1.5 minutes. Eight customers are ramdomly chosen and the standard
2

deviation is obtained as 1.1 minutes. Please verify if the restaurant’s statement is


correct with 0.01 level of significance.
168 7: Testing of Hypotheses

[Solution]

To solve this problem, the first step is to define H0 and H1. In this example, H0
and H1 are set as follows:
H0: µ = 110

H1: µ ≠ 110

As the parameter of interest s = 1.1, the test statistic value is:

χ 2 = (n − 1) × S 2 / σ 2 = 7 ×1.21/ 1.5 = 5.65

This example is a left-tailed test with α = 0.01 and n = 8 . The critical value
can be obtained from the Appendix Table A.3 or the Excel function:
χ 0.99,7
2
= 1.24 , or CHIINV(0.99,7) = 1.24. The rejection region is χ 2 < 1.24.

Similarly, The function CHIINV can be recreated using VBA macros, combine
the macros with the UserForm, you can choose the one you wanted depending on the
situation. Figure 7.19 shows the Form control that can be used to ask the user for an
option.

Figure 7.19 Form controls that asks the user for an option

Because the χ 2 = 5.65 is outside the rejection region, H0 can be accepted..


7: Testing of Hypotheses 169

7.5 Summaries of Excel Functions

In this chapter, to test the population mean with given σ , Excel function
NORMINV can be used; to test the population mean without given σ or small
sample size, the function TINV can be used; to test population variance, the function
CHIINV can be used. Table 7.1 shows the summary of Excel functions used in this
chapter.

Table 7.1 Summaries of the built-in functions


FUNCTION How it works? Notes
NORMINV It considers the inverse of the probability Ex. 7.1
of being in two tails.
TINV It considers the inverse of the probability Ex. 7.2
of being in both tails.
CHIINV It returns the inverse of the right-tailed Ex. 7.3
probability of the chi-squared distribution.
170

Regression Analysis

8.1 Introduction

Many variables observed in real life are relevant, such as the house area and the
sale price, the study hours and the final grades, the calories you intake and the weight
you getting, the crop yields and the amount of fertilizers used. The relations of such
kinds of variables can be expressed in a mathematical form. For two variables, say x
and y, the fixed variable x is called the independent variables, and y is called the
dependent variables. The process of estimating y from x is often referred to as
regression. The objective of regression analysis is to express the relationship between
two or more variables. For instance, the variable x presents the study hours, and y
presents the final grades. Generally, the student who spend a long time (x) to study
will earn good marks (y). However, the variables x and y are not deterministically
related, as the study time is just one of the factors that can affect the students’ grades.
In this chapter, the fundamentals about regression analysis will be introduced,
together with some useful Excel functions and the method of creating charts.

8.2 The Simple Linear Regression Model

Generally, if the regression analysis is limited to examine a linear relationship, it


is referred to as the linear regression analysis. For any fixed independent variable x,
the dependent variable is a random variable that is related to x and can be expressed
as:
8: Regression Analysis 171

Y = β 0 + β1 ⋅ x + ε
8.1
ε : random error term, which follows the normal distribution
E( ε ) = 0

V( ε ) = σ 2

Without ε , any observed pairs (x, y) would be exactly on the line y = b0 + b1 x ,


which is referred to as the true regression line. However, as the random errors are
unavoidable, it is impossible that ε = 0 and the observed pairs (x, y) fall either above
or below the regression line.

8.3 Estimating Model Parameters

Just obtaining the equation Y = β 0 + β1 * x + ε is not enough, as the values of


β0 , β , and ε are usually never provided. Instead, the sample data will be available to
estimate the parameters and the true regression line. One of the widely used method
which is used to estimate the parameters is referred to as the least squares method.

The Least Squares Method

Generally, for a given set of data, there are more than one curves will appear to
fit it. Figure 8.1 shows the three possible curves that can be used to fit the data, and it
is hard to say which line is the best fit one.
172 8: Regression Analysis

50

45
y 3 = a1 x + a 0
40

35

30 y1 = b1 x + b0 d

25

20 y 2 = k1 x + k 0
15

10
0 10 20 30 40 50

Figure 8.1 Three different regression lines

the widely used methods is referred to


To obtain the true regression line, one of the
as the least squares method. If the vertical distance (d)) from the data point to the line
is small (see Figure 8.1), the line provide a good fit. The principle of least squares is
to obtain a line that has the smallest sum of the squares of the vertical distances. The
T
intercept β 0 and slope β1 can be estimated from the sample data,
b0 for β 0, and b1 for β1 .

Principle of Least
east Squares
The
he sum of squared vertical deviations from the points (x1, y1), …(x
… n, yn) to the
line is then
n
f (b0,b1 ) = ∑ [ yi − (b0 + b1 xi )]2 8.2
i =1

Too minimize the sum of squared residuals, we can take partial derivatives of
f (b0, b1 ) , equating both of them to zero and solving the resulting equations.

∂f (b0, b1 )
= ∑ 2( yi − b0 − b1 xi ) × (−1) = 0
∂ (b0 )
8: Regression Analysis 173

∂f (b0, b1 )
= ∑ 2( yi − b0 − b1 xi ) × (− xi ) = 0
∂ (b1 )
Cancellation of the two factors and rearrangement gives the following system of
equations, called the normal equations:
nb0 + (∑ xi )b1 = ∑ yi

(∑ xi )b0 + (∑ xi2 )b1 = ∑ xi yi

The estimated regression line is: y = b0 + b1 x . The model parameters b1 and b0

can be estimated as follows:

 __
  __

∑  i   i y  S xy
x − x × y −
b1 = = 8.3
__ 2
  S xx
∑  xi − x 
__ __
b0 = y'− b1 x' 8.4

Where:
S xy = ∑ x y i i
−( ∑ x )(∑ y ) / n , and
i i
S xx = ∑ xi2 − ( ∑ xi ) 2 / n

According to Eqs. 8.3 and 8.4, the model estimates b1 and b0 can be calculated,
and then the regression line is obtained. Example 8.1 demonstrates the process of
obtaining the parameters by hand calculation. Further more, the process of drawing
the scatter diagram is also introduced.

Example 8.1

Some investigations show that the fathers and their sons’ heights have a strong
relationship. The table below shows the respective height (in inches) x and y of 15
fathers and their sons.

x 69 72 61 70 67 68 70 68 76 68 64 61 60 52 67
y 71 70 64 72 69 66 73 70 82 71 65 66 73 54 70
174 8: Regression Analysis

1) Construct a scatter diagram.


2) Find the least-squares regression line between x and y.

[Solution]

1)

The scatter diagram is often used to show the relationship between two variables.
Excel provides tools to create the scatter diagram. Figure 8.2 shows a scatter diagram
that depicts the fathers and their sons’ height.

90
85
80
Sons' height(inch)

75
70
65
60
55
50
50 55 60 65 70 75 80 85
Father's height (inch)

Figure 8.2 Scatter diagram to show the interdependence between the fathers
and their son’s height

The horizontal axis presents the father’s height, and the vertical axis presents the
son’s height. It can be seen in Figure 8.2 that the son’s height is interdependent with
the father’s height.

Creating the chart by Excel

Excel is one of the most widely used software to create charts. The major steps to
create a chart likes Figure 8.2 are shown as follows:

1. Select the data


2. Choose a chart type
3. Format chart elements
8: Regression Analysis 175

1. Select the data

The first step is to select the data. In this example, the range (C5: D19) is
selected ( shown in Figure 8.3).

Figure 8.3 The source data for creating the chart

2. Choose a chart type

After selecting the data, the next step is to choose a chart type from the
Insert Charts. You can choose one of them as you required. Figure 8.4 displays a
dialog box of insert chart. The main categories are listed on the left, and the subtypes
are shown as icons.
176 8: Regression Analysis

Figure 8.4 A dialog box of insert chart

In this example, the XY (Scatter) chart is selected. You can see the XY chart after
pressing OK bottom (shown in Figure 8.5).

90
80
70
60
50
40
30
20
10
0
0 10 20 30 40 50 60 70 80

Figure 8.5 A XY scatter chart

3. Format chart elements

Comparing to Figure 8.2, no one likes Figure 8.5, as it looks ugly and not
8: Regression Analysis 177

explains the data. Excel allows you to modify the chart and makes it exactly as you
like.
In general, there are two ways to modify the chart: one is using the Ribbon and
the mini Toolbar, and you can see it when clicking any cells inside the chart (shown in
Figure 8.6); the other way is to use the shortcut menu, and you can find it by right
click the element you want to modify and choose the option called Format from the
shortcut menu.

Figure 8.6 The chart tools

As the chart contains various kinds of elements, it is impossible to explain them


one by one. In the next section, we will focus on formatting some of the elements
using the chart tools or shortcut menu, including the axis, gridlines and axis title.

Format the axis

To adjust the horizontal or vertical axis, you can right click the axis and choose
Format from the shortcut menu. A formatting dialog box can be seen after choosing
the Format option (shown in Figure 8.7). You can modify the items as you like.

Figure 8.7 Axis formation


178 8: Regression Analysis

Format the gridlines

To format the gridlines, you can choose Chart Tools Layout Axes Gridlines
(shown in Figure 8.8). This drop-down controls contains options for all possible
gridlines in the chart.

Figure 8.8 Gridlines formation

Tips

If the objective is to remove the gridline only, you can simply right click the
gridline and choose the delete option on the shortcut (shown in Figure 8.9).

Figure 8.9 Delete the gridlines


8: Regression Analysis 179

Format the axis title

To change the axis titles, you can choose Chart Tools Layout Axis Titles. In
this example, the text Father’s height (inch) is added to the x-axis and son’s height
(inch) is added to the y-axis.
Figure 8.10 is a XY chart that is formatted from Figure 8.5. After formatting the
axis, gridlines, and axis titles, the figure looks professional.

90
85
80
Son's height(inch)

75
70
65
60
55
50
50 55 60 65 70 75 80 85 90
Father's height(inch)

Figure 8.10 XY chart formatted from Figure 8.4

2)

The statistics used for hand calculation are shown in Table 8.1.

Table 8.1 The statistics used for hand calculation


x y x2 xy y2
1 69 71 4761 4899 5041
2 72 70 5184 5040 4900
3 61 64 3721 3904 4096
4 70 72 4900 5040 5184
5 67 69 4489 4623 4761
6 68 66 4624 4488 4356
7 70 73 4900 5110 5329
8 68 70 4624 4760 4900
9 76 82 5776 6232 6724
10 68 71 4624 4828 5041
11 64 65 4096 4160 4225
12 61 66 3721 4026 4356
180 8: Regression Analysis

13 60 73 3600 4380 5329


14 52 54 2704 2808 2916
15 67 70 4489 4690 4900
∑ x = 993 ∑ y = 1036 ∑ x 2 = 66213 ∑ xy = 68988 ∑ y2 = 72058

According to Eqs. 8.3 and 8.4, the estimates b1 and b0 can be calculated, and then
the regression line is obtained. The calculation processes are shown as follows:
S xx = 66213 − (993) 2 / 15 = 476.4

S xy = 68988 − (993)(1036) /15 = 404.8

S xy 404.8
b1 = = = 0.85
S xx 476.4

b0 = y − b1 x = 69.1 − 66.2 × 0.85 = 12.83

So that the least-squares regression line of y on x is: y = 0.85 x + 12.38.

To obtain the estimates b0 and b1, hand calculation is achievable. However, the
process is repeated and complex when coming out the calculation. Alternatively, it is
effective to use Excel’s build-in functions to obtain the parameters. Example 8.2
demonstrates the process of using the functions SLOPE , INTERCEPT, TREND, and
FORECAST to obtain the estimates.

Example 8.2

Some researchers say that the students’ academic performance and their studying
hours are related. Random sampling 20 students from UST to do the survey named
Study Or Fail. Let x present the study hours (per day), and let y present the graded
point average (GPA) in the semester. The table below records the 20 students’ study
hours (x) and their GPA (y).

x 5 4 6 3.5 7 6 6.5 6 8 5.5


y 3.1 2.8 3.4 2.5 3.6 3.2 3.6 3.0 3.8 3.0

x 5.4 5.8 6 4.4 6.4 5 5.7 7 6.5 4


y 3.3 3.2 3.6 2.8 3.8 2.9 3.3 3.5 3.3 3.1

1) Construct a scatter diagram.


2) Find the least-squares regression line between x and y.
8: Regression Analysis 181

3) Suppose that five new comers want to predict their GPA using the regression line
that is obtained from the question two. Their study hours (per day) are 3.3, 2.0, 8.0,
6.0, and 4.4. Using these data to predict the new comers’ GPA.

[Solution]

1)

Since the procedures to draw the scatter diagram have already been demonstrated
specifically in Example 8.1 (pages 174-179), the detailed procedures are skipped here.
Figure 8.11 shows a scatter diagram that depicts the relationship between the study
hours and GPA. The horizontal axis presents the study hours, and the vertical axis
presents the student’s GPA. It can be seen in Figure 8.11 that the study hours and GPA
are positive related.

4.0

3.5
GPA

3.0

2.5

2.0
3.0 4.0 5.0 6.0 7.0 8.0
Study Hours(Hours per day)

Figure 8.11 Scatter diagram to show the interdependence between the study
hours and the GPA

2)

As mentioned earlier, the first step to obtain the least-squares regression line is to
estimate the parameters b0 and b1. The functions SLOPE and INTERCEPT can be
used to estimate the b0 and b1, respectively, and then the regression line can be
obtained by putting the b0 and b1 into the equation: y = b1 x + b0 .
182 8: Regression Analysis

Excel functions – SLOPE, INTERCEPT, TREND, FORECAST

Function SLOPE

The function SLOPE is used to calculate the slope of the linear regression line.
The syntax is shown as follows:

Syntax
= SLOPE (known ys, known xs)

Known ys: a range of numeric dependent data points


Known xs: the set of independent data points

The data are shown in the columns C and D. After entering the function SLOPE
in the cell N6, you can select the range of numbers that required for this function. The
result is shown in Figure 8.12.

Figure 8.12 Using the function SLOPE to obtain the slope

However, you may feel odder when seeing the syntax of the function is: SLOPE
8: Regression Analysis 183

(known ys, known xs). The locations of the arguments can be switched using VBA. In
code 8.1, the function SLOPE is recreated to switch the arguments’ entering sequence.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 8.1

'****************************************************************
'Purpose: to recreate the function SLOPE to switch the arguments’ entering
sequence.
'Define variables:
'a: Known xs, which is an array or cell range of
numeric dependent data points
'b: Known ys, which is the set of independent data
points

*****Start Coding**********************************************

Function newslope(a, b)
newslope = Application.WorksheetFunction.slope(b, a)
End Function
***********************************************end of coding *******
Tips

Pay attention that a and b here are the range of cells.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this example, we can obtain that b1 = SLOPE (D6:D25, C6:C25) = 0.265, or


using the new Excel function NEWSLOPE (C6:C25: D6:D25) = 0.265. The results
are the same.

Function INTERCEPT

The function INTERCEPT is used to calculate the point at which a line will
intersect the y-axis. The syntax is shown as follows:
184 8: Regression Analysis

Syntax
= INTERCEPT(known ys, known xs)

Known ys: the dependent set of observations or data


Known xs: the independent set of observations or data

The data is shown in the columns C and D. After entering the function
INTERCEPT in the cell N7, you can select the range of numbers that required for this
function. The result is shown in Figure 8.13.

Figure 8.13 Using the function INTERCEPT to obtain the intercept

Similar with the function SLOPE, the entering sequence of the function
INTERCEPT is also known ys first, and then known xs. You may also want to switch
the locations of the arguments, which means that the know xs first, and then know ys. In
the code 8.2, we show an example to recreate the function INTERCEPT to switch the
arguments’ entering sequence.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8: Regression Analysis 185

Code 8.2

'****************************************************************
'Purpose: to recreate the function INTERCEPT to switch the arguments’
entering sequence.
'Define variables:
'a: Known xs, which is an array or cell range of
numeric dependent data points
'b: Known ys, which is the set of independent data
points

*****’Start Coding*********************************************

Function newintercept(a , b)
newintercept = Application.WorksheetFunction.Intercept(b, a)
End Function
**************************************************end of coding *******

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In this example, we can obtain that b0 = INTERCEPT (D6:D25, E6:E25) =


1.728 , or using the function NEWINTERCEPT (E6:E25: D6:D25) = 1.728. The
results are the same.
After calculating the values of b0 and b1 , the regression line can be obtained as

y = 0.27 x + 1.73.

3)

As the regression line y = 0.27 x + 1.73 is obtained in question two, to obtain the
value of y, we just need to substitute the value of x into the regression line. However,
suppose you do not know the regression line and want to obtain the value of y directly,
the functions TREND and FORECAST are the good choice.

Function TREND

The function TREND can be used to predict y value from each x without
knowing the regression line y = b1 x + b0 . The syntax is shown as follows:
186 8: Regression Analysis

Syntax
= TREND (known ys, known xs, new xs, const)

Known ys : an array or cell range of numeric dependent data points


Known xs: the set of independent data points
New xs : the new x-values you wanted
Const : a logical value specifying whether to force the constant b to equal to 0

Continuing to Example 8.2, suppose there are five new students, and their study
hours have already given. The function TREND can be used to predict the students’
GPA. The original data are shown in the columns C and D, and the new students’
study hours are shown in the column U. This function returns an array of values, and
it must be entered as an array formula (Pressing Ctrl + Shift + Enter). The result is
shown in Figure 8.14.

Figure 8.14 Using the function TREND to predict the value of y

Function FORECAST

The function FORECAST also can be used to predict y value for a given x value.
8: Regression Analysis 187

The syntax is shown as follows:

Syntax
= FORECAST (x, known ys, known xs)

x : the data point for which you want to predict a value


Known ys : the dependent range of data
Known xs : the independent range of data

Continuing to Example 8.2, the Known xs and Known ys are shown in the
columns C and D, respectively (shown in Figure 8.14). Suppose one student spends
3.3 hours per day on study, the GPA can be predicted using the function FORECAST
as:
FORECAST (3.3, D6:D25, C6:C25) = 2.6.

In Example 8.2, b0 and b1 can be obtained by Excel functions SLOPE and


INTERCEPT. However, in regression analysis, only obtaining the least-squares
regression line is not enough, and we need to test how well the regression line
represents the data, which is expressed by the coefficient of determination and
correlation coefficient.

8.4 Coefficient of Determination and Correlation Coefficient

8.4.1 Coefficient of Determination

In reality, it is rare that every points exactly passes through the regression line,
and the variation is unavoidable. The further the line is away from the points, the less
it is able to explain. The coefficient of determination measures how well the
regression line represents the data.

Error Sum of Square (SSE) measures how much variation in y is not described by
the regression line. The total amount of variation is observed y values given by:
n
8.5
SSE = ∑ ( yi − yi )2

i =1

SSE
And the estimate of σ 2 is: σ 2 = 8.6
n−2
188 8: Regression Analysis

Total Sum of Square (SST) is the sum of the squared deviation about the
horizontal line at the mean y .

n
SST = ∑ ( yi − y ) 2
i =1 8.7

Figure 8.15 shows the least squares line, the horizontal line at height y , the
squared deviations about the least squares line ( yi − yi ) , and the squared deviations

about the horizontal line ( yi − y ).

50
Least squares line
y45
40

35
y
30

25 ( yi − y)
Horizontal line at height y
20

15
( yi − yi )

10
0 10 20 30 40 x 50

Figure 8.15 The coefficient of determination

The coefficient of determination:


SSE
r2 = 1− 8.8
SST
8: Regression Analysis 189

SSE is the sum of squared deviations about the least squares line, and SST is the
sum of squared deviation about the horizontal line at the mean of . The ratio
SSE/SST is the percentage of total variation is not answered by the least squares line,
and 1- SSE/SST is the proportion of the line can be explained. If r2 is close to 1, the
regression line would be a good fit.

8.4.2 Correlation Coefficient

The correlation coefficient measures the strength and the direction of a linear
relationship between two variables varying from -1 to +1.

The formula is:


S xy
r=
S xx × S yy
8.9
Where:
S xy = ∑ ( xi − x )( yi − y )

S xx = ∑ ( xi − x )

S yy = ∑ ( yi − y )

The positive r indicates that the value of y will increase as x increasing. If x and y
have a strong positive linear correlation, r is close to +1. If x and y have a strong
negative linear correlation, r is close to -1. Negative values indicate that the value of
y will decrease as the increasing of x. Further more, the value of r near zero means
that there is a nonlinear relationship between the two variables.
Excel provides built-in functions to estimate the r2 and r, including the functions
RSQ, CORREL and LENEST.

Excel functions – RSQ, CORREL, LENSET, STEYX

Function RSQ

The function RSQ returns the coefficient of determination. The syntax is shown
190 8: Regression Analysis

as follows:

Syntax
= RSQ (known ys, known xs)

Known ys: an array or range of data points


Known xs: an array or range of data points

Function CORREL

The function CORREL returns the correlation coefficient. The syntax is shown
as follows:

Syntax
= CORREL (known ys, known xs)

Known ys: an array or range of data points


Known xs: an array or range of data points

Function STEYX

Returns the standard error of the predicted y-value for each x in the regression.

Syntax
= STEYX (known ys, known xs)

Known ys: is an array or range of dependent data points


Known xs : is an array or range of independent data points

Excel does not provide functions to obtain the value of SSE and SST. However,
you can use VBA to recreate the functions.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code 8.3
8: Regression Analysis 191

'Purpose: To create the function SSE to calculate the sum of


squared deviations about the least squares lines.
'Define variables:
'a: Known xs, which is an array or cell range of
numeric dependent data points.
'b: Known ys, which is the set of independent data
points.

*****’Start Coding*********************************************

Function sse(a, b)
SSE = Application.WorksheetFunction.StEyx(a, b) ^ 2 *
(Application.WorksheetFunction.Count(a) - 2)
End Function
**************************************************end of coding *******

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Code 8.4

'Purpose: to create the function sst to calculate the sum of


squared deviations about the least squares lines.
'Define variables:
'a: Known xs, which is an array or cell range of
numeric dependent data points.
'b: Known ys, which is the set of independent data
points.

*****’Start Coding*********************************************

Function SST(a, b)
SST = Application.WorksheetFunction.StEyx(a, b) ^ 2 *
(Application.WorksheetFunction.Count(a) - 2) / (1 -
Application.WorksheetFunction.RSq(a, b))
End Function
**************************************************end of coding *******

Comments
SSE
SST is obtained according to Eq. 8.8: SST =
1− r2
192 8: Regression Analysis

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Example 8.3 demonstrates the process of calculating the coefficient of


determination and correlation coefficient.

Example 8.3

Continuing to Example 8.2, we have already obtained the least-squares


regression line that is y = 0.27 x + 1.73 . However, the fitness between the linear
trending and actual data is not decided. Find the value of the coefficient of
determination and correlation coefficient.

[Solution]

The functions RSq, CORREL and LENEST are used to estimate the r2 and r.

To estimate r2

In Example 8.3, the data are shown in the columns C and D. After entering the
function RSQ in the cell G6, you can select the range of numbers that required for this
function. The result is shown in Figure 8.16.

Figure 8.16 Process of using the function RSQ


8: Regression Analysis 193

As mentioned earlier, the coefficient of determination (r2) measures how well the
regression line represents the data. In this example, r2 is equal to 0.74, the linear
trendline is acceptable.

To estimate r

In Example 8.3, the data are shown in the columns C and D. After entering the
function CORREL in the cell G7, you can select the range of numbers that required
for this function. The result is shown in Figure 8.17.

Figure 8.17 Process of using the function CORREL

In Example 8.3, CORREL(D6:D25,C6:C25) = 0.86, which indicates a strong


positive correlation between the study time and GPA. Notice that the value of the
squares of the result is equal to 0.74, which is equal to the r2.

From Examples 8.1 and 8.2, we have already introduced the functions SLOPE,
INTERCEPT, RSQ, and CORREL, which can be used to estimate the parameters b0,
b1, r2 and r , respectively. However, it is inconvenience to calculate the parameters
one by one. We will introduce a pretty good function called LINEST, as it estimates
not only the b0 and b1, but also the other statistics used in regression analysis such as
the r2 and r.
194 8: Regression Analysis

Function LINEST

The function LINEST here is an array formula which products the array results.
The syntax is shown as follows:

Syntax
=LINEST (known ys, known xs, const, states)

Known ys: an array or cell range of numeric dependent data points


Known xs: the set of independent data points
Const : a logical value specifying whether to force the constant b equal to 0.
Entering 0 (or false) means force the constant b equal to 0; entering 1
(or TRUE or omit) indicates that b is calculated normally
Stats : a logical value specifying whether to return additional regression
statistics. Enter 0 (or FALSE or omit) means you just want to obtain
slope and intercept; Entering 1 (or TRUE) indicates that the error
estimates to be listed.

This function returns an array of values, and it must be entered as an array


formula. The first step is to select the areas that will be used to hold the outputs of the
array formula (in this example, L7: M11). After that, you can type the formula and
press Ctrl + Shift + Enter. Figure 8.18 shows the process of using the function
LINEST.

Figure 8.18 Statistics that related to regression analysis


8: Regression Analysis 195

Figure 8.18 shows the statistics that is related to the regression analysis. The
table’s first and last columns are not provided by the function LINEST, and we add
the terms manually to show the meaning of each cells. Some statistics such as the
F-test overall are not mentioned in here, but such statistics are necessary in regression
analysis.

8.5 Intrinsic Linear Regression

In the previous sections, we have focused on analyzing the two variables having
linear relations. For some variables, they themselves may not have obvious linear
relationship. However, after suitable transformation of the variables x and/or y, the
relationship between the resulting variables may intrinsically linearity.

Definition
A probability model relating y to x is intrinsically linear if, by means of a
transformation on y and/or x, it can be reduced to a linear probabilistic model.
Y ' = β 0 + β1 x '+ ε '

Four important intrinsically linear functions are given in Table 8.2. For an
exponential function, only y is transformed to achieve linearity. For a power function
relationship, both x and y are transformed to achieve linearity.

Table 8.2 Useful intrinsically linear functions


Function Transformation(s) to Linearize Linear Form

a. Exponential: y = ae β x y ' = ln( y ) y ' = ln(a ) + β x

b. Power: y = ax β y ' = log( y ), x ' = log( x ) y ' = log(a ) + β x '

c. y = a + β ⋅ log( x) x ' = log( x ) y = a + β x'

1 1
d. Reciprocal: y = a + β ⋅ x' = y = a + β x'
x x

One of the advantages of the intrinsically linear model is that the parameters
such as b0, b1, r2 and r of the transformed model can be estimated immediately using
the principle of least squares. For instance, according to Eqs. 8.3 and 8.4, we can
196 8: Regression Analysis

estimate b0 and b1 as follows:

 __
  __

∑  i   i ' 
x ' − x ' × y '− y
b1 =
__ 2
 
∑  x 'i − x ' 

__ __
b0 = y'− b1 x'

Example 8.4 demonstrates the process of using the intrinsically linear model to
solve the problem.

Example 8.4

Some researchers suggest that one of the important factors that affect the
moisture content(%) of the chips is the frying time(sec). The table below shows the
relationship between the frying time (x) and moisture content (y).

x 1 4 9 15 23 28 30 45 60
y 20 16.3 9.7 8.1 4.2 3.4 2.9 1.9 1.3

1) Construct a scatter diagram.


2) Using the exponential and power functions to estimate the parameters and
determine which one is the more appropriate model.

[Solution]

1)

The procedures to draw the scatter diagram by Excel are demonstrated


specifically in Example 8.1 (pages 174-179). Similarly, the scatter diagram which is
used to show the relationship between the frying time and the moisture content can
also be drawn using the similar steps.
Figure 8.19 shows a scatter diagram that depicts the frying time and the moisture
content. The horizontal axis presents the frying time, and the vertical axis presents the
moisture content. The chart shows that the frying time and the moisture content are
negative related.
8: Regression Analysis 197

25

20
Moisture Content(%)

15

10

0
0 10 20 30 40 50 60
Frying Time(sec)

Figure 8.19 Scatter plot of y VS x

In Examples 8.2 and 8.3, we have introduced the functions to obtain the
parameters which can be used to obtain the regression line. One of the shortcut to
obtain the regression line and r2 is using the Format Trendline, which can be obtained
by right click any points on the graph and choose the option called add trend line.

Figure 8.20 The format trendline


198 8: Regression Analysis

After choosing the options called linear, Display Equation on chart, and the
Display R-squared value on chart, you can get the regression line, the equation and
R-squared value at the same time (shown in Figure 8.21).

25

20
Moisture Content(%)

y = -0.292x + 14.51
15
R² = 0.721

10

0
0 10 20 30 40 50 60
Frying Time(sec)

Figure 8.21 Scatter plot of y VS x

Figure 8.21 shows that the frying time and the moisture content are negative
related. As mentioned earlier, the r2 measures how well the regression line represents
the data. In this example, r2 = 0.721, the linear trendline is not such good. However, it
does not mean that the variables x and y do not have relationship. The functions
EXPONENTIAL and POWER are used to test whether the transformed x and/or y
have the strong linear relationship.

2)

Exponential Function Relationship

For the exponential function relationship, only y is transformed to achieve


linearity. Figure 8.22 shows the data values used in the linear regression analysis.
8: Regression Analysis 199

Figure 8.22 Data for Example 8.4

For exponential model, only y is transformed to ln (y) to achieve linearity. Figure


8.23 shows a scatter diagram that depicts the frying time and the moisture content.
The horizontal axis presents the frying time (x), and the vertical axis presents the
moisture content (ln(y)).

3.5

3
Moisture Content(%)

2.5
y = -0.047x + 2.773
2 R² = 0.940

1.5

0.5

0
0 10 20 30 40 50 60
Frying Time(sec)

Figure 8.23 Scatter plot of ln(y) VS x

In Figure 8.23, we obvious that each point is close to the regression line.
Furthermore, In this example, r2 = 0.94, which is pretty good than previous one.
200 8: Regression Analysis

Power Function Relationship

For power function relationship, both x and y are transformed to achieve linearity.
Figure 8.24 shows the scatter plot of ln(y) and ln(x).

4
3.5
Moisture Content(%)

3
2.5 y = -0.685x + 3.475
R² = 0.882
2
1.5
1
0.5
0
0 1 2 3 4 5
Frying Time(sec)

Figure 8.24 Scatter Plot of ln(x) and ln(y)

Figure 8.24 shows a scatter diagram that depicts the frying time and the moisture
content. The horizontal axis presents the frying time (ln(x)), and the vertical axis
presents the moisture content (ln(y)). The chart shows that the frying time and the
moisture content are negative related.
As mentioned earlier, the closer r2 is to 1, the more successful is the regression
model in explaining y variation. According to the previous calculation, y’ = -0.047x +
2.773, the estimated regression function for the exponential model is ln(y) = -0.047x
+ 2.773 and y = e-0.047x + 2.77.

8.6 Summaries of Excel Functions

Excel functions used in this chapter are summarized in Table 8.3. In this chapter,
we have introduced some Excel functions which are related to regression analysis.
The functions SLOPE and INTERCEPT can be used to estimate the parameters b1 and
b0. The functions RSq and CORREL can be used to obtain the r2 and r. The function
LINEST can be used to estimate ten statistics relating to regression analysis, including
such as the parameters b0, b1 ,and r2. The function TREND is used to predict the y’s
values according to the new xs.
8: Regression Analysis 201

Table 8.3 Summaries of the built-in functions


FUNCTION How it works? Notes
SLOPE It returns the slope of the linear regression line Ex. 8.2
INTERCEPT It calculates the intercept. Ex. 8.2
LINEST It returns the parameters of a linear trend. Ex. 8.2
TREND It returns the y-values along that line for the Ex. 8.3
array of new_x's that you specified.
FORECAST It predicts a y-value for a given x-value. Ex. 8.2
RSQ It returns the coefficient of determination. Ex. 8.3 & 8.4
CORREL It returns the correlation coefficient between Ex. 8.3 & 8.4
two data sets.
STEYX It returns the standard error of the predicted-y Ex.8.3
value for each x in the regression.

In this chapter, the functions such as SLOPE and INTERCEPT are recreated to
switch the entering sequence: know_y’s first, and then know_x’s . Table 8.4 shows the
summaries of the user defined functions.

Table 8.4 Summaries of the user defined functions


FUNCTION How it works?
NEWSLOPE It returns the slope of the linear regression line using new
defined entering sequence.
NEWINTERCEPT It calculates the point at which a line will intersect the y-axis
using new defined entering sequence.
NEWREQ It returns the coefficient of determination using the new defined
entering sequence.
NEWCORREL It returns the correlation coefficient between two data sets using
the new defined entering sequence.
SSE It returns the value of SSE.
SST It returns the value of SST.
202 References

References:

Ang, A. H-S., Tang, Wilson H., Probability Concepts in Engineering: Emphasis on


Applications in Civil & Environmental Engineering (2nd ed.), Wiley, Hoboken,
NJ, 2007.
Baron M., Probability and Statistics for Computer Scientists, Chapman & Hall/CRC
Press, Boca Raton, FL, 2007.
Birnbaum, D., Vine, M., Microsoft Excel VBA Programming for the Absolute
Beginner (3nd ed.), Thomson Course Technology, Boston, MA, 2007.
DavidA. S., Probability: An Introduction, Jones and Bartlett Publisher, Sudbury, MA,
2011.
Deep, R., Probability and Statistics, Elsevier Academic Press Publications, London,
2006.
Devore, J., Probability and Statistics for Engineering and the Sciences (8th ed.),
Brooks/Cole, Boston, MA, 2012.
Miller, J. N., Miller, J.C., Statistics and Chemometrics for Analytical Chemistry (5th
ed.), Pearson, Harlow, England, 2005.
Rosenkrantz, W.A., Introduction to Probability and Statistics for Science, Engineering,
and Finance, Chapman & Hall/ CRC Press, Boca Raton, FL, 2009.
Schay, J., Introduction to Probability with Statistical Application, Birkhsauser, Boston,
2007.
Walkenbach, J., Excel 2007 Bible, Wiley Publishing, Indianapolis, Indiana, 2007.
Walkenbach, J., Excel 2007 Programming with VBA, Wiley Publishing, Indianapolis,
Indiana, 2007.
Appendix Tables 203

Appendix Tables

Table A.1 Standard normal curve areas


z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
204 Appendix Tables

Table A.2 Critical values for t-distributions


d.o.f a = 0.1 a = 0.05 a = 0.025 a = 0.01 a = 0.005 a = 0.001 a = 0.0005
1 3.078 6.314 12.706 31.821 63.657 318.309 636.619
2 1.886 2.920 4.303 6.965 9.925 22.327 31.599
3 1.638 2.353 3.182 4.541 5.841 10.215 12.924
4 1.533 2.132 2.776 3.747 4.604 7.173 8.610
5 1.476 2.015 2.571 3.365 4.032 5.893 6.869
6 1.440 1.943 2.447 3.143 3.707 5.208 5.959
7 1.415 1.895 2.365 2.998 3.499 4.785 5.408
8 1.397 1.860 2.306 2.896 3.355 4.501 5.041
9 1.383 1.833 2.262 2.821 3.250 4.297 4.781
10 1.372 1.812 2.228 2.764 3.169 4.144 4.587
11 1.363 1.796 2.201 2.718 3.106 4.025 4.437
12 1.356 1.782 2.179 2.681 3.055 3.930 4.318
13 1.350 1.771 2.160 2.650 3.012 3.852 4.221
14 1.345 1.761 2.145 2.624 2.977 3.787 4.140
15 1.341 1.753 2.131 2.602 2.947 3.733 4.073
16 1.337 1.746 2.120 2.583 2.921 3.686 4.015
17 1.333 1.740 2.110 2.567 2.898 3.646 3.965
18 1.330 1.734 2.101 2.552 2.878 3.610 3.922
19 1.328 1.729 2.093 2.539 2.861 3.579 3.883
20 1.325 1.725 2.086 2.528 2.845 3.552 3.850
21 1.323 1.721 2.080 2.518 2.831 3.527 3.819
22 1.321 1.717 2.074 2.508 2.819 3.505 3.792
23 1.319 1.714 2.069 2.500 2.807 3.485 3.768
24 1.318 1.711 2.064 2.492 2.797 3.467 3.745
25 1.316 1.708 2.060 2.485 2.787 3.450 3.725
26 1.315 1.706 2.056 2.479 2.779 3.435 3.707
27 1.314 1.703 2.052 2.473 2.771 3.421 3.690
28 1.313 1.701 2.048 2.467 2.763 3.408 3.674
29 1.311 1.699 2.045 2.462 2.756 3.396 3.659
30 1.310 1.697 2.042 2.457 2.750 3.385 3.646
31 1.309 1.696 2.040 2.453 2.744 3.375 3.633
32 1.309 1.694 2.037 2.449 2.738 3.365 3.622
33 1.308 1.692 2.035 2.445 2.733 3.356 3.611
Appendix Tables 205

Table A.3 Critical values for chi-squared distributions


d.o.f a = 0.995 a = 0.99 a = 0.975 a = 0.95 a = 0.05 a = 0.025 a = 0.01 a = 0.005
1 0.000 0.000 0.001 0.004 3.841 5.024 6.635 7.879
2 0.010 0.020 0.051 0.103 5.991 7.378 9.210 10.597
3 0.072 0.115 0.216 0.352 7.815 9.348 11.345 12.838
4 0.207 0.297 0.484 0.711 9.488 11.143 13.277 14.860
5 0.412 0.554 0.831 1.145 11.070 12.833 15.086 16.750
6 0.676 0.872 1.237 1.635 12.592 14.449 16.812 18.548
7 0.989 1.239 1.690 2.167 14.067 16.013 18.475 20.278
8 1.344 1.646 2.180 2.733 15.507 17.535 20.090 21.955
9 1.735 2.088 2.700 3.325 16.919 19.023 21.666 23.589
10 2.156 2.558 3.247 3.940 18.307 20.483 23.209 25.188
11 2.603 3.053 3.816 4.575 19.675 21.920 24.725 26.757
12 3.074 3.571 4.404 5.226 21.026 23.337 26.217 28.300
13 3.565 4.107 5.009 5.892 22.362 24.736 27.688 29.819
14 4.075 4.660 5.629 6.571 23.685 26.119 29.141 31.319
15 4.601 5.229 6.262 7.261 24.996 27.488 30.578 32.801
16 5.142 5.812 6.908 7.962 26.296 28.845 32.000 34.267
17 5.697 6.408 7.564 8.672 27.587 30.191 33.409 35.718
18 6.265 7.015 8.231 9.390 28.869 31.526 34.805 37.156
19 6.844 7.633 8.907 10.117 30.144 32.852 36.191 38.582
20 7.434 8.260 9.591 10.851 31.410 34.170 37.566 39.997
21 8.034 8.897 10.283 11.591 32.671 35.479 38.932 41.401
22 8.643 9.542 10.982 12.338 33.924 36.781 40.289 42.796
23 9.260 10.196 11.689 13.091 35.172 38.076 41.638 44.181
24 9.886 10.856 12.401 13.848 36.415 39.364 42.980 45.559
25 10.520 11.524 13.120 14.611 37.652 40.646 44.314 46.928
26 11.160 12.198 13.844 15.379 38.885 41.923 45.642 48.290
27 11.808 12.879 14.573 16.151 40.113 43.195 46.963 49.645
28 12.461 13.565 15.308 16.928 41.337 44.461 48.278 50.993
29 13.121 14.256 16.047 17.708 42.557 45.722 49.588 52.336
30 13.787 14.953 16.791 18.493 43.773 46.979 50.892 53.672
31 14.458 15.655 17.539 19.281 44.985 48.232 52.191 55.003
32 15.134 16.362 18.291 20.072 46.194 49.480 53.486 56.328
33 15.815 17.074 19.047 20.867 47.400 50.725 54.776 57.648
206 Appendix Tables

Table A.4 Critical values of Dna at significance level α in the K-S test

d.o.f = n α = 0.20 α = 0.10 α = 0.05 α = 0.01


5 0.45 0.51 0.56 0.67
10 0.32 0.37 0.41 0.49
15 0.27 0.30 0.34 0.40
20 0.23 0.26 0.29 0.36
25 0.21 0.24 0.26 0.32
30 0.19 0.22 0.24 0.29
35 0.18 0.20 0.23 0.27
40 0.17 0.19 0.21 0.25
45 0.16 0.18 0.20 0.24
50 0.15 0.17 0.19 0.23
>50 1.07/√ 1.22/√ 1.36/√ 1.63/√
Appendix Tables 207

Table A.5 Critical values of F test for a two-tailed test (P = 0.05)


V1
V2 1 2 3 4 5 6 7 8 9 10
1 647.79 799.50 864.16 899.58 921.85 937.11 948.22 956.66 963.28 968.63
2 38.51 39.00 39.17 39.25 39.30 39.33 39.36 39.37 39.39 39.40
3 17.44 16.04 15.44 15.10 14.88 14.73 14.62 14.54 14.47 14.42
4 12.22 10.65 9.98 9.60 9.36 9.20 9.07 8.98 8.90 8.84
5 10.01 8.43 7.76 7.39 7.15 6.98 6.85 6.76 6.68 6.62
6 8.81 7.26 6.60 6.23 5.99 5.82 5.70 5.60 5.52 5.46
7 8.07 6.54 5.89 5.52 5.29 5.12 4.99 4.90 4.82 4.76
8 7.57 6.06 5.42 5.05 4.82 4.65 4.53 4.43 4.36 4.30
9 7.21 5.71 5.08 4.72 4.48 4.32 4.20 4.10 4.03 3.96
10 6.94 5.46 4.83 4.47 4.24 4.07 3.95 3.85 3.78 3.72
11 6.72 5.26 4.63 4.28 4.04 3.88 3.76 3.66 3.59 3.53
12 6.55 5.10 4.47 4.12 3.89 3.73 3.61 3.51 3.44 3.37
13 6.41 4.97 4.35 4.00 3.77 3.60 3.48 3.39 3.31 3.25
14 6.30 4.86 4.24 3.89 3.66 3.50 3.38 3.29 3.21 3.15
15 6.20 4.77 4.15 3.80 3.58 3.41 3.29 3.20 3.12 3.06
16 6.12 4.69 4.08 3.73 3.50 3.34 3.22 3.12 3.05 2.99
17 6.04 4.62 4.01 3.66 3.44 3.28 3.16 3.06 2.98 2.92

You might also like