You are on page 1of 41

STATISTICAL CONCEPTS IN

MODELING &
SIMULATION
INTRODUCTION

People use the term probability many times


each day. For example, physician says that a
patient has a 50-50 chance of surviving a
certain operation. Another physician may
say that she is 95% certain that a patient
has a particular disease
Definition

If an event can occur in N mutually exclusive


and equally likely ways, and if m of these
possess a trait, E, the probability of the
occurrence of E is read as

P(E) = m/N
DEFINITION

Experiment ==> any planned process


of data collection. It consists of a
number of trials (replications) under
the same condition.
Definition
Sample space: collection of unique, non-overlapping possible
outcomes of a random circumstance.

Simple event: one outcome in the sample space; a possible


outcome of a random circumstance.

Event: a collection of one or more simple events in the sample


space; often written as
A, B, C, and so on

Male, Female
Complement ==> sometimes, we want to know
the probability that an event will not happen; an
event opposite to the event of interest is called
a complementary event.

If A is an event, its complement is The


probability of the complement is AC or A
Example: The complement of male event is the
female

P(A) + P(AC) = 1
Views of Probability:

Subjective:

It is an estimate that reflects a person’s opinion, or


best guess about whether an outcome will occur.

Important in medicine  form the basis of a


physician’s opinion (based on information gained in
the history and physical examination) about whether a
patient has a specific disease. Such estimate can be
changed with the results of diagnostic procedures.
Objective
Classical
It is well known that the probability of flipping a fair
coin and getting a “tail” is 0.50.
If a coin is flipped 10 times, is there a guarantee,
that exactly 5 tails will be observed
If the coin is flipped 100 times? With 1000 flips?
As the number of flips becomes larger, the
proportion of coin flips that result in tails
approaches 0.50
THE MEANING OF A VARIABLE
A variable refers to any quantity that may take
on more than one value
 Population is a variable because it is not fixed
or constant – changes over time
 The unemployment rate is a variable because
it may take on any value from 0-100%
A random variable can be thought of as an
unknown value that may change every time it is
inspected.
THE MEANING OF A VARIABLE

 A random variable either may be discrete or


continuous
 A variable is discrete if its possible values have
jumps or breaks
e.g. Population - measured in integers or whole units:
1, 2, 3, …
 A variable is continuous if there are no jumps or
breaks
 Unemployment rate – needs not be measured in
whole units: 1.77, .., 8.99, …
DESCRIPTIVE STATISTICS
 Descriptive statistics are used to describe the main
features of a collection of data in quantitative terms.
 Descriptive statistics aim to quantitatively summarize a
data set

 Some statistical summaries are especially common in


descriptive analyses. For example
 Frequency Distribution
 Central Tendency
 Dispersion
 Association
FREQUENCY DISTRIBUTION

 Every set of data can be described in terms of how frequently


certain values occur.
 In statistics, a frequency distribution is a tabulation of the
values that one or more variables take in a sample.
 Consider the hypothetical prices of Dec CME Live Cattle
Futures
Month Price (cents/lb)
May 67.05
June 66.89
July 67.45
August 68.39
September 67.45
October 70.10
November 68.39
FREQUENCY DISTRIBUTION
 Univariate frequency distributions are often presented as lists
ordered by quantity showing the number of times each value appears.
 A frequency distribution may be grouped or ungrouped
 For a small number of observations - ungrouped frequency distribution
 For a large number of observations - grouped frequency distribution

Ungrouped Grouped
Price (X) Frequency Price (X) Frequency
67.05 1 65.00-66.99 1
66.89 1 67.00-68.99 4
67.45 2 69.00-70.99 1
68.39 2 71.00-72.99 0
70.10 1 73.00-74.99 0
CENTRAL TENDENCY
 In statistics, the term central tendency relates to
the way in which quantitative data tend to cluster
around a “central value”.
 A measure of central tendency is any of a number
of ways of specifying this "central value.“
 There are three important descriptive statistics that
gives measures of the central tendency of a variable:
 The Mean
 The Median
 The Mode
THE MEAN
 The arithmetic mean is the most commonly-used type of
average and is often referred to simply as the average.
 In mathematics and statistics, the arithmetic mean (or simply
the mean) of a list of numbers is the sum of all numbers in the
list divided by the number of items in the list.
 If the list is a statistical population, then the mean of that
population is called a population mean.
 If the list is a statistical sample, we call the resulting statistic
a sample mean.
 If we denote a set of data by X = (x1, x2, ..., xn), then the sample
mean is typically denoted with a horizontal bar over the variable
( X , enunciated "x bar").
 The Greek letter μ is used to denote the arithmetic mean of
an entire population.
THE SAMPLE MEAN

 In mathematical notation, the sample mean of a set of data denoted as


X = (x1, x2, ..., xn) is given by
1 n 1
X   X i  ( X 1  X 2  ...  X n )
n i 1 n

 To calculate the mean, all of the observations (values) of X are added


and the result is divided by the number of observations (n)
 In the previous example, the mean price of Dec CME Live Cattle futures
contract is
1 n 1
X   X i  (67.05  66.89  ...  68.39)  67.96
n i 1 7
THE MEDIAN

 In statistics, a median is described as the numeric value separating


the higher half of a sample or population from the lower half.
 The median of a finite list of numbers can be found by arranging all the
observations from lowest value to highest value and picking the middle
one.
 If there is an even number of observations, then there is no single
middle value, so one often takes the mean of the two middle values.
 Organize the price data in the previous example in ascending order
67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10
 The median of this price series is 67.45
THE MODE

 In statistics, the mode is the value that occurs the most frequently in
a data set.
 The mode is not necessarily unique, since the same maximum
frequency may be attained at different values.
 Organize the price data in the previous example in ascending order
67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10
 There are two modes in the given price data – 67.45 and 68.39
 Thus the mode of the sample data is not unique
 The sample price dataset may be said to be bimodal
 A population or sample data may be unimodal, bimodal, or multimodal
STATISTICAL DISPERSION

 In statistics, statistical dispersion (also called statistical variability


or variation) is the variability or spread in a variable or probability
distribution.
 In particular, a measure of dispersion is a statistic (formula) that
indicates how disperse (i.e., spread) the values of a given variable are
 Common measures of statistical dispersion are
 The Variance, and
 The Standard Deviation

 Dispersion is contrasted with location or central tendency, and together


they are the most used properties of distributions
THE VARIANCE

 In statistics, the variance of a random variable or distribution is the


expected (mean) value of the square of the deviation of that variable
from its expected value or mean.
 Thus the variance is a measure of the amount of variation within the
values of that variable, taking account of all possible values and their
probabilities.
 If a random variable X has the expected (mean) value E[X]=μ, then the
variance of X can be given by:

Var ( X )  E[( X   ) 2 ]   x2
THE VARIANCE

 The above definition of variance encompasses random variables that


are discrete or continuous. It can be expanded as follows:

Var ( X )  E[( X   ) 2 ]
 E[ X 2  2X   2 ]
 E[ X 2 ]  2E[ X ]   2
 E[ X 2 ]  2 2   2
 E[ X 2 ]   2
 E[ X 2 ]  ( E[ X ]) 2
THE VARIANCE: PROPERTIES

 Variance is non-negative because the squares are positive or zero.


 The variance of a constant a is zero, and the variance of a variable
in a data set is 0 if and only if all entries have the same value.
Var (a )  0
 Variance is invariant with respect to changes in a location parameter.
That is, if a constant is added to all values of the variable, the
variance is unchanged.
Var ( X  a)  Var ( X )
 If all values are scaled by a constant, the variance is scaled by the
square of that constant.
Var (aX )  a 2Var ( X )
Var (aX  b)  a 2Var ( X )
THE SAMPLE VARIANCE

 If we have a series of n measurements of a random


variable X as Xi, where i = 1, 2, ..., n, then the sample
variance, can be used to estimate the population variance
of X = (x1, x2, ..., xn), The sample variance is calculated as

 X X
n
2
i
S x2  i 1
n 1

1
n 1

X1  X   X 2  X   ...  X n  X 
2 2 2

THE SAMPLE VARIANCE

 The denominator, (n-1) is known as the degrees of freedom in


calculating s x2: Intuitively, once X is known, only n-1 observation
values are free to vary, one is predetermined by X
 When n = 1 the variance of a single sample is obviously zero
regardless of the true variance. This bias needs to be corrected for
when n is small.

 X X
n
2


X1  X   X 2  X   ...  X n  X  
i
1 2 2 2
S 
2 i 1

n 1 n 1
x
THE SAMPLE VARIANCE

 For the hypothetical price data for Dec CME Live Cattle futures
contract, 67.05, 66.89, 67.45, 67.45, 68.39, 68.39, 70.10, the sample
variance can be calculated as

 X  X 
n
2
i
S 
2 i 1
n 1
x


1
7 1

67.05  67.96  ...  70.10  67.96
2 2

 1.24
THE STANDARD DEVIATION

 In statistics, the standard deviation of a random variable


or distribution is the square root of its variance.
 If a random variable X has the expected value (mean)
E[X]=μ, then the standard deviation of X can be given by:

 x   x2  E [( X   )2 ]
 That is, the standard deviation σ (sigma) is the square root
of the average value of (X − μ)2.
THE STANDARD DEVIATION

 If we have a series of n measurements of a random


variable X as Xi, where i = 1, 2, ..., n, then the sample
standard deviation, can be used to estimate the
population standard deviation of X = (x1, x2, ..., xn). The
sample standard deviation is calculated as

 X X
n
2
i
Sx  S  2 i 1
 1.24  1.114
n 1
x
THE MEAN ABSOLUTE DEVIATION

 The mean or average deviation of X from its mean

  di (X  X)
  i 
 n n 
 
is always zero. The positive and negative deviations cancel out in
the summation, which makes it a useless measure of dispersion.
 The mean absolute deviation (MAD), calculated by:


 d i   (X i  X ) 
 n n 
 
solves the “canceling out” problem.
THE MSD AND RMSD

 The alternative way to address the canceling out problem is by


squaring the deviations from the mean to obtain the mean
squared deviation (MSD):

 di
2
 X  X  2

 i

n n
 The problem of squaring can be solved by taking the square root of
the MSD to obtain the root mean squared deviation (RMSD):

 X X
n
2
i
RMSD  MSD  i 1
n
RMSD VS. STANDARD DEVIATION

 When calculating the RMSD, the squaring of the deviations gives a


greater importance to the deviations that are larger in absolute value,
which may or may not be desirable.
 For statistical reasons, it turns out that a slight variation of the RMSD,
known as the standard deviation (SX), is more desirable as a measure of
dispersion.

 X i  X
n
2

RMSD  MSD  i 1
n

 X X
n
2
i
Sx  i 1
n 1
VARIANCE VS. MSD
STANDARD DEVIATION VS. RMSD

Price (X) Mean (Xi−Mean) |Xi−Mean| |Xi−Mean|2


67.05 67.96 -0.91 0.91 0.83
66.89 67.96 -1.07 1.07 1.14
67.45 67.96 -0.51 0.51 0.26
68.39 67.96 0.43 0.43 0.18
67.45 67.96 -0.51 0.51 0.26
70.10 67.96 2.14 2.14 4.58
68.39 67.96 0.43 0.43 0.18
Total 0.00 6.00 7.44

MAD = 0.86
Variance = 1.24 MSD = 1.06
Std. Dev. = 1.11 RMSD = 1.03
p 53
ASSOCIATION

 Bivariate statistics can be used to examine the degree in


which two variables are related or associated, without
implying that one causes the other

 Multivariate statistics can be used to examine the degree in


which multiple variables are related or associated, without
implying that one causes any or some of the others

 Two common measures of bivariate and multivariate statistics are


 Covariance
32
 Correlation Coefficient
p 54
Association: Bivariate Statistics
 In Figure 3.3 (a) Y and X are positively but weakly correlated
while in 3.3 (b) they are negatively and strongly correlated

33
THE COVARIANCE

 The covariance between two real-valued random variables X and Y,


with mean (expected values) X   and Y  v , is
Cov( X , Y )  E[( X  X ).(Y  Y )]  E[( X   ).(Y  v)]
 E[ X .Y  Y  vX  v]
 E[ X .Y ]  E[Y ]  vE[ X ]   v
 E[ X .Y ]   v   v   v
 E[ X .Y ]   v
 Cov(X, Y) can be negative, zero, or positive
 Random variables with covariance is zero are called uncorrelated
or independent
COVARIANCE

 If X and Y are independent, then their covariance is zero. This


follows because under independence,

E[ X .Y ]  E[ X ].E[Y ]   v

 Recalling the final form of the covariance derivation given above,


and substituting, we get
Cov( X , Y )   v   v  0

 The converse, however, is generally not true: Some pairs of random


variables have covariance zero although they are not independent.
THE COVARIANCE: PROPERTIES

 If X and Y are real-valued random variables and a and b are


constants ("constant" in this context means non-random), then the
following facts are a consequence of the definition of covariance:

Cov( X , a )  0
Cov( X , X )  Var ( X )
Cov( X , Y )  Cov(Y , X )
Cov(aX , bY )  abCov( X , Y )
Cov( X  a, Y  b)  Cov( X , Y )
VARIANCE OF THE SUM OF CORRELATED
RANDOM VARIABLES

 If X and Y are real-valued random variables and a and b are


constants ("constant" in this context means non-random), then the
following facts are a consequence of the definition of variance and
covariance:
Var ( X  Y )  Var ( X )  Var (Y )  2Cov( X , Y )
Var (aX  bY )  a 2Var ( X )  b 2Var (Y )  2abCov( X , Y )

 The variance of a finite sum of uncorrelated random variables is


equal to the sum of their variances.
Var ( X  Y )  Var ( X )  Var (Y )

 This is because, if X and Y are uncorrelated, their covariance is 0.


p 53
THE SAMPLE COVARIANCE

 The covariance is one measure of how closely the values taken by


two variables X and Y vary together:

 If we have a series of n measurements of X and Y written as Xi and


Yi where i = 1, 2, ..., n, then the sample covariance can be used to
estimate the population covariance between X=(X1, X2, …, Xn) and
Y=(Y1, Y2, …, Yn). The sample covariance is calculated as

 X  X Yi  Y 
n

i
S x, y  i 1
n  1 38
CORRELATION COEFFICIENT

 A disadvantage of the covariance statistic is that its magnitude


can not be easily interpreted, since it depends on the units in
which we measure X and Y
 The related and more used correlation coefficient remedies
this disadvantage by standardizing the deviations from the
mean:

Cov( X , Y )  X ,Y
 x, y  
Var ( X ) Var (Y )  X . Y

 The correlation coefficient is symmetric, that is


 x, y   y, x
CORRELATION COEFFICIENT

 If we have a series of n measurements of X and Y written as Yi


and Yi, where i = 1, 2, ..., n, then the sample correlation
coefficient, can be used to estimate the population correlation
coefficient between X and Y. The sample correlation coefficient
is calculated as

(X i  X )(Yi  Y )
rx , y  i 1
(n  1) S x S y
CORRELATION COEFFICIENT

 The value of correlation coefficient falls between −1 and 1:

 1  rx , y  1

 rx,y= 0 => X and Y are uncorrelated


 rx,y= 1 => X and Y are perfectly positively correlated
 rx,y = −1 => X and Y are perfectly negatively correlated