You are on page 1of 22

OPIM102 Computer as an Analysis Tool Week 10: Data Analysis

Instructor:
Practice Associat Prof. Michelle CHEONG michcheong@smu.edu.sg Room: 80-04-036 Tel: 6828-0269

Copyright Michelle Cheong

Week 10

Curve Fitting vs Data Fitting


Curve Fitting In week 2, we have learnt how to fit curve to points on graph to obtain the equation that best describes the points using R2 value. Plot points on a X-Y scatter graph for a x value -> results in a y Select the points and right-click to add Trendline We can select linear, logarithmic, polynomial etc. Equations returned are y = f(x) with respective R2 value Data Fitting Here, we are trying to fit distribution to data points using the cumulative format. Distributions are plots which describe how often a certain data point occur. So, there are essences of frequency and probability. We need not know what x will give y. Very often, we are only able to collect some data of certain events (e.g. arrival time, number of passenger etc). We need to know which distribution best describes these data points, so that we can generate more data points to be used in simulation. We have learnt simulation assuming known distribution in weeks 5 and 6. Now, we will learn to find out which distribution to use.
Copyright Michelle Cheong Week 10
2

Probability Functions
Random variables (r.v.) are either discrete or continuous
Discrete r.v. = Uniform, Binomial, Poisson Continuous r.v. = Uniform, Exponential, Normal, and many more

Discrete Uniform Distribution


P(X = x) = prob mass function (pmf)

Continuous Uniform Distribution


fX(x) = prob density function (PDF)

X = random variable x = possible values of X P(X = x) probability of X being x

P(a<X<b) = Integrate fX(x) from a to b


X = random variable x = possible values of X P(a < X < b) = probability of X between a and b
Week 10
3

Copyright Michelle Cheong

Probability Functions
Discrete Uniform Distribution Continuous Uniform Distribution

1.0

1.0

CDF

CDF

PMF
1 2 3 4 5 6 x 1 2 3 4 5

PDF
6 x

Cumulatively, both discrete and continuous functions will sum to 1.0 (which is the total probability) and known as cumulative distribution function (CDF)
Copyright Michelle Cheong Week 10
4

Discrete Probability Functions


Binomial r.v. X is best described as counting the number of success (e.g. head) in n tosses. If the probability of head (x=1) is p, and probability of tail (x=0) is 1-p, then P(X = k) = nCk pk (1-p)n-k k = 0, 1, 2, , n

Binomial is represented as Bin(n,p)


Probability, P(X = k)

Bin(7,0.5)

Binomial Distribution

Copyright Michelle Cheong

Week 10
5

Discrete Probability Functions


Poisson r.v. X is best described as counting the number of events that can occur within a specific time, given the expected value P(X = k) = [e- k] / k! k = 0, 1, 2,
Poisson Distribution
Probability, P(X = k)

=3

...
0 1 2 3 4 5 6 7

Copyright Michelle Cheong

Week 10
6

Continuous Probability Functions


Exponential r.v. X is best described as the time interval between arrivals, given the expected value

fX(x) = e-x
fX(x)

= rate of arrivals 1/ = average inter-arrival time


Example: 1. Probability that a light bulb will fail in 20 days is area A 2. Probability that a light bulb will fail in 10 days is area B

10

20

x
Week 10
7

Copyright Michelle Cheong

Continuous Probability Functions


Normal r.v. X is best described as the natural, random occurrence of errors. Normal distribution can be standardized to the standard form with mean 0 and variance 1
fX(x) N~(,2) = N~(0,1)

0 Standard Normal, Z = (x- )/


Copyright Michelle Cheong

Week 10
8

Excel Functions for Probability Distribution


Binomial
BINOMDIST(number_s, trials, probability_s, cumulative)
number_s = number of success trials = number of trials probability_s = probability of success cumulative = TRUE returns CDF, FALSE returns PMF

Poisson
POISSON(x, mean, cumulative)
X = number of events Mean = expected value cumulative = TRUE returns CDF, FALSE returns PMF

Copyright Michelle Cheong

Week 10
9

Excel Functions for Probability Distribution


Exponential
EXPONDIST(x, 1/mean, cumulative)
x = value of interest mean = average inter-arrival time, where 1/mean = rate of arrival cumulative = TRUE returns CDF, FALSE returns PDF

Normal
NORMSDIST(z) = standard normal
Z = value of interest Returns only CDF

NORMDIST(x, mean, standard_dev, cumulative)


x = value of interest mean = mean value standard_dev = standard deviation cumulative = TRUE returns CDF, FALSE returns PDF
Week 10
10

Copyright Michelle Cheong

Fitting data to known distribution


Raw data
Relative frequency = frequency / total count Cumulative relative frequency (CRFdata) = cumulative frequency / total count Example: 1 person eats 1 bowl of pasta, 2 persons eats 2 bowls, and 1 person eats 3 bowls
# of bowls of pasta 1 2 3 Relative frequency CRF 1

Distribution
Probability = relative frequency Cumulative probability CDF = cumulative relative frequency (CRFdistb)
Copyright Michelle Cheong Week 10
11

Fitting data to known distribution


Thus, to fit raw data into known distribution, we try to match CRFdata to CRFdistb
Raw data
Relative frequency Cumulative relative frequency (CRFdata)

Probability Function
Probability Cumulative probability (CDF or CRFdistb)

We match by minimizing the Maximum absolute deviation (MAD) = the largest gap between the cumulative relative frequency of a given data set (CRFdata) and that of its fitted statistical distribution (CRFdistb) The best fitted distribution is one with the smallest MAD

Copyright Michelle Cheong

Week 10
12

What are we trying to do here?


CDF or CRF
1.0 CRFdata CRFdistb1 (mean1, SD1) assume that it is a normal curve

CRFdistb2 (mean2, SD2) MAD2

We are trying to minimize the largest gap between the 2 curves by setting MAD1 the parameters that best describe the distribution, so that the distribution fits the data in the best way. For every iteration, as we change the parameters for CRFdistb, MAD changes position. We will stop when we get the smallest MAD.
Copyright Michelle Cheong Week 10
13

Analogy to Curve Fitting


Y Y = m1X + C1 Y = m2X + C2

Minimize the squared error

X
By minimizing the squared error between the points and the line, the parameters that describe the line changes. And we stop when the squared error is the least to get the parameters that best describe the line.
Copyright Michelle Cheong Week 10
14

Steps
1. Distribution functions are usually defined by the 4 parameters - Max, min, mean, standard deviation compute these parameters from the raw data Sort raw data in ascending order Compute cumulative relative frequency (CRFdata) of data Compute cumulative relative frequency (CRFdistb) of a known distribution using the raw data and an initial arbitrary input parameters for the distribution For discrete data, we can use uniform binomial, poisson For continuous data, we can use uniform, exponential, normal Compute the MAD = max(abs(CRFdata CRFdistb)) Use Solver to minimize the MAD where the minimization process will change the parameters for the distribution to get the best fit distribution Repeat 4 to 6 for another distribution to get its MAD. Select the distribution with the smallest MAD
Copyright Michelle Cheong Week 10
15

2. 3. 4.

5. 6.

7. 8.

Step 3: CRF of raw data


Discrete data
Data 1 2 2 2 3 3 4 4 4 4 Freq 1 Cum Freq 1 CRF 1/10 Data 1.01 1.25 1.33 2.10 2.88 3.11 3.78 4.29 4.56 4.78

Continuous data
Freq 1 1 1 1 1 1 1 1 1 1 Cum Freq 1 2 3 4 5 6 7 8 9 10 CRF 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10

3 2

4 6

4/10 6/10

10

10/10

Total Count = 10 each


Copyright Michelle Cheong Week 10
16

Step 4: CRF of Distribution Continuous


Uniform
All input parameters in RED are arbitrary values until solved by Solver

CRF = (x min) / (max min) X = data point


min x max

Normal
NORMDIST(x, mean, standard_dev, cumulative) NORMDIST(x, mean, standard_dev, TRUE)

Exponential
EXPONDIST(x, 1/mean, cumulative) EXPONDIST(x, 1/mean, TRUE)
Copyright Michelle Cheong Week 10
17

Step 4: CRF of Distribution Discrete


Uniform
All input parameters in RED are arbitrary values until solved by Solver

CRF = (x min) / (max min) x = data point


min x max

Binomial
BINOMDIST(number_s, trials, probability_s, cumulative) BINOMDIST(x, trials, probability_s, TRUE)

Poisson
POISSON(x, mean, cumulative) POISSON(x, mean, TRUE)
Copyright Michelle Cheong Week 10
18

Hotel Apex
Fit Normal distribution to room sales data so as to estimate number of rooms to keep open during refurbishment period to satisfy 70% of demand Due to maximum capacity of 150 rooms, the data given does not represent real demand, since demand exceeding 150, will only result in 150 rooms being sold. Thus, data given is in fact sales data and NOT demand data. So, can we infer real demand data from sales data? This exercise tells us the importance of understanding data collected and the power of data inference

Copyright Michelle Cheong

Week 10
19

Yankee Fruits
Help Paul decides how many melons he has to purchase weekly to satisfy different service level Use 4 methods Frequency Bins & Lookup() function CRF of raw data & Lookup() function Percentile() function NORMINV() function, assuming demand is Normal Which method is the best? - Depends on the situation.
Copyright Michelle Cheong Week 10
20

Shifting columns for Lookup() Function


Lookup() returns the corresponding result_value in the result_vector where the largest value in the lookup_vector is less than or equal to lookup_value In the Yankee Fruits exercise, we need to determine the minimum number of fruits to buy in order to satisfy a certain percentage of customers, as defined by service level. For example, a service level of 70% would mean that 70% of customers will be served from your inventory of fruits. The higher the service level, the more inventory you should keep.
From the table, 0.7 lies between 0.57 and 0.83. To satisfy 70% customer, we need to have at least 650 fruits rather than 600.
If we do not shift columns, for 70% service level, lookup() will return 600, which corresponds to 0.57 CRF. Thus, we need 0.57 to correspond to the next result value, which is 650, to get the correct result.
Copyright Michelle Cheong Week 10
21

Yankee Fruits
Method 1: Uses lookup() function with frequency bins - This method is coarse as the bins consolidated data into intervals, so the answer provided will also be in terms of large intervals. However, if the order size is in multiples of 50, then this answer is appropriate. Method 2: Uses lookup() function with CRF of raw data. This method is finer than Method 1. But answers given are in terms of each unit. Method 3: Uses Percentile(). This method will interpolate and generate new data points which may not be integer. Is such fine data needed?

Method 4: Uses Norminv() function. This method assumes demand follows a normal distribution. For service level >= 0.95, the number of melons to order is larger than the largest raw data of 716. Seems to over-estimate.
Copyright Michelle Cheong Week 10
22

You might also like