CAT Wk10

OPIM102 Computer as an Analysis Tool Week 10: Data Analysis
Instructor:
Practice Associat Prof. Michelle CHEONG michcheong@smu.edu.sg Room: 80-04-036 Tel: 6828-0269
Copyright Michelle Cheong
Week 10
Curve Fitting vs Data Fitting

Curve Fitting In week 2, we have learnt how to fit curve to points on graph to obtain the equation that best describes the points using R2 value. Plot points on a X-Y scatter graph for a x value -> results in a y Select the points and right-click to add Trendline We can select linear, logarithmic, polynomial etc. Equations returned are y = f(x) with respective R2 value Data Fitting Here, we are trying to fit distribution to data points using the cumulative format. Distributions are plots which describe how often a certain data point occur. So, there are essences of frequency and probability. We need not know what x will give y. Very often, we are only able to collect some data of certain events (e.g. arrival time, number of passenger etc). We need to know which distribution best describes these data points, so that we can generate more data points to be used in simulation. We have learnt simulation assuming known distribution in weeks 5 and 6. Now, we will learn to find out which distribution to use.
Copyright Michelle Cheong Week 10
2
Probability Functions
Random variables (r.v.) are either discrete or continuous
Discrete r.v. = Uniform, Binomial, Poisson Continuous r.v. = Uniform, Exponential, Normal, and many more
Discrete Uniform Distribution

P(X = x) = prob mass function (pmf)
Continuous Uniform Distribution

fX(x) = prob density function (PDF)
X = random variable x = possible values of X P(X = x) probability of X being x
P(a<X<b) = Integrate fX(x) from a to b

X = random variable x = possible values of X P(a < X < b) = probability of X between a and b
Week 10
3
Probability Functions
Discrete Uniform Distribution Continuous Uniform Distribution
1.0
1.0
CDF
CDF
PMF
1 2 3 4 5 6 x 1 2 3 4 5
PDF
6 x
Cumulatively, both discrete and continuous functions will sum to 1.0 (which is the total probability) and known as cumulative distribution function (CDF)
4
Discrete Probability Functions

Binomial r.v. X is best described as counting the number of success (e.g. head) in n tosses. If the probability of head (x=1) is p, and probability of tail (x=0) is 1-p, then P(X = k) = nCk pk (1-p)n-k k = 0, 1, 2, , n
Binomial is represented as Bin(n,p)

Probability, P(X = k)
Bin(7,0.5)
Binomial Distribution
Week 10
5
Discrete Probability Functions

Poisson r.v. X is best described as counting the number of events that can occur within a specific time, given the expected value P(X = k) = [e- k] / k! k = 0, 1, 2,
Poisson Distribution
Probability, P(X = k)
=3
...
0 1 2 3 4 5 6 7
Week 10
6
Continuous Probability Functions

Exponential r.v. X is best described as the time interval between arrivals, given the expected value
fX(x) = e-x
fX(x)
= rate of arrivals 1/ = average inter-arrival time

Example: 1. Probability that a light bulb will fail in 20 days is area A 2. Probability that a light bulb will fail in 10 days is area B
10
20
x
Week 10
7
Continuous Probability Functions

Normal r.v. X is best described as the natural, random occurrence of errors. Normal distribution can be standardized to the standard form with mean 0 and variance 1
fX(x) N~(,2) = N~(0,1)
0 Standard Normal, Z = (x- )/

Week 10
8
Excel Functions for Probability Distribution

Binomial
BINOMDIST(number_s, trials, probability_s, cumulative)
number_s = number of success trials = number of trials probability_s = probability of success cumulative = TRUE returns CDF, FALSE returns PMF
Poisson
POISSON(x, mean, cumulative)
X = number of events Mean = expected value cumulative = TRUE returns CDF, FALSE returns PMF
Week 10
9
Excel Functions for Probability Distribution

Exponential
EXPONDIST(x, 1/mean, cumulative)
x = value of interest mean = average inter-arrival time, where 1/mean = rate of arrival cumulative = TRUE returns CDF, FALSE returns PDF
Normal
NORMSDIST(z) = standard normal
Z = value of interest Returns only CDF
NORMDIST(x, mean, standard_dev, cumulative)

x = value of interest mean = mean value standard_dev = standard deviation cumulative = TRUE returns CDF, FALSE returns PDF
Week 10
10
Fitting data to known distribution

Raw data
Relative frequency = frequency / total count Cumulative relative frequency (CRFdata) = cumulative frequency / total count Example: 1 person eats 1 bowl of pasta, 2 persons eats 2 bowls, and 1 person eats 3 bowls
# of bowls of pasta 1 2 3 Relative frequency CRF 1
Distribution
Probability = relative frequency Cumulative probability CDF = cumulative relative frequency (CRFdistb)
11
Fitting data to known distribution

Thus, to fit raw data into known distribution, we try to match CRFdata to CRFdistb
Raw data
Relative frequency Cumulative relative frequency (CRFdata)
Probability Function
Probability Cumulative probability (CDF or CRFdistb)
We match by minimizing the Maximum absolute deviation (MAD) = the largest gap between the cumulative relative frequency of a given data set (CRFdata) and that of its fitted statistical distribution (CRFdistb) The best fitted distribution is one with the smallest MAD
Week 10
12
What are we trying to do here?

CDF or CRF
1.0 CRFdata CRFdistb1 (mean1, SD1) assume that it is a normal curve
CRFdistb2 (mean2, SD2) MAD2
We are trying to minimize the largest gap between the 2 curves by setting MAD1 the parameters that best describe the distribution, so that the distribution fits the data in the best way. For every iteration, as we change the parameters for CRFdistb, MAD changes position. We will stop when we get the smallest MAD.
13
Analogy to Curve Fitting

Y Y = m1X + C1 Y = m2X + C2
Minimize the squared error
X
By minimizing the squared error between the points and the line, the parameters that describe the line changes. And we stop when the squared error is the least to get the parameters that best describe the line.
14
Steps
1. Distribution functions are usually defined by the 4 parameters - Max, min, mean, standard deviation compute these parameters from the raw data Sort raw data in ascending order Compute cumulative relative frequency (CRFdata) of data Compute cumulative relative frequency (CRFdistb) of a known distribution using the raw data and an initial arbitrary input parameters for the distribution For discrete data, we can use uniform binomial, poisson For continuous data, we can use uniform, exponential, normal Compute the MAD = max(abs(CRFdata CRFdistb)) Use Solver to minimize the MAD where the minimization process will change the parameters for the distribution to get the best fit distribution Repeat 4 to 6 for another distribution to get its MAD. Select the distribution with the smallest MAD
15
2. 3. 4.
5. 6.
7. 8.
Step 3: CRF of raw data

Discrete data
Data 1 2 2 2 3 3 4 4 4 4 Freq 1 Cum Freq 1 CRF 1/10 Data 1.01 1.25 1.33 2.10 2.88 3.11 3.78 4.29 4.56 4.78
Continuous data
Freq 1 1 1 1 1 1 1 1 1 1 Cum Freq 1 2 3 4 5 6 7 8 9 10 CRF 1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10
3 2
4 6
4/10 6/10
10
10/10
Total Count = 10 each

16
Step 4: CRF of Distribution Continuous

Uniform
All input parameters in RED are arbitrary values until solved by Solver
CRF = (x min) / (max min) X = data point

min x max
Normal
NORMDIST(x, mean, standard_dev, cumulative) NORMDIST(x, mean, standard_dev, TRUE)
Exponential
EXPONDIST(x, 1/mean, cumulative) EXPONDIST(x, 1/mean, TRUE)
17
Step 4: CRF of Distribution Discrete

Uniform
All input parameters in RED are arbitrary values until solved by Solver
CRF = (x min) / (max min) x = data point

min x max
Binomial
BINOMDIST(number_s, trials, probability_s, cumulative) BINOMDIST(x, trials, probability_s, TRUE)
Poisson
POISSON(x, mean, cumulative) POISSON(x, mean, TRUE)
18
Hotel Apex
Fit Normal distribution to room sales data so as to estimate number of rooms to keep open during refurbishment period to satisfy 70% of demand Due to maximum capacity of 150 rooms, the data given does not represent real demand, since demand exceeding 150, will only result in 150 rooms being sold. Thus, data given is in fact sales data and NOT demand data. So, can we infer real demand data from sales data? This exercise tells us the importance of understanding data collected and the power of data inference
Week 10
19
Yankee Fruits
Help Paul decides how many melons he has to purchase weekly to satisfy different service level Use 4 methods Frequency Bins & Lookup() function CRF of raw data & Lookup() function Percentile() function NORMINV() function, assuming demand is Normal Which method is the best? - Depends on the situation.
20
Shifting columns for Lookup() Function

Lookup() returns the corresponding result_value in the result_vector where the largest value in the lookup_vector is less than or equal to lookup_value In the Yankee Fruits exercise, we need to determine the minimum number of fruits to buy in order to satisfy a certain percentage of customers, as defined by service level. For example, a service level of 70% would mean that 70% of customers will be served from your inventory of fruits. The higher the service level, the more inventory you should keep.
From the table, 0.7 lies between 0.57 and 0.83. To satisfy 70% customer, we need to have at least 650 fruits rather than 600.
If we do not shift columns, for 70% service level, lookup() will return 600, which corresponds to 0.57 CRF. Thus, we need 0.57 to correspond to the next result value, which is 650, to get the correct result.
21
Yankee Fruits
Method 1: Uses lookup() function with frequency bins - This method is coarse as the bins consolidated data into intervals, so the answer provided will also be in terms of large intervals. However, if the order size is in multiples of 50, then this answer is appropriate. Method 2: Uses lookup() function with CRF of raw data. This method is finer than Method 1. But answers given are in terms of each unit. Method 3: Uses Percentile(). This method will interpolate and generate new data points which may not be integer. Is such fine data needed?
Method 4: Uses Norminv() function. This method assumes demand follows a normal distribution. For service level >= 0.95, the number of melons to order is larger than the largest raw data of 716. Seems to over-estimate.
22

CAT Wk10

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CAT Wk10

Uploaded by

Copyright:

Available Formats

OPIM102 Computer as an Analysis Tool Week 10: Data Analysis

Copyright Michelle Cheong

Curve Fitting vs Data Fitting

Discrete Uniform Distribution

Continuous Uniform Distribution

X = random variable x = possible values of X P(X = x) probability of X being x

P(a<X<b) = Integrate fX(x) from a to b

Copyright Michelle Cheong

Discrete Probability Functions

Binomial is represented as Bin(n,p)

Copyright Michelle Cheong

Discrete Probability Functions

Copyright Michelle Cheong

Continuous Probability Functions

= rate of arrivals 1/ = average inter-arrival time

Copyright Michelle Cheong

Continuous Probability Functions

0 Standard Normal, Z = (x- )/

Excel Functions for Probability Distribution

Copyright Michelle Cheong

Excel Functions for Probability Distribution

NORMDIST(x, mean, standard_dev, cumulative)

Copyright Michelle Cheong

Fitting data to known distribution

Fitting data to known distribution

Copyright Michelle Cheong

What are we trying to do here?

CRFdistb2 (mean2, SD2) MAD2

Analogy to Curve Fitting

Minimize the squared error

Step 3: CRF of raw data

Total Count = 10 each

Step 4: CRF of Distribution Continuous

CRF = (x min) / (max min) X = data point

Step 4: CRF of Distribution Discrete

CRF = (x min) / (max min) x = data point

Copyright Michelle Cheong

Shifting columns for Lookup() Function

You might also like