You are on page 1of 10

USV Data Management and Analysis

Laboratory 3
using Wittwer, J.W., Monte Carlo Simulation Example: Sales Forecast, (2004)

The Scenario: Company XYZ wants to know how profitable it will be to market their new gadget,
realizing there are many uncertainties associated with market size, expenses, and revenue.
The Method: Use a Monte Carlo Simulation to estimate profit and evaluate risk.

PART 1: MODELING AND SIMULATION


Step 1: Creating the Model
We are going to use a top-down approach to create the sales forecast model, starting with:
Profit = Income - Expenses
Both income and expenses are uncertain parameters, but we aren't going to stop here, because one of
the purposes of developing a model is to try to break the problem down into more fundamental
quantities. Ideally, we want all the inputs to be independent.
We'll say that Income comes solely from the number of sales (S) multiplied by the profit per sale
(P) resulting from an individual purchase of a gadget, so Income = S*P. The profit per sale takes
into account the sale price, the initial cost to manufacturer or purchase the product wholesale, and
other transaction fees (credit cards, shipping, etc.). For our purposes, we'll say the P may fluctuate
between $47 and $53.
We could just leave the number of sales as one of the primary variables, but for this example,
Company XYZ generates sales through purchasing leads1. The number of sales per month is the
number of leads per month (L) multiplied by the conversion rate (R) (the percentage of leads that
result in sales). So our final equation for Income is:
Income = L*R*P
We'll consider the Expenses to be a combination of fixed overhead (H) plus the total cost of the
leads. For this model, the cost of a single lead (C) varies between $0.20 and $0.80. Based upon
some market research, Company XYZ expects the number of leads per month (L) to vary between
1200 and 1800. Our final model for Company XYZ's sales forecast is:
Profit = L*R*P - (H + L*C)
Notice that H is also part of the equation, but we are going to treat it as a constant in this example.
The inputs to the Monte Carlo simulation are just the uncertain parameters L, R, P, and C.
1

A lead, in a marketing context, is a potential sales contact: an individual or organization that expresses an interest in
your goods or services.

Step 2: Generating Random Inputs


The key to Monte Carlo simulation is generating the set of random inputs.
For this example, we're going to use a Uniform Distribution to represent the four uncertain
parameters (If you have additional information about some uncertain parameters you might find
other distributions (Normal, Poisson, etc.) more suitable to describe those inputs).
The inputs are summarized in the table shown below. (If you haven't already, download
MAD_Laborator_5.xlsx from group website)

Figure 1: Screen capture from the example sales forecast spreadsheet.

The table above uses "Min" and "Max" to indicate the uncertainty in L, C, R, and P. To generate a
random number between "Min" and "Max", we use the following formula in Excel (Replacing
"min" and "max" with cell references):
= min + RAND()*(max-min)
Let's say we want to run n=5000 evaluations of our model.
A convenient way to organize the data in Excel is to make a column for each variable as shown in
the screen capture below.

Figure 2: Screen capture from the example sales forecast spreadsheet.

Cell A2 contains the formula:


=Model!$F$14+RAND()*(Model!$G$14-Model!$F$14)
Note that the reference Model!$F$14 refers to the corresponding Min value for the variable L on the
Model worksheet, as shown in Figure 1. To generate 5000 random numbers for L, you simply copy
the formula down 5000 rows. You repeat the process for the other variables (except for H, which is
constant).

Step 3: Evaluating the Model


Since our model is simple, all we need to do to evaluate the model for each run of the simulation is
put the equation in another column next to the inputs, as shown in Figure 2 (the Profit column).
Cell G2 contains the formula: =A2*C2*D2-(E2+A2*B2)

Step 4: Run the Simulation


We don't need to write a fancy macro for this example in order to iteratively evaluate our model. We
simply copy the formula for profit down 5000 rows, making sure that we use relative references in
the formula (no $ signs).

Rerun the Simulation: F9


Although we still need to analyze the data, we have essentially completed a Monte Carlo
simulation. Because we have used the volatile RAND() formula, to re-run the simulation all we
have to do is recalculate the worksheet (F9 is the shortcut).
In practice, it is usually more convenient to buy an add-on for Excel than to do a Monte Carlo
analysis from scratch every time. But these add-ons are not usually for free, and hopefully the skills
you will learn from this example will help in future data analysis and modeling.

PART 2: DATA ANALYSIS


PART 2.1: Creating a Histogram in Excel
Method 1: Using the Histogram Tool in the Analysis Tool-Pak.
This is probably the easiest method, but you have to re-run the tool each to you do a new simulation.
AND, you still need to create an array of bins (which will be discussed below).
Method 2: Using the FREQUENCY function in Excel.
This is the method used in the spreadsheet for the sales forecast example. One of the reasons I like
this method is that you can make the histogram dynamic, meaning that every time you re-run the MC
simulation, the chart will automatically update. This is how you do it:
Step 1: Create an array of bins
The figure below shows how to easily create a dynamic array of bins. This is a basic technique for
creating an array of N evenly spaced numbers.
To create the dynamic array, enter the following formulas:
B6 = $B$2
B7 = B6+($B$3-$B$2)/5
Then, copy cell B7 down to B11
After you create the array of bins, you can go ahead and use the Histogram tool, or you can proceed
with the next step.

Figure 3: A dynamic array of 5 bins.

Step 2: Use Excel's FREQUENCY formula


The next figure is a screen shot from the example Monte Carlo simulation. I'm not going to explain
the FREQUENCY function in detail since you can look it up in the Excel's help file. But, one thing
to remember is that it is an array function, and after you enter the formula, you will need to press
Ctrl+Shift+Enter. Note that the simulation results (Profit) are in column G and there are 5000 data
points ( Points: J5=COUNT(G:G) ).
The Formula for the Count column:
FREQUENCY(data_array,bins_array)
a) Select cells J8:J48
b) Enter the array formula: {=FREQUENCY(G:G,I8:I48)}
c) Press Ctrl+Shift+Enter

Figure 4: Layout in Excel for Creating a Dynamic Scaled Histogram.

Creating a Scaled Histogram


If you want to compare your histogram with a probability distribution, you will need to scale the
histogram so that the area under the curve is equal to 1 (one of the properties of probability
distributions). Histograms normally include the count of the data points that fall into each bin on the
y-axis, but after scaling, the y-axis will be the frequency (a not-so-easy-to-interpret number that in
all practicality you can just not worry about).
To scale the histogram, use the following method:
Scaled = (Count/Points) / (BinSize)
a) K8 = (J8/$J$5)/($I$9-$I$8)
b) Copy cell K8 down to K48
c) Press F9 to force a recalculation (may take a while)
Step 3: Create the Histogram Chart
Bar Chart, Line Chart, or Area Chart:
To create the histogram, just create a bar chart using the Bins column for the Labels and the Count
or Scaled column as the Values. Tip: To reduce the spacing between the bars, right-click on the
bars and select "Format Data Series...". Then go to the Options tab and reduce the Gap. Figure 5
below was created this way.
A More Flexible Histogram Chart
One of the problems with using bar charts and area charts is that the numbers on the x-axis is
actually just labels. This can make it very difficult to overlay data that uses a different number of
points or to show the proper scale when bins are not all the same size. However, you CAN use a
scatter plot to create a histogram. After creating a line using the Bins column for the X Values and
Count or Scaled column for the Y Values, add Y Error Bars to the line that extend down to the xaxis (by setting the Percentage to 100%). You can right-click on these error bars to change the line
widths, color, etc.

Figure 5: Example Histogram Created Using a Scatter Plot and Error Bars.

PART 2.2: Summary Statistics


In Part 2.1 of this example, we plotted the results as a histogram in order to visualize the uncertainty
in profit. In order to provide a concise summary of the results, it is customary to report the mean,
median, standard deviation, standard error, and a few other summary statistics to describe the
resulting distribution. The screenshot below shows these statistics calculated using simple Excel
formulas.

Figure 6: Summary statistics for the sales forecast example

Statistics Formulas in Excel


Sample Size (n): =COUNT(G:G)
Sample Mean: =AVERAGE(G:G)
Median: =MEDIAN(G:G)
Sample Standard Deviation (): =STDEV(G:G)
Maximum: =MAX(G:G)
Mininum: =MIN(G:G)
Q(.75): =QUARTILE(G:G,3)
Q(.25): =QUARTILE(G:G,1)
Skewness: =SKEW(G:G)
Kurtosis: =KURT(G:G)

Explanations for these formulas can be found below.

Sample Size (n)


The sample size, n, is the number of observations or data points from a single MC simulation. For
this example, we obtained n = 5000 simulated observations. Because the Monte Carlo method is
stochastic, if we repeat the simulation, we will end up calculating a different set of summary
statistics. The larger the sample size, the smaller the difference will be between the repeated
simulations. (See standard error below).
Central Tendancy: Mean and Median
The sample mean and median statistics describe the central tendancy or "location" of the
distribution. The arithmetic mean is simply the average value of the observations.
If you sort the results from lowest to highest, the median is the "middle" value or the 50th
Percentile, meaning that 50% of the results from the simulation are less than the median. If there is
an even number of data points, then the median is the average of the middle two points.
Extreme values can have a large impact on the mean, but the median only depends upon the middle
point(s). This property makes the median useful for describing the center of skewed distributions. If
the distribution is symmetric then the mean and median will be identical.
Spread: Standard Deviation, Range, Quartiles
The standard deviation and range describe the spread of the data or observations. The standard
deviation is calculated using the STDEV function in Excel.
The range is also a helpful statistic, and it is simply the maximum value minus the minimum value.
Extreme values have a large effect on the range, so another measure of spread is something called
the Interquartile Range.
The Interquartile Range represents the central 50% of the data. If you sorted the data from lowest
to highest, and divided the data points into 4 sets, you would have 4 Quartiles:
Q0 is the Minimum value: =QUARTILE(G:G,0) or just =MIN(G:G),
Q1 or Q(0.25) is the First quartile or 25th percentile: =QUARTILE(G:G,1),
Q2 or Q(0.5) is the Median value or 50th percentile: =QUARTILE(G:G,2) or =MEDIAN(G:G),
Q3 or Q(0.75) is the Third quartile or 75th percentile: =QUARTILE(G:G,3),
Q4 is the Maximum value: =QUARTILE(G:G,4) or just MAX(G:G).
In Excel, the Interquartile Range is calculated as Q3-Q1 or:
=QUARTILE(G:G,3)-QUARTILE(G:G,1)
The IQR is used in creating a box and whisker plot
Shape: Skewness and Kurtosis
Skewness
Skewness describes the asymmetry of the distribution relative to the mean. A positive skewness
indicates that the distribution has a longer right-hand tail (skewed towards more positive values). A
negative skewness indicates that the distribution is skewed to the left.
Kurtosis
Kurtosis describes the peakedness or flatness of a distribution relative to the Normal distribution.
Positive kurtosis indicates a more peaked distribution. Negative kurtosis indicates a flatter
distribution.

Confidence Intervals for the True Population Mean


The sample mean is just an estimate of the true population mean. How accurate is the estimate?
You can see by repeating the simulation (using F9 in this Excel example) that the mean is not the
same for each simulation.
Standard Error
If you repeated the Monte Carlo simulation and recorded the sample mean each time, the distribution
of the sample mean would end up following a Normal distribution (based upon the Central Limit
Theorem). The standard error is a good estimate of the standard deviation of this distribution,
assuming that the sample is sufficiently large (n >= 30).
The standard error is calculated using the following formula:

In Excel: =STDEV(G:G)/SQRT(COUNT(G:G))
95% Confidence Interval
The standard error can be used to calculate confidence intervals for the true population mean. For
a 95% 2-sided confidence interval, the Upper Confidence Limit (UCL) and Lower Confidence Limit
(LCL) are calculated as:

To get a 90% or 99% confidence interval, you would change the value 1.96 to 1.645 or 2.575,
respectively. The value 1.96 represents the 97.5 percentile of the standard normal distribution.
(You may often see this number rounded to 2). To calculate a different percentile of the standard
normal distribution, you can use the NORMSINV() function in Excel.
Example: 1.96 = NORMSINV(1-(1-.95)/2)
Commentary
Keep in mind that confidence intervals make no sense (except to statisticians), but they tend to make
people feel good. The correct interpretation: "We can be 95% confident that the true mean of the
population falls somewhere between the lower and upper limits." What population? The
population we artificially created! Lest we forget, the results depend completely on the assumptions
that we made in creating the model and choosing input distributions. So, I generally just stick to
using the standard error as a measure of the uncertainty in the mean.

PART 2.3: Cumulative Probabilities


In Part 2.1 of this example, we plotted the results as a histogram in order to visualize the uncertainty
in profit. We are going to augment the histogram by including a graph of the estimated cumulative
distribution function (CDF) as shown below.

Figure 7: Graph of the estimated cumulative distribution.

The reason for showing the CDF along with the histogram is to demonstrate that an estimate of the
cumulative probability is simply the percentage of the data points to the left of the point of interest.
For example, we might want to know what percentage of the results was less than -$700.00 (the
vertical red line on the left). From the graph, the corresponding cumulative probability is about 0.05
or 5%. Similarly, we can draw a line at $2300 and find that about 95% of the results are less than
$2300.
It is fairly simple to create the cumulative distribution in Excel. Figure 8 shows how you can
estimate the CDF by calculating the probabilities using a cumulative sum of the count from the
frequency function. You simply divide the cumulative sum by the total number of points.

Figure 8: Calculating the probabilities for the cumulative distribution.

Many of the questions we may be interested in have to do with using the CDF to go from a
cumulative probability to a percentile or vice versa. The PERCENTRANK() and PERCENTILE()
functions in Excel allow us to do this quite easily.
Note that a percentile, or quantile, refers to the value (in this case, the profit) corresponding to a
given estimated cumulative probability.
Question 1: What percentage of the results was less than -$700?
This question is answered using the percent rank function: =PERCENTRANK(array,x), where the
array is the data range (column G in figure 2 above) and x is $700.
If x matches one of the values in the array, this function is equivalent to the Excel formula
=(RANK(x)-1)/(N-1) where N is the number of data points. If x does not match one of the values,
then the PERCENTRANK function interpolates. You can read more about the details of the RANK,
PERCENTILE, and PERCENTRANK functions in the Excel help file (F1).
The figure below shows a screen shot of some examples where the percent rank function is used to
estimate the cumulative probability based upon results of the Monte Carlo simulation.

Figure 9: Calculating probabilities using the Excel percent rank function.


The accuracy of the result will depend upon the number of data points and how far out on the tails
of the distribution you are (and of course on how realistic the model is, how well the input
distributions represent the true uncertainty or variation, and how good the random number generator
is). Recalculating the spreadsheet a few times by pressing F9 will give you an idea of how much the
results may vary between each simulation.

The End
This concludes the Monte Carlo simulation example using Excel. This lab is not comprehensive, and
many details having to do with Monte Carlo simulation have not been covered. However, I hope this
lab has given you a good introduction to the basics.
To generate a random number from a Normal (Gaussian) distribution you would use the following
formula in Excel:
=NORMINV(rand(),mean,standard_dev)
Ex: =NORMINV(RAND(),$D$4,$D$5)
To generate a random number from a Lognormal distribution with median = exp(meanlog), and
shape = sdlog, you would use the following formula in Excel:
=LOGINV(RAND(),meanlog,sdlog)
Ex: LOGINV(RAND(),$D$6,$D$5)

You might also like