You are on page 1of 17

HBS Toolkit LICENSE AGREEMENT

HBS Toolkit License Agreement

Harvard Business School Publishing (the Publisher) grants you, the


individual user, limited license to use this product. By accepting and
using this product, you agree to the terms of service described below.

Terms
You accept that this product is intended for your use, and you will not
duplicate in any form or manner, electronic or otherwise, copies of this
product nor distribute this product to anyone else.

You recognize that the product and its content are the sole property of the
Publisher, and that we have copyrighted the product.

You agree that the Publisher is not responsible for any interruption of
service or malfunction that is a consequence of the Internet, a service
provider, personal computer, browser or other software or hardware
components. You accept that there is no guarantee that this product is
totally error free. You further understand and accept that the Publisher
intends to provide reliable information but does not guarantee the accuracy
or completeness of any information, and is not responsible for any results
obtained from the use of such information.

This license is effective until terminated, when the license or subscription


period ends without renewal, or when you destroy this product and any
related documentation. The Publisher may terminate your license without
notice if you fail to comply with the conditions set forth in this
agreement, and may pursue any other legal recourse.

Copyright © 1999 President and Fellows of Harvard College


Introduction to Statistics Using Excel INTRODUCTION

Contents
Introduction This sheet
Ex.1 Analyzing Single-Variable Data Sets using Microsoft Excel
Ex.2 Statistical Analysis using the Summary Statistics Tool
Ex.3 Data Representation - Histograms
Ex.4 Analyzing Two-Variable Data Sets using Correlation
Ex.5 Data Representation - Scatter Diagrams
Ex.6 Simple Forecasting - Adding Trendlines to your Scatter Diagrams

Overview
The Random House College Dictionary defines statistics as the science that deals with the
collection, classification, analysis, and interpretation of information or data. In business we
use statistical analysis to reveal such trends as the number of employees working in
high-tech companies compared to banking or consulting. One might use this data to
determine if the supply of available workers will meet demand. Often the data we analyze are
selected from a larger set of data whose characteristics we want to know something about.
For example, we might collect the number of job openings at a high-tech company as
compared to a bank and a consulting company. The companies surveyed are part of a
sample. By analyzing the data from these sample companies we hope to draw conclusions
about the larger population of all high-tech, banking, and consulting companies.

This brief explanation leads us to break the study of statistics into two broad categories
descriptive statistics and inferential statistics.

Descriptive statistics utilize numbers and graphs to look for patterns in a data set, summarize
the data, and present the data in a convenient form.

Inferential statistics utilize sample data to help make estimates, decisions, predictions, or
other generalizations about a larger set of data.

In business, prediction (more commonly called forecasting) is an important activity that


managers have responsibility for carrying out. For this reason, inference will be the focus of
most of the statistical analysis you will do in business.

This workbook will help introduce several fundamental statistical concepts and provide you
with hands-on experience using the powerful statistical analysis tools built into Microsoft
Excel.

The Excel tools covered in this exercise are:

Mean (average)
Median (median)
Mode (mode)
Range (min, max)
Variance (var)
Standard deviation (stddev)
Summary statistics (Data analysis add-in)
Histogram (Data analysis add-in)
Correlation (Data analysis add-in)
Scatter diagram (Chart wizard)
Trendline (Chart wizard)
Introduction to Statistics Using Excel INTRODUCTION

Directions

You may want to print these directions as a reference guide for this tool.

Introduction to Statistics Using Excel is a self-instructional workbook (tutorial) that introduces the user
to ten Microsoft Excel statistical analysis tools and their corresponding statistical concepts. Each
exercise is self-contained but the workbook is designed to be completed in order from Exercise 1 to
Exercise 5.

Each exercise follows this standard format:


1. Exercise number, title, and description
2. List of Excel tools covered in the exercise
3. Summary of the content introduced in the exercise
4. Step-by-step instructions on how to use Excel to carry-out the calculation(s)
5. Sample data set
6. User output area
7. Sample dialog box (what your entries in Excel should look like)
8. Exercise answer
9. What's next (a guide to the next exercise)

Note: you may want to print the entire workbook before you begin so that you can refer to it
as you work through the Excel-based exercises.

Note About Using Internet Explorer


The default setting in Internet Explorer is to open these tools in the Explorer application instead
of Excel. We recommend against this and provide directions in the Help section of the HBS
Toolkit web site to change this default behavior.

HBS Menu
Show Calculator: Launches Windows calculator
Show/Hide Celltips: Toggles in/out red Celltips in documented cells
Print Sheet with Celltips: Prints Celltip documentation on current sheet
Set Zoom: Provides quick access to 80%, 100%, and 125% zoom levels
Visit Web Links: Links to HBS Toolkit website, Toolkit Glossary, and Toolkit
Feedback, as well as HBS and HBS Publishing web sites
About HBS Toolkit: Launches the about box for the HBS Toolkit

Jon B. DeFriese MBA `00 developed this software under the supervision of Professor Frances X.
Frei as the basis for class discussion rather than to illustrate either the effective or ineffective
handling of an administrative situation.

Copyright © 1999 President and Fellows of Harvard College


EXERCISE 1
Introduction to Statistics Using Excel SINGLE-VARIABLE
DATA SET ANALYSIS

Exercise 1: Analyzing Single-Variable Data Sets using Microsoft Excel

This exercise demonstrates how to analyze sets of data that contain one variable, in this case,
the selling price of residential real-estate.

The Excel tools covered in this exercise are:


Tool 1 Mean
Tool 2 Median Measures of Central Tendency
Tool 3 Mode
Tool 4 Range
Tool 5 Variance Measures of Variability
Tool 6 Standard Deviation

Definitions of the tools and terms used in this exercise

Statistics The study of ways to collect, describe, draw conclusions, and make projections from data
Population A group of objects about which information is to be gained
Sample A subset of a population used to gain information about the whole population
Measures of Central Tendency Summary measures used to describe data.
Mean The sum of the data divided by the number of data points in the data set (the average)
Median The middle number when the data set is arranged in ascending (or descending) order
Mode The most frequently occurring number in the data set
Measures of Variability Summary measures used to describe data.
Range The largest number in the data set minus the smallest number in the data set
Sample Variance The sum of each data points distance from the mean, squared, and divided
by the number of data points minus one (consult the Excel Help file for the equation)
Sample Standard Deviation The positive square root of the sample variance

1. To determine the Mean (the statistical average) of the data set follow these steps:

Step 1: Place your cursor in the cell labeled Mean (Average)


Step 2: Select Function… from the Insert menu (or click on the paste function - fx - button on the menu bar)
Step 3: Select Statistical under the function category menu and Average under the function name menu
Step 4: When the dialog box appears, place your cursor in the first cell of the data set and
select the entire column. Notice that the range G66:G76 is now located in the dialog box
Step 4: Click OK and compare your results with those listed below

Formulas:
Mean (Average) AVERAGE(G66:G76)
Median MEDIAN(G66:G76)
Mode MODE(G66:G76)
Range MAX(G66:G76)-MIN(G66:G76)
Variance VAR(G66:G76)
Standard Deviation STDEV(G66:G76)

Selling Price
$109,360
$137,980
$131,230
$130,230
Data Set $125,410
$124,370
$109,360
$139,030
$140,160
$144,220
$154,190

Now follow the same steps substituting the appropriate Excel command in place of the Average command.
Check your answers below.
EXERCISE 1
Introduction to Statistics Using Excel SINGLE-VARIABLE
DATA SET ANALYSIS

This is what the data entry dialog box should look like:

Answer 1 Mean (Average) $131,413 AVERAGE(G66:G76)

This is what the data entry dialog box should look like:

Answer 2 Median $131,230 MEDIAN(G66:G76)

This is what the data entry dialog box should look like:

Answer 3 Mode $109,360 MODE(G66:G76)

This is what the data entry dialog box should look like:

$154,190 MAX(G66:G76)
$109,360 MIN(G66:G76)

Answer 4 Range $44,830 MAX(G66:G76)-MIN(G66:G76)


EXERCISE 1
Introduction to Statistics Using Excel SINGLE-VARIABLE
DATA SET ANALYSIS

This is what the data entry dialog box should look like:

Answer 5 Variance $192,020,762 VAR(G66:G76)

This is what the data entry dialog box should look like:

Answer 6 Standard Deviation $13,857 STDEV(G66:G76)

This concludes Exercise 1: Analyzing Single-variable data sets using Microsoft Excel.
Exercise 2 demonstrates how to use the Excel tool Summary Statistics to combine these steps into one command

Copyright © 1999 President and Fellows of Harvard College


Introduction to Statistics Using Excel EXERCISE 2
SUMMARY STATISTICS

Exercise 2: Statistical Analysis using the Summary Statistics Tool

This exercise continues with the same data set but introduces a tool that allows you to
quickly calculate all of the individual measures previously introduced (and several others)
using a single Excel tool called Summary Statistics.

The Excel Tools covered in this exercise:


Tool 7 Data Analysis - Summary Statistics

Definitions of the measures output by Summary Statistics:

Mean The sum of the data divided by the number of data points in the data set (the average)
Standard Error The standard error of the mean of the sample
Median The middle number when the data set is arranged in ascending (or descending order)
Mode The most frequently occurring number in the data set
Standard Deviation The positive square root of the sample variance
Sample Variance The sum of each data points distance from the mean, squared, and divided
by the number of data points minus one (consult the Excel Help file for the equation)
Kurtosis The relative peakedness or flatness of a distribution compared with the normal distribution
Skewness The degree of asymmetry of a distribution around its mean
Range The largest number in the data set minus the smallest number in the data set
Minimum The smallest number in the data set
Maximum The largest number in the data set
Sum The data points added together
Count The number of data points

Note: Summary Statistics requires the Excel Data Analysis Add-In. If Data Analysis is not
available under the Tools menu, you will need to install the Analysis Toolpak. Under Tools,
click Add-Ins..., select the Analysis Toolpak and then click OK. If the Analysis Toolpak is
already checked, uncheck the box, click OK, and then repeat this procedure.

To analyze the data table using Summary Statistics follow these steps:

Step 1: Click on Tools from the Menu Bar and select Data Analysis
Step 2: Select Descriptive Statistics and click OK
Step 3: With your cursor in the Input Range cell, use your mouse to highlight the
data in the Selling Price column, including the label
Step 4: Select the Columns option in the Grouped By section and check Labels in First Row
Step 5: Under Output Options, place your cursor in the Output Range cell and use your
mouse to select labeled output cell
Step 6: Check off Summary Statistics and click OK.

Selling Price
$109,360
$137,980
$131,230
Data Set $130,230
$125,410
$124,370
$109,360
$139,030
$140,160
$144,220
$154,190

Your table should match the one at the bottom of this page.
Introduction to Statistics Using Excel EXERCISE 2
SUMMARY STATISTICS

Output cell

This is what the data entry dialog box should look like:

Answer 7 Selling Price

Mean 131412.727272727
Standard Error 4178.0896223707
Median 131230
Mode 109360
Standard Deviation 13857.1556178814
Sample Variance 192020761.818182
Kurtosis -0.2688724255
Skewness -0.3079635474
Range 44830
Minimum 109360
Maximum 154190
Sum 1445540
Count 11

This concludes Exercise 2: Statistical Analysis using the Summary Statistics Tool
Exercise 3 demonstrates how to graphically represent your data using Excel to create a histogram.

Copyright © 1999 President and Fellows of Harvard College


EXERCISE 3
Introduction to Statistics Using Excel DATA REPRESENTATION
HISTOGRAMS

Exercise 3: Data Representation - Histograms

This exercise introduces a powerful tool which allows you to graphically represent
your data set in addition to analyzing its summary statistics. In this example we use
the same selling price information to construct a graphical representation of the data
called a histogram.

The Excel Tools covered in this exercise:


Tool 8 Data Analysis - Histograms

Creating Histograms (Frequency Histograms) using Excel

A Histogram is one method of graphically representing a set of data (other examples


include bar graphs, line graphs, and circle graphs). Graphs help to provide a sense of
shape to the data points which, in turn, may provide some insight to the analyst about
how the data is distributed. The shape of a data distribution can help us make
predictions (forecasts) about future events which is one of the underlying goals of
statistical analysis. Some distributions may be grouped more densely around the low
end, high end, or middle of the data set. These tendencies translate into a series of
commonly occurring distributions like the normal distribution, uniform distribution, left or
right skewed. For this exercise it is only important to realize that it is usually helpful to
graph a data set and histograms are one way to do so.

To generate a histogram, we must first define a range of selling price categories (called
Bins) so the histogram can assign each value to the appropriate category.

To do this, we have added a column next to Selling Price and labeled it Bin Range. The
Bin Range is the equally spaced set of categories we want to file each data point in.

To analyze the data and create the Histogram we will use the Histogram Tool which
is part of the Data Analysis Add-In.

Note: The Histogram Tool requires the Excel Data Analysis Add-In. If Data Analysis is not
available under the Tools menu, you will need to install the Analysis Toolpak. Under Tools,
click Add-Ins..., select the Analysis Toolpak and then click OK. If the Analysis Toolpak is
already checked, uncheck the box, click OK, and then repeat this procedure.

Step 1: Click on Tools from the Menu Bar and select Data Analysis
Step 2: Select Histogram and click OK
Step 3: With your cursor in the Input Range cell, use your mouse to highlight the
data in the Selling Price column, including the label
Step 4: With your cursor in the Bin Range cell, use your mouse to highlight the
data in the Bin Range column, including the label
Step 5: Check the Labels check box
Step 6: Under Output Options, place your cursor in the Output Range cell and use your
mouse to select labeled output cell,
Step 7: Check the Chart Output check box, Click OK

Selling Price Bin Range


$109,360 100000
$137,980 110000
$131,230 120000
$130,230 130000
$125,410 140000
$124,370 150000
$109,360 160000
$139,030 170000
$140,160 180000
$144,220 190000
$154,190 200000

Your Histogram and output table should match the one at the bottom of this page.
EXERCISE 3
Introduction to Statistics Using Excel DATA REPRESENTATION
HISTOGRAMS

Output cell

*Note: You may need to resize the Histogram in order to see the y axis values.
This can be done by clicking on the Histogram and dragging one of the points at the
corner with the left mouse button held down.

This is what the data entry dialog box should look like:

Answer 8
Frequency

Bin Range Frequency


100000 0
110000 2 Histogram
120000 0
130000 2 5
140000 4 4
150000 2 3 Frequency
160000 1 2
1
170000 0
0
180000 0
190000 0
200000 0
More 0 Bin Range

This concludes Exercise 3: Data Representation - Histograms


Exercise 4 demonstrates how to analyze two-variable data sets using Correlation

Copyright © 1999 President and Fellows of Harvard College


EXERCISE 4
Introduction to Statistics Using Excel TWO-VARIABLE DATA
SET ANALYSIS
CORRELATION

Exercise 4: Analyzing Two-Variable Data Sets using Correlation

This exercise presents a data set with two residential real-estate variables: selling price and size in square feet.

The Excel Tools covered in this exercise:


Tool 9 Correlation Coefficient

Correlation Analysis using Excel

The correlation coefficient, a summary statistic, is often used to indicate the degree to which
two variables (x and y) are related (more specifically, the degree to which they are linearly related).

The correlation coefficient is represented by the letter r. A value of r near 0 implies little or
no relationship between x and y. An r value of 1 implies a perfect positive relationship
between x and y. The closer the value is to 1, the stronger the correlation. An r value of -1
implies a perfect negative relationship between x and y. The closer the value of r is to -1 the
stronger the negative correlation.

An example of positive correlation would be as the number of rainy days go up (monsoon season)
rain coat sales go up.
An example of negative correlation would be as the number of rainy days go down (during a drought)
bathing suit sales go up.

To perform correlation analysis we will use the Correlation Tool which is part of the Data Analysis Add-In.

Note: The Histogram Tool requires the Excel Data Analysis Add-In. If Data Analysis is not
available under the Tools menu, you will need to install the Analysis Toolpak. Under Tools,
click Add-Ins..., select the Analysis Toolpak and then click OK. If the Analysis Toolpak is
already checked, uncheck the box, click OK, and then repeat this procedure.

To generate the correlation coefficient ( r ) based on the data table below:

Step 1: Click on Tools from the Menu Bar and select Data Analysis
Step 2: Select Correlation and click OK
Step 3: With your cursor in the Input Range cell, use your mouse to highlight the
data in the Square Feet and Selling Price columns, including the labels
Step 4: Select the Columns option in the Grouped By section and check Labels in First Row
Step 5: Under Output Options, select Output Range and place your cursor in the
Output Range cell and
Step 6: Use your mouse to select labeled output cell, Click OK.

Square Feet Selling Price


1500 $100,000
1600 $110,000
1700 $150,000
1800 $185,000
1900 $187,000
Data Set 2000 $188,000
2100 $192,000
2200 $195,000
2300 $197,000
2400 $200,000
2500 $210,000
2600 $215,000

Output cell

The r value is the value in the Square Feet row and the Selling Price column, in this case .8758699.
EXERCISE 4
Introduction to Statistics Using Excel TWO-VARIABLE DATA
SET ANALYSIS
CORRELATION

This is what the data entry dialog box should look like:

Answer 9 Square Feet Selling Price


Square Feet 1
Selling Price 0.8758699251 1

The r value is the value in the Square Feet row and the Selling Price column, in this case .8758699.

This concludes Exercise 4: Analyzing Two-Variable Data Sets using Correlation


Exercise 5 demonstrates how to graphically represent a two-variable data set using the Excel Chart Wizard tool to
create a Scatter Diagram.

Copyright © 1999 President and Fellows of Harvard College


EXERCISE 5
Introduction to Statistics Using Excel DATA REPRESENTATION
SCATTER DIAGRAMS

Exercise 5: Data Representation - Scatter Diagrams

This exercise presents a data set with two residential real-estate variables: selling price and
size in square feet.

The Excel Tools covered in this exercise:


Tool 10 Using Chart Wizard to create a Scatter Diagram

Creating Scatter Diagrams using Excel

A Scatter Diagram (also called a scatter plot, scatter chart, or scattergram) shows an
approximate straight-line relationship between the points in a data set. In Scatter
Diagrams the horizontal axis (the x axis) is labeled with one variable (in our example
we use Square Feet) and the vertical axis (the y axis) is labeled with the other variable
(in this case Selling Price). For each observation, a point is plotted whose
coordinates are that observation's values on both x and y.

This, like the Histogram demonstrated in Exercise 2, is another method of graphically


representing a set of data. In this case, it is the relationship between two variables in a
two-variable data set.

To generate a Scatter Diagram based on the data table below:

Step 1: Select Chart Wizard from the top toolbar (or select Chart from the Insert menu)
Step 2: Select the XY (scatter) chart type, Click the Next Button
Step 3: Place your cursor in the Data Range cell
Step 4: Highlight the entire contents of the table, including labels, Click the Next button
Step 5: Use the default values to show the legend at the right of the graph
Note: You can label the X and Y axes if you wish
Step 6: Select the option to place the chart as an Object in Ex.5, Click the Finish button

Square Feet Selling Price


1500 $100,000
1600 $110,000
1700 $150,000
1800 $185,000
1900 $187,000
2000 $188,000
2100 $192,000
2200 $195,000
2300 $197,000
2400 $200,000
2500 $210,000
2600 $215,000

Your Scatter Diagram should match the one at the bottom of this page.
EXERCISE 5
Introduction to Statistics Using Excel DATA REPRESENTATION
SCATTER DIAGRAMS

Note: You may need to click on the chart and drag it into this space.

Answer 10:

Selling Price

$250,000

$200,000

$150,000
Selling Price

$100,000

$50,000

$0
1400 1600 1800 2000 2200 2400 2600 2800

This concludes Exercise 5: Data Representation - Scatter Diagrams


Exercise 6: Simple Forecasting - Adding Trendlines to your Scatter Diagrams will guide you through adding
a trendline to your scatter diagram.

Copyright © 1999 President and Fellows of Harvard College


EXERCISE 6
Introduction to Statistics Using Excel SIMPLE FORECASTING
TRENDLINES

Exercise 6: Simple Forecasting - Adding Trendlines to your Scatter Diagrams

This exercise will walk you through adding a forward-looking and a backward looking trendline to the scatter
diagram we created in exercise 5.

The Excel Tools covered in this exercise:


Tool 11 Adding a Trendline to a Scatter Diagram

Trendlines

Now that you have identified the straight-line relationship between the x and y points in the
data set, you may want to extrapolate to determine what possible values are above or below
the end-points of your scatter diagram. Trendlines are used to analyze problems of
prediction. You can extend a trendline in a chart forward or backward beyond the actual data
to show a trend. For example, since the maximum house size for which we have data is
2600 square feet, to forecast the price at 3000 square feet we will add a trendline of 400 units
to our scatter diagram. We might also be interested in what a 1000 square foot house would
sell for based on our sample data.

Note: Although beyond the scope of this workbook. The add trendline feature uses a
concepts known as regression to add the trendline (also known as a regression line) and
extend it beyond the points for which we have data in our data set. For more information
about regression and trendlines, consult the Introduction to Regression Using Excel
Workbook that is part of the HBS Toolkit.

To add a trendline to the scatter diagram you created in exercise 5 follow these steps (a copy of the graph is
located below):

Step 1: Use the right mouse buton to click on any data point in the graph
Step 2: Select Add Trendline from the menu
Step 3: Under the Type tab select Linear
Step 4: Under the 0ptions tab in the Forecast section place your cursor the Forward box
Step 5: Enter 400 units (3000 sq. ft. - 2600 sq. ft.)
Step 6: Now place your curson in the Backward box
Step 7: Enter 500 units (1500 sq. ft. - 1000 sq. ft.), Click OK

Selling Price

$250,000

$200,000

$150,000
Selling Price

$100,000

$50,000

$0
1400 1600 1800 2000 2200 2400 2600 2800
EXERCISE 6
Introduction to Statistics Using Excel SIMPLE FORECASTING
TRENDLINES

This is what the data entry dialog box should look like:
EXERCISE 6
Introduction to Statistics Using Excel SIMPLE FORECASTING
TRENDLINES

Answer 11:

Chart Title

$250,000

$200,000

$150,000 Selling Price


Linear (Selling Price)
$100,000

$50,000

$0
1400 1600 1800 2000 2200 2400 2600 2800

This concludes Exercise 6: Simple Forecasting - Adding Trendlines to your Scatter Diagrams
This concludes the Introduction to Statistics Using Excel Workbook

Copyright © 1999 President and Fellows of Harvard College

You might also like