You are on page 1of 380

BUSINESS STATISTICS

S
IM
NM
COURSE DESIGN COMMITTEE
TOC Reviewer Content Reviewer
Mr. Ravindra Babu S Mr. Ravindra Babu S
Visiting Faculty Visiting Faculty
NMIMS Global Access – NMIMS Global Access –
School for Continuing Education. School for Continuing Education.
Specialization: Finance Specialization: Finance

S
IM
Author : DP Apte
NM

Reviewed By : Mr. Ravindra Babu S

Copyright:
2015 Publisher
ISBN:
978-81-8323-129-9
Address:
A-45, Naraina, Phase-I, New Delhi – 110 028
Only for
NMIMS Global Access - School for Continuing Education School Address
V. L. Mehta Road, Vile Parle (W), Mumbai – 400 056, India.

NMIMS Global Access – School for Continuing Education


iii 

C O N T E N T S

CHAPTER NO. CHAPTER NAME PAGE NO.

1 Introduction to Business Statistics 01

2 Descriptive Statistics: Collection, Processing and 23


Presentation of Data

3 Measures of Central Tendency 75

4 S
Measures of Dispersion 113
IM
5 Skewness and Kurtosis 147
NM

6 Correlation Analysis 171

7 Regression Analysis 203

8 Theory of Probability 233

9 Probability Distribution 261

10 Use of Excel Software for Statistical Analysis 297

11 Case Studies 347

NMIMS Global Access – School for Continuing Education


iv 

BUSINESS STATISTICS

C U R R I C U L U M

Introduction to Business Statistics: Development of Statistics, Definitions of Statistics,


Importance of Statistics, Classification of Statistics, Role of Statistics, Functions of Statistics,
Limitations of Statistics

Descriptive Statistics: Collection, Processing and Presentation of Data: Descriptive and


Inferential Statistics, Collection of Data, Editing and Classification of Data, Classification of
Data, Tabulation of Data, Diagrammatic and Graphical Representation of data

S
Measures of Central Tendency: Characteristics of Central Tendency, Arithmetic Mean,
Median, Mode
IM
Measures of Dispersion: Characteristics of Measures of Dispersion, Absolute and Relative
Measures of Dispersion, Range Interquartile Range and Deviations, Variance and Standard
Deviation, Case Study Problem covering Variance, Standard Deviation and Coefficient of
Variation
NM

Skewness and Kurtosis: Karl Pearson’s Coefficient of Skewness (SKp), Bowley’s Coefficient of
Skewness (SKB), Kelly’s Coefficient of Skewness (Skk), Measures of Kurtosis, Moments

Correlation Analysis: Types of Correlation, Methods of Calculating Correlation, Scatter


Diagram Method, Co-variance Method – The Karl Pearson’s Correlation Coefficient, Rank
Correlation Method, Correlation Coefficient using Concurrent Deviation

Regression Analysis: Regression Analysis, Simple Linear Regression, Coefficient of


Regression, Non-linear Regression Models, Correlation Analysis vs Regression Analysis, Case
Study Problem based on Regression Analysis and Correlation Analysis

Theory of Probability: Important Terms in Probability, Kinds of Probability, Simple


Propositions of Probability, Addition Theorem of Probability, Multiplication Theorem of
Probability, Conditional Probability, Law of Total Probability, Independence of Events,
Combinatorial Concepts

Probability Distribution: Random Variable, Probability Distributions of Standard Random


Variables, Bernoulli Distribution, Binomial Distribution, Poisson Distribution, Normal
Distribution

NMIMS Global Access – School for Continuing Education


v 

Use of Excel Software for Statistical Analysis: Introduction to Excel, Entering Data in Excel,
Descriptive Statistics, Basic Built-in Functions (Average, Mean, Mode, Count, Max and Min),
Statistical Analysis, Normal Distribution, Brief about SPSS

S
IM
NM

NMIMS Global Access – School for Continuing Education


NM
IM
S
C H A
1 P T E R

INTRODUCTION TO BUSINESS STATISTICS

CONTENTS
1.1 Introduction


1.2 
1.3
S
Development of Statistics
Definitions of Statistics
IM
1.4 Importance of Statistics
1.5 Classification of Statistics
1.6 Role of Statistics
1.6.1 Role of Statistics in Business
1.6.2 Role of Statistics in Decision Making
NM

1.6.3 Role of Statistics in Research


1.7 Functions of Statistics
1.7.1 Laws of Statistics
1.8 Limitations of Statistics
1.8.1 Common Statistical Issues
1.8.2 Distrust of Statistics
1.8.3 Misuse of Statistics
1.9 Summary
1.10 Descriptive Questions
1.11 Answers and Hints
1.12 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


2  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

CAMO “HELPING SMART PEOPLE GET SMARTER”


Turning Data into Knowledge
Most organizational theory explores how organizations function.
But little is known about the vast amounts of data that an
organization generates and how it is processed into knowledge to
significantly increase intelligence and achieve better success.
Further accentuating the problem is technology that has given
organizations the ability to collect and store vast amounts of
information into data warehouses.
To build a strong business intelligence solution, organizations
need to better utilize this data and integrate it into the knowledge
building process.
Structured vs Unstructured Data

S
Let us define what is meant by structured and unstructured
data. The unstructured data of an organization includes e-mail
correspondence, text documents, even voice and video. But in
IM
large part, most of the information in an organization is structured.
This is the quantified information found in financial statements,
statistical reports and other sources that include responses to
surveys, point of sale information and sales reports. In essence, the
non-text data the organization generates.
NM

Structured data is extremely difficult to organize. It is generally


fragmented, resting in different silos of the organization. Often,
organizations have little ability to aggregate the data and process
it in a manner that reveals the core information that constitutes a
robust knowledge base.
New and proven approaches to processing structured data are
quickly emerging. CAMO’s approach is to integrate a self-learning
system that continually aggregates data and correlates it into a
knowledge base. When applied to an organization, intelligence
increases exponentially as the “community,” adds their own content
and feeds knowledge back into the system. The system itself is always
learning by continually processing structured data from multiple
data sources both internal and external to the organization. CAMO
believes that hybrid methods to integrate artificial intelligence
and advanced modeling techniques are the most powerful ways
an organization can process data into knowledge. What results
are sophisticated ways to classify customers, define and track
multiple market segments, make predictions, optimize processes
and simulate market conditions for product concept development.
The Intelligence Grows
As the organization accepts the knowledge, they contribute to it,
according to their own knowledge and expertise of the business
process. This creates a multiple increase in intelligence as people
Contd...

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  3 

N O T E S

from throughout the organization learn and contribute to the


knowledge core. Data warehousing, great analytics or extensive
data collection are part of the process but not what determines the
success of a corporation’s business intelligence investments. For
maximum results, building a knowledge base is crucial. But without
understanding how to process organizational data into knowledge,
organizations run the risk of developing a knowledge management
system that lacks the robustness and the attributes needed to gain
considerable market share in the general marketplace.

S
IM
NM

NMIMS Global Access – School for Continuing Education


4  BUSINESS STATISTICS

N O T E S

After studying this chapter, you should be able to:


  Understand the development, importance and role of
statistics
  Explain the basic concept of statistical studies
  Understand the application of statistics in business and
management
  Learn about functions and limitations of statistics

1.1 INTRODUCTION
Information derived from good statistical analysis is always precise
and never useless. One of the primary tasks of a manager is decision-
making. Decision-making is usually based on the past experience and

S
future projections. In many situations, decision-making purely based
on personal experience, subjective judgment and intuition, is rather
difficult and inefficient. Statistical techniques offer powerful tools
IM
in the decision-making process. These tools have power to interpret
quantitative information in a scientific and an objective manner.
These tools also provide certain conceptual framework to the decision
maker and enable him/her to comprehend qualitative information in
a more objective way.
NM

It is said that, “There are three kinds of lies; lies, damn lies and
statistics.” Malcolm Forbes, publisher of Forbes magazine and an
adventurist, once got lost floating for miles in one of his famous
balloons and finally landed in the middle of a cornfield. He spotted a
man coming towards him and asked, “Sir, can you tell me where am I?”
The man said, “Certainly, you are in a basket in a field of corn.” Forbes
said, “You must be a statistician.” The man said, “That’s amazing,
how did you know that?” “Easy”, said Forbes, “Your information is
concise, precise and absolutely useless!”
The story is of course in a lighter vein. Nevertheless, it conveys two
points about the use of statistics. Firstly, good statistical tools must
assist the effective decision-making process. Thus, appropriateness of
the tool and interpretation of the results are essential ingredients for
decision-making. Secondly, the information derived from irrelevant
data may not lead to right conclusion. However, in such a case the
statistician is to blame, and not the statistical tools.

1.2 DEVELOPMENT OF STATISTICS


The word statistics is derived from the Italian word ‘Stato’ which means
‘state’; and ‘Statista’ refers to a person involved with the affairs of state.
Thus, statistics originally was meant for collection of facts useful for
affaires of the state, like taxes, land records, population demography,

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  5 

N O T E S
etc. There is an evidence of use of some of the principles of statistics by
ancient Indian civilization. Some of the techniques find their mention
in Vedic Mathematics. However, the modern statistical methods spread
from Italy to France, Holland and Germany in 16th century.
During ancients times even before 300BC, the rulers and kings, like
Chandragupta Maurya used statistics to maintain the land and revenue
records, collection of taxes and registration of births and deaths.
During the seventeenth century, statistics was used in Europe for a
variety of information like life expectancy and gambling. Theoretical
development of modern statistics was during the mid-seventeenth
centuries with the introduction of ‘Theory of Probability’ and ‘Theory
of Games and Chance’. Many famous problems like ‘the problem of
points’ (posed by Chevalier de-Mere), ‘the gambler’s ruin’ etc. posed
by professional gamblers were solved by mathematicians. These
solutions laid the foundation to the theory of probability and statistics.

S
Some of the notable contributors in the development of statistics are:
Pascal, Fermat, James Bernoulli, De-Moivre, Laplace, Gauss Euler,
Lagrange, Bayes, Kolmogorov, Karl Pearson and so on. One of the most
IM
significant works in modern times is by Ronald A. Fisher (1890-1962),
who is considered to be the ‘Father of Statistics’ by the community
of statisticians all over. He applied statistics to diversified fields such
as education, agriculture, genetics, biometry, psychology, etc. He also
pioneered ‘Estimation Theory’, ‘Exact sampling distribution’, ‘Analysis
NM

of variance’, and ‘Experimental Design’.


Significant contribution has also been made by Indians in the field of
statistics. Prof Prasant Chandra Mahalanobis, is the first to pioneer the
study of statistical science in India. He founded the Indian Statistical
Institute (ISI) in 1931. Mahalanobis viewed statistics as a tool in
increasing the efficiency of all human efforts and also concentrated
on sample surveys. Mahalanobis is known for his famous work on
an important statistic known as D2 statistic, which is very popular
among social scientists. Prof C.R. Rao is another Indian Statistician
who made significant contribution in the field of statistical inference
and multivariate analysis.

Fill in the blanks:


1. ‘..................’ refers to a person involved with the affairs of state.
2. During ancients times even before 300BC, the rulers and
kings, like .................. .................. used statistics to maintain the
land and revenue records, collection of taxes and registration
of births and deaths.
3. Prof Prasant Chandra Mahalanobis founded the Indian
Statistical Institute (ISI) in ...................

NMIMS Global Access – School for Continuing Education


6  BUSINESS STATISTICS

N O T E S

Study on the use of statistics in ancient times in India and other


parts of the world and Prepare a comparative report on the same.

The History of statistics can be said to start around 1749 although,


over time, there have been changes to the interpretation of the
word statistics. In early times, the meaning was restricted to
information about states. This was later extended to include all
collections of information of all types, and later still it was extended
to include the analysis and interpretation of such data. In modern
terms, “statistics” means both sets of collected information, as
in national accounts and temperature records, and analytical work
which require statistical inference.

S
Statistical activities are often associated with models expressed
using probabilities, and require probability theory for them to be
put on a firm theoretical basis.
IM
1.3 DEFINITIONS OF STATISTICS
Since all branches use statistics, there are number of definitions
of statistics, each based on the way one looks at the application of
the statistics. Some of the definitions appealing to the managerial
NM

perspective are listed below.

“Statistics are the classified facts representing the conditions of


the people in the state…. specially those facts which can be stated
in number or in table of numbers or in any tabular or classified
arrangement”. – Webster
“By statistics we mean quantitative data affected to a marked
extent by multiplicity of causes”. –Yule and Kendall
“Statistics is a science of estimates and probability”. – Boddington
“Statistics is a method of decision-making in the face of uncertainty
on the basis of numerical data and calculated risk”.
– Prof. Ya-Lun-Chou
“Statistics may be defined as the science of collection, presentation,
analysis and interpretation of data”. – Croxton and Cowden
“Statistics is the science and art of handling aggregate of facts–
observing, enumerating, recording, classifying and otherwise
systematically treating them”. – Harlow
“Statistics are measurements, enumerations or estimates of natural
phenomenon, usually systematically arranged, analyzed and
presented as to exhibit important inter-relationships among them”.
– A.M. Tuttle

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  7 

N O T E S
Thus, statistics is a science of collection, organisation, presentation,
analysis and interpretation of data, so that it helps a manager to take
effective and knowledgeable decisions under given circumstances.

State whether the following statements are true/false:


4. Statistics is not a science of estimates and probability.
5. Statistics is a method of decision-making in the face of
uncertainty on the basis of numerical data and calculated risk.

A statistic is a quantity that is calculated from a sample of data.


It is used to give information about unknown values in the
corresponding population. For example, the average of the data in

S
a sample is used to give information about the overall average in
the population from which that sample was drawn.
IM
It is possible to draw more than one sample from the same
population and the value of a statistic will in general vary from
sample to sample. For example, the average value in a sample is a
statistic. The average values in more than one sample, drawn from
the same population, will not necessarily be equal.
NM

Statistics are often assigned Roman letters (e.g. m and s), whereas
the equivalent unknown values in the population (parameters) are
assigned Greek letters (e.g. m and σ).

1.4 IMPORTANCE OF STATISTICS


Whatever be the field of application, complete information can seldom
be obtained due to cost and time factors. In real life, partial information
forms the basis of most of our decisions. Statistical techniques enable
us to:
‰‰ Identify what information or data is worth collecting,
‰‰ Decide when and how judgments may be made on the basis of
partial information, and
‰‰ Measure the extent of doubt and risk associated with the use of
partial information and stochastic processes.
The key distinction between normative (or judgmental) techniques
and statistical techniques is of estimate of level of confidence in
decision. Statistical methods are explicit in nature and provide clearly
defined measure of error. On the other hand, normative techniques
based on the judgment and rule of thumb, although help in effective
decision-making but fail to specify estimate of error.

NMIMS Global Access – School for Continuing Education


8  BUSINESS STATISTICS

N O T E S

Fill in the blanks:


6. The key distinction between normative (or judgmental)
techniques and statistical techniques is of estimate of level of
................... in decision.
7. ................... techniques based on the judgment and rule of
thumb, although help in effective decision-making but fail to
specify estimate of error.

List down the areas of daily life where you feel statistics plays
an important role, for example noticing average temperature in
summers, etc.

S
1.5 CLASSIFICATION OF STATISTICS
IM
Statistical methods are broadly divided into five categories. These
categories are not mutually exclusive. These are often found to be
overlapping.
‰‰ Descriptive Statistics: When statistical methods are used,
a problem is always formulated in terms of ‘population’ or
NM

‘universe’, which is defined as all the elements about which


conclusions or decisions are to be made. In statistics, there
is a specific meaning to the words population and universe.
We shall discuss exact definitions subsequently. For example,
if we want to find customer satisfaction, all our customers
represent the population. If information or data is taken from
each and every element of the population, we are dealing with
‘Descriptive Statistics’. In research vocabulary, such a process is
called ‘Census’. This includes methods for collection, collation,
tabulation, summarization and analysis of the data on entire
population. Averages, trends, index numbers, dispersion and
skewness, help in summarizing and describing the main features
of the statistical data. This is primarily to present the data in the
form easily understandable to the decision-maker. One example
is the national census conducted every 10 years.
‰‰ Analytical Statistics: This deals with establishing relationship
between two or more variables. This includes methods
like correlation and regression, association of attributes,
multivariate analysis, etc., which help establishing relationship
between variables. This facilitates comparison, interpolation,
extrapolation and relationships. In these cases, we require
multiple samples on different populations or same population, for
example, sales of a product before and after launch of promotion
campaign.

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  9 

N O T E S
‰‰ Inductive Statistics: Decision making in most business
situations requires estimates about future like trends and forecast.
Inductive statistics include methods that help in generalizing the
trends based on the random observations. This process provides
estimation indirectly on the basis of partial data or method of
forecasting based on past data for example, future share price of
a share based on the inflow of funds by FII.
‰‰ Inferential Statistics: Another way, in which conclusions or
decisions are made, is using a portion of population or sample
from the universe. The sample data is analyzed. Then based on
the sample evidence, conclusions are generalized about the target
population. Exit poll during elections is an example of sample
survey. This method is referred to as ‘Statistical Inference’.
Hypotheses and significance tests form an important part of
inferential statistics.
Applied Statistics: It is the application of statistical methods
‰‰

S
and techniques used for solving the real life problems. Quality
control, sample surveys, inventory management, simulations,
quantitative analysis for business decision making, etc., form a
IM
part of this category.

Fill in the blanks:


NM

8. ................... is defined as all the elements about which


conclusions or decisions are to be made.
9. ................... Statistics deals with establishing relationship
between two or more variables.
10. Hypotheses and significance tests form an important part of
................... statistics.
11. Quality control, sample surveys, inventory management,
simulations, quantitative analysis for business decision
making, etc., form a part of ................... statistics.

Discuss the practical application of all the five types of statistics in


your daily life.

In statistics, where classification is often done with logistic


regression or a similar procedure, the properties of observations
are termed explanatory variables (or independent variables,
regressors, etc.), and the categories to be predicted are known
as outcomes, which are considered to be possible values of
the dependent variable.

NMIMS Global Access – School for Continuing Education


10  BUSINESS STATISTICS

N O T E S

1.6 ROLE OF STATISTICS


Role of statistics is defined below in different areas.

1.6.1 ROLE OF STATISTICS IN BUSINESS


Today, statistics is not restricted to information about the state but
extends to almost every realm of the business. Statistics is concerned
with scientific methods of collecting, organizing, summarizing
and analyzing data. What is even more important is drawing valid
conclusions and making effective decisions based on such analysis.
The success of a business to a large extent depends on the accuracy
and precision of the forecast. Statistics is an indispensable tool
of production control and market research. Statistical tools are
extensively used in business for time and motion study, consumer
behaviour study, investment decisions, performance measurements
and compensations, credit ratings, inventory management,

S
accounting, quality control, distribution channel design, etc. Hence,
understanding statistical concepts and knowledge of using statistical
tools is essential for today’s managers.
IM
1.6.2 ROLE OF STATISTICS IN DECISION MAKING
Very often, people consider decision-making just as an act of selection
among alternatives. However, there are two more phases in decision-
making. Noble Laureate Sir Herbert A Simon identified the phases of
NM

decision-making as:
‰‰ Information gathering: Searching the environment for
information, called the intelligence activity.
‰‰ Generation of alternatives: Inventing, developing and analyzing
possible courses of action, called the design activity.
‰‰ Selection of alternatives: Selecting a particular course of action
from those available, called the decision activity.
Most important task of a manager is to take decisions in a given
situation that helps an organization to achieve its goals. Management
is a process of converting information into action – this we call
decision-making. Decision-making is a deliberate thought process
based on available data developing alternatives to choose from so as
to find the best solution to the problem at hand.
Statistics and statistical tools play very vital role during all these
three phases of decisions. There are two basic approaches of decision-
making, namely, quantitative (or mathematical) and qualitative (or
rational, creative and judgmental). In the first approach statistics
and mathematics play dominant role. Even in second approach
statistics plays a role for collection and presentation of data to help
decision-maker’s intuition. Extent to which statistical and
mathematical tools can be used, depend upon the situations.

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  11 

N O T E S
These can be briefly classified as:
‰‰ Decision-making under certainty: These are deterministic
situations amenable to mathematical tools to fullest extent.
‰‰ Decision-making under risk: These are stochastic situations
amenable to statistical tools to a large extent with supplement of
rational decision-making.
‰‰ Decision-making under uncertainty: These are amenable to
judgmental and creative approaches.
It is observed that middle level and senior level managers primarily
deal with decision-making under risk or in a few cases decision-making
under uncertainty. Thus, knowledge of statistical and mathematical
computational tools is necessary, if not mandatory, for efficient and
effective decision-making. It is not required to apply all advanced
statistical tools in every situation. Certain tools may not be applicable
in some cases. Simple statistics like average, weighted average,

S
percentage and standard deviation, index would reveal a great deal
of information in many decision-making scenarios. Exploratory
investigation may, however, require some advanced tools.
IM
1.6.3 ROLE OF STATISTICS IN RESEARCH
Statistical analysis is a vital component in every aspect of research.
Social surveys, laboratory experiment, clinical trials, marketing
research, human resource planning, inventory management, quality
NM

management, etc., require statistical treatment before arriving at


valid conclusions. Today, with availability of computers, we can very
effectively apply statistical techniques in every field of knowledge. The
findings of any research have to be justified in the light of statistical
logic. In business situations, use of statistical tools in marketing
research, operations research, forecasting, factor analysis, human
resource development, etc., could immensely benefit managers to
gain competitive advantage, improve productivity and reduce costs.
Thus, every manager must be aware of statistical tools and should
have knowledge to use them.

State whether the following statements are true/false:


12. Inventing, developing and analyzing possible courses of action,
called the intelligence activity.
13. Decision-making under certainty are deterministic situations
amenable to mathematical tools to fullest extent.
14. Decision-making under uncertainty are stochastic situations
amenable to statistical tools to a large extent with supplement
of rational decision-making.

NMIMS Global Access – School for Continuing Education


12  BUSINESS STATISTICS

N O T E S

List down the areas in your work situation where in your opinion
statistical tools would improve decision-making.

Statistics has important role in determining the existing position


of per capita income, unemployment, population growth rate,
housing, schooling medical facilities etc. in a country. Now statistics
holds a central position in almost every field like Industry, Commerce,
Trade, Physics, Chemistry, Economics, Mathematics, Biology,
Botany, Psychology, Astronomy etc., so application of statistics is
very wide.

1.7 FUNCTIONS OF STATISTICS

‰‰
S
Functions of statistics are described below:
Condensation: Statistics compresses mass of figures to small
IM
meaningful information, for example, average sales, BSE index
(SENSEX), growth rate. It is impossible to get a precise idea
about the profitability of a business from a record of income
and expenditure transactions. The information of Return on
Investment (ROI), Earnings per Share (EPS), profit margins,
NM

etc., however, can be easily remembered, understood and used


in decision-making.
‰‰ Comparison: Statistics facilitates comparing two related
quantities for example, Price to Earning Ratio (PE Ratio) of
Reliance Industries stood at 17.5 as compared to the industry
figure of 13 showing the confidence of investors.
‰‰ Forecast: Statistics helps in forecast by looking at trends. These
are essential for planning and decision-making. Predictions based
on the gut feeling or hunch could be harmful for the business.
For example, to decide the refining capacity for a petrochemical
plant, we need to predict the demand of petrochemical product
mix, supply of crude, cost of crude, substitution products, etc.,
over next 15 to 25 years, before committing an investment.
‰‰ Testing of hypotheses: Hypotheses are statements about
the population parameters based on our past knowledge or
information that we would like to check its validity in the light
of current information. Inductive inference about the population
based on the sample estimates involves an element of risk.
However, sampling keeps the costs of decision-making low.
Statistics provides quantitative base for testing our beliefs about
the population.
‰‰ Preciseness: Statistics present facts precisely in quantitative
form. Statement of facts conveyed in exact quantitative terms

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  13 

N O T E S
are always more convincing than vague utterances. For example,
‘increase in profit margin is less in year 2006 than in year 2005’
does not convey a definite piece of information. On the other
hand, statistics presents the information more definitely like
“profit margin is 10% of the turnover in year 2006 against 12% in
year 2005”.
‰‰ Expectation: Statistics provides the basic building block for
framing suitable policies. For example, how much raw material
should be imported, how much capacity should be installed, or
manpower recruited, etc., depends upon the expected value of
outcome of our present decisions.

1.7.1 LAWS OF STATISTICS


There are two fundamental laws of statistics. These are:
‰‰ The Law of Statistical Regularity: This law states, “A moderately

S
large number of items, chosen at random from a large group, are
almost sure on an average to possess the characteristics of the
large group.” For example, it is difficult to predict failure of an
IM
individual machine or an accident on express way but not difficult
to indicate what percentage of large number of machines might
suffer from a breakdown in given period. Similarly, average
number of accident on expressway would remain stable over a
fairly long period of time unless the conditions have changed
drastically.
NM

‰‰ The Law of Inertia of Large Number: It states, ‘Other things


being equal, as the sample size increases the result tends to be
more reliable and accurate.’ As the sample size increases, the
possibility of the effect of extreme values in data reduces due
to the compensation on the both sides. Thus, as the sample size
increases chances of stability of results enhance and confidence
in our estimate of the population increases. In the limiting case
if the sample size reaches to the population size we can exactly
describe the characteristics of the population.

Fill in the blanks:


15. ................... is statement about the population parameters
based on our past knowledge or information that we would
like to check its validity in the light of current information.
16. Statistics present facts precisely in ................... form.
17. The Law of Inertia of Large Number states, ‘Other things
being equal, as the sample size ................... the result tends to
be more reliable and accurate.’

NMIMS Global Access – School for Continuing Education


14  BUSINESS STATISTICS

N O T E S

Describe the stages of business where statistical analysis has


become necessary and are very important.

Statistics as a discipline is considered indispensable in almost all


spheres of human knowledge. There is hardly any branch of study
which does not use statistics. Scientific, social and economic studies
use statistics in one form or another. These disciplines make-use
of observations, facts and figures, enquiries and experiments, etc.
using statistics and statistical methods. Statistics studies almost all
aspects in an enquiry.

1.8 LIMITATIONS OF STATISTICS

S
Statistical techniques, because of their flexibility and economy, have
become popular and are used in numerous fields. But statistics is not
IM
a cure-all technique and has limitations. It cannot be applied to all
kinds of situations and cannot be made to answer all queries. The
major limitations are:
‰‰ Statistics deals with only those problems, which can be
expressed in quantitative terms and amenable to mathematical
NM

and numerical analysis. These are not suitable for qualitative


data such as customer loyalty, integrity of employee, emotional
bonding, motivation, initiative, etc.
‰‰ Statistics deals only with collection of data and no importance is
attached to an individual item.
‰‰ Statistical results are only approximate and not mathematically
correct. There is always a possibility of random error.
‰‰ Statistics, if used wrongly, can lead to misleading conclusions,
and therefore, should be used only after complete understanding
of the process and conceptual base.

1.8.1 COMMON STATISTICAL ISSUES


There are different types of statistical issues faced by a researcher.
These are broadly classified into the following groups.
‰‰ Data collection and recording stage: These include sampling
plan, data collection and data representation.
‰‰ Computing basic statistics: These include proportions,
computing central tendency, variation and skewness, measuring
consistency of data, frequency distribution and cross tabulation.
‰‰ Statistical tests of hypotheses: These include comparison of
means, comparison of proportions, and comparison of variances.

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  15 

N O T E S
‰‰ Associations and relationship: These include testing of
dependence between attributes, correction and regression and
non-parametric methods.
‰‰ Multivariate method: These include factor analysis, cluster
analysis, discriminant analysis, probit and logit analysis, path
analysis, profile analysis, multivariate ANOVA, and analysis of
factorial experiments.
Each of these requires a fundamental understanding of its statistical
origin and purpose.

1.8.2 DISTRUST OF STATISTICS


Many managers have doubts in using the result of statistical analysis for
decision-making, particularly if the analysis goes against their intuition.
Some of them also relate it to their past experience when statistical
analysis has misled them. The problem of misleading could be due to

S
the incorrect use of data. This happens due to lack of understanding of
statistical principles or intentional fudging with the figures with ulterior
motives. As Kings says, “Statistics are like clay of which one can make
IM
a god or devil as one pleases”. According to Bowley, “Statistics only
furnishes tools, necessary though imperfect, which are dangerous in
the hands of those who do not know its use and its deficiencies”. It is
often quoted by managers that “figures don’t lie, liars figure”.
The distrust of statistics among managers is result of bad experience,
NM

lack of understanding, hence faith in method, complex and


voluminous data overwhelms the thinking, or simply the attitude of
liking subjective judgments based on the gut feelings.

1.8.3 MISUSE OF STATISTICS


More dangerous than distrust is misuse of statistics to draw convenient
conclusions to satisfy selfish or ulterior motives. Arguments and
analysis supported by facts, figures, charts, graphs, index numbers,
etc. are indeed very appealing and convincing. They can be used to
intimidate opposing views. Hence, statistics is open to manipulation.
Very common examples are charges people make on successive
governments of fudging the figures to show how good their government
is as compared to the previous government. Business houses using
statistics to mislead the public to manipulate the share prices is not
uncommon. The misuse, whether through ignorance or manipulation
is a result of one or more of the following reasons.
‰‰ Bias in sampling due to short cuts, convenience, selectivity, or
purposeful manipulation.
‰‰ Inadequate sample size that is too less to represent underlying
characteristics of the population. Statistical inference requires a
minimum specified size of sample.
‰‰ Changing definitions, weights, attributes, of sampling method,
after commencement of data collection.

NMIMS Global Access – School for Continuing Education


16  BUSINESS STATISTICS

N O T E S
‰‰ Establishing absurd correlations or associations just because
independent data appears moving together.
‰‰ Comparing and drawing causal relationship between unrelated
variables based on association.
‰‰ Changing hypotheses after collecting and analyzing the data.

Fill in the blanks:


18. According to ..................., “Statistics only furnishes tools,
necessary though imperfect, which are dangerous in the hands
of those who do not know its use and its deficiencies”.
19. The ................... of statistics among managers is result of bad
experience, lack of understanding.
20. ................... in sampling due to short cuts, convenience,

S
selectivity, or purposeful manipulation.
IM
Give a practical example from Indian Industry where statistics has
been misused and resulted into losses.
NM

Due to limitations of statistics an attitude of distrust towards it


has been developed. There are some people who place statistics in
the category of lying and maintain that, “there are three degrees
of comparison in lying-lies, dammed lies and statistics”. But this
attitude is not correct. The person who is handling statistics may be
a liar or inexperienced. But that would be the fault not of statistics
but of the person handling them.
The person using statistics should not take them at their face value.
He should check the result from an independent source. Also only
experts should handle the statistics otherwise they may be misused.

1.9 SUMMARY
‰‰ Managerial decision-making can be made efficient and effective
by analyzing available data using appropriate statistical tools.
Statistical tools not only have application in research (marketing
research included) but also in other functional areas like quality
management, inventory management, financial analysis, human
resource planning and so on.
‰‰ The word statistics is derived from the Italian word ‘Stato’ which
means ‘state’; and ‘Statista’ refers to a person involved with the
affairs of state. Thus, statistics originally was meant for collection

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  17 

N O T E S
of facts useful for affaires of the state, like taxes, land records,
population demography, etc.
‰‰ Significant contribution has also been made by Indians in the
field of statistics. Prof Prasant Chandra Mahalanobis, is the first
to pioneer the study of statistical science in India. He founded
the Indian Statistical Institute (ISI) in1931. Mahalanobis viewed
statistics as a tool in increasing the efficiency of all human efforts
and also concentrated on sample surveys.
‰‰ Statistics is the classified facts representing the conditions of the
people in the state…. specially those facts which can be stated
in number or in table of numbers or in any tabular or classified
arrangement.
‰‰ Statistical methods are broadly divided into five categories.
These are Descriptive Statistics, Analytical Statistics, Inductive
Statistics, Inferential Statistics and Applied Statistics.
‰‰

S
Statistics is an indispensable tool of production control and
market research. Statistical tools are extensively used in business
for time and motion study, consumer behaviour study, investment
IM
decisions, performance measurements and compensations,
credit ratings, inventory management, accounting, quality
control, distribution channel design, etc.
‰‰ Statistical analysis is a vital component in every aspect of research.
Social surveys, laboratory experiment, clinical trials, marketing
NM

research, human resource planning, inventory management,


quality management, etc., require statistical treatment before
arriving at valid conclusions.
‰‰ Functions of statistics are Condensation, Comparison, Forecast,
Testing of hypotheses, Preciseness and Expectation.
‰‰ Statistical techniques, because of their flexibility and economy,
have become popular and are used in numerous fields. But
statistics is not a cure-all technique and has limitations. It cannot
be applied to all kinds of situations and cannot be made to answer
all queries.
‰‰ More dangerous than distrust is misuse of statistics to draw
convenient conclusions to satisfy selfish or ulterior motives.
Arguments and analysis supported by facts, figures, charts,
graphs, index numbers, etc. are indeed very appealing and
convincing. They can be used to intimidate opposing views.
Hence, statistics is open to manipulation.

NMIMS Global Access – School for Continuing Education


18  BUSINESS STATISTICS

N O T E S

‰‰ Statistics: By statistics we mean quantitative data affected to


a marked extent by multiplicity of causes.
‰‰ Information: It is that which informs, i.e. that from
which data can be derived. Information is given through the
content of a message or through direct or indirect observation of
some thing. 
‰‰ Universe: Universe refers to the entire collection of elements
that share defined characteristics.
‰‰ Population: A population refers only to those elements
that realistically might be selected as potentially accessible
elements by researcher. It is a collection of all the elements
that are under study.
‰‰ Variables: Variables used in statistics can be divided into

‰‰
S
“dependent variable”, “independent variable”, or other.
Parameter: A parameter is an important element to consider in
evaluation or comprehension of an event, project, or situation.
IM
1.10 DESCRIPTIVE QUESTIONS
1. Define Statistics. Also discuss the development of statistics.
2. Who gave the following definitions of statistics?
NM

(i) “Statistics are the classified facts representing the conditions


of the people in the state…. specially those facts which can
be stated in number or in table of numbers or in any tabular
or classified arrangement ”.
(Bowley, Webster, King, Saligman)
(ii) “Statistics is the science of estimates and probabilities”.
(Webster, Secrist, Boddington, Yule & Kendall)
(iii) “Statistics is the science and art of handling aggregate of
facts– observing, enumerating, recording, classifying and
otherwise systematically treating them”.
(Harlow, Marshall, W.I. King, Croxton & Cowden)
3. Discuss the importance of statistics in our daily lives.
4. Write a short note on classification of statistics.
5. ‘Statistics is the backbone of decision-making’. Comment
6. Discuss the role of statistics in research.
7. What are the functions of statistics? Briefly explain each one of
them.
8. What are the two fundamental laws of statistics?

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  19 

N O T E S
9. What are the limitations of Statistics? How can be statistical
techniques be misused?
10. What are common statistical issues? How can statistics mislead
us?

1.11 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS

Topic Q. No. Answers


Development of Statistics 1. Statista
2. Chandragupta Maurya
3. 1931
Definition of Statistics 4. False
5. True
Importance of Statistics 6.
7. S
Confidence
Normative
IM
Classification of Statistics 8. Population
9. Analytical
10. Inferential
11. Applied
Role of Statistics 12. False
NM

13. True
14. False
Functions of Statistics 15. Hypotheses
16. Quantitative
17. Increases
Limitations of Statistics 18. Bowley
19. Distrust
20. Bias

HINTS FOR DESCRIPTIVE QUESTIONS


1. Refer Sections 1.2 and 1.3
“Statistics are the classified facts representing the conditions of
the people in the state…. specially those facts which can be stated
in number or in table of numbers or in any tabular or classified
arrangement”. – Webster
Theoretical development of modern statistics was during the
mid-seventeenth centuries with the introduction of ‘Theory of
Probability’ and ‘Theory of Games and Chance’. Many famous
problems like ‘the problem of points’ (posed by Chevalier
de-Mere), ‘the gambler’s ruin’ etc. posed by professional gamblers
were solved by mathematicians.

NMIMS Global Access – School for Continuing Education


20  BUSINESS STATISTICS

N O T E S
2. Refer Section 1.3
Answers are
(i) Webster
(ii) Boddington
(iii) Harlow
3. Refer Section 1.4
Statistical techniques enable us to identify what information or
data is worth collecting, decide when and how judgments may be
made on the basis of partial information.
4. Refer Section 1.5
Statistical methods are broadly divided into five categories.
These are Descriptive Statistics, Analytical Statistics, Inductive
Statistics, Inferential Statistics, and Applied Statistics.
5.
S
Refer Section 1.6.2
Statistics and statistical tools play very vital role during all these
IM
three phases of decisions. There are two basic approaches of
decision-making, namely, quantitative (or mathematical) and
qualitative (or rational, creative and judgmental). In the first
approach statistics and mathematics play dominant role. Even
in second approach statistics plays a role for collection and
presentation of data to help decision-maker’s intuition.
NM

6. Refer Section 1.6


Statistical analysis is a vital component in every aspect of research.
Social surveys, laboratory experiment, clinical trials, marketing
research, human resource planning, inventory management,
quality management, etc., require statistical treatment before
arriving at valid conclusions.
7. Refer Section 1.7
Functions of statistics are Condensation, Comparison, Forecast,
Testing of hypotheses, Preciseness and Expectation.
8. Refer Section 1.7.1
The two laws are the Law of Statistical Regularity and the Law of
Inertia of Large Number.
9. Refer Section 1.8
Statistical techniques, because of their flexibility and economy,
have become popular and are used in numerous fields. But
statistics is not a cure-all technique and has limitations. It cannot
be applied to all kinds of situations and cannot be made to answer
all queries.
Arguments and analysis supported by facts, figures, charts,
graphs, index numbers, etc. are indeed very appealing and

NMIMS Global Access – School for Continuing Education


INTRODUCTION TO BUSINESS STATISTICS  21 

N O T E S
convincing. They can be used to intimidate opposing views.
Hence, statistics is open to manipulation.
10. Refer Sections 1.8.1 and 1.8.2
There are different types of statistical issues faced by a
researcher. The distrust of statistics among managers is result
of bad experience, lack of understanding, hence faith in method,
complex and voluminous data overwhelms the thinking, or
simply the attitude of liking subjective judgments based on the
gut feelings.

1.12 SUGGESTED READINGS FOR REFERENCE


SUGGESTED READINGS
‰‰ R S Bhardwaj, Mathematics and Statistics fort Business, Excel
Books, 2012
‰‰
Books, 2009 S
D P Apte, Statistical Tools for Managers using MS Excel, Excel
IM
‰‰ S Jaisankar, Quantitative Techniques for Management Computer
based Problem Solving, Excel Books, 2005
‰‰ R Selvaraj, Quantitative Methods in Management, Problems and
Solutions, Excel Books, 2008
J K Sharma, Fundamentals of Business Statistics, 2010
NM

‰‰

‰‰ Bierman H, Bonnini C P, and Hausma W H, Quantitative Analysis


for Business Decisions, Homewood, Illinois. Richard D.I. Win, Inc
1973.

E-REFERENCES
‰‰ www.statistics.com/
‰‰ http://www.statsoft.com/
‰‰ http://www.stats.gla.ac.uk/steps/glossary/basic_definitions.html

NMIMS Global Access – School for Continuing Education


NM
IM
S
C H A
2 P T E R

DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING


AND PRESENTATION OF DATA

CONTENTS
2.1 Introduction


2.2 
2.2.1
S
Descriptive and Inferential Statistics
Descriptive Statistics
IM
2.2.2 Inferential Statistics
2.3 Collection of Data
2.3.1 Types of Data – Primary and Secondary
2.3.2 Methods of Collecting Primary Data
2.3.3  Merits and Demerits of Collecting Primary Data
NM

2.3.4 Methods of Collecting Secondary Data


2.3.5 Designing Questionnaire
2.4 Editing and Coding of Data
2.4.1 Editing Primary Data
2.4.2 Editing Secondary Data
2.4.3 Coding of Data
2.5 Classification of Data
2.5.1 Rules of Classification
2.5.2 Bases of Classification
2.5.3 Frequency Distribution
2.6 Tabulation of Data
2.6.1 Types of Tabulation
2.6.2 One-way Tabulation
2.6.3 Two-way Tabulation
2.6.4 Multi-way Tabulation
2.6.5 Advantages of Tabulation
2.7 Diagrammatic and Graphical Presentation of Data
2.7.1 Difference between Diagrams and Graphs

Contd...

NMIMS Global Access – School for Continuing Education


24  BUSINESS STATISTICS

2.7.2 Types of Diagrams


2.7.3 Bar Diagram
2.7.4 Histogram
2.7.5 Pie Diagram
2.7.6 Frequency Polygon
2.7.7 Ogives
2.8 Summary
2.9 Descriptive Questions
2.10 Answers and Hints
2.11 Suggested Readings for Reference

S
IM
NM

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  25 

INTRODUCTORY CASELET
N O T E S

PROFITABILITY

The profitability of a company is defined as the ratio of its operating


profit to its operating income, typically expressed in percentage.
The following two charts show the operating income as well as the
profitability of six companies in the financial years 2001-02 and
2002-03. 

Operating Income Profitability


Operating Income (Crores)

A B C
Company
D E F

S
A B C
Company
D E F
IM
1. Which company recorded the highest operating profit in F.Y.
2002-03?
(i) A     (ii) C     (iii) E     (iv) F
2. The average operating profit in F.Y. 2002-03, of companies with
NM

profitability exceeding 10% in F.Y. 2002-03, is approximately: 


(i) 17.5 crore  (ii) 25 crore  (iii) 27.5 crore  (iv) 32.5 crore
[Answers: 1: (iii) E   2: (iv) 32.5 crore]

NMIMS Global Access – School for Continuing Education


26  BUSINESS STATISTICS

N O T E S

After studying this chapter, you should be able to:


  Describe descriptive and inferential statistics
  Explain collection, editing and classification of primary and
secondary data
  Define tabulation and presentation of data
  Understand diagrammatic and graphical presentation
 Understand Bar diagram, Histogram, Pie Diagram,
Frequency polygons and Ogives

2.1 INTRODUCTION
To make a decision in any business situation you need data. Facts
expressed in quantitative form can be termed as data. Success of any

S
statistical investigation depends on the availability of accurate and
reliable data. These depend on the appropriateness of the method
chosen for data collection. Therefore, data collection is a very basic
IM
activity in decision-making. Data may be classified either as primary
data or secondary data.
Data collected from the field needs to be processed and analysed.
The processing is primarily editing, coding, classification and the
NM

tabulation of the data collected so that it is compliant to analysis.


Presentation of data can either be in tabulation form or through
charts. Successful use of the collected data depends to a great extent
upon the way it is arranged, displayed and summarized.

 ESCRIPTIVE AND INFERENTIAL


D
2.2
STATISTICS
There are two major divisions of the field of statistics, namely
descriptive and inferential statistics. Both the segments of statistics
are important, and accomplish different objectives.

2.2.1 DESCRIPTIVE STATISTICS


Descriptive statistics is the type of statistics that probably comes to
most of the minds of people when they hear the word “statistics.” Here
the purpose is to describe. Numerical measures are used to tell about
features of a set of data. There are a number of items that belong in
this portion of statistics, such as:
‰‰ The average of data which can be measured by mean, median,
mode or midrange
‰‰ The spread of a data set, which can be measured with the range
or standard deviation

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  27 

N O T E S
‰‰ Other measurements such as skewness and kurtosis
‰‰ The exploration of relationships and correlation between paired
data
‰‰ The presentation of statistical results in graphical form

2.2.2 INFERENTIAL STATISTICS


For the inferential statistics we have to differentiate between two
groups. The population is the entire collection of individuals that
we have to study. It is typically impossible or infeasible to examine
each member of the population individually. So we have to choose a
representative subset of the population, called a sample.

Inferential statistics studies a statistical sample, and from this

which the sample came.


S
analysis we are able to say something about the population from

There are two major divisions of inferential statistics:


IM
‰‰ A confidence interval which gives a range of values for an
unknown parameter of the population by measuring a statistical
sample. We can express this in terms of an interval and the degree
of confidence that the parameter is within the interval.
NM

‰‰ Tests of significance or hypothesis testing tests a claim about


the population by analyzing a statistical sample. There is some
uncertainty in this process by its design. This we can express in
terms of a level of significance.

Difference between these Areas


As we have studied above, descriptive statistics is concerned with
telling about certain features of a data set. Although, this helps in
learning the amount of spread and center of the data, but we cannot
make any kind of generalization by the help of descriptive statistics.
In descriptive statistics measurements such as the mean and standard
deviation gives we exact numbers. Though we may use descriptive
statistics in examining a statistical sample, but this branch of statistics
does not allow us to say anything about the population based on the
results of sample.
Inferential statistics is different from descriptive statistics in many
ways. Even though there are similar calculations, such as those for
the mean and standard deviation, the focus is entirely different for
inferential statistics. Inferential statistics starts analyzing a sample and
then generalizes to a population. This information about a population
is not stated as a number. Instead we express these parameters as a
series of potential numbers, along with a degree of confidence.

NMIMS Global Access – School for Continuing Education


28  BUSINESS STATISTICS

N O T E S

Fill in the blanks:


1. The ................. of data which can be measured by mean,
median, mode or midrange.
2. ................. statistics studies a statistical sample, and from this
analysis we are able to say something about the population
from which the sample came.

It is important to know the difference between descriptive and


inferential statistics. This knowledge is helpful when we need to
apply it to a real world situation involving statistical methods.

2.3 COLLECTION OF DATA


S
The collection and analysis of data constitute the main stages of
IM
execution of any statistical investigation. The procedure for collection
of data depends upon various considerations such as objective, scope,
nature of investigation, etc. Availability of resources like money, time,
manpower, etc., also affects the choice of a procedure. Data may be
collected either from a primary or from a secondary source. They are
NM

described below.

2.3.1 TYPES OF DATA – PRIMARY AND SECONDARY


Data used in statistical study is termed either ‘primary’ or ‘secondary’
depending upon whether it was collected specifically for the study
undertaken or for some other purposes.
When the data used in a statistical study was collected under the
control and supervision of the investigator, such type of data is
referred to as ‘primary data’. Primary data are collected afresh and
for the first time, and thus, happen to be original in character. On
the other hand, when the data is not collected for this purpose, but is
derived from other sources then such data is referred to as ‘secondary
data’. Generally speaking, Secondary Data are collected by some other
organization to satisfy their need but being used by someone else for
entirely different reasons.
The difference between primary and secondary data is only in terms
of degree. For example, data, which are primary in the hands of one,
becomes secondary in the hands of another. Suppose an investigator
wants to study the working conditions of labourers in an industry. If
the investigator or his agent collects the data directly, then it is called
a ‘primary data’. But if subsequently someone else uses this collected
data for some other purpose, then this data becomes a ‘secondary
data’.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  29 

N O T E S
2.3.2 METHODS OF COLLECTING PRIMARY DATA
Generally, for managerial decision-making, it is necessary to
analyze information regarding a large number of characteristics.
Collection of primary data may thus be time consuming, expensive,
and hence requires a great deal of deliberation. According to the
nature of information required, one of the following methods or their
combination could be selected.

Observation Method
In this method investigator collects the data through his/her personal
observations. This method is very useful if data is created in the system
through capturing transactions. Computerized transaction processing
could be modified to generate necessary data or information. An
investigator well versed with the system or a part of the system is ideally
suited for collecting this kind of data. Since the investigator is solely
involved in collecting the data, his/her training, skill, and knowledge

S
plays an important role as far as the quality of the data is concerned.
Sometimes, audio/video aids could also be used to record the observations.
IM
Indirect Investigation
In this case, data is collected from a person, who is likely to have
information about the problem under study. The information collected
by oral or written interrogation forms a primary data. Usually enquiry
commissions, board of investigations, investigation teams and
NM

committees collect data in this manner. Quality of the data largely


depends upon the person interviewed, his/her motives, memory and
cooperation, and interviewer’s repute and rapport with the person being
interviewed. We should be careful while collecting data by this method.

Questionnaire with Personal Interview


This is by far the most common and popular method. In this method,
individuals are personally interviewed and answers recorded
to collect the data. Questionnaire is structured and followed in
specific sequence. Occasionally, a part of the questionnaire may
be unstructured to motivate the interviewee to give additional
information or information on intimate matters. Accuracy of the data
depends on the ability, sincerity and tactfulness of the interviewer to
conduct the interview in friendly and professional environment.

Mailed Questionnaire
In this case structured questionnaire is mailed to selected persons with
request to fill them and return. Supplementary information clarifying
terms, explaining process, etc., is also attached with the questions. In
a few cases, inducements for filling and returning the questionnaire
are also given. Covering letter with a questionnaire is necessary for
developing rapport, explaining the reason for collecting the data,
and alleviating fears of the respondent if any. It is assumed that the
respondents are literate and can answer the questions without any

NMIMS Global Access – School for Continuing Education


30  BUSINESS STATISTICS

N O T E S
ambiguity. This is a less expensive and faster method to collect large
volume of data, over a wide geographic area, in standard form, and at
the convenience of the respondent. This method is, therefore, most
popular and extensively used. However, we must guard against two
disadvantages of this method viz. absence of interviewer, resulting in
large proportion of non-response and possibility of lowering of the
reliability of the responses if the respondent is not motivated enough.
These shortcomings could be overcome by increasing sample size and
comprehensive design of questionnaire.

Telephonic Interview
This method is less expensive but limited in scope as the respondent
must possess a telephone and has it listed. Further, the respondent
must be available and in the frame of mind to provide correct
answers. This method is comparatively less reliable for public surveys.
However, for industrial survey, in developed regions, and with known

S
customers, this method could be the best suited. Obviously, in this
method there is a limit to the number of questions that the interviewee
could answer in three to four minutes. If there are just three to five
IM
yes/no type questions and two to three short questions, this method
is very efficient.

Internet Surveys
Of late, Internet surveys have become popular. These are less
NM

expensive, fast and could be interactive. However, its scope is limited


to those who have regular Internet access. With rapid growth in
personal computers and Internet connectivity it would be one of the
main methods of collecting primary data. With its interactivity and
multimedia facilities it combines the advantages of other methods.

2.3.3 MERITS AND DEMERITS OF COLLECTING PRIMARY


DATA
Type of research, its purpose, conditions under which the data are
obtained will determine the method of collecting the data. If relatively
few items of information are required quickly, and funds are limited
telephonic interviews are recommended. If respondents are industrial
clients Internet could also be used. If depth interviews and probing
techniques are to be used, it is necessary to employ investigators to
collect data. Thus, each method has its utility and none is superior in
all situations. We could combine two methods to improve the quality
of data collected. For example, when a wide geographical area is
being covered, the mail questionnaires supplemented by personnel
interviews will yield more reliable results.

Merits and Demerits of Observation Method


Merits
‰‰ Original data are collected.
‰‰ Collected data are more accurate and reliable.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  31 

N O T E S
‰‰ The investigator can modify or put indirect questions in order to
extract satisfactory information.
‰‰ The collected data are often homogeneous and comparable.
‰‰ Some additional information may also get collected, along with
the regular information, which may prove to be helpful in future
investigations.
‰‰ Misinterpretations or misgivings, if any, on the part of the
respondents can be avoided by the investigators.
‰‰ Since the information is collected from the persons who are well
aware of the situation, it is likely to be unbiased and reliable.
‰‰ This method is particularly suitable for the collection of
confidential information. For example, a person may not like to
reveal his habit of drinking, smoking, gambling, etc., which may
be revealed by others.
Demerits
‰‰ S
This method is expensive and time consuming, particularly when
IM
the field of investigation is large.
‰‰ It is not possible to properly train a large team of investigators.
‰‰ The bias or prejudice of investigators can affect the accuracy of
data to a large extent.
NM

‰‰ Data are collected as per the convenience and willingness of the


respondents.
‰‰ The persons, providing the information, may be prejudiced or
biased.
‰‰ Since the interest of the person, providing the information, is not
at stake, the collected information is often vague and unreliable.
‰‰ The information collected from different persons may not be
homogeneous and comparable.

Merits and Demerits of Questionnaire Method


Merits
‰‰ This method is useful for the collection of information from an
extensive area of investigation.
‰‰ This method is economical as it requires less time, money and
labour.
‰‰ The collected information is original and more reliable.
‰‰ It is free from the bias of the investigator.
Demerits
‰‰ Very often, there is problem of ‘non-response’ as the respondents
are not willing to provide answers to certain questions.

NMIMS Global Access – School for Continuing Education


32  BUSINESS STATISTICS

N O T E S
‰‰ The respondents may provide wrong information if the questions
are not properly understood.
‰‰ It is not possible to collect information if the respondents are not
educated.
‰‰ Since it is not possible to ask supplementary questions, the
method is not flexible.
‰‰ The results of an investigation are likely to be misleading if the
attitude of the respondents is biased.
‰‰ The process is time consuming, particularly when the information
is to be obtained by post.

2.3.4 METHODS OF COLLECTING SECONDARY DATA


Secondary data is one that has been collected/analyzed by some other
agency for another purpose.

‰‰
S
Sources of secondary data could be:
Various publications of central, state and local governments. This
is an important and reliable source to get unbiased data.
IM
‰‰ Various publications of foreign governments or of international
bodies. Although it is a good source, context under which it is
collected needs to be verified before using this data. For international
situations this data could be very useful and authentic.
NM

‰‰ Journals of trade, commerce, economics, scientific, engineering,


medicine, etc. This data could be very reliable for a specific
purpose.
‰‰ Other published sources like books, magazines, newspapers,
reports, etc.
‰‰ Unpublished data, based on internal records and documents of
an organization could provide most authentic and much cheaper
information provided we could identify the source. Diaries,
letters, etc could also provide a secondary data. The problem with
the unpublished data is that it’s difficult to locate and get access.

2.3.5 DESIGNING QUESTIONNAIRE


The success of collecting data through a questionnaire depends
mainly on how skilfully and imaginatively the questionnaire has
been designed. A badly designed questionnaire will never be able to
gather the relevant data. In designing the questionnaire, some of the
important points to be kept in mind are:
‰‰ Covering letter: Every questionnaire should contain a covering
letter. The covering letter should highlight the purpose of study and
assure the respondent that all responses will be kept confidential.
It is desirable that some inducement or motivation is provided to
the respondent for better response. The objectives of the study
and questionnaire design should be such that the respondent
derives a sense of satisfaction through his involvement.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  33 

N O T E S
‰‰ Number of questions should be kept to the minimum: The
fewer the questions, the greater the chances of getting a better
response and of having all the questions answered. Otherwise
the respondent may feel disinterested and provide inaccurate
answers particularly towards the end of the questionnaire. As
a rough indication, the number of questions should be between
10 to 20. If number of questions have to be more than 25, it is
desirable that the questionnaire be divided into various parts to
ensure clarity.
‰‰ Questions should be simple, short and unambiguous: The
questions should be simple, short, and easy to understand and
such that their answers are unambiguous. For example, if the
question is, “Are you literate?” the respondent may have doubts
about the meaning of literacy. To some, literacy may mean a
university degree whereas to others even the capacity to read
and write may mean literacy. Hence, it is desirable to specify

‰‰ S
“Have passed (a) high school (b) graduation (c) post graduation”.
Type of questions: Questions can be of Yes/No type, or of multiple
choices depending on the requirement of the investigator. Open-
IM
ended questions should generally be avoided.
‰‰ Questions of sensitive or personal nature should be avoided: The
questions should not require the respondent to disclose any private,
personal or confidential information. For example, questions
NM

relating to sales, profits, marital happiness, tax liability, etc., should


be avoided as far as possible. If such questions are necessary in the
survey, an assurance should be given to the respondent that the
information provided shall be kept strictly confidential and shall
not be used at any cost to respondent’s disadvantage.
‰‰ Answers to questions should not require calculations: The
questions should be framed in such a way that their answers do
not require any calculations.
‰‰ Logical arrangement: The questions should be logically arranged
so that there is a continuity of responses and the respondent
does not feel the need to refer back to the previous questions.
It is desirable that the questionnaire should begin with some
introductory questions followed by vital questions crucial to the
survey and ending with some light questions so that the overall
impression of the respondent is a happy one.
‰‰ Crosscheck and footnotes: The questionnaire should contain
some such questions, which act as a crosscheck to the reliability
of the information provided. For example, when a question
relating to income is asked, it is desirable to include a question:
“Are you an income tax payer?” Certain questions might create
a doubt in the mind of respondents. For the purpose of clarity,
it is desirable to give footnotes. The purpose of footnotes is to
clarify all possible doubts, which may emerge from the questions
and cannot be removed while framing them. For example, if a

NMIMS Global Access – School for Continuing Education


34  BUSINESS STATISTICS

N O T E S
question relates to income limits like 1000-2000, 2000-3000, etc., a
person getting exactly ` 2,000 should know in which income class
he has to place himself.
‰‰ Pre-test the questionnaire: Once the questionnaire has been
designed, it is important to pre-test it. The pre-testing is also known
as pilot survey because it precedes the main survey work. Pre-
testing allows rectification of problems, inconsistencies, repetition
etc. Proper testing, revisiting, and re-testing, yields high dividends.

Fill in the blanks:


3. ................. data are collected afresh and for the first time, and
thus, happen to be original in character.
4. The difference between primary and secondary data is only in

S
terms of ................. .
5. ................. ................. is less expensive but limited in scope as
the respondent must possess a telephone and has it listed.
IM
6. Once the questionnaire has been designed, it is important to
................. it.
State whether the following statements are true/false:
7. Data derived from other existing sources is referred to as
‘secondary data’.
NM

8. The questions should be simple, short, easy to understand and


such that their answers are highly ambiguous.
9. The pre-testing is also known as pilot survey because it
succeeds the main survey work.
10. Secondary data is too expensive to get and scrutinize; and
hence not used very often.

Prepare a questionnaire for finding out students’ satisfaction at any


training institute.

There are two important techniques of data collection, (i) Census


enquiry implies complete numeration of each unit of the universe,
(ii) In a sample survey, only a small part of the group, is considered,
which is taken as representative. For example, the population
census in India implies the counting of each and every human
being within the country.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  35 

N O T E S

2.4 EDITING AND CODING OF DATA


Between the two stages of collection of data and analysis of data there
is always an intermediate stage, known as the editing of data.

The process of editing refines the collected data by checking


inconsistencies, inaccuracies, illegible writings and other types of
deficiencies or errors present in the collected information.

2.4.1 EDITING PRIMARY DATA


Once the questionnaires have been filled and the data collected, it
is necessary to edit this data to ensure completeness, consistency,
accuracy and homogeneity.
‰‰ Completeness: Each questionnaire should be complete in all

S
respects, i.e. the respondent should have answered each and
every question. If some important questions have been left
unanswered, attempts should be made to contact the respondent
IM
and get the response. If despite all efforts, answers to vital
questions are not given, such questionnaires should be dropped
from final analysis.
‰‰ Consistency: Questionnaire should be checked to see that there
are no contradictory answers. Contradictory responses may arise
NM

due to wrong answers filled up by the respondent or because of


carelessness on the part of the investigator in recording the data.
‰‰ Accuracy: The questionnaire should be checked for the accuracy
of information provided by the respondent. This is the most
difficult job of the investigator and at the same time the most
important one. If inaccuracies were permitted, this would lead to
misleading results. Inaccuracies may be randomly crosschecked
by supervisor.
‰‰ Homogeneity: It is important to check whether all the
respondents have understood the questions in the same sense.
For instance, if there is a question on income, it should be very
clearly stated whether it refers to weekly, monthly, or yearly
income and checked that the respondents have answered in the
same way.

2.4.2 EDITING SECONDARY DATA


The editing of the data is a process of examining the raw data to detect
errors and omissions and to correct them, if possible, so as to ensure
completeness, consistency, accuracy and homogeneity. Editing can be
done at two stages:
‰‰ Field editing: The field editing consists of reviewing the
interviewer’s report for completeness and translating what
the interviewer has written in abbreviated form at the time of

NMIMS Global Access – School for Continuing Education


36  BUSINESS STATISTICS

N O T E S
interviewing the respondent. This sort of editing should be
done as soon as possible after the interview, as memory recall
diminishes with time. Care should be taken that the interviewer
does not complete the information by simply guessing.
‰‰ Central editing: When all forms are filled up completely and
returned to the headquarters, central editing is carried out.
The editor may correct the obvious errors. If necessary, the
respondent may be contacted for clarification. All the incorrect
replies, which are obvious, must be deleted.

2.4.3 CODING OF DATA

Coding is the process of assigning some symbols either alphabetical


or numeral or both to the answers so that the responses can be

S
recorded into a limited number of classes or categories.

The classes should be appropriate to the research problem being


studied. They must be exhaustive and must be mutually exclusive,
IM
so that the answer can be placed in one and only one cell in a given
category. Further, every class must be defined in terms of only one
concept. The coding is necessary for the efficient analysis of data. The
coding decisions should usually be taken at the designing stage of the
questionnaire so that the likely responses to questions are pre-coded.
NM

This simplifies computer tabulation of the data for further analysis.

Fill in the blanks:


11. Between the two stages of collection of data and analysis
of data there is always an intermediate stage, known as the
................. of data.
12. The ................. ................. consists of reviewing the interviewer’s
report for completeness and translating what the interviewer
has written in abbreviated form at the time of interviewing the
respondent.
13. ................. is the process of assigning some symbols either
alphabetical or numeral or both to the answers so that the
responses can be recorded into a limited number of classes or
categories.

Visit any newspaper printing agency. Study the process of editing


primary and secondary data there and prepare a short report.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  37 

N O T E S

2.5 CLASSIFICATION OF DATA

Classification refers to the grouping of data into homogeneous


classes and categories. It is the process of arranging things in
groups or classes according to their resemblances and affinities.

2.5.1 RULES OF CLASSIFICATION


The principal rules of classifying data are:
‰‰ To condense the mass of data in such a way that salient features
can be readily noticed; for example, household incomes can be
grouped as higher income group, middle-income group and
lower income group based on certain criterion.
‰‰ To facilitate comparison between attributes of variables; for

‰‰
and expenditure on consumer durables, etc.
To prepare data for tabulation.
S
example, comparison between education and income, income
IM
‰‰ To highlight the significant features; for example, data is
concentrated on one side, or one particular value may be dominant.
‰‰ To enable grasp of data.
‰‰ To study the relationship.
NM

2.5.2 BASES OF CLASSIFICATION


Some common types of bases of classification are:
‰‰ Geographical classification: In this type, the data is classified
according to area or region; for example, state wise industrial
production, city wise consumer behaviour, area wise sales
figures, etc.
‰‰ Chronological classification: In this type, the data is classified
according to the time of its occurrence; for example, monthly
sales, yearly production, daily demands, etc.
‰‰ Qualitative classification: When the data is classified according
to some attributes, which are not capable measurement, is known
as qualitative classification. In dichotomous classification, an
attribute is divided into two classes, one possessing the attribute
and other not possessing it; for example, sex, smoker, non-
smoker, employed, unemployed, etc. In many-fold classification,
attribute is divided so as to form several classes; for example,
education level, religion, mother tongue, etc.
‰‰ Classification of data according to some characteristics: It refers
to the classification of data according to some characteristics
that can be measured; for example, salary, age, height, etc.
Quantitative data may be further classified into one or two types,

NMIMS Global Access – School for Continuing Education


38  BUSINESS STATISTICS

N O T E S
discrete and continuous. In case of discrete type, values the
variable can take are countable (could be infinitely large also for
example, integers). Examples of these are number of accidents,
number of defectives, etc. In case of continuous quantities, data
can take any real values; for example, weight, distance, volume,
etc.

2.5.3 FREQUENCY DISTRIBUTION

Classification of data, showing the different values of a variable


and their respective frequency of occurrence is called a frequency
distribution of the values.

There are two kinds of frequency distributions, namely, discrete


frequency distribution (or simple, or ungrouped frequency

S
distribution), and continuous frequency distribution (or condensed
or grouped frequency distribution).
IM
Discrete Frequency Distribution
The process of preparing discrete frequency distribution is simple.
First, all possible values of variables are arranged in ascending order
in a column. Then, another column of ‘Tally’ mark is prepared to count
the number of times a particular value of the variable is repeated. To
NM

facilitate counting, a block of five ‘Tally’ marks is prepared. The last


column contains frequency. To illustrate this let us consider one example.
Example: Construct frequency distribution table for the following
data of number of family members in 30 families:

4 3 2 3 4 5 5 7 3 2
3 4 2 1 1 6 3 4 5 4
2 7 3 4 5 6 2 1 5 3
Solution: The discrete frequency distribution with the help of tally
mark is shown below:

Number of Family Tally Marks Frequency


Members
1 ||| 3
2 |||| 5
3 |||| || 7
4 |||| | 6
5 |||| 5
6 || 2
7 || 2
Total N = 30

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  39 

N O T E S
Continuous Frequency Distribution
For continuous data a ‘grouped frequency distribution’ is necessary.
For discrete data, discrete frequency distribution is better than array,
but this does not condense the data. ‘Grouped frequency distribution’
is useful for condensing discrete data by putting them into smaller
groups or classes called class-intervals. Some important terms used
in case of continuous frequency distribution are as follows:
‰‰ Class limits: Class limits denote the lowest and highest value
that can be included in the class. The two boundaries of class
are known as the lower limit and upper limit of the class. For
example, 10-19.5, 20-29.5, where 10 and 19.5 are limits of the first
class; 20 and 29.5 are limits of second class, etc.
‰‰ Class intervals: The class interval represents the width (span or
size) of a class. The width may be determined by subtracting the
lower limit of one class from the lower limit of the following class.

‰‰
= 10.
S
For example, classes 10-20, 20-30, etc. have class interval 20–10

Class frequency: The number of observation falling within a


IM
particular class is called its class frequency. Total frequency
indicates the total number of observations N =Σ f.
‰‰ Class mark or class mid-point: Mid-point of a class is defined as
sum of two successive lower limits divided by 2. Thus class mark
is the value lying halfway between lower and upper class limits.
NM

For example, classes 10-20, 20-30, etc. have class marks 15, 25, etc.
‰‰ Types of class intervals: There are different ways in which limits
of class intervals can be shown.
‰‰ Exclusive method: The class intervals are so arranged that upper
limit of one class is the lower limit of next class. This method
always presumes that the upper limit is excluded from the class,
for example, with class limits 20-25, 25-30 observation with value
25 is included in class 25-30.
‰‰ Inclusive method: In this method, the upper limit of the class is
included in that class itself. In such case there is no overlap of
upper limit of former class and lower limit of successive class.
For example, with class limits 20-29.5, 30-39.5, 40-49.5, etc. there
is no ambiguity but values from 29.5 to 30 or 39.5 to 40 etc. are
not allowed.
‰‰ Open end: In an open-end distribution, the lower limit of the
very first class and/or upper limit of the last class is not given.
For example, while stating the distribution of monthly salary of
managers in rupees, one may specify class limits as, below 15000,
15000-25000, 25000-35000, 35000-45000, above 45000. Similarly,
while recording weights of college students in kg as grouped data
the class intervals could be less than 50, 50 to 60, 60 to 70, 70 to 80,
80 to 90 and greater than 90.

NMIMS Global Access – School for Continuing Education


40  BUSINESS STATISTICS

N O T E S
‰‰ Unequal class interval: This is another method to limit the
class intervals where the width of the classes is not equal for
all classes. This method is of practical use when there are large
gaps in the data, or distribution of the data is uneven. It is used
for explaining, visualizing and plotting data with unequal class
interval. However, we must adjust formulae for calculations
accordingly.

Guideline for Choosing the Class


‰‰ Number of classes should not be too small or too large, preferably
between 5 and 15.
‰‰ If possible, the widths of the intervals should be numerically
simple like 5, 10, 15, etc.
‰‰ It is desirable to have classes of equal width.
‰‰ Starting point of class should begin with 0, 5, 10 or multiple there

‰‰
of.
S
Class interval should be determined based on maximum values
IM
and number of classes to be formed.
All the above points can be explained with the help of the following
example.
Example: Ages of 50 employees are given:
NM

22 21 37 33 28 42 56 33 32 59
40 47 29 65 45 48 55 43 42 40
37 39 56 54 38 49 60 37 28 27
32 33 47 36 35 42 43 55 53 48
29 30 32 37 43 54 55 47 38 62
Prepare a frequency distribution table.
Solution: A frequency distribution table is prepared as follows:
‰‰ First, find the highest and lowest values. These are 65 and 21
respectively. Thus, the difference is 44.
‰‰ Since the total observations are 50 we decide to select 5 classes.
‰‰ The approximate class interval works out to be (65-21)/5 = 8.8.
Hence, we select class interval as 10.
‰‰ As our lowest value is 21, we start from the lower class limit of the
first class as 20. We use exclusive method of class interval.
‰‰ We then decide class intervals as 20-30, 30-40, 40-50, 50-60 and
60-70.
‰‰ Then, each observation is checked for the class interval in which
it lies. For each observation, we make a tally mark against the
corresponding class interval. As per the convention, every fifth
tally is put horizontally across. This helps quick counting.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  41 

N O T E S
The frequency distribution is given below:
Age (Years)
Class Interval Class Mark Tally Frequency
20-30 25 |||| || 7
30-40 35 |||||||||||| | 16
40-50 45 |||||||||||| 15
50-60 55 |||| |||| 9
60-70 65 ||| 3
Total = 50

Cumulative and Relative Frequency


In many situations rather than listing the actual frequency opposite
each class, it may be appropriate to list either cumulative frequencies
or relative frequencies or both.

Cumulative Frequencies
S
IM
The cumulative frequency of a given class interval thus, represents
the total of all the previous class frequencies including the class
against which it is written.

Relative Frequencies
NM

Relative frequency is obtained by dividing the frequency of each


class by the total number of observations (total frequency).

If we multiply relative frequency by 100, we get percentage frequency.


There are two important advantages in looking at relative frequencies
(percentages) instead of the absolute frequencies in a frequency
distribution. These are:
‰‰ Relative frequencies facilitate the comparison of two or more
than sets of data.
‰‰ Relative frequencies constitute the basis of understanding the
concept of probability.
To explain the cumulative and relative frequencies we work these on
our earlier problem.
Example: Ages of 50 employees are given:

22 21 37 33 28 42 56 33 32 59
40 47 29 65 45 48 55 43 42 40
37 39 56 54 38 49 60 37 28 27
32 33 47 36 35 42 43 55 53 48
29 30 32 37 43 54 55 47 38 62

NMIMS Global Access – School for Continuing Education


42  BUSINESS STATISTICS

N O T E S
Find cumulative frequency, relative frequency and percentage frequency.
Solution:

Class Class Cumulative Relative Percentage


interval Frequency Frequency Frequency Frequency
20-30 7 (0+7) = 7 7/50 = 0.14 14
30-40 16 (7+16) = 23 16/50 = 0.32 32
40-50 15 (23+15) = 38 15/50 = 0.30 30
50-60 9 (38+9) = 47 9/50 = 0.18 18
60-70 3 (47+3) = 50 3/50 = 0.06 6
N = ∑f = 50 Total = 1 Total = 100

Fill in the blanks:

S
14. ................. refers to the grouping of data into homogeneous
classes and categories.
IM
15. There are two kinds of frequency distributions, namely,
................. frequency distribution and ................. frequency
distribution.
16. Class ................. denote the lowest and highest value that can
be included in the class.
NM

17. The number of observation falling within a particular class is


called its class ................. .
18. In ................. method, the upper limit of the class is included in
that class itself.
19. ................. frequency is obtained by dividing the frequency of
each class by the total number of observations.
State whether the following statements are true/false:
20. In geographical classification, the data is classified according
to area or region.
21. In qualitative classification, the data is classified according to
the time of its occurrence.
22. The class interval represents the width (span or size) of a class.
23. In Exclusive method, the class intervals are so arranged that
upper limit of one class is the lower limit of next class.
24. The relative frequency of a given class interval thus, represents
the total of all the previous class frequencies including the
class against which it is written.

Obtain the data of the salary or age of employees in your company.


Construct the frequency distribution table. Using this comment on
the organizational structure of your company.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  43 

N O T E S

2.6 TABULATION OF DATA


Once the raw data is collected, it needs to be summarized and
presented to the decision-maker in a form that is easy to comprehend.
The manager must be able to look at the data so as to decide what
further analysis is required. Tabulation helps this process through
effective presentation.

Tabulation is arranging the data in flat table (two dimensional


arrays) format by grouping the observations.

Table is a spreadsheet with rows and columns with headings and


stubs indicating class of the data.

S
Tabulation not only condenses the data, but also makes it easy to
understand. Tabulation is the fastest way to extract information from
the mass of data and hence popular even among those not exposed to
IM
the statistical method. The report card of a school is the most common
example.

Objectives of Tabulation
The main objectives of tabulation are:
NM

‰‰ To simplify complex data.


‰‰ To highlight chief characteristics of the data.
‰‰ To clarify objective of investigation.
‰‰ To present data in a minimum space.
‰‰ To detect errors and omissions in the data.
‰‰ To facilitate comparison of data.
‰‰ To facilitate reference.
‰‰ To identify trend and tendencies of the given data.
‰‰ To facilitate statistical analysis.

Main Parts of a Table


The main parts of a table are given below:
‰‰ Table Number: This number is helpful in the identification of a
table. This is often indicated at the top of the table.
‰‰ Title: Each table should have a title to indicate the scope, nature
of contents of the table in an unambiguous and concise form.
‰‰ Captions and stubs: A table is made up of rows and columns.
Headings or subheadings used to designate columns are called
captions while those used to designate rows are called stubs. A

NMIMS Global Access – School for Continuing Education


44  BUSINESS STATISTICS

N O T E S
caption or a stub should be self explanatory. A provision of totals
of each row or column should always be made in every table by
providing an additional column or row respectively.
‰‰ Main Body of the Table: This is the most important part of the
table as it contains numerical information. The size and shape of
the main body should be planned in view of the nature of figures
and the objective of investigation. The arrangement of numerical
data in main body is done from top to bottom in columns and
from left to right in rows.
‰‰ Ruling and Spacing: Proper ruling and spacing is very important
in the construction of a table. Vertical lines are drawn to separate
various columns with the exception of sides of a table. Horizontal
lines are normally not drawn in the body of a table; however, the
totals are always separated from the main body by horizontal
lines. Further, the horizontal lines are drawn at the top and the
bottom of a table.

S
Spacing of various horizontal and vertical lines should be done
depending on the available space. Major and minor items should
IM
be given space according to their relative importance.
‰‰ Head-note: A head-note is often given below the title of a table
to indicate the units of measurement of the data. This is often
enclosed in brackets.
Foot note: Abbreviations, if any, used in the table or some other
NM

‰‰
explanatory notes are given just below the last horizontal line in
the form of footnotes.
‰‰ Source-note: This note is often required when secondary data
are being tabulated. This note indicates the source from where
the information has been obtained. Source note is also given as
a footnote.
Example: The main parts of a table can also be understood by looking
at its broad structure given below:

Structure of a Table
Table No: .............
Title: .....................
Stub Captions Captions Total
Heading Captions  Captions Captions  Captions   Captions

Stub
Enteries MAIN BODY

Total

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  45 

N O T E S
Foot Note:
Source:

Rules for Tabulation


Now, let us learn about the general rules of tabulation.
‰‰ The table should be simple and compact which is contains simple
details.
‰‰ Tabulation should be in accordance with the objective of
investigation.
‰‰ The unit of measurements must always be indicated in the table.
‰‰ The captions and stubs must be arranged in a systematic manner
so that it is easy to grasp the table.
‰‰ A table should be complete and self explanatory.
‰‰

‰‰ S
As far as possible the interpretative figures like totals, ratios and
percentages must also be provided in a table.
The entries in a table should be accurate.
IM
‰‰ Table should be attractive to draw the attention of readers.

2.6.1 TYPES OF TABULATION


Statistical tables can be classified into various categories depending
upon the basis of their classification. Broadly speaking, the basis of
NM

classification can be any of the following:


‰‰ Purpose of investigation
‰‰ Nature of presented figures
‰‰ Construction
Different types of tables, thus, obtained are shown in the following
chart.

Figure 2.1: Classification of Table

NMIMS Global Access – School for Continuing Education


46  BUSINESS STATISTICS

N O T E S
Classification on the basis of purpose of investigation
These tables are of two types viz. General purpose table and Special
purpose table.
‰‰ General purpose table: A general purpose table is also called
as a reference table. This table facilitates easy reference to
the collected data. In the words of Croxton and Cowden, “The
primary and usually the sole purpose of a reference table are to
present the data in such a manner that the individual items may
be readily found by a reader.” A general purpose table is formed
without any specific objective, but can be used for a number of
specific purposes. Such a table usually contains a large mass of
data and is generally given in the appendix of a report.
An example of general purpose table is as follows:

TABLE 2.1: REPORT FORMS NAME CODES FOR

Position Description S
GENERAL PURPOSE REPORTS OF GINT
Values
IM
1 Type LOG (or L) - Log
FNC (or F) - Fence
GRF (or G) - Graph
GTB (or T) - Graphical Table
GTD (or X) - Graphical Text Document
NM

HST (or H) - Histogram


TTB (or A) - Text Table
TXD (or D) - Text Document
SMP - Site Map
2 Paper Size A - US Letter
4 - ISO A4
B - US 11×17
3 - ISO A3
L - US Legal
3 Content G - Predominantly Geotechnical
E - Predominantly Environmental
R - Predominantly Rock Core
N - Not Applicable or can be generally used
4 Well W - Has well (applies to logs & fences)
N - No well or Not Applicable
5 Graph G - Has graph (applies to logs & fences)
N - No graph or Not Applicable
6 Legend L - Has legend
N - No legend or Not applicable

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  47 

N O T E S
‰‰ Special purpose table: A special purpose table is also called a
text table or a summary table or an analytical table. Such a table
presents data relating to a specific problem. According to H.
Secrist, “These tables are those in which are recorded, not the
detailed data which have been analyzed, but rather the results of
analysis.” Such tables are usually of smaller size than the size of
reference tables and are generally found to highlight relationship
between various characteristics or to facilitate their comparisons.
Classification on the basis of the nature of presented figures
Tables, when classified on the basis of the nature of presented figures
can be Primary table and Derivative table.
‰‰ Primary Table: Primary table is also known as original table and
it contains data in the form in which it were originally collected.
‰‰ Derivative Table: A table which presents figures like totals,
averages, percentages, ratios, coefficients, etc., derived from

S
original data. A table of time series data is an original table but
a table of trend values computed from the time series data is
known as a derivative table.
IM
Classification on the basis of construction
Tables when classified on the basis of construction can be Simple
table, Complex table and Cross-classified table.
Simple Table: In this table the data are presented according to
NM

‰‰
one characteristic only. This is the simplest form of a table and is
also known as table of first order.
Example: The following blank table, for showing the number of
workers in each shift of a company, is an example of a simple
table.

Shifts No. of Workers


|
||
|||
Total
‰‰ Complex Table: A complex table is used to present data according
to two or more characteristics. Such a table can be two-way,
three-way or multi-way, etc.
 Two-way Table: Such a table presents data that is classified
according to two characteristics. In such a table the columns
of a table are further divided into sub-columns.

NMIMS Global Access – School for Continuing Education


48  BUSINESS STATISTICS

N O T E S
Example: The example of such a table is given below.

Shifts No. of Workers Total


Males      Females
|
||
|||
Total
 Three-way Table: When three characteristics of data are
shown simultaneously, we get a three-way table as shown
below in the example.
Example:
No. of Workers Total
No. of
Shifts    Males           Females

|
||
S
Skilled  Unskilled  Total  Skilled Unskilled Total Workers
IM
|||

 Multi-way Table: If each shift is further classified into


three departments, say, manufacturing, packing and
transportation, we shall get a four-way table, etc. 9911740271
NM

‰‰ Cross-classified Table: Tables that classify entries in both


directions, i.e., row-wise and column-wise, are called cross-
classified tables. The two ways of classification are such that
each category of one classification can occur with any category of
the other. The cross-classified tables can also be constructed for
more than two characteristics also. A cross-classification can also
be used for analytical purpose, e.g., it is possible to make certain
comparisons while keeping the effect of other factors as constant.
Example: Draw a blank table to show the population of a city according
to age, sex and unemployment in various years.
Population (in thousands)
Employed            Unemployed
Age below 20  20-60  60 & Total below 20 20-60 60 & Total
        above             above
Years Sex
1991 Males
Females
Total
1992 Males
Females
Total

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  49 

N O T E S

The table can be extended for the years 1993, 94, 95, 96, etc.

Example: In a sample study about coffee habit in two towns; the


following information were received:
Town A: Females were 40%; total coffee drinkers were 45%; and male
non-coffee drinkers were 20%.
Town B: Males were 55%; male non-coffee drinkers were 30%; and
female coffee drinkers were 15%.
Represent the above data in a tabular form.
Solution: The figures are in percentage

Town A Town B
Habit Males Females Total Males Females Total
Coffee Drinkers
Non-coffee Drinkers
40
20
5
35
45
55 S 25
30
15
30
40
60
IM
Total 60 40 100 55 45 100

2.6.2 ONE-WAY TABULATION


Tabulation is primarily counting how many observations are in
a particular category. Tabulation is like an in-process inventory.
NM

Tabulation in itself may not be the end of statistical processing. It may


be noted that once we tabulate the data, we usually do not go back to the
raw data. Any improper tabulation would definitely mislead decision-
maker for further processing. Hence, before tabulation manager must
give sufficient thought to decide what kind of tabulation is required
for decision-making. We need to first decide characteristics, their
values and ranges, title of the table, stubs for the rows, headings for
the columns, scale and dimensions used, foot notes, pivots if we need,
etc. Table must suit the purpose for which the data is being processed.
We also need to decide on the size of the table, clarity, approximations,
boundaries, appearance, order, readability, etc. A meaningful title
not only helps the manager to focus on the purpose, and thus, group
the data properly but also others who refer the table later. The next
step is to decide appropriate column headings; row stubs units and
dimensions of the quantities used, labels for summary figures, etc.
to improve the readability of the table. Many times the requirement
of statistical analysis is to count the frequency of the distinct value of
a variable. When we arrange the range of values (or just values) and
their frequencies the tabulation is known as one way. Variable could
be either quantitative or normative.

NMIMS Global Access – School for Continuing Education


50  BUSINESS STATISTICS

N O T E S
For example, examination result of MBA could be tabulated as,

Class Number of Students (Frequency f)


Distinction (≥75%) 26
First Class (60-75%) 72
Second Class (50-60%) 94
Pass Class (40-50%) 42
Fail 16
Total Students Appeared 250

Foot Note
‰‰ Each class includes its lower limit.
‰‰ Fail indicates failure in any one or more subjects irrespective of
the percentage marks.

S
Example: Represent the following information in a table:
The number of students in a college in the year 1961 was 1100; of
those 980 were boys and rest were girls. In 1971 the number of boys
IM
increased by 100% and that of girls increased by 300% as compared to
their strength in 1961. In 1981 the total number of students in a college
was 3600, the number of boys being double the number of girls.
Solution:
NM

Year Number of Boys Number of Total Students


Girls
1961 980 120 1100
1971 1960 480 2440
1981 2400 1200 3600

2.6.3 TWO-WAY TABULATION


There are occasions that we want to summaries the frequency
table against two attributes (categories) and want the count of the
same population belonging to all possible combinations of these
two attributes. For example, we want to know the frequency of
personnel with different combinations of salary earned category and
education qualification category for a given company. Since there
are two variables, we call it a two-way tabulation (also referred to
as cross-tabulation). We prepare the table with one of the category
varied along the rows and other along the columns. For counting
the frequency, a pair of combinations of categories one from each
direction is considered. Thus, we get a table in m × n matrix form,
with each cell containing data for one combination. This is also knows
as contingency table. With m rows and n columns, we get m categories
of one variable varying along column and n categories of another
variable varying along row. There are obviously mn cells containing
distinct mutually exclusive and collectively exhaustive data. It may be

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  51 

N O T E S
noted that, a two way table can be converted to one way table with mn
distinct values of a combination variable. This is called a normalized
table or a flat table in data base management.
Example: In a survey conducted in a city about preference of Coke or
Pepsi or Mazza, the sample consisted of 400 people that included 150
women and 250 men. It was observed that 50 women preferred Coke
and 40 preferred Pepsi. In case of men the preference was 100, 80 and
70 respectively. Present the information in two way table and answer
the following:
1. What is the percentage of men in Coke preferring population?
2. What is the proportion of population preferring Pepsi?
3. What is the proportion of women preferring Maza in total
population?
Solution:

Coke Preferring People


Men
100 S
Women
50
Total
150
IM
Pepsi Preferring People 80 40 120
Maza Preferring People 70 60 130
Total 250 150 400
100
1. Percentage of men in Coke preferring population = × 100 =
150
NM

66.67%
120
2. Proportion of population preferring Pepsi = = 0.3
400
3. Proportion of women preferring Maza in total population = 0.15

2.6.4 MULTI-WAY TABULATION


We can carry out cross-tabulation with more than two variables. It
is called a nested table. In fact, in most of the business situations the
tabulation may have more than two variables (usually 10 to 15). Up
to about 3 to 4 variables could be shown on two dimensional papers.
These can also be represented as flat tables by taking one composite
variable of dimension n1 × n2 × n3 × n4 × n5 × …, where n1, n2, n3, n4,
n5…are dimensions of each variable (attribute). Obviously the number
grows so rapidly, that it becomes too voluminous and complex to get
any meaningful information for decision-making. However, that does
not mean such multidimensional data is not tabulated. It is tabulated
using computer database like MS Access, FOXPRO, Oracle, etc. We
cannot view it together but definitely use it for the decision-making
through ‘query language’. Data base management systems and query
languages are beyond the scope of this book. One simple, three-
dimensional, tabulation is shown in the following example.
Example: A mutual fund wants to compare the performance of shares
on NSE over past three years. It wants to categorize the shares as
below average, average and above average as compared to the

NMIMS Global Access – School for Continuing Education


52  BUSINESS STATISTICS

N O T E S
benchmark. It also wants to group the shares as large cap, mid-cap
and small cap. The data obtained is as follows: In 40 large cap shares
studied 27 performed average and 11 above average in year 2004.
Similar, figures for year 2005 and 2006 were 34 and 8 out of 50, and
32 and 16 out of 50 respectively. In mid-cap segment the number of
shares below average, average and above average was 22, 35 and 23 in
year 2004. These were 17, 40, 23 for year 2005 and 13, 38 and 29 for year
2006 respectively. In case of small cap shares the performance figures
for year 2004, 2005 and 2006 in categories below average, average and
above average were 26, 32, 42; 25, 36, 39; and 12, 40, 48 respectively.
Present the data as multi-way table.
Solution:
Year 2004 2005 2006
Large Cap Below Average 12 8 2
Average 27 34 32

Total S
Above Average 11
50
8
40
16
40
IM
Mid Cap Below Average 22 17 13
Average 35 40 38
Above Average 23 23 29
Total 80 80 80
Small Cap Below Average 26 25 12
NM

Average 32 36 40
Above Average 42 39 48
Total 100 100 100

2.6.5 ADVANTAGES OF TABULATION


Tabulation helps to achieve the following:
‰‰ It presents the data in easy to understand format.
‰‰ It reduces the voluminous size of data so as to view it in
comprehensive way.
‰‰ It simplifies the data through grouping.
‰‰ It tries to highlight common features, salient points,
characteristics, etc. from the data.
‰‰ Reveals underlying trends.
‰‰ It allows easy comparison within the data or with other tabulated
data.
‰‰ Data storage, reference, and retrieval at later stage are very easy.
‰‰ Processing the data through spreadsheet packages like MS Excel
can be done.
‰‰ Charting of graphs and diagrams is easy with tabulated data.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  53 

N O T E S

Fill in the blanks:


25. ................. is arranging the data in flat table (two dimensional
arrays) format by grouping the observations.
26. A table is made up of ................. and columns.
27. A ................. is often given below the title of a table to indicate
the units of measurement of the data.
28. Tables classified on the basis of purpose of investigation are
................. purpose table and ................. purpose table.
29. ................. table is also known as original table and it contains
data in the form in which it were originally collected.
30. A ................. table is used to present data according to two or
more characteristics.

S
31. Two way table is also known as ................. table.
IM
In any organization of your choice, identify a problem and collect
data internally through questionnaire from randomly selected
people of the organization. Present the collected data in tabular
form and Suggest a solution to the problem.
NM

It observed quite often that even the person who has tabulated
the data find it difficult to understand the table after few days if
the table is ill presented. For future references footnotes, headers,
colour coding etc., improves the efficiency of the table. With
present day computers it is possible to present data very effectively
with tables, associated charts, catchy icons, 3-D surfaces, etc. It is
also possible to view only some part of the table as necessary. The
computer also provides cross reference or stage by stage display of
the tables through the links (or hyperlinks).

 IAGRAMMATIC AND GRAPHICAL


D
2.7
PRESENTATION OF DATA
Tabulation and grouping does make data simple to understand and
analyze. However, just the numerical data is not attractive enough
to present it to higher management, stakeholders and those not
very familiar with the particular functional area. Moreover, pictorial
or graphical representation is catchy to appreciate, remember, and
grasp quickly and easy to explain. It allows us to obtain the underlying
information in one glance. “One picture is equal to a thousand
words” as the proverb goes. Hence, diagrams, graphs and charts

NMIMS Global Access – School for Continuing Education


54  BUSINESS STATISTICS

N O T E S
have assumed importance for decision-making to the managers. To
communicate the information effectively to the higher management,
you must present the data in pictorial format whenever feasible, and
support it with the numerical data as a reference. Remember, higher
management may not have adequate time to analyze the numerical
data. Similarly, always present the information to junior employees
as diagrams, graphs and charts, because they may not have adequate
knowledge and grasp of numerical analysis.

2.7.1 DIFFERENCE BETWEEN DIAGRAMS AND GRAPHS


A brief distinction between a diagram and a graph is given below in
Table 2.2.

TABLE 2.2: DIFFERENCE BETWEEN


DIAGRAM AND GRAPHS
Diagram Graph
1. Can be drawn on an
ordinary paper. S 1. Can be drawn on a graph
paper.
IM
2. Easy to grasp. 2. Needs some effort to grasp.
3. Not capable of analytical 3. Capable of analytical
treatment. treatment.
4. Can be used only for 4. Can be used to represent a
comparisons. mathematical relation.
NM

5. Data are represented 5. Data are represented by lines


by bars, and rectangles, curves.
pictures, etc.
A graphic presentation is used to represent two types of statistical
data: (i) Time Series Data and (ii) Frequency Distribution.

2.7.2 TYPES OF DIAGRAMS


There are a large number of diagrams which can be used for
presentation of data. The selection of a particular diagram depends
upon the nature of data, objective of presentation and the ability and
experience of the person doing this task. For convenience, various
diagrams can be grouped under the following categories:
‰‰ One-dimensional Diagrams: One-dimensional diagrams are
also known as bar diagrams. In case of one-dimensional diagrams,
the magnitude of the characteristics is shown by the length or
height of the bar. The width of a bar is chosen arbitrarily so that
the constructed diagram looks more elegant and attractive. It
also depends upon the number of bars to be accommodated in
the diagrams. If large numbers of items are to be included in the
diagram, lines may also be used instead of bars.
‰‰ Two-dimensional Diagrams: In case of a two-dimensional
diagram, the value of an item is represented by an area. Such

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  55 

N O T E S
diagrams are also known as ‘surface’ or ‘area diagrams’. Popular
forms of two-dimensional diagrams are:
 Rectangular Diagrams
 Square Diagrams
 Circular or Pie Diagrams.
‰‰ Three-dimensional Diagrams: With the help of three dimensional
diagrams, the values of various items are represented by the
volume of cube, sphere, cylinder, etc. These diagrams are normally
used when the variations in the magnitudes of observations are
very large.
‰‰ Pictograms and Cartograms: These are like frequency plots.
The data points are plotted on the graph in the same manner.
Then instead of joining the data points, pictures or objects of the
height of the data points are used to depict the data. In that case,

S
heights of the pictures or objects represent the frequency. These
include Histograms and frequency polygon.
IM
2.7.3 BAR DIAGRAM
Bar diagrams and Column diagrams are very common in representing
business data. These are used to depict the frequencies of different
categories of variables. In case of bar diagrams the bars are horizontal
with their lengths proportional to the frequencies. On the other hand,
NM

in column diagrams the frequencies are depicted by vertical columns


having their length proportional to the frequencies. We can also
have multiple bars or columns representing different categories of
variables. Further, data related to sub-categories in a category can be
shown on same bar or column by overlapping the bars or column on
top.
Example: Draw a multiple bar diagram to present following data.
Also draw a multiple column diagram.
Year Sales (‘000 `) Gross Profit Net Profit
(‘000 `) (‘000 `)
1996 120 40 20
1997 135 45 30
1998 140 55 35
1999 150 60 40
Solution:
Bar Diagram: We take year on the Y axis and rupees in thousands on
the X axis. Then we draw horizontal bars with lengths proportional
to the values of variables ‘Sales’, ‘Gross Profits’ and ‘Net Profits’. The
bar diagram for the above data is as follows:

NMIMS Global Access – School for Continuing Education


56  BUSINESS STATISTICS

N O T E S Bar Diagram: Company Results


Bar Diagram: Company Results

1999
1999

1998 1998 Net Profit


Net
Net Profit('000
Profit Rs.)
('000
(′000 Rs.)
`)

Year
Year
Gross Profit ('000 Rs.)
Gross Profit(′000
Gross Profit ('000`)Rs.)
1997 Sales ('000 Rs.)
1997 Sales ('000`)Rs.)
Sales (′000

1996

1996
0 50 100 150 200
` in Thousands
Rs in Thousands
0 50 100 150 200

Column Diagram: Rs in
WeThousands
take year on the X axis and rupees in
thousands on the Y axis. Then we draw vertical columns with lengths
proportionalColumn
to the values of variables
Diagram: ‘Sales’, Results
Company ‘Gross Profits’ and ‘Net

160
S
Profits’. The column diagram for the above data is as follows:
Column Diagram: Company Results
IM
140 160
140
120
Rs in Thousands

120
Rs in Thousands

100
Sales ('000 Rs.)
` in Thousands

100
Sales(′000
Sales ('000 Rs.)
`)
80 Gross Profit ('000 Rs.)
80 GrossProfit
Gross Profit ('000 Rs.)
Net Profit(′000
('000`)Rs.)
NM

60 60 Net Profit ('000 Rs.)


Net Profit (′000 `)
40 40

20 20
0
0
1996 1997 1998 1999
1996 1997 1998 1999
Year
Year

Scatter Diagram
Scatter diagram is the most fundamental graph plotted to show
relationship between two variables. It is a simple way to represent
bivariate distribution. Bivariate distribution is the distribution of two
random variables. Two variables are plotted one against each of the X
and Y axis. Thus, every data pair of (xi, yj) is represented by a point on
the graph, x being abscissa and y being the ordinate of the point. From
a scatter diagram we can find if there is any relationship between the
x and y, and if yes, what type of relationship. Scatter diagram thus,
indicates nature and strength of the correlation.
Example: Draw a scatter diagram for the following data of eight years
between income (X) and expenditure (Y).
Income (X) (`) 100 110 113 120 125 130 130 140
Expenditure (Y) (`) 85 90 91 100 110 125 125 130

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  57 

N O T E S
Solution:
Scatter Diagram
Scatter Diagram
140
130
(`)
(Rs.)

120
(Y)(Y)

110
Expenditure

100
Expenditure

90
80
70
60
50
80 100 120 140 160
Income (X) (`)
Income (X) (RS.)

Line Diagram
It is similar to the frequency polygon, where we plot one or more
variables against one variable. One variable against which other
variables are plotted is taken along the X axis. It is commonly used
S
IM
to depict the trends in anytime series data. We can show one or more
variables like economic, market trends, financial results, etc. together
so that these can be compared.
Example: Draw line diagram to present following data.
NM

Year Sales (‘000 `) Gross Profit (‘000 `) Net Profit (‘000 `)


1996 120 40 20
1997 135 45 30
1998 140 55 35
1999 150 60 40
Solution: We take year on the X axis and rupees in thousands on the
Y axis. Then we plot the data points for the variables ‘Sales’, ‘Gross
Profits’ and ‘Net Profits’. These data points are then joined by straight
lines to draw the line diagram. The line diagram for the above data is
as follows:
Company Performance Trend
Company Performance Trend
160
140
ofofRs.
`

120 Sales ('000 Rs.)


`

100
Thousands
Thousands

Gross Profit ('000 Rs.)


`
80
60 Net Profit ('000 Rs.)
`
40
20
0
1996 1997 1998 1999
Year

NMIMS Global Access – School for Continuing Education


58  BUSINESS STATISTICS

N O T E S
2.7.4 HISTOGRAM
Besides the frequency polygon, histogram is one of the most popular
and widely used graphical representations. It uses vertical bars whose
height represents the frequency. In histogram, the vertical bars touch
the neighbouring bars sharing one edge. Hence, if the data is of inclusive
classes, it needs to be converter to exclusive classes so that the class
boundaries overlap. Sometimes, we also use histograms superimposed
with frequency polygons. This helps interpolation of data, at the same
time retaining the attractive representation of histogram.
Example: In a city, the income tax department had the data as follows
for the number of tax payers along with the range of income tax they
paid for a particular year. Represent the data graphically with the
help of a histogram.
Tax paid in 20-24 25-29 30-34 35-39 40-44 45-49 Total
` ‘000
Number of Tax
Payers
S
45 130 200

Solution: For plotting the data we will first convert the data as exclusive
65 45 15 500
IM
classes. This is done by increasing the upper limits and decreasing the
lower limits by an amount equal to half of the difference between upper
limit of any class and lower limit of the subsequent class. This makes
the class boundary to join. Then class boundaries of tax paid classes
are plotted on the X axis and number of tax payers on the Y axis. Then
NM

vertical bars are drawn of widths equal to classes and heights equal to
the frequencies of corresponding classes. This is depicted as follows.
Tax paid 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
in ` ‘000
Number 45 130 200 65 45 15
of Tax
Payers
The histogram is shown below:

Tax Payer Data


Tax Payer Data
250

200
200
Number of Tax Payers

150 130
Series1
100
65
45 45
50
15
0 0
0
17 22 27 32 37 42 47 52
Tax Paid
Tax Paid in Rs. '000
` ′000

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  59 

N O T E S
2.7.5 PIE DIAGRAM
Pie diagram is very popular visual representation in business reports,
when manager wants to show the share of various categories in total.
Total is represented as a circle. Each category is depicted as a sector
with its central angle proportional to its share. The share percent in
total of each category is converted to a sector angle using formula:
Sector Angle in degrees = Share Percentage × 360
100
Other variations of pie diagrams are doughnut diagrams and exploded
pie diagram. These are shown below.
Example: ABC Company has a total income of ` 180 crore. Out of
this it has paid ` 10 crore as interest on borrowed capital. It has spent
` 80 crore for raw materials and other running expenditure. Its fixed
costs (overheads) are ` 30 crore. On the net profit it has to pay the
tax at the rate of 30% on net profit. Further, the board of directors

S
decides to pay the dividend at the rate of 50% on the paid up capital of
` 60 crore. The remaining amount is retained as profit ploughed back.
Depict the data as a pie diagram, doughnut diagram and exploded pie
IM
diagram.
Amount in Proportion Equivalent
` in Crore to Total Angle
Income
Total Income (a) 180 1 360
NM

Expenditure on (b) 80 0.44 160


Raw Material
Interest on (c) 10 0.056 20
Borrowed Capital
Fixed Expenditure (d) 30 0.167 60
Net Profit (e) 60
[a – b – c –d]

Tax [ 30 × e] (f) 18 0.1 36


100
Dividend (g) 30 0.167 60
Ploughed Back (h) 12 0.067 24
Capital

Solution: We need to calculate the proportion of each category of the


income distribution. Then we convert it as degrees, with total is 360
degrees. The calculations are shown as follows:

NMIMS Global Access – School for Continuing Education


60  BUSINESS STATISTICS

N O T E S Distribution of Income Rs. 180 Crores

Distribution of Income ` 180 Crore


Ploughed
Back Capital
7%
Dividend Expenditure
17% on Raw
Tax Material
10% 43%

Fixed Interest on
Expenditure Borrowed
17% Capital
6%

Pie Chart
Distribution of Total Income Rs. 180 Crores

S
Distribution of Total Income ` 180 Crore

Ploughed
IM
Back Capital
7%
Dividend Expenditure
17% on Raw
Tax Material
10% 43%
NM

Interest on
Fixed
Borrowed
Expenditure
Capital
17%
6%

Distribution of Total
Doughnut Income
Diagram
Rs. 180 Crores
Distribution of Total Income ` 180 Crore
Ploughed Back
Capital
7%
Dividend
17%
Expenditure on
Raw Material
43%

Tax
10%

Fixed Expenditure Interest on


17% Borrow ed Capital
6%

Exploded Pie Diagram

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  61 

N O T E S
2.7.6 FREQUENCY POLYGON
Frequency polygon is used for presenting the frequency distribution
in graphical form. This can be used for discrete distribution with
grouped as well as ungrouped data. This can also be used for
continuous data by converting it to approximate discrete data through
grouping. In all these cases, values of variables are represented on the
X axis and their frequency (number of occurrences) on the Y axis. In
case of probability distributions, we use the probability as frequency
by choosing a suitable scale on the Y axis. For plotting the frequency
polygon, we need to choose appropriate scale and origin so that the
main data features occupy the reasonable area on the paper. This helps
the readability. Although usually the scale chosen is linear, however,
depending on the data type we could use logarithmic or other types of
scale. Examples of these are audio noise plots, earthquake intensity
plots, etc. Once the scale and origin is chosen, we need to draw grid
lines (or use graph paper with grid lines) to facilitate accurate plotting.

S
Then we take each data point and mark it on the graph. In case of a
grouped data we use class marks (mid points of the class intervals) as
variable values on the X axis. These data points are joined by straight
IM
lines or a smooth curve to get frequency polygon or the frequency
distribution in graphical form. To plot frequency distribution we can
also join the data points by smooth lines.
Example: In a city, the income tax department had the data as follows
for the number of tax payers along with the range of income tax they
NM

paid for a particular year. Represent the data graphically with the
help of a frequency polygon and frequency distribution chart.
Tax paid in 20-24 25-29 30-34 35-39 40-44 45-49 Total
` ‘000
Number of 45 130 200 65 45 15 500
Taxpayers
Solution: For plotting the data we will use class marks of tax paid
classes on the X axis and number of tax payers on Y axis. Thus, the
points for plotting are as follows. Then we join these points by straight
lines.
Value on X axis 22 27 32 37 42 47
Value on Y axis 45 130 200 65 45 15
The plot is shown below:
To draw the plot as frequency distribution, we follow the same
procedure for plotting the data points. Then we join the data points
with a smooth curve as shown below. This gives better interpolation
results. It also helps in comparing it with standard distributions.

NMIMS Global Access – School for Continuing Education


62  BUSINESS STATISTICS

N O T E S
Tax Payers' Data
Tax Payers’ Data
250

Payers
of Tax Payers
200

150
Number of
Tax Payers

Nubber of
100

Number
50

0
22 27 32 37 42 47
TaxPaid
Tax ′000'000
Paidinin` Rs.

Tax Payers' Data


Tax Payers’ Data
250

S
Payers
Payers

200
IM
Tax

150
Number ofofTax

Number of Tax
Payers
100
Number

50

0
NM

22 27 32 37 42 47
Tax Paid
Tax PaidininRs.` ′000
'000

2.7.7 OGIVES
Ogives are used to present cumulative frequency of a distribution
in graphical format. There are two kinds of ogives. ‘Less than’ ogive
represents cumulative frequency just below the variable value plotted
on X axis. On the other hand, ‘More than’ ogive plots the sum of the
frequencies corresponding to above the variable value. For this we
first calculate ‘Less than’ and ‘More than’ cumulative frequencies
for the entire variable values (corresponding to classes). Then we
plot these as points on the graph with class marks along the X axis
and cumulative frequencies (‘Less than’ or ‘More than’) along the Y
axis. These points are then joined by a smooth curve like frequency
distribution. The value of the variable (on the X axis) at an ordinate
from the point where two ogives intersect is ‘Median’ i.e. mid-value
of the data (more about Median is in next chapter). The following
example demonstrates drawing of ogives.
Example: Before constructing a dam on a river the central water
research institute performed a series of tests to measure the water
flow, past the proposed location of the dam during the period of 246
days, when there was a sufficient flow of water. The results of testing
were used to construct the following frequency distribution.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  63 

N O T E S

River Flow 1001- 1051- 1100- 1151- 1201- 1251- 1301- 1351-
(thousand 1050 1100 1150 1200 1250 1300 1350 1400
cubic metres
per min)
Number 7 21 32 49 58 41 27 11
of Days
(frequency)
‰‰ Draw ogive curves for the above data.
‰‰ From the ogive curve estimate the proportion of the days on
which flow occurs at less than 1300 thousands of cubic metres
per minute.
Solution: First we calculate and prepare ‘less than’ and ‘more than’
frequency table as follows.
River Flow No of Upper ‘Less than’ Lower ‘More than’
1000 cu. m
per min
1001 - 1050
Days

7
Class Limit Frequency Class Frequency

1050.5 7
Limit
1001.5 246
S
IM
1051 - 1100 21 1100.5 28 1050.5 239
1101 - 1150 32 1150.5 60 1100.5 218
1151 - 1200 49 1200.5 109 1150.5 186
1201 - 1250 58 1250.5 167 1200.5 137
NM

1251 - 1300 41 1300.5 208 1250.5 79


1301 - 1350 27 1350.5 235 1300.5 38
1351 - 1400 11 1400.5 246 1350.5 11
Total 246
Now we plot the ogives with class limits on the X axis and frequencies
(less than or more than) on the Y axis, and joining the points with
smooth curves. We also plot both ogives superimposed. The ogives are
shown below.

Less than Ogive

300

250
Number of Days

200

150 Less than


Ogive
100

50

0
1051

1101

1151

1201

1251

1301

1351

1401

River Flow (Thousand Cu. m . per


m in)

NMIMS Global Access – School for Continuing Education


64  BUSINESS STATISTICS

N O T E S

More Than Ogive

300

250

Number of Days
200

150
More Than
100 Ogive

50

1002

1051

1101

1151

1201

1251

1301

1351
River Flow (Thousand Cu. m . per m in)

From the ‘less than’ ogive we can read that the number of days on
which flow occurs at less than 1,300 thousand of cubic metres per
minute is 208.
S
Thus, the proportion of days on which flow occurs at less than 1,300
IM
thousand of cubic metres per minute is 0.846 or 84.6%.

Fill in the blanks:


NM

32. Popular forms of ................. diagrams are Rectangular


Diagrams, Square Diagrams, Circular or Pie Diagrams.
33. ................. uses vertical bars whose height represents the
frequency.
34. ..................... is used for presenting the frequency distribution
in graphical form.
35. ‘.................’ ogive represents cumulative frequency just below
the variable value plotted on X axis.

Present the following information in a suitable supplying the


missing figure.
In the house of Lok Sabha there were 542 members. When a
certain bill was put to vote 306 voted as Ayes of which 30 belonged
to opposition benches. In all 54 members abstained from voting
of which 30 belonged to treasury benches. Out of a total of 236
members to opposition benches 182 voted as Nays. The bill was
passed as 306 Ayes against 182 Nays.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  65 

N O T E S

The choice of diagram out of several ones in a given situation is a


ticklish problem. The choice primarily depends upon two factors,
(i) the nature of the data; and (ii) the type of people for whom the
diagram is needed.
On the nature of the data would depend whether to use one
dimensional, two dimensional or three dimensional diagram, and
if it is one dimensional, whether to adopt the simple bar or sub-
divided bar, multiple bar or some other type. While selecting the
diagram the type of the people for whom the diagram is intended
must also be considered. For example, for drawing attention of an
uneducated mass, pictograms and cartograms are more effective.

2.8 SUMMARY
‰‰

S
There are two major divisions of the field of statistics, namely
descriptive and inferential statistics. Both the segments of
statistics are important, and accomplish different objectives.
IM
‰‰ Data can be obtained through primary source or secondary source
according to need, situation, convenience, time, resources and
availability. The most important method for primary data collection
is through questionnaire. Data must be objective and fact-based so
that it helps a decision-maker to arrive at a better decision.
NM

‰‰ Statistical data is a set of facts expressed in quantitative form.


Data is collected through various methods. Sometimes our data
set consists of the entire population we are interested in. In other
situations, data may constitute a sample from some population.
‰‰ Type of research, its purpose, conditions under which the data
are obtained will determine the method of collecting the data.
If relatively few items of information are required quickly, and
funds are limited telephonic interviews are recommended. If
respondents are industrial clients Internet could also be used.
If depth interviews and probing techniques are to be used, it is
necessary to employ investigators to collect data.
‰‰ The quality of information collected through the filling of a
questionnaire depends, to a large extent, upon the drafting of its
questions. Hence, it is extremely important that the questions be
designed or drafted very carefully and in a tactful manner.
‰‰ Before any processing of the data, editing and coding of data
is necessary to ensure the correctness of data. In any research
studies, the voluminous data can be handled only after
classification. Data can be presented through tables and charts.
‰‰ Classification refers to the grouping of data into homogeneous
classes and categories. It is the process of arranging things in
groups or classes according to their resemblances and affinities.

NMIMS Global Access – School for Continuing Education


66  BUSINESS STATISTICS

N O T E S
‰‰ A frequency distribution is the principle tabular summary of either
discrete data or continuous data. The frequency distribution
may show actual, relative or cumulative frequencies. Actual and
relative frequencies may be charted as either histogram (a bar
chart) or a frequency polygon. Two commonly used graphs of
cumulative frequencies are less than ogive or more than ogive.
‰‰ Once the raw data is collected, it needs to be summarized
and presented to the decision-maker in a form that is easy to
comprehend. Tabulation not only condenses the data, but also
makes it easy to understand. Tabulation is the fastest way to
extract information from the mass of data and hence popular
even among those not exposed to the statistical method.
‰‰ The charts help in grasping the data and analyze it qualitatively.
This also helps managers to effectively present the data as a
part of reports. Various types of chart are bar diagram, multiple
bar diagrams, component bar diagram, deviation bar diagram,

‰‰ S
sliding bar diagram, Histogram and Pie charts.
A graphic presentation is another way of representing the
IM
statistical data in a simple and intelligible form. There are two
types of graphs which we have discussed, line graphs and ogives.

‰‰ Primary Data: Primary data are collected afresh and for the
NM

first time, and thus, happen to be original in character.


‰‰ Secondary Data: When the data are not collected for the
purpose, but is derived from other sources then such data is
referred to as ‘secondary data’.
‰‰ Frequency Distribution: A tabular summary of data showing
the number (frequency) of observations in each of several non-
overlapping classes.
‰‰ Tabulation: Tabulation is arranging the data in flat table (two
dimensional arrays) format by grouping the observations.
‰‰ Bar Graph: A graphical device for depicting data that have
been summarized in a frequency distribution, relative
frequency distribution, or percent frequency distribution.
‰‰ Histogram: A graphical presentation of a frequency
distribution, relative frequency distribution, or percent
frequency distribution of quantitative data constructed
by placing the class intervals in the horizontal axis and the
frequencies on the vertical axis.
‰‰ Relative Frequency Distribution: A tabular summary
of data showing the fraction or proportion (relative
frequency) of observations in the data set in each of several
non-overlapping classes.

Contd...

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  67 

N O T E S
‰‰ Histogram: It uses vertical bars whose height represents
the frequency. In histogram, the vertical bars touch the
neighboring bars sharing one edge.
‰‰ Line Graph: We plot one or more variables against one variable.
One variable against which other variables are plotted is taken
along the X axis. It is commonly used to depict the trends in
anytime series data.
‰‰ Ogives: Ogives are used to present cumulative frequency of a
distribution in graphical format. There are two kinds of ogives,
‘Less than’ ogive and more than ogives.

2.9 DESCRIPTIVE QUESTIONS


1. Differentiate between descriptive and inferential statistics with
examples.

3. S
2. Describe various methods of collecting primary data and
comment on their relative advantages and disadvantages.
Discuss methods or sources of collecting secondary data.
IM
4. How do you design a questionnaire? What are the important
points to be kept in mind?
5. How is Editing of primary and secondary data done? Also,
describe coding of data.
NM

6. Describe the classification of data. What are the rules and bases
of classification of data?
7. What is frequency distribution? Differentiate between discrete
and continuous frequency distribution with examples.
8. Discuss the concept of tabulation. What are objectives and main
parts of table?
9. Differentiate between one-way tabulation, two-way tabulation
and multi-way tabulation with examples.
10. Describe different types of diagrams and graphs with examples.
Differentiate between diagrams and graphs too.

EXERCISE FOR PRACTICE


1. The income of 12 workers on a particular day was recorded as
given below. Represent the data by a line diagram.

S. No. of 1 2 3 4 5 6 7 8 9 10 11 12
workers :
Income 25 35 30 45 50 55 40 50 60 55 40 35
(in `) :

NMIMS Global Access – School for Continuing Education


68  BUSINESS STATISTICS

N O T E S
2. Represent the following data by a suitable diagram.

Years : 1987 1988 1989 1990 1991


C.F.A. Enrolments 7300 9400 12100 14600 16700
3. Show the following data of expenditure of an average working
class family by a suitable diagram.

Item of Expenditure Percent of Total Expenditure


(i) Food 65
(ii) Clothing 10
(iii) Housing 12
(iv) Fuel and Lighting 5
(v) Miscellaneous 8
4. Represent the following data, on revenue and costs, of a company
during July 1991 to December 1991 by a net balance chart.
Months (1991) :
S
Revenue (in `′ 000)
Jul
20
Aug
25
Sep
18
Oct
20
Nov
23
Dec
22
IM
Cost (in `′ 000) 18 22 20 21 19 20
5. Draw ‘less than’ and ‘more than’ ogives for the following
distribution of monthly salary of 250 families of a certain locality.
Income Intervals : 0-500 500-800 1000-1500 1500-2000
No. of Families : 50 80 40 35
NM

Income Intervals : 2000-2500 2500-3000 3000-3500 3500-4000


No. of Families : 25 15 10 5

2.10 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS
Topic Q. No. Answers
Descriptive and Inferential 1. Average
Statistics
2. Inferential
Collection of Data 3. Primary
4. Degree
5. Telephonic Interview
6. Pre-test
7. True
8. false
9. False
10. False
Editing and Coding of Data 11. Editing
12. Field editing
Contd...

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  69 

N O T E S

13. Coding
Classification of Data 14. Classification
15. Discrete, Continuous
16. Limits
17. Frequency
18. Inclusive
19. Relative
20. True
21. False
22. True
23. True
24. False
Tabulation of Data 25. Tabulation
26.
27.
28.
S Rows
Head-note
General, Special
IM
29. Primary
30. Complex
31. Contingency
Diagrammatic and Graphical 32. Two-dimensional
NM

Presentation of Data
33. Histogram
34. Frequency polygon
35. Less than

HINTS FOR DESCRIPTIVE QUESTIONS


1. Refer Section 2.2
Descriptive statistics is the type of statistics that probably
comes to most of the minds of people when they hear the word
“statistics.” Here the purpose is to describe. Numerical measures
are used to tell about features of a set of data.
For the inferential statistics we have to differentiate between
two groups. The population is the entire collection of individuals
that we have to study. It is typically impossible or infeasible to
examine each member of the population individually. So we
have to choose a representative subset of the population, called a
sample.
2. Refer Section 2.3.2
According to the nature of information required, one of the
following methods or their combination could be selected.

NMIMS Global Access – School for Continuing Education


70  BUSINESS STATISTICS

N O T E S
Observation Method, Indirect Investigation, Questionnaire with
Personal Interview, Mailed Questionnaire, Telephonic Interview,
Internet Surveys
3. Refer Section 2.3.4
Sources of secondary data could be:
(a) Various publications of central, state and local governments.
This is an important and reliable source to get unbiased
data.
(b) Various publications of foreign governments or of
international bodies. Although it is a good source, context
under which it is collected needs to be verified before using
this data. For international situations this data could be very
useful and authentic.
4. Refer Section 2.3.5

S
The success of collecting data through a questionnaire depends
mainly on how skilfully and imaginatively the questionnaire has
been designed. A badly designed questionnaire will never be able
IM
to gather the relevant data. In designing the questionnaire, some
of the important points to be kept in mind are Covering letter,
Number of questions should be kept to the minimum, Questions
should be simple, short and unambiguous, Type of questions.
5. Refer Section 2.4
NM

Once the questionnaires have been filled and the data collected, it
is necessary to edit this data to ensure completeness, consistency,
accuracy and homogeneity. The editing of the data is a process
of examining the raw data to detect errors and omissions and
to correct them, if possible, so as to ensure completeness,
consistency, accuracy and homogeneity. Editing can be done at
two stages, Field editing and central editing.
Coding is the process of assigning some symbols either
alphabetical or numeral or both to the answers so that the
responses can be recorded into a limited number of classes or
categories. The classes should be appropriate to the research
problem being studied.
6. Refer Sections 2.5.1 and 2.5.2
Classification refers to the grouping of data into homogeneous
classes and categories. It is the process of arranging things in
groups or classes according to their resemblances and affinities.
The principal rules of classifying data are:
(a) To prepare data for tabulation.
(b) To enable grasp of data.
(c) To study the relationship.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  71 

N O T E S
Some common types of bases of classification are: Geographical
classification, Chronological classification, Qualitative classification,
Classification of data according to some characteristics.
7. Refer Section 2.5.3
Classification of data, showing the different values of a variable
and their respective frequency of occurrence is called a frequency
distribution of the values. There are two kinds of frequency
distributions, namely, discrete frequency distribution (or simple,
or ungrouped frequency distribution), and continuous frequency
distribution (or condensed or grouped frequency distribution).
8. Refer Section 2.6
Tabulation is arranging the data in flat table (two dimensional
arrays) format by grouping the observations. Table is a spreadsheet
with rows and columns with headings and stubs indicating class
of the data. Tabulation not only condenses the data, but also

The main objectives of tabulation are: S


makes it easy to understand. Objectives of Tabulation
IM
(a) To simplify complex data.
(b) To highlight chief characteristics of the data.
(c) To clarify objective of investigation.
(d) To present data in a minimum space.
NM

The main parts of a table are as given below:


(a) Table Number: This number is helpful in the identification
of a table. This is often indicated at the top of the table.
(b) Title: Each table should have a title to indicate the scope,
nature of contents of the table in an unambiguous and
concise form.
(c) Captions and stubs: A table is made up of rows and columns.
Headings or subheadings used to designate columns are
called captions while those used to designate rows are
called stubs. A caption or a stub should be self explanatory.
A provision of totals of each row or column should always be
made in every table by providing an additional column or
row respectively.
9. Refer Sections 2.6.1, 2.6.2, 2.6.3, 2.6.4 and 2.6.5
In one-way tabulation, the data are presented according to one
characteristic only. This is the simplest form of a table and is also
known as table of first order. Since there are two variables, we
call it a two-way tabulation (also referred to as cross-tabulation).
We prepare the table with one of the category varied along the
rows and other along the columns. For counting the frequency,
a pair of combinations of categories one from each direction is
considered. We can carry out cross tabulation with more than

NMIMS Global Access – School for Continuing Education


72  BUSINESS STATISTICS

N O T E S
two variables. It is called a nested table. In fact, in most of the
business situations the tabulation may have more than two
variables (usually 10 to 15). Up to about 3 to 4 variables could be
shown on two dimensional papers. These can also be represented
as flat tables by taking one composite variable of dimension n1 ×
n2 × n3 × n4 × n5 × …, where n1, n2, n3, n4, n5…are dimensions of
each variable (attribute).
10. Refer Section 2.7
Different types of bar diagrams are Line Diagram and column
diagram. Popular forms of two-dimensional diagrams are:
Rectangular Diagrams, Square Diagrams, Circular or Pie
Diagrams. With the help of three dimensional diagrams, the
values of various items are represented by the volume of cube,
sphere, cylinder, etc. These diagrams are normally used when
the variations in the magnitudes of observations are very large.

S
Pictograms and Cartograms are like frequency plots. The data
points are plotted on the graph in the same manner. These
include Histograms and frequency polygon.
IM
ANSWERS FOR EXERCISE FOR PRACTICE
1.
NM `

2.

NMIMS Global Access – School for Continuing Education


DESCRIPTIVE STATISTICS: COLLECTION, PROCESSING AND PRESENTATION OF DATA  73 

N O T E S
3.

4.

S
`

IM
NM

5.

2.11 SUGGESTED READINGS FOR REFERENCE


SUGGESTED READINGS
‰‰ Levin, R.I., Statistics for Management, Prentice-Hall of India,
New Delhi, 1979
‰‰ Moskowitz., H. and Wright, G.P., Statistics for Management and
Economics, Charles.E. Merin Publishing Company, Ohio, U.S.A.,
1985.
‰‰ Gupta, S.P. and Gupta, M.P., Business Statistics, Sultan Chand &
Sons, New Delhi, 1987

NMIMS Global Access – School for Continuing Education


74  BUSINESS STATISTICS

N O T E S
‰‰ Loomba, M.P., Management – A Quantitative Perspective,
MacMillan Publishing Company, New York, 1978.
‰‰ Shenoy, G.V., Srivastava, U.K. and Sharma, S.C., Quantitative
Techniques for Managerial Decision Making, Wiley Eastern, New
Delhi, 1985
‰‰ Venkata Rao, K., Management Science, McGraw-Hill Book
Company, Singapore, 1986.
‰‰ Bhardwaj, R.S., Business Statistics, 2nd Edition, Excel Books,
New Delhi.
‰‰ Kothari, C.R., Quantitative Techniques, Vikas Publication.

E-REFERENCES
‰‰ http://elearning.sol.du.ac.in/
‰‰ http://www.okstate.edu/
‰‰
S
http://www.statcan.gc.ca/
IM
NM

NMIMS Global Access – School for Continuing Education


C H A
3 P T E R

MEASURES OF CENTRAL TENDENCY

CONTENTS
3.1 Introduction


3.2 
3.3
S
Characteristics of Central Tendency
Arithmetic Mean
IM
3.3.1 Properties of Arithmetic Mean
3.3.2 Calculation of Simple Arithmetic Mean
3.3.3 Merits and Demerits of Arithmetic Mean
3.3.4 Weighted Arithmetic Mean
3.4 Median
NM

3.4.1 Calculation of Median


3.4.2 Merits and Demerits of Median
3.4.3 Partition Values or Positional Measures
3.4.4 Quartiles
3.4.5 Deciles
3.4.6 Percentiles
3.5 Mode
3.5.1 Calculation of Mode
3.5.2 Merits and Demerits of Mode
3.5.3 Graphic Location of Mode
3.6 Empirical Relationship between Mean, Median and Mode
3.7 Limitations of Central Tendency
3.8 Summary
3.9 Descriptive Questions
3.10 Answers and Hints
3.11 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


76  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

MEASURE IN HEALTH CARE COST

A company wanted to determine the health care costs of its


employees. A sample of 5 employees was interviewed and their
medical expenses for the previous year were determined. Later
the company discovered that the highest medical expense in the
sample was mistakenly recorded as 10 times the actual amount.
However, after correcting the error, the corrected amount was still
greater than or equal to any other medical expense in the sample.
Which of the following sample statistics must have remained the
same after the correction was made?
A. Mean B. Median C. Mode

S
IM
NM

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  77 

N O T E S

After studying this chapter, you should be able to:


  Understand the concept and characteristics of central
tendency
  Describe all the measures of central tendency: mean, median
and mode.
  Explain merits and demerits of all measures of central
tendency.
  Discuss partition values or positional measures like quartiles,
deciles and percentiles.

3.1 INTRODUCTION
The concept of central tendency plays a dominant role in the study of

S
statistics. In many frequency distributions, the tabulated values show
a distinct tendency to cluster or to group around a typical central
value. This behaviour of the data to concentrate the values around a
IM
central part of distribution is called ‘Central Tendency’ of the data. If
we find such a central value, it can be used as a representative value
for the entire data set. This helps take many decisions concerning the
entire set. It may be noted, however, that averages may some times
give strange and illogical conclusions, if not used with a commonsense.
NM

 HARACTERISTICS OF CENTRAL
C
3.2
TENDENCY
Measure of central tendency enables us to get an idea of entire data
from a single value at which we consider the entire data is concentrated.
This single value could be used to represent the entire population.
Measure of central tendency also enables us to compare two or more
sets of data, for example, average sales figures for two months.
A good measure of central tendency should possess as far as possible
the following characteristics:
‰‰ Easy to understand.
‰‰ Simple to compute.
‰‰ Based on all observations.
‰‰ Uniquely defined.
‰‰ Possibility of further algebraic treatment.
‰‰ Not unduly affected by extreme values.

Common Measures of Central Tendency


The three common measures of central tendency:
‰‰ Mean: The average value.

NMIMS Global Access – School for Continuing Education


78  BUSINESS STATISTICS

N O T E S
‰‰ Median: The middle value.
‰‰ Mode: Most occurring value.
Each one has its advantages and disadvantages. Here we discuss the
definitions, concepts and methods of manual calculation. Grouping
of discrete data is not necessary for computer calculations. We can
directly use the discrete data and get faster as well more accurate
results than by grouping of the data. When only grouped data is
available, we need to use formulae for grouped data.

Fill in the blanks:


1. Measure of ................... tendency enables us to get an idea of
entire data from a single value at which we consider the entire
data is concentrated.
2.
3. S
Measures of central tendency are ................... defined.
The three common measures of central tendency are mean,
................... and mode.
IM
List down various measures of central tendency and explain the
difference between them. In your day-to-day work which of the
NM

measures of central tendency are used and why?

Averages provide us the gist and give a bird’s eye view of the huge
mass of unwieldy numerical data. Averages are the typical values
around which other items of the distribution congregate. This value
lies between the two extreme observations of the distribution and
give us an idea about the concentration of the values in the central
part of the distribution. They are called the measures of central
tendency.

3.3 ARITHMETIC MEAN

The arithmetic mean of a series is the quotient obtained by dividing


the sum of the values by the number of items. In algebraic language,
if X1, X2, X3....... Xn are the n values of a variate X.

Arithmetic Mean is again of two types, ‘Simple Arithmetic Mean’ and


‘Weighted Arithmetic Mean’.

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  79 

N O T E S
3.3.1 PROPERTIES OF ARITHMETIC MEAN
Properties of arithmetic mean are as follows:
The sum of the deviations, of all the values of x, from their
‰‰
arithmetic mean, is zero.

Justification: ∑fi (xi – x ) = ∑fi xi – ∑fi = 0

– – ∑f x –
Since  x  is a constant, x = i i   ∴ ∑fi xi = x ∑fi
∑fi
‰‰ The product of the arithmetic mean and the number of items
gives the total of all items.

– ∑f x –
x = i i   ⇒ ∑fi xi = x ∑fi
Justification: 
∑fi

– ∑xi –
              Or  x =   ⇒ x . N ∑xi

– –
N

S
‰‰ If x 1 and   x 2 are the arithmetic mean of two samples of sizes n1 and

n2 respectively then, the arithmetic mean x of the distribution
IM
combining the two can be calculated as
– –
– n1 x1 + n2 x2
x=
n1 + n2
‰‰ This formula can be extended for still more groups or samples.
NM

– ∑x –
x 1 = n 1i   ⇒ ∑x1i = n1 x1
1

– ∑x –
Justification: x 1 = n 1i   ⇒ ∑x1i = n1 x1 = total of the observations
of the first sample 1

Similarly, ∑x2i = n2 x2 = total of the observations of the first
sample
The combined mean of the two samples
combined total
     =
n1 + n2
– –
– n1 x1 + n2 x2
x = 
n1 + n2
Arithmetic Mean of Combined Data
Arithmetic Mean is used very often in business for calculating average
sales, average cost, average earnings, etc. If there are two related
data groups and their arithmetic means are known, we can calculate
arithmetic mean of the combined data without referring to individual
data points. If the first group of N1 items has arithmetic mean of μ1, the
second group of N2 items has arithmetic mean of μ2, and so on.

NMIMS Global Access – School for Continuing Education


80  BUSINESS STATISTICS

N O T E S
We can find the arithmetic mean of combined data as,
N1 × µ1 + N2 × µ2 + ...... + N × µ n
µ= n

N1 + N2 + ...... + N n
Example: The weekly average salaries paid to all employees in a
certain company was ` 600. The mean salary paid to male and female
employees were ` 620 and ` 520 respectively. Obtain the percentage of
male and female employees in the company.
Solution: Arithmetic mean of combined data is,
N1 × µ1 + N2 × µ2 + ...... + N × µ n
µ= n

N1 + N2 + ...... + N n
In this problem N1 = number of male employees, N2 = number of
female employees, mean salary of male employees m1 = 620, mean
salary of female employees m2 = 520 and combined mean m = 600.
Therefore,

µ=
N 1 × µ 1 + N2 × µ 2
N 1 + N2 S⇒ 600 =
620 × N1 + 520 × N2
N 1 + N2
⇒ 20 × N1 = 80 × N2
IM
∴ N1 : N2 =
4:1

Thus, percentage of male and female employees in the company is,


80% and 20% respectively.

3.3.2 CALCULATION OF SIMPLE ARITHMETIC MEAN


NM

Simple Arithmetic Mean for Ungrouped Data (AM)

It is the value obtained by dividing the sum of all the values in


data (called data points) by total number of such data points
(observations). It is denoted by, X (X Bar) or μ depending on the
data is a sample or population. Thus,
n

x1 + x2 + x3 + ...... + xn ∑
i =1
xi
m= =
N N
There is a short cut method for calculations based on a simple
concept that, if a constant is subtracted or added to all data points, the
Arithmetic Mean (AM) is reduced or increased by that amount. Thus,
n
∑ di
µ= A + i =1

N
Where, A = Arbitrarily selected constant value (assumed mean). This
value is selected such that it simplifies the values in calculations when
deviation of each observation is used instead of the data values. A is
selected close to the expected or guess value of mean. Calculations on
deviation should be such that we should be able to do it orally.

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  81 

N O T E S
di = Deviation of each observation from the assumed mean.
N = Number of observations.
Note that, when assumed mean ‘A’ is exactly equal to Arithmetic mean

μ or X , algebraic sum of all deviations is equal to zero. Thus, algebraic
sum of deviations of all observations about Arithmetic Mean is
zero. Or, n
About Arithmetic Mean, ∑
i =1
di = 0
Now we will solve one example just to demonstrate the method.
Example: Find the arithmetic mean of 3, 6, 24, and 48.
Solution: Let the assumed mean A = 20

Sl. No. xi Deviation di = (xi –A)


1 3 –17
2
3
4
6
24
48
S –14
4
28
IM
N=4 ∑ xi = 81 ∑ di = 1
n

i =1
xi
81
μ= = = 20.25
N 4
NM

Or, alternatively by short cut method,


n

i =1
di
1
μ=A+ = 20 +
= 20.25
N 4
This is same as direct method.

If we take assumed mean as arithmetic mean 20.25,

Sl. No. xi Deviation di = (xi –A)


1 3 –17.25
2 6 –14.25
3 24 3.75
4 48 27.75
N=4 ∑ xi = 81 ∑ di = 0

Example: Find the arithmetic mean of 10, 12, 20, 15, 20, 12, 10, 15,
20 and 10
Solution: Arithmetic mean
x1 + x2 + x3 + ..... + xn 10 + 12 + 20 + 15 + 20 + 12 + 10 + 15 + 20 + 10
m = = 14.4
N 10

NMIMS Global Access – School for Continuing Education


82  BUSINESS STATISTICS

N O T E S
OR Frequency distribution of the data is,

Sl. No. xi Frequency fi


xi fi
1 10 3 30
2 12 2 24
3 15 2 30
4 20 3 60
N=4 ∑ xi = 81 ∑ fi = 10 ∑ xi fi = 144

m
Arithmetic Mean=
∑ x=
f i i 144
= 14.4
∑f i 10

Simple Arithmetic Mean for Grouped Data


‰‰ In case of grouped data, we consider class mark (Mid point of the
class) as a data point (value of observation).
‰‰
S
In other words, we use mid-value of class for all the observations
in that class (since we don’t know exact values of the observations,
IM
this is the best we can do keeping grouping errors to the
minimum).
‰‰ Multiply the class marks by frequency of that class.
‰‰ Then the weighted average is calculated by dividing sum of these
values of class marks with frequency as their weights, by total
NM

number of observation (sum of all frequencies).


‰‰ Thus, for grouped data,
n n

i=1
m fi ∑ m fi
i i

μ= n
i =1
=
N

i=1
fi

Where, mi = Midpoint of the class interval


fi = Class frequency.
N = ∑fi = Total number of observations.
To make manual calculations easy, we may subtract or add a constant
from all class marks (observations). In such case, as discussed earlier,
‘Arithmetic Mean’ is reduced or increased by that amount. Thus,
n n
∑ fidi
i =1

i =1
f dii

μ=A+ n =A +
N

i =1
f i

Where, A = Assumed mean.



di = (mi – A) = Deviation of class marks from the assumed mean.
fi = class frequency.

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  83 

N O T E S
N = ∑fi = Total number of observation.
mi = Class marks.
This method is also called a ‘Short Cut Method’. To make manual
calculations further easy, we can use the principle, that if all the
observations are divided or multiplied by a constant, the ‘Arithmetic
Mean’ is divided or multiplied by that value. We select a convenient
number usually the class width or size. Divide all deviations by that
number. Then use following formula to calculate ‘Arithmetic Mean’.
This method is called as ‘Step Division Method’. The formula is:
n
∑ f d′i
i

m=
A+ i =1
n ×h
∑ f i

i =1

Where, A = Assumed mean.


=d′i
( mi − A) di
=
h h S
IM
mi = Class Marks.
h = Step size usually class interval.
N = ∑fi = Total number of observations.

Marks 0-10 10-20 20-30 30-40 40-50 50-60


NM

No. of Students 5 10 25 30 20 10
Example: From the following data, compute Arithmetic Mean by
direct method, short cut methods and step division method.
Solution: Let the Assumed Mean be A = 35 and Step size h = 10

CALCULATION TABLE
Marks Class No. of mi * fi Devia- fi * di Step De- fi * di′
Mark Students tion viation
(mi) (fi) di = mi di′=(mi-
–A A)/h

0-10 5 5 25 –30 –150 –3 –15


10-20 15 10 150 –20 –200 –2 –20
20-30 25 25 625 –10 –250 –1 –25
30-40 35 30 1050 0 0 0 0
40-50 45 20 900 10 200 1 20
50-60 55 10 550 20 200 2 20
∑ 100 3300 – 200 – 20

NMIMS Global Access – School for Continuing Education


84  BUSINESS STATISTICS

N O T E S
‰‰ Direct Method:
6

i =1
m fi i
3300
μ = 6
= = 33
100

i =1
f i

‰‰ Shortcut Method:
6
∑ f di i

μ=A+ i =1
6
= 35 + –200 = 35 – 2 = 33
100

i =1
f i

‰‰ Step Division Method


n
∑ f d′i i
–20
m=
A+ i =1
× h = 35 + × 10 = 33

i =1
n
f i

S 100
IM
The answer is same irrespective the method used.

Effect of Shift of Origin and Change of Scale


NM

To simplify the manual calculation, we may sometimes use shift of


origin and change of scale. Shifting of origin is achieved by adding
or subtracting a constant to all observations. In case of discrete data
we add or subtract (usually subtract) a constant to the individual
observations. Whereas for grouped data, we add or subtract (usually
subtract) the constant to the class mark values. The effect is as follows.
If a constant is subtracted or added to all data points, the Arithmetic
Mean (AM) is reduced or increased by that amount. This principal
is used in the short cut method, which has been explained earlier.
This was explained earlier. In this method we first subtract a suitable
constant from all the observations, calculate the mean and then add
the same constant to the answer to get the actual value of the mean.
Change of scale is achieved by multiplying or dividing by a constant to
all observations. In case of discrete data we multiply or divide (usually
divide) by a constant to the individual observations. Whereas for
grouped data we multiply or divide (usually divide) by the constant
to the class mark values. The effect is as follows. If all data points
are multiplied or divided by a constant, the Arithmetic Mean (AM) is
multiplied (stretched) or divided (shrunk) by that amount. This is the
principle behind the step division method which was explained earlier.
In this method we first subtract a constant, say A (called assumed
mean) from all the observations or class marks and them divide all the
observations by a suitable constant say h, (usually the class interval
for grouped data), and then calculate the mean. Then we multiply

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  85 

N O T E S
the answer by the same constant h and then add the constant A to
get the actual value of the mean. We can use both, Change of Origin
and Change of Scale together, but we must correct the answers in
the reverse order of the algebraic operations performed on the data
points.

3.3.3 MERITS AND DEMERITS OF ARITHMETIC MEAN


Out of all averages arithmetic mean is the most popular average in
statistics because of its merits given below.
‰‰ Arithmetic mean is rigidly defined by an algebraic formula.
‰‰ Calculation of arithmetic mean requires simple knowledge of
addition, multiplication and division of numbers and hence, is
easy to calculate. It is also simple to understand the meaning of
arithmetic mean, e.g., the value per item or per unit, etc.
‰‰ Calculation of arithmetic mean is based on all the observations

‰‰ S
and hence, it can be regarded as representative of the given data.
It is capable of being treated mathematically and hence, is widely
IM
used in statistical analysis.
‰‰ Arithmetic mean can be computed even if the detailed
distribution is not known but sum of observations and numbers
of observations are known.
‰‰ It is least affected by the fluctuations of sampling.
NM

‰‰ It represents the centre of gravity of the distribution because it


balances the magnitudes of observations which are greater and
less than it.
‰‰ It provides a good basis for the comparison of two or more
distributions.
Although, arithmetic mean satisfies most of the properties of an ideal
average, it has certain drawbacks and should be used with care. Some
limitations of arithmetic mean are:
‰‰ It can neither be determined by inspection nor by graphical
location.
‰‰ Arithmetic mean cannot be computed for a qualitative data; like
data on intelligence, honesty, smoking habit, etc.
‰‰ It is too much affected by extreme observations and hence, it
does not adequately represent data consisting of some extreme
observations.
‰‰ The value of mean obtained for a data may not be an observation
of the data and as such it is called a fictitious average.
‰‰ Arithmetic mean cannot be computed when class intervals have
open ends. To compute mean, some assumption regarding the
width of class intervals is to be made.

NMIMS Global Access – School for Continuing Education


86  BUSINESS STATISTICS

N O T E S
‰‰ In the absence of a complete distribution of observations the
arithmetic mean may lead to fallacious conclusions. For example,
there may be two entirely different distributions with same value
of arithmetic mean.
‰‰ Simple arithmetic mean gives greater importance to larger
values and lesser importance to smaller values.

3.3.4 WEIGHTED ARITHMETIC MEAN


There are cases where relative importance of the different items is not
the same. In such a case, we need to compute the weighted arithmetic
mean. The procedure is similar to the grouped data calculations
studied earlier, when we consider frequency as a weight associated
with the class-mark. Now suppose the data values are x1, x2, x3…
xn and associated weights are W1, W2, W3 …Wn, then the weighted
arithmetic mean is:
‰‰


Direct Method

mw = S
W 1 × x1 + W 2 × x2 + ...... + Wn × xn =
W 1 + W 2 + ...... + Wn
∑W × x
∑W
i i
IM
i

‰‰ Short-cut Method

m= Aw +
∑ Wi × di
∑W
w
i
NM

Where Aw = Assumed weighted mean.


di(Aw – xi) = Deviation of observations from assumed mean.
‰‰ Utility of weighted mean
Some of the common applications where weighted mean is
extensively used are:
 Construction of index numbers, for example, consumer
Price Index, BSE sensex, etc., where different weights
are associated for different items or shares. The weighted
average of outstanding shares is a calculation that
incorporates any changes in the amount of outstanding
shares over a reporting period. It is an important number,
as it is used to calculate key financial measures such as
earnings per share (EPS) for the time period.
For example, say a company has 100,000 shares outstanding at the
start of the year. Halfway through the year, it issues an additional
100,000 shares, so the total amount of shares outstanding
increases to 200,000. If at the end of the year the company reports
earnings of $200,000, which amount of shares should be used to
calculate EPS: 100,000 or 200,000? If the 200,000 shares were
used, the EPS would be $1, and if 100,000 shares were used, the
EPS would be $2 - this is quite a large range!

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  87 

N O T E S
This potentially large range is the reason why a weighted average
is used, as it ensures that financial calculations will be as accurate
as possible in the event the amount of a company’s shares changes
over time. The weighted average number of shares is calculated
by taking the number of outstanding shares and multiplying the
portion of the reporting period those shares covered, doing this
for each portion and, finally, summing the total. The weighted
average number of outstanding shares in our example would be
150,000 shares.

Shares Outstanding Period Covered Weighted Shares


First Half of the Year 100,000 0.5 50,000
Second Half of the Year 200,000 0.5 100,000
Weighted Average 150,000
The earnings per share calculation for the year would then be
calculated as earnings divided by the weighted average number

 S
of shares ($200,000/150,000), which is equal to $1.33 per share.
Comparison of results of the two companies when their sizes
IM
are different.
 Computation of standardized death and birth rates.
Example: The management of hotel has employed 2 managers, 5
cooks and 8 waiters. The monthly salaries of the manager, the cook
and waiter are ` 3000, ` 1200 and ` 1000 respectively. Find the mean
NM

salary of the employees. (Note: Although these salaries must be 10 to


15 year old, we will take it only to learn the principle.)
Solution: Here we need to calculate waited average of salary with
salaries as weights.
W 1 × x1 + W 2 × x2 + ...... + Wn × xn 2 × 3000 + 5 × 1200 + 8 × 1000
mw =
W 1 + W 2 + ...... + Wn 2+5+8
= ` 1333.33

State whether the following statements are true/false:


4. Arithmetic Mean is of two types, ‘Simple Arithmetic Mean’
and ‘Weighted Arithmetic Mean’.
5. Algebraic sum of deviations of all observations about
Arithmetic Mean is one.
6. Shifting of origin is achieved by multiplying or dividing by a
constant to all observations.
7. Change of scale is achieved by adding or subtracting a constant
to all observations.
Contd...

NMIMS Global Access – School for Continuing Education


88  BUSINESS STATISTICS

N O T E S
8. Arithmetic mean can be computed even if the detailed
distribution is not known but sum of observations and numbers
of observations are known.
9. Arithmetic mean can be computed for a qualitative data like
data on intelligence, honesty, smoking habit, etc.
10. Weighted mean is used for construction of index numbers,
for example, consumer Price Index, BSE sensex, etc., where
different weights are associated for different items or shares.

On your first four math tests you earned 85, 80, 95, and 65. What
must you earn on your next test to have a mean score of at least 80?

S
If the class intervals are of varying width, an effort should be made
to avoid calculating mean and mode. It is advisable to calculate
IM
median.

3.4 MEDIAN
NM

Median is the value, which divides the distribution of data, arranged


in ascending or descending order, into two equal parts. Thus, the
‘Median’ is a value of the middle observation.

3.4.1 CALCULATION OF MEDIAN

Median for Ungrouped Data


When the series is arranged in order of size or magnitude, and if total
number of observations are odd,
th
 N + 1
Median Md =   observation.
 2 
If the number of observations is even, then the median is the arithmetic
mean of two middle observations.
th th
N N 
  observation +  + 1  observation
2 2
Median Md =   
2
Example: Students of a class were divided in two groups and
undergone tutorial training by different faculty members. There
scores in final examination are:
Group A: 80, 70, 50, 20, 30, 90, 10, 40, 60

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  89 

N O T E S
Group B: 80, 70, 50, 20, 30, 90, 10, 40, 60, 100
Which group showed better performance based on Median?
Solution: First we arrange the scores in ascending order.
Group A: 10, 20, 30, 40, 50, 60, 70, 80, 90
Number of observations is 9 (odd). Therefore,
th
 N + 1 9+1
Median =Md  =  = 5th observation = 50
 2  2
Group B: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
Number of observations is 10 (even). Therefore,
th th
N N 
  observation +  + 1  observation 50 + 60
2 2  = = 55
Median Md =
2 2
Thus, group B has better performance on median.

S
Example: Distribution of heights of new recruits is given below:
IM
Height in Inches 58 60 61 62 63 64 65 66 68 70
No. of Persons 4 6 5 10 20 22 24 6 2 1
Determine the median height.
Solution: There are total of 100 observations. These are already
NM

arranged in ascending order. We now find cumulative frequency as,

Height in Inches 58 60 61 62 63 64 65 66 68 70
No. of Persons 4 6 5 10 20 22 24 6 2 1
Cumulative 4 10 15 25 45 67 91 97 99 100
Frequency
Now N = 100. Hence N = 50. Thus, median is 50th observation. From
2
cumulative frequency 50th observation is 64. Hence median is 64 inches.

Median for Grouped Data


In case of grouped data we first find the value N . Then from the
2
cumulative frequency we find the class in which the N item falls.
th

2
Such a class is called as Median Class. Then the median is calculated
by formula:
N
− pcf
Median Md = L + 2 ×h
f
Where, L = lower limit of Median class.
N = Total Frequency.
pcf = Preceding cumulative frequency to the median class.

NMIMS Global Access – School for Continuing Education


90  BUSINESS STATISTICS

N O T E S
f = frequency of median class.
h = class interval of median class.
Let us understand the logic of the formula. Median is value of N
th

2
observation. But this observation falls in the median class whose lower
limit is L. Cumulative frequency of class preceding to the ‘median
class’ is pcf. Thus, the median observation is N – pcf observation in
th

2 
the median class (counted from the lower limit of the median class).
Now, if we consider that all f observations in the median class are
evenly spaced from lower limit L to upper limit L+h, the value of the
median can be found out by using ratio proportion.
Example: Calculate the median for the following data.

Age 20-25 25-30 30-35 35-40 40-45 45-50 50-55 55-60


No. of 14 28 33 30 20 15 13 7
Workers
Solution:
S
IM
Age Frequency Cumulative
f Frequency
cf
20-25 14 14
25-30 28 42
NM

30-35 33 75
35-40 30 105
40-45 20 125
45-50 15 140
50-55 13 153
55-60 7 160
Now, N = 160
Or, N = 80
2
80th item lies in class 35-40.
Hence, pcf = 75, f =30, h = 5 and L = 35
Therefore, the Median is,
N 160
− pcf − 75
2 2
Md = L + × h = 35 + × 5 = 35.83
f 30

3.4.2 MERITS AND DEMERITS OF MEDIAN


The merits of Median are:
‰‰ It is easy to understand and easy to calculate, especially in series of
individual observations and ungrouped frequency distributions.
In such cases it can even be located by inspection.

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  91 

N O T E S
‰‰ Median can be determined even when class intervals have open
ends or not of equal width.
‰‰ It is not much affected by extreme observations. It is also
independent of range or dispersion of the data.
‰‰ Median can also be located graphically.
‰‰ It is centrally located measure of average since the sum of
absolute deviation is minimum when taken from median.
‰‰ It is the only suitable average when data are qualitative and
it is possible to rank various items according to qualitative
characteristics.
‰‰ Median conveys the idea of a typical observation.
Demerits of median are as follows:
‰‰ In case of individual observations, the process of location of

S
median requires their arrangement in the order of magnitude
which may be a cumbersome task, particularly when the number
of observations is very large.
IM
‰‰ It, being a positional average, is not capable of being treated
algebraically.
‰‰ In case of individual observations, when the number of
observations is even, the median is estimated by taking mean
of the two middle-most observations, which is not an actual
NM

observation of the given data.


‰‰ It is not based on the magnitudes of all the observations. There
may be a situation where different sets of observations give same
value of median. For example, the following two different sets of
observations, have median equal to 30.
Set I: 10, 20, 30, 40, 50 and Set II: 15, 25, 30, 60, and 90.
‰‰ In comparison to arithmetic mean, it is much affected by the
fluctuations of sampling.
‰‰ The formula for the computation of median, in case of grouped
frequency distribution, is based on the assumption that the
observations in the median class are uniformly distributed. This
assumption is rarely met in practice.
‰‰ Since it is not possible to define weighted median like weighted
arithmetic mean, this average is not suitable when different
items are of unequal importance.

3.4.3 PARTITION VALUES OR POSITIONAL MEASURES


Quantiles are related positional measures of central tendency. These
are useful and frequently employed measures. Most familiar quantiles
are Quartiles, Deciles, and Percentiles. We are familiar with percentile
scores in competitive aptitude tests or examinations of few institutes.

NMIMS Global Access – School for Continuing Education


92  BUSINESS STATISTICS

N O T E S
If your score is 90 percentile, it means that 90% of the candidates who
took the test, received a score lower than yours. In incomes in your
organisation if you are 95 percentile, you are in the group of top 5%
highest paid employees in your company.

3.4.4 QUARTILES
Quartiles are position values similar to the Median. There are three
quartiles denoted by Q1, Q2 and Q3. Q1 is called the lower Quartile or
first quartile. The second quartile Q2 is nothing but the median. In
a distribution, one fourth of the item are less then Q1 and the other
3 th item are greater then Q1 is called the upper quartile (or) the 3rd
4
quartile.

Inter-quartile range is defined as the difference between the first

S
and third quartile. It is a measure of spread of the data.

Formulae: In individual observations and discrete series, the values


IM
are known and are considered in ascending order.
th
Q1 is the value at N +1
 4 
position.

th
Q2 is at 2 = N +1 = N +1
   
position.
4 2
NM

th
Q3 is the value at 3 N +1
 
position.
4
In Continuous series,
N
− c.f
Q1 = L + 4 ×c
f

N
− c.f
Q2 = L + 2 ×c
f
N
3 − c.f
Q3 = L + 4 ×c
f
Formulae: Or we use formula Qth quartile = (n +1) × Q observation.
th

Q = 0, 1, 2, 3, & 4  4 

Example: In a computerized entrance test 20 candidates appear on a


particular day. Their scores are: 9, 6, 12, 10, 13, 15, 16, 14, 14, 16, 17, 16,
24, 21, 22, 18, 19, 18, 20, 17. Find the quartiles of data.
Solution: First, we order the data in ascending order.
6, 9, 10, 12, 13, 14, 14, 15, 16, 16, 16, 17, 17, 18, 18, 19, 20, 21, 22, 24

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  93 

N O T E S
‰‰ First quartile is the observation in position:

(n +1) × 25 = 5.25.
100
‰‰ Value of the observation corresponding to 5.25th position is 13.25

Second quartile or median is the observation in position:

(n +1) × 25 = 10.5.
100
‰‰ Value of the observation corresponding to 10.5th position is 16.

Third quartile is the observation in position:

(n +1) × 75 = 15.75.
100
Value of the observation corresponding to 15.75th position is 18.75.
Example: Calculate Median, Q1 and Q3 from the following data

Salary (` 000)
No. of officers:
Solution:
25 40 50
S
15 – 19 20 – 24 25 – 29 30 – 34 35 – 39 40 – 44
15 40 30
IM
N = 200 = 100 \ median class: 29.5 – 34.5
2 2
\ L = 29.5 C= 34.5 - 29.5 = 5; f = 50; c. f. = 80

N = 200 No. of Officers True Class Cum. Fre.


NM

(f) Interval (c.f.)


15 – 19 15 14.5 – 19.5 15
20 – 24 25 19.5 – 24.5 40
25 – 29 40 24.5 – 29.5 80
30 – 34 50 29.5 – 34.5 130
35 – 39 40 34.5 – 39.5 170
40 – 44 30 39.5 – 44.5 200
N = 200
N
− c.f
Median = L + 2 ×i
f
= 29.5 + 100 – 80 × 5
50
= 29.5 + 2 = ` 31.50 thousands

N = 200 = 50;\ Q1 lies: 24.5 – 29.5


4 4
\ L1 = 24.5, c = 5, f = 40; c. f.= 40
N
− c.f
\ Q1 = L + 4 ×c
f

NMIMS Global Access – School for Continuing Education


94  BUSINESS STATISTICS

N O T E S

= 24.5 + 50 – 40 × 5
40
= 24.5 + 1.25
= ` 25.75 thousands

3N = 3 × 200 = 150 \Q3 class: 34.5 – 39.5


4 4
\ L = 34.5, C= 5, f = 40; c.f =130
N
3   − c.f
4
\ Q3 = L + ×c
f
= 34.5 + 150 – 130 × 5
40
= 34.5 + 2.5
= ` 37 thousands

3.4.5 DECILES
S
D1, D2, D3… and D9 are the nine deciles. They divide a series into 10
IM
equal parts. One tenth of the items are less than or equal to D1. One
tenth of the items are more than or equal to D9 and one tenth of the
items between any successive pairs of deciles when all the items are
in ascending order.
Formulae: Or we use formula Dth decile = (n +1) × D observation.
th

 
NM

10
D = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, & 10
Example: Find Q1, Q3, D2, D5, D7 from the following data

Daily pocket money (`) : 15 20 25 40 60 100


No. of days : 34 39 70 70 76 76
Solution:
X No. of Days (f) c.f.
15 34 34
20 39 73
25 70 143
40 70 213
60 76 289
100 76 365
N = 365
N + 1 = 365 + 1 = 91.5
4 4
25 + 25
Q1 = 91 value + 92 value =
st nd
= ` 25
2 2

3 N + 1 = 3 366 = 3× 91.5 = 274.5


 4   4 

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  95 

N O T E S
60 + 50
Q3 = 274 value + 275 value =
th th
= ` 60
2 2

2 N + 1 = 2 365 + 1 = 2 × 36.6 = 73.2


 10   10 

\ D2 = 
value at 73rd position + 0.20 (value at 74th
position – value at 73rd position)
= 20 + 0.20 (25 – 20)
= 20 + 0.20 (5) = 20 + 1
= ` 21
5 N + 1 = 5 × 36.6 = 183
 10 
\ D5 = ` 40
\ x = 40 corresponding to the position 183.

7 N + 1 = 7 × 36.6 = 256.2
\
 10 
S
D7 = value at 256th position + 0.20 (value at 257)th
position – value at 256th position)
IM
= 60 + 0.20 (60-60)
= 60 + 0
=
` 60
NM

3.4.6 PERCENTILES
Pth percentile of a group of observations is that observation below
which lie P% (P percent) observations. The position of Pth percentile

is given by (n + 1) × P , where ‘n’ is the number of data points.


100
If the value of (n + 1) × P is a fraction, we need to interpolate the
100
value. k
  ×h
( × N − pcf)
th
Formulae: k percentile = L+ 100
f
Where, L = Lower limit of percentile (quartile or decile) class.
N = Number of observations.
pcf = Cumulative frequency up to previous class.
f = Frequency of percentile (quartile or decile) class.
h = Class width of percentile (quartile or decile) class.
th
Percentile (quartile or decile) class is the class where k × N
observation lies.  100 

Example: In a computerized entrance test 20 candidates appear on a


particular day. Their scores are: 9, 6, 12, 10, 13, 15, 16, 14, 14, 16, 17, 16,
24, 21, 22, 18, 19, 18, 20, 17. Find 80th and 90th percentiles of data.

NMIMS Global Access – School for Continuing Education


96  BUSINESS STATISTICS

N O T E S
Solution: First, we order the data in ascending order.
6, 9, 10, 12, 13, 14, 14, 15, 16, 16, 16, 17, 17, 18, 18, 19, 20, 21, 22, 24.
80th percentile of the data set is the observation lying in the position:

(n + 1) × P = (20 + 1) × 80 = 16.8
100 100
Now, the 16th observation is 19 and 17th observation is 20. Therefore
80th percentile is a point lying, 0.8 proportion away from 19 to 20,
which is 19.8.
The 90th percentile is similarly found as observation lying in position:

(n + 1) × P = (n + 1) × 90 = 18.9
100 100
The 18th observation is 21 and 19th observation is 22. Therefore, 90th
percentile is a point 0.8 proportion away from 21 to 22, which is 21.9

Fill in the blanks: S


IM
11. The ‘Median’ is a value of the ................... observation.
12. ................... range is defined as the difference between the first
and third quartile. It is a measure of spread of the data.
13. ................... is directly used in Kelly’s coefficient of skewness.
NM

These are the top ten final scores for the combined results of the
Ladies’ Figure Skating event at the 2010 Winter Olympics:
Figure Skating
Yu-Na Kim 228.56
Mao Asada 205.50
Joanie Rochette 202.64
Mirai Nagasu 190.15
Miki Ando 188.86
Laura Lepisto 187.97
Rahail Flatt 182.49
Akiko Suzuki 181.44
Alena Leonova 172.46
Ksenia Makarova 171.91
Yu-Na Kim of South Korea shattered the world record with a score
18 points higher than the previous record. How would the mean
and median of this group change if we left out her score?

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  97 

N O T E S

Quartiles are used directly in measures of dispersion to find the


Quartile Deviation (Q.D) and in measures of skewness to calculate
Bowley’s coefficient of skewness.
Deciles are directly used in Kelly’s coefficient of skewness.

3.5 MODE
The Mode of a data set is the value that occurs most frequently.
There are many situations in which arithmetic mean and median
fail to reveal the true characteristics of a data (most representative
figure), for example, most common size of shoes, most common size
of garments, etc. In such cases, mode is the best-suited measure of the
central tendency. There could be multiple model values, which occur
with equal frequency. In some cases, the mode may be absent.

S
Mode is the value which has the greatest frequency density. Mode
IM
is denoted by Z.

For a grouped data, model class is defined as the class with the
maximum frequency.
NM

3.5.1 CALCULATION OF MODE


Formulae: The mode is calculated as:
D
Mode = L + D +1 D × h
1 2
Where,
L = Lower limit of modal class.
D1 = Difference between frequency of the modal class and preceding
class.
D 2 = Difference between frequency of the modal class and succeeding
class.
h = Size of the modal class.
Example: In a computerized entrance test, 20 candidates appear on a
particular day. Their scores are: 9, 6, 12, 10, 13, 15, 16, 14, 14, 16, 17, 16,
24, 21, 22, 18, 19, 18, 20, 17. Find the mode of the data.
Solution: Now the value 16 occurs 3 times which is maximum for any
observation. Therefore,
Mode = 16
Example: In a computerized entrance test, 20 candidates appear on a
particular day. Their scores are: 9, 6, 12, 10, 13, 15, 14, 14, 16, 17, 16, 24,
21, 22, 18, 19, 18, 20, 17, 8. Find the mode of the data.

NMIMS Global Access – School for Continuing Education


98  BUSINESS STATISTICS

N O T E S
Solution: Now the values 14, 16, 17 and 18 occur 2 times which is
maximum for any observation. Therefore,
Modes are 14, 16, 17 and 18 (this is a multimodal distribution).
Example: Find the mode of the following distribution class: 61-65,
66-70, 71-75, 76-80, 81-85, 86-90 frequency: 7, 9, 11, 7, 2, 3
Solution:
True class limits Frequency
60.5-65.5 7
65.5-70.5 9 (F0)
70.5-75.5 11 (Modal Class)
75.5-80.5 7 (F1)
80.5-85.5 2
85.5-90.5 3


S
Mode = L +
f1 – f0
2f1 – f0 – f2
×C
IM
L = 0.5, f1=11, f0=9, f2=7, c=5

\ Mode = 70.5 + 11 – 9 × 5
22 – 9 – 7

= 70.5 + 10
6
NM

= 72.17
Example: From the following data, calculate mean, median and mode.

X: 50-53 53-56 56-59 59-62 62-65 65-68 68-71 71-74 74-77


F: 3 8 14 30 36 28 16 10 5
Solution:
Mid x F x–a Fd c.f
d=
c
51.5 3 –4 –12 3
54.5 8 –3 –24 11
57.5 14 –2 –28 25
60.5 30 –1 –30 55
63.5 36 0 0 91
66.5 28 1 28 119
69.5 16 2 32 135
72.5 10 3 30 145
75.5 5 4 20 150
150 16

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  99 

N O T E S
A = 63.5

Mean X = A + ∑fd × C
N
= 63.5 + 16 × 3
150
= 63.5 + 0.106
= 63.607

Median = N = 150 = 75
2 2
Median class lies between 62-65
\ L = 62, f=36, c.f = 55, c=3
N
− c.f
L + 2 ×C
\ Median =
f

= 62 + 75 – 55 × 3
36
= 62 + 1.67
S
IM
= 63.67
f1 – f0
Mode = L + ×c
2f1 – f0 – f2
NM

= 62 + 30 – 36 ×3
2 × 30 – 36 – 28

= 62 + (–6) × 3
60 – 34
= 62 + 4.5
= 66.5

3.5.2 MERITS AND DEMERITS OF MODE


Merits of Mode are:
‰‰ It is easy to understand and easy to calculate. In many cases it
can be located just by inspection.
‰‰ It can be located in situations where the variable is not measurable
but categorization or ranking of observations is possible.
‰‰ Like mean or median, it is not affected by extreme observations.
It can be calculated even if these extreme observations are not
known.
‰‰ It can be determined even if the distribution has open end classes.
‰‰ It can be located even when the class intervals are of unequal
width provided that the width of modal and that of its preceding
and following classes are equal.

NMIMS Global Access – School for Continuing Education


100  BUSINESS STATISTICS

N O T E S
‰‰ It is a value around which there is more concentration of
observations and hence the best representative of the data.
Limitations of mode are:
‰‰ It is not based on all the observations.
‰‰ It is not capable of further mathematical treatment.
‰‰ In certain cases mode is not rigidly defined and hence, the
important requisite of a good measure of central tendency is not
satisfied.
‰‰ It is much affected by the fluctuations of sampling.
‰‰ It is not easy to calculate unless the number of observations is
sufficiently large and reveal a marked tendency of concentration
around a particular value.
‰‰ It is not suitable when different items of the data are of unequal

‰‰
importance.

S
It is an unstable average because, mode of a distribution, depends
upon the choice of width of class intervals.
IM
3.5.3 GRAPHIC LOCATION OF MODE
The mode of a data set is the value that occurs most frequently. Mode
can be found out from the histogram. If the data is discrete (not
grouped) it’s very easy to find the mode as the X value of the tallest
NM

column of the histogram. If there is more than one tallest column,


X values of all the tallest columns are the modes. This is called as
multimodal distribution. If all the values have same frequencies for
instance one, the distribution is without mode.
In case of a grouped data, the procedure to find the mode is as follows.
‰‰ First we draw the histogram as explained earlier.
‰‰ Find the highest column (modal class column).
‰‰ Find the points where preceding class and succeeding class
column tops join the modal class column.
‰‰ Join these points to the opposite top corners of the modal column
as diagonals.
‰‰ Abscissa of the point of intersection of these lines gives the mode
for the grouped data.
This is explained in the problem below.
Example: Given the following distribution for overtime, draw
histogram and estimate mode from it.

Overtime in hours 4-8 8-12 12-16 16-20 20-24 24-28


Number of workers 4 8 16 18 20 18

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  101 

N O T E S
Solution:
Using the above data we draw the histogram as shown below:

Over time for Workers

25

20
Number of Workers

15 Number of
Workers
10

Over Tim e in Hours

From the histogram we can see that Mode = 22


S
IM
Fill in the blanks:
14. The ................... of a data set is the value that occurs most
frequently.
NM

15. Mode is much affected by the fluctuations of ................... .


16. Mode can be found out from the ................... .

Create a histogram displaying the top ten women’s figure skating


scores for the 2010 Winter Olympics
Figure Skating
Yu-Na Kim 228.56
Mao Asada 205.5
Joanie Rochette 202.64
Mirai Nagasu 190.15
Miki Ando 188.86
Laura Lepisto 187.97
Rahail Flatt 182.49
Akiko Suzuki 181.44
Alena Leonova 172.46
Ksenia Makarova 171.91
The x-axis needs to span from at least 171 to 229 in order to
accommodate all of the data. Hence find Mode of the data.

NMIMS Global Access – School for Continuing Education


102  BUSINESS STATISTICS

N O T E S

The mode is most “fashionable” size in the sense that it is the


most common and typical and is defined by Zizek as “the value
occurring most frequently in series of items and around which the
other items are distributed most densely.” In the words of Croxton
and Cowden, the mode of a distribution is the value at the point
where the items tend to be most heavily concentrated. According
to A.M. Tuttle, Mode is the value which has the greater frequency
density in its immediate neighborhood.

 EMPIRICAL RELATIONSHIP BETWEEN


3.6
MEAN, MEDIAN AND MODE
A distribution in which the mean, the median, and the mode
coincide is known as symmetrical (bell shaped) distribution.

very commonly used.


S
Normal distribution is one such a symmetric distribution, which is

If the distribution is skewed, the mean, the median and the mode
IM
are not equal. In a moderately skewed distribution distance between
the mean and the median is approximately one third of the distance
between the mean and the mode. This can be expressed as:
Mean – Median = (Mean – Mode) / 3
NM

Mode = 3 * Median – 2 * Mean


Thus, if we know values of two central tendencies, the third value can
be approximately determined in any moderately skewed distribution.
In any skewed distribution, the median lies between the mean and
mode.
In case of right-skewed (positive-skewed) distribution which has a
long right tail,
Mode < Median < Mean.
In case of left-skewed (negative-skewed) distribution which has along
left tail,
Mean < Median < Mode
Example: The time required in minutes for each of the 50 students to
read 20 pages of a book is recorded below:
43, 52, 49, 36, 48, 41, 47, 50, 32, 45, 48, 40, 43, 48, 36, 51, 44, 49, 53, 37, 34,
42, 47, 45 ,47, 44, 50, 31, 48, 43, 45, 44, 36,b 49, 51, 43, 53, 46, 39, 50, 42,
47, 38, 51, 46, 40, 38, 45, 47, 42.
1. Classify the data by considering classes 30-34, 35-39 …
2. Find median and mode.
3. Draw histogram and ogive. Also read values of median and mode
from the graph.

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  103 

N O T E S
Solution: The classified data is shown below.

TIME REQUIRED TO READ 20 PAGES (MINUTES)


Class interval Frequency
30-34 3
35-39 7
40-44 13
45-49 18
50-54 9
Total No. = 50

Calculations on Discrete Data


Median is a middle value. In this case it is average of 25th and 26th
observations.
Thus, median = 45 + 45 = 45
2
S
Mode is the value with highest frequency. From the discrete data we
IM
find that, 47 appear 5 times which is highest. Hence,
Mode = 47

Calculations on Grouped Data


Median class is 45-49, because cumulative frequency up to class 40-44
NM

is 20 and for class 45-49 is 41. Mid value is N = 50 = 25


2 2
Now, Lower limit of median class L = 45
Total Frequency N= 50
Preceding cumulative frequency to the median class pcf = 23
Frequency of median class f = 18.
Class interval of median class h = 4
Now median is,
N
− pcf
2 25 − 23
Median Md = L + × h = 45 + × 4 = 45 + 0.4 = 45.4
f 18
Now Modal class is 45-49. Thus,
Lower limit of model class L = 45
Difference between frequency of the model class and preceding class
D1 = 5
Difference between frequency of the model class and succeeding class
D1 = 9
Size of the model class h = 4

NMIMS Global Access – School for Continuing Education


104  BUSINESS STATISTICS

N O T E S
Now the mode is,
D1 5
Mode = L + × h = 45 + × 4 = 45 + 1.43 = 46.43
D1 + D2 5+9
Part Do it yourself.

Fill in the blanks:


17. A distribution in which the mean, the median, and the mode
coincide is known as .................(bell shaped) distribution.
18. In case of ................. distribution which has a long right tail,
Mode < Median < Mean.

S
These are the scores from last week’s geometry test:
90, 94, 53, 68, 79, 84, 87, 72, 70, 69, 65, 89, 85, 83, 72
IM
You earned a score of 72. Your mom asks you how you did on the
test compared to the rest of the class. Calculate the three measures
of the average, and decide what to tell your mom.

3.7 LIMITATIONS OF CENTRAL TENDENCY


NM

No single average can be regarded as the best or most suitable under


all circumstances. Each average has its merits and demerits and its
own particular field of importance and utility.
A proper selection of an average depends on the (1) nature of the data
and (2) purpose of enquiry or requirement of the data.
A.M. satisfies almost all the requisites of a good average and hence
can be regarded as the best average but, it cannot be used:
‰‰ In case of highly skewed data.
‰‰ In case of uneven or irregular spread of the data.
‰‰ In open end distributions.
‰‰ When average growth or average speed is required.
‰‰ When there are extreme values in the data.
‰‰ Except in these cases AM is widely used in practice.
Median is the best average in open end distributions or in distributions
which give highly skew or j or reverse j type frequency curves. In such
cases A.M. gives unnecessarily high or low value whereas median
gives a more representative value.
‰‰ But in case of fairly symmetric distribution mean, median and
mode are very close to each other.

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  105 

N O T E S
‰‰ Mode is especially useful to describe qualitative data. According
to Freunel and Williams, consumer preferences for different
kinds of products can be compared using modal preferences as
we cannot compute mean or median. Mode can best describe the
average size of shoes or shirts.
‰‰ G.M. is useful to average relative changes, averaging ratios and
percentages. It is theoretically the best average for construction
of index number. But it should not be used for measuring absolute
changes.
‰‰ H.M. is useful in problems where values of a variable are
compared with a constant quantity of another variable like time,
distance travelled within a given time, quantities purchased or
sold over a unit.

Fill in the blanks:


S
19. ................... cannot be used in case of highly skewed data.
IM
20. In case of fairly symmetric distribution mean, median and
mode are very ................... to each other.
NM

On his first three quizzes, Patrick earned a 15, 18, and 16.  (A perfect
score would have been 20 points.) What does he need to earn on the
next quiz to have a mean score of at least 17?

In general we can say that A.M. is the best of all averages as it satisfies
almost all requirements of an ideal measure of central tendency
and other averages may be used under special circumstances.

3.8 SUMMARY
‰‰ Measures of the central tendency give one of the very important
characteristics of the data. According to the situation, one of the
various measures of central tendency may be chosen as the most
representative.
‰‰ Arithmetic mean is widely used and understood. What
characterizes the three measures of centrality, and what are the
relative merits of each in the given situation, is the question.
‰‰ Mean summarizes all the information in the data. Mean can be
visualized as a single point where all the mass (the weight) of
the observations is concentrated. It is like a centre of gravity in
physics. Mean also has some desirable mathematical properties
that make it useful in the context of statistical inference.

NMIMS Global Access – School for Continuing Education


106  BUSINESS STATISTICS

N O T E S
‰‰ To simplify the manual calculation, we may sometimes use shift
of origin and change of scale. Shifting of origin is achieved by
adding or subtracting a constant to all observations. In case of
discrete data we add or subtract (usually subtract) a constant to
the individual observations. Whereas for grouped data, we add or
subtract (usually subtract) the constant to the class mark values.
‰‰ There are cases where relative importance of the different items
is not the same. In such a case, we need to compute the weighted
arithmetic mean. The procedure is similar to the grouped data
calculations studied earlier, when we consider frequency as a
weight associated with the class-mark.
‰‰ Median is the middle value when the data is arranged in order.
The median is resistant to the extreme observations. Median is
like the geometric centre in physics. In case we want to guard
against the influence of a few outlying observations (called
outliers), we may use the median.
‰‰
S
Quantiles are related positional measures of central tendency.
These are useful and frequently employed measures. Most
IM
familiar quantiles are Quartiles, Deciles, and Percentiles.
‰‰ Quartiles are position values similar to the Median. There are
three quartiles denoted by Q1, Q2 and Q3. Q1 is called the lower
Quartile or first quartile. The second quartile Q2 is nothing but
the median. In a distribution, one fourth of the item are less
NM

then Q1 and the other 3 th item are greater then Q1 is called the
4
upper quartile (or) the 3rd quartile.
‰‰ Inter-quartile range is defined as the difference between the first
and third quartile. It is a measure of spread of the data.
‰‰ D1, D2, D3… and D9 are the nine deciles. They divide a series
into 10 equal parts. One tenth of the items are less than or equal
to D1. One tenth of the items are more than or equal to D9 and
one tenth of the items between any successive pairs of deciles
when all the items are in ascending order.
‰‰ Pth percentile of a group of observations is that observation
below which lie P% (P percent) observations. The position of

Pth percentile is given by (n + 1) × P , where ‘n’ is the number of


100
data points.
‰‰ If the value of
(n + 1) × P is a fraction, we need to interpolate
100
the value.
‰‰ The Mode of a data set is the value that occurs most frequently.
There are many situations in which arithmetic mean and
median fail to reveal the true characteristics of a data (most
representative figure), for example, most common size of shoes,
most common size of garments etc. In such cases, mode is the
best-suited measure of the central tendency.

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  107 

N O T E S
‰‰ A distribution in which the mean, the median, and the mode
coincide is known as symmetrical (bell shaped) distribution.
Normal distribution is one such a symmetric distribution, which
is very commonly used.
This can be expressed as:
 Mean – Median = (Mean – Mode) / 3
 Mode = 3 * Median – 2 * Mean
‰‰ No single average can be regarded as the best or most suitable
under all circumstances. Each average has its merits and
demerits and its own particular field of importance and utility. A
proper selection of an average depends on the (1) nature of the
data and (2) purpose of enquiry or requirement of the data.

‰‰

‰‰
by the number of values.
S
Arithmetic Mean: It is defined as the sum of all values divided

Median: Median of a set of values is the value which is the


IM
middle most when they are arranged in an ascending order.
‰‰ Mode: Mode is the value which has highest frequency.
‰‰ Quartile: When the distribution is divided into four equal
portions, then we get first quartile (Q1), second quartile (Q2)
NM

and third Quartile (Q3) as the positional averages.


‰‰ Percentile: Pth percentile of a group of observations is that
observation below which lie P% (P percent) observations.
‰‰ Inter-quartile Range: It is defined as the difference between
the first and third quartile. It is a measure of spread of the data.

3.9 DESCRIPTIVE QUESTIONS


1. What do you understand by measures of Central Tendency? What
are the characteristics of a good measure of central tendency?
2. What are the common measures of central tendency? Explain
with examples.
3. Define arithmetic mean and give its formulae for calculation in
grouped and ungrouped data.
4. What are the properties of arithmetic mean? Justify with examples.
5. What is the effect of shift of origin and change of scale on
calculation of arithmetic mean?
6. Define median. How do you calculate median for grouped and
ungrouped data?
7. What are partition values of positional measures of data? Explain
them with the help of examples.

NMIMS Global Access – School for Continuing Education


108  BUSINESS STATISTICS

N O T E S
8. What do you understand by mode? How do you calculate it for a
continuous data set? How will you find mode from Histogram?
9. Explain the limitations of central tendency.
10. Explain the empirical relation between mean, median and mode.

EXERCISE FOR PRACTICE


1. Compute mean, median, mode quartiles and 90th percentile for
data given below:

22 21 37 33 28 42 56 33 32 59
40 47 29 65 45 48 55 43 42 40
37 39 56 54 38 49 60 37 28 27
32 33 47 36 35 42 43 55 53 48
29 30 32 37 43 54 55 47 38 62
2.

S
Compute mean, median, mode, quartiles and 90th percentile for
the grouped data of age (years) of employees given below:

Class Interval 20-30 30-40 40-50 50-60 60-70


IM
Frequency 7 16 15 9 3
3. Compute the arithmetic mean for the following frequency
distribution:
NM

Profit (`) per 0-100 100-200 200-300 300-400 400-500


Shop
No. of Students 20 36 50 30 14
4. Compute median for the following data:

No. of units of 0-200 200-400 400-600 600-800 800-1000


Electricity
Consumed
No. of Families 5 10 34 21 10
5. The monthly income (in `) of 7 families in a village is as follows:
1200, 1000, 1100, 1250, 950, 1100, 1350
Calculate median and mode.

3.10 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS
Topic Q. No. Answers
Characteristics of Central 1. Central
Tendency
2. Uniquely
3. Median
Contd...

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  109 

N O T E S

Arithmetic Mean 4. True


5. False
6. False
7. False
8. True
9. False
10. True
Median 11. Middle
12. Inter-quartile
13. Deciles
Mode 14. Mode
15. Sampling
16. Histogram
Empirical Relationship
between Mean, Median
and Mode
17.

S
Symmetrical
IM
18. Right (or positive)-skewed
19. Arithmetic mean
Limitations of Central 20. Close
Tendency
NM

HINTS FOR DESCRIPTIVE QUESTIONS


1. Refer Section 3.2
Measure of central tendency enables us to get an idea of entire
data from a single value at which we consider the entire data is
concentrated. This single value could be used to represent the
entire population. A good measure of central tendency should
possess as far as possible the following characteristics:
(a) Easy to understand
(b) Simple to compute
2. Refer Section 3.2
There are three common measures of central tendency:
(a) Mean: The average value
(b) Median: The middle value
(c) Mode: Most occurring value
3. Refer Section 3.3
The arithmetic mean of a series is the quotient obtained by
dividing the sum of the values by the number of items. In algebraic
language, if X1, X2, X3....... Xn are the n values of a variate X. It is

denoted by, X (X Bar) or μ depending on the data is a sample or
population. Thus,

NMIMS Global Access – School for Continuing Education


110  BUSINESS STATISTICS

N O T E S
n

x + x2 + x3 + ...... + xn ∑
i =1
xi
m= 1 =
N N
4. Refer Section 3.3.1
Properties of arithmetic mean are as follows:
(a) The sum of the deviations, of all the values of x, from their
arithmetic mean, is zero.
(b) The product of the arithmetic mean and the number of items
gives the total of all items.
– –
(c) If x1 and x2 are the arithmetic mean of two samples of sizes

n1 and n2 respectively then, the arithmetic mean x of the
distribution combining the two can be calculated as
– –
– n1x1 + n2x2
x =

5.

S
n1 + n2
Refer Section 3.3.2
If a constant is subtracted or added to all data points, the
IM
Arithmetic Mean (AM) is reduced or increased by that amount.
In this method we first subtract a suitable constant from all the
observations, calculate the mean and then add the same constant
to the answer to get the actual value of the mean.
NM

6. Refer Section 3.4


Median is the value, which divides the distribution of data,
arranged in ascending or descending order, into two equal parts.
Thus, the ‘Median’ is a value of the middle observation.
When the series is arranged in order of size or magnitude, and if
total number of observations are odd,
th
 N + 1
Median (M
Mdd) =  observation.
 2 
7. Refer Section 3.4.3
Quantiles are related positional measures of central tendency.
These are useful and frequently employed measures. Most
familiar quantiles are Quartiles, Deciles, and Percentiles.
Quartiles are position values similar to the Median. There are
three quartiles denoted by Q1, Q2 and Q3. Q1 is called the lower
Quartile or first quartile. The second quartile Q2 is nothing but
the median. In a distribution, one fourth of the item are less
3
then Q1 and the other th item are greater then Q1 is called
4
the upper quartile (or) the 3rd quartile. D1, D2, D3… and D9 are
the nine deciles. They divide a series into 10 equal parts. One
tenth of the items are less than or equal to D1. Pth percentile of
a group of observations is that observation below which lie P%

NMIMS Global Access – School for Continuing Education


MEASURES OF CENTRAL TENDENCY  111 

N O T E S
(P percent) observations. The position of P percentile is given
th

( n + 1) × P
by , where ‘n’ is the number of data points.
100
8. Refer Section 3.5
Mode is the value which has the greatest frequency density.
Mode is denoted by Z.
For a grouped data, model class is defined as the class with the
maximum frequency.
The mode is calculated as:
D1
Mode = L + ×h
D1 + D2
9. Refer Section 3.7
No single average can be regarded as the best or most suitable

S
under all circumstances. Each average has its merits and demerits
and its own particular field of importance and utility.
A proper selection of an average depends on the (1) nature of the
IM
data and (2) purpose of enquiry or requirement of the data. A.M.
satisfies almost all the requisites of a good average and hence can
be regarded as the best average but, it cannot be used:
(a) In case of highly skewed data.
(b) In case of uneven or irregular spread of the data.
NM

10. Refer Section 3.6


If the distribution is skewed, the mean, the median and the mode
are not equal. In a moderately skewed distribution distance
between the mean and the median is approximately one third
of the distance between the mean and the mode. This can be
expressed as:
Mean – Median = (Mean – Mode) / 3
Mode = 3 * Median – 2 * Mean

ANSWERS FOR EXERCISE FOR PRACTICE


1. Mean = 41.86, Median = 41, Mode = 37, 1st Quartile = 33, 2nd
Quartile = 41, 3rd Quartile = 48.75, 90th Percentile = 56
2. Mean = 42, Median = 41.33, Mode = 39, 1st Quartile = 30.44, 2nd
Quartile = 41.33, 3rd Quartile = 49.67, 90th Percentile = 58.89
3. Mean: 238
4. Median: 547.058
5. Median = 1100, Mode = 1100

NMIMS Global Access – School for Continuing Education


112  BUSINESS STATISTICS

N O T E S

3.11 SUGGESTED READINGS FOR REFERENCE


SUGGESTED READINGS
‰‰ S Jaisankar, Quantitative Techniques for Management Computer
based Problem Solving, Excel Books, 2005
‰‰ R Selvaraj, Quantitative Methods in Management, Problems and
Solutions, Excel Books, 2008
‰‰ J K Sharma, Fundamentals of Business Statistics, 2010
‰‰ Bierman H., Bonnini C.P., and Hausma W.H., Quantitative
Analysis for Business Decisions, Homewood, Illinois. Richard D.I.
Win, Inc 1973.
‰‰ Gallagher, C.A. and Watson, H.J., Quantitative Methods for
Business Decisions, McGraw Hill, Inc., 1976
D P Apte, Probability and Combinatorics, Excel Books, 2007
‰‰

‰‰
S
Gordon, G., and Pressman I., Quantitative Decision Making for
Business, New Delhi: National Publishing House, 1983.
IM
‰‰ Lapin, L., Quantitative Methods for Business Decisions, New
York: Harcourt Brace Jovanovich. Inc., 1976

E-REFERENCES
‰‰ http://math.about.com/
NM

‰‰ http://www.calculatorsoup.com/
‰‰ http://www.mathgoodies.com/

NMIMS Global Access – School for Continuing Education


C H A
4 P T E R

MEASURES OF DISPERSION

CONTENTS
4.1 Introduction


4.2
4.3  
S
Characteristics of Measures of Dispersion
Absolute and Relative Measures of Dispersion
IM
4.4 Range
4.4.1 Merits and Demerits of Range
4.4.2 Uses of Range
4.5 Inter-quartile Range and Deviations
4.5.1 Inter-quartile Range
NM

4.5.2 Quartile Deviation


4.5.3 Mean Deviation
4.6 Variance and Standard Deviation
4.6.1  Different Formulae for Calculating Variance
4.6.2 Calculation of Standard Deviation
4.6.3 Properties of Standard Deviation
4.6.4 Merits and Demerits of Standard Deviation
4.6.5 Standard Deviation of Combined Means
4.6.6 Coefficient of Variation
4.6.7  Empirical Relationship between different Measures of
Variation
4.7 Summary
4.8 Descriptive Questions
4.9 Answers and Hints
4.10 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


114  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

VARIABILITY IS IMPORTANT

A brief story may help the reader to see why variability is often
important. Some years ago a company was producing nickel
powder, which varied considerably in particle size. A metallurgical
engineer in technical sales was given the task of developing new
customers in the alloy steel industry for the powder. Some potential
buyers said they would pay a premium price for a product that was
more closely sized. After some discussion with the management of
the plant, specifications for three new products were developed:
fine powder, medium powder, and coarse powder.

An order was obtained for fine powder. Although the specifications


for this fine powder were within the size range of powder which
had been produced in the past, the engineers in the plant found
that very little of the powder produced at their best guess of the

S
optimum conditions would satisfy the specifications. Thus, the
mean size of the specification was satisfactory, but the specified
variability was not satisfactory from the point of view of production.
IM
To make production of fine powder more practical, it was necessary
to change the specifications for “fine powder” to correspond to a
larger standard deviation. When this was done, the plant could
produce fine powder much more easily (but the customer was not
willing to pay such a large premium for it!).
NM

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  115 

N O T E S

After studying this chapter, you should be able to:


  Understand absolute and relative measures of variation
  Learn about range and inter-quartile range
  Discuss variance, standard deviation, mean deviation and
coefficient of variation
  Study the empirical relationship between different measures
of variation

4.1 INTRODUCTION
Different series may possess different dispersions of items around the
average. Measures of central tendency are averages of the first order.
Measures of dispersion are averages of the second order.

S
A measure of dispersion gives an idea about the extent of lack of
uniformity in the sizes and qualities of the items in a series. It helps
us to know the degree of uniformity and consistency in the series.
IM
If the difference between items is large the dispersion or variation is
large and vice versa.
NM

A measure of dispersion or variation in any data shows the extent


to which the numerical values tend to spread about an average.

If the difference between items is small, the average represents and


describes the data adequately. For large differences it is proper to
supplement information by calculating a measure of dispersion in
addition to an average.
Data is useful:
‰‰ To compare the current results with the past results.
‰‰ To compare two are more sets of observations.
‰‰ To suggest methods to control variation in the data.
A study of variations helps us in knowing the extent of uniformity
or consistency in any data. Uniformity in production is an essential
requirement in industry. Quality control methods are based on the
laws of dispersion.

 CHARACTERISTICS OF MEASURES OF
4.2
DISPERSION
There are number of measures of variability (or dispersion). Some
of the common measures are Range, Inter Quartile Range, Quartile
Deviation, Mean Deviation and Standard Deviation.

NMIMS Global Access – School for Continuing Education


116  BUSINESS STATISTICS

N O T E S
There are certain pre-requisites or characteristics for a good measure
of dispersion:
‰‰ It should be simple to understand.
‰‰ It should be easy to compute.
‰‰ It should be rigidly defined.
‰‰ It should be based on each individual item of the distribution.
‰‰ It should be capable of further algebraic treatment.
‰‰ It should have sampling stability.
‰‰ It should not be unduly affected by the extreme items.

Fill in the blanks:


1.

2. S
A measure of ................... in any data shows the extent to which
the numerical values tend to spread about an average.
................... control methods are based on the laws of dispersion.
IM
3. Measures of dispersion should not be unduly affected by the
................... items.
NM

List down various measures of dispersion and explain the difference


between them. In your day-to-day work which of the measures of
dispersion are used and why?

Simplest meaning that can be attached to the word ‘dispersion’ is a


lack of uniformity in the sizes or quantities of the items of a group
or series. According to Reiglemen, “Dispersion is the extent to
which the magnitudes or quantities of the items differ, the degree
of diversity.” The word dispersion may also be used to indicate the
spread of the data.

 ABSOLUTE AND RELATIVE MEASURES OF


4.3
DISPERSION
The measures of dispersion can be either ‘absolute’ or “relative”.
Absolute measures of dispersion are expressed in the same units in
which the original data are expressed. For example, if the series is
expressed as Marks of the students in a particular subject; the absolute
dispersion will provide the value in Marks. The only difficulty is that if
two or more series are expressed in different units, the series cannot
be compared on the basis of dispersion.

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  117 

N O T E S

‘Relative’ or ‘Coefficient’ of dispersion is the ratio or the percentage


of a measure of absolute dispersion to an appropriate average.
The basic advantage of this measure is that two or more series can
be compared with each other despite the fact they are expressed in
different units.

A precise measure of dispersion is one which gives the magnitude


of the variation in a series, i.e. it measures in numerical terms, the
extent of the scatter of the values around the average.
When dispersion is measured in terms of the original units of a series,
it is absolute dispersion or variability.

S
It is difficult to compare absolute values of dispersion in different
series, especially when the series in different units or have different
sets of values. A good measure of dispersion should have properties
IM
similar to those described for a good measure of central tendency.

TABLE 4.1: ABSOLUTE AND RELATIVE


MEASURES OF DISPERSION
Measures of Dispersion Relative Variability
NM

The range Relative range


The Quartile Deviation Relative Quartile Deviation
The Mean Deviation Relative Mean deviation
The Median Deviation Coefficient of Variation
The Standard Deviation
Graphical Method

Fill in the blanks:


4. ................... measures of dispersion are expressed in the same
units in which the original data are expressed.
5. ‘...................’ or ‘...................’ of dispersion is the ratio or
the percentage of a measure of absolute dispersion to an
appropriate average.

Choose any work situation from your life and differentiate the
relative and absolute measures of dispersion which you use. Which
of them is more helpful and why?

NMIMS Global Access – School for Continuing Education


118  BUSINESS STATISTICS

N O T E S

Theoretically, ‘Absolute measure’ of dispersion is better. But from


a practical point of view, relative or coefficient of dispersion is
considered better as it is used to make comparison between series.

4.4 RANGE

The ‘Range’ of the data is the difference between the largest value
of data and smallest value of data.

This is an absolute measure of variability. However, if we have to


compare two sets of data, ‘Range’ may not give a true picture. In such
case, relative measure of range, called coefficient of range is used.
This is given by,
Formulae: Range = L – S
S
Where L : Largest value and S: Smallest Value
IM
In individual observations and discrete series, L and S are easily
identified. In continuous series, the following two methods are used
as follows:
Method 1: L - Upper boundary of the highest class.
NM

S - Lower boundary of the lowest class.


Method 2: L - Mid value of the highest class.
S - Mid Value of the lowest class.
L–S
Formulae: Coefficient of range =
L+S
Example: Find the set of observations 10 5 8 11 12 9
Solution: L = 12 S = 5
Range = L – S
= 12 – 5
= 7
L − S 12 − 5 7
Coefficient of range
= = = = 0.4118
L + S 12 + 5 17
4.4.1 MERITS AND DEMERITS OF RANGE
Merits
‰‰ Range is a simplest method of studying dispersion.
‰‰ It takes lesser time to compute the ‘absolute’ and ‘relative’ range.
‰‰ The concept of range is useful in the field of quality control and
to study the variations in the prices of the shares, etc.

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  119 

N O T E S
Demerits
‰‰ Range does not take into account all the values of a series, i.e. it
considers only the extreme items and middle items are not given
any importance. Therefore, Range cannot tell us anything about
the character of the distribution.
‰‰ Range cannot be computed in the case of “open ends’ distribution
i.e., a distribution where the lower limit of the first group and
upper limit of the higher group is not given.

4.4.2 USES OF RANGE


Uses of range are described as follows:
‰‰ It is commonly used for statistical quality control to decide the
upper and lower control limits for the control chart.
‰‰ Since it is very easy to measure, it is used for calculating the
acceptable limits of product with the help of ‘Control Charts’.
‰‰

S
To study fluctuation in prices over a period, say a week or a
month or a year, for example, 52 weeks high/low of share prices
given in newspapers.
IM
‰‰ Weather forecast indicators like maximum and minimum
temperatures, maximum and minimum rainfall in a particular
year, etc.
NM

Fill in the blanks:


6. The ................... of the data is the difference between the largest
value of data and smallest value of data.
7. The concept of range is useful in the field of ................... control
and to study the variations in the prices of the shares, etc.

Find the range of global observed sea surface temperatures at


each grid point over the time period December 2013 to the present.
Collect data from secondary sources.

 INTER-QUARTILE RANGE AND


4.5
DEVIATIONS
Inter-quartile range and deviations are described in the following sub
sections.

4.5.1 INTER-QUARTILE RANGE

Inter-quartile range is a difference between upper quartile (third


quartile) and lower quartile (first quartile).

NMIMS Global Access – School for Continuing Education


120  BUSINESS STATISTICS

N O T E S
Thus,
Inter Quartile Range = (Q3 - Q1)

4.5.2 QUARTILE DEVIATION

Quartile Deviation is the average of the difference between upper


quartile and lower quartile.

Formulae: Thus,
Q3 – Q1
Quartile Deviation = QD =
2
Quartile Deviation (QD) also gives the average deviation of upper and
lower quartiles from Median.
Q3 – Q1 (Q2 – Q1) + (Q3 – Q2)
QD = =
2

S 2
IM
Relative measure of Quartile Deviation is called the Coefficient of
Quartile Deviation. It is defined as,
Q – Q1
Coefficient of QD = 3
Q3 + Q1
NM

As compared to a range, the inter-quartile range and the quartile


deviation are more resistant to the extreme values in the data. This is
the only measure of variability, which can be used for an open-ended
distribution.
The main limitation of both the Inter-Quartile Range and the Quartile
Deviation is that they ignore the first 25% and the last 25% observation.

Weekly wages (`): 100 200 400 500 600 Total


No. of Week: 5 8 21 12 6 52
Example: Weekly wages of a laborer are given below. Calculate Q.D.
and coefficient of Q.D.
Solution:
Weekly No. of Cum.
Wages (`) Weeks Freq.
100 5 5
200 8 13
400 21 34
500 12 46
600 6 52
N = 52

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  121 

N O T E S

N + 1 52 + 1
Q1 is = = 13.25
4 4
∴ Q1 = 13th value + 0.25 (14th value – 13th value)
= 200 + 0.25 (400-200)
= 200 + 0.25 × 200
= 200 + 50
= 250
N + 1
Q3 is 3  =3 × 13.25 =
39.75
 4 
Q3 = 39 th value + 0.75 (40 th value – 39 th value)
= 500 + 0.75 (500-500)
= 500 + 0.75 X 0

∴ Q.D. =
= 500.
Q3 − Q1
=
500 − 250 250
= =125 // S
IM
2 2 2
∴ Coefficient of Q.= Q3 − Q1 500 − 250 250
D. = = = 0.3333
Q3 + Q1 500 + 250 750

Class : 0-10 10-20 20-30 30-40 40-50 50-60 60-70


NM

Frequency : 8 20 34 46 28 14 10
Example: Find the quartile deviation and the quartile coefficient of
dispersion for the following data.
Solution:

Class Frequency Cumulative


Frequency
0-10 8 8
10-20 20 28
20-30 34 62 Q1 class
30-40 46 108 Median Class
40-50 28 136 Q3 Class
50-60 14 150
60-70 10 160
N
− c. f .
Q= L + 4 Xi
1 1
f
40 − 28
= 20 + X 10
34
= 20 + 3.53
= 23.53

NMIMS Global Access – School for Continuing Education


122  BUSINESS STATISTICS

N O T E S

N
− c. f .
Median= L + 2 Xi
f
80 − 62
= 30 + X 10
46
= 30 + 3.91
= 33.91

3N
− c. f .
Q= L3 + 4 Xi
3
f
120 − 108
= 40 + X 10
28
= 40 + 4.29
= 44.29

=
Quartile deviation S Q3 − Q1 44.29 − 23.53
=
2 2
IM
20.76
=
2
= 10.38

Q3 − Q1 44.29 − 23.53
NM

=
Quartile Coeffiicient of Dispersion =
Q3 + Q1 44.29 + 23.53
20.76
=
67.82
= 0.31

4.5.3 MEAN DEVIATION

Mean deviation is the arithmetic mean of the absolute deviations of


the values about their arithmetic mean or median or mode.

Mean Deviation (MD) is an average value of absolute deviation of


observations from the data mean (or the median or the mode). It gives
how spread/dispersed the data is. If x1, x2… xn are N observations,
then,
di xi − Average
Mean Deviation MD = =
N N
Where,
di = Deviation of each observation = xi – Average
Average used for calculating deviation can be the mean, the median
or the mode. However, usually the mean is used. There is also an
advantage of taking deviations from the median, because ‘Mean

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  123 

N O T E S
Deviation’ from median is lowest as compared to any other ‘Mean
Deviations’. Since absolute values of deviations ignoring sign are
taken for calculating Mean Deviation, the mean deviation is not
amenable to further algebraic treatment.

The relative measure corresponding to the ‘Mean Deviation’ is


‘coefficient of Mean Deviation’.

It is defined as:
Mean Deviation
Coefficient of mean deviation =
Mean or Median or Mode
It can also be expressed in percentage by multiplying it with 100.
Formulae:
Coefficient of Mean deviation (about mean)


Mean deviation about Mean
=
Mean
∑| X−X|
N
S
IM
Coefficient of Mean deviation (about Median)

Mean deviation about Median


=
∑f| X _M)
Median N
NM

Coefficient of Mean deviation about Mode

Mean deviation about mode


=
∑ | X − Z)
Mode N
Example: Calculate mean deviation about the mean for the following:

12 7 9 7 7 4 10 9 15 20
Solution:
12 + 7 + 9 + 7 + 4 + 10 + 9 + 15 + 20 100
X
= = = 10
10 10

Mean deviation about the mean =


∑ |X −X|
n
= 2 + 3 + 1 + 3 + 3 + 6 + 0 + 1 + 5 + 10
10
34
= = 3.4
10
Example: Find the mean deviation about the mean for the following
data

X: 2 4 6 8 10
f: 1 4 6 4 1

NMIMS Global Access – School for Continuing Education


124  BUSINESS STATISTICS

N O T E S
Solution:

x-x
X F fX f x-x
x =6
2
1
2
4
4
4 4 16 2 8
6 6 36 0 0
8 4 32 2 8
10 1 10 4 4


X
=
∑ fx=
N
S
N=16
96
= 6
16
96 24
IM
Mean Deviation about Mean =
∑f|
X − X | 24
= = 1.50
N 16
Example: Find the mean deviation about the mean for the following
data
NM

Class: 0-5 5-10 10-15 15-20 20-25


Frequency: 3 5 12 6 4
Solution:

X-A
Class Mid x f d= X-X f | X-X|
C
0-5 2.5 3 –2 10.5 31.5
5-10 7.5 5 –1 5.5 27.5
10-15 12.5 12 0 0.5 6.0
15-20 17.5 6 1 4.5 27.0
20-25 22.5 4 2 9.5 38.0
30 130.0
X − A X − 7.5
=A 12.5;
= d =
C 5

X= A+
∑ fd × C
N
3
= 12.5 + × 5
30

= 13

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  125 

N O T E S

Mean Deviation about the Mean =


∑ f |X − X | ×C
N
130 × 5
=
30
=21.67

State whether the following statements are true/false:


8. Inter-quartile range is a difference between upper quartile
(third quartile) and lower quartile (first quartile).
9. Quartile Deviation is the average of the sum between upper
quartile and lower quartile.
10. Absolute measure of Quartile Deviation is called the Coefficient
of Quartile Deviation.

S
11. Mean deviation is the arithmetic mean of the absolute
deviations of the values about their arithmetic mean or median
IM
or mode.
12. The relative measure corresponding to the ‘Mean Deviation’ is
‘coefficient of Mean Deviation’.
NM

These are the average temperatures (°F) in Miami, Florida, for each
month of the year.
Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec
67.2 68.5 71.7 75.2 78.7 81.4 82.6 82.8 81.9 78.3 73.6 69.1
Find the inter-quartile range, Quartile deviation and mean
deviation of this data.

Average deviation takes into account all the items of a series and
hence, it provides sufficiently representative results. It simplifies
calculations since all signs of the deviations are taken as positive.
Average Deviation may be calculated either by taking deviations
from Mean or Median or Mode. Average Deviation is not affected
by extreme items.

4.6 VARIANCE AND STANDARD DEVIATION


There are two more commonly used and powerful measures of
variability. These are more useful than other measures because, they
use all the information available in the data set and are amenable to
mathematical treatment.

NMIMS Global Access – School for Continuing Education


126  BUSINESS STATISTICS

N O T E S

Variance is defined as the average of squared deviation of data


points from their mean.

When the data constitute a sample, the variance is denoted by sx2, and
averaging is done by dividing the sum of the squared deviation from
the mean by ‘n – 1’. (Note that one is reduced from n because we lose
one degree of freedom while using mean of the sample. More about
degree of freedom will be discussed later.) When our observations
constitute the population, the variance is denoted by s2 and we divide
by N for the averaging.

4.6.1 DIFFERENT FORMULAE FOR CALCULATING


VARIANCE n

∑ (x − X )
i
2

Sample Variance Var (x) = sx2 =

S i =1

n −1
∑ ( x − m)
i
2
IM
Population Variance Var (x) = s2 =
N
Where,
xi for i = 1, 2, …, n are observation values.
X = Sample mean
NM

n = Sample size
μ = Population mean
N = Population size
Population Variance is,

Var (x) = σ =2 ∑ ( x − m)
i
2

N
n n n n

∑ ( xi2 − 2m xi + m 2 ) ∑ ( xi2 ) − 2m ∑ xi + m 2 ∑ (1)


=i
= 1 =i 1
= =i 1 =i 1

N N
n

∑x 2
i
= i =1 − m2
N
Var (x) = E( X 2 ) − [ E( X )]2
This formula is very useful for manual calculations.
‰‰ For grouped data, we need to multiply average values of
observations (class marks) by corresponding class frequencies.
Then, the formula for variance becomes:

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  127 

N O T E S
2
nn
 
i ∑
2
i f × m  ∑ fi × mi 
Population Variance = E( X 2 ) − [ E( X )] =
=2 i 1 =i 1
− 
N  N 
 
In case of sample ‘N’ is replaced by ‘n – 1’.  

4.6.2 CALCULATION OF STANDARD DEVIATION

Standard Deviation is the root mean square deviation of the values


from their arithmetic mean. S.D. is denoted by symbol σ (read
sigma).

The Standard Deviation (SD) of a set of data is the positive square root
of the variance of the set. This is also referred as Root Mean Square
(RMS.) value of the deviations of the data points. SD of sample is the

S
square root of the sample variance i.e. equal to σx and the Standard
Deviation of a population is the square root of the variance of the
population and denoted by σ.
IM
Formula for Calculating S.D.
For the set of values x1, x2 ……..Xn
2
Ex 2  ∑ x 
NM

=s − 
n  n 

If an assumed value A is taken for mean and d = X-A, then
2
∑d  ∑d 
2

=s −  
n  n 

For a frequency distribution
2
∑ fd  ∑ fd 
2

s= −   × C
N  N 
Where, d = X–A and C is the true class interval.
N = Total frequency.
Example: Find the standard deviation for the following data:

Class Interval 0-10 10-20 20-30 30-40 40-50 50-60 60-70


Frequency 6 14 10 8 1 3 8

NMIMS Global Access – School for Continuing Education


128  BUSINESS STATISTICS

N O T E S
Solution: Direct method

Class Class Frequency di =


Interval Mark fi Fi × mi (mi-A) di2 fi × di2
mi
0-10 5 6 30 -25 625 3750
10-20 15 14 210 -15 225 3150
20-30 25 10 250 -5 25 250
30-40 35 8 280 5 25 200
40-50 45 1 45 15 225 225
50-60 55 3 165 25 625 1875
60-70 65 8 520 35 1225 9800
Σfi = 50 1500 19250

Mean = μ =
∑ f ×=
m i i 1500
= 30

SD = σ =
∑f
∑=
f ×d S
i

2
50

19250
IM
i i
= 19.62
∑f i 50

Short-cut method: We assume the mean as A = 35. The class width


h = 10
NM

Class Class Fre- di = d′ fi × d¢ fi × d′ 2


Interval Mark quency (mi-A) = di /h
Mi fi
0-10 5 6 –30 -3 –18 54
10-20 15 14 –20 -2 –28 56
20-30 25 10 –10 -1 –10 10
30-40 35 8 0 0 0 0
40-50 45 1 10 1 1 1
50-60 55 3 20 2 6 12
60-70 65 8 30 3 24 72
Σfi = 50 –25 205

Mean = A + ∑ f ×d′ ×h= i i


35 +
−25
× 10 = 30
∑f i 50

( ) −  ∑ f × d 
2

SD = ∑ fi × di ′
i i
2

∑ fi  ∑ fi 
2
= 205  −25  = 19.62

50  50 
Thus, we get the same answers.

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  129 

N O T E S
Effect of Shift of Origin and Change of Scale
To simplify the manual calculation we may some times use shift of
origin and change of scale. Shifting of origin is achieved by adding
or subtracting a constant to all observations. In case of discrete data
we add or subtract (usually subtract) a constant to the individual
observations. Whereas, for grouped data we add or subtract (usually
subtract) the constant to the class mark values. There is no effect of
shifting origin on standard deviation or variance.
Change of scale is achieved by multiplying or dividing by a constant to
all observations. In case of discrete data we multiply or divide (usually
divide) by a constant to the individual observations. Whereas for
grouped data, we multiply or divide (usually divide) by the constant
(usually by class interval) to the class mark values. The effect is as
follows. If all data points are multiplied or divided by a constant, the
standard deviation is multiplied (stretched) or divided (shrunk) by
that amount.

S
We can use both, Change of Origin and Change of Scale together, but
we must correct the answers in the reverse order of the algebraic
IM
operations performed on the data points. In this method, we first
subtract a constant, say A (called assumed mean) from all the
observations or class marks and them divide all the observations by a
suitable constant say h, (usually the class interval for grouped data),
and then calculate the Standard Deviation. Then we multiply the
NM

answer by the same constant h to get the actual value of the Standard
Deviation. For calculating variance of course we need to multiply the
calculated answer by h2.
Example: The weekly salaries of a group of employees are given in
the following table. Find the mean and S.D. of the salaries.

Salaries (in `): 75 80 85 90 95 100


No. of Person: 3 7 18 12 6 4
Solution:
X F X−A fd fd²
d=
C
75 3 –2 –6 12
80 7 –1 –7 7
85 18 0 0 0
90 12 1 12 12
95 6 2 12 24
100 4 3 12 36
50 23 91
A = 85, C = 5, N = 50

Mean X =
A+
∑ fd × C
N

NMIMS Global Access – School for Continuing Education


130  BUSINESS STATISTICS

N O T E S

23
= 85 + ×5
50
= ` 87.30
2
∑ fd  ∑ fd 
2

s= −   × C
N  N 
2
91  23 
= −   × 10
50  50 
= 1.61 ×10
= ` 12.69
Example: The following data were obtained while observing the life
span of a few neon lights of a company. Calculate S.D.
Life span (Years)
No. of Neon Lights
Solution:
S 4-6
10
6-8
17
8-10
32
10-12
21
12-14
20
IM
Life span No. of X-A Mid Fd fd²
Years Neon d= value
Light (f) C
4-6 10 5 –2 –20 40
NM

6-8 17 7 –1 –17 17
8-10 32 9 0 0 0
10-12 21 11 1 21 21
12-14 20 13 2 40 40
100 24 118
2
∑ fd  ∑ fd 
2

Standard Deviation, s = −   × C
N  N 
2
118  24 
= −  ×2
100  100 

= 1.18 − 0.0576 × 2

= 1.1224 × 2
= 2.1188

4.6.3 PROPERTIES OF STANDARD DEVIATION


The properties of standard deviation are:
‰‰ It is the most important and widely used measure of variability.

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  131 

N O T E S
‰‰ It is based on all the observations.
‰‰ Further mathematical treatment is possible.
‰‰ It is affected least by any sampling fluctuations.
‰‰ It is affected by the extreme values and it gives more importance
to the values away from the mean.
‰‰ The main limitation is; we cannot compare the variability of
different data sets given in different units.

Mathematical Properties of Standard Deviation


If deviations of given items are taken from arithmetic mean and
squared then the sum of squared deviation should be minimum.
If different values are increased or decreased by a constant, the
standard deviation will remain the same. If different values are
multiplied or divided by a constant than the standard deviation will
be multiplied or divided by that constant.
S
Combined standard deviation can be obtained for two or more series
IM
with below given formula:
2 2 2 2 2 2
n s + n 2 s 2 + n 3s 3 + n1d1 + n 2 d 2 + n 3d 3
s2 = 1 1
n1 + n 2 + n 3

4.6.4 MERITS AND DEMERITS OF STANDARD DEVIATION


NM

Merits
‰‰ Standard deviation is the best measure of dispersion because it
takes into account all the items and is capable of future algebraic
treatment and statistical analysis.
‰‰ It is possible to calculate standard deviation for two or more series.
‰‰ This measure is most suitable for making comparisons among
two or more series about variability.
Disadvantages
It is difficult to compute.
It assigns more weights to extreme items and less weight to items that
are nearer to mean. It is because of this fact that the squares of the
deviations which are large in size would be proportionately greater
than the squares of those deviations which are comparatively small.

4.6.5 STANDARD DEVIATION OF COMBINED MEANS


The mean and S.D. of two groups are given in the following table.

Group Mean S.D. Size



I x1 σ1 n1

II x1 σ2 n2

NMIMS Global Access – School for Continuing Education


132  BUSINESS STATISTICS

N O T E S

Let x and σ be the mean and S.D. of the combined group of (n1 + n2)

items. Then x and σ are determined by the formulae.
n1 x1 + n2 x2
X=
n1 + n2
2 2 2 2 2 2 2 2
n1s 1 + n2s 2 + n1d1 + n2 d2 n1s 1 + n2s 2 + n1d1 + n2 d2
=s2 = (or) s
n1 + n2 n1 + n2
=
where d1 x=
1 – x; =
d2 x2 – x (or) d1 x1 =
– x ; d2 x2 – x

These results can be extended to 3 samples as follows


n1 x1 + n2 x2 + n3 x3
X=
n1 + n2 + n3
2 2 2 2 2 2
n1s 1 + n2s 2 + n3s 3 + n1d1 + n2 d2 + n3 d3
s2 =
n1 + n2 + n3

S
Example: Particulars regarding the income of two villages are given below:
Village A Village B
IM
Number of people: 600 500
Average income (in `): 175 186
Variance of income (`): 100 81
In which village is the variation in income greater?
NM

What is the combined SD of the village A and B put together?


Solution: We have to compare the coefficient of variation
Village A Village B
SD SD
C.V . for
= A ×100 C.V . for
= A ×100
Mean Mean

10 9
= × 100 = ×100
175 186
= 5.714 = 4.839
Therefore income is greater in village A

n1 x1 + n2 x 2
X=
n1 + n2
where
= n1 600;
= n2 500
=x1 175;
= x 2 186

∴X =
( 600 ×175) + ( 500 ×186 )
600 + 500
1, 05, 000 + 93, 000
=
1100
= 180

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  133 

N O T E S

2 2 2 2
nn1s
s1 +
2
+ nn22 s
s22 +
2
+ nn11dd11 +
2
+ nn22 dd 22
2
s
s22 =
= 1 1 nn1 + n2
1 + n2

where
where dd1 xx=
=
= = 1 – x ;
1 – x ;
dd 2 xx 22 –– xx
1 2

= 175
= 175 –180
–180== 186 − 180
186 − 180
=
= dd1 55=
= dd 2 66
1 2
2 2
600(100
600(100)+
600(100 ++500(81) ++600(5)
500(81)+
500(81) 600(5) ++500(6)
600(5)2 2+ 500(6)2 2
500(6)
∴s =
∴s 222 =
∴s =
1100
1100
1100
60, 000 + 40, 400 + 15, 000 + 18, 000
= 60, 000 + 40, 400 + 15, 000 + 18, 000
= 1100
1100
= 11.01
= 11.01
Example: An analysis of monthly wage of workers of two organizations,
A and B yielded the following results.
S
Organisations
IM
A B
No. of Workers 50 60
Average monthly ` 60 ` 48
wage
NM

Variance 100 144


Obtain the average monthly wage and the S.D. of wages of all workers
in the two organizations taken together.
Solution:
n11 x11 + n22 x22
X=
n11 + n22
=
where n11 50; = n22 60
= x11 60;= x22 48

∴X =
( 50 × 60 ) + ( 60 × 48 )
60 + 50
3000 + 2880
=
110
= ` 53.45
= ` 53.45
2 2 2 2
2
+ 2
+ 2
+ 2
s 22 = n11s 11 n22s 22 n11d11 n22 d22
n11 + n22
=
where d11 x=
1 – x ;
1 d22 x22 – x
= 60 =
– 53.45 48 – 53.45
=d11 =
6.55 d22 5.45

NMIMS Global Access – School for Continuing Education


134  BUSINESS STATISTICS

N O T E S

( 50 × 100 ) + ( 60 × 144 ) + 50 ( 6.55 ) + 60 ( 5.45 )


2 2

∴s 2 =
60
5000 + 8640 + 2145.125 + 1782.15
=
110
17567.275
= = 159.7025
110
= 12.637
S.D. of two organization taken together =` 12.637
Example: There are 20, 30 and 50 employees in the three branches
of a concern. Their mean salaries are ` 15, 12 and 18 thousand. S.D.
of their salaries is ` 3, 5 and 6 thousand respectively. Find the mean
salary and the S.D. of salaries for the employees of the concern as a
whole.
Solution:

Given S
s1 = 3 Mean Salary

S.D. of salaries
IM
Branch-I N1=20 X1 = 15 s1 = 3

Branch-II N2=30 X2 = 12 s2 = 5

Branch-III N3=50 X3 = 18 s3 = 6
Solution:
NM

n1 x1 + n2 x2 + n3 x3
X=
n1 + n2 + n3
20 × 15 + 30 × 12 + 50 × 18
= X
mean =
20 + 30 + 50
300 + 360 + 900
=
100
1560
= = 15.60
100
2 2 2 2 2 2
n1s 1 + n2s 2 + n3s 3 + n1d1 + n2 d2 + n3 d3
s2 =
n1 + n2 + n3

20 × 32 + 30 × 52 + 50 × 62 + 20 × ( –6 ) + 30 × ( –3.6 ) + 50 × ( 2.4 )
2 2 2

=
100
=
where d1 x=
1 – x 15.0=
– 15.6 –0.6

=d2 x=
2 – x 12.0=
– 15.6 –3.6

=d3 x=
3–x =
18.0 – 15.6 2.4

180.00 + 750.00 + 1800.00 + 7.20 + 388.80 + 288.00


=
100
3414
= = =
34.14 5.84
100

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  135 

N O T E S
4.6.6 COEFFICIENT OF VARIATION
This was developed by Karl Pearson and defined as the ratio of SD
and mean, multiplied by 100.
s
CV = × 100
m
This is also called as variability. Smaller value of CV indicates greater
stability and lesser variability.
Example: Two batsmen A and B made the following scores in the
preliminary round of World Cup Series of cricket matches.
A  14, 13, 26, 53, 17, 29, 79, 36, 84 and 49
B  37, 22, 56, 52, 28, 30, 37, 48, 20 and 40
Whom will you select for the final? Justify your answer?
Solution: We will first calculate mean, standard deviation and Karl

S
Pearson’s coefficient of variation. We will select the player based on
the average score as well as consistency. We not only want the player
who has been scoring at high average but also doing it consistently.
IM
Thus, the probability of his playing good inning in final is high.

FOR PLAYER ‘A’ (USING DIRECT METHOD)


Score Xi Deviation (xi - μ)2 Σ xi2
(xi - μ)
NM

14 –26 676 196


13 –27 729 169
26 –14 196 676
53 13 169 2809
17 –23 529 289
29 –11 121 841
79 39 1521 6241
36 –4 16 1296
84 44 1936 7056
49 9 81 2401
Σ xi = 400 Σ (xi – μ) = 0 Σ(xi–μ)2 = 5974 Σxi = 21974
2

Now,
10

∑x i
400
Mean = μ = =
i =1
= 40
N 10
10

∑ (x − m )
i
2
5974
=
Variance = Var ( x) i =1
= = 597.4
N 10
=
Standard Deviation =s =
Var ( x) =
597.4 24.44

NMIMS Global Access – School for Continuing Education


136  BUSINESS STATISTICS

N O T E S
Another Method
10

∑ (x ) i
2

Variance = Var (=
x) i =1
− m=
2
2197.4 − 1600
= 597.4
N
Coefficient of variation (variability) for player ‘A’
s 24.44
CV = × 100= × 100= 61.10
m 40
For player ‘B’ we will use the short-cut method. Let the assumed
mean A = 40

FOR PLAYER ‘B’ (USING SHORT-CUT METHOD)


Score Deviation from Assumed Mean di2
xi di = (xi – A)
37 –3 9
22
56 S –18
16
324
256
IM
52 12 144
28 –12 144
30 –10 100
37 –3 9
NM

48 8 64
20 –20 400
40 0 0
Σ xi =370 Σ di = –30 Σ di = 1450
2

10

∑x i
370
Now, Mean = μ = =
i =1
= 37
N 10
10

Or, Mean = μ = A +
∑d i
−30
i =1
= 40 + = 40 − 3 = 37
N 10
2
10
 10 
∑ (di )2
 ∑ (di )  1450
=i
Variance = Var =
(x ) −
1=i 1

= −=
9 136
N  N  10
 

=s
Standard Deviation = =
Var ( x) =
136 11.66

s 11.66
CV = × 100= × 100= 31.5
m 37

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  137 

N O T E S

Mean Score Coefficient of Variability


Player A 40 61.10
Player B 37 31.5
Although the average score of the player B is slightly lower than the
player A, the player B has lesser variability. Hence, player B is more
consistent. Since the difference in average scores is not substantial we
would select the player B for his consistency. He is likely to score close
to his average with more probability. Hence, we are more confident
about his performance.

4.6.7 EMPIRICAL RELATIONSHIP BETWEEN DIFFERENT


MEASURES OF VARIATION

Relation between the Mean and the Standard Deviation


The mean is a measure of the central tendency of the data set, and the

S
standard deviation is a measure of the spread. There are two general
rules that establish a relation between these measures.
IM
Chebyshev’s Theorem
Chebyshev’s theorem states following rules:
‰‰ At least three quarters of the observations in a set will lie within
±2 standard deviation of the mean.
NM

‰‰ At least eight-ninths of the observations in a set will lie within ±3


standard deviation of mean.
 1
‰‰ In general, at least 1 − 2  of the observations will lie within ±k
 k 
standard deviations of the mean, where k is real.

The Empirical Rule


If the distribution of the data is more or less bell shaped, then we can
use empirical rule. The rule states:
‰‰ Approximately 68% of the observations will be within ±1 standard
deviation of the mean.
‰‰ Approximately 95% of the observations will be within ±2 standard
deviations of the mean.
‰‰ Approximately 99.97% of the observations will be within ±3
standard deviations of the mean.
Only 3 observations out of one million will be out side ±6 standard
deviations of the mean. (This is ‘Six SIGMA’ philosophy).
Let us check applicability of Chebyshev’s Theorem and empirical rule
for the data of the previous example.

NMIMS Global Access – School for Continuing Education


138  BUSINESS STATISTICS

N O T E S
Scores of player ‘A’ (m = 40 and s = 24.44)

Range As per As per Actual Actual


Chebyshev’s Empirical Data percent-
Theorem rule per- Points age of
percentage centage of observa-
of Observa- Observa- tions
tions tions
m ±1×s 25.56 to 68 8 80
64.44
m ±2×s 0 to 88.88 75 95 10 100
m ±3×s 0 to 113.32 89 99 10 100
Scores of player ‘B’ (m = 37 and s = 11.66)
Thus, Chebyshev’s Theorem and empirical rule provide good rule of
thumb.

Range
SAs per Che- As per Em-
byshev’s pirical rule
Actual Actual
Data percentage
IM
Theorem percentage Points of Obser-
percentage of Observa- vations
of Observa- tions
tions
m ±1×s 25.34 to 68 5 50
NM

48.66
m ±2×s 13.68 to 75 95 10 100
60.32
m ±3×s 2.02 to 89 99 10 100
71.98

Empirical Relationship between different Measures of Variation


Following empirical relationship is observed for data distributions
that are more or less symmetric and mound shaped.
Quartile Deviation = 2/3 Standard Deviation
Mean Deviation = 4/5 Standard Deviation
Range Deviation = 6 Standard Deviation

State whether the following statements are true/false:


13. Standard deviation is the most important and widely used
measure of variability.
14. If deviations of given items are taken from arithmetic mean
and squared then the sum of squared deviation should be
maximum.
Contd...

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  139 

N O T E S
15. There is no effect of shifting origin on standard deviation or
variance.
16. It is not possible to calculate standard deviation for two or
more series.
Fill in the blanks:
17. ................... is defined as the average of squared deviation of
data points from their mean.
18. ................... ................... is the root mean square deviation of the
values from their arithmetic mean.
19. Coefficient of variation was developed by ................... .
20. ................... value of CV indicates greater stability and lesser
variability.

S
Calculate the standard deviation of monthly cloud cover over
Equatorial Africa for January 2012 to December 2014. Collect data
IM
from secondary sources like internet.

Standard Deviation is more meaningful measure than variance


NM

because by taking square root, the units are ‘un-squared’. Variance


tends to be larger because it is in squared units. However, statisticians
like to work with the variance because its mathematical properties
simplify the computations. Managers, engineers and other people
who apply statistics for decision-making prefer to work with the
standard deviation because being in original units of data; it is
logically easier to interpret.

4.7 SUMMARY
‰‰ Study of distribution is very important for decision-making.
Usually, measures of central tendency and variability are
adequate for taking decision. However, if data is quite different
from normal distribution then measure skewness and kurtosis
need to be considered. We discussed measures of variability:
Range, Variance and Standard Deviation.
‰‰ A measure of dispersion gives an idea about the extent of lack
of uniformity in the sizes and qualities of the items in a series. It
helps us to know the degree of uniformity and consistency in the
series. If the difference between items is large the dispersion or
variation is large and vice versa.
‰‰ The measures of dispersion can be either ‘absolute’ or ‘relative’.
Absolute measures of dispersion are expressed in the same units
in which the original data are expressed. For example, if the series

NMIMS Global Access – School for Continuing Education


140  BUSINESS STATISTICS

N O T E S
is expressed as Marks of the students in a particular subject; the
absolute dispersion will provide the value in Marks. The only
difficulty is that if two or more series are expressed in different
units, the series cannot be compared on the basis of dispersion.
‰‰ The ‘Range’ of the data is the difference between the largest
value of data and smallest value of data. This is an absolute
measure of variability. However, if we have to compare two sets
of data, ‘Range’ may not give a true picture. In such case, relative
measure of range, called coefficient of range is used.
‰‰ Inter-quartile range is a difference between upper quartile (third
quartile) and lower quartile (first quartile). Quartile Deviation is
the average of the difference between upper quartile and lower
quartile.
‰‰ Average used for calculating deviation can be the mean, the
median or the mode. However, usually the mean is used. There
is also an advantage of taking deviations from the median,

S
because ‘Mean Deviation’ from median is lowest as compared to
any other ‘Mean Deviations’. Since absolute values of deviations
ignoring sign are taken for calculating Mean Deviation, the mean
IM
deviation is not amenable to further algebraic treatment.
‰‰ The variance is the average squared deviation of the data from
their mean. For sample data, we take the average by dividing
with (n-1) where n is a sample size. This is to cater for degree of
freedom. For population data, we average by dividing with the
NM

population size N.
‰‰ The Standard Deviation (SD) of a set of data is the positive square
root of the variance of the set. This is also referred as Root Mean
Square (RMS) value of the deviations of the data points. SD of
sample is the square root of the sample variance
‰‰ There is no effect of shifting origin on standard deviation or
variance.
‰‰ The measures of deviation are very effective in making reports
and presentations by the business executives to present their data
top general public who do not understand statistical methods.
‰‰ Variance analysis also helps in managing budgets by controlling
budgeted versus actual costs. Without the standard deviation,
you can’t compare two data sets effectively.

‰‰ Range: The ‘Range’ of the data is the difference between


the largest value of data and smallest value of data. This is an
absolute measure of variability.
‰‰ Inter-quartile Range: Inter-quartile range is a difference
between upper quartile (third quartile) and lower quartile
(first quartile).
Contd...

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  141 

N O T E S
‰‰ Variance: Variance is defined as the average of squared
deviation of data points from their mean.
‰‰ Standard Deviation (SD): SD of a set of data is the positive
square root of the variance of the set. This is also referred as
Root Mean Square (RMS).
‰‰ Mean Deviation: Mean Deviation (MD) is an average value of
absolute deviation of observations from the data mean (or the
median or the mode).
‰‰ Coefficient of Variation: It is defined as the ratio of SD and
mean, multiplied by 100.

4.8 DESCRIPTIVE QUESTIONS


1. What do you understand by measures of dispersion? How is it
useful?
2.
3. S
What are the characteristics of good measures of dispersion?
What are absolute and relative measures of dispersion?
IM
4. Define range. Discuss merits, demerits and uses of range.
5. Write a short note on inter-quartile range and quartile deviation.
6. What do you understand by mean deviation? Give formulae for
mean deviation.
NM

7. Define Variance. Give different formulae for variance.


8. What do you understand by standard deviation? Explain the
concept of standard deviation of combined means.
9. What is the effect of shift of origin and change of scale on standard
deviation?
10. What is coefficient of variation? How is it useful?

EXERCISE FOR PRACTICE


1. Twenty randomly chosen people are asked to rank a product on
scale 0 to 100. The results are given below:
89, 75, 59, 96, 88, 71, 43, 62, 80, 92
76, 72, 67, 60, 79, 85, 77, 83, 87, 53
Find Range, Mean, Variance and Standard Deviation
2. Calculate standard deviation and variance from the following
data:

Class 1-3 3-5 5-7 7-9 9-11


Frequency 2 3 5 3 2

NMIMS Global Access – School for Continuing Education


142  BUSINESS STATISTICS

N O T E S
3. Scores of the two teams are given below:

Team A B
Average Score 53.30 45.30
S.D. of Scores 40.93 16.89
(a) Which team is better in average?
(b) Which team is more consistent?
4. Two workers on the same job show the result over a long period
of time:

Worker A Worker B
Mean time of completion of 30 25
job (in min)
Standard deviation (in min) 6 4

5.
S
(a) Which worker appears to be faster in completing the job?
(b) Which worker appears to be more consistent?
For a set of 100 items, the mean and SD are 60 and 6 respectively.
IM
For another set of 200 items, the mean and SD are 63 and 4
respectively. Find the mean and SD of the combined group.

4.9 ANSWERS AND HINTS


NM

ANSWERS FOR SELF ASSESSMENT QUESTIONS

Topic Q. No. Answers


Characteristics of Measures 1. Dispersion
of Dispersion
2. Quality
3. Extreme
Absolute and Relative 4. Absolute
Measures of Dispersion
5. Relative, Coefficient Range
Range 6. Range
7. Quality
Inter-quartile Range and 8. True
Deviations
9. False
10. False
11. True
12. True
Variance and Standard 13. True
Deviation
14. False
Contd...

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  143 

N O T E S

15. True
16. False
17. Variance
18. Standard Deviation
19. Karl Pearson
20. Smaller

HINTS FOR DESCRIPTIVE QUESTIONS


1. Refer Section 4.1
A measure of dispersion or variation in any data shows the extent
to which the numerical values tend to spread about an average.
If the difference between items is small, the average represents
and describes the data adequately. For large differences it is
proper to supplement information by calculating a measure of

2.
dispersion in addition to an average.
Refer Section 4.2 S
IM
There are certain pre-requisites or characteristics for a good
measure of dispersion:
(a) It should be simple to understand.
(b) It should be easy to compute.
NM

(c) It should be rigidly defined.


3. Refer Section 4.3
Absolute measures of dispersion are expressed in the same units
in which the original data are expressed. A precise measure of
dispersion is one which gives the magnitude of the variation in
a series, i.e. it measures in numerical terms, the extent of the
scatter of the values around the average.
When dispersion is measured in terms of the original units of a
series, it is absolute dispersion or variability.
4. Refer Section 4.4
The ‘Range’ of the data is the difference between the largest value
of data and smallest value of data. This is an absolute measure
of variability. However, if we have to compare two sets of data,
‘Range’ may not give a true picture. Uses of range are described
as follows:
(a) It is commonly used for statistical quality control to decide
the upper and lower control limits for the control chart.
(b) Since it is very easy to measure, it is used for calculating
the acceptable limits of product with the help of ‘Control
Charts’.

NMIMS Global Access – School for Continuing Education


144  BUSINESS STATISTICS

N O T E S
5. Refer Section 4.5
Inter-quartile range is a difference between upper quartile (third
quartile) and lower quartile (first quartile). Quartile Deviation is
the average of the difference between upper quartile and lower
quartile.
6. Refer Section 4.5.3
Mean deviation is the arithmetic mean of the absolute deviations
of the values about their arithmetic mean or median or mode.
Mean Deviation (MD) is an average value of absolute deviation
of observations from the data mean (or the median or the mode).
It gives how spread/dispersed the data is. If x1, x2… xn are N
observations, then,
di xi − Average
Mean Deviation MD = =
N N
7.

Refer Section 4.6

S
Variance is defined as the average of squared deviation of data
points from their mean.
IM
When the data constitute a sample, the variance is denoted by
sx2, and averaging is done by dividing the sum of the squared
deviation from the mean by ‘n – 1’. (Note that one is reduced from
n because we lose one degree of freedom while using mean of the
sample.
NM

∑ ( x i − X )2
Sample Variance Var (x) = sx2 = i =1
n−1
8. Refer Section 4.6.2
The Standard Deviation (SD) of a set of data is the positive square
root of the variance of the set. This is also referred as Root Mean
Square (RMS.) value of the deviations of the data points. SD of
sample is the square root of the sample variance i.e. equal to σx
and the Standard Deviation of a population is the square root of
the variance of the population and denoted by σ.
9. Refer Section 4.6.2
Shifting of origin is achieved by adding or subtracting a constant
to all observations. In case of discrete data we add or subtract
(usually subtract) a constant to the individual observations.
Whereas, for grouped data we add or subtract (usually subtract)
the constant to the class mark values. There is no effect of shifting
origin on standard deviation or variance.
10. Refer Section 4.6.6
This was developed by Karl Pearson and defined as the ratio of
SD and mean, multiplied by 100.
s
CV = × 100
m

NMIMS Global Access – School for Continuing Education


MEASURES OF DISPERSION  145 

N O T E S
This is also called as variability. Smaller value of CV indicates
greater stability and lesser variability.

ANSWERS FOR EXERCISE FOR PRACTICE


1. Range = 53, Mean = 74.7, Variance = 184.71, Standard Deviation
= 13.59
2. Variance = 6.286, Standard Deviation = 2.507
3. Team A has better average. Team B is more consistent.
4. (a) Worker B (mean = 25) (b) Worker B (C.V. = 16%)
5. Mean = 62, Standard Deviation = 4.967

4.10 SUGGESTED READINGS FOR REFERENCE


SUGGESTED READINGS
‰‰
Books, 2012
S
R S Bhardwaj, Mathematics and Statistics for Business, Excel
IM
‰‰ D P Apte, Statistical Tools for Managers using MS Excel, Excel
Books, 2009
‰‰ D P Apte, Probability and Combinatorics, Excel Books, 2007
‰‰ Shiela Cameron and Deborah Price, Business Research Methods,
A Practical Approach, Excel Books, 2010
NM

‰‰ Richard Levin and Devid Rubin, Statistics for Management,


Pearson Education, 2004
‰‰ Rosen, Kenneth, H., Discrete Mathematics and its Applications,
Tata McGraw Hill Co Ltd., 2003
‰‰ Ross, Sheldon, A First Course in Probability, Pearson Education,
2003
‰‰ Salkind, N.J., Statistics for People who (They Think) Hate
Statistics, SAGE Publications, 2004
‰‰ Sharma, K.V.S., Statistics Made Simple, Prentice Hall of India,
2002
‰‰ Verma, A.P., Business Mathematics and Statistics, Asian Books
Pvt Ltd, 2002
‰‰ Clark, T.C. and Jordan, E.W., Introduction to Business and Economic
Statistics, South-Western Publishing Co., Ohio, U.S.A., 1985.

E-REFERENCES
‰‰ http://www.wyzant.com/
‰‰ http://www.princeton.edu/
‰‰ http://www.statistics.com/

NMIMS Global Access – School for Continuing Education


NM
IM
S
C H A
5 P T E R

SKEWNESS AND KURTOSIS

CONTENTS
5.1 Introduction

5.3
S
5.2 Karl Pearson’s Coefficient of Skewness (SKp)
Bowley’s Coefficient of Skewness (SKB)
IM
5.4 Kelly’s Coefficient of Skewness (Skk)
5.5 Measures of Kurtosis
5.6 Moments
5.6.1 Properties of Moments
5.6.2 Coefficients based on Moments
NM

5.7 Summary
5.8 Descriptive Questions
5.9 Answers and Hints
5.10 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


148  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

KURTOSIS BY EXCELTM

The ExcelTM help screens tell us that “kurtosis characterizes


the relative peakedness or flatness of a distribution compared to
the normal distribution. Positive kurtosis indicates a relatively
peaked distribution. Negative kurtosis indicates a relatively flat
distribution” (Microsoft, 1996). And, once again, that definition
doesn’t really help us understand the meaning of the numbers
resulting from this statistic.

Normal distributions produce a kurtosis statistic of about zero


(again, I say “about” because small variations can occur by chance
alone). So a kurtosis statistic of 0.09581 would be an acceptable
kurtosis value for a mesokurtic (that is, normally high) distribution
because it is close to zero. As the kurtosis statistic departs further
from zero, a positive value indicates the possibility of a leptokurtic

S
distribution (that is, too tall) or a negative value indicates the
possibility of a platykurtic distribution (that is, too flat, or even
concave if the value is large enough). Values of 2 standard errors
IM
of kurtosis (sek) or more (regardless of sign) probably differ from
mesokurtic to a significant degree.

The sek can be estimated roughly using the following formula (after
Tabachnick & Fidell, 1996): For example, let’s say you are using
ExcelTM and calculate a kurtosis statistic of + 1.9142 for a particular
NM

test administered to 30 students. An approximate estimate of the


sek for this example would be: Since two times the standard error of
the kurtosis is .7888 and the absolute value of the kurtosis statistic
was 1.9142, which is greater than .7888, you can assume that the
distribution has a significant kurtosis problem. Since the sign of
the kurtosis statistic is positive, you know that the distribution
is leptokurtic (too all). Alternatively, if the kurtosis statistic had
been negative, you would have known that the distribution was
platykurtic (too flat). Yet another alternative would be that the
kurtosis statistic might fall within the range between - 1.7888 and
+ 1.7888, in which case, you would have to assume that the kurtosis
was within the expected range of chance fluctuations in that
statistic. The existence of flat or peaked distributions as indicated
by the kurtosis statistic is important to you as a language tester
insofar as it indicates violations of the assumption of normality that
underlies many of the other statistics like correlation coefficients,
t-tests, etc. used to study the validity of a test.

Another practical implication should also be noted. If a distribution


of test scores is very leptokurtic, that is, very tall, it may indicate a
problem with the validity of your decision making processes. For
instance, at the University of Hawai’i at Manoa, we give a writing
placement test for all incoming native speaker freshmen (or should
that be fresh persons?) that produces scores on a scale of 0-20 (each
Contd...

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  149 

N O T E S

student’s score is based on four raters’ scores, which each range


from 0-5). Yearly, we test about 3400 students.

You can imagine how tall the distribution must look when it
is plotted out as a histogram: 20 points wide and hundreds of
students high. The decision that we are making is a four way
decision about the level of instruction that students should take:
remedial writing; regular writing with an extra lab tutorial; regular
writing; or honours writing. The problem that arises is that very
few points separate these four classifications and that hundreds of
students are on the borderline. So a wider distribution would help
us to spread the students out and make more responsible decisions
especially if the revisions resulted in a more reliable measure with
fewer students near each cut point.

S
IM
NM

NMIMS Global Access – School for Continuing Education


150  BUSINESS STATISTICS

N O T E S

After studying this chapter, you should be able to:


  Understand the concept and different types of skewness
  Discuss various measures of kurtosis
  Learn about moments, its properties and coefficients based
on moments

5.1 INTRODUCTION
Measures of Skewness and Kurtosis, like measures of central tendency
and dispersion, study the characteristics of a frequency distribution.
Averages tell us about the central value of the distribution and
measures of dispersion tell us about the concentration of the items
around a central value. These measures do not reveal whether the

S
dispersal of value on either side of an average is symmetrical or
not. If observations are arranged in a symmetrical manner around
a measure of central tendency, we get a symmetrical distribution;
IM
otherwise, it may be arranged in an asymmetrical order which gives
asymmetrical distribution. When the distribution stretches more to
the right than it does to the left, the distribution is said to be ‘right
skewed’ or ‘positively skewed’. Similarly, a left-skewed distribution is
the one that stretches asymmetrically to the left.
NM

Skewness is a measure that studies the degree and direction of


departure from symmetry.
A symmetrical distribution, when presented on the graph paper, gives
a ‘symmetrical curve’, where the value of mean, median and mode are
exactly equal. On the other hand, in an asymmetrical distribution, the
values of mean, median and mode are not equal.
Note that the median always lies between mean and mode.
When two or more symmetrical distributions are compared, the
difference in them is studied with ‘Kurtosis’. On the other hand,
when two or more symmetrical distributions are compared, they will
give different degrees of Skewness. These measures are mutually
exclusive i.e. the presence of skewness implies absence of kurtosis
and vice-versa.

Nature of Skewness
Skewness can be positive or negative or zero.
When the values of mean, median and mode are equal, there is no
skewness.
‰‰ When mean > median > mode, skewness will be positive.
‰‰ When mean < median < mode, skewness will be negative.

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  151 

N O T E S
Characteristic of a Good Measure of Skewness
‰‰ It should be a pure number in the sense that its value should be
independent of the unit of the series and also degree of variation
in the series.
‰‰ It should have zero-value, when the distribution is symmetrical.
‰‰ It should have a meaningful scale of measurement so that we
could easily interpret the measured value.
‰‰ Mathematical measures of skewness can be calculated by:
‰‰ Karl-Pearson’s Method
‰‰ Bowley’s Method
‰‰ Kelly’s method

S
Skewness could be measured either in absolute term as ‘mean
minus mode’ or in relative term. When the skewness is presented
in absolute term i.e., in units, it is absolute skewness. If the value of
IM
skewness is obtained in ratios or percentages, it is called relative or
coefficient of skewness.
When skewness is measured in absolute terms, we can compare one
distribution with the other if the units of measurement are same.
When it is presented in ratios or percentages, comparison become
NM

easy. Relative measures of skewness is also called coefficient of


skewness.

KARL PEARSON’S COEFFICIENT OF


5.2 SKEWNESS (SK )
P

Karl Pearson has suggested two formulae;


‰‰ Where the relationship of mean and mode is established;
‰‰ Where the relationship between mean and median is not
established.

When the values of Mean and Median are related


Absolute skewness = Mean – Mode
mean – mode
Coefficient of skewness, SKp =
S.D.
Coefficient of skewness generally lies within + 1

When the values of Mean and Mode are related


Absolute skewness = 3(Mean – Median)
3( X - Md )
Coefficient of skewness = Sk2
s

NMIMS Global Access – School for Continuing Education


152  BUSINESS STATISTICS

N O T E S

Where X = the mean, Mo = the mode and s = the standard deviation for
the sample.
It is generally used when you don’t know the mode.
Coefficient of skewness generally lies within + 3
Example: Calculate the Karl Pearson’s coefficient of skewness from
the following data:

Size: 1 2 3 4 5 6 7
Frequency: 10 18 30 25 12 3 2
Solution:
– To calculate Karl Pearson’s coefficient of skewness, we first
find X , Mo and s from the given distribution.
Size (X) Frequency (f ) d= X − 4 fd fd 2
1 10 −3 − 30 90
2 18 −2 − 36 72
3
4 S 30
25
−1
0
− 30 30
0 0
IM
5 12 1 12 12
6 3 2 6 12
7 2 3 6 18
Total 100 − 72 234
NM

X =+
A
∑ fd =
4+
− 72
=
3.28
N 100

s =∑
 ∑ fd 
2 2 2
fd 234  − 72 
− = −  = 1.35
N  N  100  100 
 

Also, Mo (by inspection) = 3.00


X – Mo 3.28 – 3.00
\ S
Skk= = = 0.207
σ 1.35
Since Sk is positive and small, the distribution is moderately positively
skewed.
Example: Calculate Karl Pearson’s coefficient of skewness from the
following data:
Weights (lbs.) No. of Students Weights (lbs.) No. of Students
90-100 4 140-150 23
100-110 10 150-160 16
110-120 17 160-170 5
120-130 22 170-180 3
130-140 30

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  153 

N O T E S

Solution: Calculation of X , σ and Mo
Class Frequency Mid − Points X − 135
u= fu fu 2
Intervals (f ) (X) 10
90-100 4 95 −4 −16 64
100-110 10 105 −3 − 30 90
110-120 17 115 −2 − 34 68
120-130 22 125 −1 − 22 22
130-140 30 135 0 0 0
140-150 23 145 1 23 23
150-160 16 155 2 32 64
160-170 5 165 3 15 45
170-180 3 175 4 12 48
Total 130 − 20 424

1. X =A + h
∑ fu = 135 + 10 × −20 = 133.46
N 130 S
IM
∑ fu  ∑ fu 
2 2 2
424  − 20 
2. s=
h× −   =
10 × −  = 18.0
N  N  130  130 

D1
Mo =+
Lm ×h
NM

3.
D1 + D2
By inspection, the modal class is 130-140.
\ Lm = 130, D1 = 30 – 22 = 8, D2 = 30 – 23 = 7 and h = 10
8
Thus, Mo = 130 + × 10 = 135.33
15

Hence, Sk = X − Mo = 133.46 − 135.33 = − 0.10 i.e., the distribution


s 18.0
is moderately negatively skewed.

Fill in the blanks:


1. ................... is a measure that studies the degree and direction
of departure from symmetry.
2. When two or more symmetrical distributions are compared,
the difference in them is studied with ................... .
3. When mean > median > mode, skewness will be ................... .
4. When mean < median < mode, skewness will be ................... .
5. Since Sk is positive and small, the distribution is moderately
................... skewed.

NMIMS Global Access – School for Continuing Education


154  BUSINESS STATISTICS

N O T E S

The length of stay on the cancer floor of Apolo Hospital was


organized into a frequency distribution. The mean length of stay
was 28 days, the median in 25 days and modal length is 23 days. The
standard deviation was computed to be 4.2 days. Is the distribution
symmetrical, or skewed? What is the coefficient of skewness?
Interpret.

You will always find the different values of coefficient of skewness


when calculated by Karl Pearson’s and Bowley’s formula.

BOWLEY’S COEFFICIENT OF SKEWNESS


5.3 (SK )
B
S
Bowley’s method of skewness is based on the values of median, lower
IM
and upper quartiles. This method suffers from the same limitations
which are in the case of median and quartiles.
Wherever positional measures are given, skewness should be
measured by Bowley’s method. This method is also used in case of
‘open-end series’, where the importance of extreme values is ignored.
NM

Absolute skewness = Q3 + Q1 – 2 Median


(Q3 − Q2) − (Q2 − Q1) (Q3 + Q1) − 2 × Md
Coefficient of Skewness, (SkB) = =
(Q3 − Q2) + (Q2 − Q1) (Q3 − Q1)
Where, Q is quartile.
Example: Calculate Bowley’s coefficient of skewness from the
following data:
Class 0-5 5-10 10-15 15-20 20-35 25-30 30-35 35-40
Intervals:
Frequency: 7 10 20 13 17 10 14 9
Solution:
Class Intervals Frequency (f ) Less than (c.f .)
0-5 7 7
5-10 10 17
10-15 20 37
15-20 13 50
20-25 17 67
25-30 10 77
30-35 14 91
35-40 9 100
Total 100

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  155 

N O T E S

N
1. Since = 50 , the median class is 15 - 20.
2
50 − 37
Thus, Lm =15, fm = 13, C =37, h = 5, hence Md = 15 + × 5 = 20
13
N
2. Since = 25 , the first quartile class is 10 - 15.
4
25 − 17
Thus, LQ1 = 10 fQ1 = 20 C = 17, h = 5, hence Q1 = 10 + × 5 = 12
20
3N
3. Since = 75 , the third quartile class is 25 - 30.
4
75 − 67
=
Thus, =
LQ3 25, fQ3 10 C = 67, h = 5, hence Q3 = 25 + × 5 = 29
10
\ Bowley’s Coefficient of Skewness = 0.06
Thus, the distribution is approximately symmetrical.

S
IM
Fill in the blanks:
6. ................... method of skewness is based on the values of
median, lower and upper quartiles.
7. Bowley’s method is also used in case of ‘open-end series’,
NM

where the importance of ................... values is ignored.

My computer program has a function that provides what it calls


“basic statistics.” Among those are Skew and Kurtosis. It is said that
abnormally skewed and peaked distributions may be signs of trouble
and those problems may then arise in applying testing statistics.
What are the acceptable ranges for these two statistics and how will
they affect the testing statistics if they are outside those limits? 

Coefficient of skewness lies within the limit ± 1. This method is


quite convenient for determining skewness where one has already
calculated quartiles.

KELLY’S COEFFICIENT OF SKEWNESS


5.4 (SK )
K

Kelly’s coefficient of skewness is defined as:


( P90 + P10) − 2 × Md
Skk =
( P90 − P10)

NMIMS Global Access – School for Continuing Education


156  BUSINESS STATISTICS

N O T E S
Where, P is percentile.
Example: Calculate the Kelly’s coefficient of skewness from the
following data:

Wages (`) No. of Workers Wages (`) No. of Workers


800-900 10 1200-1300 160
900-1000 33 1300-1400 80
1000-1100 47 1400-1500 60
1100-1200 110
Solution: Calculation of P10, P50 and P90

Class Intervals No.of Workers (f ) Less than (c.f .)


800-900 10 10
900-1000 33 43
1000-1100
1100-1200
1200-1300
S 47
110
160
90
200
360
IM
1300-1400 80 440
1400-1500 60 500
Total 500
NM

10 10 × 500
1. Since= N = 50, P10 lies in the interval 1000 - 1100.
100 100
Thus, = =
LP10 1000, C 43,= =
fP10 47, h 100

Hence, P10 = 1000+


50 – 43
×100 = ` 1014.89
47
50
2. N = 250 , P50 lies in the interval 1200 - 1300.
Since
100
Thus,= =
LP50 1200, C 200, = =
fP50 160, h 100

Hence, P50 = 1200+


250 – 200
×100 = ` 1231.25
160
90
3. Since N = 450 , P90 lies in the class 1400 - 1500.
100
Thus, = =
LP90 1400, C 440,= =
fP90 60, h 100.
450 – 440
Hence, P90 = 1400+ ×100 =
` 1416.67
60
( P90 + P10) − 2 × Md
SkK =
( P90 − P10)
1416.67 + 1014.89 − 2 × 1231.25 − 30.94
SKK
\ Sk == = = − 0.08
K
1416.67 − 1014.89 401.78

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  157 

N O T E S
‰‰ Skewness is also defined in term of the moment about mean. One
such measure is defined as:
3
 ( xi − m ) 
∑
 s 
Skewness =
N
‰‰ Lorenz Curve: This is a special type of graph, which is designed
to show how much a certain distribution varies from a completely
uniform distribution. It is a cumulative percentage curve
comparing the population and factor under study. For example,
we could plot a graph of percentage of population and percentage
of their wealth. Lorenz curve is very useful for comparing two
populations particularly when their means and SD are same.

Fill in the blanks:

S
8. Skewness is also defined in term of the moment about
................... .
IM
9. ................... curve is a special type of graph, which is designed to
show how much a certain distribution varies from a completely
uniform distribution.
10. Lorenz curve is very useful for comparing two populations
particularly when their ................... and ................... are same.
NM

Collect data from a manufacturing company about their employees


and their salaries. Which of the three coefficient of skewness will
you find out to know the variation in the data? Which coefficient
will define the variation best?

5.5 MEASURES OF KURTOSIS


Kurtosis is a measure of peaked-ness of distribution. Larger the
kurtosis, more and more peaked will be the distribution. The kurtosis
is calculated either as an absolute or a relative value. Absolute kurtosis
is always a positive number.
Absolute kurtosis of a normal distribution (symmetric bell shaped
distribution) is taken as 3. It is taken as datum to calculate relative
kurtosis as follows: 4
 ( xi − m ) 
∑
s 
Absolute kurtosis = 
N
Relative kurtosis = Absolute kurtosis – 3
Relative kurtosis can be negative. Managers usually work with relative
kurtosis.

NMIMS Global Access – School for Continuing Education


158  BUSINESS STATISTICS

N O T E S

Negative kurtosis indicates a flatter distribution than the normal


distribution, and called as platykurtic.

A positive kurtosis means more peaked curve, called Leptokurtic.

Peakedness of normal distribution is called Mesokurtic.

Example: Find standard deviation and kurtosis of the following series


by the method of moments:
Class Intervals: 0-10 10-20 20-30 30-40 40-50
Frequency:
S 10
Solution: Calculation of Moments
20 40 20 10
IM
Class Frequency Mid-values X − 25
u= fu fu 2 fu 4
Intervals (f ) (X) 10
0-10 10 5 −2 − 20 40 160
10-20 20 15 −1 − 20 20 20
NM

20-30 40 25 0 0 0 0
30-40 20 35 1 20 20 20
40-50 10 45 2 20 40 160
Total 100 0 120 360

Since ∑fu = 0, \ X = 25 and the calculated moments will be central.

m2 = h2 ∑
fu2 120
= 100 × = 120
N 100

and m4 = h
4 ∑ fu 4

= 10000 ×
360
= 36000
N 100
m4 36000
Thus, measure of kurtosis b
= 2 = = 2.5
m22 14400
Since this value is less than 3, the distribution is platykurtic.
The standard deviation s = 120 = 10.95
Example: The first four central moments of a distribution are 0,
2.5, 0.7 and 18.75. Calculate the moment measures of skewness and
kurtosis of the distribution and comment upon the results.
Solution: The moment measures of skewness and kurtosis are given
by

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  159 

N O T E S

( 0.7 ) = 0.031 and b= m=


2
m32 18.75
b1 =
= 4
= 3 respectively.
( 2.5 ) m2 ( 2.5 )2
2
m23 3 2

Since b1 is very small, the distribution is approximately symmetrical.


Further, b2 = 3, therefore, the curve is mesokurtic.
The above calculations show that the given distribution is
approximately normal.
Example: The following data are given to an economist for the
purpose of economic analysis. The data refers to the length of life of a
certain type of batteries.
n = 100, ∑fd = 50, ∑fd² = 1970, ∑fd³ = 2948 and ∑ fd 4 = 86,752. Here
d = X - 48.
Do you think that the distribution is platykurtic?
Solution: We can calculate raw moments, from the given values, as
given below:

m1′
=
∑=
fd 50
= 0.5,=
1970
m2′= 19.7
∑ fd
=
2
S
IM
N N
100 100

=
∑ fd3 2948
m3′ = = 29.48, =
∑ fd4 86752
m4′ = = 867.52
N 100 N 100

To calculate b2, we compute m2 and m4 , as given below:


NM

m=
2 m2′ − m1′2 = 19.7 - 0.52 = 19.45

m4′ 4m3′ m1′ + 6 m2′ m1′2 − 3 m1′4


m4 =−

= 867.52 - 4 × 29.48 × 0.5 + 6 × 19.7 × (0.5)2 - 3 × (0.5)4 = 837.9


m4 837.9
Now, b
= = = 2.2 which is less than 3, therefore, the
( 19.45 )
2
m22 2

distribution is platykurtic.

Fill in the blanks:


11. ................... is a measure of peaked-ness of distribution.
12. ................... kurtosis is always a positive number.
13. Negative kurtosis indicates a flatter distribution than the
normal distribution, and called as ................... .
14. A positive kurtosis means more peaked curve, called ...................
15. Peakedness of normal distribution is called ................... .

NMIMS Global Access – School for Continuing Education


160  BUSINESS STATISTICS

N O T E S

Give two examples where measure of skewness and kurtosis could


play important role in decision-making.

Comparison among Dispersion, Skewness and Kurtosis


Dispersion, Skewness and Kurtosis are different characteristics of
frequency distribution. Dispersion studies the scatter of the items
round a central value or among themselves. It does not show the
extent to which deviations cluster below an average or above it.
Skewness tells us about the cluster of the deviations above and below
a measure of central tendency. Kurtosis studies the concentration
of the items at the central part of a series. If items concentrate
too much at the centre, the curve becomes ‘LEPTOKURTIC’ and

S
if the concentration at the centre is comparatively less, the curve
becomes ‘PLATYKURTIC’.
IM
5.6 MOMENTS
One important concept of measuring the frequency distribution is
moments. It can be visualized as rotational effect of a force.
The concept of moments has crept into the statistical literature from
NM

mechanics. In mechanics, this concept refers to the turning or the


rotating effect of a force whereas it is used to describe the peculiarities
of a frequency distribution in statistics. We can measure the central
tendency of a set of observations by using moments.

The arithmetic mean of various powers of these deviations in any


distribution is called the moments of the distribution about mean.

Moments about mean are generally used in statistics. We use a Greek


alphabet read as mu for these moments. Consider a mass attached
at each point proportional to its frequency and take moments about
the mean. First, second, third and fourth moments can be used as a
measure of Central Tendency, Variation (dispersion), asymmetry and
peaked-ness of the curve. We shall understand the first four moments
about mean in this section, i.e., µ1, µ2, µ3, and µ4.
We define the moments as:

First Moment m1 =
∑f i × ( xi − m )
N

Second Moment m2 =
∑f i × ( x i − m )2
N

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  161 

N O T E S

Third Moment m3 =
∑f i × ( x i − m )3
N

Fourth Moment m4 =
∑f i × ( x i − m )4
N
5.6.1 PROPERTIES OF MOMENTS
First moment about mean is always zero. i.e. m1 = 0
Second moment about mean is the variance. m2 = s2 = Var
Third moment can be used as a measure of skewness. Karl Pearson
2
m3
has suggested a different measure of skewness as b 1 = 3
m2
Thus: If m3 > 0 ⇒ Distribution is positively skewed.
If m3 < 0 ⇒ Distribution is negatively skewed.
If m3 = 0 ⇒ Distribution is symmetric.
S
Fourth moment can be used as a measure of kurtosis. Karl Pearson
gave the coefficient as
IM
m
b2 = 42
m 2

If the distribution is symmetric, all odd moments are zero. i.e.


m=1 m=
3 m=
5 = 0.
.......
NM

5.6.2 COEFFICIENTS BASED ON MOMENTS


There are few useful coefficients based on the moments. These
are non-dimensional numbers and hence useful for comparison of
distribution of data. b Coefficients are used for measuring calculating
mode, skewness and kurtosis. Where as g1 and g2 are used to measure
skewness and Kurtosis. These are,

Alpha Coefficients
It is defined as:
m
α i = i Where i = 1, 2, 3, 4
si
Note that, α 1 = 0, α 2 = 1, α 3 = μ 3 / α 3 and α 4 = μ 4 / α 4

Beta Coefficients
It is defined as:
2 m3 2
b1 α=
= 3
m2 3
m4
b=
2 α=
4
m2 2

NMIMS Global Access – School for Continuing Education


162  BUSINESS STATISTICS

N O T E S
Gamma Coefficient
It is defined as:
g1 = a3
g2 = b2 – 3
Example: Calculate the first four moments about 30 for the following
distribution and convert them into central moments.

Class Intervals : 5-15 15-25 25-35 35-45 45-55


Frequency: 8 12 15 9 6
Solution: Calculation of Moments
Class Freq. M.V.
X − 30 f (X − 30) f (X − 30) 2 f (X − 30)3 f (X − 30) 4
Intervals (f ) (X)
5-15 8 10 − 20 −160 3200 − 64000 1280000
15-25
25-35
35-45
12
15
9
20
30
40
−10
0
10
S −120

90
0
1200

900
0
−12000
0
9000
120000
0
90000
IM
45-55 6 50 20 120 2400 48000 960000
Total 50 − 70 7700 −19000 2450000

− 70 7700 −19000
\ m1′ = =−1.40, m′2 = =154, m′3 = =− 380,
NM

50 50 50
24,50,000
= m′4 = 49,000
50
Conversion into central moments
m1 =0
m 2 = m′2 − m1′2 = 154 − ( −1.4 ) = 152.04
2

m3 =m′3 − 3m′2m1′ + 2m1′3 = − 380 − 3 × α 154 ( − 1.4) + 2×α ( − 1.4)3 = 261.31


m 4 =m′4 − 4m′3m1′ + 6m′2m1′2 − 3m1′4
= 49,000 − 4×α ( − 380)( − 1.4) + 6×α 154×α ( − 1.4)2 − 3×α ( − 1.4)4 = 48,671.52
Example: The first two moments of a distribution about the value 5
are 2 and 20. Find mean and variance of the distribution.
Solution: We know that m′1 =X − A or X =m′1 + A = 2 + 5 = 7
Also, m2 = mc2 – mc12 = 20 – 4 = 16
\  Mean = 7 and Variance = 16.
Example: The first four moments of a distribution about 4 are as
given below:
m′1 = 1, m′2 = 4, m′3 = 10 and m′4 = 45

Find mean of the distribution and calculate the first four moments
about mean and also the first four moments about origin.

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  163 

N O T E S

Solution: We know that µ1 = 0 and m2 = m′2 − m′1   = 4 – 1 = 3


2

m3 =m′3 − 3m′2 m′1 + 2m′1 = 10 – 12 + 2 = 0


3

m 4 =m′4 − 4m′3 m′1 + 6m′2 m′12 − 3m′14 = 45 – 4×10 + 6×4 – 3 = 26

Moments about Origin


m 1 =X =m′1 + A = 1 + 4 = 5
m1
m2 = m2 + m12 = 3 + 25 = 28
m3 = m3 + 3m2m1 + m13 = 0 + 45 + 125 = 170
m4 = m4 + 4m3m1 + 6m2m12 + m14 = 26 + 0 + 18×25 + 625 = 1101

Fill in the blanks:

S
16. The arithmetic mean of various powers of these deviations
in any distribution is called the ................... of the distribution
about mean.
IM
17. ................... moment about mean is the variance.
18. First moment about mean is always ................... .
19. ................... moment can be used as a measure of skewness.
20. ................... moment can be used as a measure of kurtosis.
NM

Take P/E (Price to earning) ratios of 10 banking stocks for year


2003-2004. Plot them as histogram. Then calculate various measures
of variability and compare with visual observations. What are your
observations if you want to invest in banking stocks?

Moments also help in measuring the scatteredness, asymmetry and


peakedness of a curve for a particular distribution. Moments refers
to the average of the deviations from mean or some other value
raised to a certain power.

5.7 SUMMARY
‰‰ Measures of Skewness and Kurtosis, like measures of central
tendency and dispersion, study the characteristics of a
frequency distribution. Averages tell us about the central value
of the distribution and measures of dispersion tell us about the
concentration of the items around a central value.
‰‰ When two or more symmetrical distributions are compared, the
difference in them is studied with ‘Kurtosis’. On the other hand,

NMIMS Global Access – School for Continuing Education


164  BUSINESS STATISTICS

N O T E S
when two or more symmetrical distributions are compared, they
will give different degrees of Skewness. These measures are
mutually exclusive i.e. the presence of skewness implies absence
of kurtosis and vice-versa.
‰‰ Bowley’s method of skewness is based on the values of median,
lower and upper quartiles. This method suffers from the same
limitations which are in the case of median and quartiles.
Wherever positional measures are given, skewness should be
measured by Bowley’s method. This method is also used in case
of ‘open-end series’, where the importance of extreme values is
ignored.
‰‰ Kelly’s coefficient of skewness is defined as:
(P90 + P10) − 2 × Md
Skk =
(P90 − P10)
Where, P is percentile.
‰‰

S
Kurtosis is a measure of peaked-ness of distribution. Larger the
kurtosis, more and more peaked will be the distribution. The
kurtosis is calculated either as an absolute or a relative value.
IM
Absolute kurtosis is always a positive number. Absolute kurtosis
of a normal distribution (symmetric bell shaped distribution) is
taken as 3. It is taken as datum to calculate relative kurtosis as
follows:
4
 (xi − m) 
NM

∑ 
 s 
Absolute kurtosis =
N
Relative kurtosis = Absolute kurtosis – 3
‰‰ Moments about mean are generally used in statistics. We use
a Greek alphabet read as mu for these moments. Consider a
mass attached at each point proportional to its frequency and
take moments about the mean. First, second, third and fourth
moments can be used as a measure of Central Tendency, Variation
(dispersion), asymmetry and peakedness of the curve.

‰‰ Measure of Skewness: Measure of skewness is the technique


to indicate the direction and extent of skewness in the
distribution values in the data set.
‰‰ Moments: The arithmetic mean of various powers of these
deviations in any distribution is called the moments of the
distribution about mean.
‰‰ Moment of Order r: It is defined as the arithmetic mean of the
rth power of deviations of observations.
‰‰ Platykurtic: Negative kurtosis indicates a flatter distribution
than the normal distribution, and called as platykurtic.
Contd...

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  165 

N O T E S
‰‰ Leptokurtic: A positive kurtosis means more peaked curve,
called Leptokurtic.
‰‰ Mesokurtic: Peakedness of normal distribution is called
Mesokurtic.
‰‰ Kurtosis: When two or more symmetrical distributions are
compared, the difference in them is studied with Kurtosis.
‰‰ Coefficient of Kurtosis: It is a measure of the relative
peakedness of the top of a frequency curve.

5.8 DESCRIPTIVE QUESTIONS


1. What do you understand by skewness?
2. Explain the nature of skewness.
3. What are the characteristics of a good measure of skewness?
4.

5.
What are its key features?
S
How do you calculate Karl pearsons’s coefficient of skewness?

How do you calculate Bowley’s coefficient of skewness?


IM
6. Explain Kelly’s coefficient of skewness with examples.
7. What do you understand by measures of Kurtosis?
8. Explain the terms Platykurtic, Leptokurtic and Mesokurtic.
NM

9. Define Moments of the distribution about mean. What are the


properties of Moments?
10. Define coefficients based on moments.

EXERCISE FOR PRACTICE


1. From the following data compute quartiles and find the coefficient
of skewness.

Income Below 200-400 400-600 600-800 800- Above


(`) 200 1000 1000
No. of 20 40 80 75 20 16
Persons
2. From the following data calculate the coefficient of skewness
based on percentiles.

Marks less than 10 20 30 40 50 60


No. of Students 4 10 30 40 47 50
3. Calculate Karl Pearson’s coefficient of skewness for the following
distribution.

Class 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80


Interval
Frequency 6 12 22 48 56 32 18 6

NMIMS Global Access – School for Continuing Education


166  BUSINESS STATISTICS

N O T E S
4. The first four moments from mean of a distribution are 0, 3.2, 3.6
and 20. The mean value is 11. Calculate the first four moments
about zero and about 10.
5. Compute the moment measure of skewness from the following
distribution.

Marks 0-10 10-20 20-30 30-40 40-50 50-60 60-70


obtained:
No. of 8 14 22 26 15 10 5
Students:

5.9 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS
Topic Q. No. Answers Karl

(Skp)
S
Karl Pearson’s Coefficient of Skewness 1.

2.
Skewness

Kurtosis
IM
3. Positive
4. Negative
5. Positively
Bowley’s Coefficient of Skewness (Skb) 6. Bowley’s
NM

7. Extreme
Kelly’s Coefficient of Skewness (Skk) 8. Mean
9. Lorenz
10. Means, SD
Measures of Kurtosis 11. Kurtosis
12. Absolute
13. Platykurtic
14. Leptokurtic
15. Mesokurtic
Moments 16. Moments
17. Second
18. Zero
19. Third
20. Fourth

HINTS FOR DESCRIPTIVE QUESTIONS


1. Refer Section 5.1
Skewness is a measure that studies the degree and direction
of departure from symmetry. A symmetrical distribution, when
presented on the graph paper, gives a ‘symmetrical curve’, where

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  167 

N O T E S
the value of mean, median and mode are exactly equal. On the
other hand, in an asymmetrical distribution, the values of mean,
median and mode are not equal.
2. Refer Section 5.1
Skewness can be positive or negative or zero.
(a) When the values of mean, median and mode are equal, there
is no skewness.
(b) When mean > median > mode, skewness will be positive.
(c) When mean < median < mode, skewness will be negative.
3. Refer Section 5.1
It should be a pure number in the sense that its value should be
independent of the unit of the series and also degree of variation
in the series.
4.

Refer Section 5.2
Karl Pearson has suggested two formulae; S
IM
(a) Where the relationship of mean and mode is established;
(b) Where the relationship between mean and median is not
established.
mean - mode
(c) Coefficient of skewness, SKp =
S.D.
NM

(d) Coefficient of skewness generally lies within + 1


5. Refer Section 5.3
Bowley’s method of skewness is based on the values of median,
lower and upper quartiles. This method suffers from the same
limitations which are in the case of median and quartiles.
(Q 3 − Q 2) − (Q 2 − Q 1)
Coefficient of Skewness, (SKB) =
(Q 3 − Q 2) + (Q 2 − Q 1)

(Q 3 + Q 1) − 2 × Md
=
(Q 3 − Q 1)
Where, Q is quartile.
6. Refer Section 5.4
Kelly’s coefficient of skewness is defined as:
(P90 + P10) − 2 × Md
Skk =
(P90 − P10)
Where, P is percentile
7. Refer Section 5.5
Kurtosis is a measure of peakedness of distribution. Larger the
kurtosis, more and more peaked will be the distribution. The
kurtosis is calculated either as an absolute or a relative value.
Absolute kurtosis is always a positive number.

NMIMS Global Access – School for Continuing Education


168  BUSINESS STATISTICS

N O T E S
8. Refer Section 5. 5
Negative kurtosis indicates a flatter distribution than the normal
distribution, and called as platykurtic. A positive kurtosis means
more peaked curve, called Leptokurtic. Peakedness of normal
distribution is called Mesokurtic.
9. Refer Section 5.6
The arithmetic mean of various powers of these deviations in
any distribution is called the moments of the distribution about
mean. Moments about mean are generally used in statistics.
10. Refer Section 5.6
There are few useful coefficients based on the moments. These
are non-dimensional numbers and hence useful for comparison
of distribution of data. β Coefficients are used for measuring
calculating mode, skewness and kurtosis. Where as ϒ1 and ϒ2

S
are used to measure skewness and Kurtosis. These are, Alpha
Coefficients, Beta Coefficients and Gamma Coefficients
IM
ANSWERS FOR EXERCISE FOR PRACTICE
Q 3 + Q 1 − 2Md
1. Q1= 395, Q3 = 725.333, Md ==
557.5, SK B = 0.016
Q3 − Q1

P90 + P10 − 2Md


2. P10= 11.67, P90 = 47.14, Md =
= 27.50, SK B = 0.11
NM

P90 − P10
– 3(X − Md )
3. X = 41.7 , Md = 42.14, s = 15.43, SK P = = −0.086
s
4. The first four moments about 10 are 1, 4.2, 14.2 and 54.6
5. β1 = 0.02249

5.10 SUGGESTED READINGS FOR REFERENCE


SUGGESTED READINGS
‰‰ R Selvaraj, Quantitative Methods in Management, Problems and
Solutions, Excel Books, 2008
‰‰ J K Sharma, Fundamentals of Business Statistics, 2010
‰‰ R S Bhardwaj, Mathematics and Statistics for Business, Excel
Books, 2012
‰‰ Dey, B.R., Text Book of Managerial Statistics, Macmillan India
Ltd, 2005
‰‰ Gupta, S.C., Kapoor, V.K., Fundamentals of Mathematical
Statistics, Sultan Chand & Sons, 1970
‰‰ Gallagher, C.A. and Watson, H.J., Quantitative Methods for
Business Decisions, McGraw Hill, Inc., 1976

NMIMS Global Access – School for Continuing Education


SKEWNESS AND KURTOSIS  169 

N O T E S
‰‰ D P Apte, Statistical Tools for Managers using MS Excel, Excel
Books, 2009
‰‰ Bierman H., Bonnini C.P., and Hausma W.H., Quantitative
Analysis for Business Decisions, Homewood, Illinois. Richard D.I.
Win, Inc 1973.
‰‰ Gordon, G., and Pressman I., Quantitative Decision Making for
Business, New Delhi, National Publishing House, 1983.

E-REFERENCES
‰‰ www.math.uah.edu/stat/expect/Skew.html
‰‰ http://www.itl.nist.gov/
‰‰ http://www.real-statistics.com/

S
IM
NM

NMIMS Global Access – School for Continuing Education


NM
IM
S
C H A
6 P T E R

CORRELATION ANALYSIS

CONTENTS
6.1 Introduction


6.2
6.2.1
Types of Correlation

S
Positive or Negative Correlation
IM
6.2.2 Simple or Multiple Correlations
6.2.3 Partial or Total Correlation
6.2.4 Linear and Non-linear Correlation
6.3  Methods of Calculating Correlation
6.4 Scatter Diagram Method
NM

6.5 Co-variance Method – The Karl Pearson’s Correlation


Coefficient
6.5.1  Assumptions Underlying Karl Pearson’s Correlation
Coefficient
6.5.2 Interpretation of R
6.5.3 Estimation of Probable Error
6.6 Rank Correlation Method
6.6.1 Rank Correlation when Ranks are given
6.6.2 Rank Correlation when Ranks are not given
6.6.3 Rank Correlation when Equal Ranks are given
6.7  Correlation Coefficient using Concurrent Deviation
6.8 Summary
6.9 Descriptive Questions
6.10 Answers and Hints
6.11 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


172  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

RBI’S BALANCING ACT AMID SHAKY CURRENCY MARKET

The correlation between the Sensex and the rupee has been drifting
away from its historical averages, following RBI’s interventions
in the currency market. The central bank has been intervening
in the forex market in order to cap the significant upside in the
rupee as well as to build forex reserves. The 120-day correlation
between the Sensex and the rupee has fallen to a negative point of
0.36. Interestingly, such correlation levels were not seen before the
global financial crisis in September 2008.

S
IM
NM

A correlation is a measurement of how two variables are related to


each other and it can range from plus one to minus one levels. The
prime reason for the rupee not moving in tandem with the equity
gauges is the change in RBI’s focus. Of late, RBI has been focusing
on building the foreign reserves in order to be able to hedge against
any potential outflows of funds in case the yield increases in the
US markets.

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  173 

N O T E S

After studying this chapter, you should be able to:


  Understand the concept of correlation
  Study about different types of correlation
  Describe various methods of calculating correlation such as
scatter diagram method
  Discuss various types of correlation coefficients viz, Karl
Pearson correlation coefficient, rank correlation and
coefficient based on concurrent deviations.

6.1 INTRODUCTION
We often encounter the situations, where data appears as pairs of
figures relating to two variables, for example, price and demand of

S
commodity, money supply and inflation, industrial growth and GDP,
advertising expenditure and market share, etc. Examples of correlation
problems are found in the study of the relationship between IQ and
IM
aggregate percentage marks obtained in mathematics examination or
blood pressure and metabolism. In these examples, both variables are
observed as they naturally occur, since neither variable can be fixed
at predetermined levels.
These are some of the important definitions about correlation.
NM

Croxton and Cowden say, “When the relationship is of a


quantitative nature, the appropriate statistical tool for discovering
and measuring the relationship and expressing it in a brief formula
is known as correlation”.

A.M. Tuttle says, “Correlation is an analysis of the covariation between


two or more variables.”
W.A. Neiswanger says, “Correlation analysis contributes to the
understanding of economic behavior, aids in locating the critically
important variables on which others depend, may reveal to the
economist the connections by which disturbances spread and suggest
to him the paths through which stabilizing forces may become
effective.”
L.R. Conner says, “If two or more quantities vary in sympathy so
that the movement in one tends to be accompanied by corresponding
movements in others than they are said are correlated.”
Correlation is a degree of linear association between two random
variables. In these two variables, we do not differentiate them as
dependent and independent variables. It may be the case that one
is the cause and other is an effect i.e. independent and dependent
variables respectively. On the other hand, both may be dependent

NMIMS Global Access – School for Continuing Education


174  BUSINESS STATISTICS

N O T E S
variables on a third variable. In some cases there may not be any
cause-effect relationship at all. Therefore, if we do not consider and
study the underlying economic or physical relationship, correlation
may sometimes give absurd results. For example, take a case of global
average temperature and Indian population. Both are increasing over
past 50 years but obviously not related.
Correlation is an analysis of the degree to which two or more variables
fluctuate with reference to each other. Correlation is expressed by a
coefficient ranging between –1 and +1. Positive (+ve) sign indicates
movement of the variables in the same direction. E.g. Variation of the
fertilizers used on a farm and yield, observes a positive relationship
within technological limits. Whereas negative (–ve) coefficient
indicates movement of the variables in the opposite directions, i.e.
when one variable decreases, other increases. E.g. Variation of price
and demand of a commodity have inverse relationship. Absence of
correlation is indicated if the coefficient is close to zero. Value of the

S
coefficient close to ±1 denotes a very strong linear relationship.
The study of correlation helps managers in following ways:
IM
‰‰ To identify relationship of various factors and decision variables.
‰‰ To estimate value of one variable for a given value of other if both
are correlated. E.g. estimating sales for a given advertising and
promotion expenditure.
NM

‰‰ To understand economic behaviour and market forces.


‰‰ To reduce uncertainty in decision-making to a large extent.
In business, correlation analysis often helps manager to take
decisions by estimating the effects of changing the values of the
decision variables like promotion, advertising, price, production
processes, on the objective parameters like costs, sales, market share,
consumer satisfaction, competitive price. The decision becomes
more objective by removing subjectivity to certain extent. However,
it must be understood that the correlation analysis only tells us about
the two or more variables in a data fluctuate together or not. It does
not necessarily be due cause and effect relationship. To know if the
fluctuations in one of the variables indeed affects other or not, one
has to be established with logical understanding of the business
environment.
Some of the correlations could be completely nonsense relations like
increase in jobs in I.T. and reduction production of wheat over past 3
years in India, or share market Bull Run of 2004 to 2007 and increase
in suicides by farmers in India. There are many reasons to get such
spurious correlations. Hence before we use correlation analysis we
must check few factors responsible for the apparent relationship.
Firstly, the fluctuation may be a chance coincidence. In this case we
could look at the data over different periods and also study if one factor
affects the other through third factor that we have not considered.
Secondly, even when correlation exists the logical analysis may tell

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  175 

N O T E S
us that one variable is independent and other dependent on it. E.g.
surface temperature of the Pacific Ocean (Al Niño) affects monsoons
in India but monsoons do not affect temperatures of the Pacific Ocean.
Thirdly, in some cases both variables under study may be fluctuating
together due to a variation in the third variables. Thus both variables
under correlation analysis may be dependent variables and hence
not mutually correlated. In such a case, manager can not vary one of
them and expect other variable to vary. For example, correlation in
increase in share prices and stronger rupee against dollar may be due
to increase in Foreign Direct Investment (FDI). In this case expecting
to control falling share prices through selling dollars by the Reserve
Bank is incorrect. To control these two variables we need to control
FDI. Further, if the falling share prices are due to market sentiments or
overheated market, controlling FDI may not help. Thus, the manager
needs to analyze the problem in business environment before he/she
can apply the correlation analysis in decision-making.

Fill in the blanks:


S
IM
1. Correlation is an analysis of the ................... between two or
more variables
2. Correlation is a ................... of linear association between two
random variables.
NM

3. ................... analysis helps to identify relationship of various


factors and decision variables.

A correlation considers the joint variation of two measurements


with no distinction as independent and dependent variables. It is
a measure of linear relationship between them. In correlation, we
do not restrict or set values of any measurement and observe then
as they vary to different levels. It only gives indication whether the
two variables move together in linearly. On the other hand, the
regression problem considers the frequency distribution of one
variable when another is set at each of the several possible levels.

6.2 TYPES OF CORRELATION


The correlation can be studied as positive and negative, simple
and multiple, partial and total, linear and non-linear. Further the
method to study the correlation is plotting graphs on x-y axis or by
algebraic calculation of coefficient of correlation. Graphs are usually
scatter diagrams or line diagrams. The correlation coefficients have
been defined in different ways, of these Karl Pearson’s correlation
coefficient; Spearman’s Rank correlation coefficient and coefficient
of determination are more popular.

NMIMS Global Access – School for Continuing Education


176  BUSINESS STATISTICS

N O T E S
In managerial decision-making, it is a good practice to draw the scatter
diagram first, and then study the logical relationship to identify the
type of correlation and the cause effect relation. Only then manager
should calculate the coefficient of correlation for further mathematical
analysis. Types of correlation that need to be differentiated before
using the correlation coefficient for managerial decision-making are
given below.

6.2.1 POSITIVE OR NEGATIVE CORRELATION


In positive correlation, both factors increase or decrease together.
Positive or Direct Correlation refers to the movement of variables in
the same direction.

The correlation is said to be positive when the increase (decrease) in


the value of one variable is accompanied by an increase (decrease)

S
in the value of other variable also.
IM
Negative or inverse correlation refers to the movement of the
variables in opposite direction. Correlation is said to be negative, if
an increase (decrease) in the value of one variable is accompanied
by a decrease (increase) in the value of other.
NM

When we say a perfect correlation, the scatter diagram will show a


linear (straight line) plot with all points falling on straight line. If we
take appropriate scale, the straight line inclination can be adjusted
to 45°, although it is not necessary as long as inclination is not 0° or
90° where there is no correlation at all because value of one variable
changes without any change in the value of other variable. In case of
negative correlation when one variable increases the other decrease
and visa versa. If the scatter diagram shows the points distributed
closely around an imaginary line, we say it is high degree of correlation.
On the other hand, if we can hardly see any unique imaginary line
around which the observations are scattered, we say correlation
does not exist. Even in case of imaginary line being parallel to one
of the axes we say no correlation exists between the variables. If the
imaginary line is a straight line we say the correlation is linear.

6.2.2 SIMPLE OR MULTIPLE CORRELATIONS


In simple correlation the variation is between only two variables under
study and the variation is hardly influenced by any external factor.
In other words, if one of the variables remains same, there won’t be
any change in other variable. For example, variation in sales against
price change in case of a price sensitive product under stable market
conditions shows a negative correlation. In multiple correlations,
more than two variables affect one another. In such a case, we need to
study correlation between all the pairs that are affecting each other
and study extent to which they have the influence.

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  177 

N O T E S
6.2.3 PARTIAL OR TOTAL CORRELATION
In case of multiple correlation analysis there are two approaches to
study the correlation. In case of partial correlation, we study variation
of two variables and excluding the effects of other variables by keeping
them under controlled condition. In case of ‘total correlation’ study we
allow all relevant variables to vary with respect to each other and find
the combined effect. With few variables, it is feasible to study ‘total
correlation’. As number of variables increase, it becomes impractical
to study the ‘total correlation’. For example, coefficient of correlation
between yield of wheat and chemical fertilizers excluding the effects of
pesticides and manures is called partial correlation. Total correlation
is based upon all the variables.

6.2.4 LINEAR AND NON-LINEAR CORRELATION

S
When the amount of change in one variable tends to keep a
constant ratio to the amount of change in the other variable, then
the correlation is said to be linear.
IM
But if the amount of change in one variable does not bear a
constant ratio to the amount of change in the other variable then
NM

the correlation is said to be non-linear.

The distinction between linear and non-linear is based upon the


consistency of the ratio of change between the variables. The manager
must be careful in analyzing the correlation using coefficients because
most of the coefficients are based on assumption of linearity. Hence
plotting a scatter diagram is good practice. In case of linear correlation,
the differential (derivative) of relationship is constant with the graph
of the data being a straight line. In case on nonlinear correlation the
rate of variation changes as values increase or decrease. The nonlinear
relationship could be approximated to a polynomial (parabolic, cubic
etc.), exponential sinusoidal, etc. In such cases using the correlation
coefficients based on linear assumption will be misleading unless
used over a very short data range. Using computers, we could analyze
a nonlinear correlation to a certain extent, with some simplified
assumption.

Fill in the blanks:


4. The correlation is said to be ................... when the increase
(decrease) in the value of one variable is accompanied by an
increase (decrease) in the value of other variable also.
Contd...

NMIMS Global Access – School for Continuing Education


178  BUSINESS STATISTICS

N O T E S
5. Correlation is said to be ..................., if an increase (decrease)
in the value of one variable is accompanied by a decrease
(increase) in the value of other.
6. When the amount of change in one variable tends to keep a
constant ratio to the amount of change in the other variable,
then the correlation is said to be ................... .
7. In case on ................... correlation the rate of variation changes
as values increase or decrease.

Give practical examples from your life on the different types of


correlation which you have studied above.

S
Scatter diagram not only tell us about linearity or nonlinearity but
also whether the data is cyclic. When values of two variables have a
IM
constant rate of change it is linear correlation.

 ETHODS OF CALCULATING
M
6.3
CORRELATION
Simple linear correlation is a statistical tool applied in many business
NM

situations to find the degree to which two variables vary linearly to


one another. Although in many situations even if there are more than
two variables involved, two of them may be dominant. In such a case,
correlation analysis between these two variables helps us to measure
the degree of association between these two variables. For example,
demand of a particular product depends on number of factors.
However, association of demand with price may be dominant.
Correlation analysis may also be necessary to eliminate a variable
which shows low or hardly any correlation with the variable of our
interest. In statistics, there are number of measures to describe degree
of association between variables.
These are Karl Pearson’s Correlation Coefficient, Spearman’s rank
correlation coefficient, coefficient of determination, Yule’s coefficient
of association, coefficient of colligation, etc.
There are different methods which help us to find out whether the
variables are related or not.
‰‰ Scatter Diagram Method.
‰‰ Karl Pearson’s Coefficient of correlation
‰‰ Rank Method
‰‰ Concurrent deviation method.
We shall discuss these methods one by one.

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  179 

N O T E S

State whether the following statements are true/false:


8. Correlation analysis may also be necessary to eliminate a
variable which shows low or hardly any correlation with the
variable of our interest.
9. Simple linear correlation is a statistical tool applied in many
business situations to find the degree to which two variables
vary linearly to one another.

Suppose, you have some achievement test results collected in


a project on which you had worked years ago. The achievement
test had four scales: vocabulary, reading, math concepts, and math

S
problem solving. How will you find the correlation of your scores of
different subjects and interpret which was your strongest subject.
IM
6.4 SCATTER DIAGRAM METHOD
Scatter diagram is the most fundamental graph plotted to show
relationship between two variables. It is a simple way to represent
bivariate distribution. Bivariate distribution is the distribution of two
NM

random variables. Two variables are plotted one against each of the X
and Y axes. Thus, every data pair of (xi, yj) is represented by a point on
the graph, x being abscissa and y being the ordinate of the point. From
a scatter diagram we can find if there is any relationship between the
x and y, and if yes, what type of relationship. Scatter diagram thus,
indicates nature and strength of the correlation.

The pattern of points obtained by plotting the observed points are


knows as scatter diagram.

It gives us two types of information.


‰‰ Whether the variables are related or not.
‰‰ If so, what kind of relationship or estimating equation that
describes the relationship.
If the dots cluster around a line, the correlation is called linear
correlation. If the dots cluster around a curve, the correlation is called
a non-linear or curve linear correlation.
Scatter diagram is drawn to visualize the relationship between two
variables. The values of more important variable are plotted on the
X-axis while the values of the variable are plotted on the Y-axis.
On the graph, dots are plotted to represent different pairs of data.
When dots are plotted to represent all the pairs, we get a scatter

NMIMS Global Access – School for Continuing Education


180  BUSINESS STATISTICS

N O T E S
diagram. The way the dots scatter gives an indication of the kind of
relationship which exists between the two variables. While drawing
scatter diagram, it is not necessary to take at the point of sign the zero
values of X and Y variables, but the minimum values of the variables
considered may be taken.
When there is a positive correlation between the variables, the dots
on the scatter diagram run from left hand bottom to the right hand
upper corner. In case of perfect positive correlation all the dots will lie
on a straight line.
When a negative correlation exists between the variables, dots on the
scatter diagram run from the upper left hand corner to the bottom
right hand corner. In case of perfect negative correlation, all the dots
lie on a straight line.
If a scatter diagram is drawn and no path is formed, there is no
correlation.

S
Example: Figures on advertisement expenditure (X) and Sales (Y) of
a firm for the last ten years are given below. Draw a scatter diagram.
IM
Advertisement 40 65 60 90 85 75 35 90 34 76
cost in ‘000 `
Sales in Lakh ` 45 56 58 82 65 70 64 85 50 85
Solution:
NM

90
85
80
Sales in Lakh `

75
70
65 Sales
60 in Lakh `
55
50
45
40
30 50 70 90 110
Advertisement cost in '000 `

Scatter Diagram: Correlation


between Advertisement Cost & Sales
Example: Draw a scatter diagram for the following data of eight years
between income (X) and expenditure (Y).

Income (X) (`) 100 110 113 120 125 130 130 140
Expenditure (Y) (`) 85 90 91 100 110 125 125 130

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  181 

N O T E S
Solution:

140
130
Expenditure (Y) (`)

120
110
100
90
80
70
60
50
80 100 120 140 160
Income (X) (`)

Scatter Diagram
S
IM
Fill in the blanks:
10. Scatter diagram is the most fundamental graph plotted to
show relationship between ................... variables.
NM

11. The pattern of points obtained by plotting the observed points


are knows as ...................
12. In case of perfect positive correlation all the dots will lie on a
................... line.

Collect the data of income and expenditure of ten households in


your locality. Draw a scatter diagram to plot the correlation between
income and expenditure. Interpret the results and prepare a short
report.

 O-VARIANCE METHOD – THE KARL


C
6.5
PEARSON’S CORRELATION COEFFICIENT

The correlation coefficient measures the degree of association


between two variables X and Y.

Karl Pearson’s formula for correlation coefficient is given as,


Covx.cov y

sX sY

NMIMS Global Access – School for Continuing Education


182  BUSINESS STATISTICS

N O T E S

1
n
∑ (X − X)(Y − Y)  (1)
r=
sX sY
Where r is the ‘Correlation Coefficient’ or ‘Product Moment Correlation
Coefficient’ between X and Y. sX and sY are the standard deviations
of X and Y respectively. ‘n’ is the number of the pairs of variables X
1
and Y in the given data. The expression ∑ (X − X)(Y − Y) is known
n
as a covariance between the variables X and Y. It is denoted as Cov
(x, y). The Correlation Coefficient r is a dimensionless number whose
value lies between +1 and –1. Positive values of r indicate positive (or
direct) correlation between the two variables X and Y i.e. both X and
Y increase or decrease together. Negative values of r indicate negative
(or inverse) correlation, thereby meaning that an increase in one
variable X or Y results in a decrease in the value of the other variable.
A zero correlation means that there is no association between the two
variables.

S
The formula can be modified as,
IM
1 1
∑ ( X − X )(Y − Y ) ∑ ( XY − XY − XY + XY )
=r n= n
s Xs Y s Xs Y

∑ XY − ∑ X × ∑ Y
NM

= n n n
 (2)
∑X ∑X  ∑Y  ∑Y 
2 2 2 2

−   −  
n  n  n  n 
E[ XY ] − E[ X ] E[Y ]
= (3)
E[ X 2 ] − ( E[ X ] ) E[Y 2 ] − ( E[Y ] )
2 2

Equations (2) and (3) are alternate forms of equation (1). These have
advantage that we don’t have to subtract each value from the mean.
Example: The data of advertisement expenditure (X) and sales (Y)
of a company for past 10 year period is given below. Determine the
correlation coefficient between these variables and comment on the
correlation.

X 50 50 50 40 30 20 20 15 10 5
Y 700 650 600 500 450 400 300 250 210 200
Solution:

=
X
∑=
X 290
= 29=
,Y ∑
=
Y 4260
= 426
n 10 n 10

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  183 

N O T E S

S.No. X Y x (X − X ) =
= y (Y − Y ) x2 y2 xy

1 50 700 21 274 441 75076 5754


2 50 650 21 224 441 50176 4704
3 50 600 21 174 441 30276 3654
4 40 500 11 74 121 5476 814
5 30 450 1 24 1 576 24
6 20 400 -9 -26 81 676 234
7 20 300 -9 -126 81 15876 1134
8 15 250 -14 -176 196 30976 2464
9 10 210 -19 -216 361 46656 4104
10 5 200 -24 -226 576 51076 5424
Total ∑ 290 4260 0 0 2740 306840 28310

=
1
Now, r n
∑ ( X − X )(Y − Y )
=
s Xs Y
=
1
n
∑ xy
∑ x 2 ∑ y2 S ∑ xy
∑x ∑y 2 2
IM
n n
28310
=r = 0.976
2740 × 306840

This value of Karl Pearson’s coefficient r = 0.976 indicates a high


NM

degree of positive association between the variables X and Y.


Effect of shifting origin and change of scale on correlation coefficient
– –
Value of X and Y may not be integers. In such a case, the calculations
become tedious. We can expand the formula as,
1 1
n=
∑ ( X − X )(Y − Y ) ∑ XY − n ∑ X ∑ Y
r
s Xs Y 1 1
(
∑X ∑ X ) ∑ Y 2 − (∑ Y )
22 2

n n
Further simplification in computations can be adopted by calculating
the deviation of the observation from an assumed mean rather than
the actual mean, and also scaling these deviations conveniently.
Here we use the property that correlation coefficient does not change
with shifting of origin i.e. by adding or subtracting any constant from
the two variables (X, Y) correlation coefficient remains same. It also
remains unchanged if we change the scales by dividing or multiplying
the variables by a constant. Let X and Y be the two variable with
values x1, x2, ...., xn and y1, y2, ...., yn. Let us define another two variables
obtained by transformation as,
X −a
U= and V = Y − b
g h

NMIMS Global Access – School for Continuing Education


184  BUSINESS STATISTICS

N O T E S
Where a, b, g and h are constants.
In this case, we have defined variables U and V through shift of origin
from (0, 0) to (a, b) and change the X and Y scale by factors ‘g’ and
‘h’ respectively. Thus for every observation pair (xi, yi) there is a
corresponding pair ( ui, vi) such that,
xi − a and v = yi − b
ui = i
g h
Σx i Σ(g × ui + a) g × Σui + n × a
Now, X = = = = gU + a
n n n
Similarly,

Y = hV + b

Now, xi − X = (g × ui + a) − (gU + a) = g( ui − U )

And
Σ ( x i − X )2
Hence, s X 2 =
S
yi − Y= h(vi − V )

g2 ×
=
Σ( ui − U )2
g2s U
=
2
IM
n n
And s Y 2 = h2s V 2
1
Σ(xi − X )( yi − Y )
n Σg × ( ui − U ) × h × (vi − V )
NM

=
Now,  rXY =
s Xs Y n × (g × s U )(h × s V )

1
Σ( ui − U )(vi − V )
= n
s Us V
= rUV
This result is very useful for manual calculations. We can select
arbitrary constants a, b, g and h so as to simplify the data and the
find rUV which gives the result rXY. Thus, if any constant is added or
subtracted to the variables or the variables are multiplied or divided by
any constant, the correlation coefficient between these two variables
does not change.
Example: The data of advertisement expenditure (X) and sales (Y)
of a company for past 10 year period is given below. Determine the
correlation coefficient between these variables and comment the
correlation.

X 50 50 50 40 30 20 20 15 10 5
Y 700 650 600 500 450 400 300 250 210 200
Solution: We shall take U to be the deviation of X values from the
assumed mean of 30 divided by 5. Similarly, V represents the deviation
of Y values from the assumed mean of 400 divided by 10.

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  185 

N O T E S
Short cut procedure for calculation of correlation coefficient

Sl. No. X = xi Y = yi U = ui V = Vi uivi ui2 vi2


1 50 700 4 30 120 16 900
2 50 650 4 25 100 16 625
3 50 600 4 20 80 16 400
4 40 500 2 10 20 4 100
5 30 450 0 5 0 0 25
6 20 400 -2 0 0 4 0
7 20 300 -2 -10 20 4 100
8 15 250 -3 -15 45 9 225
9 10 210 -4 -19 76 16 361
10 5 200 -5 -20 100 25 400
Total -2 26 561 110 3136

r= =i 1
n

∑ ui vi −
1 n
=
n

∑ i ∑ vi
u
n i 1=i 1 S
IM
2 2
n
1 n  n
1 n 
∑ ui −  ∑ ui  ∑ vi −  ∑ vi 
2 2

=i 1 = n  i 1= i1 = n i 1 

(−2)(26)
561 −
10 561 + 5.2
= = 0.976
NM

4 676 109.6 3068.4


110 − 3136 −
10 10

Correlation of Grouped Data


Many times the observations are grouped into a ‘two way’ frequency
distribution table. These are called bivariate frequency distribution.
It is a matrix where rows are grouped for X variable and columns are
grouped for Y variable. Each cell say (i, j) represents the frequency
or count that falls in both groups of a particular range of values of
Xi and Yj. In this case correlation coefficient is given by:
1
Σ f × mx × m y − Σ( f × mx )Σ( f × my )
r= n
2 (Σf × mx )2 2 (Σf × my )2
Σ ( f × mx ) − Σ( f × m y ) −
n n
Where, mX and mY are class marks of frequency distributions of X and
Y variables, fx and fy are marginal frequencies of X and Y and fxy are
joint frequencies of X and Y respectively. As explained earlier, to make
the calculations easier, we can use the property that shifting the origin
and change of scale does not affect correlation coefficient. Hence we
could use transformation as,
mx − a my − b
dx = and dy =
g h

NMIMS Global Access – School for Continuing Education


186  BUSINESS STATISTICS

N O T E S
This is explained in the following example.
Example: Calculate coefficient of correlation for the following data.

X/Y 0-500 500-1000 1000-1500 1500-2000 2000-2500 Total


0-200 12 6 - - - 18
200-400 2 18 4 2 1 27
400-600 - 4 7 3 - 14
600-800 - 1 - 2 1 4
800-1000 - - 1 2 3 6
Total 14 29 12 9 5 69
Solution: Let the assumed mean for X be a = 1250 and the scaling
factor g = 500. Therefore, we can calculate f × dx and f × dx2 from the
marginal distribution of X as,

X Class mx − a Frequency f × dx f × dx2

0-500 250
S
Mark mx dx = g
-2
f

14 -28 56
IM
500-1000 750 -1 29 -29 29
1000-1500 1250 0 12 0 0
1500-2000 1750 1 9 9 9
2000-2500 2250 2 5 10 20
NM

Total -38 114


Similarly, let the assumed mean for Y be b = 500 and the scaling
factor h = 200. Therefore, we can calculate f × dy and f × dy2 from the
marginal distribution of Y as,

Y Class my − b Frequency f × dy f × dy2


Mark my dy = f
h
0-200 100 -2 18 -36 72
200-400 300 -1 27 -27 27
400-600 500 0 14 0 0
600-800 700 1 4 4 4
800-1000 900 2 6 12 24
Total -47 127
From the values of dx, dy and joint frequency given in the table, we can
find the value,

∑ f ×d x × dy
= (−2)(−2)(12) + (−1)(−2)(6) + (−2)(−1)(2) + (−1)(−1)(18) + (−1)(1)(2) + (−1)(2)(1)
+(1)(−1)(1) + (1)(1)(2) + (1)(2)(1) + (2)(1)(2) + (2)(2)(3)

= 48 + 12 + 4 + 18 − 2 − 2 − 1 + 2 + 2 + 4 + 12 = 97

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  187 

N O T E S
Hence,
1
Σf × dx × dy − Σ( f × dx )Σ( f × dy )
r= n
2 (Σf × dx )2 2 (Σf × dy )2
Σ( f × dx ) − Σ( f × dy ) −
n n

1
97 −× (−38)(−47)
69 71.1159
= = = 0.76
1 1 9.647 × 9.746
114 − × (−38)2 127 − × (−47)2
69 69

6.5.1 ASSUMPTIONS UNDERLYING KARL PEARSON’S


CORRELATION COEFFICIENT
The assumptions underlying Karl Pearson’s correlation coefficient
are as follows:
‰‰
S
Your data on both variables is measured on either an Interval
Scale or a Ratio Scale. Interval Scales have equal intervals
between points on your scale but they do not have a true zero
IM
point. Ratio Scales have both equal intervals between points on
their scale and they do have a true zero point.
‰‰ The traits you are measuring are normally distributed in the
population. In other words, even though the data in your sample
NM

may not be normally distributed (if you plot them in a histogram


they do not form a bell-shaped curve) you are pretty sure that
if you could collect data from the entire population the results
would be normally distributed.
‰‰ The relationship, if there is any, between the two variables
is best characterized by a straight line. This is called a “linear
relationship”. The best way to check this is to plot the variables
on a scatter plot and see if there is a clear trend from lower left to
upper right (a positive relationship) or from the upper left to the
lower right (a negative relationship). If the relationship seems
to change directions somewhere in the scatter plot, this means
that you do not have a linear relationship. Instead, it would be
curvilinear and Pearson’s r is not the best type of correlation
coefficient to use. There are others, however, that are beyond the
scope of this book so they will not be discussed. It is ok if this
assumption is violated as long as it’s not too bad (sounds really
specific, huh?)
‰‰ Homoscedasticity: A fancy term that says scores on the Y variable
are “normally distributed” across each value of the X variable.
Again, one of the easiest ways to assess homoscedasticity is to plot
the variables on a scatter plot and make sure the “spread” of the dots
is approximately equal along the entire length of the distribution.

NMIMS Global Access – School for Continuing Education


188  BUSINESS STATISTICS

N O T E S
6.5.2 INTERPRETATION OF R
The correlation coefficient, r ranges from −1 to 1. A value of 1 implies
that a linear equation describes the relationship between X and Y
perfectly, with all data points lying on a line for which Y increases
as X increases. A value of −1 implies that all data points lie on a line
for which Y decreases as X increases. A value of 0 implies that there
is no linear correlation between the variables.
More generally, note that (Xi  −  X) (Yi  −  Y) is positive if and only
if Xi and Yi lie on the same side of their respective means. Thus the
correlation coefficient is positive if Xi and Yi tend to be simultaneously
greater than, or simultaneously less than, their respective means.
‰‰ The correlation coefficient is negative if Xi and Yi tend to lie on
opposite sides of their respective means.
‰‰ The coefficient of correlation r lies between –1 and +1 inclusive
of those values.
‰‰
together. S
When r is positive, the variables x and y increases or decrease
IM
‰‰ r=+1 implies that there is a perfect positive correlation between
variables x and y.
‰‰ When r is negative, the variables x and y move in the opposite
direction.
NM

‰‰ When r=–1, there is a perfect negative correlation.


‰‰ When r=0, the two variables are uncorrelated.

6.5.3 ESTIMATION OF PROBABLE ERROR


It is used to help in the determination of the Karl Pearson’s coefficient
of correlation ‘r’. Due to this ‘r’ is corrected to a great extent but note
that ‘r’ depends on the random sampling and its conditions. It is given
by
 1 − r2 
P. E. = 0.6745  
 n 
If the value of r is less than P. E., then there is no evidence of correlation
i.e. r is not significant.
If r is more than 6 times the P. E. ‘r’ is practically certain .i.e. significant.
By adding or subtracting P. E. to ‘r’, we get the upper and Lower limits
within which ‘r’ of the population can be expected to lie.

Symbolically e = r ± P. E.
P = Correlation (coefficient) of the population.
Example:  If r = 0.6 and n = 64 find out the probable error of the
coefficient of correlation.
 1 − r2 
Solution: P. E. = 0.6745   
 n 

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  189 

N O T E S

 1 − (−0.6)2 
= 0.6745  
 64 

=  0.6745 − 0.64
8
= 0.57

Fill in the blanks:


13. The correlation ................... measures the degree of association
between two variables X and Y.
1
14. The expression
n
∑ ( X − X )(Y − Y ) is known as a ...................
between the variables X and Y.

S
15. Correlation coefficient does not change with shifting of
................... i.e. by adding or subtracting any constant from the
two variables (X, Y) correlation coefficient remains same.
IM
16. If the value of r is ................... than P. E., then there is no
evidence of correlation i.e. r is not significant.
17. If r is ................... than 6 times the P. E. ‘r’ is practically certain
i.e. significant.
NM

Suppose, you are doing a data analysis to understand if there is


any correlation between 5 different product categories of FMCG
products in terms of Attitude of buying of customers. Collect the
data and apply Karl Pearson’s correlation coefficient to find out the
correlation between 5 different product categories.

The coefficient of determination, r², is useful because it gives the


proportion of the variance (fluctuation) of one variable that is
predictable from the other variable. It is a measure that allows us
to determine how certain one can be in making predictions from a
certain model/graph.

6.6 RANK CORRELATION METHOD


Quite often the data is available in the form of some ranking for different
variables. Also there are occasions where it is difficult to measure the
cause-effect variables. For example, while selecting a candidate, there
are number of factors on which the experts base their assessment. It
is not possible to measure many of these parameters in physical units
e.g. sincerity, loyalty, integrity, tactfulness, initiative, etc. Similar is the

NMIMS Global Access – School for Continuing Education


190  BUSINESS STATISTICS

N O T E S
case during beauty contests. However, in these cases the experts may
rank the candidates. It is then necessary to find out whether the two
sets of ranks are in agreement with each other. This is measured by
Rank Correlation Coefficient. The purpose of computing a correlation
coefficient in such situations is to determine the extent to which the
two sets of ranking are in agreement. The coefficient that is determined
from these ranks is known as Spearman’s rank coefficient, rs.
This is defined by the following formula:
n
6 × ∑ di
2

rS = 1 − i =1

n( n2 − 1)

Where, n = Number of observation pairs


di = Xi – Yi
Xi = Values of variable X and Yi = values of variable Y

S
6.6.1 RANK CORRELATION WHEN RANKS ARE GIVEN
IM
Example: Ranks obtained by a set of ten students in a mathematics
test (variable X) and a physics test (variable Y) are shown below:

Rank for Variable X 1 2 3 4 5 6 7 8 9 10


Rank for Variable Y 3 1 4 2 6 9 8 10 5 7
NM

To determine the coefficient of rank correlation, rs


Solution: Computations of Spearman’s Rank Correlation as shown
below:

Individual Rank in Rank in di = xi – yi di2


Maths Physics
(X = xi) (Y = yi)
1 1 3 +2 4
2 2 1 -1 1
3 3 4 +1 1
4 4 2 -2 4
5 5 6 +1 1
6 6 9 +3 9
7 7 8 +1 1
8 8 10 +2 4
9 9 5 -4 16
10 10 7 -3 9
Total 50
n

∑d 2
Now, n = 10, i = 50
i =1

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  191 

N O T E S
Using the formula
n
6 × ∑ di
2

6 × 50
rS =1− i =1
2
=
1− =
0.697
n( n − 1) 10(100 − 1)
We can say that there is a high degree of correlation between the
performance in mathematics and physics.

6.6.2 RANK CORRELATION WHEN RANKS ARE NOT GIVEN


Example: Find the rank correlation coefficient for the following data.

X: 88 95 70 60 80 81 50 75
Y: 50 115 110 140 142 100 120 134
Solution: Let R1 and R2 denotes the ranks in X and Y respectively.
X Y R1 R2 d=R1-R2 d2
75
88
95
120
134
150
5
2
1
5
4
1 S 0
–2
0
0
4
0
IM
70 115 6 6 0 0
60 110 7 7 0 0
80 140 4 3 1 1
81 142 3 2 1 1
50 100 8 8 0 0
NM

6
6∑ d2 6×6
Coefficient of Correlation P =
1− =
1− =
+.93
n( n2 − 1) 8 ( 64 − 1)

In this method the biggest item gets the first rank, the next biggest
second rank and so on.

Example: Calculate the coefficient of rank correlation of the following


data:

X: 87 22 35 75 37
Y: 29 63 52 46 48
Solution:

X Y R1 R2 d=R1-R2 d2
87 29 1 5 –4 16
22 63 5 1 4 16
35 52 4 2 2 4
75 46 2 4 –2 4
37 48 3 3 0 0
40

NMIMS Global Access – School for Continuing Education


192  BUSINESS STATISTICS

N O T E S

6∑ d2 6 × 40
Coefficient of correlation P =
1− =
1− =
−1
n ( n − 1)
2
5 × 24
This shows on absolute negative correlation or perfect inverse
correlation.

6.6.3 RANK CORRELATION WHEN EQUAL RANKS ARE GIVEN


When two or more items have the same rank, a correction has to be
applied to ∑ di . For example, if the ranks of X are 1, 2, 3, 3, 5,….
2

showing that there are two items with the same 3rd rank and fourth
rank is skipped, then instead of writing 3, we write 3½ for both. Thus
the sum of these ranks which is 7 (3+4= 3½+3½= 7) remains same
keeping the mean of ranks unaffected. But in such cases the standard
deviation is affected. Therefore, correction is required for the Rank
( m3 − m)
Correlation Coefficient. For this, ∑ di is increased by
2
for

S 12
each tie, where m is number of items in each tie. If there are more
than one group of items with common rank, this correction factor is
to be added that many times once for each group.
IM
Example: Twelve salesmen are ranked for efficiency and length of
service as below:
Salesman A B C D E F G H I J K L
Efficiency (X) 1 2 3 4 4 4 7 8 9 10 11 12
NM

Length of 2 1 5 3 9 7 7 6 4 11 10 11
Service (Y)
Find the value of Spearman’s Rank Coefficient.
Solution:
Computations of Spearman’s Rank Correlation as shown below:
Individual Efficiency (X Length of Service di = xi – yi di2
= xi) (Y = yi)
A 1 2 -1 1
B 2 1 1 1
C 3 5 -2 4
D (4+5+6)/3 = 5 3 2 4
E (4+5+6)/3 = 5 9 -4 16
F (4+5+6)/3 = 5 (7+8)/2 = 7.5 -2.5 6.25
G 7 (7+8)/2 = 7.5 -0.5 0.25
H 8 6 2 4
I 9 4 5 25
J 10 (11+12)/2 = 11.5 -1.5 2.25
K 11 10 1 1
L 12 (11+12)/2 = 11.5 0.5 0.25
Total 65

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  193 

N O T E S
n

Now, n = 12, ∑d
i =1
i
2
= 65

Using the formula


n 2 1 1 1 
6 × ∑ di + × (33 − 3) + × (23 − 2) + × (23 − 2) 
 i =1 12 12 12 
rS = 1 − 2
n( n − 1)
6 × {65 + 2 + 0.5 + 0.5}
= 1− =
0.762
12(144 − 1)
We can conclude that there is a high degree of correlation between
efficiency and length of service.
Example: An investigation was conducted by a company on the value
educational and aptitude tests as assessment methods for recruiting
employees. It is the present practice of the company to give recruits
such tests when they apply for posts. The following data give the

S
educational and aptitude test scores, together with assessment score
by the Personal department of their ability one year after joining the
company. 1 is a low score and 20 is a high score.
IM
Employee Educational Aptitude Assembly by
test officer
A 9 17 12
B 10 14 14
NM

C 15 12 16
D 14 13 15
E 16 10 17
F 11 15 10
G 12 12 11
H 17 16 18
‰‰ Rank each set of the data
‰‰ Calculate appropriate rank correlation coefficients
Solution: Let X denote the score in educational tests, let Y denote the
score in aptitude test and Z denote the assessment by personal office.
Employee X Y Z Rx Ry Rz d1 d2 d12 d22
A 9 17 12 8 1 6 2 –5 4 25
B 10 14 14 7 4 5 –1 4 1 16
C 15 12 16 3 6.5 3 3.5 0 12.25 0
D 14 13 15 4 5 4 1 0 1 0
E 16 10 17 2 8 2 6 0 36 0
F 11 15 10 6 3 8 –5 4 25 16
G 12 12 11 5 6.5 7 –0.5 0 0.25 0
H 17 16 18 1 2 1 1 0 1 0
16 101.25 67

NMIMS Global Access – School for Continuing Education


194  BUSINESS STATISTICS

N O T E S

6∑ d2 6 × 16
P(d2 1) =
1− 2
=
1− =
0.81
N ( N − 1) 8 × 63

6∑ d2 + ∑ m( m2 − 1) / 12
P(d2 2)= 1 −
N ( N 2 − 1)

6 × (101.25 + 0.5)
=
1− =
0.2141
8 × 63
The rank correlation coefficient between educational test and
assessment score is positive and high and therefore high educational
test score will correspond to high ability in performance of the job.

Fill in the blanks:

S
18. The coefficient that is determined from these ranks is known
as ................... rank coefficient, rs.
19. When two or more items have the same rank, a correction has
IM
to be applied to ................... .

Collect the data of marks of all the students of your class of any
NM

two subjects. Convert them into ranks and find the rank correlation
between the two subjects.

 ORRELATION COEFFICIENT USING


C
6.7
CONCURRENT DEVIATION
This is the easiest method to find the correlation between two
variables. Although the method is effective in giving the direction of
the correlation as positive or negative but fails to give the accurate
strength of the correlation. In this method we check the fluctuation
in each data series as increasing (+), or decreasing (-) or equal
values. Then we count the number of items that increase or decrease
or remains equal concurrently and denote as c. The correlation
coefficient is then calculated as,

 2×c − n 
r =± ±  
 n 
Where, n = total number of pairs.
c = Number of concurrent changes
Example: The data of advertisement expenditure (X) and sales (Y)
of a company for past 10 year period is given below. Determine the
correlation coefficient between these variables and comment the
correlation.

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  195 

N O T E S

X 50 50 50 40 30 20 20 15 10 5
Y 700 650 600 500 450 400 300 250 210 200
Solution:

S.No. X Deviation Y Deviation Concurrent


Sign Sign Deviation
1 50 …… 700 …… …..
2 50 = 650 - -
3 50 = 600 - -
4 40 - 500 - +
5 30 - 450 - +
6 20 - 400 - +
7 20 = 300 - -
8 15 - 250 - +
9
10
10
5
-
-
210
200 S -
-
+
+
IM
Total ∑ 6
Therefore,

 2×c − n   2×6 −9 
r =± ±   =+ +   =0.577
 n   9 
NM

The result indicates that there is positive correlation between


advertisement expenditure (X) and sales (Y).

Fill in the blank:


20. We count the number of items that increase or decrease or
remain equal ................... and denote as c.

Collect the data of heights and weights of all the boys in your class.
Find the correlation coefficient using concurrent deviation method
between the variables height and weight.

2×c − n 
1. Sign ± is selected to make the value of   positive. The
same sign is used outside the radical.  n 
2. This method does not give strength of correlation. The
method is ad hoc and used only to reduce the efforts of tedious
calculations.

NMIMS Global Access – School for Continuing Education


196  BUSINESS STATISTICS

N O T E S

6.8 SUMMARY
‰‰ In this chapter the concept of correlation or the association
between two variables has been discussed. A scatter plot of the
variables may suggest that the two variables are related but
the value of the Pearson correlation coefficient r quantifies this
association.
‰‰ Correlation is a degree of linear association between two random
variables. In these two variables, we do not differentiate them
as dependent and independent variables. It may be the case
that one is the cause and other is an effect i.e. independent and
dependent variables respectively. On the other hand, both may
be dependent variables on a third variable.
‰‰ In business, correlation analysis often helps manager to take
decisions by estimating the effects of changing the values of the
decision variables like promotion, advertising, price, production
processes, on the objective parameters like costs, sales, market

S
share, consumer satisfaction, competitive price. The decision
becomes more objective by removing subjectivity to certain
extent.
IM
‰‰ The correlation coefficient r may assume values between –1 and
1. The sign indicates whether the association is direct (+ve) or
inverse (-ve). A numerical value of r equal to unity indicates
perfect association while a value of zero indicates no association.
‰‰ The correlation is said to be positive when the increase
NM

(decrease) in the value of one variable is accompanied by an


increase (decrease) in the value of other variable also. Negative
or inverse correlation refers to the movement of the variables
in opposite direction. Correlation is said to be negative, if an
increase (decrease) in the value of one variable is accompanied
by a decrease (increase) in the value of other.
‰‰ In simple correlation the variation is between only two variables
under study and the variation is hardly influenced by any external
factor. In other words, if one of the variables remains same, there
won’t be any change in other variable.
‰‰ In case of multiple correlation analysis there are two approaches
to study the correlation. In case of partial correlation, we study
variation of two variables and excluding the effects of other
variables by keeping them under controlled condition.
‰‰ When the amount of change in one variable tends to keep a
constant ratio to the amount of change in the other variable, then
the correlation is said to be linear. But if the amount of change
in one variable does not bear a constant ratio to the amount of
change in the other variable then the correlation is said to be
non-linear.
‰‰ Correlation analysis may also be necessary to eliminate a variable
which shows low or hardly any correlation with the variable
of our interest. In statistics, there are number of measures to
describe degree of association between variables. These are Karl
Pearson’s Correlation Coefficient, Spearman’s rank correlation

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  197 

N O T E S
coefficient, coefficient of determination, Yule’s coefficient of
association, coefficient of colligation, etc.
‰‰ The correlation coefficient measures the degree of association
between two variables X and Y.
‰‰ Karl Pearson’s formula for correlation coefficient is given as,
Covx.cov y
r=
s Xs Y
1
n
∑ ( X − X )(Y − Y )
r=
s Xs Y
‰‰ The purpose of computing a correlation coefficient in such
situations is to determine the extent to which the two sets of
ranking are in agreement. The coefficient that is determined
from these ranks is known as Spearman’s rank coefficient, rs.
This is defined by the following formula:

rS = 1 −
n( n2 − 1)
n
6 × ∑ di
i =1
2

S
IM
Where, n = Number of observation pairs
    di = Xi – Yi
    Xi = Values of variable X and Yi = values of variable Y
‰‰ Although the concurrent deviation method is effective in giving
NM

the direction of the correlation as positive or negative but fails


to give the accurate strength of the correlation. In this method
we check the fluctuation in each data series as increasing (+),
or decreasing (–) or equal values. Then we count the number of
items that increase or decrease or remains equal concurrently
and denote as c. The correlation coefficient is then calculated as,
 2×c − n 
r =± ±  
 n 
Where, n = total number of pairs.
c = Number of concurrent changes

‰‰ Correlation: Correlation is a degree of linear association


between two random variables. In these two variables, we
do not differentiate them as dependent and independent
variables.
‰‰ Positive Correlation: The correlation is said to be positive
when the increase (decrease) in the value of one variable is
accompanied by an increase (decrease) in the value of other
variable also.
‰‰ Negative Correlation: Correlation is said to be negative, if an
increase (decrease) in the value of one variable is accompanied
by a decrease (increase) in the value of other.
Contd...

NMIMS Global Access – School for Continuing Education


198  BUSINESS STATISTICS

N O T E S
‰‰ Linear Correlation: When the amount of change in one
variable tends to keep a constant ratio to the amount of change
in the other variable, then the correlation is said to be linear.
‰‰ Non-linear Correlation: The amount of change in one variable
does not bear a constant ratio to the amount of change in the
other variable then the correlation is said to be non-linear.
‰‰ Coefficient of Correlation: The correlation coefficient
measures the degree of association between two variables X
and Y.
‰‰ Scatter Diagram: The pattern of points obtained by plotting
the observed points are knows as scatter diagram.

6.9 DESCRIPTIVE QUESTIONS


1. Define correlation. Explain the meaning with the help of an
example.
2.
3. S
How can a study of correlation help managers in business?
What are different types of correlation? Explain with example.
IM
4. How will you find out correlation between two variables by
scatter diagram method?
5. What are different methods of calculating correlation?
6. How do you calculate Karl Pearson’s correlation coefficient?
NM

Give its different formulas. What is the effect of shifting origin


and change of scale on correlation coefficient?
7. What are the assumptions underlying Karl Pearson’s correlation
coefficient?
8. How do you interpret coefficient of correlation?
9. How do you calculate rank correlation coefficient when ranks are
given and when equal ranks are given? Explain with examples.
10. What is the formula for finding out correlation coefficient using
concurrent deviations?

EXERCISE FOR PRACTICE


1. Calculate coefficient of correlation between advertisement cost
and sales as per the data given below:

Advertisement 39 65 62 90 82 75 25 98 36 78
cost in ’000 `
Sales in Lakh ` 47 53 58 86 62 68 60 91 51 84
2.
Marks in Marks in
Statistics Economics
Mean 55 48
Standard Deviation 4 5

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  199 

N O T E S
The correlation coefficient between marks in statistics and
economics is 0.8 given in table above. Estimate the marks in
statistics of a student who scored 50 marks in economics.
3. Calculate coefficient of correlation between X and Y as per the
data given below:
X 14 16 20 22 28 30 34 40 45
Y 97 89 68 65 56 50 37 18 12
4. Ten competitors in a beauty contest are ranked by three judges
in the following order. Determine which pair of judge has the
nearest approach to common taste in beauty?
Judge 1: 1 6 5 10 3 2 4 9 7 8
Judge 2: 3 5 8 4 7 10 2 1 6 9
Judge 3: 6 4 9 8 1 2 3 10 5 7
5.

S
Ten candidates obtained the following marks in examinations in
Statistics and Mathematics. Find the rank correlation coefficient
to determine whether these results support the suggestion that
IM
ability in one subject is associated with ability in the other.

Candidate A B C D E F G H I J
Statistics 40 65 61 49 53 42 68 57 58 46
Maths 51 58 67 55 76 45 69 56 73 63
NM

6.10 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS
Topic Q. No. Answers
Introduction 1. Covariation
2. Degree
3. Correlation
Types of Correlation 4. Positive
5. Negative
6. Linear
7. Nonlinear
Methods of Calculating Correlation 8. True
9. True
Scatter Diagram Method 10. Two
11. Scatter diagram
12. Straight
Co-variance Method – The Karl 13. Coefficient
Pearson’s Correlation Coefficient
14. Covariance
15. Origin
Contd...

NMIMS Global Access – School for Continuing Education


200  BUSINESS STATISTICS

N O T E S

16. Less
17. More
Rank Correlation Method 18. Spearman’s
19.
∑d i
2

Correlation coefficient using 20. Concurrently


concurrent deviation
HINTS FOR DESCRIPTIVE QUESTIONS
1. Refer Section 6.1
Correlation is a degree of linear association between two random
variables. In these two variables, we do not differentiate them
as dependent and independent variables. It may be the case
that one is the cause and other is an effect i.e. independent and
dependent variables respectively. On the other hand, both may

2. Refer Section 6.1


S
be dependent variables on a third variable.
IM
The study of correlation helps managers in following ways:
(a) To identify relationship of various factors and decision
variables.
(b) To estimate value of one variable for a given value of other
if both are correlated. E.g. estimating sales for a given
NM

advertising and promotion expenditure.


3. Refer Section 6.2
The correlation can be studied as positive and negative, simple
and multiple, partial and total, linear and non linear. Further the
method to study the correlation is plotting graphs on x-y axis or
by algebraic calculation of coefficient of correlation.
4. Refer Section 6.4
Scatter diagram is the most fundamental graph plotted to show
relationship between two variables. It is a simple way to represent
bivariate distribution. Bivariate distribution is the distribution of
two random variables. Two variables are plotted one against each
of the X and Y axes. Thus, every data pair of (xi ,yj) is represented
by a point on the graph, x being abscissa and y being the ordinate
of the point.
5. Refer Section 6.3
Correlation analysis may also be necessary to eliminate a variable
which shows low or hardly any correlation with the variable
of our interest. In statistics, there are number of measures to
describe degree of association between variables.
These are Karl Pearson’s Correlation Coefficient, Spearman’s
rank correlation coefficient, coefficient of determination, Yule’s
coefficient of association, coefficient of colligation, etc.

NMIMS Global Access – School for Continuing Education


CORRELATION ANALYSIS  201 

N O T E S
6. Refer Section 6.5
Karl Pearson’s formula for correlation coefficient is given as,
Covx.cov y
r=
s Xs Y
1
∑ ( X − X )(Y − Y )
r= n
s Xs Y
Where r is the ‘Correlation Coefficient’ or ‘Product Moment
Correlation Coefficient’ between X and Y. sX and sY are the
standard deviations of X and Y respectively. ‘n’ is the number of
the pairs of variables X and Y in the given data.
7. Refer Section 6.5.1
The assumptions underlying Karl Pearson’s correlation
coefficient are as follow:

S
(a) Your data on both variables is measured on either an Interval
Scale or a Ratio Scale.
IM
(b) The traits you are measuring are normally distributed in the
population.
8. Refer Section 6.5.2
The correlation coefficient, r ranges from −1 to 1. A value
NM

of 1 implies that a linear equation describes the relationship


between X and Y perfectly, with all data points lying on a line for
which Y increases as X increases. A value of −1 implies that all
data points lie on a line for which Y decreases as X increases. A
value of 0 implies that there is no linear correlation between the
variables.
9. Refer Section 6.6
The purpose of computing a correlation coefficient in such
situations is to determine the extent to which the two sets of
ranking are in agreement. The coefficient that is determined
from these ranks is known as Spearman’s rank coefficient, rs.
This is defined by the following formula:
n
6 × ∑ di
2

rS = 1 − i =1

n( n2 − 1)

Where, n = Number of observation pairs


di = Xi – Yi
Xi = Values of variable X and Yi = values of variable Y
10. Refer Section 6.7
In this method we check the fluctuation in each data series as
increasing (+), or decreasing (–) or equal values. Then we count

NMIMS Global Access – School for Continuing Education


202  BUSINESS STATISTICS

N O T E S
the number of items that increase or decrease or remains equal
concurrently and denote as c. The correlation coefficient is then
calculated as,

 2×c − n 
r =± ±  
 n 
Where, n = total number of pairs.
c = Number of concurrent changes

ANSWERS FOR EXERCISE FOR PRACTICE


1. 0.78041
2. 56.28
3. -0.99863
4. The first and third judge has the nearest approach in common

5. 0.6
between them.
S
testing beauty because the coefficient of correlation is highest
IM
6.11 SUGGESTED READINGS FOR REFERENCE
SUGGESTED READINGS
NM

‰‰ Gupta, S.P. and Gupta, M.P., Business Statistics, Sultan Chand &
Sons, New Delhi, 1987
‰‰ Loomba, M.P., Management – A Quantitative Perspective,
MacMillan Publishing Company, New York, 1978.
‰‰ Levin, R.I., Statistics for Management, Prentice-Hall of India,
New Delhi, 1979
‰‰ Shenoy, G.V., Srivastava, U.K. and Sharma, S.C., Quantitative
Techniques for Managerial Decision Making, Wiley Eastern, New
Delhi, 1985
‰‰ Venkata Rao, K., Management Science, McGraw-Hill Book
Company, Singapore, 1986.
‰‰ Bhardwaj, R.S., Business Statistics, 2nd Edition, Excel Books,
New Delhi.
‰‰ Kothari, C.R., Quantitative Techniques, Vikas Publication.

E-REFERENCES
‰‰ http://www.pinkmonkey.com/
‰‰ https://www.tutorsland.com/
‰‰ http://www.jstor.org/

NMIMS Global Access – School for Continuing Education


C H A
7 P T E R

REGRESSION ANALYSIS

CONTENTS
7.1 Introduction


7.2 Regression Analysis
7.2.1
S
Applicability of Regression Analysis
IM
7.3 Simple Linear Regression
7.3.1 Simple Linear Regression Model
7.3.2 Linear Regression Equation
7.4 Coefficient of Regression
7.5 Non-linear Regression Models
NM

7.6  Correlation Analysis vs Regression Analysis


7.7 Summary
7.8 Descriptive Questions
7.9 Answers and Hints
7.10 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


204  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

PREGNANCY

A woman in the first trimester of pregnancy has a great deal


of concern about the environmental factors surrounding her
pregnancy and asks her doctor about what to impact they might
have on her unborn child. The doctor makes a “point estimate”
based on a regression model that the child will have an IQ of 75.
It is highly unlikely that her child will have an IQ of exactly 75, as
there is always error in the regression procedure. Error may be
incorporated into the information given the woman in the form of
an “interval estimate.” For example, it would make a great deal
of difference if the doctor were to say that the child had a ninety-
five percent chance of having an IQ between 70 and 80 in contrast
to a ninety-five percent chance of an IQ between 50 and 100. The
concept of error in prediction will become an important part of the
discussion of regression models.

S
It is also worth pointing out that regression models do not make
decisions for people. Regression models are a source of information
IM
about the world. In order to use them wisely, it is important to
understand how they work.
NM

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  205 

N O T E S

After studying this chapter, you should be able to:


  Understand the concept of regression analysis
  Discuss the applicability of regression
  Describe simple linear regression and nonlinear regression
model.
  Learn about coefficient of regression and linear regression
equations

7.1 INTRODUCTION
The word regression was first used as a statistical concept in 1877 by
Francis Galtan. Later if more than one variable is used to predict, the
word multiple regression is used. In regression analysis we develop an

S
equation called as an estimating equation used to relate known and
unknown variables. Then correlation analysis is used to determine
the degree of the relationship between the variables.
IM
Using the chi-square test we can find whether there is any relationship
between the variables. Correlation and regression analysis show how
to determine the nature and strength of the relationship between the
variables. In this chapter we will learn, how to calculate the regression
line mathematically.
NM

7.2 REGRESSION ANALYSIS


We need to have statistical model that will extract information from
the given data to establish the regression relationship between
independent and dependent relationship. The model should capture
systematic behaviour of data. The non-systematic behaviour cannot be
captured and called as errors. The error is due to random component
that cannot be predicted as well as the component not adequately
considered in statistical model. Good statistical model captures the
entire systematic component leaving only random errors.
In any model we attempt to capture everything which is systematic
in data. Random errors cannot be captured in any case. Assuming
the random errors are ‘Normally distributed’ we can specify the
confidence level and interval of random errors. Thus, our estimates
are more reliable.
If the variables in a bivariate distribution are correlated, the points
in scatter diagram approximately cluster around some curve. If the
curve is straight line we call it as linear regression. Otherwise, it is
curvilinear regression. The equation of the curve which is closest to
the observations is called the ‘best fit’.
The best fit is calculated as per Legender’s principle of least sum squares
of deviations of the observed data points from the corresponding

NMIMS Global Access – School for Continuing Education


206  BUSINESS STATISTICS

N O T E S
values on the ‘best fit’ curve. This is called as minimum squared error
criteria. It may be noted that the deviation (error) can be measured in
X direction or Y direction. Accordingly we will get two ‘best fit’ curves.
If we measure deviation in Y direction, i.e. for a given xi value of data
point (xi, yi), then we measure corresponding y value on ‘beast fit’
curve and then take the value of deviation in y, we call it as regression
of Y on X. In the other case, if we measure deviations in X direction
we call it as regression of X and Y.

According to Morris Myers Blair, “regression is the measure of the


average relationship between two or more variables in terms of the
original units of the data.”

7.2.1 APPLICABILITY OF REGRESSION ANALYSIS

S
Regression analysis is one of the most popular and commonly used
statistical tools in business. With availability of computer packages, it
has simplified the use. However, one must be careful before using this
tool as it gives only mathematical measure based on available data. It
IM
does not check whether the cause effect relationship really exists and
if it exists which is dependent and which is dependent variable.
NM

Regression analysis is a branch of statistical theory which is widely


used in all the scientific disciplines. It is a basic technique for
measuring or estimating the relationship among economic variables
that constitute the essence of economic theory and economic life.
The uses of regression analysis are not confined to economic and
business activities. Its applications are extended to almost all the
natural, physical and social sciences. Regression analysis helps in the
following way:
‰‰ It provides mathematical relationship between two or more
variables. This mathematical relationship can then be used
for further analysis and treatment of information using more
complex techniques.
‰‰ Since most of the business analysis and decisions are based on
cause-effect relationships, regression analysis is highly valuable
tool to provide mathematical model for this relationship.
‰‰ Most wide use of regression analysis is of course estimation and
forecast.
‰‰ Regression analysis is also used in establishing the theories
based on relationships of various parameters. Some of the
common examples are demand and supply, money supply and
expenditure, inflation and interest rates, promotion expenditure
and sales, productivity and profitability, health of workers and
absenteeism, etc.

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  207 

N O T E S

Fill in the blanks:


1. The word regression was first used as a statistical concept in
1877 by ................... .
2. The best fit is calculated as per ................... principle of least
sum squares of deviations of the observed data points from the
corresponding values on the ‘...................’ curve.
3. ................... is the measure of the average relationship between
two or more variables in terms of the original units of the data.
4. ................... is a basic technique for measuring or estimating
the relationship among economic variables that constitutes
the essence of economic theory and economic life.
5. Most wide use of regression analysis is of course estimation
and ................... .

S
IM
With the help of a few examples illustrate how regression analysis
helps in business decision making.
NM

The meaning of the term “Regression” is the act of returning or


going back. This term was first used by Sir Francis Galton in 1877
when he studied the relationship between the height of fathers
and sons. His study revealed a very interesting relationship. All
tall fathers tend to have tall sons and all short fathers short sons
but the average height of the sons of a group of tall fathers was
less than that of the fathers and the average height of the sons of a
group of short fathers was greater than that of the fathers. The line
describing this tendency of going back is called “Regression Line”.

7.3 SIMPLE LINEAR REGRESSION


This model is used if we have bivariate distribution i.e. only two
variables are considered and the ‘best fit’ curve is approximated to
a straight line. This describes the liner relationship between two
variables. Although it appears to be too simplistic, in many business
situations, it is adequate. At least, initial study can be based on this
model for any decision-making situation. Then we could either use
other models of some adhoc methods to cater for the complexity of the
business situation. If the system is found to have many non random
components we may have to discard this model and use some other
model. This model assumes the errors are purely due to randomness
and all non-random fluctuations are captured by our ‘best fit’ curve.
Thus we can use the regression analysis for prediction of dependent

NMIMS Global Access – School for Continuing Education


208  BUSINESS STATISTICS

N O T E S
variable for a given value of independent variable or for controlling
the independent variable to get the desired results or to explain
relationship for reliable predictions.

7.3.1 SIMPLE LINEAR REGRESSION MODEL


The linear regression model uses straight line relationship. Equation
of a straight line is of the form,

ŷ= α + b x (1)

Where ŷ is the predicted value of Y corresponding to x. a and b are


constants. Now if we assume the error (deviation) in Y direction is e,
we can write the relationship of X and Y in data points as,
y= α + b x + ∈
Error e is the amount by which observation will fall off regression
line. Error e is due to random error ‘a’ and ‘b’ are called parameters

observed data.
S
of the linear regression model whose values are found out from the

In case of nonlinear equation we use the equation,


IM
y= α + b x + δ x2 + ...+ ∈

The highest power of x is called as order of the model.


NM

Now in model Y = α + b x + ∈ we cannot find e since it changes from


observation to observation. But values of a and b are fixed. However,
to know the exact values of a and b we need to know all values of
the population which is not the usually feasible. Further, if we know
the entire population, regression analysis may not have much utility.
Thus, we can only find estimates of a and b from the sample data or
past data. We indicate it as ‘a’ and ‘b’.
If we fit a straight line in scattered data points, obviously some of the
points would be above the line and some below. The deviation of each
point from line is called error. We want the error should be as small
as possible. The least square criterion is most commonly used. In this
case we minimize the value of sum of square of the errors. We could
also use criteria like sum of minimum absolute deviation. But the
least square criterion is superior because,
‰‰ It is simple to interpret.
‰‰ Easy to treat mathematically.
‰‰ Estimate of quality of fit and confidence intervals can be easily
stated.

7.3.2 LINEAR REGRESSION EQUATION


Suppose the data points are (x1, y1) (x2, y2) ….. (xn, yn) . Then we can
write from regression equation,

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  209 

N O T E S

yi = a + bxi + ∈i i = 1, 2… n.(2)
Or, ∈
=i ( yi − a − bxi )
Thus, sum square of errors is,
n
S= ∑∈
i=1
i
2
= ∑ (y i − a − bxi )2

To have minimum sum of squares of errors (SSE) we must have the


condition,
∂S ∂S
= = 0
∂a ∂b
Or, 2 × ∑ ( yi − a − bxi ) =0

And, 2 × ∑ xi ( yi − a − bxi ) =0

Thus we obtain two linear equations in a and b. These are,

a × n + b × ∑ xi =∑ yi
=i 1=i 1
n n

(3) S
IM
n n
a × ∑ xi + b∑ xi =
∑ xi yi (4) 2

=i 1=i 1

These two equations are called as ‘Normal Equations’. By solving


these equations, we get the values of a and b. Note that these values
are estimates of a and b. Alternatively, dividing (3) by n we get,
NM

b × ∑ xi ∑ yi
a+ =
n n
Or, a + bX =Y (5)
Or, a= Y − bX  (6)
Substituting (6) in (4) and dividing it by n we get,
1
× ∑ xi yi − X × Y
b= n (7)
1
× ∑ xi − X
2 2

n
We denote b as bYX only to indicate it is regression of Y on X. bYX is
called as Regression Coefficient.
Now equation of regression line is,
ˆ= a + byx x
y
Subtracting equation (5) we get
ˆ − Y=
(y ) byx (x − X )  (8)
n n

1 n =iΣ1= xi Σ yi
Σ xi yi − ×i1
cov( X , Y ) n i=1 n n
=
And bYX = 2 n (9)
sX Σ xi
1 n
2
Σ x i − ( i = 1 )2
n i=1 n

NMIMS Global Access – School for Continuing Education


210  BUSINESS STATISTICS

N O T E S
For finding regression equation of X on Y we follow similar procedure
and get the regression line equation as
ˆ − X )= bxy ( y − Y )
(x
 (10)
n n

1 n =iΣ1= xi Σ yi
Σ xi yi − ×i1
cov( X , Y ) n i=1 n n 
=
With bXY = 2 n (11)
sY
1 n 2 iΣ= 1 yi 2
Σ yi − ( )
n i=1 n
Further, covariance of (X, Y) is,

1 n 1 n
cov( X , Y ) = Σ (xi − X )( yi − Y ) = Σ (xi yi − xi Y − Xyi + XY )
n i=1 n i=1
n n

1 n Σ xi Σ yi 1 n
= Σ x= i yi − Y −X
i 1=i 1
+ XY = Σ xi yi − YX − XY + XY
n i=1

1 n S n

= Σ xi yi − XY 
n n i=1

(12)
IM
n i=1
Also, variance of X is,
1 n 1 n 2
var( X ) = Σ (xi − X )2 = Σ (xi − 2 xi X + X 2 )
= n i 1= ni 1
NM

1 n Σ xi X 2 n 1 n 2
Σ 1 = iΣ= 1 xi − 2 X + X
2 2
= Σ xi 2 − 2 X i = 1 +
= n i 1= n n i 1 n

1 n 2
= Σ xi − X 2  (13)
n i=1
Substituting (11) & (13) in (7)
cov( X , Y ) cov( X , Y )
=bYX =  (14)
var( X ) s X2
Further, we note that
2 2
cov(= YX s X
X , Y ) b= bXY s Y

Also, using correlation coefficient is, r = cov( X , Y )


s Xs Y
Thus we get,
sY s
bYX = r and bXY = r X  (15)
sX sY

r 2 = bYX bXY

Thus, r = ± bYX bXY  (16)

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  211 

N O T E S

Fill in the blanks:


6. We can use the regression analysis for ................... of dependent
variable for a given value of independent variable.
7. The linear regression model uses ................... line relationship.
8. The highest power of x is called as ................... of the model.

Discuss the practical importance and use of two regression lines?


When do we use one in preference to the other?

S
Regression refers to an average of relationship between a dependent
variable with one or more independent variables. Such relationship
is generally expressed by a line of regression drawn by the method
IM
of the “Least Squares”. This line of regression can be drawn
graphically or derived algebraically with the help of regression
equations. According to Tom Cars, before the equation of the least
line can be determined some criterion must be established as to
what conditions the best line should satisfy. The condition usually
NM

stipulated in regression analysis is that the sum of the squares of


the deviations of the observed Y values from the fitted line shall be
minimum. This is known as the least squares or minimum squared
error criterion. A line fitted by the method of least squares is the
line of best fit.

7.4 COEFFICIENT OF REGRESSION


The coefficients of regression are bYX and bXY. They have following
implications:
Slopes of regression lines of Y on X and X on Y viz. bYX and bXY must
have same signs (because r² cannot be negative).
Correlation coefficient is geometric mean of bYX and bXY.
If both slopes bYX and bXY are positive correlation coefficient r is
positive. If both bYX and bXY are negative the correlation coefficient r
is negative.
1
If bYX = , ⇒ r =±1 indicating perfect correlation.
bXY
– –
Both regression lines intersect at point (X , Y )

NMIMS Global Access – School for Continuing Education


212  BUSINESS STATISTICS

N O T E S
Properties of Regression Coefficients
‰‰ The coefficient of correlation is the geometric mean of the two
regression coefficients.
‰‰ Both the regression coefficients are either positive or negative. It
means that they always have identical sign i.e., either both have
positive sign or negative sign.
‰‰ The coefficient of correlation and the regression coefficients will
also have same sign.
‰‰ If one of the regression coefficient is more than unity, the other
must be less than unity because the value of coefficient of
correlation can not exceed one (r = ± 1)
‰‰ Regression coefficients are independent of the change in the
origin but not of the scale.
‰‰ The average of regression coefficients is always greater than

S
correlation coefficient.

Solved Examples
IM
Example: The cost of total output in a factory is linearly related to
number of units manufactured. Data collected for 8 months is as
follows.
Month 1 2 3 4 5 6 7 8
NM

X(‘000 Units) 2 3 1 2.5 3.5 4 5 5.5


Y (‘000 `) 15 16 13 15 17 18 19 20
1. Find best fit linear relationship of cost Y on units X.
2. Compute correlation coefficient and assess whether relation can
be deemed as reasonable valid.
3. Estimate the cost for 13,500 units.
Solution: The calculations are tabulated below.

Month xi (’000 yi (‘000 `) xi2 yi2 xiyi


Units)
1 2 15 4 225 30
2 3 16 9 256 48
3 1 13 1 169 13
4 2.5 15 6.25 225 37.5
5 3.5 17 12.25 289 59.5
6 4 18 16 324 72
7 5 19 25 361 95
8 5.5 20 30.25 400 110
Total ∑ 26.5 133 103.75 2249 465

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  213 

N O T E S
n n

1 n =iΣ1= xi Σ yi
Σ xi yi − ×i1
cov( X , Y ) n i=1 n n
1. Now,=
bYX = 2 n
sX Σ xi
1 n
2
Σ x i − ( i = 1 )2
n i=1 n
n n

1 n Σ xi Σ yi
465 26.5 × 133
cov( X , Y ) = Σn x= i yi −n ×
i 1=i 1
= −
n Σ x Σ yn
i = 1 n 8 8×8
1 n i i 465 26.5 ×133
cov( X , Y ) = Σ xi =yi − i 1 = ×i 1 = − =58.125–55.07=3.055
n i =1 n nn 8 8×8
1 n 2 iΣ= 1 xi 2 103.75
And, s X2 =
n Σ xi − ( ) = − (3.3125)2
Σxn i=1 n 8
n 1 i 103.75
s X 2 = Σ xi 2 − ( i =1 ) 2 = − (3.3125) 2 =12.96875–10.973=1.99575
n i =1 n 8
Therefore, sX = 1.4127

Thus, bYX
=
58.125–55.07
= 1.53
12.96875–10.973
S
IM
The regression equation is
ˆ − Y=
(y ) byx (x − X )

133 26.5
ˆ−
Or, ( y )= 1.53 × (x − ˆ − 16.625)= 1.53 × (x − 3.3125)
) ⇒ (y
8 8
NM

Or, =ˆ 1.53 x + 11.557 (Ans)


y
2. Now,
n

1 n Σ yi 2249 133 2
s Y 2 = Σ yi 2 − ( i =1 ) 2 = −( ) =281.125–276.391=4.734
n i =1 n 8 8
Therefore, sY = 2.176
sX 1.4127
Hence, r = bYX × = 1.53 × = 0.993 (Ans)
sY 2.176
Since correlation coefficient r is close to 1, there is strong
association. Hence the relation can be deemed as reasonable
valid.
3. For number of units 13500, x = 13.5. The estimated cost of output
is,
ˆ=
y 1.53 x + 11.557 =1.53 × 13.5 + 11.557 =32.212 (Ans)
Example: The two regression line equations
_ are given as 8x – 10y +

66 = 0 and 40x – 18y – 214 = 0. Find X, Y two regression coefficients
and correlation coefficient r.
_

Solution: The point of intersection of two regression lines is (X, Y).
Hence solving the two equations we get the point of intersection as,
_

X = 13 and Y = 17

NMIMS Global Access – School for Continuing Education


214  BUSINESS STATISTICS

N O T E S
Now if we take first equation as regression of Y on X we can rewrite
the equation as,
8 66
y= ×x+
10 10
8
Thus, the regression coefficient is bYX =
10
Similarly, taking second equation as regression of X on Y we can
rewrite the equation as,
18 214
=x y+
40 40
18
Thus, the regression coefficient is bXY =
40
Now, correlation coefficient r is given by,
8 18
r= bYX × bXY = × = 0.6

S
10 40
IM
‰‰ If we had taken first equation as regression of X on Y and
second equation as regression of Y on X then value of r2 = bYX
× bXY would have been greater than 1. But we know that value
is always r2 ≤ 1 since r is always between ±1.
NM

‰‰ Sign of r while taking radical is taken as per signs of bYX and bXY.
Signs of bYX and bXY both must be either positive or negative.
bYX And bXY having opposite signs is not possible.

Example: Derive regression lines for the following data:


Σx=30, Σx2 = 190, Σxy = 192, Σy = 30, Σy2 = 190, n = 5
Solution: Now, from the given data,

=
X
∑=
x 30
= 6 , =
Y
∑=
y 30
= 6
n 5 n 5
1
∑ xy − n × ∑ x∑ y
192 − 180
=bYX = = 1.2
1
( )
190 − 180
2
∑ x2 − n × ∑ x
Hence the regression equation of y on x is,
Yˆ =Y + bYX (x − X ) =30 + 1.2(x − 30)

Yˆ 1.2 x − 6
=
Effect of shifting of origin and change of scale on regression coefficient
byx

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  215 

N O T E S
Let the transformation be
X−A Y−B
U= and V =
g h
h
Then the regression coefficient of V on U is, bVU = bYX
g
g
And regression coefficient of U on V is, bUV = bXY
h
Thus we can say that shifting of origin does not change the regression
coefficients.
Example: Data below gives transit time in days for random sample of
10 consignments with related distance.

X Distance 4 5 6 7 9 9 10 11 11 12
in 100 km
Y Transit 4 5 5 6 7 6 7 8 7 8

1.
2.
time in days

S
Find best fit linear relationship of transit time on distance.
Also estimate the transit time for a new location at a distance 800
IM
km.
3. Also compute correlation coefficient and assess whether relation
can be deemed as reasonable valid.
4. Find coefficient of determination R and explain its significance.
NM

Solution: The computation is shown below. We use A = 9 and B = 6

Sl. No. xi yi ui vi ui2 vi2 ui vi


1 4 4 –5 –2 25 4 10
2 5 5 –4 –1 16 1 4
3 6 5 –3 –1 9 1 3
4 7 6 –2 0 4 0 0
5 9 7 0 1 0 1 0
6 9 6 0 0 0 0 0
7 10 7 1 1 1 1 1
8 11 8 2 2 4 4 4
9 11 7 2 1 4 1 2
10 12 8 3 2 9 4 6
Total ∑ 84 63 –6 3 72 17 30
1. Since there is no change of scale, bYX = bVU
n n

1 n Σ ui Σ vi
Σ u=i vi − ×
i 1=i 1
cov(U , V ) n i=1 n n
Now,=
bVU = 2 n
sU
1 n 2 iΣ= 1 ui 2
Σ ui − ( )
n i=1 n

NMIMS Global Access – School for Continuing Education


216  BUSINESS STATISTICS

N O T E S
n n

1 n Σ ui Σ vi 30 (−6) × 3
cov(U , V ) = Σ u= v
i i − ×
i 1=i 1
= − =3+0.18=3.18
n i=1 n n 10 10 × 10
n

And, s U 2 1 n 2 iΣ= 1 ui 2 72
= Σ ui − ( ) = − (−0.6)2 = 7.2-0.36=6.84
n i=1 n 10

Therefore, sU = 2.615

Thus, = 3.18
bUV = 0.4649
6.84
But, bYX = bVU = 9.4649
The regression equation is
ˆ − Y=
(y ) bYX (x − X )

Or,
=
ˆ − 6.3)
Or, ( y

S
= 0.4649 × (x − 8.4)

ˆ 0.4649 x + 2.395 (Ans)


y
IM
2. For 800 km distance, x = 8. Therefore,
ˆ= 0.4649 x + 2.395= 6.1142 days
Estimated transit time is, y
n

2 1 n 2 Σ vi 17
3. Now, s V = Σ vi − ( i = 1 )2 = − (0.3)2 =1.7-0.09=1.61
NM

n i=1 n 10

Therefore, sV = 1.2689
sU 2.615
Hence, r = bVU × = 0.4649 × = 0.958 (Ans)
sV 1.2689
Since correlation coefficient r is close to 1, there is strong association.
Hence the relation can be deemed as reasonable valid.
Example: The owner of a small garment shop is hopeful that his sales
are rising significantly week by week. Treating the sales of previous
six weeks as a typical example of this rising trend, he recorded them
in `1000’s and analyzed the results.

Weeks: 1 2 3 4 5 6
Sales: 269 262 280 270 275 281
Fit a linear regression equation to suggest him the weekly rate at
which his sales are rising and use this equation to estimate expected
sales for the 7th week.
Solution: 1. Regression line equation
The calculations are tabulated below.
The computation is shown below. We use A = 3 and B = 270 and shift
the origin

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  217 

N O T E S

Sl. No. xi
Weeks yi
Sales in
` 1000 ui vi ui 2 vi2 ui vi
1 1 269 -2 -1 4 1 2
2 2 262 -1 -8 1 64 8
3 3 280 0 10 0 100 0
4 4 270 1 0 1 0 0
5 5 275 2 5 4 25 10
6 6 281 3 11 9 121 33
Total ∑ 21 1637 3 17 19 311 53
Since there is no change of scale, bYX = bVU

=
Now, bVU
cov(U , V )
=
sU2
1 n =iΣ1=
Σ ui vi −
n i=1 n
n

nS
ui Σ vi
×i1
n
n
IM
1 n 2 iΣ= 1 ui 2
Σ ui − ( )
n i=1 n
n n

1 n Σ ui Σ vi
53 3 × 17
cov(U , V ) = Σ u= i vi − ×
i 1=i 1
= − =8.8333-1.4166=7.4167
n i=1 n n 6 6×6
NM

1 n Σ ui 19
And, s U 2 = Σ ui 2 − ( i =1 )2 = − (0.5)2 =2.9167
n i=1 n 6
Therefore, sU = 1.7078
7.4167
Thus, =
bUV = 2.5428
2.9167

But, b=
YX b=
VU 2.5428

=
X
∑=
x 21
= 3.5
Also n 6

=Y

=
y 1637
= 272.83
n 6
The regression equation is
ˆ − Y=
(y ) bYX (x − X )

ˆ
Or, ( y − 272.83)
= 2.5428 × (x − 3.5)

=
Or, ˆ 2.5428 x + 263.9302 (Ans)
y
2. For the 7th week i.e. x= 7
ˆ(x = 7) = 2.5428 × 7 + 263.9302=281.7298
Expected sales = y

NMIMS Global Access – School for Continuing Education


218  BUSINESS STATISTICS

N O T E S
Example: Using the following information, obtain the line of regression
of average defective parts delivered (in hundred units) y on average
expenditure incurred on inspection (in ` thousands) x:

∑ x = 424 , ∑ y = 363 , ∑ xy = 12815 , ∑ x 2


= 21926 , ∑y 2
= 15123 ,
n = 10 From the regression equation you get, estimate the number of
defective parts delivered, when expenditure on inspection amounts
to ` 28,000.
Solution: Now, from the given data,

=
X
∑=
x 424
= 42.4 =
Y
∑=
y 363
= 36.3
n 10 n 10

1
∑ xy − n × ∑ x∑ y 12815 − 42.4 × 363
bYX = = = −0.6525
1
( ∑ x) 21926 − 42.4 × 424
2
∑ x2 − n ×
S
Hence the regression equation of y on x is,
IM
Yˆ =Y + bYX (x − X ) =36.3 − 0.6525(x − 42.4)

=Yˆ 63.966 − 0.6525 x


Now, for x = 28,000 value of number of defectives y is,
Yˆ 63.966 − 0.6525 × 28
NM

= = 45.696
~ 4570 parts.
Thus, number of defectives is 4569 ~

Fill in the blanks:


9. Correlation coefficient is ................... mean of bYX and bXY.
10. Both regression lines ................... at point ( X , Y ) .
11. The coefficient of correlation and the regression coefficients
will also have ................... sign.

Visit any organisation of your choice and do the regression


analysis of profit and expenditure of any of the department of that
organization. Also study their P/L accounts of last ten years. And
forecast their profits for next two years based on the previous data
using regression analysis.

7.5 NON-LINEAR REGRESSION MODELS


So far we have assumed a linear relationship between two variables.
Thus, we attempted to fit a straight line to a given set of data points.

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  219 

N O T E S
Errors involved in straight linear approximation are much high, hence
we use polynomials of higher degrees to achieve the smoothness and
better approximation. Least square principle can also be applied
to the fitting of a second degree polynomial which may be useful in
business situation if we have some idea that the relationship between
two variables is parabolic. In any case second degree polynomial fit
is more likely to be better approximation of the actual relationship.
We may use second order model (parabolic trend) if we feel that the
variation is parabolic. Here we will discuss only one nonlinear model
i.e. polynomial of second degree.
Second Degree Model
Just to demonstrate the theoretical similarity of linear (first degree)
and parabolic (second degree) models, we will describe the normal
equations. In this case the regression equation is,
ŷ =b0 + b1 X + b2 X 2

And the value of dependent variable is,


y= b0 + b1 X + b2 X 2 + ∈
S
IM
With estimates of the constants as a, b and c we can write the regression
curve as,
ŷ =a + bx + cx2  (17)
NM

To eliminate three unknown constants, using least square fit we get


three normal equations as,
n n n

∑y
=i 1
i = a × n + b∑ xi + c∑ xi
=i 1=i 1
2

n n n n

∑x y
=i 1
i i = a∑ x + b∑ x + c∑ xi
i i
2

=i 1=i 1 =i 1
3

n n n n

∑x
=i 1
i
2
y = a ∑ x + b∑ x + c ∑ x i
i i
2

=i 1=i 1=i 1
i
3 4

Solving these simultaneously equations, we can find the coefficients


a, b and c. The data points are (x1, y1) (x2, y2) ….. (xn, yn) . Values of xi
and yi are known. After putting those values, we get three equations in
three unknowns. Solving those equation we get values of unknowns a,
b, and c. On substituting these values of a, b, and c in equation (17) we
get approximated polynomial for (xi, yi)data points. The least square
approximation works with any type of data.

Year 1984 1985 1986 1987 1988


Sales in million ` 10 12 13 10 8
Example: Fit a parabolic curve of second degree to the data given
below and estimate the value for 1990 and comment on it.

NMIMS Global Access – School for Continuing Education


220  BUSINESS STATISTICS

N O T E S
Solution: We use normal equations for second degree regression.
Shifting origin does not change regression coefficients. It only shifts
the regression curve. Let the origin be shifted to (1986, 10). Hence in
normal equations, we replace xi by (xi – 1986) and yi by (yi – 10) . The
calculations are shown in the following table.

Year Sales Y xi yi xiyi xi2 xi2yi xi3 xi7


X (in million `)
1984 10 –2 0 0 4 0 –8 16
1985 12 –1 2 –2 1 2 –1 1
1986 13 0 3 0 0 0 0 0
1987 10 1 0 0 1 0 1 1
1988 8 2 –2 –4 4 –8 8 16
Total 0 3 –6 10 –6 0 34
Now, using normal equations,
n
a × n + b∑ xi + c∑ x=
=i 1 =i 1=i 1
i
n

S 2
∑y
n

i c 3 (18)
⇒ 5 × a + 10 ×=
IM
n n n n
a∑ xi + b∑ xi + c∑ xi =∑ xi yi ⇒ 10 × b =−6 (19)
2 3

=i 1 =i 1 =i 1=i 1
n n n n
a∑ xi + b∑ xi + c∑ xi =∑ xi yi ⇒ 10 × a + 34 × c =−6 
2 3 4 2
(20)
=i 1=i 1=i =
1 i 1
NM

Solving simultaneous equations (18), (19) and (20) we get,


a = 2.314, b = –0.6 and c = –0.857
Hence the regression equation is,
ˆ − 10= 2.314 − 0.6 x − 0.857 x2
y
Or, ˆ= 12.314 − 0.6 x − 0.857 x2 (Ans)
y
Now, for X = 1990 ⇒ x = 4
Sales in million ` is
ˆ 12.314 − 0.6 x − 0.857 x=
=
y 2
12.314 − 2.4 − 13.712
= −3.798
Hence prediction of sales for 1990 is -3.798 million ` Now obviously
the sales can’t be negative. So we predict the sales would be very less
or near zero. However, it may be noted that with data of just five years,
forecast of two forward periods may not be realistic.

Other Regression Models


So far we have assumed a linear or parabolic relationship between
two variables. Thus, we attempted to fit a straight line or a parabola to
a given set of data points. A similar principle can also be applied to the
fitting of a variety of other functions which may be useful in business
situation if we have some idea about the type of relationship between
two variables. We could use a general polynomial as,
y= b0 + b1 X + b2 X 2 + ....+ ∈

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  221 

N O T E S
The least square approximation can be calculated easily for low
degree polynomials, like linear, parabolic, cubic, etc. But for higher
degrees (more than three), the system of normal equations becomes
ill conditioned. This causes large errors in values of coefficients.
Then the approximation becomes incorrect. To avoid these problems,
‘orthogonal polynomials’ are used for approximation.

Orthogonal polynomials determine the coefficients directly without


having to solve normal equations. The Legendre and Chebyshev
polynomials are the well-known orthogonal polynomials.

Non-linear models are difficult to handle. But we can often use simple
transformation to convert the model to linear. Taking the logarithm of
values of the variable is one such method. These are called logarithmic
linear (log linear) models.

S
Non-linear models that can be transformed to yield linear models
IM
are called intrinsically linear.
However, there are many software packages that can handle these
models. Managers working in this area must become familiar to such
models as per the availability of particular software. Discussion on
NM

such models is beyond the scope of this book.


If we are not certain about such relationship but suspect that
the relationship not likely to be linear, we should first draw a
scatter diagram. The scatter diagram gives us fair idea about such
relationship. There are two common models used in business. We will
discuss these.

Seasonal Model
We know many business parameters are highly seasonable. E.g. sales
of air conditioners, sales of woolen clothing, share market prices,
price of a commodity, etc. are seasonal. Many of these are cyclic in
nature with constant period like a year, a month, settlement period on
stock exchange, etc. Sinusoidal model is approximate for such cases
to separate the seasonality part of the data. If Ft is the forecast for
period‘t’.
2π 2π
Ft =
a + u cos t + v sin t
N N
Where a, u and v are constants, t is time period and N is number of
time periods in the complete cycle.

Seasonal Model with Trend


In many situations, besides the seasonal fluctuations there is an overall
trend of increase or decrease. E.g. demand for a particular item may

NMIMS Global Access – School for Continuing Education


222  BUSINESS STATISTICS

N O T E S
be cyclic with one year period. However, at the same time it may also
have underlying trend of overall increase year on year. In such a case,
seasonal model and straight line model are superimposed as
2π 2π
Ft = a + b × t + u cos t + v cos t
N N
This model has a growth term b × t.
Coefficient of Determination
Once we know there is a correlation between two variables and then
we find the linear relationship between two variables, we would like
to specify how strong is the relationship? If relationship is strong, we
can use it for decision-making with more confidence. Because our
estimates based on the regression equation would be more accurate.
Mean Square Error (MSE) is an estimate of the variance of the
regression error. MSE depends on the values of data and its scales.

S
Hence we need a measure that calculates relative degree of variation
so that it can be compared for the fits obtained from different models
and for different data sets. Coefficient of determination is such a
measure.
IM
Coefficient of determination is defined as the ratio of explained
variance of the dependent variable to the total variance. It can be
NM

shown that this measure is equal to the square of the correlation


coefficient.

Thus,

of determination b Explained
Coefficient= = r2
Variance
Total Variance
b is the proportion of variation explained by the independent variable.
Remaining variation in data (1 – b) is due to some other factors. The
value (1 – b) is called coefficient of Non-determination and defined as,

Unexplained Variance
(1 − b ) =1 − r 2 =
Total Variance

Coefficient of alienation is square root of coefficient of non-


determination.

Thus,
Coefficient of Alienation =
k 1 − r2
Coefficient of determination is a measure of the strength of the
regression fit. It is an estimator of population parameter of correlation
and can be obtained directly from a decomposition of variation in Y
into two components, viz. due to error and due to regression. Error

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  223 

N O T E S
is a deviation of a data point from its respective group mean. Thus
error is the deviation of a data from its predicted values explained
by the regression line. In analysis of variance ANOVA we also look at
the total deviation of data point from the grand mean. Thus when we
consider the deviations we consider three kinds. Firstly, deviation of
a data point from the grand mean ( y − Y ) . Secondly, the deviation of a
data point from the predicted value using regression ( y − y ˆ) . Thirdly,
the deviation of the predicted value of y from the grand mean ( y ˆ − Y) .
Thus, Total deviation = Unexplained deviation + Explained deviation
Or, ˆ) + ( y
(y − Y ) = (y − y ˆ − Y)
Or, Total Deviation = Error + Regression.
(yˆ − Y ) is called explained deviation or regression deviation because
it can be explained by the regression relationship between X and
Y. Where as, the part ( y − y ˆ) , is not explained by the regression
relationship. Hence it is called an error.

2
n
n

Σ(y
2
n
− y) = Σ(y ˆ ) + Σ(y
−y ˆi − y ) 2
S
If we square deviations for all data points and sum them over all ‘n’
points, the simplification gives,
IM
i j i
=i 1=i 1 =i 1

Or
Total Sum Squares = Sum Squares of Error + Sum Squares of
Regression
NM

Or
SST = SSE + SSR
SSR SSE
Thus, coefficient of determination = = 1−
SST SST
Values of the coefficient of determination range from 0 to 1.
When r2 =1 the variation in Y is completely explained by variation in X.
Means all data points exactly fall on regression line with no error. This
is called a perfect fit. In real business, there is always some error that is
not explained. If r2 is close to 1, we say that there is a strong relationship.
On the other hand, if r2 ~ 0 or close to zero, there is hardly any linear
relationship between X and Y. In such case we cannot use value of X to
predict values of Y. Higher the values of r2 , the better is the fit and we
can have more confidence in our predictions using regression line.

Fill in the blanks:


12. Least square principle can also be applied to the fitting of a
second degree polynomial which may be useful in business
situation if we have some idea that the relationship between
two variables is ................... .
13. ................... polynomials determine the coefficients directly
without having to solve normal equations.
Contd...

NMIMS Global Access – School for Continuing Education


224  BUSINESS STATISTICS

N O T E S
14. ................... models that can be transformed to yield linear
models are called intrinsically linear.
15. Coefficient of ................... is square root of coefficient of non-
determination.

Recently, research efforts have focused on the problem of predicting


a manufacturer’s market share by using information on the quality
of its product. Suppose that the following data are available on
market share, in percentage (Y), and product quality, on a scale of 0
to 100, determine by an objective evaluation procedure (X):
X 27 39 73 66 33 43 47 55 60 68 70 75 80
Y 2 3 10 9 4 6 5 8 7 9 10 13 12

S
Estimate the simple linear regression relationship between market
share and product quality rating. Can you apply any nonlinear
model on the above data too? Explain with reason.
IM
Computer models are available that deal with such estimations.
MS Excel does not have any tool directly dealing with this, but in
particular cases we can use ‘Moving Average’ and ‘Exponential
NM

Smoothing’ tools from Data Analysis Pak.

 ORRELATION ANALYSIS VS
C
7.6
REGRESSION ANALYSIS
Both the techniques are directed towards a common purpose of
establishing the degree and direction of relationship between two or
more variables but the methods of doing so are different. The choice of
one or the other will depend on the purpose. If the purpose is to know
the degree and direction of relationship, correlation is an appropriate
tool but if the purpose is to estimate a dependent variable with the
substitution of one or more independent variables, the regression
analysis shall be more helpful. The point of difference is discussed
below:
‰‰ Degree and Nature of Relationship: The correlation coefficient
is a measure of degree of co variability between two variables
whereas regression analysis is used to study the nature of
relationship between the variables so that we can predict the value
of one on the basis of another. The reliance on the estimates or
predictions depends upon the closeness of relationship between
the variables.
‰‰ Cause and Effect Relationship: The cause and effect relationship
is explained by regression analysis. Correlation is only a tool

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  225 

N O T E S
to ascertain the degree of relationship between two variables
and we can not say that one variable is the cause and other the
effect. A high degree of correlation between price and demand
for a commodity or at a particular point of time may not suggest
which the cause is and which the effect is. However, in regression
analysis cause and effect relationship is clearly expressed – one
variable is taken as dependent and the other an independent.
‰‰ Like in correlation, regression analysis can also be studied as
‘simple and multiple’, ‘total and partial’, ‘linear and nonlinear’, etc.
depending upon the type of data and method we use for regression
analysis. Regression word implies ‘going back or falling back to
mean or average value’ but in most application of regression we
do not use regression in this sense. We use it for the forecasting
purpose or to understand underlying mathematical relationship.
‰‰ Although correlation and regression both attempt to establish
whether relationship exists between two or more variables or

S
not, these two techniques differ in approach. If we only want to
know the degree and direction of relationship we use correlation
analysis. But if we want to forecast or predict the values we need
IM
regression analysis.
‰‰ In correlation, there is no distinction between independent and
dependent variables. But for regression analysis we need to
specify independent and dependent variables clearly. In case
NM

of correlation we are only interested in finding whether the


relationship exists. Hence the measuring error is only to establish
confidence in our analysis. However, in regression our analysis
itself is based on the concept of minimizing the errors.

Fill in the blanks:


16. The cause and effect relationship is explained by ...................
analysis.
17. Regression word implies ‘going back or falling back to
................... or average value’ but in most application of
regression we do not use regression in this sense.
18. For regression analysis we need to specify independent and
dependent ................... clearly.
19. In regression our analysis itself is based on the concept of
................... the errors.
20. ................... is one of the most commonly used (and abused)
statistical tool for predictions or forecast of economic and
business information for decision-making.

NMIMS Global Access – School for Continuing Education


226  BUSINESS STATISTICS

N O T E S

What is the relation between coefficient of correlation and


regression analysis? Illustrate with the help of a practical example.

Correlation analysis indicates whether two variables fluctuate with


any relationship or not. Regression provides us a measure of the
relationship and also facilitates to predict one variable for a value
of other variable. Thus, unlike correlation analysis, in regression
analysis, one variable is independent and other dependent. Thus,
the regression analysis only gives a mathematical measure of
average relationship between two variables. This is one of the
most commonly used (and abused) statistical tool for predictions or
forecast of economic and business information for decision-making.

7.7 SUMMARY S
IM
‰‰ In this chapter, the concept of regression between dependent and
independent variables has been discussed. Regression provides
us a measure of the relationship and also facilitates to predict
one variable for a value of other variable.
‰‰ Unlike correlation analysis, in regression analysis, one variable
NM

is independent and other dependent. Please note that this


relationship need not be a cause-effect relationship.
‰‰ Regression analysis is a branch of statistical theory which is
widely used in all the scientific disciplines. It is a basic technique
for measuring or estimating the relationship among economic
variables that constitute the essence of economic theory and
economic life. The uses of regression analysis are not confined to
economic and business activities. Its applications are extended
to almost all the natural, physical and social sciences.
‰‰ Simple linear regression model is used if we have bivariate
distribution i.e. only two variables are considered and the ‘best
fit’ curve is approximated to a straight line. This describes the
liner relationship between two variables. Although it appears to
be too simplistic, in many business situations, it is adequate. At
least, initial study can be based on this model for any decision-
making situation.
‰‰ We have studied simple linear, non-linear and multiple regression
models. For multiple regression and non-linear regression
models, MS Excel or any other computer package would help in
reducing voluminous calculations. We also discussed coefficient
of determination as a measure of the strength of relationship.
‰‰ Least square principle can also be applied to the fitting of a second
degree polynomial which may be useful in business situation if

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  227 

N O T E S
we have some idea that the relationship between two variables is
parabolic. In any case second degree polynomial fit is more likely
to be better approximation of the actual relationship. We may use
second order model (parabolic trend) if we feel that the variation
is parabolic.
‰‰ The least square approximation can be calculated easily for low
degree polynomials, like linear, parabolic, cubic, etc. But for higher
degrees (more than three), the system of normal equations becomes
ill conditioned. This causes large errors in values of coefficients.
Then the approximation becomes incorrect. To avoid these
problems, ‘orthogonal polynomials’ are used for approximation.
‰‰ Mean Square Error (MSE) is an estimate of the variance of the
regression error. MSE depends on the values of data and its
scales. Hence we need a measure that calculates relative degree
of variation so that it can be compared for the fits obtained
from different models and for different data sets. Coefficient of

‰‰
determination is such a measure.
S
Coefficient of determination is a measure of the strength of
the regression fit. It is an estimator of population parameter of
IM
correlation and can be obtained directly from a decomposition
of variation in Y into two components, viz. due to error and
due to regression. Error is a deviation of a data point from its
respective group mean. Thus error is the deviation of a data from
its predicted values explained by the regression line.
NM

‰‰ Regression: Regression is the measure of the average


relationship between two or more variables in terms of the
original units of the data.
‰‰ Regression Analysis: Regression analysis is a branch of
statistical theory which is widely used in all the scientific
disciplines. It is a basic technique for measuring or estimating
the relationship among economic variables that constitute the
essence of economic theory and economic life.
‰‰ Orthogonal Polynomials: These polynomials determine the
coefficients directly without having to solve normal equations.
The Legendre and Chebyshev polynomials are the well-known
orthogonal polynomials.
‰‰ Intrinsically Linear: Non-linear models that can be
transformed to yield linear models are called intrinsically
linear.
‰‰ Coefficient of Determination: It is defined as the ratio of
explained variance of the dependent variable to the total
variance. It can be shown that this measure is equal to the
square of the correlation coefficient.
‰‰ Coefficient of Alienation: It is square root of coefficient of
non-determination.

NMIMS Global Access – School for Continuing Education


228  BUSINESS STATISTICS

N O T E S

7.8 DESCRIPTIVE QUESTIONS


1. Define regression and regression analysis.
2. Explain the concept of regression analysis in detail with examples.
3. Discuss the applicability of regression analysis in business and
other situations.
4. Explain the concept of simple linear regression and its model.
5. What are there two regression equations? How do you derive
regression coefficients from them?
6. What are coefficients of regression? What are the properties of
regression coefficients?
7. Write a short note on nonlinear regression models.
8. What are orthogonal polynomials?
9.

S
Explain seasonal model and seasonal model with trend.
10. Explain the difference between correlation and regression analysis.
IM
EXERCISE FOR PRACTICE
3 1
1. If bXY = and bYX = , find the value of correlation coefficient
2 6
between X and Y.
NM

2. For the following data

X Y
Mean 36 85
Standard 11 8
Deviation
The correlation coefficient between X and Y is 0.66. Find
regression equation of X on Y, hence estimate the value of X
when Y = 80.
3. A student obtains lines of regression of Y on X and X on Y as 2X
– 5Y – 7 and 3X + 2Y – 8 = 0 respectively. Is this correct?
4. The following data which consists of the scores that 10 salesmen
made on a test designed to measure their aptitude for sales work
and their sales productivity over a period of time. The test score
is denoted by X and sales productivity by Y.

X 41 35 34 40 33 42 37 42 30 43
Y 32 20 35 24 27 28 31 33 26 41
(a) Calculate the correlation coefficient
(b) Find the equation of the least square line.
(c) Calculate the value of coefficient of determination and use it
to comment on the usefulness.

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  229 

N O T E S
5. The XYZ store has been expanding market share during past 7
years, posting the following gross sales in millions of dollars.
Year 1994 1995 1996 1997 1998 1999 2000
Sales 15 21 25 33 38 48 52
(a) Find the linear estimating equation that best described
these data and also find the trend (estimated) value.
(b) Calculate the present trend for these data and identify the
year in which the fluctuation from the trend is largest.
(c) Forecast the sales value for the year 2001.

7.9 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS
Topic Q. No. Answers
Regression Analysis 1.
2. S
Francis Galtan
Legender’s, best fit
IM
3. Regression
4. Regression analysis
5. Forecast
Simple Linear Regression 6. Prediction
. 7. Straight
NM

8. Order
Coefficient of Regression 9. Geometric
10. Intersect
11. Same
Nonlinear Regression Models 12. Parabolic
13. Orthogonal
14. Non-linear
15. Alienation
Correlation Analysis vs 16. Regression
Regression Analysis
17. Mean
18. Variables
19. Minimizing
20. Regression

HINTS FOR DESCRIPTIVE QUESTIONS


1. Refer Sections 7.1 and 7.2
According to Morris Myers Blair, regression is the measure of
the average relationship between two or more variables in terms
of the original units of the data. Regression analysis is a branch

NMIMS Global Access – School for Continuing Education


230  BUSINESS STATISTICS

N O T E S
of statistical theory which is widely used in all the scientific
disciplines. It is a basic technique for measuring or estimating
the relationship among economic variables that constitute the
essence of economic theory and economic life.
2. Refer Section 7.2
If the variables in a bivariate distribution are correlated, the
points in scatter diagram approximately cluster around some
curve. If the curve is straight line we call it as linear regression.
Otherwise, it is curvilinear regression. The equation of the curve
which is closest to the observations is called the ‘best fit’.
The best fit is calculated as per Legender’s principle of least
sum squares of deviations of the observed data points from
the corresponding values on the ‘best fit’ curve. This is called
as minimum squared error criteria. It may be noted that the
deviation (error) can be measured in X direction or Y direction.
3.
S
Refer Section 7.2.1
Regression analysis is one of the most popular and commonly
IM
used statistical tools in business. With availability of computer
packages, it has simplified the use. The uses of regression
analysis are not confined to economic and business activities. Its
applications are extended to almost all the natural, physical and
social sciences.
NM

It provides mathematical relationship between two or more


variables. This mathematical relationship can then be used
for further analysis and treatment of information using more
complex techniques.
4. Refer Section 7.3
This model is used if we have bivariate distribution i.e. only two
variables are considered and the ‘best fit’ curve is approximated
to a straight line. This describes the liner relationship between
two variables. Although it appears to be too simplistic, in many
business situations, it is adequate.
This model assumes the errors are purely due to randomness
and all non-random fluctuations are captured by our ‘best fit’
curve. Thus, we can use the regression analysis for prediction of
dependent variable for a given value of independent variable or
for controlling the independent variable to get the desired results
or to explain relationship for reliable predictions.
5. Refer Section 7.3.2
ˆ − Y=
(y ˆ − X )= bxy ( y − Y ) These two equations
) byx (x − X ) and (x
are called as ‘regression equations’.

NMIMS Global Access – School for Continuing Education


REGRESSION ANALYSIS  231 

N O T E S
6. Refer Section 7.4
The coefficients of regression are bYX and bXY. Properties of
Regression Coefficients are:
The coefficient of correlation is the geometric mean of the two
regression coefficients.
Both the regression coefficients are either positive or negative. It
means that they always have identical sign i.e., either both have
positive sign or negative sign.
7. Refer Section 7.5
Least square principle can also be applied to the fitting of a second
degree polynomial which may be useful in business situation if
we have some idea that the relationship between two variables is
parabolic. In any case second degree polynomial fit is more likely
to be better approximation of the actual relationship. We may use
second order model (parabolic trend) if we feel that the variation

8.
polynomial of second degree.
Refer Section 7.5
S
is parabolic. Here we will discuss only one nonlinear model i.e.
IM
Orthogonal polynomials determine the coefficients directly
without having to solve normal equations. The Legendre
and Chebyshev polynomials are the well-known orthogonal
polynomials.
NM

9. Refer Section 7.5


We know many business parameters are highly seasonable. E.g.
sales of air conditioners, sales of woolen clothing, share market
prices, price of a commodity, etc are seasonal. Many of these
are cyclic in nature with constant period like a year, a month,
settlement period on stock exchange, etc. Sinusoidal model is
approximate for such cases to separate the seasonality part of
the data.
10. Refer Section 7.6
Both the techniques are directed towards a common purpose of
establishing the degree and direction of relationship between two
or more variables but the methods of doing so are different. The
choice of one or the other will depend on the purpose. If the purpose
is to know the degree and direction of relationship, correlation is
an appropriate tool but if the purpose is to estimate a dependent
variable with the substitution of one or more independent
variables, the regression analysis shall be more helpful.

ANSWERS FOR EXERCISE FOR PRACTICE


1. 0.5
2. 31.4625
3. Wrong. Because values of bYX and bXY have opposite signs. As a
result r2 = bYX × bXY is negative. This is not feasible.

NMIMS Global Access – School for Continuing Education


232  BUSINESS STATISTICS

N O T E S
4. (a) Correlation coefficient = 0.442547
(b) y = mx + c Slope m = 0.587181; Y intercept c = 7.563281
(c) Coefficient of determination b = r2 = 0.1958
This indicates that only 19.58% of the variation in Y is explained
by the variation in X as per trend line. About 80% of the variation
is due to some other factors. Thus we cannot really estimate the
variation in Y from the variation in X.
5. (a) Slope m = 6.357143; Y intercept c = - 12662.1
(b) Table shows trend values. Fluctuation from the trend is
largest in year 1999
Year 1994 1995 1996 1997 1998 1999 2000
Sales 15 21 25 33 38 48 52
Trend 14.07143 20.42857 26.78571 33.14286 39.5 45.85714 52.21429
(Estimate)

S
Fluctuation 0.928571 0.571429 -1.78571 -0.14286 -1.5

(c) Sales estimate for 2001 is 58.5714286


2.142857 -0.21429
IM
7.10 SUGGESTED READINGS FOR REFERENCE
SUGGESTED READINGS
‰‰ R Selvaraj, Quantitative Methods in Management, Problems and
NM

Solutions, Excel Books, 2008.


‰‰ J K Sharma, Fundamentals of Business Statistics, 2010.
‰‰ Bierman H., Bonnini C.P., and Hausma W.H., Quantitative
Analysis for Business Decisions, Homewood, Illinois. Richard D.I.
Win, Inc 1973.
‰‰ Gallagher, C.A. and Watson, H.J., Quantitative Methods for
Business Decisions, McGraw Hill, Inc., 1976.
‰‰ Gordon, G., and Pressman I., Quantitative Decision Making for
Business, New Delhi: National Publishing House, 1983.
‰‰ Lapin, L., Quantitative Methods for Business Decisions, New
York: Harcourt Brace Jovanovich. Inc., 1976.
‰‰ Loomba, N.P., Management – A Quantitative Perspective, New
York: MacMillan Pub. Company, Inc., 1970.
‰‰ Richard, E.T., An Introduction to Quantitative Methods for Decision
Making, 2nd Ed., New York, Holt, Rinechart and Winston, 1977.

E-REFERENCES
‰‰ http://www.statsoft.com/Textbook/Multiple-Regression
‰‰ http://obsessionwithregression.blogspot.in/
‰‰ http://www.statmethods.net/stats/regression.html

NMIMS Global Access – School for Continuing Education


C H A
8 P T E R

THEORY OF PROBABILITY

CONTENTS
8.1 Introduction


8.2
8.3
S
Important Terms in Probability
Kinds of Probability
IM
8.4 Simple Propositions of Probability
8.5 Addition Theorem of Probability
8.6  Multiplication Theorem of Probability
8.7 Conditional Probability
8.8 Law of Total Probability
NM

8.8.1 Bayes’s Formula


8.9 Independence of Events
8.10 Combinatorial Concept
8.10.1 Product Rule of Counting
8.10.2 Sum Rule of Counting
8.10.3 Permutation
8.10.4 Combination
8.11 Summary
8.12 Descriptive Questions
8.13 Answers and Hints
8.14 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


234  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

FACING A CROWD ISN’T EASY

USA TODAY Snapshots R

S
The above chart shows the percentage of professional women and
men who fear public speaking. These percentages can be written
IM
as conditional probabilities as follows. Suppose one professional
is selected at random. Then, given that this person is a female, the
probability is .35 that she has a public speaking fear. On the other
hand, if this selected person is a male, this probability is only .11.
These probabilities can be written as follows:
NM

P (has fear of public speaking/female) = .35


P (has fear of public speaking/male) = .11

Note that these are approximate probabilities because the data


given in the chart are based on a sample survey.

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  235 

N O T E S

After studying this chapter, you should be able to:


  Understand the meaning and important terms of probability
  Learn about addition theorem and multiplicative theorem of
probability
 Understand the concept of independence of events,
combinatorial concepts like permutation and combination
  Solve problems of conditional probability and Baye’s
Theorem and other concepts of probability

8.1 INTRODUCTION
A probability is a quantitative measure of risk. The statistician I.J.
Good suggests, “The theory of probability is much older than the

S
human species, since the assessment of uncertainty incorporates
the idea of learning from experience, which most creatures do.”
Development of probability theory in Europe is associated with
IM
gamblers in the famous European casinos, such as the one at Monte
Carlo. It is also associated with astrology.
This chapter provides exposure to fundamental concepts, since
probability is inseparable from statistical methods. Those, not
familiar with the subject, are suggested to study details from any
NM

book or a book by this author on probability and solve few numerical


problems to understand the logic. In probability, understanding and
interpreting the logic of the problem is more important. There could
be more than one method to solve a given problem.
The theory of probability is an indispensable tool in the analysis of
situations involving risk. It is used in various fields such as quality
control, management, engineering, physics, biology, and economics
and so on.

8.2 IMPORTANT TERMS IN PROBABILITY


Probability and sampling are inseparable parts of statistics. Before we
discuss probability and sampling distributions, we must be familiar
with some common terms used in theory of probability. Although these
terms are commonly used in business, they have precise technical
meaning.
Random Experiment: In theory of probability, a process or activity
that results in outcomes under study is called experiment, for example,
sampling from a production lot.

Random experiment is an experiment whose outcome is not


predictable in advance.

NMIMS Global Access – School for Continuing Education


236  BUSINESS STATISTICS

N O T E S
There is a chance or risk (sometimes also called as uncertainty)
associated with each outcome.
Sample Space: It is a set of all possible outcomes of an experiment. It
is usually represented as S. For example, if the random experiment is
rolling of a die, the sample space is a set, S = {1, 2, 3, 4, 5, 6}. Similarly,
if the random experiment is tossing of three coins, the sample space is,
S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT} with total of 8
possible outcomes. (H is heads, and T is Tails showing up.)
If we select a random sample of 2 items from a production lot and
check them for defect, the sample space will be S = {DD, DS, DR, RS,
RR, SS} where D stands for defective, S stands for serviceable and R
stands for re-workable.
‰‰ Event: One or more possible outcomes that belong to certain
category of our interest are called as event. A sub set E of
the sample space S is an event. In other words, an event is a

‰‰
S
favourable outcome.
Event space: It is a set of all possible events. It is usually
represented as E. Note that usually in probability and statistics;
IM
we are interested in number of elements in sample space and
number of elements in event space.
‰‰ Union of events: If E and F are two events, then another event
defined to include all outcomes that are either in E or in F or in
both is called as a union of events E and F. It is denoted as E ∪ F.
NM

‰‰ Intersection of events: If E and F are two events, then another


event defined to include all outcomes that are in both E and F is
called as a intersection of events E and F. It is denoted as E ∩ F.
‰‰ Mutually exclusive events: The events E and F are said to
be mutually exclusive events if the have no outcome of the
experiment common to them. In other words, events E and F are
said to be mutually exclusive events if E ∩ F = f, where f is a null
or empty set.
‰‰ Collectively exhaustive events: The events are collectively
exhaustive if their union is the sample space.
‰‰ Complement of event: Complement of an event E is an event
which consists of all outcomes that are not in the E. It is denoted
as EC. Thus, E ∩ EC = f and E ∪ EC = S.

Fill in the blanks:


1. ................... experiment is an experiment whose outcome is not
predictable in advance.
2. One or more possible outcomes that belong to certain category
of our interest are called as ................... .
3. The events E and F are said to be ................... events if the have
no outcome of the experiment common to them.

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  237 

N O T E S

A box contains a certain number of computer parts, a few of which


are defective. Two parts are selected at random from this box and
inspected to determine if they are good or defective. How many total
outcomes are possible? Draw a tree diagram for this experiment.

The events that (i) an employee would be late, and (ii) the employee
would be absent, on a particular day, are mutually exclusive since
both cannot occur simultaneously. An employee cannot be both
late and absent on a particular day. On the other hand, two or more
events which are not mutually exclusive are called overlapping
events. Suppose A represents the event that the number on the
card chosen is divisible by 3 and B represents the event that the

S
number is divisible by 5, then for A to occur the number must be
either 3, 6, 9, 12, 15 or 18, and for B to occur, it must be one of 5, 10,
15 and 20. Note that if the number 15 is obtained, it implies that
IM
both A and B have taken place. Thus, A and B are not mutually
exclusive.

8.3 KINDS OF PROBABILITY


NM

There are four kinds of approaches to the probability. Whatever is the


approach, same set of mathematical rules, theorems and postulates
hold for manipulating and analyzing probability.

Classical Probability
This is also called Mathematical Probability or Objective Probability or
A-priori Probability. This probability is based on the assumption that
certain occurrences are equally likely. For example, if an unbiased dice
is rolled, numbers 1 to 6 are equally likely to appear on the top face.
If there are n mutually exclusive, collectively exhaustive and equally
likely outcomes of an experiment and if m of them are favourable to
an event E, then the probability of occurrence of E, denoted by P(E)
is defined as,
m
P (E) = n Where, 0 ≤ m ≤ n Thus, P (E) ≤ 1
This definition is based on a-priori knowledge of equally likely
outcomes and total outcomes are finite, for example, draw of cards
from a shuffled pack of 52 cards, or a throw of a dice, or a toss of
a coin. If any of these assumptions are not true, then the classical
definition given above does not hold true for example, toss of a biased
coin, or throw of dice by ‘Shakuni Mama’ in the epic Mahabharat.
This definition also has a serious drawback: How do we know with
certainty that the outcomes are equally likely? If it cannot be proven
mathematically or logically, this definition is not complete.

NMIMS Global Access – School for Continuing Education


238  BUSINESS STATISTICS

N O T E S
Relative Frequency Probability
This is another type of objective probability. It is also called as
experimental probability.

Suppose, that an experiment, whose sample space is S, is repeatedly


performed under exactly the same conditions, and if the event E
which is a subset of the sample space S occurs m times in total of n
trials, then the probability of event E, denoted as P(E) is defined as,
 m
P (E) = nlim  
→∞  n 

Thus, P (E) is defined as the limiting proportion of number of times


that E occurs, i.e. the limiting frequency.
This definition also has the similar drawback as that of earlier. How

S
m
do we know that the ratio n will converge to some constant value that
will be the same every time we carry out the experiment? If we carry
out an experiment of flipping a coin and our event is getting heads,
IM
we do not observe any systematic series so as to prove mathematically
m 1
that the ratio n converges to .
2
Subjective Probability
NM

The most simple and natural interpretation of probabilities is that


they are measures of the individual’s belief in the statement that he or
she makes. This probability depends on the personal judgment, and
hence called as personal or subjective probability. In statistics, this
is the confidence placed in occurrence of an event by an individual,
based on certain evidence available to him or her, for example, forecast
of rainfall, or estimate of sales, or surgeon assessing probability of
operated patient’s recovery.

Axiomatic Probability
Earlier definitions that we have discussed make certain assumptions.
m
However, to assume that will necessarily converge to some
n
constant value every time the experiment is performed; or the event
is equally likely; seem to be very complex assumptions. It would be
more reasonable to assume a set of simpler and logically self-evident
axioms (assumptions on which a theory is based). Then base the
probability definition on these axioms. This is the modern axiomatic
approach to probability theory. Russian mathematician A.N.
Kolmogorov developed this concept that combines both the objective
and subjective concepts of probability.
Consider an experiment whose sample space is S. For each event E
of the sample space S we assume that a number P (E) is referred as
probability of event E if it satisfies the following axioms.

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  239 

N O T E S
Axiom 1: 0 ≤ P (E) ≤ 1
Axiom 2: P(S) = 1 Certain event. This also implies,
P (Φ) = 0 Impossible event.

Or, ∑ P(xi) = 1 All

Axiom 3: P( E1  E2  ..........  En) = P( E1) + P( E2) + ........ + P( En)


For any sequence of mutually exclusive events E1, E2 …En, or in
other words,
Events for which Ei ∩ Ek = Φ when i ≠ k
Axiomatic approach is valid for all situations irrespective of the
outcomes of experiment are equally likely or not. This is the advantage
of the axiomatic approach. Axioms are logically natural and according
to our intuition. Further, we cannot find any cases in probability theory
that do not satisfy these axioms.

S
IM
State whether the following statements are true/false:
4. Classical probability is also called Mathematical Probability or
Objective Probability or A-priori Probability.
5. Subjective probability is also called as experimental probability.
NM

6. Russian mathematician A.N. Kolmogorov developed the


concept of Relative frequency probability that combines both
the objective and subjective concepts of probability.

Suppose a randomly selected passenger is about to go through the


metal detector at the Indira Gandhi International airport. Consider
the following two outcomes: The passenger sets off the metal detector,
and the passenger does not set off the metal detector. Are these two
outcomes equally likely? Explain why or why not. If you are to find
the probability of these two outcomes, would you use the classical
approach or the relative frequency approach? Explain why.

The classical theory, under the assumption of equally likely


outcomes, depends on logical reasoning. It does very well when we
are concerned with balanced coins, perfect dice, well shuffled pack
of cards and all those situations where all outcomes are equally likely.
However, problems are immediately encountered when we have
to deal with the unbalanced coins, loaded dice and so on. In such
situations, we have to depend on the relative frequency approach.

NMIMS Global Access – School for Continuing Education


240  BUSINESS STATISTICS

N O T E S

8.4 SIMPLE PROPOSITIONS OF PROBABILITY


Now we state and prove some simple propositions of probabilities.
These are very handy while using statistical tools. Rather than
remembering their proofs, managers should understand the
conditions under which these are applicable.

Proposition 1
P (EC) = 1 – P (E)
Probability of compliment: Let even EC denote complement of the
event E. Obviously by definition of complement, EC has all elements
from the sample space S that are not in E. Thus, E and EC are mutually
exclusive and collectively exhaustive. Therefore, by axiom 2 and 3 we
have,
1 = P(S) = P (E ∪ EC) = P (E) + P (EC)
or,

S
P (EC) = 1 - P (E)

Proposition 2
IM
If E ⊂ F, then P (E) ≤ P (F)
If the event E is contained in event F, that is, then we can express,
F = E ∪ (EC ∩ F).
However, as events E and (EC ∩ F) are mutually exclusive, we get,
NM

P (F) = P (E) + P (EC ∩ F)


But, by axiom 1, P (EC ∩ F) ≥ 0. Therefore, we have proved the
proposition,
P (E) ≤ P (F)

Proposition 3
P (E ∪ F) = P (E) + P (F) – P (E ∩ F)
Probability of unions: Event E ∪ F can be written as the union of the
two disjoint events namely E and (EC ∩ F). Thus, from axiom 3,
P (E ∪ F) = P [E ∪ (EC ∩ F)] = P (E) + P (EC ∩ F) (1)
Also, F = (E ∩ F) ∪ (EC ∩ F), hence,
P (F) = P (E ∩ F) + P (EC ∩ F)  (2)
From (1) and (2) we get the proposition 3 as,
P (E ∪ F) = P (E) + P (F) - P (E ∩ F)
Extended statement of this proposition for n events is also called as
inclusion-exclusion principle.
P(E ∪ F ∪ G) = P(E) + P(F) + P(G) – P(EF) – P(FG) – P(EG) +
P(E∩F∩G)

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  241 

N O T E S
Proposition 4
Mutually exclusive events: When the sets corresponding to two
events are disjoint (have no common elements, or the intersection is
null), the two events are called mutually exclusive.
E ∩ F = Φ Therefore,
P (E ∩ F) = P (Φ) = 0
Also, for mutually exclusive events E and F,
P (E ∪ F) = P (E) + P (F)

Proposition 5
P (EC∩F) = P (F) – P (E∩F)
From set theory, F can be written as a union of two disjoint events E ∩
F and EC ∩ F . Hence, by Axiom III, we have, P(F) = P(E ∩ F) + P(EC
∩ F). By re-arranging the terms we get the result.

S
IM
Fill in the blanks:
7. Proposition 1 is defined as P (EC) = ...................
8. Event E ∪ F can be written as the ................... of the two disjoint
events namely E and (EC ∩ F).
NM

9. When the sets corresponding to two events are ...................


(have no common elements, or the intersection is null), the
two events are called mutually exclusive.

A gambler has four cards – two diamonds and two clubs. The
gambler proposes the following game to you: You will leave the
room and the gambler will put the cards face down on a table. When
you return to the room, you will pick two cards at random. You will
win $10 if both cards are diamonds, you will win $10 if both are
clubs, and for any other outcome you will lose $10. Assuming that
there is no cheating, should you accept this proposition? Support
your answer by calculating your probability of winning $10.

8.5 ADDITION THEOREM OF PROBABILITY


The addition theorem in the probability concept is the process of
determination of the probability that either event ‘A’ or event ‘B’
occurs or both occur. The notation between two events ‘A’ and ‘B’ the
addition is denoted as ‘∪’ and pronounced as Union.

NMIMS Global Access – School for Continuing Education


242  BUSINESS STATISTICS

N O T E S

Let A and B be two events defined in a sample space. The union of


events A and B is the collection of all outcomes that belong either
to A or to B or to both A and B and is denoted by A or B.

The result of this addition theorem generally written using Set notation,
P (A ∪ B) = P (A) + P (B) – P (A ∩ B),
Where, P (A) = probability of occurrence of event ‘A’
P (B) = probability of occurrence of event ‘B’
P (A ∪ B) = probability of occurrence of event ‘A’ or event ‘B’.
P (A ∩ B) = probability of occurrence of event ‘A’ or event ‘B’.
Addition theorem probability can be defined and proved as follows:
Let ‘A’ and ‘B’ are Subsets of a finite non empty set ‘S’ then according
to the addition rule

S
P (A ∪ B) = P (A) + P (B) – P (A). P(B),
On dividing both sides by P(S), we get
IM
P (A ∪ B) / P(S) = P (A) / P(S) + P (B) / P(S) – P (A ∩ B) / P(S) (1).
If the events ‘A’ and ‘B’ correspond to the two events ‘A’ and ‘B’
of a random experiment and if the set ‘S’ corresponds to the
Sample Space ‘S’ of the experiment then the equation (1) becomes
NM

P (A ∪ B) = P (A) + P (B) – P (A). P(B),


This equation is known as the addition theorem in probability.
Here the event A ∪ B refers to the meaning that either event
‘A’ or event ‘B’ occurs or both may occur simultaneously.
If two events A and B are Mutually Exclusive Events then A ∩ B = f,
Therefore
P (A ∪ B) = P (A) + P (B) [since P (A ∩ B) = 0],
In language of set theory A ∩ B is same as A/B.

Fill in the blanks:


10. The ................... theorem in the probability concept is the
process of determination of the probability that either event
‘A’ or event ‘B’ occurs or both occur.
11. Let ‘A’ and ‘B’ are Subsets of a finite non empty set ‘S’ then
according to the addition rule
P (A ∪ B) = ................... .

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  243 

N O T E S

Vikram and Kiara are planning an outdoor reception following their


wedding. They estimate that the probability of bad weather is .25,
that of a disruptive incident (a fight breaks out, the limousine is late,
etc.) is .15, and that bad weather and a disruptive incident will occur
is .08. Assuming these estimates are correct, find the probability
that their reception will suffer bad weather or a disruptive incident.

To calculate the probability of the union of two events A and B, we


add their marginal probabilities and subtract their joint probability
from this sum. We must subtract the joint probability of A and
B from the sum of their marginal probabilities to avoid double
counting because of common outcomes in A and B.

8.6
 ULTIPLICATION THEOREM OF
M
PROBABILITY
S
IM
Probability is the branch of mathematics which deals with the
occurrence of samples. The basic form of Multiplication theorems on
probability for two events ‘X’ and ‘Y’ can be stated as,
P (x. y) = p (x). P(x / y)
NM

Here p (x) and p (y) are the probabilities of occurrences of events ‘x’
and ‘y’ respectively.
P (x / y) is the Conditional Probability of ‘x’ and the condition is that
‘y’ has occurred before ‘x’.
P (x / y) is always calculated after ‘y’ has occurred. Here, occurrence of
‘x’ depends on ‘y’. ‘y’ has changed some events already. So, occurrence
of ‘x’ also changes.

Intersection of Events: Let A and B be two events defined in a


sample space. The intersection of A and B represents the collection
of all outcomes that are common to both A and B and is denoted by
‘A and B’.

The essential condition is that ‘y’ is not equals to zero that is y ≠ 0.


Now, consider the case when ‘x’ and ‘y’ are Independent Events.
The occurrence of ‘x’ does not depend on ‘y’ as they are independent
events.
Hence,
p (x / y ) = p (x) (equation 1)
As occurrence of ‘y’ has no effect on ‘x’.

NMIMS Global Access – School for Continuing Education


244  BUSINESS STATISTICS

N O T E S
Now, according to the multiplication theorem of probability,
P (x. y) = p (x). p (x / y ) (equation 2)
Substituting p (x / y) from “equation 2” in “equation 1”, we get
P (x. y) = p(x).p(y),
This is the special case of this theorem.
This case is valid only when events are independent.

State whether the following statements are true/false:


12. Probability is the branch of mathematics which deals with the
occurrence of samples.
13. The basic form of Addition theorems on probability for two
events ‘X’ and ‘Y’ can be stated as, P (x. y) = p (x). P(x / y)

S
14. The intersection of A and B represents the collection of all
outcomes that are common to both A and B and is denoted by
A and B.
IM
According to data from the Centers for Disease Control and
Prevention, there were a total of 823,542,000 visits to physicians
NM

in the United States during 2000. Of these visits, 488,199,000 were


visits by women, and 44,313,000 were by women aged 15 to 24 years
(Advance Data from Vital and Health Statistics, June 5, 2002). If
one of these 823,542,000 visits is selected at random, what is the
probability that the patient is 15 to 24 years of age given that this
person is a woman?

8.7 CONDITIONAL PROBABILITY


As a measure of uncertainty, probability depends on the information
available. If we know occurrence of say event F, probability of event
E happening may be different as compared to original probability of
E when we had no knowledge of the event F happening. Probability
that E occurs given that F has occurred is the conditional probability
and denoted by P(E|F). If event F occurs, then our sample space is
reduced to the event space of F. Also now for event E to occur, we must
have both events E and F occur simultaneously. Hence probability
that event E occurs, given that event F has occurred, is equal to the
probability of EF (that is E ∩ F) relative to the probability of F. Thus,
P( EF)
P( E F ) =   Provided P (F) > 0
P( F )
Another variation of conditional probability rule is,
P(EF) =P(E/F) × P(F)

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  245 

N O T E S
Conditional probability satisfies all the properties and axioms of
probabilities. Now onwards, we would write (E ∩ F) as EF, which is a
common convention.

Conditional probability is the probability that an event will occur


given that another event has already occurred. If A and B are two
events, then the conditional probability of A given B is written as
P(A/B) and read as “the probability of A given that B has already
occurred.”
Example: The probability that a new product will be successful if a
competitor does not launch a similar product is 0.67. The probability
that a new product will be successful in the presence of a competitor’s
new product is 0.42. The probability that the competitor will launch
a new product is 0.35. What is the probability that the product will be
success?

S
Solution: Let S denote that the product is successful, L denote
competitor will launch a product and LC denotes competitor will not
launch the product. Now, from given data,
IM
P(S/LC) = 0.67, P(S/L) 0.42, P(L) = 0.35
Hence, P( LC ) =1 − P( L) =
1 − 0.35 =
0.65
Now, using conditional probability formula, probability that the
product will be success P(S) is,
NM

=P(S) P(S L) P( L) + P(S LC ) P( LC )


= 0.42 × 0.35 + 0.67 × 0.65 = 0.5825

Fill in the blanks:


15. ................... probability is the probability that an event will
occur given that another event has already occurred.
16. Another variation of conditional probability rule is,
P(EF) = ...................

A Consumer agency randomly selected 1700 flights for two major


airlines, A and B. The following table gives the two-way classification
of these flights based on airline and arrival time. Note that “less
than 30 minutes late” includes flights that arrived early or on time.
Less than 30 30 Minutes to More than 1
Minutes Late 1 Hour Late Hour Late
Airline A 429 390 92
Airline B 393 316 80
Contd...

NMIMS Global Access – School for Continuing Education


246  BUSINESS STATISTICS

N O T E S
1. If one flight is selected at random from these 1700 flights, find
the probability that this flight is
(a) more than 1 hour late
(b) less than 30 minutes late
(c) a flight on airline A given that it is 30 minutes to 1 hour late
(d) more than 1 hour late given that it is a flight on airline B
2. Are the events “airline A” and “more than 1 hour late” mutually
exclusive? What about the events “less than 30 minutes late”
and “more than 1 hour late”? Why or why not?
3. Are the events “airline B” and “30 minutes to 1 hour late”
independent? Why or why not?

8.8 LAW OF TOTAL PROBABILITY

S
Consider two events, E and F. Whatsoever be the events, we can
always say that the probability of E is equal to the probability of
intersection of E and F, plus, the probability of the intersection of E
IM
and complement of F. That is,
P (E) = P (E ∩ F) + P (E ∩ F ∩ C)

8.8.1 BAYES’S FORMULA


Let, E and F are events.
NM

E = (E ∩ F) U (E ∩ F ∩ C)
For any element in E, must be either in both E and F or be in E but not
in F. (E F) and (E FC) are mutually exclusive, since former must be in
F and latter must not in F, we have by Axiom 3,
P (E) = (E F) + (E FC) = P(E/F) × P(F) +P(E/FC) × P(FC)

= P(E/F) × P(F) + P( E F c ) × [1 − P( F)]


The equation may be generalized for mutually exclusive and
n
collectively exhaustive events F1, F2 … Fn. That is,
n
F
i =1
i = S and
n

F i = φ . Hence, we can write, E =  ( EFi)


i =1 i =1
n n
Therefore, P( E) = ∑ ( EFi) = ∑ P( E F ) × P( F )
i i
i =1 i =1

Suppose now that E has occurred and we are interested in determining


the probability of Fi has occurred, then using above equations, we
have following proposition.
P( EFi) P( E Fi) × P( Fi)
P( Fi E) = = for all i = 1,2, …n
P( E) n

∑ P( E F ) × P( F )
i =1
i i

This equation is known as Bayes’ formula. If we think of the events Fi


as being possible ‘hypothesis’ about proportionality of some subject

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  247 

N O T E S
matter, say market shares of a competitors, then Bayes’ formula
gives us how these should be modified by the new evidence of the
experiment, says a market survey.
Example: A bin contains 3 different types of lamps. The probability
that a type 1 lamp will give over 100 hours of use is 0.7, with the
corresponding probabilities for type 2 and 3 lamps being 0.4 and 0.3
respectively. Suppose that 20 per cent of the lamps in the bin are of
type 1, 30 per cent are of type 2 and 50 per cent are of type 3.
What is the probability that a randomly selected lamp will last more
than 100 hours?
Given that a selected lamp lasted more than 100 hours, what are the
conditional probabilities that it is of type 1, type 2 and type 3?
Solution: Let type 1, type 2 and type 3 lamps be denoted by T1, T2 and
T3 respectively. Also, we denote S if a lamp lasts more than 100 hours
and SC if it does not. Now, as per given data,
P(S|T1) =0.7,    P(S|T2) =0.4
P(T1) = 0.2,     P(T2) = 0.3,
S
P(S|T3) =0.3
P(T3) = 0.5
IM
4. Now, using conditional probability formula,
P(S) = P(S T1 ) P(T1 ) + P(S T2 ) P(T2 ) + P(S T3 ) P(T3 )
= 0.7 × 0.2 + 0.4 × 0.3 +0.3 × 0.5 = 0.41
NM

5. Now, using Bayes’ formula,


P(S T1 ) P(T1 ) 0.7 × 0.2
=
P(T1 S) = = 0.341
P ( S) 0.41

P(S T2 ) P(T2 ) 0.4 × 0.3


=
P(T2 S) = = 0.293
P(S) 0.41

P(S T3 ) P(T3 ) 0.3 × 0.5


=
P(T3 S) = = 0.366
P(S) 0.41

Example: A certain firm has plants A, B and C producing respectively


35%, 15% and 50% of the total output. The probabilities of non-
defective product from these plants are 0.75, 0.95 and 0.85 respectively.
The products from these plants are mixed together and dispatched
randomly to the customer. A customer receives a defective product.
What is the probability that it came from plant C?
Solution: Let us use symbols D for defective and ND for non-defective.
Given data can be written as,
P( ND A) = 0.75 ⇒ P( D A) =0.25
P( ND B) =
0.95 ⇒ P( D A) =
0.05
P( ND C) =
0.85 ⇒ P( D A) =
0.15

NMIMS Global Access – School for Continuing Education


248  BUSINESS STATISTICS

N O T E S
Now we need to find probability of the item has come from C when we
know that it is defective, i.e. P(C|D). Using Bayes’ formula,
P( D C) P(C)
P(C D) =
P( D A) P( A) + P( D B) P( B) + P( D C) P(C)
0.15 × 0.5
=
0.25 × 0.35 + 0.05 × 0.15 + 0.15 × 0.5
0.075
= = 0.44
0.17
Example: A product is produced on three different machines M1, M2
and M3 with proportion of production from these machines as 50%,
30% and 20% respectively. The past experience shows percentage
defectives from these machines as 3%, 4% and 5% respectively. At
the end of the day’s production, one unit of production is selected at
random and it is found to be defective. What is the chance that it is
manufactured by machine M2?
S
Solution: Let, M1, M2 and M3 are the events that the product is
IM
manufactured on machines M1, M2 and M3 respectively. Let D be the
event that the item is defective. The given information can be written as,
P(M1) = 0.5, P(M1) = 0.3, P(M1) = 0.2,
P(D|M1) = 0.03, P(D|M2) = 0.04 and P(D|M3) =0.05
NM

We know that the selected item is defective. Therefore, by Bayes’


theorem the probability that the item is produced on machine M2 is,
P( M2 ) P( D M2 )
P( M2 D) =
P( M1 ) P( D M1 ) + P( M2 ) P( D M2 ) + P( M3 ) P( D M3 )

0.3 × 0.04
= = 0.324
     0.5 × 0.03 + 0.3 × 0.04 + 0.2 × 0.05

Fill in the blank:


17. If we think of the events Fi as being possible ‘hypothesis’ about
proportionality of some subject matter, say market shares of
a competitors, then ................... ................... gives us how these
should be modified by the new evidence of the experiment,
says a market survey.

Two thousand randomly selected adults were asked if they think they
are financially better off than their parents. The following table gives
the two-way classification of the responses based on the education
Contd...

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  249 

N O T E S
levels of the persons included in the survey and whether they are
financially better off, the same, or worse off than their parents.
Less than High School More than
High School High School
Better off 140 450 420
Same 60 250 110
Worse off 200 300 70
1. Suppose one adult is selected at random from these 2000
adults. Find the following probabilities.
(a) P(better off and high school)
(b) P(more than high school and worse off )
2. Find the joint probability of the events “worse off” and “better
off.” Is this probability zero?
Explain why or why not.

8.9 INDEPENDENCE OF EVENTS S


IM
Two events are said to be independent of each other if and only if the
following three conditions hold:
P(EF) = P(E) × P(F) (This is the most useful result.)
P(E|F) = P(E)
NM

P(F|E) = P(F)
In other words, two events are independent, if knowledge of
occurrence of one event does not modify probability of the other
event. For example, outcome of first toss of coin (heads or tails) does
not affect the probability that second toss landing heads. Two events
that are not independent are said to be dependent. Also, if events E
and F are independent, so are E and FC.
Example: A bag contains 4 tickets numbered 112, 121, 211 and 222.
One ticket is drawn randomly. Let Ai be the event that ith digit of the
number on the ticket is 1 with i = 1, 2, 3. Comment on pair-wise and
mutual independence of A1, A2 and A3.

Solution: Probability of first digit as 1 is, P( A1 = 2 1


) =
4 2

Probability of second digit as 1 is, P( A2 = 2 1


) =
4 2

Probability of third digit as 1 is, P( A3 = 2 1


) =
4 2
1 1
Now, P( A1 A2 A3 ) = Also, P( A1 ) P( A2 ) P( A3 ) =
4 8

NMIMS Global Access – School for Continuing Education


250  BUSINESS STATISTICS

N O T E S

Since P( A1 A2 A3 ) ≠ P( A1 ) P( A2 ) P( A3 ) hence, A1, A2 and A3 are not


mutually independent. (They are dependent).
1
Now, P( A2 A1 ) =
2
Since, P( A2 A1 ) = P( A1 ) P , A1, and A2 are pair-wise independent.
Similarly, P( A3 A1 ) = P( A1 ) and P( A2 A3 ) = P( A2 ) . Hence, A1 and A3
as well as A2 and A3 are pair-wise independent.
Note that P( A3 A1 A2 )= 0 ≠ P( A3 ) Hence, A1, A2 and A3 together are
not mutually independent.
Example: A highway has three recovery vans namely I, II and III.
The probability of their availability at any time is 0.9, 0.7 and 0.8 and
is independent of each other. What is the probability that at least one
recovery van will be available at any time to attend the break-down?

S
Solution: Let I, II, and III be the three events that the vans I, II
and III are available. The probability that at least one recovery van
will be available P is the union of these probabilities. Further, since
IM
probabilities of availability of vans are independent, their joint
probability is the product of individual probabilities. Thus,
P( I  II  III ) =P( I ) + P( II ) + P( III ) − P( I  II ) − P( I  III ) − P( II  III ) + P( I  II  III )

= P( I ) + P( II ) + P( III ) − P( I ) × P( II ) − P( I ) × P( III ) − P( II ) × P( III ) + P( I ) × P( II ) × P( III )


NM

=0.9 + 0.7 + 0.8 − 0.63 − 0.72 − 0.56 + 0.504 =0.994


Example: In a certain examination results show that 20% students
failed in P & C, 10% failed in Data Structure while 5 % failed in both
P & C and Data Structure. Are the two events ‘failing in P & C’ and
‘failing in Data Structure’ independent?
Solution: Let ‘A’ denote failing in P & C and ‘B’ denote failing in Data
Structure. The given is,
P(A) = 0.2, P(B) – 0.1 P(AB = 0.05)
Now, P(AB) ≠ P(A) × P(B)
Hence, two events ‘failing in P & C’ and ‘failing in Data Structure’ are
not independent.

Fill in the blanks:


18. Two events are ..................., if knowledge of occurrence of one
event does not modify probability of the other event.
19. Two events that are not independent are said to be ....................

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  251 

N O T E S

I go to my friend.
He tells me, “I have two children”. What is the probability that my
friend has a son?
As I sit down, one girl comes in and offers me a glass of water. My
friend says, “Please meet my daughter”. Now, what is the probability
that my friend has a son?
After I thank her for water, my friend adds, “She is ‘Didi’ or ‘Tai’
(meaning the elder child)”. Now, what is the probability that my
friend has a son?
After some time one boy enters. My friend introduces him as his
son. Now, what is the probability that my friend has a son?

S
If we toss a six-faced die and call the event of appearance of an even
number as the event A and the appearance of an odd number as the
IM
event B. Now, suppose that in the first toss we get an even number.
If we toss the die the second time, we can still get an even or an
odd number and their chances are not influenced by the result of
the first trial. Thus, the appearance of an even number in the first
trial and the appearance of an even number in the second trial is an
NM

example of independent events.

8.10 COMBINATORIAL CONCEPT


Combinatorial concepts are useful in calculating probability of the
event, particularly when the problem can be solved by classical
probability theory. Hence we will briefly state some of the commonly
used rules of combinatorial analysis.

8.10.1 PRODUCT RULE OF COUNTING


Suppose that a procedure can be broken down into a sequence of two
tasks. If there are n1 ways to do first task and n2 ways to do second task
after the first task has been done. Then there are (n1 × n2) ways to do
the procedure. In general, if r experiments are to be performed are
such that the first outcome can be in n1 ways, having completed the
first experiment the second experiment outcome can be in n2, then
similarly outcome of the third experiment can be in n3 ways, and so on.
Then there is a total of n1 × n2 × n3 ×…× nr possible outcomes of the
r experiments. When the logical AND is used to indicate successive
experiments then, the ‘Product Rule’ is applicable. For example, how
many outcomes are there if we toss a coin and then throw a dice?
Answer is 2 × 6 = 12.

NMIMS Global Access – School for Continuing Education


252  BUSINESS STATISTICS

N O T E S
8.10.2 SUM RULE OF COUNTING
If one task can be done in n1 ways and other task can be done in n2
ways and if these tasks cannot be done at the same time, then there are
(n1 + n2) ways of doing one of these tasks (either one task or the other).
When logical OR is used in deciding outcomes of the experiment and
events are mutually exclusive then the ‘Sum Rule’ is applicable.
For example, an urn contains 10 balls of which 5 are white, 3 black
and 2 red. If we select one ball randomly, how many ways are there
that the ball is either white or red? Answer is 5 + 2 = 7. Note that the
sum rule is nothing but the Axiom 3.

8.10.3 PERMUTATION

A Permutation of a set of distinct objects is an ordered arrangement

S
of these objects. An ordered arrangement of r elements of a set is
called r-permutation.
IM
The number of r-permutations of a set with n elements, where n is a
nonnegative integer with 0 ≤ r ≤ n, equals,
n!
P( n, r) =n × ( n − 1) × ( n − 2) × ......... × ( n − r + 1) =
( n − r)!
This is also number of ways of drawing items from a set without
NM

replacing the item drawn. For example, number of ways of drawing


three cards one after other from a pack without replacement is
52!
P(52,3) = = 52 × 51 × 50 = 132600
(52 − 3)!
R-permutation can also be written as nPr.
Permutations with Indistinguishable Objects
The number of different permutations of n objects, where n1
indistinguishable objects of type 1, n2 distinguishable of type 2 … and
nk indistinguishable objects of type k, is,
n!
= C( n; n1, n2,......nk)
n1! n2 !.....nk !
This is also called as an ordered partitioning or a multinomial
coefficient.

8.10.4 COMBINATION

An r-combination of a set is an unordered selection of r elements


from the set of n items. Thus, an r-combination is simply a subset
of r elements, taken from a set with n elements. The number of
r-combinations, of a set with n elements, where n is a non-negative
integer, and with condition 0 ≤ r ≤ n, equals.

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  253 

N O T E S

P( n, r) n!
=
C ( n, r) =
r! r !( n − r)!
C (n, r) is also called a binomial coefficient, since it is a coefficient of rth
term in a binomial expansion. Note that r-combination is also written
n
as, nCr or  
r
Combinations with Repetition
Number of r-combinations of a set with n elements when repletion of
elements is allowed, equals,

 n + r − 1  ( n + r − 1)!
 =
 r −1  r !( n − 1)!
For example, if we have to select 6 ice-creams (r) of available 4 flavours
(n), it can be done in

C(4 + 6 − 1,6)
= C(9,6)
=
9!
=
9×8×7
= 84 ways.
6!3! 3 × 2 × 1
S
IM
This is also the number of ways of distributing r identical objects in n
boxes where empty box is allowed.
Further, it also gives number of non-negative integer solutions of an
equation,
NM

x1 + x2 + …+ xn = r
Solved Examples
Example: In a triangular series the probability of Indian team
winning match with Pakistan is 0.7 and that with Australia is 0.4.
If the probability of India winning both matches is 0.3, what is the
probability that India will win at least one match so that it can enter
the final?
Solution: Now, given that probability of the Indian team winning the
match with Pakistan P (A) = 0.7, with Australia P (A) = 0.4 and with
both P(A ∩ B) = 0.3
Therefore, probability that India will win at least one match is,
P( A  B) = P( A) + P( B) − P( A  B) = 0.7 + 0.4 − 0.3 = 0.8
Example: What is the probability of a hand of 13 dealt from a shuffled
pack of 52 cards, containing exactly 2 kings and 1 ace?

4
Solution: Out of 13 cards, 2 kings must come from 4 kings is  
2
4
ways, 1 ace must come from 4 aces in   ways, and remaining 10
1
 44 
cards must com from 44 non-kings and non-ace cards in   . Thus,
 10 

NMIMS Global Access – School for Continuing Education


254  BUSINESS STATISTICS

N O T E S
by product rule, the required probability of hand of 13 containing
exactly 2 kings and 1 ace is,

 4  4  44 
   
 2  1  10  = 0.09378
 52 
 
 13 
Example: In the dairy, the milk filled in sachets of 500 Gms by machine
A, B and C respectively 25%, 35% and 40% of the total output. It is also
found that 5, 4, and 2 per cent of sachets respectively by machine A,
B and C have either over filling or under filling of milk. A government
inspector made a random check and found that the sachet was under
filled and booked a case against the dairy. What are the probabilities
that it was filled by machine A, B and C?
Solution: Given: P(A) – 0.25, P(B) – 0.35, P(C) – 0.4

S
If we indicate under fill or overfill as D (defective),
P(D|A) = 0.05, P(D|B) = 0.04, P(D|C) = 0.02
IM
Now, we have to find P(A|D), P(B|D) and P(C|D) respectively.
Probabilities that it was filled by machine A is,
P ( D A ) P ( A)
P( A D) =
P( D A) P( A) + P( D B) P( B) + P( D C) P(C)
NM

0.05 × 0.25
=
0.05 × 0.25 + 0.04 × 0.35 + 0.02 × 0.4
0.0125
= = 0.362
0.0345
Similarly,
P( D B) P( B)
P( B D) =
P( D A) P( A) + P( D B) P( B) + P( D C) P(C)

0.04 × 0.35
=
0.05 × 0.25 + 0.04 × 0.35 + 0.02 × 0.4
= 0.406
Also,
P( D C) P(C)
P(C D) =
P( D A) P( A) + P( D B) P( B) + P( D C) P(C)

0.02 × 0.4
=
0.05 × 0.25 + 0.04 × 0.35 + 0.02 × 0.4
= 0.232

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  255 

N O T E S

Fill in the blanks:


20. When the logical ................... is used to indicate successive
experiments then, the ‘Product Rule’ is applicable.
21. When logical ................... is used in deciding outcomes of the
experiment and events are mutually exclusive then the ‘Sum
Rule’ is applicable.
22. A ................... of a set of distinct objects is an ordered
arrangement of these objects.
23. An ................... of a set is an unordered selection of r elements
from the set of n items.

S
In a Monster.com online poll during January 14–21, 2001,
respondents were asked the question, “Which is the best job at the
IM
Super Bowl?” (USA TODAY, February 1, 2002). There were a total
of 30,270 (self-selected) responses. The most popular response was
“player” with 11,715 votes, and the second most popular response
was “announcer/reporter” with 9,982 votes. If one of the 30,270
responses is selected at random, what is the probability that the
NM

vote was for “player” or “announcer/reporter?”


Explain why this probability is not equal to 1.0.

To calculate the probability of the union of two events A and B, we


add their marginal probabilities and subtract their joint probability
from this sum. We must subtract the joint probability of A and
B from the sum of their marginal probabilities to avoid double
counting because of common outcomes in A and B.

8.11 SUMMARY
‰‰ In this chapter, we discussed basic idea of probability. We defined
probability in different ways and pointed out serious limitations
of each definition.
‰‰ Then we discussed axioms of probability, which are the backbone
of theory of probability. Then we studied number of useful
propositions of probability.
‰‰ We also defined conditional probability, law of total probability,
and Bayes’ Theorem. We also defined mutually exclusive events,
and independence of events.

NMIMS Global Access – School for Continuing Education


256  BUSINESS STATISTICS

N O T E S
‰‰ Lastly, we discussed few important concepts of combinatorial
analysis, which comes very handy while calculating probability
of an event.

‰‰ Probability: It is a degree or scope of an occurrence of an


event. We can say the chances of an event to happen.
‰‰ Event: A collection of one or more outcomes of an experiment.
‰‰ Experiment: A process with well-defined outcomes that,
when performed, results in one and only one of the outcomes
per repetition.
‰‰ Mutually Exclusive Events: Two or more events that do
not contain any common outcome and, hence, cannot occur
together.
‰‰ Additive Rule: A property of probability that affirms the

S
probability of one and/or two events occurring at the same
time is equal to the probability of the first event occurring,
plus the probability of the second event occurring, minus the
IM
probability that both events occur at the same time.
‰‰ Multiplicative Rule: The probability of two independent
events occurring simultaneously is the product of the
individual probabilities.
‰‰ Conditional Probability: It states the probability of event (A)
NM

given that event (B) has already occurred.


‰‰ Independent Events: Two events for which the occurrence of
one does not change the probability of the occurrence of the
other.

8.12 DESCRIPTIVE QUESTIONS


1. Define Random Experiment, Sample space, Event, Mutually
exclusive events and collectively exhaustive events.
2. What are the four different types of probability? Explain in brief.
3. Discuss in brief simple propositions of probability which are used
in the statistical problems.
4. Write a short note on additional theorem of probability.
5. Explain in brief multiplicative theorem of probability with few
examples.
6. What is conditional probability? Discuss with an example.
7. How will you define law of total probability?
8. Explain Baye’s formula with an example.
9. What do you understand by independence of events? Give one
example.
10. Discuss all the combinatorial concepts which are used in
probability.

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  257 

N O T E S
EXERCISE FOR PRACTICE
1. Consider an experiment of rolling a fair dice. Let the event A is
an even number appears on the upper face. The event B is the
number on the upper face is greater than 3. Find the probability
of the number appearing on the upper face is either event A or B.
2. Three balls are randomly selected without replacement from a
bag containing 20 balls numbered 1, 2, through 20. If we bet that
at least one of the balls has a number greater than or equal to 17,
what is the probability that we will win the bet?
3. A bag contains 4 white and 2 black balls. Another bag contains 3
white and 5 red balls. One ball is drawn from each bag. What is
the probability that they are of different colours?
4. An office has three Xerox machines X1, X2 and X3. The
probability that on a given day machines X1, X2 and X3 would
work is 0.60, 0.75 and 0.80 respectively; both X1 and X2 work is

S
0.50; both X1 and X3 work is 0.40; both X2 and X3 work is 0.70.
The probability that all of them work is 0.25. Find the probability
that on a given day at least one of the three machines works.
IM
5. A factory has 65% male workers. 70% of the total workers
are married. 47% of the male workers are married. Find the
probability that a worker chosen randomly is,
(i) Married female.  (ii) A male married or both.
NM

8.13 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS
Topic Q. No. Answers
Important Terms in Probability 1. Random
2. Event
3. Mutually exclusive
Kinds of Probability 4. True
5. False
6. False
Simple Propositions of 7. 1 – P (E)
Probability
8. Union
9. Disjoint
Addition Theorem of Probability 10. Addition
11. P (A) + P (B) – P (A).
P (B)
Multiplication Theorem of 12. True
Probability
Contd...

NMIMS Global Access – School for Continuing Education


258  BUSINESS STATISTICS

N O T E S

13. False
14. True
Conditional Probability 15. Conditional

16. P( E F ) × P( F )

Law of Total Probability 17. Bayes’ formula


Independence of Events 18. Independent
19. Dependent
Combinatorial Concept 20. AND
21. OR
22. Permutation
23. r-combination


Refer Section 8.2
S
HINTS FOR DESCRIPTIVE QUESTIONS
1.
Random experiment is an experiment whose outcome is not
IM
predictable in advance.
One or more possible outcomes that belong to certain category of
our interest are called as event. A sub set E of the sample space S
is an event. In other words, an event is a favorable outcome.
NM

2. Refer Section 8.3


There are four kinds of approaches to the probability. Whatever
is the approach, same set of mathematical rules, theorems and
postulates hold for manipulating and analyzing probability.
They are Classical Probability, Relative Frequency Probability,
Subjective Probability, and Axiomatic Probability.
3. Refer Section 8.4
Proposition 1: P (EC) = 1 – P (E)
Proposition 2: If E ⊂ F, then P (E) ≤ P (F)
Proposition: P (E ∪ F) = P (E) + P (F) – P (E ∩ F)
Proposition: P (E ∪ F) = P (E) + P (F)
Proposition: P (EC ∩ F) = P (F) – P (E∩F)
4. Refer Section 8.5
Let A and B be two events defined in a sample space. The union
of events A and B is the collection of all outcomes that belong
either to A or to B or to both A and B and is denoted by A or B.
The result of this addition theorem generally written using Set
notation, P (A ∪ B) = P (A) + P (B) – P (A ∩ B),

NMIMS Global Access – School for Continuing Education


THEORY OF PROBABILITY  259 

N O T E S
5. Refer Section 8.6
The basic form of Multiplication theorems on probability for two
events ‘X’ and ‘Y’ can be stated as,
P (x. y) = p (x). P(x / y)
Here p (x) and p (y) are the Probabilities of occurrences of events
‘x’ and ‘y’ respectively.
P (x / y) is the Conditional Probability of ‘x’ and the condition is
that ‘y’ has occurred before ‘x’.
P (x / y) is always calculated after ‘y’ has occurred. Here,
occurrence of ‘x’ depends on ‘y’. ‘y’ has changed some events
already. So, occurrence of ‘x’ also changes.
6. Refer Section 8.7
Conditional probability is the probability that an event will occur

S
given that another event has already occurred. If A and B are two
events, then the conditional probability of A given B is written as
P (A/B) and read as “the probability of A given that B has already
IM
occurred.”
7. Refer Section 8.8
Consider two events, E and F. whatsoever be the events, we can
always say that the probability of E is equal to the probability of
intersection of E and F, plus, the probability of the intersection of
NM

E and complement of F. That is,


P (E) = P (E F) + P (E FC)
8. Refer Section 8.8.1
Let, E and F are events.
E = (E F) U (E FC)
For any element in E, must be either in both E and F or be in
E but not in F. (E F) and (E FC) are mutually exclusive, since
former must be in F and latter must not in F, we have by Axiom 3,
P (E) = (E F) + (E FC) = P( E F) × P( F) + P( E F c ) × P( F c )

c
= P( E F) × P( F) + P( E F ) × [1 − P( F)]
9. Refer Section 8.9
Two events are said to be independent of each other if and only if
the following three conditions hold:
P(EF) = P(E) × P(F) (This is the most useful result.)
P(E/F) = P(F)
P(F/E) = P(F)

NMIMS Global Access – School for Continuing Education


260  BUSINESS STATISTICS

N O T E S
10. Refer Section 8.10
Combinatorial concepts are useful in calculating probability
of the event, particularly when the problem can be solved by
classical probability theory. They are product rule of counting,
Sum rule of couting, permutation and Combination.

ANSWERS FOR EXERCISE FOR PRACTICE


2
1. P(A) + P(B) – P(AB) =
3
2. 0.509
3. 13/48
4. 0.8
5. 0.23, 0.88

8.14 SUGGESTED READINGS FOR REFERENCE


SUGGESTED READINGS S
IM
‰‰ D P Apte, Probability and Combinatorics, Excel Books, 2007
‰‰ Gordon, G., and Pressman I., Quantitative Decision Making for
Business, New Delhi: National Publishing House, 1983.
‰‰ Lapin, L., Quantitative Methods for Business Decisions, New
NM

York: Harcourt Brace Jovanovich. Inc., 1976


‰‰ Apte, D.P., Probability and Statistics, Excel Books, 2008
‰‰ Dey, B.R., Text Book of Managerial Statistics, Macmillan India
Ltd, 2005
‰‰ Ross, Sheldon, A First Course in Probability, Pearson Education,
2003
‰‰ Sharma, K.V.S., Statistics Made Simple, Prentice Hall of India, 2002
‰‰ Loomba, M.P., Management – A Quantitative Perspective,
MacMillan Publishing Company, New York, 1978.
‰‰ Kothari, C.R., Quantitative Techniques, Vikas Publication.

E-REFERENCES
‰‰ http://math.berkeley.edu/~isammis/55.S08/55PS7.pdf
‰‰ http://webbut.unitbv.ro/bulletin/Series%20II/BULETIN%20
II/07-Pacurar.pdf
‰‰ http://www.shmoop.com/basic-statistics-probability/and-or-
probability-exercises-3.html

NMIMS Global Access – School for Continuing Education


C H A
9 P T E R

PROBABILITY DISTRIBUTION

CONTENTS
9.1 Introduction


9.2 Random Variable
9.2.1
S
Discrete and Continuous Random Variables
IM
9.2.2 Probability Mass Function (p.m.f.)
9.2.3 Probability Density Function
9.2.4 Cumulative Distribution Function
9.2.5 Expectation Value of Random Variables
9.2.6 Expected Value of a Function of a Random Variable
NM

9.2.7 Variance and Standard Deviation of Random Variable


9.3  Probability Distributions of Standard Random Variables
9.4 Bernoulli Distribution
9.4.1 Application of Bernoulli Distribution
9.5 Binomial Distribution
9.5.1 Applications of Binomial Distribution
9.6 Poisson Distribution
9.7 Normal Distribution
9.7.1 Equation for Normal Probability Curve
9.7.2 Standard Normal Distribution
9.7.3 Properties of Normal Distribution
9.7.4 Areas Under Standard Normal Probability Curve
9.7.5 Importance of Normal Distribution
9.8 Summary
9.9 Descriptive Questions
9.10 Answers and Hints
9.11 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


262  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

BASEBALL PLAYERS HAVE “SLUMPS” AND “STREAKS”

Going “0 for July,” as former infielder Bob Aspromonte once put


it, is enough to make a baseball player toss out his lucky bat or
start seriously searching for flaws in his hitting technique. But the
culprit is usually just simple mathematics.
Statistician Harry Roberts of the University of Chicago’s Graduate
School of Business studied the records of major-league baseball
players and found that a batter is no more likely to hit worse when
he is in a slump than when he is in a hot streak. The occurrences
of hits followed the same pattern as purely random events such
as pulling marbles out of a hat. If there were one white marble
and three black ones in the hat, for example, then a white marble
would come out about one quarter of the time – a .250 average. In
the same way, a player who hits .250 wills in the long run get a hit
every four times at bat.

S
But that doesn’t mean the player will hit the ball exactly every
fourth time he comes to the plate – just as it’s unlikely that the
IM
white marble will come out exactly every fourth time.
Even a batter who goes hitless 10 times in a row might safely be
able to pin the blame on statistical fluctuations. The odds of pulling
a black marble out of a hat 10 times in a row are about 6 percent –
not a frequent occurrence, but not impossible, either. Only in the
NM

long run do these statistical fluctuations even out.


If we assume a player hits .250 in the long run, the probability
that this player does not hit during a specific trip to the plate is
.75. Hence, we can calculate the probability that he goes hitless 10
times in a row as follows.
P (hitless 10 times in a row) = (.75) (.75) . . . (.75) ten times
= (.75)¹° = .0563
Note that each trip to the plate is independent and the probability
that a player goes hitless 10 times in a row is given by the intersection
of 10 hitless trips. This probability has been rounded off to “about
6%” in this illustration.

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  263 

N O T E S

After studying this chapter, you should be able to:


  Differentiate between discrete and continuous random
variables
 Discuss probability distributions of standard random
variable
  Understand discrete probability distribution which include
Binomial and Poisson Distribution
  Explain continuous probability distribution which includes
Normal distribution

9.1 INTRODUCTION
Frequently, we are more interested in some function of the outcome

S
of an experiment/process rather than the actual outcome itself. For
example, an expressway safety service may be interested to know the
probability that a particular number of accidents could take place on
IM
a day than the details of accident itself. Or, in an experiment of tossing
a coin four times we may be interested in total number heads that
occur (if we have called or bet on heads say) and not care at all about
the actual sequence of results. These quantities of interest are known
as random variables. In statistics, we are also interested in probability
NM

associated with the values of random variable so that we can take


decision under risk. There are a number of theoretical random
variables and their distributions that have been analyzed. Many real
life situations could be approximated to these distributions and used
for decision-making. In other cases, we have to plot the actual data
as a distribution. We will study a few common distributions in this
chapter. Normal distribution has extensive use in statistical tools
and therefore readers are advised to study it in detail. Knowledge of
sequences, series and calculus is expected.

9.2 RANDOM VARIABLE


Random variable is a real valued function defined over a sample space.
Since it is over a sample space, probability is associated with each
value of the random variable. The value of the random variable is a
value of the function related to the outcome of an experiment. Random
variables are neither ‘random’ nor ‘variable’. Their possible values
and associated probability is known. It is actually a function giving
a correspondence between a point in the sample space and values
of random variables. This allows us to determine the probabilities
associated with the values of random variable.

A random variable, usually written X, is a variable whose possible


values are numerical outcomes of a random phenomenon. 

NMIMS Global Access – School for Continuing Education


264  BUSINESS STATISTICS

N O T E S
For example, consider an experiment of tossing an unbiased coin for
four times where we are interested in our favorable event of number
of heads. (Imagine the similarity of this with a real life experiment of
picking fuses out of a box when probability of fuse being serviceable
is 0.5.) Possible outcomes are 24 = 16 namely, TTTT, TTTH, TTHT,
THTT, HTTT, TTHH, THTH, THHT, HTHT, HTTH, HHTT, THHH,
HTHH, HHTH, HHHT, HHHH. Let our random variable ‘X’ is number
of heads. It can be seen that random variable can take values as 0, 1, 2,
3, and 4. Since all the 16 outcomes are equally likely, their probability
is (1/16). Now counting the outcomes that give us a particular value
of the random variable, we can calculate the probability associated
with it. The rule that assigns the probabilities to the different values
of random variable is called the probability distribution of random
variable. In our example of tossing a coin four times the probability
distribution is as follows:
Value of Xi 0 1 2 3 4 Total
Random
Variable
Probability
S
P {X = Xi} 1/16 4/16 6/16 4/16 1/16 1
IM
Note that sum of all probabilities is 1. This is always true for any
probability distribution according to the ‘Axiom 2’ for probability
space.

9.2.1 DISCRETE AND CONTINUOUS RANDOM VARIABLES


NM

There are two types of random variables, discrete and continuous.

A  discrete random variable is one which may take on only a


countable number of distinct values such as 0, 1, 2, 3, 4…
Discrete random variables are usually (but not necessarily) counts. If a
random variable can take only a finite number of distinct values, then
it must be discrete. Examples of discrete random variables include
the number of children in a family, the Friday night attendance at a
cinema, the number of patients in a doctor’s surgery, the number of
defective light bulbs in a box of ten.
The  probability distribution of a discrete random variable is a list
of probabilities associated with each of its possible values. It is also
sometimes called the probability function or the probability mass
function.

A  continuous random variable is one which takes an infinite


number of possible values. Continuous random variables are
usually measurements.

Examples include height, weight, the amount of sugar in an orange,


the time required to run a mile.

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  265 

N O T E S
A continuous random variable is not defined at specific values.
Instead, it is defined over an interval of values, and is represented by
the area under a curve (in advanced mathematics, this is known as
an integral). The probability of observing any single value is equal to
0, since the number of values which may be assumed by the random
variable is infinite.
Suppose a random variable X may take all values over an interval of
real numbers. Then the probability that X is in the set of outcomes A,
P (A) is defined to be the area above A and under a curve. The curve,
which represents a function p(x), must satisfy the following:
‰‰ The curve has no negative values (p(x) > 0 for all x)
‰‰ The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.

9.2.2 PROBABILITY MASS FUNCTION (P.M.F.)

S
A random variable that can take countable number of possible
values (including infinite countable numbers) is said to be discrete.
For discrete random variable ‘probability mass function’ (p.m.f.) is
IM
defined as,
P (a) = P {X = a}
P.m.f. must be positive and satisfy axioms of probability. P.m.f. could
be imagined as masses equivalent to the probability values p (xi) are
NM

placed at points xi. Example of discrete random variable is number of


typing mistakes on a page of the book. Its values could be at the most
countable. The probability distribution is tabulation of values of xi
and p (xi).
Example: Let the random variable X be the sum of the numbers on
top faces of two dice rolled. Find probability mass function (p.m.f.) of
this discrete random variable. Also plot it as a graph.
Solution: Probability distribution of this discrete random variable is
as follows.

X = xi 2 3 4 5 6 7 8 9 10 11 12
P(xi ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Graph of this RV is given below.

Probability Distribution p.m.f. of RVX

0.20
0.15
P(X=xi)

0.10 P(X=xi)
0.05
0.00
2 3 4 5 6 7 8 9 10 11 12
X=xi

NMIMS Global Access – School for Continuing Education


266  BUSINESS STATISTICS

N O T E S
9.2.3 PROBABILITY DENSITY FUNCTION
There also exist random variables whose set of positive values is
uncountable. Time taken to service a customer, or time between
accidents on expressway are two such examples. X is a continuous
random variable if there exists a non-negative function f(x), for all real
values of X, having property that for any set B of real numbers,

∫ f (x)dx
P(x ∈ B) =
B

The function f(x) is called the probability density function of the


random variable X. Again note that f(x) must satisfy axioms of
probability.

9.2.4 CUMULATIVE DISTRIBUTION FUNCTION


Another useful concept is cumulative distribution function (c.d.f.) or
just a distribution function. It is defined as sum of all probabilities for

S
the values of random variable less than or equal to the specified value.
Obviously, c.d.f. at infinity is equal to one, as per axiom 2.
Cumulative distribution function (c.d.f.) for discrete random variable
IM
is given by
F(a)= P( X ≤ a)= ∑
for xi ≤ a
p(xi )

Cumulative distribution function (c.d.f.) for continuous random


NM

variable is given by
a


F(a)= P( X ≤ a)= ∫
−∞
f (x)dx

Example: A random variable is number of tails when a coil is flipped


thrice. Find probability distribution of the random variable.
Solution: Sample space is HHH, THH, HTH, HHT, TTH, THT, HTT,
TTT
The required probability distribution is,

Value of
Random X = xi 0 1 2 3
Variable
1 3 3 1
Probability P(X = xi) 8 8 8 8

Example: Let the random variable X be the sum of the numbers on


top faces of two dice rolled. Find cumulative distribution function
(c.d.f.) of this discrete random variable. Also plot it as a graph.
Solution: Probability distribution of this discrete random variable is
as follows.
X = xi 2 3 4 5 6 7 8 9 10 11 12
F(a)=P(X 1/36 3/36 6/36 10/36 15/36 21/36 26/36 30/36 33/36 35/36 36/36
≤ a)

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  267 

N O T E S
Graph of this RV is given below.

Cummulative Distribution Function


(c.d.f.) of RVX
1.2

1
F(a) = P(x<=xi)

0.8

0.6

0.4

0.2

0
2 3 4 5 6 7 8 9 10 11 12
X=xi

S
IM
9.2.5 EXPECTATION VALUE OF RANDOM VARIABLES
One of the most important concepts in probability theory is that
expectation of a random variable. For example, if we consider random
variable X as next month’s demand for our product, say luxury car.
Then we would different values of X along with associated probability
NM

as given below.

Demands Xi 1000 1500 2000 2500 3000


for Cars
Probability P {X=xi}= p(xi) 0.1 0.2 0.3 0.3 0.1
Product xi * p(xi) 100 300 600 750 300
Expected ∑ xi * p (xi) = 2050
Value
Now to plan for the monthly production we need a specific quantity
that would best serve our planning. This quantity is the ‘Expected
Value’ of demand. If X is the random variable, then the expected value
of X is denoted by E [X] and given by:
For discrete random variable
E[ X ] = ∑
for all i
xi p(xi )
P (xi) is p.m.f.
For continuous random variable

E[ X ] = ∫ xf (x)dx
−∞
Where f(x) is p.d.f.

In other words, the expected value of X is a weighted average of all


possible values of X, weight being the associated probabilities. Thus,
the weighted mean of the probability distribution is called the expected

NMIMS Global Access – School for Continuing Education


268  BUSINESS STATISTICS

N O T E S
value of the random variable. Due to associated probability (risk) the
term ‘expected’ is used. This is a measure of ‘central tendency’ mean
for the probability distribution. Hence,
m = E[X]

9.2.6 EXPECTED VALUE OF A FUNCTION OF A RANDOM


VARIABLE
It is possible to compute the expected value of a function of a random
variable. Let g(X) be a function of the random variable X. Since g(X)
itself is a random variable, it has probability distribution associated
with it, which can be determined from the probability distribution of
X. Once we have determined the probability distribution of g(X), we
can then compute the expected value of g(X) as,
For discrete random variable
E[g( X )] = ∑ g(xi ) p(xi )
for all i

For continuous random variable


S
IM

E[g( X )] = ∫ g(x) f (x)dx
−∞

g (X) could be any real valued function of X like X 5, 3 × X 2, log X,


(2X 5+5) etc.
NM

In particular, if g(X) is a linear function of X, g(X) = aX + b, where a


and b are real numbers, then expected value of g (X) is,
E[g(X)] = E [aX + b] = aE[X] + B

9.2.7 VARIANCE AND STANDARD DEVIATION OF RANDOM


VARIABLE
The variance of a random variable is expected squared deviation of
the random variable from its mean. The idea is similar to that of the
variance of a data set discussed earlier. Variance of a random variable
is, thus, defined as,
Var (X) = s2 = E[(xi – m)2]
For discrete random variable,
=
Var (X) ∑ (x i − m )2 p(xi )
for all i. Where p (xi) is p.m.f.
And, for continuous random variable,

∫ (x − m )
2
Var( X ) = f (x) Where f(x) is p.d.f.
−∞
By algebraic simplification with noting that μ is a constant, using
definition of expected value and axiom 3, It can be shown that,
=
Var ( X ) E[ X 2 ] − ( E[ X ])2

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  269 

N O T E S
E [X] is called the first moment of X and E [X 2] as second moment of X.
Variance gives the dispersion or spread of the probability distribution
of random variable X. It is extremely important while comparing two
or more distributions, hypothesis testing, drawing inference from the
sample, etc. For a random variable, the standard deviation is equal to
the positive square root of the variance, and denoted by σ.
Example: A random variable is number of tails when a coil is flipped
thrice. Find expectation (mean) of the random variable.

Random Variable X = xi 0 1 2 3
1 3 3 1
p.m.f.
8 8 8 8
3 6 3
P(X = xi) xi x P(xi) 0 8 8 8

Solution: The required probability distribution is,


Now, the expectation of the random variable is,

E=
(X)
4

∑x 1 3 3 1 12 3
× P( x i ) = 0 × + 1 × + 2 × + 3 × =
S
IM
i =
i =1 8 8 8 8 8 2
Example: X is a random variable with probability distribution

X = xi 0 1 2
P(X = xi) 0.3 0.3 0.4
NM

Y = g(X) = 2X + 3
Find expected value or mean of Y that is E(Y).
Solution: Now, for X = 0, 1, 2 Y = 3, 5, 7 respectively. Hence, the
distribution of Y is,

X = xi 0 1 2
Y = yi 3 5 7
p(Y = yi) 0.3 0.3 0.4
Hence,
n n

=
E(Y ) E=
[ g( xi )] ∑ g(x=
i ) P( x i ) ∑ yi P ( x i )
=i 1=i 1

   = 3 × 0.3 − 5 × 0.3 − 7 × 0.4


=5.2
Example: Suppose we have two coffee packet filling machines that fill
200 gm packets. You promise the customers that you would give one
packet free as a penalty if the coffee is short of the specified weight
of 200 gm by 5 gm. Due to random process weight of coffee in each
packet follows a random distribution. Let X be a random variable
denoting the weight of the coffee with distribution for two machines
as follows:

NMIMS Global Access – School for Continuing Education


270  BUSINESS STATISTICS

N O T E S
Machine A

X = xi 190 195 200 205 210


P(X = xi) 0.1 0.2 0.4 0.2 0.1
Machine B

X = xi 198 199 200 201 202


P(X = xi) 0.1 0.2 0.4 0.2 0.1
Find the mean and variance of the weight these coffee packs will have.
Which of the machine will you prefer?
Solution: Machine A

X = xi 190 195 200 205 210 Total


P(X = xi) 0.1 0.2 0.4 0.2 0.1 1
xi P(xi) 19 39 80 41 21 200
xi2 P(xi)

=
Thus, the mean is, m E=
(X)
S
3610 7605

∑ x P=
(x )
16000 8405

200 (Ans)
4410 40030
IM
i i
all

∑ 2
Also,
= E( X 2 ) = x P( x )
i i 40030
all

Hence, Variance = E( X 2 ) − [ E( X )]2 = 40030 − 40000 = 30  (Ans)


NM

Now, S.D=. s= Variance= 30= 5.48


Machine B

X = xi 198 199 200 201 202 Total


P(X = xi) 0.1 0.2 0.4 0.2 0.1 1
xi P(xi) 19.8 39.8 80 40.2 20.2 200
xi2 P(xi) 3920.4 7920.2 16000 8080.2 4080.4 40001.2
Thus, the mean is,
= m E=
(X) ∑ x P=
all
(x )i i 200 (Ans)

∑ 2
Also,
= E( X 2 ) = x P( x )
i i 40001.2
all
Hence, Variance = E( X 2 ) − [ E( X )]2 = 40001.2 − 40000 = 1.2 (Ans)
Now, S.D=. s= Variance= 1.2= 1.1
From the above result it can be seen that machine B is preferable
since it has very small variance as compared to the machine A. In
fact, we could roughly say that in case of machine A, we will have
to give free packets as a penalty for about 27% of the customers. In
case of machine A not even 1% customers will get coffee pack that
is underweight by 5 gms. Also, the coffee in overweight packs from
machine B will also be very small quantity as compared to machine A
and hence less costly.

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  271 

N O T E S

Fill in the blanks:


1. A ................... variable is a variable whose possible values are
numerical outcomes of a random phenomenon.
2. A ................... random variable is one which may take on only a
countable number of distinct values such as 0, 1, 2, 3, 4…
3. A ................... random variable is one which takes an infinite
number of possible values. Continuous random variables are
usually measurements.

Choose any organisation of your choice. In that work environment


list down various random variables that could be studied. (For

S
example, arrival of customers, number of people served in unit
time, time between failures of a machine.)
IM
 ROBABILITY DISTRIBUTIONS OF
P
9.3
STANDARD RANDOM VARIABLES
In many practical situations, the random variable of interest follows
a specific pattern. Random variables are often classified according
NM

to the probability mass function in case of discrete, and probability


density function in case of continuous random variable. When the
distributions are known fully, all statistical calculations are possible.
In practice, however, the distributions may not be known fully. But we
may be able to approximate the random variable to one of the known
types of standard random variables by examining the processes
that make it random. These standard distributions are also called
‘probability models’ or sample distributions. Various characteristics
of distribution like mean, variance, moments, etc. can be calculated
using known closed formulae. We will study some of the common types
of probability distributions. The normal distribution is the backbone
of statistical inference and hence we will study it in more detail.
There are broadly four theoretical distributions which are generally
applied in practice. They are:
‰‰ Bernoulli distribution
‰‰ Binomial distribution
‰‰ Poisson distribution
‰‰ Normal distribution

NMIMS Global Access – School for Continuing Education


272  BUSINESS STATISTICS

N O T E S

State whether the following statements are true/false:


4. Random variables are often classified according to the
probability mass function in case of discrete, and probability
density function in case of continuous random variable.
5. These standard distributions are also called ‘probability
models’ or sample distributions.
6. The Bernoulli distribution is the backbone of statistical
inference.

In your classroom, identify different random variables. Find the


probability of those random variables that exist and prepare a
comparative report.
S
IM
Theoretical distributions refer to a set of mathematical models
of the relative frequencies of a finite number of observations
of a variable. It is systematic arrangement of probabilities of
mutually exclusive and collectively exhaustive elementary events
NM

of an experiment. Observed frequency distributions are based


upon actual observation and experimentation. We can deduce
mathematically a frequency distribution of certain population
based on the trend of the known values. This kind of distribution
on experience or theoretical considerations is known as theoretical
distribution or probability distributions.

9.4 BERNOULLI DISTRIBUTION


It is a basis of many discrete random variables, as it deals with
individual trial. It is a building block for other random variables. It is
a single trial distribution.
Suppose that a trial, whose outcome is dichotomous, i.e. can be
classified as either a success or a failure. If we let value of the random
variable X = 1 when outcome is a success and X = 0 when it is a
failure, and if p is the probability of success for the trial such that 0 ≤
p ≤ 1, then the probability mass function of X is given by,
P( X= 0)= P(0)= 1 − p
P( X= 1)= P(1)= p
This random variable is called a Bernoulli random variable with
parameter (p). Its mean (expected value) and variance are given by,

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  273 

N O T E S

Mean m = E[ X ] = ∑ x P(x ) = 0 × (1 − p) + 1 × p = p
i = 1,2
i i

Variance Var( X
= ) p(1 − p)
For variance we first calculate

∑x
2
E[ X 2 ] = i P(xi ) = 02 × (1 − p) + 12 × p = p

And then use, Var( X ) =E[ X 2 ] − ( E[ X ])2 =p − p2 =p(1 − p)


For example, in many experiments there are only two outcomes. For
instance:
‰‰ Flip a coin.
‰‰ Take a penalty shot on goal.
‰‰ Test a randomly selected circuit to see whether it is defective.
‰‰ Roll a die and determine whether it is a 6 or not.
‰‰

9.4.1 APPLICATION OF BERNOULLI DISTRIBUTION S


Determine whether there was flooding this year at Laguna Beach.
IM
Bernoulli trial is fundamental to many discrete distributions like
Binomial, Poisson, Geometric, etc. Situations where Bernoulli
distribution is commonly used are:
‰‰ Sex of newborn child; Male = 0, Female = 1 say.
NM

‰‰ Items produced by a machine are Defective or Non-defective.


‰‰ During next flight an engine will fail or remain serviceable.
‰‰ Student appearing for examination will pass or fail.
1
Note that if p= q= the Bernoulli distribution is reduced to a discrete
2
uniform distribution as,
1
P( X= i=
) When i = 0, 1
2
= 0 Otherwise

Fill in the blanks:


7. ................... Distribution is a basis of many discrete random
variables, as it deals with individual trial.
8. Suppose that a trial, whose outcome is ..................., i.e. can be
classified as either a success or a failure.
9. Student appearing for examination will pass or fail is an
example of ................... distribution.

NMIMS Global Access – School for Continuing Education


274  BUSINESS STATISTICS

N O T E S

The Bernoulli trials process, named after Jacob Bernoulli, is one of


the simplest yet most important random processes in probability.
Essentially, the process is the mathematical abstraction of coin
tossing, but because of its wide applicability, it is usually stated
in terms of a sequence of generic trials that satisfy the following
assumptions:
‰‰ Each trial has two possible outcomes, in the language of
reliability called success and failure.
‰‰ The trials are independent. Intuitively, the outcome of one
trial has no influence over the outcome of another trial.
‰‰ On each trial, the probability of success is p and the probability
of failure is 1−p where p∈ [0, 1] is the success parameter of the
process.

S
9.5 BINOMIAL DISTRIBUTION
Usually, we often conduct many trials, which are independent and
IM
identical. Suppose we perform n independent Bernoulli trials (each
with two possible outcomes and probability of success p) each of which
results in a success with probability p and probability of failure (1 – p).
If random variable X represents the number of successes that occur
in n trials (order of successes not important), then X is said to be a
NM

Binomial random variable with parameters (n, p). Note that Bernoulli
random variable is a Binomial random variable with parameter (1, p)
i.e. n = 1.

A  binomial random variable is the number of successes


x in n repeated trials of a binomial experiment. The probability
distribution of a binomial random variable is called a binomial
distribution (also known as a Bernoulli distribution).

The probability mass function of a binomial random variable with


parameters (n, p) is given by,

 n
)   pi (1 − p) n − i
P( X= i= For i = 0, 1, 2… n
i
 
Expected value and variance for Binomial random variable are,
m = E[X] = np
Var [X] = np (1 – p)

9.5.1 APPLICATIONS OF BINOMIAL DISTRIBUTION


When to use binomial distribution is an important decision. Binomial
distribution can be used when following conditions are satisfied:

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  275 

N O T E S
‰‰ Trials are finite (and not very large), performed repeatedly for ‘n’
times.
‰‰ Each trial (random experiment) should be a Bernoulli trial, the
one that results in either success or failure.
‰‰ Probability of success in any trial is ‘p’ and is constant for each
trial.
‰‰ All trials are independent.
These trials are usually the experiments of selection ‘with
replacement’. In cases where the number of the population is very
large, drawing a small sample from it does not change probability of
success significantly. Hence, we could consider the distribution as
Bernoulli distribution.
Following are some of the real life examples of applications of binomial
distribution.
‰‰
machine.
S
Number of defective items in a lot of n items produced by a
IM
‰‰ Number of male births out of n births in a hospital.
‰‰ Number of correct answers in a multiple-choice test.
‰‰ Number of seeds germinated in a row of n planted seeds.
‰‰ Number of re-captured fish in a sample of n fishes.
NM

‰‰ Number of missiles hitting the targets out of n fired.


Example: Suppose that a particular trait of a person (like the color of
eyes) is classified on the basis of one pair of gene and suppose that d
represents a dominant gene and r a recessive gene. The child receives
one gene from each parent. Child with pure dominance genes ‘dd’
and hybrid genes ‘dr’ or ‘rd’ shows outward appearance of the trait.
If two hybrid parents have four children, what is the probability that
only one of the four children have the outward appearance of the
dominant gene?
Solution: Let X is the number of children with outward appearance
of the trait. This outward trait will be present if child has ‘dd’, ‘dr’,
or ‘rd’ genes. Assuming the probability of receiving either d or r
gene from a parent is equal, the probability of any child having the
3
outward appearance is . Thus, X is a binomial random variable with
4
3
parameters (4, ). Hence, the desired probability is,
4
1 3
 4  3   1 
P ( X= 1)=       = 0.046875
 1  4   4 

Fitting of Binomial Distribution


Usually, when we want to predict, interpolate or extrapolate the
probabilities for a given probability distribution, it would be easier

NMIMS Global Access – School for Continuing Education


276  BUSINESS STATISTICS

N O T E S
to get the results if the probability distribution is approximated to a
standard probability distribution. In case the probability distribution
(or a frequency distribution which is not necessarily a probability
distribution) is concerning with a random variable X which takes finite
integer values 0, 1, 2, …, n assumption of Binomial distribution may
work as a model for the given data. This is known as fitting binomial
distribution to the given data. We first estimate the parameters of
distribution (n, p) from the data and then compute probabilities and
expected frequencies.
The parameter p is estimated by equating the mean of binomial

distribution μ = np with the data mean x. Thus,
x
ˆ=
p And qˆ= 1 − p
ˆ where p̂ means p estimate, and q̂ means q
n
estimate.
Σf i x i
x=
Σf i

S
With the estimated parameters we calculate all the probability values
(frequencies) for the given data points. If the observed values are
IM
quite close to the estimates, the binomial model under consideration
is satisfactory.
Example: The following data gives number of seeds germinated
in row of 5 seeds each. Fit a binomial distribution to the data and
calculate expected frequency.
NM

xi 0 1 2 3 4 5
fi 10 20 30 15 15 10
Solution: Now,
Σfi xi 235
=
x = = 2.35 Hence,
Σf i 100
x 2.35
ˆ=
p = = 0.47 q̂ = 1 – p̂ = 0.53
n 5

N = ∑ fi = 100 = 0.8868

Now, either by using p.m.f. with n = 5 and p = 0.47 or by using
recurrence relation we can find probabilities and hence expected
frequencies. We demonstrate using recurrence relation.

X=i 0 1 2 3 4 5 Total
( n − i) 5 2 1 0.5 0.2 0
(i + 1)
P( X­= i ) 0.0418 0.1853 0.3287 0.2915 0.1293 0.0229 0.9995
Ei = N x P(X) 4.18 18.53 32.87 29.15 12.93 2.29 99.95
We observe that fitting is reasonably good, except at both ends.

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  277 

N O T E S
Example: Suppose that the probability that a light in a classroom
will be burnt out is 1/3. The classroom has in all five lights and it is
unusable if the number of lights burning is less than two. What is the
probability that the class room is unusable on a random occasion?
1
Solution: This a case of binomial distribution with n = 5 and p –
3
Class room is unusable if the number of burnouts is 4 or 5. That is
i = 4 or 5. Noting that,
 n
i)   ( p ) ( 1 − p )
i n− i
P( X =+4) P( X ==
i
Thus, the probability that the class room is unusable on a random
occasion is,
4 5 0
 5  1   2   5  1   2 
P( X =4) + P( X =
5) =     +      = 0.0412 + 0.00412 =
0.04532
 4  3   3   5  3   3 
=0.0412 + 0.00412 =0.04532

S
Example: It is observed that 80% of T.V. viewers watch Aap Ki Adalat
programme. What is the probability that at least 80% of the viewers in
IM
a random sample of 5 watch this programme?
Solution: This is the case of binomial distribution with n = 5 and p =
0.8. Also i = 4 or 5.
Probability of at least 80% of the viewers in a random sample of 5
NM

watches this programme.


5 5
P( X ≥ 4) = P( X = 4) + P( X = 5) =   ( 0.8 ) (0.2)1 +   ( 0.8 ) (0.2)0 = 0.4096 + 0.3277
4 5

4 5
= 0.4096 + 0.3277 = 0.7373

Fill in the blanks:


10. A ................... random variable is the number of successes x in
n repeated trials of a binomial experiment.
11. The probability distribution of a binomial random variable is
called a binomial ................... .
12. The parameter p is estimated by equating the mean of binomial
distribution μ = np with the data ................... .

Collect the data and prove that as n tends to infinity the Binomial
distribution approaches to normal.

NMIMS Global Access – School for Continuing Education


278  BUSINESS STATISTICS

N O T E S

A cumulative binomial probability refers to the probability that


the binomial random variable falls within a specified range (e.g., is
greater than or equal to a stated lower limit and less than or equal
to a stated upper limit).
For example, we might be interested in the cumulative binomial
probability of obtaining 45 or fewer heads in 100 tosses of a coin.
This would be the sum of all these individual binomial probabilities.
B(x < 45; 100, 0.5) = b(x = 0; 100, 0.5) + b(x = 1; 100, 0.5) + ... + b(x
= 44; 100, 0.5) + b(x = 45; 100, 0.5)

9.6 POISSON DISTRIBUTION

S
A random variable X, taking one of the values 0, 1, 2 … is said to be
a Poisson random variable with parameter λ, if for some λ > 0,
e− λ λ i
IM
P( X= i=
) For i = 0, 1, 2 …
i!

P(X = i) is a probability mass function (p.m.f.) of the Poisson random


variable. Its expected value and variance are,
NM

m = E[X] = l
Var[X] = l
Poisson random variable has wide range of applications. It can also
be used as an approximation for a binomial random variable with
parameters (n, p) if n is large and p is small enough to make the
product np of moderate size. In this case we call np – l an average
rate. Some of the common examples where Poisson random variable
can be used to define the probability distribution are:
‰‰ Number of accidents per day on expressway.
‰‰ Number of earthquakes occurring over fixed time span.
‰‰ Number of misprints on a page.
‰‰ Number of arrivals of calls on telephone exchange per minute.
‰‰ Number of interrupts per second on a server.
Example: Average number of accidents on express way is five per
week. Find the probability of exactly two accidents would take place
in a given week. Also find the probability of at the most two accidents
will take place in next week.
Solution:
Now, l = 5 and i = 2

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  279 

N O T E S

e−5 × 52
Therefore, P( X= 2)
= = 0.084224
2!
e−5 × 50 e−5 × 51 e−5 × 52 25
P( X ≤ 2)
= P(0) + P(1) + P(2)
= + + = e−5 (1 + 5 + )
0! 1! 2! 2
= 0.12465
Example: Probability of defective items produced on a machine is 0.1.
Find the probability that a sample of 10 items will contain at the most
1 defective item.

Solution: Method I
Using binomial distribution with parameters (n=10, p=0.1) we get,
P{X ≤ 1} = p (0) + p (1) = 10C0 (0.1)0 (0.1)10 + 10C1 (0.1)1 (0.1)9 =
0.7361

Method II

S
Using Poisson distribution (as approximation to Binomial distribution)
with parameter λ = 10 × 0.1 = 1 we get,
IM
P {X ≤ 1} = p (0) + p (1) = [e-1 (λ) 0] / 0! + [e-1 (λ) 1] / 1! = e-1 + e-1
= 0.7358
Note that Poisson distribution gives reasonable good approximation.

Exponential Random Variable


NM

It is a continuous random variable. A continuous random variable X


is said to be exponential with parameter λ, if for some λ > 0,
λ e− λ x for x ≥ 0
f ( x) = 
 0 for x < 0
F(x) is a probability density function (p.d.f.) of the exponential random
variable. Its expected value and variance are,
1
= m E= [ X]
λ
1
Var( X ) = 2
λ
Many problems involving exponential distribution require cumulative
density function (c.d.f.), which is equal to:
F(a) =P( X ≤ a) =1 − e− λ a For a ≥ 0
The exponential distribution often arises as being the distribution of
the amount of time until some specific event occurs. For example time
taken until next earthquake occur from now, or time taken to serve
customer from now, or time till the machine would break down from
this moment.
Exponential random variable has an interesting property called
‘memory-less property’. It can be shown that, if we think of exponential

NMIMS Global Access – School for Continuing Education


280  BUSINESS STATISTICS

N O T E S
random variable X as being the lifetime of some item (say bulb), the
probability that the bulb will survive for at least ‘ (s + t)’ hours, given
that it has survived ‘t’ hours, is the same as the initial probability that
it survives for at least ‘s’ hours. That is, the bulb does not remember
that it has already been in use for the time ‘t’.
Example: Average time for updating a passbook by a bank clerk is 15
seconds. Someone arrives just ahead of you. Find the probability that
you will have to wait for your turn,
1. More than 1 minute.
2. Less than ½ minutes.
Solution: Now, λ = 60/15 = 4 passbooks per minute
P {X > 1} = 1 – F (1) = e-4 = 0.0183
P {X < 0.5} = F (0.5) = 1 - e-2 = 1 - 0.1353 = 0.8647
Example: In certain factory it was found that average absentee rate is

1. S
3 workers per shift. Find the probability that on a given shift:
Exactly two workers will be absent.
IM
2. More that four workers will be absent.
[Given e–3 = 0.04970] and e–0.3 = 0.0.7408
Solution: This is a case of Poisson distribution with average rate of
absentee is l = 3
NM

e− λ λ i
We use P( X= i=
)
i!
−3 2
e 3
1. P( X= 2)= = 0.224
2!
2. P( X > 4) =1 − P( X ≤ 4) =1 − [ P(0) + P(1) + P(2) + P(3) + P(4)]

9 9 27
= 1 − e−3 [1 + 3 +
+ + ] = 0.1847
2 2 8
Or, we can use cumulative Poisson probabilities table to
calculate P(X ≤ 4). From the table for l = 3 and i = 4 we get
cumulative probability P(X ≤ 4) as 0.8153. Hence, we calculate
P( X > 4) =1 − P( X ≤ 4) =1 − 0.8153 =0.1847

Fill in the blanks:


13. A random variable X, taking one of the values 0, 1, 2 … is said
to be a Poisson random variable with ................... .
14. Exponential Random Variable is a ................... random variable.
15. ................... random variable has an interesting property called
‘memory-less property’.
16. ................... distribution gives reasonable good approximation.

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  281 

N O T E S

Collect the data and prove the formulae for cumulative density
function for exponential random variable and Poisson random
variable.

Since the cumulative probabilities of exponential distribution


can be easily calculated using the formula for c.d.f. given above,
cumulative probability tables are usually not given. Further,
since the exponential distribution is continuous, tabulating the
probabilities reduces the accuracy.

9.7 NORMAL DISTRIBUTION

S
Normal random variable and its distribution is commonly used in
many business and engineering problems. Many other distributions
like binomial, Poisson, beta, chi-square, students, exponential; etc.
IM
could also be approximated to normal distribution under specific
conditions. (Usually when sample size is large.) If random variable is
affected by many independent causes, and the effect of each cause is
not significantly large as compared to other effects, then the random
variable will closely follow the normal distribution. e.g., weights of
NM

coffee filled in packs, lengths of nails manufactured on a machine,


hardness of ball bearing surface, diameters of shafts produced
on lathe, effectiveness of training programme on the employees’
productivity, etc., are examples of normally distributed random

variables. Further, many sampling statistics e.g., sample means x , are
normally distributed.

9.7.1 EQUATION FOR NORMAL PROBABILITY CURVE


A random variable X is a normal random variable with parameters m
and s if the probability density function (p.d.f.) of X is given by,
( x − m )2
1 −
f ( x) = e 2s 2 Where, –∞ – < x < ∞
s 2π
This distribution is bell-shaped curve that is symmetric about m. The
normal distribution is defined by the following equation:
The value of the random variable Y is:
2 2
Y = {1/[s × (2π )]} × e –(x – μ) /2s
Where X is a normal random variable, μ is the mean, σ is the standard
deviation, π is approximately 3.14159, and e is approximately 2.71828.
The random variable X in the normal equation is called the normal
random variable. The normal equation is the probability density
function for the normal distribution.

NMIMS Global Access – School for Continuing Education


282  BUSINESS STATISTICS

N O T E S
Mean of normal random variable is E(X) = u and variance of normal
random variable is Var (X) = σ2.
If X is normally distributed with parameters m and σ, then another
random variable Y = aX + b is also normally distributed with
parameters ( am + b) and (aσ).

9.7.2 STANDARD NORMAL DISTRIBUTION


Calculating cumulative density of normal distribution involves
integration. Further, tabulation also has a problem that we must have
tables for every possible value of μ and σ² (which is not feasible).
Hence, we transform Normal Random Variable to another random
variable known as Standard Normal Random Variable. For this, we
use a transformation,
(x − m ) 1 m
=z = x−
s s s

S
z is a normally distributed random variable with parameters,
m= 0 and s = 1.
IM
Any normal random variable can be transformed to standard normal
random variable z. We can get cumulative distribution function as,
a a z2
1 −2
=
F ( a) ∫=
f (x)dx ∫
−∞ −∞ 2π
e dz
NM

This has been calculated for various values of ‘a’ and tabulated. Also,
we know that,
F(− a) =1 − F(a)

We also note that F(a < Z < b)= F(b) − F(a)


The table giving area under the Standard Normal Curve is available
in statistical tables and also given at the Appendix 1.

Procedure to Read Standard Normal Table


Standard Normal Table is given in Appendix 1 gives area under
Standard Normal Distribution. In other words, it indicates Cumulative
Distribution Function or cumulative probability P(−∞ < Z < a) = F(a).
In many tables rather than giving cumulative probability from ∞,
cumulative probability from mean i.e., Z – 0 is given for various positive
values of z. In other words, the table gives the values of probability
P(0 < Z < a=
) F(a) − F(0)= F(a) − 0.5 .
Thus, we can find the cumulative probability for a given value of z by
adding 0.5 to the value read from the table. The table in appendix A
value of cumulative probability from 0 to z for values of zvarying from
0 to 3.9 up to two decimal places. The calculations are as follows.
Let the probability value read from the table for a given z is called
as p. Now, using symmetry of Standard Normal Distribution about
Z = 0, we could get the probabilities as,

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  283 

N O T E S

‰‰ P(−∞ < Z < z) = F( z) = 0.5 + P(0 < Z < z) = 0.5 + p


‰‰ P(0 < Z < z)= F( z) − F(0)= F( z) − 0.5= p
‰‰ P(−∞ < Z < − z) = F(− z) = 1 − F( z) = 0.5 − P(0 < Z < z) = 0.5 − p
 (Value in Left Tail)
‰‰ P( z < Z < ∞)(∞) − F( z) = 1 − F( z) = 0.5 − P(0 < Z < z) = 0.5 − p
 (Value in Right Tail)
‰‰ P(− z < Z < 0)= F(0) − F(− z)= 0.5 − [1 − F( z)]= F( z) − 0.5= p
Now to read the value of p= P(0 < Z < z) the procedure is,
(x − m )
‰‰ Calculate the value of z using formula z = where m is mean
s
and s is standard deviation of given normal distribution.
‰‰ Round up the calculated value of z to two decimal places.
‰‰

S
Look for the value of z up to first decimal in column z of the
Standard Normal Distribution Table shown in Appendix A (first
column of the table). Look for the second decimal value of the z
IM
in top row of the table. Read the probability value p in the cell at
intersection point of the row and column where the z value up
to first decimal and second decimal is located. The p value Thus,
read is then used for finding probabilities as indicated above.
Sometimes, we need to find the value of z called as zcritical for a
NM

given probability in left-tail or right-tail. This is required in testing


of hypothesis for given significance level. So in such a case we first
calculate the value of p using relations (a) to (e) given above. Then we
look for the value closest to p in the probabilities given in Standard
Normal Distribution Table. Once we identify the cell where this value
of p lies, we can read z value up to one decimal in the first column of
that row where the identified cell lies. We find second decimal of the
z value in the first row of that column where the identified cell lies.

Important Points to Remember


‰‰ The table at appendix A gives the value of p up to four decimal
places. Some tables give it up to five decimal places.
‰‰ Please check the area (probability) as indicated by p. It is usually
shown as a diagram on the top of the table. Some books give
probability p as P(–∞ < Z < z) also known as c.d.f. In this case z
values will be negative as well as positive and the p value in table is
0 in top left corner and 1 in bottom right corner. Or in some of the
engineering books probability p is given as P(–z < Z < z). In this
case z value are positive starting from 0 and the p value in table is
0 in top left corner and 1 in bottom right corner. Some books also
give probabilities p as P(–z < Z < ∞) i.e., right tailed value. In this
case z value are positive starting from 0 and the p value in table is
0.5 in top left corner and 0 in bottom right corner. In all such cases
we need to readjust the formulae (a) to (e) given above.

NMIMS Global Access – School for Continuing Education


284  BUSINESS STATISTICS

N O T E S
‰‰ The key to understanding the type of table (if the graph is not
given on the top with shaded portion for p) is the following
properties of Standard Normal Distribution Table.
 The probability values are symmetric about midpoint i.e. Z
= 0.
 Total probability P(–∞ < Z < ∞).
 Cumulative Probabilities in left and right half of the curve
are 0.5 i.e.
 P(–∞ < Z < 0) = P(0 < Z < ∞) = 0.5.
‰‰ For calculating the probability values either convert them in
c.d.f. values F(a) and use the formulae or draw a simple sketch to
identify the area that we are interested on the probability curve
and then use the logic. Don’t mix the two methods as it can be
confusing. Use the method that is more appealing to you.

‰‰
S
9.7.3 PROPERTIES OF NORMAL DISTRIBUTION
It is perfectly symmetric about the mean m.
IM
‰‰ For a normal distribution mean = median = mode.
‰‰ It is uni-modal (one mode), with skewness = 0 and kurtosis = 0.
‰‰ Normal distribution is a limiting form of binomial distribution
when number trials n is large, and neither the probability p nor
NM

(1-p) is very small.


‰‰ Normal distribution is a limiting case of Poisson distribution
when mean m = l is very large.
‰‰ While working on probability of normal distribution we usually
use normal distribution (more often standard normal distribution)
tables. While reading these tables, properties are,
 The probability that a normally distributed random variable
with mean m and variance σ2 lies between two specified
values a and b is P (a < X < b) = area under the curve P(x)
between the specified values X = a and X = b.
 Total area under the curve P (x) is equal to 1 in which 0.5 lies
on either side of the mean.
 The range μ ± σ covers 68.27% of the observations.
 The range μ ± 2σ covers 95.44% of the observations.
 The range μ ± 3σ covers 99.73% of the observations.

9.7.4 AREAS UNDER STANDARD NORMAL PROBABILITY


CURVE
‰‰ Approximately 68% of the area under the curve is between μ-σ
and μ+σ

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  285 

N O T E S
‰‰ Approximately 95% of the area under the curve is between μ-2σ
and μ+2σ.
‰‰ Approximately 99.7% of the area under the curve is between μ-3σ
and μ+3σ.

S
IM
NM

Figure 9.1: Area under the Normal Curve

9.7.5 IMPORTANCE OF NORMAL DISTRIBUTION


‰‰ Data obtained from psychological, physical and biological
measurement approximately follow normal distribution.
‰‰ Distributions like binomial, poisson, etc. can be approximated to
normal distribution for large n.
‰‰ For large samples any statistic (sample parameters) approximately
follows normal distribution.
‰‰ Normal curve is used to find confidence limits of the population
parameters.
‰‰ Normal distributions are largely applied to statistical quality
control.
‰‰ The theory of errors of observations in physical measurements is
based on normal distribution.

Conditions for Normality


‰‰ The causal forces must be numerous and have approximately
equal weight.
‰‰ These forces must be same over the universe from which the
observations are drawn. This is the condition of homogeneity.

NMIMS Global Access – School for Continuing Education


286  BUSINESS STATISTICS

N O T E S
‰‰ The force affecting the events must be independent of one
another.
‰‰ The operation of the causal forces must be such that deviations
about the population mean are balanced as to magnitude and
number.

Solved Examples
Example: If X is a normal random variable with parameters μ = 3 and
σ² = 9, find
(a) P(2 < x< 5) (b) P(x< 0) (c) P(|x– 3|> 6)
Solution:
(x − m ) (2 − 3) 1
For x = 2  z = = = −
s 3 3
(x − m ) (5 − 3) 2

Therefore,
For x = 5 =z

S = =
s 3 3
IM
1 2
P(2 < x < 5) = P(− <z< )
3 3

2  1 2   1 
= F  − F− =
 F   − 1 − F   
3  3 3   3 
NM

From Standard Normal tables, we get,


2
 0.5 + area under standard normal curve for z = 0.667
F =
3

2
∴F  =
0.5 + 0.2486 =
0.7486
3

 1
F = 0.5 + area under standard normal curve for z = 0.334
3
 1
F  = 0.5 + 0.1293 =
0.6293 Thus, P(2 < x < 5) = 0.7486 + 0.6293 – 1
3
= 0.3779
Example: Coffee is filled in the packs of 200 gm by a machine with
variability of 0.25 grms. Packs weighing less than 200 gm would
be rejected by customers and not legally acceptable. Therefore,
marketing and legal department requests production manager to
set the machine to fill slightly more quantity in each pack. However,
finance department objects to this since it would lead to financial loss
due to overfilling the packs. The general manager wants to know the
99% confidence interval, when the machine is set at 200gms, so that
he can take a decision. Find confidence interval. What is your advice
to the production manger?

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  287 

N O T E S
Solution: Let weight of the coffee in a pack is a random variable X. We
know that the mean μ = 200 gm and variance σ² = 0.25 gms2 i.e. σ = 0.5
gm. First, we find the value of z for 99% confidence. Standard Normal
Distribution curve is symmetric about mean. Hence, corresponding
to 99% confidence, half area under the curve = 0.99/2 = 0.495.
z value corresponding to probability 0.495 is 2.575. Thus, the 99%
confidence interval in terms of variable z is ± 2.575 which in terms of
variable x is, 200 ±1.2875 or (198.71 to 201.29).
Note: That x= s z + m= 0.5 × (±2.575) + 200= 200 ± 1.2875
Hence, we can advise the production manager to set his machine
to fill the coffee with mean weight as 201.2875 or say 201.29. In that
case we have 99% confidence of meeting legal requirement and at the
same time to keep the cost of excess filling of the coffee to minimum.
Example: A total of 2,058 students take a difficult test. Each student
has an independent 0.6205 probability of passing the test.
1.
S
What is the probability that between 1,250 and 1,300 students,
both numbers inclusive will pass the test?
IM
2. What is the probability that at least 1,300 students will pass the
test?
3. If the probability of at least 1,300 students passing the test has to
be at least 0.5, what is the minimum value for the probability of
each student passing the test?
NM

Solution: The problem could be rewritten as:


1. Find P(1250 ≤ X ≤ 1300)
= F(1300) − F(1250)
2. Find P( X ≥ 1300) =
F(1300)
3. Find p such that P( X ≥ 1300)= F(1300) > 0.5
First, we find z values corresponding to 1250 and 1300 using formula
x−m
z=
s
Now, with given n = 2058 and p = 0.6205
m=
np =
2058 × 0.6205 =
1276.99
s
= =
Variance np(1 − p=
) 2058 × 0.6205 × 0.3795
= 22.014
Hence, for X = 1250, z = – 1.226 and for X = 1300, z = 1.045
1. Thus,
F(1300) − F(1250)
= F(1.045) − F(−1.226)
= F(1.045) − [1 − F(1.226)]
= 0.3520 + 0.3897 = 0.7417
2. =
F (1300) F=
(1.045) 0.3520
3. Now for F(1300) > 0.5 implies that mean is 1300. Hence,

m = np ⇒ 1300 = 2058 × p ⇒ p = 0.6317

NMIMS Global Access – School for Continuing Education


288  BUSINESS STATISTICS

N O T E S

Fill in the blanks:


17. A random variable X is a ................... random variable with
parameters m and σ.
18. For a normal distribution mean = ................... .
19. Normal distribution is a limiting case of Poisson distribution
when mean ................... is very large.
20. The range μ ± 3σ covers ................... of the observations.

Assessing Normality
Suppose that seventeen randomly selected workers at a detergent

S
factory were tested for exposure to a Bacillus subtillis enzyme by
measuring the ratio of forced expiratory volume (FEV) to vital
capacity (VC). (Note: FEV is the maximum volume of air a person
IM
can exhale in one second; VC is the maximum volume of air that a
person can exhale after taking a deep breath.) Is it reasonable to
conclude that the FEV to VC (FEV/VC) ratio is normally distributed?
0.61 0.70 0.76 0.84
0.63 0.72 0.78 0.85
NM

0.64 0.73 0.82 0.85


0.67 0.74 0.83 0.87
0.88

Normal Distribution was introduced by the French mathematician


Abraham De Moivre in 1733 and used by him to approximate
probabilities associated with binomial random variables when
the binomial parameter ‘n’ is large. This was further extended
by Laplace and now known as Central Limit Theorem. It gives a
theoretical base to the observation that, in practice, many random
phenomena obey approximately, a normal probability distribution.

9.8 SUMMARY
‰‰ Random variable is a real valued function defined over a sample
space with probability associated with it. The value of the random
variable is outcome of an experiment. Random variables are
neither ‘random’ nor ‘variable’.
‰‰ In this chapter we discussed several important random variables,
the associated formulae, and problem solving using formulae.
A discrete random variable is the one that takes at the most

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  289 

N O T E S
countable values. A continuous random variable can take any
real value.
‰‰ We also discussed probability distributions of random variables.
Binomial distribution is used if an experiment is carried out for
finite number of n independent trials; all trials being Bernoulli
trials with constant probability of success p.
‰‰ Random variable will follow Poisson distribution if it is the number
of occurrences of a rare event during a finite period. Waiting time
for a rare event is exponentially distributed. Negative binomial
distribution is used if numbers of Bernoulli trials are made to
achieve desired number of successes.
‰‰ One of the continuous random variable required often is
uniform random variable. Waiting time for an event that occurs
periodically follows uniform distribution.
Normal probability distribution is the most important distribution
‰‰

S
in statistics. We defined normal distribution with parameters (μ,
σ) where μ is mean and σ is standard deviation.
IM
‰‰ Further, we defined standard normal distribution, which is a
special case of normal distribution with parameters (0, 1).
‰‰ We also discussed transformation of normal random variable X
x−m
to standard random variable Z using z = Z distribution is
s
NM

very convenient for manual calculation as we can use standard


normal tables which are extensively plotted, to find probability
and interval.
‰‰ Normal distribution is used as a model in many real world
situations, both as a continuous distribution or an approximation
to discrete distributions like binomial or Poisson.

‰‰ Random Variable: Random variable is a real valued function


defined over a sample space.
‰‰ Discrete Random Variable: Random variable is discrete when
the number of possible outcomes in a random experiment is
countable.
‰‰ Continuous Random Variable: Random variable is continuous
when the number of outcomes in a random experiment is
uncountable.
‰‰ Normal Random Variable: Random variable which is used in
normal distribution is called as normal random variable.
‰‰ Probability Distribution: The probability distribution of a
discrete random variable is a list of probabilities associated
with each of its possible values.
Contd...

NMIMS Global Access – School for Continuing Education


290  BUSINESS STATISTICS

N O T E S
‰‰ Binomial Random Variable: A binomial random variable is
the number of successes x in n repeated trials of a binomial
experiment.
‰‰ Binomial Distribution: The probability distribution of a
binomial random variable is called a binomial distribution.

9.9 DESCRIPTIVE QUESTIONS


1. Define a random variable. Give few examples.
2. Differentiate between discrete and continuous random variables
with examples.
3. Explain probability mass function and probability density
function for a random variable.
4. Define expected value of a function of a random variable.

6. S
5. What are the variance and standard deviation of a random
variable? How do you calculate them?
Write a short note on Bernoulli distribution of random variables.
IM
Discuss its applications also.
7. Define binomial random variable. Describe binomial distribution
and its applications.
8. How will you define Poisson random variable and exponential
NM

random variable? Describe Poisson distribution with an example.


9. Define normal random variable and what is equation for normal
probability curve?
10. Write a short note on standard normal distribution along with its
properties and importance.

EXERCISE FOR PRACTICE


1. A company produces parts and sells them in a pack of 10. The
company offers refund if more than one item in the pack is
defective. The company’s record shows that the defect proportion
of the parts manufactured is 0.3. What is the proportion of lots
that company will have to provide refund?
2. On an average two bulbs out of 100 produced by a company give
less than 500 hours of life. The company supplies the bulbs in
pack of 100. What is the probability that in a pack purchased
by you two bulbs give you life less than 500 hours? What is the
probability that two or more bulbs will give less than 500 hours
life?
3. A radar of a missile system has a probability of 0.1 of a detecting
and locking on an aircraft within 60 kms during one scan. The
radar gets 4 scans before the aircraft goes outside the missile
range.

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  291 

N O T E S
(a) Find the probability that the target will be detected at least
twice.
(b) Find the probability that the target will be detected at the
most once.
4. In a large group of men, it is found that 5% are under the age 60
and 40% are between the age 60 and 65. Assuming the distribution
of the age is normal; find the mean and standard deviation.
5. If a random variable X follows a normal distribution with mean
18 and standard deviation 25 find, P(–31 < x 67 ).

9.10 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS
Topic Q. No. Answers
Random Variable 1. Random
2.
3. SDiscrete
Continuous
IM
Probability Distributions of 4. True
Standard Random Variables
5. True
6. False
Bernoulli Distribution 7. Bernoulli
NM

8. Dichotomous
9. Bernoulli
Binomial Distribution 10. Binomial
11. Distribution

12. Mean x
Poisson Distribution 13. Parameter λ
14. Continuous
15. Exponential
16. Poisson
Normal Distribution 17. Normal
18. Median = mode.
19. m=l
20. 99.73%

HINTS FOR DESCRIPTIVE QUESTIONS


1. Refer Section 9.2
A random variable, usually written X, is a variable whose possible
values are numerical outcomes of a random phenomenon. 

NMIMS Global Access – School for Continuing Education


292  BUSINESS STATISTICS

N O T E S
2. Refer Section 9.2.1
A discrete random variable is one which may take on only a
countable number of distinct values such as 0, 1, 2, 3, 4…
Discrete random variables are usually (but not necessarily)
counts. If a random variable can take only a finite number of
distinct values, then it must be discrete. A continuous random
variable is one which takes an infinite number of possible values.
Continuous random variables are usually measurements.
3. Refer Section 9.2.2
A random variable that can take countable number of possible
values (including infinite countable numbers) is said to be
discrete. For discrete random variable ‘probability mass function’
(p.m.f.) is defined as,
P (a) = P {X = a}

S
X is a continuous random variable if there exists a non-negative
function f(x), for all real values of X, having property that for any
set B of real numbers,
IM
∫ f (x)dx
P(x ∈ B) =
B

The function f(x) is called the probability density function of the


random variable X.
4. Refer Section 9.2.6
NM

We can then compute the expected value of g(X) as,


For discrete random variable


E[g( X )] = ∑
for all i
g(xi ) p(xi )

For continuous random variable



E[g( X )] = ∫ g(x) f (x)dx
−∞
5. Refer Section 9.2.7
Variance of a random variable is, thus, defined as,
) s=
Var( X= 2
E[(xi − m )2 ]
For discrete random variable,

=
Var (X) ∑ (x i − m )2 p(xi ) for all i. Where p (xi) is p.m.f.
And, for continuous random variable,

∫ (x − m )
2
Var( X ) = f (x) Where f(x) is p.d.f.
−∞

6. Refer Section 9.4


It is a basis of many discrete random variables, as it deals with
individual trial. It is a building block for other random variables.

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  293 

N O T E S
It is a single trial distribution. This random variable is called a
Bernoulli random variable with parameter (p).
7. Refer Section 9.5
A binomial random variable is the number of
successes  x in n repeated trials of a binomial experiment.
The probability distribution of a binomial random variable
is called a binomial distribution (also known as a Bernoulli
distribution).
The probability mass function of a binomial random variable
with parameters (n, p) is given by,
 n
)   pi (1 − p) n − i
P( X= i= for i = 0, 1, 2, …, n
i
8. Refer Section 9.6
A random variable X, taking one of the values 0, 1, 2 … is said to

P( X= i=
)
e− λ λ i
for i = 0, 1, 2, … S
be a Poisson random variable with parameter λ, if for some λ > 0,
IM
i!
A continuous random variable X is said to be exponential with
parameter λ, if for some λ > 0,
λ e− λ x for x ≥ 0
f ( x) = 
 0 for x < 0
NM

f(x) is a probability density function (p.d.f.) of the exponential


random variable.
9. Refer Section 9.7
If random variable is affected by many independent causes, and
the effect of each cause is not significantly large as compared
to other effects, then the random variable will closely follow the
normal distribution. A random variable X is a normal random
variable with parameters m and s if the probability density
function (p.d.f.) of X is given by,
( x − m )2
1 −
f ( x) = e 2s 2 Where, −∞ < x < ∞
s 2π
10. Refer Section 9.7.2
We transform Normal Random Variable to another random
variable known as Standard Normal Random Variable. For this,
we use a transformation,
(x − m ) 1 m
=z = x−
s s s
z is a normally distributed random variable with parameters,
m – 0 And s – 1.

NMIMS Global Access – School for Continuing Education


294  BUSINESS STATISTICS

N O T E S
ANSWERS FOR EXERCISE FOR PRACTICE
1. 0.85
2. 0.27, 0.324
3. 0.528, 0.948
4. 65.41, 3.29
5. 0.950004

9.11 SUGGESTED READINGS FOR REFERENCE


SUGGESTED READINGS
‰‰ Richard Levin; Devid Rubin, Statistics for Management, Pearson
Education, 2004
‰‰ Rosen, Kenneth, H., Discrete Mathematics and its Applications,

‰‰
S
Tata McGraw Hill Co Ltd., 2003
Ross, Sheldon, A First Course in Probability, Pearson Education,
2003
IM
‰‰ Salkind, N.J., Statistics for People Who (They Think) Hate
Statistics, Sage Publications, 2004
‰‰ D P Apte, Statistical Tools for Managers using MS Excel, Excel
Books, 2009
NM

‰‰ Sharma, K.V.S., Statistics Made Simple, Prentice Hall of India,


2002
‰‰ Verma, A.P., Business Mathematics and Statistics, Asian Books
Pvt Ltd, 2002
‰‰ Levin, R.I., Statistics for Management, Prentice-Hall of India,
New Delhi, 1979
‰‰ Gupta, S.P. and Gupta, M.P., Business Statistics, Sultan Chand &
Sons, New Delhi, 1987
‰‰ Bhardwaj, R.S., Business Statistics, 2nd Edition, Excel Books,
New Delhi.

E-REFERENCES
‰‰ h t t p : / / w w w. h e n r y. k 1 2 . g a . u s / u g h / a p s t a t / c h a p t e r n o t e s /
7supplement.html
‰‰ http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm
‰‰ http://sites.stat.psu.edu/~babu/418dist/binom.html

NMIMS Global Access – School for Continuing Education


PROBABILITY DISTRIBUTION  295 

N O T E S
APPENDIX 1
THE Z-TABLE FOR NORMAL DISTRIBUTION

Appendix A

Standard Normal Distribution


Area Under Standard Normal Curve From 0 to z
 
 
 
z 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9
1
1.1
1.2
1.3
1.4
0.3159
0.3413
0.3643
0.3849
0.4032
0.4192
0.3186
0.3438
0.3665
0.3869
0.4049
0.4207
0.3212
0.3461
0.3686
0.3888
0.4066
0.4222
0.3238
0.3485
0.3708
0.3907
0.4082
0.4236
0.3264
0.3508
0.3729
0.3925
0.4099
0.4251
0.3289
0.3531
0.3749
0.3944
0.4115
0.4265
S
0.3315
0.3554
0.3770
0.3962
0.4131
0.4279
0.3340
0.3577
0.3790
0.3980
0.4147
0.4292
0.3365
0.3599
0.3810
0.3997
0.4162
0.4306
0.3389
0.3621
0.3830
0.4015
0.4177
0.4319
IM
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
NM

2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993
3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995
3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997
3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.5 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
3.6 0.4998 0.4998 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
3.7 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
3.8 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
3.9 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000

NMIMS Global Access – School for Continuing Education


296  BUSINESS STATISTICS

N O T E S

S
IM
NM

NMIMS Global Access – School for Continuing Education


C H
10 A P T E R

USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS

CONTENTS
10.1 Introduction


10.1.1
10.2
S
Microsoft Office Versions
Introduction to Excel
IM
10.2.1 Opening a Document
10.2.2 Saving and Closing a Document
10.2.3 Excel Screen
10.2.4 Workbooks and Worksheets
10.2.5 Moving around the Worksheet
NM

10.2.6 Moving between Cells


10.3 Entering Data in Excel
10.4 Descriptive Statistics
10.5  Basic Built-in Functions (Average, Mean, Mode, Count,
Max and Min)
10.6 Statistical Analysis
10.6.1 Histogram
10.6.2 Correlation Plot and Regression Analysis
10.7 Normal Distribution
10.8 Brief about SPSS
10.8.1 SPSS Files
10.9 Summary
10.10 Descriptive Questions
10.11 Answers and Hints
10.12 Suggested Readings for Reference

NMIMS Global Access – School for Continuing Education


298  BUSINESS STATISTICS

INTRODUCTORY CASELET
N O T E S

EXCEL FUNCTIONS

There are approximately 365 functions in Excel which are divided


into 11 categories. In addition it is possible to purchase more in the
form of adding which extend the power of the spreadsheet.

Excel provides some help in choosing the right function by using the
Insert Function commands but to employ functions effectively you
need to be acquainted with the mathematics behind the function.

S
IM
NM

The form of an Excel function is always =FunctionName(Argument)


where the Argument may be a cell, a group of cells or other forms
of data.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  299 

N O T E S

After studying this chapter, you should be able to:


  Understand the basic concepts of using Microsoft Excel
  Discuss how to enter data in excel and basic built-in functions
  Gain knowledge about SPSS

10.1 INTRODUCTION
Microsoft office is one of the most powerful office productivity tools in
the market today. The entire suite is vast and covers a wide range of
software solutions catering to various aspects of modern businesses.
The most popular software in the MS Office Suite includes the
following:
Microsoft Word: This is a text editing software that allows users to

S
write all kinds of letters, messages and documents. This tool is very
powerful when it comes to textual representation, allowing the users
to change the fonts, page layouts, insert headers and footers and it
IM
even includes a table of content. There are a lot of other features that
make MS Word more than just an effective text editor.
Microsoft Excel: This is a powerful accounting and calculation
solution. It has a standard tabular layout and it supports a wide range
of arithmetic, accounting and statistical functions. Actually, it is well
NM

suited for anything and everything related to numbers.


Microsoft PowerPoint: This is the presentation tool in the MS office
suite. This software allows users to display information in different
formats and is ideal for making impactful presentations.
Microsoft Access: This is Microsoft’s solution to databases. While
Microsoft also has the SQL solution, in many cases using MS
Access is more convenient than using SQL. Due to the simplicity
of its programming and usage, many users prefer to maintain their
databases in Access.
Microsoft Project Plan: The Microsoft Project Plan or MPP is a
project management tool that allows managers to create Gantt charts
to plan and track project progress. This tool is designed to support
typical project challenges such as allocation of a single resource to
multiple activities or linking activities to progress sequentially or in
parallel.
Microsoft Outlook: The Microsoft Outlook is the mail client that can
be set up to download mails from a mail server as well as send and
receive emails as desired. Being a part of the Microsoft Office suite,
this tool is compatible with other applications in the suite. Thus, you
can easily copy-paste information from any of the other suites into
the Outlook with ease. Latest versions of Outlook also come with a

NMIMS Global Access – School for Continuing Education


300  BUSINESS STATISTICS

N O T E S
preview function where you can preview your Word, PowerPoint or
Excel attachments from the mail itself without having to open the file.

10.1.1 MICROSOFT OFFICE VERSIONS


One of the most popular and widely used Microsoft Office Suites is
the MS Office 2003. Later Microsoft released two other versions of
Office, namely Office 2007 and Office 2010. Although Office 2010 is the
latest version, many businesses still continue to use Office 2003.
From Office 2003 to Office 2007, Microsoft radicalised the overall look
and feel of the office suite. There are many changes that have made
the newer suites more user-friendly, classy and intuitive. With newer
features, better design and more versatility, Office 2007 clearly marked
a major change in the office productivity software. However, at the
core, the main functionalities still remained the same ensuring that
the learning curve for all the existing users remained minimal. Office
2010 is an upgrade to Office 2007, where it sorted out the glitches and

S
bugs of Office 2007 besides adding more features to the overall suite.
This book is primarily designed for Office 2010 users; however given
the similarity, a lot of it would also be valid for Office 2007 and to a
IM
lesser extent Office 2003.
With Office 2007 and Office 2010, Microsoft has bundled its Office suite
in multiple packages depending on the typical usage. By pricing it
economically, different users can pick and choose a package that meets
their needs and thus they can save on the license charges. For Office
NM

2010, Microsoft has three packages as shown in the table 10.1 below:

TABLE 10.1: MICROSOFT OFFICE SUITE


Suite Product Home and Home and
Professional
Student Business
Word 2010
Included Included Included

Excel 2010
Included Included Included

PowerPoint
2010
Included Included Included

OneNote 2010
Included Included Included

Outlook 2010
- Included Included

Contd...

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  301 

N O T E S

Access 2010
- - Included

Publisher 2010
- - Included

Fill in the blanks:


1. ................... is a text editing software that allows users to write
all kinds of letters, messages and documents.
2. ................... is a powerful accounting and calculation solution.
3. The Microsoft ................... is the mail client that can be set up to
download mails from a mail server as well as send and receive
emails as desired.
S
IM
10.2 INTRODUCTION TO EXCEL
Before looking at the functionalities in MS Excel, it is important
that we be comfortable with the navigation of the software. Once we
understand how the Excel is designed, using it for creating financial
models and accounting records becomes very straightforward.
NM

Microsoft Excel 2000 (version 9) provides a set of data analysis tools


called the Analysis ToolPak which you can use to save steps when
you develop complex statistical analyses. You provide the data and
parameters for each analysis; the tool uses the appropriate statistical
macro functions and then displays the results in an output table.
Some tools generate charts in addition to output tables.
Excel is available on all public-access PCs (i.e., those, e.g., in the Library
and PC Labs). It can be opened either by selecting Start - Programs -
Microsoft Excel or by clicking on the Excel Short Cut which is either
on your desktop, or on any PC, or on the Office Tool bar.

10.2.1 OPENING A DOCUMENT


‰‰ Click on File-Open (Ctrl+O) to open/retrieve an existing
workbook; change the directory area or drive to look for files in
other locations.
‰‰ To create a new workbook, click on File-New-Blank Document.

10.2.2 SAVING AND CLOSING A DOCUMENT


To save your document with its current filename, location and file
format either click on File - Save. If you are saving for the first time,
click File-Save; choose/type a name for your document; then click OK.
Also use File-Save if you want to save to a different filename/location.

NMIMS Global Access – School for Continuing Education


302  BUSINESS STATISTICS

N O T E S
When you have finished working on a document you should close it.
Go to the File menu and click on Close. If you have made any changes
since the file was last saved, you will be asked if you wish to save them.

10.2.3 EXCEL SCREEN


Menu Items: The most basic navigation tool in any Excel version is
the menu bar. The screenshot below shows the menu bar with various
items, which are designed for accessing different features of Excel.
Table 10.2 below gives a snapshot of the menu items and their overall
functions.

S
IM
Figure 10.1: Menu Bar in Excel

TABLE 10.2: MENU ITEMS AND THEIR


OVERALL FUNCTIONS
NM

Menu Items Function


Main menu to open, close, save and print the
File
spreadsheet
Default menu for clipboard, font, alignment and
Home
other editing functions
Menu to insert tables, pictures, charts and other
Insert
details in the spreadsheet
Page Layout Menu for spreadsheet themes and page setup
Menu with quick reference to all the formulae in
Formulas
Excel and other calculation aids
Data Menu for data manipulation and analysis

Review Menu for review and protection features

View Menu for visual display features


Advanced menu for exploiting Excel features
Developer
beyond pre-defined features

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  303 

N O T E S
Screenshot of Excel screen is shown in Figure 10.2.

Figure 10.2: Excel Screen

10.2.4 WORKBOOKS AND WORKSHEETS


Excel is built on the concept of cell, rows, columns, spreadsheets and

S
workbooks. The entire structure is hierarchical, and this allows it to
be scalable and versatile enough to adapt to varying needs for users
from different specialisations. Understanding the following concepts
IM
is pretty useful in developing complex reports and models.
‰‰ Cell: Cell is the most basic unit in Excel. You can enter text,
numbers or formulas in the cells and build a report. The cells
can be formatted to change the font, colour, alignment and other
aesthetics to present the data in the desired format. The cell is
NM

typically identified by the intersection of the row and column


and the cell location can be seen in the Name Box. You can
change the cell location to a more discerning name and make it a
pre-defined variable in Excel.

Figure 10.3: Cell Name Box


‰‰ Row: A row is a series of cells arranged horizontally. The rows
are numbered from the top to the bottom. The first row is ‘1’
and the last row is ‘1048576’. Though Excel provides for over a
million lines, most models and accounting reports tend to use
fewer rows. You can delete rows, insert new rows between two

NMIMS Global Access – School for Continuing Education


304  BUSINESS STATISTICS

N O T E S
existing rows or perform a common formatting action on all the
cells in the row.
‰‰ Column: A column is a series of cells arranged vertically. The
columns are alphabetically named starting from ‘A’. After column
‘Z’, the new column is named ‘AA’ and this continues till ‘XFD’.
Again, like the rows, you would not use all these columns in your
model for most parts. You can insert, delete or format the cells in
the column just like the rows.
‰‰ Spreadsheet: At the bottom of the Excel window you can view
tabs named ‘Sheet1’, ‘Sheet2’ and ‘Sheet3’. Each of these sheets
is a spreadsheet. You can insert spreadsheets, delete existing
spreadsheets, rename them, copy-paste the entire spreadsheet
and carry out global formatting on all the cells in the spreadsheet.
While for most simple models a single spreadsheet would suffice,
sometimes it is better to use multiple spreadsheets to keep the
data logically separated.

S
IM
A spreadsheet is a collection of all the rows and columns in Excel.
NM

Figure 10.4: Spreadsheet Tabs in Excel


‰‰ Workbook: Any excel file is basically a workbook, as a workbook
is a collection of spreadsheets. Any action performed on the file
such as opening, saving or closing is action being performed
on the workbook. An important function, where the concept
of workbook itself is important, is evident while printing the
Excel file. The default settings in Excel let you print the current
spreadsheet; however, you can change them to print the entire
workbook. In case you choose to print the entire workbook, all
the sheets in the Excel file are printed without issuing a separate
print command for every spreadsheet.
An interesting feature of MS Excel 2007/2010 is the so-called
Ribbon, which is a band with all the menu items displayed with
a visual example. For most users, having the Ribbon is a great
way to quickly find the right menu item for the desired purpose.
However, as you get comfortable with Excel, you may want to
minimise the Ribbon to gain some extra spreadsheet area on your

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  305 

N O T E S
computer display. To minimise the Ribbon, you can press ‘Ctrl +
F1’ or just right click anywhere on the Ribbon to get the ‘Minimise
Ribbon’ option as shown. For beginners, it is recommended that
they continue to keep the Ribbon as it would be easier to locate
the functions.

S
IM
Figure 10.5: Minimizing the Ribbon
NM

Figure 10.6: Minimized Ribbon

10.2.5 MOVING AROUND THE WORKSHEET


As long as you work on the soft copies, page layouts are not really
important – you can scroll a spreadsheet to view the contents.
However, when it comes to printouts it is important that one gets the
page layouts sorted out. Excel 2010 has all the page layout options
under Page Layout menu item. There are four common settings that
can be set to get a good printout of the details on the spreadsheet:
‰‰ Margins: Margins, as the name suggests, are the page margins –
the non-usable borders on a page. Depending on the content, one
can choose from Normal, Wide or Narrow margins. When you need
to squeeze many columns or rows in a single page, use of narrower
margins is a good option. When there are fewer rows and columns
to print, using wider margins gives a neat and clean printout.

NMIMS Global Access – School for Continuing Education


306  BUSINESS STATISTICS

N O T E S
Excel 2010 also has custom margin settings, and the Last Custom
setting option gives you the option to play with the settings to get the
perfect printout.

S
IM
Figure 10.7: Margin Options in Excel
NM

‰‰ Orientation: Orientation is another important tool that you can


use to get the perfect printout. Typically, there are two standard
orientations: portrait and landscape. In the portrait mode, the
shorter edge of the page is at the bottom while in the landscape
mode the longer edge is at the bottom.

Figure 10.8: Orientation Options in Excel


‰‰ Paper Size: It is vital to ensure that the paper size fed in the
printer and the one specified in the Excel setting match, or else

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  307 

N O T E S
the printout would not come out properly. To look at all the
options supported by Excel, click on the Size item in the Page
Setup section.

S
IM
NM

Figure 10.9: Paper Size Options in Excel


‰‰ Print Area: This is another important tool that you can use to
get the perfect printout. Sometimes, a spreadsheet may contain
a lot of information, but you may only want to print a section of
the sheet to share with others. At such times, you can select the
area you want to print and click on the ‘Set Print Area’. This
ensures that only the selected area gets printed, although there
is additional content on the spreadsheet.
Similarly, if you have already set the print area but want to
remove or change it, you can first click on ‘Clear Print Area’ to
remove the settings and then you can select the new area and
click on ‘Set Print Area’ again.

NMIMS Global Access – School for Continuing Education


308  BUSINESS STATISTICS

N O T E S

S
Figure 10.10: Print Area Selection
IM
10.2.6 MOVING BETWEEN CELLS
While working with any Office productivity tool, the clipboard
functions are invaluable. The most common clipboard functions
are ‘Cut’, ‘Copy’ and ‘Paste’. In the Microsoft Office suite, there are
keyboard shortcuts for these functions. The table below maps these
NM

shortcuts. Once you become conversant with the Excel functions,


you would prefer to use the keyboard shortcuts as they are faster and
easier to use than the mouse.

TABLE 10.3: KEYBOARD SHORTCUTS


Cut Ctrl + X
Copy Ctrl + C
Paste Ctrl + V
While cut and copy functions are pretty self-explanatory, paste
function have multiple options. All these options have been provided
based on the varied needs of the end users. The table 10.4 below
describes the various paste options available in Excel:

TABLE 10.4: VARIOUS PASTE OPTIONS


AVAILABLE IN EXCEL
Option Icon Description
Advanced Paste
Options
Paste Formulas This option will paste the formulas
in specified cell. Thus, the values will
not match, but the calculations will be
replicated
Contd...

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  309 

N O T E S

Paste Formulas This option not only copies the formulas,


with Number but also the cell formatting. This saves
Formatting time for reformatting the cell again
Keep Source This option ensures that source
Formatting formatting is retained on the new cell.
This is used when you are not copying
formulas, just numbers or text
Transpose This lets you swap columns and rows to
create a transposed copy of the original
content
Value Paste
Options
Value This option will paste just the values, even
if they are calculated using a formula at
the source
Value with
Number
Formatting
the number formatting
S
This option will paste the values retaining
IM
Value with This option will paste the values retaining
Source the cell formatting
Formatting
There are two other clipboard features that can be very useful in
creating a good spreadsheet:
NM

Figure 10.11: Clipboard Screenshot


‰‰ Format Painter: The format painter tool can copy the formatting
of the cell to any new cell. To use this feature, set the desired
formatting (number format, font format, cell format) on any

NMIMS Global Access – School for Continuing Education


310  BUSINESS STATISTICS

N O T E S
single cell on the spreadsheet. Select the cell and click on the
Format Painter icon. The next cell that you click on will inherit
the same formatting as the original cell. If you want to paste the
formatting on multiple cells, then you must double click on the
original cell. This would allow you to format paint multiple cells
by clicking them one by one.
‰‰ Clipboard: As you keep working in a spreadsheet, you would
copy multiple cells. While most of the times you would paste the
copied cell immediately, there are instances when you may want
to paste some of the previously copied cells.

Microsoft Office suite has a clipboard that maintains a list of all the
previously copied cells.
To paste any of the older values, just select the destination cell and

S
click on the value from the clipboard.
To open the clipboard display, click on the button at the bottom
right of the Clipboard Ribbon.
IM
Fill in the blanks:
4. To create a new workbook, click on ................... Document.
NM

5. The most basic navigation tool in any Excel version is the


................... bar.
6. A ................... is a collection of all the rows and columns in
Excel.
7. A ................... is a collection of spreadsheets.

Collect the data of marks of Mathematics of all the students in your


class and prepare a Microsoft Excel worksheet with all the data.

While the overall look and feel of Excel has undergone a sea change
from Excel 2003 to Excel 2007/2010, the basic logic behind the
features remains consistent. Hence users of Excel 2003 should not
have any trouble using the newer versions of Excel.

10.3 ENTERING DATA IN EXCEL


A new worksheet is a grid of rows and columns. The rows are labeled
with numbers, and the columns are labeled with letters. Each

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  311 

N O T E S
intersection of a row and a column is a cell. Each cell has an address,
which is the column letter and the row number. The arrow on the
worksheet to the right points to cell A1, which is currently highlighted,
indicating that it is an active cell. A cell must be active to enter
information into it. To highlight (select) a cell, click on it.
To select more than one cell:
‰‰ Click on a cell (e.g. A1), then hold the shift key while you click on
another (e.g. D4) to select all cells between and including A1 and
D4.
‰‰ Click on a cell (e.g. A1) and drag the mouse across the desired
range, un-clicking on another cell (e.g. D4) to select all cells
between and including A1 and D4.
‰‰ To select several cells which are not adjacent, press “control”
and click on the cells you want to select. Click a number or letter
labelling a row or column to select that entire row or column.
‰‰
S
One worksheet can have up to 256 columns and 65,536 rows, so
it’ll be a while before you run out of space.
IM
‰‰ Each cell can contain a label, value, logical value, or formula.
‰‰ Labels can contain any combination of letters, numbers, or
symbols.
‰‰ Values are numbers. Only values (numbers) can be used in
NM

calculations. A value can also be a date or a time.


‰‰ Logical values are “true” or “false.”
Formulas automatically do calculations on the values in other specified
cells and display the result in the cell in which the formula is entered
(for example, you can specify that cell D3 is to contain the sum of the
numbers in B3 and C3; the number displayed in D3 will then be a
function of the numbers entered into B3 and C3).

Figure 10.12
‰‰ To enter information into a cell, select the cell and begin typing.
Note that as you type information into the cell, the information
you enter also displays in the formula bar. You can also enter
information into the formula bar, and the information will appear
in the selected cell.

NMIMS Global Access – School for Continuing Education


312  BUSINESS STATISTICS

N O T E S
‰‰ When you have finished entering the label or value:
 Press “Enter” to move to the next cell below (in this
case, A2)
 Press “Tab” to move to the next cell to the right (in this
case, B1)
 Click in any cell to select it

Entering Labels
Unless the information you enter is formatted as a value or a formula,
Excel will interpret it as a label, and defaults to align the text on the
left side of the cell.

S
IM
Figure 10.13
NM

If you are creating a long worksheet and you will be repeating the
same label.
Information in many different cells, you can use the
AutoComplete function. This function will look at other entries in the
same column and attempt to match a previous entry with your current
entry. For example, if you have already typed “Wesleyan” in another
cell and you type “W” in a new cell, Excel will automatically enter
“Wesleyan.” If you intended to type “Wesleyan” into the cell, your task
is done, and you can move on to the next cell. If you intended to type
something else, e.g. “Williams,” into the cell, just continue typing to
enter the term.
‰‰ To turn on the AutoComplete function, click on “Tools” in the
menu bar, then select “Options,” then select “Edit,” and click
to put a check in the box beside “Enable AutoComplete for cell
values.”
‰‰ Another way to quickly enter repeated labels is to use the Pick
List feature. Right click on a cell, and then select “Pick from
List.” This will give you a menu of all other entries in cells in
that column. Click on an item in the menu to enter it into the
currently selected cell.
Entering Values
A value is a number, date, or time, plus a few symbols if necessary to
further define the numbers [such as: + – ( ) % $ /].

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  313 

N O T E S
‰‰ Numbers are assumed to be positive; to enter a negative number,
use a minus sign “–” or enclose the number in parentheses “()”.
‰‰ Dates are stored as MM/DD/YYYY, but you do not have to enter
it precisely in that format. If you enter “Jan 9” or “jan-9”, Excel
will recognize it at January 9 of the current year, and store it
as 1/9/2002. Enter the four-digit year for a year other than the
current year (e.g. “Jan 9, 1999”). To enter the current day’s date,
press “control” and “;” at the same time.
‰‰ Times default to a 24 hour clock. Use “a” or “p” to indicate “am”
or “pm” if you use a 12 hour clock (e.g. “8:30 p” is interpreted
as 8:30 PM). To enter the current time, press “control” and “:”
(shift-semicolon) at the same time.

S
IM
Figure 10.14
NM

‰‰ An entry interpreted as a value (number, date, or time) is aligned


to the right side of the cell, to reformat a value.
Rounding Numbers that Meet Specified Criteria: To apply colors to
maximum and/or minimum values:
‰‰ Select a cell in the region, and press Ctrl+Shift+* (in Excel 2003,
press this or Ctrl+A) to select the Current Region.
‰‰ From the Format menu, select Conditional Formatting.
‰‰ In Condition 1, select Formula Is, and type =MAX($F:$F) =$F1.
‰‰ Click Format, select the Font tab, select a color, and then click
OK.
‰‰ In Condition 2, select Formula Is, and type =MIN($F:$F) =$F1.
‰‰ Repeat step 4, select a different color than you selected for
Condition 1, and then click OK.

Rounding Numbers that Meet Specified Criteria


Problem:  Rounding all the numbers in column A to zero decimal
places, except for those that have “5” in the first decimal place.
Solution: Use the IF, MOD, and ROUND functions in the following
formula: =IF(MOD(A2,1)=0.5,A2,ROUND(A2,0))
‰‰ To Copy and Paste All Cells in a Sheet

NMIMS Global Access – School for Continuing Education


314  BUSINESS STATISTICS

N O T E S
‰‰ Select the cells in the sheet by pressing Ctrl+A (in Excel 2003,
select a cell in a blank area before pressing Ctrl+A, or from a
selected cell in a Current Region/List range, press Ctrl+A+A). 
OR
Click Select All at the top-left intersection of rows and columns.
‰‰ Press Ctrl+C.
‰‰ Press Ctrl+Page Down to select another sheet, then select cell
A1.
‰‰ Press Enter.
‰‰ To Copy the Entire Sheet.
‰‰ Copying the entire sheet means copying the cells, the page setup
parameters, and the defined range Names.

Option 1
‰‰
S
Move the mouse pointer to a sheet tab.
Press Ctrl, and hold the mouse to drag the sheet to a different
IM
‰‰
location.
‰‰ Release the mouse button and the Ctrl key.

Option 2
NM

‰‰ Right-click the appropriate sheet tab.


‰‰ From the shortcut menu, select Move or Copy. The Move or Copy
dialog box enables one to copy the sheet either to a different
location in the current workbook or to a different workbook. Be
sure to mark the Create a copy checkbox.

Option 3
‰‰ From the Window menu, select Arrange.
‰‰ Select Tiled to tile all open workbooks in the window.
‰‰ Use Option 1 (dragging the sheet while pressing Ctrl) to copy or
move a sheet.

Sorting by Columns
The default setting for sorting in Ascending or Descending order is by
row. To sort by columns:
‰‰ From the Data menu, select Sort, and then Options.
‰‰ Select the Sort left to right option button and click OK.
‰‰ In the Sort by option of the Sort dialog box, select the row number
by which the columns will be sorted and click OK.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  315 

N O T E S

State whether the following statements are true/false:


8. Unless the information you enter is formatted as a value or a
formula, Excel will interpret it as a label, and defaults to align
the text on the left side of the cell.
9. A value is a number, date, or time, plus a few symbols if
necessary to further define the numbers.
10. The default setting for sorting in Ascending or Descending
order is by column.

Collect the data of age of teachers in your college. Prepare a


frequency distribution table in Excel worksheet.

S
IM
Be sure to distinguish between absolute reference and relative
reference when entering the formulas.

10.4 DESCRIPTIVE STATISTICS


NM

The Data Analysis ToolPak has a Descriptive Statistics tool that


provides you with an easy way to calculate summary statistics for a set
of sample data. Summary statistics includes Mean, Standard Error,
Median, Mode, Standard Deviation, Variance, Kurtosis, Skewness,
Range, Minimum, Maximum, Sum, and Count. This tool eliminates
the need to type individual functions to find each of these results.
Excel includes elaborate and customisable toolbars, for example the
“standard” toolbar shown here:

Figure 10.15
Some of the icons are useful mathematical computation: is the
“Autosum” icon, which enters the formula “=sum ()” to add up a
range of cells.
is the “FunctionWizard” icon, which gives you access to all the
functions available.

NMIMS Global Access – School for Continuing Education


316  BUSINESS STATISTICS

N O T E S

is the “GraphWizard” icon, giving access to all graph types


available, as shown in this display:

S Figure 10.16
IM
Excel can be used to generate measures of location and variability for
a variable. Suppose we wish to find descriptive statistics for a sample
data: 2, 4, 6, and 8.
Step1: Select the Tools *pull-down menu, if you see data analysis, click
NM

on this option, otherwise, click on add-in.. option to install analysis


tool pak.
Step 2: Click on the data analysis option.
Step 3: Choose Descriptive Statistics from Analysis Tools list.
Step 4: When the dialog box appears:
Enter A1:A4 in the input range box, A1 is a value in column A and row
1; in this case this value is 2. Using the same technique enters other
VALUES until you reach the last one. If a sample consists of 20
numbers, you can select for example A1, A2, A3, etc. as the input
range.
Step 5: Select an output range, in this case B1. Click on summary
statistics to see the results.
Select OK.
When you click OK, you will see the result in the selected range.
As you will see, the mean of the sample is 5, the median is 5, the
standard deviation is 2.581989, the sample variance is 6.666667, and
the range is 6 and so on. Each of these factors might be important in
your calculation of different statistical procedures.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  317 

N O T E S

Fill in the blanks:


11. The Data Analysis ToolPak has a ................... Statistics tool that
provides you with an easy way to calculate summary statistics
for a set of sample data.
12. is the “...................” icon, which enters the formula
“=sum()” to add up a range of cells.
13. Excel can be used to generate measures of location and
variability for a ................... .

Represent the following information in form of a table in MS-Excel


worksheet.

S
The number of students in a college in the year 1961 was 1100;
of those 980 were boys and rest girls. In 1971 the number of boys
IM
increased by 100% and that of girls increased by 300% as compared
to their strength in 1961. In 1981 the total number of students in a
college was 3600, the number of boys being double the number of
girls.

 ASIC BUILT-IN FUNCTIONS (AVERAGE,


B
NM

10.5
MEAN, MODE, COUNT, MAX AND MIN)
Excel is a very powerful accounting tool, but before going to the real
complex functions, let us sees how to use Excel for simple calculations.
There are two ways of using Excel for simple calculations: you can
enter the actual arithmetic equations in the cell or use pre-defined
Excel formulas to do the same. The following sections explain how
Excel can be used to carry out simple arithmetic functions.

Manual Equation Entry


Excel cells can be used as simple calculators. By starting data entry
in any Excel cell with the ‘equal to’ (=) sign, you can create formulae
to calculate any data in Excel using standard mathematical operators.
For Example, let us say that you need to calculate the total sales for
ABC Style Centre for 1st April 2013. To do this, you need the total sales
numbers of all the dress items sold at the store on that date. The figure
below shows the sales under each dress category, namely Men’s Jeans,
T-shirts, Skirts, Women’s Jeans and Tops. To calculate the total sales
for the week, you can simply add all values as =E3+E4+E5+E6+E7.
Notice that the cell starts with an ‘=’ sign.

NMIMS Global Access – School for Continuing Education


318  BUSINESS STATISTICS

N O T E S

Figure 10.17

S
Similarly, the total sale for each category is also calculated using a
different mathematical operation: product. For example, the total
T-shirt sale on 1st April, 2013 is =C4×D4. C4 is the rate of each T-shirt,
IM
while D4 is the total T-shirts sold on that day.
NM

Figure 10.18
While simple calculations can be done by manually listing each cell
individually, this is not very scalable. Besides, accounting and financial
management are not just about sums and products. There are many
other complex functions, and Excel accommodates for these using
pre-defined functions.

Arithmetic Functions in Excel


Excel supports a host of functions that can be used to effectively
manipulate the numbers to get the desired results. Most commonly
used Excel functions can be broadly classified into five categories:
Arithmetic, Date & Time functions, Logical, Lookup and Statistical.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  319 

N O T E S
TABLE 10.5: FUNCTION, SYNTAX AND DESCRIPTION
Function Syntax Description
FACT FACT(n) Factorial – Product of all the
numbers from 1 to n
GCD GCD(x,y,…) Returns Greatest Common
Denominator for the
arguments
INT INT(n) Round decimal number to the
nearest integer
LCM LCM(x,y,…) Returns Least Common
Multiple for the arguments
POWER POWER(n,r) Returns rth of n
PRODUCT PRODUCT (x,y..) Returns product of arguments
QUOTIENT QUOTIENT Returns the quotient for n/d

ROUND
(n,d)
ROUND(n,d)
S
Rounds of decimal number n
up to d digits after the decimal
IM
point
ROUNDDOWN ROUNDDOWN Rounds the decimal n to the
(n,d) nearest lower integer up to d
digits after the decimal point
ROUNDUP ROUNDUP() Rounds the decimal n to the
NM

nearest higher integer up to d


digits after the decimal point
SIGN SIGN(n) Returns the sign of the
number. (–1 = negative, 1 =
positive)
SUM Function
The SUM function is probably the most commonly used function in
Excel. It comes in three flavours in Excel, namely SUM, SUMIF and
SUMIFS.
‰‰ SUM (): This is a simple sum function that adds all the arguments in
the list. In the ABC Style Centre example, =E3+E4+E5+E6+E7
can be replaced with the sum function as =SUM (E3, E4, E5,
E6, E7). Now, since the cells E3 to E7 are in series, these can be
replaced by arrays as =SUM (E3:E7). Note that array format can
only be used for contiguous series of cells; if there are breaks in
the list, you can use arrays with individual numbers.
‰‰ SUMIF (): This is a conditional sum function where the items
in the list get added only when the pre-defined condition is
satisfied. This function is typically used when you want to add
certain numbers, for example calculating the annual salary for
a particular employee from the monthly expense sheet for the
entire workforce. The syntax is =SUMIF (match_range, match_

NMIMS Global Access – School for Continuing Education


320  BUSINESS STATISTICS

N O T E S
criteria, sum_range). The match_range is the list on which the
match needs to be made, the match_criteria is the condition
being matched and sum_range is the list of items to be summed.
‰‰ SUMIFS (): This is another variation of the sum function where
multiple conditions can be matched at the same time. Depending
on how many conditions need to be matched, SUMIF or SUMIFS
can be used. The syntax for this function is SUMIFS (sum_
range, match_range1, match_criteria1, match_range2, match_
criteria2…). As seen from the syntax, the arguments for SUMIFS
are pretty similar to SUMIF, but the order of arguments is different.

Logical Functions
Excel also supports standard logical or Boolean functions, which
are very useful in testing special conditions. Not all calculations are
limited to SUM; hence you need to rely on a combination of logical
and other functions to get results similar to SUMIF and SUMIFS. The

‰‰ S
commonly supported logical functions are described below:
AND(): This function performs logical AND operation on the
IM
arguments; if all the values test to be TRUE, the final result is
TRUE, else it is FALSE. Most of the times, AND is used as an
operator in other functions rather than a separate function,
although depending on the requirement, one could use this
function directly too.
NM

‰‰ FALSE: This is an argument-less function. Typically, FALSE is


used as a value to test a logical condition. To reduce the effort
of typing FALSE every time, you can define a cell with value
FALSE and reference the cell in your sheet.
‰‰ IF (): This is a very useful logical condition used to test most
conditions. The syntax for this function is =IF (logical expression,
value if true, value if false). This expression can be used in loop,
which means you can test more than one condition by looping IF
function inside another IF function. However, while looping IF
functions you need to ensure that you have got your logic correct,
else you may get incorrect results.
‰‰ IFERROR (): One of the common challenges you would face in
Excel is getting odd expressions for function results. Typical error
messages include #DIV/0!, #ERROR!, etc. If you want to avoid
such errors in your spreadsheet and would prefer more graceful
error messages, you can use the IFERROR function. The syntax
for this function is =IFERROR (Expression, value if error). If the
expression returns a valid result, it will be displayed, if you get
an error, then instead of the odd error message, a more sensible
value would be returned.
‰‰ NOT: This is another logical expression used mainly in other
functions rather than independently. It reverses the logical value
of the cell: NOT (TRUE) =FALSE and NOT (FALSE) =TRUE.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  321 

N O T E S
‰‰ OR (): This function performs logical OR operation on the
arguments; if all the values test to be FALSE, the final result
is FLASE, else it is TRUE. Most of the times, OR is used as an
operator in other functions rather than a separate function,
although depending on the requirement, one could use this
function directly too.
‰‰ TRUE: This is an argument-less function. Typically, TRUE is
used as a value to test a logical condition. To reduce the effort of
typing TRUE every time, you can define a cell with value TRUE
and reference the cell in your sheet.
Mostly, the logical Excel functions are rarely used independently,
except the IF function. They are mainly used in conjunction with IF
statements to validate conditions and return responses accordingly.
Statistical Functions
Statistical functions are invaluable in any mathematical calculations.

S
They can provide insights into trends provide data for detailed
analysis as well as help identify gaps that need to be plugged. Excel
provides a wide range of functions that can be used to perform basic
IM
statistical analyses.

TABLE 10.6: FUNCTIONS THAT CAN BE USED TO


PERFORM BASIC STATISTICAL ANALYSIS
NM

Function Syntax Description


AVERAGE AVERAGE(x,y,…) Returns the average of
list of arguments. This
excludes any text
AVERAGEA AVERAGEA(x,y,…) Returns the average
of list of arguments
including text
AVERAGEIF AVERAGEIF (match Matches the cells in
range, criteria, the match range to
calculation range) the match criteria and
returns average for
calculation range. If
calculation range is not
specified, match range
is used for calculation.
AVERAGEIFS AVERAGEIFS Calculates the average
(calculation range, match for the calculation
range 1, criteria 1, match range based on multiple
range 2, criteria 2, ...) match criteria specified
COUNT COUNT(x,y,…) Returns the number of
cells with numbers in
the range
Contd...

NMIMS Global Access – School for Continuing Education


322  BUSINESS STATISTICS

N O T E S

COUNTA COUNTA(x,y,…) Returns the number


of cells with any
information in the
range
COUNTBLANK COUNTBLANK(x,y…) Returns the number of
empty or blank cells in
the range
COUNTIF C O U N T I F ( r a n g e , Returns the count of
criteria) cells that match the
criteria
COUNTIFS C O U N T I F S ( r a n g e 1 , Returns the count
criteria1, range2, of cells that match
criteria2, …) multiple criteria
MAX MAX(x,y,…) Returns the largest
value in the range of

MAXA MAXA(x,y…)
S numbers
Returns the largest
value from the range
IM
of cells with any
information
MEDIAN MEDIAN(x,y,…) Returns the median for
the range of numbers
MIN MIN(x,y,…) Returns the smallest
NM

value in the range of


numbers
MINA MINA(x,y…) Returns the smallest
value from the range
of cells with any
information
STDEV STDEV(x,y,…) Returns the standard
deviation for the range
of numbers using n-1
method, for backward
compatibility with
older excel versions
STDEV.P STDEV.P(x,y,…) Returns the standard
deviation for range
of numbers using n
method
STDEV.PA STDEV.PA(x,y,…) Returns the standard
deviation for range
of cells with any
information using n-1
method
Contd...

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  323 

N O T E S

STDEV.S STDEV.S(x,y,…) Returns the standard


deviation for range
of numbers using n-1
method
VAR VAR(x,y,…) Returns the variance
for a sample of range of
numbers, for backward
compatibility with
older excel versions
VAR.P VAR.P(x,y,…) Returns the variance
for complete range of
numbers
VAR.S VAR.S(x,y,…) Returns the variance
for a sample of range of
numbers
VARA VARA(x,y,…)

S Returns the variance for


sample of range of cells
with any information
IM
VARPA VAR.P(x,y,…) Returns the variance
for complete range
of cells with any
information
NM

State whether the following statements are true/false:


14. SUM Function comes in three flavours in Excel, namely SUM,
SUMIF and SUMIFS.
15. SUMIFS is a conditional sum function where the items in the
list get added only when the pre-defined condition is satisfied.
16. OR function performs logical AND operation on the arguments;
if all the values test to be TRUE, the final result is TRUE, else
it is FALSE.

Find the inflation rate in India during past One year on monthly
basis. Find mean, median and mode using Ms-Excel.

10.6 STATISTICAL ANALYSIS


One of the most powerful features of Excel is its ability to represent
large data volumes in easy-to-decipher formats including pivot tables
and varied charts. Understanding these features is very important
for any modelling or analysis exercise. While it is great to be able
to calculate numbers, if these are not presented in the right format

NMIMS Global Access – School for Continuing Education


324  BUSINESS STATISTICS

N O T E S
they add up to zilch. In this section you will get to see some of the
commonly used data representation methods used in Excel.

Creating Charts
‰‰ Select the data range (only numbers) for which the chart needs
to be created.
‰‰ Under the Insert Ribbon, in the Chart section, click on the type
of chart you want to create and the category. Here the clustered
chart has been used.
‰‰ Select the chart and click on Select Data button in Data section
of the Design Layout.
‰‰ In the Select Data Source dialog, select ‘Series 1’ and click on
Edit button.

S
IM
NM

Figure 10.19: Select Data Source


This opens the Edit Series dialog that allows you to change the range
of values in series and provide a Series name. For the series name,
click on icon to select the column title of Series 1.

Figure 10.20: Edit Series

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  325 

N O T E S

‰‰ Click on icon and click OK. Observe how Series 1 changes


to the proper series name. Repeat the same for subsequent series.
‰‰ If you do not want to use the column title and instead want to
use a different name, then in the Edit Series dialog, instead of
selecting a cell reference, just type the name without the ‘=’ sign.

‰‰ By playing with the icons, you can move the series up


and down. This will change the order of the series in the clustered
graph.
‰‰ If you selected a much larger range in the first attempt and want
to trim the number of series, you can select the unwanted series
and click on Remove.
‰‰ Also, if you want to append additional series to the chart, you can
use the Add option. The Add option opens the Edit series dialog

S
just like Edit does, except that this time the dialog is practically
empty and you need to manually add the contents.
IM
10.6.1 HISTOGRAM
Now follow the steps given below to draw histogram.
‰‰ Select the first two columns i.e. class interval and frequency in
the Excel sheet.
NM

‰‰ Click on ‘Chart Wizard’ icon on tool bar or select from menu


[Insert → Chart…..] From insert drop down menu. A dialogue
box with title ‘Chart Wizard – Step 1 to 4 – Chart type’ will appear.
‰‰ In the menu ‘Standard Type’, select ‘Column’. Click on ‘Next’
button.
‰‰ Now the next menu with title ‘Chart Wizard – Step 2 to 4 – Chart
Source Data’ will appear. Since we have already selected the
source data, select ‘Next’. Don’t forget to check that column is
selected in data series.
‰‰ Now the next menu with title ‘Chart Wizard – Step 3 to 4 – Chart
Options’ will appear.
‰‰ Enter appropriate ‘Title’ as ‘Histogram: Reading Time for 20
Pages’, captions for category for X axis as ‘Time in minutes’ and
for Y axis category as ‘Number of students’. Select major grid
lines for Y axis. Keep other options as default options. Then click
on ‘Next’ button.
‰‰ Now the next menu with title ‘Chart Wizard – Step 4 to 4 – Chart
Location’ will appear.
‰‰ Select the option as an ‘Object in’ the existing worksheets. After
selecting select the ‘Finish’ button. Now the chart in the final
form as column chart will appear.

NMIMS Global Access – School for Continuing Education


326  BUSINESS STATISTICS

N O T E S
‰‰ Now to convert it to the histogram we need to join the columns.
For this left click the mouse on any of the column. Then right
click the mouse and select ‘Format Data Series’ option or select
‘Format’ from tool bar and click on the ‘Selected Data Series’
option. Now the ‘Format Data Series’ menu box will appear.
Select the ‘Options’ menu. In options menu reduce the ‘Gap
Width’ to zero in given window. You can see the column chart
becoming histogram. Now, click on ‘OK’. The histogram is now
ready and will appear on Excel worksheet. You can shift it by
dragging or increasing its size using corner toggles. You can also
export it to MS word or Power Point by copy-paste options.
‰‰ Use draw option from tool bar to draw diagonal lines to locate the
mode. Also draw the vertical line from the point of intersection.
Value of the mode can be read from the abscissa of the intersection
point.

S
10.6.2 CORRELATION PLOT AND REGRESSION ANALYSIS
Using MS Excel for calculating Karl Pearson’s correlation coefficient
Calculating Karl Pearson’s correlation coefficient using MS Excel is
IM
very simple. The steps are as follows:
‰‰ Open an Excel worksheet and enter the data values of X and Y
variables as two arrays (columns or rows). Keep these contiguous
if possible.
NM

‰‰ Select the cell where you want to store the result r. Enter the
formula with syntax as,
‘=CORREL (array1, array2)’
‘array1’ is a cell range of values and ‘array2’ is a second cell range
of values.
‰‰ Alternatively w can select the paste function
‘=CORREL(array1,array2)’ from the menu as [Insert→Function…
→Statistical→CORREL] if you are using MS Excel 97-2003
or from quick access tool bar by selecting [Formulas→Insert
Function→Statistical→CORREL] if you are using MS Excel 2007
or just clicking on fX icon on ‘Insert Function’ Tool Bar. Once
we get ‘Function Arguments dialog box for ‘CORREL’ function
follow the dialog box to select the values of X and Y as array1 and
array2 respectively. Then press OK button.
Besides the Insert→Function… menu, MS Excel also has a Data
Analysis tools called as Data Analysis ToolPak. These tools can be
accessed through menu [Tools→Data Analysis…→Correlation] if you
are using MS Excel 97-2003 or [Data→Data Analysis→Correlation]
from quick access tool bar by selecting and then following the
dialog box of ‘Correlation’. With Data Analysis ToolPak we can find
correlation coefficients between several variables. This can also be
used for finding correlation coefficient between two variables. In

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  327 

N O T E S
the result correlation coefficient with itself is always 1. The result is
displayed as correlation matrix. The procedure is as follows:
‰‰ Open an Excel worksheet and enter the data values of X and Y
variables as two arrays (columns or rows). Keep these contiguous.
‰‰ Select any cell: Select from Quick Access Tool Bar the Correlation
tool as [Data→DataAnalysis→Correlation]. Follow the dialog box
giving following details.
 Input range: It is either typed as cell references or selected
by blocking the data with mouse.
 Grouped By: Select as per data entered is column wise or
row wise.
 Label in First Row/Column: Check (click) the box if you have
used labels.
 Output Range: /New Worksheet Ply:/New Workbook: Select

S
as appropriate. Note that you keep size of output matrix
adequate to number of variables.
IM
Then press OK button.
‰‰ You will get the correlation coefficients between pairs of variable
as correlation matrix.
Example: The data of advertisement expenditure (X) and sales (Y)
of a company for past 10 year period is given below. Determine the
NM

correlation coefficient between these variables and comment the


correlation.
X 50 50 50 40 30 20 20 15 10 5
Y 700 650 600 500 450 400 300 250 210 200
Solution: Method I: Using paste function ‘CORREL’
‰‰ Open an Excel worksheet and enter the data values of X and Y
variables as two arrays (columns) from cell B3 to B12 and C3
to C12 respectively. It is a good practice to give headings at cell
number B2 and C2.
‰‰ Select cell D3 for result. Enter =CORREL(B3:B12,C3:C12)
and enter. Instead we can also use [Insert→Function…
→Statistical→CORREL] and then follow the dialog box to enter
Array1 and Array2 and then select OK button.

NMIMS Global Access – School for Continuing Education


328  BUSINESS STATISTICS

N O T E S
We will get result in cell D3 as 0.976357

S
Method II: Using Data Analysis ToolPak
Open an Excel worksheet and enter the data values of X and Y
IM
‰‰
variables as two arrays (columns) from cell B3 to B12 and C3
to C12 respectively. It is a good practice to give headings at cell
number B2 and C2.
‰‰ Select any cell say D3 and use menu [Tools→Data Analysis…→
NM

Correlation] in case you have Excel 97-2003. For Excel 2007 we


select ‘Data’ from quick access tool bar and then select ‘Data
Analysis’ from the icon on Tool Bar. We get Data Analysis dialog
box as follows. Select ‘Correlation’ and press ‘OK’ button in the
box.

Then follow the dialog box to enter ‘Input Range’ as $B$2:$C$12,


‘Grouped By’ select ‘Column’, click in ‘Label In First Row’, and enter

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  329 

N O T E S
‘Output Range’ $D$3:$F$5 and then select OK button. The Excel
sheet will be as follows:

S
IM
We will get result as on new sheet as,

X Y
X 1  
Y 0.976357 1
NM

Thus we get correlation coefficient as 0.976357.

Using Excel to Solve Regression


Steps for solving the problem are:
‰‰ Open a MS Excel worksheet. Enter the data of X and Y in two
adjacent columns (say X variable from B3 to B10 and Y variable
from C3 to C10).
‰‰ Select any cell and use ‘Data Analysis Pak’ tool ‘Regression’ from
the menu as follows:
Select from Drop down Menu [Tools→Data Analysis…
→Regression→OK] if you are using Excel 97-2003 or
select from Quick Access Tool Bar [Data→Data Analysis…
→Regression→OK].
‰‰ Now the regression menu box will appear. Follow the box and
enter ‘Input Y Range:’, ‘Input X Range:’ and also select other
choices as required. Select output range as per your choice where
you want output to be displayed. (You can also get Residue Plot,
Line Fit Plot, Normal Probability plot, etc. It is recommended
that you try these on your own. These are very useful for your
guidelines during decision-making). Then select ‘OK’.
‰‰ We get the result (output) as summary output, ANOVA and
coefficient analysis (Besides the residue plots you have asked for).

NMIMS Global Access – School for Continuing Education


330  BUSINESS STATISTICS

N O T E S
Example: Data below gives transit time in days for random sample of
10 consignments with related distance.
‰‰ Find best fit linear relationship of transit time on distance.

X Distance in 100 4 5 6 7 9 9 10 11 11 12
km
Y Transit time in 4 5 5 6 7 6 7 8 7 8
days
‰‰ Also estimate the transit time for a new location at a distance 800
km.
‰‰ Also compute correlation coefficient and assess whether relation
can be deemed as reasonable valid.
‰‰ Find coefficient of determination R and explain its significance.
Solution: Using MS Excel

S
As we have seen earlier, MS Excel is very fast, simple to use and
provides much more analysis while solving correlation and regression
problems. We don’t need shortcut method of shifting origin or changing
IM
scale. Steps for solving this problem are:
‰‰ Open an MS Excel worksheet. Enter the data of X and Y in two
adjacent columns (say X variable from B3 to B12 and Y variable
from C3 to C12).
‰‰ Select any cell and use ‘Data Analysis Pak’ tool ‘Regression’ from
NM

the menu as follows:


[Data→Data Analysis…→Regression→OK].
‰‰ Now the regression menu box will appear. Follow the box and
enter ‘Input Y Range:’, ‘Input X Range:’ and also select other
choices as required. Select output range as per your choice where
you want output to be displayed. (You can also get Residue Plot,
Line Fit Plot, Normal Probability plot, etc. It is recommended
that you try these on your own. These are very useful for your
guidelines during decision-making). Then select ‘OK’.
‰‰ We get the result (output) as summary output, ANOVA and
coefficient analysis (Besides the residue plots you have asked for).
In this example, the result is
SUMMARY
OUTPUT
Regression
Statistics
Multiple R 0.958266
R Square 0.918274
Adjusted R 0.908058
Square
Standard Error 0.405554
Contd...

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  331 

N O T E S

Observations 10
ANOVA
  df SSE MSE F Significance
F
Regression 1 14.78421 14.78421 89.888 1.26E-05
Residual 8 1.315789 0.164474
Total 9 16.1

Coefficients Standard t Stat P-value Lower Upper Lower Upper


Error 95% 95% 95.0% 95.0%
Intercept 2.394737 0.43141 5.550948 0.00054 1.399903 3.389571 1.399903 3.389571
X Variable 0.464912 0.049037 9.480928 1.26E-05 0.351834 0.577991 0.351834 0.577991
1

We can see that: Correlation coefficient r= 0.958266


Coefficient of determination b = r2 = 0.918274
Regression coefficient bYX = 0.464912
Y intercept is c = 2.394737 S
IM
Hence regression line equation is,
^
y = 0.464912x + 2.394737
t statistics = 9.481
Hence the correlation is significant. The p-values and F test also
NM

shows that the fit is good.


‰‰ For finding the forecast value at x = 8, we follow following steps after
entering the data as explained in the first step). Then select any
cell and insert ‘Paste Function’ called ‘Forecast’. We can use drop
down menu as [Insert→Function…→Statistical→FORECAST]
if you are using Excel 97-2003 or [Formulas→Insert Function
fX→Statistical→FORECAST] or directly from formula tool bar by
clicking on icon fX and then use menu box to enter ‘x’, ‘Known_y’s’
and ‘Known_x’s’. Then on selecting ‘OK’ we get the result, in this
case value 6.11403509 which is same as manual calculation (In
fact this is more accurate). We can also directly enter syntax for
the paste function as ‘=FORECAST(8,C3:C12,B3:B12)’ and press
the Enter key to get the answer.

State whether the following statements are true/false:


17. One of the most powerful features of Excel is its ability to
represent large data volumes in easy-to-decipher formats
including pivot tables and varied charts.
18. You will get the correlation coefficients between pairs of
variable as correlation matrix.

NMIMS Global Access – School for Continuing Education


332  BUSINESS STATISTICS

N O T E S

Collect the data of family income and expenditure per month of


at least ten families in your locality. Enter the data in Ms-Excel
worksheet and find the correlation using Ms-excel functions.

If an array or reference argument contains text, logical values, or


empty cells, those values are ignored; however, cells with the value
zero are included.
If array1 and array2 have a different number of data points, CORREL
returns the #N/A error value.
If either array1 or array2 is empty, or if the standard deviation of their
values equals zero, CORREL returns the #DIV/0! error value.

S
10.7 NORMAL DISTRIBUTION
Statistical calculations for exponential random variables could be
IM
calculated using statistical functions available in MS Excel.
NORMDIST returns the normal distribution for the specified mean
and standard deviation. This function has a very wide range of
applications in statistics, including hypothesis testing.
NM

Syntax: NORMDIST(x,mean,standard_dev,cumulative)
‰‰ X is the value for which you want the distribution.
‰‰ Mean is the arithmetic mean of the distribution.
‰‰ Standard_dev is the standard deviation of the distribution.
Cumulative is a logical value that determines the form of the function. If
cumulative is TRUE, NORMDIST returns the cumulative distribution
function; if FALSE, it returns the probability mass function.

If mean or standard_dev is nonnumeric, NORMDIST returns the


#VALUE! error value.
If standard_dev ≤ 0, NORMDIST returns the #NUM! error value.
If mean = 0, standard_dev = 1, and cumulative = TRUE, NORMDIST
returns the standard normal distribution, NORMSDIST.
First, we open MS Excel worksheet and select a cell. Then we
select from formula bar [fx → statistical → NORMDIST] or from
quick action tool bar [Formulas→ fx → statistical → NORMDIST]
we get a paste function dialogue box. It asks value of x, mean μ,
standard deviation σ and a logical choice ‘cumulative’ that gives
either cumulative distribution function value (choice TRUE) or
probability mass function (choice FALSE). We could also directly
type the paste function syntax in the selected cell.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  333 

N O T E S
Standard Normal Distribution
Statistical calculations for exponential random variables could be
calculated using statistical functions available in MS excel.
NORMSDIST returns the standard normal cumulative distribution
function. The distribution has a mean of 0 (zero) and a standard
deviation of one. Use this function in place of a table of standard
normal curve areas.

Syntax: NORMSDIST (z)


Z is the value for which you want the distribution.

If z is nonnumeric, NORMSDIST returns the #VALUE! error value.


First we open MS Excel worksheet and select a cell. Then we select

S
formula bar [fx → statistical → NORMSDIST] or from quick action
tool bar [Formulas → fx → statistical → NORMSDIST] we get a
paste function dialogue box. It asks value of z and gives cumulative
IM
distribution function value. We could also directly type the paste
function syntax in the selected cell.

Example: If X is a normal random variable with parameters μ = 3 and


σ2 = 9, find
NM

(a) P(2 < x < 5) (b) P(x < 0) (c) P(|x–3|>6)


Solution: Open MS Excel worksheet. Select any cell and enter paste
function by selecting paste function or by directly typing the paste
function as,
= (NORMSDIST (2/3))-(NORMSDIST (-1/3))
Then press ‘Enter’ key. We get result as 0.378066
Open MS Excel worksheet. Select any cell and enter paste function
by selecting paste function or by directly typing the paste function as,
=(NORMDIST(5,3,3,TRUE))-(NORMDIST(2,3,3,TRUE))
Then press ‘Enter’ key. We get result as 0.378066.
Just to familiarize you with the screen, screen for cumulative
probability P(x< 5) for this problem using MS Excel paste functions
NORMDIST and NORMSDIST is shown below. Please note that we
get the result same as manual calculations given above as 0.7486.
‘= NORMDIST(2,3,3,TRUE)’ gives answer as 0.747507462

NMIMS Global Access – School for Continuing Education


334  BUSINESS STATISTICS

N O T E S

Similarly,

S
=NORMSDIST(2/3) gives answer as 0.747507462
IM
NM

Thus, we see that using MS Excel is extremely easy as compared to


manual calculations.
P(x > 0) = P( z > −1) = 1 − F(−1) = F(1) = 0.5 + 0.3413 = 0.8413
P( x − 3 > =
6) P(x > 9) + P(x < −=
3) P( z > 2) + P(< −2)
=
1 − F(2) + F(−2) =
1 − F(2) + 1 − F(2) =
2 − 2(0.5 + 0.4772) =
0.0456

Fill in the blanks:


19. ................... returns the normal distribution for the specified
mean and standard deviation.
20. ................... is a logical value that determines the form of the
function.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  335 

N O T E S

While using MS Excel, we don’t have to find z value and then


use Standard Normal Distribution. We can directly use normal
distribution.

10.8 BRIEF ABOUT SPSS


SPSS Statistics is a software package used for statistical analysis.
Long produced by SPSS Inc., it was acquired by IBM in 2009. The
current versions (2014) are officially named IBM SPSS Statistics.
Companion products in the same family are used for survey authoring
and deployment (IBM SPSS Data Collection), data mining (IBM
SPSS Modeler), text analytics, and collaboration and deployment
(batch and automated scoring services).
The software name stands for Statistical Package for the Social

S
Sciences (SPSS), reflecting the original market, although the software
is now popular in other fields as well, including the health sciences and
marketing.
IM
Our first step is to see the way SPSS functions and take cognizance
of the files that it uses. Second, we will try to create a dataset using
available data. Once the data has been entered, our third step is to use
the SPSS pull-down menus to conduct the analyses of data. We will
then use SPSS to draw charts which display results. We can run SPSS
NM

using either the pull-down menus or the syntax window (writing your
own SPSS programmes).

10.8.1 SPSS FILES


SPSS uses several types of files. First, there is the file that contains data
view and variable view. These have been entered using SPSS Data
Editor Window. It is known as an SPSS system file. Once SPSS has
conducted an analysis, it displays the results in the Output Navigator
window. The important thing to remember is that you create the data
file and instruct SPSS what analysis to perform. SPSS then conducts
the analysis and displays the results. The contents of this window can
be saved in a Navigator document.
SPSS has the default convention of naming data files with a .sav
extension and Navigator documents with a .spo extension.

Entering Data
Select SPSS from the Windows Start Button (that is, click the Start
Button, select Programmes, and select SPSS 11 for Windows). At the
top of your screen you will see the pull-down menus, and just below
them you will see a toolbar with several icons. If you place the mouse
pointer on any one of the toolbar icons, SPSS will display a label
telling you what that icon does. SPSS automatically opens the Data
Editor window, and your screen looks like Figure 10.21.

NMIMS Global Access – School for Continuing Education


336  BUSINESS STATISTICS

N O T E S

S
Figure 10.21: SPSS Data Editor Window – Data View
IM
Notice that the Data Editor window looks quite like a spreadsheet, in
that it is made up of cells defined by both rows and columns. In the
Data Editor window, each row represents a single record, and each
column represents a single variable. By using the keyboard arrow
keys (up, down, right and left) or your mouse, you can move the cursor
NM

round to different cells in the window.

Figure 10.22: Data Editor Window – Variable View


Notice that at this point each column of the data has automatically
been called “var” by SPSS.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  337 

N O T E S
To label the first variable, again click the cursor on Variable View. You
will be prompted with the dialog box as shown in Figure 10.22.

Define Variable Dialog Box


Enter the name you want to use for this variable in the Variable Name
box. In the example, we have chosen the first variable as academic
ability. Next, click on the Labels button. You will be prompted with
the dialog box shown in Figure 10.23.
Now move the cursor to the variable view (you can see at the bottom
of the screen). When you click you can see the variable view window
which has been assigned the variable name “OLTS”. Then move to
Type, let it be default numeric as we are going to enter only numeric
numbers in that column. Move on to the next column and let the width
be three, since the maximum amount of value that we may enter in
the online test score will be 100. Next let the decimal two by default,
before you move on to the variable label type “Online Test Score” to

S
the first variable and let other columns be default as it is. Repeat this
procedure for the work experience data in the second row, and then
again for student motivation data in the third row. Remember that the
IM
student motivation variable had values that had been coded.
‰‰ Student Motivation
‰‰ Not willing
Undecided
NM

‰‰

‰‰ Willing
Therefore, once the variable label has been assigned, use the tab key
(or the mouse) to bring the cursor to the Values box in the same row
and click. You will see a box as in Figure 10.23.

Figure 10.23: Value Labels – Dialog Box


Enter a “0” (which is our first value), then tab or click down to the
Value Label box and enter “Not willing,” and finally click the Add
button. Repeat for the other two values of this variable. Click the OK
button to finish defining the variable of student motivation.

NMIMS Global Access – School for Continuing Education


338  BUSINESS STATISTICS

N O T E S

Figure 10.24: Value Labels Coded with Value and Value Label
If you discover later that for some reason you need to further define
this variable (for example, if you want to change the labels), you can
always return to this dialog box. As their names suggest, the Change
button can be used to change a value label, the Remove button can be

S
used to remove a value label, the Cancel button can be used to cancel
your labeling work, and the Help button can be used to access the
SPSS online help file.
IM
Notice that you have other options available to you in the row dialog
box. For example, if you click on the Type button, you will be presented
with several different data types to choose from. By default, the
variable is considered to be a number that has up to eight digits. You
can tell SPSS to expect a larger number by entering a different size
NM

in the Width and Decimal Places boxes, although that is certainly


not necessary for the data. It is important to notice, though, that this
is where you can tell SPSS to expect a “String” variable (that is, an
alphanumeric variable that can be coded with either numbers or
letters) if appropriate. For example, if for the “Gender” variable we
had used “M” instead of “0” for “Male” (and “F” instead of “1” for
“Female”), and then the Data Editor would not let you enter these
values until you told it to expect “Gender” as a “String” variable.
One other important option that is available to you is that of declaring
the placeholders that have been used for missing values. For example,
remember that “Student Motivation” may take only the values of “0”,
“1”, and “2”. Earlier, it was suggested that if the measure of student
motivation was not available for a respondent, then an out-of-range
value such as “9” could be used to indicate that this respondent had
missing data.
Now go ahead and assign the variable and value labels to the remaining
variables. See the codebook in Figure 10.25 for the labels.
To enter the data for the study, simply move the cursor to the data
view and you can start entering the data in the respective columns.
Enter “99” for the online test score, then move the cursor one cell to
the right and enter “4” for work experience, and so on. Once the data
for that record has been entered, move the cursor to the left most cell
in the second line. You are now ready to enter the second record. The
Data Entry window looks like Figure 10.25.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  339 

N O T E S

S
Figure 10.25: SPSS Data Editor Window with all Record Entered
Now save the data using method File pull-down menu and the Save
choice. Because this data has not been saved previously, you will see
IM
a dialog box prompting you to enter a file name. Notice that SPSS
provided the default data file extension (.sav). Type file name and
click OK button. SPSS will then save the data to this file. (SPSS will
automatically attach the .sav file extension if you do not type it in –
in general, SPSS will automatically attach the default file extension
NM

if you do not type it in, e.g., .sav for a data file, .spo for a Navigator
document, .sps for a syntax file, etc.). Another alternative would be
to select File Save As… in case the dataset had already been saved
once, but you now want to save it as a new file with a new name.

Fill in the blanks:


21. SPSS Statistics is a software package used for statistical
analysis. Long produced by SPSS Inc., it was acquired
by ................... in 2009.
22. Once SPSS has conducted an analysis, it displays the results
in the ................... window.
23. SPSS has the default convention of naming data files with
a .sav extension and ................... documents with a .spo
extension.

Visit any organisation which uses statistical data. Take feedback


from their manager about the advantages and disadvantages of
using SPSS in their office and prepare a short report. What are the
other statistical software which they are using besides MS-Excel
and SPSS.

NMIMS Global Access – School for Continuing Education


340  BUSINESS STATISTICS

N O T E S

10.9 SUMMARY
‰‰ Microsoft office is one of the most powerful office productivity
tools in the market today. The entire suite is vast and covers a
wide range of software solutions catering to various aspects of
modern businesses.
‰‰ Microsoft excel is a powerful accounting and calculation solution.
It has a standard tabular layout and it supports a wide range of
arithmetic, accounting and statistical functions.
‰‰ The Microsoft Outlook is the mail client that can be set up to
download mails from a mail server as well as send and receive
emails as desired. Being a part of the Microsoft Office suite, this
tool is compatible with other applications in the suite.
‰‰ One of the most popular and widely used Microsoft Office Suites
is the MS Office 2003. Later Microsoft released two other versions

S
of Office, namely Office 2007 and Office 2010. Although Office
2010 is the latest version, many businesses still continue to use
Office 2003. From Office 2003 to Office 2007, Microsoft radicalised
the overall look and feel of the office suite.
IM
‰‰ Excel is built on the concept of cell, rows, columns, spreadsheets
and workbooks. The entire structure is hierarchical, and this
allows it to be scalable and versatile enough to adapt to varying
needs for users from different specialisations. Understanding the
NM

following concepts is pretty useful in developing complex reports


and models.
‰‰ As long as you work on the soft copies, page layouts are not really
important – you can scroll a spreadsheet to view the contents.
However, when it comes to printouts it is important that one gets
the page layouts sorted out. Excel 2010 has all the page layout
options under Page Layout menu item.
‰‰ While working with any Office productivity tool, the clipboard
functions are invaluable. The most common clipboard functions
are ‘Cut’, ‘Copy’ and ‘Paste’. In the Microsoft Office suite, there
are keyboard shortcuts for these functions. Once you become
conversant with the Excel functions, you would prefer to use the
keyboard shortcuts as they are faster and easier to use than the
mouse.
‰‰ A new worksheet is a grid of rows and columns. The rows are
labelled with numbers, and the columns are labelled with letters.
Each intersection of a row and a column is a cell. Each cell has
an address, which are the column letter and the row number.
The arrow on the worksheet to the right points to cell A1, which
is currently highlighted, indicating that it is an active cell. A cell
must be active to enter information into it.
‰‰ Excel is a very powerful accounting tool, but before going to the
real complex functions, let us sees how to use Excel for simple

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  341 

N O T E S
calculations. There are two ways of using Excel for simple
calculations: you can enter the actual arithmetic equations in the
cell or use pre-defined Excel formulas to do the same.
‰‰ Statistical calculations for exponential random variables could
be calculated using statistical functions available in MS Excel.
NORMDIST returns the normal distribution for the specified
mean and standard deviation. This function has a very wide
range of applications in statistics, including hypothesis testing.
Syntax: NORMDIST(x,mean,standard_dev,cumulative)
‰‰ SPSS Statistics is a software package used for statistical analysis.
Long produced by SPSS Inc., it was acquired by IBM in 2009. The
current versions (2014) are officially named IBM SPSS Statistics.
Companion products in the same family are used for survey
authoring and deployment (IBM SPSS Data Collection), data
mining (IBM SPSS Modeler), text analytics, and collaboration
and deployment (batch and automated scoring services).

S
IM
‰‰ Microsoft Excel: An electronic spreadsheet program with
which you can create graphs and worksheets for financial
and other numeric data. After you enter your financial data,
you can analyze it for forecasts, generate numerous what-if
scenarios, and publish worksheets on the Web.
NM

‰‰ Statistics: Statistics is a tool that enables us to impose order


on the disorganized cacophony of the real world of modern
society. The business world has grown both in size and
competition.
‰‰ Variable: A characteristic or phenomenon, which may take
different, values, such as weight, gender since they are
different from individual to individual.
‰‰ Microsoft Word: This is a text editing software that allows
users to write all kinds of letters, messages and documents.
‰‰ Microsoft Project Plan: The Microsoft Project Plan or MPP
is a project management tool that allows managers to create
Gantt charts to plan and track project progress.
‰‰ Spreadsheet: A spreadsheet is a collection of all the rows and
columns in Excel.
‰‰ SPSS: SPSS Statistics is a software package used for statistical
analysis. Long produced by SPSS Inc., it was acquired by IBM
in 2009. The software name stands for Statistical Package for
the Social Sciences (SPSS).

10.10 DESCRIPTIVE QUESTIONS


1. Microsoft office is one of the most powerful office productivity
tools in the market today. Explain the Microsoft suite in detail.

NMIMS Global Access – School for Continuing Education


342  BUSINESS STATISTICS

N O T E S
2. What are the various Microsoft Versions?
3. Explain how do you open, save and close an Excel document.
4. Explain the menu items and their functions present on the excel
screen.
5. Explain the concept of Excel with reference to workbooks and
worksheets.
6. Discuss the procedure of entering the data in an excel file.
7. Write a short note on basic built-in functions like average, sum
and statistical functions.
8. Explain the Logical functions in an Excel document.
9. How do you create chart and Histogram in an Excel file?
10. Write a short note on SPSS. What is its importance in today’s
scenario?

EXERCISE FOR PRACTICE


1.
S
Draw histogram in Microsoft Excel for the following distribution.
IM
Marks less than 10 20 30 40 50 60 70 80 90
Number of 4 6 24 46 67 86 96 99 100
students
NM

2. Suppose X is a normal random variable with parameters


(m = 100, s = 2). Or we write in short as, X/N(100, 2). Find X such
that P (99 ≤ X ≤ x2) = 60%. Solve using Ms-Excel.
3. Suppose the mean weight of packs is 100 gms and Standard
Deviation 2 gms. Bags below 99 gms weight are rejected. From
the stock you want to supply 60% of bags. You want to give bags
with as less overfill as possible. Find the highest limit of weight
of overfilled bags that you must give.
4. Calculate coefficient of correlation between X and Y using
Ms-Excel as per the data given below:

X 14 16 20 22 28 30 34 40 45
Y 97 89 68 65 56 50 37 18 12
5. The following data give the average yields of major grain
(excluding rice) for the period 1965–1973. The yields are in
quintals per hectare.

Year 1965 1966 1967 1968 1969 1970 1971 1972 1973
Yield 14.7 16.2 16.2 16.7 16.9 17.3 18.8 18.5 19.4
Find the equation of the trend line, assuming that the trend is
linear using Ms-Excel.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  343 

N O T E S

10.11 ANSWERS AND HINTS


ANSWERS FOR SELF ASSESSMENT QUESTIONS

Topic Q. No. Answers


Introduction 1. Microsoft Word
2. Microsoft Excel
3. outlook
Introduction to Excel 4. File-New-Blank
5. menu
6. spreadsheet
7. workbook
Entering Data in Excel 8. True
9. True

Descriptive Statistics
10.
11.
12.
S
False
Descriptive
Autosum
IM
13. variable
Basic Built-in Functions 14. True
15. False
16. False
NM

Statistical Analysis 17. True


18. True
Normal Distribution 19. NORMDIST
20. Cumulative
Brief about SPSS 21. IBM
22. Output Navigator
23. Navigator

HINTS FOR DESCRIPTIVE QUESTIONS


1. Refer Section 10.1
The entire suite is vast and covers a wide range of software
solutions catering to various aspects of modern businesses. The
most popular software in the MS Office Suite includes Microsft
word, Microsoft Excel, Microsoft PowerPoint, Microsoft Access,
Microsoft Project Plan and Microsoft Outlook.
2. Refer Section 10.1.1
One of the most popular and widely used Microsoft Office Suites
is the MS Office 2003. Later Microsoft released two other versions
of Office, namely Office 2007 and Office 2010. Although Office
2010 is the latest version, many businesses still continue to use
Office 2003.

NMIMS Global Access – School for Continuing Education


344  BUSINESS STATISTICS

N O T E S
3. Refer Section 10.2
Click on File-Open (Ctrl+O) to open/retrieve an existing
workbook; change the directory area or drive to look for files in
other locations.
To save your document with its current filename, location and
file format either click on File - Save. If you are saving for the first
time, click File-Save; choose/type a name for your document;
then click OK.
4. Refer Section 10.2.3
The most basic navigation tool in any Excel version is the menu
bar. The screenshot below shows the menu bar with various
items, which are designed for accessing different features of
Excel. Table 10.1 below gives a snapshot of the menu items and
their overall functions.
5. Refer Section 10.2.4

S
Excel is built on the concept of cell, rows, columns, spreadsheets
and workbooks. The entire structure is hierarchical, and this
IM
allows it to be scalable and versatile enough to adapt to varying
needs for users from different specializations.
6. Refer Section 10.3
A new worksheet is a grid of rows and columns. The rows are
labeled with numbers, and the columns are labeled with letters.
NM

Each intersection of a row and a column is a cell. Each cell has


an address, which are the column letter and the row number.
The arrow on the worksheet to the right points to cell A1, which
is currently highlighted, indicating that it is an active cell. A cell
must be active to enter information into it. To highlight (select) a
cell, click on it.
7. Refer Section 10.5
The SUM function is probably the most commonly used function
in Excel. It comes in three flavors in Excel, namely SUM, SUMIF
and SUMIFS.
Statistical functions are invaluable in any mathematical
calculations. They can provide insights into trends provide data
for detailed analysis as well as help identify gaps that need to be
plugged. Excel provides a wide range of functions that can be
used to perform basic statistical analyses.
8. Refer Section 10.5
Excel also supports standard logical or Boolean functions, which
are very useful in testing special conditions. Not all calculations
are limited to SUM; hence you need to rely on a combination of
logical and other functions to get results similar to SUMIF and
SUMIFS. The commonly supported logical functions are AND,
IF, FALSE, TRUE, NOT, OR.

NMIMS Global Access – School for Continuing Education


USE OF EXCEL SOFTWARE FOR STATISTICAL ANALYSIS  345 

N O T E S
9. Refer Section 10.6
One of the most powerful features of Excel is its ability to represent
large data volumes in easy-to-decipher formats including pivot
tables and varied charts. Understanding these features is very
important for any modeling or analysis exercise. While it is great
to be able to calculate numbers, if these are not presented in the
right format they add up to zilch.
10. Refer Section 10.8
SPSS Statistics is a software package used for statistical analysis.
Long produced by SPSS Inc., it was acquired by IBM in 2009. The
current versions (2014) are officially named IBM SPSS Statistics.
Companion products in the same family are used for survey
authoring and deployment (IBM SPSS Data Collection), data
mining (IBM SPSS Modeler), text analytics, and collaboration
and deployment (batch and automated scoring services).
The software name stands for Statistical Package for the Social
Sciences (SPSS), reflecting the original market, although
the software is now popular in other fields as well, including
S
IM
the health sciences and marketing.

ANSWERS FOR EXERCISE FOR PRACTICE


1.
NM

Histogram:
Distribution of Marks
Number of Students

25
20
15
10
5
0
10 –…
20 –…
30 –…
40 –…
50 –…
60 –…
70 –…
80 –…
0 – 10

Marks

2. P(99 ≤ X ≤ x2 )= P( X ≤ x2 ) − P(99 ≤ X ) ⇒ 0.6= F( x2 ) − F(99)


⇒ F(x2 ) =
0.6 + F(99)

Find F (99) using MS-Excel.


Ans is 0.908538.
3. We find ‘a’ such that P(99 < x < a) = 0.6 ⇒ a = 102.6648 (Use ‘Goal
Seek’ tool.)
4. – 0.99863
5. y = mx + c Slope m = 0.525; Y intercept c = – 1016.54

NMIMS Global Access – School for Continuing Education


346  BUSINESS STATISTICS

N O T E S

 UGGESTED READINGS FOR


S
10.12
REFERENCE
SUGGESTED READINGS
‰‰ David, H., and Edwards, A., Annotated Readings in the History of
Statistics, Springer, 2001. Offers general historical collections of
the probability and statistical literature.
‰‰ D P Apte, Statistical Tools for Managers Using MS Excel, Excel
Books, 2009.
‰‰ Gallagher, C.A. and Watson, H.J., Quantitative Methods for
Business Decisions, McGraw Hill, Inc., 1976.
‰‰ Dey, B.R., Text Book of Managerial Statistics, Macmillan India
Ltd, 2005.
‰‰ Peters, W., Counting for Something: Statistical Principles and

‰‰ S
Personalities, Springer, New York, 1987.
Porter, T., The Rise of Statistical Thinking, 1820-1900, Princeton
IM
University Press, 1986.
‰‰ Stigler, S., The History of Statistics: The Measurement of
Uncertainty before 1900, U. of Chicago Press, 1990.
‰‰ Tankard, J., The Statistical Pioneers, Schenkman Books, New
York, 1984.
NM

‰‰ Ramgopal Rajan, Excel in Excel, Introducing the “Power of Ms


Excel” to Beginners, Excel Books, 2013.

E-REFERENCES
‰‰ http://home.ubalt.edu/ntsbarsh/excel/excel.htm
‰‰ http://people.umass.edu/evagold/excel.html
‰‰ http://www.excel-easy.com/examples/descriptive-statistics.html

NMIMS Global Access – School for Continuing Education


C H
11 A P T E R

CASE STUDIES

CONTENTS

Case Study 1: Chapter 1 Using Statistics in Business Decision Analysis

Case Study 2: Chapter 2


S
Descriptive Statistics in Sports
IM
Case Study 3: Chapter 3 A Look at the Average Wage

Case Study 4: Chapter 4 Variance Analysis of Cooka Ltd.

Case Study 5: Chapter 5 Skewness and Kurtosis: Important Parameters in


the Characterization of Dental Implant Surface
NM

Roughness

Case Study 6: Chapter 6 Finding the Correlation between Currency, Bonds


and Stocks

Case Study 7: Chapter 7 Contributions: Simple Linear Regression


Background

Case Study 8: Chapter 7 Selecting Colleges

Case Study 9: Chapter 8 Death Penalty Probability

Case Study 10: Chapter 9 Shopping Attitude

Case Study 11: Chapter 10 Age Distribution in the United States

Case Study 12: Chapter 9 Birth Weights in America

NMIMS Global Access – School for Continuing Education


348  BUSINESS STATISTICS

CASE STUDY 1: CHAPTER 1


N O T E S

USING STATISTICS IN BUSINESS DECISION ANALYSIS

In a highly competitive and an increasingly Internet-centric world


where information and data is available in abundance, only those
companies will survive that focus on statistics in business decision
analysis as a primary tool of decision making. XYZ Company is one
of them which is using statistics in each field for decision making.

XYZ Company uses Statistics in Business Decision Analysis


There is a significant part of statistics in business decision analysis.
In an aggressive business environment, a business can’t survive
simply by making decisions focused around intuition, guesswork
and close estimations. Procuring scientific data and information, and
investigating that data precisely can help to settle on more gainful
decisions for the business organization. The XYZ organization that
is strong in the core area of decision making is liable to accomplish

S
more prominent accomplishment for its stakeholders over the long
haul, have less hazard presentation, and have a lower possibility of
missing lucrative opportunities.
IM
Different Applications of Statistical Analysis
Any business operates under conditions of probability and
uncertainty because there are too many variables and external
factors that can influence a situation. Therefore, the decision
NM

making process of XYZ Company must include collection and


analysis of as much data and information as possible in order to
arrive at optimal business decisions. Computerized analysis of data
has made the task simpler. The following are a few examples where
statistical methods can help in decision making in the company:
‰‰ Random sampling techniques are used by production
managers and the QC department to determine quality grades
of materials.
‰‰ Accountants use these same techniques while auditing
accounts receivables for their clients.
‰‰ Regression and Correlation analysis may be used by the
finance department to correlate a set of financial ratios with
other business variables.
‰‰ Marketing departments may apply Statistical Test of
Significance for their market research about a suitable target
market for their new products or services.
‰‰ Forecasting techniques may be used by the top management
to estimate sales volume for the next budget year.
‰‰ Standard deviation methods are used by various profit centers
within the organization to cut down the inherent risk in a
particular business decision.

Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  349 

N O T E S

XYZ uses Financial Ratios as a Tool of Business Decision

Analysis
One of the valuable statistics in business decision analysis is the
internal accounting figures of the organization, or the performance
data. The decision analysis team within the company has a key
responsibility to analyze the company’s performance in measurable,
statistical terms, and evaluate the deviations from group goals, if
any. The financial performance or profitability figures, assets and
liabilities figures, inventory and sales figures are analyzed with the
help of business ratios. These ratios provide a crystallized picture
of the business and test its performance on various parameters.
For example, Current Ratio indicates the position of the company’s
current assets against current liabilities. The most critical financial
ratios for any company include Profit to Sales ratio, Debt to Equity
ratio, Current ratio, and Return on Capital Employed.

S
Key Elements of Statistics in Business Decision Analysis
The important elements to consider when using statistics in
IM
business decision analysis, particularly in process improvement
of XYZ, are the accuracy of collected data and information, the
choice of statistical design or statistical model to analyze that data,
the clear presentation of findings and conclusions, and finally,
managerial recommendations on how to take corrective measures
NM

based on these findings and conclusions.

Analyze the case above and suggest what improvements/changes


XYZ Company can make by using statistical analysis?
Source: http://www.brighthub.com/office/entrepreneurs/articles/85564.aspx

NMIMS Global Access – School for Continuing Education


350  BUSINESS STATISTICS

CASE STUDY 2: CHAPTER 2


N O T E S

DESCRIPTIVE STATISTICS IN SPORTS

Descriptive statistics are numbers that are used to summarize and


describe data. The word “data” refers to the information that has
been collected from an experiment, a survey, a historical record,
etc. (Since, “data” is plural. One piece of information is called a
“datum.”) If we are analyzing birth certificates, for example,
a descriptive statistic might be the percentage of certificates
issued in New York State, or the average age of the mother. Any
other number we choose to compute also counts as a descriptive
statistic for the data from which the statistic is computed. Several
descriptive statistics are often used at one time to give a full picture
of the data.
Descriptive statistics are just analysis of data that describe
or summarize data in a meaningful way. They do not
involve generalizing beyond the data at hand. Generalizing from our

S
data to another set of cases is the business of inferential statistics.
You probably know that descriptive statistics are central to the
IM
world of sports. Every sporting event produces numerous statistics
such as the shooting percentage of players on a basketball team.
For the Olympic marathon (a foot race of 26.2 miles), we possess
data that cover more than a century of competition. The following
table shows the winning times for both men and women.
NM

TABLE 1: WINNING OLYMPIC MARATHON TIMES


Women
Year Winner Country Time
1984 Joan Benoit USA 2:24:52
1988 Rosa Mota POR 2:25:40
1992 Valentina Yegorova UT 2:32:41
1996 Fatuma Roba ETH 2:26:05
2000 Naoko Takahashi JPN 2:23:14
2004 Mizuki Noguchi JPN 2:26:20
Men
Year Winner Country Time
1896 Spiridon Louis GRE 2:58:50
1900 Michel Theato FRA 2:59:45
1904 Thomas Hicks USA 3:28:53
1906 Billy Sherring CAN 2:51:23
1908 Johnny Hayes USA 2:55:18
1912 Kenneth McArthur S. Afr. 2:36:54
1920 Hannes Kolehmainen FIN 2:32:35

Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  351 

N O T E S

1924 Albin Stenroos FIN 2:41:22


1928 Boughra El Ouafi FRA 2:32:57
1932 Juan Carlos Zabala ARG 2:31:36
1936 Sohn Kee-Chung JPN 2:29:19
1948 Delfo Cabrera ARG 2:34:51
1952 Emil Ztopek CZE 2:23:03
1956 Alain Mimoun FRA 2:25:00
1960 Abebe Bikila ETH 2:15:16
1964 Abebe Bikila ETH 2:12:11
1968 Mamo Wolde ETH 2:20:26
1972 Frank Shorter USA 2:12:19
1976 Waldemar Cierpinski E.Ger 2:09:55
1980 Waldemar Cierpinski E.Ger 2:11:03
1984
1988
1992
Carlos Lopes
Gelindo Bordin
Hwang Young-Cho
S POR
ITA
S. Kor
2:09:21
2:10:32
2:13:23
IM
1996 Josia Thugwane S. Afr. 2:12:36
2000 Gezahenge Abera ETH 2:10.10
2004 Stefano Baldini ITA 2:10:55
There are many descriptive statistics that we can compute from
NM

the data in the table. To gain insight into the improvement in speed
over the years, let us divide the men’s times into two pieces, namely,
the first 13 races (up to 1952) and the second 13 (starting from
1956). The mean winning time for the first 13 races is 2 hours, 44
minutes, and 22 seconds (written 2:44:22). The mean winning time
for the second 13 races is 2:13:18. This is quite a difference (over
half an hour). Does this prove that the fastest men are running
faster? Or is the difference just due to chance, no more than what
often emerges from chance differences in performance from year
to year? We can’t answer this question with descriptive statistics
alone. All we can affirm is that the two means are “suggestive.”

Examining Table 1 leads to many other questions. We note that


Takahashi (the lead female runner in 2000) would have beaten
the male runner in 1956 and all male runners in the first 12
marathons. This fact leads us to ask whether the gender gap will
close or remain constant. When we look at the times within each
gender, we also wonder how much they will decrease (if at all)
in the next century of the Olympics. Might we one day witness
a within 2 hour marathon? The study of statistics can help you
make reasonable guesses about the answers to these questions.

NMIMS Global Access – School for Continuing Education


352  BUSINESS STATISTICS

CASE STUDY 3: CHAPTER 3


N O T E S

A LOOK AT THE AVERAGE WAGE

The Head of the Union Mr. Motswiri in the Matongo Manufacturing


and Marketing Company was negotiating with Ms. Kelebogile
Matongo, the president of the company. He said, “The cost of living
is going up. Our workers need more money. No one in our union
earns more than $ 9000 a year.”
Ms. Matongo replied, “It’s true that costs are going up. It’s the same
for us—we have to pay higher prices for materials, so we get lower
profits. Besides, the average salary in our company is over $11000.
I don’t see how we can afford a wage increase at this time.”
That night the union official conducted the monthly union meeting.
A sales clerk spoke up. “We sales clerks make only $5000 a year.
Most workers in the union make $7500 a year. We want our pay
increased at least to that level.”

S
The union official decided to take a careful look at the salary
information. He went to the salary administration. They told him
that they had all the salary information on a spreadsheet in the
IM
computer, and printed off this table:

Type of Job Number Salary Union


Employed Member
President 1 $125 000 No
NM

Vice president 2 $65 000 No


Plant Manager 3 $27 500 No
Foreman 12 $9 000 Yes
Workman 30 $7 500 Yes
Payroll clerk 3 $6 750 Yes
Secretary 6 $6 000 Yes
Sales Clerk 10 $5 000 Yes
Security officer 5 $4 000 Yes
TOTAL 72 $796 750 -
The union official calculated the mean:
$796, 750
MEAN = = $11,065. 97
72
“Hmmmm,” Mr. Motswiri thought, “Miss Matongo is right, but the
mean salary is pulled up by those high executive salaries. It doesn’t
give a really good picture of the typical worker’s salary.”
Then he thought, “The salary clerk is sort of right. Each of the
thirty workmen makes $7500. That is the most common salary –
the mode.

Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  353 

N O T E S

However, there are thirty-six union members who don’t make


$7500 and of those, twenty-four make less.” Finally, the union head
said to him, “I wonder what the middle salary is?” He thought of
the employees as being lined up in order of salary, low to high.
The middle salary (it’s called the median) is midway between
employee 36 and employee 37. He said, “employee 36 and employee
37 each make $7500.- , so the middle salary is also $7500.-.”

1. If the twenty-four lowest salaried workers were all moved


up to $7500,
What would be
(a) The new median?
(b) The new mean?

2.
(c) The new mode?
What salary position do you support, and why?S
IM
Source: http://wikieducator.org/images/2/28/JSMath6_Part3.pdf
NM

NMIMS Global Access – School for Continuing Education


354  BUSINESS STATISTICS

CASE STUDY 4: CHAPTER 4


N O T E S

VARIANCE ANALYSIS OF COOKA LTD.

Sean Thornton is the General Manager of Cooka Ltd., a


manufacturer of kitchen appliances for the catering trade (school
canteens, restaurants and commercial eateries). Sean introduced a
system of variance analysis into Cooka Ltd last month. As a result
of this, he is now reviewing the monthly financial report, in time
for the forthcoming board meeting. Although the system is an
improvement on the previous arrangements (since some figures
are missing), it can summarised as follows:

Monthly Financial Report: Cooka Limited


Budget Actual Variance
$’000 $’000 $’000
Sales Revenues 260 00 (?)
Direct costs of
production
Indirect costs of S 160

40
180

(?)
(20)

(5)
IM
production
Net Profit (?) (?) (?)
Sean will suggest to the board that whilst profits in the company had
recently risen, there was still room for improvement. He proceeded
with the implementation of the system of variance analysis despite
NM

some objections from departmental managers. Sean feels that


this was the correct decision Aaron Thorn has complained of an
increased workload in the Production Department and Amy Babe
is concerned that the Marketing Department has insufficient funds
to spend on product promotion next month. In the meantime, the
employees of Cooka Ltd are unhappy about the non-payment of
bonuses from increased profits.

1. Explain what is meant by the terms ‘variance’ and ‘variance


analysis’.
2. Explain what is meant by the terms ‘Direct costs’ and
‘Indirect costs’ with reference to a business such as Cooka
Ltd.
3. Complete the missing information in the monthly financial
report for Cooka Ltd.
4. Calculate standard deviation and variance and Coefficient
of variance for the above information for Cooka Ltd.
5. With reference to each of the variances noted above, explain
(as far as is possible), the reasons as to why they may have
occurred in Cooka Ltd.

NMIMS Global Access – School for Continuing Education


CASE STUDIES  355 

CASE STUDY 5: CHAPTER 5


N O T E S

SKEWNESS AND KURTOSIS: IMPORTANT PARAMETERS IN


THE CHARACTERIZATION OF DENTAL IMPLANT SURFACE
ROUGHNESS

For dental implants, the primary rationale of surface roughness


is to get increased retention strength. Implant surface roughness
is normally characterized by a number of surface roughness
parameters. There is no consensus as to which combination
of roughness parameters that best characterize the important
topographical features of implant surface roughness. Hansson and
Norton assumed that a rough implant surface can be conceptualized
as consisting of small pits. Assuming that bone grows into these pits,
creating retention, it was found that the retention strength depends
upon the size, shape, and packing density of these pits. A theoretical
study did not show any clear relationship between the estimated
retention strength, using the method suggested by Hansson and

S
Norton and the values of a set of surface roughness parameters.
Wennerberg and Albrektsson suggested the use of atleast one
height, one space, and one hybride parameter for characterization
IM
of implant surface roughness. For 2D measurements, one of the
height parameters Ra (average roughness) and Rq (root-mean-
square roughness), the space parameter RSm (mean width of
profile elements), and the hybrid parameter Rdr (developed length
ratio) were suggested. The limitations of this recommendation are
immediately realized when considering the two surfaces in Figure
NM

1. These surfaces are mirror images of each other, and the values
of the suggested set of parameters are exactly the same for these
surfaces; these parameters cannot discriminate between surfaces
which are mirror images of each other. It is however quite obvious
that the interface shear strength is much higher for the surface in
Figure 1(a) than in Figure 1(b). The number of bone plugs which
protrude into pits on the surface per length unit is exactly the same
for the two surfaces, while the shear strength of the individual bone
knobs, protruding into the pits, is much higher for surface in Figure
1(a) than for surface in Figure 1(b). If the surface characterization
is supplemented by the skewness parameter (Rsk), discrimination
between these two surfaces is achieved. The absolute value of the
skewness is the same for the two surfaces, but the sign is different;
a plus sign for the surface in Figure 1(a) and a minus sign for the
surface in Figure 1(b).

Figure 1 (a)
Contd...

NMIMS Global Access – School for Continuing Education


356  BUSINESS STATISTICS

N O T E S

Figure 1(b)
Figure 1: Two rough surfaces in cross-section. The Ra, Rq,
RSm, and Rdr parameters are the same for the two surfaces.
The interface shear strength is much higher for surface (a)
than for surface (b).
An even better representation of a rough surface is obtained if the
kurtosis parameter (Rku) is added. This parameter is a descriptor
of the peakedness of the surface. As the modulus of elasticity of the

S
implant material is substantially higher than that of bone, stress
peaks will arise in the bone adjacent to the roughness peaks. The
sharper the asperities of the surface roughness, the higher the
stress peaks in the bone. Excessive bone stresses will result in bone
IM
resorption. This means that theoretically the kurtosis parameter
is important in the characterization of implant surface roughness.
A review of the literature on bone implants shows that the skewness
and kurtosis parameters are seldom used in the characterization
NM

of surface roughness. The explanation for this is probably the


experience in surface metrology that these parameters often show a
high spread which is explained by the fact that in the mathematical
expressions of skewness and kurtosis the departures from the
mean line are raised to the power of three and four, respectively
(Table 1). This makes the values of these parameters strongly
influenced by outliers, deviating from the general pattern, which is
also mentioned in the standard EN ISO 4287: 1998.

TABLE 1: 2D SURFACE ROUGHNESS PARAMETERS


DEALT WITH IN THE PRESENT STUDY

Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  357 

N O T E S
According to Albrektsson and Wennerberg, implant surfaces with
a Sa (3D average roughness) value between 1.0 μm and 2.0 μm
(moderately rough surfaces) show stronger bone responses than
smoother and rougher surfaces. They also found that the majority
of the dental implants, currently on the market, have Sa values
within that interval. Sa is a three-dimensional height parameter –
the average departure from the mean surface within the sampling
area. The two-dimensional analogue of the Sa parameter is the Ra
parameter – the average departure from the mean line within the
sampling length.
The metrology standard EN ISO 4288: 1997 differentiates between
periodic and non periodic profiles. For non periodic profiles the
recommended sampling length, when measuring skewness and
kurtosis, depends on the Ra value. For Ra values between 0.1 and
2 μm, the prescribed sampling length is 0.8 mm. This means that if
a moderately rough implant surface is regarded as nonperiodic, a

S
sampling length of 0.8 mm should be applied for the measurement
of skew and kurtosis. For surfaces having a periodic profile, the
prescribed sampling length is based on the mean width of profile
IM
elements (RSm) to the effect that the sampling length will be 2–6.25
times the mean width of profile elements. The mean width of profile
elements seems to be less than 40 μm for most moderately rough
implant surfaces of today which, according to EN ISO 4288 : 1997,
means that a sampling length of 0.08 mm should be applied. Thus,
NM

the decision of whether to regard a dental implant surface as


having a periodic or non-periodic surface profile has a big impact
on the choice of sampling length. A periodic profile leads to a
sampling length of 0.08 mm, while a non periodic profile gives the
sampling length 0.8 mm. The standard EN ISO 4288: 1997 does not
provide clear information regarding the discrimination between
a periodic and non periodic profile. An inquiry at a company
specialized in surface metrology gave the answer that a blasted,
etched, or plasma sprayed surface should be regarded as having a
non periodic profile. The standard EN ISO 4288: 1997 recommends
that measurements be made on five consecutive sampling lengths;
these five sampling lengths constitute the evaluation length.
In metrology, the surface topography is assumed to consist of
three basic components: form, waviness, and roughness, which are
superimposed upon each other. Roughness is what remains when
the form and waviness components have been subtracted from the
real contour of the surface. This subtraction is effected by a digital
filter; normally a Gaussian filter. According to the standard EN ISO
4287: 1998, the characteristic wavelength of the filter (the cutoff)
should equal the sampling length.
The aim of the present study was to investigate the effect of the
sampling length on the accuracy which can be expected when
measuring skewness and kurtosis on fairly well-defined and
Contd...

NMIMS Global Access – School for Continuing Education


358  BUSINESS STATISTICS

N O T E S

homogenous “semi-periodic” surfaces upon which isolated peaks


of higher amplitude are superimposed.

Conclusion
A primary aim of the surface roughness of dental implants is to
increase the bone-implant interface shear strength. The surface
roughness parameters normally used for characterization of dental
implant surface roughness cannot discriminate between surfaces
expected to give high interface shear strength from surfaces
expected to give low interface shear strength. The skewness
parameter can achieve this discrimination. Kurtosis is another
parameter which theoretically is important in the evaluation of
the quality of a rough implant surface. A problem with these two
parameters is that they are sensitive to isolated outliers. By using
small sampling lengths during measurement, it should be possible
to get accurate values of the skewness and kurtosis parameters.

S
IM
Analyze the case and comment on how skewness and kurtosis
parameters are used in the characterization of surface roughness
in bone implants.
Source: http://www.hindawi.com/journals/isrn/2011/305312/)
NM

NMIMS Global Access – School for Continuing Education


CASE STUDIES  359 

CASE STUDY 6: CHAPTER 6


N O T E S

FINDING THE CORRELATION BETWEEN CURRENCY,


BONDS AND STOCKS

The present exchange rate puzzle has created a new ‘trilemma’


which involves the forex, bond and stock markets. The
impressionistic feeling we get is that when the rupee falls and
the RBI intervenes, the bond market goes into a tizzy and rates
increase.
In sympathy, this prompts the stock market to move down as these
signals are interpreted perversely. Quiet clearly there is a basis for
this theory which needs to be explored further.
The turning point in the rupee came, when the rate crossed 55 a
dollar. From then on till now, which is just a little over 3 months,
there has been a lot of measures taken by authorities, some of
which have worked more than others.

S
The objective here is to examine two sets of issues. The first relates
to the linkage between the exchange rate, 10-year G-sec and
stock market movements as denoted by the sensex. The other is a
IM
theoretical exercise which involves the notional cost that has been
involved.

RUPEE & SENSEX


For the record, on a point-to-point basis, the rupee (RBI reference
NM

rate) has fallen by 17.5%, the 10-year yield has gone up by around
90 bps and the Sensex has declined 9.5%. In terms of the linkage
between the two, a rudimentary statistical exercise shows that the
coefficient of correlation between the rupee and sensex at absolute
levels was -0.58 which is quite high with an inverse sign, indicating
that the market does not like a declining rupee. At the incremental
level, i.e. daily changes in both of them, the coefficient was -0.37.
In case of the rupee and the 10-year bond, it was as high as 0.70 at
the absolute level and -0.07 at the incremental level. This shows that
high rupee rates go hand-inhand with high bond yields. However,
the exact changes in levels are not correlated. Last, higher bond
yields are negatively correlated with sensex at 0.29 (for absolute
levels) and 0.35 (for changes). At the second level, a causal relation
could also be examined between these three sets of variables.
While such correlations do have somewhere an inbuilt assumption of
causation, the causality tests do not support such a relation between
any of these variables. This probably makes sense as bond yields are
also driven mainly by liquidity conditions and regulatory conditions.
The sensex reacts also to political actions and global developments.
Therefore, while there is a tendency to move in a pre-determined
direction – the stock market does not quite like a weak rupee or
high interest rates, which sounds logical, a weak rupee should go
along with higher interest rates.
Contd...

NMIMS Global Access – School for Continuing Education


360  BUSINESS STATISTICS

N O T E S

Cost of Rupee Depreciation


The second exercise can be in the direction of guessing as to what
has been the cost of the rupee depreciation as there has been a
sequential drop in the stock market and an increase in bond yields.
The cost is notional as these numbers are at specific points of time
and while the changes that have been mentioned earlier are in the
same direction for this 3-month period, they exclude the peaks.
At the FY13 level of trade deficit of $190 billion, rupee depreciation
means an additional cost of ` 1.83 lakh crore. The external debt
at $390 billion in March means that in future we will have to pay
another ` 3.76 lakh crore. At the same time the fall in the stock
market can be captured by the movement in market capitalisation.
During this period, there has been a fall of ` 2.5 lakh crore — which
may not at all be because of the exchange rate.
Interestingly, the FII withdrawal at this time means that not only
were they moving out when the stock market went down but also the

S
rupee depreciated. FIIs leaving today with May 20 as benchmark
would have taken a loss of above 25% as the combined effect of
rupee depreciation and stock market decline. Quite clearly, the
IM
perceived rewards from going back home on account of the US
recovery are more attractive for these players.
In the bond market, the 10-year yield has moved up by close to 100
bps, though the increase has been higher at the lower end of the
maturity spectrum. The government will be affected under ceteris
paribus conditions. So far, it has completed ` 2.6 lakh crore of the
NM

`. 4.84 lakh crore of borrowing. Therefore, the balance ` 2.25 lakh


crore could be at around 100 bps higher, which means an interest
of ` 2,250 crore.
The dealers in the secondary market will be affected adversely
with the MTM losses and there are varying estimates of this being
between ` 30-50,000 crore. Lending costs have gone up and 25 bps
increase in base rates could mean an additional cost of ` 1,200 crore
for the borrowing community assuming 14% growth in credit for the
year and a balance of ` 4.8 lakh crore of borrowing to be undertaken
during the rest of the year. Deposit holders could gain a little less
than this as there is general stickiness in changing deposit rates.
Therefore, a slight increase in FII could be expected on this count.
As everything appears to affect everything, we could alter Thomas
Friedman’s phraseology, and say that markets are flat.

1. What is the linkage or correlation between the exchange


rate, 10-year G-sec and stock market movements as denoted
by the sensex?
2. Analyse the case and suggest how can we strengthen the
rupee by using correlation analysis of rupee, sensex and
other variables.
Source: http://articles.economictimes.indiatimes.com/

NMIMS Global Access – School for Continuing Education


CASE STUDIES  361 

CASE STUDY 7: CHAPTER 7


N O T E S

CONTRIBUTIONS: SIMPLE LINEAR REGRESSION


BACKGROUND

The Colorado Combined Campaign solicits Colorado government


employees’ participation in a fund-raising drive. Funds raised by
the campaign go to over 700 Colorado charities in all, including
the Humane Society of Boulder Valley and the Denver Children’s
Advocacy Center. Prominent state employees, such as university
presidents, chancellors and lieutenant governors, head the annual
campaigns. An advisory committee determines whether the
charities receiving contributions provide the services claimed in a
fiscally responsible manner.
All Colorado state employees may contribute to the fund. However,
certain state institutions are targeted to receive promotional
brochures and campaign literature. Employees in these targeted
groups are referred to as “eligible” employees. Each year, the

S
number of eligible employees is known in June. Fund-raising
activities are then conducted throughout the fall. By year’s end,
total contributions raised that year are tabulated.
IM
The Task
It is now June 2010. The number of eligible employees for 2010 has
been determined to be 53,455. Does knowing the number of eligible
employees help predict 2010 year-end contributions?
NM

The Data
This is an annual time-series from 1988–2009. The variables are
contribution Year and:
Actual: Total contributions to the campaign for the year in dollars
Employees: Number of eligible employees that year

Analysis
The average level of contributions during this time period was
$1,143,769, with a typical fluctuation of $339,788 around the
average. The average number of eligible employees was 45,419,
with a typical fluctuation of 9,791.

EXHIBIT 1: SUMMARY STATISTICS FOR


ACTUAL AND EMPLOYEES

As we can see in Exhibit 2, contributions are growing over time:

Contd...

NMIMS Global Access – School for Continuing Education


362  BUSINESS STATISTICS

N O T E S

EXHIBIT 2: TIME SERIES PLOT OF ACTUAL BY YEAR

phenomena:
‰‰ S
The long-term growth in contributions is attributable to two

The amount contributed per eligible employee is mostly


IM
upward (Exhibit 3, top).
‰‰ The number of eligible employees is on the rise, particularly in
the 1999 to 2002 campaign years (Exhibit 3, bottom).

EXHIBIT 3: TIME SERIES PLOTS OF


NM

ACTUAL PER EMPLOYEE AND EMPLOYEES

The scatterplot and least squares regression line using Actual as


the response variable and Employees as the predictor variable is
shown in Exhibit 4. The formula for the regression line is found
below the plot under Linear Fit. The slope of the fitted line, 33.555,
Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  363 

N O T E S
estimates the contribution for each eligible employee over this
time period. Hence, the model estimates an additional $33.56
in contributions for each eligible employee. Under Parameter
Estimates, we see that the number of employees is a statistically
significant predictor of year-end contributions; the p-value, listed
as Prob > |t|, is < 0.0001.
The number of employees doesn’t perfectly predict contributions.
Just over 93% of the variability in contributions is associated with
variability in number of eligible employees (RSquare = 0.934907).
Comparing the standard deviation of Actual ($339,788) to the root
mean square of the regression equation (RMSE = $88,832) suggests
that a substantial reduction in the variation in contributions occurs
by using the regression model to explain variation in year-end
contributions.

EXHIBIT 4: REGRESSION WITH ACTUAL (Y)


AND EMPLOYEES (X)

S
IM
NM

We’ve been informed that the number of eligible employees in 2010


is 53,455. To use the regression equation to forecast 2010 year-end
contributions, we can plug this number into the regression equation.
If the number of employees is 53,455, the predicted actual
contributions are:
Actual = – 380265.5 + (33.555042) × Employees
    = – 380265.5 + (33.555042) × (53,455)
    = 1413419.3 (or, $1,413,419)
Contd...

NMIMS Global Access – School for Continuing Education


364  BUSINESS STATISTICS

N O T E S

In words, given that the number of eligible employees is 53,455;


our model estimates that 2010 year-end contributions will be
approximately $1.413 million.
Easier still, we can skip the math exercise, save the regression
formula and prediction intervals and ask JMP to calculate the
estimated contributions for 2010 (Exhibit 5). Prediction intervals
are useful, since the number of employees isn’t a perfect predictor
of contributions. The prediction interval gives us an estimate of
the interval in which the 2010 year-end contributions will fall (with
95% confidence).

EXHIBIT 5: PREDICTED VALUE AND PREDICTION


INTERVAL FOR 2010 CONTRIBUTION

S
IM
NM

Predicted values can also be explored dynamically using the cross-


hair tool. In Exhibit 6, we see that the predicted value for Actual, if
Employees is 53,414, is around $1.402 million.

EXHIBIT 6: USING CROSS-HAIR TOOL TO EXPLORE


PREDICTED CONTRIBUTION

We can also graphically explore prediction intervals (Exhibit 7).

Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  365 

N O T E S

EXHIBIT 7: PREDICTION INTERVALS FOR ACTUAL

Managerial Implications
Regression has provided a prediction for year-end 2010 Colorado
Combined Campaign contributions of $1.4M. In managerial settings

S
such as this, where the response variable represents a business
goal, managers often set higher expectations than the predicated
value to motivate improved performance. One such choice here
IM
might be the upper 95% prediction limit of $1.6M.
This forecasting methodology can be repeated year after year. Once
the final contributions to 2010 are known, they can be added to the
data set and the regression line can be recalculated. By midyear of
2011, the number of eligible employees will be known. Note that,
NM

in this case, we focused on trend analysis using only Year as the


predictor. We could also fit a model with both Employee and Year.
We will consider regression models with more than one predictor
in a future case.

1. Perform regression analysis with the Colorado Combined


Campaign data, using Actual as the response variable and
Year as the predictor.
2. Compare your forecast for 2010 with that obtained from the
simple linear regression model in which number of eligible
employees is the predictor variable.
Hint: Compare RMSE, RSquare, and the estimated
contributions for 2010. Which model does a better job of
explaining variation in contributions?
Source: http://mis.aug.edu/drjmatls/Quan6600/case_study_library_all/08Contributions.pdf

NMIMS Global Access – School for Continuing Education


366  BUSINESS STATISTICS

CASE STUDY 8: CHAPTER 7


N O T E S

SELECTING COLLEGES

In a highly competitive and an increasingly Internet-cA high


school student discusses plans to attend college with a guidance
counsellor. The student has a 2.04 grade point average out of 4.00
maximum and mediocre to poor scores on the ACT. He asks about
attending Harvard. The counsellor tells him he would probably not
do well at that institution, predicting he would have a grade point
average of 0.64 at the end of four years at Harvard. The student
inquires about the necessary grade point average to graduate and
when told that it is 2.25, the student decides that maybe another
institution might be more appropriate in case he becomes involved
in some “heavy duty partying.”
When asked about the large state university, the counsellor predicts
that he might succeed, but chances for success are not great, with a
predicted grade point average of 1.23. A regional institution is then

S
proposed, with a predicted grade point average of 1.54. Deciding
that is still not high enough to graduate, the student decides to
attend a local community college, graduates with an associate’s
IM
degree and makes a fortune selling real estate.
If the counsellor was using a regression model to make the
predictions, he or she would know that this particular student
would not make a grade point of 0.64 at Harvard, 1.23 at the state
university, and 1.54 at the regional university. These values are
NM

just “best guesses.” It may be that this particular student was


completely bored in high school, didn’t take the standardized tests
seriously, would become challenged in college and would succeed
at Harvard. The selection committee at Harvard, however, when
faced with a choice between a student with a predicted grade
point of 3.24 and one with 0.64 would most likely make the rational
decision of the most promising student.

1. Which regression model will fit best in the above situation?


2. Make regression equations and find out the regression
coefficients based on the above case if possible. Give
explanations for your answer.
Source: http://www.psychstat.missouristate.edu/introbook/sbk16.htm

NMIMS Global Access – School for Continuing Education


CASE STUDIES  367 

CASE STUDY 9: CHAPTER 8


N O T E S

DEATH PENALTY PROBABILITY

Radelet (1981) studied effects of racial characteristics on whether


individuals convicted of homicide receivethe death penalty. The
events that are considered on this study are the selection of a case
with “death penalty verdict”, “not death penalty verdict”, “white
defendant”, “black defendant”, “white victim”, and “blackvictim”.
The 326 subjects were defendants in homicide indictment in 20
Florida counties during 1976-1977.
The following table gives the number of subjects for each of the
defendant’s race, victim’s race and death penalty combinations.

S
IM
The main question that one would like to answer is “Is there
an evidence of racial discrimination given the evidence on
NM

this table?” Also, one would be interested with the following


questions;
1. Is there a relation between defendant’s race and victim’s
race?
2. Is there a relation between victim’s race and death penalty?
3. If we control for the victim’s race, that is if we look at the
cases for black victims and white victims separately, what
is the relation between defendant’s race and death penalty
verdict?
Source: Agresti, A. Categorical Data Analysis, John Wiley & Sons, 1990, pg. 135-138

NMIMS Global Access – School for Continuing Education


368  BUSINESS STATISTICS

CASE STUDY 10: CHAPTER 9


N O T E S

SHOPPING ATTITUDE

Nationwide random sample of 2,500 adults were asked if they


agreed or disagreed with the statement “I like buying clothes, but
shopping is often frustrating and time-consuming.” Suppose that
in fact 60% of the population of all adult U.S. residents would say
“Agree” if asked this question. What is the probability that 1520 or
more of the sample agree?
‰‰ The responses of the 2,500 randomly chosen adults (from over
210 million adults) can be taken to be independent.
‰‰ The number X in the sample who agree has a binomial
distribution with n=2,500 and p=0.60.
‰‰ To find the probability that at least 1,520 people in the sample
agree, we would need to add the binomial probabilities of all
outcomes from X=1,520 to X=2,500…this is not practical.

S
Histogram of 1000 simulated values of the binomial variable X, and
the density curve of the Normal distribution with the same mean
and standard deviation:
IM
µ = np = 2500(0.6) = 1500
=ó np(1 − p )
= (2500)(0.6)(0.4)
NM

= =
600 24.49

Assuming X has the N (1500, 24.49) distribution


[np and n (1-p) are both ≥ 10], we have
 X − ì 1520 − 1500 
P( X ≥ 1520)= P  ≥ 
 ó 24.49
= P(Z ≥ 0.82)
= 1 − 0.7939 (from Standard Normal Table)
= 0.2061
Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  369 

N O T E S
The probability of observing 1,520 or more adults in the sample
who agree with the statement has been calculated as 20.61% using
the Normal approximation to the Binomial.
Using a computer program to calculate the actual Binomial
probabilities for all values from 1520 to 2,500, the true probability
of observing 1,520 or more who agree is 21.31%. This is a very good
approximation!

Analyze the case above and try to find the probability that at
least 1000 people in the sample agree, and add the binomial
probabilities of all outcomes from X=1000 to X=2000

S
IM
NM

NMIMS Global Access – School for Continuing Education


370  BUSINESS STATISTICS

CASE STUDY 11: CHAPTER 10


N O T E S

AGE DISTRIBUTION IN THE UNITED STATES

One of the jobs of the U.S. Census Bureau is to keep track of the age
distribution in the country. The age distribution in 2013 is shown
below.

S
Figure 1: Age Distribution in the U.S.
IM
TABLE 1
NM

Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  371 

N O T E S

We used Microsoft Excel to select random samples n = 40 with


from the age distribution of the United States. The means of the
36 samples were as follows.
28.14,31.56,36.86,32.37,36.12,39.53,
36.19,39.02,35.62,36.30,34.38,32.98,
36.41,30.24,34.19,44.72,38.84,42.87,
38.90,34.71,34.13,38.25,38.04,34.07,
39.74,40.91,42.63,35.29,35.91,34.36,
36.51,36.47,32.88,37.33,31.27,35.80
1. Enter the age distribution of the United States into
Microsoft Excel. Use the tool to find the mean age in the
United States.
2.
S
Enter the set of sample means into Microsoft Excel. Find
the mean of the set of sample means. How does it compare
with the mean age in the United States? Does this agree
IM
with the result predicted by the Central Limit Theorem?
3. Are the ages of people in the United States normally
distributed? Explain your reasoning.
4. Sketch a relative frequency histogram for the 36 sample
means. Use nine classes. Is the histogram approximately
NM

bell shaped and symmetric?


Does this agree with the result predicted by the Central
Limit Theorem?
5. Use Microsoft Excel to find the standard deviation of the
ages of people in the United States.
6. Use Microsoft Excel to find the standard deviation of the
set of 36 sample means. How does it compare with the
standard deviation of the ages? Does this agree with the
result predicted by the Central Limit Theorem?
Source: U.S.Census Bureau, www.census.gov

NMIMS Global Access – School for Continuing Education


372  BUSINESS STATISTICS

CASE STUDY 12: CHAPTER 9


N O T E S

BIRTH WEIGHTS IN AMERICA

The National Center for Health Statistics (NCHS) keeps records of


many health-related aspects of people, including the birth weights
of all babies born in the United States.
The birth weight of a baby is related to its gestation period (the
time between conception and birth). For a given gestation period,
the birth weights can be approximated by a normal distribution.
The means and standard deviations of the birth weights for various
gestation periods are shown at the right.
One of the many goals of the NCHS is to reduce the percentage of
babies born with low birth weights. As you can see from the graph
at the upper right, the problem of low birth weights increased from
1988 to 2002.

S
IM
NM

Contd...

NMIMS Global Access – School for Continuing Education


CASE STUDIES  373 

N O T E S

1. The distributions of birth weights for three gestation


periods are shown. Match the curves with the gestation
periods. Explain your reasoning.

S
IM
NM

2. What percent of the babies born with each gestation period


have a low birth weight (under 5.5 pounds)? Explain your
reasoning.
(a)Under 28 weeks (b) 32 to 35 weeks
(c) 37 to 39 weeks (d) 42 weeks and over
3. Describe the weights of the top 10% of the babies born with
each gestation period. Explain your reasoning.
(a) 37 to 39 weeks (b) 42 weeks and over
4. For each gestation period, what is the probability that a
baby will weigh between 6 and 9 pounds at birth?
(a) 32 to 35 weeks (b) 37 to 39 weeks (c) 42 weeks and over
5. A birth weight of less than 3.3 pounds is classified by the
NCHS as a very low birth weight. What is the probability
that a baby has a very low birth weight for each gestation
period?
(a) Under 28 weeks (b) 32 to 35 weeks (c) 37 to 39 weeks

NMIMS Global Access – School for Continuing Education


NM
IM
S

You might also like