Professional Documents
Culture Documents
S
IM
NM
COURSE DESIGN COMMITTEE
TOC Reviewer Content Reviewer
Mr. Ravindra Babu S Mr. Ravindra Babu S
Visiting Faculty Visiting Faculty
NMIMS Global Access – NMIMS Global Access –
School for Continuing Education. School for Continuing Education.
Specialization: Finance Specialization: Finance
S
IM
Author : DP Apte
NM
Copyright:
2015 Publisher
ISBN:
978-81-8323-129-9
Address:
A-45, Naraina, Phase-I, New Delhi – 110 028
Only for
NMIMS Global Access - School for Continuing Education School Address
V. L. Mehta Road, Vile Parle (W), Mumbai – 400 056, India.
C O N T E N T S
4 S
Measures of Dispersion 113
IM
5 Skewness and Kurtosis 147
NM
BUSINESS STATISTICS
C U R R I C U L U M
S
Measures of Central Tendency: Characteristics of Central Tendency, Arithmetic Mean,
Median, Mode
IM
Measures of Dispersion: Characteristics of Measures of Dispersion, Absolute and Relative
Measures of Dispersion, Range Interquartile Range and Deviations, Variance and Standard
Deviation, Case Study Problem covering Variance, Standard Deviation and Coefficient of
Variation
NM
Skewness and Kurtosis: Karl Pearson’s Coefficient of Skewness (SKp), Bowley’s Coefficient of
Skewness (SKB), Kelly’s Coefficient of Skewness (Skk), Measures of Kurtosis, Moments
Use of Excel Software for Statistical Analysis: Introduction to Excel, Entering Data in Excel,
Descriptive Statistics, Basic Built-in Functions (Average, Mean, Mode, Count, Max and Min),
Statistical Analysis, Normal Distribution, Brief about SPSS
S
IM
NM
CONTENTS
1.1 Introduction
1.2
1.3
S
Development of Statistics
Definitions of Statistics
IM
1.4 Importance of Statistics
1.5 Classification of Statistics
1.6 Role of Statistics
1.6.1 Role of Statistics in Business
1.6.2 Role of Statistics in Decision Making
NM
INTRODUCTORY CASELET
N O T E S
S
Let us define what is meant by structured and unstructured
data. The unstructured data of an organization includes e-mail
correspondence, text documents, even voice and video. But in
IM
large part, most of the information in an organization is structured.
This is the quantified information found in financial statements,
statistical reports and other sources that include responses to
surveys, point of sale information and sales reports. In essence, the
non-text data the organization generates.
NM
N O T E S
S
IM
NM
N O T E S
1.1 INTRODUCTION
Information derived from good statistical analysis is always precise
and never useless. One of the primary tasks of a manager is decision-
making. Decision-making is usually based on the past experience and
S
future projections. In many situations, decision-making purely based
on personal experience, subjective judgment and intuition, is rather
difficult and inefficient. Statistical techniques offer powerful tools
IM
in the decision-making process. These tools have power to interpret
quantitative information in a scientific and an objective manner.
These tools also provide certain conceptual framework to the decision
maker and enable him/her to comprehend qualitative information in
a more objective way.
NM
It is said that, “There are three kinds of lies; lies, damn lies and
statistics.” Malcolm Forbes, publisher of Forbes magazine and an
adventurist, once got lost floating for miles in one of his famous
balloons and finally landed in the middle of a cornfield. He spotted a
man coming towards him and asked, “Sir, can you tell me where am I?”
The man said, “Certainly, you are in a basket in a field of corn.” Forbes
said, “You must be a statistician.” The man said, “That’s amazing,
how did you know that?” “Easy”, said Forbes, “Your information is
concise, precise and absolutely useless!”
The story is of course in a lighter vein. Nevertheless, it conveys two
points about the use of statistics. Firstly, good statistical tools must
assist the effective decision-making process. Thus, appropriateness of
the tool and interpretation of the results are essential ingredients for
decision-making. Secondly, the information derived from irrelevant
data may not lead to right conclusion. However, in such a case the
statistician is to blame, and not the statistical tools.
N O T E S
etc. There is an evidence of use of some of the principles of statistics by
ancient Indian civilization. Some of the techniques find their mention
in Vedic Mathematics. However, the modern statistical methods spread
from Italy to France, Holland and Germany in 16th century.
During ancients times even before 300BC, the rulers and kings, like
Chandragupta Maurya used statistics to maintain the land and revenue
records, collection of taxes and registration of births and deaths.
During the seventeenth century, statistics was used in Europe for a
variety of information like life expectancy and gambling. Theoretical
development of modern statistics was during the mid-seventeenth
centuries with the introduction of ‘Theory of Probability’ and ‘Theory
of Games and Chance’. Many famous problems like ‘the problem of
points’ (posed by Chevalier de-Mere), ‘the gambler’s ruin’ etc. posed
by professional gamblers were solved by mathematicians. These
solutions laid the foundation to the theory of probability and statistics.
S
Some of the notable contributors in the development of statistics are:
Pascal, Fermat, James Bernoulli, De-Moivre, Laplace, Gauss Euler,
Lagrange, Bayes, Kolmogorov, Karl Pearson and so on. One of the most
IM
significant works in modern times is by Ronald A. Fisher (1890-1962),
who is considered to be the ‘Father of Statistics’ by the community
of statisticians all over. He applied statistics to diversified fields such
as education, agriculture, genetics, biometry, psychology, etc. He also
pioneered ‘Estimation Theory’, ‘Exact sampling distribution’, ‘Analysis
NM
N O T E S
S
Statistical activities are often associated with models expressed
using probabilities, and require probability theory for them to be
put on a firm theoretical basis.
IM
1.3 DEFINITIONS OF STATISTICS
Since all branches use statistics, there are number of definitions
of statistics, each based on the way one looks at the application of
the statistics. Some of the definitions appealing to the managerial
NM
N O T E S
Thus, statistics is a science of collection, organisation, presentation,
analysis and interpretation of data, so that it helps a manager to take
effective and knowledgeable decisions under given circumstances.
S
a sample is used to give information about the overall average in
the population from which that sample was drawn.
IM
It is possible to draw more than one sample from the same
population and the value of a statistic will in general vary from
sample to sample. For example, the average value in a sample is a
statistic. The average values in more than one sample, drawn from
the same population, will not necessarily be equal.
NM
Statistics are often assigned Roman letters (e.g. m and s), whereas
the equivalent unknown values in the population (parameters) are
assigned Greek letters (e.g. m and σ).
N O T E S
List down the areas of daily life where you feel statistics plays
an important role, for example noticing average temperature in
summers, etc.
S
1.5 CLASSIFICATION OF STATISTICS
IM
Statistical methods are broadly divided into five categories. These
categories are not mutually exclusive. These are often found to be
overlapping.
Descriptive Statistics: When statistical methods are used,
a problem is always formulated in terms of ‘population’ or
NM
N O T E S
Inductive Statistics: Decision making in most business
situations requires estimates about future like trends and forecast.
Inductive statistics include methods that help in generalizing the
trends based on the random observations. This process provides
estimation indirectly on the basis of partial data or method of
forecasting based on past data for example, future share price of
a share based on the inflow of funds by FII.
Inferential Statistics: Another way, in which conclusions or
decisions are made, is using a portion of population or sample
from the universe. The sample data is analyzed. Then based on
the sample evidence, conclusions are generalized about the target
population. Exit poll during elections is an example of sample
survey. This method is referred to as ‘Statistical Inference’.
Hypotheses and significance tests form an important part of
inferential statistics.
Applied Statistics: It is the application of statistical methods
S
and techniques used for solving the real life problems. Quality
control, sample surveys, inventory management, simulations,
quantitative analysis for business decision making, etc., form a
IM
part of this category.
N O T E S
S
accounting, quality control, distribution channel design, etc. Hence,
understanding statistical concepts and knowledge of using statistical
tools is essential for today’s managers.
IM
1.6.2 ROLE OF STATISTICS IN DECISION MAKING
Very often, people consider decision-making just as an act of selection
among alternatives. However, there are two more phases in decision-
making. Noble Laureate Sir Herbert A Simon identified the phases of
NM
decision-making as:
Information gathering: Searching the environment for
information, called the intelligence activity.
Generation of alternatives: Inventing, developing and analyzing
possible courses of action, called the design activity.
Selection of alternatives: Selecting a particular course of action
from those available, called the decision activity.
Most important task of a manager is to take decisions in a given
situation that helps an organization to achieve its goals. Management
is a process of converting information into action – this we call
decision-making. Decision-making is a deliberate thought process
based on available data developing alternatives to choose from so as
to find the best solution to the problem at hand.
Statistics and statistical tools play very vital role during all these
three phases of decisions. There are two basic approaches of decision-
making, namely, quantitative (or mathematical) and qualitative (or
rational, creative and judgmental). In the first approach statistics
and mathematics play dominant role. Even in second approach
statistics plays a role for collection and presentation of data to help
decision-maker’s intuition. Extent to which statistical and
mathematical tools can be used, depend upon the situations.
N O T E S
These can be briefly classified as:
Decision-making under certainty: These are deterministic
situations amenable to mathematical tools to fullest extent.
Decision-making under risk: These are stochastic situations
amenable to statistical tools to a large extent with supplement of
rational decision-making.
Decision-making under uncertainty: These are amenable to
judgmental and creative approaches.
It is observed that middle level and senior level managers primarily
deal with decision-making under risk or in a few cases decision-making
under uncertainty. Thus, knowledge of statistical and mathematical
computational tools is necessary, if not mandatory, for efficient and
effective decision-making. It is not required to apply all advanced
statistical tools in every situation. Certain tools may not be applicable
in some cases. Simple statistics like average, weighted average,
S
percentage and standard deviation, index would reveal a great deal
of information in many decision-making scenarios. Exploratory
investigation may, however, require some advanced tools.
IM
1.6.3 ROLE OF STATISTICS IN RESEARCH
Statistical analysis is a vital component in every aspect of research.
Social surveys, laboratory experiment, clinical trials, marketing
research, human resource planning, inventory management, quality
NM
N O T E S
List down the areas in your work situation where in your opinion
statistical tools would improve decision-making.
S
Functions of statistics are described below:
Condensation: Statistics compresses mass of figures to small
IM
meaningful information, for example, average sales, BSE index
(SENSEX), growth rate. It is impossible to get a precise idea
about the profitability of a business from a record of income
and expenditure transactions. The information of Return on
Investment (ROI), Earnings per Share (EPS), profit margins,
NM
N O T E S
are always more convincing than vague utterances. For example,
‘increase in profit margin is less in year 2006 than in year 2005’
does not convey a definite piece of information. On the other
hand, statistics presents the information more definitely like
“profit margin is 10% of the turnover in year 2006 against 12% in
year 2005”.
Expectation: Statistics provides the basic building block for
framing suitable policies. For example, how much raw material
should be imported, how much capacity should be installed, or
manpower recruited, etc., depends upon the expected value of
outcome of our present decisions.
S
large number of items, chosen at random from a large group, are
almost sure on an average to possess the characteristics of the
large group.” For example, it is difficult to predict failure of an
IM
individual machine or an accident on express way but not difficult
to indicate what percentage of large number of machines might
suffer from a breakdown in given period. Similarly, average
number of accident on expressway would remain stable over a
fairly long period of time unless the conditions have changed
drastically.
NM
N O T E S
S
Statistical techniques, because of their flexibility and economy, have
become popular and are used in numerous fields. But statistics is not
IM
a cure-all technique and has limitations. It cannot be applied to all
kinds of situations and cannot be made to answer all queries. The
major limitations are:
Statistics deals with only those problems, which can be
expressed in quantitative terms and amenable to mathematical
NM
N O T E S
Associations and relationship: These include testing of
dependence between attributes, correction and regression and
non-parametric methods.
Multivariate method: These include factor analysis, cluster
analysis, discriminant analysis, probit and logit analysis, path
analysis, profile analysis, multivariate ANOVA, and analysis of
factorial experiments.
Each of these requires a fundamental understanding of its statistical
origin and purpose.
S
the incorrect use of data. This happens due to lack of understanding of
statistical principles or intentional fudging with the figures with ulterior
motives. As Kings says, “Statistics are like clay of which one can make
IM
a god or devil as one pleases”. According to Bowley, “Statistics only
furnishes tools, necessary though imperfect, which are dangerous in
the hands of those who do not know its use and its deficiencies”. It is
often quoted by managers that “figures don’t lie, liars figure”.
The distrust of statistics among managers is result of bad experience,
NM
N O T E S
Establishing absurd correlations or associations just because
independent data appears moving together.
Comparing and drawing causal relationship between unrelated
variables based on association.
Changing hypotheses after collecting and analyzing the data.
S
selectivity, or purposeful manipulation.
IM
Give a practical example from Indian Industry where statistics has
been misused and resulted into losses.
NM
1.9 SUMMARY
Managerial decision-making can be made efficient and effective
by analyzing available data using appropriate statistical tools.
Statistical tools not only have application in research (marketing
research included) but also in other functional areas like quality
management, inventory management, financial analysis, human
resource planning and so on.
The word statistics is derived from the Italian word ‘Stato’ which
means ‘state’; and ‘Statista’ refers to a person involved with the
affairs of state. Thus, statistics originally was meant for collection
N O T E S
of facts useful for affaires of the state, like taxes, land records,
population demography, etc.
Significant contribution has also been made by Indians in the
field of statistics. Prof Prasant Chandra Mahalanobis, is the first
to pioneer the study of statistical science in India. He founded
the Indian Statistical Institute (ISI) in1931. Mahalanobis viewed
statistics as a tool in increasing the efficiency of all human efforts
and also concentrated on sample surveys.
Statistics is the classified facts representing the conditions of the
people in the state…. specially those facts which can be stated
in number or in table of numbers or in any tabular or classified
arrangement.
Statistical methods are broadly divided into five categories.
These are Descriptive Statistics, Analytical Statistics, Inductive
Statistics, Inferential Statistics and Applied Statistics.
S
Statistics is an indispensable tool of production control and
market research. Statistical tools are extensively used in business
for time and motion study, consumer behaviour study, investment
IM
decisions, performance measurements and compensations,
credit ratings, inventory management, accounting, quality
control, distribution channel design, etc.
Statistical analysis is a vital component in every aspect of research.
Social surveys, laboratory experiment, clinical trials, marketing
NM
N O T E S
S
“dependent variable”, “independent variable”, or other.
Parameter: A parameter is an important element to consider in
evaluation or comprehension of an event, project, or situation.
IM
1.10 DESCRIPTIVE QUESTIONS
1. Define Statistics. Also discuss the development of statistics.
2. Who gave the following definitions of statistics?
NM
N O T E S
9. What are the limitations of Statistics? How can be statistical
techniques be misused?
10. What are common statistical issues? How can statistics mislead
us?
13. True
14. False
Functions of Statistics 15. Hypotheses
16. Quantitative
17. Increases
Limitations of Statistics 18. Bowley
19. Distrust
20. Bias
N O T E S
2. Refer Section 1.3
Answers are
(i) Webster
(ii) Boddington
(iii) Harlow
3. Refer Section 1.4
Statistical techniques enable us to identify what information or
data is worth collecting, decide when and how judgments may be
made on the basis of partial information.
4. Refer Section 1.5
Statistical methods are broadly divided into five categories.
These are Descriptive Statistics, Analytical Statistics, Inductive
Statistics, Inferential Statistics, and Applied Statistics.
5.
S
Refer Section 1.6.2
Statistics and statistical tools play very vital role during all these
IM
three phases of decisions. There are two basic approaches of
decision-making, namely, quantitative (or mathematical) and
qualitative (or rational, creative and judgmental). In the first
approach statistics and mathematics play dominant role. Even
in second approach statistics plays a role for collection and
presentation of data to help decision-maker’s intuition.
NM
N O T E S
convincing. They can be used to intimidate opposing views.
Hence, statistics is open to manipulation.
10. Refer Sections 1.8.1 and 1.8.2
There are different types of statistical issues faced by a
researcher. The distrust of statistics among managers is result
of bad experience, lack of understanding, hence faith in method,
complex and voluminous data overwhelms the thinking, or
simply the attitude of liking subjective judgments based on the
gut feelings.
E-REFERENCES
www.statistics.com/
http://www.statsoft.com/
http://www.stats.gla.ac.uk/steps/glossary/basic_definitions.html
CONTENTS
2.1 Introduction
2.2
2.2.1
S
Descriptive and Inferential Statistics
Descriptive Statistics
IM
2.2.2 Inferential Statistics
2.3 Collection of Data
2.3.1 Types of Data – Primary and Secondary
2.3.2 Methods of Collecting Primary Data
2.3.3 Merits and Demerits of Collecting Primary Data
NM
Contd...
S
IM
NM
INTRODUCTORY CASELET
N O T E S
PROFITABILITY
A B C
Company
D E F
S
A B C
Company
D E F
IM
1. Which company recorded the highest operating profit in F.Y.
2002-03?
(i) A (ii) C (iii) E (iv) F
2. The average operating profit in F.Y. 2002-03, of companies with
NM
N O T E S
2.1 INTRODUCTION
To make a decision in any business situation you need data. Facts
expressed in quantitative form can be termed as data. Success of any
S
statistical investigation depends on the availability of accurate and
reliable data. These depend on the appropriateness of the method
chosen for data collection. Therefore, data collection is a very basic
IM
activity in decision-making. Data may be classified either as primary
data or secondary data.
Data collected from the field needs to be processed and analysed.
The processing is primarily editing, coding, classification and the
NM
N O T E S
Other measurements such as skewness and kurtosis
The exploration of relationships and correlation between paired
data
The presentation of statistical results in graphical form
N O T E S
described below.
N O T E S
2.3.2 METHODS OF COLLECTING PRIMARY DATA
Generally, for managerial decision-making, it is necessary to
analyze information regarding a large number of characteristics.
Collection of primary data may thus be time consuming, expensive,
and hence requires a great deal of deliberation. According to the
nature of information required, one of the following methods or their
combination could be selected.
Observation Method
In this method investigator collects the data through his/her personal
observations. This method is very useful if data is created in the system
through capturing transactions. Computerized transaction processing
could be modified to generate necessary data or information. An
investigator well versed with the system or a part of the system is ideally
suited for collecting this kind of data. Since the investigator is solely
involved in collecting the data, his/her training, skill, and knowledge
S
plays an important role as far as the quality of the data is concerned.
Sometimes, audio/video aids could also be used to record the observations.
IM
Indirect Investigation
In this case, data is collected from a person, who is likely to have
information about the problem under study. The information collected
by oral or written interrogation forms a primary data. Usually enquiry
commissions, board of investigations, investigation teams and
NM
Mailed Questionnaire
In this case structured questionnaire is mailed to selected persons with
request to fill them and return. Supplementary information clarifying
terms, explaining process, etc., is also attached with the questions. In
a few cases, inducements for filling and returning the questionnaire
are also given. Covering letter with a questionnaire is necessary for
developing rapport, explaining the reason for collecting the data,
and alleviating fears of the respondent if any. It is assumed that the
respondents are literate and can answer the questions without any
N O T E S
ambiguity. This is a less expensive and faster method to collect large
volume of data, over a wide geographic area, in standard form, and at
the convenience of the respondent. This method is, therefore, most
popular and extensively used. However, we must guard against two
disadvantages of this method viz. absence of interviewer, resulting in
large proportion of non-response and possibility of lowering of the
reliability of the responses if the respondent is not motivated enough.
These shortcomings could be overcome by increasing sample size and
comprehensive design of questionnaire.
Telephonic Interview
This method is less expensive but limited in scope as the respondent
must possess a telephone and has it listed. Further, the respondent
must be available and in the frame of mind to provide correct
answers. This method is comparatively less reliable for public surveys.
However, for industrial survey, in developed regions, and with known
S
customers, this method could be the best suited. Obviously, in this
method there is a limit to the number of questions that the interviewee
could answer in three to four minutes. If there are just three to five
IM
yes/no type questions and two to three short questions, this method
is very efficient.
Internet Surveys
Of late, Internet surveys have become popular. These are less
NM
N O T E S
The investigator can modify or put indirect questions in order to
extract satisfactory information.
The collected data are often homogeneous and comparable.
Some additional information may also get collected, along with
the regular information, which may prove to be helpful in future
investigations.
Misinterpretations or misgivings, if any, on the part of the
respondents can be avoided by the investigators.
Since the information is collected from the persons who are well
aware of the situation, it is likely to be unbiased and reliable.
This method is particularly suitable for the collection of
confidential information. For example, a person may not like to
reveal his habit of drinking, smoking, gambling, etc., which may
be revealed by others.
Demerits
S
This method is expensive and time consuming, particularly when
IM
the field of investigation is large.
It is not possible to properly train a large team of investigators.
The bias or prejudice of investigators can affect the accuracy of
data to a large extent.
NM
N O T E S
The respondents may provide wrong information if the questions
are not properly understood.
It is not possible to collect information if the respondents are not
educated.
Since it is not possible to ask supplementary questions, the
method is not flexible.
The results of an investigation are likely to be misleading if the
attitude of the respondents is biased.
The process is time consuming, particularly when the information
is to be obtained by post.
S
Sources of secondary data could be:
Various publications of central, state and local governments. This
is an important and reliable source to get unbiased data.
IM
Various publications of foreign governments or of international
bodies. Although it is a good source, context under which it is
collected needs to be verified before using this data. For international
situations this data could be very useful and authentic.
NM
N O T E S
Number of questions should be kept to the minimum: The
fewer the questions, the greater the chances of getting a better
response and of having all the questions answered. Otherwise
the respondent may feel disinterested and provide inaccurate
answers particularly towards the end of the questionnaire. As
a rough indication, the number of questions should be between
10 to 20. If number of questions have to be more than 25, it is
desirable that the questionnaire be divided into various parts to
ensure clarity.
Questions should be simple, short and unambiguous: The
questions should be simple, short, and easy to understand and
such that their answers are unambiguous. For example, if the
question is, “Are you literate?” the respondent may have doubts
about the meaning of literacy. To some, literacy may mean a
university degree whereas to others even the capacity to read
and write may mean literacy. Hence, it is desirable to specify
S
“Have passed (a) high school (b) graduation (c) post graduation”.
Type of questions: Questions can be of Yes/No type, or of multiple
choices depending on the requirement of the investigator. Open-
IM
ended questions should generally be avoided.
Questions of sensitive or personal nature should be avoided: The
questions should not require the respondent to disclose any private,
personal or confidential information. For example, questions
NM
N O T E S
question relates to income limits like 1000-2000, 2000-3000, etc., a
person getting exactly ` 2,000 should know in which income class
he has to place himself.
Pre-test the questionnaire: Once the questionnaire has been
designed, it is important to pre-test it. The pre-testing is also known
as pilot survey because it precedes the main survey work. Pre-
testing allows rectification of problems, inconsistencies, repetition
etc. Proper testing, revisiting, and re-testing, yields high dividends.
S
terms of ................. .
5. ................. ................. is less expensive but limited in scope as
the respondent must possess a telephone and has it listed.
IM
6. Once the questionnaire has been designed, it is important to
................. it.
State whether the following statements are true/false:
7. Data derived from other existing sources is referred to as
‘secondary data’.
NM
N O T E S
S
respects, i.e. the respondent should have answered each and
every question. If some important questions have been left
unanswered, attempts should be made to contact the respondent
IM
and get the response. If despite all efforts, answers to vital
questions are not given, such questionnaires should be dropped
from final analysis.
Consistency: Questionnaire should be checked to see that there
are no contradictory answers. Contradictory responses may arise
NM
N O T E S
interviewing the respondent. This sort of editing should be
done as soon as possible after the interview, as memory recall
diminishes with time. Care should be taken that the interviewer
does not complete the information by simply guessing.
Central editing: When all forms are filled up completely and
returned to the headquarters, central editing is carried out.
The editor may correct the obvious errors. If necessary, the
respondent may be contacted for clarification. All the incorrect
replies, which are obvious, must be deleted.
S
recorded into a limited number of classes or categories.
N O T E S
and expenditure on consumer durables, etc.
To prepare data for tabulation.
S
example, comparison between education and income, income
IM
To highlight the significant features; for example, data is
concentrated on one side, or one particular value may be dominant.
To enable grasp of data.
To study the relationship.
NM
N O T E S
discrete and continuous. In case of discrete type, values the
variable can take are countable (could be infinitely large also for
example, integers). Examples of these are number of accidents,
number of defectives, etc. In case of continuous quantities, data
can take any real values; for example, weight, distance, volume,
etc.
S
distribution), and continuous frequency distribution (or condensed
or grouped frequency distribution).
IM
Discrete Frequency Distribution
The process of preparing discrete frequency distribution is simple.
First, all possible values of variables are arranged in ascending order
in a column. Then, another column of ‘Tally’ mark is prepared to count
the number of times a particular value of the variable is repeated. To
NM
4 3 2 3 4 5 5 7 3 2
3 4 2 1 1 6 3 4 5 4
2 7 3 4 5 6 2 1 5 3
Solution: The discrete frequency distribution with the help of tally
mark is shown below:
N O T E S
Continuous Frequency Distribution
For continuous data a ‘grouped frequency distribution’ is necessary.
For discrete data, discrete frequency distribution is better than array,
but this does not condense the data. ‘Grouped frequency distribution’
is useful for condensing discrete data by putting them into smaller
groups or classes called class-intervals. Some important terms used
in case of continuous frequency distribution are as follows:
Class limits: Class limits denote the lowest and highest value
that can be included in the class. The two boundaries of class
are known as the lower limit and upper limit of the class. For
example, 10-19.5, 20-29.5, where 10 and 19.5 are limits of the first
class; 20 and 29.5 are limits of second class, etc.
Class intervals: The class interval represents the width (span or
size) of a class. The width may be determined by subtracting the
lower limit of one class from the lower limit of the following class.
= 10.
S
For example, classes 10-20, 20-30, etc. have class interval 20–10
For example, classes 10-20, 20-30, etc. have class marks 15, 25, etc.
Types of class intervals: There are different ways in which limits
of class intervals can be shown.
Exclusive method: The class intervals are so arranged that upper
limit of one class is the lower limit of next class. This method
always presumes that the upper limit is excluded from the class,
for example, with class limits 20-25, 25-30 observation with value
25 is included in class 25-30.
Inclusive method: In this method, the upper limit of the class is
included in that class itself. In such case there is no overlap of
upper limit of former class and lower limit of successive class.
For example, with class limits 20-29.5, 30-39.5, 40-49.5, etc. there
is no ambiguity but values from 29.5 to 30 or 39.5 to 40 etc. are
not allowed.
Open end: In an open-end distribution, the lower limit of the
very first class and/or upper limit of the last class is not given.
For example, while stating the distribution of monthly salary of
managers in rupees, one may specify class limits as, below 15000,
15000-25000, 25000-35000, 35000-45000, above 45000. Similarly,
while recording weights of college students in kg as grouped data
the class intervals could be less than 50, 50 to 60, 60 to 70, 70 to 80,
80 to 90 and greater than 90.
N O T E S
Unequal class interval: This is another method to limit the
class intervals where the width of the classes is not equal for
all classes. This method is of practical use when there are large
gaps in the data, or distribution of the data is uneven. It is used
for explaining, visualizing and plotting data with unequal class
interval. However, we must adjust formulae for calculations
accordingly.
of.
S
Class interval should be determined based on maximum values
IM
and number of classes to be formed.
All the above points can be explained with the help of the following
example.
Example: Ages of 50 employees are given:
NM
22 21 37 33 28 42 56 33 32 59
40 47 29 65 45 48 55 43 42 40
37 39 56 54 38 49 60 37 28 27
32 33 47 36 35 42 43 55 53 48
29 30 32 37 43 54 55 47 38 62
Prepare a frequency distribution table.
Solution: A frequency distribution table is prepared as follows:
First, find the highest and lowest values. These are 65 and 21
respectively. Thus, the difference is 44.
Since the total observations are 50 we decide to select 5 classes.
The approximate class interval works out to be (65-21)/5 = 8.8.
Hence, we select class interval as 10.
As our lowest value is 21, we start from the lower class limit of the
first class as 20. We use exclusive method of class interval.
We then decide class intervals as 20-30, 30-40, 40-50, 50-60 and
60-70.
Then, each observation is checked for the class interval in which
it lies. For each observation, we make a tally mark against the
corresponding class interval. As per the convention, every fifth
tally is put horizontally across. This helps quick counting.
N O T E S
The frequency distribution is given below:
Age (Years)
Class Interval Class Mark Tally Frequency
20-30 25 |||| || 7
30-40 35 |||||||||||| | 16
40-50 45 |||||||||||| 15
50-60 55 |||| |||| 9
60-70 65 ||| 3
Total = 50
Cumulative Frequencies
S
IM
The cumulative frequency of a given class interval thus, represents
the total of all the previous class frequencies including the class
against which it is written.
Relative Frequencies
NM
22 21 37 33 28 42 56 33 32 59
40 47 29 65 45 48 55 43 42 40
37 39 56 54 38 49 60 37 28 27
32 33 47 36 35 42 43 55 53 48
29 30 32 37 43 54 55 47 38 62
N O T E S
Find cumulative frequency, relative frequency and percentage frequency.
Solution:
S
14. ................. refers to the grouping of data into homogeneous
classes and categories.
IM
15. There are two kinds of frequency distributions, namely,
................. frequency distribution and ................. frequency
distribution.
16. Class ................. denote the lowest and highest value that can
be included in the class.
NM
N O T E S
S
Tabulation not only condenses the data, but also makes it easy to
understand. Tabulation is the fastest way to extract information from
the mass of data and hence popular even among those not exposed to
IM
the statistical method. The report card of a school is the most common
example.
Objectives of Tabulation
The main objectives of tabulation are:
NM
N O T E S
caption or a stub should be self explanatory. A provision of totals
of each row or column should always be made in every table by
providing an additional column or row respectively.
Main Body of the Table: This is the most important part of the
table as it contains numerical information. The size and shape of
the main body should be planned in view of the nature of figures
and the objective of investigation. The arrangement of numerical
data in main body is done from top to bottom in columns and
from left to right in rows.
Ruling and Spacing: Proper ruling and spacing is very important
in the construction of a table. Vertical lines are drawn to separate
various columns with the exception of sides of a table. Horizontal
lines are normally not drawn in the body of a table; however, the
totals are always separated from the main body by horizontal
lines. Further, the horizontal lines are drawn at the top and the
bottom of a table.
S
Spacing of various horizontal and vertical lines should be done
depending on the available space. Major and minor items should
IM
be given space according to their relative importance.
Head-note: A head-note is often given below the title of a table
to indicate the units of measurement of the data. This is often
enclosed in brackets.
Foot note: Abbreviations, if any, used in the table or some other
NM
explanatory notes are given just below the last horizontal line in
the form of footnotes.
Source-note: This note is often required when secondary data
are being tabulated. This note indicates the source from where
the information has been obtained. Source note is also given as
a footnote.
Example: The main parts of a table can also be understood by looking
at its broad structure given below:
Structure of a Table
Table No: .............
Title: .....................
Stub Captions Captions Total
Heading Captions Captions Captions Captions Captions
Stub
Enteries MAIN BODY
Total
N O T E S
Foot Note:
Source:
S
As far as possible the interpretative figures like totals, ratios and
percentages must also be provided in a table.
The entries in a table should be accurate.
IM
Table should be attractive to draw the attention of readers.
N O T E S
Classification on the basis of purpose of investigation
These tables are of two types viz. General purpose table and Special
purpose table.
General purpose table: A general purpose table is also called
as a reference table. This table facilitates easy reference to
the collected data. In the words of Croxton and Cowden, “The
primary and usually the sole purpose of a reference table are to
present the data in such a manner that the individual items may
be readily found by a reader.” A general purpose table is formed
without any specific objective, but can be used for a number of
specific purposes. Such a table usually contains a large mass of
data and is generally given in the appendix of a report.
An example of general purpose table is as follows:
Position Description S
GENERAL PURPOSE REPORTS OF GINT
Values
IM
1 Type LOG (or L) - Log
FNC (or F) - Fence
GRF (or G) - Graph
GTB (or T) - Graphical Table
GTD (or X) - Graphical Text Document
NM
N O T E S
Special purpose table: A special purpose table is also called a
text table or a summary table or an analytical table. Such a table
presents data relating to a specific problem. According to H.
Secrist, “These tables are those in which are recorded, not the
detailed data which have been analyzed, but rather the results of
analysis.” Such tables are usually of smaller size than the size of
reference tables and are generally found to highlight relationship
between various characteristics or to facilitate their comparisons.
Classification on the basis of the nature of presented figures
Tables, when classified on the basis of the nature of presented figures
can be Primary table and Derivative table.
Primary Table: Primary table is also known as original table and
it contains data in the form in which it were originally collected.
Derivative Table: A table which presents figures like totals,
averages, percentages, ratios, coefficients, etc., derived from
S
original data. A table of time series data is an original table but
a table of trend values computed from the time series data is
known as a derivative table.
IM
Classification on the basis of construction
Tables when classified on the basis of construction can be Simple
table, Complex table and Cross-classified table.
Simple Table: In this table the data are presented according to
NM
one characteristic only. This is the simplest form of a table and is
also known as table of first order.
Example: The following blank table, for showing the number of
workers in each shift of a company, is an example of a simple
table.
N O T E S
Example: The example of such a table is given below.
|
||
S
Skilled Unskilled Total Skilled Unskilled Total Workers
IM
|||
N O T E S
The table can be extended for the years 1993, 94, 95, 96, etc.
Town A Town B
Habit Males Females Total Males Females Total
Coffee Drinkers
Non-coffee Drinkers
40
20
5
35
45
55 S 25
30
15
30
40
60
IM
Total 60 40 100 55 45 100
N O T E S
For example, examination result of MBA could be tabulated as,
Foot Note
Each class includes its lower limit.
Fail indicates failure in any one or more subjects irrespective of
the percentage marks.
S
Example: Represent the following information in a table:
The number of students in a college in the year 1961 was 1100; of
those 980 were boys and rest were girls. In 1971 the number of boys
IM
increased by 100% and that of girls increased by 300% as compared to
their strength in 1961. In 1981 the total number of students in a college
was 3600, the number of boys being double the number of girls.
Solution:
NM
N O T E S
noted that, a two way table can be converted to one way table with mn
distinct values of a combination variable. This is called a normalized
table or a flat table in data base management.
Example: In a survey conducted in a city about preference of Coke or
Pepsi or Mazza, the sample consisted of 400 people that included 150
women and 250 men. It was observed that 50 women preferred Coke
and 40 preferred Pepsi. In case of men the preference was 100, 80 and
70 respectively. Present the information in two way table and answer
the following:
1. What is the percentage of men in Coke preferring population?
2. What is the proportion of population preferring Pepsi?
3. What is the proportion of women preferring Maza in total
population?
Solution:
66.67%
120
2. Proportion of population preferring Pepsi = = 0.3
400
3. Proportion of women preferring Maza in total population = 0.15
N O T E S
benchmark. It also wants to group the shares as large cap, mid-cap
and small cap. The data obtained is as follows: In 40 large cap shares
studied 27 performed average and 11 above average in year 2004.
Similar, figures for year 2005 and 2006 were 34 and 8 out of 50, and
32 and 16 out of 50 respectively. In mid-cap segment the number of
shares below average, average and above average was 22, 35 and 23 in
year 2004. These were 17, 40, 23 for year 2005 and 13, 38 and 29 for year
2006 respectively. In case of small cap shares the performance figures
for year 2004, 2005 and 2006 in categories below average, average and
above average were 26, 32, 42; 25, 36, 39; and 12, 40, 48 respectively.
Present the data as multi-way table.
Solution:
Year 2004 2005 2006
Large Cap Below Average 12 8 2
Average 27 34 32
Total S
Above Average 11
50
8
40
16
40
IM
Mid Cap Below Average 22 17 13
Average 35 40 38
Above Average 23 23 29
Total 80 80 80
Small Cap Below Average 26 25 12
NM
Average 32 36 40
Above Average 42 39 48
Total 100 100 100
N O T E S
S
31. Two way table is also known as ................. table.
IM
In any organization of your choice, identify a problem and collect
data internally through questionnaire from randomly selected
people of the organization. Present the collected data in tabular
form and Suggest a solution to the problem.
NM
It observed quite often that even the person who has tabulated
the data find it difficult to understand the table after few days if
the table is ill presented. For future references footnotes, headers,
colour coding etc., improves the efficiency of the table. With
present day computers it is possible to present data very effectively
with tables, associated charts, catchy icons, 3-D surfaces, etc. It is
also possible to view only some part of the table as necessary. The
computer also provides cross reference or stage by stage display of
the tables through the links (or hyperlinks).
N O T E S
have assumed importance for decision-making to the managers. To
communicate the information effectively to the higher management,
you must present the data in pictorial format whenever feasible, and
support it with the numerical data as a reference. Remember, higher
management may not have adequate time to analyze the numerical
data. Similarly, always present the information to junior employees
as diagrams, graphs and charts, because they may not have adequate
knowledge and grasp of numerical analysis.
N O T E S
diagrams are also known as ‘surface’ or ‘area diagrams’. Popular
forms of two-dimensional diagrams are:
Rectangular Diagrams
Square Diagrams
Circular or Pie Diagrams.
Three-dimensional Diagrams: With the help of three dimensional
diagrams, the values of various items are represented by the
volume of cube, sphere, cylinder, etc. These diagrams are normally
used when the variations in the magnitudes of observations are
very large.
Pictograms and Cartograms: These are like frequency plots.
The data points are plotted on the graph in the same manner.
Then instead of joining the data points, pictures or objects of the
height of the data points are used to depict the data. In that case,
S
heights of the pictures or objects represent the frequency. These
include Histograms and frequency polygon.
IM
2.7.3 BAR DIAGRAM
Bar diagrams and Column diagrams are very common in representing
business data. These are used to depict the frequencies of different
categories of variables. In case of bar diagrams the bars are horizontal
with their lengths proportional to the frequencies. On the other hand,
NM
1999
1999
Year
Year
Gross Profit ('000 Rs.)
Gross Profit(′000
Gross Profit ('000`)Rs.)
1997 Sales ('000 Rs.)
1997 Sales ('000`)Rs.)
Sales (′000
1996
1996
0 50 100 150 200
` in Thousands
Rs in Thousands
0 50 100 150 200
Column Diagram: Rs in
WeThousands
take year on the X axis and rupees in
thousands on the Y axis. Then we draw vertical columns with lengths
proportionalColumn
to the values of variables
Diagram: ‘Sales’, Results
Company ‘Gross Profits’ and ‘Net
160
S
Profits’. The column diagram for the above data is as follows:
Column Diagram: Company Results
IM
140 160
140
120
Rs in Thousands
120
Rs in Thousands
100
Sales ('000 Rs.)
` in Thousands
100
Sales(′000
Sales ('000 Rs.)
`)
80 Gross Profit ('000 Rs.)
80 GrossProfit
Gross Profit ('000 Rs.)
Net Profit(′000
('000`)Rs.)
NM
20 20
0
0
1996 1997 1998 1999
1996 1997 1998 1999
Year
Year
Scatter Diagram
Scatter diagram is the most fundamental graph plotted to show
relationship between two variables. It is a simple way to represent
bivariate distribution. Bivariate distribution is the distribution of two
random variables. Two variables are plotted one against each of the X
and Y axis. Thus, every data pair of (xi, yj) is represented by a point on
the graph, x being abscissa and y being the ordinate of the point. From
a scatter diagram we can find if there is any relationship between the
x and y, and if yes, what type of relationship. Scatter diagram thus,
indicates nature and strength of the correlation.
Example: Draw a scatter diagram for the following data of eight years
between income (X) and expenditure (Y).
Income (X) (`) 100 110 113 120 125 130 130 140
Expenditure (Y) (`) 85 90 91 100 110 125 125 130
N O T E S
Solution:
Scatter Diagram
Scatter Diagram
140
130
(`)
(Rs.)
120
(Y)(Y)
110
Expenditure
100
Expenditure
90
80
70
60
50
80 100 120 140 160
Income (X) (`)
Income (X) (RS.)
Line Diagram
It is similar to the frequency polygon, where we plot one or more
variables against one variable. One variable against which other
variables are plotted is taken along the X axis. It is commonly used
S
IM
to depict the trends in anytime series data. We can show one or more
variables like economic, market trends, financial results, etc. together
so that these can be compared.
Example: Draw line diagram to present following data.
NM
100
Thousands
Thousands
N O T E S
2.7.4 HISTOGRAM
Besides the frequency polygon, histogram is one of the most popular
and widely used graphical representations. It uses vertical bars whose
height represents the frequency. In histogram, the vertical bars touch
the neighbouring bars sharing one edge. Hence, if the data is of inclusive
classes, it needs to be converter to exclusive classes so that the class
boundaries overlap. Sometimes, we also use histograms superimposed
with frequency polygons. This helps interpolation of data, at the same
time retaining the attractive representation of histogram.
Example: In a city, the income tax department had the data as follows
for the number of tax payers along with the range of income tax they
paid for a particular year. Represent the data graphically with the
help of a histogram.
Tax paid in 20-24 25-29 30-34 35-39 40-44 45-49 Total
` ‘000
Number of Tax
Payers
S
45 130 200
Solution: For plotting the data we will first convert the data as exclusive
65 45 15 500
IM
classes. This is done by increasing the upper limits and decreasing the
lower limits by an amount equal to half of the difference between upper
limit of any class and lower limit of the subsequent class. This makes
the class boundary to join. Then class boundaries of tax paid classes
are plotted on the X axis and number of tax payers on the Y axis. Then
NM
vertical bars are drawn of widths equal to classes and heights equal to
the frequencies of corresponding classes. This is depicted as follows.
Tax paid 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
in ` ‘000
Number 45 130 200 65 45 15
of Tax
Payers
The histogram is shown below:
200
200
Number of Tax Payers
150 130
Series1
100
65
45 45
50
15
0 0
0
17 22 27 32 37 42 47 52
Tax Paid
Tax Paid in Rs. '000
` ′000
N O T E S
2.7.5 PIE DIAGRAM
Pie diagram is very popular visual representation in business reports,
when manager wants to show the share of various categories in total.
Total is represented as a circle. Each category is depicted as a sector
with its central angle proportional to its share. The share percent in
total of each category is converted to a sector angle using formula:
Sector Angle in degrees = Share Percentage × 360
100
Other variations of pie diagrams are doughnut diagrams and exploded
pie diagram. These are shown below.
Example: ABC Company has a total income of ` 180 crore. Out of
this it has paid ` 10 crore as interest on borrowed capital. It has spent
` 80 crore for raw materials and other running expenditure. Its fixed
costs (overheads) are ` 30 crore. On the net profit it has to pay the
tax at the rate of 30% on net profit. Further, the board of directors
S
decides to pay the dividend at the rate of 50% on the paid up capital of
` 60 crore. The remaining amount is retained as profit ploughed back.
Depict the data as a pie diagram, doughnut diagram and exploded pie
IM
diagram.
Amount in Proportion Equivalent
` in Crore to Total Angle
Income
Total Income (a) 180 1 360
NM
Fixed Interest on
Expenditure Borrowed
17% Capital
6%
Pie Chart
Distribution of Total Income Rs. 180 Crores
S
Distribution of Total Income ` 180 Crore
Ploughed
IM
Back Capital
7%
Dividend Expenditure
17% on Raw
Tax Material
10% 43%
NM
Interest on
Fixed
Borrowed
Expenditure
Capital
17%
6%
Distribution of Total
Doughnut Income
Diagram
Rs. 180 Crores
Distribution of Total Income ` 180 Crore
Ploughed Back
Capital
7%
Dividend
17%
Expenditure on
Raw Material
43%
Tax
10%
N O T E S
2.7.6 FREQUENCY POLYGON
Frequency polygon is used for presenting the frequency distribution
in graphical form. This can be used for discrete distribution with
grouped as well as ungrouped data. This can also be used for
continuous data by converting it to approximate discrete data through
grouping. In all these cases, values of variables are represented on the
X axis and their frequency (number of occurrences) on the Y axis. In
case of probability distributions, we use the probability as frequency
by choosing a suitable scale on the Y axis. For plotting the frequency
polygon, we need to choose appropriate scale and origin so that the
main data features occupy the reasonable area on the paper. This helps
the readability. Although usually the scale chosen is linear, however,
depending on the data type we could use logarithmic or other types of
scale. Examples of these are audio noise plots, earthquake intensity
plots, etc. Once the scale and origin is chosen, we need to draw grid
lines (or use graph paper with grid lines) to facilitate accurate plotting.
S
Then we take each data point and mark it on the graph. In case of a
grouped data we use class marks (mid points of the class intervals) as
variable values on the X axis. These data points are joined by straight
IM
lines or a smooth curve to get frequency polygon or the frequency
distribution in graphical form. To plot frequency distribution we can
also join the data points by smooth lines.
Example: In a city, the income tax department had the data as follows
for the number of tax payers along with the range of income tax they
NM
paid for a particular year. Represent the data graphically with the
help of a frequency polygon and frequency distribution chart.
Tax paid in 20-24 25-29 30-34 35-39 40-44 45-49 Total
` ‘000
Number of 45 130 200 65 45 15 500
Taxpayers
Solution: For plotting the data we will use class marks of tax paid
classes on the X axis and number of tax payers on Y axis. Thus, the
points for plotting are as follows. Then we join these points by straight
lines.
Value on X axis 22 27 32 37 42 47
Value on Y axis 45 130 200 65 45 15
The plot is shown below:
To draw the plot as frequency distribution, we follow the same
procedure for plotting the data points. Then we join the data points
with a smooth curve as shown below. This gives better interpolation
results. It also helps in comparing it with standard distributions.
N O T E S
Tax Payers' Data
Tax Payers’ Data
250
Payers
of Tax Payers
200
150
Number of
Tax Payers
Nubber of
100
Number
50
0
22 27 32 37 42 47
TaxPaid
Tax ′000'000
Paidinin` Rs.
S
Payers
Payers
200
IM
Tax
150
Number ofofTax
Number of Tax
Payers
100
Number
50
0
NM
22 27 32 37 42 47
Tax Paid
Tax PaidininRs.` ′000
'000
2.7.7 OGIVES
Ogives are used to present cumulative frequency of a distribution
in graphical format. There are two kinds of ogives. ‘Less than’ ogive
represents cumulative frequency just below the variable value plotted
on X axis. On the other hand, ‘More than’ ogive plots the sum of the
frequencies corresponding to above the variable value. For this we
first calculate ‘Less than’ and ‘More than’ cumulative frequencies
for the entire variable values (corresponding to classes). Then we
plot these as points on the graph with class marks along the X axis
and cumulative frequencies (‘Less than’ or ‘More than’) along the Y
axis. These points are then joined by a smooth curve like frequency
distribution. The value of the variable (on the X axis) at an ordinate
from the point where two ogives intersect is ‘Median’ i.e. mid-value
of the data (more about Median is in next chapter). The following
example demonstrates drawing of ogives.
Example: Before constructing a dam on a river the central water
research institute performed a series of tests to measure the water
flow, past the proposed location of the dam during the period of 246
days, when there was a sufficient flow of water. The results of testing
were used to construct the following frequency distribution.
N O T E S
River Flow 1001- 1051- 1100- 1151- 1201- 1251- 1301- 1351-
(thousand 1050 1100 1150 1200 1250 1300 1350 1400
cubic metres
per min)
Number 7 21 32 49 58 41 27 11
of Days
(frequency)
Draw ogive curves for the above data.
From the ogive curve estimate the proportion of the days on
which flow occurs at less than 1300 thousands of cubic metres
per minute.
Solution: First we calculate and prepare ‘less than’ and ‘more than’
frequency table as follows.
River Flow No of Upper ‘Less than’ Lower ‘More than’
1000 cu. m
per min
1001 - 1050
Days
7
Class Limit Frequency Class Frequency
1050.5 7
Limit
1001.5 246
S
IM
1051 - 1100 21 1100.5 28 1050.5 239
1101 - 1150 32 1150.5 60 1100.5 218
1151 - 1200 49 1200.5 109 1150.5 186
1201 - 1250 58 1250.5 167 1200.5 137
NM
300
250
Number of Days
200
50
0
1051
1101
1151
1201
1251
1301
1351
1401
N O T E S
300
250
Number of Days
200
150
More Than
100 Ogive
50
1002
1051
1101
1151
1201
1251
1301
1351
River Flow (Thousand Cu. m . per m in)
From the ‘less than’ ogive we can read that the number of days on
which flow occurs at less than 1,300 thousand of cubic metres per
minute is 208.
S
Thus, the proportion of days on which flow occurs at less than 1,300
IM
thousand of cubic metres per minute is 0.846 or 84.6%.
N O T E S
2.8 SUMMARY
S
There are two major divisions of the field of statistics, namely
descriptive and inferential statistics. Both the segments of
statistics are important, and accomplish different objectives.
IM
Data can be obtained through primary source or secondary source
according to need, situation, convenience, time, resources and
availability. The most important method for primary data collection
is through questionnaire. Data must be objective and fact-based so
that it helps a decision-maker to arrive at a better decision.
NM
N O T E S
A frequency distribution is the principle tabular summary of either
discrete data or continuous data. The frequency distribution
may show actual, relative or cumulative frequencies. Actual and
relative frequencies may be charted as either histogram (a bar
chart) or a frequency polygon. Two commonly used graphs of
cumulative frequencies are less than ogive or more than ogive.
Once the raw data is collected, it needs to be summarized
and presented to the decision-maker in a form that is easy to
comprehend. Tabulation not only condenses the data, but also
makes it easy to understand. Tabulation is the fastest way to
extract information from the mass of data and hence popular
even among those not exposed to the statistical method.
The charts help in grasping the data and analyze it qualitatively.
This also helps managers to effectively present the data as a
part of reports. Various types of chart are bar diagram, multiple
bar diagrams, component bar diagram, deviation bar diagram,
S
sliding bar diagram, Histogram and Pie charts.
A graphic presentation is another way of representing the
IM
statistical data in a simple and intelligible form. There are two
types of graphs which we have discussed, line graphs and ogives.
Primary Data: Primary data are collected afresh and for the
NM
Contd...
N O T E S
Histogram: It uses vertical bars whose height represents
the frequency. In histogram, the vertical bars touch the
neighboring bars sharing one edge.
Line Graph: We plot one or more variables against one variable.
One variable against which other variables are plotted is taken
along the X axis. It is commonly used to depict the trends in
anytime series data.
Ogives: Ogives are used to present cumulative frequency of a
distribution in graphical format. There are two kinds of ogives,
‘Less than’ ogive and more than ogives.
3. S
2. Describe various methods of collecting primary data and
comment on their relative advantages and disadvantages.
Discuss methods or sources of collecting secondary data.
IM
4. How do you design a questionnaire? What are the important
points to be kept in mind?
5. How is Editing of primary and secondary data done? Also,
describe coding of data.
NM
6. Describe the classification of data. What are the rules and bases
of classification of data?
7. What is frequency distribution? Differentiate between discrete
and continuous frequency distribution with examples.
8. Discuss the concept of tabulation. What are objectives and main
parts of table?
9. Differentiate between one-way tabulation, two-way tabulation
and multi-way tabulation with examples.
10. Describe different types of diagrams and graphs with examples.
Differentiate between diagrams and graphs too.
S. No. of 1 2 3 4 5 6 7 8 9 10 11 12
workers :
Income 25 35 30 45 50 55 40 50 60 55 40 35
(in `) :
N O T E S
2. Represent the following data by a suitable diagram.
N O T E S
13. Coding
Classification of Data 14. Classification
15. Discrete, Continuous
16. Limits
17. Frequency
18. Inclusive
19. Relative
20. True
21. False
22. True
23. True
24. False
Tabulation of Data 25. Tabulation
26.
27.
28.
S Rows
Head-note
General, Special
IM
29. Primary
30. Complex
31. Contingency
Diagrammatic and Graphical 32. Two-dimensional
NM
Presentation of Data
33. Histogram
34. Frequency polygon
35. Less than
N O T E S
Observation Method, Indirect Investigation, Questionnaire with
Personal Interview, Mailed Questionnaire, Telephonic Interview,
Internet Surveys
3. Refer Section 2.3.4
Sources of secondary data could be:
(a) Various publications of central, state and local governments.
This is an important and reliable source to get unbiased
data.
(b) Various publications of foreign governments or of
international bodies. Although it is a good source, context
under which it is collected needs to be verified before using
this data. For international situations this data could be very
useful and authentic.
4. Refer Section 2.3.5
S
The success of collecting data through a questionnaire depends
mainly on how skilfully and imaginatively the questionnaire has
been designed. A badly designed questionnaire will never be able
IM
to gather the relevant data. In designing the questionnaire, some
of the important points to be kept in mind are Covering letter,
Number of questions should be kept to the minimum, Questions
should be simple, short and unambiguous, Type of questions.
5. Refer Section 2.4
NM
Once the questionnaires have been filled and the data collected, it
is necessary to edit this data to ensure completeness, consistency,
accuracy and homogeneity. The editing of the data is a process
of examining the raw data to detect errors and omissions and
to correct them, if possible, so as to ensure completeness,
consistency, accuracy and homogeneity. Editing can be done at
two stages, Field editing and central editing.
Coding is the process of assigning some symbols either
alphabetical or numeral or both to the answers so that the
responses can be recorded into a limited number of classes or
categories. The classes should be appropriate to the research
problem being studied.
6. Refer Sections 2.5.1 and 2.5.2
Classification refers to the grouping of data into homogeneous
classes and categories. It is the process of arranging things in
groups or classes according to their resemblances and affinities.
The principal rules of classifying data are:
(a) To prepare data for tabulation.
(b) To enable grasp of data.
(c) To study the relationship.
N O T E S
Some common types of bases of classification are: Geographical
classification, Chronological classification, Qualitative classification,
Classification of data according to some characteristics.
7. Refer Section 2.5.3
Classification of data, showing the different values of a variable
and their respective frequency of occurrence is called a frequency
distribution of the values. There are two kinds of frequency
distributions, namely, discrete frequency distribution (or simple,
or ungrouped frequency distribution), and continuous frequency
distribution (or condensed or grouped frequency distribution).
8. Refer Section 2.6
Tabulation is arranging the data in flat table (two dimensional
arrays) format by grouping the observations. Table is a spreadsheet
with rows and columns with headings and stubs indicating class
of the data. Tabulation not only condenses the data, but also
N O T E S
two variables. It is called a nested table. In fact, in most of the
business situations the tabulation may have more than two
variables (usually 10 to 15). Up to about 3 to 4 variables could be
shown on two dimensional papers. These can also be represented
as flat tables by taking one composite variable of dimension n1 ×
n2 × n3 × n4 × n5 × …, where n1, n2, n3, n4, n5…are dimensions of
each variable (attribute).
10. Refer Section 2.7
Different types of bar diagrams are Line Diagram and column
diagram. Popular forms of two-dimensional diagrams are:
Rectangular Diagrams, Square Diagrams, Circular or Pie
Diagrams. With the help of three dimensional diagrams, the
values of various items are represented by the volume of cube,
sphere, cylinder, etc. These diagrams are normally used when
the variations in the magnitudes of observations are very large.
S
Pictograms and Cartograms are like frequency plots. The data
points are plotted on the graph in the same manner. These
include Histograms and frequency polygon.
IM
ANSWERS FOR EXERCISE FOR PRACTICE
1.
NM `
2.
N O T E S
3.
4.
S
`
IM
NM
5.
N O T E S
Loomba, M.P., Management – A Quantitative Perspective,
MacMillan Publishing Company, New York, 1978.
Shenoy, G.V., Srivastava, U.K. and Sharma, S.C., Quantitative
Techniques for Managerial Decision Making, Wiley Eastern, New
Delhi, 1985
Venkata Rao, K., Management Science, McGraw-Hill Book
Company, Singapore, 1986.
Bhardwaj, R.S., Business Statistics, 2nd Edition, Excel Books,
New Delhi.
Kothari, C.R., Quantitative Techniques, Vikas Publication.
E-REFERENCES
http://elearning.sol.du.ac.in/
http://www.okstate.edu/
S
http://www.statcan.gc.ca/
IM
NM
CONTENTS
3.1 Introduction
3.2
3.3
S
Characteristics of Central Tendency
Arithmetic Mean
IM
3.3.1 Properties of Arithmetic Mean
3.3.2 Calculation of Simple Arithmetic Mean
3.3.3 Merits and Demerits of Arithmetic Mean
3.3.4 Weighted Arithmetic Mean
3.4 Median
NM
INTRODUCTORY CASELET
N O T E S
S
IM
NM
N O T E S
3.1 INTRODUCTION
The concept of central tendency plays a dominant role in the study of
S
statistics. In many frequency distributions, the tabulated values show
a distinct tendency to cluster or to group around a typical central
value. This behaviour of the data to concentrate the values around a
IM
central part of distribution is called ‘Central Tendency’ of the data. If
we find such a central value, it can be used as a representative value
for the entire data set. This helps take many decisions concerning the
entire set. It may be noted, however, that averages may some times
give strange and illogical conclusions, if not used with a commonsense.
NM
HARACTERISTICS OF CENTRAL
C
3.2
TENDENCY
Measure of central tendency enables us to get an idea of entire data
from a single value at which we consider the entire data is concentrated.
This single value could be used to represent the entire population.
Measure of central tendency also enables us to compare two or more
sets of data, for example, average sales figures for two months.
A good measure of central tendency should possess as far as possible
the following characteristics:
Easy to understand.
Simple to compute.
Based on all observations.
Uniquely defined.
Possibility of further algebraic treatment.
Not unduly affected by extreme values.
N O T E S
Median: The middle value.
Mode: Most occurring value.
Each one has its advantages and disadvantages. Here we discuss the
definitions, concepts and methods of manual calculation. Grouping
of discrete data is not necessary for computer calculations. We can
directly use the discrete data and get faster as well more accurate
results than by grouping of the data. When only grouped data is
available, we need to use formulae for grouped data.
Averages provide us the gist and give a bird’s eye view of the huge
mass of unwieldy numerical data. Averages are the typical values
around which other items of the distribution congregate. This value
lies between the two extreme observations of the distribution and
give us an idea about the concentration of the values in the central
part of the distribution. They are called the measures of central
tendency.
N O T E S
3.3.1 PROPERTIES OF ARITHMETIC MEAN
Properties of arithmetic mean are as follows:
The sum of the deviations, of all the values of x, from their
arithmetic mean, is zero.
–
Justification: ∑fi (xi – x ) = ∑fi xi – ∑fi = 0
– – ∑f x –
Since x is a constant, x = i i ∴ ∑fi xi = x ∑fi
∑fi
The product of the arithmetic mean and the number of items
gives the total of all items.
– ∑f x –
x = i i ⇒ ∑fi xi = x ∑fi
Justification:
∑fi
– ∑xi –
Or x = ⇒ x . N ∑xi
– –
N
S
If x 1 and x 2 are the arithmetic mean of two samples of sizes n1 and
–
n2 respectively then, the arithmetic mean x of the distribution
IM
combining the two can be calculated as
– –
– n1 x1 + n2 x2
x=
n1 + n2
This formula can be extended for still more groups or samples.
NM
– ∑x –
x 1 = n 1i ⇒ ∑x1i = n1 x1
1
– ∑x –
Justification: x 1 = n 1i ⇒ ∑x1i = n1 x1 = total of the observations
of the first sample 1
–
Similarly, ∑x2i = n2 x2 = total of the observations of the first
sample
The combined mean of the two samples
combined total
=
n1 + n2
– –
– n1 x1 + n2 x2
x =
n1 + n2
Arithmetic Mean of Combined Data
Arithmetic Mean is used very often in business for calculating average
sales, average cost, average earnings, etc. If there are two related
data groups and their arithmetic means are known, we can calculate
arithmetic mean of the combined data without referring to individual
data points. If the first group of N1 items has arithmetic mean of μ1, the
second group of N2 items has arithmetic mean of μ2, and so on.
N O T E S
We can find the arithmetic mean of combined data as,
N1 × µ1 + N2 × µ2 + ...... + N × µ n
µ= n
N1 + N2 + ...... + N n
Example: The weekly average salaries paid to all employees in a
certain company was ` 600. The mean salary paid to male and female
employees were ` 620 and ` 520 respectively. Obtain the percentage of
male and female employees in the company.
Solution: Arithmetic mean of combined data is,
N1 × µ1 + N2 × µ2 + ...... + N × µ n
µ= n
N1 + N2 + ...... + N n
In this problem N1 = number of male employees, N2 = number of
female employees, mean salary of male employees m1 = 620, mean
salary of female employees m2 = 520 and combined mean m = 600.
Therefore,
µ=
N 1 × µ 1 + N2 × µ 2
N 1 + N2 S⇒ 600 =
620 × N1 + 520 × N2
N 1 + N2
⇒ 20 × N1 = 80 × N2
IM
∴ N1 : N2 =
4:1
x1 + x2 + x3 + ...... + xn ∑
i =1
xi
m= =
N N
There is a short cut method for calculations based on a simple
concept that, if a constant is subtracted or added to all data points, the
Arithmetic Mean (AM) is reduced or increased by that amount. Thus,
n
∑ di
µ= A + i =1
N
Where, A = Arbitrarily selected constant value (assumed mean). This
value is selected such that it simplifies the values in calculations when
deviation of each observation is used instead of the data values. A is
selected close to the expected or guess value of mean. Calculations on
deviation should be such that we should be able to do it orally.
N O T E S
di = Deviation of each observation from the assumed mean.
N = Number of observations.
Note that, when assumed mean ‘A’ is exactly equal to Arithmetic mean
–
μ or X , algebraic sum of all deviations is equal to zero. Thus, algebraic
sum of deviations of all observations about Arithmetic Mean is
zero. Or, n
About Arithmetic Mean, ∑
i =1
di = 0
Now we will solve one example just to demonstrate the method.
Example: Find the arithmetic mean of 3, 6, 24, and 48.
Solution: Let the assumed mean A = 20
Example: Find the arithmetic mean of 10, 12, 20, 15, 20, 12, 10, 15,
20 and 10
Solution: Arithmetic mean
x1 + x2 + x3 + ..... + xn 10 + 12 + 20 + 15 + 20 + 12 + 10 + 15 + 20 + 10
m = = 14.4
N 10
N O T E S
OR Frequency distribution of the data is,
m
Arithmetic Mean=
∑ x=
f i i 144
= 14.4
∑f i 10
μ= n
i =1
=
N
∑
i=1
fi
μ=A+ n =A +
N
∑
i =1
f i
N O T E S
N = ∑fi = Total number of observation.
mi = Class marks.
This method is also called a ‘Short Cut Method’. To make manual
calculations further easy, we can use the principle, that if all the
observations are divided or multiplied by a constant, the ‘Arithmetic
Mean’ is divided or multiplied by that value. We select a convenient
number usually the class width or size. Divide all deviations by that
number. Then use following formula to calculate ‘Arithmetic Mean’.
This method is called as ‘Step Division Method’. The formula is:
n
∑ f d′i
i
m=
A+ i =1
n ×h
∑ f i
i =1
=d′i
( mi − A) di
=
h h S
IM
mi = Class Marks.
h = Step size usually class interval.
N = ∑fi = Total number of observations.
No. of Students 5 10 25 30 20 10
Example: From the following data, compute Arithmetic Mean by
direct method, short cut methods and step division method.
Solution: Let the Assumed Mean be A = 35 and Step size h = 10
CALCULATION TABLE
Marks Class No. of mi * fi Devia- fi * di Step De- fi * di′
Mark Students tion viation
(mi) (fi) di = mi di′=(mi-
–A A)/h
N O T E S
Direct Method:
6
∑
i =1
m fi i
3300
μ = 6
= = 33
100
∑
i =1
f i
Shortcut Method:
6
∑ f di i
μ=A+ i =1
6
= 35 + –200 = 35 – 2 = 33
100
∑
i =1
f i
S 100
IM
The answer is same irrespective the method used.
N O T E S
the answer by the same constant h and then add the constant A to
get the actual value of the mean. We can use both, Change of Origin
and Change of Scale together, but we must correct the answers in
the reverse order of the algebraic operations performed on the data
points.
S
and hence, it can be regarded as representative of the given data.
It is capable of being treated mathematically and hence, is widely
IM
used in statistical analysis.
Arithmetic mean can be computed even if the detailed
distribution is not known but sum of observations and numbers
of observations are known.
It is least affected by the fluctuations of sampling.
NM
N O T E S
In the absence of a complete distribution of observations the
arithmetic mean may lead to fallacious conclusions. For example,
there may be two entirely different distributions with same value
of arithmetic mean.
Simple arithmetic mean gives greater importance to larger
values and lesser importance to smaller values.
Direct Method
mw = S
W 1 × x1 + W 2 × x2 + ...... + Wn × xn =
W 1 + W 2 + ...... + Wn
∑W × x
∑W
i i
IM
i
Short-cut Method
m= Aw +
∑ Wi × di
∑W
w
i
NM
N O T E S
This potentially large range is the reason why a weighted average
is used, as it ensures that financial calculations will be as accurate
as possible in the event the amount of a company’s shares changes
over time. The weighted average number of shares is calculated
by taking the number of outstanding shares and multiplying the
portion of the reporting period those shares covered, doing this
for each portion and, finally, summing the total. The weighted
average number of outstanding shares in our example would be
150,000 shares.
S
of shares ($200,000/150,000), which is equal to $1.33 per share.
Comparison of results of the two companies when their sizes
IM
are different.
Computation of standardized death and birth rates.
Example: The management of hotel has employed 2 managers, 5
cooks and 8 waiters. The monthly salaries of the manager, the cook
and waiter are ` 3000, ` 1200 and ` 1000 respectively. Find the mean
NM
N O T E S
8. Arithmetic mean can be computed even if the detailed
distribution is not known but sum of observations and numbers
of observations are known.
9. Arithmetic mean can be computed for a qualitative data like
data on intelligence, honesty, smoking habit, etc.
10. Weighted mean is used for construction of index numbers,
for example, consumer Price Index, BSE sensex, etc., where
different weights are associated for different items or shares.
On your first four math tests you earned 85, 80, 95, and 65. What
must you earn on your next test to have a mean score of at least 80?
S
If the class intervals are of varying width, an effort should be made
to avoid calculating mean and mode. It is advisable to calculate
IM
median.
3.4 MEDIAN
NM
N O T E S
Group B: 80, 70, 50, 20, 30, 90, 10, 40, 60, 100
Which group showed better performance based on Median?
Solution: First we arrange the scores in ascending order.
Group A: 10, 20, 30, 40, 50, 60, 70, 80, 90
Number of observations is 9 (odd). Therefore,
th
N + 1 9+1
Median =Md = = 5th observation = 50
2 2
Group B: 10, 20, 30, 40, 50, 60, 70, 80, 90, 100
Number of observations is 10 (even). Therefore,
th th
N N
observation + + 1 observation 50 + 60
2 2 = = 55
Median Md =
2 2
Thus, group B has better performance on median.
S
Example: Distribution of heights of new recruits is given below:
IM
Height in Inches 58 60 61 62 63 64 65 66 68 70
No. of Persons 4 6 5 10 20 22 24 6 2 1
Determine the median height.
Solution: There are total of 100 observations. These are already
NM
Height in Inches 58 60 61 62 63 64 65 66 68 70
No. of Persons 4 6 5 10 20 22 24 6 2 1
Cumulative 4 10 15 25 45 67 91 97 99 100
Frequency
Now N = 100. Hence N = 50. Thus, median is 50th observation. From
2
cumulative frequency 50th observation is 64. Hence median is 64 inches.
2
Such a class is called as Median Class. Then the median is calculated
by formula:
N
− pcf
Median Md = L + 2 ×h
f
Where, L = lower limit of Median class.
N = Total Frequency.
pcf = Preceding cumulative frequency to the median class.
N O T E S
f = frequency of median class.
h = class interval of median class.
Let us understand the logic of the formula. Median is value of N
th
2
observation. But this observation falls in the median class whose lower
limit is L. Cumulative frequency of class preceding to the ‘median
class’ is pcf. Thus, the median observation is N – pcf observation in
th
2
the median class (counted from the lower limit of the median class).
Now, if we consider that all f observations in the median class are
evenly spaced from lower limit L to upper limit L+h, the value of the
median can be found out by using ratio proportion.
Example: Calculate the median for the following data.
30-35 33 75
35-40 30 105
40-45 20 125
45-50 15 140
50-55 13 153
55-60 7 160
Now, N = 160
Or, N = 80
2
80th item lies in class 35-40.
Hence, pcf = 75, f =30, h = 5 and L = 35
Therefore, the Median is,
N 160
− pcf − 75
2 2
Md = L + × h = 35 + × 5 = 35.83
f 30
N O T E S
Median can be determined even when class intervals have open
ends or not of equal width.
It is not much affected by extreme observations. It is also
independent of range or dispersion of the data.
Median can also be located graphically.
It is centrally located measure of average since the sum of
absolute deviation is minimum when taken from median.
It is the only suitable average when data are qualitative and
it is possible to rank various items according to qualitative
characteristics.
Median conveys the idea of a typical observation.
Demerits of median are as follows:
In case of individual observations, the process of location of
S
median requires their arrangement in the order of magnitude
which may be a cumbersome task, particularly when the number
of observations is very large.
IM
It, being a positional average, is not capable of being treated
algebraically.
In case of individual observations, when the number of
observations is even, the median is estimated by taking mean
of the two middle-most observations, which is not an actual
NM
N O T E S
If your score is 90 percentile, it means that 90% of the candidates who
took the test, received a score lower than yours. In incomes in your
organisation if you are 95 percentile, you are in the group of top 5%
highest paid employees in your company.
3.4.4 QUARTILES
Quartiles are position values similar to the Median. There are three
quartiles denoted by Q1, Q2 and Q3. Q1 is called the lower Quartile or
first quartile. The second quartile Q2 is nothing but the median. In
a distribution, one fourth of the item are less then Q1 and the other
3 th item are greater then Q1 is called the upper quartile (or) the 3rd
4
quartile.
S
and third quartile. It is a measure of spread of the data.
th
Q2 is at 2 = N +1 = N +1
position.
4 2
NM
th
Q3 is the value at 3 N +1
position.
4
In Continuous series,
N
− c.f
Q1 = L + 4 ×c
f
N
− c.f
Q2 = L + 2 ×c
f
N
3 − c.f
Q3 = L + 4 ×c
f
Formulae: Or we use formula Qth quartile = (n +1) × Q observation.
th
Q = 0, 1, 2, 3, & 4 4
N O T E S
First quartile is the observation in position:
(n +1) × 25 = 5.25.
100
Value of the observation corresponding to 5.25th position is 13.25
(n +1) × 25 = 10.5.
100
Value of the observation corresponding to 10.5th position is 16.
(n +1) × 75 = 15.75.
100
Value of the observation corresponding to 15.75th position is 18.75.
Example: Calculate Median, Q1 and Q3 from the following data
Salary (` 000)
No. of officers:
Solution:
25 40 50
S
15 – 19 20 – 24 25 – 29 30 – 34 35 – 39 40 – 44
15 40 30
IM
N = 200 = 100 \ median class: 29.5 – 34.5
2 2
\ L = 29.5 C= 34.5 - 29.5 = 5; f = 50; c. f. = 80
N O T E S
= 24.5 + 50 – 40 × 5
40
= 24.5 + 1.25
= ` 25.75 thousands
3.4.5 DECILES
S
D1, D2, D3… and D9 are the nine deciles. They divide a series into 10
IM
equal parts. One tenth of the items are less than or equal to D1. One
tenth of the items are more than or equal to D9 and one tenth of the
items between any successive pairs of deciles when all the items are
in ascending order.
Formulae: Or we use formula Dth decile = (n +1) × D observation.
th
NM
10
D = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, & 10
Example: Find Q1, Q3, D2, D5, D7 from the following data
N O T E S
60 + 50
Q3 = 274 value + 275 value =
th th
= ` 60
2 2
\ D2 =
value at 73rd position + 0.20 (value at 74th
position – value at 73rd position)
= 20 + 0.20 (25 – 20)
= 20 + 0.20 (5) = 20 + 1
= ` 21
5 N + 1 = 5 × 36.6 = 183
10
\ D5 = ` 40
\ x = 40 corresponding to the position 183.
7 N + 1 = 7 × 36.6 = 256.2
\
10
S
D7 = value at 256th position + 0.20 (value at 257)th
position – value at 256th position)
IM
= 60 + 0.20 (60-60)
= 60 + 0
=
` 60
NM
3.4.6 PERCENTILES
Pth percentile of a group of observations is that observation below
which lie P% (P percent) observations. The position of Pth percentile
N O T E S
Solution: First, we order the data in ascending order.
6, 9, 10, 12, 13, 14, 14, 15, 16, 16, 16, 17, 17, 18, 18, 19, 20, 21, 22, 24.
80th percentile of the data set is the observation lying in the position:
(n + 1) × P = (20 + 1) × 80 = 16.8
100 100
Now, the 16th observation is 19 and 17th observation is 20. Therefore
80th percentile is a point lying, 0.8 proportion away from 19 to 20,
which is 19.8.
The 90th percentile is similarly found as observation lying in position:
(n + 1) × P = (n + 1) × 90 = 18.9
100 100
The 18th observation is 21 and 19th observation is 22. Therefore, 90th
percentile is a point 0.8 proportion away from 21 to 22, which is 21.9
These are the top ten final scores for the combined results of the
Ladies’ Figure Skating event at the 2010 Winter Olympics:
Figure Skating
Yu-Na Kim 228.56
Mao Asada 205.50
Joanie Rochette 202.64
Mirai Nagasu 190.15
Miki Ando 188.86
Laura Lepisto 187.97
Rahail Flatt 182.49
Akiko Suzuki 181.44
Alena Leonova 172.46
Ksenia Makarova 171.91
Yu-Na Kim of South Korea shattered the world record with a score
18 points higher than the previous record. How would the mean
and median of this group change if we left out her score?
N O T E S
3.5 MODE
The Mode of a data set is the value that occurs most frequently.
There are many situations in which arithmetic mean and median
fail to reveal the true characteristics of a data (most representative
figure), for example, most common size of shoes, most common size
of garments, etc. In such cases, mode is the best-suited measure of the
central tendency. There could be multiple model values, which occur
with equal frequency. In some cases, the mode may be absent.
S
Mode is the value which has the greatest frequency density. Mode
IM
is denoted by Z.
For a grouped data, model class is defined as the class with the
maximum frequency.
NM
N O T E S
Solution: Now the values 14, 16, 17 and 18 occur 2 times which is
maximum for any observation. Therefore,
Modes are 14, 16, 17 and 18 (this is a multimodal distribution).
Example: Find the mode of the following distribution class: 61-65,
66-70, 71-75, 76-80, 81-85, 86-90 frequency: 7, 9, 11, 7, 2, 3
Solution:
True class limits Frequency
60.5-65.5 7
65.5-70.5 9 (F0)
70.5-75.5 11 (Modal Class)
75.5-80.5 7 (F1)
80.5-85.5 2
85.5-90.5 3
S
Mode = L +
f1 – f0
2f1 – f0 – f2
×C
IM
L = 0.5, f1=11, f0=9, f2=7, c=5
\ Mode = 70.5 + 11 – 9 × 5
22 – 9 – 7
= 70.5 + 10
6
NM
= 72.17
Example: From the following data, calculate mean, median and mode.
N O T E S
A = 63.5
–
Mean X = A + ∑fd × C
N
= 63.5 + 16 × 3
150
= 63.5 + 0.106
= 63.607
Median = N = 150 = 75
2 2
Median class lies between 62-65
\ L = 62, f=36, c.f = 55, c=3
N
− c.f
L + 2 ×C
\ Median =
f
= 62 + 75 – 55 × 3
36
= 62 + 1.67
S
IM
= 63.67
f1 – f0
Mode = L + ×c
2f1 – f0 – f2
NM
= 62 + 30 – 36 ×3
2 × 30 – 36 – 28
= 62 + (–6) × 3
60 – 34
= 62 + 4.5
= 66.5
N O T E S
It is a value around which there is more concentration of
observations and hence the best representative of the data.
Limitations of mode are:
It is not based on all the observations.
It is not capable of further mathematical treatment.
In certain cases mode is not rigidly defined and hence, the
important requisite of a good measure of central tendency is not
satisfied.
It is much affected by the fluctuations of sampling.
It is not easy to calculate unless the number of observations is
sufficiently large and reveal a marked tendency of concentration
around a particular value.
It is not suitable when different items of the data are of unequal
importance.
S
It is an unstable average because, mode of a distribution, depends
upon the choice of width of class intervals.
IM
3.5.3 GRAPHIC LOCATION OF MODE
The mode of a data set is the value that occurs most frequently. Mode
can be found out from the histogram. If the data is discrete (not
grouped) it’s very easy to find the mode as the X value of the tallest
NM
N O T E S
Solution:
Using the above data we draw the histogram as shown below:
25
20
Number of Workers
15 Number of
Workers
10
N O T E S
If the distribution is skewed, the mean, the median and the mode
IM
are not equal. In a moderately skewed distribution distance between
the mean and the median is approximately one third of the distance
between the mean and the mode. This can be expressed as:
Mean – Median = (Mean – Mode) / 3
NM
N O T E S
Solution: The classified data is shown below.
N O T E S
Now the mode is,
D1 5
Mode = L + × h = 45 + × 4 = 45 + 1.43 = 46.43
D1 + D2 5+9
Part Do it yourself.
S
These are the scores from last week’s geometry test:
90, 94, 53, 68, 79, 84, 87, 72, 70, 69, 65, 89, 85, 83, 72
IM
You earned a score of 72. Your mom asks you how you did on the
test compared to the rest of the class. Calculate the three measures
of the average, and decide what to tell your mom.
N O T E S
Mode is especially useful to describe qualitative data. According
to Freunel and Williams, consumer preferences for different
kinds of products can be compared using modal preferences as
we cannot compute mean or median. Mode can best describe the
average size of shoes or shirts.
G.M. is useful to average relative changes, averaging ratios and
percentages. It is theoretically the best average for construction
of index number. But it should not be used for measuring absolute
changes.
H.M. is useful in problems where values of a variable are
compared with a constant quantity of another variable like time,
distance travelled within a given time, quantities purchased or
sold over a unit.
On his first three quizzes, Patrick earned a 15, 18, and 16. (A perfect
score would have been 20 points.) What does he need to earn on the
next quiz to have a mean score of at least 17?
In general we can say that A.M. is the best of all averages as it satisfies
almost all requirements of an ideal measure of central tendency
and other averages may be used under special circumstances.
3.8 SUMMARY
Measures of the central tendency give one of the very important
characteristics of the data. According to the situation, one of the
various measures of central tendency may be chosen as the most
representative.
Arithmetic mean is widely used and understood. What
characterizes the three measures of centrality, and what are the
relative merits of each in the given situation, is the question.
Mean summarizes all the information in the data. Mean can be
visualized as a single point where all the mass (the weight) of
the observations is concentrated. It is like a centre of gravity in
physics. Mean also has some desirable mathematical properties
that make it useful in the context of statistical inference.
N O T E S
To simplify the manual calculation, we may sometimes use shift
of origin and change of scale. Shifting of origin is achieved by
adding or subtracting a constant to all observations. In case of
discrete data we add or subtract (usually subtract) a constant to
the individual observations. Whereas for grouped data, we add or
subtract (usually subtract) the constant to the class mark values.
There are cases where relative importance of the different items
is not the same. In such a case, we need to compute the weighted
arithmetic mean. The procedure is similar to the grouped data
calculations studied earlier, when we consider frequency as a
weight associated with the class-mark.
Median is the middle value when the data is arranged in order.
The median is resistant to the extreme observations. Median is
like the geometric centre in physics. In case we want to guard
against the influence of a few outlying observations (called
outliers), we may use the median.
S
Quantiles are related positional measures of central tendency.
These are useful and frequently employed measures. Most
IM
familiar quantiles are Quartiles, Deciles, and Percentiles.
Quartiles are position values similar to the Median. There are
three quartiles denoted by Q1, Q2 and Q3. Q1 is called the lower
Quartile or first quartile. The second quartile Q2 is nothing but
the median. In a distribution, one fourth of the item are less
NM
then Q1 and the other 3 th item are greater then Q1 is called the
4
upper quartile (or) the 3rd quartile.
Inter-quartile range is defined as the difference between the first
and third quartile. It is a measure of spread of the data.
D1, D2, D3… and D9 are the nine deciles. They divide a series
into 10 equal parts. One tenth of the items are less than or equal
to D1. One tenth of the items are more than or equal to D9 and
one tenth of the items between any successive pairs of deciles
when all the items are in ascending order.
Pth percentile of a group of observations is that observation
below which lie P% (P percent) observations. The position of
N O T E S
A distribution in which the mean, the median, and the mode
coincide is known as symmetrical (bell shaped) distribution.
Normal distribution is one such a symmetric distribution, which
is very commonly used.
This can be expressed as:
Mean – Median = (Mean – Mode) / 3
Mode = 3 * Median – 2 * Mean
No single average can be regarded as the best or most suitable
under all circumstances. Each average has its merits and
demerits and its own particular field of importance and utility. A
proper selection of an average depends on the (1) nature of the
data and (2) purpose of enquiry or requirement of the data.
by the number of values.
S
Arithmetic Mean: It is defined as the sum of all values divided
N O T E S
8. What do you understand by mode? How do you calculate it for a
continuous data set? How will you find mode from Histogram?
9. Explain the limitations of central tendency.
10. Explain the empirical relation between mean, median and mode.
22 21 37 33 28 42 56 33 32 59
40 47 29 65 45 48 55 43 42 40
37 39 56 54 38 49 60 37 28 27
32 33 47 36 35 42 43 55 53 48
29 30 32 37 43 54 55 47 38 62
2.
S
Compute mean, median, mode, quartiles and 90th percentile for
the grouped data of age (years) of employees given below:
N O T E S
S
Symmetrical
IM
18. Right (or positive)-skewed
19. Arithmetic mean
Limitations of Central 20. Close
Tendency
NM
N O T E S
n
x + x2 + x3 + ...... + xn ∑
i =1
xi
m= 1 =
N N
4. Refer Section 3.3.1
Properties of arithmetic mean are as follows:
(a) The sum of the deviations, of all the values of x, from their
arithmetic mean, is zero.
(b) The product of the arithmetic mean and the number of items
gives the total of all items.
– –
(c) If x1 and x2 are the arithmetic mean of two samples of sizes
–
n1 and n2 respectively then, the arithmetic mean x of the
distribution combining the two can be calculated as
– –
– n1x1 + n2x2
x =
5.
S
n1 + n2
Refer Section 3.3.2
If a constant is subtracted or added to all data points, the
IM
Arithmetic Mean (AM) is reduced or increased by that amount.
In this method we first subtract a suitable constant from all the
observations, calculate the mean and then add the same constant
to the answer to get the actual value of the mean.
NM
N O T E S
(P percent) observations. The position of P percentile is given
th
( n + 1) × P
by , where ‘n’ is the number of data points.
100
8. Refer Section 3.5
Mode is the value which has the greatest frequency density.
Mode is denoted by Z.
For a grouped data, model class is defined as the class with the
maximum frequency.
The mode is calculated as:
D1
Mode = L + ×h
D1 + D2
9. Refer Section 3.7
No single average can be regarded as the best or most suitable
S
under all circumstances. Each average has its merits and demerits
and its own particular field of importance and utility.
A proper selection of an average depends on the (1) nature of the
IM
data and (2) purpose of enquiry or requirement of the data. A.M.
satisfies almost all the requisites of a good average and hence can
be regarded as the best average but, it cannot be used:
(a) In case of highly skewed data.
(b) In case of uneven or irregular spread of the data.
NM
N O T E S
S
Gordon, G., and Pressman I., Quantitative Decision Making for
Business, New Delhi: National Publishing House, 1983.
IM
Lapin, L., Quantitative Methods for Business Decisions, New
York: Harcourt Brace Jovanovich. Inc., 1976
E-REFERENCES
http://math.about.com/
NM
http://www.calculatorsoup.com/
http://www.mathgoodies.com/
MEASURES OF DISPERSION
CONTENTS
4.1 Introduction
4.2
4.3
S
Characteristics of Measures of Dispersion
Absolute and Relative Measures of Dispersion
IM
4.4 Range
4.4.1 Merits and Demerits of Range
4.4.2 Uses of Range
4.5 Inter-quartile Range and Deviations
4.5.1 Inter-quartile Range
NM
INTRODUCTORY CASELET
N O T E S
VARIABILITY IS IMPORTANT
A brief story may help the reader to see why variability is often
important. Some years ago a company was producing nickel
powder, which varied considerably in particle size. A metallurgical
engineer in technical sales was given the task of developing new
customers in the alloy steel industry for the powder. Some potential
buyers said they would pay a premium price for a product that was
more closely sized. After some discussion with the management of
the plant, specifications for three new products were developed:
fine powder, medium powder, and coarse powder.
S
optimum conditions would satisfy the specifications. Thus, the
mean size of the specification was satisfactory, but the specified
variability was not satisfactory from the point of view of production.
IM
To make production of fine powder more practical, it was necessary
to change the specifications for “fine powder” to correspond to a
larger standard deviation. When this was done, the plant could
produce fine powder much more easily (but the customer was not
willing to pay such a large premium for it!).
NM
N O T E S
4.1 INTRODUCTION
Different series may possess different dispersions of items around the
average. Measures of central tendency are averages of the first order.
Measures of dispersion are averages of the second order.
S
A measure of dispersion gives an idea about the extent of lack of
uniformity in the sizes and qualities of the items in a series. It helps
us to know the degree of uniformity and consistency in the series.
IM
If the difference between items is large the dispersion or variation is
large and vice versa.
NM
CHARACTERISTICS OF MEASURES OF
4.2
DISPERSION
There are number of measures of variability (or dispersion). Some
of the common measures are Range, Inter Quartile Range, Quartile
Deviation, Mean Deviation and Standard Deviation.
N O T E S
There are certain pre-requisites or characteristics for a good measure
of dispersion:
It should be simple to understand.
It should be easy to compute.
It should be rigidly defined.
It should be based on each individual item of the distribution.
It should be capable of further algebraic treatment.
It should have sampling stability.
It should not be unduly affected by the extreme items.
2. S
A measure of ................... in any data shows the extent to which
the numerical values tend to spread about an average.
................... control methods are based on the laws of dispersion.
IM
3. Measures of dispersion should not be unduly affected by the
................... items.
NM
N O T E S
S
It is difficult to compare absolute values of dispersion in different
series, especially when the series in different units or have different
sets of values. A good measure of dispersion should have properties
IM
similar to those described for a good measure of central tendency.
Choose any work situation from your life and differentiate the
relative and absolute measures of dispersion which you use. Which
of them is more helpful and why?
N O T E S
4.4 RANGE
The ‘Range’ of the data is the difference between the largest value
of data and smallest value of data.
N O T E S
Demerits
Range does not take into account all the values of a series, i.e. it
considers only the extreme items and middle items are not given
any importance. Therefore, Range cannot tell us anything about
the character of the distribution.
Range cannot be computed in the case of “open ends’ distribution
i.e., a distribution where the lower limit of the first group and
upper limit of the higher group is not given.
S
To study fluctuation in prices over a period, say a week or a
month or a year, for example, 52 weeks high/low of share prices
given in newspapers.
IM
Weather forecast indicators like maximum and minimum
temperatures, maximum and minimum rainfall in a particular
year, etc.
NM
N O T E S
Thus,
Inter Quartile Range = (Q3 - Q1)
Formulae: Thus,
Q3 – Q1
Quartile Deviation = QD =
2
Quartile Deviation (QD) also gives the average deviation of upper and
lower quartiles from Median.
Q3 – Q1 (Q2 – Q1) + (Q3 – Q2)
QD = =
2
S 2
IM
Relative measure of Quartile Deviation is called the Coefficient of
Quartile Deviation. It is defined as,
Q – Q1
Coefficient of QD = 3
Q3 + Q1
NM
N O T E S
N + 1 52 + 1
Q1 is = = 13.25
4 4
∴ Q1 = 13th value + 0.25 (14th value – 13th value)
= 200 + 0.25 (400-200)
= 200 + 0.25 × 200
= 200 + 50
= 250
N + 1
Q3 is 3 =3 × 13.25 =
39.75
4
Q3 = 39 th value + 0.75 (40 th value – 39 th value)
= 500 + 0.75 (500-500)
= 500 + 0.75 X 0
∴ Q.D. =
= 500.
Q3 − Q1
=
500 − 250 250
= =125 // S
IM
2 2 2
∴ Coefficient of Q.= Q3 − Q1 500 − 250 250
D. = = = 0.3333
Q3 + Q1 500 + 250 750
Frequency : 8 20 34 46 28 14 10
Example: Find the quartile deviation and the quartile coefficient of
dispersion for the following data.
Solution:
N O T E S
N
− c. f .
Median= L + 2 Xi
f
80 − 62
= 30 + X 10
46
= 30 + 3.91
= 33.91
3N
− c. f .
Q= L3 + 4 Xi
3
f
120 − 108
= 40 + X 10
28
= 40 + 4.29
= 44.29
=
Quartile deviation S Q3 − Q1 44.29 − 23.53
=
2 2
IM
20.76
=
2
= 10.38
Q3 − Q1 44.29 − 23.53
NM
=
Quartile Coeffiicient of Dispersion =
Q3 + Q1 44.29 + 23.53
20.76
=
67.82
= 0.31
N O T E S
Deviation’ from median is lowest as compared to any other ‘Mean
Deviations’. Since absolute values of deviations ignoring sign are
taken for calculating Mean Deviation, the mean deviation is not
amenable to further algebraic treatment.
It is defined as:
Mean Deviation
Coefficient of mean deviation =
Mean or Median or Mode
It can also be expressed in percentage by multiplying it with 100.
Formulae:
Coefficient of Mean deviation (about mean)
Mean deviation about Mean
=
Mean
∑| X−X|
N
S
IM
Coefficient of Mean deviation (about Median)
X: 2 4 6 8 10
f: 1 4 6 4 1
N O T E S
Solution:
x-x
X F fX f x-x
x =6
2
1
2
4
4
4 4 16 2 8
6 6 36 0 0
8 4 32 2 8
10 1 10 4 4
X
=
∑ fx=
N
S
N=16
96
= 6
16
96 24
IM
Mean Deviation about Mean =
∑f|
X − X | 24
= = 1.50
N 16
Example: Find the mean deviation about the mean for the following
data
NM
X-A
Class Mid x f d= X-X f | X-X|
C
0-5 2.5 3 –2 10.5 31.5
5-10 7.5 5 –1 5.5 27.5
10-15 12.5 12 0 0.5 6.0
15-20 17.5 6 1 4.5 27.0
20-25 22.5 4 2 9.5 38.0
30 130.0
X − A X − 7.5
=A 12.5;
= d =
C 5
X= A+
∑ fd × C
N
3
= 12.5 + × 5
30
= 13
N O T E S
S
11. Mean deviation is the arithmetic mean of the absolute
deviations of the values about their arithmetic mean or median
IM
or mode.
12. The relative measure corresponding to the ‘Mean Deviation’ is
‘coefficient of Mean Deviation’.
NM
These are the average temperatures (°F) in Miami, Florida, for each
month of the year.
Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec
67.2 68.5 71.7 75.2 78.7 81.4 82.6 82.8 81.9 78.3 73.6 69.1
Find the inter-quartile range, Quartile deviation and mean
deviation of this data.
Average deviation takes into account all the items of a series and
hence, it provides sufficiently representative results. It simplifies
calculations since all signs of the deviations are taken as positive.
Average Deviation may be calculated either by taking deviations
from Mean or Median or Mode. Average Deviation is not affected
by extreme items.
N O T E S
When the data constitute a sample, the variance is denoted by sx2, and
averaging is done by dividing the sum of the squared deviation from
the mean by ‘n – 1’. (Note that one is reduced from n because we lose
one degree of freedom while using mean of the sample. More about
degree of freedom will be discussed later.) When our observations
constitute the population, the variance is denoted by s2 and we divide
by N for the averaging.
∑ (x − X )
i
2
S i =1
n −1
∑ ( x − m)
i
2
IM
Population Variance Var (x) = s2 =
N
Where,
xi for i = 1, 2, …, n are observation values.
X = Sample mean
NM
n = Sample size
μ = Population mean
N = Population size
Population Variance is,
Var (x) = σ =2 ∑ ( x − m)
i
2
N
n n n n
N N
n
∑x 2
i
= i =1 − m2
N
Var (x) = E( X 2 ) − [ E( X )]2
This formula is very useful for manual calculations.
For grouped data, we need to multiply average values of
observations (class marks) by corresponding class frequencies.
Then, the formula for variance becomes:
N O T E S
2
nn
i ∑
2
i f × m ∑ fi × mi
Population Variance = E( X 2 ) − [ E( X )] =
=2 i 1 =i 1
−
N N
In case of sample ‘N’ is replaced by ‘n – 1’.
The Standard Deviation (SD) of a set of data is the positive square root
of the variance of the set. This is also referred as Root Mean Square
(RMS.) value of the deviations of the data points. SD of sample is the
S
square root of the sample variance i.e. equal to σx and the Standard
Deviation of a population is the square root of the variance of the
population and denoted by σ.
IM
Formula for Calculating S.D.
For the set of values x1, x2 ……..Xn
2
Ex 2 ∑ x
NM
=s −
n n
If an assumed value A is taken for mean and d = X-A, then
2
∑d ∑d
2
=s −
n n
For a frequency distribution
2
∑ fd ∑ fd
2
s= − × C
N N
Where, d = X–A and C is the true class interval.
N = Total frequency.
Example: Find the standard deviation for the following data:
N O T E S
Solution: Direct method
Mean = μ =
∑ f ×=
m i i 1500
= 30
SD = σ =
∑f
∑=
f ×d S
i
2
50
19250
IM
i i
= 19.62
∑f i 50
( ) − ∑ f × d
2
SD = ∑ fi × di ′
i i
2
∑ fi ∑ fi
2
= 205 −25 = 19.62
−
50 50
Thus, we get the same answers.
N O T E S
Effect of Shift of Origin and Change of Scale
To simplify the manual calculation we may some times use shift of
origin and change of scale. Shifting of origin is achieved by adding
or subtracting a constant to all observations. In case of discrete data
we add or subtract (usually subtract) a constant to the individual
observations. Whereas, for grouped data we add or subtract (usually
subtract) the constant to the class mark values. There is no effect of
shifting origin on standard deviation or variance.
Change of scale is achieved by multiplying or dividing by a constant to
all observations. In case of discrete data we multiply or divide (usually
divide) by a constant to the individual observations. Whereas for
grouped data, we multiply or divide (usually divide) by the constant
(usually by class interval) to the class mark values. The effect is as
follows. If all data points are multiplied or divided by a constant, the
standard deviation is multiplied (stretched) or divided (shrunk) by
that amount.
S
We can use both, Change of Origin and Change of Scale together, but
we must correct the answers in the reverse order of the algebraic
IM
operations performed on the data points. In this method, we first
subtract a constant, say A (called assumed mean) from all the
observations or class marks and them divide all the observations by a
suitable constant say h, (usually the class interval for grouped data),
and then calculate the Standard Deviation. Then we multiply the
NM
answer by the same constant h to get the actual value of the Standard
Deviation. For calculating variance of course we need to multiply the
calculated answer by h2.
Example: The weekly salaries of a group of employees are given in
the following table. Find the mean and S.D. of the salaries.
Mean X =
A+
∑ fd × C
N
N O T E S
23
= 85 + ×5
50
= ` 87.30
2
∑ fd ∑ fd
2
s= − × C
N N
2
91 23
= − × 10
50 50
= 1.61 ×10
= ` 12.69
Example: The following data were obtained while observing the life
span of a few neon lights of a company. Calculate S.D.
Life span (Years)
No. of Neon Lights
Solution:
S 4-6
10
6-8
17
8-10
32
10-12
21
12-14
20
IM
Life span No. of X-A Mid Fd fd²
Years Neon d= value
Light (f) C
4-6 10 5 –2 –20 40
NM
6-8 17 7 –1 –17 17
8-10 32 9 0 0 0
10-12 21 11 1 21 21
12-14 20 13 2 40 40
100 24 118
2
∑ fd ∑ fd
2
Standard Deviation, s = − × C
N N
2
118 24
= − ×2
100 100
= 1.18 − 0.0576 × 2
= 1.1224 × 2
= 2.1188
N O T E S
It is based on all the observations.
Further mathematical treatment is possible.
It is affected least by any sampling fluctuations.
It is affected by the extreme values and it gives more importance
to the values away from the mean.
The main limitation is; we cannot compare the variability of
different data sets given in different units.
Merits
Standard deviation is the best measure of dispersion because it
takes into account all the items and is capable of future algebraic
treatment and statistical analysis.
It is possible to calculate standard deviation for two or more series.
This measure is most suitable for making comparisons among
two or more series about variability.
Disadvantages
It is difficult to compute.
It assigns more weights to extreme items and less weight to items that
are nearer to mean. It is because of this fact that the squares of the
deviations which are large in size would be proportionately greater
than the squares of those deviations which are comparatively small.
N O T E S
–
Let x and σ be the mean and S.D. of the combined group of (n1 + n2)
–
items. Then x and σ are determined by the formulae.
n1 x1 + n2 x2
X=
n1 + n2
2 2 2 2 2 2 2 2
n1s 1 + n2s 2 + n1d1 + n2 d2 n1s 1 + n2s 2 + n1d1 + n2 d2
=s2 = (or) s
n1 + n2 n1 + n2
=
where d1 x=
1 – x; =
d2 x2 – x (or) d1 x1 =
– x ; d2 x2 – x
S
Example: Particulars regarding the income of two villages are given below:
Village A Village B
IM
Number of people: 600 500
Average income (in `): 175 186
Variance of income (`): 100 81
In which village is the variation in income greater?
NM
10 9
= × 100 = ×100
175 186
= 5.714 = 4.839
Therefore income is greater in village A
n1 x1 + n2 x 2
X=
n1 + n2
where
= n1 600;
= n2 500
=x1 175;
= x 2 186
∴X =
( 600 ×175) + ( 500 ×186 )
600 + 500
1, 05, 000 + 93, 000
=
1100
= 180
N O T E S
2 2 2 2
nn1s
s1 +
2
+ nn22 s
s22 +
2
+ nn11dd11 +
2
+ nn22 dd 22
2
s
s22 =
= 1 1 nn1 + n2
1 + n2
where
where dd1 xx=
=
= = 1 – x ;
1 – x ;
dd 2 xx 22 –– xx
1 2
= 175
= 175 –180
–180== 186 − 180
186 − 180
=
= dd1 55=
= dd 2 66
1 2
2 2
600(100
600(100)+
600(100 ++500(81) ++600(5)
500(81)+
500(81) 600(5) ++500(6)
600(5)2 2+ 500(6)2 2
500(6)
∴s =
∴s 222 =
∴s =
1100
1100
1100
60, 000 + 40, 400 + 15, 000 + 18, 000
= 60, 000 + 40, 400 + 15, 000 + 18, 000
= 1100
1100
= 11.01
= 11.01
Example: An analysis of monthly wage of workers of two organizations,
A and B yielded the following results.
S
Organisations
IM
A B
No. of Workers 50 60
Average monthly ` 60 ` 48
wage
NM
∴X =
( 50 × 60 ) + ( 60 × 48 )
60 + 50
3000 + 2880
=
110
= ` 53.45
= ` 53.45
2 2 2 2
2
+ 2
+ 2
+ 2
s 22 = n11s 11 n22s 22 n11d11 n22 d22
n11 + n22
=
where d11 x=
1 – x ;
1 d22 x22 – x
= 60 =
– 53.45 48 – 53.45
=d11 =
6.55 d22 5.45
N O T E S
∴s 2 =
60
5000 + 8640 + 2145.125 + 1782.15
=
110
17567.275
= = 159.7025
110
= 12.637
S.D. of two organization taken together =` 12.637
Example: There are 20, 30 and 50 employees in the three branches
of a concern. Their mean salaries are ` 15, 12 and 18 thousand. S.D.
of their salaries is ` 3, 5 and 6 thousand respectively. Find the mean
salary and the S.D. of salaries for the employees of the concern as a
whole.
Solution:
Given S
s1 = 3 Mean Salary
–
S.D. of salaries
IM
Branch-I N1=20 X1 = 15 s1 = 3
–
Branch-II N2=30 X2 = 12 s2 = 5
–
Branch-III N3=50 X3 = 18 s3 = 6
Solution:
NM
n1 x1 + n2 x2 + n3 x3
X=
n1 + n2 + n3
20 × 15 + 30 × 12 + 50 × 18
= X
mean =
20 + 30 + 50
300 + 360 + 900
=
100
1560
= = 15.60
100
2 2 2 2 2 2
n1s 1 + n2s 2 + n3s 3 + n1d1 + n2 d2 + n3 d3
s2 =
n1 + n2 + n3
20 × 32 + 30 × 52 + 50 × 62 + 20 × ( –6 ) + 30 × ( –3.6 ) + 50 × ( 2.4 )
2 2 2
=
100
=
where d1 x=
1 – x 15.0=
– 15.6 –0.6
=d2 x=
2 – x 12.0=
– 15.6 –3.6
=d3 x=
3–x =
18.0 – 15.6 2.4
N O T E S
4.6.6 COEFFICIENT OF VARIATION
This was developed by Karl Pearson and defined as the ratio of SD
and mean, multiplied by 100.
s
CV = × 100
m
This is also called as variability. Smaller value of CV indicates greater
stability and lesser variability.
Example: Two batsmen A and B made the following scores in the
preliminary round of World Cup Series of cricket matches.
A 14, 13, 26, 53, 17, 29, 79, 36, 84 and 49
B 37, 22, 56, 52, 28, 30, 37, 48, 20 and 40
Whom will you select for the final? Justify your answer?
Solution: We will first calculate mean, standard deviation and Karl
S
Pearson’s coefficient of variation. We will select the player based on
the average score as well as consistency. We not only want the player
who has been scoring at high average but also doing it consistently.
IM
Thus, the probability of his playing good inning in final is high.
Now,
10
∑x i
400
Mean = μ = =
i =1
= 40
N 10
10
∑ (x − m )
i
2
5974
=
Variance = Var ( x) i =1
= = 597.4
N 10
=
Standard Deviation =s =
Var ( x) =
597.4 24.44
N O T E S
Another Method
10
∑ (x ) i
2
Variance = Var (=
x) i =1
− m=
2
2197.4 − 1600
= 597.4
N
Coefficient of variation (variability) for player ‘A’
s 24.44
CV = × 100= × 100= 61.10
m 40
For player ‘B’ we will use the short-cut method. Let the assumed
mean A = 40
48 8 64
20 –20 400
40 0 0
Σ xi =370 Σ di = –30 Σ di = 1450
2
10
∑x i
370
Now, Mean = μ = =
i =1
= 37
N 10
10
Or, Mean = μ = A +
∑d i
−30
i =1
= 40 + = 40 − 3 = 37
N 10
2
10
10
∑ (di )2
∑ (di ) 1450
=i
Variance = Var =
(x ) −
1=i 1
= −=
9 136
N N 10
=s
Standard Deviation = =
Var ( x) =
136 11.66
s 11.66
CV = × 100= × 100= 31.5
m 37
N O T E S
S
standard deviation is a measure of the spread. There are two general
rules that establish a relation between these measures.
IM
Chebyshev’s Theorem
Chebyshev’s theorem states following rules:
At least three quarters of the observations in a set will lie within
±2 standard deviation of the mean.
NM
N O T E S
Scores of player ‘A’ (m = 40 and s = 24.44)
Range
SAs per Che- As per Em-
byshev’s pirical rule
Actual Actual
Data percentage
IM
Theorem percentage Points of Obser-
percentage of Observa- vations
of Observa- tions
tions
m ±1×s 25.34 to 68 5 50
NM
48.66
m ±2×s 13.68 to 75 95 10 100
60.32
m ±3×s 2.02 to 89 99 10 100
71.98
N O T E S
15. There is no effect of shifting origin on standard deviation or
variance.
16. It is not possible to calculate standard deviation for two or
more series.
Fill in the blanks:
17. ................... is defined as the average of squared deviation of
data points from their mean.
18. ................... ................... is the root mean square deviation of the
values from their arithmetic mean.
19. Coefficient of variation was developed by ................... .
20. ................... value of CV indicates greater stability and lesser
variability.
S
Calculate the standard deviation of monthly cloud cover over
Equatorial Africa for January 2012 to December 2014. Collect data
IM
from secondary sources like internet.
4.7 SUMMARY
Study of distribution is very important for decision-making.
Usually, measures of central tendency and variability are
adequate for taking decision. However, if data is quite different
from normal distribution then measure skewness and kurtosis
need to be considered. We discussed measures of variability:
Range, Variance and Standard Deviation.
A measure of dispersion gives an idea about the extent of lack
of uniformity in the sizes and qualities of the items in a series. It
helps us to know the degree of uniformity and consistency in the
series. If the difference between items is large the dispersion or
variation is large and vice versa.
The measures of dispersion can be either ‘absolute’ or ‘relative’.
Absolute measures of dispersion are expressed in the same units
in which the original data are expressed. For example, if the series
N O T E S
is expressed as Marks of the students in a particular subject; the
absolute dispersion will provide the value in Marks. The only
difficulty is that if two or more series are expressed in different
units, the series cannot be compared on the basis of dispersion.
The ‘Range’ of the data is the difference between the largest
value of data and smallest value of data. This is an absolute
measure of variability. However, if we have to compare two sets
of data, ‘Range’ may not give a true picture. In such case, relative
measure of range, called coefficient of range is used.
Inter-quartile range is a difference between upper quartile (third
quartile) and lower quartile (first quartile). Quartile Deviation is
the average of the difference between upper quartile and lower
quartile.
Average used for calculating deviation can be the mean, the
median or the mode. However, usually the mean is used. There
is also an advantage of taking deviations from the median,
S
because ‘Mean Deviation’ from median is lowest as compared to
any other ‘Mean Deviations’. Since absolute values of deviations
ignoring sign are taken for calculating Mean Deviation, the mean
IM
deviation is not amenable to further algebraic treatment.
The variance is the average squared deviation of the data from
their mean. For sample data, we take the average by dividing
with (n-1) where n is a sample size. This is to cater for degree of
freedom. For population data, we average by dividing with the
NM
population size N.
The Standard Deviation (SD) of a set of data is the positive square
root of the variance of the set. This is also referred as Root Mean
Square (RMS) value of the deviations of the data points. SD of
sample is the square root of the sample variance
There is no effect of shifting origin on standard deviation or
variance.
The measures of deviation are very effective in making reports
and presentations by the business executives to present their data
top general public who do not understand statistical methods.
Variance analysis also helps in managing budgets by controlling
budgeted versus actual costs. Without the standard deviation,
you can’t compare two data sets effectively.
N O T E S
Variance: Variance is defined as the average of squared
deviation of data points from their mean.
Standard Deviation (SD): SD of a set of data is the positive
square root of the variance of the set. This is also referred as
Root Mean Square (RMS).
Mean Deviation: Mean Deviation (MD) is an average value of
absolute deviation of observations from the data mean (or the
median or the mode).
Coefficient of Variation: It is defined as the ratio of SD and
mean, multiplied by 100.
N O T E S
3. Scores of the two teams are given below:
Team A B
Average Score 53.30 45.30
S.D. of Scores 40.93 16.89
(a) Which team is better in average?
(b) Which team is more consistent?
4. Two workers on the same job show the result over a long period
of time:
Worker A Worker B
Mean time of completion of 30 25
job (in min)
Standard deviation (in min) 6 4
5.
S
(a) Which worker appears to be faster in completing the job?
(b) Which worker appears to be more consistent?
For a set of 100 items, the mean and SD are 60 and 6 respectively.
IM
For another set of 200 items, the mean and SD are 63 and 4
respectively. Find the mean and SD of the combined group.
N O T E S
15. True
16. False
17. Variance
18. Standard Deviation
19. Karl Pearson
20. Smaller
2.
dispersion in addition to an average.
Refer Section 4.2 S
IM
There are certain pre-requisites or characteristics for a good
measure of dispersion:
(a) It should be simple to understand.
(b) It should be easy to compute.
NM
N O T E S
5. Refer Section 4.5
Inter-quartile range is a difference between upper quartile (third
quartile) and lower quartile (first quartile). Quartile Deviation is
the average of the difference between upper quartile and lower
quartile.
6. Refer Section 4.5.3
Mean deviation is the arithmetic mean of the absolute deviations
of the values about their arithmetic mean or median or mode.
Mean Deviation (MD) is an average value of absolute deviation
of observations from the data mean (or the median or the mode).
It gives how spread/dispersed the data is. If x1, x2… xn are N
observations, then,
di xi − Average
Mean Deviation MD = =
N N
7.
Refer Section 4.6
S
Variance is defined as the average of squared deviation of data
points from their mean.
IM
When the data constitute a sample, the variance is denoted by
sx2, and averaging is done by dividing the sum of the squared
deviation from the mean by ‘n – 1’. (Note that one is reduced from
n because we lose one degree of freedom while using mean of the
sample.
NM
∑ ( x i − X )2
Sample Variance Var (x) = sx2 = i =1
n−1
8. Refer Section 4.6.2
The Standard Deviation (SD) of a set of data is the positive square
root of the variance of the set. This is also referred as Root Mean
Square (RMS.) value of the deviations of the data points. SD of
sample is the square root of the sample variance i.e. equal to σx
and the Standard Deviation of a population is the square root of
the variance of the population and denoted by σ.
9. Refer Section 4.6.2
Shifting of origin is achieved by adding or subtracting a constant
to all observations. In case of discrete data we add or subtract
(usually subtract) a constant to the individual observations.
Whereas, for grouped data we add or subtract (usually subtract)
the constant to the class mark values. There is no effect of shifting
origin on standard deviation or variance.
10. Refer Section 4.6.6
This was developed by Karl Pearson and defined as the ratio of
SD and mean, multiplied by 100.
s
CV = × 100
m
N O T E S
This is also called as variability. Smaller value of CV indicates
greater stability and lesser variability.
E-REFERENCES
http://www.wyzant.com/
http://www.princeton.edu/
http://www.statistics.com/
CONTENTS
5.1 Introduction
5.3
S
5.2 Karl Pearson’s Coefficient of Skewness (SKp)
Bowley’s Coefficient of Skewness (SKB)
IM
5.4 Kelly’s Coefficient of Skewness (Skk)
5.5 Measures of Kurtosis
5.6 Moments
5.6.1 Properties of Moments
5.6.2 Coefficients based on Moments
NM
5.7 Summary
5.8 Descriptive Questions
5.9 Answers and Hints
5.10 Suggested Readings for Reference
INTRODUCTORY CASELET
N O T E S
KURTOSIS BY EXCELTM
S
distribution (that is, too tall) or a negative value indicates the
possibility of a platykurtic distribution (that is, too flat, or even
concave if the value is large enough). Values of 2 standard errors
IM
of kurtosis (sek) or more (regardless of sign) probably differ from
mesokurtic to a significant degree.
The sek can be estimated roughly using the following formula (after
Tabachnick & Fidell, 1996): For example, let’s say you are using
ExcelTM and calculate a kurtosis statistic of + 1.9142 for a particular
NM
N O T E S
You can imagine how tall the distribution must look when it
is plotted out as a histogram: 20 points wide and hundreds of
students high. The decision that we are making is a four way
decision about the level of instruction that students should take:
remedial writing; regular writing with an extra lab tutorial; regular
writing; or honours writing. The problem that arises is that very
few points separate these four classifications and that hundreds of
students are on the borderline. So a wider distribution would help
us to spread the students out and make more responsible decisions
especially if the revisions resulted in a more reliable measure with
fewer students near each cut point.
S
IM
NM
N O T E S
5.1 INTRODUCTION
Measures of Skewness and Kurtosis, like measures of central tendency
and dispersion, study the characteristics of a frequency distribution.
Averages tell us about the central value of the distribution and
measures of dispersion tell us about the concentration of the items
around a central value. These measures do not reveal whether the
S
dispersal of value on either side of an average is symmetrical or
not. If observations are arranged in a symmetrical manner around
a measure of central tendency, we get a symmetrical distribution;
IM
otherwise, it may be arranged in an asymmetrical order which gives
asymmetrical distribution. When the distribution stretches more to
the right than it does to the left, the distribution is said to be ‘right
skewed’ or ‘positively skewed’. Similarly, a left-skewed distribution is
the one that stretches asymmetrically to the left.
NM
Nature of Skewness
Skewness can be positive or negative or zero.
When the values of mean, median and mode are equal, there is no
skewness.
When mean > median > mode, skewness will be positive.
When mean < median < mode, skewness will be negative.
N O T E S
Characteristic of a Good Measure of Skewness
It should be a pure number in the sense that its value should be
independent of the unit of the series and also degree of variation
in the series.
It should have zero-value, when the distribution is symmetrical.
It should have a meaningful scale of measurement so that we
could easily interpret the measured value.
Mathematical measures of skewness can be calculated by:
Karl-Pearson’s Method
Bowley’s Method
Kelly’s method
S
Skewness could be measured either in absolute term as ‘mean
minus mode’ or in relative term. When the skewness is presented
in absolute term i.e., in units, it is absolute skewness. If the value of
IM
skewness is obtained in ratios or percentages, it is called relative or
coefficient of skewness.
When skewness is measured in absolute terms, we can compare one
distribution with the other if the units of measurement are same.
When it is presented in ratios or percentages, comparison become
NM
N O T E S
–
Where X = the mean, Mo = the mode and s = the standard deviation for
the sample.
It is generally used when you don’t know the mode.
Coefficient of skewness generally lies within + 3
Example: Calculate the Karl Pearson’s coefficient of skewness from
the following data:
Size: 1 2 3 4 5 6 7
Frequency: 10 18 30 25 12 3 2
Solution:
– To calculate Karl Pearson’s coefficient of skewness, we first
find X , Mo and s from the given distribution.
Size (X) Frequency (f ) d= X − 4 fd fd 2
1 10 −3 − 30 90
2 18 −2 − 36 72
3
4 S 30
25
−1
0
− 30 30
0 0
IM
5 12 1 12 12
6 3 2 6 12
7 2 3 6 18
Total 100 − 72 234
NM
X =+
A
∑ fd =
4+
− 72
=
3.28
N 100
s =∑
∑ fd
2 2 2
fd 234 − 72
− = − = 1.35
N N 100 100
N O T E S
–
Solution: Calculation of X , σ and Mo
Class Frequency Mid − Points X − 135
u= fu fu 2
Intervals (f ) (X) 10
90-100 4 95 −4 −16 64
100-110 10 105 −3 − 30 90
110-120 17 115 −2 − 34 68
120-130 22 125 −1 − 22 22
130-140 30 135 0 0 0
140-150 23 145 1 23 23
150-160 16 155 2 32 64
160-170 5 165 3 15 45
170-180 3 175 4 12 48
Total 130 − 20 424
1. X =A + h
∑ fu = 135 + 10 × −20 = 133.46
N 130 S
IM
∑ fu ∑ fu
2 2 2
424 − 20
2. s=
h× − =
10 × − = 18.0
N N 130 130
D1
Mo =+
Lm ×h
NM
3.
D1 + D2
By inspection, the modal class is 130-140.
\ Lm = 130, D1 = 30 – 22 = 8, D2 = 30 – 23 = 7 and h = 10
8
Thus, Mo = 130 + × 10 = 135.33
15
N O T E S
N O T E S
N
1. Since = 50 , the median class is 15 - 20.
2
50 − 37
Thus, Lm =15, fm = 13, C =37, h = 5, hence Md = 15 + × 5 = 20
13
N
2. Since = 25 , the first quartile class is 10 - 15.
4
25 − 17
Thus, LQ1 = 10 fQ1 = 20 C = 17, h = 5, hence Q1 = 10 + × 5 = 12
20
3N
3. Since = 75 , the third quartile class is 25 - 30.
4
75 − 67
=
Thus, =
LQ3 25, fQ3 10 C = 67, h = 5, hence Q3 = 25 + × 5 = 29
10
\ Bowley’s Coefficient of Skewness = 0.06
Thus, the distribution is approximately symmetrical.
S
IM
Fill in the blanks:
6. ................... method of skewness is based on the values of
median, lower and upper quartiles.
7. Bowley’s method is also used in case of ‘open-end series’,
NM
N O T E S
Where, P is percentile.
Example: Calculate the Kelly’s coefficient of skewness from the
following data:
10 10 × 500
1. Since= N = 50, P10 lies in the interval 1000 - 1100.
100 100
Thus, = =
LP10 1000, C 43,= =
fP10 47, h 100
N O T E S
Skewness is also defined in term of the moment about mean. One
such measure is defined as:
3
( xi − m )
∑
s
Skewness =
N
Lorenz Curve: This is a special type of graph, which is designed
to show how much a certain distribution varies from a completely
uniform distribution. It is a cumulative percentage curve
comparing the population and factor under study. For example,
we could plot a graph of percentage of population and percentage
of their wealth. Lorenz curve is very useful for comparing two
populations particularly when their means and SD are same.
S
8. Skewness is also defined in term of the moment about
................... .
IM
9. ................... curve is a special type of graph, which is designed to
show how much a certain distribution varies from a completely
uniform distribution.
10. Lorenz curve is very useful for comparing two populations
particularly when their ................... and ................... are same.
NM
N O T E S
20-30 40 25 0 0 0 0
30-40 20 35 1 20 20 20
40-50 10 45 2 20 40 160
Total 100 0 120 360
–
Since ∑fu = 0, \ X = 25 and the calculated moments will be central.
m2 = h2 ∑
fu2 120
= 100 × = 120
N 100
and m4 = h
4 ∑ fu 4
= 10000 ×
360
= 36000
N 100
m4 36000
Thus, measure of kurtosis b
= 2 = = 2.5
m22 14400
Since this value is less than 3, the distribution is platykurtic.
The standard deviation s = 120 = 10.95
Example: The first four central moments of a distribution are 0,
2.5, 0.7 and 18.75. Calculate the moment measures of skewness and
kurtosis of the distribution and comment upon the results.
Solution: The moment measures of skewness and kurtosis are given
by
N O T E S
m1′
=
∑=
fd 50
= 0.5,=
1970
m2′= 19.7
∑ fd
=
2
S
IM
N N
100 100
=
∑ fd3 2948
m3′ = = 29.48, =
∑ fd4 86752
m4′ = = 867.52
N 100 N 100
m=
2 m2′ − m1′2 = 19.7 - 0.52 = 19.45
distribution is platykurtic.
N O T E S
S
if the concentration at the centre is comparatively less, the curve
becomes ‘PLATYKURTIC’.
IM
5.6 MOMENTS
One important concept of measuring the frequency distribution is
moments. It can be visualized as rotational effect of a force.
The concept of moments has crept into the statistical literature from
NM
First Moment m1 =
∑f i × ( xi − m )
N
Second Moment m2 =
∑f i × ( x i − m )2
N
N O T E S
Third Moment m3 =
∑f i × ( x i − m )3
N
Fourth Moment m4 =
∑f i × ( x i − m )4
N
5.6.1 PROPERTIES OF MOMENTS
First moment about mean is always zero. i.e. m1 = 0
Second moment about mean is the variance. m2 = s2 = Var
Third moment can be used as a measure of skewness. Karl Pearson
2
m3
has suggested a different measure of skewness as b 1 = 3
m2
Thus: If m3 > 0 ⇒ Distribution is positively skewed.
If m3 < 0 ⇒ Distribution is negatively skewed.
If m3 = 0 ⇒ Distribution is symmetric.
S
Fourth moment can be used as a measure of kurtosis. Karl Pearson
gave the coefficient as
IM
m
b2 = 42
m 2
Alpha Coefficients
It is defined as:
m
α i = i Where i = 1, 2, 3, 4
si
Note that, α 1 = 0, α 2 = 1, α 3 = μ 3 / α 3 and α 4 = μ 4 / α 4
Beta Coefficients
It is defined as:
2 m3 2
b1 α=
= 3
m2 3
m4
b=
2 α=
4
m2 2
N O T E S
Gamma Coefficient
It is defined as:
g1 = a3
g2 = b2 – 3
Example: Calculate the first four moments about 30 for the following
distribution and convert them into central moments.
90
0
1200
900
0
−12000
0
9000
120000
0
90000
IM
45-55 6 50 20 120 2400 48000 960000
Total 50 − 70 7700 −19000 2450000
− 70 7700 −19000
\ m1′ = =−1.40, m′2 = =154, m′3 = =− 380,
NM
50 50 50
24,50,000
= m′4 = 49,000
50
Conversion into central moments
m1 =0
m 2 = m′2 − m1′2 = 154 − ( −1.4 ) = 152.04
2
Find mean of the distribution and calculate the first four moments
about mean and also the first four moments about origin.
N O T E S
S
16. The arithmetic mean of various powers of these deviations
in any distribution is called the ................... of the distribution
about mean.
IM
17. ................... moment about mean is the variance.
18. First moment about mean is always ................... .
19. ................... moment can be used as a measure of skewness.
20. ................... moment can be used as a measure of kurtosis.
NM
5.7 SUMMARY
Measures of Skewness and Kurtosis, like measures of central
tendency and dispersion, study the characteristics of a
frequency distribution. Averages tell us about the central value
of the distribution and measures of dispersion tell us about the
concentration of the items around a central value.
When two or more symmetrical distributions are compared, the
difference in them is studied with ‘Kurtosis’. On the other hand,
N O T E S
when two or more symmetrical distributions are compared, they
will give different degrees of Skewness. These measures are
mutually exclusive i.e. the presence of skewness implies absence
of kurtosis and vice-versa.
Bowley’s method of skewness is based on the values of median,
lower and upper quartiles. This method suffers from the same
limitations which are in the case of median and quartiles.
Wherever positional measures are given, skewness should be
measured by Bowley’s method. This method is also used in case
of ‘open-end series’, where the importance of extreme values is
ignored.
Kelly’s coefficient of skewness is defined as:
(P90 + P10) − 2 × Md
Skk =
(P90 − P10)
Where, P is percentile.
S
Kurtosis is a measure of peaked-ness of distribution. Larger the
kurtosis, more and more peaked will be the distribution. The
kurtosis is calculated either as an absolute or a relative value.
IM
Absolute kurtosis is always a positive number. Absolute kurtosis
of a normal distribution (symmetric bell shaped distribution) is
taken as 3. It is taken as datum to calculate relative kurtosis as
follows:
4
(xi − m)
NM
∑
s
Absolute kurtosis =
N
Relative kurtosis = Absolute kurtosis – 3
Moments about mean are generally used in statistics. We use
a Greek alphabet read as mu for these moments. Consider a
mass attached at each point proportional to its frequency and
take moments about the mean. First, second, third and fourth
moments can be used as a measure of Central Tendency, Variation
(dispersion), asymmetry and peakedness of the curve.
N O T E S
Leptokurtic: A positive kurtosis means more peaked curve,
called Leptokurtic.
Mesokurtic: Peakedness of normal distribution is called
Mesokurtic.
Kurtosis: When two or more symmetrical distributions are
compared, the difference in them is studied with Kurtosis.
Coefficient of Kurtosis: It is a measure of the relative
peakedness of the top of a frequency curve.
5.
What are its key features?
S
How do you calculate Karl pearsons’s coefficient of skewness?
N O T E S
4. The first four moments from mean of a distribution are 0, 3.2, 3.6
and 20. The mean value is 11. Calculate the first four moments
about zero and about 10.
5. Compute the moment measure of skewness from the following
distribution.
(Skp)
S
Karl Pearson’s Coefficient of Skewness 1.
2.
Skewness
Kurtosis
IM
3. Positive
4. Negative
5. Positively
Bowley’s Coefficient of Skewness (Skb) 6. Bowley’s
NM
7. Extreme
Kelly’s Coefficient of Skewness (Skk) 8. Mean
9. Lorenz
10. Means, SD
Measures of Kurtosis 11. Kurtosis
12. Absolute
13. Platykurtic
14. Leptokurtic
15. Mesokurtic
Moments 16. Moments
17. Second
18. Zero
19. Third
20. Fourth
N O T E S
the value of mean, median and mode are exactly equal. On the
other hand, in an asymmetrical distribution, the values of mean,
median and mode are not equal.
2. Refer Section 5.1
Skewness can be positive or negative or zero.
(a) When the values of mean, median and mode are equal, there
is no skewness.
(b) When mean > median > mode, skewness will be positive.
(c) When mean < median < mode, skewness will be negative.
3. Refer Section 5.1
It should be a pure number in the sense that its value should be
independent of the unit of the series and also degree of variation
in the series.
4.
Refer Section 5.2
Karl Pearson has suggested two formulae; S
IM
(a) Where the relationship of mean and mode is established;
(b) Where the relationship between mean and median is not
established.
mean - mode
(c) Coefficient of skewness, SKp =
S.D.
NM
(Q 3 + Q 1) − 2 × Md
=
(Q 3 − Q 1)
Where, Q is quartile.
6. Refer Section 5.4
Kelly’s coefficient of skewness is defined as:
(P90 + P10) − 2 × Md
Skk =
(P90 − P10)
Where, P is percentile
7. Refer Section 5.5
Kurtosis is a measure of peakedness of distribution. Larger the
kurtosis, more and more peaked will be the distribution. The
kurtosis is calculated either as an absolute or a relative value.
Absolute kurtosis is always a positive number.
N O T E S
8. Refer Section 5. 5
Negative kurtosis indicates a flatter distribution than the normal
distribution, and called as platykurtic. A positive kurtosis means
more peaked curve, called Leptokurtic. Peakedness of normal
distribution is called Mesokurtic.
9. Refer Section 5.6
The arithmetic mean of various powers of these deviations in
any distribution is called the moments of the distribution about
mean. Moments about mean are generally used in statistics.
10. Refer Section 5.6
There are few useful coefficients based on the moments. These
are non-dimensional numbers and hence useful for comparison
of distribution of data. β Coefficients are used for measuring
calculating mode, skewness and kurtosis. Where as ϒ1 and ϒ2
S
are used to measure skewness and Kurtosis. These are, Alpha
Coefficients, Beta Coefficients and Gamma Coefficients
IM
ANSWERS FOR EXERCISE FOR PRACTICE
Q 3 + Q 1 − 2Md
1. Q1= 395, Q3 = 725.333, Md ==
557.5, SK B = 0.016
Q3 − Q1
P90 − P10
– 3(X − Md )
3. X = 41.7 , Md = 42.14, s = 15.43, SK P = = −0.086
s
4. The first four moments about 10 are 1, 4.2, 14.2 and 54.6
5. β1 = 0.02249
N O T E S
D P Apte, Statistical Tools for Managers using MS Excel, Excel
Books, 2009
Bierman H., Bonnini C.P., and Hausma W.H., Quantitative
Analysis for Business Decisions, Homewood, Illinois. Richard D.I.
Win, Inc 1973.
Gordon, G., and Pressman I., Quantitative Decision Making for
Business, New Delhi, National Publishing House, 1983.
E-REFERENCES
www.math.uah.edu/stat/expect/Skew.html
http://www.itl.nist.gov/
http://www.real-statistics.com/
S
IM
NM
CORRELATION ANALYSIS
CONTENTS
6.1 Introduction
6.2
6.2.1
Types of Correlation
S
Positive or Negative Correlation
IM
6.2.2 Simple or Multiple Correlations
6.2.3 Partial or Total Correlation
6.2.4 Linear and Non-linear Correlation
6.3 Methods of Calculating Correlation
6.4 Scatter Diagram Method
NM
INTRODUCTORY CASELET
N O T E S
The correlation between the Sensex and the rupee has been drifting
away from its historical averages, following RBI’s interventions
in the currency market. The central bank has been intervening
in the forex market in order to cap the significant upside in the
rupee as well as to build forex reserves. The 120-day correlation
between the Sensex and the rupee has fallen to a negative point of
0.36. Interestingly, such correlation levels were not seen before the
global financial crisis in September 2008.
S
IM
NM
N O T E S
6.1 INTRODUCTION
We often encounter the situations, where data appears as pairs of
figures relating to two variables, for example, price and demand of
S
commodity, money supply and inflation, industrial growth and GDP,
advertising expenditure and market share, etc. Examples of correlation
problems are found in the study of the relationship between IQ and
IM
aggregate percentage marks obtained in mathematics examination or
blood pressure and metabolism. In these examples, both variables are
observed as they naturally occur, since neither variable can be fixed
at predetermined levels.
These are some of the important definitions about correlation.
NM
N O T E S
variables on a third variable. In some cases there may not be any
cause-effect relationship at all. Therefore, if we do not consider and
study the underlying economic or physical relationship, correlation
may sometimes give absurd results. For example, take a case of global
average temperature and Indian population. Both are increasing over
past 50 years but obviously not related.
Correlation is an analysis of the degree to which two or more variables
fluctuate with reference to each other. Correlation is expressed by a
coefficient ranging between –1 and +1. Positive (+ve) sign indicates
movement of the variables in the same direction. E.g. Variation of the
fertilizers used on a farm and yield, observes a positive relationship
within technological limits. Whereas negative (–ve) coefficient
indicates movement of the variables in the opposite directions, i.e.
when one variable decreases, other increases. E.g. Variation of price
and demand of a commodity have inverse relationship. Absence of
correlation is indicated if the coefficient is close to zero. Value of the
S
coefficient close to ±1 denotes a very strong linear relationship.
The study of correlation helps managers in following ways:
IM
To identify relationship of various factors and decision variables.
To estimate value of one variable for a given value of other if both
are correlated. E.g. estimating sales for a given advertising and
promotion expenditure.
NM
N O T E S
us that one variable is independent and other dependent on it. E.g.
surface temperature of the Pacific Ocean (Al Niño) affects monsoons
in India but monsoons do not affect temperatures of the Pacific Ocean.
Thirdly, in some cases both variables under study may be fluctuating
together due to a variation in the third variables. Thus both variables
under correlation analysis may be dependent variables and hence
not mutually correlated. In such a case, manager can not vary one of
them and expect other variable to vary. For example, correlation in
increase in share prices and stronger rupee against dollar may be due
to increase in Foreign Direct Investment (FDI). In this case expecting
to control falling share prices through selling dollars by the Reserve
Bank is incorrect. To control these two variables we need to control
FDI. Further, if the falling share prices are due to market sentiments or
overheated market, controlling FDI may not help. Thus, the manager
needs to analyze the problem in business environment before he/she
can apply the correlation analysis in decision-making.
N O T E S
In managerial decision-making, it is a good practice to draw the scatter
diagram first, and then study the logical relationship to identify the
type of correlation and the cause effect relation. Only then manager
should calculate the coefficient of correlation for further mathematical
analysis. Types of correlation that need to be differentiated before
using the correlation coefficient for managerial decision-making are
given below.
S
in the value of other variable also.
IM
Negative or inverse correlation refers to the movement of the
variables in opposite direction. Correlation is said to be negative, if
an increase (decrease) in the value of one variable is accompanied
by a decrease (increase) in the value of other.
NM
N O T E S
6.2.3 PARTIAL OR TOTAL CORRELATION
In case of multiple correlation analysis there are two approaches to
study the correlation. In case of partial correlation, we study variation
of two variables and excluding the effects of other variables by keeping
them under controlled condition. In case of ‘total correlation’ study we
allow all relevant variables to vary with respect to each other and find
the combined effect. With few variables, it is feasible to study ‘total
correlation’. As number of variables increase, it becomes impractical
to study the ‘total correlation’. For example, coefficient of correlation
between yield of wheat and chemical fertilizers excluding the effects of
pesticides and manures is called partial correlation. Total correlation
is based upon all the variables.
S
When the amount of change in one variable tends to keep a
constant ratio to the amount of change in the other variable, then
the correlation is said to be linear.
IM
But if the amount of change in one variable does not bear a
constant ratio to the amount of change in the other variable then
NM
N O T E S
5. Correlation is said to be ..................., if an increase (decrease)
in the value of one variable is accompanied by a decrease
(increase) in the value of other.
6. When the amount of change in one variable tends to keep a
constant ratio to the amount of change in the other variable,
then the correlation is said to be ................... .
7. In case on ................... correlation the rate of variation changes
as values increase or decrease.
S
Scatter diagram not only tell us about linearity or nonlinearity but
also whether the data is cyclic. When values of two variables have a
IM
constant rate of change it is linear correlation.
ETHODS OF CALCULATING
M
6.3
CORRELATION
Simple linear correlation is a statistical tool applied in many business
NM
N O T E S
S
problem solving. How will you find the correlation of your scores of
different subjects and interpret which was your strongest subject.
IM
6.4 SCATTER DIAGRAM METHOD
Scatter diagram is the most fundamental graph plotted to show
relationship between two variables. It is a simple way to represent
bivariate distribution. Bivariate distribution is the distribution of two
NM
random variables. Two variables are plotted one against each of the X
and Y axes. Thus, every data pair of (xi, yj) is represented by a point on
the graph, x being abscissa and y being the ordinate of the point. From
a scatter diagram we can find if there is any relationship between the
x and y, and if yes, what type of relationship. Scatter diagram thus,
indicates nature and strength of the correlation.
N O T E S
diagram. The way the dots scatter gives an indication of the kind of
relationship which exists between the two variables. While drawing
scatter diagram, it is not necessary to take at the point of sign the zero
values of X and Y variables, but the minimum values of the variables
considered may be taken.
When there is a positive correlation between the variables, the dots
on the scatter diagram run from left hand bottom to the right hand
upper corner. In case of perfect positive correlation all the dots will lie
on a straight line.
When a negative correlation exists between the variables, dots on the
scatter diagram run from the upper left hand corner to the bottom
right hand corner. In case of perfect negative correlation, all the dots
lie on a straight line.
If a scatter diagram is drawn and no path is formed, there is no
correlation.
S
Example: Figures on advertisement expenditure (X) and Sales (Y) of
a firm for the last ten years are given below. Draw a scatter diagram.
IM
Advertisement 40 65 60 90 85 75 35 90 34 76
cost in ‘000 `
Sales in Lakh ` 45 56 58 82 65 70 64 85 50 85
Solution:
NM
90
85
80
Sales in Lakh `
75
70
65 Sales
60 in Lakh `
55
50
45
40
30 50 70 90 110
Advertisement cost in '000 `
Income (X) (`) 100 110 113 120 125 130 130 140
Expenditure (Y) (`) 85 90 91 100 110 125 125 130
N O T E S
Solution:
140
130
Expenditure (Y) (`)
120
110
100
90
80
70
60
50
80 100 120 140 160
Income (X) (`)
Scatter Diagram
S
IM
Fill in the blanks:
10. Scatter diagram is the most fundamental graph plotted to
show relationship between ................... variables.
NM
N O T E S
1
n
∑ (X − X)(Y − Y) (1)
r=
sX sY
Where r is the ‘Correlation Coefficient’ or ‘Product Moment Correlation
Coefficient’ between X and Y. sX and sY are the standard deviations
of X and Y respectively. ‘n’ is the number of the pairs of variables X
1
and Y in the given data. The expression ∑ (X − X)(Y − Y) is known
n
as a covariance between the variables X and Y. It is denoted as Cov
(x, y). The Correlation Coefficient r is a dimensionless number whose
value lies between +1 and –1. Positive values of r indicate positive (or
direct) correlation between the two variables X and Y i.e. both X and
Y increase or decrease together. Negative values of r indicate negative
(or inverse) correlation, thereby meaning that an increase in one
variable X or Y results in a decrease in the value of the other variable.
A zero correlation means that there is no association between the two
variables.
S
The formula can be modified as,
IM
1 1
∑ ( X − X )(Y − Y ) ∑ ( XY − XY − XY + XY )
=r n= n
s Xs Y s Xs Y
∑ XY − ∑ X × ∑ Y
NM
= n n n
(2)
∑X ∑X ∑Y ∑Y
2 2 2 2
− −
n n n n
E[ XY ] − E[ X ] E[Y ]
= (3)
E[ X 2 ] − ( E[ X ] ) E[Y 2 ] − ( E[Y ] )
2 2
Equations (2) and (3) are alternate forms of equation (1). These have
advantage that we don’t have to subtract each value from the mean.
Example: The data of advertisement expenditure (X) and sales (Y)
of a company for past 10 year period is given below. Determine the
correlation coefficient between these variables and comment on the
correlation.
X 50 50 50 40 30 20 20 15 10 5
Y 700 650 600 500 450 400 300 250 210 200
Solution:
=
X
∑=
X 290
= 29=
,Y ∑
=
Y 4260
= 426
n 10 n 10
N O T E S
S.No. X Y x (X − X ) =
= y (Y − Y ) x2 y2 xy
=
1
Now, r n
∑ ( X − X )(Y − Y )
=
s Xs Y
=
1
n
∑ xy
∑ x 2 ∑ y2 S ∑ xy
∑x ∑y 2 2
IM
n n
28310
=r = 0.976
2740 × 306840
N O T E S
Where a, b, g and h are constants.
In this case, we have defined variables U and V through shift of origin
from (0, 0) to (a, b) and change the X and Y scale by factors ‘g’ and
‘h’ respectively. Thus for every observation pair (xi, yi) there is a
corresponding pair ( ui, vi) such that,
xi − a and v = yi − b
ui = i
g h
Σx i Σ(g × ui + a) g × Σui + n × a
Now, X = = = = gU + a
n n n
Similarly,
–
Y = hV + b
Now, xi − X = (g × ui + a) − (gU + a) = g( ui − U )
And
Σ ( x i − X )2
Hence, s X 2 =
S
yi − Y= h(vi − V )
g2 ×
=
Σ( ui − U )2
g2s U
=
2
IM
n n
And s Y 2 = h2s V 2
1
Σ(xi − X )( yi − Y )
n Σg × ( ui − U ) × h × (vi − V )
NM
=
Now, rXY =
s Xs Y n × (g × s U )(h × s V )
1
Σ( ui − U )(vi − V )
= n
s Us V
= rUV
This result is very useful for manual calculations. We can select
arbitrary constants a, b, g and h so as to simplify the data and the
find rUV which gives the result rXY. Thus, if any constant is added or
subtracted to the variables or the variables are multiplied or divided by
any constant, the correlation coefficient between these two variables
does not change.
Example: The data of advertisement expenditure (X) and sales (Y)
of a company for past 10 year period is given below. Determine the
correlation coefficient between these variables and comment the
correlation.
X 50 50 50 40 30 20 20 15 10 5
Y 700 650 600 500 450 400 300 250 210 200
Solution: We shall take U to be the deviation of X values from the
assumed mean of 30 divided by 5. Similarly, V represents the deviation
of Y values from the assumed mean of 400 divided by 10.
N O T E S
Short cut procedure for calculation of correlation coefficient
r= =i 1
n
∑ ui vi −
1 n
=
n
∑ i ∑ vi
u
n i 1=i 1 S
IM
2 2
n
1 n n
1 n
∑ ui − ∑ ui ∑ vi − ∑ vi
2 2
=i 1 = n i 1= i1 = n i 1
(−2)(26)
561 −
10 561 + 5.2
= = 0.976
NM
N O T E S
This is explained in the following example.
Example: Calculate coefficient of correlation for the following data.
0-500 250
S
Mark mx dx = g
-2
f
14 -28 56
IM
500-1000 750 -1 29 -29 29
1000-1500 1250 0 12 0 0
1500-2000 1750 1 9 9 9
2000-2500 2250 2 5 10 20
NM
∑ f ×d x × dy
= (−2)(−2)(12) + (−1)(−2)(6) + (−2)(−1)(2) + (−1)(−1)(18) + (−1)(1)(2) + (−1)(2)(1)
+(1)(−1)(1) + (1)(1)(2) + (1)(2)(1) + (2)(1)(2) + (2)(2)(3)
= 48 + 12 + 4 + 18 − 2 − 2 − 1 + 2 + 2 + 4 + 12 = 97
N O T E S
Hence,
1
Σf × dx × dy − Σ( f × dx )Σ( f × dy )
r= n
2 (Σf × dx )2 2 (Σf × dy )2
Σ( f × dx ) − Σ( f × dy ) −
n n
1
97 −× (−38)(−47)
69 71.1159
= = = 0.76
1 1 9.647 × 9.746
114 − × (−38)2 127 − × (−47)2
69 69
N O T E S
6.5.2 INTERPRETATION OF R
The correlation coefficient, r ranges from −1 to 1. A value of 1 implies
that a linear equation describes the relationship between X and Y
perfectly, with all data points lying on a line for which Y increases
as X increases. A value of −1 implies that all data points lie on a line
for which Y decreases as X increases. A value of 0 implies that there
is no linear correlation between the variables.
More generally, note that (Xi − X) (Yi − Y) is positive if and only
if Xi and Yi lie on the same side of their respective means. Thus the
correlation coefficient is positive if Xi and Yi tend to be simultaneously
greater than, or simultaneously less than, their respective means.
The correlation coefficient is negative if Xi and Yi tend to lie on
opposite sides of their respective means.
The coefficient of correlation r lies between –1 and +1 inclusive
of those values.
together. S
When r is positive, the variables x and y increases or decrease
IM
r=+1 implies that there is a perfect positive correlation between
variables x and y.
When r is negative, the variables x and y move in the opposite
direction.
NM
Symbolically e = r ± P. E.
P = Correlation (coefficient) of the population.
Example: If r = 0.6 and n = 64 find out the probable error of the
coefficient of correlation.
1 − r2
Solution: P. E. = 0.6745
n
N O T E S
1 − (−0.6)2
= 0.6745
64
= 0.6745 − 0.64
8
= 0.57
S
15. Correlation coefficient does not change with shifting of
................... i.e. by adding or subtracting any constant from the
two variables (X, Y) correlation coefficient remains same.
IM
16. If the value of r is ................... than P. E., then there is no
evidence of correlation i.e. r is not significant.
17. If r is ................... than 6 times the P. E. ‘r’ is practically certain
i.e. significant.
NM
N O T E S
case during beauty contests. However, in these cases the experts may
rank the candidates. It is then necessary to find out whether the two
sets of ranks are in agreement with each other. This is measured by
Rank Correlation Coefficient. The purpose of computing a correlation
coefficient in such situations is to determine the extent to which the
two sets of ranking are in agreement. The coefficient that is determined
from these ranks is known as Spearman’s rank coefficient, rs.
This is defined by the following formula:
n
6 × ∑ di
2
rS = 1 − i =1
n( n2 − 1)
S
6.6.1 RANK CORRELATION WHEN RANKS ARE GIVEN
IM
Example: Ranks obtained by a set of ten students in a mathematics
test (variable X) and a physics test (variable Y) are shown below:
∑d 2
Now, n = 10, i = 50
i =1
N O T E S
Using the formula
n
6 × ∑ di
2
6 × 50
rS =1− i =1
2
=
1− =
0.697
n( n − 1) 10(100 − 1)
We can say that there is a high degree of correlation between the
performance in mathematics and physics.
X: 88 95 70 60 80 81 50 75
Y: 50 115 110 140 142 100 120 134
Solution: Let R1 and R2 denotes the ranks in X and Y respectively.
X Y R1 R2 d=R1-R2 d2
75
88
95
120
134
150
5
2
1
5
4
1 S 0
–2
0
0
4
0
IM
70 115 6 6 0 0
60 110 7 7 0 0
80 140 4 3 1 1
81 142 3 2 1 1
50 100 8 8 0 0
NM
6
6∑ d2 6×6
Coefficient of Correlation P =
1− =
1− =
+.93
n( n2 − 1) 8 ( 64 − 1)
In this method the biggest item gets the first rank, the next biggest
second rank and so on.
X: 87 22 35 75 37
Y: 29 63 52 46 48
Solution:
X Y R1 R2 d=R1-R2 d2
87 29 1 5 –4 16
22 63 5 1 4 16
35 52 4 2 2 4
75 46 2 4 –2 4
37 48 3 3 0 0
40
N O T E S
6∑ d2 6 × 40
Coefficient of correlation P =
1− =
1− =
−1
n ( n − 1)
2
5 × 24
This shows on absolute negative correlation or perfect inverse
correlation.
showing that there are two items with the same 3rd rank and fourth
rank is skipped, then instead of writing 3, we write 3½ for both. Thus
the sum of these ranks which is 7 (3+4= 3½+3½= 7) remains same
keeping the mean of ranks unaffected. But in such cases the standard
deviation is affected. Therefore, correction is required for the Rank
( m3 − m)
Correlation Coefficient. For this, ∑ di is increased by
2
for
S 12
each tie, where m is number of items in each tie. If there are more
than one group of items with common rank, this correction factor is
to be added that many times once for each group.
IM
Example: Twelve salesmen are ranked for efficiency and length of
service as below:
Salesman A B C D E F G H I J K L
Efficiency (X) 1 2 3 4 4 4 7 8 9 10 11 12
NM
Length of 2 1 5 3 9 7 7 6 4 11 10 11
Service (Y)
Find the value of Spearman’s Rank Coefficient.
Solution:
Computations of Spearman’s Rank Correlation as shown below:
Individual Efficiency (X Length of Service di = xi – yi di2
= xi) (Y = yi)
A 1 2 -1 1
B 2 1 1 1
C 3 5 -2 4
D (4+5+6)/3 = 5 3 2 4
E (4+5+6)/3 = 5 9 -4 16
F (4+5+6)/3 = 5 (7+8)/2 = 7.5 -2.5 6.25
G 7 (7+8)/2 = 7.5 -0.5 0.25
H 8 6 2 4
I 9 4 5 25
J 10 (11+12)/2 = 11.5 -1.5 2.25
K 11 10 1 1
L 12 (11+12)/2 = 11.5 0.5 0.25
Total 65
N O T E S
n
Now, n = 12, ∑d
i =1
i
2
= 65
S
educational and aptitude test scores, together with assessment score
by the Personal department of their ability one year after joining the
company. 1 is a low score and 20 is a high score.
IM
Employee Educational Aptitude Assembly by
test officer
A 9 17 12
B 10 14 14
NM
C 15 12 16
D 14 13 15
E 16 10 17
F 11 15 10
G 12 12 11
H 17 16 18
Rank each set of the data
Calculate appropriate rank correlation coefficients
Solution: Let X denote the score in educational tests, let Y denote the
score in aptitude test and Z denote the assessment by personal office.
Employee X Y Z Rx Ry Rz d1 d2 d12 d22
A 9 17 12 8 1 6 2 –5 4 25
B 10 14 14 7 4 5 –1 4 1 16
C 15 12 16 3 6.5 3 3.5 0 12.25 0
D 14 13 15 4 5 4 1 0 1 0
E 16 10 17 2 8 2 6 0 36 0
F 11 15 10 6 3 8 –5 4 25 16
G 12 12 11 5 6.5 7 –0.5 0 0.25 0
H 17 16 18 1 2 1 1 0 1 0
16 101.25 67
N O T E S
6∑ d2 6 × 16
P(d2 1) =
1− 2
=
1− =
0.81
N ( N − 1) 8 × 63
6∑ d2 + ∑ m( m2 − 1) / 12
P(d2 2)= 1 −
N ( N 2 − 1)
6 × (101.25 + 0.5)
=
1− =
0.2141
8 × 63
The rank correlation coefficient between educational test and
assessment score is positive and high and therefore high educational
test score will correspond to high ability in performance of the job.
S
18. The coefficient that is determined from these ranks is known
as ................... rank coefficient, rs.
19. When two or more items have the same rank, a correction has
IM
to be applied to ................... .
Collect the data of marks of all the students of your class of any
NM
two subjects. Convert them into ranks and find the rank correlation
between the two subjects.
2×c − n
r =± ±
n
Where, n = total number of pairs.
c = Number of concurrent changes
Example: The data of advertisement expenditure (X) and sales (Y)
of a company for past 10 year period is given below. Determine the
correlation coefficient between these variables and comment the
correlation.
N O T E S
X 50 50 50 40 30 20 20 15 10 5
Y 700 650 600 500 450 400 300 250 210 200
Solution:
2×c − n 2×6 −9
r =± ± =+ + =0.577
n 9
NM
Collect the data of heights and weights of all the boys in your class.
Find the correlation coefficient using concurrent deviation method
between the variables height and weight.
2×c − n
1. Sign ± is selected to make the value of positive. The
same sign is used outside the radical. n
2. This method does not give strength of correlation. The
method is ad hoc and used only to reduce the efforts of tedious
calculations.
N O T E S
6.8 SUMMARY
In this chapter the concept of correlation or the association
between two variables has been discussed. A scatter plot of the
variables may suggest that the two variables are related but
the value of the Pearson correlation coefficient r quantifies this
association.
Correlation is a degree of linear association between two random
variables. In these two variables, we do not differentiate them
as dependent and independent variables. It may be the case
that one is the cause and other is an effect i.e. independent and
dependent variables respectively. On the other hand, both may
be dependent variables on a third variable.
In business, correlation analysis often helps manager to take
decisions by estimating the effects of changing the values of the
decision variables like promotion, advertising, price, production
processes, on the objective parameters like costs, sales, market
S
share, consumer satisfaction, competitive price. The decision
becomes more objective by removing subjectivity to certain
extent.
IM
The correlation coefficient r may assume values between –1 and
1. The sign indicates whether the association is direct (+ve) or
inverse (-ve). A numerical value of r equal to unity indicates
perfect association while a value of zero indicates no association.
The correlation is said to be positive when the increase
NM
N O T E S
coefficient, coefficient of determination, Yule’s coefficient of
association, coefficient of colligation, etc.
The correlation coefficient measures the degree of association
between two variables X and Y.
Karl Pearson’s formula for correlation coefficient is given as,
Covx.cov y
r=
s Xs Y
1
n
∑ ( X − X )(Y − Y )
r=
s Xs Y
The purpose of computing a correlation coefficient in such
situations is to determine the extent to which the two sets of
ranking are in agreement. The coefficient that is determined
from these ranks is known as Spearman’s rank coefficient, rs.
This is defined by the following formula:
rS = 1 −
n( n2 − 1)
n
6 × ∑ di
i =1
2
S
IM
Where, n = Number of observation pairs
di = Xi – Yi
Xi = Values of variable X and Yi = values of variable Y
Although the concurrent deviation method is effective in giving
NM
N O T E S
Linear Correlation: When the amount of change in one
variable tends to keep a constant ratio to the amount of change
in the other variable, then the correlation is said to be linear.
Non-linear Correlation: The amount of change in one variable
does not bear a constant ratio to the amount of change in the
other variable then the correlation is said to be non-linear.
Coefficient of Correlation: The correlation coefficient
measures the degree of association between two variables X
and Y.
Scatter Diagram: The pattern of points obtained by plotting
the observed points are knows as scatter diagram.
Advertisement 39 65 62 90 82 75 25 98 36 78
cost in ’000 `
Sales in Lakh ` 47 53 58 86 62 68 60 91 51 84
2.
Marks in Marks in
Statistics Economics
Mean 55 48
Standard Deviation 4 5
N O T E S
The correlation coefficient between marks in statistics and
economics is 0.8 given in table above. Estimate the marks in
statistics of a student who scored 50 marks in economics.
3. Calculate coefficient of correlation between X and Y as per the
data given below:
X 14 16 20 22 28 30 34 40 45
Y 97 89 68 65 56 50 37 18 12
4. Ten competitors in a beauty contest are ranked by three judges
in the following order. Determine which pair of judge has the
nearest approach to common taste in beauty?
Judge 1: 1 6 5 10 3 2 4 9 7 8
Judge 2: 3 5 8 4 7 10 2 1 6 9
Judge 3: 6 4 9 8 1 2 3 10 5 7
5.
S
Ten candidates obtained the following marks in examinations in
Statistics and Mathematics. Find the rank correlation coefficient
to determine whether these results support the suggestion that
IM
ability in one subject is associated with ability in the other.
Candidate A B C D E F G H I J
Statistics 40 65 61 49 53 42 68 57 58 46
Maths 51 58 67 55 76 45 69 56 73 63
NM
N O T E S
16. Less
17. More
Rank Correlation Method 18. Spearman’s
19.
∑d i
2
N O T E S
6. Refer Section 6.5
Karl Pearson’s formula for correlation coefficient is given as,
Covx.cov y
r=
s Xs Y
1
∑ ( X − X )(Y − Y )
r= n
s Xs Y
Where r is the ‘Correlation Coefficient’ or ‘Product Moment
Correlation Coefficient’ between X and Y. sX and sY are the
standard deviations of X and Y respectively. ‘n’ is the number of
the pairs of variables X and Y in the given data.
7. Refer Section 6.5.1
The assumptions underlying Karl Pearson’s correlation
coefficient are as follow:
S
(a) Your data on both variables is measured on either an Interval
Scale or a Ratio Scale.
IM
(b) The traits you are measuring are normally distributed in the
population.
8. Refer Section 6.5.2
The correlation coefficient, r ranges from −1 to 1. A value
NM
rS = 1 − i =1
n( n2 − 1)
N O T E S
the number of items that increase or decrease or remains equal
concurrently and denote as c. The correlation coefficient is then
calculated as,
2×c − n
r =± ±
n
Where, n = total number of pairs.
c = Number of concurrent changes
5. 0.6
between them.
S
testing beauty because the coefficient of correlation is highest
IM
6.11 SUGGESTED READINGS FOR REFERENCE
SUGGESTED READINGS
NM
Gupta, S.P. and Gupta, M.P., Business Statistics, Sultan Chand &
Sons, New Delhi, 1987
Loomba, M.P., Management – A Quantitative Perspective,
MacMillan Publishing Company, New York, 1978.
Levin, R.I., Statistics for Management, Prentice-Hall of India,
New Delhi, 1979
Shenoy, G.V., Srivastava, U.K. and Sharma, S.C., Quantitative
Techniques for Managerial Decision Making, Wiley Eastern, New
Delhi, 1985
Venkata Rao, K., Management Science, McGraw-Hill Book
Company, Singapore, 1986.
Bhardwaj, R.S., Business Statistics, 2nd Edition, Excel Books,
New Delhi.
Kothari, C.R., Quantitative Techniques, Vikas Publication.
E-REFERENCES
http://www.pinkmonkey.com/
https://www.tutorsland.com/
http://www.jstor.org/
REGRESSION ANALYSIS
CONTENTS
7.1 Introduction
7.2 Regression Analysis
7.2.1
S
Applicability of Regression Analysis
IM
7.3 Simple Linear Regression
7.3.1 Simple Linear Regression Model
7.3.2 Linear Regression Equation
7.4 Coefficient of Regression
7.5 Non-linear Regression Models
NM
INTRODUCTORY CASELET
N O T E S
PREGNANCY
S
It is also worth pointing out that regression models do not make
decisions for people. Regression models are a source of information
IM
about the world. In order to use them wisely, it is important to
understand how they work.
NM
N O T E S
7.1 INTRODUCTION
The word regression was first used as a statistical concept in 1877 by
Francis Galtan. Later if more than one variable is used to predict, the
word multiple regression is used. In regression analysis we develop an
S
equation called as an estimating equation used to relate known and
unknown variables. Then correlation analysis is used to determine
the degree of the relationship between the variables.
IM
Using the chi-square test we can find whether there is any relationship
between the variables. Correlation and regression analysis show how
to determine the nature and strength of the relationship between the
variables. In this chapter we will learn, how to calculate the regression
line mathematically.
NM
N O T E S
values on the ‘best fit’ curve. This is called as minimum squared error
criteria. It may be noted that the deviation (error) can be measured in
X direction or Y direction. Accordingly we will get two ‘best fit’ curves.
If we measure deviation in Y direction, i.e. for a given xi value of data
point (xi, yi), then we measure corresponding y value on ‘beast fit’
curve and then take the value of deviation in y, we call it as regression
of Y on X. In the other case, if we measure deviations in X direction
we call it as regression of X and Y.
S
Regression analysis is one of the most popular and commonly used
statistical tools in business. With availability of computer packages, it
has simplified the use. However, one must be careful before using this
tool as it gives only mathematical measure based on available data. It
IM
does not check whether the cause effect relationship really exists and
if it exists which is dependent and which is dependent variable.
NM
N O T E S
S
IM
With the help of a few examples illustrate how regression analysis
helps in business decision making.
NM
N O T E S
variable for a given value of independent variable or for controlling
the independent variable to get the desired results or to explain
relationship for reliable predictions.
ŷ= α + b x (1)
observed data.
S
of the linear regression model whose values are found out from the
N O T E S
yi = a + bxi + ∈i i = 1, 2… n.(2)
Or, ∈
=i ( yi − a − bxi )
Thus, sum square of errors is,
n
S= ∑∈
i=1
i
2
= ∑ (y i − a − bxi )2
And, 2 × ∑ xi ( yi − a − bxi ) =0
a × n + b × ∑ xi =∑ yi
=i 1=i 1
n n
(3) S
IM
n n
a × ∑ xi + b∑ xi =
∑ xi yi (4) 2
=i 1=i 1
b × ∑ xi ∑ yi
a+ =
n n
Or, a + bX =Y (5)
Or, a= Y − bX (6)
Substituting (6) in (4) and dividing it by n we get,
1
× ∑ xi yi − X × Y
b= n (7)
1
× ∑ xi − X
2 2
n
We denote b as bYX only to indicate it is regression of Y on X. bYX is
called as Regression Coefficient.
Now equation of regression line is,
ˆ= a + byx x
y
Subtracting equation (5) we get
ˆ − Y=
(y ) byx (x − X ) (8)
n n
1 n =iΣ1= xi Σ yi
Σ xi yi − ×i1
cov( X , Y ) n i=1 n n
=
And bYX = 2 n (9)
sX Σ xi
1 n
2
Σ x i − ( i = 1 )2
n i=1 n
N O T E S
For finding regression equation of X on Y we follow similar procedure
and get the regression line equation as
ˆ − X )= bxy ( y − Y )
(x
(10)
n n
1 n =iΣ1= xi Σ yi
Σ xi yi − ×i1
cov( X , Y ) n i=1 n n
=
With bXY = 2 n (11)
sY
1 n 2 iΣ= 1 yi 2
Σ yi − ( )
n i=1 n
Further, covariance of (X, Y) is,
1 n 1 n
cov( X , Y ) = Σ (xi − X )( yi − Y ) = Σ (xi yi − xi Y − Xyi + XY )
n i=1 n i=1
n n
1 n Σ xi Σ yi 1 n
= Σ x= i yi − Y −X
i 1=i 1
+ XY = Σ xi yi − YX − XY + XY
n i=1
1 n S n
= Σ xi yi − XY
n n i=1
(12)
IM
n i=1
Also, variance of X is,
1 n 1 n 2
var( X ) = Σ (xi − X )2 = Σ (xi − 2 xi X + X 2 )
= n i 1= ni 1
NM
1 n Σ xi X 2 n 1 n 2
Σ 1 = iΣ= 1 xi − 2 X + X
2 2
= Σ xi 2 − 2 X i = 1 +
= n i 1= n n i 1 n
1 n 2
= Σ xi − X 2 (13)
n i=1
Substituting (11) & (13) in (7)
cov( X , Y ) cov( X , Y )
=bYX = (14)
var( X ) s X2
Further, we note that
2 2
cov(= YX s X
X , Y ) b= bXY s Y
r 2 = bYX bXY
N O T E S
S
Regression refers to an average of relationship between a dependent
variable with one or more independent variables. Such relationship
is generally expressed by a line of regression drawn by the method
IM
of the “Least Squares”. This line of regression can be drawn
graphically or derived algebraically with the help of regression
equations. According to Tom Cars, before the equation of the least
line can be determined some criterion must be established as to
what conditions the best line should satisfy. The condition usually
NM
N O T E S
Properties of Regression Coefficients
The coefficient of correlation is the geometric mean of the two
regression coefficients.
Both the regression coefficients are either positive or negative. It
means that they always have identical sign i.e., either both have
positive sign or negative sign.
The coefficient of correlation and the regression coefficients will
also have same sign.
If one of the regression coefficient is more than unity, the other
must be less than unity because the value of coefficient of
correlation can not exceed one (r = ± 1)
Regression coefficients are independent of the change in the
origin but not of the scale.
The average of regression coefficients is always greater than
S
correlation coefficient.
Solved Examples
IM
Example: The cost of total output in a factory is linearly related to
number of units manufactured. Data collected for 8 months is as
follows.
Month 1 2 3 4 5 6 7 8
NM
N O T E S
n n
1 n =iΣ1= xi Σ yi
Σ xi yi − ×i1
cov( X , Y ) n i=1 n n
1. Now,=
bYX = 2 n
sX Σ xi
1 n
2
Σ x i − ( i = 1 )2
n i=1 n
n n
1 n Σ xi Σ yi
465 26.5 × 133
cov( X , Y ) = Σn x= i yi −n ×
i 1=i 1
= −
n Σ x Σ yn
i = 1 n 8 8×8
1 n i i 465 26.5 ×133
cov( X , Y ) = Σ xi =yi − i 1 = ×i 1 = − =58.125–55.07=3.055
n i =1 n nn 8 8×8
1 n 2 iΣ= 1 xi 2 103.75
And, s X2 =
n Σ xi − ( ) = − (3.3125)2
Σxn i=1 n 8
n 1 i 103.75
s X 2 = Σ xi 2 − ( i =1 ) 2 = − (3.3125) 2 =12.96875–10.973=1.99575
n i =1 n 8
Therefore, sX = 1.4127
Thus, bYX
=
58.125–55.07
= 1.53
12.96875–10.973
S
IM
The regression equation is
ˆ − Y=
(y ) byx (x − X )
133 26.5
ˆ−
Or, ( y )= 1.53 × (x − ˆ − 16.625)= 1.53 × (x − 3.3125)
) ⇒ (y
8 8
NM
1 n Σ yi 2249 133 2
s Y 2 = Σ yi 2 − ( i =1 ) 2 = −( ) =281.125–276.391=4.734
n i =1 n 8 8
Therefore, sY = 2.176
sX 1.4127
Hence, r = bYX × = 1.53 × = 0.993 (Ans)
sY 2.176
Since correlation coefficient r is close to 1, there is strong
association. Hence the relation can be deemed as reasonable
valid.
3. For number of units 13500, x = 13.5. The estimated cost of output
is,
ˆ=
y 1.53 x + 11.557 =1.53 × 13.5 + 11.557 =32.212 (Ans)
Example: The two regression line equations
_ are given as 8x – 10y +
–
66 = 0 and 40x – 18y – 214 = 0. Find X, Y two regression coefficients
and correlation coefficient r.
_
–
Solution: The point of intersection of two regression lines is (X, Y).
Hence solving the two equations we get the point of intersection as,
_
–
X = 13 and Y = 17
N O T E S
Now if we take first equation as regression of Y on X we can rewrite
the equation as,
8 66
y= ×x+
10 10
8
Thus, the regression coefficient is bYX =
10
Similarly, taking second equation as regression of X on Y we can
rewrite the equation as,
18 214
=x y+
40 40
18
Thus, the regression coefficient is bXY =
40
Now, correlation coefficient r is given by,
8 18
r= bYX × bXY = × = 0.6
S
10 40
IM
If we had taken first equation as regression of X on Y and
second equation as regression of Y on X then value of r2 = bYX
× bXY would have been greater than 1. But we know that value
is always r2 ≤ 1 since r is always between ±1.
NM
Sign of r while taking radical is taken as per signs of bYX and bXY.
Signs of bYX and bXY both must be either positive or negative.
bYX And bXY having opposite signs is not possible.
=
X
∑=
x 30
= 6 , =
Y
∑=
y 30
= 6
n 5 n 5
1
∑ xy − n × ∑ x∑ y
192 − 180
=bYX = = 1.2
1
( )
190 − 180
2
∑ x2 − n × ∑ x
Hence the regression equation of y on x is,
Yˆ =Y + bYX (x − X ) =30 + 1.2(x − 30)
Yˆ 1.2 x − 6
=
Effect of shifting of origin and change of scale on regression coefficient
byx
N O T E S
Let the transformation be
X−A Y−B
U= and V =
g h
h
Then the regression coefficient of V on U is, bVU = bYX
g
g
And regression coefficient of U on V is, bUV = bXY
h
Thus we can say that shifting of origin does not change the regression
coefficients.
Example: Data below gives transit time in days for random sample of
10 consignments with related distance.
X Distance 4 5 6 7 9 9 10 11 11 12
in 100 km
Y Transit 4 5 5 6 7 6 7 8 7 8
1.
2.
time in days
S
Find best fit linear relationship of transit time on distance.
Also estimate the transit time for a new location at a distance 800
IM
km.
3. Also compute correlation coefficient and assess whether relation
can be deemed as reasonable valid.
4. Find coefficient of determination R and explain its significance.
NM
1 n Σ ui Σ vi
Σ u=i vi − ×
i 1=i 1
cov(U , V ) n i=1 n n
Now,=
bVU = 2 n
sU
1 n 2 iΣ= 1 ui 2
Σ ui − ( )
n i=1 n
N O T E S
n n
1 n Σ ui Σ vi 30 (−6) × 3
cov(U , V ) = Σ u= v
i i − ×
i 1=i 1
= − =3+0.18=3.18
n i=1 n n 10 10 × 10
n
And, s U 2 1 n 2 iΣ= 1 ui 2 72
= Σ ui − ( ) = − (−0.6)2 = 7.2-0.36=6.84
n i=1 n 10
Therefore, sU = 2.615
Thus, = 3.18
bUV = 0.4649
6.84
But, bYX = bVU = 9.4649
The regression equation is
ˆ − Y=
(y ) bYX (x − X )
Or,
=
ˆ − 6.3)
Or, ( y
S
= 0.4649 × (x − 8.4)
2 1 n 2 Σ vi 17
3. Now, s V = Σ vi − ( i = 1 )2 = − (0.3)2 =1.7-0.09=1.61
NM
n i=1 n 10
Therefore, sV = 1.2689
sU 2.615
Hence, r = bVU × = 0.4649 × = 0.958 (Ans)
sV 1.2689
Since correlation coefficient r is close to 1, there is strong association.
Hence the relation can be deemed as reasonable valid.
Example: The owner of a small garment shop is hopeful that his sales
are rising significantly week by week. Treating the sales of previous
six weeks as a typical example of this rising trend, he recorded them
in `1000’s and analyzed the results.
Weeks: 1 2 3 4 5 6
Sales: 269 262 280 270 275 281
Fit a linear regression equation to suggest him the weekly rate at
which his sales are rising and use this equation to estimate expected
sales for the 7th week.
Solution: 1. Regression line equation
The calculations are tabulated below.
The computation is shown below. We use A = 3 and B = 270 and shift
the origin
N O T E S
Sl. No. xi
Weeks yi
Sales in
` 1000 ui vi ui 2 vi2 ui vi
1 1 269 -2 -1 4 1 2
2 2 262 -1 -8 1 64 8
3 3 280 0 10 0 100 0
4 4 270 1 0 1 0 0
5 5 275 2 5 4 25 10
6 6 281 3 11 9 121 33
Total ∑ 21 1637 3 17 19 311 53
Since there is no change of scale, bYX = bVU
=
Now, bVU
cov(U , V )
=
sU2
1 n =iΣ1=
Σ ui vi −
n i=1 n
n
nS
ui Σ vi
×i1
n
n
IM
1 n 2 iΣ= 1 ui 2
Σ ui − ( )
n i=1 n
n n
1 n Σ ui Σ vi
53 3 × 17
cov(U , V ) = Σ u= i vi − ×
i 1=i 1
= − =8.8333-1.4166=7.4167
n i=1 n n 6 6×6
NM
1 n Σ ui 19
And, s U 2 = Σ ui 2 − ( i =1 )2 = − (0.5)2 =2.9167
n i=1 n 6
Therefore, sU = 1.7078
7.4167
Thus, =
bUV = 2.5428
2.9167
But, b=
YX b=
VU 2.5428
=
X
∑=
x 21
= 3.5
Also n 6
=Y
∑
=
y 1637
= 272.83
n 6
The regression equation is
ˆ − Y=
(y ) bYX (x − X )
ˆ
Or, ( y − 272.83)
= 2.5428 × (x − 3.5)
=
Or, ˆ 2.5428 x + 263.9302 (Ans)
y
2. For the 7th week i.e. x= 7
ˆ(x = 7) = 2.5428 × 7 + 263.9302=281.7298
Expected sales = y
N O T E S
Example: Using the following information, obtain the line of regression
of average defective parts delivered (in hundred units) y on average
expenditure incurred on inspection (in ` thousands) x:
=
X
∑=
x 424
= 42.4 =
Y
∑=
y 363
= 36.3
n 10 n 10
1
∑ xy − n × ∑ x∑ y 12815 − 42.4 × 363
bYX = = = −0.6525
1
( ∑ x) 21926 − 42.4 × 424
2
∑ x2 − n ×
S
Hence the regression equation of y on x is,
IM
Yˆ =Y + bYX (x − X ) =36.3 − 0.6525(x − 42.4)
= = 45.696
~ 4570 parts.
Thus, number of defectives is 4569 ~
N O T E S
Errors involved in straight linear approximation are much high, hence
we use polynomials of higher degrees to achieve the smoothness and
better approximation. Least square principle can also be applied
to the fitting of a second degree polynomial which may be useful in
business situation if we have some idea that the relationship between
two variables is parabolic. In any case second degree polynomial fit
is more likely to be better approximation of the actual relationship.
We may use second order model (parabolic trend) if we feel that the
variation is parabolic. Here we will discuss only one nonlinear model
i.e. polynomial of second degree.
Second Degree Model
Just to demonstrate the theoretical similarity of linear (first degree)
and parabolic (second degree) models, we will describe the normal
equations. In this case the regression equation is,
ŷ =b0 + b1 X + b2 X 2
∑y
=i 1
i = a × n + b∑ xi + c∑ xi
=i 1=i 1
2
n n n n
∑x y
=i 1
i i = a∑ x + b∑ x + c∑ xi
i i
2
=i 1=i 1 =i 1
3
n n n n
∑x
=i 1
i
2
y = a ∑ x + b∑ x + c ∑ x i
i i
2
=i 1=i 1=i 1
i
3 4
N O T E S
Solution: We use normal equations for second degree regression.
Shifting origin does not change regression coefficients. It only shifts
the regression curve. Let the origin be shifted to (1986, 10). Hence in
normal equations, we replace xi by (xi – 1986) and yi by (yi – 10) . The
calculations are shown in the following table.
S 2
∑y
n
i c 3 (18)
⇒ 5 × a + 10 ×=
IM
n n n n
a∑ xi + b∑ xi + c∑ xi =∑ xi yi ⇒ 10 × b =−6 (19)
2 3
=i 1 =i 1 =i 1=i 1
n n n n
a∑ xi + b∑ xi + c∑ xi =∑ xi yi ⇒ 10 × a + 34 × c =−6
2 3 4 2
(20)
=i 1=i 1=i =
1 i 1
NM
N O T E S
The least square approximation can be calculated easily for low
degree polynomials, like linear, parabolic, cubic, etc. But for higher
degrees (more than three), the system of normal equations becomes
ill conditioned. This causes large errors in values of coefficients.
Then the approximation becomes incorrect. To avoid these problems,
‘orthogonal polynomials’ are used for approximation.
Non-linear models are difficult to handle. But we can often use simple
transformation to convert the model to linear. Taking the logarithm of
values of the variable is one such method. These are called logarithmic
linear (log linear) models.
S
Non-linear models that can be transformed to yield linear models
IM
are called intrinsically linear.
However, there are many software packages that can handle these
models. Managers working in this area must become familiar to such
models as per the availability of particular software. Discussion on
NM
Seasonal Model
We know many business parameters are highly seasonable. E.g. sales
of air conditioners, sales of woolen clothing, share market prices,
price of a commodity, etc. are seasonal. Many of these are cyclic in
nature with constant period like a year, a month, settlement period on
stock exchange, etc. Sinusoidal model is approximate for such cases
to separate the seasonality part of the data. If Ft is the forecast for
period‘t’.
2π 2π
Ft =
a + u cos t + v sin t
N N
Where a, u and v are constants, t is time period and N is number of
time periods in the complete cycle.
N O T E S
be cyclic with one year period. However, at the same time it may also
have underlying trend of overall increase year on year. In such a case,
seasonal model and straight line model are superimposed as
2π 2π
Ft = a + b × t + u cos t + v cos t
N N
This model has a growth term b × t.
Coefficient of Determination
Once we know there is a correlation between two variables and then
we find the linear relationship between two variables, we would like
to specify how strong is the relationship? If relationship is strong, we
can use it for decision-making with more confidence. Because our
estimates based on the regression equation would be more accurate.
Mean Square Error (MSE) is an estimate of the variance of the
regression error. MSE depends on the values of data and its scales.
S
Hence we need a measure that calculates relative degree of variation
so that it can be compared for the fits obtained from different models
and for different data sets. Coefficient of determination is such a
measure.
IM
Coefficient of determination is defined as the ratio of explained
variance of the dependent variable to the total variance. It can be
NM
Thus,
of determination b Explained
Coefficient= = r2
Variance
Total Variance
b is the proportion of variation explained by the independent variable.
Remaining variation in data (1 – b) is due to some other factors. The
value (1 – b) is called coefficient of Non-determination and defined as,
Unexplained Variance
(1 − b ) =1 − r 2 =
Total Variance
Thus,
Coefficient of Alienation =
k 1 − r2
Coefficient of determination is a measure of the strength of the
regression fit. It is an estimator of population parameter of correlation
and can be obtained directly from a decomposition of variation in Y
into two components, viz. due to error and due to regression. Error
N O T E S
is a deviation of a data point from its respective group mean. Thus
error is the deviation of a data from its predicted values explained
by the regression line. In analysis of variance ANOVA we also look at
the total deviation of data point from the grand mean. Thus when we
consider the deviations we consider three kinds. Firstly, deviation of
a data point from the grand mean ( y − Y ) . Secondly, the deviation of a
data point from the predicted value using regression ( y − y ˆ) . Thirdly,
the deviation of the predicted value of y from the grand mean ( y ˆ − Y) .
Thus, Total deviation = Unexplained deviation + Explained deviation
Or, ˆ) + ( y
(y − Y ) = (y − y ˆ − Y)
Or, Total Deviation = Error + Regression.
(yˆ − Y ) is called explained deviation or regression deviation because
it can be explained by the regression relationship between X and
Y. Where as, the part ( y − y ˆ) , is not explained by the regression
relationship. Hence it is called an error.
2
n
n
Σ(y
2
n
− y) = Σ(y ˆ ) + Σ(y
−y ˆi − y ) 2
S
If we square deviations for all data points and sum them over all ‘n’
points, the simplification gives,
IM
i j i
=i 1=i 1 =i 1
Or
Total Sum Squares = Sum Squares of Error + Sum Squares of
Regression
NM
Or
SST = SSE + SSR
SSR SSE
Thus, coefficient of determination = = 1−
SST SST
Values of the coefficient of determination range from 0 to 1.
When r2 =1 the variation in Y is completely explained by variation in X.
Means all data points exactly fall on regression line with no error. This
is called a perfect fit. In real business, there is always some error that is
not explained. If r2 is close to 1, we say that there is a strong relationship.
On the other hand, if r2 ~ 0 or close to zero, there is hardly any linear
relationship between X and Y. In such case we cannot use value of X to
predict values of Y. Higher the values of r2 , the better is the fit and we
can have more confidence in our predictions using regression line.
N O T E S
14. ................... models that can be transformed to yield linear
models are called intrinsically linear.
15. Coefficient of ................... is square root of coefficient of non-
determination.
S
Estimate the simple linear regression relationship between market
share and product quality rating. Can you apply any nonlinear
model on the above data too? Explain with reason.
IM
Computer models are available that deal with such estimations.
MS Excel does not have any tool directly dealing with this, but in
particular cases we can use ‘Moving Average’ and ‘Exponential
NM
ORRELATION ANALYSIS VS
C
7.6
REGRESSION ANALYSIS
Both the techniques are directed towards a common purpose of
establishing the degree and direction of relationship between two or
more variables but the methods of doing so are different. The choice of
one or the other will depend on the purpose. If the purpose is to know
the degree and direction of relationship, correlation is an appropriate
tool but if the purpose is to estimate a dependent variable with the
substitution of one or more independent variables, the regression
analysis shall be more helpful. The point of difference is discussed
below:
Degree and Nature of Relationship: The correlation coefficient
is a measure of degree of co variability between two variables
whereas regression analysis is used to study the nature of
relationship between the variables so that we can predict the value
of one on the basis of another. The reliance on the estimates or
predictions depends upon the closeness of relationship between
the variables.
Cause and Effect Relationship: The cause and effect relationship
is explained by regression analysis. Correlation is only a tool
N O T E S
to ascertain the degree of relationship between two variables
and we can not say that one variable is the cause and other the
effect. A high degree of correlation between price and demand
for a commodity or at a particular point of time may not suggest
which the cause is and which the effect is. However, in regression
analysis cause and effect relationship is clearly expressed – one
variable is taken as dependent and the other an independent.
Like in correlation, regression analysis can also be studied as
‘simple and multiple’, ‘total and partial’, ‘linear and nonlinear’, etc.
depending upon the type of data and method we use for regression
analysis. Regression word implies ‘going back or falling back to
mean or average value’ but in most application of regression we
do not use regression in this sense. We use it for the forecasting
purpose or to understand underlying mathematical relationship.
Although correlation and regression both attempt to establish
whether relationship exists between two or more variables or
S
not, these two techniques differ in approach. If we only want to
know the degree and direction of relationship we use correlation
analysis. But if we want to forecast or predict the values we need
IM
regression analysis.
In correlation, there is no distinction between independent and
dependent variables. But for regression analysis we need to
specify independent and dependent variables clearly. In case
NM
N O T E S
7.7 SUMMARY S
IM
In this chapter, the concept of regression between dependent and
independent variables has been discussed. Regression provides
us a measure of the relationship and also facilitates to predict
one variable for a value of other variable.
Unlike correlation analysis, in regression analysis, one variable
NM
N O T E S
we have some idea that the relationship between two variables is
parabolic. In any case second degree polynomial fit is more likely
to be better approximation of the actual relationship. We may use
second order model (parabolic trend) if we feel that the variation
is parabolic.
The least square approximation can be calculated easily for low
degree polynomials, like linear, parabolic, cubic, etc. But for higher
degrees (more than three), the system of normal equations becomes
ill conditioned. This causes large errors in values of coefficients.
Then the approximation becomes incorrect. To avoid these
problems, ‘orthogonal polynomials’ are used for approximation.
Mean Square Error (MSE) is an estimate of the variance of the
regression error. MSE depends on the values of data and its
scales. Hence we need a measure that calculates relative degree
of variation so that it can be compared for the fits obtained
from different models and for different data sets. Coefficient of
determination is such a measure.
S
Coefficient of determination is a measure of the strength of
the regression fit. It is an estimator of population parameter of
IM
correlation and can be obtained directly from a decomposition
of variation in Y into two components, viz. due to error and
due to regression. Error is a deviation of a data point from its
respective group mean. Thus error is the deviation of a data from
its predicted values explained by the regression line.
NM
N O T E S
S
Explain seasonal model and seasonal model with trend.
10. Explain the difference between correlation and regression analysis.
IM
EXERCISE FOR PRACTICE
3 1
1. If bXY = and bYX = , find the value of correlation coefficient
2 6
between X and Y.
NM
X Y
Mean 36 85
Standard 11 8
Deviation
The correlation coefficient between X and Y is 0.66. Find
regression equation of X on Y, hence estimate the value of X
when Y = 80.
3. A student obtains lines of regression of Y on X and X on Y as 2X
– 5Y – 7 and 3X + 2Y – 8 = 0 respectively. Is this correct?
4. The following data which consists of the scores that 10 salesmen
made on a test designed to measure their aptitude for sales work
and their sales productivity over a period of time. The test score
is denoted by X and sales productivity by Y.
X 41 35 34 40 33 42 37 42 30 43
Y 32 20 35 24 27 28 31 33 26 41
(a) Calculate the correlation coefficient
(b) Find the equation of the least square line.
(c) Calculate the value of coefficient of determination and use it
to comment on the usefulness.
N O T E S
5. The XYZ store has been expanding market share during past 7
years, posting the following gross sales in millions of dollars.
Year 1994 1995 1996 1997 1998 1999 2000
Sales 15 21 25 33 38 48 52
(a) Find the linear estimating equation that best described
these data and also find the trend (estimated) value.
(b) Calculate the present trend for these data and identify the
year in which the fluctuation from the trend is largest.
(c) Forecast the sales value for the year 2001.
8. Order
Coefficient of Regression 9. Geometric
10. Intersect
11. Same
Nonlinear Regression Models 12. Parabolic
13. Orthogonal
14. Non-linear
15. Alienation
Correlation Analysis vs 16. Regression
Regression Analysis
17. Mean
18. Variables
19. Minimizing
20. Regression
N O T E S
of statistical theory which is widely used in all the scientific
disciplines. It is a basic technique for measuring or estimating
the relationship among economic variables that constitute the
essence of economic theory and economic life.
2. Refer Section 7.2
If the variables in a bivariate distribution are correlated, the
points in scatter diagram approximately cluster around some
curve. If the curve is straight line we call it as linear regression.
Otherwise, it is curvilinear regression. The equation of the curve
which is closest to the observations is called the ‘best fit’.
The best fit is calculated as per Legender’s principle of least
sum squares of deviations of the observed data points from
the corresponding values on the ‘best fit’ curve. This is called
as minimum squared error criteria. It may be noted that the
deviation (error) can be measured in X direction or Y direction.
3.
S
Refer Section 7.2.1
Regression analysis is one of the most popular and commonly
IM
used statistical tools in business. With availability of computer
packages, it has simplified the use. The uses of regression
analysis are not confined to economic and business activities. Its
applications are extended to almost all the natural, physical and
social sciences.
NM
N O T E S
6. Refer Section 7.4
The coefficients of regression are bYX and bXY. Properties of
Regression Coefficients are:
The coefficient of correlation is the geometric mean of the two
regression coefficients.
Both the regression coefficients are either positive or negative. It
means that they always have identical sign i.e., either both have
positive sign or negative sign.
7. Refer Section 7.5
Least square principle can also be applied to the fitting of a second
degree polynomial which may be useful in business situation if
we have some idea that the relationship between two variables is
parabolic. In any case second degree polynomial fit is more likely
to be better approximation of the actual relationship. We may use
second order model (parabolic trend) if we feel that the variation
8.
polynomial of second degree.
Refer Section 7.5
S
is parabolic. Here we will discuss only one nonlinear model i.e.
IM
Orthogonal polynomials determine the coefficients directly
without having to solve normal equations. The Legendre
and Chebyshev polynomials are the well-known orthogonal
polynomials.
NM
N O T E S
4. (a) Correlation coefficient = 0.442547
(b) y = mx + c Slope m = 0.587181; Y intercept c = 7.563281
(c) Coefficient of determination b = r2 = 0.1958
This indicates that only 19.58% of the variation in Y is explained
by the variation in X as per trend line. About 80% of the variation
is due to some other factors. Thus we cannot really estimate the
variation in Y from the variation in X.
5. (a) Slope m = 6.357143; Y intercept c = - 12662.1
(b) Table shows trend values. Fluctuation from the trend is
largest in year 1999
Year 1994 1995 1996 1997 1998 1999 2000
Sales 15 21 25 33 38 48 52
Trend 14.07143 20.42857 26.78571 33.14286 39.5 45.85714 52.21429
(Estimate)
S
Fluctuation 0.928571 0.571429 -1.78571 -0.14286 -1.5
E-REFERENCES
http://www.statsoft.com/Textbook/Multiple-Regression
http://obsessionwithregression.blogspot.in/
http://www.statmethods.net/stats/regression.html
THEORY OF PROBABILITY
CONTENTS
8.1 Introduction
8.2
8.3
S
Important Terms in Probability
Kinds of Probability
IM
8.4 Simple Propositions of Probability
8.5 Addition Theorem of Probability
8.6 Multiplication Theorem of Probability
8.7 Conditional Probability
8.8 Law of Total Probability
NM
INTRODUCTORY CASELET
N O T E S
S
The above chart shows the percentage of professional women and
men who fear public speaking. These percentages can be written
IM
as conditional probabilities as follows. Suppose one professional
is selected at random. Then, given that this person is a female, the
probability is .35 that she has a public speaking fear. On the other
hand, if this selected person is a male, this probability is only .11.
These probabilities can be written as follows:
NM
N O T E S
8.1 INTRODUCTION
A probability is a quantitative measure of risk. The statistician I.J.
Good suggests, “The theory of probability is much older than the
S
human species, since the assessment of uncertainty incorporates
the idea of learning from experience, which most creatures do.”
Development of probability theory in Europe is associated with
IM
gamblers in the famous European casinos, such as the one at Monte
Carlo. It is also associated with astrology.
This chapter provides exposure to fundamental concepts, since
probability is inseparable from statistical methods. Those, not
familiar with the subject, are suggested to study details from any
NM
N O T E S
There is a chance or risk (sometimes also called as uncertainty)
associated with each outcome.
Sample Space: It is a set of all possible outcomes of an experiment. It
is usually represented as S. For example, if the random experiment is
rolling of a die, the sample space is a set, S = {1, 2, 3, 4, 5, 6}. Similarly,
if the random experiment is tossing of three coins, the sample space is,
S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT} with total of 8
possible outcomes. (H is heads, and T is Tails showing up.)
If we select a random sample of 2 items from a production lot and
check them for defect, the sample space will be S = {DD, DS, DR, RS,
RR, SS} where D stands for defective, S stands for serviceable and R
stands for re-workable.
Event: One or more possible outcomes that belong to certain
category of our interest are called as event. A sub set E of
the sample space S is an event. In other words, an event is a
S
favourable outcome.
Event space: It is a set of all possible events. It is usually
represented as E. Note that usually in probability and statistics;
IM
we are interested in number of elements in sample space and
number of elements in event space.
Union of events: If E and F are two events, then another event
defined to include all outcomes that are either in E or in F or in
both is called as a union of events E and F. It is denoted as E ∪ F.
NM
N O T E S
The events that (i) an employee would be late, and (ii) the employee
would be absent, on a particular day, are mutually exclusive since
both cannot occur simultaneously. An employee cannot be both
late and absent on a particular day. On the other hand, two or more
events which are not mutually exclusive are called overlapping
events. Suppose A represents the event that the number on the
card chosen is divisible by 3 and B represents the event that the
S
number is divisible by 5, then for A to occur the number must be
either 3, 6, 9, 12, 15 or 18, and for B to occur, it must be one of 5, 10,
15 and 20. Note that if the number 15 is obtained, it implies that
IM
both A and B have taken place. Thus, A and B are not mutually
exclusive.
Classical Probability
This is also called Mathematical Probability or Objective Probability or
A-priori Probability. This probability is based on the assumption that
certain occurrences are equally likely. For example, if an unbiased dice
is rolled, numbers 1 to 6 are equally likely to appear on the top face.
If there are n mutually exclusive, collectively exhaustive and equally
likely outcomes of an experiment and if m of them are favourable to
an event E, then the probability of occurrence of E, denoted by P(E)
is defined as,
m
P (E) = n Where, 0 ≤ m ≤ n Thus, P (E) ≤ 1
This definition is based on a-priori knowledge of equally likely
outcomes and total outcomes are finite, for example, draw of cards
from a shuffled pack of 52 cards, or a throw of a dice, or a toss of
a coin. If any of these assumptions are not true, then the classical
definition given above does not hold true for example, toss of a biased
coin, or throw of dice by ‘Shakuni Mama’ in the epic Mahabharat.
This definition also has a serious drawback: How do we know with
certainty that the outcomes are equally likely? If it cannot be proven
mathematically or logically, this definition is not complete.
N O T E S
Relative Frequency Probability
This is another type of objective probability. It is also called as
experimental probability.
S
m
do we know that the ratio n will converge to some constant value that
will be the same every time we carry out the experiment? If we carry
out an experiment of flipping a coin and our event is getting heads,
IM
we do not observe any systematic series so as to prove mathematically
m 1
that the ratio n converges to .
2
Subjective Probability
NM
Axiomatic Probability
Earlier definitions that we have discussed make certain assumptions.
m
However, to assume that will necessarily converge to some
n
constant value every time the experiment is performed; or the event
is equally likely; seem to be very complex assumptions. It would be
more reasonable to assume a set of simpler and logically self-evident
axioms (assumptions on which a theory is based). Then base the
probability definition on these axioms. This is the modern axiomatic
approach to probability theory. Russian mathematician A.N.
Kolmogorov developed this concept that combines both the objective
and subjective concepts of probability.
Consider an experiment whose sample space is S. For each event E
of the sample space S we assume that a number P (E) is referred as
probability of event E if it satisfies the following axioms.
N O T E S
Axiom 1: 0 ≤ P (E) ≤ 1
Axiom 2: P(S) = 1 Certain event. This also implies,
P (Φ) = 0 Impossible event.
S
IM
State whether the following statements are true/false:
4. Classical probability is also called Mathematical Probability or
Objective Probability or A-priori Probability.
5. Subjective probability is also called as experimental probability.
NM
N O T E S
Proposition 1
P (EC) = 1 – P (E)
Probability of compliment: Let even EC denote complement of the
event E. Obviously by definition of complement, EC has all elements
from the sample space S that are not in E. Thus, E and EC are mutually
exclusive and collectively exhaustive. Therefore, by axiom 2 and 3 we
have,
1 = P(S) = P (E ∪ EC) = P (E) + P (EC)
or,
S
P (EC) = 1 - P (E)
Proposition 2
IM
If E ⊂ F, then P (E) ≤ P (F)
If the event E is contained in event F, that is, then we can express,
F = E ∪ (EC ∩ F).
However, as events E and (EC ∩ F) are mutually exclusive, we get,
NM
Proposition 3
P (E ∪ F) = P (E) + P (F) – P (E ∩ F)
Probability of unions: Event E ∪ F can be written as the union of the
two disjoint events namely E and (EC ∩ F). Thus, from axiom 3,
P (E ∪ F) = P [E ∪ (EC ∩ F)] = P (E) + P (EC ∩ F) (1)
Also, F = (E ∩ F) ∪ (EC ∩ F), hence,
P (F) = P (E ∩ F) + P (EC ∩ F) (2)
From (1) and (2) we get the proposition 3 as,
P (E ∪ F) = P (E) + P (F) - P (E ∩ F)
Extended statement of this proposition for n events is also called as
inclusion-exclusion principle.
P(E ∪ F ∪ G) = P(E) + P(F) + P(G) – P(EF) – P(FG) – P(EG) +
P(E∩F∩G)
N O T E S
Proposition 4
Mutually exclusive events: When the sets corresponding to two
events are disjoint (have no common elements, or the intersection is
null), the two events are called mutually exclusive.
E ∩ F = Φ Therefore,
P (E ∩ F) = P (Φ) = 0
Also, for mutually exclusive events E and F,
P (E ∪ F) = P (E) + P (F)
Proposition 5
P (EC∩F) = P (F) – P (E∩F)
From set theory, F can be written as a union of two disjoint events E ∩
F and EC ∩ F . Hence, by Axiom III, we have, P(F) = P(E ∩ F) + P(EC
∩ F). By re-arranging the terms we get the result.
S
IM
Fill in the blanks:
7. Proposition 1 is defined as P (EC) = ...................
8. Event E ∪ F can be written as the ................... of the two disjoint
events namely E and (EC ∩ F).
NM
A gambler has four cards – two diamonds and two clubs. The
gambler proposes the following game to you: You will leave the
room and the gambler will put the cards face down on a table. When
you return to the room, you will pick two cards at random. You will
win $10 if both cards are diamonds, you will win $10 if both are
clubs, and for any other outcome you will lose $10. Assuming that
there is no cheating, should you accept this proposition? Support
your answer by calculating your probability of winning $10.
N O T E S
The result of this addition theorem generally written using Set notation,
P (A ∪ B) = P (A) + P (B) – P (A ∩ B),
Where, P (A) = probability of occurrence of event ‘A’
P (B) = probability of occurrence of event ‘B’
P (A ∪ B) = probability of occurrence of event ‘A’ or event ‘B’.
P (A ∩ B) = probability of occurrence of event ‘A’ or event ‘B’.
Addition theorem probability can be defined and proved as follows:
Let ‘A’ and ‘B’ are Subsets of a finite non empty set ‘S’ then according
to the addition rule
S
P (A ∪ B) = P (A) + P (B) – P (A). P(B),
On dividing both sides by P(S), we get
IM
P (A ∪ B) / P(S) = P (A) / P(S) + P (B) / P(S) – P (A ∩ B) / P(S) (1).
If the events ‘A’ and ‘B’ correspond to the two events ‘A’ and ‘B’
of a random experiment and if the set ‘S’ corresponds to the
Sample Space ‘S’ of the experiment then the equation (1) becomes
NM
N O T E S
8.6
ULTIPLICATION THEOREM OF
M
PROBABILITY
S
IM
Probability is the branch of mathematics which deals with the
occurrence of samples. The basic form of Multiplication theorems on
probability for two events ‘X’ and ‘Y’ can be stated as,
P (x. y) = p (x). P(x / y)
NM
Here p (x) and p (y) are the probabilities of occurrences of events ‘x’
and ‘y’ respectively.
P (x / y) is the Conditional Probability of ‘x’ and the condition is that
‘y’ has occurred before ‘x’.
P (x / y) is always calculated after ‘y’ has occurred. Here, occurrence of
‘x’ depends on ‘y’. ‘y’ has changed some events already. So, occurrence
of ‘x’ also changes.
N O T E S
Now, according to the multiplication theorem of probability,
P (x. y) = p (x). p (x / y ) (equation 2)
Substituting p (x / y) from “equation 2” in “equation 1”, we get
P (x. y) = p(x).p(y),
This is the special case of this theorem.
This case is valid only when events are independent.
S
14. The intersection of A and B represents the collection of all
outcomes that are common to both A and B and is denoted by
A and B.
IM
According to data from the Centers for Disease Control and
Prevention, there were a total of 823,542,000 visits to physicians
NM
N O T E S
Conditional probability satisfies all the properties and axioms of
probabilities. Now onwards, we would write (E ∩ F) as EF, which is a
common convention.
S
Solution: Let S denote that the product is successful, L denote
competitor will launch a product and LC denotes competitor will not
launch the product. Now, from given data,
IM
P(S/LC) = 0.67, P(S/L) 0.42, P(L) = 0.35
Hence, P( LC ) =1 − P( L) =
1 − 0.35 =
0.65
Now, using conditional probability formula, probability that the
product will be success P(S) is,
NM
N O T E S
1. If one flight is selected at random from these 1700 flights, find
the probability that this flight is
(a) more than 1 hour late
(b) less than 30 minutes late
(c) a flight on airline A given that it is 30 minutes to 1 hour late
(d) more than 1 hour late given that it is a flight on airline B
2. Are the events “airline A” and “more than 1 hour late” mutually
exclusive? What about the events “less than 30 minutes late”
and “more than 1 hour late”? Why or why not?
3. Are the events “airline B” and “30 minutes to 1 hour late”
independent? Why or why not?
S
Consider two events, E and F. Whatsoever be the events, we can
always say that the probability of E is equal to the probability of
intersection of E and F, plus, the probability of the intersection of E
IM
and complement of F. That is,
P (E) = P (E ∩ F) + P (E ∩ F ∩ C)
E = (E ∩ F) U (E ∩ F ∩ C)
For any element in E, must be either in both E and F or be in E but not
in F. (E F) and (E FC) are mutually exclusive, since former must be in
F and latter must not in F, we have by Axiom 3,
P (E) = (E F) + (E FC) = P(E/F) × P(F) +P(E/FC) × P(FC)
∑ P( E F ) × P( F )
i =1
i i
N O T E S
matter, say market shares of a competitors, then Bayes’ formula
gives us how these should be modified by the new evidence of the
experiment, says a market survey.
Example: A bin contains 3 different types of lamps. The probability
that a type 1 lamp will give over 100 hours of use is 0.7, with the
corresponding probabilities for type 2 and 3 lamps being 0.4 and 0.3
respectively. Suppose that 20 per cent of the lamps in the bin are of
type 1, 30 per cent are of type 2 and 50 per cent are of type 3.
What is the probability that a randomly selected lamp will last more
than 100 hours?
Given that a selected lamp lasted more than 100 hours, what are the
conditional probabilities that it is of type 1, type 2 and type 3?
Solution: Let type 1, type 2 and type 3 lamps be denoted by T1, T2 and
T3 respectively. Also, we denote S if a lamp lasts more than 100 hours
and SC if it does not. Now, as per given data,
P(S|T1) =0.7, P(S|T2) =0.4
P(T1) = 0.2, P(T2) = 0.3,
S
P(S|T3) =0.3
P(T3) = 0.5
IM
4. Now, using conditional probability formula,
P(S) = P(S T1 ) P(T1 ) + P(S T2 ) P(T2 ) + P(S T3 ) P(T3 )
= 0.7 × 0.2 + 0.4 × 0.3 +0.3 × 0.5 = 0.41
NM
N O T E S
Now we need to find probability of the item has come from C when we
know that it is defective, i.e. P(C|D). Using Bayes’ formula,
P( D C) P(C)
P(C D) =
P( D A) P( A) + P( D B) P( B) + P( D C) P(C)
0.15 × 0.5
=
0.25 × 0.35 + 0.05 × 0.15 + 0.15 × 0.5
0.075
= = 0.44
0.17
Example: A product is produced on three different machines M1, M2
and M3 with proportion of production from these machines as 50%,
30% and 20% respectively. The past experience shows percentage
defectives from these machines as 3%, 4% and 5% respectively. At
the end of the day’s production, one unit of production is selected at
random and it is found to be defective. What is the chance that it is
manufactured by machine M2?
S
Solution: Let, M1, M2 and M3 are the events that the product is
IM
manufactured on machines M1, M2 and M3 respectively. Let D be the
event that the item is defective. The given information can be written as,
P(M1) = 0.5, P(M1) = 0.3, P(M1) = 0.2,
P(D|M1) = 0.03, P(D|M2) = 0.04 and P(D|M3) =0.05
NM
0.3 × 0.04
= = 0.324
0.5 × 0.03 + 0.3 × 0.04 + 0.2 × 0.05
Two thousand randomly selected adults were asked if they think they
are financially better off than their parents. The following table gives
the two-way classification of the responses based on the education
Contd...
N O T E S
levels of the persons included in the survey and whether they are
financially better off, the same, or worse off than their parents.
Less than High School More than
High School High School
Better off 140 450 420
Same 60 250 110
Worse off 200 300 70
1. Suppose one adult is selected at random from these 2000
adults. Find the following probabilities.
(a) P(better off and high school)
(b) P(more than high school and worse off )
2. Find the joint probability of the events “worse off” and “better
off.” Is this probability zero?
Explain why or why not.
P(F|E) = P(F)
In other words, two events are independent, if knowledge of
occurrence of one event does not modify probability of the other
event. For example, outcome of first toss of coin (heads or tails) does
not affect the probability that second toss landing heads. Two events
that are not independent are said to be dependent. Also, if events E
and F are independent, so are E and FC.
Example: A bag contains 4 tickets numbered 112, 121, 211 and 222.
One ticket is drawn randomly. Let Ai be the event that ith digit of the
number on the ticket is 1 with i = 1, 2, 3. Comment on pair-wise and
mutual independence of A1, A2 and A3.
N O T E S
S
Solution: Let I, II, and III be the three events that the vans I, II
and III are available. The probability that at least one recovery van
will be available P is the union of these probabilities. Further, since
IM
probabilities of availability of vans are independent, their joint
probability is the product of individual probabilities. Thus,
P( I II III ) =P( I ) + P( II ) + P( III ) − P( I II ) − P( I III ) − P( II III ) + P( I II III )
N O T E S
I go to my friend.
He tells me, “I have two children”. What is the probability that my
friend has a son?
As I sit down, one girl comes in and offers me a glass of water. My
friend says, “Please meet my daughter”. Now, what is the probability
that my friend has a son?
After I thank her for water, my friend adds, “She is ‘Didi’ or ‘Tai’
(meaning the elder child)”. Now, what is the probability that my
friend has a son?
After some time one boy enters. My friend introduces him as his
son. Now, what is the probability that my friend has a son?
S
If we toss a six-faced die and call the event of appearance of an even
number as the event A and the appearance of an odd number as the
IM
event B. Now, suppose that in the first toss we get an even number.
If we toss the die the second time, we can still get an even or an
odd number and their chances are not influenced by the result of
the first trial. Thus, the appearance of an even number in the first
trial and the appearance of an even number in the second trial is an
NM
N O T E S
8.10.2 SUM RULE OF COUNTING
If one task can be done in n1 ways and other task can be done in n2
ways and if these tasks cannot be done at the same time, then there are
(n1 + n2) ways of doing one of these tasks (either one task or the other).
When logical OR is used in deciding outcomes of the experiment and
events are mutually exclusive then the ‘Sum Rule’ is applicable.
For example, an urn contains 10 balls of which 5 are white, 3 black
and 2 red. If we select one ball randomly, how many ways are there
that the ball is either white or red? Answer is 5 + 2 = 7. Note that the
sum rule is nothing but the Axiom 3.
8.10.3 PERMUTATION
S
of these objects. An ordered arrangement of r elements of a set is
called r-permutation.
IM
The number of r-permutations of a set with n elements, where n is a
nonnegative integer with 0 ≤ r ≤ n, equals,
n!
P( n, r) =n × ( n − 1) × ( n − 2) × ......... × ( n − r + 1) =
( n − r)!
This is also number of ways of drawing items from a set without
NM
8.10.4 COMBINATION
N O T E S
P( n, r) n!
=
C ( n, r) =
r! r !( n − r)!
C (n, r) is also called a binomial coefficient, since it is a coefficient of rth
term in a binomial expansion. Note that r-combination is also written
n
as, nCr or
r
Combinations with Repetition
Number of r-combinations of a set with n elements when repletion of
elements is allowed, equals,
n + r − 1 ( n + r − 1)!
=
r −1 r !( n − 1)!
For example, if we have to select 6 ice-creams (r) of available 4 flavours
(n), it can be done in
C(4 + 6 − 1,6)
= C(9,6)
=
9!
=
9×8×7
= 84 ways.
6!3! 3 × 2 × 1
S
IM
This is also the number of ways of distributing r identical objects in n
boxes where empty box is allowed.
Further, it also gives number of non-negative integer solutions of an
equation,
NM
x1 + x2 + …+ xn = r
Solved Examples
Example: In a triangular series the probability of Indian team
winning match with Pakistan is 0.7 and that with Australia is 0.4.
If the probability of India winning both matches is 0.3, what is the
probability that India will win at least one match so that it can enter
the final?
Solution: Now, given that probability of the Indian team winning the
match with Pakistan P (A) = 0.7, with Australia P (A) = 0.4 and with
both P(A ∩ B) = 0.3
Therefore, probability that India will win at least one match is,
P( A B) = P( A) + P( B) − P( A B) = 0.7 + 0.4 − 0.3 = 0.8
Example: What is the probability of a hand of 13 dealt from a shuffled
pack of 52 cards, containing exactly 2 kings and 1 ace?
4
Solution: Out of 13 cards, 2 kings must come from 4 kings is
2
4
ways, 1 ace must come from 4 aces in ways, and remaining 10
1
44
cards must com from 44 non-kings and non-ace cards in . Thus,
10
N O T E S
by product rule, the required probability of hand of 13 containing
exactly 2 kings and 1 ace is,
4 4 44
2 1 10 = 0.09378
52
13
Example: In the dairy, the milk filled in sachets of 500 Gms by machine
A, B and C respectively 25%, 35% and 40% of the total output. It is also
found that 5, 4, and 2 per cent of sachets respectively by machine A,
B and C have either over filling or under filling of milk. A government
inspector made a random check and found that the sachet was under
filled and booked a case against the dairy. What are the probabilities
that it was filled by machine A, B and C?
Solution: Given: P(A) – 0.25, P(B) – 0.35, P(C) – 0.4
S
If we indicate under fill or overfill as D (defective),
P(D|A) = 0.05, P(D|B) = 0.04, P(D|C) = 0.02
IM
Now, we have to find P(A|D), P(B|D) and P(C|D) respectively.
Probabilities that it was filled by machine A is,
P ( D A ) P ( A)
P( A D) =
P( D A) P( A) + P( D B) P( B) + P( D C) P(C)
NM
0.05 × 0.25
=
0.05 × 0.25 + 0.04 × 0.35 + 0.02 × 0.4
0.0125
= = 0.362
0.0345
Similarly,
P( D B) P( B)
P( B D) =
P( D A) P( A) + P( D B) P( B) + P( D C) P(C)
0.04 × 0.35
=
0.05 × 0.25 + 0.04 × 0.35 + 0.02 × 0.4
= 0.406
Also,
P( D C) P(C)
P(C D) =
P( D A) P( A) + P( D B) P( B) + P( D C) P(C)
0.02 × 0.4
=
0.05 × 0.25 + 0.04 × 0.35 + 0.02 × 0.4
= 0.232
N O T E S
S
In a Monster.com online poll during January 14–21, 2001,
respondents were asked the question, “Which is the best job at the
IM
Super Bowl?” (USA TODAY, February 1, 2002). There were a total
of 30,270 (self-selected) responses. The most popular response was
“player” with 11,715 votes, and the second most popular response
was “announcer/reporter” with 9,982 votes. If one of the 30,270
responses is selected at random, what is the probability that the
NM
8.11 SUMMARY
In this chapter, we discussed basic idea of probability. We defined
probability in different ways and pointed out serious limitations
of each definition.
Then we discussed axioms of probability, which are the backbone
of theory of probability. Then we studied number of useful
propositions of probability.
We also defined conditional probability, law of total probability,
and Bayes’ Theorem. We also defined mutually exclusive events,
and independence of events.
N O T E S
Lastly, we discussed few important concepts of combinatorial
analysis, which comes very handy while calculating probability
of an event.
S
probability of one and/or two events occurring at the same
time is equal to the probability of the first event occurring,
plus the probability of the second event occurring, minus the
IM
probability that both events occur at the same time.
Multiplicative Rule: The probability of two independent
events occurring simultaneously is the product of the
individual probabilities.
Conditional Probability: It states the probability of event (A)
NM
N O T E S
EXERCISE FOR PRACTICE
1. Consider an experiment of rolling a fair dice. Let the event A is
an even number appears on the upper face. The event B is the
number on the upper face is greater than 3. Find the probability
of the number appearing on the upper face is either event A or B.
2. Three balls are randomly selected without replacement from a
bag containing 20 balls numbered 1, 2, through 20. If we bet that
at least one of the balls has a number greater than or equal to 17,
what is the probability that we will win the bet?
3. A bag contains 4 white and 2 black balls. Another bag contains 3
white and 5 red balls. One ball is drawn from each bag. What is
the probability that they are of different colours?
4. An office has three Xerox machines X1, X2 and X3. The
probability that on a given day machines X1, X2 and X3 would
work is 0.60, 0.75 and 0.80 respectively; both X1 and X2 work is
S
0.50; both X1 and X3 work is 0.40; both X2 and X3 work is 0.70.
The probability that all of them work is 0.25. Find the probability
that on a given day at least one of the three machines works.
IM
5. A factory has 65% male workers. 70% of the total workers
are married. 47% of the male workers are married. Find the
probability that a worker chosen randomly is,
(i) Married female. (ii) A male married or both.
NM
N O T E S
13. False
14. True
Conditional Probability 15. Conditional
16. P( E F ) × P( F )
Refer Section 8.2
S
HINTS FOR DESCRIPTIVE QUESTIONS
1.
Random experiment is an experiment whose outcome is not
IM
predictable in advance.
One or more possible outcomes that belong to certain category of
our interest are called as event. A sub set E of the sample space S
is an event. In other words, an event is a favorable outcome.
NM
N O T E S
5. Refer Section 8.6
The basic form of Multiplication theorems on probability for two
events ‘X’ and ‘Y’ can be stated as,
P (x. y) = p (x). P(x / y)
Here p (x) and p (y) are the Probabilities of occurrences of events
‘x’ and ‘y’ respectively.
P (x / y) is the Conditional Probability of ‘x’ and the condition is
that ‘y’ has occurred before ‘x’.
P (x / y) is always calculated after ‘y’ has occurred. Here,
occurrence of ‘x’ depends on ‘y’. ‘y’ has changed some events
already. So, occurrence of ‘x’ also changes.
6. Refer Section 8.7
Conditional probability is the probability that an event will occur
S
given that another event has already occurred. If A and B are two
events, then the conditional probability of A given B is written as
P (A/B) and read as “the probability of A given that B has already
IM
occurred.”
7. Refer Section 8.8
Consider two events, E and F. whatsoever be the events, we can
always say that the probability of E is equal to the probability of
intersection of E and F, plus, the probability of the intersection of
NM
c
= P( E F) × P( F) + P( E F ) × [1 − P( F)]
9. Refer Section 8.9
Two events are said to be independent of each other if and only if
the following three conditions hold:
P(EF) = P(E) × P(F) (This is the most useful result.)
P(E/F) = P(F)
P(F/E) = P(F)
N O T E S
10. Refer Section 8.10
Combinatorial concepts are useful in calculating probability
of the event, particularly when the problem can be solved by
classical probability theory. They are product rule of counting,
Sum rule of couting, permutation and Combination.
E-REFERENCES
http://math.berkeley.edu/~isammis/55.S08/55PS7.pdf
http://webbut.unitbv.ro/bulletin/Series%20II/BULETIN%20
II/07-Pacurar.pdf
http://www.shmoop.com/basic-statistics-probability/and-or-
probability-exercises-3.html
PROBABILITY DISTRIBUTION
CONTENTS
9.1 Introduction
9.2 Random Variable
9.2.1
S
Discrete and Continuous Random Variables
IM
9.2.2 Probability Mass Function (p.m.f.)
9.2.3 Probability Density Function
9.2.4 Cumulative Distribution Function
9.2.5 Expectation Value of Random Variables
9.2.6 Expected Value of a Function of a Random Variable
NM
INTRODUCTORY CASELET
N O T E S
S
But that doesn’t mean the player will hit the ball exactly every
fourth time he comes to the plate – just as it’s unlikely that the
IM
white marble will come out exactly every fourth time.
Even a batter who goes hitless 10 times in a row might safely be
able to pin the blame on statistical fluctuations. The odds of pulling
a black marble out of a hat 10 times in a row are about 6 percent –
not a frequent occurrence, but not impossible, either. Only in the
NM
N O T E S
9.1 INTRODUCTION
Frequently, we are more interested in some function of the outcome
S
of an experiment/process rather than the actual outcome itself. For
example, an expressway safety service may be interested to know the
probability that a particular number of accidents could take place on
IM
a day than the details of accident itself. Or, in an experiment of tossing
a coin four times we may be interested in total number heads that
occur (if we have called or bet on heads say) and not care at all about
the actual sequence of results. These quantities of interest are known
as random variables. In statistics, we are also interested in probability
NM
N O T E S
For example, consider an experiment of tossing an unbiased coin for
four times where we are interested in our favorable event of number
of heads. (Imagine the similarity of this with a real life experiment of
picking fuses out of a box when probability of fuse being serviceable
is 0.5.) Possible outcomes are 24 = 16 namely, TTTT, TTTH, TTHT,
THTT, HTTT, TTHH, THTH, THHT, HTHT, HTTH, HHTT, THHH,
HTHH, HHTH, HHHT, HHHH. Let our random variable ‘X’ is number
of heads. It can be seen that random variable can take values as 0, 1, 2,
3, and 4. Since all the 16 outcomes are equally likely, their probability
is (1/16). Now counting the outcomes that give us a particular value
of the random variable, we can calculate the probability associated
with it. The rule that assigns the probabilities to the different values
of random variable is called the probability distribution of random
variable. In our example of tossing a coin four times the probability
distribution is as follows:
Value of Xi 0 1 2 3 4 Total
Random
Variable
Probability
S
P {X = Xi} 1/16 4/16 6/16 4/16 1/16 1
IM
Note that sum of all probabilities is 1. This is always true for any
probability distribution according to the ‘Axiom 2’ for probability
space.
N O T E S
A continuous random variable is not defined at specific values.
Instead, it is defined over an interval of values, and is represented by
the area under a curve (in advanced mathematics, this is known as
an integral). The probability of observing any single value is equal to
0, since the number of values which may be assumed by the random
variable is infinite.
Suppose a random variable X may take all values over an interval of
real numbers. Then the probability that X is in the set of outcomes A,
P (A) is defined to be the area above A and under a curve. The curve,
which represents a function p(x), must satisfy the following:
The curve has no negative values (p(x) > 0 for all x)
The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.
S
A random variable that can take countable number of possible
values (including infinite countable numbers) is said to be discrete.
For discrete random variable ‘probability mass function’ (p.m.f.) is
IM
defined as,
P (a) = P {X = a}
P.m.f. must be positive and satisfy axioms of probability. P.m.f. could
be imagined as masses equivalent to the probability values p (xi) are
NM
X = xi 2 3 4 5 6 7 8 9 10 11 12
P(xi ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Graph of this RV is given below.
0.20
0.15
P(X=xi)
0.10 P(X=xi)
0.05
0.00
2 3 4 5 6 7 8 9 10 11 12
X=xi
N O T E S
9.2.3 PROBABILITY DENSITY FUNCTION
There also exist random variables whose set of positive values is
uncountable. Time taken to service a customer, or time between
accidents on expressway are two such examples. X is a continuous
random variable if there exists a non-negative function f(x), for all real
values of X, having property that for any set B of real numbers,
∫ f (x)dx
P(x ∈ B) =
B
S
the values of random variable less than or equal to the specified value.
Obviously, c.d.f. at infinity is equal to one, as per axiom 2.
Cumulative distribution function (c.d.f.) for discrete random variable
IM
is given by
F(a)= P( X ≤ a)= ∑
for xi ≤ a
p(xi )
variable is given by
a
F(a)= P( X ≤ a)= ∫
−∞
f (x)dx
Value of
Random X = xi 0 1 2 3
Variable
1 3 3 1
Probability P(X = xi) 8 8 8 8
N O T E S
Graph of this RV is given below.
1
F(a) = P(x<=xi)
0.8
0.6
0.4
0.2
0
2 3 4 5 6 7 8 9 10 11 12
X=xi
S
IM
9.2.5 EXPECTATION VALUE OF RANDOM VARIABLES
One of the most important concepts in probability theory is that
expectation of a random variable. For example, if we consider random
variable X as next month’s demand for our product, say luxury car.
Then we would different values of X along with associated probability
NM
as given below.
N O T E S
value of the random variable. Due to associated probability (risk) the
term ‘expected’ is used. This is a measure of ‘central tendency’ mean
for the probability distribution. Hence,
m = E[X]
∫ (x − m )
2
Var( X ) = f (x) Where f(x) is p.d.f.
−∞
By algebraic simplification with noting that μ is a constant, using
definition of expected value and axiom 3, It can be shown that,
=
Var ( X ) E[ X 2 ] − ( E[ X ])2
N O T E S
E [X] is called the first moment of X and E [X 2] as second moment of X.
Variance gives the dispersion or spread of the probability distribution
of random variable X. It is extremely important while comparing two
or more distributions, hypothesis testing, drawing inference from the
sample, etc. For a random variable, the standard deviation is equal to
the positive square root of the variance, and denoted by σ.
Example: A random variable is number of tails when a coil is flipped
thrice. Find expectation (mean) of the random variable.
Random Variable X = xi 0 1 2 3
1 3 3 1
p.m.f.
8 8 8 8
3 6 3
P(X = xi) xi x P(xi) 0 8 8 8
E=
(X)
4
∑x 1 3 3 1 12 3
× P( x i ) = 0 × + 1 × + 2 × + 3 × =
S
IM
i =
i =1 8 8 8 8 8 2
Example: X is a random variable with probability distribution
X = xi 0 1 2
P(X = xi) 0.3 0.3 0.4
NM
Y = g(X) = 2X + 3
Find expected value or mean of Y that is E(Y).
Solution: Now, for X = 0, 1, 2 Y = 3, 5, 7 respectively. Hence, the
distribution of Y is,
X = xi 0 1 2
Y = yi 3 5 7
p(Y = yi) 0.3 0.3 0.4
Hence,
n n
=
E(Y ) E=
[ g( xi )] ∑ g(x=
i ) P( x i ) ∑ yi P ( x i )
=i 1=i 1
N O T E S
Machine A
=
Thus, the mean is, m E=
(X)
S
3610 7605
∑ x P=
(x )
16000 8405
200 (Ans)
4410 40030
IM
i i
all
∑ 2
Also,
= E( X 2 ) = x P( x )
i i 40030
all
∑ 2
Also,
= E( X 2 ) = x P( x )
i i 40001.2
all
Hence, Variance = E( X 2 ) − [ E( X )]2 = 40001.2 − 40000 = 1.2 (Ans)
Now, S.D=. s= Variance= 1.2= 1.1
From the above result it can be seen that machine B is preferable
since it has very small variance as compared to the machine A. In
fact, we could roughly say that in case of machine A, we will have
to give free packets as a penalty for about 27% of the customers. In
case of machine A not even 1% customers will get coffee pack that
is underweight by 5 gms. Also, the coffee in overweight packs from
machine B will also be very small quantity as compared to machine A
and hence less costly.
N O T E S
S
example, arrival of customers, number of people served in unit
time, time between failures of a machine.)
IM
ROBABILITY DISTRIBUTIONS OF
P
9.3
STANDARD RANDOM VARIABLES
In many practical situations, the random variable of interest follows
a specific pattern. Random variables are often classified according
NM
N O T E S
N O T E S
Mean m = E[ X ] = ∑ x P(x ) = 0 × (1 − p) + 1 × p = p
i = 1,2
i i
Variance Var( X
= ) p(1 − p)
For variance we first calculate
∑x
2
E[ X 2 ] = i P(xi ) = 02 × (1 − p) + 12 × p = p
N O T E S
S
9.5 BINOMIAL DISTRIBUTION
Usually, we often conduct many trials, which are independent and
IM
identical. Suppose we perform n independent Bernoulli trials (each
with two possible outcomes and probability of success p) each of which
results in a success with probability p and probability of failure (1 – p).
If random variable X represents the number of successes that occur
in n trials (order of successes not important), then X is said to be a
NM
Binomial random variable with parameters (n, p). Note that Bernoulli
random variable is a Binomial random variable with parameter (1, p)
i.e. n = 1.
n
) pi (1 − p) n − i
P( X= i= For i = 0, 1, 2… n
i
Expected value and variance for Binomial random variable are,
m = E[X] = np
Var [X] = np (1 – p)
N O T E S
Trials are finite (and not very large), performed repeatedly for ‘n’
times.
Each trial (random experiment) should be a Bernoulli trial, the
one that results in either success or failure.
Probability of success in any trial is ‘p’ and is constant for each
trial.
All trials are independent.
These trials are usually the experiments of selection ‘with
replacement’. In cases where the number of the population is very
large, drawing a small sample from it does not change probability of
success significantly. Hence, we could consider the distribution as
Bernoulli distribution.
Following are some of the real life examples of applications of binomial
distribution.
machine.
S
Number of defective items in a lot of n items produced by a
IM
Number of male births out of n births in a hospital.
Number of correct answers in a multiple-choice test.
Number of seeds germinated in a row of n planted seeds.
Number of re-captured fish in a sample of n fishes.
NM
N O T E S
to get the results if the probability distribution is approximated to a
standard probability distribution. In case the probability distribution
(or a frequency distribution which is not necessarily a probability
distribution) is concerning with a random variable X which takes finite
integer values 0, 1, 2, …, n assumption of Binomial distribution may
work as a model for the given data. This is known as fitting binomial
distribution to the given data. We first estimate the parameters of
distribution (n, p) from the data and then compute probabilities and
expected frequencies.
The parameter p is estimated by equating the mean of binomial
–
distribution μ = np with the data mean x. Thus,
x
ˆ=
p And qˆ= 1 − p
ˆ where p̂ means p estimate, and q̂ means q
n
estimate.
Σf i x i
x=
Σf i
S
With the estimated parameters we calculate all the probability values
(frequencies) for the given data points. If the observed values are
IM
quite close to the estimates, the binomial model under consideration
is satisfactory.
Example: The following data gives number of seeds germinated
in row of 5 seeds each. Fit a binomial distribution to the data and
calculate expected frequency.
NM
xi 0 1 2 3 4 5
fi 10 20 30 15 15 10
Solution: Now,
Σfi xi 235
=
x = = 2.35 Hence,
Σf i 100
x 2.35
ˆ=
p = = 0.47 q̂ = 1 – p̂ = 0.53
n 5
pˆ
N = ∑ fi = 100 = 0.8868
qˆ
Now, either by using p.m.f. with n = 5 and p = 0.47 or by using
recurrence relation we can find probabilities and hence expected
frequencies. We demonstrate using recurrence relation.
X=i 0 1 2 3 4 5 Total
( n − i) 5 2 1 0.5 0.2 0
(i + 1)
P( X= i ) 0.0418 0.1853 0.3287 0.2915 0.1293 0.0229 0.9995
Ei = N x P(X) 4.18 18.53 32.87 29.15 12.93 2.29 99.95
We observe that fitting is reasonably good, except at both ends.
N O T E S
Example: Suppose that the probability that a light in a classroom
will be burnt out is 1/3. The classroom has in all five lights and it is
unusable if the number of lights burning is less than two. What is the
probability that the class room is unusable on a random occasion?
1
Solution: This a case of binomial distribution with n = 5 and p –
3
Class room is unusable if the number of burnouts is 4 or 5. That is
i = 4 or 5. Noting that,
n
i) ( p ) ( 1 − p )
i n− i
P( X =+4) P( X ==
i
Thus, the probability that the class room is unusable on a random
occasion is,
4 5 0
5 1 2 5 1 2
P( X =4) + P( X =
5) = + = 0.0412 + 0.00412 =
0.04532
4 3 3 5 3 3
=0.0412 + 0.00412 =0.04532
S
Example: It is observed that 80% of T.V. viewers watch Aap Ki Adalat
programme. What is the probability that at least 80% of the viewers in
IM
a random sample of 5 watch this programme?
Solution: This is the case of binomial distribution with n = 5 and p =
0.8. Also i = 4 or 5.
Probability of at least 80% of the viewers in a random sample of 5
NM
4 5
= 0.4096 + 0.3277 = 0.7373
Collect the data and prove that as n tends to infinity the Binomial
distribution approaches to normal.
N O T E S
S
A random variable X, taking one of the values 0, 1, 2 … is said to be
a Poisson random variable with parameter λ, if for some λ > 0,
e− λ λ i
IM
P( X= i=
) For i = 0, 1, 2 …
i!
m = E[X] = l
Var[X] = l
Poisson random variable has wide range of applications. It can also
be used as an approximation for a binomial random variable with
parameters (n, p) if n is large and p is small enough to make the
product np of moderate size. In this case we call np – l an average
rate. Some of the common examples where Poisson random variable
can be used to define the probability distribution are:
Number of accidents per day on expressway.
Number of earthquakes occurring over fixed time span.
Number of misprints on a page.
Number of arrivals of calls on telephone exchange per minute.
Number of interrupts per second on a server.
Example: Average number of accidents on express way is five per
week. Find the probability of exactly two accidents would take place
in a given week. Also find the probability of at the most two accidents
will take place in next week.
Solution:
Now, l = 5 and i = 2
N O T E S
e−5 × 52
Therefore, P( X= 2)
= = 0.084224
2!
e−5 × 50 e−5 × 51 e−5 × 52 25
P( X ≤ 2)
= P(0) + P(1) + P(2)
= + + = e−5 (1 + 5 + )
0! 1! 2! 2
= 0.12465
Example: Probability of defective items produced on a machine is 0.1.
Find the probability that a sample of 10 items will contain at the most
1 defective item.
Solution: Method I
Using binomial distribution with parameters (n=10, p=0.1) we get,
P{X ≤ 1} = p (0) + p (1) = 10C0 (0.1)0 (0.1)10 + 10C1 (0.1)1 (0.1)9 =
0.7361
Method II
S
Using Poisson distribution (as approximation to Binomial distribution)
with parameter λ = 10 × 0.1 = 1 we get,
IM
P {X ≤ 1} = p (0) + p (1) = [e-1 (λ) 0] / 0! + [e-1 (λ) 1] / 1! = e-1 + e-1
= 0.7358
Note that Poisson distribution gives reasonable good approximation.
N O T E S
random variable X as being the lifetime of some item (say bulb), the
probability that the bulb will survive for at least ‘ (s + t)’ hours, given
that it has survived ‘t’ hours, is the same as the initial probability that
it survives for at least ‘s’ hours. That is, the bulb does not remember
that it has already been in use for the time ‘t’.
Example: Average time for updating a passbook by a bank clerk is 15
seconds. Someone arrives just ahead of you. Find the probability that
you will have to wait for your turn,
1. More than 1 minute.
2. Less than ½ minutes.
Solution: Now, λ = 60/15 = 4 passbooks per minute
P {X > 1} = 1 – F (1) = e-4 = 0.0183
P {X < 0.5} = F (0.5) = 1 - e-2 = 1 - 0.1353 = 0.8647
Example: In certain factory it was found that average absentee rate is
1. S
3 workers per shift. Find the probability that on a given shift:
Exactly two workers will be absent.
IM
2. More that four workers will be absent.
[Given e–3 = 0.04970] and e–0.3 = 0.0.7408
Solution: This is a case of Poisson distribution with average rate of
absentee is l = 3
NM
e− λ λ i
We use P( X= i=
)
i!
−3 2
e 3
1. P( X= 2)= = 0.224
2!
2. P( X > 4) =1 − P( X ≤ 4) =1 − [ P(0) + P(1) + P(2) + P(3) + P(4)]
9 9 27
= 1 − e−3 [1 + 3 +
+ + ] = 0.1847
2 2 8
Or, we can use cumulative Poisson probabilities table to
calculate P(X ≤ 4). From the table for l = 3 and i = 4 we get
cumulative probability P(X ≤ 4) as 0.8153. Hence, we calculate
P( X > 4) =1 − P( X ≤ 4) =1 − 0.8153 =0.1847
N O T E S
Collect the data and prove the formulae for cumulative density
function for exponential random variable and Poisson random
variable.
S
Normal random variable and its distribution is commonly used in
many business and engineering problems. Many other distributions
like binomial, Poisson, beta, chi-square, students, exponential; etc.
IM
could also be approximated to normal distribution under specific
conditions. (Usually when sample size is large.) If random variable is
affected by many independent causes, and the effect of each cause is
not significantly large as compared to other effects, then the random
variable will closely follow the normal distribution. e.g., weights of
NM
N O T E S
Mean of normal random variable is E(X) = u and variance of normal
random variable is Var (X) = σ2.
If X is normally distributed with parameters m and σ, then another
random variable Y = aX + b is also normally distributed with
parameters ( am + b) and (aσ).
S
z is a normally distributed random variable with parameters,
m= 0 and s = 1.
IM
Any normal random variable can be transformed to standard normal
random variable z. We can get cumulative distribution function as,
a a z2
1 −2
=
F ( a) ∫=
f (x)dx ∫
−∞ −∞ 2π
e dz
NM
This has been calculated for various values of ‘a’ and tabulated. Also,
we know that,
F(− a) =1 − F(a)
N O T E S
S
Look for the value of z up to first decimal in column z of the
Standard Normal Distribution Table shown in Appendix A (first
column of the table). Look for the second decimal value of the z
IM
in top row of the table. Read the probability value p in the cell at
intersection point of the row and column where the z value up
to first decimal and second decimal is located. The p value Thus,
read is then used for finding probabilities as indicated above.
Sometimes, we need to find the value of z called as zcritical for a
NM
N O T E S
The key to understanding the type of table (if the graph is not
given on the top with shaded portion for p) is the following
properties of Standard Normal Distribution Table.
The probability values are symmetric about midpoint i.e. Z
= 0.
Total probability P(–∞ < Z < ∞).
Cumulative Probabilities in left and right half of the curve
are 0.5 i.e.
P(–∞ < Z < 0) = P(0 < Z < ∞) = 0.5.
For calculating the probability values either convert them in
c.d.f. values F(a) and use the formulae or draw a simple sketch to
identify the area that we are interested on the probability curve
and then use the logic. Don’t mix the two methods as it can be
confusing. Use the method that is more appealing to you.
S
9.7.3 PROPERTIES OF NORMAL DISTRIBUTION
It is perfectly symmetric about the mean m.
IM
For a normal distribution mean = median = mode.
It is uni-modal (one mode), with skewness = 0 and kurtosis = 0.
Normal distribution is a limiting form of binomial distribution
when number trials n is large, and neither the probability p nor
NM
N O T E S
Approximately 95% of the area under the curve is between μ-2σ
and μ+2σ.
Approximately 99.7% of the area under the curve is between μ-3σ
and μ+3σ.
S
IM
NM
N O T E S
The force affecting the events must be independent of one
another.
The operation of the causal forces must be such that deviations
about the population mean are balanced as to magnitude and
number.
Solved Examples
Example: If X is a normal random variable with parameters μ = 3 and
σ² = 9, find
(a) P(2 < x< 5) (b) P(x< 0) (c) P(|x– 3|> 6)
Solution:
(x − m ) (2 − 3) 1
For x = 2 z = = = −
s 3 3
(x − m ) (5 − 3) 2
Therefore,
For x = 5 =z
S = =
s 3 3
IM
1 2
P(2 < x < 5) = P(− <z< )
3 3
2 1 2 1
= F − F− =
F − 1 − F
3 3 3 3
NM
2
∴F =
0.5 + 0.2486 =
0.7486
3
1
F = 0.5 + area under standard normal curve for z = 0.334
3
1
F = 0.5 + 0.1293 =
0.6293 Thus, P(2 < x < 5) = 0.7486 + 0.6293 – 1
3
= 0.3779
Example: Coffee is filled in the packs of 200 gm by a machine with
variability of 0.25 grms. Packs weighing less than 200 gm would
be rejected by customers and not legally acceptable. Therefore,
marketing and legal department requests production manager to
set the machine to fill slightly more quantity in each pack. However,
finance department objects to this since it would lead to financial loss
due to overfilling the packs. The general manager wants to know the
99% confidence interval, when the machine is set at 200gms, so that
he can take a decision. Find confidence interval. What is your advice
to the production manger?
N O T E S
Solution: Let weight of the coffee in a pack is a random variable X. We
know that the mean μ = 200 gm and variance σ² = 0.25 gms2 i.e. σ = 0.5
gm. First, we find the value of z for 99% confidence. Standard Normal
Distribution curve is symmetric about mean. Hence, corresponding
to 99% confidence, half area under the curve = 0.99/2 = 0.495.
z value corresponding to probability 0.495 is 2.575. Thus, the 99%
confidence interval in terms of variable z is ± 2.575 which in terms of
variable x is, 200 ±1.2875 or (198.71 to 201.29).
Note: That x= s z + m= 0.5 × (±2.575) + 200= 200 ± 1.2875
Hence, we can advise the production manager to set his machine
to fill the coffee with mean weight as 201.2875 or say 201.29. In that
case we have 99% confidence of meeting legal requirement and at the
same time to keep the cost of excess filling of the coffee to minimum.
Example: A total of 2,058 students take a difficult test. Each student
has an independent 0.6205 probability of passing the test.
1.
S
What is the probability that between 1,250 and 1,300 students,
both numbers inclusive will pass the test?
IM
2. What is the probability that at least 1,300 students will pass the
test?
3. If the probability of at least 1,300 students passing the test has to
be at least 0.5, what is the minimum value for the probability of
each student passing the test?
NM
N O T E S
Assessing Normality
Suppose that seventeen randomly selected workers at a detergent
S
factory were tested for exposure to a Bacillus subtillis enzyme by
measuring the ratio of forced expiratory volume (FEV) to vital
capacity (VC). (Note: FEV is the maximum volume of air a person
IM
can exhale in one second; VC is the maximum volume of air that a
person can exhale after taking a deep breath.) Is it reasonable to
conclude that the FEV to VC (FEV/VC) ratio is normally distributed?
0.61 0.70 0.76 0.84
0.63 0.72 0.78 0.85
NM
9.8 SUMMARY
Random variable is a real valued function defined over a sample
space with probability associated with it. The value of the random
variable is outcome of an experiment. Random variables are
neither ‘random’ nor ‘variable’.
In this chapter we discussed several important random variables,
the associated formulae, and problem solving using formulae.
A discrete random variable is the one that takes at the most
N O T E S
countable values. A continuous random variable can take any
real value.
We also discussed probability distributions of random variables.
Binomial distribution is used if an experiment is carried out for
finite number of n independent trials; all trials being Bernoulli
trials with constant probability of success p.
Random variable will follow Poisson distribution if it is the number
of occurrences of a rare event during a finite period. Waiting time
for a rare event is exponentially distributed. Negative binomial
distribution is used if numbers of Bernoulli trials are made to
achieve desired number of successes.
One of the continuous random variable required often is
uniform random variable. Waiting time for an event that occurs
periodically follows uniform distribution.
Normal probability distribution is the most important distribution
S
in statistics. We defined normal distribution with parameters (μ,
σ) where μ is mean and σ is standard deviation.
IM
Further, we defined standard normal distribution, which is a
special case of normal distribution with parameters (0, 1).
We also discussed transformation of normal random variable X
x−m
to standard random variable Z using z = Z distribution is
s
NM
N O T E S
Binomial Random Variable: A binomial random variable is
the number of successes x in n repeated trials of a binomial
experiment.
Binomial Distribution: The probability distribution of a
binomial random variable is called a binomial distribution.
6. S
5. What are the variance and standard deviation of a random
variable? How do you calculate them?
Write a short note on Bernoulli distribution of random variables.
IM
Discuss its applications also.
7. Define binomial random variable. Describe binomial distribution
and its applications.
8. How will you define Poisson random variable and exponential
NM
N O T E S
(a) Find the probability that the target will be detected at least
twice.
(b) Find the probability that the target will be detected at the
most once.
4. In a large group of men, it is found that 5% are under the age 60
and 40% are between the age 60 and 65. Assuming the distribution
of the age is normal; find the mean and standard deviation.
5. If a random variable X follows a normal distribution with mean
18 and standard deviation 25 find, P(–31 < x 67 ).
8. Dichotomous
9. Bernoulli
Binomial Distribution 10. Binomial
11. Distribution
–
12. Mean x
Poisson Distribution 13. Parameter λ
14. Continuous
15. Exponential
16. Poisson
Normal Distribution 17. Normal
18. Median = mode.
19. m=l
20. 99.73%
N O T E S
2. Refer Section 9.2.1
A discrete random variable is one which may take on only a
countable number of distinct values such as 0, 1, 2, 3, 4…
Discrete random variables are usually (but not necessarily)
counts. If a random variable can take only a finite number of
distinct values, then it must be discrete. A continuous random
variable is one which takes an infinite number of possible values.
Continuous random variables are usually measurements.
3. Refer Section 9.2.2
A random variable that can take countable number of possible
values (including infinite countable numbers) is said to be
discrete. For discrete random variable ‘probability mass function’
(p.m.f.) is defined as,
P (a) = P {X = a}
S
X is a continuous random variable if there exists a non-negative
function f(x), for all real values of X, having property that for any
set B of real numbers,
IM
∫ f (x)dx
P(x ∈ B) =
B
E[g( X )] = ∑
for all i
g(xi ) p(xi )
E[g( X )] = ∫ g(x) f (x)dx
−∞
5. Refer Section 9.2.7
Variance of a random variable is, thus, defined as,
) s=
Var( X= 2
E[(xi − m )2 ]
For discrete random variable,
=
Var (X) ∑ (x i − m )2 p(xi ) for all i. Where p (xi) is p.m.f.
And, for continuous random variable,
∞
∫ (x − m )
2
Var( X ) = f (x) Where f(x) is p.d.f.
−∞
N O T E S
It is a single trial distribution. This random variable is called a
Bernoulli random variable with parameter (p).
7. Refer Section 9.5
A binomial random variable is the number of
successes x in n repeated trials of a binomial experiment.
The probability distribution of a binomial random variable
is called a binomial distribution (also known as a Bernoulli
distribution).
The probability mass function of a binomial random variable
with parameters (n, p) is given by,
n
) pi (1 − p) n − i
P( X= i= for i = 0, 1, 2, …, n
i
8. Refer Section 9.6
A random variable X, taking one of the values 0, 1, 2 … is said to
P( X= i=
)
e− λ λ i
for i = 0, 1, 2, … S
be a Poisson random variable with parameter λ, if for some λ > 0,
IM
i!
A continuous random variable X is said to be exponential with
parameter λ, if for some λ > 0,
λ e− λ x for x ≥ 0
f ( x) =
0 for x < 0
NM
N O T E S
ANSWERS FOR EXERCISE FOR PRACTICE
1. 0.85
2. 0.27, 0.324
3. 0.528, 0.948
4. 65.41, 3.29
5. 0.950004
S
Tata McGraw Hill Co Ltd., 2003
Ross, Sheldon, A First Course in Probability, Pearson Education,
2003
IM
Salkind, N.J., Statistics for People Who (They Think) Hate
Statistics, Sage Publications, 2004
D P Apte, Statistical Tools for Managers using MS Excel, Excel
Books, 2009
NM
E-REFERENCES
h t t p : / / w w w. h e n r y. k 1 2 . g a . u s / u g h / a p s t a t / c h a p t e r n o t e s /
7supplement.html
http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm
http://sites.stat.psu.edu/~babu/418dist/binom.html
N O T E S
APPENDIX 1
THE Z-TABLE FOR NORMAL DISTRIBUTION
Appendix A
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
3.1 0.4990 0.4991 0.4991 0.4991 0.4992 0.4992 0.4992 0.4992 0.4993 0.4993
3.2 0.4993 0.4993 0.4994 0.4994 0.4994 0.4994 0.4994 0.4995 0.4995 0.4995
3.3 0.4995 0.4995 0.4995 0.4996 0.4996 0.4996 0.4996 0.4996 0.4996 0.4997
3.4 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4997 0.4998
3.5 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998 0.4998
3.6 0.4998 0.4998 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
3.7 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
3.8 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999 0.4999
3.9 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000 0.5000
N O T E S
S
IM
NM
CONTENTS
10.1 Introduction
10.1.1
10.2
S
Microsoft Office Versions
Introduction to Excel
IM
10.2.1 Opening a Document
10.2.2 Saving and Closing a Document
10.2.3 Excel Screen
10.2.4 Workbooks and Worksheets
10.2.5 Moving around the Worksheet
NM
INTRODUCTORY CASELET
N O T E S
EXCEL FUNCTIONS
Excel provides some help in choosing the right function by using the
Insert Function commands but to employ functions effectively you
need to be acquainted with the mathematics behind the function.
S
IM
NM
N O T E S
10.1 INTRODUCTION
Microsoft office is one of the most powerful office productivity tools in
the market today. The entire suite is vast and covers a wide range of
software solutions catering to various aspects of modern businesses.
The most popular software in the MS Office Suite includes the
following:
Microsoft Word: This is a text editing software that allows users to
S
write all kinds of letters, messages and documents. This tool is very
powerful when it comes to textual representation, allowing the users
to change the fonts, page layouts, insert headers and footers and it
IM
even includes a table of content. There are a lot of other features that
make MS Word more than just an effective text editor.
Microsoft Excel: This is a powerful accounting and calculation
solution. It has a standard tabular layout and it supports a wide range
of arithmetic, accounting and statistical functions. Actually, it is well
NM
N O T E S
preview function where you can preview your Word, PowerPoint or
Excel attachments from the mail itself without having to open the file.
S
bugs of Office 2007 besides adding more features to the overall suite.
This book is primarily designed for Office 2010 users; however given
the similarity, a lot of it would also be valid for Office 2007 and to a
IM
lesser extent Office 2003.
With Office 2007 and Office 2010, Microsoft has bundled its Office suite
in multiple packages depending on the typical usage. By pricing it
economically, different users can pick and choose a package that meets
their needs and thus they can save on the license charges. For Office
NM
2010, Microsoft has three packages as shown in the table 10.1 below:
Excel 2010
Included Included Included
PowerPoint
2010
Included Included Included
OneNote 2010
Included Included Included
Outlook 2010
- Included Included
Contd...
N O T E S
Access 2010
- - Included
Publisher 2010
- - Included
N O T E S
When you have finished working on a document you should close it.
Go to the File menu and click on Close. If you have made any changes
since the file was last saved, you will be asked if you wish to save them.
S
IM
Figure 10.1: Menu Bar in Excel
N O T E S
Screenshot of Excel screen is shown in Figure 10.2.
S
workbooks. The entire structure is hierarchical, and this allows it to
be scalable and versatile enough to adapt to varying needs for users
from different specialisations. Understanding the following concepts
IM
is pretty useful in developing complex reports and models.
Cell: Cell is the most basic unit in Excel. You can enter text,
numbers or formulas in the cells and build a report. The cells
can be formatted to change the font, colour, alignment and other
aesthetics to present the data in the desired format. The cell is
NM
N O T E S
existing rows or perform a common formatting action on all the
cells in the row.
Column: A column is a series of cells arranged vertically. The
columns are alphabetically named starting from ‘A’. After column
‘Z’, the new column is named ‘AA’ and this continues till ‘XFD’.
Again, like the rows, you would not use all these columns in your
model for most parts. You can insert, delete or format the cells in
the column just like the rows.
Spreadsheet: At the bottom of the Excel window you can view
tabs named ‘Sheet1’, ‘Sheet2’ and ‘Sheet3’. Each of these sheets
is a spreadsheet. You can insert spreadsheets, delete existing
spreadsheets, rename them, copy-paste the entire spreadsheet
and carry out global formatting on all the cells in the spreadsheet.
While for most simple models a single spreadsheet would suffice,
sometimes it is better to use multiple spreadsheets to keep the
data logically separated.
S
IM
A spreadsheet is a collection of all the rows and columns in Excel.
NM
N O T E S
computer display. To minimise the Ribbon, you can press ‘Ctrl +
F1’ or just right click anywhere on the Ribbon to get the ‘Minimise
Ribbon’ option as shown. For beginners, it is recommended that
they continue to keep the Ribbon as it would be easier to locate
the functions.
S
IM
Figure 10.5: Minimizing the Ribbon
NM
N O T E S
Excel 2010 also has custom margin settings, and the Last Custom
setting option gives you the option to play with the settings to get the
perfect printout.
S
IM
Figure 10.7: Margin Options in Excel
NM
N O T E S
the printout would not come out properly. To look at all the
options supported by Excel, click on the Size item in the Page
Setup section.
S
IM
NM
N O T E S
S
Figure 10.10: Print Area Selection
IM
10.2.6 MOVING BETWEEN CELLS
While working with any Office productivity tool, the clipboard
functions are invaluable. The most common clipboard functions
are ‘Cut’, ‘Copy’ and ‘Paste’. In the Microsoft Office suite, there are
keyboard shortcuts for these functions. The table below maps these
NM
N O T E S
N O T E S
single cell on the spreadsheet. Select the cell and click on the
Format Painter icon. The next cell that you click on will inherit
the same formatting as the original cell. If you want to paste the
formatting on multiple cells, then you must double click on the
original cell. This would allow you to format paint multiple cells
by clicking them one by one.
Clipboard: As you keep working in a spreadsheet, you would
copy multiple cells. While most of the times you would paste the
copied cell immediately, there are instances when you may want
to paste some of the previously copied cells.
Microsoft Office suite has a clipboard that maintains a list of all the
previously copied cells.
To paste any of the older values, just select the destination cell and
S
click on the value from the clipboard.
To open the clipboard display, click on the button at the bottom
right of the Clipboard Ribbon.
IM
Fill in the blanks:
4. To create a new workbook, click on ................... Document.
NM
While the overall look and feel of Excel has undergone a sea change
from Excel 2003 to Excel 2007/2010, the basic logic behind the
features remains consistent. Hence users of Excel 2003 should not
have any trouble using the newer versions of Excel.
N O T E S
intersection of a row and a column is a cell. Each cell has an address,
which is the column letter and the row number. The arrow on the
worksheet to the right points to cell A1, which is currently highlighted,
indicating that it is an active cell. A cell must be active to enter
information into it. To highlight (select) a cell, click on it.
To select more than one cell:
Click on a cell (e.g. A1), then hold the shift key while you click on
another (e.g. D4) to select all cells between and including A1 and
D4.
Click on a cell (e.g. A1) and drag the mouse across the desired
range, un-clicking on another cell (e.g. D4) to select all cells
between and including A1 and D4.
To select several cells which are not adjacent, press “control”
and click on the cells you want to select. Click a number or letter
labelling a row or column to select that entire row or column.
S
One worksheet can have up to 256 columns and 65,536 rows, so
it’ll be a while before you run out of space.
IM
Each cell can contain a label, value, logical value, or formula.
Labels can contain any combination of letters, numbers, or
symbols.
Values are numbers. Only values (numbers) can be used in
NM
Figure 10.12
To enter information into a cell, select the cell and begin typing.
Note that as you type information into the cell, the information
you enter also displays in the formula bar. You can also enter
information into the formula bar, and the information will appear
in the selected cell.
N O T E S
When you have finished entering the label or value:
Press “Enter” to move to the next cell below (in this
case, A2)
Press “Tab” to move to the next cell to the right (in this
case, B1)
Click in any cell to select it
Entering Labels
Unless the information you enter is formatted as a value or a formula,
Excel will interpret it as a label, and defaults to align the text on the
left side of the cell.
S
IM
Figure 10.13
NM
If you are creating a long worksheet and you will be repeating the
same label.
Information in many different cells, you can use the
AutoComplete function. This function will look at other entries in the
same column and attempt to match a previous entry with your current
entry. For example, if you have already typed “Wesleyan” in another
cell and you type “W” in a new cell, Excel will automatically enter
“Wesleyan.” If you intended to type “Wesleyan” into the cell, your task
is done, and you can move on to the next cell. If you intended to type
something else, e.g. “Williams,” into the cell, just continue typing to
enter the term.
To turn on the AutoComplete function, click on “Tools” in the
menu bar, then select “Options,” then select “Edit,” and click
to put a check in the box beside “Enable AutoComplete for cell
values.”
Another way to quickly enter repeated labels is to use the Pick
List feature. Right click on a cell, and then select “Pick from
List.” This will give you a menu of all other entries in cells in
that column. Click on an item in the menu to enter it into the
currently selected cell.
Entering Values
A value is a number, date, or time, plus a few symbols if necessary to
further define the numbers [such as: + – ( ) % $ /].
N O T E S
Numbers are assumed to be positive; to enter a negative number,
use a minus sign “–” or enclose the number in parentheses “()”.
Dates are stored as MM/DD/YYYY, but you do not have to enter
it precisely in that format. If you enter “Jan 9” or “jan-9”, Excel
will recognize it at January 9 of the current year, and store it
as 1/9/2002. Enter the four-digit year for a year other than the
current year (e.g. “Jan 9, 1999”). To enter the current day’s date,
press “control” and “;” at the same time.
Times default to a 24 hour clock. Use “a” or “p” to indicate “am”
or “pm” if you use a 12 hour clock (e.g. “8:30 p” is interpreted
as 8:30 PM). To enter the current time, press “control” and “:”
(shift-semicolon) at the same time.
S
IM
Figure 10.14
NM
N O T E S
Select the cells in the sheet by pressing Ctrl+A (in Excel 2003,
select a cell in a blank area before pressing Ctrl+A, or from a
selected cell in a Current Region/List range, press Ctrl+A+A).
OR
Click Select All at the top-left intersection of rows and columns.
Press Ctrl+C.
Press Ctrl+Page Down to select another sheet, then select cell
A1.
Press Enter.
To Copy the Entire Sheet.
Copying the entire sheet means copying the cells, the page setup
parameters, and the defined range Names.
Option 1
S
Move the mouse pointer to a sheet tab.
Press Ctrl, and hold the mouse to drag the sheet to a different
IM
location.
Release the mouse button and the Ctrl key.
Option 2
NM
Option 3
From the Window menu, select Arrange.
Select Tiled to tile all open workbooks in the window.
Use Option 1 (dragging the sheet while pressing Ctrl) to copy or
move a sheet.
Sorting by Columns
The default setting for sorting in Ascending or Descending order is by
row. To sort by columns:
From the Data menu, select Sort, and then Options.
Select the Sort left to right option button and click OK.
In the Sort by option of the Sort dialog box, select the row number
by which the columns will be sorted and click OK.
N O T E S
S
IM
Be sure to distinguish between absolute reference and relative
reference when entering the formulas.
Figure 10.15
Some of the icons are useful mathematical computation: is the
“Autosum” icon, which enters the formula “=sum ()” to add up a
range of cells.
is the “FunctionWizard” icon, which gives you access to all the
functions available.
N O T E S
S Figure 10.16
IM
Excel can be used to generate measures of location and variability for
a variable. Suppose we wish to find descriptive statistics for a sample
data: 2, 4, 6, and 8.
Step1: Select the Tools *pull-down menu, if you see data analysis, click
NM
N O T E S
S
The number of students in a college in the year 1961 was 1100;
of those 980 were boys and rest girls. In 1971 the number of boys
IM
increased by 100% and that of girls increased by 300% as compared
to their strength in 1961. In 1981 the total number of students in a
college was 3600, the number of boys being double the number of
girls.
10.5
MEAN, MODE, COUNT, MAX AND MIN)
Excel is a very powerful accounting tool, but before going to the real
complex functions, let us sees how to use Excel for simple calculations.
There are two ways of using Excel for simple calculations: you can
enter the actual arithmetic equations in the cell or use pre-defined
Excel formulas to do the same. The following sections explain how
Excel can be used to carry out simple arithmetic functions.
N O T E S
Figure 10.17
S
Similarly, the total sale for each category is also calculated using a
different mathematical operation: product. For example, the total
T-shirt sale on 1st April, 2013 is =C4×D4. C4 is the rate of each T-shirt,
IM
while D4 is the total T-shirts sold on that day.
NM
Figure 10.18
While simple calculations can be done by manually listing each cell
individually, this is not very scalable. Besides, accounting and financial
management are not just about sums and products. There are many
other complex functions, and Excel accommodates for these using
pre-defined functions.
N O T E S
TABLE 10.5: FUNCTION, SYNTAX AND DESCRIPTION
Function Syntax Description
FACT FACT(n) Factorial – Product of all the
numbers from 1 to n
GCD GCD(x,y,…) Returns Greatest Common
Denominator for the
arguments
INT INT(n) Round decimal number to the
nearest integer
LCM LCM(x,y,…) Returns Least Common
Multiple for the arguments
POWER POWER(n,r) Returns rth of n
PRODUCT PRODUCT (x,y..) Returns product of arguments
QUOTIENT QUOTIENT Returns the quotient for n/d
ROUND
(n,d)
ROUND(n,d)
S
Rounds of decimal number n
up to d digits after the decimal
IM
point
ROUNDDOWN ROUNDDOWN Rounds the decimal n to the
(n,d) nearest lower integer up to d
digits after the decimal point
ROUNDUP ROUNDUP() Rounds the decimal n to the
NM
N O T E S
criteria, sum_range). The match_range is the list on which the
match needs to be made, the match_criteria is the condition
being matched and sum_range is the list of items to be summed.
SUMIFS (): This is another variation of the sum function where
multiple conditions can be matched at the same time. Depending
on how many conditions need to be matched, SUMIF or SUMIFS
can be used. The syntax for this function is SUMIFS (sum_
range, match_range1, match_criteria1, match_range2, match_
criteria2…). As seen from the syntax, the arguments for SUMIFS
are pretty similar to SUMIF, but the order of arguments is different.
Logical Functions
Excel also supports standard logical or Boolean functions, which
are very useful in testing special conditions. Not all calculations are
limited to SUM; hence you need to rely on a combination of logical
and other functions to get results similar to SUMIF and SUMIFS. The
S
commonly supported logical functions are described below:
AND(): This function performs logical AND operation on the
IM
arguments; if all the values test to be TRUE, the final result is
TRUE, else it is FALSE. Most of the times, AND is used as an
operator in other functions rather than a separate function,
although depending on the requirement, one could use this
function directly too.
NM
N O T E S
OR (): This function performs logical OR operation on the
arguments; if all the values test to be FALSE, the final result
is FLASE, else it is TRUE. Most of the times, OR is used as an
operator in other functions rather than a separate function,
although depending on the requirement, one could use this
function directly too.
TRUE: This is an argument-less function. Typically, TRUE is
used as a value to test a logical condition. To reduce the effort of
typing TRUE every time, you can define a cell with value TRUE
and reference the cell in your sheet.
Mostly, the logical Excel functions are rarely used independently,
except the IF function. They are mainly used in conjunction with IF
statements to validate conditions and return responses accordingly.
Statistical Functions
Statistical functions are invaluable in any mathematical calculations.
S
They can provide insights into trends provide data for detailed
analysis as well as help identify gaps that need to be plugged. Excel
provides a wide range of functions that can be used to perform basic
IM
statistical analyses.
N O T E S
MAXA MAXA(x,y…)
S numbers
Returns the largest
value from the range
IM
of cells with any
information
MEDIAN MEDIAN(x,y,…) Returns the median for
the range of numbers
MIN MIN(x,y,…) Returns the smallest
NM
N O T E S
Find the inflation rate in India during past One year on monthly
basis. Find mean, median and mode using Ms-Excel.
N O T E S
they add up to zilch. In this section you will get to see some of the
commonly used data representation methods used in Excel.
Creating Charts
Select the data range (only numbers) for which the chart needs
to be created.
Under the Insert Ribbon, in the Chart section, click on the type
of chart you want to create and the category. Here the clustered
chart has been used.
Select the chart and click on Select Data button in Data section
of the Design Layout.
In the Select Data Source dialog, select ‘Series 1’ and click on
Edit button.
S
IM
NM
N O T E S
S
just like Edit does, except that this time the dialog is practically
empty and you need to manually add the contents.
IM
10.6.1 HISTOGRAM
Now follow the steps given below to draw histogram.
Select the first two columns i.e. class interval and frequency in
the Excel sheet.
NM
N O T E S
Now to convert it to the histogram we need to join the columns.
For this left click the mouse on any of the column. Then right
click the mouse and select ‘Format Data Series’ option or select
‘Format’ from tool bar and click on the ‘Selected Data Series’
option. Now the ‘Format Data Series’ menu box will appear.
Select the ‘Options’ menu. In options menu reduce the ‘Gap
Width’ to zero in given window. You can see the column chart
becoming histogram. Now, click on ‘OK’. The histogram is now
ready and will appear on Excel worksheet. You can shift it by
dragging or increasing its size using corner toggles. You can also
export it to MS word or Power Point by copy-paste options.
Use draw option from tool bar to draw diagonal lines to locate the
mode. Also draw the vertical line from the point of intersection.
Value of the mode can be read from the abscissa of the intersection
point.
S
10.6.2 CORRELATION PLOT AND REGRESSION ANALYSIS
Using MS Excel for calculating Karl Pearson’s correlation coefficient
Calculating Karl Pearson’s correlation coefficient using MS Excel is
IM
very simple. The steps are as follows:
Open an Excel worksheet and enter the data values of X and Y
variables as two arrays (columns or rows). Keep these contiguous
if possible.
NM
Select the cell where you want to store the result r. Enter the
formula with syntax as,
‘=CORREL (array1, array2)’
‘array1’ is a cell range of values and ‘array2’ is a second cell range
of values.
Alternatively w can select the paste function
‘=CORREL(array1,array2)’ from the menu as [Insert→Function…
→Statistical→CORREL] if you are using MS Excel 97-2003
or from quick access tool bar by selecting [Formulas→Insert
Function→Statistical→CORREL] if you are using MS Excel 2007
or just clicking on fX icon on ‘Insert Function’ Tool Bar. Once
we get ‘Function Arguments dialog box for ‘CORREL’ function
follow the dialog box to select the values of X and Y as array1 and
array2 respectively. Then press OK button.
Besides the Insert→Function… menu, MS Excel also has a Data
Analysis tools called as Data Analysis ToolPak. These tools can be
accessed through menu [Tools→Data Analysis…→Correlation] if you
are using MS Excel 97-2003 or [Data→Data Analysis→Correlation]
from quick access tool bar by selecting and then following the
dialog box of ‘Correlation’. With Data Analysis ToolPak we can find
correlation coefficients between several variables. This can also be
used for finding correlation coefficient between two variables. In
N O T E S
the result correlation coefficient with itself is always 1. The result is
displayed as correlation matrix. The procedure is as follows:
Open an Excel worksheet and enter the data values of X and Y
variables as two arrays (columns or rows). Keep these contiguous.
Select any cell: Select from Quick Access Tool Bar the Correlation
tool as [Data→DataAnalysis→Correlation]. Follow the dialog box
giving following details.
Input range: It is either typed as cell references or selected
by blocking the data with mouse.
Grouped By: Select as per data entered is column wise or
row wise.
Label in First Row/Column: Check (click) the box if you have
used labels.
Output Range: /New Worksheet Ply:/New Workbook: Select
S
as appropriate. Note that you keep size of output matrix
adequate to number of variables.
IM
Then press OK button.
You will get the correlation coefficients between pairs of variable
as correlation matrix.
Example: The data of advertisement expenditure (X) and sales (Y)
of a company for past 10 year period is given below. Determine the
NM
N O T E S
We will get result in cell D3 as 0.976357
S
Method II: Using Data Analysis ToolPak
Open an Excel worksheet and enter the data values of X and Y
IM
variables as two arrays (columns) from cell B3 to B12 and C3
to C12 respectively. It is a good practice to give headings at cell
number B2 and C2.
Select any cell say D3 and use menu [Tools→Data Analysis…→
NM
N O T E S
‘Output Range’ $D$3:$F$5 and then select OK button. The Excel
sheet will be as follows:
S
IM
We will get result as on new sheet as,
X Y
X 1
Y 0.976357 1
NM
N O T E S
Example: Data below gives transit time in days for random sample of
10 consignments with related distance.
Find best fit linear relationship of transit time on distance.
X Distance in 100 4 5 6 7 9 9 10 11 11 12
km
Y Transit time in 4 5 5 6 7 6 7 8 7 8
days
Also estimate the transit time for a new location at a distance 800
km.
Also compute correlation coefficient and assess whether relation
can be deemed as reasonable valid.
Find coefficient of determination R and explain its significance.
Solution: Using MS Excel
S
As we have seen earlier, MS Excel is very fast, simple to use and
provides much more analysis while solving correlation and regression
problems. We don’t need shortcut method of shifting origin or changing
IM
scale. Steps for solving this problem are:
Open an MS Excel worksheet. Enter the data of X and Y in two
adjacent columns (say X variable from B3 to B12 and Y variable
from C3 to C12).
Select any cell and use ‘Data Analysis Pak’ tool ‘Regression’ from
NM
N O T E S
Observations 10
ANOVA
df SSE MSE F Significance
F
Regression 1 14.78421 14.78421 89.888 1.26E-05
Residual 8 1.315789 0.164474
Total 9 16.1
N O T E S
S
10.7 NORMAL DISTRIBUTION
Statistical calculations for exponential random variables could be
IM
calculated using statistical functions available in MS Excel.
NORMDIST returns the normal distribution for the specified mean
and standard deviation. This function has a very wide range of
applications in statistics, including hypothesis testing.
NM
Syntax: NORMDIST(x,mean,standard_dev,cumulative)
X is the value for which you want the distribution.
Mean is the arithmetic mean of the distribution.
Standard_dev is the standard deviation of the distribution.
Cumulative is a logical value that determines the form of the function. If
cumulative is TRUE, NORMDIST returns the cumulative distribution
function; if FALSE, it returns the probability mass function.
N O T E S
Standard Normal Distribution
Statistical calculations for exponential random variables could be
calculated using statistical functions available in MS excel.
NORMSDIST returns the standard normal cumulative distribution
function. The distribution has a mean of 0 (zero) and a standard
deviation of one. Use this function in place of a table of standard
normal curve areas.
S
formula bar [fx → statistical → NORMSDIST] or from quick action
tool bar [Formulas → fx → statistical → NORMSDIST] we get a
paste function dialogue box. It asks value of z and gives cumulative
IM
distribution function value. We could also directly type the paste
function syntax in the selected cell.
N O T E S
Similarly,
S
=NORMSDIST(2/3) gives answer as 0.747507462
IM
NM
N O T E S
S
Sciences (SPSS), reflecting the original market, although the software
is now popular in other fields as well, including the health sciences and
marketing.
IM
Our first step is to see the way SPSS functions and take cognizance
of the files that it uses. Second, we will try to create a dataset using
available data. Once the data has been entered, our third step is to use
the SPSS pull-down menus to conduct the analyses of data. We will
then use SPSS to draw charts which display results. We can run SPSS
NM
using either the pull-down menus or the syntax window (writing your
own SPSS programmes).
Entering Data
Select SPSS from the Windows Start Button (that is, click the Start
Button, select Programmes, and select SPSS 11 for Windows). At the
top of your screen you will see the pull-down menus, and just below
them you will see a toolbar with several icons. If you place the mouse
pointer on any one of the toolbar icons, SPSS will display a label
telling you what that icon does. SPSS automatically opens the Data
Editor window, and your screen looks like Figure 10.21.
N O T E S
S
Figure 10.21: SPSS Data Editor Window – Data View
IM
Notice that the Data Editor window looks quite like a spreadsheet, in
that it is made up of cells defined by both rows and columns. In the
Data Editor window, each row represents a single record, and each
column represents a single variable. By using the keyboard arrow
keys (up, down, right and left) or your mouse, you can move the cursor
NM
N O T E S
To label the first variable, again click the cursor on Variable View. You
will be prompted with the dialog box as shown in Figure 10.22.
S
the first variable and let other columns be default as it is. Repeat this
procedure for the work experience data in the second row, and then
again for student motivation data in the third row. Remember that the
IM
student motivation variable had values that had been coded.
Student Motivation
Not willing
Undecided
NM
Willing
Therefore, once the variable label has been assigned, use the tab key
(or the mouse) to bring the cursor to the Values box in the same row
and click. You will see a box as in Figure 10.23.
N O T E S
Figure 10.24: Value Labels Coded with Value and Value Label
If you discover later that for some reason you need to further define
this variable (for example, if you want to change the labels), you can
always return to this dialog box. As their names suggest, the Change
button can be used to change a value label, the Remove button can be
S
used to remove a value label, the Cancel button can be used to cancel
your labeling work, and the Help button can be used to access the
SPSS online help file.
IM
Notice that you have other options available to you in the row dialog
box. For example, if you click on the Type button, you will be presented
with several different data types to choose from. By default, the
variable is considered to be a number that has up to eight digits. You
can tell SPSS to expect a larger number by entering a different size
NM
N O T E S
S
Figure 10.25: SPSS Data Editor Window with all Record Entered
Now save the data using method File pull-down menu and the Save
choice. Because this data has not been saved previously, you will see
IM
a dialog box prompting you to enter a file name. Notice that SPSS
provided the default data file extension (.sav). Type file name and
click OK button. SPSS will then save the data to this file. (SPSS will
automatically attach the .sav file extension if you do not type it in –
in general, SPSS will automatically attach the default file extension
NM
if you do not type it in, e.g., .sav for a data file, .spo for a Navigator
document, .sps for a syntax file, etc.). Another alternative would be
to select File Save As… in case the dataset had already been saved
once, but you now want to save it as a new file with a new name.
N O T E S
10.9 SUMMARY
Microsoft office is one of the most powerful office productivity
tools in the market today. The entire suite is vast and covers a
wide range of software solutions catering to various aspects of
modern businesses.
Microsoft excel is a powerful accounting and calculation solution.
It has a standard tabular layout and it supports a wide range of
arithmetic, accounting and statistical functions.
The Microsoft Outlook is the mail client that can be set up to
download mails from a mail server as well as send and receive
emails as desired. Being a part of the Microsoft Office suite, this
tool is compatible with other applications in the suite.
One of the most popular and widely used Microsoft Office Suites
is the MS Office 2003. Later Microsoft released two other versions
S
of Office, namely Office 2007 and Office 2010. Although Office
2010 is the latest version, many businesses still continue to use
Office 2003. From Office 2003 to Office 2007, Microsoft radicalised
the overall look and feel of the office suite.
IM
Excel is built on the concept of cell, rows, columns, spreadsheets
and workbooks. The entire structure is hierarchical, and this
allows it to be scalable and versatile enough to adapt to varying
needs for users from different specialisations. Understanding the
NM
N O T E S
calculations. There are two ways of using Excel for simple
calculations: you can enter the actual arithmetic equations in the
cell or use pre-defined Excel formulas to do the same.
Statistical calculations for exponential random variables could
be calculated using statistical functions available in MS Excel.
NORMDIST returns the normal distribution for the specified
mean and standard deviation. This function has a very wide
range of applications in statistics, including hypothesis testing.
Syntax: NORMDIST(x,mean,standard_dev,cumulative)
SPSS Statistics is a software package used for statistical analysis.
Long produced by SPSS Inc., it was acquired by IBM in 2009. The
current versions (2014) are officially named IBM SPSS Statistics.
Companion products in the same family are used for survey
authoring and deployment (IBM SPSS Data Collection), data
mining (IBM SPSS Modeler), text analytics, and collaboration
and deployment (batch and automated scoring services).
S
IM
Microsoft Excel: An electronic spreadsheet program with
which you can create graphs and worksheets for financial
and other numeric data. After you enter your financial data,
you can analyze it for forecasts, generate numerous what-if
scenarios, and publish worksheets on the Web.
NM
N O T E S
2. What are the various Microsoft Versions?
3. Explain how do you open, save and close an Excel document.
4. Explain the menu items and their functions present on the excel
screen.
5. Explain the concept of Excel with reference to workbooks and
worksheets.
6. Discuss the procedure of entering the data in an excel file.
7. Write a short note on basic built-in functions like average, sum
and statistical functions.
8. Explain the Logical functions in an Excel document.
9. How do you create chart and Histogram in an Excel file?
10. Write a short note on SPSS. What is its importance in today’s
scenario?
X 14 16 20 22 28 30 34 40 45
Y 97 89 68 65 56 50 37 18 12
5. The following data give the average yields of major grain
(excluding rice) for the period 1965–1973. The yields are in
quintals per hectare.
Year 1965 1966 1967 1968 1969 1970 1971 1972 1973
Yield 14.7 16.2 16.2 16.7 16.9 17.3 18.8 18.5 19.4
Find the equation of the trend line, assuming that the trend is
linear using Ms-Excel.
N O T E S
Descriptive Statistics
10.
11.
12.
S
False
Descriptive
Autosum
IM
13. variable
Basic Built-in Functions 14. True
15. False
16. False
NM
N O T E S
3. Refer Section 10.2
Click on File-Open (Ctrl+O) to open/retrieve an existing
workbook; change the directory area or drive to look for files in
other locations.
To save your document with its current filename, location and
file format either click on File - Save. If you are saving for the first
time, click File-Save; choose/type a name for your document;
then click OK.
4. Refer Section 10.2.3
The most basic navigation tool in any Excel version is the menu
bar. The screenshot below shows the menu bar with various
items, which are designed for accessing different features of
Excel. Table 10.1 below gives a snapshot of the menu items and
their overall functions.
5. Refer Section 10.2.4
S
Excel is built on the concept of cell, rows, columns, spreadsheets
and workbooks. The entire structure is hierarchical, and this
IM
allows it to be scalable and versatile enough to adapt to varying
needs for users from different specializations.
6. Refer Section 10.3
A new worksheet is a grid of rows and columns. The rows are
labeled with numbers, and the columns are labeled with letters.
NM
N O T E S
9. Refer Section 10.6
One of the most powerful features of Excel is its ability to represent
large data volumes in easy-to-decipher formats including pivot
tables and varied charts. Understanding these features is very
important for any modeling or analysis exercise. While it is great
to be able to calculate numbers, if these are not presented in the
right format they add up to zilch.
10. Refer Section 10.8
SPSS Statistics is a software package used for statistical analysis.
Long produced by SPSS Inc., it was acquired by IBM in 2009. The
current versions (2014) are officially named IBM SPSS Statistics.
Companion products in the same family are used for survey
authoring and deployment (IBM SPSS Data Collection), data
mining (IBM SPSS Modeler), text analytics, and collaboration
and deployment (batch and automated scoring services).
The software name stands for Statistical Package for the Social
Sciences (SPSS), reflecting the original market, although
the software is now popular in other fields as well, including
S
IM
the health sciences and marketing.
Histogram:
Distribution of Marks
Number of Students
25
20
15
10
5
0
10 –…
20 –…
30 –…
40 –…
50 –…
60 –…
70 –…
80 –…
0 – 10
Marks
N O T E S
S
Personalities, Springer, New York, 1987.
Porter, T., The Rise of Statistical Thinking, 1820-1900, Princeton
IM
University Press, 1986.
Stigler, S., The History of Statistics: The Measurement of
Uncertainty before 1900, U. of Chicago Press, 1990.
Tankard, J., The Statistical Pioneers, Schenkman Books, New
York, 1984.
NM
E-REFERENCES
http://home.ubalt.edu/ntsbarsh/excel/excel.htm
http://people.umass.edu/evagold/excel.html
http://www.excel-easy.com/examples/descriptive-statistics.html
CASE STUDIES
CONTENTS
Roughness
S
more prominent accomplishment for its stakeholders over the long
haul, have less hazard presentation, and have a lower possibility of
missing lucrative opportunities.
IM
Different Applications of Statistical Analysis
Any business operates under conditions of probability and
uncertainty because there are too many variables and external
factors that can influence a situation. Therefore, the decision
NM
Contd...
N O T E S
Analysis
One of the valuable statistics in business decision analysis is the
internal accounting figures of the organization, or the performance
data. The decision analysis team within the company has a key
responsibility to analyze the company’s performance in measurable,
statistical terms, and evaluate the deviations from group goals, if
any. The financial performance or profitability figures, assets and
liabilities figures, inventory and sales figures are analyzed with the
help of business ratios. These ratios provide a crystallized picture
of the business and test its performance on various parameters.
For example, Current Ratio indicates the position of the company’s
current assets against current liabilities. The most critical financial
ratios for any company include Profit to Sales ratio, Debt to Equity
ratio, Current ratio, and Return on Capital Employed.
S
Key Elements of Statistics in Business Decision Analysis
The important elements to consider when using statistics in
IM
business decision analysis, particularly in process improvement
of XYZ, are the accuracy of collected data and information, the
choice of statistical design or statistical model to analyze that data,
the clear presentation of findings and conclusions, and finally,
managerial recommendations on how to take corrective measures
NM
S
data to another set of cases is the business of inferential statistics.
You probably know that descriptive statistics are central to the
IM
world of sports. Every sporting event produces numerous statistics
such as the shooting percentage of players on a basketball team.
For the Olympic marathon (a foot race of 26.2 miles), we possess
data that cover more than a century of competition. The following
table shows the winning times for both men and women.
NM
Contd...
N O T E S
the data in the table. To gain insight into the improvement in speed
over the years, let us divide the men’s times into two pieces, namely,
the first 13 races (up to 1952) and the second 13 (starting from
1956). The mean winning time for the first 13 races is 2 hours, 44
minutes, and 22 seconds (written 2:44:22). The mean winning time
for the second 13 races is 2:13:18. This is quite a difference (over
half an hour). Does this prove that the fastest men are running
faster? Or is the difference just due to chance, no more than what
often emerges from chance differences in performance from year
to year? We can’t answer this question with descriptive statistics
alone. All we can affirm is that the two means are “suggestive.”
S
The union official decided to take a careful look at the salary
information. He went to the salary administration. They told him
that they had all the salary information on a spreadsheet in the
IM
computer, and printed off this table:
Contd...
N O T E S
2.
(c) The new mode?
What salary position do you support, and why?S
IM
Source: http://wikieducator.org/images/2/28/JSMath6_Part3.pdf
NM
40
180
(?)
(20)
(5)
IM
production
Net Profit (?) (?) (?)
Sean will suggest to the board that whilst profits in the company had
recently risen, there was still room for improvement. He proceeded
with the implementation of the system of variance analysis despite
NM
S
Norton and the values of a set of surface roughness parameters.
Wennerberg and Albrektsson suggested the use of atleast one
height, one space, and one hybride parameter for characterization
IM
of implant surface roughness. For 2D measurements, one of the
height parameters Ra (average roughness) and Rq (root-mean-
square roughness), the space parameter RSm (mean width of
profile elements), and the hybrid parameter Rdr (developed length
ratio) were suggested. The limitations of this recommendation are
immediately realized when considering the two surfaces in Figure
NM
1. These surfaces are mirror images of each other, and the values
of the suggested set of parameters are exactly the same for these
surfaces; these parameters cannot discriminate between surfaces
which are mirror images of each other. It is however quite obvious
that the interface shear strength is much higher for the surface in
Figure 1(a) than in Figure 1(b). The number of bone plugs which
protrude into pits on the surface per length unit is exactly the same
for the two surfaces, while the shear strength of the individual bone
knobs, protruding into the pits, is much higher for surface in Figure
1(a) than for surface in Figure 1(b). If the surface characterization
is supplemented by the skewness parameter (Rsk), discrimination
between these two surfaces is achieved. The absolute value of the
skewness is the same for the two surfaces, but the sign is different;
a plus sign for the surface in Figure 1(a) and a minus sign for the
surface in Figure 1(b).
Figure 1 (a)
Contd...
N O T E S
Figure 1(b)
Figure 1: Two rough surfaces in cross-section. The Ra, Rq,
RSm, and Rdr parameters are the same for the two surfaces.
The interface shear strength is much higher for surface (a)
than for surface (b).
An even better representation of a rough surface is obtained if the
kurtosis parameter (Rku) is added. This parameter is a descriptor
of the peakedness of the surface. As the modulus of elasticity of the
S
implant material is substantially higher than that of bone, stress
peaks will arise in the bone adjacent to the roughness peaks. The
sharper the asperities of the surface roughness, the higher the
stress peaks in the bone. Excessive bone stresses will result in bone
IM
resorption. This means that theoretically the kurtosis parameter
is important in the characterization of implant surface roughness.
A review of the literature on bone implants shows that the skewness
and kurtosis parameters are seldom used in the characterization
NM
Contd...
N O T E S
According to Albrektsson and Wennerberg, implant surfaces with
a Sa (3D average roughness) value between 1.0 μm and 2.0 μm
(moderately rough surfaces) show stronger bone responses than
smoother and rougher surfaces. They also found that the majority
of the dental implants, currently on the market, have Sa values
within that interval. Sa is a three-dimensional height parameter –
the average departure from the mean surface within the sampling
area. The two-dimensional analogue of the Sa parameter is the Ra
parameter – the average departure from the mean line within the
sampling length.
The metrology standard EN ISO 4288: 1997 differentiates between
periodic and non periodic profiles. For non periodic profiles the
recommended sampling length, when measuring skewness and
kurtosis, depends on the Ra value. For Ra values between 0.1 and
2 μm, the prescribed sampling length is 0.8 mm. This means that if
a moderately rough implant surface is regarded as nonperiodic, a
S
sampling length of 0.8 mm should be applied for the measurement
of skew and kurtosis. For surfaces having a periodic profile, the
prescribed sampling length is based on the mean width of profile
IM
elements (RSm) to the effect that the sampling length will be 2–6.25
times the mean width of profile elements. The mean width of profile
elements seems to be less than 40 μm for most moderately rough
implant surfaces of today which, according to EN ISO 4288 : 1997,
means that a sampling length of 0.08 mm should be applied. Thus,
NM
N O T E S
Conclusion
A primary aim of the surface roughness of dental implants is to
increase the bone-implant interface shear strength. The surface
roughness parameters normally used for characterization of dental
implant surface roughness cannot discriminate between surfaces
expected to give high interface shear strength from surfaces
expected to give low interface shear strength. The skewness
parameter can achieve this discrimination. Kurtosis is another
parameter which theoretically is important in the evaluation of
the quality of a rough implant surface. A problem with these two
parameters is that they are sensitive to isolated outliers. By using
small sampling lengths during measurement, it should be possible
to get accurate values of the skewness and kurtosis parameters.
S
IM
Analyze the case and comment on how skewness and kurtosis
parameters are used in the characterization of surface roughness
in bone implants.
Source: http://www.hindawi.com/journals/isrn/2011/305312/)
NM
S
The objective here is to examine two sets of issues. The first relates
to the linkage between the exchange rate, 10-year G-sec and
stock market movements as denoted by the sensex. The other is a
IM
theoretical exercise which involves the notional cost that has been
involved.
rate) has fallen by 17.5%, the 10-year yield has gone up by around
90 bps and the Sensex has declined 9.5%. In terms of the linkage
between the two, a rudimentary statistical exercise shows that the
coefficient of correlation between the rupee and sensex at absolute
levels was -0.58 which is quite high with an inverse sign, indicating
that the market does not like a declining rupee. At the incremental
level, i.e. daily changes in both of them, the coefficient was -0.37.
In case of the rupee and the 10-year bond, it was as high as 0.70 at
the absolute level and -0.07 at the incremental level. This shows that
high rupee rates go hand-inhand with high bond yields. However,
the exact changes in levels are not correlated. Last, higher bond
yields are negatively correlated with sensex at 0.29 (for absolute
levels) and 0.35 (for changes). At the second level, a causal relation
could also be examined between these three sets of variables.
While such correlations do have somewhere an inbuilt assumption of
causation, the causality tests do not support such a relation between
any of these variables. This probably makes sense as bond yields are
also driven mainly by liquidity conditions and regulatory conditions.
The sensex reacts also to political actions and global developments.
Therefore, while there is a tendency to move in a pre-determined
direction – the stock market does not quite like a weak rupee or
high interest rates, which sounds logical, a weak rupee should go
along with higher interest rates.
Contd...
N O T E S
S
rupee depreciated. FIIs leaving today with May 20 as benchmark
would have taken a loss of above 25% as the combined effect of
rupee depreciation and stock market decline. Quite clearly, the
IM
perceived rewards from going back home on account of the US
recovery are more attractive for these players.
In the bond market, the 10-year yield has moved up by close to 100
bps, though the increase has been higher at the lower end of the
maturity spectrum. The government will be affected under ceteris
paribus conditions. So far, it has completed ` 2.6 lakh crore of the
NM
S
number of eligible employees is known in June. Fund-raising
activities are then conducted throughout the fall. By year’s end,
total contributions raised that year are tabulated.
IM
The Task
It is now June 2010. The number of eligible employees for 2010 has
been determined to be 53,455. Does knowing the number of eligible
employees help predict 2010 year-end contributions?
NM
The Data
This is an annual time-series from 1988–2009. The variables are
contribution Year and:
Actual: Total contributions to the campaign for the year in dollars
Employees: Number of eligible employees that year
Analysis
The average level of contributions during this time period was
$1,143,769, with a typical fluctuation of $339,788 around the
average. The average number of eligible employees was 45,419,
with a typical fluctuation of 9,791.
Contd...
N O T E S
phenomena:
S
The long-term growth in contributions is attributable to two
N O T E S
estimates the contribution for each eligible employee over this
time period. Hence, the model estimates an additional $33.56
in contributions for each eligible employee. Under Parameter
Estimates, we see that the number of employees is a statistically
significant predictor of year-end contributions; the p-value, listed
as Prob > |t|, is < 0.0001.
The number of employees doesn’t perfectly predict contributions.
Just over 93% of the variability in contributions is associated with
variability in number of eligible employees (RSquare = 0.934907).
Comparing the standard deviation of Actual ($339,788) to the root
mean square of the regression equation (RMSE = $88,832) suggests
that a substantial reduction in the variation in contributions occurs
by using the regression model to explain variation in year-end
contributions.
S
IM
NM
N O T E S
S
IM
NM
Contd...
N O T E S
Managerial Implications
Regression has provided a prediction for year-end 2010 Colorado
Combined Campaign contributions of $1.4M. In managerial settings
S
such as this, where the response variable represents a business
goal, managers often set higher expectations than the predicated
value to motivate improved performance. One such choice here
IM
might be the upper 95% prediction limit of $1.6M.
This forecasting methodology can be repeated year after year. Once
the final contributions to 2010 are known, they can be added to the
data set and the regression line can be recalculated. By midyear of
2011, the number of eligible employees will be known. Note that,
NM
SELECTING COLLEGES
S
proposed, with a predicted grade point average of 1.54. Deciding
that is still not high enough to graduate, the student decides to
attend a local community college, graduates with an associate’s
IM
degree and makes a fortune selling real estate.
If the counsellor was using a regression model to make the
predictions, he or she would know that this particular student
would not make a grade point of 0.64 at Harvard, 1.23 at the state
university, and 1.54 at the regional university. These values are
NM
S
IM
The main question that one would like to answer is “Is there
an evidence of racial discrimination given the evidence on
NM
SHOPPING ATTITUDE
S
Histogram of 1000 simulated values of the binomial variable X, and
the density curve of the Normal distribution with the same mean
and standard deviation:
IM
µ = np = 2500(0.6) = 1500
=ó np(1 − p )
= (2500)(0.6)(0.4)
NM
= =
600 24.49
N O T E S
The probability of observing 1,520 or more adults in the sample
who agree with the statement has been calculated as 20.61% using
the Normal approximation to the Binomial.
Using a computer program to calculate the actual Binomial
probabilities for all values from 1520 to 2,500, the true probability
of observing 1,520 or more who agree is 21.31%. This is a very good
approximation!
Analyze the case above and try to find the probability that at
least 1000 people in the sample agree, and add the binomial
probabilities of all outcomes from X=1000 to X=2000
S
IM
NM
One of the jobs of the U.S. Census Bureau is to keep track of the age
distribution in the country. The age distribution in 2013 is shown
below.
S
Figure 1: Age Distribution in the U.S.
IM
TABLE 1
NM
Contd...
N O T E S
S
IM
NM
Contd...
N O T E S
S
IM
NM