Quantitative Methods For Economic Analysis 6nov2014

QUANTITATIVE METHODS FOR
ECONOMIC ANALYSIS 1
III SEMESTER
B A ECONOMICS
(2013 Admission )
UNIVERSITY OF CALICUT
SCHOOL OF DISTANCE EDUCATION
Calicut university P.O, Malappuram Kerala, India 673 635.
263 A
School of Distance Education
UNIVERSITY OF CALICUT
SCHOOL OF DISTANCE EDUCATION
B.A. ECONOMICS
(2013 ADMISSION )
III SEMESTER
QUANTITATIVE METHODS FOR
ECONOMIC ANALYSIS 1
Prepared by:
Module
Materials Prepared by
Full Module
Chacko Jose P, PhD

Associate Professor of Economics
Sacred Heart College
Chalakudy, Thrissur, Kerala
(Formerly Reader
UGC-Academic Staff College
University of Calicut)
Editor
Dr.C.Krishnan
Associate Professor
PG Department of Economics
Govt. College Kodanchery
Kozhikode 673580
Email: ckcalicut@rediffmail.com
Layout & Settings: Computer Section, SDE

Reserved
Quantitative Methods for Economic Analysis - I
Page 2
CONTENTS
PAGES
MODULE - I
5- 79
MODULE - II
80-100
MODULE - III
101-150
MODULE - IV
151-169
Page 3
Quantitative Methods for Economic Analysis 1

Syllabus
Module I. Description of Data and Sampling
Statistics-Meaning and limitations-Data: Elements, Variables, ObservationsScale of Measurement-Types of Data: Qualitative and Quantitative; Crosssection, Time series and Pooled Data-Frequency Distributions: Absolute and
relative-Graphs: Bar chart, Histogram etc. Summary Measure of Distributions:
Measures of Central Tendency, Variability and Shape-Sampling: Population
and Sample, Methods of Sampling.
Module II. Correlation and Regression Analysis
Correlation-Meaning, Types and Degrees of Correlation- Methods of Measuring
Correlation- Graphical Methods: Scatter Diagram and Correlation Graph;
Algebraic
Methods: Karl Pearsons Coefficient of Correlation and Rank Correlation
Coefficient Properties and Interpretation of Correlation Coefficient
Module III. Index Numbers and Time Series Analysis
Index Numbers: Meaning and Uses- Laspeyres, Paasches, Fishers, DorbishBowley,
Marshall-Edgeworth and Kelleys Methods- Tests of Index Numbers: Time
Reversal and
Factor Reversal tests -Base Shifting, Splicing and Deflating- Special Purpose
IndicesWholesale Price Index, Consumer Price Index and Stock Price Indices:
BSE SENSEX and NSE-NIFTY. Time Series Analysis-Components of Time
Series, Measurement of Trend by Moving Average and the Method of Least
Squares.
Module IV. Nature and Scope of Econometrics
Econometrics: Meaning, Scope, and Limitations - Methodology of econometricsModern interpretation-Stochastic Disturbance term- Population Regression
Function and Sample Regression Function-Assumptions of Classical Linear
regression model.
Page 4
Module I
Description of Data and Sampling
1. STATISTICS-MEANING
Statistics is as old as the human race!. Its utility has been increasing as the ages goes by. In the
olden days it was used in the administrative departments of the states and the scope was limited.
Earlier it was used by governments to keep record of birth, death, population etc., for
administrative purpose. John Graunt was the first man to make a systematic study of birth and
death statistics and the calculation of expectation of life at different age in the 17th century
which led to the idea of Life Insurance.
The word Statistics seems to have been derived from the Latin word status or Italian word
statista or the German word Statistik each of which means a political state. Fields like
agriculture, economics, sociology, business management etc., are now using Statistical Methods
for different purposes.
Statistics has been defined differently by different writers. According to Webster "Statistics are
the classified facts representing the conditions of the people in a state. Specially those facts
which can be stated in numbers or any tabular or classified arrangement."
According to Bowley statistics are statistics is the science of counting, science of averages
Numerical statements of facts in any department of enquiry placed in relation to each other.
According to Yule and Kendall, statistics means quantitative data affected to a marked extent by
multiplicity of causes.
More broad definition of statistics was given by Horace Secrist. According to him, statistics
means aggregate of facts affected to marked extent by multiplicity of causes, numerically
expressed, enumerated or estimated according to a reasonable standard of accuracy, collected in
a systematic manner for a predetermined purpose and placed in relation to each other.
This definition points out some essential characteristics that numerical facts must possess so that
they may be called statistics. These characteristics are:
1.
They are enumerated or estimated according to a reasonable standard of accuracy
2.
They are affected by multiplicity of factors
3.
They must be numerically expressed
4.
They must be aggregate of facts
W.I. King defines the science of statistics is the method of judging collection, natural or social
phenomena from the results obtained from the analysis or enumeration or collection of
estimates.
Prof: Boddington has defined statistics as science of estimate and probabilities
Let us also see some other definitions of statistics.
Statistics as a discipline is the development and application of methods to collect, analyse and
interpret data.
Page 5
Statistics is the science of learning from data, and of measuring, controlling, and communicating
uncertainty; and it thereby provides the navigation essential for controlling the course of
scientific and societal advances.
Statistics is a collection of mathematical techniques that help to analyse and present data.
Statistics is also used in associated tasks such as designing experiments and surveys and planning
the collection and analysis of data from these.
Statistics is the study of numerical information, called data. Statisticians acquire, organize, and
analyse data. Each part of this process is also scrutinized. The techniques of statistics are applied
to a multitude of other areas of knowledge.
Thus to sum up statistics are the numerical statement of facts capable of analysis and
interpretation and the science of statistics is the study of the principles and the methods applied
in collecting, presenting, analysis and interpreting the numerical data in any field of inquiry.
Characteristics of Statistics
1. Statistics are aggregate of facts: A single age of 20 or 30 years is not statistics, a series of ages
are. Similarly, a single figure relating to production, sales, birth, death etc., would not be
statistics although aggregates of such figures would be statistics because of their comparability
and relationship.
2. Statistics are affected to a marked extent by a multiplicity of causes: A number of causes
affect statistics in a particular field of enquiry, e.g., in production statistics are affected by
climate, soil, fertility, availability of raw materials and methods of quick transport.
3. Statistics are numerically expressed, enumrated or estimated: The subject of statistics is
concerned essentially with facts expressed in numerical form -with their quantitative details but
not qualitative descriptions. Therefore, facts indicated by terms such as good, poor are not
statistics unless a numerical equivalent, is assigned to each expression. Also this may either be
enumerated or estimated, where actual enumeration is either not possible or is very difficult.
4. Statistics are numerated or estimated according to reasonable standard of accuracy: Personal
bias and prejudices of the enumeration should not enter into the counting or estimation of
figures, otherwise conclusions from the figures would not be accurate. The figures should be
counted or estimated according to reasonable standards of accuracy. Absolute accuracy is neither
necessary nor sometimes possible in social sciences. But whatever standard of accuracy is once
adopted, should be used throughout the process of collection or estimation.
5. Statistics should be collected in a systematic manner for a predetermined purpose: The
statistical methods to be applied on the purpose of enquiry since figures are always collected
with some purpose. If there is no predetermined purpose, all the efforts in collecting the figures
may prove to be wasteful. The purpose of a series of ages of husbands and wives may be to find
whether young husbands have young wives and the old husbands have old wives.
6. Statistics should be capable of being placed in relation to each other: The collected figure
should be comparable and well-connected in the same department of inquiry. Ages of husbands
Page 6
are to be compared only with the corresponding ages of wives, and not with, say, heights of
trees.
Functions of Statistics
The functions of statistics may be enumerated as follows :
(i) To present facts in a definite form : Without a statistical study our ideas are likely to be vague,
indefinite and hazy, but figures helps as to represent things in their true perspective. For
example, the statement that some students out of 1,400 who had appeared, for a certain
examination, were declared successful would not give as much information as the one that 300
students out of 400 who took the examination were declared successful.
(ii) To simplify unwieldy and complex data : It is not easy to treat large numbers and hence they
are simplified either by taking a few figures to serve as a representative sample or by taking
average to give a birds eye view of the large masses. For example, complex data may be
simplified by presenting them in the form of a table, graph or diagram, or representing it through
an average etc.
(iii) To use it as a technique for making comparisons: The significance of certain figures can be
better appreciated when they are compared with others of the same type. The comparison
between two different groups is best represented by certain statistical methods, such as average,
coefficients, rates, ratios, etc.
Uses of Statistics
Statistics is primarily used either to make predictions based on the data available or to make
conclusions about a population of interest when only sample data is available.
In both cases statistics tries to make sense of the uncertainty in the available data.
Statisticians apply statistical thinking and methods to a wide variety of scientific, social, and
business endeavours in such areas as astronomy, biology, education, economics, engineering,
genetics, marketing, medicine, psychology, public health, sports, among many. Many economic,
social, political, and military decisions cannot be made without statistical techniques, such as the
design of experiments to gain federal approval of a newly manufactured drug.
Statistics is of two types (a) Descriptive statistics involves methods of organizing, picturing and
summarizing information from data. (b) Inferential statistics involves methods of using
information from a sample to draw conclusions about the population.
These days statistical methods are applicable everywhere. There is no field of work in which
statistical methods are not applied. According to A L. Bowley, A knowledge of statistics is like
a knowledge of foreign languages or of Algebra, it may prove of use at any time under any
circumstances. The importance of the statistical science is increasing in almost all spheres of
knowledge, e g., astronomy, biology, meteorology, demography, economics and mathematics.
Economic planning without statistics is bound to be baseless. Statistics serve in administration,
and facilitate the work of formulation of new policies. Financial institutions and investors utilise
statistical data to summaries the past experience. Statistics are also helpful to an auditor, when he
uses sampling techniques or test checking to audit the accounts of his client.
Page 7
(a) Statistics and Economics: In the year 1890 Prof. Alfred Marshall, the renowned economist
observed that statistics are the straw out of which I, like every other economist, have to make
bricks. This proves the significance of statistics in economics. Economics is concerned with
production and distribution of wealth as well as with the complex institutional set-up connected
with the consumption, saving and investment of income. Statistical data and statistical methods
are of immense help in the proper understanding of the economic problems and in the
formulation of economic policies. In fact these are the tools and appliances of an economists
laboratory. In the field of economics it is almost impassible to find a problem which does not
require an extensive uses of statistical data. As economic theory advances use of statistical
methods also increase. The laws of economics like law of demand, law of supply etc can be
considered true and established with the help of statistical methods. Statistics of consumption
tells us about the relative strength of the desire of a section of people. Statistics of production
describe the wealth of a nation. Exchange statistics through light on commercial development of
a nation. Distribution statistics disclose the economic conditions of various classes of people.
There for statistical methods are necessary for economics.
(b) Statistics and business: Statistics is an aid to business and commerce. When a person enters
business, he enters into the profession of fore casting. Modern statistical devices have made
business forecasting more precise and accurate. A business man needs statistics right from the
time he proposes to start business. He should have relevant fact and figures to prepare the
financial plan of the proposed business. Statistical methods are necessary for these purposes. In
industrial concern statistical devices are being used not only to determined and control the
quality of products manufactured by also to reduce wastage to a minimum. The technique of
statistical control is used to maintain quality of products.
(c) Statistics and Research: Statistics is an indispensable tool of research. Most of the
advancement in knowledge has taken place because of experiments conducted with the help of
statistical methods. For example, experiments about crop yield and different types of fertilizers
and different types of soils of the growth of animals under different diets and environments are
frequently designed and analysed according to statistical methods. Statistical methods are also
useful for the research in medicine and public health. In fact there is hardly any research work
today that one can find complete without statistical data and statistical methods.
Other uses of statistics are as follows.
(1) Statistics helps in providing a better understanding and exact description of a phenomenon of
nature.
(2) Statistical helps in proper and efficient planning of a statistical inquiry in any field of study.
(3) Statistical helps in collecting an appropriate quantitative data.
(4) Statistics helps in presenting complex data in a suitable tabular, diagrammatic and graphic
form for an easy and clear comprehension of the data.
(5) Statistics helps
in
understanding
the nature
and pattern
of
variability
of
a phenomenon through quantitative observations.
Page 8
(6) Statistics helps in drawing valid inference, along with a measure of their reliability about the
population parameters from the sample data.
Limitations of Statistics
Statistics is indispensable to almost all sciences - social, physical and natural. It is very often
used in most of the spheres of human activity. In spite of the wide scope of the subject it has
certain limitations. Some important limitations of statistics are the following:
1. Statistics does not study qualitative phenomena: Statistics deals with facts and figures. So
the quality aspect of a variable or the subjective phenomenon falls out of the scope of statistics.
For example, qualities like beauty, honesty, intelligence etc. cannot be numerically expressed. So
these characteristics cannot be examined statistically. This limits the scope of the subject.
2. Statistical laws are not exact: Statistical laws are not exact as incase of natural sciences.
These laws are true only on average. They hold good under certain conditions. They cannot be
universally applied. So statistics has less practical utility.
3. Statistics does not study individuals: Statistics deals with aggregate of facts. Single or
isolated figures are not statistics. This is considered to be a major handicap of statistics.
4. Statistics can be misused: Statistics is mostly a tool of analysis. Statistical techniques are
used to analyze and interpret the collected information in an enquiry. As it is, statistics does not
prove or disprove anything. It is just a means to an end. Statements supported by statistics are
more appealing and are commonly believed. For this, statistics is often misused. Statistical
methods rightly used are beneficial but if misused these become harmful. Statistical methods
used by less expert hands will lead to inaccurate results. Here the fault does not lie with the
subject of statistics but with the person who makes wrong use of it.
Other limitations are as follows.
(1) Statistics laws are true on average. Statistics are aggregates of facts. So single observation is
not a statistics, it deals with groups and aggregates only.
(2) Statistical methods are best applicable on quantitative data.
(3) Statistical cannot be applied to heterogeneous data.
(4) It sufficient care is not exercised in collecting, analyzing and interpretation the data,
statistical results might be misleading.
(5) Only a person who has an expert knowledge of statistics can handle statistical data
efficiently.
(6) Some errors are possible in statistical decisions. Particularly the inferential statistics involves
certain errors. We do not know whether an error has been committed or not.
2.DATA: ELEMENTS, VARIABLES, OBSERVATIONS, SCALE OF
MEASUREMENT
Data may be defined as facts, observations, and information that come from investigations. Data
can be defined as groups of information that represent the qualitative or quantitative attributes of
a variable or set of variables, which is the same as saying that data can be any set of information
that describes a given entity. Data in statistics can be classified into grouped data and ungrouped
data.
Page 9
1. Elements: A data element is a unit of data for which the definition, identification,
representation, and permissible values are specified by means of a set of attributes. It is the
smallest named item of data that conveys meaningful information or condenses lengthy
description into a short code called data field in the structure of a database.
2. Variable - property of an object or event that can take on different values. A variable is any
measurable characteristic or attribute that can have different values for different subjects. Height,
age, amount of income, country of birth, grades obtained at school and type of housing are
examples of variables. For example, college major is a variable that takes on values like
mathematics, computer science, English, psychology, etc.
Discrete Variable - a variable with a limited number of values (e.g., gender (male/female),
college class (freshman/sophomore/junior/senior).
Continuous Variable - a variable that can take on many different values, in theory, any value
between the lowest and highest points on the measurement scale.
Independent Variable - a variable that is manipulated, measured, or selected by the researcher as
an antecedent condition to an observed behavior. In a hypothesized cause-and-effect
relationship, the independent variable is the cause and the dependent variable is the outcome or
effect.
Dependent Variable - a variable that is not under the experimenter's control -- the data. It is the
variable that is observed and measured in response to the independent variable.
Qualitative Variable - a variable based on categorical data.
Quantitative Variable - a variable based on quantitative data.
Qualitative vs. Quantitative Variables
Variables can be classified as qualitative (aka, categorical) or quantitative (aka, numeric).
Qualitative. Qualitative variables take on values that are names or labels. The color of a
ball (e.g., red, green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be
examples of qualitative or categorical variables.
Quantitative. Quantitative variables are numeric. They represent a measurable quantity.
For example, when we speak of the population of a city, we are talking about the number
of people in the city - a measurable attribute of the city. Therefore, population would be a
quantitative variable.
In algebraic equations, quantitative variables are represented by symbols (e.g., x, y, or z).
Discrete vs. Continuous Variables
Quantitative variables can be further classified as discrete or continuous. If a variable can take on
any value between its minimum value and its maximum value, it is called a continuous variable;
otherwise, it is called a discrete variable.
Some examples will clarify the difference between discrete and continouous variables.
Suppose the fire department mandates that all fire fighters must weigh between 150 and
250 pounds. The weight of a fire fighter would be an example of a continuous variable;
since a fire fighter's weight could take on any value between 150 and 250 pounds.
Page 10
Suppose we flip a coin and count the number of heads. The number of heads could be any
integer value between 0 and plus infinity. However, it could not be any number between
0 and plus infinity. We could not, for example, get 2.3 heads. Therefore, the number of
heads must be a discrete variable.
Univariate vs. Bivariate Data
Statistical data are often classified according to the number of variables being studied.
Univariate data. When we conduct a study that looks at only one variable, we say that we
are working with univariate data. Suppose, for example, that we conducted a survey to
estimate the average weight of high school students. Since we are only working with one
variable (weight), we would be working with univariate data.
Bivariate data. When we conduct a study that examines the relationship between two
variables, we are working with bivariate data. Suppose we conducted a study to see if
there were a relationship between the height and weight of high school students. Since we
are working with two variables (height and weight), we would be working with bivariate
data.
3. Observations
An observation is the value, at a particular period, of a particular variable, such as the individual
price of an item at a given outlet. An observation is the value, at a particular period, of a
particular variable. It is thus a method of data collection in which the situation of interest is
watched and the relevant facts, actions and behaviors are recorded.
Observation units vary according to the specific survey or data collection: for statistical data
collected on persons the observation unit is usually one individual or a household.
4. Scale of Measurement
Normally, when one hears the term measurement, they may think in terms of measuring the
length of something (e.g., the length of a piece of wood) or measuring a quantity of something
(ie. a cup of flour).This represents a limited use of the term measurement. In statistics, the term
measurement is used more broadly and is more appropriately termed scales of measurement.
Scales of measurement refer to ways in which variables/numbers are defined and categorized.
Each scale of measurement has certain properties which in turn determines the appropriateness
for use of certain statistical analyses. The four scales of measurement are nominal, ordinal,
interval, and ratio.
Properties of Measurement Scales
Each scale of measurement satisfies one or more of the following properties of measurement.
Identity: Each value on the measurement scale has a unique meaning. It is not equal to any other
value on the scale.
Magnitude: All values on the measurement scale have an ordered relationship to one another.
That is, some values are larger and some are smaller.
Equal intervals: Scale units along the scale are equal to one another. This means, for example,
that the difference between 1 and 2 would be equal to the difference between 19 and 20.
A minimum value of zero: The scale has a true zero point that is now values exist below zero.
Page 11
Measurement scales are of four types, namely, Nominal Scale of Measurement, Ordinal Scale of
Measurement, Interval Scale of Measurement and Ratio Scale of Measurement
(a) Nominal Scale of Measurement
The nominal scale of measurement only satisfies the identity property of measurement. Values
assigned to variables represent a descriptive category, but have no inherent numerical value with
respect to magnitude.
Gender is an example of a variable that is measured on a nominal scale. Individuals may be
classified as "male" or "female", but neither value represents more or less "gender" than the
other. Religion and political affiliation are other examples of variables that are normally
measured on a nominal scale.
(b) Ordinal Scale of Measurement
The ordinal scale has the property of both identity and magnitude. Each value on the ordinal
scale has a unique meaning, and it has an ordered relationship to every other value on the scale.
An example of an ordinal scale in action would be the results of a horse race, reported as "win",
"place", and "show". We know the rank order in which horses finished the race. The horse that
won finished ahead of the horse that placed, and the horse that placed finished ahead of the horse
that showed. However, we cannot tell from this ordinal scale whether it was a close race or
whether the winning horse won by a mile.
(c) Interval Scale of Measurement
The interval scale of measurement has the properties of identity, magnitude, and equal intervals.
A perfect example of an interval scale is the Fahrenheit scale to measure temperature. The scale
is made up of equal temperature units, so that the difference between 40 and 50 degrees
Fahrenheit is equal to the difference between 50 and 60 degrees Fahrenheit.
With an interval scale, you know not only whether different values are bigger or smaller, you
also know how much bigger or smaller they are. For example, suppose it is 60 degrees
Fahrenheit on Monday and 70 degrees on Tuesday. You know not only that it was hotter on
Tuesday, you also know that it was 10 degrees hotter.
(d) Ratio Scale of Measurement
The ratio scale of measurement satisfies all four of the properties of measurement: identity,
magnitude, equal intervals, and a minimum value of zero.
The weight of an object would be an example of a ratio scale. Each value on the weight scale has
a unique meaning, weights can be rank ordered, units along the weight scale are equal to one
another, and the scale has a minimum value of zero.
Weight scales have a minimum value of zero because objects at rest can be weightless, but they
cannot have negative weight.
The table below will help clarify the fundamental differences between the four scales of
measurement:
Page 12
Nominal
Ordinal
Interval
Ratio
Indications
Difference
Indicates Direction of
Difference
Indicates Amount of
Difference
Absolute
Zero
X
X
X
X
X
X
X
X
X
You will notice in the above table that only the ratio scale meets the criteria for all four
properties of scales of measurement.
Interval and Ratio data are sometimes referred to as parametric and Nominal and Ordinal data
are referred to as nonparametric. Parametric means that it meets certain requirements with
respect to parameters of the population (for example, the data will be normal--the distribution
parallels the normal or bell curve). In addition, it means that numbers can be added, subtracted,
multiplied, and divided. Parametric data are analyzed using statistical techniques identified as
Parametric Statistics. As a rule, there are more statistical technique options for the analysis of
parametric data and parametric statistics are considered more powerful than nonparametric
statistics. Nonparametric data are lacking those same parameters and cannot be added,
subtracted, multiplied, and divided. For example, it does not make sense to add Social Security
numbers to get a third person. Nonparametric data are analyzed by using Nonparametric
Statistics.
3. TYPES OF DATA: Qualitative and Quantitative; Cross-section, Time
series and Pooled Data
3.1 Qualitative and Quantitative
Data is a collection of facts, such as values or measurements. It can be numbers, words,
measurements, observations or even just descriptions of things.Some methods provide data
which are quantitative and some methods data which are qualitative.
Quantitative data are anything that can be expressed as a number, or quantified. Examples of
quantitative data are scores on achievement tests, number of hours of study, or weight of a
subject. These data may be represented by ordinal, interval or ratio scales and lend themselves to
most statistical manipulation. Thus qualitative data is one that approximates or characterizes but
does not measure the attributes, characteristics, properties, etc., of a thing or phenomenon.
Qualitative data describes whereas quantitative data defines.
Qualitative data cannot be expressed as a number. Data that represent nominal scales such as
gender, socio-economic status, religious preference are usually considered to be qualitative data.
Thus quantitative data is one that can be quantified and verified, and is amenable to statistical
manipulation. Quantitative data defines whereas qualitative data describes.
Both types of data are valid types of measurement. But only quantitative data can be analysed
statistically, and thus more rigorous assessments of the data are possible.
Page 13
Quantitative and qualitative data provide different outcomes, and are often used together to get a
full picture of a population. For example, if data are collected on annual income (quantitative),
occupation data (qualitative) could also be gathered to get more detail on the average annual
income for each type of occupation.
Quantitative and qualitative data can be gathered from the same data unit depending on whether
the variable of interest is numerical or categorical. For example:
Example 1:
Oil Painting
Oil Painting
Qualitative data:
blue/green color, gold frame
Quantitative data:
picture is 10" by 14"
smells old and musty
with frame 14" by 18"
texture shows brush strokes of oil
weighs 8.5 pounds
paint
surface area of painting is 140 sq.
peaceful scene of the country
masterful brush strokes
in.
cost Rs5000
Example 2
Data
unit
A person
A house
A
business
Numeric variable = Quantitative

data
"How
4 children
many children do
you have?"
"How much do you Rs. 50,000 p.a.
earn?"
"How many hours 45 hours per
do you work?"
week
"Plinth area of
1000 square
your house?"
metres
"How
many workers are
currently
employed?"
110 employees
Categorical
variable
"In which
country were your
children born?"
"What is your
occupation?"
"Do you work fulltime or part-time?"
"In which city or
town is the house
located?"
"What is
the industry of the
business?"
= Qualitative
data
India
Banker
Full-time
Thrissur
Textile retail
Page 14
A farm
"How many milk

cows are located
on the farm?
36 cows
"What is the
main activity of
the farm?"
Dairy
And Quantitative data can also be Discrete data or Continuous data.

Discrete data can only take certain values (like whole numbers)
Continuous data can take any value (within a range)
Put simply: Discrete data is counted, Continuous data is measured.
See the following example.
Example: What do we know about Arrow the Dog?
Description about Blacky, your pet dog
Qualitative:
Quantitative:
He is brown and black

He has long hair
He has lots of energy
Discrete:
He has 4 legs
He has 2 brothers
Continuous:
He weighs 25.5 kg
He is 565 mm tall
3.2 Cross Section and Time Series Data

Time series data is data that is measured using a sequence of certain points at particular times.
The BSE SENSEX is an example of data that is measured using time series data, as the data
collected is listed at a certain time on each day. Line charts are used to plot time series data and
these enable the viewer of the data to analyze the data with ease, and to compare and contrast the
differences between one set of data at a particular time and another set of data at a particular
time.
Other examples of time-series would be staff numbers at a particular institution taken on a
monthly basis in order to assess staff turnover rates, weekly sales figures of ice-cream sold
during a holiday period at a seaside resort and the number of students registered for a particular
course on a yearly basis. All of the above would be used to forecast likely data patterns in the
future.
Cross-section data is data that is collected by analyzing different sets of data from different
sources at a particular time. This type of statistical information is useful when observing habits
within a country, such as eating habits, voting habits, and drinking habits. Applying a certain set
of questions to a certain number of people in different areas, and collating the information to
achieve a realistic picture that is relevant to a nation or an area as a whole makes this data useful.
Another example of cross-section data is business data collected to see the popularity of certain
products at a particular time, and this is known as market research.
Page 15
Other examples: if one considered the closing prices of a group of 20 different tech stocks of
BSE on September 15, 2014 this would be an example of cross-sectional data. Note that the
underlying population should consist of members with similar characteristics. For example,
suppose you are interested in how much companies spend on research and development
expenses. Firms in some industries such as retail spend little on research and development
(R&D), while firms in industries such as technology spend heavily on R&D. Therefore, it's
inappropriate to summarize R&D data across all companies. Rather, analysts should summarize
R&D data by industry, and then analyze the data in each industry group. Other examples of
cross-sectional data would be: an inventory of all ice creams in stock at a particular supermarket,
a list of grades obtained by a class of students for a specific test.
The major difference between time series data and cross-section data is that the former focuses
on results gained over an extended period of time, often within a small area, whilst the latter
focuses on the information received from surveys and opinions at a particular time, in various
locations, depending on the information sought.
4. FREQUENCY DISTRIBUTIONS: ABSOLUTE AND RELATIVE
Frequency distribution is a specification of the way in which the frequencies of members of a
population are distributed according to the values of the variates which they exhibit. For
observed data the distribution is usually specified in tabular form, with some grouping for
continuous variates.
The frequency distribution or frequency table is a tabular organization of statistical data,
assigning to each piece of data its corresponding frequency.
Types of Frequencies
(a) Absolute Frequency
The absolute frequency is the number of times that a certain value appears in a statistical study.
It is denoted by .
The sum of the absolute frequencies is equal to the total number of data, which is denoted by N.
+
+ +
This sum is commonly denoted by the Greek letter (capital sigma) which represents sum.
(b) Relative Frequency
The relative frequency is the quotient between the absolute frequency of a certain value and the
total number of data. It can be expressed as a percentage and is denoted by
The sum of the relative frequency is equal to 1.

=
Page 16
(c) Cumulative Frequency

The cumulative frequency is the sum of the absolute frequencies of all values less than or equal
to the value considered.
It is denoted by F i .
(d) Relative Cumulative Frequency
The relative cumulative frequency is the quotient between the cumulative
frequency of a particular value and the total number of data. It can be expressed as
a percentage.
Example
A city has recorded the following daily maximum temperatures during a month:
32, 31, 28, 29, 33, 32, 31, 30, 31, 31, 27, 28, 29, 30, 32, 31, 31, 30, 30, 29, 29, 30, 30, 31, 30, 31,
34, 33, 33, 29, 29.
Let us form a table based on this information. In the first column of the table are the variables
ordered from lowest to highest, in the second column is the count or the number or times this
variable has occurred and in the third column is the score of the absolute frequency.
xi
27
28
29
fi
1
2
6
Fi
1
3
9
ni
0.032
0.065
0.194
Ni
0.032
0.097
0.290
30
16
0.226
0.516
31
24
0.258
0.774
3
3
1
31
27
30
31
0.097
0.097
0.032
1
0.871
0.968
1
32
33
34
Count
I
II
III
III
I
Discrete variables are used for this type of frequency table.

5. GRAPHS OF FREQUENCY DISTRIBUTION
A frequency distribution can be represented graphically in any of the following ways.
The most commonly used graphs and curves for representation a frequency distribution are
Bar Charts
Histogram
Frequency Polygon
Smoothened frequency curve
Ogives or cumulative frequency curves.
(a)Bar Charts
A bar chart is used to present categorical, quantitative or discrete data.
Page 17
The information is presented on a coordinate axis. The values of the variable are represented on
the horizontal axis and the absolute, relative or cumulative frequencies are represented on the
vertical axis.
The data is represented by bars whose height is proportional to the frequency.
Example
A study has been conducted to determine the blood group of a class of 20 students. The results
are as follows:
Blood
Group
fi
AB
9
20
Based on this we can draw a bar chart as follows.

Step 1: Number the Y-axis with the dependent variable. The dependent variable is the one being
tested in an experiment. In this sample question, the study wanted to know how many students
belonged to each blood group. So the number of students is the dependent variable. So it is
marked on the Y-axis.
Step 2: Label the X-axis with what the bars represent. For this problem, label the x-axis Blood
Group and then label the Y-axis with what the Y-axis represents: number of students.
Step 3: Draw your bars. The height of the bar should be even with the correct number on the Yaxis. Dont forget to label each bar under the x-axis.
Finally, give your graph a name. For this problem, call the graph Blood group of students.
Number of students
Blood group of students

10
9
8
7
6
5
4
3
2
1
0
A
AB
Blood Group
Page 18
Histogram:
A histogram is a set of vertical bars whose one as are proportional to the frequencies
represented. While constructing histogram, the variable is always taken on the X axis and the
frequencies on the Y axis. The width of the bars in the histogram will be proportional to the
class interval. The bars are drawn without leaving space between them. A histogram generally
represents a continuous curve. If the class intervals are uniform for a frequency distribution,
then the width of all the bars will by equal.
Example:
15-20
20
20-25
47
25-30
38
30-35
10
50
No. of students
10-15
No. of
students
5
Marks
40
30
20
10
0
Marks
10
15
20
25
30
Frequency Polygon (or line graphs)

Frequency Polygon is a graph of frequency distribution. Frequency polygons are a
graphical device for understanding the shapes of distributions. They serve the same purpose as
histograms, but are especially helpful for comparing sets of data.
To create a frequency polygon, start just as for histograms, by choosing a class interval. Then
draw an X-axis representing the values of the scores in your data. Mark the middle of each class
interval with a tick mark, and label it with the middle value represented by the class. Draw the Yaxis to indicate the frequency of each class. Place a point in the middle of each class interval at
the height corresponding to its frequency. Finally, connect the points. You should include one
class interval below the lowest value in your data and one above the highest value. The graph
will then touch the X-axis on both sides.
Another method of constructing frequency polygon is to take the mid points of the various class
intervals and then plot frequency corresponding to each point and to join all these points by a
straight lines. Here need not construct a histogram:Quantitative Methods for Economic Analysis - I
Page 19
Example:
Draw a frequency polygon to the following frequency distribution
Marks:
10-20
20-30
No. of
13
30-40
40-50
50-60
19
28
19
60-70
11
70-80
9
Students:
No. of students
Y
20
15
10
10
20
30
40
50
60
70
Marks
Frequency Curves
Frequency curves are derived from frequency polygons. Frequency curve is obtained by
joining the points of frequency polygon by a freehand smoothed curve. Unlike frequency
polygon, where the points we joined by straight lines, we make use of free hand joining of those
points in order to get a smoothed frequency curve. It is used to remove the ruggedness of
polygon and to present it in a good form or shape. We smoothen the angularities of the polygon
only without making any basic change in the shape of the curve. In this case also the curve
begins and ends at base line, as is in case of polygon. Area under the curve must remain almost
the same as in the case of polygon.
Example:
Marks:
10-20
No. of
Students:
20-30
30-40
15
40-50
20
50-60
60-70
12
7
Page 20
No. of students
Y
20
15
10
5
|
x
x
x
x
|
No. of
Students
Marks
10
20
|
30
|
40
50
60
70
Marks
Difference between frequency polygon and frequency curve

Frequency polygon is drawn to frequency distribution of discrete or continuous nature.
Frequency curves are drawn to continuous frequency distribution. Frequency polygon is
obtained by joining the plotted points by straight lines. Frequency curves are smooth. They are
obtained by joining plotted points by smooth curve.
Ogives (Cumulative frequency curve)
A frequency distribution when cumulated, we get cumulative frequency distribution. A
series can be cumulated in two ways. One method is frequencies of all the preceding classes one
added to the frequency of the classes. This series is called less than cumulative series. Another
method is frequencies of succeeding classes are added to the frequency of a class. This is called
more than cumulative series. Smoothed frequency curves drawn for these two cumulative series
are called cumulative frequency curve or ogives. Thus corresponding to the two cumulative
series we get two ogive curves, known as less than ogive and more than ogive.
Less than ogive curve is obtained by plotting frequencies (cumulated) against the upper
limits of class intervals. More than ogive curve is obtained by plotting cumulated frequencies
against the lower limits of class intervals. Less than ogive is an increasing curve, slopping
upwards from left to right. More than ogive is a decreasing curve and slopes from left to right.
Example:
From less than and more than cumulative frequency distribution for the following frequency
distribution. Cumulative frequency distribution:
Page 21
10-20
20-30
30-40
40-50
50-60
60-70
Marks
less than
No. of
Students
10
20
30
40
50
60
70
0
4
10
20
40
58
60
4
6
10
20
18
2
Marks
More
than
10
20
30
40
50
60
70
No. of
Students
60
56
50
40
20
2
0
No. of Students
70
60
Less than ogive
50
40
No. of Students
30
No. of Students2
20
10
More than ogive

0
-10
20
40
60
80
Marks
Pie Diagrams
One of the most common ways to represent data graphically is called a pie chart. It gets its name
by how it looks, just like a circular pie that has been cut into several slices. This kind of graph is
helpful when graphing qualitative data, where the information describes a trait or attribute and is
not numerical. Each trait corresponds to a different slice of the pie. By looking at all of the pie
pieces, you can compare how much of the data fits in each category.
Pie charts are a form of an area chart that are easy to understand with a quick look. They show
the part of the total (percentage) in an easy-to-understand way. Pie charts are useful tools that
Page 22
help you figure out and understand polls, statistics, complex data, and income or spending. They
are so wonderful because everybody can see what is going on.
Pie diagrams are used when the aggregate and their division are to be shown together.
The aggregate is shown by means of a circle and the division by the sectors of the circle. For
example: to show the total expenditure of a government distributed over different departments
like agriculture, irrigation, industry, transport etc. can be shown in a pie diagram. In constructing
a pie diagram the various components are first expressed as a percentage and then the percentage
is multiplied by 3.6. so we get angle for each component. Then the circle is divided into sectors
such that angles of the components and angles of the sectors are equal. Therefore one sector
represents one component. Usually components are with the angles in descending order are
shown.
Example:
You conducted a survey as part of a project work. You had taken a sample of 20 individuals and
you want to represent their occupation using a pie chart .
First, put your data into a table, then add up all the values to get a total:
Farmer
Business
Teacher
Bank
Driver
TOTAL
20
Calculate the angle of each sector, using the formula
Divide each value by the total and multiply by 100 to get a percent:
Farmer
Business
Teacher
Bank
Driver
TOTAL
20
4/20 =20% 5/20 =25% 6/20 =30% 1/20 = 5% 4/20 =20% 100%
Now you need to figure out how many degrees for each pie slice (correctly
called a sector).
A Full Circle has 360 degrees, so we do this calculation:
Farmer
Business
Teacher
Bank
Driver
TOTAL
20
4/20 =20%
5/20 =25%
6/20 =30%
1/20 = 5%
4/20 =20%
100%
4/20 360 5/20 360 6/20 360 1/20 360 4/20 360 360
= 72
= 90
= 108
= 18
= 72
Page 23
Draw a circle using a pair of compasses.

Use a protractor to draw the angle for each sector.
Label the circle graph and all its sectors.
SAMPLE POPULATION BY OCCUPATION

Farmer
Business
Teacher
20%
Bank
Driver
20%
5%
25%
30%
Pie charts are to be used with qualitative data, however there are some limitations in using them.
If there are too many categories, then there will be a multitude of pie pieces. Some of these are
likely to be very skinny, and can be difficult to compare to one another.
If we want to compare different categories that are close in size, a pie chart does not always help
us to do this. If one slice has central angle of 30 degrees, and another has a central angle of 29
degrees, then it would be very hard to tell at a glance which pie piece is larger than the other.
6. SUMMARY MEASURE OF DISTRIBUTIONS
We will discuss three sets of summary measures namely Measures of Central Tendency,
Variability and Shape. These are called summary measures because they summarise the data. For
example, one of summary measure very familiar to you is mean. (Mean comes under measure of
central tendency.) If we take mean mark of students in a class for a subject, it gives you a rough
idea of what the marks are like. Thus based on just one summary value, we get idea of the entire
data.
6.1 Measures of Central Tendency
A measure of central tendency is a measure that tells us where the middle of a bunch of data lies.
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. As such, measures of central tendency are
sometimes called measures of central location. They are also classed as summary statistics. The
mean (often called the average) is most likely the measure of central tendency that you are most
familiar with, but there are others, such as the median and the mode.
Page 24
Mean: Mean is the most common measure of central tendency. It is simply the sum of the
numbers divided by the number of numbers in a set of data. This is also known as average.
Median: Median is the number present in the middle when the numbers in a set of data are
arranged in ascending or descending order. If the number of numbers in a data set is even, then
the median is the mean of the two middle numbers.
Mode: Mode is the value that occurs most frequently in a set of data.
The mean, median and mode are all valid measures of central tendency, but under different
conditions, some measures of central tendency become more appropriate to use than others. In
the following sections, we will look at the mean, mode and median, and learn how to calculate
them.
We will also discuss Geometric Mean and Harmonic Mean.
Requisites of a good average
Since an average is a single value representing a group of values, it is desired that such a
value satisfies the following properties.
1.
Easy to understand:- Since statistical methods are designed to simplify the complexities.
2.
Simple to compute: A good average should be easy to compute so that it can be used
widely. However, though case of computation is desirable, it should not be sought at the
expense of other averages. ie, if in the interest of greater accuracy, use of more difficult average
is desirable.
3.
Based on all items:- The average should depend upon each and every item of the series,
so that if any of the items is dropped, the average itself is altered.
4.
Not unduly affected by Extreme observations:- Although each and every item should
influence the value of the average, non of the items should influence it unduly. If one or two
very small or very large items unduly affect the average, ie, either increase its value or reduce its
value, the average cant be really typical of entire series. In other words, extremes may distort
the average and reduce its usefulness.
5.
Rigidly defined: An average should be properly defined so that it has only one
interpretation. It should preferably be defined by algebraic formula so that if different people
compute the average from the same figures they all get the same answer. The average should not
depend upon the personal prejudice and bias of the investigator, other wise results can be
misleading.
6.
Capable of further algebraic treatment: We should prefer to have an average that could be
used for further statistical computation so that its utility is enhanced. For example, if we are
given the data about the average income and number of employees of two or more factories, we
should able to compute the combined average.
7.
Sampling stability: Last, but not least we should prefer to get a value which has what the
statisticians called sampling stability. This means that if we pick 10 different group of college
students, and compute the average of each group, we should expect to get approximately the
same value. It does not mean, however that there can be no difference in the value of different
Page 25
samples. There may be some differences but those samples in which this difference is less that
are considered better than those in which the difference is more.
(a) Mean (Arithmetic mean / average)
The mean (or average) is the most popular and well known measure of central tendency. It can
be used with both discrete and continuous data, although its use is most often with continuous
data (see our Types of Variable guide for data types). The mean is equal to the sum of all the
values in the data set divided by the number of values in the data set. So, if we have n values in a
data set and they have values x1, x2, ..., xn, the sample mean, usually denoted by (pronounced x
bar), is:
This formula is usually written in a slightly different manner using the Greek
capitol letter, , pronounced "sigma", which means "sum of...":
Example
In a survey you collected information on monthly spending for mobile recharge by 20 students of
which 10 are male and 10 female. We illustrate below how the data is used to find mean.
1
Male
250
Female 100
Both
350
2
150
150
300
3
100
150
250
4
175
100
275
5
150
200
350
6
250
150
400
7
200
125
325
8
200
150
350
9
150
130
280
10
170
180
350
Total
1795
1435
3230
Mean
179.50
143.50
161.50
First we found the mean for male students. Here x= 1795. n =10. So 1795/10 = 179.5.
Similarly, the mean for female students. Here x= 1435. n =10. So 1435/10 = 143.5.
We also find the mean for male and female taken together.
Here x= 3230. n =20. So 3230/20 = 161.50.
Based on the above we can make certain observations. Male students spend Rs. 179.50 on an
average in a month for mobile recharge. Female students spend Rs. 143.50. We may conclude
that male students spend more on monthly mobile recharges. As a researcher, you may now use
this information to make further studies as to why this is so. What are the factors that make male
students to spend more on mobile recharges. We have also calculated the average for all students
taken together. It is Rs. 161.50. Thus we observe that the male students spend more than the
average for all students while female students spend less than the total for all students.
Mean is also calculated using another method called the shortcut method asexplained below.
Short cut method: The arithmetic mean can also be calculated by short cut method. This method
reduces the amount of calculation. It involves the following steps
Page 26
i.
Assume any one value as an assumed mean, which is also known as working mean
or arbitrary average (A).
ii. Find out the difference of each value from the assumed mean
(d = X-A).
iii. Add all the deviations (d)
iv. Apply the formula
X=A+
Where X Mean,
Sum of deviation from assumed mean,
A Assumed mean
Example:
Calculate arithmetic mean
Roll No :
Marks :
1
40
2
50
Roll Nos.
3
55
4
78
5
58
Marks
d = X - 55
40
-15
50
-5
55
78
23
58
60
1
2
3
4
5
6
6
60
d = 11
X=A+
= 55 +
= 56.83
Calculation of arithmetic mean - Discrete series

To find out the total items in discrete series, frequency of each value is multiplies with
the respective size. The value so obtained are totaled up. This total is then divided by the total
number of frequencies to obtain arithmetic mean.
Steps
1. Multiply each size of the item by its frequency fX
2. Add all fX (f X)
3. Divide fX by total frequency (N).
The formula is X =
Example
X
f
1
10
2
12
3
8
4
7
5
11
Page 27
Solution
X
1
2
3
4
5
fX
10
10
12
24
24
28
11
55
N = fX = 141
X=
= .
= 2.93
Short cut Method

Steps:
Take the value of assumed mean (A)
Find out deviations of each variable from Aie d.

Multiply d with respective frequencies (fd)
Add up the product (fd)
Apply formula
X=A
Continuous series
In continuous frequency distribution, the value of each individual frequency distribution
is unknown. Therefore an assumption is made to make them precise or on the assumption that
the frequency of the class intervals is concentrated at the centre that the mid point of each class
intervals has to be found out. In continuous frequency distribution, the mean can be calculated
by any of the following methods.
a. Direct method
b. Short cut method
c. Step deviation method
a. Direct Method
Steps:
1. Find out the mid value of each group or class. The mid value is obtained by adding the
lower and upper limit of the class and dividing the total by two. (symbol = m)
2. Multiply the mid value of each class by the frequency of the class. In other words m will
be multiplied by f.
3. Add up all the products - fm
4. fm is divided by N
Example:
From the following find out the mean profit
Profit/Shop:
No. of shops:
100-200 200-300
300-400
400-500
500-600
600-700
700-800
10
20
26
30
28
18
18
Page 28
Solution
Mid point - m
150
250
350
450
550
650
750
Profit ( )
100-200
200-300
300-400
400-500
500-600
600-700
700-800
X=
No of Shops (f)
10
18
20
26
30
28
18
f = 150
fm
1500
4500
7000
11700
16500
18200
13500
fm = 72900
= 486
b) Short cut method

Steps:
1. Find the mid value of each class or group (m)
2. Assume any one of the mid value as an average (A)
3. Find out the deviations of the mid value of each from the assumed mean
(d)
4. Multiply the deviations of each class by its frequency (fd).
5. Add up the product of step 4 - fd
6. Apply formula
X=A +
Example: (solving the last example)
Solving: Calculation of Mean
Profit ( )
100-200
200-300
300-400
400-500
500-600
600-700
700-800
X=A +
m
150
250
350
450
550
650
750
=450 +
c) Step deviation method
d = m - 450
-300
-200
-100
0
100
200
300
f
10
18
20
26
30
28
18
f = 150
fd
-3000
-3600
-2000
0
3000
5600
5400
fd = 5400
= 486
The short cut method discussed above is further simplified or calculations are reduced to a
great extent by adopting step deviation methos.
Steps:
1. Find out the mid value of each class or group (m)
2. Assume any one of the mid value as an average (A)
Page 29
3.
4.
5.
6.
7.
Find out the deviations of the mid value of each from the assumed mean (d)
Deviations are divided by a common factor (d')
Multiply the d' of each class by its frequency (f d')
Add up the products (fd')
Then apply the formula

X=A +
c
Where c = Common factor
Example:
Calculate mean for the last problem
Solution
Profit
100-200
200-300
300-400
400-500
500-600
600-700
700-800
m
150
250
350
450
550
650
750
X=A +
f
10
18
20
26
30
28
18
f = 150
d
-300
-200
-100
0
100
200
300
d'
-3
-2
-1
0
1
2
3
f d'
-30
-36
-20
0
30
56
54
f d' = 540
450 +
100
450 + (0.36 100) = 486
The mean is essentially a model of your data set. It is the value that is most common. You will
notice, however, that the mean is not often one of the actual values that you have observed in
your data set. However, one of its important properties is that it minimises error in the prediction
of any one value in your data set. That is, it is the value that produces the lowest amount of error
from all other values in the data set.
An important property of the mean is that it includes every value in your data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.
We complete our discussion on arithmetic mean by listing the merits and demerits of it.
Merits:
It is rigidly defined.
It is easy to calculate and simple to follow.
It is based on all the observations.

It is determined for almost every kind of data.
It is finite and not indefinite.
Page 30
It is readily put to algebraic treatment.

It is least affected by fluctuations of sampling.
Demerits:
The arithmetic mean is highly affected by extreme values.
It cannot average the ratios and percentages properly.
It is not an appropriate average for highly skewed distributions.
It cannot be computed accurately if any item is missing.
The mean sometimes does not coincide with any of the observed value.
We elaborate on only one of the demerits for your better understanding. The first demerit says
the arithmetic mean is highly affected by extreme values. What does this mean. See the
following example.
Consider the following table which gives information on the marks obtained by students in a test.
Student
1
2
3
4
5
6
7
8
9
10
Mark
15
18
16
14
15
15
12
17
90
95
The mean mark for these ten students is 30.7. However, inspecting the raw data suggests that this
mean value might not be the best way to accurately reflect the typical mark obtained by a
student, as most students have marks in the 12 to 18 range. Here we see that the mean is being
affected by the two large figures 90 and 95. This shows that arithmetic mean is highly affected
by extreme values.
Therefore, in this situation, we would like to have a better measure of central tendency. As we
will find out later, taking the median would be a better measure of central tendency in this
situation.
Weighted Mean
Simple arithmetic mean gives equal importance to all items. Some times the items in a
series may not have equal importance. So the simple arithmetic mean is not suitable for those
series and weighted average will be appropriate.
Weighted means are obtained by taking in to account these weights (or importance).
Each value is multiplied by its weight and sum of these products is divided by the total weight to
get weighted mean.
Weighted average often gives a fair measure of central tendency. In many cases it is
better to have weighted average than a simple average. It is invariably used in the following
circumstances.
1. When the importance of all items in a series are not equal. We associate weights to the
items.
2. For comparing the average of one group with the average of an other group, when the
frequencies in the two groups are different, weighted averages are used.
3. When rations percentages and rates are to be averaged, weighted average is used.
4. It is also used in the calculations of birth and death rate index number etc.
5. When average of a number of series is to be found out together weighted average is used.
Formula: Let x1+ x2 + x3 - - - - +xn be in values with corresponding weights
w1+ w2 + w3 - - - - +wn . Then the weighted average is
Page 31
=
=
(b) Median
The median is also a frequently used measure of central tendency. The median is the midpoint of
a distribution: the same number of data points are above the median as below it. The median is
the middle score for a set of data that has been arranged in order of magnitude.
The median is determined by sorting the data set from lowest to highest values and taking the
data point in the middle of the sequence. There is an equal number of points above and below the
median. For example, in the data 7,8,9,10,11, the median is 9; there are two data points greater
than this value and two data points less than this value. Thus to find the median, we arrange the
observations in order from smallest to largest value. If there is an odd number of observations,
the median is the middle value.
If there is an even number of observations, the median is the average of the two middle values.
Thus, the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.
In certain situations the mean and median of the distribution will be the same, and in some
situations it will be different. For example, in the data 1,2,3,4,5 the median is 3; there are two
data points greater than this value and two data points less than this value. In this case, the
median is equal to the mean. But consider the data 1,2,3,4,10. In this dataset, the median still is
three, but the mean is equal to 4.
The median can be determined for ordinal data as well as interval and ratio data. Unlike the
mean, the median is not influenced by outliers at the extremes of the data set. For this reason, the
median often is used when there are a few extreme values that could greatly influence the mean
and distort what might be considered typical. For data which is very skewed, the median often is
used instead of the mean.
Calculation of Median : Discrete series
Steps:
Arrange the date in ascending or descending order
Find cumulative frequencies
Apply the formula Median
Median = Size of
item
Example: Calculate median from the following
Size of shoes:
5
5.5 6
6.5
Frequency :
10
16
28
15
7
30
7.5
40
8
34
Solution
Page 32
Size
5
5.5
6
6.5
7
7.5
8
Median = Size of
N = 173
f
10
16
28
15
30
40
34
Cumulative f (f)
10
26
54
69
99
139
173
item
Median =
= 87th item = 7
Median = 7
Calculation of median Continuous frequency distribution
Steps:
Find out the median by using N/2
Find out the class which median lies
Apply the formula
= +
2
Where L = lower limit of the median class
h = class interval of the median class
f = frequency of the median class
N= ,
c = cumulative frequency of the preceding median class

Example: Calculate median from the following data
Age in Below
years
10
No.
of
2
persons
Below
20
5
Below
30
9
Below
40
12
Below
50
14
Below
60
15
Below
70
15.5
70 and
over
15.6
Solution:
First we have to convert the distribution to a continuous frequency distribution as in the
following table and then compute median.
Age in years
No. of persons (f)
Cumulative frequency (cf) less than
0-10
10-20
5-2=3
20-30
9-5=4
30-40
12-9=3
12
40-50
14-12=2
14
50-60
15-14=1
15
60-70
15.5-15=0.5
15.5
70 and above
15.6-15.5=0.1
15.6
Page 33
Median item =
= 7.8
Find the cumulative frequency (c.f) greater than 7.8 is 9. Thus the corresponding class 20-30 is
the median class.
= 20, = 10, = 4,
= 15.6 , = 5
Use the formula
So the median age is 27.
10
5
(7.8 5) = 20 + 2.8
= 20 +
4
2
= 20 + 5 1.4 = 27.
The Mean vs. the Median

As measures of central tendency, the mean and the median each have advantages and
disadvantages. Some pros and cons of each measure are summarized below.
The median may be a better indicator of the most typical value if a set of scores has an outlier.
An outlier is an extreme value that differs greatly from other values.
However, when the sample size is large and does not include outliers, the mean score usually
provides a better measure of central tendency.
(b) Mode
The mode of a data set is the value that occurs with the most frequency. This measurement is
crude, yet is very easy to calculate. Suppose that a history class of eleven students scored the
following (out of 100) on a test: 60, 64, 70, 70, 70, 75, 80, 90, 95, 95, 100. We see that 70 is in
the list three times, 95 occurs twice, and each of the other scores are each listed only once. Since
70 appears in the list more than any other score, it is the mode. If there are two values that tie for
the most frequency, then the data is said to be bimodal.
The mode can be very useful for dealing with categorical data. For example, if a pizza shop sells
10 different types of sandwiches, the mode would represent the most popular pizza. The mode
also can be used with ordinal, interval, and ratio data. However, in interval and ratio scales, the
data may be spread thinly with no data points having the same value. In such cases, the mode
may not exist or may not be very meaningful.
To find mode in the case of a continuous frequencydistribution, mode is found using the formula
( )
= +
( )( )
Rearranging we get
( )
= +
2
Where
is the lower limit of the model class

is the frequency of the model class
Page 34
is the frequency of the class preceding the model class

is the frequency of the class succeeding the model class
his the class interval of the model class
See the following example where we compute mode using the above formula.(mean and median
are also computed)
Example
Find the values of mean, mode and median from the following data.
Weight
93-97
98-102
(kg)
No. of
103-
108-
113-
118-
123-
128-
107
112
117
122
127
132
12
17
14
students
Solution: Since the formula for mode requires the distribution to be continuous
with exclusive type classes, we first convert the classes into class boundaries.
Wight
Class
boundaries
Mid
value (X)
93-97
98-102
103-107
108-112
113-117
118-122
123-127
128-132
92.5-97.5
97.5-102.5
102.5-107.5
107.5-112.5
112.5-117.5
117.5-122.5
122.5-127.5
127.5-132.5
95
100
105
110
115
120
125
130
Number
of
students
(f)
3
5
12
17
14
6
3
1
Mean
Mean = 110.66kgs.
Mode
= 61
110
5
-3
-2
-1
0
1
2
3
4
fd
Less than
c.f
-9
-10
-12
0
14
12
9
4
3
8
20
37
51
57
60
61
=8
= +
58
= 110 +
= 110.66.
61
Here maximum frequency is 17. The corresponding class 107.5-112.5 is the model class.
Using the formula of mode
( )
= +
2
We get
Page 35
= 107.5 +
Hence mode is 110.63 kgs.
= 107.5 +
5(17 12)
2(17) 12 14
25
= 107.5 + 3.125 = 110.625
8
Median
Use the formula
=
Here 2 = 61 2 = 30.5
The cumulative frequency (c.f.) just greater than 30.5 is 37. So the corresponding class 107.5112.5 is the median class.
Substituting values in the median formula
5 61
= 107.5 +
20
17 2
5
(30.5 20)
= 107.5 +
17
= 107.5 +
Median is 110.59 Kgs.
5 10.5
17
= 107.5 + 3.09 = 110.59
When to use Mean, Median, and Mode

The following table summarizes the appropriate methods of determining the middle or typical
value of a data set based on the measurement scale of the data.
Measurement Scale
Best Measure
Nominal
(Categorical)
Mode
Ordinal
Median
Interval
Symmetrical data: Mean

Skewed data: Median
Ratio
Symmetrical data: Mean

Skewed data: Median
Merits and demerits of mean, median and mode

Merits and demerits of arithmetic mean has already been discussed. Please refer to that. Here we
discuss only median and mode.
Median:
The median is that value of the series which divides the group into two equal parts, one part
Page 36
comprising all values greater than the median value and the other part comprising all the values
smaller than the median value.
Merits of median
(1) Simplicity:- It is very simple measure of the central tendency of the series. I the case of
simple statistical series, just a glance at the data is enough to locate the median value.
(2) Free from the effect of extreme values: - Unlike arithmetic mean, median value is not
destroyed by the extreme values of the series.
(3) Certainty: - Certainty is another merits is the median. Median values are always a certain
specific value in the series.
(4) Real value: - Median value is real value and is a better representative value of the series
compared to arithmetic mean average, the value of which may not exist in the series at all.
(5) Graphic presentation: - Besides algebraic approach, the median value can be estimated also
through the graphic presentation of data.
(6) Possible even when data is incomplete: - Median can be estimated even in the case of certain
incomplete series. It is enough if one knows the number of items and the middle item of the
series.
Demerits of median:
Following are the various demerits of median:
(1) Lack of representative character: - Median fails to be a representative measure in case of such
series the different values of which are wide apart from each other. Also, median is of limited
representative character as it is not based on all the items in the series.
(2) Unrealistic:- When the median is located somewhere between the two middle values, it
remains
only
an
approximate
measure,
not
a
precise
value.
(3) Lack of algebraic treatment: - Arithmetic mean is capable of further algebraic treatment, but
median is not. For example, multiplying the median with the number of items in the series will
not give us the sum total of the values of the series.
However, median is quite a simple method finding an average of a series. It is quite a commonly
used measure in the case of such series which are related to qualitative observation as and health
of the student.
Mode: The value of the variable which occurs most frequently in a distribution is called the
mode.
Merits of mode:
Following are the various merits of mode:
(1) Simple and popular: - Mode is very simple measure of central tendency. Sometimes, just at
the series is enough to locate the model value. Because of its simplicity, it s a very popular
measure of the central tendency.
(2) Less effect of marginal values: - Compared top mean, mode is less affected by marginal
values in the series. Mode is determined only by the value with highest frequencies.
Page 37
(3) Graphic presentation:- Mode can be located graphically, with the help of histogram.
(4) Best representative: - Mode is that value which occurs most frequently in the series.
Accordingly,
mode
is
the
best
representative
value
of
the
series.
(5) No need of knowing all the items or frequencies: - The calculation of mode does not require
knowledge of all the items and frequencies of a distribution. In simple series, it is enough if one
knows
the
items
with
highest
frequencies
in
the
distribution.
Demerits of mode:
Following are the various demerits of mode:
(1) Uncertain and vague: - Mode is an uncertain and vague measure of the central tendency.
(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable of further algebraic
treatment.
(3) Difficult: - With frequencies of all items are identical, it is difficult to identify the modal
value.
(4) Complex procedure of grouping:- Calculation of mode involves cumbersome procedure of
grouping the data. If the extent of grouping changes there will be a change in the model value.
(5) Ignores extreme marginal frequencies:- It ignores extreme marginal frequencies. To that
extent model value is not a representative value of all the items in a series.
Besides, one can question the representative character of the model value as its calculation does
not involve all items of the series.
Exercises
1. Find the measures of central tendency for the data set 3, 7, 9, 4, 5, 4, 6, 7, and 9.
Mean = 6, median = 6 and modes are 4, 7 and 9.Note that here mode is bimodal.
2. Four friends take an IQ test. Their scores are 96, 100, 106, 114. Which of the following
statements is true?
I. The mean is 103.
II. The mean is 104.
III. The median is 100.
IV. The median is 106.
Page 38
(A) I only
(B) II only
(C) III only
(D) IV only
(E) None is true
The correct answer is (B). The mean score is computed from the equation:
Mean score = x / n = (96 + 100 + 106 + 114) / 4 = 104
Since there are an even number of scores (4 scores), the median is the average of the two middle
scores. Thus, the median is (100 + 106) / 2 = 103.
3. The owner of a shoe shop recorded the sizes of the feet of all the customers who bought shoes
in his shop in one morning. These sizes are listed below:
8 7 4 5 9 13 10 8 8 7 6 5 3 11 10 8 5 4 8 6
What is the mean of these values: 7.25
What is the median of these values: 7.5
What is the mode of these values: 8.
4. Eight people work in a shop. Their hourly wage rates of pay are:
Worker
Wage
14
Rs.
Work out the mean, median and mode for the values above.
Mean = 5.75, Median = 4.50, Mode = 4.00.
Using the above findings, if the owner of the shop wants to argue that the staff are paid well.
Which measure would they use? He will use mean. Because mean shows the highest value.
Using the above findings, if the staff in the shop want to argue that they are badly paid. Which
measure would they use? The staff will use mode as it is the lowest of the three measures of
central tendencies.
5. The table below gives the number of accidents each year at a particular road junction:
1991 1992 1993 1994 1995 1996 1997 1998
4
5
4
2
10
5
3
5
Work out the mean, median and mode for the values above.
Mean =4.75
Median =4.5
Mode =5
Using the above measures, a road safety group want to get the council to make this junction
safer.
Which measure will they use to argue for this? They will use mode as it is the figure which will
help them to justify their argument that the junction has a large number of accidents.
Page 39
Using the same data the council do not want to spend money on the road junction. Which
measure will they use to argue that safety work is not necessary? The council will use median as
this figure will help them to argue that the junction has less number of accidents.
6. Mr Sasi grows two different types of tomato plant in his greenhouse.
One week he keeps a record of the number of tomatoes he picks from each type of plant.
Day
Mon Tue Wed
Type A 5
5
4
Type B 3
4
3
Thu Fri Sat Sun

1
0
1
5
3
7
9
6
(a) Calculate the mean, median and mode for the Type A plants.
Mean =3, Median = 4, Mode = 5.
(b) Calculate the mean, median and mode for the Type B plants.
Mean =5, Median = 4, Mode = 3.
(c) Which measure would you use to argue that there is no difference between the types?
We will use median as it is the same for both plants.
(d) Which measure would you use to argue that Type A is the best plant?
We will use mode as mode for type A is higher than B. Note that for type A mean is lower than
type B and median is the same for both types.
(e) Which measure would you use to argue that Type B is the best plant?
We will use mean as mean for type A is higher than type B.
Geometric Mean:
The geometric mean is a type of mean or average, which indicates the central tendency or typical
value of a set of numbers. It is similar to the arithmetic mean, which is what most people think of
with the word "average", except that the numbers are multiplied and then the n th root (where n is
the count of numbers in the set) of the resulting product is taken.
Geometric mean is defined as the nth root of the product of N items of series. If there are two
items, take the square root; if there are three items, we take the cube root; and so on.
Symbolically;
GM = ( )( ) ( )
Where X1, X2 .. Xn are refer to the various items of the series.
For instance, the geometric mean of two numbers, say 2 and 8, is just the square root of their
product; that is 2 8 = 4. As another example, the geometric mean of three numbers 1, , is
the cube root of their product (1/8), which is 1/2; that is
1 1 21 4=
8=
When the number of items are three or more, the task of multiplying the numbers and of
extracting the root becomes excessively difficult. To simplify calculations, logarithms are used.
GM then is calculated as follows.
log G.M =
Page 40
G.M. =
log X
G.M. = Antilog N
f log X
In discrete series GM = Antilog
Nf log m
In continuous series GM = Antilog
N
Where f = frequency
M = mid point
Merits of G.M
1. It is based on each and every item of the series.
2. It is rigidly defined.
3. It is useful in averaging ratios and percentages and in determining rates of increase and
decrease.
4. It is capable of algebraic manipulation.
Limitations
1. It is difficult to ounderstant
2. It is difficult to compute and to interpret
3. It cant be computed when there are negative and positive values in a series or one or
more of values is zero.
4. G.M has very limited applications.
Harmonic Mean
Harmonic mean is a kind of average.It is the mean of a set of positive variables. It is calculated
by dividing the number of observations by the reciprocal of each number in the series.
Harmonic Mean of a set of numbers is the number of items divided by the sum of the reciprocals
of the numbers. Hence, the Harmonic Mean of a set of n numbers i.e. a1, a2, a3, ... an, is given as
=
+ +
Example: Find the harmonic mean for the numbers 3 and 4.

Take the reciprocals of the given numbers and sum them.
1 1
4+3
7
+ =
=
3 4
12
12
Now apply the formula. Since the number of observations is two, here n = 2.
=
In discrete series, H.M =
In continuous series, H.M =

Merits of Harmonic mean:
.
.
12
= 2
12
24
=
= 3.43
7
7
1. Its value is based on every item of the series.

2. It lends itself to algebraic manipulation.
Limitations
1. It is not easily understood
Page 41
2. It is difficult to compute
3. It gives larges weight to smallest item.
7. MEASURES OF VARIABILITY / DISPERSION

The terms variability, spread, and dispersion are synonyms, and refer to how spread out a
distribution is.Just as in the section on central tendency where we discussed measures of the
centre of a distribution of scores, here we discuss measures of the variability of a
distribution.Measures of variability provide information about the degree to which individual
scores are clustered about or deviate from the average value in a distribution.
Quite often students find it difficult to understand what is meant by variability or dispersion and
hence they find the measures of dispersion difficult. So will discuss the meaning of the term in
detail. First one should understand that dispersion or variability is a continuation of our
discussion of measure of central tendency. So for any discussion on measure of dispersion we
should use any of the measure of central tendency. We continue this discussion taking mean as
an example. The mean or average measures the centre of the data. It is one aspect observations.
Another feature of the observations is as to how the observations are spread about the centre. The
observation may be close to the centre or they may be spread away from the centre. If the
observation are close to the centre (usually the arithmetic mean or median), we say that
dispersion or scatter or variation is small. If the observations are spread away from the centre, we
say dispersion is large.
Let us make this clear with the help of an example. Suppose we have three groups of students
who have obtained the following marks in a test. The arithmetic means of the three groups are
also given below:
Group A: 46, 48, 50, 52, 54, for this the mean is 50.
Group B: 30, 40, 50, 60, 70, for this the mean is 50.
Group C: 40, 50, 60, 70, 80, for this the mean is 60.
In a group A and B arithmetic means are equal i.e. mean of Group A = Mean of Group B = 50.
But in group A the observations are concentrated on the centre. All students of group A have
almost the same level of performance. We say that there is consistence in the observations in
group A. In group B the mean is 50 but the observations are not closed to the centre. One
observation is as small as 30 and one observation is as large as 70. Thus there is greater
dispersion in group B. In group C the mean is 60 but the spread of the observations with respect
to the centre 60 is the same as the spread of the observations in group B with respect to their own
centre which is 50. Thus in group B and C the means are different but their dispersion is the
same. In group A and C the means are different and their dispersions are also different.
Dispersion is an important feature of the observations and it is measured with the help of the
measures of dispersion, scatter or variation. The word variability is also used for this idea of
dispersion.
Page 42
The study of dispersion is very important in statistical data. If in a certain factory there is
consistence in the wages of workers, the workers will be satisfied. But if some workers have high
wages and some have low wages, there will be unrest among the low paid workers and they
might go on strikes and arrange demonstrations. If in a certain country some people are very
poor and some are very high rich, we say there is economic disparity. It means that dispersion is
large. The idea of dispersion is important in the study of wages of workers, prices of
commodities, standard of living of different people, distribution of wealth, distribution of land
among framers and various other fields of life. Some brief definitions of dispersion are:
The degree to which numerical data tend to spread about an average value is called the
dispersion or variation of the data.
Dispersion or variation may be defined as a statistics signifying the extent of the scatteredness of
items around a measure of central tendency.
Dispersion or variation is the measurement of the scatter of the size of the items of a series about
the average.
There are five frequently used measures of variability: the Range, Interquartile range or quartile
deviation, Mean deviation or average deviation, Standard deviation and Lorenz curve.
7.1 Range
The range is the simplest measure of variability to calculate, and one you have
probably encountered many times in your life. The range is simply the highest
score minus the lowest score.
Range: R = maximum minimum
Lets take a few examples. What is the range of the following group of numbers: 10, 2, 5, 6, 7, 3,
4. Well, the highest number is 10, and the lowest number is 2, so 10 - 2 = 8. The range is 8.
Lets take another example. Heres a dataset with 10 numbers: 99, 45, 23, 67, 45, 91, 82, 78, 62,
51. What is the range. The highest number is 99 and the lowest number is 23, so 99 - 23 equals
76; the range is 76.
Example2: Ms. Kesavan listed 9 integers on the blackboard. What is the range of these integers?
14, -12, 7, 0, -5, -8, 17, -11, 19
Ordering the data from least to greatest, we get:
-12, -11, -8, -5, 0, 7, 14, 17, 19
Range: R = highest - lowest = 19 - -12 = 19 + +12 = +31
Page 43
The range of these integers is +31.

Example 3: A marathon race was completed by 5 participants. What is the range of times given
in hours below
2.7 hr, 8.3 hr, 3.5 hr, 5.1 hr, 4.9 hr
Ordering the data from least to greatest, we get:
2.7, 3.5, 4.9, 5.1, 8.3
Range: R = highest lowest = 8.3 hr - 2.7 hr = 5.6 hr
The range of marathon race is 5.6 hr.
Merits and Limitations
Merits
Amongst all the methods of studying dispersion, range is the simplest to understand
easiest to compute.
It takes minimum time to calculate the value of range Hence if one is interested in getting
a quick rather than very accurate picture of variability one may compute range.
Limitation
Range is not based on each and every item of the distribution.
It is subject to fluctuation of considerable magnitude from sample to sample.
Range cant tell us anything about the character of the distribution with the two.
According to kind Range is too indefinite to be used as a practical measure of dispersion
Uses of Range
Range is useful in studying the variations in the prices of stocks, shares and other
commodities that are sensitive to price changes from one period to another period.
The meteorological department uses the range for weather forecasts since public is
interested to know the limits within which the temperature is likely to vary on a particular
day.
7.2 Inter Quartile Range Or Quartile Deviation
So we have seen Range which is a measure of variability which concentrate on two extreme
values. If we concentrate on two extreme values as in the case of range, we do not get any idea
about the scatter of the data within the range ( i.e. what happens within the two extreme values ).
If we discard these two values the limited range thus available might be more informative. For
this reason the concept of interquartile range is developed. It is the range which includes middle
50% of the distribution. Here 1/4 ( one quarter of the lower end and 1/4 ( one quarter ) of the
upper end of the observations are excluded.
Page 44
Now the lower quartile ( Q1 ) is the 25th percentile and the upper quartile (Q3 ) is the 75th
percentile. It is interesting to note that the 50th percentile is the middle quartile ( Q2 ) which is in
fact what you have studied under the title Median . Thus symbolically
Inter quartile range = Q3 - Q1
If we divide ( Q3 - Q1 ) by 2 we get what is known as Semi-Iinter quartile range.
i.e.
. It is known as Quartile deviation ( Q. D or SI QR ).
Another look at the same issue is given here to make the concept more clear for the student.
In the same way that the median divides a dataset into two halves, it can be further divided into
quarters by identifying the upper and lower quartiles. The lower quartile is found one quarter of
the way along a dataset when the values have been arranged in order of magnitude; the upper
quartile is found three quarters along the dataset. Therefore, the upper quartile lies half way
between the median and the highest value in the dataset whilst the lower quartile lies halfway
between the median and the lowest value in the dataset. The inter-quartile range is found by
subtracting the lower quartile from the upper quartile.
For example, the examination marks for 20 students following a particular module are arranged
in order of magnitude.
median lies at the mid-point between the two central values (10th and 11th)
= half-way between 60 and 62 = 61
The lower quartile lies at the mid-point between the 5th and 6th values
= half-way between 52 and 53 = 52.5
The upper quartile lies at the mid-point between the 15th and 16th values
= half-way between 70 and 71 = 70.5
The inter-quartile range for this dataset is therefore 70.5 - 52.5 = 18 whereas the range is: 80 - 43
= 37.
The inter-quartile range provides a clearer picture of the overall dataset by removing/ignoring the
outlying values.
Page 45
Like the range however, the inter-quartile range is a measure of dispersion that is based upon
only two values from the dataset. Statistically, the standard deviation is a more powerful measure
of dispersion because it takes into account every value in the dataset. The standard deviation is
explored in the next section.
Example 1
The wheat production (in Kg) of 20 acres is given as: 1120, 1240, 1320, 1040, 1080, 1200, 1440,
1360, 1680, 1730, 1785, 1342, 1960, 1880, 1755, 1720, 1600, 1470, 1750, and 1885. Find the
quartile deviation and coefficient of quartile deviation.
After arranging the observations in ascending order, we get
1040, 1080, 1120, 1200, 1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750,
1755, 1785, 1880, 1885, 1960.
+1
4
20 + 1
(5.25)
=5
+ 0.25(6
= 1240 + 0.25(1320 1240)
=
= 15
= 1240 + 20 = 1260
3( + 1)
4
3(20 + 1)
4
(15.75)
+ 0.75(16
15
= 1750 + 0.75(1755 1750)
= 1750 + 3.75 = 1753.75
( . .) =
Example 2
1753.75 1260
492.75
=
= 246.88
2
2
Calculate the range and Quartile deviation of wages.
1753.75 1260
= 0.164
1753.75 + 1260
Page 46
Wages ( )
Labourers
30 32
12
32 34
18
34 36
16
36 38
14
38 40
12
40 42
42 - 44
Solution
Range : = L S
Calculation of Quartiles :
X
c.f
30 32
12
12
32 34
18
30
34 36
16
46
36 38
14
60
38 40
12
72
40 42
80
42 - 44
86
= Size of
=
item
= 21.5
ie. Q. lies in the group 32 34

=L+
= 32 +
= 32 +
i
.
= 32 + 1.06
= 33.06
====
= Size of
item
Page 47
= 3 = 64.5 item
lies in the group 38 40
.
=L+
= 38 +
= 38 + 0.75
= 38.75
Q.D =
=
=
= 2.85
===
Coefficient of Q.D. =
=
=
= 0.08
Merits of Quartile Deviation
1. It is simple to understand and easy to calculate.
2. It is not influenced by extreme values.
3. It can be found out with open end distribution.
4. It is not affected by the presence of extreme values.
Demerits
1. It ignores the first 25% of the items and the last 25% of the items.
2. It is a positional average : hence not amenable to further mathematical treatment.
3. The value is affected by sampling fluctuations.
7.3
Mean Deviation or Average Deviation
Average deviations (mean deviation) is the average amount of variations
(scatter) of the items in a distribution from either the mean or the median or
the mode, ignoring the signs of these deviations. In other words, the mean
deviation or average deviation is the arithmetic mean of the absolute
deviations.
Example 1: Find the Mean Deviation of 3, 6, 6, 7, 8, 11, 15, 16
Step 1: Find the mean:
Step 2: Find the distance of each value from that mean:
=9
Page 48
Value
Distance
from 9
3
6
6
7
8
11
15
16
6
3
3
2
1
2
6
7
Which looks like this diagrammatically:
Step 3. Find the mean of those distances:
6+3+3+2+1+2+6+7
30
=
= 3.75
8
8
So, the mean = 9, and the mean deviation = 3.75
It tells us how far, on average, all values are from the middle.
In that example the values are, on average, 3.75 away from the middle.
The formula is:
Where
| |
is the mean (in our example = 9)

x is each value (such as 3 or 16)
N is the number of values (in our example N = 8)
Each distance we calculated is called an Absolute Deviation, because it is the
Absolute Value of the deviation (how far from the mean).To show "Absolute
Value" we put | marks either side like this: |-3| = 3. Thus absolute value is
one where we ignore sign. That is, if it is or +, we consider it as +. Eg. -3 or +3
will be taken as just 3.
Page 49
Let us redo example 1 using the formula: Find the Mean Deviation of 3, 6, 6, 7,
8, 11, 15, 16
Step 1: Find the mean:
3 + 6 + 6 + 7 + 8 + 11 + 15 + 16
72
=
=9
8
8
Step 2: Find the Absolute Deviations:

x
x-
|x - |
-6
-3
-3
-2
-1
11
15
16
7
| | = 30
Step 3. Find the Mean Deviation:

=
| |
Example 2
30
= 3.75
8
Calculate the mean deviation using mean for the following data
2-4
4-6
6-8
8-10
Solution
Class
Mid
Frequency
Value
(f)
d = X-5
(X)
fd
| |
= | 5.2|
| |
Page 50
2-4
-2
-6
2.2
6.6
4-6
0.2
0.8
6-8
1.8
3.6
8-10
3.8
3.8
= 10
| |
=2
Example 3
= 14.8
2
= 5.2
10
14.8
| |=
= 1.48
10
= 5+
Calculate mean deviation based on (a) Mean and (b) median

Class
0-10
10-20
20-30
30-40
40-50
50-60
60-70
12
10
Interval
Frequency 8
f
Solution
Let us first make the necessary computations.
Class
interval
Mid
value
(X)
Frequency
(f)
Less
than
c.f.
fX
0-10
10-20
15
12
20-30
25
30-40
|
|
40
| |
=|
29|
24
192
20
180
14
168
10
30
250
35
38
280
40-50
45
41
50-60
55
60-70
65
136
84
40
30
48
13
104
135
16
48
23
69
43
110
26
52
33
66
50
455
36
252
43
301
N=50
(a) M.D. from Mean
|
|
=|
22|
17
|
= 800
= 1450
( )=
1450
= 29
50
= 790
So mean =29. Let us now find men deviation about mean

Page 51
. .=
| |=
800
= 16
50
We see that mean deviation based on mean is 16.

Now let us compute M.D. about median
(b) M.D. from median
(N/2) =(50/2) = 25. The c.f. just greater than 25 is 30 in the table above. So the
corresponding class 20-30 is the median class.
Sol= lower limit of the median class = 20, f = frequency of the median class =
25, h = class interval of the median class =10,c = cumulative frequency of the
preceding median class =20.
Use the formula of median to substitute values.
= 20 +
= + ( )
2
10
(25 20) = 20 + 2 = 22
25
Median = 22. Let us now find Mean Deviation about median.

. .=
|=
790
= 15.8
50
Thus we have computed Mean Deviation from Mean and Median. Let us
compare the two results. MD from Mean is 16 and MD from median is 15.8.
So, M.D. from Median < M.D. from Mean. This implies that M.D. is least when
taken about median.
Merits of M.D.
i.
ii.
iii.
iv.
It is simple to understand and easy to compute.

It is not much affected by the fluctuations of sampling.
It is based on all items of the series and gives weight according to their size.
It is less affected by extreme items.
v. It is rigidly defined.
vi. It is a better measure for comparison.
Demerits of M.D.
i.
It is a non-algebraic treatment
ii.
Algebraic positive and negative signs are ignored. It is mathematically unsound

and illogical.
iii.
It is not as popular as standard deviation.
Uses :
Page 52
It will help to understand the standard deviation. It is useful in marketing

problems. It is used in statistical analysis of economic, business and social
phenomena. It is useful in calculating the distribution of wealth in a
community or nation.
7.4
Standard Deviation
The concept, standard deviation was introduced by Karl Pearson in 1893. It is the most
important measure of dispersion and is widely used. It is a measure of the dispersion of a set of
data from its mean. The standard deviation is kind of the mean of the mean, and often can help
you find the story behind the data.
The standard deviation is a measure that summarises the amount by which every value within a
dataset varies from the mean. Effectively it indicates how tightly the values in the dataset are
bunched around the mean value. It is the most robust and widely used measure of dispersion
since, unlike the range and inter-quartile range, it takes into account every variable in the dataset.
When the values in a dataset are pretty tightly bunched together the standard deviation is small.
When the values are spread apart the standard deviation will be relatively large.
Standard deviation is defined as a statistical measure of dispersion in the value of an asset around
mean. The standard deviation calculation tells you how spread out the numbers are in your
sample. Standard Deviation is represented using the symbol (
).
For example if you want to measure the performance a mutual fund, SD can be used. It gives an
idea of how volatile a fund's performance is likely to be. It is an important measure of a fund's
performance. It gives an idea of how much the return on the asset at a given time differs or
deviates from the average return. Generally, it gives an idea of a fund's volatility i.e. a higher
dispersion (indicated by a higher standard deviation) shows that the value of the asset has
fluctuated over a wide range.
The formula for finding SD in a sentence form is : it is the square root of the Variance. So now
you ask, What is the Variance. Let us see what is variance.
The Variance is defined as:The average of the squared differences from the Mean.
We can calculate the variance follow these steps:
a. Work out the Mean (the simple average of the numbers)
b. Then for each number: subtract the Mean and square the result (the squared difference).
c. Then work out the average of those squared differences.
You may ask Why square the differences. If we just added up the differences from the mean ...
the negatives would cancel the positives as shown below. So we take the square.
Page 53
Example
You have figures of the marks obtained by your five bench mates which is as
follows: 600, 470, 170, 430 and 300. Find out the Mean, the Variance, and the
Standard Deviation.
Your first step is to find the Mean:
=
600 + 470 + 170 + 430 + 300

1970
=
= 394
5
5
So the mean (average) mark is 394. Let us plot this on the chart:
x
600
206
470
76
5776
170
-224
50176
430
36
1296
300
-94
8836
42436
( )
( )
= 108520
To calculate the Variance, take each difference, square it, find the sum
(108520) and find average:
108520
=
= 21704
5
So, the Variance is 21,704.
The Standard Deviation is just the square root of Variance, so:
SD = = 21704 = 147.32 147
Now we can see which heights are within one Standard Deviation (147) of the
Mean.
Please note that there is a slight difference when we find variance from a
population and mean. In the above example we found out variance for data
collected from all your bench mates. So it may be considered as population.
Suppose now you collect data only from some of your bench mates. Now it may
be considered as a sample. If you are finding variance for a sample data, in the
formula to find variance, divide by N-1 instead of N.
For example, if we say that in our problem the marks are of some students in a
class, it should be treated as a sample. In that case
Variance (or to be precise Sample Variance) = 108,520 / 4 = 27,130. Note that
instead of N (i.e.5) we divided by N-1 (5-1=4).
Standard Deviation (Sample Standard Deviation) = = 27130 = 164.31 164
Page 54
Based on the above information, let us build the formula for finding SD. Since
we use two different formulae for data which is population and data which is
sample, we will have two different formula for SD also.
The "Population Standard Deviation":
The "Sample Standard Deviation":
Computation of Standard Deviation: There are different

computeSD. They are illustrated through examples below.
methods
to
Example 1
Calculate SD for the following observations using different methods.
160, 160, 161, 162, 163, 163, 163, 164, 164, 170
(a) Direct method No.1
Formula
X
160
= 7.4 = 2.72
-3
160
-3
161
-2
162
-1
163
163
163
164
164
170
49
= 1630
Now compute SD
= 163
= 74
Page 55
(b) Direct method No.2

Here the formula is
=
160
25600
160
25600
161
25921
162
26244
163
26569
163
26569
163
26569
164
26896
164
26896
170
28900
= 1630
=
=
= 2657640
265764 1630 /10

10
74
= 7.4 = 2.72
10
(c)Method 3 (Short Cut Method) in this method instead of finding the mean we assume a
figure as mean. Here we have assumed 162 as mean arbitrarily.
We use the formula
X
160
160
161
162
163
Deviation from assumed mean (here

we assume mean as162)
dx
-2
-2
-1
0
1
4
4
1
0
1
Page 56
163
163
164
164
170
1630
1
1
2
2
8
+10
1
1
4
4
64
= 84
84
10
10
10
= 8.4 1
= 7.4 = 2.72
Another example where we find many of the concepts together.

Example:
Given the series: 3, 5, 2, 7, 6, 4, 9.
Calculate:
The (a)mode, (b)median and (c)mean.
(d) variance (e)standard devi ation and (f)The average deviation.
(a)Mode : Does not exist because all the scores have the same frequency.
(b) Median
2, 3, 4, 5, 6, 7, 9.
Median = 5
(c)Mean
(d)Variance
(d)Variance
=
=
2+3+4+5+6+7+9
= 5.143
7
2 +3 +4 +5 +6 +7 +9
5.143 = 4.978
7
(e)Standard Deviation
= 4.978 = 2.231
(f) Average Deviation
2
3
4
5
6
7
9
| |
= | 5.143|
3.143
2.143
1.143
0.143
0.857
1.857
3.857
Page 57
| |
= 13.143
Calculation of SD for continuous series
13.143
= 1.878
7
The step deviation method is easy to use to find SD for continuous

series.
=
Calculate Mean and SD for the following data

0-10
10-20
20-30
30-40
40-50
50-60
60-70
12
30
45
50
37
21
Make the necessary computations

x
Midpoint
(m)
0-10
10-20
15
12
20-30
25
30-40
35
40-50
=
35)
10
-3
fd
fd 2
-15
45
-2
-24
48
30
-1
-30
30
45
45
50
50
50
50-60
55
37
74
148
60-70
65
21
63
189
= 118
= 510
N = 200
= 35 +
118
10 = 35 + 5.9 = 40.9
200
Page 58
510
118
200
200
10
= 2.55 348110
=1.483910=14.839.
Merits of Standard Deviation
1.
2.
3.
4.
It is rigidly defined and its value is always definite and based on all observation.
As it is based on arithmetic mean, it has all the merits of arithmetic mean.
It is possible for further algebraic treatment.
It is less affected by sampling fluctuations.
Demerits
1. It is not easy to calculate.
It gives more weight to extreme values, because the values are squared up.
Coefficient of Variation
Standard deviation is the absolute measure of dispersion. It is expressed in
terms of the units in which the original figures are collected and stated. The relative
measure of standard deviation is known as coefficient of variation.
Variance : Square of Standard deviation
Symbolically;
Variance
=
=
Coefficient of standard deviation =
8. MEASURES OF VARIABILITY IN SHAPE

- Graphic Method of Dispersion
Dispersion or variance can be represented using graphs also. We discuss here some of
the graphical methods which rely on the shape of the curve to represent the deviations.
We will see Lorenz Curve, Ginis Coefficient, Skewness and Kurtosis
Page 59
8.1 - LORENZ CURVE

Lorenz Curve is a graphical representation of wealth distribution developed by
American economist Dr. Max O. Lorenz a popular Economic- Statistician in 1905. He
studied distribution of Wealth and Income with its help.. On the graph, a straight
diagonal line represents perfect equality of wealth distribution; the Lorenz curve lies
beneath it, showing the reality of wealth distribution. The difference between the
straight line and the curved line is the amount of inequality of wealth distribution, a
figure described by the Gini coefficient. One practical use of The Lorenz curve is that it
can be used to show what percentage of a nation's residents possess what percentage of
that nation's wealth. For example, it might show that the country's poorest 10% possess
2% of the country's wealth.
It is graphic method to study dispersion. It helps in studying the variability in different
components of distribution especially economic. The base of Lorenz Curve is that we
take cumulative percentages along X and Y axis. Joining these points we get the Lorenz
Curve. Lorenz Curve is of much importance in the comparison of two series
graphically. It gives us a clear cut visual view of the series to be compared.
Steps to plot 'Lorenz Curve'
Cumulate both values and their corresponding frequencies.
Find the percentage of each of the cumulated figures taking the grand total of each
corresponding column as 100.
Represent the percentage of the cumulated frequencies on X axis and those of the values
on the Y axis.
Draw a diagonal line designated as the line of equal distribution.
Plot the percentages of cumulated values against the percentages of the cumulated
frequencies of a given distribution and join the points so plotted through a free hand
curve.
Page 60
The greater the distance between the curve and the line of equal distribution, the
greater the dispersion. If the Lorenz curve is nearer to the line of equal distribution, the
dispersion or variation is smaller.
Based on data of annual income of 8 individuals we have drawn a Lorenz curve
below using MS Excel.
Individual
Income
%
population
%
income
Cumulative
Income %
5000
12.5
1.204819
1.204819
12000
25
2.891566
4.096385
18000
37.5
4.337349
8.433735
30000
50
7.228916
15.66265
40000
62.5
9.638554
25.3012
60000
75
14.45783
39.75904
100000
87.5
24.09639
63.85542
150000
100
36.14458
100
415000
Page 61
Example
From the following table giving data regarding income of workers in a factory, draw
Lorenz Curve to study inequality of income
The following method for constructing Lorenz Curve.
1.
The size of the item and their frequencies are to be cumulated.
2.
Percentage must be calculated for each cumulation value of the size and
frequency of items.
3.
Plot the percentage of the cumulated values of the variable against the
percentage of the corresponding cumulated frequencies. Join these points with as
smooth free hand curve. This curve is called Lorenz curve.
4.
Zero percentage on the X axis must be joined with 100% on Y axis. This line is
called the line of equal distribution.
Mid value
Cumulative
income
% of
cumulative
income
No. of
workers (f)
Cumulative
no. of
workers
0-500
250
250
2.94
6000
6000
% of
Cumulative
no. Of
workers
37.50
500-1000
750
1000
11.76
4250
10250
64.06
1000-2000
1500
2500
29.41
3600
13850
86.56
2000-3000
2500
5000
58.82
1500
15350
95.94
3000-4000
3500
8500
100.00
650
16000
100.00
Income
8500
16000
Page 62
Uses of Lorenz Curve

1. To study the variability in a distribution.
2. To compare the variability relating to a phenomenon for two regions.
3. To study the changes in variability over a period.
8.2 - Gini index / Gini coefficient

A Lorenz curve plots the cumulative percentages of total income received against the cumulative
number of recipients, starting with the poorest individual or household. The Gini index measures
the area between the Lorenz curve and a hypothetical line of absolute equality, expressed as a
percentage of the maximum area under the line. This is the most commonly used measure of
inequality. The coefficient varies between 0, which reflects complete equality and 1(100), which
indicates complete inequality (one person has all the income or consumption, all others have
none). Gini coefficient is found by measuring the areas A and B as marked in the following
diagram and using the formula A/(A+B). If the Gini coefficient is to be presented as a ratio or
percentage, A/(A+B)100.
Page 63
The Gini coefficient (also known as the Gini index or Gini ratio) is a measure of statistical
dispersion intended to represent the income distribution of a nation's residents. This is the most
commonly used measure of inequality. The coefficient varies between 0, which reflects complete
equality and 1, which indicates complete inequality (one person has all the income or
consumption, all others have none). It was developed by the Italian statistician and sociologist
Corrado Gini in 1912.
8.3 - Skewness
We have discussed earlier techniques to calculate the deviations of a
distribution from its measure of central tendency (mean / median, mode ).
Here we see another measure for that named Skewness. Skewness characterizes
the degree of asymmetry of a distribution around its mean. If there is only one mode (peak)
in our data (unimodel) , and if the other data are distributed evenly to the left and right of
this value, if we plot it in a graph, we get a curve like this, which is called a normal curve
(See figure below). Here we say that there is no skewness or skewness = 0. If there is zero
skewness (i.e., the distribution is symmetric) then the mean = median for this distribution.
Page 64
However data need not always be like this. Sometimes the bulk of the data is at the left and the
right tail is longer, we say that the distribution is skewed right or positively skewed. Positive
skewness indicates a distribution with an asymmetric tail extending towards more positive
values.On the other hand, sometimes the bulk of the data is at is at the right and the left tail is
longer, we say that the distribution is skewed left or negatively skewed. Negative skewness
indicates a distribution with an asymmetric tail extending towards more negative values"
Skewed Left
Symmetric
Skewed Right
Tests of Skewness
There are certain tests to know whether skewness does or does not exist in a frequency
distribution.
They are :
1. In a skewed distribution, values of mean, median and mode would not coincide. The
values of mean and mode are pulled away and the value of median will be at the centre.
In this distribution, mean-Mode = 2/3 (Median - Mode).
2. Quartiles will not be equidistant from median.
3. When the asymmetrical distribution is drawn on the graph paper, it will not give a
bell shapedcurve.
4. Sum of the positive deviations from the median is not equal to sum of negative
deviations.
5. Frequencies are not equal at points of equal deviations from the mode.
Nature of Skewness
Skewness can be positive or negative or zero.
1. When the values of mean, median and mode are equal, there is no skewness.
2. When mean > median > mode, skewness will be positive.
3. When mean < median < mode, skewness will be negative.
Characteristic of a good measure of skewness
1. It should be a pure number in the sense that its value should be independent of the
unit of the series and also degree of variation in the series.
2. It should have zero-value, when the distribution is symmetrical.
Page 65
3. It should have a meaningful scale of measurement so that we could easily interpret

the measured value.
Measures of Skewness
Skewness can be studied graphically and mathematically. When we study

Skewness graphically, we can find out whether Skewness is positive or negative or zero.
This is what we have shown above.
Mathematically Skewness can be studied as :
(a) Absolute Skewness
(b) Relative or coefficient of skewness
When the skewness is presented in absolute term i.e, in units, it is absolute
skewness. If the value of skewness is obtained in ratios or percentages, it is called
relative or coefficient of skewness. When skewness is measured in absolute terms, we
can compare one distribution with the other if the units of measurement are same.
When it is presented in ratios or percentages, comparison become easy. Relative
measures of skewness is also called coefficient of skewness.
(a) Absolute measure of Skewness:
Skewness can be measured in absolute terms by taking the difference between
mean and mode.
Absolute Skewness = mode
If the value of the mean is greater than mode, the Skewness is positive
If the value of mode is greater than mean, the Skewness is negative
Greater the amount of Skewness (negative or positive) the more tendency
towards asymmetry. The absolute measure of Skewness will be proper measure for
comparison, and hence, in each series a relative measure or coefficient of Skeweness
have to be computed.
(b) Relative measure of skewness
There are three important measures of relative skewness.
1. Karl Pearsons coefficient of skewness.
2. Bowleys coefficient of skewness.
3. Kellys coefficient of skewness.
(b 1) Karl Pearsons coefficient of Skewness
The mean, median and mode are not equal in a skewed distribution. The Karl Pearsons
measure of skewness is based upon the divergence of mean from mode in a skewed
distribution.
Karl Pearson's measure of skewness is sometimes referred to Skp
=
Properties of Karl Pearson coefficient of Skewness
(1)1 Skp 1.
(2) Skp = 0 distribution is symmetrical about mean.

Page 66
(3)Skp> 0 distribution is skewed to the right.

(4) Skp< 0 distribution is skewed to the left.
Advantage of Karl Pearson coefficient of Skewness

Skp is independent of the scale. Because (mean-mode) and standard deviation have
same scale and it will be canceled out when taking the ratio.
Disadvantage of Karl Pearson coefficient of Skewness
Skp depends on the extreme values.
Example: 1
Calculate the coefficient of skewness of the following data by using Karl Pearson's
method for the data 2 3 3 4 4 6 6
Step 1. Find the mean:
Step 2. Find the standard deviation:
Then
Step 3. Find the coefficient of skeness:

Here skewness is negative.
(b 2) Bowleys coefficient of skewness
Bowley's formula for measuring skewness is based on quartiles. For a symmetrical
distribution, it is seen that Q1, and Q3areequidistant from median (Q2).
Thus (Q3 Q2) (Q2 Q1) can be taken as an absolute measure of skewness.
=
(
(
=
)(
)+(
)
)
Page 67
Note:
In the above equation, where the Qs denote the interquartile ranges. Divide a set of data into two
groups (high and low) of equal size at the statistical median if there is an even number of data
points, or two groups consisting of points on either side of the statistical median itself plus the
statistical median if there is an odd number of data points. Find the statistical medians of the low
and high groups, denoting these first and third quartiles by Q1 and Q3. The interquartile range
is then defined by IQR = Q3 - Q1.
Properties of Bowleys coefficient of skewness
1 1 Skq 1.
2 Skq = 0 distribution is symmetrical about mean.
3 Skq> 0 distribution is skewed to the right.
4 Skq< 0 distribution is skewed to the left.
Advantageof Bowleys coefficient of skewness
Skq does not depend on extreme values.
Disadvantage of Bowleys coefficient of skewness
Skq does not utilize the data fully.
Example
The following table shows the distribution of 128 families according to the number of
children.
No of children
No of families
20
15
25
30
18
10
8 or more
Compute Bowleys coefficient of skewness

We use formula for measuring Bowleys coefficient of skewness
=
Page 68
Let us find the necessary values

No
of No
of Cumulative
children
families
frequency
0
20
20
15
35
25
60
30
90
18
108
10
118
124
127
8 or more
128
= (32.25)th observation
=1
=3
=4
=
Since Skq< 0 distribution is skewed left

(b 3) Kellys coefficient of skewness
( )
1
= 0.333
3
Bowleys measure of skewness is based on the middle 50% of the observations because
it leaves 25% of the observations on each extreme of the distribution.As an
Page 69
improvement over Bowleys measure, Kelly has suggested a measure based on P10 and,
P90 so that only 10% of the observations on each extreme are ignored.
=
(
(
8.4 - KURTOSIS
)(
)+(
)
)
As Wesaw above, Skewness is a measure of symmetry, or more precisely, the lack of

symmetry. A distribution, or data set, is symmetric if it looks the same to the left and
right of the center point.
Kurtosis is a measure of whether the data are peaked or flat relative to a normal
distribution. That is, data sets with high kurtosis tend to have a distinct peak near the
mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to
have a flat top near the mean rather than a sharp peak. A uniform distribution would be
the extreme case. Kurtosis has its origin in the Greek word Bulginess.
Distributions of data and probability distributions are not all the same shape. Some are
asymmetric and skewed to the left or to the right. Other distributions are bimodal and
have two peaks. In other words there are two values that dominate the distribution of
values. Another feature to consider when talking about a distribution is not just the
number of peaks but the shape of them. Kurtosis is the measure of the peak of a
distribution, and indicates how high the distribution is around the mean. The kurtosis
of a distributions is in one of three categories of classification:
Mesokurtic
Leptokurtic
Platykurtic
We will consider each of these classifications in turn.

Mesokurtic
Page 70
Kurtosis is typically measured with respect to the normal distribution. A distribution

that is peaked in the same way as any normal distribution, not just the standard normal
distribution, is said to be mesokurtic. The peak of a mesokurtic distribution is neither
high nor low, rather it is considered to be a baseline for the two other classifications.
Besides normal distributions, binomial distributions for which p is close to 1/2 are
considered to be mesokurtic.
Leptokurtic
A leptokurtic distribution is one that has kurtosis greater than a mesokurtic
distribution. Leptokurtic distributions are identified by peaks that are thin and tall. The
tails of these distributions, to both the right and the left, are thick and heavy.
Leptokurtic distributions are named by the prefix "lepto" meaning "skinny."
There are many examples of leptokurtic distributions. One of the most well
knownleptokiurtic distributions is Student's t distribution.
Platykurtic
The third classification for kurtosis is platykurtic. Platykurtic distributions are those
that have a peak lower than a mesokurtic distribution. Platykurtic distributions are
characterized by a certain flatness to the peak, and have slender tails. The name of these
types of distributions come from the meaning of the prefix "platy" meaning "broad."
All uniform distributions are platykurtic. In addition to this the discrete probability
distribution from a single flip of a coin is platykurtic.
Measures of Kurtosis
Moment ratio and Percentile Coefficient of kurtosis are used to measure the kurtosis
Moment Coefficient of Kurtosis=
Where M4 = 4th moment and M2 = 2nd moment

If = 3, the distribution is said to be normal. (ie mesokurtic)
If
> 3, the distribution is more peaked to curve is lepto kurtic.
If
< 3, the distribution is said to be flat topped and the curve is platy kurtic.
Page 71
Percentile Coefficient of Kurtosis = =

where . . = (
has the value 0.263.
. .
) is the semi-interquartile range. For normal distribution this
A normal random variable has a kurtosis of 3 irrespective of its mean or standard

deviation. If a random variables kurtosis is greater than 3, it is said to be Leptokurtic. If
its kurtosis is less than 3, it is said to be Platykurtic.
Thus we conclude our discussion by saying that kurtosis is any measure of the
peakedness of a distribution. The height and sharpness of the peak relative to the rest
of the data are measured by a number called kurtosis. Higher values indicate a higher,
sharper peak; lower values indicate a lower, less distinct peak. This occurs because,
higher kurtosis means more of the variability is due to a few extreme differences from
the mean, rather than a lot of modest differences from the mean. A normal distribution
has kurtosis exactly 3. Any distribution with kurtosis =3 is called mesokurtic. A
distribution with kurtosis <3 is called platykurtic. Compared to a normal distribution,
its central peak is lower and broader, and its tails are shorter and thinner. A distribution
with kurtosis >3 is called leptokurtic. Compared to a normal distribution, its central
peak is higher and sharper, and its tails are longer and fatter.
Comparison among dispersion, skewness and kurtosis
Dispersion, Skewness and Kurtosis are different characteristics of frequency
distribution. Dispersion studies the scatter of the items round a central value or among
themselves. It does not show the extent to which deviations cluster below an average or
above it. Skewness tells us about the cluster of the deviations above and below a
measure of central tendency. Kurtosis studies the concentration of the items at the
central part of a series. If items concentrate too much at the centre, the curve becomes
leptokurtic and if the concentration at the centre is comparatively less, the curve
becomes platykurtic.
POPULATION AND SAMPLE

The study of statistics revolves around the study of data sets. Here describes two
important types of data sets population and samples.
Population
In statistics the term population has a slightly different meaning from the one given to
it in ordinary speech. It need not refer only to people or to animate creatures - the
population of India. When we think of the term population, we usually think of people
in our town, region, state or country and their respective characteristics such as gender,
age, marital status, religion, caste and so on. In statistics the term population takes on a
slightly different meaning. The population in statistics includes all members of a
defined group that we are studying or collecting information on for data driven
decisions.
A population is a group of phenomena that have something in common.
Page 72
A population is any entire collection of people, animals, plants or things from which we
may collect data. It is the entire group we are interested in, which we wish to describe
or draw conclusions about.
A population is an entire set of individuals or objects, which may be finite or infinite.
Examples of finite populations include the employees of a given company, the number
of airplanes owned by an airline, or the potential consumers in a target market.
Examples of infinite populations include the number of watches manufactured by a
company that plans to be in business forever, or the grains of sand on the beaches of the
world or stars in the sky.
For a deeper understanding of a population, consider a market researcher for a fast food
chain who might want to determine the flavour preferences of Indian customers
between the ages of 15 and 25. The population in this example is finite and includes
every Indian in this age group of 15-25.
Note that population does not refer to people only. Statisticians also speak of a
population of objects, or events, or procedures, or observations, including such things
as the quantity of haemoglobin in blood, number of visits to the doctor by a patient, or
number surgical operations by a doctor. A population is thus an aggregate of creatures,
things, cases and so on.
Sample
A population commonly contains too many individuals to study conveniently, so gathering data
from every individual in this population would be nearly impossible and prohibitively expensive.
So an investigation is often restricted to a part drawn from it, which is called a sample. A part of
the population is called a sample. It is a proportion of the population, a slice of it, a part of it and
all its characteristics.
A sample is a group of units selected from a larger group (the population). By studying the
sample it is hoped to draw valid conclusions about the larger group.
A sample is a smaller group of members of a population selected to represent the population.
A sample is a subset of population.
A sample is a scientifically drawn group that actually possesses the same characteristics as the
population if it is drawn randomly. Thus a well-chosen sample will contain most of the
information about a particular population parameter but the relation between the sample and the
population must be such as to allow true inferences to be made about a population from that
sample.
The best example of sampling is what housewives do in a kitchen to see whether rice has cooked
enough by tasting just one piece of grain.
If the sample is to be used to make inferences about the population the sample data must be
unbiased. In order for a sample to be unbiased, it must be
Page 73
representative of the population

randomly selected
sufficiently large
Representative of the population: A representative sample contains members from the
population of interest. In the case of the flavour preferences study we discussed above, the
sample would need to include Indians between the ages of 15 and 25. If people outside of the
target age range are included, the sample would not be representative.
Randomly selected: A random sample is one in which every member of a population has an
equal chance of being selected. In a random sample, each member of the population has an
equally likely chance of being selected for the sample. Suppose that the sample data for the
flavour preferences study discussed earlier came exclusively from students at one university in
the India. This sample is not random due to the limited opportunity for the rest of the population
to be involved in the study. Data from this sample would not be representative of the entire
Indian population between ages 15 and 25, because the students attending this university may
have a different preference than other groups of young people. Drawing conclusions about the
overall population from this sample could lead to mistakes. The most commonly used sample is a
simple random sample. It requires that every possible sample of the selected size has an equal
chance of being used.
Sufficiently large: A sample must also be large enough in order for its data to reflect the
population. A sample that is too small may bias population estimates. When larger samples are
used, data collected from idiosyncratic individuals have less influence than when smaller
samples are used.
Imagine what would happen if the flavour preferences study collected data from a sample of
three students and, based on the results from this sample, concluded that Indians between the
ages of 15 and 25 favour a particular flavour say masala flavour. A sample of three people is
too small to serve as the basis for drawing conclusions about the population in general.
How many people must be included in a sample in order for it to represent the population? The
optimal sample size depends on, among other things, the desired confidence level and the
precision of the confidence interval. A sample size of 30 or more is often desired to ensure that
the distribution of the sample mean is normal. In general, more is better.
Population vs Sample
The main difference between a population and sample has to do with how observations are
assigned to the data set.
A populationincludes each element from the set of observations that can be made.
A sample consists only of observations drawn from the population.
Page 74
Depending on the sampling method, a sample can have fewer observations than the population,
the same number of observations, or more observations. More than one sample can be derived
from the same population.
Other differences are related to terms used. For example,
A measurable characteristic of a population, such as a mean or standard deviation, is

called a parameter; but a measurable characteristic of a sample is called a statistic.
The mean of a population is denoted by the symbol ; but the mean of a sample is
denoted by the symbol x.
What is the difference between information based on a sample and information based on a
population: Information based on a sample is, by definition, incomplete; as such, a sample
demands that inferences be drawn regarding the population from which it came. Information
based on a population, however, is considered complete, and therefore requires no inferential
leap to be made.
What Characteristics are necessary before a sample can be considered random: The members of
the sample must be chosen based on chance from the population. Each member of the population
must have an equal likelihood of being chosen.
What is the consequence of failing to have a random sample from a population?: A sample is a
subset of a population. If a sample is randomly selected and sufficiently large, the information
obtained from the sample will be representative of the population. A small sample, or one that is
not drawn in a random fashion, may be biased. Making inferences from a biased sample to a
population is ill-advised and may lead to costly business mistakes.
Different methods of sampling
There are numerous sample selection methods for drawing the sample from the population,
broadly classified into random or probability-based sampling schemes or survey design methods,
and non-random or non-probability based sampling.
Probability Sampling
Probability samples are selected in such a way as to be representative of the population. They
provide the most valid or credible results because they reflect the characteristics of the
population from which they are selected.
The following sampling methods are types of probability sampling:
1.
2.
3.
4.
5.
6.
1.
Simple Random Sampling (SRS)

Stratified Sampling
Cluster Sampling
Multistage Sampling
Random-Digit Dialing
Systematic Sampling
Simple Random Sampling
The most widely known type of a random sample is the simple random sample (SRS). This is
characterized by the fact that the probability of selection is the same for every case in the
Page 75
population. All have an equal chance of being selected. Simple random sampling is a method of
selecting n units from a population of size N such that every unit of the population has equal
chance of being selected.
There are two methods by which we can select a random sample
(a) Lottery Method
An example may make this easier to understand. Imagine you want to carry out a survey of 100
voters in a small town with a population of 1,000 eligible voters. One method of SRS is that we
write the names of all voters on a piece of paper, put all pieces of paper into a box and draw 100
tickets at random. The draw is done in this manner - Shake the box, draw a piece of paper and set
it aside, shake again, draw another, set it aside, etc. until we had 100 slips of paper. These 100
form our sample. And this sample would be drawn through a simple random sampling procedure
- at each draw, every name in the box had the same probability of being chosen. This is called
the lottery method of random sampling.
(b) Table of random numbers:
The lottery method is a clumsy physical process for choosing random samples. Often it is
convenient to use a ready-made table of random numbers. A random number table is a table of
digits. The digit given in each position in the table was originally chosen randomly from the
digits 1,2,3,4,5,6,7,8,9,0 by a random process in which each digit is equally likely to be chosen.
Thus a random number table is a series of digits (0 to 9) arranged randomly through the rows and
columns. Table 1 gives part of table of random numbers. The digits are often grouped in fives as
shown here.
Table 1 : table of Random Numbers
The researcher can use the list of random numbers to draw a simple random sample from a
population.
Step 1: each element in the population from which the sample is to be drawn must be assigned a
unique number. This is usually done by numbering the elements in the population consecutively.
If there were 280 elements in the population, for example, they would be numbered 001, 002,
003. . . 280. Here is one procedure for using Table B.1 to select a simple random sample:
Page 76
Step 2: determine a starting point in the table by closing your eyes and placing the point of your
pencil anywhere in the table.
Step 3:Using the starting point you have selected, begin reading the numbers in the table either
across the rows or down the columns. If your population consisted of 99 or fewer elements, read
the numbers in two-digit units; for 999 or fewer elements in the population, read the numbers in
three-digit units, and so forth. If a table number is larger than the number of elements in the
population (e.g., if the table number is 323 and the your population is 286), skip that number and
read the next. If you come to a number equivalent to one you have already drawn, you can either
skip the number and read the next one or count the data for that unit of analysis twice. Continue
until you have selected as many valid numbers as there are elements in your desired sample.
The population elements that comprise the simple random sample are those whose
numbers correspond to the numbers read from the table.
For example, you have to select a sample 5 students from a population of 75 students.
First give numbers to all students from 1 to 75. Now through process in step two above,
place your pencil anywhere on the table. Suppose you place on 62570 in 2nd column and
4th row. Since step 3 above says If your population consisted of 99 or fewer elements, read
the numbers in two-digit units, we read only the first two digits, so it is 62. So the 62nd student is
our 1st sample. (If in case you get a number which is bigger than your sample, then you take the
next number from the table). Now to get the next sample, move in the table in any direction from
the number you have chosen. Suppose we decide to keep moving move down the column. So
the next digit is 26440. We take the first two digits, so the number is 26. This means 26 th student
is our 2nd sample. Going down the column, we get 47174, so it is 47. So the 47th student is our 3rd
sample. Moving down, 34378, we take 34. So 34th student is our 4th sample. Next is 22466, so
22nd student is our 5th sample.
Stratified Random Sampling
In this form of sampling, the population is first divided into two or more mutually exclusive
segments based on some categories of variables of interest in the research. It is designed to
organize the population into homogenous subsets before sampling, then drawing a random
sample within each subset. With stratified random sampling the population of N units is divided
into subpopulations of units respectively. These subpopulations, called strata, are nonoverlapping and together they comprise the whole of the population. When these have been
determined, a sample is drawn from each, with a separate draw for each of the different strata.
The sample sizes within the strata are denoted by respectively. If a SRS is taken within each
stratum, then the whole sampling procedure is described as stratified random sampling.
The primary benefit of this method is to ensure that cases from smaller strata of the population
are included in sufficient numbers to allow comparison.
Systematic Sampling
This method of sampling is at first glance very different from SRS. In practice, it is a variant of
simple random sampling that involves some listing of elements - every nth element of list is then
Page 77
drawn for inclusion in the sample. Say you have a list of 10,000 people and you want a sample of
1,000.
Creating such a sample includes three steps:
1. Divide number of cases in the population by the desired sample size. In this example,
dividing 10,000 by 1,000 gives a value of 10.
2. Select a random number between one and the value attained in Step 1. In this example,
we choose a number between 1 and 10 - say we pick 7.
3. Starting with case number chosen in Step 2, take every tenth record (7, 17, 27, etc.).
More generally, suppose that the N units in the population are ranked 1 to N in some order (e.g.,
alphabetic). To select a sample of n units, we take a unit at random, from the 1st k units and take
every k- unit thereafter.
Cluster Sampling
In some instances the sampling unit consists of a group or cluster of smaller units that we call
elements or subunits (these are the units of analysis for your study). There are two main reasons
for the widespread application of cluster sampling. Although the first intention may be to use the
elements as sampling units, it is found in many surveys that no reliable list of elements in the
population is available and that it would be prohibitively expensive to construct such a list. In
many countries there are no complete and updated lists of the people, the houses or the farms in
any large geographical region.
Even when a list of individual houses is available, economic considerations may point to the
choice of a larger cluster unit. For a given size of sample, a small unit usually gives more precise
results than a large unit. For example a SRS of 600 houses covers a town more evenly than 20
city blocks containing an average of 30 houses each. But greater field costs are incurred in
locating 600 houses and in traveling between them than in covering 20 city blocks. When cost is
balanced against precision, the larger unit may prove superior.
Nonprobability Sampling
Social research is often conducted in situations where a researcher cannot select the kinds of
probability samples used in large-scale social surveys. For example, say you wanted to study
homelessness - there is no list of homeless individuals nor are you likely to create such a list.
However, you need to get some kind of a sample of respondents in order to conduct your
research. To gather such a sample, you would likely use some form of non-probability sampling.
To restate, the primary difference between probability methods of sampling and non-probability
methods is that in the latter you do not know the likelihood that any element of a population will
be selected for study.
There are four primary types of non-probability sampling methods:
Availability Sampling
Availability sampling is a method of choosing subjects who are available or easy to find. This
method is also sometimes referred to as haphazard, accidental, or convenience sampling. The
primary advantage of the method is that it is very easy to carry out, relative to other methods. For
Page 78
example if you want to collect data from women alone, you may stand in a crowded market place
and distribute your schedule as you wish
Quota Sampling
Quota sampling is designed to overcome the most obvious flaw of availability sampling. Rather
than taking just anyone, you set quotas to ensure that the sample you get represents certain
characteristics in proportion to their prevalence in the population. Note that for this method, you
have to know something about the characteristics of the population ahead of time. Say you want
to make sure you have a sample proportional to the population in terms of gender - you have to
know what percentage of the population is male and female, then collect sample until yours
matches. Marketing studies are particularly fond of this form of research design.
Purposive or judgmental Sampling
Purposive sampling is a sampling method in which elements are chosen based on purpose of the
study. Purposive sampling may involve studying the entire population of some limited group
(Economics BA students of Calicut University) or a subset of a population (Economics BA
students of Calicut University who are women). As with other non-probability sampling
methods, purposive sampling does not produce a sample that is representative of a larger
population, but it can be exactly what is needed in some cases - study of organization,
community, or some other clearly defined and relatively limited group.
Snowball Sampling
Snowball sampling is a method in which a researcher identifies one member of some population
of interest, speaks to him/her, then asks that person to identify others in the population that the
researcher might speak to. This person is then asked to refer the researcher to yet another person,
and so on. Snowball sampling is very good for cases where members of a special population are
difficult to locate.
The best sampling method is the sampling method that most effectively meets the particular
goals of the study in question. The effectiveness of a sampling method depends on many factors.
Because these factors interact in complex ways, the best sampling method is seldom obvious.
Good researchers use the following strategy to identify the best sampling method.
List the research goals (usually some combination of accuracy, precision, and/or cost).
Identify potential sampling methods that might effectively achieve those goals.
Test the ability of each method to achieve each goal.
Choose the method that does the best job of achieving the goals.
***********************************
Page 79
Module II
CORRELATION AND REGRESSION ANALYSIS
Module II. Correlation and Regression Analysis
Correlation-Meaning, Types and Degrees of Correlation- Methods of Measuring CorrelationGraphical Methods: Scatter Diagram and Correlation Graph; Algebraic Methods: Karl
Pearsons Coefficient of Correlation and Rank Correlation Coefficient - Properties and
Interpretation of Correlation Coefficient
Introduction
Correlation is a statistical technique which tells us if two variables are related.For
example, consider the variables family income and family expenditure. It is well known that
income and expenditure increase or decrease together. Thus they are related in the sense that
change in any one variable is accompanied by change in the other variable.Again price and
demand of a commodity are related variables; when price increases demand will tend to
decreases and vice versa. If the change in one variable is accompanied by a change in the other,
then the variables are said to be correlated. We can therefore say that family income and family
expenditure, price and demand are correlated.
Correlation can tell us something about the relationship between variables. It is used to
understand:a. whether the relationship is positive or negative b. the strength of relationship.
Correlation is a powerful tool that provides these vital pieces of information.
In the case of family income and family expenditure, it is easy to see that they both rise or fall
together in the same direction. This is called positive correlation.
In case of price and demand, change occurs in the opposite direction so that increase in one is
accompanied by decrease in the other. This is called negative correlation.
Coefficient of Correlation
Correlation is measured by what is called coefficient of correlation (r). A correlation coefficient
is a statistical measure of the degree to which changes to the value of one variable predict change
to the value of another. Correlation coefficients are expressed as values between +1 and -1. Its
numerical value gives us an indication of the strength of relationship. In general, r > 0 indicates
positive relationship, r < 0 indicates negative relationship while r = 0 indicates no relationship
(or that the variables are independent and not related). Here r = +1.0 describes a perfect positive
correlation and r = 1.0 describes a perfect negative correlation. Closer the coefficients are to
+1.0 and 1.0, greater is the strength of the relationship between the variables. As a rule of
thumb, the following guidelines on strength of relationship are often useful (though many experts
would somewhat disagree on the choice of boundaries).
Page 80
Value of r
Strength of relationship
1.0 to 0.5 or 1.0 to 0.5 Strong

0.5 to 0.3 or 0.3 to 0.5 Moderate
0.3 to 0.1 or 0.1 to 0.3 Weak
0.1 to 0.1
None or very weak
A perfect positive correlation
No Correlation (No relation between two variables)
A perfect negative correlation
Correlation is only appropriate for examining the relationship between meaningful quantifiable
data (e.g. air pressure, temperature) rather than categorical data such as gender, favourite colour
etc.
A key thing to remember when working with correlations is never to assume a correlation means
that a change in one variable causes a change in another. Sales of personal computers and
athletic shoes have both risen strongly in the last several years and there is a high correlation
between them, but you cannot assume that buying computers causes people to buy athletic shoes
(or vice versa).
The second caution is that the Pearson correlation technique (which we are about to see) works
best with linear relationships: as one variable gets larger (or smaller), the other gets larger (or
smaller) in direct proportion. It does not work well with curvilinear relationships (in which the
relationship does not follow a straight line). An example of a curvilinear relationship is age and
health care. They are related, but the relationship doesn't follow a straight line. Young children
and older people both tend to use much more health care than teenagers or young adults. (In such
cases, the technique of multiple regression can be used to examine curvilinear relationships)
METHODS OF MEASURING CORRELATION

I.
Graphical Method
(a) Scatter Diagram
(b) Correlation Graph
II. Algebraic Method (Coefficient of Correlation)
(a) Karl Pearsons Coefficient of Correlation
(b) Spearmans Rank Correlation Coefficient
I. (a) Scatter Diagram
Scatter Diagram (also called scatter plot, XY graph) is a graph that shows the relationship
between two quantitative variables measured on the same individual. Each individual in the data
set is represented by a point in the scatter diagram. The predictor variable is plotted on the
Page 81
horizontal axis and the response variable is plotted on the vertical axis. Do not connect the points
when drawing a scatter diagram. The scatter diagram graphs pairs of numerical data, with one
variable on each axis, to look for a relationship between them. If the variables are correlated, the
points will fall along a line or curve. The better the correlation, the tighter the points will hug the
line. Scatter Diagram is a graphical measure of correlation.
Examples of Scatter Diagram. Given below each diagram is the value of correlation.
Note that the value shows how good the correlation is (not how steep the line is), and if it is
positive or negative.
Scatter Diagram Procedure
1. Collect pairs of data where a relationship is suspected.
2. Draw a graph with the independent variable on the horizontal axis and the dependent variable
on the vertical axis. For each pair of data, put a dot or a symbol where the x-axis value intersects
the y-axis value. (If two dots fall together, put them side by side, touching, so that you can see
both.)
3. Look at the pattern of points to see if a relationship is obvious. If the data clearly form a line
or a curve, you may stop. The variables are correlated.
The data set below represents a random sample of 5 workers in a particular industry. The
productivity of each worker was measured at one point in time, and the worker was asked the
number of years of job experience. The dependent variable is productivity, measured in number
of units produced per day, and the independent variable is experience, measured in years.
Worker
y=Productivity(output/day) x=Experience(in
years)
1
2
3
4
5
33
19
32
26
15
10
6
12
8
4
Page 82
Scatter Chart for Worker Productivity Vs

Experience
35
Productivity
30
25
20
15
10
5
0
0
10
12
14
Experience
This scatter diagram tell us that the two variables, productivity and experience, are
positively correlated.
Merits of Scatter Diagram Method:
1. It is an easy way of finding the nature of correlation between two variables.
2. By drawing a line of best fit by free hand method through the plotted dots, the method
can be used for estimating the missing value of the dependent variable for a given value
of independent variable.
3. Scatter diagram can be used to find out the nature of linear as well as non-linear
correlation.
4. The values of extreme observations do not affect the method.
Demerits of Scatter Diagram Method:
It gives only rough idea of how the two variables are related. It gives an idea about the
direction of correlation and also whether it is high or low. But this method does not give any
quantitative measure of the degree or extent of correlation.
I (b) Correlation Graph
Correlation graph is also used as a measure of correlation. When this method is used
the correlation graph is drawn and the direction of curve is examined to understand the nature of
correlation. Under this method, separate curves are drawn for the X variable and Y variable on
the same graph paper. The values of the variable are taken as ordinates of the points plotted.
From the direction and closeness of the two curves we can infer whether the variables are
related. If both the curves move in the same direction (upward or downward), correlation is said
Page 83
to be positive. If the curves are moving in the opposite direction, correlation is said to be
negative.
But correlation graphs are not capable of doing anything more than suggesting the fact
of a possible relationship between two variables. We can neither establish any casual
relationship between two variables nor obtain the exact degree of correlation through them.
They only tell us whether the two variables are positively or negatively correlated. Example of a
graph is given below.
II.
Algebraic Method (Coefficient of Correlation)
II. (a) Karl Pearsons Coefficient of Correlation (Pearson product-moment

correlation coefficient)
Karl Pearsons Product-Moment Correlation Coefficient or simply Pearsons Correlation
Coefficient for short, is one of the important methods used in Statistics to measure
Correlation between two variables. Karl Pearson was a British mathematician,
statistician, lawyer and a eugenicist. He established the discipline of mathematical
statistics. He founded the worlds first statistics department In the University of London
in the year 1911. He along with his colleagues Weldon and Galton founded the journal
Biometrika whose object was the development of statistical theory.
The Pearson product-moment correlation coefficient (r) is a common measure of the
correlation between two variables X and Y. When measured in a population the Pearson
Page 84
Product Moment correlation is designated by the Greek letter rho (?). When computed
in a sample, it is designated by the letter "r" and is sometimes called "Pearson's r."
Pearson's correlation reflects the degree of linear relationship between two variables.
Mathematical Formula:-The quantity r, called the linear correlation coefficient, measures the strength and the
direction of a linear relationship between two variables. (The linear correlation
coefficient is a measure of the strength of linear relation between two quantitative
variables. We use the Greek letter (rho) to represent the population correlation
coefficient and r to represent the sample correlation coefficient.)
Correlation coefficient for ungrouped data
)( )
Where
Xi is the ith observation of the variable X
Yi is the ith observation of the variable Y
is the mean of the observations of the variable X
is the mean of the observations of the variable Y
n is the number of pairs of observations of X and Y
is the standard deviation of the variable X
is the standard deviation of the variable Y
The above formula may be presented in the following form
)( )
( )
The same may be computed using Pearson product-moment correlation coefficient

formula as shown below.

Page 85
Year
(i)
Annual Sales
Annual
advertising
expenditure Xi
10
12
30
14
37
16
50
18
56
20
78
22
89
24
100
26
120
10
28
110
20
Compute the necessary values and substitute in the formula, we will solve using both
formula. We get
= (
)=
= 19.
= ( ) =
= 69.
Year
(i)
Xi
Annual
Sales
(Yi)
10
20
-9
-49
81
2401
441
12
30
-7
-39
49
1521
273
14
37
-5
-32
25
1024
160
16
50
-3
-19
361
57
18
56
-1
-13
169
13
20
78
81
22
89
20
400
60
24
100
31
25
961
155
26
120
51
49
2601
357
10
28
110
41
81
1681
369
190
690
330
11200
1894
( )
( )
)( )
We make the additional computations for the Pearson product-moment correlation

coefficient formula.
200
100
400
360
144
900
518
196
1369
Page 86
800
256
2500
1008
324
3136
1560
400
6084
1958
484
7921
2400
576
10000
3120
676
14400
3080
784
12100
15004
3940
58810
Substitute the values in the respective formula.

Using the basic formula
)(
1894
330 11200
= 0.985
Now let us re do the problem using Pearson product-moment correlation coefficient

formula
=
10 15004 190 690
10 3940 190 10 58810 690
= 0.985
The correlation coefficient between annual advertising expenditure and annual sales revenue is
0.985. This is a positive value and is very close to 1. So it implies there is very strong corelation
between annual advertising expenditure and annual sales revenue.
Properties of Correlation coefficient
1. The correlation coefficient lies between -1 & +1 symbolically ( - 1 r 1 )
2. The correlation coefficient is independent of the change of origin & scale.
3. The coefficient of correlation is the geometric mean of two regression coefficient.
=
The one regression coefficient is (+ve) other regression coefficient is also (+ve) correlation
coefficient is (+ve)
Page 87
Assumptions of Pearsons Correlation Coefficient

1. There is linear relationship between two variables, i.e. when the two variables are plotted on a
scatter diagram a straight line will be formed by the points.
2. Cause and effect relation exists between different forces operating on the item of the two
variable series.
Advantages of Pearsons Coefficient
1. It summarizes in one value, the degree of correlation & direction of correlation also.
Disadvantages
While 'r' (correlation coefficient) is a powerful tool, it has to be handled with care.
1. The most used correlation coefficients only measure linear relationship. It is therefore
perfectly possible that while there is strong non-linear relationship between the variables,
r is close to 0 or even 0. In such a case, a scatter diagram can roughly indicate the
existence or otherwise of a non-linear relationship.
2. One has to be careful in interpreting the value of 'r'. For example, one could compute 'r'
between the size of shoe and intelligence of individuals, heights and income. Irrespective
of the value of 'r', it makes no sense and is hence termed chance or non-sense correlation.
3. 'r' should not be used to say anything about cause and effect relationship. Put differently,
by examining the value of 'r', we could conclude that variables X and Y are related.
However the same value of 'r' does not tell us if X influences Y or the other way round.
Statistical correlation should not be the primary tool used to study causation, because of
the problem with third variables.
Coefficient of Determination
The convenient way of interpreting the value of correlation coefficient is to use of square of
coefficient of correlation which is called Coefficient of Determination.
The Coefficient of Determination = r2.
Suppose: r = 0.9, r2 = 0.81 this would mean that 81% of the variation in the dependent variable
has been explained by the independent variable.
The maximum value of r2 is 1 because it is possible to explain all of the variation in y but it is
not possible to explain more than all of it.
Coefficient of Determination: An example
Suppose: r = 0.60 in one case and r = 0.30 in another case. It does not mean that the first
correlation is twice as strong as the second the r can be understood by computing the value of
r2.
When r = 0.60, r2 = 0.36 -----(1)
When r = 0.30, r2 = 0.09 -----(2)
This implies that in the first case 36% of the total variation is explained whereas in second case
9% of the total variation is explained.
Page 88
II. (b) Spearmans Rank Correlation Coefficient

The Spearman's rank-order correlation is the nonparametric version of the Pearson productmoment correlation. Spearman's correlation coefficient, (
,
)
measures the strength of association between two ranked variables.
Data which are arranged in numerical order, usually from largest to smallest and numbered 1,2,3
---- are said to be in ranks or ranked data.. These ranks prove useful at certain times when two or
more values of one variable are the same. The coefficient of correlation for such type of data is
given by Spearman rank difference correlation coefficient.
Spearman Rank Correlation Coefficient uses ranks to calculate correlation. The Spearman Rank
Correlation Coefficient is its analogue when the data is in terms of ranks. One can therefore also
call it correlation coefficient between the ranks.
The Spearman's rank-order correlation is used when there is a monotonic relationship between
our variables. A monotonic relationship is a relationship that does one of the following: (1) as the
value of one variable increases, so does the value of the other variable; or (2) as the value of one
variable increases, the other variable value decreases. A monotonic relationship is an important
underlying assumption of the Spearman rank-order correlation. It is also important to recognize
the assumption of a monotonic relationship is less restrictive than a linear relationship (an
assumption that has to be met by the Pearson product-moment correlation). The middle image
above illustrates this point well: A non-linear relationship exists, but the relationship is
monotonic and is suitable for analysis by Spearman's correlation, but not by Pearson's
correlation.
Let us make the relevance of use of Spearman Rank Correlation Coefficient with the aid of an
example.
As an example, let us consider a musical talent contest where 10 competitors are evaluated by
two judges, A and B. Usually judges award numerical scores for each contestant after his/her
performance.
A product moment correlation coefficient of scores by the two judges hardly makes sense here as
we are not interested in examining the existence or otherwise of a linear relationship between the
scores.
What makes more sense is correlation between ranks of contestants as judged by the two judges.
Spearman Rank Correlation Coefficient can indicate if judges agree to each other's views as far
as talent of the contestants are concerned (though they might award different numerical scores) in other words if the judges are unanimous.
The numerical value of the correlation coefficient, rs, ranges between -1 and +1. The correlation
coefficient is the number indicating the how the scores are relating.
In general,
rs > 0 implies positive agreement among ranks
rs < 0 implies negative agreement (or agreement in the reverse direction)
rs = 0 implies no agreement
Page 89
Closer rs is to 1, better is the agreement while rs closer to -1 indicates strong agreement in the
reverse direction.
The formula for finding Spearman Rank Correlation Coefficient is
=1
Where
Xiis the rank of the ith observation of the variable X
Yiis the rank of the ith observation of the variable Y
n is the number of payers of observations
+ )
1)
Let us calculate Spearman Rank Correlation Coefficient for our example of the musical talent
contest where 10 competitors are evaluated by two judges, A and B. The scores are givenbelow,
Contestant
1
2
Rating by judge 1
1
2
Rating by judge 2
2
4
3
4
3
4
5
1
5
6
7
8
9
10
5
6
7
8
9
10
3
6
7
9
10
8
Let us first make the necessary calculations

Contestant
Rating by
Rating by
judge 1 (Xi)
judge 2(Yi)
-1
-2
-2
-1
10
-1
10
10
28
Page 90
=1
+ )
=
1)
= 1
6 28
= 0.8303
10 (10 )
Spearman Rank Correlation Coefficient tries to assess the relationship between ranks
without making any assumptions about the nature of their relationship. Hence it is a
non-parametric measure - a feature which has contributed to its popularity and wide
spread use.
Interpretation of Rank Correlation Coefficient (R)
1. The value of rank correlation coefficient, R ranges from -1 to +1
2. If R = +1, then there is complete agreement in the order of the ranks and the ranks are
in the same direction
3. If R = -1, then there is complete agreement in the order of the ranks and the ranks are
in the opposite direction
4. If R = 0, then there is no correlation
Advantages Spearmans Rank Correlation
1. This method is simpler to understand and easier to apply compared to karlearsons
correlation method.
2. This method is useful where we can give the ranks and not the actual data.
(qualitative term)
3. This method is to use where the initial data in the form of ranks.
Disadvantages Spearmans Rank Correlation
1. It cannot be used for finding out correlation in a grouped frequency distribution.
2. This method should be applied where N exceeds 30.
3. As Spearman's rank only uses rank, it is not affected by significant variations in
readings. As long as the order remains the same, the coefficient will stay the same. As
with any comparison, the possibility of chance will have to be evaluated to ensure that
the two quantities are actually connected.
4. A significant correlation does not necessarily mean cause and effect.
Advantages of Correlation studies

1. Show the amount (strength) of relationship present.
2. Can be used to make predictions about the variables under study.
3. Can be used in many places, including natural settings, libraries, etc.
4. Easier to collect co relational data
REGRESSION ANALYSIS*
* Note: In the syllabus for III Semester BA Economics paper Quantitative Methods for
Economic Analysis 1,though the tile of this module II is given as Correlation and Regression
Analysis, regression is not included in the contents. Hence here we give a brief discussion on
regression.
Page 91
If two variables are significantly correlated, and if there is some theoretical basis for doing so, it
is possible to predict values of one variable from the other. This observation leads to a very
important concept known as Regression Analysis.
Regression analysis, in general sense, means the estimation or prediction of the unknown value
of one variable from the known value of the other variable. It is one of the most important
statistical tools which is extensively used in almost all sciences Natural, Social and Physical. It
is specially used in business and economics to study the relationship between two or more
variables that are related causally and for the estimation of demand and supply graphs, cost
functions, production and consumption functions and so on.
Prediction or estimation is one of the major problems in almost all the spheres of human activity.
The estimation or prediction of future production, consumption, prices, investments, sales,
profits, income etc. are of very great importance to business professionals. Similarly, population
estimates and population projections, GNP, Revenue and Expenditure etc. are indispensable for
economists and efficient planning of an economy.
Regression analysis was explained by M. M. Blair as follows:
Regression analysis is a mathematical measure of the average relationship between two or more
variables in terms of the original units of the data.
Regression Analysis is a very powerful tool in the field of statistical analysis in predicting the
value of one variable, given the value of another variable, when those variables are related to
each other.Regression Analysis is mathematical measure of average relationship between two or
more variables.Regression analysis is a statistical tool used in prediction of value of unknown
variable from known variable.
Advantages of Regression Analysis
1. Regression analysis provides estimates of values of the dependent variables from the values of
independent variables.
2. Regression analysis also helps to obtain a measure of the error involved in using the
regression line as a basis for estimations .
3. Regression analysis helps in obtaining a measure of the degree of association or correlation
that exists between the two variable.
Assumptions in Regression Analysis
1. Existence of actual linear relationship.
2. The regression analysis is used to estimate the values within the range for which it is valid.
3. The relationship between the dependent and independent variables remains the same till the
regression equation is calculated.
4. The dependent variable takes any random value but the values of the independent variables are
fixed.
5. In regression, we have only one dependant variable in our estimating equation. However, we
can use more than one independent variable.
Page 92
Regression line
A regression line summarizes the relationship between two variables in the setting when one of
the variables helps explain or predict the other.
A regression line is a straight line that describes how a response variable y changes as an
explanatory variable x changes. A regression line is used to predict the value of y for a given
value of x. Regression, unlike correlation, requires that we have an explanatory variable and a
response variable.
Regression line is the line which gives the best estimate of one variable from the value of any
other given variable. The regression line gives the average relationship between the two variables
in mathematical form.
For two variables X and Y, there are always two lines of regression
Regression line of X on Y : gives the best estimate for the value of X for any specific given
values of Y :
X=a+bY
Where
a = X intercept
b = Slope of the line
X = Dependent variable
Y = Independent variable
Regression line of Y on X : gives the best estimate for the value of Y for any specific given
values of X
Y = a + bx
Where
a = Y intercept
b = Slope of the line
Y = Dependent variable
x= Independent variable
Simple Linear Regression
Regression analysis is most often used for prediction. The goal in regression analysis is to create
a mathematical model that can be used to predict the values of a dependent variable based upon
the values of an independent variable. In other words, we use the model to predict the value of Y
when we know the value of X. (The dependent variable is the one to be predicted). Correlation
analysis is often used with regression analysis because correlation analysis is used to measure the
strength of association between the two variables X and Y.
In regression analysis involving one independent variable and one dependent variable the values
are frequently plotted in two dimensions as a scatter plot. The scatter plot allows us to visually
inspect the data prior to running a regression analysis. Often this step allows us to see if the
relationship between the two variables is increasing or decreasing and gives only a rough idea of
the relationship. The simplest relationship between two variables is a straight-line or linear
relationship. Of course the data may well be curvilinear and in that case we would have to use a
Page 93
different model to describe the relationship. Simple linear regression analysis finds the straight
line that best fits the data.
Fitting a Line to Data

Fitting a Line to data means drawing a line that comes as close as possible to the points. (Note
that, no straight line passes exactly through all of the points). The overall pattern can be
described by drawing a straight line through the points.
Example:
The data in the table below were obtained by measuring the heights of 161 children
from a village each month from 18 to 29 months of age.
Table: Mean height of children
Age in Height
in
months
centimeters
(x)
(y)
18
76.1
19
77
20
78.1
21
78.2
22
78.8
23
79.7
24
79.9
25
81.1
26
81.2
27
81.8
28
82.8
29
83.5
Figure below is a scatterplot of the data in the above table.

Age is the explanatory variable, which is plotted on the x axis. Mean height (in cm) is
the response variable.
Page 94
84
83
Mean Height
82
81
80
79
78
77
76
75
16
18
20
22
24
26
28
30
Age in months
We can see on the plot a strong positive linear association with no outliers. The correlation is
r=0.994, close to the r = 1 of points that lie exactly on a line.
If we draw a line through the points, it will describe these data very well. This line is called the
regression line and the process of doing so is called Fitting a line. This is done in figure below.
Let y is a response variable and x is an explanatory variable.
A straight line relating y to x has an equation of the form y = a + bx.
In this equation, b is the slope, the amount by which y changes when x increases by one unit.
The number a is the intercept, the value of y when x = 0
The straight line describing the data has the form
height = a + (b age).
In Figure below the regression line has been drawn with the following equation
height = 64.93 + (0.635 age).
Page 95
Regression Line
84
y = 0.635x + 64.92
83
Mean Height
82
81
80
79
78
77
76
75
16
18
20
22
24
26
28
30
Age in months
The figure above shows that this line fits the data well.
The slope b = 0.635 tells us that the height of children increases by about 0.6 cm for each
month of age.
The slope b of a line y = a + bx is the rate of change in the response y as the explanatory
variable x changes.
The slope of a regression line is an important numerical description of the relationship
between the two variables.
Regression for prediction

We use the regression equation for prediction of the value of a variable,
Suppose we have a sample of size n and it has two sets of measures, denoted by x and
y. We can predict the values of y given the values of x by using the equation, called
the regression equation given below.
y* = a + bx
where the coefficients a and b are given by
=
=
( )( )
( ) ( )
In the regreesion equation the symbol y* refers to the predicted value of y from a given
value of x from the regression equation.
Let us see with the aid of an example how regressions used for prediction.
Example:
Page 96
Scores made by students in a statistics class in the mid - term and final examination are
given here. Develop a regression equation which may be used to predict final
examination scores from the mid term score.
STUDENT
MID TERM
FINAL
98
90
66
74
100
98
96
88
88
80
45
62
76
78
60
74
74
86
10
82
80
Solution:
We want to predict the final exam scores from the mid term scores. So let us designate
y for the final exam scores and x for the mid term exam scores. We open the
following table for the calculations.
STUDENT X
X2
XY
98
90
9604
8820
66
74
4356
4884
100
98
10000
9800
96
88
9216
8448
88
80
7744
7040
45
62
2025
2790
76
78
5776
5928
60
74
3600
4440
74
86
5476
6364
10
82
80
6724
6560
785
810
64521
65074
Page 97
First find b and then find a and substitute in the equation.
( )( )
10(65074) (785)(810)
=
( ) ( )
10 (64521) (785)
=
650740 635850 14890

=
= 0.514
645210 616225 28985
810 (0.514)(785)
810 403.49 406.51
=
=
= 40.651
10
10
10
So a = 40.651 and b =0.514
Substitute in the equation for regression line y* = a + bx

y* = 40.651 + (0.514)x
Now we can use this for making predictions.
We can use this to find the projected or estimated final scores of the students.
For example, for the midterm score of 50 the projected final score is
y* = 40.651 + (0.514) 50 = 40.651 + 25.70 = 66.351, which is a quite a good estimation.
To give another example, consider the midterm score of 70. Then the projected final
score is
y* = 40.651 + (0.514) 70 = 40.651 + 35.98= 76.631, which is again a very good estimation.
Applications (uses) of regression analysis

1. Predicting the Future :The most common use of regression in business is to predict events that
have yet to occur. Demand analysis, for example, predicts how many units consumers will
purchase. Many other key parameters other than demand are dependent variables in regression
models, however. Predicting the number of shoppers who will pass in front of a particular
billboard or the number of viewers who will watch the Champions Trophy Cricket may help
management assess what to pay for an advertisement.
2. Insurance companies heavily rely on regression analysis to estimate, for example, how many
policy holders will be involved in accidents or be victims of theft,.
3. Optimization: Another key use of regression models is the optimization of business processes.
A factory manager might, for example, build a model to understand the relationship between
oven temperature and the shelf life of the cookies baked in those ovens. A company operating a
Page 98
call center may wish to know the relationship between wait times of callers and number of
complaints.
4. A fundamental driver of enhanced productivity in business and rapid economic advancement
around the globe during the 20th century was the frequent use of statistical tools in
manufacturing as well as service industries. Today, managers considers regression an
indispensable tool.
Limitations of Regression Analysis

There are three main limitations:
1. Parameter Instability - This is the tendency for relationships between variables to change over
time due to changes in the economy or the markets, among other uncertainties. If a mutual fund
produced a return history in a market where technology was a leadership sector, the model may
not work when foreign and small-cap markets are leaders.
2. Public Dissemination of the Relationship - In an efficient market, this can limit the
effectiveness of that relationship in future periods. For example, the discovery that low price-tobook value stocks outperform high price-to-book value means that these stocks can be bid
higher, and value-based investment approaches will not retain the same relationship as in the
past.
3. Violation of Regression Relationships - Earlier we summarized the six classic assumptions of
a linear regression. In the real world these assumptions are often unrealistic - e.g. assuming the
independent variable X is not random.
Correlation or Regression
Correlation and regression analysis are related in the sense that both deal with relationships
among variables. Whether to use Correlation or Regression in an analysis is often confusing for
researchers.
In regression the emphasis is on predicting one variable from the other, in correlation the
emphasis is on the degree to which a linear model may describe the relationship between two
variables. In regression the interest is directional, one variable is predicted and the other is the
predictor; in correlation the interest is non-directional, the relationship is the critical aspect.
Correlation makes no a priori assumption as to whether one variable is dependent on the other(s)
and is not concerned with the relationship between variables; instead it gives an estimate as to
the degree of association between the variables. In fact, correlation analysis tests for
interdependence of the variables.
As regression attempts to describe the dependence of a variable on one (or more) explanatory
variables; it implicitly assumes that there is a one-way causal effect from the explanatory
variable(s) to the response variable, regardless of whether the path of effect is direct or indirect.
There are advanced regression methods that allow a non-dependence based relationship to be
described (eg. Principal Components Analysis or PCA) and these will be touched on later.
Page 99
The best way to appreciate this difference is by example.

Take for instance samples of the leg length and skull size from a population of elephants. It
would be reasonable to suggest that these two variables are associated in some way, as elephants
with short legs tend to have small heads and elephants with long legs tend to have big heads. We
may, therefore, formally demonstrate an association exists by performing a correlation analysis.
However, would regression be an appropriate tool to describe a relationship between head size
and leg length? Does an increase in skull size cause an increase in leg length? Does a decrease in
leg length cause the skull to shrink? As you can see, it is meaningless to apply a causal
regression analysis to these variables as they are interdependent and one is not wholly dependent
on the other, but more likely some other factor that affects them both (eg. food supply, genetic
makeup).
Consider two variables: crop yield and temperature. These are measured independently, one by
the weather station thermometer and the other by Farmer Giles' scales. While correlation anaylsis
would show a high degree of association between these two variables, regression anaylsis would
be able to demonstrate the dependence of crop yield on temperature. However, careless use of
regression analysis could also demonstrate that temperature is dependent on crop yield: this
would suggest that if you grow really big crops you will be guaranteed a hot summer.
Thus, neither regression nor correlation analyses can be interpreted as establishing cause-andeffect relationships. They can indicate only how or to what extent variables are associated with
each other. The correlation coefficient measures only the degree of linear association between
two variables. Any conclusions about a cause-and-effect relationship must be based on the
judgment of the analyst.
Uses of Correlation and Regression
There are three main uses for correlation and regression.
1. One is to test hypotheses about cause-and-effect relationships. In this case, the experimenter
determines the values of the X-variable and sees whether variation in X causes variation in Y.
For example, giving people different amounts of a drug and measuring their blood pressure.
2. The second main use for correlation and regression is to see whether two variables are
associated, without necessarily inferring a cause-and-effect relationship. In this case, neither
variable is determined by the experimenter; both are naturally variable. If an association is found,
the inference is that variation in X may cause variation in Y, or variation in Y may cause
variation in X, or variation in some other factor may affect both X and Y.
3.The third common use of linear regression is estimating the value of one variable
corresponding to a particular value of the other variable.
*************************
Page 100
MODULE III
INDEX NUMBERS AND TIME SERIES ANALYSIS
Index Numbers: Meaning and Uses- Laspeyres, Paasches, Fishers, Dorbish-Bowley,
Marshall-Edgeworth and Kelleys Methods- Tests of Index Numbers: Time Reversal and Factor
Reversal tests -Base Shifting, Splicing and Deflating- Special Purpose IndicesWholesale Price
Index, Consumer Price Index and Stock Price Indices: BSE SENSEX and NSE-NIFTY. Time
Series Analysis-Components of Time Series, Measurement of Trend by Moving Average and the
Method of Least Squares.
Introduction
Historically, the first index was constructed in 1764 to compare the Italian price index in
1750 with the price level in 1500. Though originally developed for measuring the effect of
change in prices, index numbers have today become one of the most widely used statistical
devices and there is hardly any field where they are not used. Newspapers headline the fact that
prices are going up or down, that industrial production is rising or falling, that imports are
increasing or decreasing, that crimes are rising in a particular period compared to the previous
period as disclosed by index numbers. They are used to feel the pulse of the economy and they
have come to be used as indicators of inflationary or deflationary tendencies, In fact, they are
described as barometers of economic activity, i.e., if one wants to get an idea as to what is
happening to an economy, he should look to important indices like the index number of
industrial production, agricultural production, business activity, etc.
Of the important statistical devices and techniques, Index Numbers have today become one of
the most widely used for judging the pulse of economy, although in the beginning they were
originally constructed to gauge the effect of changes in prices. Today we use index numbers for
cost of living, industrial production, agricultural production, imports and exports, etc.
Index numbers are the indicators which measure percentage changes in a variable (or a group of
variables) over a specified time. For example,if we say that the index of export for the year 2013
is 125, taking base year as 2010, it means that there is an increase of 25% in the country's export
as compared to the corresponding figure for the year 2000.
Definitions of Index number
According to
Spiegel: An index number is a statistical measure, designed to measure changes in a variable,
or a group of related variables with respect to time, geographical location or other characteristics
such as income, profession, etc.
Patternson: In its simplest form, an index number is the ratio of two index numbers expressed as
a percent. An index is a statistical measure, a measure designed to show changes in one variable
Page 101
or a group of related variables over time, with respect to geographical location or other
characteristics.
Bowley: Index numbers are used to measure the changes in some quantity which we cannot
observe directly
We can thus say that index numbers are economic barometers to judge the inflation (increase in
prices) or deflationary (decrease in prices) tendencies of the economy. They help the government
in adjusting its policies in case of inflationary situations.
TYPES OF INDEX NUMBERS

Index numbers are names after the activity they measure. Their types are as under :
Price Index :Measure changes in price over a specified period of time. It is basically the ratio of
the price of a certain number of commodities at the present year as against base year.
Quantity Index: As the name suggest, these indices pertain to measuring changes in volumes of
commodities like goods produced or goods consumed, etc.
Value Index: These pertain to compare changes in the monetary value of imports, exports,
production or consumption of commodities
Purpose of Index Numbers

An index number, which is designed keeping, specific objective in mind, is a very powerful tool.
For example, an index whose purpose is to measure consumer price index, should not include
wholesale rates of items and the index number meant for slum-colonies should not consider
luxury items like A.C., Cars refrigerators, etc.
Index numbers are meant to study the change in the effects of such factors which cannot be
measured directly. For example, changes in business activity in a country are not capable of
direct measurement but it is possible to study relative changes in business activity by studying
the variations in the values of some such factors which affect business activity, and which are
capable of direct measurement.
CHARACTERISTICS OF INDEX NUMBERS

Following are some of the important characteristics of index numbers :
(a) Index numbers are expressed in terms of percentages to show the extent of relative
change
(b) Index numbers measure relative changes. They measure the relative change in the
value of a variable or a group of related variables over a period of time or between places.
(c) Index numbers measures changes which are not directly measurable.
The cost of living, the price level or the business activity in a country are not directly
measurable but it is possible to study relative changes in these activities by measuring the
changes in the values of variables/factors which effect these activities.
Page 102
PROBLEMS IN THE CONSTRUCTION OF INDEX NUMBERS

The decision regarding the following problems/aspect have to be taken before starting the actual
construction of any type of index numbers.
(i) Purpose of Index numbers under construction
(ii) Selection of items
(iii) Choice of an appropriate average
(iv) Assignment of weights (importance)
(v)Choice of base period
Let us discuss these one-by-one
(i) Purpose of Index Numbers
An index number, which is designed keeping, specific objective in mind, is a very powerful tool.
For example, an index whose purpose is to measure consumer price index, should not include
wholesale rates of items and the index number meant for slum-colonies should not consider
luxury items like A.C., Cars refrigerators, etc.
(ii) Selection of Items
After the objective of construction of index numbers is defined, only those items which are
related to and are relevant with the purpose should be included.
(iii) Choice of Average
As index numbers are themselves specialised averages, it has to be decided first as to which
average should be used for their construction. The arithmetic mean, being easy to use and
calculate, is preferred over other averages (median, mode or geometric mean). In this lesson, we
will be using only arithmetic mean for construction of index numbers.
(iv) Assignment of weights
Proper importance has to be given to the items used for construction of index numbers. It is
universally agreed that wheat is the most important cereal as against other cereals, and hence
should be given due importance.
(v) Choice of Base year
The index number for a particular future year is compared against a year in the near past, which
is called base year. It may be kept in mind that the base year should be a normal year and
economically stable year.
USES OF INDEX NUMBERS

Index numbers are commonly used statistical device for measuring the combined fluctuations in
a group related variables. If we wish to compare the price level of consumer items today with
that prevalent ten years ago, we are not interested in comparing the prices of only one item, but
in comparing some sort of average price levels. We may wish to compare the present agricultural
production or industrial production with that at the time of independence. Here again, we have to
Page 103
consider all items of production and each item may have undergone a different fractional
increase (or even a decrease). How do we obtain a composite measure? This composite measure
is provided by index numbers which may be defined as a device for combining the variations that
have come in group of related variables over a period of time, with a view to obtain a figure that
represents the net result of the change in the constitute variables.
Index numbers may be classified in terms of the variables that they are intended to measure. In
business, different groups of variables in the measurement of which index number techniques are
commonly used are (i) price, (ii) quantity, (iii) value and (iv) business activity. Thus, we have
index of wholesale prices, index of consumer prices, index of industrial output, index of value of
exports and index of business activity, etc. Here we shall be mainly interested in index numbers
of prices showing changes with respect to time, although methods described can be applied to
other cases. In general, the present level of prices is compared with the level of prices in the past.
The present period is called the current period and some period in the past is called the base
period.
1) Index numbers are used as economic barometers:
Index number is a special type of averages which helps to measure the economic
fluctuations on price level, money market, economic cycle like inflation, deflation etc.
G.Simpson and F.Kafka say that index numbers are today one of the most widely used
statistical devices. They are used to take the pulse of economy and they are used as indicators
of inflation or deflation tendencies. So index numbers are called economic barometers.
2) Index numbers helps in formulating suitable economic policies and planning etc.
Many of the economic and business policies are guided by index numbers. For
example while deciding the increase of DA of the employees; the employers have to depend
primarily on the cost of living index. If salaries or wages are not increased according to the
cost of living it leads to strikes, lock outs etc. The index numbers provide some guide lines that
one can use in making decisions.
3) They are used in studying trends and tendencies.
Since index numbers are most widely used for measuring changes over a period of
time, the time series so formed enable us to study the general trend of the phenomenon under
study. For example for last 8 to 10 years we can say that imports are showing upward
tendency.
4) They are useful in forecasting future economic activity.
Index numbers are used not only in studying the past and present workings of our
economy but also important in forecasting future economic activity.
5) Index numbers measure the purchasing power of money.
The cost of living index numbers determine whether the real wages are rising or falling
or remain constant. The real wages can be obtained by dividing the money wages by the
Page 104
corresponding price index and multiplied by 100. Real wages helps us in determining the
purchasing power of money.
6) Index numbers are used in deflating.
Index numbers are highly useful in deflating i.e. they are used to adjust the wages for cost of
living changes and thus transform nominal wages into real wages, nominal income to real
income, nominal sales to real sales etc. through appropriate index numbers.
Methods of Constructing Index Numbers

Construction of index numbers can be divided into two types :
(a)Unweighted indices
(i) Simple Aggregative method
(ii) Simple average of price relative method
(b)Weighted indices
(i) Weighted Aggregative Indices
1. Laspayers Method
2. Paashe Method
3. Dorbish&Bowleys method
4. Fishers ideal Method
5. Marshall Edgeworth Method, and
6. Kelleys Method
(ii) Weighted Average of relatives
Let us see them in detail.
a (i) Simple Aggregative Method
This is a simple method for constructing index numbers. In this method, the total of the prices of
commodities in a given (current) years is divided by the total of the prices of commodities in a
base year and expressed as percentage.
=
100
= Total of Current year prices for various commodities

= Total of base year prices for various commodities
Example 1
Let us take an example to illustrate
Construct the price index number for 2013, taking the year 2010 as base year
Commodity
A
B
C
D
E
Price in the year

2010
60
50
70
120
100
Price in the year

2013
80
60
100
160
150
Page 105
Solution :
Calculation of simple Aggregative index number for 2013 (against the year 2010) using the
formula.
Price in the year2010
Price in the year2013
60
50
70
120
100
80
60
100
160
150
Commodity
A
B
C
D
E
= 400
Substitute in the formula

=
100 =
= 550
400
100 = 137.50
550
This means that the price index for the year 2013, taking 2010 as base year, is 137.5, showing
that there is an increase of 37.5% in the prices in 2013 as against 2010.
Example 2
Compute the index number for the years 2011, 2012, 2013 and 2014, taking 2010 as base year,
from the following data.
Year
Price
2010
120
2011
144
2012
168
2013
204
2014
216
Solution :
Price relatives for different years are
Year
Price
2010
120
100
120
= 100
2011
144
100
120
= 120
2012
168
100
120
= 140
2013
204
100
120
= 170
2014
216
100
120
= 180
2010
100
2011
120
2012
140
2013
170
2014
180
Price index for different years are as in the following table.

Year
Price Index
There are two main limitations of this method. They are ;

(i)The units used in the price or quantity quotations can exert a big influence on the value of the
index, and
(ii) No consideration is given to the relative importance of the Commodities.
a (ii) Simple Average of price Relatives Method
Price Relative means the ratio of price of a certain item in current year to the price of that item in
base year, expressed as a percentage (i.e. Price Relative = (p2/p1)100). For example, if a fridge
TV cost Rs 12000 in 2005 and Rs. 18000 in 2013, the price relative is
(18000/12000)100 = 150.
Page 106
When this method is used to construct a price index, first of all price relatives are
obtained for the various items included in the index and then arrange of these relatives is
obtained using any one of the measures of central value, ie, arithmetic mean, median, mode,
geometric or harmonic mean. When arithmetic mean is used for averaging the relatives, the
formula for computing the index is:
100
if A.M. is used as average where
Is the price index, N is the number of items,P0 is the price
in the base year and P1 is the price of corresponding commodity in present year (for which index
is to be calculated).
Example
Construct by simple average of price relative method the price index of 2013, taking 2010 as
base year from the following data
Commodity
Price in
2010
60
50
60
50
25
20
Price in
2014
80
60
72
75
37.5
30
Solution
Find the price relatives for each, take the sum, substitute in formula.
Commodity
Price in
2010 (P0)
60
50
60
50
25
20
Price in
2014 (P1)
80
60
72
75
37.5
30
60
100
80
60
100
50
72
100
60
75
100
50
150.00
37.5
100
25
150.00
30
100
20
Price
relative
100
133.33
120.00
120.00
150.00
100 = 823.33
Substituting we get
100
823.33
= 137.22
6
Price index for 2013, taking 2010 for base year = 137.22
An un-weighted aggregate price index represents the changes in prices, over time, for an
entire group of commodities. However, an un-weighted aggregate price index has two
short comings. First, this index considers each commodity in the group as equally
Page 107
important. Thus, the most expensive commodities per unit are overly influential. Second,
not all the commodities are consumed at the same rate. In an un-weighted index, changes
in the price of the least consumed commodities are overly influential.
(b) i. Weighted Aggregative Indices

Due to the shortcomings of un-weighted aggregate price indices, weighted aggregate price
indices are generally preferable. Weighted aggregate price indices account for differences in the
magnitude of prices per unit and differences in the consumption levels of the items in the market
basket.
When all commodities are not of equal importance, this method is used. Here we assign weight
to each commodity relative to its importance and index number computed from these weights is
called weighted index numbers.
b.i. (i) 1. Laspayers Method
In this index number the base year quantities are used as weights, so it also called base year
weighted index.
The primary disadvantage of the Laspeyres Method is that it does not take into consideration the
consumption pattern. The Laspeyres Index has an upward bias. When the prices increase, there
is a tendency to reduce the consumption of higher priced items. Similarly when prices decline,
consumers shift their purchase to those items which decline the most.
b. i. (ii) Paasches Method
Under this method weights are determined by quantities in the given year
=
The Paasche price index uses the consumption quantities in the year of interest instead of using
the initial quantities. Thus, the Paasche index is a more accurate reflection of total consumption
costs at that point in time. However, there are two major drawbacks of the Paasche index. First,
accurate consumption values for current purchases are often difficult to obtain. Thus, many
important indices, such as the consumer price index (CPI), use the Laspeyres method. Second, if
a particular product increases greatly in price compared to the other items in the market basket,
consumers will avoid the high-priced item out of necessity, not because of changes in what they
might prefer to purchase.
Page 108
b.i. (iii) Dorbish&Bowleys Method

Dorbish and Bowley have suggested simple arithmetic mean of the two indices (Laspeyres and
Paasche) mentioned above so s to take into account the influence of both the periods, i.e., current
as well as base periods. The formula for constructing the index is:
=
Where L = Laspeyres Index
P = Paasches Index
OR it may be written as
=
b.i. (iv) Fishers Ideal Index
The geometric mean of Laspeyres and Paasches price indices is called Fishers price Index.
Fisher price index uses both current year and base year quantities as weight. This index corrects
the positive bias inherent in the Laspeyres index and the negative bias inherent in the Paasche
index. Fishers price index is also a weighted aggregative price index because it is an average
(G.M) of two weighted aggregative indices. The computational formula for the fisher ideal price
index is:
OR
Fischers Index is known as ideal because (1) it is based on geometric mean, which
is considered to be the best average for constructing index numbers. (2) It takes into account
both current as well as base year prices and quantities (3) It satisfies both time reversal as well
as the factor reversal tests (which we will study soon) and (4) it is free from bias.
It is not, however, a practical index to compute because it is excessively laborious.
The data, particularly for the Paasche segment of the index, are not readily available.
b.i. (v) Marshall-Edgeworth Method
If the weights are taken as the arithmetic mean of base and current year quantities, then the
weighted aggregative index is called Marshal-Edgeworth index. Like Fishers index, MarshallEdgeworth index alsorequires too much labor in selection of commodities. In some cases the
usage of this index is not suitable, for example the comparison of the price level of a large
Page 109
country to a small country. Marshal-Edgeworth index can be calculated by using the formula
given below.
+
=
+
It is a simple, readily constructed measure, giving a very close approximation to the
results obtained by the ideal formula.
The Marshall-Edgeworth formula uses the arithmetic mean of the quantities purchased in the
base and current periods as weights. Like the Fisher 'Ideal' index it is impracticable to use as a
timely indicator of price change because it requires the use of quantities purchased in the current
period. In practice, the Marshall-Edgeworth index and the Fisher Ideal, index give similar
results.
b.i. (vi) Kelleys Method
According to Truman L. Kelly the formula for constructing index numbers.
Where q refer to some period, not necessarily the base year or current year.
Example 1
From the following data calculate Price Index Numbers for 2000 with 2013 as base year by using
(i) Laspayers Method (ii) Paasches Method (iii) Dorbish&Bowleys Method (iv) Fishers Ideal
Index (v) Marshall-Edgeworth Method
2000
Commodity
A
B
C
D
Price
20
50
40
20
2013
Quantity
8
10
15
20
Solution
Let us first compute the necessary values.
(i) Laspayers Method
2000
2013
Commodity
P0
Q0
P1
Q1
20
40
50
10
40
20
Price
40
60
50
20
100
P1Q0
P0Q0
320
160
60
600
500
15
50
15
750
600
20
20
25
400
400
2070
1660
=
Quantity
6
5
15
25
2070
100 = 124.70
1660
Page 110
(ii) Paasches Method
2000
2013
Commodity
P0
Q0
P1
Q1
20
40
50
10
40
20
100
P1Q1
P0Q1
240
120
60
300
250
15
50
15
750
600
20
20
25
500
500
1790
1470
(iii) Dorbish&Bowleys Method
(iv)Fishers Ideal Index
1790
100 = 121.77
1470
=
+
2
124.70 + 121.77
246.47
=
= 123.23
2
2
=
124.70 121.77 = 15184.56 = 123.23
(v)Marshall-Edgeworth Method
2000
2013
+
+
100
Commodity
P0
Q0
P1
Q1
P1Q1
P0Q1
P1Q0
P0Q0
20
40
240
120
320
160
50
10
60
300
250
600
500
40
15
50
15
750
600
750
600
20
20
20
25
500
500
400
400
1790
1470
2070
1660
2070 + 1790
3860
100 =
100 = 1.233226837 100
1660 + 1470
3130
= 123.32
Page 111
Example 2
Compute index number from the following data
Materials
Unit
Cement
Timber
Steel
Bricks
Quantity
required
500 lb
2000 c.ft.
50 cvt.
20000
100lb
c.ft.
Cwt.
Per 000
Price
2000
5.0
9.5
34.0
12.0
2010
8.0
14.2
42.20
24.0
Solution
Since the quantities (weights) required of different materials are fixed for both base year and
current year, we will use Kellys formula.
For materials we have to do certain conversions. For example, for cement unit is in 100 lbs, and
the quantity required is 500 lbs. Hence, the quantity consumed per unit for cement is 500/100 =
5. Similarly, the quantity consumed per unit for brick is 20000/1000= 20.
By Kelleys Method,
100
Let us make the necessary computations.

Materials
Unit
Cement
Timber
Steel
Bricks
Quantity
required
100 lb
c.ft.
Cwt.
Per 000
500 lb
2000 c.ft.
50 cvt.
20000
5
2000
50
20
100 =
Price (Rs.)
2000
2010
P0
P1
5.0
8.0
9.5
14.2
34.0
42.0
12.0
24.0
Total
P1q
P0q
25
19000
1700
240
20965
40
28400
2100
480
31020
Substituting
=
31020
1.4796
100 =
= 147.96
20965
100
B. (II) WEIGHTED AVERAGE OF RELATIVES

I.
Weighted Average of Price Relatives Method

In this method, appropriate weights are assigned to the commodities according to the
relative importance of those commodities in the group. Thus the index for the whole group is
Page 112
obtained on taking the weighted average of the price relatives. To find the average, Arithmetic
Mean or Geometric Mean can be used.
=
When AM is used, the index is

Where
P = Price relative
V = Value of weights i.e.
Example:
From the following data compute price index by supplying weighted average of price relatives
method using Arithmetic Mean
Commodity
Sugar
Flour
Milk
3.0
1.5
1.0
20 Kg.
40 Kg.
10 Lit.
4.0
1.6
1.5
By using Arithmetic Mean

Commodity
(v)
Sugar
3.0
20 Kg
60
Flour
1.5
40 Kg.
1.6
60
Milk
1.0
10 Lit.
1.5
10
x 100
x 100
.
V = 130
x 100
15900
= 122.31
130
Instead of Arithmetic Mean, we can use Geometric Mean.

When GM is used, the index is
=
Where P =
.
.
x 100
x 100
pv
8000
6400
1500
PV
= 15900
V = Value of weight
The above example can be re worked using GM as follows.
Page 113
By using Geometric Mean

Commodity
Flour
(v)
3.0
1.5
1.0
20 Kg
40 Kg.
10 Lit.
4
1.6
1.5
60
60
10
p
133.3
106.7
150.0
Log p
2.1249
2.0282
2.1761
= 130
V Log p
127.494
121.692
21.761
.
= 270.947
270.947
=
2.084 = 120.9
130
Merits of weighted Average of Relative Indices
When different index numbers are constructed by the average of relatives method, all of
which have the same base, they can be combined to form a new index.
When an index is computed by selecting one item from each of the many sub groups of
items, the values of each sub subgroup may be used as weights. Then only the method of
weighted average of relatives is appropriate.
When a new commodity is introduced to replace the one formerly used, the relative for
the new it may be spliced to the relative for the old one, using the former value weights.
The price or quantity relatives each single item in the aggregate are in effect, themselves
a simple index that often yields valuable information for analysis.
=
TESTS OF INDEX NUMBERS

The following are the most important tests through which one can list the consistency of
index numbers.
1. The time Reversal Test
2. The factor Reversal Test
1.The Time Reversal Test
Where P01 is the price index for year 1 with year 0 as base year and P10 is the price index for
year a with year b as base.
This test is not satisfied by both Laspeyres and Paasches index numbers.
=
X
Paasches Method =
Fishers formula satisfies this test

Fishers Method =
2. The Factor Reversal Test
Wheref
stands for the price relative for the year 1 with base year 0 and
quantity relative for the year 1 with base year 0, then the condition is
This test is not satisfied by both Laspeyres and Paasches index numbers.
LaspeyresFormula =
=

Paaschesformula =
stands for
Page 114
Fishers formula satisfies this test
Fishers Formula =
Fishers formula satisfies both time reversal and factor reversal test. This is why the
Fishers formula is often called Fishers Ideal Index Number.
Example
For the following data prove that the Fishers Ideal Index satisfies both the Time Reversal Test
and the Factor Reversal Test.
Commodity
Base Year
Price
6
2
4
10
A
B
C
D
Quantity
50
100
60
30
Price
10
2
6
12
Current Year
Quantity
56
120
60
24
Solution
A
50
P1
10
100
120
200
240
200
240
60
60
240
240
360
360
10
30
12
24
300
240
360
288
= 1040
= 1056
= 1420
= 1448
56
300
336
500
560
Fishers price index number is gven by

=
Substituting the values we get
=
Time reversal test:
We have P01 = 1.3683 (without factor 100)

And
=
Page 115
Substituting
=
= .
= 1.3683 0.7308 = 0.9999 1

Hence, Fischers index satisfies Time Reversal Test.
Factor Reversal Test
We have (without factor 100)
=
=
BASE SHIFTING, SPLICING AND DEFLATING THE INDEX NUMBERS :
(a) Base shifting

Most index numbers are subjected to revision from time to time due to different reasons. In most
cases it becomes compulsory to change the base year because numerous changes took place with
the passage of time. For example changes may happen due to disappearance of old items,
inclusion of new ones, changes in weights of commodities or changes in conditions, habits, and
standard of life etc.
One of the most frequent operations necessary in the use of index numbers is changing the base
of an index from one period to another with out recompiling the entire series. Such a change is
referred to as base shifting. The reasons for shifting the base are
If the previous base has become too old and is almost useless for purposes of comparison.
If the comparison is to be made with another series of index numbers having different base.
The following formula must be used in this method of base shifting is
current years old index number
100
Index number based on new base year =
new base years old index number
Shifting from one fixed base to another fixed base
To convert a fixed base to a new fixed base each old index is divided by the index of new base
sought multiplied by 100. It can be illustrated with the help of following problem.
Example:
Page 116
Following series is given to the base year 2000. Now convert it into the new series with base
year 2003.
Year
Index
2000
100
2001
130
2002
145
2003
155
Year
2004
205
2005
255
Fixed Base Index

Base = 2000
100
130
145
155
205
255
2000
2001
2002
2003
2004
2005
Base = 2003
100/155100 = 64.52
130/155100 = 83.87
145/155100 = 93.55
155/155100 = 100.00
205/155100 = 132.26
255/155100 = 164.52
Shifting from chain base to fixed base

One of the disadvantages of chain base method is that the comparison between distant periods is
not immediately evident. Therefore it becomes necessary to convert chain base indices into fixed
base indices. This can be illustrated with the help of following example.
Example:
Convert the following chain indexes into the new series with base year 2005.
Year
Index
2005
100
Year
2005
2006
2007
2008
2009
2010
2006
105
Chain Base Index

100
105
110
107
112
107
2007
110
2008
107
2009
112
2010
107
Fixed Index (1970 = 100)

100
100105/100 = 105
105110/100 = 115.5
115.5107/100 = 123.59
123.59112/100 = 138.42
138.42107/100 = 148.10
Shifting from Fixed to chain base

As discussed earlier, conditions change over a period due to revised weightings system, inclusion
of new items and disappearance of old ones etc. Due to all these factors, sometimes it is
necessary to convert the indices from fixed base to chain base. This can be explained with the
help of following problem. Problem: Convert the following indexes with base 1980 to chain
indexes.
Year
Fixed
Index
2005
100
Year
2005
1980
2006
105
2007
115
Fixed Index (Base

= 1980)
100
2008
130
2009
150
2010
175
Chain Base Index

100
, =
100
=
100 = 100
100
Page 117
2006
1981
105
2007
1982
115
2008
1983
130
2009
1984
150
2010
1985
175
Splicing of two series of index numbers
,
,
,
105
100 = 105
100
115
=
100 = 109.52
105
130
=
100 = 113.04
115
150
=
100 = 115.38
130
175
=
100 = 116.67
150
=
Splicing of index numbers mean combining two or more series of overlapping index numbers to
obtain a single index number on a common base. This is done by the same technique as used in
base shifting.
To combine two or more series of overlapping index numbers to obtain a single series of index
numbers on a common base.
It is of two types:(i) Splicing of new index numbers to old index numbers
(ii) Splicing of old index numbers to new index number.
Splicing of Index numbers can be done only if the index numbers are constructed with the same
items, and have an overlapping year. Suppose we have an index number with a base year of 2001
and another index number (using the same item as the first one) with a base of 2011. Suppose
both index numbers are continuing. Then we can splice the first series of index number to the
second series and have a common index with base 2011. We can also spice index number series
two with series one and have a common index number with base 2001. Splicing is generally
done when an old index number with an old base is being discontinued and a new index with a
new base is being started.
The following formula must be used in this method of splicing
Index number after splicing =
index number to be spliced old index number of existing base
100
Example
Index Number A given below was started in 1981 and discontinued in 2001 when another index
B was started which continues up to date. From the data given in the table below splice the index
number B to index number A so that a continuous series of index numbers from 1951 up to date
is available.
Splicing of Index B to Index A
Here we multiply index B with a common factor
which is the ratio of index B to index A in
the overlapping year 2001.
Page 118
Year
1981
.
.
.
2000
2001
Index A
100
Index B
-
Index B Spliced to A
-
180
200
100
2002
120
2003
.
.
2013
140
200
100 = 200
100
200
120 = 240
100
200
140 = 280
100
250
200
250 = 500
100
Thus we have a continuous series of index numbers with base 1981 which continues up todate.
DEFLATING THE INDEX NUMBERS
By deflating we mean making allowances for the effect of changing price levels. A rise in price
level means a reduction in the purchasing power of money. To take the case of a single
commodity suppose the price of wheat rises from 500 per quintal in 1999 to 1,000 per
quintal in 2009 it means that in 2009 one can buy only half of wheat if the spends the same
amount which he was spending on wheat in 1999. Thus the value (or purchasing power) of a
rupee is simply the reciprocal of an appropriate price index written as a proportion. If prices
increase by 60 per cent, the price index is 1.60 and what a rupee will busy is only 1/1.60 or 5/8 of
what it used to buy. In other words the purchasing power of rupee is 5/8 of what it was.
Similarly, if prices increase by 25 per cent the price index is 1.25 (125 per cent). And the
purchasing power of the rupee is 1/1.25 = 0.80.
1
Thus the purchasing power of money =
price index
In times of rising prices the money wages should be deflated by the price index to get the
figure of real wages. The real wages alone tells whether a wage earner is in better position or in
worst position.
For calculating real wage, the money wages or income is divided by the corresponding
price index and multiplied by 100.
i.e. Real wages =
Thus Real Wage Index=
Money wages
100
Pr ice index
Re al wage of current year
100
Re al wage of base year
Page 119
Example
The annual wage of workers (in Rs.) of workers are given along with Consumer Price Indices.
Find (i) the real wage and (ii) the real wage indices.
Year
Wages
Consumer
Indices
2010
1800
Price 100
Year
Wage
Price Index
2010
1800
100
2011
2200
170
2012
3400
300
2013
3600
320
2011
2200
170
Real Wage
100 =1800
100 =1294.1
100 =1133.3
100 =1125
2012
3400
300
2013
3600
320
Real Wage Indices 2010 = 100

100
.
.
100 =71.90
100 =62.96
100 =62.50
SPECIAL PURPOSE INDICES

Price Index: The price index is an indicator of the average price movement over time of a fixed
basket of goods and services. The constitution of the basket of goods and services is done
keeping in to consideration whether the changes are to be measured in retail, wholesale, or
producer prices etc. The basket will also vary for economy-wide, regional, or sector specific
series. At present, separate series of index numbers are compiled to capture the price movements
at retail and wholesale level in India. There are four main series of price indices compiled at the
national level. Out of these four, Consumer Price Index for Industrial Workers (CPI-IW),
Consumer Price Index for Agricultural Labourers / Rural Labourers (CPI -AL/RL), Consumer
Price Index for Urban Non-Manual Employees (CPI-UNME) are consumer price indices. The
Wholesale Price Index (WPI) number is a weekly measure of wholesale price movement for the
economy. Some states also compile variants of CPI and WPI indices at the state level.
1. Wholesale Price Index

The wholesale price index numbers indicate the general condition of the national economy. They
measure the change in prices of products produced by different sectors of an economy. The
wholesale prices of major items manufactured or produced are included in the construction of
these index numbers.
Page 120
Wholesale Price Index (WPI) represents the price of goods at a wholesale stage i.e. goods that
are sold in bulk and traded between organizations instead of consumers. WPI is used as a
measure of inflation in some economies.
Uses
In a dynamic world, prices do not remain constant. Inflation rate calculated on the basis of the
movement of the Wholesale Price Index (WPI) is an important measure to monitor the dynamic
movement of prices. As WPI captures price movements in a most comprehensive way, it is
widely used by Government, banks, industry and business circles. Important monetary and fiscal
policy changes are often linked to WPI movements. Similarly, the movement of WPI serves as
an important determinant, in formulation of trade, fiscal and other economic policies by the
Government of India. The WPI indices are also used for the purpose of escalation clauses in the
supply of raw materials, machinery and construction work.
WPI is used as an important measure of inflation in India. Fiscal and monetary policy changes
are greatly influenced by changes in WPI.
WPI is an easy and convenient method to calculate inflation. Inflation rate is the difference
between WPI calculated at the beginning and the end of a year. The percentage increase in WPI
over a year gives the rate of inflation for that year.
WPI computation in India
WPI is the most widely used inflation indicator in India. This is published by the Office of
Economic Adviser, Ministry of Commerce and Industry. WPI captures price movements in a
most comprehensive way. It is widely used by Government, banks, industry and business
circles. Important monetary and fiscal policy changes are linked to WPI movements. It is in use
since 1939 and is being published since 1947 regularly. We are well aware that with the
changing times, the economies too undergo structural changes. Thus, there is a need for
revisiting such indices from time to time and new set of articles / commodities are required to be
included based on current economic scenarios. Thus, since 1939, the base year of WPI has been
revised on number of occasions. The current series of Wholesale Price Index has 2004-05 as
the base year.
Wholesale price index comprises as far as possible all transactions at first point of bulk sale in
the domestic market. Provisional monthly WPI for All Commodities is released on 14th of every
month (next working day, if 14th is holiday). Detailed item level WPI is put on official website
(http://www.eaindustry.nic.in/) for public use. The provisional index is made final after a period
of eight weeks/ two months.
The Office of the Economic Adviser to the Government of India undertook to publish for the
first time, an index number of wholesale prices, with base week ended August 19, 1939 = 100,
from the week commencing January 10, 1942. The index was calculated as the geometric mean
Page 121
of the price relatives of 23 commodities classified into four groups: (1) food & tobacco; (2)
agricultural commodities; (3) raw materials; and (4) manufactured articles. Each item was
assigned equal weight and for each item, there was a single price quotation. That was a modest
beginning to what became an important weekly activity for the monitoring and management of
the Indian economy and a benchmark for business transactions.
Step-in compilation of WPI in India
Like most of the price indices, WPI is based on Laspeyres formula for reason of practical
convenience. Therefore, once the concept of wholesale price is defined and the base year is
finalized, the exercise of index compilation involve finalization of item basket, allocation of
weights (W) at item, groups/ sub-groups level. Simultaneously, the exercise to collect base prices
(Po), current prices (P1), finalization of item specifications, price data sources, and data
collection machinery is undertaken. These steps are
1. Definition of the Concept of Wholesale Prices:
Wholesale price has divergent connotations adopted by the different departments using them.
There is no uniform definition for agricultural and nonagricultural commodities as all the
wholesale prices cannot be collected from the established markets. So proper definition has to be
made by the competent authority.
For example in the case of agricultural commodities, in practice, there are three types of
wholesale markets viz., primary, secondary and terminal in the agricultural sector. The price
movements and price levels in all three vary. Price movement in the terminal market may tend to
converge toward the retail prices. Option to collect the wholesale prices for these three different
stages of wholesale transactions exists for agricultural commodities though the primary market is
prepared. So, the Ministry of Agriculture has defined wholesale price as the rate at which
relatively large transaction of purchase, usually for further sale, is effected.
Similarly, for non-agricultural commodities, which are predominantly manufacturing items, the
problem arises, as there are no established sources in markets. This is true of mining and fuel
items also. The issue of ex-factory vis--vis wholesale prices for non-agriculture items have been
discussed by the successive Working Groups set up for the revision of WPI and all have reached
the conclusion that in practice, it is not feasible to collect wholesale prices for most of the
manufacturing items. It has also been observed that the margin of wholesalers in case of nonagricultural commodities remains unchanged for over a long period of time. As a result, it is felt
that the trends in the index compiled on the basis of ex-factory prices would not be much
different from the index if compiled on the basis of wholesale prices if it were feasible to get
these prices. The last Working Group has recommended collecting wholesale prices from the
Page 122
markets as far as possible, because the economy is moving towards globalization and open
trade with inputs increasing in the commodities set.
2) Choice of Base Year
The second step is choice of base year. The well-known criteria for the selection of base year
are (i) a normal year i.e. a year in which there are no abnormalities in the level of production,
trade and in the price level and price variations, (ii) a year for which reliable production, price
and other required data are available and (iii) a year as recent possible and comparable with
other data series at national and state level. The National Statistical Commission has
recommended that base year should be revised every five year and not later than ten years.
3. Selection of Items, Varieties/ Grades, Markets:
To ensure that the items in the index basket are as best representatives as possible, efforts are
made to include all the important items transacted in the economy during the base year. The
importance of an item in the free market will depend on its traded value during the base year. At
wholesale level, bulk transactions of goods and services need to be captured. As the services are
not covered so far, the WPI basket mainly consists of items from goods sector. In the absence of
single source of data on traded value, the selection procedures followed for agricultural
commodities and non-agricultural commodities have also been different.
For example, in the case of agricultural commodities: As there is a little scope of emergence of
new commodities in the agriculture, the selection of new items in the basket is done on the basis
of increased importance in wholesale markets. Varieties, which have declined in importance,
need to be dropped in the revised series. Final inclusion or exclusion of an item in the basket is
based on the process of consultation with the various departments. The exercise of adding
/deleting commodities, specifications and markets is completed once the consultation process is
over. In the existing WPI series, items, their specifications and markets have been finalized in
consultation of with the Directorate of E&S (M/O Agriculture), National Horticulture Board,
Spices Board, Tea board, Coffee Board and Rubber Board, Silk Board, Directorate Of Tobacco,
Cotton Corporation of India etc.
4. Derivation of Weighting Diagram
Weights used in the WPI are value weights not quantity weights as its difficult to assign quantity
weights. Distribution of the appropriate weight to each of the item is most important exercise for
reliable index. Unlike consumer price indices, where weights are derived on the basis of results
of Expenditure Surveys, several sources of data are used for derivation of weights for WPI.
5) Collection of Prices
In WPI pricing methodology used is specification pricing. Under this, in consultation with the
identified source agencies, precise specifications of all items in the basket are defined for repeat
Page 123
pricing every week. All characteristics like make, model, features along with the unit of sale,
type of packaging, if applicable, etc are recorded and printed in the price collection schedule. At
the time of scrutiny of price data all these are kept in mind. This pricing to constant quality
technique is the cornerstone of Laspeyres formula. In case of changes in quality and
specifications, due adjustments are made as per the standard procedures.
The collection of base prices is done concurrently while the work on finalisation of index basket
is on. Therefore, price collection is normally done for larger number of items pending
finalisation. Once the basket is ready, current prices are collected only as per the final basket
from the designated sources. Weekly prices need to be collected for pre-determined day of the
week. For the current series prices are quoted on the basis of the prevailing prices of every
Friday. Agricultural wholesale prices are for bulk transactions and include transport cost. Nonagricultural prices are ex-mine or ex-factory inclusive of excise duty but exclusive of rebate if
any.
6) Treatment of prices collected from open market & administered prices:
There are some items which constitute part of index baskets but the prices for these items are
either totally administered by the Government or are under dual pricing policy. The issue of
using administered prices for index compilation is resolved by taking into account appropriate
ratio between the levy and non-levy portions. Where these ratios are not available, the issues can
be resolved through taking the appropriate number of price quotations of the administered prices
and the open market prices after periodic review.
Due to variation in quality and different price movements of the commodities belonging to
unorganized sector, separate quotations from organized and unorganized units have to be taken
and merged based on the turnover value of both the sectors at item level. For pricing from
unorganized sector, adequate number of price quotations has to be drawn out of the list of units
by criteria of share of production as far as possible.
7) Classification structure:
The Working Groups over the period have been suggesting to bring the classification of various
items under different groups and sub-groups as per the latest revised National Industrial
Classification (NIC) which in turn is comparable to International Standard Industrial
Classification (ISIC). The classification based on NIC renders the WPI data amenable to
comparison with the Index of Industrial Production (IIP) and National Income data.
Major Group/Groups: I. Primary Articles II. Fuel, Power, Light & Lubricants III. Manufactured
Products
8) Methodology of Index Calculation
Actual index compilation is done in stages.
Page 124
In the first stage, once the price data are scrutinized, price relative for each price quote is
calculated. Price relative is calculated as the ratio of the current price to the base price multiplied
by 100 i.e. (P1/Po)100.
In the next stage, commodity/item level index is arrived at as the simple arithmetic average of
the price relatives of all the varieties (each quote) included under that commodity. An average of
price ratio/ relative is used under implicit assumption that each price quotation collected for an
item/commodity index compilation has equal importance i.e. the shares of production value is
equal.
Next, the indices for the sub groups/groups/ major groups are compiled and the aggregation
method is based on Laspeyres formula as below:
I= S (Ii x Wi) / S Wi
Where,
I = Index numbers of wholesale prices of a sub- group/group/ major group/ all commodities
S = represents the summation operation,
Ii = Index of the ith item / sub- group/ group/ major group.
Wi = Weight assigned to the ith item of sub- group/group/ major group.
The weights are value weights. Aggregation is first done at sub-group and group level. All
commodities index is compiled by aggregating Major group indices.
9) Handling of the Seasonal Commodities :
There are number of agriculture items, especially some fruits and vegetables, which are of
seasonal nature. When a particular seasonal item disappears from the market and its prices are
not available because of its being out of season, the weights of such item is imputed amongst the
other items on pro rata basis with in the sub-group of vegetables or fruits. The underlying
assumption is that if the items remained available, the prices of these items would have moved in
the same proportion as the prices of the other items in the sub-group, which did remain available.
This is equivalent to giving a greater weight to the remaining items. The seasonality problem can
be sorted by adopting other methods like, i) prices of unavailable items can also be extrapolated
forward from the period of availability or ii) if such seasonal item has insignificant weight it can
be taken permanently from the basket etc.
2. Consumer Price Index Number

The Consumer Price Index (CPI) is a measure of the average change over time in the prices of
consumer items -goods and services that people buy for day-to-day living. The CPI is a complex
construct that combines economic theory with sampling and other statistical techniques and uses
Page 125
data from several surveys to produce a timely and precise measure of average price change for
the consumption sector.
Consumer Price Index is a comprehensive measure used for estimation of price changes in a
basket of goods and services representative of consumption expenditure is called consumer price
index. The calculation involved in the estimation of CPI is quite rigorous. Various categories and
sub-categories have been made for classifying consumption items and on the basis of consumer
categories like urban or rural. Based on these indices and sub indices obtained, the final overall
index of price is calculated mostly by national statistical agencies. It is one of the most important
statistics for an economy and is generally based on the weighted average of the prices of
commodities. It gives an idea of the cost of living.
Inflation is measured using CPI. The percentage change in this index over a period of time gives
the amount of inflation over that specific period, i.e. the increase in prices of a representative
basket of goods consumed.
The CPI frequently is called a cost-of-living index, but it differs in important ways from a
complete cost-of-living measure. A cost-of-living index would measure changes over time in the
amount that consumers need to spend to reach a certain utility level or standard of living. Both
the CPI and a cost-of-living index would reflect changes in the prices of goods and services, such
as food and clothing that are directly purchased in the marketplace; but a complete cost-of-living
index would go beyond this role to also take into account changes in other governmental or
environmental factors that affect consumers' well-being. It is very difficult to determine the
proper treatment of public goods, such as safety and education, and other broad concerns, such as
health, water quality, and crime, that would constitute a complete cost-of-living framework.
How do we read or interpret an index?
An index is a tool that simplifies the measurement of movements in a numerical series. Most of
the specific CPI indexes have a 1982-84 reference base. That is, the agency computing the index
sets the average index level (representing the average price level)-for the 36-month period
covering the years 1982, 1983, and 1984-equal to 100. The agency then measures changes in
relation to that figure. An index of 110, for example, means there has been a 10-percent increase
in price since the reference period; similarly, an index of 90 means a 10-percent decrease.
Movements of the index from one date to another can be expressed as changes in index points
(simply, the difference between index levels), but it is more useful to express the movements as
percent changes. This is because index points are affected by the level of the index in relation to
its reference period, while percent changes are not.
Year I
Year II
Change in index
points
Percent change
Item A
112.500
121.500
9.000
Item B
225.000
243.000
18.000
Item C
110.000
128.000
18.000
9.0/112.500 x 100 =
8.0
18.0/225.000 x 100 =
8.0
18.0/110.000 x 100 =
16.4
Page 126
In the table above, Item A increased by half as many index points as Item B between Year I and
Year II. Yet, because of different starting indexes, both items had the same percent change; that
is, prices advanced at the same rate. By contrast, Items B and C show the same change in index
points, but the percent change is greater for Item C because of its lower starting index value.
Uses of cost of living index numbers:
1. Cost of living index numbers indicate whether the real wages are rising or falling. In
other words they are used for calculating the real wages and to determine the change in
the purchasing power of money.
1
Purchasing power of money
Cost of living index number
Real Wages
Money wages
100
Cost of living index umbers
2. Cost of living indices are used for the regulation of D.A or the grant of bonus to the
workers so as to enable them to meet the increased cost of living.
3. Cost of living index numbers are used widely in wage negotiations.
4. These index numbers also used for analyzing markets for particular kinds of goods.
Main steps or problems in construction of cost of living index numbers
Production of the CPI requires the skills of many professionals, including economists,
statisticians, computer scientists, data collectors, and others.
The cost of living index numbers measures the changes in the level of prices of commodities
which directly affects the cost of living of a specified group of persons at a specified place. The
general index numbers fails to give an idea on cost of living of different classes of people at
different places.
Different classes of people consume different types of commodities, peoples consumption
habit is also vary from man to man, place to place and class to class i.e. richer class, middle class
and poor class. For example the cost of living of rickshaw pullers at BBSR is different from the
rickshaw pullers at Kolkata. The consumer price index helps us in determining the effect of rise
and fall in prices on different classes of consumers living in different areas.
The following are the main steps in constructing a cost of living index number.
1. Decision about the class of people for whom the index is meant
It is absolutely essential to decide clearly the class of people for whom the index
is meant i.e. whether it relates to industrial workers, teachers, officers, labors, etc. Along
with the class of people it is also necessary to decide the geographical area covered by the
index, such as a city, or an industrial area or a particular locality in a city.
2. Conducting family budget enquiry
Once the scope of the index is clearly defined the next step is to conduct a sample
family budget enquiry i.e. we select a sample of families from the class of people for
whom the index is intended and scrutinize their budgets in detail. The enquiry should be
conducted during a normal period i.e. a period free from economic booms or depressions.
Page 127
The purpose of the enquiry is to determine the amount; an average family spends on
different items. The family budget enquiry gives information about the nature and quality
of the commodities consumed by the people. The commodities are being classified under
following heads
i) Food ii) Clothing iii)Fuel and Lighting iv)House rent v) miscellaneous
3. Collecting retail prices of different commodities
The collection of retail prices is a very important and at the same time very
difficult task, because such prices may vary from lace to place, shop to shop and person
to person. Price quotations should be obtained from the local markets, where the class of
people reside or from super bazaars or departmental stores from which they usually make
their purchases.
Method of Constructing the Index

The index may be constructed by applying any of the following methods :
1) Aggregate Expenditure Method or Aggregation Method
2) Family Budget Method or the Method of Weighted Relatives.
1. Aggregate Expenditure Method.
When this method is applied the quantities of commodities consumed by the particular group in
the base year are estimated and these figures are used as weights. Then the total expenditure on
each commodity for each year is calculated.
Where
and stand for the prices of the current year and base year.
and
stand for the quantities of the current year and base year.
Steps:
i) The prices of commodities for various groups for the current year is multiplied by the quantities
of the base year and their aggregate expenditure of current year is obtained .i.e. p1q0
p q
ii) Similarly obtain
iii) The aggregate expenditure of the current year is divided by the aggregate expenditure of the
base year and the quotient is multiplied by 100.
Symbolically
pq
p q
1 0
0
100
2. Family Budget Method

When this method is applied the family budgets of a large number of are carefully studied and
the aggregate expenditure of the average family on various items is estimated. These values are
used as weights.
Page 128
p1
100 for each item
po
v p0 q0 , value on the base year
Where p
Example
Construct the Consumer price index number of 2013 on the basis of 2009 from the following
data using 1) the aggregate expenditure method and 2) the family budget method.
Commodity
Quantity in units in
2009
Price per unit in 2000

()
Price per unit in 2013

()
A
B
C
D
E
F
100
25
10
20
25
30
8
6
5
48
15
9
12
7.50
5.25
52
16.50
27
Solution
(1) Aggregate expenditure method
Formula
Commodity
for
aggregate
expenditure
=
100
Price
per unit
in 2013
()
P1
12
Quantity
in units
in 2009
Price
per unit
in 2000
()
P0
8
100
800
1200
7.5
25
150
187.5
5.25
10
50
52.5
48
52
20
960
1040
15
16.5
25
375
412.5
27
30
270
810
q0
Total
= 2605
=
method
1 0
= 3702.50
100
Page 129
2. The family budget method
3702.50
100 = 142.13
2605
=
Where
=
Commodity
100
for each item

=
, value on the base year

=
Price per Price per Quantity in

unit in
unit in
units in
2000 () 2013 ()
2009
P0
P1
q0
8
12
100
150
800
120000
7.5
25
125
150
18750
5.25
10
105
50
5250
48
52
20
108.33
960
104000
15
16.5
25
110
375
41250
27
30
300
270
81000
898.33
2605
370250
100
370250
=
= 142.13
2605
Note: It should be noted that the answer obtained by applying the aggregate expenditure method
and family budget method is the same.
=
Given below is an example of Consumer Price Index for Kerala
Page 130
Possible errors in construction of cost of living index numbers:

Cost of living index numbers or its recently popular name consumer price index numbers
are not accurate due to various reasons.
1. Errors may occur in the construction because of inaccurate specification of groups for
whom the index is meant.
2. Faulty selection of representative commodities resulting out of unscientific family budget
enquiries.
3. Inadequate and unrepresentative nature of price quotations and use of inaccurate weights
4. Frequent changes in demand and prices of the commodity
5. The average family might not be always a representative one.
Wholesale price index numbers (Vs) consumer price index numbers:
1. The wholesale price index number measures the change in price level in a country as a
whole. For example economic advisors index numbers of wholesale prices.
Where as cost of living index numbers measures the change in the cost of living
Page 131
of a particular class of people stationed at a particular place. In this index number we take
retail price of the commodities.
2. The wholesale price index number and the consumer price index numbers are generally
different because there is lag between the movement of wholesale prices and the retail
prices.
3. The retail prices required for the construction of consumer price index number increased
much faster than the wholesale prices i.e. there might be erratic changes in the consumer
price index number unlike the wholesale price index numbers.
4. The method of constructing index numbers in general the same for wholesale prices and
cost of living. The wholesale price index number is based on different weighting systems
and the selection of commodities is also different as compared to cost of living index
number
Limitations or demerits of index numbers:
Although index numbers are indispensable tools in economics, business, management
etc, they have their limitations and proper care should be taken while interpreting them. Some of
the limitations of index numbers are
1. Since index numbers are generally based on a sample, it is not possible to take into
account each and every item in the construction of index.
2. At each stage of the construction of index numbers, starting from selection of
commodities to the choice of formulae there is a chance of the error being introduced.
3. Index numbers are also special type of averages, since the various averages like mean,
median, G.M have their relative limitations, their use may also introduce some error.
4. None of the formulae for the construction of index numbers is exact and contains the so
called formula error. For example Laspereys index number has an upward bias while
Paasches index has a downward bias.
5. An index number is used to measure the change for a particular purpose only. Its misuse
for other purpose would lead to unreliable conclusions.
6. In the construction of price or quantity index numbers it may not be possible to retain the
uniform quality of commodities during the period of investigation.
3. STOCK MARKET INDEX NUMBER

A stock market index is a measure of the relative value of a group of stocks in numerical terms.
As the stocks within an index change value, the index value changes. An index is important to
measure the performance of investments against a relevant market index.
An Index is used to give information about the price movements of products in the financial,
commodities or any other markets. Financial indexes are constructed to measure price
movements of stocks, bonds, T-bills and other forms of investments. Stock market indexes are
meant to capture the overall behaviour of equity markets. A stock market index is created by
Page 132
selecting a group of stocks that are representative of the whole market or a specified sector or
segment of the market. An Index is calculated with reference to a base period and a base index
value.
Stock indexes are useful for benchmarking portfolios, for generalizing the experience of all
investors, and for determining the market return used in the Capital Asset Pricing Model
(CAPM).
A hypothetical portfolio encompassing all possible securities would be too broad to measure, so
proxies such as stock indexes have been developed to serve as indicators of the overall market's
performance. In addition, specialized indexes have been developed to measure the performance
of more specific parts of the market, such as small companies.
It is important to realize that a stock price index by itself does not represent an average return to
shareholders. By definition, a stock price index considers only the prices of the underlying stocks
and not the dividends paid. Dividends can account for a large percentage of the total investment
return.
An stock market index (or just index) is a number that measures the relative value of a group of
stocks. As the stocks in this group change value, the index also changes value. If an index goes
up by 1% then that means the total value of the securities which make up the index have gone up
by 1% in value.
A stock market index measures the change in the stock prices of the index's components.
How it works/Example:
Let's say we want to measure the performance of the Indian stock market. Assume there are
currently four public companies that operate in the United States: Company A, Company B,
Company C, and Company D.
In the year 2000, the four companies' stock prices were as follows:
Company A
10
Company B
Company C
12
Company D
25
Total 55
To create an index, we simply set the total (55) in the year 2000 equal to 100 and measure any
future periods against that total. For example, let's assume that in 2001 the stock prices were:
Company A
Company B
38
Company C
12
Company D
24
Total 78
Page 133
Because 78 is 41.82% higher than the 2000 base, the index is now at 141.82. Every day,
month, year, or other period, the index can be recalculated based on current stock prices.
Note that this index is price-weighted (i.e., the larger the stock price, the more influence it has on
the index). Indexes can be weighted by any number of metrics, including shares outstanding,
market capitalization, or stock price.
Some Important Stock Market Indices

Symbol
XAX
VOLND
X
FTSEQ50
0
RCMP
IXIC
NQGM
NQGS
QOMX
ILTI
QMEA
IXNDX
NYA
OMXB10
OMXC20
OMXH25
OMXN40
OMXS30
RUI
RUT
RUA
OEX
SPX
MID
NDXE
VINX30
WLX
Name
Amex Composite
DWS NASDAQ-100 Volatility Target Index
FTSE NASDAQ 500 Index
NASDAQ Capital Market Composite Index
NASDAQ Composite
NASDAQ Global Market Composite
NASDAQ Global Select Market Composite
NASDAQ OMX 100 Index
NASDAQ OMX AeA Illinois Tech Index
NASDAQ OMX Middle East North Africa Index
NASDAQ-100
NYSE Composite
OMX Baltic 10
OMX Copenhagen 20
OMX Helsinki 25
OMX Nordic 40
OMX Stockholm 30 Index
Russell 1000
Russell 2000
Russell 3000
S&P 100
S&P 500
S&P MidCap
The NASDAQ-100 Equal Weighted Index
VINX 30
Wilshire 5000
Types of Stock Market Indices (National Stock Exchange)

(a) Broad Market Indices
These indices are broad-market indices, consisting of the large, liquid stocks listed on the
Exchange. They serve as a benchmark for measuring the performance of the stocks or portfolios
such as mutual fund investments.
Examples
CNX Nifty(The CNX Nifty is a well diversified 50 stock index accounting for 23 sectors
of the economy. It is used for a variety of purposes such as benchmarking fund portfolios,
index based derivatives and index funds.)
CNX Nifty Junior
LIX15 Midcap
Page 134
CNX 100
Nifty Midcap 50
CNX Midcap
CNX Smallcap Index
India VIX
(b) Sectoral Indices

Sector-based index are designed to provide a single value for the aggregate performance of a
number of companies representing a group of related industries or within a sector of the
economy.
Examples
CNX Auto Index (The CNX Auto Index is designed to reflect the behaviour and performance of
the Automobiles sector which includes manufacturers of cars & motorcycles, heavy vehicles,
auto ancillaries, tyres, etc. The CNX Auto Index comprises of 15 stocks that are listed on the
National Stock Exchange.)
CNX Bank Index
CNX Metal Index
CNX Energy Index
CNX Pharma Index
CNX Finance Index
CNX PSU Bank Index
CNX FMCG Index
CNX Realty Index
CNX IT Index
IISL CNX Industry Indices
CNX Media Index
(c) Thematic Indices

Thematic indices are designed to provide a single value for the aggregate performance of a
number of companies representing a theme.
Examples
CNX Commodities Index (The CNX Commodities Index is designed to reflect the behaviour and
performance of a diversified portfolio of companies representing the commodities segment
which includes sectors like Oil, Petroleum Products, Cement, Power, Chemical, Sugar, Metals
and Mining. The CNX Commodities Index comprises of 30 companies that are listed on the
National Stock Exchange (NSE).)
CNX Consumption Index
CNX Service Sector Index
CPSE Index
CNX Shariah25
CNX Infrastructure Index
CNX Nifty Shariah / CNX 500 Shariah
CNX MNC Index
CNX PSE Index
(d) Strategy Indices

Strategy indices are designed on the basis of quantitative models / investment strategies to
provide a single value for the aggregate performance of a number of companies. Strategic indices
are designed on the basis of quantitative models / investment strategies to provide a single value
for the aggregate performance of a number of companies.
Page 135
CNX 100 Equal Weight (The CNX 100 Equal Weight Index comprises of same constituents as
CNX 100 Index (free float market capitalization based Index).
The CNX 100 tracks the behavior of combined portfolio of two indices viz. CNX Nifty and CNX
Nifty Junior. It is a diversified 100 stock index. The maintenance of the CNX Nifty and the CNX
Nifty Junior are synchronized so that the two indices will always be disjoint sets; i.e. a stock will
never appear in both indices at the same time.)
CNX Alpha Index
CNX Nifty Dividend
CNX Defty
NV20 Index
CNX Dividend Opportunities Index
NI15 Index
CNX High Beta Index
Nifty TR 2X Leverage
CNX Low Volatility Index
Nifty TR 1X Inverse
(e) Fixed Income Indices

Fixed income index is used to measure performance of the bond market. The fixed income
indices are useful tool for investors to measure and compare performance of bond portfolio.
Fixed income indices also used for introduction of Exchange Traded Funds.
Examples
GSEC10 NSE Index (GSEC10 NSE index is constructed using the prices of top 5 ( in terms of
traded value) liquid GOI bonds with residual maturity between 8 to 13 years and have
outstanding issuance exceeding Rs.5000 crores. The individual bonds are assigned weights
considering the traded value and outstanding issuance in the ratio of 40:60.The index measures
the changes in the prices of the bond basket.)
GSECBM NSE Index
(f) Index Concepts
Indices and index-linked investment products provide considerable benefits. Important concepts
and terminologies are associated with Index construction. These concepts are important for
investors to learn from the information that indices contain about investment opportunities.
In the investment world, however, risk is inseparable from performance and, rather than being
desirable or undesirable, is simply necessary. Understanding risk is one of the most important
parts of a financial education.
Indices and index-linked investment products provide considerable benefits. But it is equally
important to know the associated risk that comes as part of such exposure. Important concepts
and terminologies are associated with Indices. For e.g. Beta helps us to understand the concepts
of passive and active risk. Impact cost represents the cost of executing a transaction in a given
stock, for a specific predefined order size, at any given point of time. These concepts are
important for to understanding indices and investment opportunities.
Page 136
(g) Index Funds

An Index Fund is a type of mutual fund with a portfolio constructed to match the constituents of
the market index, such as CNX Nifty. An index fund provides broad market exposure and lower
operating expenses for investors.
Index Funds today are a source of investment for investors looking at a long term, less risky form
of investment. The success of index funds depends on their low volatility and therefore the
choice of the index.
Examples
1
Principal Index Fund
UTI Nifty Index Fund
Franklin India Index Fund
SBI Nifty Index Fund
ICICI Prudential Index Fund
HDFC Index Fund - Nifty Plan
Birla Sun Life Index Fund
LIC NOMURA MF Index Fund - Nifty Plan
Uses of Stock Market Indices

With any type of investment it's important to measure the performance of that investment.
Otherwise there's no way for you to distinguish between a good return on your money versus a
bad one.
A relevant stock market index serves that purpose. If your investments consistently lag behind
the index then you know you have a poor performer, and it may be time to find a new
investment.
Stock market indexes are useful for a variety of reasons. Some of them are :
They provide a historical comparison of returns on money invested in the stock market
against other forms of investments such as gold or debt.
They can be used as a standard against which to compare the performance of an equity
fund.
In It is a lead indicator of the performance of the overall economy or a sector of the
economy
Stock indexes reflect highly up to date information
Modern financial applications such as Index Funds, Index Futures, Index Options play an
important role in financial investments and risk management
BSE SENSEX (Bombay Stock
Exchange Sensitive Index)
The Sensex is an "index". What is an index? An index is basically an indicator. It gives you a
general idea about whether most of the stocks have gone up or most of the stocks have gone
down. The Sensex is an indicator of all the major companies of the BSE.
Page 137
BSE SENSEX is considered as the Barometer of Indian Capital Markets. If the Sensex goes up,
it means that the prices of the stocks of most of the major companies on the BSE have gone up.
If the Sensex goes down, this tells you that the stock price of most of the major stocks on the
BSE have gone down.
BSE SENSEX, first compiled in 1986, was calculated on a "Market Capitalization-Weighted"
methodology of 30 component stocks representing large, well-established and financially sound
companies across key sectors. The base year of S&P BSE SENSEX was taken as 1978-79. S&P
BSE SENSEX today is widely reported in both domestic and international markets through print
as well as electronic media. It is scientifically designed and is based on globally accepted
construction and review methodology. Since September 1, 2003, BSE SENSEX is being
calculated on a free-float market capitalization methodology. The "free-float market
capitalization-weighted" methodology is a widely followed index construction methodology on
which majority of global equity indices are based; all major index providers like MSCI, FTSE,
STOXX, and Dow Jones use the free-float methodology.
The BSE Sensex currently consists of the following 30 major Indian companies as of October
2014
Axis Bank Ltd
Bajaj Auto Ltd
Bharat Heavy Electricals Ltd
Bharti Airtel Ltd
Cipla Ltd
Coal India Ltd
Dr.Reddy's Laboratories Ltd
GAIL (India) Ltd
HDFC Bank Ltd
Hero MotoCorp Ltd
Hindalco Industries Ltd
Hindustan Unilever Ltd
Housing Development Finance Corporation Ltd
ICICI Bank Ltd
Infosys Ltd
ITC Ltd
Larsen & Toubro Ltd
Mahindra and Mahindra Ltd
Maruti Suzuki India Ltd
NTPC Ltd
Oil and Natural Gas Corporation Ltd
Reliance Industries Ltd
Sesa Goa Ltd
State Bank of India
Sun Pharmaceutical Industries Ltd
Tata Consultancy Services Ltd
Tata Motors Ltd
Tata Power Company Ltd
Tata Steel Ltd
Wipro Ltd
Nifty (National Stock Exchange Index)

Just like the Sensex which was introduced by the Bombay stock exchange, Nifty is a major stock
index in India introduced by the National stock exchange.
NIFTY was coined fro the two words National and FIFTY. The word fifty is used because;
the index consists of 50 actively traded stocks from various sectors.
So the nifty index is a bit broader than the Sensex which is constructed using 30 actively traded
stocks in the BSE.
Nifty is calculated using the same methodology adopted by the BSE in calculating the Sensex
but with three differences. They are:
The base year is taken as 1995
The base value is set to 1000
Page 138
Nifty is calculated on 50 stocks actively traded in the NSE

50 top stocks are selected from 24 sectors.
The selection criteria for the 50 stocks are also similar to the methodology adopted by the
Bombay stock exchange.
Nifty, is a weighted average of 50 stocks, meaning some stocks hold more value than other
stocks. For example ITC has more weight than Lupin.
List of 50 stocks that have been included in the nifty as on October 2014.
Name
ACC Ltd.
Ambuja Cements Ltd.
Asian Paints Ltd.
Axis Bank Ltd.
Bajaj Auto Ltd.
Bank of Baroda
Bharat Heavy Electricals Ltd.
Bharat Petroleum Corporation Ltd.
Bharti Airtel Ltd.
Cairn India Ltd.
Cipla Ltd.
Coal India Ltd
DLF Ltd.
Dr. Reddy's Laboratories Ltd.
GAIL (India) Ltd.
Grasim Industries Ltd.
HCL Technologies Ltd.
HDFC Bank Ltd.
Hero Honda Motors Ltd.
Hindalco Industries Ltd.
Hindustan Unilever Ltd.
Housing Development Finance Corporation Ltd.
I T C Ltd.
ICICI Bank Ltd.
IndusInd Bank Ltd.
Infosys Technologies Ltd.
Infrastructure Development Finance Co. Ltd.
Jindal Steel & Power Ltd.
Kotak Mahindra Bank Ltd.
Larsen & Toubro Ltd.
Lupin Ltd.
Mahindra & Mahindra Ltd.
Maruti Suzuki India Ltd.
NMDC Ltd.
NTPC Ltd.
Oil & Natural Gas Corporation Ltd.
Power Grid Corporation of India Ltd.
Punjab National Bank
Reliance Industries Ltd.
SesaSterlite Ltd.
State Bank of India
Sun Pharmaceutical Industries Ltd.
Sector
CEMENT AND CEMENT PRODUCTS
PAINTS
BANKS
AUTOMOBILES - 2 AND 3 WHEELERS
BANKS
ELECTRICAL EQUIPMENT
REFINERIES
TELECOMMUNICATION - SERVICES
OIL EXPLORATION/PRODUCTION
PHARMACEUTICALS
MINING
CONSTRUCTION
PHARMACEUTICALS
GAS
COMPUTERS - SOFTWARE
BANKS
AUTOMOBILES - 2 AND 3 WHEELERS
ALUMINIUM
PERSONAL CARE
FINANCE - HOUSING
CIGARETTES
BANKS
BANKS
FINANCIAL INSTITUTION
STEEL AND STEEL PRODUCTS
BANKS
ENGINEERING
PHARMACEUTICALS
AUTOMOBILES - 4 WHEELERS
MINING
POWER
OIL EXPLORATION/PRODUCTION
POWER
BANKS
REFINERIES
MINING
BANKS
PHARMACEUTICALS
Page 139
Tata Consultancy Services Ltd.

Tata Motors Ltd.
Tata Power Co. Ltd.
Tata Steel Ltd.
Tech Mahindra Ltd.
UltraTech Cement Ltd.
United Spirits Ltd.
Wipro Ltd.
POWER
STEEL AND STEEL PRODUCTS
BREW/DISTILLERIES
Nifty and the Sensex

The Sensex and Nifty are both Indices. The Sensex, also called the BSE 30, is a stock market
index of 30 well-established and financially sound companies listed on Bombay Stock Exchange
(BSE). The Nifty, similarly, is an indicator of the 50 top major companies on the National Stock
Exchange (NSE).
The Sensex and Nifty are both indicators of market movement. If the Sensex or Nifty go up, it
means that most of the stocks in India went up during the given period. If the Nifty goes down,
this tells you that the stock price of most of the major stocks on the BSE have gone down.
Just in case you are confused, the BSE, is the Bombay Stock Exchange and the NSE is the
National Stock Exchange. The BSE is situated at Bombay and the NSE is situated at Delhi.
These are the major stock exchanges in the country. There are other stock exchanges like the
Calcutta Stock Exchange etc. but they are not as popular as the BSE and the NSE.Most of the
stock trading in the country is done though the BSE & the NSE.
TIME SERIES ANALYSIS
In plain English, a time series is simply a sequence of numbers collected at regular intervals over
a period of time. In statistics, a time series is a sequence of numerical data points in successive
order, usually occurring in uniform intervals. This concerns the analysis of data collected over
time, such as weekly values, monthly values, quarterly values, yearly values, etc.
Many statistical methods relate to data which are independent, or at least uncorrelated. There are
many practical situations where data might be correlated. This is particularly so where repeated
observations on a given system are made sequentially in time. Data gathered sequentially in time
are called a time series.
Here are some examples in which time series arise:
Economics and Finance
Environmental Modelling
Meteorology and Hydrology
Demographics
Medicine
Engineering
Quality Control
The simplest form of data is a longish series of continuous measurements at equally spaced time
points. That is observations are made at distinct points in time, these time points being
Page 140
equally spaced and, the observations may take values from a continuous distribution.
The above setup could be easily generalized: for example, the times of observation need not be
equally spaced in time, the observations may only take values from a discrete distribution . . .
If we repeatedly observe a given system at regular time intervals, it is very likely that the
observations we make will be correlated. So we cannot assume that the data constitute a random
sample. The time-order in which the observations are made is vital.
Objectives of time series analysis:
description - summary statistics, graphs
analysis and interpretation - find a model to describe the time dependence in the data, can we
interpret the model
forecasting or prediction - given a sample from the series, forecast the next value, or the next
few values
control - adjust various control parameters to make the series fit closer to a target
adjustment - in a linear model the errors could form a time series of correlated observations,
and we might want to adjust estimated variances to allow for this
Types of time Series
1. continuous
2. discrete
Discrete means that observations are recorded in discrete times it says nothing about the nature
of the observed variable. The time intervals can be annually, quarterly, monthly, weekly, daily,
hourly, etc.
Continuous means that observations are recorded continuously -e.g. temperature and/or humidity
in some laboratory. Again, time series can be continuous regardless of the nature of the observed
variable.
Discrete time series can result when continuous time series are sampled. Sometimes quantities
that don't have an instantaneous value get aggregated also resulting in a discrete time series e.g.
daily rainfall We will mostly study discrete time series in this course. Note that discrete time
series are often the result of discretization of continuous time series (e.g. monthly rainfall).
Uses of time series
There are two main uses of time series analysis: (a) identifying the nature of the phenomenon
represented by the sequence of observations, and (b) forecasting (predicting future values of the
time series variable). Both of these goals require that the pattern of observed time series data is
identified and more or less formally described. Once the pattern is established, we can interpret
and integrate it with other data (i.e., use it in our theory of the investigated phenomenon, e.g.,
seasonal commodity prices). Regardless of the depth of our understanding and the validity of our
interpretation (theory) of the phenomenon, we can extrapolate the identified pattern to predict
future events.
The usage of time series models is twofold:
Page 141
Obtain an understanding of the underlying forces and structure that produced the
observed data
Fit a model and proceed to forecasting, monitoring or even feedback and feedforward
control.
Time Series Analysis is used for many applications such as:
Economic Forecasting
Sales Forecasting
Budgetary Analysis
Stock Market Analysis
Yield Projections
Process and Quality Control
Inventory Studies
Workload Projections
Utility Studies
Census Analysis
Time series analysis can be useful to see how a given asset, security or economic variable
changes over time or how it changes compared to other variables over the same time period. For
example, in stock market investments, suppose you wanted to analyze a time series of daily
closing stock prices for a given stock over a period of one year. You would obtain a list of all the
closing prices for the stock over each day for the past year and list them in chronological order.
This would be a one-year, daily closing price time series for the stock. Delving a bit deeper, you
might be interested to know if a given stock's time series shows any seasonality, meaning it goes
through peaks and valleys at regular times each year. Or you might want to know how a stocks
share price changes as an economic variable, such as the unemployment rate, changes.
The analysis of time series if of great significance not only to the economists and business man
but also to the scientist, astronomist, geologist etc. for the reasons given below.
1) It helps in understanding past behavior. It helps to understand what changes have taken
place in the past. Such analysis is helpful in predicting the future behavior.
2) It helps in planning future operations : Statistical techniques have been evolved which
enable time series to be analysed in such a way that the influence which have determined
the form of that series may be ascertained. If the regularity of occurrence of any feature
over a sufficient long period could be clearly established then. Within limits, prediction
of probable future variations would become possible.
3) It helps in evaluating current accomplishments. The actual performance can be compared
with the expected performance and the cause of variation analysed. For example, if
expected sale for 2000-01 was 10,000 washing machine and the actual sale was only
9000. One can investigate the cause for the shortfall in achievement.
4) It facilitates comparison. Different time series are often compared and important
conclusions drawn therefrom.
Page 142
Components of Time Series

The fluctuations of time series can be classified into four basic type of variations, They
are often called components or elements of a time series. They are :
(1) Secular Trend or Long Term Movements (T)
(2) Seasonal Variations (S)
(3) Cyclical Variations (C)
(4) Irregular Variations (I)
The value (y) of a phenomenon observed at any point of time (t) is the net effect of all the above
mentioned categories of components of a time series. We will see them in detail here.
(1) Secular Trend
The secular trend is the main component of a time series which results from long term effect of
socio-economics and political factors. This trend may show the growth or decline in a time series
over a long period. This is the type of tendency which continues to persist for a very long period.
Prices, export and imports data, for example, reflect obviously increasing tendencies over time.
(2) Seasonal Variations (Seasonal Trend)
These are short term movements occurring in a data due to seasonal factors. The short term is
generally considered as a period in which changes occur in a time series with variations in
weather or festivities. For example, it is commonly observed that the consumption of ice-cream
during summer us generally high and hence sales of an ice-cream dealer would be higher in some
months of the year while relatively lower during winter months. Employment, output, export etc.
are subjected to change due to variation in weather. Similarly sales of garments, umbrella,
greeting cards and fire-work are subjected to large variation during festivals like Onam, Eid,
Christmas, New Year etc. These types of variation in a time series are isolated only when the
series is provided biannually, quarterly or monthly.
(3) Cyclical Variations (Cyclical Variations)
These are long term oscillation occurring in a time series. These oscillations are mostly observed
in economics data and the periods of such oscillations are generally extended from five to twelve
years or more. These oscillations are associated to the well known business cycles. These cyclic
movements can be studied provided a long series of measurements, free from irregular
fluctuations is available.
(4) Irregular Variations (Irregular Fluctuations)
These are sudden changes occurring in a time series which are unlikely to be repeated, it id that
component of a time series which cannot be explained by trend, seasonal or cyclic movements .It
is because of this fact these variations some-times called residual or random component. These
variations though accidental in nature, can cause a continual change in the trend, seasonal and
cyclical oscillations during the forthcoming period. Floods, fires, earthquakes, revolutions,
epidemics and strikes etc,.are the root cause of such irregularities.
Page 143
Measurement of Trend : Moving Average and the Method of least squares :

Mean of time series data (observations equally spaced in time) from several consecutive periods.
Called 'moving' because it is continually recomputed as new data becomes available, it
progresses by dropping the earliest value and adding the latest value. For example, the moving
average of six-month sales may be computed by taking the average of sales from January to
June, then the average of sales from February to July, then of March to August, and so on.
Moving averages (1) reduce the effect of temporary variations in data, (2) improve the 'fit' of
data to a line (a process called 'smoothing') to show the data's trend more clearly, and (3)
highlight any value above or below the trend.
1. Method of Moving Averages
Let us explain the concept of Moving Average with the aid of an example.
Suppose that the demand for skilled laborers for a construction project is given for the last 7
months as shown in the following table:
Month
1
2
3
4
5
6
7
Demand
120
110
90
115
125
117
121
The engineer who is in charge of this project needs to predict the demand for the next month (the
8th month) based on the available data. He decided to take the average of the data and predicted
the demand as follows.
Average = (120 + 110 + 90 + 115 + 125 + 117 + 121)/7 = 114
But this method has a disadvantage. The above method is known as the Simple Mean
Forecasting Method. The main problem with this method is the space limitation for storing all of
the past data. If the data contains several thousand items, each of which has several hundred data
records, you need a lot of memory space to store this data on your computer. In addition, this
method is not very sensitive to a shift in recent data if it contains a large number of data points.
A solution to the these problems is the Moving Averages technique. Using this method, you need
to maintain only the N most recent periods of data points. At the end of each period, the oldest
period's data is discarded and the newest period's data is added to the data base. The average is
then divided by N and used as a forecast for the next period.
The formula for a three period moving average is given below:
(3) =
]
Page 144
Now using the three period moving average, the average for the above problem can be calculated
as follows.
[
125 + 117 + 121

= 121
3
3
So from the above example we can summarize as follows.
(3) =
When a trend is to be determined by the method of moving average value for a number of years
is secured and this average is taken as the normal or trend value for the unit of time falling at the
middle of the period covered in the calculation of the average. While applying this method, it is
necessary to select a period for moving average such as 3 yearly, 5 yearly or 8 yearly moving
average etc.
The 3 yearly moving average shall be computed as follows :
+
+ +
,
3
3
5 yearly moving average
+ +
5
Example
+ +
+ +
3
+ +
..
+ +
3
..
Calculate the 3 yearly moving average and 5yearly moving average of the producing figures
given below .
For computing three yearly trend, first find three yearly moving totals a+b+c, b+c+d, c+d+eetc
(Column 3 in the following table). Then find average of each. Since it is sum of three
,
observations, divide each by 3 to get average.

process for 5 years taking 5 instead of 3.
Year
3 yearly
moving totals
(1)
(2)
(3)
3 yearly
moving averages
(trend values)
(4) =(3)3
1990
242
(5)
_
5 yearly
moving
averages
(trend values)
(6) = (5) 5
_
1991
250
744
248.0
1246
249.2
1992
252
751
250.3
1259
251.8
1993
249
754
251.3
1260
252
1994
253
757
252.3
1265
253
1995
255
759
253.0
1276
255.2
1996
251
763
254.3
1288
257.6
1997
257
768
256.0
1295
259
1998
260
782
260.7
1999
265
787
262.3
2000
262
5 yearly
moving
totals
. Repeat the same
Page 145
Three yearly moving average

264.0
262.0
260.0
258.0
256.0
254.0
252.0
250.0
248.0
246.0
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Five yearly moving average

260
258
256
254
252
250
248
1990
1991
1992
1993
1994
1995
1996
1997
1998
Value
Moving Average
270
265
260
255
250
245
240
235
230
Actual
Forecast
3 per. Mov. Avg. (Forecast)
Data Point
Page 146
Merits of Moving Average Method

It is simple as compared to the method of least squares.
It is flexible, If a few more figures are added to the data, the entire calculations are not
changed.
It has the advantage that it follows the general movements of the data and that its shape is
determined by the data rather than statisticians choice of a mathematical function.
It is particularly effective if the trend of a series is very irregular.
Limitations :
Trend values cannot be computed for all the years. The moving averages for the first few
years and last few years cannot be obtained. It is often these extreme years in which h
we may be interested.
Selection of proper period is a great difficulty. If a wrong period is selected, there is ever
likelihood that conclusions may be misleading.
Since the moving average is not represented by a mathematical function, this method
cannot be used for forecasting.
Itcan be applied only to those series which show periodically.
2. METHOD OF LEAST SQUARES:

Least Squares Method is astatistical technique to determine the line of best fit for a model. The
least squares method is specified by an equation with certain parameters to observed data. This
method is extensively used in regression analysis and estimation.
In the most common application - linear or ordinary least squares - a straight line is sought to be
fitted through a number of points to minimize the sum of the squares of the distances (hence the
name "least squares") from the points to this line of best fit.
In contrast to a linear problem, a non-linear least squares problem has no closed solution and is
generally solved by iteration. The earliest description of the least squares method was by Carl
Freidrich Gauss in 1795.
Field data is often accompanied by noise. Even though all control parameters (independent
variables) remain constant, the resultant outcomes (dependent variables) vary. A process of
quantitatively estimating the trend of the outcomes, also known as regression or curve fitting,
therefore becomes necessary.
The curve fitting process fits equations of approximating curves to the raw field data.
Nevertheless, for a given set of data, the fitting curves of a given type are generally NOT unique.
Thus, a curve with a minimal deviation from all data points is desired. This best-fitting curve can
be obtained by the method of least squares.
The principle of least squares provides us an analytical or mathematical device to obtain an
objective fit to the trend of the given time series. Most of the data relating to economic and
business time series conform to definite laws of growth or predictions. This technique can be
used to fit linear as well as nonlinear trends.
Page 147
Fitting linear trend

A straight line can be fitted to the data by the method of curve fitting based on the most popular
principle called the principle of least squares. Such a straight line is also known as Line of Best
fit. Let the line of best fit be described by an equation of the type y = a+bx where y is the value
of dependent variable, a and b are two unknown constants whose values are to be determined.
To find a and b, we apply the method of least squares. Let E be the sum of the squares
of the deviations of all the original values from their respective values derived from the
equations. So that E = [y (a+bx)] 2
By Calculus method, for minimum
Normal equations. They are :
= 0 . Thus we get the two equations known as
Solving these two normal equations, we get a and b. Substituting these values in the
equation y = a+bx, we get the trend equation.
Example:
Fit a linear trend to the following data by the least square method.
Year
2000
2002
2004
2006
2008
Production
18
21
23
27
16
Solution
Let x = t -2004 .(I)
Let the trend line of y (production) on x be
= + , (
2004)..(II)
Year (t)
x2
x=t-2004
xy
Ye=21+0.1x
Y-Ye
2000
18
-4
16
-72
20.6
-2.6
2002
21
-2
-42
20.8
0.2
2004
23
21
2006
27
54
21.2
5.8
2008
16
16
64
21.4
-5.4
=105
=0
=40
=4
The normal equations for estimating and b in (II) are

=
)=0
+
Page 148
105 = 5 +
=
105
= 21
5
4=
0+
40
4
1
=
= 0.1
40 10
Substituting in (II), the straight line trend equation is given by

Y = 21+0.1x, (Origin :2004) ..(III)
[x units = 1 year and y = production in 000 units)]
Putting x = 4, 2,0,2 and 4 in (III), we obtain the trend values ( ) for the years 2000,
20022008 respectively, as given in last but one column of the table above.
The difference (
We have
) is calculated in the last column of the table.

) = 2.6 + 0.2 + 2.0 + 5.8 5.4 = 8 8 = 0,
Uses of Method of Least Squares
The least square methods (LSM)is probably the most popular technique in statistics. This is due
to several factors.
First, most common estimators can be casted within this framework. For example, the mean of a
distribution is the value that minimizes the sum of squared deviations of the scores.
Second, using squares makes LSM mathematically very tractable because the Pythagorean
theorem indicates that, when the error is independent of an estimated quantity, one can add the
squared error and the squared estimated quantity.
Third, the mathematical tools and algorithms involved in LSM (for eg. derivatives) have been
well studied for a relatively long time.
The use of LSM in a modern statistical framework can be traced to Galton (1886) who used it in
his work on the heritability of size which laid down the foundations of correlation and (also gave
the name to) regression analysis. The two antagonistic giants of statistics Pearson and Fisher,
who did so much in the early development of statistics, used and developed it in different
contexts (factor analysis for Pearson and experimental design for Fisher).
Nowadays, the least square method is widely used to find or estimate the numerical values of the
parameters to fit a function to a set of data and to characterize the statistical properties of
estimates. It exists with several variations: Its simpler version is called ordinary least
squares(OLS), a more sophisticated version is called weighted least squares (WLS), which often
performs better than OLS because it can modulate the importance of each observation in the final
solution. Recent variations of the least square method are alternating least squares (ALS) and
partial least squares (PLS).
Page 149
Problems with least squares

Despite its popularity and versatility, LSM has its problems. Probably, the most important
drawback of LSM is its high sensitivity to outliers (i.e., extreme observations). This is a
consequence of using squares because squaring exaggerates the magnitude of differences (e.g.,
the difference between 20 and 10 is equal to 10 but the difference between 20 2and 102 is equal
to 300) and therefore gives a much stronger importance to extreme observations. This problem is
addressed by using robust techniques which are less sensitive to the effect of outliers. This field
is currently under development and is likely to become more important in the next future.
Page 150
MODULE IV
NATURE AND SCOPE OF ECONOMETRICS
Econometrics: Meaning, Scope, and Limitations - Methodology of econometrics-Modern
interpretation-Stochastic Disturbance term- Population Regression Function and Sample
Regression Function-Assumptions of Classical Linear regression model.
Introduction
Between the world wars, advances in mathematical statistics and a cadre of
mathematically trained economists led to econometrics, which was the name proposed for the
discipline of advancing economics by using mathematics and statistics. The roots of modern
econometrics can be traced to the American economist Henry L. Moore. Moore studied
agricultural productivity and attempted to fit changing values of productivity for plots of corn
and other crops to a curve using different values of elasticity. Moore made several errors in his
work, some from his choice of models and some from limitations in his use of mathematics.
Ragnar Frisch coined the word econometrics and helped to found both the Econometric
Society in 1930 and the Journal Econometrica in 1933.
It may be described as a branch of economics in which economic theory and statistical
methods are fused in the analysis of numerical and institutional data. The term econometrics
means economic measurement, which is synonymous with empirical research in economics.
Econometrics is concerned with the measurement of data or the application of statistical
procedures, which have been formulated in mathematical terms. It is therefore a branch of
mathematical economics. Statistical data and statistical procedures are employed to provide
numerical results, which may be used for verification of or to help in verification of economic
theorems. Econometrics provides the quantitative information that may be used to make a
qualitative analysis empirically truer and more meaningful.
The term econometrics is formed from two Greek words which means, economy and measure.
Econometrics is a rapidly developing branch of economics. Econometrics aims to give empirical
content to economic relations. The term econometrics was first used by PawelClompa in 1910.
But the credit of coining the term econometrics should be given to Ragnar Frisch (1936), one of
the founders of the Econometric Society. He was the person who established the subject in the
sense in which it is known today. Econometrics can be defined generally as the application of
mathematics and statistical methods to the analysis of economic data. In the words of
Samuelson, Koopmans and Stone, econometrics is defined as the quantitative analysis of actual
economic phenomena based on the concurrent development of the theory and observation,
related by appropriate methods of inference (1954). Other definitions of econometrics are:
Every application of mathematics or of statistical methods to the study of economic phenomena
(Malinvaud 1966)
Page 151
The production of quantitative economic statements that either explain the behaviour of variables
we have already seen, or forecast (ie. predict) behaviour that we have not yet seen, or both
(Christ 1966)
Econometrics is the art and science of using statistical methods for the measurement of economic
relations (Chow, 1983).
Need for econometrics
Economic theory makes statements or hypotheses that are mostly qualitative in nature.
For eg. Micro economic theory states that other thing remaining the same, a reduction in the
price of a commodity is expected to increase the quantity demanded of that commodity. Thus
economic Theory postulates a negative or inverse relation between price and quantity. But the
theory does not provide any numerical measure of the relationship between the two. It is the job
of the econometrician to provide such numerical estimates. Econometrics give empirical content
to most economic theory.
Scope of Econometrics
To make the meaning of econometrics more clear and detailed, it is appropriate to quote Frish
(1933) in full. econometrics is by no means the same as economic statistics. Nor is it
identical with what we call general economic theory, although a considerable portion of this
theory has a definitely quantitative character. Nor should econometrics be taken as synonymous
with the application of mathematics to economics. Experience has shown that each of these
three view points, that of statistics, economic theory, and mathematics, is necessary, but not by
itself a sufficient, condition for a real understanding of the quantitative relations in modern
economic life. It is this unification of all three that is powerful. And it is this unification that
constitutes econometrics.
Let us consider the following example to understand this unification more clearly. From +2
classes onwards we learn demand function which explains that demand is a function of price,
assuming ceteris paribus. When we relax the assumption of ceteris paribus, we argue that
demand is influenced by four factors namely, price, price of substitutes, income and taste of the
consumer. So when we consider these four factors together, it is a case of exact relation. This
exact relation can be expressed in the form of a regression model, where quantity demanded is
dependent variable and price, price of substitutes, income and taste are the independent variables.
So this mathematical representation is again an exact relation. But practical wisdom suggests
that there are many more factors which influence the quantity demanded. Some new factors are
expectation of a price rise, coming of a new product, government policy and so on. Because of
the influence of these factors, our price quantity relation becomes not exact. Then, naturally
there should be a provision to incorporate the influence of other factors. The inclusion of
provision for other factors is the uniqueness of econometrics and how it is done can be explained
in later pages.
Page 152
Goals of econometrics
There are three main goals
1. Analysis- the testing of economic theory
2. Policy making -supplying numerical estimates which can be used for decision making
3. Forecasting using numerical estimates to forecast future values.
1. Analysis: Testing Economic theory
The earlier economic theories started from a set of observations concerning the behaviour
of individuals as consumers or producers. Some basic assumptions were set regarding the
motivations of individual economic units. From these assumptions the economists by pure
logical reasoning derive some general conclusion regarding the working process of the economic
system. Economic theories thus developed in an abstract level were not tested against economic
reality. No attempt was made to examine whether the theories explained adequately the actual
economic behaviour of individuals.
Econometrics aims primarily at the verifications of economic theories. That is obtaining
empirical evidence to test the explanatory power of economic theories. To decide how well they
explain the observed behaviour of the economic units.
2. Policy making
Various econometric techniques can be obtained in order to obtain reliable estimates of
the individual coefficients of economic relationships .The knowledge of numerical value of these
coefficients is very important for the decision of the firm as well as the formulation of the
economic policy of the government. It helps to compare the effects of alternative policy
decisions.
For eg. If the price elasticity of demand for a product is less than one (inelastic demand)
it will not benefit the manufacturer to decrease its price, because his revenue would be reduced.
Since econometrics can provide numerical estimate of the co-efficients of economic relationships
it becomes an essential tool for the formulation of sound economic policies.
3. Forecasting future values
In formulating policy decisions it is essential to be able to forecast the value of the
economic variables. Such forecasts will enable the policy makers to make efficient decision. In
formulating policy decisions, it is essential to be able to forecast the value of the economic
magnitudes. For example, what will be the demand for food grains in India by 2020? Estimates
about this are essential for formulating agriculture production policies. Similarly, what will be
the impact of a rise in deposit rate in share market and so on? It is known that if the bank deposit
rates go up, day to day demand for shares will come down. Econometric tools help in such
decision makings.
Page 153
Methodology of Econometric model building

As mentioned earlier, the scope of econometrics is widening day by day. The development of
computers further promoted the use of econometric tools. Thus it is relevant and useful to have
an insight into the methodology of developing an econometric model. The development of an
econometric model undergoes the following important stages or phases.
1. Specification of the model
2. Estimation of the model
3. Evaluation of estimates
4. Forecasting power of the model
1 Specification of the model
In econometric analysis we have to identify the relevant variables, express the relationship in
appropriate mathematical form and make estimates. In order the complete this process, we have
to go step by step.
The first step is to identify the relation to be studied and express that relation in the form of a
hypothesis. For example, if we are interested in testing the relevance of law of demand, choose
law of demand and express it in the form of a hypothesis. The law of demand states that there is
an inverse relation between price and quantity demanded. This can be expressed in the form of a
null hypothesis and alternative hypothesis.
The null hypothesis is: Quantity demanded and price is unrelated or quantity demanded and
price is independent.
When we formulate null hypothesis, automatically an alternative hypothesis is also formed.
In this example, the alternative hypothesis will be quantity demanded and price are related
If we consider another example, the validity of psychological law of Keynes which relates
consumption expenditure and income, the suitable null hypothesis is consumption expenditure
and income are unrelated and the alternative hypothesis will be consumption expenditure and
income are related. These hypotheses will be used for testing the validity of estimated
coefficients, which will discussed later.
Now let us discuss how to develop econometric models to test these hypotheses. First let us
start with law of demand. The first step is identifying the relevant variables
(a)Identification of variables:The most important and difficult part in developing an
econometric model is identification of relevant variables. One source of identifying the variables
is theory. Based on the law of demand we know that the variables are quantity demanded, price,
price of substitutes, income and taste of the consumers. Conventionally we believe that demand
depends on these factors. Thus demand is the dependent variable or regressand and price, price
Page 154
of substitutes, income and taste are independent variables or regressors. There are certain
practical difficulties at this stage (1) there may be a host of variables influencing a phenomenon.
Then is it possible to identify all those variables? Even if we could identify all those variables, is
it appropriate to include all those variables in the model? If we are omitting certain important
variables, it will be leading to errors. Similarly if we are including large number of variables or
unnecessary variables, it will also lead to errors. When such errors are committed in the
development of an econometric model, it is called as specification bias or specification error. So
let us assume that we are considering only price as the variable influencing quantity demanded,
assuming other factors remain constant. So let us write,
D = f (P)
where D represents quantity demanded, P represents price.
(b) Sign and magnitude of parameters: Once the function is identified, next task is to
attribute signs to the coefficients. Based on the general theory, we know that price takes a
negative sign. Thus we can convert the demand function into a demand equation as follows
D = + P where
demand equation.
represents intercept of demand equation and represents the slope of the
But we know that price is not the only factor influencing demand, but at the same time it is
difficult to add all the variables. Thus to accommodate the unexplained variables or variables
which are not included in the model, we add a stochastic term U into the model, called
disturbance term or error term. The inclusion of an error term makes an econometric model
unique and distinct from a mathematical model or exact model. When an error term is included,
our demand equation model will become,
D =
+ P + U , This is a unique econometric model.
Similarly, in the case of consumption function, the variables are consumption expenditure,
income, savings, and government policy and so on. Conventionally we assume that consumption
expenditure depends on income, assuming other factors remain constant. Thus our consumption
function model will be,
C =
+ Y + U where C is consumption expenditure, Y is income,
is intercept and is
slope of consumption function.

(c) Mathematical form of the model: There are two issues discussed here. First issue is
whether we should follow a single equation approach or simultaneous equation approach.
Second issue is whether we should follow a linear equation or non linear equation. Economic
theory does not explain whether the system follows single equation or simultaneous models. It is
true that demand is a function of price. But at the same time, demand is a function of supply
also. If we are considering the interrelationships among economic variables, the appropriate
Page 155
method is simultaneous equation model. However, in the present discussions let us limit to
single equation models.
The second issue is also very relevant. If we use a linear equation, there is an implied
assumption that, in the case of linear equations, the growth rate remains constant or more
precisely coefficient remain constant. When we estimate a demand equation, we assume that
the rate of change in quantity demanded for a change in price is constant. Similarly, in the case
of consumption function, we assume that the slope () remains constant; otherwise, marginal
propensity remains constant. If we apply little numerical wisdom, we can realize that marginal
propensity to consume can never be constant. Then what is the logic in assuming a linear
equation? Thus we have to keep in mind that linear equations are suitable for class room
analysis but not for policy research. However, after this caution, for the time being let us assume
that we follow a linear equation for the purpose of simple understanding and explanation.
When we develop an econometric model, time specifications are also very important.
Conventionally, for all current values we give suffix t, for previous values t-1 and for all
future values t+1(t*). Thus our models can be written as,
Dt =
+ Pt + Ut
.Demand equation
Ct =
+ Yt + Ut
Consumption equation
Normally, the dependent variable is denoted by Y and independent variable by X. Thus

general framework of an econometric model can be written as,
Yt =
+ Xt + Ut
When we incorporate only one independent variable, it is only a narrow situation of the
reality. When we want to make our model more realistic, we have to incorporate more number
of independent variables. When we use two independent variables, the model can be written as,
Yt =
1+
1Xt1 + 2Xt2 +Ut+ + nXn
This is the most simple multiple regression model. When we have two or more independent
variables, the model becomes multiple regression models. The general form of a multiple
regression model can be written as,
Yt =
Yt =
+ 1X1t + 2X2t + 3X3t + + nXn +Ut,

this is also written as,
+iXti + Ut
Just like incorporating current variables, it is easy to incorporate lagged variables or expected
variables in a model. See the following example.
Page 156
Yt =
+ 1Pt +2Yt-1 + 3W* +Utwhere the new variables are Yt-1which is the lagged
value of the variable Y and W* is the expected value of W (wt+1).

Similarly there are situations where we can not measure variables directly. In such situations, we
can define a proxy variable or an instrument variable and incorporate in the system as usual. See
the following example
Yt =
+ 1X1t + 2Zt + Ut
where Z is an instrument variable or proxy variable. Proxy variable is a variable used to

represent qualitative or non measurable phenomenon.
Another important question in developing an econometric model is whether we should go for
linear models or non linear models. This is a highly debatable issue and beyond the scope of this
course. The following are the other forms available.
Lin log model Yt =
+ Log X t +Ut
Log linmodel LogYt =

Double log model
+ Xt + Ut
Log Yt =
+ log Xt + Ut
The choice of the model depends on many factors, particularly the scatter diagram of the
dependent and independent variables. Among the following, the best is double log model
because the coefficients of the double log models give directly elasticity values.
Thus in the model specification stage we consider mainly, the variables to be included in the
model, and also the mathematical form of the model. Any error committed in this stage will lead
to errors termed as specification bias or specification error, as mentioned earlier.
2 Estimation of the model
As mentioned above, one of the objectives of econometric models is to estimate the
coefficients. Estimations are possible only if data are gathered. Data can be collected either by
census method or sample method. Important sampling methods used are simple random sample,
stratified sample, systematic sample, multistage sampling, cluster sampling and quota sampling.
Similarly, data are classified into primary data, secondary data, time series data, cross section
data and pooled data.
In econometric models, the distinction between time series data and cross section data are
important. To make its distinction clear, let us consider the following example,
Year
1999
2000
2002
2003
2004
2005
2007
2008
2010
Sales
15
14
17
14
12
14
17
14
12
Page 157
A casual look into the data set gives an impression that it belongs to time series, because it is
ordered in time. But the given set is neither time series nor cross section. Why?
For a data set to be time series, there are two conditions. Data collection interval should be
equal and gather information on a single entity. The given set of data does not obey these
conditions and hence not time series. But if we are provided with sales data for a few years, with
regular intervals, on year, six months etc, definitely they constitute time series data.
Now what is cross section data? When we gather information on multiple entities at a point
of time, it is called cross section data. For example, if we are gathering details of income,
savings, education, occupation etc of a group of 35 persons at a point of time, it is the best
example of cross section data. In other words, survey data are broadly cross section data.
In short, time series data is gathered at an interval of time while cross section data are
gathered at a point of time. The classification of time series and cross section data are important
because, the use of appropriate techniques depends on the nature of the data, whether it is time
series or cross section.
Another set of data used in econometric modelling is pooled data. Pooled data, in a simple
way is the integration or mixing of time series and cross section data. But the treatment pooled
data set is little complicated.
Aggregation problem
Once the data are collected, another issue to be dealt is the aggregation problem. Aggregation
problem arises from the irrational pooling of data. Aggregation problems are classified into
aggregation over individuals, over commodities, over space and over time.
Aggregation over individuals arises when we get the sum total of income of a few individuals
or income of firms. When we do this exercise, we are likely to commit errors. For example, if
the income of three persons namely, X, Y and Z are, Rs100000, Rs10000 and Rs500
respectively, their aggregate income can be easily computed as Rs110500 and average income as
36833, but this computation as well comparison is unscientific and leads to aggregation problem
over individuals. We may aggregate over the quantities of various commodities using
appropriate quantity indexes or over the prices of a group of commodities using some
appropriate price index. But these aggregations may lead to errors called as aggregation over
commodities.
While we collect data for different purposes, periodicity is very important. But in many
practical situations, this periodicity is not maintained. For example, in India, data are gathered at
two levels. One classification is recording of data at calendar year while the other one is
recording of data at financial year. Accountants admit that these differences create sufficient
difficulties while computing certain ratios or while comparing different years. This problem is
called aggregation over time.
At last, the aggregation of population of different towns, countries, regions also create
problems. This problem is called aggregation over space. The above sources of aggregation
Page 158
create various complications which may impart some aggregation bias in the estimates of the
coefficients.
Identification problem
While discussing the econometric methodology, econometricians mention the problem of
identification of coefficients. This problem arises seriously only in the case of simultaneous
equation models, but a mentioned is made below.
We know that demand is a function of price. Similarly, supply is also a function of price.
Thus, at equilibrium point, demand equals supply. Thus at this point, we do not know whether
we are estimating the parameters of the demand function or the supply function. The problem
becomes more complex while we deal with a system of large number of equations.
Choice of the appropriate econometric technique: Next issue is the selection of the
appropriate method for estimating the coefficient of economic relationships. The kit of
econometric tools provides different techniques which can be split into single equation
techniques and simultaneous equation techniques. The important single equation techniques are
Ordinary Least Square method, Indirect Squares or Reduced form technique, Two Stage least
Square method and Limited Information Maximum Likelihood method and mixed estimation.
Simultaneous equation techniques are techniques which applied to all equations of a system at
once, and give estimates of the coefficients of all the functions simultaneously. The most
important are the three stage least squares method and the full information maximum likelihood
method. The selection of the method depends on the following.
1. The nature of the relation and its identification condition.
2. The properties of the estimates of the coefficients obtained from each technique
3. Simplicity of the method
4. Time and cost requirements of the method
5. The desirable properties expected for the coefficients.
3 Evaluation of estimates
After the estimation of the model, the econometrician must proceed with the evaluation of
the results of the computations. That is, we are testing the reliability of the results. The
evaluation consists of deciding whether the estimates of the parameters are theoretically
meaningful and statistically satisfactory. For this purpose, we use different criteria, namely
apriori criteria, statistical criteria and econometric criteria.
Page 159
Economic or apriori criteria

This is decided by the principles of economic theory and refers to the sign and magnitude of the
parameters of economic relationships. Consider the example of demand equation. In the case of
demand equation, D = +P, the coefficient of should be negative in the case of a normal
good. Similarly there is a range in which the value of and can vary. Similarly, when we
consider the case of consumption function,
and respectively represent autonomous
consumption and marginal propensity to consume. Normally the sign of will be positive and it
varies within a range (0-1). Once our coefficients take an unexpected sign or magnitude, the
reliability of the estimates is doubtful and needs a review.
Statistical criteria (First order test):The coefficients estimated may be apriori true but need
not be statistically valid. Thus the validity of the model is to be ascertained using statistical
criteria. The frequently used tests are standard error, t test, Coefficient of determination and F
ratio. These tests are discussed later in detail.
Econometric Criteria (Second order test):The validity of the model also depends on the
validity of the assumptions of the model or more specifically the stochastic assumptions. If the
assumptions of the econometric method applied by the investigators are not satisfied, either the
estimates of the parameters cease to possess some of their desirable properties or the statistical
criteria lose their validity and become unreliable for the determination of the significance of
these estimates.
When the model does not satisfy the economic, statistical or econometric criteria, it is
appropriate to re specify the model. This process and re estimation should continue until we get
reliable estimates.
3 Evaluating the forecasting power of the estimated model

Forecasting is one of the prime aims of econometric analysis and research. The forecasting
power will be based on the stability of the estimates, their sensitivity to changes in the size of the
sample. We must establish whether the estimated function performs adequately outside the
sample of data whose average variation it represents. One way of establishing the forecasting
power of a model is to use the estimates of the model for a period not included in the sample.
The estimated value or forecast value is compared with the actual or realized magnitude of the
relevant dependent variable. Usually there will be a difference between the actual and the
forecast value of the variable, which is tested with the aim of establishing whether it is
statistically significant. If, after conducting the relevant test of significance, we find that the
difference between the realized value of the dependent variable and that estimated from the
model is statistically significant, we conclude that the forecasting power of the model is poor.
Another way of establishing the stability of the estimates and the performance of the model
outside the sample of data, from which it has been estimated, is to re estimate the function with
an expanded sample that is a sample including additional observations. The original estimates
Page 160
will normally differ from the new estimates. The difference is tested for statistical significance
with appropriate methods.
Desirable properties of an Econometric model

1. Theoretical plausibility: The model should explain clearly the economic theory or
phenomena to which it relates.
2. Explanatory ability: The model should be able to explain the observations of the actual
world.
3. Accuracy of the estimates of the parameters: The estimates of the coefficients should be
accurate in the sense that they should possess the desirable properties of unbiasedness,
consistency and efficiency.
4. Forecasting ability: The model should produce satisfactory predictions of future values
of independent variables.
5. Simplicity: The model should represent the economic relationships with possible
simplicity. If the number of equations is less and if the mathematical form is less
complicated, that model is said to be a good model.
Types of econometrics
Econometrics may be divided into two broad categories. Theoretical econometrics and applied
econometrics.
Theoretical econometrics is concerned with the development of appropriate method for
measuring economic relationships specified by econometric models. For e.g. one of the methods
used extensively is the principle of least squares.
In applied econometrics, the tools of theoretical econometrics, is used to study some special area
of economics and business such as the production function, investment function, demand &
supply functions etc.
Uses of Econometrics
1.Econometrics is widely used in policy formulation
For eg. Suppose the government wants to devalue its currency to correct the balance of
payment problem. For estimating the consequences of devaluation, the price elasticities of
imports and exports is needed. If imports and exports are inelastic then devaluation will not
produce the necessary change. If imports and exports are elastic then the BOP of the country
will improve by devaluation. Price elasticity can be estimated with the help of demand function
of import and export. An econometric model can be built through which the variables can be
estimated.
2. Econometrics helps the producers in making rational calculations.
Page 161
3. Econometrics is also useful in verifying theories.

4. Studies of econometrics mainly consist of testing of hypothesis, estimation of the parameters
and ascertaining the proper functional form of the economic relations.
5 Limitations of Econometric Approach

Econometrics has come a long way over a relatively short period of time. Important advances
have been made in the compilation of data, development of concepts, theories and tools for the
construction and evaluation of a wide variety of econometric models. Applications of
econometrics can be found in almost every field of economics. Nowadays, even there is a
tendency to use econometric tools in certain other sciences like sociology, political science,
agriculture and management. Econometric models have been used frequently by government
departments, international organizations and commercial enterprises. At the same time,
experience has brought out a number of difficulties also in the use of econometric tools. The
important limitations are,
1. Quality of data: Econometric analysis and research depends on intensive data base. One
of the serious problems of Indian econometric research is non availability of accurate,
timely and reliable data.
2. Imperfections in economic theory: Earlier it was felt that the economic theory is
sufficient to provide base for model building. But later it was realized that many of the
economic theories are illusory because they are based on the assumption of ceteris
paribus and hence models can not fully accommodate the dynamic forces behind a
phenomena.
3. There are institutional features and accounting conventions that have to be allowed for in
econometric models but which are either ignored or are only partially dealt with at the
theoretical level.
4. Any economic phenomenon is influenced by social, cultural, political, physiological and
even physical factors. These factors can not be easily quantified. Even if quantified, they
may not be capable of explaining the phenomenon properly. For example, it is said that
the intelligentsia of Indian planners gave birth to very beautiful mathematical models, but
they forgot to feed the hungry masses.
Thus we may conclude our discussion on econometrics by restating the following.
Economists develop economic models to explain consistently recurring relationships. Their
models link one or more economic variables to other economic variables. For example,
economists connect the amount individuals spend on consumer goods to disposable income and
wealth, and expect consumption to increase as disposable income and wealth increase (that is,
the relationship is positive).
Page 162
There are often competing models capable of explaining the same recurring relationship, called
an empirical regularity, but few models provide useful clues to the magnitude of the association.
Yet this is what matters most to policymakers. When setting monetary policy, for example,
central bankers need to know the likely impact of changes in official interest rates on inflation
and the growth rate of the economy. It is in cases like this that economists turn to econometrics.
Econometrics uses economic theory, mathematics, and statistical inference to quantify economic
phenomena. In other words, it turns theoretical economic models into useful tools for economic
policymaking. The objective of econometrics is to convert qualitative statements (such as the
relationship between two or more variables is positive) into quantitative statements (such as
consumption expenditure increases by 95 cents for every one dollar increase in disposable
income). Econometricianspractitioners of econometricstransform models developed by
economic theorists into versions that can be estimated. As Stock and Watson put it, econometric
methods are used in many branches of economics, including finance, labor economics,
macroeconomics, microeconomics, and economic policy. Economic policy decisions are rarely
made without econometric analysis to assess their impact.
Econometrics can be divided into theoretical and applied components.
Theoretical econometricians investigate the properties of existing statistical tests and procedures
for estimating unknowns in the model. They also seek to develop new statistical procedures that
are valid (or robust) despite the peculiarities of economic datasuch as their tendency to change
simultaneously. Theoretical econometrics relies heavily on mathematics, theoretical statistics,
and numerical methods to prove that the new procedures have the ability to draw correct
inferences.
Applied econometricians, by contrast, use econometric techniques developed by the theorists to
translate qualitative economic statements into quantitative ones. Because applied
econometricians are closer to the data, they often run intoand alert their theoretical
counterparts todata attributes that lead to problems with existing estimation techniques. For
example, the econometrician might discover that the variance of the data (how much individual
values in a series differ from the overall average) is changing over time.
The main tool of econometrics is the linear multiple regression model, which provides a formal
approach to estimating how a change in one economic variable, the explanatory variable, affects
the variable being explained, the dependent variabletaking into account the impact of all the
other determinants of the dependent variable. This qualification is important because a regression
seeks to estimate the marginal impact of a particular explanatory variable after taking into
account the impact of the other explanatory variables in the model.
The methodology of econometrics is fairly straightforward. It involves 4 steps as explained
below.
Page 163
The first step is to suggest a theory or hypothesis to explain the data being examined. The
explanatory variables in the model are specified, and the sign and/or magnitude of the
relationship between each explanatory variable and the dependent variable are clearly stated. At
this stage of the analysis, applied econometricians rely heavily on economic theory to formulate
the hypothesis. For example, a tenet of international economics is that prices across open borders
move together after allowing for nominal exchange rate movements (purchasing power parity).
The empirical relationship between domestic prices and foreign prices (adjusted for nominal
exchange rate movements) should be positive, and they should move together approximately one
for one.
The second step is the specification of a statistical model that captures the essence of the theory
the economist is testing. The model proposes a specific mathematical relationship between the
dependent variable and the explanatory variableson which, unfortunately, economic theory is
usually silent. By far the most common approach is to assume linearitymeaning that any
change in an explanatory variable will always produce the same change in the dependent variable
(that is, a straight-line relationship).
Because it is impossible to account for every influence on the dependent variable, a catchall
variable is added to the statistical model to complete its specification. The role of the catchall is
to represent all the determinants of the dependent variable that cannot be accounted forbecause
of either the complexity of the data or its absence. Economists usually assume that this error
term averages to zero and is unpredictable, simply to be consistent with the premise that the
statistical model accounts for all the important explanatory variables.
The third step involves using an appropriate statistical procedure and an econometric software
package to estimate the unknown parameters (coefficients) of the model using economic data.
This is often the easiest part of the analysis thanks to readily available economic data and
excellent econometric software. Just because something can be computed doesnt mean it makes
economic sense to do so.
The fourth step is by far the most important: administering the smell test. Does the estimated
model make economic sensethat is, yield meaningful economic predictions? For example, are
the signs of the estimated parameters that connect the dependent variable to the explanatory
variables consistent with the predictions of the underlying economic theory? (In the household
consumption example, for instance, the validity of the statistical model would be in question if it
predicted a decline in consumer spending when income increased). If the estimated parameters
do not make sense, how should the econometrician change the statistical model to yield sensible
estimates? And does a more sensible estimate imply an economically significant effect? This
step, in particular, calls on and tests the applied econometricians skill and experience.
REGRESSION ANALYSIS
The term regression was introduced by Francis Galton. Regression analysis is concerned
Page 164
with the study of the dependence of one variable (dependent variable) on one or more other
variables (explanatory variables) with a view to estimating the average (mean) valve of the
former in terms of known (fixed) values of the latter.
Galton found that, although there was a tendency for tall parents to have tall children and for
short parents to have short children, the average height of children born of parents of a given
height tended to more or regress towards the average height in the population as a whole. In
other words, the height of the children of unusually tall or unusually shorts parents tends to more
towards the average height of the population. In the modern view of regression, the concern is
with finding out how the average height of sons changes, given the fathers height. Regression
analysis is largely concerned with estimating and/or predicting the (population) mean value of
the dependent variable on the basis of the known or fixed values of the explanatory variable.
Origin of the Linear Regression Model
There are different methods for estimating the coefficients of the parameters. Of these different
methods, the most popular and widely used is the regression technique using Ordinary Least
Square (OLS) method. This method is used because of the inherent properties of the estimates
derived using this method. But, first let us try to understand the rationale of this method. For
this purpose, let us go back to the demand theory as well as the consumption function which we
discussed in the earlier chapter. Demand theory says that there is a negative relation between
price and quantity demanded certeris paribus. In the case of consumption function, there is a
positive relation between consumption expenditure and income. There are three important
questions here.
1. Which is the dependent variable and which is the independent variable?
2. Which is the appropriate mathematical form which explains the phenomenon?
3. What is the expected sign and magnitude of the coefficients?
In order to answer these questions, the theory will give the necessary support.
In the case of demand equation, quantity demanded is the dependent variable, and price is the
independent variable. Economic theory does not discuss the choice between single equation
models or simultaneous equation models to discuss the relationship. So naturally we may
assume that the relation is explained with the help of single equation, that too assuming a linear
relation. As far as the sign and magnitude of the coefficients are concerned, in the equation,
D = + P + U, can take any value but preferably zero or positive. It actually shows the
quantity demanded at price zero. So chances of demanding negative quantity is very rare and
hence if we get negative quantity, it can be approximated to zero. In the case of , it can be
positive or negative. But normally it will be negative assuming that the commodity demanded is
a normal good. Of course, elasticity nature of the commodity also influences the magnitude and
nature of this value.
Page 165
In the case of consumption function, consumption is the dependent variable and income is the
independent variable. Whether the relation is linear or non linear, is a debatable issue. For
instance, psychological law of Keynes suggests that when income increases, consumption also
increases, but less than proportionate. So assuming that consumption and income are linearly
related is in one way, over simplification. But for the time being let us assume so just for
explanatory purpose. Regarding the sign and magnitude of parameters and . There is some
meaning and interpretation. represents the consumption when income takes the value zero,
that is, according to theory, it is autonomous consumption. Similarly, is nothing but the value
of marginal propensity to consume which is normally less than 1 and can not be negative.
Based on the above discussed rationale and logic, let us rewrite the demand equation as D =
+ P + U , where D is the quantity demanded, P is price, and are the parameters to be
estimated. In order to estimate these parameters, we use Ordinary Least Square (OLS) method.
Once we plot this on a graph, we will be able to get the deviations between actual and estimated
observations, popularly called as errors. Naturally, a rational decision is to minimize these
errors. Thus from all possible lines, we choose the one for which the deviations of the points is
the smallest possible. The least squares criterion requires that the regression line be drawn in
such a way, so as to minimize the sum of the squares of the deviations of the observations from
it. The first step is to draw the line so that the sum of the simple deviations of the observations is
zero. Some observations will lie above the line and will have a positive deviation, some will lie
below the line, in which case, they will have a negative deviation, and finally the points lying on
the line will have a zero deviation. In summing these deviations the positive values will offset
the negative values, so that the final algebraic sum of these residuals will equal zero.
Mathematically, e = 0. Since the sum total of deviations is 0, it can not be minimized as such.
So we try to square the deviations and minimize the sum of the squares. e2. Thus we call this
method as least square method,
Population Regression Function (PRF)
Mathematically a population regression function (PRF) or Conditional Expectation Function
(CEF) can be defined as the average value of the dependent value for a given value of the
explanatory or independent variable. In other words, PRF tries to find out how the average value
of the dependent variable varies with the given value of the explanatory variable. On the other
hand, when we estimate the average value of the dependent variable with the help of a sample, it
is called stochastic sample regression function (SRF).
E(Y | Xi) = f (Xi)
where f (Xi) denotes some function of the explanatory variable X.
E(Y | Xi) is a linear function of Xi. This is known as the conditional expectation function
(CEF) or population regression function (PRF). It states merely that the expected value of the
Page 166
distribution of Y given Xi is functionally related to Xi.In simple terms, it tells how the mean or
average response of Y varies with X. For example, an economist might posit that consumption
expenditure is linearly related to income. Therefore, as a first approximation or a working
hypothesis, we may assume that the PRF E(Y | Xi) is a linear function of Xi,
E(Y | Xi) = 1 + 2Xi
where 1 and 2 are unknown but fixed parameters known as the regression coefficients;
1 and 2 are also known as intercept and slope coefficients, respectively.
we can express the deviation of an individual Yi around its expected value as follows: ui
= Yi E(Y | Xi) or
Yi = E(Y | Xi) + ui where the deviation ui is an unobservable random variable taking
positive or negative values. Technically, ui is known as the stochastic disturbance or stochastic
error term.
We can say that the expenditure of an individual family, given its income level, can be
expressed as the sum of two components: (1) E(Y | Xi), which is simply the mean consumption
expenditure of all the families with the same level of income. This component is known as the
systematic, or deterministic, component, and (2) ui, which is the random, or nonsystematic,
component is a surrogate or proxy for all the omitted or neglected variables that may affect Y but
are not (or cannot be) included in the regression model.
If E(Y | Xi) is assumed to be linear in Xi, it may be written as
Yi = E(Y | Xi) + ui
= 1 + 2Xi+ ui
Sample regression function (SRF)
Since the entire population is not available to estimate y from given xi, we have to
estimate the PRF on the basis of sample information. From a given sample we can estimate the
mean value of y corresponding to chosen xi values. The estimated PRF value may not be
accurate because of sampling fluctuations. Because of this only an approximate value of PRF
can be obtained. In general, we would get N different sample regression function (SRFs) for N
different samples and these SRFs are not likely to be the same.
we can develop the concept of the sample regression function (SRF) to represent the
sample regression line.
Y =
1 + 2 Xi
where Y is read as Y-hat or Y-cap

Yi = estimator of E(Y | Xi)
1 = estimator of 1
2 = estimator of 2
Note that an estimator, also known as a (sample) statistic, is simply a method that tells
how to estimate the population parameter from the information provided by the sample at hand.
Page 167
we can express the SRF in its stochastic form as follows:

Yi = 1 + 2Xi + uiwhere, in addition to the symbols already defined, ui denotes the
estimate of the error term.
Significance of the stochastic Error term
The disturbance term ui is a surrogate for all those variables that are omitted from the
model but that collectively affect Y.
1. Vagueness of theory
The theory determining the behavior of Y may be, incomplete. We might know for certain that
weekly income X influences weekly consumption expenditure Y, but we might be ignorant or
unsure about the other variables affecting Y. Therefore ui may be used as a substitute for all the
excluded or omitted variables from the model.
2. Unavailability of data
Even if we know what some of the excluded variables are we may not have quantitative
information about these variables. For example, in principle we could introduce family wealth as
an explanatory variable in addition to the income to explain family consumption expenditure.
But unfortunately, information on family wealth generally is not available.
3. Core variables versus peripheral variables
Assume in our consumption income example that besides income X1, the number of children per
family X2, sex X3, religion X4, education X5, and geographical region X6 also affect
consumption expenditure. But it is quite possible that the joint influence of all these variables
may be so small that it need not be introduced in the model. Their combined effect can be treated
as a random variable ui.
4. Intrinsic randomness in human behavior
Even if all the relevant variables affecting y are introduced into the model, there may be
variations due to intrinsic randomness in individual which cannot be explained. The disturbance
term ui also include this intrinsic randomness.
5. Poor proxy variables
Although the classical regression model assumes that variables y and x are measured accurately,
it is possible that there may be errors of measurement. Variables which are used as proxy may
not provide accurate measurement. The disturbance term u can also be used to include errors of
measurement.
6.Principle of parsimony
Regression model should be formulated as simple as possible. If the behavior of y can be
explained with the help of two or three explanatory variables then more variation need not be
included in the model. Let ui represent all other variables. This does not mean that relevant and
important variables should be excluded to keep the regression model simple.
7. Wrong functional form
Even if we have theoretically correct variables exploring a phenomenon and even if it is possible
to get data on these variables, very often the functional relationship between the dependent and
independent variable may be uncertain. In two variable models functional relation can be
ascertained with the help of scattergram. But in multiple regression model it is not easy to
Page 168
determine the, approximate functional form. Scattergram cannot be visualised in multidimensional form. For all these reasons, the stochastic disturbance ui assumes an extremely
critical role in regression analysis.
Assumptions of Classical Linear Regression Model
1. U is a random real variable. The value which may assume in any one period depends on
chance. It may be positive, zero or negative. Each value has a certain probability of
being assumed by U in any particular instance.
2. The mean value of U in any particular period is zero. If we consider all the possible
values of U, for any given value of X, they would have an average value equal to zero.
With this assumption we may say that Y = +X + U gives the relationship between
X and Y on the average. That is, when X assumes the value X1, the dependent variable
will on the average assume the value Y1, although the actual value of Y observed in any
particular occasion may display some variation.
3. The variance of U is constant in each period. The variance of U about its mean is
constant at all values of X. In other words, for all values of X, the U will show the same
dispersion round their mean.
4. The variable U has a normal distribution
5. The random terms of different observations are independent. This means that all the
covariance of any U (ui) with any other U (uj) are equal to zero
6. U is independent of the explanatory variables
The above mentioned assumptions are really classic to regression estimations and make the
method OLS efficient.
There are a few other assumptions also used in OLS estimated. They are,
(i) The explanatory variables are measured without error. In other words, the explanatory
variables are measured without error. In the case of dependent variable, error may or may not
arise.
(ii) The explanatory variables are not perfectly linearly correlated. If there is more than one
explanatory variable in the relationship, it is assumed that they are not perfectly correlated with
each other. More specifically, we are assuming the absence of multicollinearity.
(iii) There is no aggregation problem. In the previous chapter, we discussed aggregation over
individuals, time, space and commodities. So we assume the absence of all these problems.
(iv) The relationship being estimated is identified. This means that we have to estimate a unique
mathematical form. There is no confusion about the coefficients and the equations to which it
belong.
(v) The relationship is correctly specified. It is assumed that we have not committed any
specification error in determining the explanatory variables, in deciding the mathematical form
etc.
*************
Page 169

Quantitative Methods For Economic Analysis 6nov2014

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quantitative Methods For Economic Analysis 6nov2014

Uploaded by

Copyright:

Available Formats

QUANTITATIVE METHODS FOR

School of Distance Education

Chacko Jose P, PhD

Layout & Settings: Computer Section, SDE

Quantitative Methods for Economic Analysis - I

School of Distance Education

Quantitative Methods for Economic Analysis - I

School of Distance Education

Quantitative Methods for Economic Analysis 1

Quantitative Methods for Economic Analysis - I

School of Distance Education

Quantitative Methods for Economic Analysis - I

School of Distance Education

Quantitative Methods for Economic Analysis - I

School of Distance Education

Quantitative Methods for Economic Analysis - I

School of Distance Education

Quantitative Methods for Economic Analysis - I

School of Distance Education

School of Distance Education

School of Distance Education

School of Distance Education

Quantitative Methods for Economic Analysis - I

School of Distance Education

School of Distance Education

smells old and musty

with frame 14" by 18"

texture shows brush strokes of oil

weighs 8.5 pounds

surface area of painting is 140 sq.

peaceful scene of the country

masterful brush strokes

Numeric variable = Quantitative

Quantitative Methods for Economic Analysis - I

School of Distance Education

"How many milk

And Quantitative data can also be Discrete data or Continuous data.

He is brown and black

3.2 Cross Section and Time Series Data

Quantitative Methods for Economic Analysis - I

School of Distance Education

(b) Relative Frequency

The sum of the relative frequency is equal to 1.

School of Distance Education

(c) Cumulative Frequency

Discrete variables are used for this type of frequency table.

School of Distance Education

Based on this we can draw a bar chart as follows.

Blood group of students

Quantitative Methods for Economic Analysis - I

School of Distance Education

Frequency Polygon (or line graphs)

School of Distance Education

Quantitative Methods for Economic Analysis - I

School of Distance Education

Difference between frequency polygon and frequency curve

School of Distance Education

Less than ogive

More than ogive

School of Distance Education

Calculate the angle of each sector, using the formula

Quantitative Methods for Economic Analysis - I

School of Distance Education