You are on page 1of 35

QM

Brajaballav Kar
B. Tech (Electrical, CET)
PGDM (XIMB)

1
DESCRIPTIVE STATISTICS
It is quantitatively describing the main features of a
collection of data, different from inferential statistics (or
inductive statistics), It aims to summarize a sample (not
learn about the population that the sample of data
represents). This generally means that descriptive
statistics, unlike inferential statistics, are not developed
on the basis of probability theory. Descriptive statistics
include measures of central tendency (mean, median
and mode) and measures of variability or dispersion
(standard deviation (or variance), the minimum and
maximum values of the variables, kurtosis and
skewness.
Entries in an analysis of variance table can also be
regarded as summary statistic

2
Data array: Arrange values in ascending or descending
order (+ve notice largest, smallest value, divide data
into sections, notice value if appears more than once,
observe distance between succeeding values)
Frequency distribution: is a table that organizes data
into classes, (into groups), it shows number of
observations from the data set that fall into each of the
classes.
Relative Frequency distribution: Express in fraction
or % of the total number of observation: /
Mutually exclusive: No data point falls into more than
one category
All inclusive: Sum of all the relative frequencies equal
100% or 1.0

3
Open ended class: when it allows the upper or the lower
end of a quantitative classification to be limit less (Age: 11-
20, 21-30, 31-40, 41-50, 51-60, 61 and older
Discrete Classes: Separate entities that do not progress
from one class to the next without a break.
Continuous class: Progress from one class to the next
without a break, (ex weights of cans of tomatoes)
The range must be divided by equal classes; that is the
width of the interval from the beginning of one class to the
beginning of the next class must be same for every class.
No of Classes: thumb rule 6 to 15 classes
Width of class interval =
Ogives: A cumulative frequency distribution enables to see
how many observations live above or below certain values,
rather than only recording the number of items within the
interval. Graph of a cumulative frequency distribution is
called ojive.
4
CHAPTER 2: MEASURES OF CENTRAL TENDENCY

Summary statistics: eg Central tendency and
dispersion (which describes the characteristics
of the data set)
Central tendency
Dispersion
Skewness: Opposite to symmetry, reason of
skewness is frequency distribution is lopsided,
not at the middle. Positively skewed (frequency
more at the beginning); Negatively skewed
(frequency more at the end)
Kurtosis (Peakedness)

5
CENTRAL TENDENCY:
Arithmetic mean:
(Characteristics of sample are called statistics and
that of population called parameter)
=x/(N) =x/(n)
Grouped data= =f x/(n) (n = f)
(in case of grouped data the midpoint taken is (if the
class interval is like x1-x2, x3-x4, then midpoint =
(x1+x3)/ 2=> This is an assumption and
approximation)
-ve: a. affected by extreme values, if the class is open
ended then the mean can not be computed, all data
points are taken except in case of grouped data)
6
CENTRAL TENDENCY
Weighted mean:
w
= w * x / (w)
Geometric Mean= root of (Product of all x
values) (Where to use)
n
th
root of the growth ex: cube root of (1.1*
1.15 *1.2)

CENTRAL TENDENCY-MEDIAN
Middle most or most central
Median Ungrouped data=
Array the data in ascending or descending order, then
((n)+1)/2 th item is median in both odd and even
cases.
Data set odd, then middle item is median
If the data set has even then average of the two
middle item
Median Grouped data:
Median Class: the class where the cumulative
frequency becomes (n+1)/ 2
Then the assumption is the data points are evenly
spread over entire class interval:
8
EXAMPLE
Account Balance Frequency
0-49.99
50.00-99.99
100.00-149.99
150.00-199.99
200.00-249.99
250.00-299.99
300.00-349.99
350.00-399.99
400.00-449.99
450.00-499.99
78
123
187
82
51
47
13
9
6
4
600
9
MEDIAN EXAMPLE
Median class : 100.00-149.99
Median value is in (600+1)/2 = 300.5 =>
300
th
and 301
st
item; 300
th
item =99
th
of the
median class (300-(78+123);
Width of median class: (150.00-100.00)/ 187
= 0.267, 1
st
is 100.00 so 99
th
= 100.00+ 98 *
0.267=126.17
100
th
= 126.17+ 0.267=126.44 so median =
(126.17+126.44)/ 2= 126.30
10
MEDIAN FORMULA
Median formula= [ { (n+1)/2 (F+1)}/ f
m
]* w + L
m

n= total no of items
F= sum of all the class frequencies upto BUT not
including median class
Fm=frequency of median class
w= class interval width
Lm= lower limit of the median class interval
For the above median by formula =126.35 and the
difference is because rounding
+ve of median: Extreme values dont affect median, can
be calculated for open ended grouped data, unless the
median is in open ended class. Can be calculated for
qualitative data (excellent, very good, good, average bad;
find the frequency and then median)
11
CENTRAL TENDENCY-MODE
Mode: Value that is most often repeated in the dataset
Mode of ungrouped data is rarely used; reason being, chance can
cause an unrepresentative data to be the most frequent value.
Data set 0,0,1,1,2,2,4,4,5,5,6,6,7,7,8, 12, 15,15,15,19 => Mode is
15 but is unrepresentative of the data set, since most of the values
are below 10
No of data= 20
So class interval (20-0)/6 =3.3 =>4; => No of Classes = 20/4=5
Class 0-3 4-7 8-11 12-15 16-19
6 8 1 4 1 => Modal class is 4-7
Mo = L MO + {d1/(d1+d2)} * w
L MO : Lower limit of modal class
d1= frequency of the modal class the frequency of the class
directly below it
d2= frequency of the modal class the frequency of the class
directly above it
w= width of the modal class interval
12
MODE EXAMPLE
Account Balance Frequency
0-49.99
50.00-99.99
100.00-149.99
150.00-199.99
200.00-249.99
250.00-299.99
300.00-349.99
350.00-399.99
400.00-449.99
450.00-499.99
78
123
187
82
51
47
13
9
6
4
600
Lmo =100,
d1=
187-123=64,
d2=
187-82=105;
w=50=>
Mo=119.00
13
ADVANTAGE MODE
Advantages: like median it can be used as a
central location for qualitative as well as
quantitative data; mode not affected by
extreme values; it also can be sued for open
ended class
-ve: if the data occurs with same frequency
then it can not be used, in case of multiple
modes, it is difficult to compare

14
MEAN MEDIAN-MODE
Mean, Median, and mode are identical in symmetrical
distribution
In a positively skewed distribution (skewed to right), the
mode is at the highest point of the distribution, median is to
the right of that and the mean is to the right of both median
and mode
In a negatively skewed distribution (skewed to left), the
mode is at the highest point of the distribution, median is to
the left of that and the mean is to the left of both median
and mode.
When the population is skewed positively or negatively the
median is often the best measure of location because it is
always between the mean and the mode. The median is not
as highly influenced by the frequency of occurrence of a
single value as the mode nor is it pulled by extreme values
as is the mean.
15
DISPERSION:
Variability
Why dispersion
It gives additional information that enables us to
judge the reliability of our measure of central
tendency: Mean age 26; (case 1: Age1=2 Age 2= 52;
case 2: Age1=24 Age 2= 28); If data is widely spread,
then mean is less representative
Compare dispersion of different samples
Usage: Financial earnings more dispersed=> more
risk
Quality parameters
Drug Purity

16
MEASURE- OF DISPERSION
Range (difference between the highest and lowest observed
values); Easy to understand and find but usefulness is limited.
Heavily influenced by extremes; Open ended distributions dont
have a range.
Interfractile range: In a frequency distribution, a given fraction or
proportion of the data lie at or below a fractile. The median for example
is the 0.5 fractile, because half the data set is less than or equal to this
value
Interfractile range is a measure of spread between tow fractiles in a
frequency distribution, i.e the difference between the values of the two
fractiles
Fractiles: if they divide the data into 10 equal parts, it is called deciles, if
4, then quartile, if 100 then percentile
Inter quartile range is difference between the values of the first and third
quartiles (Q3-Q1)
Other measures: Variance and Standard Deviation; both indicate
average distance of any observation in the data set from the mean
of the distribution
17
VARIANCE-
Variance:
2

Population Variance
2
= ((x )
2
)/ N which is
equivalent to (x
2
/ N

)
2
){used when x values are
large and x- values are small
(Square of a unit measure is not intuitive)
Variance of Grouped Data

2
= (f(x )
2
)/ N = f(x
2
) / N
2

Sample variance
s
2
= (x )
2
)/ (n-1)= x
2
/ (n-1) n
2
/ (n-1)
Standard Deviation: Square root of variance; only positive
root to consider
Population standard deviation= Square root of
Population variance
18
CHEBYSHEVS THEOREM:
Chebyshevs theorem: says that NO MATTER
what the shape of the distribution, at least 75% of
the values will fall within +2 Standard deviation,
from the mean of the distribution and at least 89
percent of the values will lie within +3 standard
deviation from the mean.
However it can be more precisely
68% of the values within +1 std Dev
95% of the values within +2 Std Dev
99% of the values within +3 Std Dev
19
STANDARD SCORE: & COEFFICIENT OF VARIATION
Standard Score:
Standard score gives the number of standard deviations a
particular observation lies below or above the mean.
Population Standard Score =(x - ) /
Relative Dispersion: The coefficient of variation
1. Standard deviation is an absolute measure of
dispersion that expresses the variation in the same unit
as the original data. 2. Standard deviation alone cant be
compared. So we need to know a. The mean b. The
standard deviation c. and how the standard deviation
is compared with the mean

So to compare we need a relative measure which is
coefficient of variation= /
20
PROBABILITY
It is the chance something will happen, expressed
in fraction, %
Event: one or more the possible outcomes of
doing something
An Experiment: An activity that produces the
events
Sample Space: The set of all possible outcomes
of an experiment
Mutually exclusive: if one and only one of the
events can take place at a time
Collectively exhaustive: when a list of the possible
events that can result from an experiment
includes every possible outcome, the list is called
collectively exhaustive
21
TYPES OF PROBABILITY:
Classical approach
Relative frequency approach
Subjective approach (Not to discuss)
Classical approach: A priori, symmetrical , assumed (fair
coin, un biased dice) we can know the probability beforehand
Relative frequency approach:
In this approach, of relative frequency the probability is
defined as
1 observed relative frequency of an event in a very large
number of trials (ex CA Pass percentage) or
2 The proportion of times that an event occurs in the long run
when conditions are stable (This method uses the relative
frequencies of past occurrences as probabilities.
Relative frequencies becomes stable as the number of
tosses becomes large (under uniform conditions)
22
RULES:
Single=marginal=unconditional probability => only
one event can take place
Mutually exclusive Events, Add probabilities:
either or events
P(A or B) = P (A) +P(B)
Proportion of families having this many children
No Children 0 1 2 3 4 5 >6
0.05 0.10 0.3 0.25 0.15 0.10 0.05
Whats the P(4 or more Children)
=0.15+0.10+0.05=0.3

23
NOT MUTUALLY EXCLUSIVE EVENT
Not Mutually exclusive event; Addition Rule:
P(A or B) = P (A) +P(B) - P(A and B)
Male Age 30
Male 32
Female 45
Female 20
Male 40
Choose one person, who is either female or
over 35
=>P (female or over 35) =P (female) + P(over
35) P (female and over 35)
2/5+2/5 -1/5 = 3/5
24
PROBABILITIES UNDER STATISTICAL
INDEPENDENCE:
Statistical Independence: The occurrence of
one has no effect on the probability of
occurrence of any other event.
Rolling a die:
In the die rolling: Getting a 6 the first time and
getting a 6 the second time are independent.
But:
Getting a 6 the first time a die is rolled and the
event that the sum of the numbers seen on the
first and second trials is 8 are not independent.
25
3 TYPES OF PROBABILITIES UNDER STATISTICAL
INDEPENDENCE:

1. Marginal
2. Joint
3. Conditional
Marginal Probabilities of independent events: is simple
probabilities (e.g fair coin toss P(H)=0.5, If unfair P(H) = 0.8
then it is 0.8 every time)
Joint probability of two independent events: P(AB) =
P(A)*P(B)
P(AB) = Probability of events A and B occurring together or
in succession is Joint Probability
P(A)= Marginal Probability of event A occurring
P(B) = Marginal Probability of event B occurring
(example: Two heads in succession, dice: first 1 and then 6)
P(H1) = P(H2)= P(H3)=0.5 =(marginal or absolute
probability) But
P(H1 H2 H3)= 0.125 =Joint probability
26
CONDITIONAL PROBABILITY UNDER STATISTICAL
INDEPENDENCE:

Conditional probability of independent
events: The conditional probability of event
B given that Event A has occurred is simply
the probability of B (Because they are
independent, by definition)
P(B|A) = P(B)
Ex: Probability of Head in second toss, given
that first toss resulted in Head = 0.5

27
PROBABILITY UNDER STATISTICAL
INDEPENDENCE:
Type of
Probability
Symbol Formula
Marginal
Joint
Conditional
P(A)
P(AB)
P(B|A)
P(A)
P(A)* P(B)
P(B)
28
PROBABILITY UNDER STATISTICAL
INDEPENDENCE:
Type of
Probability
Symbol Formula
Marginal
Joint
Conditional
P(A)
P(AB)
P(B|A)
P(A)
P(A)* P(B)
P(B)
29
SIMPLE REGRESSION
Regression & Correlation=> Naure and strength
of relationship between two variable
Regress to go back to the mean
Regression analysis Estimating equation
(mathematically relating the variables)
Types of relationship
Dependent Independent Variables
One dependent-> Multiple independent variable
Direct relationship: X increase; Y increase; Slope
+ve
Inverse Relationship: X increase;Y decrease;
Slopeve

30
REGRESSION- CAUSE & EFFECT ?
Differentiate Cause-effect;
Dependent-Independent variable
Not all relationships are cause and effect (
Relationship found by regression is of
association but not of cause and effect.
Cause and Effect:
Cause should precede in time
Presence of cause indicates presence of
effect
Presence of effect indicates presence of
cause
No Confounding (variable that impacts both)
31
SCATTER DIAGRAM:
Transform tabular information to graph
Visually Observe
Draw a fit: How?
Not necessarily touching each point, equal
points to lie on either side of the line.
Relationships could be linear/Curvilinear

32
33
TOTAL COST
It is known that the total cost is addition of
variable cost and fixed cost, one businessman
knows that for, incurring a raw material cost of
5 crore the total cost comes to 8.5crore and for
a RM Cost of 8 Crore, the total cost is 10.6
crore. The business man assumes a linear
relationship of the costs involved.
If he plans his raw material cost to be 10 crores
what would be the total cost, he should be
ready to incur
34
REGRESSION LINE
We will only examine linear relationship
Y(dependent)=a(y intercept)+b (slope) x X
(independent variable)
b= (Y2-Y1) /(x2-x1)
Estimating Y

(hat)= a + b X
Add the errors (take the lowest)
Individual difference may be +ve, -ve; and will cancel
Add absolute values (take lowest)
Does not consider large single deviation=> does not stress
magnitude of error
So, Square the error => Penalize the large absolute
deviation; take the least
Mathematically
b= {(XY)- n( )}/ {x
2
-n
2
}
35

You might also like