Statistical Inference

Statistical
Inference
Adnan Butt
4:28 AM
Course Outline
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
4:28 AM
Review of Descriptive Statistics and SPSS

Random Variable and Mathematical Expectation
Discrete Probability Distributions (Binomial, Poisson)
Continuous Probability Distribution (Normal)
Sampling Theory
Confidance Intervals
Hypotheses Testing
Goodness of Fit
Regression and Correlation with ANOVA
Multiple Regression
All the topics will be SPSS oriented
2
Recommended Readings (Books)

Introduction to Mathematical Statistics, Hogg,
R. V., Craig, A. and McKean, J. W., 6th Edition

(2004)
Statistical Inference, Cassella, G. and Berger, R.
L., 2nd Edition (2002)
Applied Regression Analysis: A Research Tool,
Rawlings, J. O., Pantula, S. G. and Dickey, D. A.,
2nd Edition (2001)
Introduction to Statistics, Walpole, R. E., 3rd
Edition (2000)
4:28 AM
Mode of Teaching
Lecture
SPSS Workshop
Discussion Session
4:28 AM
Marks Distribution
Mid term
25 Marks
Final
40 Marks
Quizzes
15 Marks
SPSS
15 Marks
Class Participation
Total
4:28 AM
5 Marks
100 Marks
5
Variable
A characteristic or
property that varies
from individual to
individual.
4:28 AM
Constant
A characteristic or
property that does not
change from individual
to individual.
4:28 AM
Types of Variables
Types of
Variables
Qualitative
Quantitative
Discrete
4:28 AM
Continuous
8
Nominal Scale
Variable categories are mutually
exclusive and exhaustive.
Variable categories have no
logical order.
Eye Color, Hair Color, Gender.
4:28 AM
Ordinal Scale
Data categories are mutually
exclusive and exhaustive.
Data classifications are ranked or
ordered
according
to
the
particular trait they possess.
Level of Knowledge about SPSS
4:28 AM
10
Interval Scale
Data categories are mutually exclusive
and exhaustive.
Data classifications are ranked or ordered
according to the particular trait they
possess.
Equal differences in the characteristic are
not represented by equal differences in
the measurements.
Temperature, Shoe Size and IQ scores
4:28 AM
11
Ratio Scale
Data categories are mutually exclusive and
exhaustive.
Data classifications are ranked or ordered
according to the particular trait they possess.
Equal differences in the characteristic are
represented by equal differences in the
measurements.
The zero point is the essence of the
characteristic.
Height, Weight, Distance.
4:28 AM
12
Measurement Scales
Scale
Nominal
Ordinal
Interval
Ratio
Data may only

be classified
Data are
ranked
True Zero Point

does not
Exist.
Meaningful Zero
point and Ratio
Between values
Eye color,
Hair Color
Gender.
4:28 AM
Level of
Knowledge
about
SPSS
Temperature,
Shoe Size,
IQ Scores
Height, Weight,
Distance.
13
Data
The information collected
for any kind of investigation.
Usually Numerical but can
be Qualitative.
4:28 AM
14
Primary Data
The initial material collected
during the research process.
The information collected
directly from the respondent.
Personal Invetigation, Through Investigator, Through Questionnaire,
Through Local Sources, Through Telephone,
4:28 AM
15
Secondary Data
The information
collected and processed
by the people other than
the researcher
Government Organizations, Semi-Government
Organizations,
4:28 AM
16
Data Collection
Any of the following methods may be
adopted:
(a) Personal interview
(b) Direct observation
(c) Mail interview (internet interview)
(d) Telephone interview
What are the cons and pros of each?
4:28 AM
17
Data management
Office Editing,
Post Coding,
Data entry and Verification.
4:28 AM
18
Data organization and Analysis

Preparing data for analysis,
Extracting descriptive measures
from the data,
Using advanced statistical
techniques to analyze the data
and draw inference there from.
4:28 AM
19
Measures of Central Tendency

Arithmetic Mean
Quantiles
(Median, Quartiles, Deciles, Percentiles)
Mode
4:28 AM
20
Arithmetic Mean
A value obtained by dividing the sum of all the observations by
their number.
Sum of all the observations

Arithmetic Mean
Number of the observations
If X1, X2, , Xn are n observations of a variable X then
n
X1 X 2 X n
X
n
4:28 AM
X
i 1
n
21
Arithmetic Mean
The marks obtained by 8 students are:
67 72 68 70 65 68 75 63
67 72 63 548
X
68.5 Marks
8
8
4:28 AM
22
Quantiles
For
individual
observations/discrete
frequency
distribution, the ith quartile, jth decile and kth
percentile are located in the array/discrete frequency

distribution by the following relations
Qi
i(n 1)
th observation in the distribution, i 1, 2, 3
4
j(n 1)
th observation in the distribution, j 1, 2,,9
10
k(n 1)
Pk
th observation in the distribution, k 1, 2,,99
100
Dj
4:28 AM
23
Quartiles
The weekly TV Watching times (Hours):
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21
The array of the above data is given below:
5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66
4:28 AM
24
Quartiles
1(20 1)
Q1
th observation in the distribution
4
5.25th observation in the distribution
5th obs. 0.25{6th obs.- 5th obs.}
21 0.25{25- 21} 22.0 Hours
4:28 AM
25
Quartiles
2(20 1)
Q2
th observation in the distribution
4
10.50th observation in the distribution
10th obs. 0.50{11th obs.- 10th obs.}
30 0.50{31- 30} 30.5 Hours
4:28 AM
26
Quantiles
4:28 AM
27
Mode
The mode is a value which occurs
most frequently in a set of data. Or
mode
is
value
that
occurs
maximum number of times in a

sequence of observations.
4:28 AM
28
Mode
The total automobile sales (in millions) in
the United States for the last 14 years.
9.0
8.2 8.0 9.1 10.3 11.0 11.5
10.3 10.5 9.8 9.3
8.2
8.2
8.5
Mode = 8.2 million

4:28 AM
29
Measures of variation measure the

variation present among the values
of a data set, so measures of
variation are measures of spread of
values in the data.
4:28 AM
30
Absolute Measures of
Dispersion
Range
Quartile Deviation
Mean (Average) Deviation
Variance and Standard Deviation
4:28 AM
31
Relative Measures of
Dispersion
Coefficient of Range
Coefficient of Quartile Deviation
Coefficient of Mean Deviation
Coefficient of Variation (CV)
4:28 AM
32
Range
Difference between the largest
and the smallest observations
Range X Largest X Smallest
4:28 AM
33
Disadvantages of the Range

Ignores the way in which data are distributed
7
10
11
Range = 12 - 7 = 5
12
10
11
12
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
4:28 AM
34
Inter-quartile Range (IQR)

Inter-quartile range = 3rd quartile 1st Quartile
Q3 Q1
IQR is independent of outliers
4:28 AM
35
Inter-quartile Range
X
minimum
Q1
25%
12
Median
(Q2)
25%
30
25%
45
Q3
maximum
25%
57
70
Inter-quartile Range (IQR)

= 57 30 = 27
4:28 AM
36
The Mean (absolute) Deviation

Mean Deviation is the average of absolute
deviations taken form the mean value.
X
8
5
2
4:28 AM
(X X )
X X
3
0
-3
0
3
0
3
(x x )
n
6
2
3
6
37
Variance
Variance is the average
of the squared
deviations taken from
the mean value.
(i) S 2
(x x )
(ii ) S
2
4:28 AM
X2
n
102
17cm 2
6
702 102 2
2

17 cm
6 6
X cm (X-Mean)^2
X2
36
16
16
36
81
12
144
13
169
16
36
256
60
102
702
38
Standard Deviation
Standard deviation is the positive square root of
the mean-square deviations of the observations
from their arithmetic mean.
Population
Sample
SD variance
x
i
N 1
Standard Deviation for Group Data

SD is :
f i xi x 2
Where
fx
N
i i
i
Simplified formula
2
fx
x
f
fx
N
Example-1: Find Standard Deviation of

Ungroup Data
Family
No.
10
Size (xi)
Here, x
n
Family No.
50
5
10
10
Total
xi
50
xi x
-2
-2
-1
-1
20
16
16
25
25
36
36
49
49
270
x i x
xi
s2
x
i
20
2
10
s 2 1.41
Comparing Standard Deviations

Data A
11
12
13
14
15
16
17
18
19
20 21
Mean = 15.5
S = 3.338
20 21
Mean = 15.5
S = 0.926
20 21
Mean = 15.5
S = 4.567
Data B
11
12
13
14
15
16
17
18
19
Data C
11
12
13
14
15
16
17
18
19
The smaller the standard deviation, the more tightly

clustered the scores around mean
The larger the standard deviation, the more spread out
4:28 AM
43
the scores from mean
Relative Measures of Variation

Coefficient of Range
X Largest X Smallest
X Largest X Smallest
Q3 Q1
Coefficient of Quartile Deviation
Q3 Q1
MD
Coefficient of Mean Deviation
Mean
4:28 AM
44
Coefficient of Variation (CV)

S
100%
CV
X
Can be used to compare two or more
sets of data measured in different
units or same units but different
average size.
4:28 AM
45
Use of Coefficient of Variation

Stock A:
Average price last year = $50
Standard deviation = $5
S
$5
CVA 100%
100% 10%
$50
X
Stock B:
Average price last year = $100
Standard deviation = $5
S
$5
CVB 100%
100% 5%
$100
X
4:28 AM
Both stocks
have the
same
standard
deviation
but stock B is
less variable
relative to its
price
46
Appropriate Choice of Measure

of Variability
If data are symmetric, with no serious
outliers, use range and standard

deviation.
If data are skewed, and/or have serious
outliers, use IQR.
If comparing variation across two data
sets, use coefficient of variation (C.V)
4:28 AM
47
Five Number Summary

The five number summary of a data set consists of the
minimum value, the first quartile, the second quartile, the
third quartile and the maximum value written in that order:
Min, Q1, Q2, Q3, Max.
From the three quartiles we can obtain a measure of central

tendency (the median, Q2) and measures of variation of the
two middle quarters of the distribution, Q2-Q1 for the
second quarter and Q3-Q2 for the third quarter.
4:28 AM
48
Five Number Summary

The weekly TV viewing times (in hours).
25
34
41 27
26 32
32
38
43 66
16 30
35
38
31 15
30 20
5
21
The array of the above data is given below:
5
15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66
4:28 AM
49
Five Number Summary

1(20 1)
LOCATIONof Q1;
th obs.in thedata 5.25th obs.
4
VALUE of Q1 ; 5th obs. 0.25{6thobs. - 5th obs.} 21 0.25{25 - 21} 22.0 Hrs
LOCAT ION of Q 2 ;
VALUE of Q2
2(20 1)
th obs. in the data 10.50th obs.
;10th obs. 0.50{11thobs. - 10th obs.} 30 0.50{31- 30} 30.5 Hrs
3(20 1)
LOCATIONof Q 3 ;
th obs.in thedata 15.75thobs.
VALUE of Q 3 ; 15th obs 0.75 {16th obs - 15th obs} 35 0.75{37 - 35} 36.5 Hrs
Minimum value=5.0
4:28 AM
Maximum value=66.0
50
Box and Whisker Diagram

A box and whisker diagram or box-plot is a
graphical mean for displaying the five number
summary of a set of data. In a box-plot the first
quartile is placed at the lower hinge and the
third quartile is placed at the upper hinge. The
median is placed in between these two hinges.
The two lines emanating from the box are
called whiskers. The box and whisker diagram
was introduced by Professor Jhon W. Tukey.
4:28 AM
51
Max
Value
Construction of Box-Plot
Start the box from Q1 and end at
Q3
2. Within the box draw a line to
represent Q2
3. Draw lower whisker to Min.
Value up to Q1
4. Draw upper Whisker from Q3 up
to Max. Value
1.
4:28 AM
Q3
Q2
Q1
Min
Value
52
70
Construction of Box-Plot
60
50
1.
2.
3.
4.
Q1=22.0 Q3=36.5
Q2=30.5
Minimum Value=5.0
Maximum Value=66.0
40
30
20
10
0
4:28 AM
53
70
Interpretation of Box-Plot
60
Box-Whisker Plot is useful to identify

Maximum and Minimum Values in the data
Median of the data
50
40
IQR=Q3-Q1,
Lengthy box indicates more variability in the data
30
Shape of the data From Position of line within box

Line At the center of the box----Symmetrical
20
Line above center of the box----Negatively skewed

Line below center of the box----Positively Skewed
10
Detection of Outliers in the data

0
4:28 AM
54
Outliers
An outlier is the values that falls well outside the overall
pattern of the data. It might be
the result of a measurement or recording error,

a member from a different population,
simply an unusual extreme value.
An extreme value needs not to be an outliers; it might,

instead, be an indication of skewness.
4:28 AM
55
Inner and Outer Fences

If
Q1=22.0
Q2=30.5
Q3=36.5
Lower Inner Fence Q1 1.5IQR 0.25

Inner Fences :
Upper Inner Fence Q 3 1.5IQR 58.25
Lower Outer Fence Q1 3IQR 21.5

Outer Fences :
Upper Outer Fence Q 3 3IQR 80.0
4:28 AM
56
Identification of the Outliers
80
70
1. The values that lie within inner

fences are normal values
2. The values that lie outside inner
fences but inside outer fences
are
possible/suspected/mild
outliers
3. The values that lie outside outer
fences are sure outliers
*
60
Only
66 is a
mild
outlier
Plot each suspected outliers with an asterisk

and each sure outliers with an hollow dot.
4:28 AM
50
40
30
20
10
57
Uses of Box and Whisker Diagram

Box plots are
especially suitable for
comparing two or more
data sets. In such a
situation the box plots
are constructed on the
same scale.
4:28 AM
Male
Female
58
Standardized Variable
A variable that has mean 0 and Variance 1 is
called standardized variable
Values of standardized variable are called
standard scores
Values of standard variable i.e standard scores are
unit-less
Construction
Variable Mean of Variable

Z
Standard Deviation of Variable
4:28 AM
59
Standardized Variable
X
( X X )2
25
(Z Z ) 2
-1.3624 1.8561
-0.5450 0.2970
11
0.81741 0.6682
12
16
1.0899
32
54
1.1879
4.009
Variable Z has mean 0 and
S x2
32
8
4
n
54
13.5
4
X X X 8
Sx
3.67
Z
S z2
n
4.009
1
4
variance 1 so Z is a standard variable.

Standard Score at X=11 is Z X X 11 8 0.8174
Sx
3.67
4:28 AM
Performance evaluation by z-scores

The industry in which sales rep Mr. Atif works has mean
annual sales=$2,500
standard deviation=$500.
The industry in which sales rep Mr. Asad works has mean
annual sales=$4,800
standard deviation=$600.
Last year Mr. Atifs sales were $4,000 and

Mr. Asads sales were $6,000.
Which of the representatives would you hire
if you have one sales position to fill?
4:28 AM
61
Performance evaluation by z-scores

Sales rep. Atif
Sales rep. Asad
XB= $2,500
XP =$4,800
SB= $500
SP = $600
XB= $4,000
XP= $6,000
ZB
ZB
XB XB
SB
4,000 2,500
500
ZP
3
ZP
XP XP
SP
6,000 4,800
600
Mr. Atif is the best choice

4:28 AM
62
The Empirical Rule

68%
X 1S contains about 68% of values
X
X 1S
X 2S contains about 95% of values
95%
X 2S
99.7%
4:28 AM
X 3S
X 3S containsabout99.7%of values
63
Measures of Skewness
A distribution in which the values equidistant from
the centre have equal frequencies is defined to be
symmetrical and any departure from symmetry is
called skewness.
1. Length of Right Tail = Length of Left
Tail
2. Mean = Median = Mode
3. Sk=0
a) Sk=(Mean-Mode)/SD
b) Sk=(Q3-2Q2+Q1)/(Q3-Q1)
4:28 AM
64
A distribution is positively skewed, if the observations
tend to concentrate more at the lower end of the possible
values of the variable than the upper end. A positively
skewed frequency curve has a longer tail on the right
hand side
1. Length of Right Tail > Length of Left

Tail
2. Mean > Median > Mode
3. SK>0
4:28 AM
65
A distribution is negatively skewed, if the
observations tend to concentrate more at the upper
end of the possible values of the variable than the
lower end. A negatively skewed frequency curve has a

longer tail on the left side.
1. Length of Right Tail < Length of Left

Tail
2. Mean < Median < Mode
3. SK< 0
4:28 AM
66
Measures of Kurtosis
4:28 AM
The Kurtosis is the degree of peakedness or flatness of a

unimodal (single humped) distribution,
When the values of a variable are highly concentrated around
the mode, the peak of the curve becomes relatively high; the
curve is Leptokurtic.
When the values of a variable have low concentration around
the mode, the peak of the curve becomes relatively flat;curve
is Platykurtic.
A curve, which is neither very peaked nor very flat-toped, it
is taken as a basis for comparison, is called
Mesokurtic/Normal.
67
4:28 AM
68
Coefficient of Kurtosis=
n X-X
X-X
2 2
1. If Coefficient of Kurtosis > 3 ----------------- Leptokurtic.

2. If Coefficient of Kurtosis = 3 ----------------- Mesokurtic.
3. If Coefficient of Kurtosis < 3 ----------------- is Platykurtic.
4:28 AM
69

Statistical Inference

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Inference

Uploaded by

Copyright:

Available Formats

Statistical

Review of Descriptive Statistics and SPSS

Recommended Readings (Books)

R. V., Craig, A. and McKean, J. W., 6th Edition

Data may only

True Zero Point

Data organization and Analysis

Measures of Central Tendency

Sum of all the observations

distribution, the ith quartile, jth decile and kth

percentile are located in the array/discrete frequency

The array of the above data is given below:

maximum number of times in a

Mode = 8.2 million

Measures of variation measure the

Disadvantages of the Range

Inter-quartile Range (IQR)

IQR is independent of outliers

Inter-quartile Range (IQR)

The Mean (absolute) Deviation

Standard Deviation for Group Data

Example-1: Find Standard Deviation of

Comparing Standard Deviations

The smaller the standard deviation, the more tightly

Relative Measures of Variation

Coefficient of Variation (CV)

Use of Coefficient of Variation

Appropriate Choice of Measure

outliers, use range and standard

Five Number Summary

From the three quartiles we can obtain a measure of central

Five Number Summary

The array of the above data is given below:

Five Number Summary

th obs.in thedata 5.25th obs.

th obs. in the data 10.50th obs.

;10th obs. 0.50{11thobs. - 10th obs.} 30 0.50{31- 30} 30.5 Hrs

th obs.in thedata 15.75thobs.

Box and Whisker Diagram

Box-Whisker Plot is useful to identify

Shape of the data From Position of line within box

Line above center of the box----Negatively skewed

Detection of Outliers in the data

the result of a measurement or recording error,

An extreme value needs not to be an outliers; it might,

Inner and Outer Fences

Lower Inner Fence Q1 1.5IQR 0.25

Lower Outer Fence Q1 3IQR 21.5

Identification of the Outliers

1. The values that lie within inner

Plot each suspected outliers with an asterisk

Uses of Box and Whisker Diagram

Variable Mean of Variable

Variable Z has mean 0 and

variance 1 so Z is a standard variable.

Performance evaluation by z-scores

Last year Mr. Atifs sales were $4,000 and

Performance evaluation by z-scores

Sales rep. Asad

Mr. Atif is the best choice

The Empirical Rule

X 2S contains about 95% of values