You are on page 1of 69

Statistical

Inference
Adnan Butt

4:28 AM

Course Outline
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
4:28 AM

Review of Descriptive Statistics and SPSS


Random Variable and Mathematical Expectation
Discrete Probability Distributions (Binomial, Poisson)
Continuous Probability Distribution (Normal)
Sampling Theory
Confidance Intervals
Hypotheses Testing
Goodness of Fit
Regression and Correlation with ANOVA
Multiple Regression
All the topics will be SPSS oriented
2

Recommended Readings (Books)


Introduction to Mathematical Statistics, Hogg,

R. V., Craig, A. and McKean, J. W., 6th Edition


(2004)
Statistical Inference, Cassella, G. and Berger, R.
L., 2nd Edition (2002)
Applied Regression Analysis: A Research Tool,
Rawlings, J. O., Pantula, S. G. and Dickey, D. A.,
2nd Edition (2001)
Introduction to Statistics, Walpole, R. E., 3rd
Edition (2000)
4:28 AM

Mode of Teaching
Lecture
SPSS Workshop

Discussion Session
4:28 AM

Marks Distribution

Mid term

25 Marks

Final

40 Marks

Quizzes

15 Marks

SPSS

15 Marks

Class Participation
Total
4:28 AM

5 Marks
100 Marks
5

Variable
A characteristic or
property that varies
from individual to
individual.
4:28 AM

Constant
A characteristic or
property that does not
change from individual
to individual.
4:28 AM

Types of Variables
Types of
Variables

Qualitative

Quantitative

Discrete
4:28 AM

Continuous
8

Nominal Scale
Variable categories are mutually
exclusive and exhaustive.
Variable categories have no
logical order.
Eye Color, Hair Color, Gender.

4:28 AM

Ordinal Scale
Data categories are mutually
exclusive and exhaustive.
Data classifications are ranked or
ordered
according
to
the
particular trait they possess.
Level of Knowledge about SPSS
4:28 AM

10

Interval Scale
Data categories are mutually exclusive
and exhaustive.
Data classifications are ranked or ordered
according to the particular trait they
possess.
Equal differences in the characteristic are
not represented by equal differences in
the measurements.
Temperature, Shoe Size and IQ scores
4:28 AM

11

Ratio Scale
Data categories are mutually exclusive and
exhaustive.
Data classifications are ranked or ordered
according to the particular trait they possess.
Equal differences in the characteristic are
represented by equal differences in the
measurements.
The zero point is the essence of the
characteristic.
Height, Weight, Distance.
4:28 AM

12

Measurement Scales
Scale

Nominal

Ordinal

Interval

Ratio

Data may only


be classified

Data are
ranked

True Zero Point


does not
Exist.

Meaningful Zero
point and Ratio
Between values

Eye color,
Hair Color
Gender.

4:28 AM

Level of
Knowledge
about
SPSS

Temperature,
Shoe Size,
IQ Scores

Height, Weight,
Distance.

13

Data
The information collected
for any kind of investigation.
Usually Numerical but can
be Qualitative.
4:28 AM

14

Primary Data
The initial material collected
during the research process.
The information collected
directly from the respondent.
Personal Invetigation, Through Investigator, Through Questionnaire,
Through Local Sources, Through Telephone,

4:28 AM

15

Secondary Data
The information
collected and processed
by the people other than
the researcher
Government Organizations, Semi-Government
Organizations,
4:28 AM

16

Data Collection
Any of the following methods may be
adopted:
(a) Personal interview
(b) Direct observation
(c) Mail interview (internet interview)
(d) Telephone interview
What are the cons and pros of each?
4:28 AM

17

Data management
Office Editing,

Post Coding,
Data entry and Verification.

4:28 AM

18

Data organization and Analysis


Preparing data for analysis,
Extracting descriptive measures
from the data,
Using advanced statistical
techniques to analyze the data
and draw inference there from.

4:28 AM

19

Measures of Central Tendency


Arithmetic Mean
Quantiles
(Median, Quartiles, Deciles, Percentiles)

Mode

4:28 AM

20

Arithmetic Mean
A value obtained by dividing the sum of all the observations by
their number.

Sum of all the observations


Arithmetic Mean
Number of the observations
If X1, X2, , Xn are n observations of a variable X then
n

X1 X 2 X n
X

n
4:28 AM

X
i 1

n
21

Arithmetic Mean
The marks obtained by 8 students are:

67 72 68 70 65 68 75 63
67 72 63 548
X

68.5 Marks
8
8

4:28 AM

22

Quantiles
For

individual

observations/discrete

frequency

distribution, the ith quartile, jth decile and kth

percentile are located in the array/discrete frequency


distribution by the following relations
Qi

i(n 1)
th observation in the distribution, i 1, 2, 3
4

j(n 1)
th observation in the distribution, j 1, 2,,9
10
k(n 1)
Pk
th observation in the distribution, k 1, 2,,99
100
Dj

4:28 AM

23

Quartiles
The weekly TV Watching times (Hours):
25 41 27 32 43 66 35 31 15 5
34 26 32 38 16 30 38 30 20 21

The array of the above data is given below:

5 15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66

4:28 AM

24

Quartiles
1(20 1)
Q1
th observation in the distribution
4
5.25th observation in the distribution
5th obs. 0.25{6th obs.- 5th obs.}
21 0.25{25- 21} 22.0 Hours

4:28 AM

25

Quartiles
2(20 1)
Q2
th observation in the distribution
4
10.50th observation in the distribution
10th obs. 0.50{11th obs.- 10th obs.}
30 0.50{31- 30} 30.5 Hours

4:28 AM

26

Quantiles

4:28 AM

27

Mode
The mode is a value which occurs
most frequently in a set of data. Or
mode

is

value

that

occurs

maximum number of times in a


sequence of observations.

4:28 AM

28

Mode
The total automobile sales (in millions) in
the United States for the last 14 years.
9.0
8.2 8.0 9.1 10.3 11.0 11.5
10.3 10.5 9.8 9.3
8.2
8.2
8.5

Mode = 8.2 million


4:28 AM

29

Measures of variation measure the


variation present among the values
of a data set, so measures of
variation are measures of spread of
values in the data.

4:28 AM

30

Absolute Measures of
Dispersion
Range
Quartile Deviation
Mean (Average) Deviation
Variance and Standard Deviation

4:28 AM

31

Relative Measures of
Dispersion
Coefficient of Range
Coefficient of Quartile Deviation
Coefficient of Mean Deviation
Coefficient of Variation (CV)

4:28 AM

32

Range
Difference between the largest
and the smallest observations
Range X Largest X Smallest

4:28 AM

33

Disadvantages of the Range


Ignores the way in which data are distributed
7

10

11

Range = 12 - 7 = 5

12

10

11

12

Range = 12 - 7 = 5

Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
4:28 AM

34

Inter-quartile Range (IQR)


Inter-quartile range = 3rd quartile 1st Quartile
Q3 Q1

IQR is independent of outliers

4:28 AM

35

Inter-quartile Range
X

minimum

Q1

25%

12

Median
(Q2)
25%

30

25%

45

Q3

maximum

25%

57

70

Inter-quartile Range (IQR)


= 57 30 = 27

4:28 AM

36

The Mean (absolute) Deviation


Mean Deviation is the average of absolute
deviations taken form the mean value.
X
8
5
2

4:28 AM

(X X )

X X

3
0
-3
0

3
0
3

(x x )
n

6
2
3

6
37

Variance
Variance is the average
of the squared
deviations taken from
the mean value.
(i) S 2

(x x )

(ii ) S
2

4:28 AM

X2
n

102
17cm 2
6

702 102 2
2


17 cm
6 6

X cm (X-Mean)^2

X2

36

16

16

36

81

12

144

13

169

16

36

256

60

102

702

38

Standard Deviation
Standard deviation is the positive square root of
the mean-square deviations of the observations
from their arithmetic mean.
Population

Sample

SD variance

x
i

N 1

Standard Deviation for Group Data


SD is :

f i xi x 2

Where

fx
N

i i
i

Simplified formula
2

fx

x
f

fx
N

Example-1: Find Standard Deviation of


Ungroup Data
Family
No.

10

Size (xi)

Here, x
n

Family No.

50

5
10

10

Total

xi

50

xi x

-2

-2

-1

-1

20

16

16

25

25

36

36

49

49

270

x i x
xi

s2

x
i

20
2
10

s 2 1.41

Comparing Standard Deviations


Data A
11

12

13

14

15

16

17

18

19

20 21

Mean = 15.5
S = 3.338

20 21

Mean = 15.5
S = 0.926

20 21

Mean = 15.5
S = 4.567

Data B
11

12

13

14

15

16

17

18

19

Data C
11

12

13

14

15

16

17

18

19

The smaller the standard deviation, the more tightly


clustered the scores around mean
The larger the standard deviation, the more spread out
4:28 AM
43
the scores from mean

Relative Measures of Variation


Coefficient of Range

X Largest X Smallest
X Largest X Smallest

Q3 Q1
Coefficient of Quartile Deviation
Q3 Q1

MD
Coefficient of Mean Deviation
Mean
4:28 AM

44

Coefficient of Variation (CV)


S
100%
CV

X
Can be used to compare two or more
sets of data measured in different
units or same units but different
average size.
4:28 AM

45

Use of Coefficient of Variation


Stock A:
Average price last year = $50
Standard deviation = $5

S
$5

CVA 100%
100% 10%
$50
X
Stock B:
Average price last year = $100
Standard deviation = $5

S
$5
CVB 100%
100% 5%
$100
X
4:28 AM

Both stocks
have the
same
standard
deviation

but stock B is
less variable
relative to its
price

46

Appropriate Choice of Measure


of Variability
If data are symmetric, with no serious

outliers, use range and standard


deviation.
If data are skewed, and/or have serious
outliers, use IQR.
If comparing variation across two data
sets, use coefficient of variation (C.V)
4:28 AM

47

Five Number Summary


The five number summary of a data set consists of the
minimum value, the first quartile, the second quartile, the
third quartile and the maximum value written in that order:
Min, Q1, Q2, Q3, Max.

From the three quartiles we can obtain a measure of central


tendency (the median, Q2) and measures of variation of the
two middle quarters of the distribution, Q2-Q1 for the
second quarter and Q3-Q2 for the third quarter.

4:28 AM

48

Five Number Summary


The weekly TV viewing times (in hours).
25
34

41 27
26 32

32
38

43 66
16 30

35
38

31 15
30 20

5
21

The array of the above data is given below:

5
15 16 20 21 25 26 27 30 30
31 32 32 34 35 37 38 41 43 66

4:28 AM

49

Five Number Summary


1(20 1)
LOCATIONof Q1;

th obs.in thedata 5.25th obs.

4
VALUE of Q1 ; 5th obs. 0.25{6thobs. - 5th obs.} 21 0.25{25 - 21} 22.0 Hrs

LOCAT ION of Q 2 ;
VALUE of Q2

2(20 1)

th obs. in the data 10.50th obs.

;10th obs. 0.50{11thobs. - 10th obs.} 30 0.50{31- 30} 30.5 Hrs

3(20 1)
LOCATIONof Q 3 ;

th obs.in thedata 15.75thobs.

VALUE of Q 3 ; 15th obs 0.75 {16th obs - 15th obs} 35 0.75{37 - 35} 36.5 Hrs

Minimum value=5.0
4:28 AM

Maximum value=66.0
50

Box and Whisker Diagram


A box and whisker diagram or box-plot is a
graphical mean for displaying the five number
summary of a set of data. In a box-plot the first
quartile is placed at the lower hinge and the
third quartile is placed at the upper hinge. The
median is placed in between these two hinges.
The two lines emanating from the box are
called whiskers. The box and whisker diagram
was introduced by Professor Jhon W. Tukey.

4:28 AM

51

Max
Value

Construction of Box-Plot
Start the box from Q1 and end at
Q3
2. Within the box draw a line to
represent Q2
3. Draw lower whisker to Min.
Value up to Q1
4. Draw upper Whisker from Q3 up
to Max. Value
1.

4:28 AM

Q3

Q2

Q1

Min
Value

52

70

Construction of Box-Plot

60

50

1.
2.

3.
4.

Q1=22.0 Q3=36.5
Q2=30.5
Minimum Value=5.0
Maximum Value=66.0

40

30

20

10

0
4:28 AM

53

70

Interpretation of Box-Plot

60

Box-Whisker Plot is useful to identify


Maximum and Minimum Values in the data
Median of the data

50

40

IQR=Q3-Q1,
Lengthy box indicates more variability in the data

30

Shape of the data From Position of line within box


Line At the center of the box----Symmetrical

20

Line above center of the box----Negatively skewed


Line below center of the box----Positively Skewed

10

Detection of Outliers in the data


0
4:28 AM

54

Outliers
An outlier is the values that falls well outside the overall
pattern of the data. It might be

the result of a measurement or recording error,


a member from a different population,
simply an unusual extreme value.

An extreme value needs not to be an outliers; it might,


instead, be an indication of skewness.

4:28 AM

55

Inner and Outer Fences


If

Q1=22.0

Q2=30.5

Q3=36.5

Lower Inner Fence Q1 1.5IQR 0.25


Inner Fences :
Upper Inner Fence Q 3 1.5IQR 58.25

Lower Outer Fence Q1 3IQR 21.5


Outer Fences :
Upper Outer Fence Q 3 3IQR 80.0

4:28 AM

56

Identification of the Outliers

80

70

1. The values that lie within inner


fences are normal values
2. The values that lie outside inner
fences but inside outer fences
are
possible/suspected/mild
outliers
3. The values that lie outside outer
fences are sure outliers

*
60

Only
66 is a
mild
outlier

Plot each suspected outliers with an asterisk


and each sure outliers with an hollow dot.
4:28 AM

50

40

30

20

10

57

Uses of Box and Whisker Diagram


Box plots are
especially suitable for
comparing two or more
data sets. In such a
situation the box plots
are constructed on the
same scale.

4:28 AM

Male

Female
58

Standardized Variable
A variable that has mean 0 and Variance 1 is
called standardized variable
Values of standardized variable are called

standard scores
Values of standard variable i.e standard scores are
unit-less
Construction

Variable Mean of Variable


Z
Standard Deviation of Variable

4:28 AM

59

Standardized Variable
X

( X X )2

25

(Z Z ) 2

-1.3624 1.8561
-0.5450 0.2970

11

0.81741 0.6682

12

16

1.0899

32

54

1.1879

4.009

Variable Z has mean 0 and

S x2

32
8
4

n
54

13.5
4

X X X 8

Sx
3.67

Z
S z2

n
4.009

1
4

variance 1 so Z is a standard variable.


Standard Score at X=11 is Z X X 11 8 0.8174
Sx
3.67
4:28 AM

Performance evaluation by z-scores


The industry in which sales rep Mr. Atif works has mean
annual sales=$2,500
standard deviation=$500.
The industry in which sales rep Mr. Asad works has mean
annual sales=$4,800
standard deviation=$600.

Last year Mr. Atifs sales were $4,000 and


Mr. Asads sales were $6,000.
Which of the representatives would you hire
if you have one sales position to fill?
4:28 AM

61

Performance evaluation by z-scores


Sales rep. Atif

Sales rep. Asad

XB= $2,500

XP =$4,800

SB= $500

SP = $600

XB= $4,000

XP= $6,000

ZB
ZB

XB XB
SB
4,000 2,500
500

ZP
3

ZP

XP XP
SP
6,000 4,800
600

Mr. Atif is the best choice


4:28 AM

62

The Empirical Rule


68%
X 1S contains about 68% of values

X
X 1S

X 2S contains about 95% of values

95%

X 2S
99.7%
4:28 AM

X 3S

X 3S containsabout99.7%of values
63

Measures of Skewness
A distribution in which the values equidistant from
the centre have equal frequencies is defined to be
symmetrical and any departure from symmetry is
called skewness.

1. Length of Right Tail = Length of Left

Tail
2. Mean = Median = Mode
3. Sk=0
a) Sk=(Mean-Mode)/SD
b) Sk=(Q3-2Q2+Q1)/(Q3-Q1)
4:28 AM

64

Measures of Skewness
A distribution is positively skewed, if the observations
tend to concentrate more at the lower end of the possible
values of the variable than the upper end. A positively
skewed frequency curve has a longer tail on the right
hand side

1. Length of Right Tail > Length of Left


Tail
2. Mean > Median > Mode
3. SK>0
4:28 AM

65

Measures of Skewness
A distribution is negatively skewed, if the
observations tend to concentrate more at the upper
end of the possible values of the variable than the

lower end. A negatively skewed frequency curve has a


longer tail on the left side.

1. Length of Right Tail < Length of Left


Tail
2. Mean < Median < Mode
3. SK< 0
4:28 AM

66

Measures of Kurtosis

4:28 AM

The Kurtosis is the degree of peakedness or flatness of a


unimodal (single humped) distribution,
When the values of a variable are highly concentrated around
the mode, the peak of the curve becomes relatively high; the
curve is Leptokurtic.
When the values of a variable have low concentration around
the mode, the peak of the curve becomes relatively flat;curve
is Platykurtic.
A curve, which is neither very peaked nor very flat-toped, it
is taken as a basis for comparison, is called
Mesokurtic/Normal.
67

Measures of Kurtosis

4:28 AM

68

Measures of Kurtosis
Coefficient of Kurtosis=

n X-X

X-X

2 2

1. If Coefficient of Kurtosis > 3 ----------------- Leptokurtic.


2. If Coefficient of Kurtosis = 3 ----------------- Mesokurtic.

3. If Coefficient of Kurtosis < 3 ----------------- is Platykurtic.

4:28 AM

69