Professional Documents
Culture Documents
Basic Statistics
Basic Statistics - Agenda
1. Introduction
2. Variable Data
3. Measures of Location and Dispersion
4. The Normal Distribution
5. Non-Normal Data
6. Data Transformation
7. Testing for Normality
8. Attribute Data
9. Defective Items – Binomial Distribution
10. Defects – Poisson Distribution
11. Z-Tables
Six Sigma - Statistics
Decisions are best made with the aid of facts and data
We collect USEFUL
DATA INFORMATION
Statistics
Translates
2 6 2 7 into
2 2 6 3
4 1 3 2
3 5 1 1
4 5 1 4
1 1 3 2
The Use of Statistics
By examining a sample of 50
items we can:
• Describe and summarise the set
of data (descriptive statistics)
• Make predictions about the
whole population (inferential
statistics)
Characteristics of Variable Data
• Frequency Distribution
• Tally Chart
• Histogram
• Measures of Location - Mean, Median, Mode
• Measures of Variation – Range, Standard Deviation
• etc……
Random Sample - 100 SAT Verbal Scores
546 592 591 602 691
689 644 546 602 695
490 536 618 669 599
531 586 622 689 560
603 555 464 599 618
549 612 641 597 622
663 546 534 740 644
515 496 503 599 618
557 631 502 605 547
673 708 624 528 645
650 656 599 586 536
546 515 644 599 734
502 541 530 663 599
547 579 666 578 635
496 541 605 560 695
426 555 483 641 546
515 609 534 645 572
637 457 631 721 578
541 592 666 619 663
547 624 567 489 528
Random Sample - 100 SAT Verbal Scores
P P P P P
P P P P P
F P P P P
P P P P P
P P F P P
P P P P P
P P P P P
P F P P P
P P P P P
P P P P P
P P P P P
P P P P P
P P P P P
P P P P P
F P P P P
F P F P P
P P P P P
P F P P P
P P P P P
P P P F P
0-499 Fail 8
500+ Pass 92
Random Sample - 100 SAT Verbal Scores
546 592 591 602 691
689 644 546 602 695
490 536 618 669 599
531 586 622 689 560
603 555 464 599 618
549 612 641 597 622
663 546 534 740 644
515 496 503 599 618
557 631 502 605 547
673 708 624 528 645
650 656 599 586 536
546 515 644 599 734
502 541 530 663 599
547 579 666 578 635
496 541 605 560 695
426 555 483 641 546
515 609 534 645 572
637 457 631 721 578
541 592 666 619 663
547 624 567 489 528
400-499 Red 8
500-599 Yellow 48
600+ Green 44
SAT Verbal Data - Arranged in Order
426 536 572 605 645
457 536 578 609 645
464 541 578 612 650
483 541 579 618 656
489 541 586 618 663
490 546 586 618 663
496 546 591 619 663
496 546 592 622 666
502 546 592 622 666
502 546 597 624 669
503 547 599 624 673
515 547 599 631 689
515 547 599 631 689
515 549 599 635 691
528 555 599 637 695
528 555 599 641 695
530 557 602 641 708
531 560 602 644 721
534 560 603 644 734
534 567 605 644 740
Tally Chart
Class Class limits Tallies Class
Frequency
1 425-449 Ι 1
2 450-474 ΙΙ 2
3 475-499 ΙΙΙΙ 5
4 500-524 ΙΙΙΙ Ι 6
5 525-549 ΙΙΙΙ ΙΙΙΙ ΙΙΙΙ ΙΙΙΙ 20
6 550-574 ΙΙΙΙ ΙΙ 7
7 575-599 ΙΙΙΙ ΙΙΙΙ ΙΙΙΙ 15
8 600-624 ΙΙΙΙ ΙΙΙΙ ΙΙΙΙ Ι 16
9 625-649 ΙΙΙΙ ΙΙΙΙ I 11
10 650-674 ΙΙΙΙ ΙΙΙΙ 9
11 675-699 IIII 4
12 700-724 II 2
13 725-749 II 2
Histogram
Frequency
20
15
10
0
424.5 449.5 474.5 499.5 524.5 549.5 574.5 599.5 624.5 649.5 674.5 699.5 724.5 749.5
Histogram of SAT
25
20
Frequency
15
10
0
380 460 540 620 700 780
SAT
Dotplot
Dotplot of SAT
The Mode
• The mode is defined as the value in the sample
which occurs most frequently.
The Median
i=n
∑yi ∑y
i=1
y= usually abbreviated to
n n
Measures of Dispersion
The Range
The range is defined as the largest sample observation
minus the smallest sample observation
y y-y (y – y)2
∑(y – y)2
2 s=
4 n-1
4
5 ∑ = Summation
6 y = Mean
9 y = Individual Data
n = Number of Data
∑y
y= = s=
n
Standard Deviation & Variance
(Σy) 2
Σy2 -
n 2
s2 = s2 = σn-1 = sample variance
n-1
(Σy)2
Σy2 -
s = n s = σn-1 = sample standard deviation
n-1
Descriptive Statistics
Median
Sample Population
y s s2 µ σ σ2
Sample Population
y s s2 µ σ σ2
σ -2σ
-3σ σ σ
-σ y σ σ
2σ σ
3σ
σ -2σ
-3σ σ σ
-σ y σ σ
2σ σ
3σ
y−µ
z=
σ
This equation is extremely useful in determining areas
under the standard normal curve. The variable z, the
standard normal variable, is used for this purpose and
values of z are tabulated in statistical tables.
Standard Normal Distribution
σ -2σ
-3σ σ σ
-σ y σ σ
2σ σ
3σ
σ -2σ
-3σ σ σ
-σ y σ σ
2σ σ
3σ
20 25 30 35 40 45 50
Example
20 25 30 35 40 45 50
y−µ 43 - 35
z= z= = 1.60
σ 5
Looking up a value of z = 1.60, we see that a value of 0.9452
is tabulated. This means that 94.52% of our population
would be expected to have a value less than 43.
Workshop
20 25 30 35 40 45 50
Workshop
20 25 30 35 40 45 50
The Normal Distribution - Things to Remember
Median
95% 19 times out of 20 the true 0.05 Only 1 in 20 times will the true
value will lie within the value lie outside the confidence
confidence interval interval
99% 99 times out of a 100 the 0.01 Only 1 in 100 times will the true
true value will lie within value lie outside the confidence
the confidence interval interval
99.9% 999 times out of 1000 the 0.001 Only 1 in 1000 times will the true
true value will lie within value lie outside the confidence
the confidence interval interval
Confidence Interval for the Mean
The confidence interval for the mean can be calculated as
follows:
σ n −1
y ± tα ×
2 n
Where: y = sample mean
tα = t distribution critical value, with n-1 df
2
σ n −1 = sample standard deviation
n = sample size
Assumes that the underlying distribution of y is normal but
the calculation is fairly robust to violations of this
assumption
Confidence Interval for Mean
(Normal Distribution)
Summary for mpg
A nderson-D arling N ormality Test
σ n −1
A -S quared
P -V alue
0.63
0.092 y ± tα ×
M ean
StD ev
33.417
1.604
2 n
V ariance 2.572
Skew ness -0.21121 1.604
Kurtosis
N
-1.16145
30
= 33.417 ± 2.045 ×
M inimum 30.450 30
1st Q uartile 31.861
M edian
3rd Q uartile
33.844
34.890
= 33.417 ± 2.045 × 0.293
30 31 32 33 34 35 36 M aximum 36.162
95% C onfidence Interv al for M ean
32.818 34.016 = 33.417 ± 0.599
95% C onfidence Interv al for M edian
32.378 34.380
Median
2
Where: χ = Critical value of Chi Squared distribution
σ n −1 = sample standard deviation
n = sample size
This formula assumes normality. Large errors are likely if
the underlying distribution is non-normal.
Confidence Interval for Standard Deviation
(Normal Distribution)
Median
30
25
Frequency
20
15
10
0
60 80 100 120 140
Weight
A P-value is returned.
The P-value is the probability of getting the sample data if the null
hypothesis is true. We generally accept that the data is from a
normal distribution if the P-value is greater than 0.05 (alpha risk).
Normality Test
60
50
40
alternate hypothesis
30
20
that the data is non-
10
5
normal.
1
0.1
50 75 100 125 150
Weight
Reasons for Failing a Normality Test
1. A shift occurred in the middle of the data
2. Mixed populations
3. Truncated data
4. Rounding to a small number of values
5. Outliers
6. Too much data
7. The underlying distribution is not normal
With this data set, the most likely reason for failing is that
the underlying distribution is not normal. We generally
would need to investigate the other reasons before
reaching this conclusion!
Data Transformation
Histogram of Weight
35
30
25
Frequency
20
15
10
0
60 80 100 120 140
Weight
1 1 1
y2 y y ln y y y y2 y3
-2.0 -1.0 -0.5 0 0.5 1.0 2.0 3.0
99
StDev
N
14.91
100
reject the null
95
AD
P-Value
0.954
0.015 hypothesis (p<0.05)
90
80 and accept the
70
Percent
60
50 alternate hypothesis
40
30
20
that the data is non-
10
5
normal.
1
0.1
50 75 100 125 150
Weight
Histogram – Transformed Data
Histogram of Ln Weight
25
20
Frequency
15
10
0
4.0 4.2 4.4 4.6 4.8
Ln Weight
Normality Test – Transformed Data
60
50
40
normal.
30
20
10
5
0.1
4.00 4.25 4.50 4.75 5.00
Ln Weight
Ln(Weight) – Graphical Summary
Summary for Ln Weight
A nderson-D arling N ormality Test
A -S quared 0.33
P -V alue 0.518
M ean 4.3978
S tDev 0.1758
V ariance 0.0309
S kew ness 0.184802
Kurtosis 0.561530
N 100
M inimum 3.9703
1st Q uartile 4.2905
M edian 4.3820
3rd Q uartile 4.5081
4.0 4.2 4.4 4.6 4.8 M aximum 4.9273
95% C onfidence Interv al for M ean
4.3629 4.4327
95% C onfidence Interv al for M edian
4.3567 4.4308
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
0.1543 0.2042
Mean
Median
Shift
Number of defects
Number of changes
Number of accidents
Number of failures
Attribute Data
9 Items (9 invoices)
3 Defective Items (3 invoices with errors)
5 Defects (5 errors)
Defective Items
Parameters
n! y n− y
P( y ) = p q
y!(n − y )!
Where:
y = number of successes
n = number of independent trials (or items)
p = probability of success
q = probability of failure
Note: p + q = 1
Binomial Data – Workshop 1
e- µ . µy e- 0.1575 . 0.15750
P(0) = = = e- 0.1575 = 0.8542
y! 0!
9 Items (9 invoices)
3 Defective Items (3 invoices with errors)
5 Defects (5 errors)
Binomial v Poisson
9 Items (9 invoices)
3 Defective Items (3 invoices with errors)
5 Defects (5 errors)
Basic Statistics - Summary
• Variable data provides a fuller description of our processes than
attribute data
• Many continuous distributions follow the Normal Distribution
• Normality must not be assumed
• There is a difference between non-Normal and unnatural data
• Some data sets are naturally non-Normal
• Defective Items can often be characterised using the Binomial
Distribution
• Defects (errors) can often be characterised using the Poisson
Distribution
• Understanding the underlying distribution of a data set allows us to
employ the correct statistical procedures