You are on page 1of 6

Appendix 2

Introduction to basic
statistics

Statistics
Statistics is the art and science of using numerical facts and figures.
In Wikipedia, statistics is defined as a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and presentation of data.
There are three kinds of statistics:
Descriptive statistics
Inferential statistics
Causal modelling
Descriptive statistics primarily deals with the description and interpretation
of data by figures, graphs and charts.
Inferential statistics is the science of making decisions in face of uncer-
tainty by using techniques such as sampling and probability.
Causal modelling is a part of inferential statistics. It is aimed at advancing
reasonable hypotheses about underlying causal relationships between the
dependent and independent variables.
In Six Sigma projects most frequently used statistics is descriptive statistics
and most useful distribution is normal distribution.

Descriptive statistics
Data distribution
Normal Distribution, other distributions
Measures of central tendency
Mean, Median, Mode
Measures of dispersion
Standard Deviation, Variance and Range

Normal Distribution
Most data tends to follow the normal distribution or bell shaped curve. One
of the key properties of the normal distribution is the relationship between
the shape of the curve and the standard deviation.
Appendix 2: Introduction to basic statistics 321

99.73% of the area of the normal distribution is contained between 3


sigma and 3 sigma from the mean. Another way of expressing this is that
0.27% of the data falls outside 3 standard deviations for the mean.

Normal Distribution bell-shaped curve

34.13% 34.13%

2.14% 13.60% 13.60% 2.14%


0.13% 0.13%

3s 2s 1s X 1s 2s 3s


68.26%
95.46%
99.73%
s  Sigma
68.26% fall within \1 Sigma
95.46% fall within \2 Sigma
99.73% fall within \3 Sigma

i1 Xi
n

X
n 1

Population is the total number in the study, e.g. census and a sample is a
small subset of the population. Mean and Standard Deviation are expressed
by Greek letters (, ) in the population and by Roman letters (X, s).

Measures of central tendency


Measures of central tendency are determined by Mean, Median and Mode.
Mean is the arithmetic average of a set of data. It is a measure of central
tendency, not a measure of variation. However, it is required to calculate some
of the statistical measures of variation:

i 1 Xi
n

Sample mean ( X ) 
n 1
where
X  Sample mean
Xi  Data point i
n  Group sample size
322 Implementing Six Sigma and Lean

Median is the central or middle data point, e.g. in the series 32, 33, 34, 34,
35, 37, 37, 39, 41, the Median is 35.
Mode is the highest data point, e,g. in the series 32, 33, 34, 34, 35, 37, 37,
39, 41, the Mode is 41.

Measure of dispersion
The measure of dispersion is determined by Standard Deviation, Variance and
Range and the shape of the distribution is measured by Skewness and Kurtosis.
Range is the simplest measure of dispersion. It defines the spread of the data
and is the distance between the largest and the smallest values of a sample fre-
quency distribution.
Variance is the average of the square values of the deviations from the mean.
Standard Deviation is defined as follows:
n (xi  x )2
s n 1
i1

It is the square root of Variance and in a more intuitive definition, think of it


as the average distance from each data point to the mean.

Skewness and Kurtosis


Skewness measures the departure from a symmetrical distribution. When the
tail of the distribution stretches to the left (smaller values) it is negatively
skewed and when it stretches to the right it is positively skewed.
Kurtosis is a measure of a distribution peakness or flatness. A curve is too
peaked when the kurtosis exceeds 3 and is too flat when kurtosis is less
than 3.

Sources of variation
There are two types or sources of variation as shown in the following table:

Type Definition Characteristics


Common cause No way to remove Always present
Influenced by one of the 6 Ms Expected
Normal
Random
Special cause Can be removed Not always present
Influence of the 6 Ms Unexpected
Not normal
Not random

To distinguish between common cause and special causes of variation, use display
tools that study variation over time such as Run Charts and Control Charts.
Appendix 2: Introduction to basic statistics 323

Process capability
Process capability refers to the ability of a process to produce a product
that meets given specification.
It is measured by one or a combination of four indices:

DPM Defects per million


level Number of standard deviation from centre
Cp Potential process index
Cpk Process capability index

level  minimum (USL )/, ( LSL)/


where USL  upper specification level
LSL  lower specification level
 standard deviation
 mean
Cpk  Process capability index
 minimum (USL )/3, ( LSL)/3
 level/3
|USL  LSL|
Cp 
6

Process Sigma
Process Sigma is an expression of process yield based on DPMO (defects per
million opportunities) to define variation relative to customer specification.
A higher value Process Sigma is an indication of a lower variation relative to
specification.
Historical data indicates a change of Sigma value by 1.5 when long-term and
short-term (actual) Process Sigma values are compared. The long-term values
are higher by 1.5, e.g. 6 Sigma is long-term Process Sigma for 3.4 DPMO.
A sample Process Sigma calculation:
Given 7 defects (D) for 100 units (N) processed with 2 defect opportunities
(O) per unit.
Defects per opportunity, DPO  D/(N  O)  7/(100  2)  0.035
Yield  (1 DPO)  100  96.5%
From the lookup table (see Appendix 2B);
Long-term Process Sigma  3.3
Actual Process Sigma  3.3 1.5  1.8

Inferential statistics
Inferential statistics helps us to make judgments about the population from
a sample. A sample is a small subset of the total number in a population.
324 Implementing Six Sigma and Lean

Appendix 2B
Yield Conversion Table

These are estimates of log-term sigma values. Subtract 1.5 from these
values to obtain actual sigma values

Sigma DPMO Yield % Sigma DPMO Yield %

6.0 3.4 99.99966 2.9 80757 91.9


5.9 5.4 99.99946 2.6 90801 90.3
5.8 8.5 99.99915 2.7 155070 88.5
5.7 13 99.99866 2.6 135666 86.4
5.6 21 99.9979 2.5 158655 84.1
5.5 32 99.9968 2.4 184060 81.6
5.4 48 99.9952 2.3 211855 78.8
5.3 72 99.9928 2.2 241964 75.8
5.2 108 99.9892 2.1 374253 72.6
5.2 159 99.984 2.0 308538 69.1
5.0 233 99.977 1.9 344578 65.5
4.9 337 99.966 1.8 382089 61.8
4.8 483 99.952 1.7 420740 57.9
4.7 687 99.931 1.6 460172 54.0
4.6 968 99.90 1.5 500000 50.0
4.5 1350 99.87 1.4 539828 46.0
4.4 1866 99.81 1.3 579260 42.1
4.3 2555 99.74 1.2 617911 38.2
4.2 3467 99.65 1.1 665422 34.5
4.1 661 99.53 1.0 691482 30.9
4.0 6210 99.38 0.9 725747 27.4
3.9 198 99.18 0.8 758036 24.2
3.8 10724 98.9 0.7 788145 21.2
3.7 13903 98.6 0.6 815940 18.4
3.6 17864 98.2 0.5 841345 15.9
3.5 22750 97.7 0.4 864334 13.6
3.4 28716 97.1 0.3 884930 11.5
3.3 35930 96.4 0.2 903199 9.7
3.2 44565 95.5 0.1 919243 8.1
3.1 54799 94.5
3.0 66897 93.3

Sample statistics are summary values of sample and are calculated using all
the values of the sample. Population parameters are summary values of the
population but they are not known. That is why we use sample statistics to
infer population parameters.
Sample size (SS)  (DC  V/DP)2
where
DC  Degree of confidence  the number of standard errors for the degree
of confidence
Appendix 2: Introduction to basic statistics 325

V  Variability  the standard deviation of the population


DP  Desired precision  the acceptable difference between the sample
estimate and the population value.
A null hypothesis refers to a population parameter and not a sample statis-
tic. Based on the sample data the researcher can accept the null hypothesis or
accept the alternative hypothesis.
The t-test compares the actual difference between two means in relation to
the variation in the data (expressed as the standard deviation of the difference
between the means). It is applied when sample sizes are small enough (less
than 30) that using an assumption of normality. A test of the null hypothesis
that the means of two normally distributed populations are equal.
Analysis of variance (ANOVA) is used to assess the statistical differences
between the means of two or more groups of population. ANOVA can also
examine research problems that involve several independent variables.
The F-test assess the differences between the group means when we use
ANOVA.
F  Variance between groups/variance within groups. Larger F-ratios
indicate significant differences between the groups and most likely the null
hypothesis will be rejected.

Causal modelling
Many business objectives are concerned with the relationship between two or
more variables. Variables are linked together if they exhibit co-variation, i.e.
when one variable consistently changes relative to other variable. Co-efficient
of correlation is used to assess this linkage. Large co-efficients indicate high
co-variation and a strong relationship and vice versa.
Causal modelling develops equations underlying causal relationships
between the dependent and independent variables. Values of variables are
plotted graphically and analysed by regression analysis and co-efficient of
correlation to establish relationships between variables and their validity.

SPSS
The data analysis and statistical techniques presented in this appendix are all
available in the popular software package SPSS (www.spss.com).

You might also like