Professional Documents
Culture Documents
Introduction to basic
statistics
Statistics
Statistics is the art and science of using numerical facts and figures.
In Wikipedia, statistics is defined as a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and presentation of data.
There are three kinds of statistics:
Descriptive statistics
Inferential statistics
Causal modelling
Descriptive statistics primarily deals with the description and interpretation
of data by figures, graphs and charts.
Inferential statistics is the science of making decisions in face of uncer-
tainty by using techniques such as sampling and probability.
Causal modelling is a part of inferential statistics. It is aimed at advancing
reasonable hypotheses about underlying causal relationships between the
dependent and independent variables.
In Six Sigma projects most frequently used statistics is descriptive statistics
and most useful distribution is normal distribution.
Descriptive statistics
Data distribution
Normal Distribution, other distributions
Measures of central tendency
Mean, Median, Mode
Measures of dispersion
Standard Deviation, Variance and Range
Normal Distribution
Most data tends to follow the normal distribution or bell shaped curve. One
of the key properties of the normal distribution is the relationship between
the shape of the curve and the standard deviation.
Appendix 2: Introduction to basic statistics 321
34.13% 34.13%
i1 Xi
n
X
n 1
Population is the total number in the study, e.g. census and a sample is a
small subset of the population. Mean and Standard Deviation are expressed
by Greek letters (, ) in the population and by Roman letters (X, s).
i 1 Xi
n
Sample mean ( X )
n 1
where
X Sample mean
Xi Data point i
n Group sample size
322 Implementing Six Sigma and Lean
Median is the central or middle data point, e.g. in the series 32, 33, 34, 34,
35, 37, 37, 39, 41, the Median is 35.
Mode is the highest data point, e,g. in the series 32, 33, 34, 34, 35, 37, 37,
39, 41, the Mode is 41.
Measure of dispersion
The measure of dispersion is determined by Standard Deviation, Variance and
Range and the shape of the distribution is measured by Skewness and Kurtosis.
Range is the simplest measure of dispersion. It defines the spread of the data
and is the distance between the largest and the smallest values of a sample fre-
quency distribution.
Variance is the average of the square values of the deviations from the mean.
Standard Deviation is defined as follows:
n (xi x )2
s n 1
i1
Sources of variation
There are two types or sources of variation as shown in the following table:
To distinguish between common cause and special causes of variation, use display
tools that study variation over time such as Run Charts and Control Charts.
Appendix 2: Introduction to basic statistics 323
Process capability
Process capability refers to the ability of a process to produce a product
that meets given specification.
It is measured by one or a combination of four indices:
Process Sigma
Process Sigma is an expression of process yield based on DPMO (defects per
million opportunities) to define variation relative to customer specification.
A higher value Process Sigma is an indication of a lower variation relative to
specification.
Historical data indicates a change of Sigma value by 1.5 when long-term and
short-term (actual) Process Sigma values are compared. The long-term values
are higher by 1.5, e.g. 6 Sigma is long-term Process Sigma for 3.4 DPMO.
A sample Process Sigma calculation:
Given 7 defects (D) for 100 units (N) processed with 2 defect opportunities
(O) per unit.
Defects per opportunity, DPO D/(N O) 7/(100 2) 0.035
Yield (1 DPO) 100 96.5%
From the lookup table (see Appendix 2B);
Long-term Process Sigma 3.3
Actual Process Sigma 3.3 1.5 1.8
Inferential statistics
Inferential statistics helps us to make judgments about the population from
a sample. A sample is a small subset of the total number in a population.
324 Implementing Six Sigma and Lean
Appendix 2B
Yield Conversion Table
These are estimates of log-term sigma values. Subtract 1.5 from these
values to obtain actual sigma values
Sample statistics are summary values of sample and are calculated using all
the values of the sample. Population parameters are summary values of the
population but they are not known. That is why we use sample statistics to
infer population parameters.
Sample size (SS) (DC V/DP)2
where
DC Degree of confidence the number of standard errors for the degree
of confidence
Appendix 2: Introduction to basic statistics 325
Causal modelling
Many business objectives are concerned with the relationship between two or
more variables. Variables are linked together if they exhibit co-variation, i.e.
when one variable consistently changes relative to other variable. Co-efficient
of correlation is used to assess this linkage. Large co-efficients indicate high
co-variation and a strong relationship and vice versa.
Causal modelling develops equations underlying causal relationships
between the dependent and independent variables. Values of variables are
plotted graphically and analysed by regression analysis and co-efficient of
correlation to establish relationships between variables and their validity.
SPSS
The data analysis and statistical techniques presented in this appendix are all
available in the popular software package SPSS (www.spss.com).