You are on page 1of 7

1 Introduction

This HowTo document describes how to use Scientific Python (SciPy) for De-
sign for Six Sigma (DfSS). DfSS is an informal collection of best-practices and
methods for product development and design. Some of these methods sup-
port the project management and requirements gathering, while others are of a
mathematical nature. In the coming chapters, it is shown with examples how to
attack these numerical/statistical problems using SciPy.
DfSS tasks addressed:

• Statistical observations

• Run-chart analysis and Statistic Process Control (SPC)

• Statistical inference from observations

• Statistical tolerance stackup

• Optimization of estimated functions

• Minimization of variance method

• DfSS DoE - Anova analysis of multivariate linear relationships

1.1 Dear Reader


This HowTo assumes that you have a basic knowledge of Python, and a basic
understanding of the DfSS methods. There are other excellent books on both
these topics. There is a good introduction to SciPy developed by the SciPy com-
munity ().
Yöu will need a working installation of Python, SciPy and SymPy. On a Mi-
crosoft Windows computer, you can install the open source bundle PythonXY
(www.pythonxy.com). On Ubuntu/Debian Linux you can easily install the
packages you need: python, python-scipy, python-sympy.
The examples assume the following packages to be imported:
import scipy as sp
import numpy as np
import matplotlib as mplt
import matplotlib.pyplot as plt
import scipy.stats as stats
from pylab import *

All examples are written for Python 2.6, and will most probably work in any
other version of Python.
2 Statistical Observations
The basic concept of DfSS is that there is variation everywhere. Every process
has some variability in the inputs and the outputs. The key to improve the
process is to understand the nature of the variation and the best way is to use
statistical models. This section describes how to look at measurement data in a
statistical way.
First of all, measurement data must be imported into the SciPy environment.
Either by typing it into the source code using the r_[...] directive to create
a row vector, or by importing the data from a file, e.g. a .csv-file (comma
separated values).
A simple example with twelve temperature observations done once per hour,
entered and plotted:
timeVector = r_[1:13]
dataVector = r_[2.3,3.3,4.1,5.5,5.1,6.7,6.3,6.9,7.0,7.7,7.2,7.4]
plot( timeVector, dataVector )
xlabel('Time [h]')
ylabel('Temperature [C]')
The same data could be found read from a data file with the following format:
1,2.3
2,3.3
3,4.1
...
The code would then look like this, and the plot is shown in Fig. 1:
measurementVector = sp.loadtxt('import_csv_example.csv',delimiter=',')
timeVector = measurementVector[:,0]
dataVector = measurementVector[:,1]

plot( timeVector, dataVector )


xlabel('Time [s]')
ylabel('Temperature [C]')
Here is another example of a larger dataset, from a weather observation sta-
tion in Maarn, The Netherlands. There are 2810 weather observations in the
data file, logged every 5 minutes for a month. (The data comes from
http://weather.gladstonefamily.net/site/C1652
and is used according to the license agreement on the website.)
The data file looks like this, and the barometric pressure is plotted in Fig. 2.
"Time(UTC)","Pressure(mbar)","Temp(F)","Dewpoint(F)","RH(\%)","Wind(mph)","W
2010-02-25 11:28:00,992.80,46.0,44.4,93,6,262
2010-02-25 11:43:00,992.80,46.0,44.4,93,6,258
...
Let’s assume that we are interested in the variations of the barometric pres-
sure over this time span. First, plot the variation over time and see if it is a
random variation, then go on with the analysis.
Figure 1: Importing measurement data from an external file and plot the data.

measurementVector = sp.loadtxt('example_weather_observations.csv',\
delimiter=',', skiprows=1, converters = {0: datestr2num}, \
usecols=(0,1,2,3,4,5) )
timeVector = measurementVector[:,0]
dateVector = mplt.dates.num2date( timeVector )

pressureVector = measurementVector[:,1]

fig = figure(figsize=(12,6)) # 4x8 inches


plot( dateVector, pressureVector )
xlabel('Time [s]')
ylabel('Barometric pressure [mbar]')
# plt.savefig('example_weather_observations.png', dpi=300)

Figure 2: Example data imported from an external file. Barometric pressure in Maarn, in March
2010.

In the figure of barometric pressure (Fig. 2) we should make an observation:


The variation is not random. There is a period in the first week that has much
lower pressure than the subsequent three weeks.
Statistical analysis only makes sense when the variation is random.
Therefore, there are two sources of variation in this signal; one that is based on
a real change of weather type and one random change within a certain weather
type.
It is meaningful to do two types of statistical tests here:
1. Quantify the random variation of the pressure during weeks 2-3.
2. Compare the average pressure in week 1 with week 2 to see if it is signifi-
cantly different
The first test is done visually and computationally. We plot the relative fre-
quency of occurrence of the pressures using a histogram. We also assume a nor-
mal distribution as a starting point because it is very common for natural phe-
nomena as explained by the Central Limit Theorem of statistics. The normal
distribution has two parameters, the mean (location parameter) and the stan-
dard deviation σ (shape parameter). These two are easy to compute from the
data. In the graph, we also plot the probability density function of the estimated
normal distribution (stats.norm.pdf( x, mean, std ) ), see Fig. 3.
pressureVectorW23 = pressureVector[700:2100]
hist( pressureVectorW23, bins = 30 )
xlabel('Barometric pressure [mbar]')
ylabel('Frequency')

hold(True)
p_mean = mean(pressureVectorW23)
p_std = std(pressureVectorW23)
p_values = linspace( p_mean-3*p_std, p_mean+3*p_std, 100)
plot( p_values, 1400*stats.norm.pdf( p_values, p_mean, p_std )
)
# plt.savefig('example_weather_observations.png', dpi=300)
The second test, the comparison of the means of the two weeks, is done using
the ”Student’s T-Test”. Visually, we clearly see that the levels are different in the
two weeks and it is quite straightforward to compare the datasets:
pressureVectorW1 = pressureVector[0:700]
pressureVectorW2 = pressureVector[700:1400]
(t,p) = stats.ttest_ind(pressureVectorW1, pressureVectorW2)
print 'T-test of different means, p = %.2f <0.05' % p
# plt.savefig('example_weather_observations.png', dpi=300)
Note that we use the ttest_ind() function for independent measurements.
The measurements were not of the same item and not correlated pairwise in
some way. However, if we would compare the measurements of two differ-
ent weather stations for the same time period, then the measurements would
be correlated and we would be only interested in the variability between the
difference of the means. Then the other T-test function is used, ttest_rel().
??
Figure 3: Histogram of the pressure during weeks 2-3, with an estimated normal distribution.

3 Statistic Process Control (SPC)


One of the key tools of statistic process control is the run chart. It is a graphical
tool to check the hypothesis that a process is truely random. The base assump-
tion is that a process is in control in an optimum operating condition when all
variation is random, i.e. the variation has a zero mean and that the current
output has no influence on the next output value.
In practice, there is very often some autocorrelation on a small time scale but
none at larger time scale. To compensate for this, we compare blocks of outputs
with each others.
[ Autocorrelation and specific time scale ]
The run chart has two graphs, one for the mean of the sample and one for the
standard deviation. By plotting these parameters over time it is easy to check
the hypotheses:

• The mean of the variation is zero

• The standard deviation is constant and known (σ).

For each sample, i.e. N items, the mean x-bar and standard deviation are
computed and plotted in the two graphs. The mean of the sample is plotted in
the top-chart, with dotted lines for the Upper Control Limit and Lower Control
Limit.

UCLx = ....LCLx = .... (1)


For the [Insert picture of run chart]
The main hypothesis is true if the sample points stay nicely and randomly
within the UCL-LCL lines. Based on experience, a number of rules-of-thumb
have been devised to check whether or not the hypotheses hold true and the
process is in control.
[items, seven rules]
Run chart analysis.
Process capability.
Remember: The run chart does not show process output acceptance levels. It
does not say whether or not the production is within tolerances. It only says
whether or not the process is stable.

4 Statistical Inference
Based on a parametrized statistical distribution, say something about the prob-
ablilities for something to happen...

5 Measurement Assurance?
Measurement Capability - repeatability/reproducibility.

6 Linear Regression
Linear model parameter estimation.
sp.stats.regression()

7 Analysis of Variance (ANOVA)


Linear statistical model - constribution of variance. - see wikipedia

8 Experimental Design (DoE)


?

9 Process Target Optimization


? fmin?

10 Minimization of Process Variance


We often want to minimize the variance of a process. Sometimes, the variance
is a crucial metric for the quality of the output, so it is interesting to see how the
variance can be minimized, while the output level is still kept constant.
This is a field where Scientific Python shows it’s strength. The integration
of numeric math (nympy/scipy) and symbolic math (sympy) enables effective
and easy-to-use solutions.
The methodology explained here is based on a set of assumptions:
• All input parameters (x1 , x2 , etc.) are normally distributed with known
variance.

• The output target level ytarget is known.

• The process target function y = f (x) is known.

A good approximation of the variance of a function, based on the variance of


the input variables is based on the first derivative of the function with respect
to each variable: !2
X ∂y
vary (x) = (x) varxi (2)
i ∂xi
The variance function is thus a scalar function of the parameter vector x.
However, while minimizing this function, we only are interested in sets of x
that also satisfy the nominal value f (x) = ytarget
Another interesting phenomenon is that symmetric random variation can be
asymmetrically distributed by a nonlinear function. The mean of the noise is
then non-zero. This phenomenon is called bias. The bias is an effect of the mean
of the output process value, and should be compensated for to achieve the target
output. The bias depends on the second derivative of the process function with
respect to the input parameters:
!2
1 X ∂ 2y
biasy (x) = (x) varxi (3)
2 i ∂x2i
Taken together, the optimization problem can be formulated as follows:
Find the values of input parameters x that minimizes vary (x) while also sat-
isfying the constraint: f (x) = ytarget − biasy (x).
In the DfSS toolbox, functions are available to do this, and the operation goes
in a number of steps. This will be showed from a hypothetical example of a
nonlinear process function.
First you define the process function f (x):
!2
1 X ∂ 2y
biasy (x) = (x) varxi (4)
2 i ∂x2i
fmin leastsquares

11 Conclusions

You might also like