Professional Documents
Culture Documents
UNIVERSITY
Purpose of Module 17
What is the purpose of this module?
This module provides a basic introduction to multivariate analysis
(MVA) as it is applied to chemical engineering. After completing
this module, the student should have sufficient understanding to
apply this statistical method to real data.
The target audience for this module are:
Upper-year engineering students, and
Practising engineers, particularly those in an industrial setting.
Structure of Module 17
What is the structure of this module?
Module 17 is divided into 3 tiers, each with a specific goal:
Tier 1: Basic introduction
Tier 2: Worked example
Tier 3: Open-ended problem
These tiers are intended to be completed in order. Students are
quizzed at various points, to measure their degree of understanding,
before proceeding.
Each tier contains a statement of intent at the beginning, and a quiz
at the end.
TIER 1:
Basic Introduction
Tier 1: Contents
Tier 1 is broken down into two sections:
Scientific
principles
Connectedness
KNOWLEDGE
Observed
associations
+ patterns
INFORMATION
+ relations
DATA
Raw Numbers
Understanding
First
OUT
Theoretical Model
Chemical engineers create two types of models to simulate an
industrial process. The first of these is a theoretical model, which
uses First Principles to mimic the inner workings of the process.
These models are based on a process flowsheet, and each unit
operation is modelled separately: reactors, tanks, mixers, heat
exchangers, and so forth. Heat and mass balances are calculated,
along with other thermodynamic factors. Chemical reactions are
accounted for explicitly, as are the physical properties of the various
gas, liquid and solid streams.
IN
Empirical Model
The second type of model created by chemical engineers is the
empirical or black-box model. This approach uses the plant
process data directly, to establish mathematic correlations.
Unlike the theoretical models, empirical models do NOT take the
process fundamentals into account. They only use pure
mathematical and statistical techniques. MVA is one such method,
because it reveals patterns and correlations independently of any
pre-conceived notions.
Obviously this approach is very sensitive to Garbage-in, garbageout which is why validation of the model is so important.
NAMP Module 17: Introduction to Multivariate Analysis
What is MVA?
Multivariate analysis (MVA) is defined as the simultaneous analysis of
more than five variables. Some people use the term megavariate
analysis to denote cases where there are more than a hundred
variables.
MVA uses ALL available data to capture the most information
possible. The basic principle is to boil down hundreds of variables
down to a mere handful.
MVA
+1
-1
X1
X4
X5
Rep
Y avec
Y sans
-1
-1
-1
2.51
2.74
-1
-1
-1
2.36
3.22
-1
-1
-1
2.45
2.56
-1
2.63
3.23
-1
2.55
2.47
-1
2.65
2.31
-1
-1
-1
Raw Data:
impossible to
interpret
1
2.45
2.67
2.6
2.45
2.53
2.98
-1
3.02
3.22
-1
2.7
2.57
-1
2.97
2.63
2.89
3.16
2.56
3.32
2.52
3.26
-1
2.44
3.1
-1
2.22
2.97
-1
2.27
2.92
.
. .. .
..
.
. .
. .
trends
X
trends
trends
X
X
hundreds of columns
thousands of rows
(internal
to
software)
Look at the table on the following page. Can you tell anything
from the raw numbers? Of course not. No one could.
NAMP Module 17: Introduction to Multivariate Analysis
Notethat
thatMVA
MVAcan
canhandle
handle
Note
upto
to10-20%
10-20%missing
missingdata
data
up
Score Plot
The MVA software generates two main types of plots to represent
the data: Score plots and Loadings plots.
The first of these, the Score plot, shows all the original data points
(observations) in a new set of coordinates or components. Each
score is the value of that data point on one of the new component
dimensions:
. .
.
..
.
..
..
A score plot shows how the observations are arranged in the new
component space. The score plot for the food data is shown on
the next page. Note how similar countries cluster together
NAMP Module 17: Introduction to Multivariate Analysis
Score Plot =
observations
Loadings Plot
The second type of data plot generated by the MVA software is the
Loadings plots. This is the equivalent to the score plot, only from the
point of view of the original variables.
Each component has a set of loadings or weights, which express the
projection of each original variable onto each new component.
Loadings show how strongly each variable is associated with each
new component. The loadings plot for the food example is shown on
the next page. The further from the origin, the more significant the
correlation.
Note that the quadrants are the same on each type of plot. Sweden
and Denmark are in the top-right corner; so are frozen fish and
vegetables. Using both plots, variables and observations can be
correlated with one another.
NAMP Module 17: Introduction to Multivariate Analysis
Projection
of old
variabiles
onto new
Loadings Plot =
variables
NAMP Module 17: Introduction to Multivariate Analysis
1.
After
1500
trials
2.
3.
Reading List
There is no paint-by-numbers way to learn MVA. Students are
strongly encouraged to read the following papers, in order to begin to
develop an independent understanding of what MVA is used for and
how it works.
After doing this on-line course, reading the references and playing
around with real data, the student should at some point experience a
Eureka! moment when suddenly MVA makes sense. Unfortunately,
there is no shortcut to achieving this insight:
Broderick, G., J. Paris, J.L. Valade and J. Wood. Applying Latent Vector
Analysis to Pulp Characterization, Paperi ja Puu, 77 (6-7): 410-419.
Saltin, J. F., and B. C. Strand. Analysis and Control of Newsprint Quality
and Paper Machine Operation Using Integrated Factor Networks, Pulp and
Paper Canada 96(7): 48-51
Basic Statistics
It is assumed that the student is familiar with the following basic
statistical concepts:
Statistical Tests
Classical statistics
statistics isis severely
severely
Classical
hampered by
by certain
certain
hampered
assumptions about
about data:
data:
assumptions
All values
values are
are accurate
accurate
--All
All variables
variables are
are uncorellated
uncorellated
--All
There are
are no
no missing
missing data
data
--There
For real
real process
process data,
data, such
such
For
assumptions are
are totally
totally
assumptions
unrealistic.
unrealistic.
Regression
Regression can be summarised as follows:
Take a set of data points, each described by a vector of values
x1, x2, xn)
(y,
X X X
XX
X
X
NAMP Module 17: Introduction to Multivariate Analysis
Y
Y Y
Y
Tier 1, Part 1, Rev.: 0
X X
X
XX
X
X
XiXj
X
X