You are on page 1of 40

NC STATE

UNIVERSITY

Program for North American Mobility


in Higher Education
Introducing Process Integration for Environmental
Control in Engineering Curricula

MODULE 17: Introduction to


Multivariate Analysis
Created at:

Ecole Polytechnique de Montreal &


North Carolina State University, 2003.
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Purpose of Module 17
What is the purpose of this module?
This module provides a basic introduction to multivariate analysis
(MVA) as it is applied to chemical engineering. After completing
this module, the student should have sufficient understanding to
apply this statistical method to real data.
The target audience for this module are:
Upper-year engineering students, and
Practising engineers, particularly those in an industrial setting.

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Prerequisites for Module 17


What are the prerequisites for this module?
Before starting this module, the student must have first completed
Module 8, Introduction to Process Integration. This module
includes basic concepts not repeated here, notably those related to
data quality.
Applying MVA to real data, without having an understanding of data
quality, is a recipe for disaster. The software will generate results,
but they could be totally meaningless and misleading.
It is further assumed that students already have an introductorylevel background in statistics, such as would normally be part of any
undergraduate engineering curriculum.

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Structure of Module 17
What is the structure of this module?
Module 17 is divided into 3 tiers, each with a specific goal:
Tier 1: Basic introduction
Tier 2: Worked example
Tier 3: Open-ended problem
These tiers are intended to be completed in order. Students are
quizzed at various points, to measure their degree of understanding,
before proceeding.
Each tier contains a statement of intent at the beginning, and a quiz
at the end.

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

TIER 1:
Basic Introduction

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Tier 1: Statement of Intent


Tier 1: Statement of intent:
The goal of Tier 1 is to familiarise the student with the basic
concepts of multivariate analysis (MVA). At the end of Tier 1, the
student should be able to answer the following questions:
What is the difference between univariate and multivariate
statistics?
Why is MVA used in a process integration context?
How does MVA fit into the bigger picture?
What are the specific types of MVA analysis?
Tier 1 also includes some selected readings, to help the student
acquire a deeper understanding of this subject. It is impossible to
spoon-feed someone about a technique as complex as MVA. The
student must begin to delve into this topic independently right from
the start.
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Tier 1: Contents
Tier 1 is broken down into two sections:

1.1 What is MVA used for?


1.2 How does MVA work?
At the end of Tier 1 there is a short multiple-answer quiz.

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

1.1: What is MVA used for?

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Process Integration Challenge:


Make sense of masses of data
Drowning in data!
Many organisations today are faced with the same challenge: TOO
MUCH DATA. These include:
Business - customer transactions
Communications - website use
Government - intelligence
Science - astronomical data
Pharmaceuticals - molecular configurations
Industry - process data
It is the last item that is of interest to us as chemical engineers
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Too Much Process Data


A typical industrial plant has hundreds of control loops, and
thousands of measured variables, many of which are updated every
few seconds.
This situation generates tens of millions of new data points each
day, and billions of data points each year. Obviously, this is far too
much for a human brain to absorb. Because of the way we visualise
things, we are basically limited to looking at only one or two
variables at a time:

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Data-Rich but Knowledge-Poor


As a result, we have become data-rich but knowledge-poor.
The biggest problem is that interesting, useful patterns and
relationships which are not intuitively obvious lie hidden inside
enormous, unwieldy databases. Also, many variables are correlated.
This has led to the creation of data-mining techniques, aimed at
extracting this useful knowledge. Some examples are:
Neural Networks
Multiple Regression
Decision Trees
Genetic Algorithms
Clustering
MVA
Subject of
of this
this module
module
Subject
Mining data
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Data Information Knowledge


The aim of data-mining can
be illustrated graphically as
follows:
Data
unrelated facts
Information
facts plus relations
Knowledge
information plus patterns

Scientific
principles

Connectedness

KNOWLEDGE
Observed
associations

+ patterns

INFORMATION
+ relations
DATA

NAMP Module 17: Introduction to Multivariate Analysis

Raw Numbers

Understanding

Tier 1, Part 1, Rev.: 0

Process Modelling from


Principles
INSIDE

First

OUT

Theoretical Model
Chemical engineers create two types of models to simulate an
industrial process. The first of these is a theoretical model, which
uses First Principles to mimic the inner workings of the process.
These models are based on a process flowsheet, and each unit
operation is modelled separately: reactors, tanks, mixers, heat
exchangers, and so forth. Heat and mass balances are calculated,
along with other thermodynamic factors. Chemical reactions are
accounted for explicitly, as are the physical properties of the various
gas, liquid and solid streams.

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Data-Driven Process Modelling


OUTSIDE

IN

Empirical Model
The second type of model created by chemical engineers is the
empirical or black-box model. This approach uses the plant
process data directly, to establish mathematic correlations.
Unlike the theoretical models, empirical models do NOT take the
process fundamentals into account. They only use pure
mathematical and statistical techniques. MVA is one such method,
because it reveals patterns and correlations independently of any
pre-conceived notions.
Obviously this approach is very sensitive to Garbage-in, garbageout which is why validation of the model is so important.
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

What is MVA?
Multivariate analysis (MVA) is defined as the simultaneous analysis of
more than five variables. Some people use the term megavariate
analysis to denote cases where there are more than a hundred
variables.
MVA uses ALL available data to capture the most information
possible. The basic principle is to boil down hundreds of variables
down to a mere handful.

MVA

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Multivariate Analysis is Based


on Ockhams Razor
Pluralitas non est ponenda sine necessitate.
Rough translation: Dont make things more
complicated than they need to be.

William of Ockham was an English monk


who laid one of the cornerstones of the
Scientific Method with his famous razor (so
named because it serves to cut out the
unnecessary parts of a scientific theory).
William of Ockham
(1285-1347)

Essentially, Ockham realised back in the 14th


century that deep down, Nature is simple

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Example: Apples and Oranges


A good example of these ideas is Apple versus Orange.
Clever scientists could easily come up with hundreds of different
things to measure on apples and oranges, to tell them apart:
Colour, shape, firmness, reflectivity,
Skin: smoothness, thickness, morphology,
Juice: water content, pH, composition,
Seeds: colour, weight, size distribution,
etc.

+1

-1

However, there will never be more than one difference: is it an


apple or an orange? In MVA parlance, we would say that there is
only one latent attribute.
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Graphical Representation of MVA


The main element of MVA is the reduction in dimensionality.
Taken to its extreme, this can mean going from hundreds of
dimensions (variables) down to just two, allowing us to create a 2dimensional graph.
Using these graphs, which our eyes and brains can easily handle,
we are able to peer into the database and identify trends and
correlations.
This is illustrated on
the next page

Peering into the data

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Graphical representation of MVA


Statistical Model
Tmt

X1

X4

X5

Rep

Y avec

Y sans

-1

-1

-1

2.51

2.74

-1

-1

-1

2.36

3.22

-1

-1

-1

2.45

2.56

-1

2.63

3.23

-1

2.55

2.47

-1

2.65

2.31

-1

-1

-1

Raw Data:
impossible to
interpret
1

2.45

2.67

2.6

2.45

2.53

2.98

-1

3.02

3.22

-1

2.7

2.57

-1

2.97

2.63

2.89

3.16

2.56

3.32

2.52

3.26

-1

2.44

3.1

-1

2.22

2.97

-1

2.27

2.92

.
. .. .
..
.
. .
. .

trends
X

trends

trends

X
X

hundreds of columns
thousands of rows

(internal
to
software)

2-D Visual Outputs

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Illustrative Data Set: Food


Consumption in European Countries
To illustrate these concepts, we take an easy-to-understand
example involving food.
Data on food preferences in 16 different European countries are
considered, involving the consumption patterns for 18 different
food groups.

Look at the table on the following page. Can you tell anything
from the raw numbers? Of course not. No one could.
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Data Table: Food Consumption in


European Countries

Notethat
thatMVA
MVAcan
canhandle
handle
Note
upto
to10-20%
10-20%missing
missingdata
data
up

NAMP Module 17: Introduction to Multivariate Analysis

Courtesy of Umetrics corp.

Tier 1, Part 1, Rev.: 0

Score Plot
The MVA software generates two main types of plots to represent
the data: Score plots and Loadings plots.
The first of these, the Score plot, shows all the original data points
(observations) in a new set of coordinates or components. Each
score is the value of that data point on one of the new component
dimensions:
. .
.
..
.
..
..

The Score Plot is the


projection of the original
data points onto a plane
defined by two new
components.

A score plot shows how the observations are arranged in the new
component space. The score plot for the food data is shown on
the next page. Note how similar countries cluster together
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Score Plot for Food Example


95% Confidence interval
(analogous to t-test)

Score Plot =
observations

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Loadings Plot
The second type of data plot generated by the MVA software is the
Loadings plots. This is the equivalent to the score plot, only from the
point of view of the original variables.
Each component has a set of loadings or weights, which express the
projection of each original variable onto each new component.
Loadings show how strongly each variable is associated with each
new component. The loadings plot for the food example is shown on
the next page. The further from the origin, the more significant the
correlation.
Note that the quadrants are the same on each type of plot. Sweden
and Denmark are in the top-right corner; so are frozen fish and
vegetables. Using both plots, variables and observations can be
correlated with one another.
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Use of loadings (illustration)

Projection
of old
variabiles
onto new

Loadings Plot =
variables
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

To MVA, Data Overload is Good!


One great advantage of MVA is that the more data are available,
the less noise matters (assuming that the noise is normally
distributed). This is one of the reasons MVA is used to mine huge
amounts of data.
This is analogous to NMR measurements in a laboratory. The
more trials there are, the clearer the spectrum becomes:

1.

After
1500
trials

2.

3.

Not random at all


Looks random

NAMP Module 17: Introduction to Multivariate Analysis

(+ve and ve noise


cancels out)
Tier 1, Part 1, Rev.: 0

Too Much Data is Good!


Another analogy is the toy compass that used to
be given as a prize in a box of Cracker Jack.
One of these compasses alone
was next to useless.

However, if somebody had


a thousand compasses
and took an average, a
useful result might be
obtained.
Dictionary time: Look up the definitions of
induction and deduction

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Multivariate Analysis: Benefits


What is the point of doing MVA?
The first potential benefit is to explore the inter-relationships
between different process variables. It is well known that simply
creating a model can provide insight in the process itself (Learn by
modelling).
Once a representative model has been created, the engineer can
perform what if? exercises without affecting the real process. This
is a low-cost way to investigate options.
Some important parameters, like final product quality, cannot be
measured in real time. They can, however, be inferred from other
variables that are measured on-line. When incorporated in the
process control system, this inferential controller or soft sensor
can greatly improve process performance.
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

MVA is Different to Neural Networks

Both are data-driven black box models


Both learn using real data
However, what is inside the black box is totally different
(NN is non-linear)
Neural Networks seek to reproduce the neuron-to-neuron
linkages in the brain
Much as genetic algorithms seek to reproduce Darwinian
evolution

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Reading List
There is no paint-by-numbers way to learn MVA. Students are
strongly encouraged to read the following papers, in order to begin to
develop an independent understanding of what MVA is used for and
how it works.
After doing this on-line course, reading the references and playing
around with real data, the student should at some point experience a
Eureka! moment when suddenly MVA makes sense. Unfortunately,
there is no shortcut to achieving this insight:
Broderick, G., J. Paris, J.L. Valade and J. Wood. Applying Latent Vector
Analysis to Pulp Characterization, Paperi ja Puu, 77 (6-7): 410-419.
Saltin, J. F., and B. C. Strand. Analysis and Control of Newsprint Quality
and Paper Machine Operation Using Integrated Factor Networks, Pulp and
Paper Canada 96(7): 48-51

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Reading List (contd)


Kooi, S. Adaptive Inferential Control of Wood Chip Refiner, Tappi Journal
77(11):185-194.
Kresta, J. V., T. E. Marlin and J. F. MacGregor (1994). Development of
Inferential Process Models Using PLS, Computers and Chemical Engineering
18 (7):597-611.
Marklund, A. Prediction of Strength Parameters for Softwood Kraft Pulps.
Nordic Pulp & Paper Research Journal, 13 (3): 211-219.
Tessier, P., G. Broderick, P. Plouffe (2001). Competitive Analysis of North
American Newsprint Producers Using Composite Statistical Indicators of
Product and Process Performance. TAPPI Journal, 84 (3).

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

1.2: How does MVA work?

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Basic Statistics
It is assumed that the student is familiar with the following basic
statistical concepts:

Mean / median / mode


Standard deviation / variance
Normality / symmetry
Degree of association
Correlation coefficients
Degree of explanation
R2, F-test
Significance of differences
t-test, Chi-square

If not, or if its been a while, it is advisable to consult an introductory


statistics text and do a cursory review.
NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Statistical Tests
Classical statistics
statistics isis severely
severely
Classical
hampered by
by certain
certain
hampered
assumptions about
about data:
data:
assumptions
All values
values are
are accurate
accurate
--All
All variables
variables are
are uncorellated
uncorellated
--All
There are
are no
no missing
missing data
data
--There

Statistical tests help characterise an


existing dataset. They do NOT enable
you to make predictions about future
data. For this we must turn to
regression techniques

For real
real process
process data,
data, such
such
For
assumptions are
are totally
totally
assumptions
unrealistic.
unrealistic.

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Regression
Regression can be summarised as follows:
Take a set of data points, each described by a vector of values
x1, x2, xn)

(y,

Find an algebraic equation


y = b1x1 + b2x2 + + bnxn + e
that best expresses the relationship between y and the xis.
This equation can be used to predict a new y-value given new xis.

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Independent vs. Dependent Variables

The xis in the preceding equation are called independent


variables. They are used to predict y.

Y is called the dependent variable, because the way the


equation is written, its value depends on the xis.

X X X
XX
X
X
NAMP Module 17: Introduction to Multivariate Analysis

Y
Y Y
Y
Tier 1, Part 1, Rev.: 0

Simple vs. Multiple


Regression

Simple regression has only one x:


y = bx + e

Multiple regression has more than one x:


y = b1x1 + b2x2 + + bnxn + e

X X
X
XX
X
X

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

Linear vs. Nonlinear Regression

Linear regression involves no powers of xi (square, cube etc.)


and no cross-product terms of form xixj

If such terms are present, we are dealing with nonlinear


regression.

XiXj

X
X

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

The Error Term e

The error term expresses the uncertainty in an empirical predictive


equation derived from imperfect observations.

Factors contributing to the error term include:


measurement error
measurement noise
unaccounted-for natural variations
disturbances to the process being measured

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

The Least Squares Principle

Regression tries to produce a


best fit equation --- but what is
best ?

Criterion: minimize the sum of


squared deviations of data
points from the regression line.

NAMP Module 17: Introduction to Multivariate Analysis

Tier 1, Part 1, Rev.: 0

You might also like