Professional Documents
Culture Documents
3
User Manual
Version 1.0
CAMO SOFTWARE AS
Nedre Vollgate 8, N-0158, Oslo, NORWAY
Tel: (47) 223 963 00
Fax: (47) 223 963 22
E-mail : info@camo.com | www.camo.com
i
The Unscrambler X v10.3
Copyright
All intellectual property rights in this work belong to CAMO Software AS. The information contained in this work
must not be reproduced or distributed to others in any form or by any means, electronic or mechanical, for any
purpose, without the express written permission of CAMO Software AS. This document is provided on the
understanding that its use will be confined to the officers of the organization (whose name is stated on the front
cover of this document) who acquired it and that no part of its contents will be disclosed to third parties without
prior written consent of CAMO Software AS.
Copyright © 2014 CAMO Software AS. All Rights Reserved
All other trademarks and copyrights mentioned in the document are acknowledged and belong to their respective
owners.
Disclaimer
This document has been reviewed and quality assured for accuracy of content. Succeeding versions of this
document are subject to change without notice and will reflect changes made to subsequent software version.
It is the sole responsibility of the organization using this document to ensure all tests meet the criteria specified in
the test scripts. CAMO Software takes no responsibility for the end use of the product as this requires the
performance of suitable feasibility trials and performance qualification to ensure the software is fit for purpose for
its intended use.
ii
Table of Contents
Table of Contents
1. Welcome to The Unscrambler® X ................................................................................. 1
2. Support Resources........................................................................................................ 3
3. Overview ...................................................................................................................... 5
iii
The Unscrambler X v10.3
iv
Table of Contents
v
The Unscrambler X v10.3
vi
Table of Contents
vii
The Unscrambler X v10.3
viii
Table of Contents
7. Plots.......................................................................................................................... 231
ix
The Unscrambler X v10.3
x
Table of Contents
xi
The Unscrambler X v10.3
xii
Table of Contents
xiii
The Unscrambler X v10.3
xiv
Table of Contents
xv
The Unscrambler X v10.3
xvi
Table of Contents
xvii
The Unscrambler X v10.3
xviii
Table of Contents
xix
The Unscrambler X v10.3
xx
Table of Contents
xxi
The Unscrambler X v10.3
xxii
Table of Contents
xxiii
The Unscrambler X v10.3
xxiv
Table of Contents
xxv
The Unscrambler X v10.3
xxvi
1. WelcometoTheUnscrambler®X
The Unscrambler® is a complete multivariate data analysis and experimental design software
solution, equipped with powerful methods including PCA, PLS, clustering and classification.
See the release notes for a list of fixes, new features and known limitations.
1
2. Support Resources
2.1. Support resources on our website
Our web site is filled with resources, case studies, recorded webinars as well as information
about our products and commercial offerings, including courses and professional services.
Support
Webinars
Training courses
Consulting
3
3. Overview
3.1. What is The Unscrambler® X?
A brief review of the tasks that can be carried out using The Unscrambler® X.
Set up experiments, analyze effects and find optima using the Design of Experiments
(DoE) module;
Reformat and preprocess data to enhance future analyses;
Find relevant variation in one data matrix (X);
Find relationships between two data matrices (X and Y);
Validate multivariate models with Uncertainty Testing;
Resolve unknown mixtures by finding the number of pure components and
estimating their concentration profiles and spectra;
Predict the unknown values of a response variable;
Classify unknown samples into various possible categories.
One should always remember, however, that there is no point in trying to analyze data if
they do not contain any meaningful information. Experimental design is a valuable tool for
building data tables which give such meaningful information. The Unscrambler® can help to
do this in an elegant way.
The Unscrambler® satisfies the US FDA’s requirements for 21 CFR Part 11 compliance.
5
The Unscrambler X Main
The purpose of experimental design is to generate experimental data that enable one to
determine which design variables (X) have an influence on the response variables (Y), in
order to understand the interactions between the design variables and thus determine the
optimum conditions. Of course, it is equally important to do this with a minimum number of
experiments to reduce costs. An experimental design program should offer appropriate
design methods and encourage good experimental practice, i.e. allow one to perform few
but useful experiments which span the important variations.
Screening designs (e.g. fractional, full factorial and Plackett-Burman) are used to find out
which design variables have an effect on the responses and are suitable for collection of data
spanning all important variations.
Optimization designs (e.g. central composite, Box-Behnken) aim to find the optimum
conditions for a process and generate nonlinear (quadratic) models. They generate data
tables that describe relationships in more detail, and are usually used to refine a model, i.e.
after the initial screening has been performed.
Whether the purpose of designed experiments is screening or optimization, there may be
multilinear constraints among some of the design variables. In such a case a D-optimal
design may be required.
Another special case is that of mixture designs, where the main design variables are the
components of a mixture. The Unscrambler® provides the classical types of mixture designs,
with or without additional constraints.
There are several methods for analysis of experimental designs. The Unscrambler® uses
Multiple Linear Regression (MLR) as its default methods for orthogonal designs. For non-
orthogonal designs, or when the levels of a design cannot be reached, The Unscrambler®
allows the use other methods, such as PCR or PLS, for this purpose.
6
Overview
The Unscrambler® finds this information by decomposing the data matrix into a structured
part and a noise part, using a technique called Principal Component Analysis (PCA).
7
The Unscrambler X Main
Related topics:
User interface basics
Principles of regression
Principles of classification
8
Overview
Purposes of classification
Classification methods
SIMCA classification
Linear Discriminant Analysis
Support Vector Machines classification
PLS Discriminant Analysis
Steps in SIMCA classification
Classifying new samples
Outcomes of a classification
Classification based on a regression model
Cluster analysis
Projection
9
The Unscrambler X Main
Once the PLS model has been checked and validated (see the chapter about multivariate
regression for more details on diagnosing and validating a model), one can run a Prediction
in order to classify new samples. The prediction results are interpreted by viewing the plot
Predicted with Deviations for each class indicator Y-variable:
10
Overview
Samples with Ypred > 0.5 and a deviation that does not cross the 0.5 line are
predicted members;
Samples with Ypred < 0.5 and a deviation that does not cross the 0.5 line are
predicted nonmembers;
Samples with a deviation that crosses the 0.5 line cannot be safely classified.
See Chapter Prediction for more details on how to run a prediction and interpret results. A
tutorial explaining PLS-DA in practice is also available: PLS Discriminant Analysis.
11
The Unscrambler X Main
The last case is not necessarily a problem. It may be a quite interpretable outcome,
especially in a one-class problem. A typical example is product quality prediction, which can
be done by modeling the single class of acceptable products. If a new sample belongs to the
modeled class, it is accepted; otherwise, it is rejected.
12
Overview
Use Find in page to search for a phrase within the current page.
What is regression?
General notation and definitions
The whys and hows of regression modeling
What is a good regression model?
Regression methods in The Unscrambler®
Multiple Linear Regression (MLR)
Principal Component Regression (PCR)
Partial Least Squares Regression (PLSR)
L-PLS Regression
Support Vector Machine Regression (SVMR)
Calibration, validation and related samples
Main results of regression
Making the right choice with regression methods
How to interpret regression results
How to detect nonlinearities (lack of fit)
What are outliers and how are they detected?
Guidelines for calibration of spectroscopic data
13
The Unscrambler X Main
Where b0 is an intercept term and b1 is a regression coefficient; in this case, the slope of the
straight line.
Multivariate regression takes into account several predictor variables, thus modeling the
property of interest with more accuracy. The form of the model is
Where the terms in the equation are defined as usual. This chapter focuses on the general
principles of multivariate regression.
The whys and hows of regression modeling
Building a regression model involves collecting the predictors and the corresponding
response values for a set of samples, and then finding the optimal parameters in a
predefined mathematical relationship to the collected data. A commonly used measure of
optimality is the minimization of the sum of squares of the deviations between the
measured and predicted responses.
For example, in analytical chemistry, spectroscopic measurements are made on solutions
with known concentrations of a component of interest. Regression is then used to relate the
concentration of the component of interest to the spectrum.
14
Overview
Once a regression model has been built, it can be used to predict the unknown
concentration for new samples, using the spectroscopic measurements as predictors. The
advantage is obvious if the concentration is difficult or expensive to measure directly.
Replacement with the spectroscopic method is less expensive and in some cases, requires
minimal to no sample preparation. It also allows for development of spectroscopic
measurements for real-time process monitoring.
The most common motivations for developing regression models as predictive tools may
include:
Replacement of expensive or time-consuming analysis methods, with cheap, rapid,
easy-to-perform measurements (e.g. NIR spectroscopy, mass spectrometry for gas
analysis).
When one wants to build a response surface model from the results of some
experimental design, i.e. describe precisely the response levels according to the
values of a few controlled factors.
What is a good regression model?
The purpose of a regression model is to extract all the information relevant for the
prediction of the response from the available data.
Unfortunately, observed data usually contains some amount of noise and in some cases,
irrelevant information.
Noise can be random variation in the response due to experimental error, or it can be
random variation in the data values due to measurement error. It may also be some amount
of response variation due to factors which are not included in the model.
Irrelevant information is carried by predictors which have little or nothing to do with the
modeled phenomenon. For instance, NIR absorbance spectra may carry some information
relative to the solvent and not only to the compound of interest in developing a model to
predict the concentration of the compound in solution.
A good regression model should be able to:
Model only relevant information, by highly weighting these sources of information
and downweighting any irrelevant variation.
Avoid overfitting, i.e. distinguish between variation in the response (that can be
explained by variation in the predictors), and variation caused by mere noise.
Regression methods in The Unscrambler®
The Unscrambler® provides five regression method choices:
This operation involves a matrix inversion, which can be numerically unstable when there is
collinearity, that is when the variables are not linearly independent. Incidentally, this is the
15
The Unscrambler X Main
reason why the predictors are called independent variables in MLR; the ability to vary
independently of each other is a crucial requirement to variables used as predictors with this
method. MLR requires more samples than predictors since the system with more variables
than samples would not have a unique solution.
The Unscrambler® uses The QR Decomposition to find the MLR solution. No missing values
are accepted.
More details about MLR regression can be found in the section Multiple Linear Regression
(MLR)
More about PCR can be found in the help section Principal Component Regression (PCR)
More information about the PCR algorithm can be found in Method References.
16
Overview
More about PLS regression can be found in the help section Partial Least Squares Regression
(PLSR)
More details regarding the PLSR algorithm are given in the Method References.
17
The Unscrambler X Main
More about L-PLS regression can be found in the help section L-PLS Regression
More details regarding the L-PLSR algorithm are given in the Method References.
More about SVMR can be found in the help section Support Vector Machine Regression
(SVMR)
More details regarding the SVMR algorithm are given in the Method References.
18
Overview
Validation
Checking whether the model is capable of performing its task on a separate test set
of data.
Calibration is the fitting stage in the regression modeling process. The main data set,
containing only the calibration sample set, is used to compute the model parameters (PCs,
regression coefficients).
It is essential to validate models to get an idea of how well a regression model will perform
when it is used to predict new, unknown samples. A test set consisting of samples with
known response values is used. Only the X-values are fed into the model, from which
response values are predicted and compared to the known, actual response values. The
model is validated if the prediction residuals are low and there is no evidence of lack of fit in
the model.
Each of the two steps described above requires its own set of samples; thus, the following
terms are used interchangeably calibration samples = training samples and validation
samples = test samples.
A more detailed description of validation techniques and their interpretation is to be found
in the chapter Validate a Model.
B-coefficients I X X X
Residuals 1 D X X X
Error Measures D X X X
ANOVA D X
19
The Unscrambler X Main
In short, all three regression methods give a model with an equation expressed by the
regression coefficients (b-coefficients), from which predicted Y-values are computed. For all
methods, residuals can be computed as the difference between predicted (fitted) values and
actual (observed) values; these residuals can then be combined into error measures that tell
how well a model performs.
PCR and PLSR, in addition to those standard results, provide powerful interpretation and
diagnostic tools linked to projection: more elaborate error measures, as well as scores and
loadings.
The simplicity of MLR, on the other hand, allows for simple significance testing of the model
with ANOVA and of the b-coefficients with a Student’s t-test (ANOVA will not be presented
hereafter; read more about it in the ANOVA section from Chapter “Analyze Results from
Designed Experiments”.) However, significance testing is also possible in PCR and PLSR, using
Martens’ Uncertainty Test.
B-coefficients
The regression model can be written
meaning that the observed response values (Y) are approximated by a linear combination of
the values of the predictors (X). The coefficients of that combination are called regression
coefficients or B-coefficients.
Several diagnostic statistics are associated with the regression coefficients (available only for
MLR):
Standard error is a measure of the precision of the estimation of a coefficient;
From that, a student’s t-value can be computed;
Comparing the t-value to a reference t-distribution will then yield a significance level or p-
value. It provides an indication that the regression coefficients are significantly different
from 0. If the t-value is found to be nonsignificant this means that the regression coefficient
cannot be distinguished from 0.
Predicted Y-values
Predicted Y-values are computed for each sample by applying the model equation (i.e. the B-
coefficients) to new (or existing) observed X-values.
For PCR or PLSR models, the predicted Y-values can also be computed using projection along
the successive components of the model. This has the advantage of diagnosing samples
which are badly represented by the model, and therefore have high prediction uncertainty.
This is discussed more fully in the chapter Predictions.
Residuals
For each sample, the residual is the difference between the observed Y-value and the
predicted Y-value. It appears as the term e in the model equation.
More generally, residuals may also be computed for each fitting operation in a projection
model: thus the samples have X- and Y-residuals along each PC (factor) in PCR and PLSR
models. Read more about how sample and variable residuals are computed in the chapter
More Details About the Theory of PCA.
20
Overview
Selecting a specific kernel function that is capable of mapping the variable space.
Fine tuning the parameters of the chosen function such that the best calibration and
prediction statistics are achieved.
21
The Unscrambler X Main
SVMR provides the least graphical output and diagnostics statistics of all the regression
methods implemented in The Unscrambler® and can often pose a difficult task for the user
to develop robust models. However, when they work, SVMR models are much better able to
handle non-linearities than MLR/PCR/PLSR models and can provide an alternative method to
Artificial Neural Networks (ANN).
Measurement error
Wrong labeling
22
Overview
For projection methods like PCA, PCR and PLSR, outliers can be detected using scores plots,
residuals, leverages and influence plots.
Outliers in regression
In regression, there are many ways for a sample to be classified as an outlier. It may be
outlying according to the X-variables only, or to the Y-variables only, or to both. It may also
not be an outlier for either separate set of variables, but become an outlier when one
considers the (X,Y) relationship. In the latter case, the X-Y Relation Outliers plot (only
available for PLSR) is a very powerful tool showing the (X,Y) relationship and how well the
data points fit into it.
Use of residuals to detect outliers
One can use the residuals in several ways. For instance, first use residual variance per
sample plot, then use a variable residual plot to detect samples with large squared residual
in the first plot. The first of the two plots is used for indicating samples with outlying
variables, while the latter plot is used for a detailed study for each of these samples. In both
cases, points located far from the zero line indicate outlying samples or variables.
Use of leverages to detect outliers
The leverages are usually plotted vs. sample number. Samples showing a much larger
leverage than the rest of the samples may be outliers and may have had a strong influence
on the model, which should be avoided.
For calibration samples, it is also natural to use an influence plot. This is a plot of squared
residuals (either X or Y) vs. leverages. Samples with both large residuals and large leverage
can then be detected. These are the samples with the strongest influence on the model, and
may disturb (influence) the model towards themselves.
The features of two plots can be utilized by plotting influence and Y-residuals vs. predicted Y
together. Some example plots are shown below:
Scores plot showing a gross outlier
23
The Unscrambler X Main
24
Overview
analysis. All of these plots can be helpful in detecting outliers, or possible errors in
the data.
Note: It is advisable to aim for a boxcar distribution of Y-values, as this provides the
most even coverage of the region of interest.
Preprocess (transform the data)
Tasks - Transform… allows for spectroscopic transformations, derivation,
smoothing, etc. Tasks - Transform - Reduce (Average) may also be useful when
replicates have been measured, or variable reduction is required. The Preview Result
option in the transform dialog, provides a graphical preview of spectral data as
transform parameters are changed. These changes are presented to the user in real
time.
Statistics
Tasks - Analyze - Descriptive Statistics… may be used to reveal scatter effects and
for visually detecting large changes in specific wavelength regions. Use the Scatter
option to reveal potential scatter effects before the application of transforms such
as Multiplicative Scatter Correction (MSC).
Select samples
The Edit - Mark option is useful for selecting a more balanced data set from a large
data set from PCA, PCR or PLSR scores. This can be applied to either the spectra or
the constituents (if more than one component is being analyzed). Mark samples that
span all the important components (samples far away from the origin, including the
extremes when selecting calibration samples). Use the Create Range option to
extract marked samples as a new row set in the project navigator.
Reduce spectra
Use the Tasks- Transform- Reduce(Average)… options to reduce spectra of high
data point spacings (being careful not to lose resolution) to fewer data points, or
average out replicate spectra in a data set.
25
The Unscrambler X Main
investigating interesting patterns in the data. View the loadings as line plots and see
if the variables of importance coincide with the spectral regions related to the
property being measured.
Delete variables (wavelengths).
From the Important variables plot the Edit - Mark option can be used to define
ranges in the spectra that are not important (potentially due to noise). Use the
Recalculate - Without Marked option to generate a new model based on fewer
wavelengths. Apply the Uncertainty test during PLS regression to aid in the
identification of important variables for modeling.
Validation
It is essential to ensure that a developed model is properly validated using a suitable
validation method (cross validation or test set validation). Cross validation can be set
up to look at the effect of removing an entire set of replicates from an analysis or
single replicates can be removed to test the predictive ability of the model for single
replicates.
26
Overview
Instrument compatibility
Some instrument vendors (for example Perten, Brimrose, Guided Wave, Foss NIRSystems,
Thermo, etc.) make use of The Unscrambler® Online Predictor/ Classifier software available
for integration of The Unscrambler® models into third party systems. These packages are
DLL-based programs that are incorporated into the instrument software, allowing the use of
The Unscrambler® predictive or classification models on the data, providing the model
results to the instrument interface for either graphical or numerical display when a new
(spectral) measurement is made. Visit http://www.camo.com/ for more information on
these applications.
The Unscrambler® X uses the Save Model option to save predictive, or classification models
as separate files from a project. The Unscrambler® Generation X family of online software
uses these model files directly for applications. The Unscrambler® X is backward compatible
for use in previous versions of The Unscrambler® Online Predictor and Classifier (back to
version 9.2). Use the File-Export-Unscrambler option to export model files for use in these
previous versions. This option will allow users to save data or model for backward
compatibility. Contact CAMO for this plug-in option.
Some instrument software can read the B vector (regression coefficients). Use File - Export -
ASCII…, or JCAMP-DX. Use File - Export - ASCII MOD… , which is a simple file format
containing all information necessary to make predictions, either using full PLSR or PCR
models, or just the B vector. It can be used with user-defined conversion routines.
Use The Unscrambler® to develop models for instruments that do not support The
Unscrambler® Online Predictor/Classifier
If an instrument vendor software does not support The Unscrambler® developed
models, import the instrument data as a common format, i.e. ASCII Excel, JCAMP etc
and develop a model using the powerful diagnostic and algorithmic capabilities. Use
this model to select appropriate calibration and validation samples, determine the
optimal PCs/factors to use and match the preprocessing to the options available in
the vendor software. Redevelop the model in the vendors’ software and compare
the two results. This will provide added assurance that the developed model is
robust and performs as required.
The various residuals and error measures are available for each PC in PCR and PLSR,
while for MLR there is only one of each type
27
The Unscrambler X Main
There are two types of scores and loadings in PLSR, only one in PCR
28
4. Application Framework
4.1. User interface basics
The purpose of this chapter is to give the user an overall introduction to the principles used
in The Unscrambler®. A short overview of The Unscrambler® user interface and workplace is
provided in this section, covering the various menu options, and the data organization
environment:
Menu walk-through:
File
Edit
View
Insert
Plot
Tasks
Tools
Help
Edit
Insert
Data matrix…
Duplicate matrix…
Custom layout…
Tools
29
The Unscrambler X Main
Matrix calculator…
Report generator…
Audit trail…
Options…
Help
Modify license…
User setup…
Application window
Workspace
Editor
Viewer
Project navigator
Project information
Page tab bar
The menu bar
The toolbar
The status bar
Dialogs
Setting up the user environment
Getting help
30
Application Framework
4.2.2 Workspace
The Workspace occupies the largest area of the application window, containing either a
table view of a data set, called the Editor, or a Viewer which displays results either
graphically as plots or numerically as tables.
Editor
The Editor presents a data table that may or may not be modified depending on its
protection status:
If a table can be edited, it is possible to:
Type in values.
Change the column and row headers.
Create ranges.
Plotting raw data from the editor: either for a data matrix or a matrix from a result.
Displaying predefined plots.
31
The Unscrambler X Main
Custom layout.
To learn more about working in this mode, please refer to the chapter on plotting data.
32
Application Framework
options which are valid for the selected area, which will save a user the work of having to
click through all the menus on the Menu bar.
4.2.9 Dialogs
The Unscrambler® aims to aid the user through dialogs that provide detailed instructions to
the application.
When working in The Unscrambler® the user will often have to enter information or make
choices in order to be able to complete an analysis. This includes activities such as specifying
the names of data matrices/files to work with, the data sets to analyze, how many PCs to
compute, or the type of validation methods to choose. This is done in dialogs, which will
normally look something like the one pictured below.
The Unscrambler® dialog
33
The Unscrambler X Main
This particular dialog is the one associated with running a Principal Component Analysis on
data. Items that are predefined, such as rows/samples, columns/variables, etc. are selected
from a drop-down list. Options which are mutually exclusive are selected via radio buttons.
The settings for many of the analysis dialogs will be remembered from the last time the
dialog was open.
Any dialog can also be canceled by pressing the Esc (escape) key on the keyboard. Ongoing
calculations can also be aborted pressing Esc.
34
Application Framework
What is a matrix?
Matrix structure
Samples and variables
Adding data matrices
Manually
Drag and drop from other applications
Altering data tables
Using ranges
Create ranges to organize subsets
Superimposed ranges
Storing data as separate matrices
Data types
Possible data types
Converting data types
Keeping versions of data
Saving data
… … … … …
35
The Unscrambler X Main
… … … … … …
See insert matrix dialog box for more information on how to create a blank table, fill it with
data and rename it.
Manually
Enter data manually into a matrix by simply typing while an entry is focused, double clicking
on a specific entry, or pressing F2 and entering the value. This operation can be done for the
data table as well as the sample and variable name.
Category entries have a drop-down list, allowing the user to select one of the levels already
used. It can also be typed, and it is possible to type anything to add new levels.
Date-time entries have a calendar pop-out, allowing the user to pick a date from it.
Drag and drop from other applications
Data can be copied from any application, e.g. Microsoft Excel, to The Unscrambler® by either
drag and drop, or by copy and paste.
Files can also be dragged from the file manager onto The Unscrambler® application window.
The window title bar is a good drop target.
36
Application Framework
This above case is typical of creating two set of variables: X (predictors) and Y (responses),
and two sets of samples for calibration and validation.
Storing data as separate matrices
In The Unscrambler® one can use different matrices in the analysis as long as they are
compatible in size and stored in the same project.
Hence one can store data in several matrices that will appear in the project navigator as
illustrated below:
37
The Unscrambler X Main
Numerical Right
Date-time Left
38
Application Framework
The file names are given in glob notation: ”*” mean any number of characters, ”?”
any character, “[ABC]” any of A,B or C.
39
The Unscrambler X Main
40
Application Framework
Matrices
Plots
Results: Each analysis will create a new node containing model or prediction details
Data set
Plot
41
The Unscrambler X Main
Rename
Rename the node
Delete
Delete the node. This operation cannot be undone, so use with caution. This action
has to be confirmed in a pop-up dialog in order for the node to be deleted.
Actions for data table nodes
Data table node menu
Transform
Shortcut all the pretreatment available in the Tasks – Transform menu.
Plot
Shortcut to all the plots available in the Plot menu.
Export
Export the data using one of the supported external data formats.
Range
The Range option allows the following actions to be performed
Define Range allows the definition or row and column ranges and special intervals in
a data set. For more information see the Define Range dialog.
Copy Range Copy the selected ranges (rows or columns) to another matrix of the
same dimensions
Paste Range Paste copied ranges into the same or another matrix of the same
dimensions
Duplicate Matrix
This will create a new copy of the data matrix in the project
navigator. It is a shortcut to the Insert - Duplicate Matrix
(Insert – Duplicate Matrix…) option.
Spectra
Define a selected columnset to hold spectral data, in order to change the default
view of certain model result plots (e.g. PLS regression coefficients plotted as line in
Regression Overview, or X-loadings plotted as line in PCA Overview).
Save Matrix
42
Application Framework
Recalculate
Rebuild the model with the following changes
43
The Unscrambler X Main
44
Application Framework
In the dialog, one has the option to save several different types of model files. These smaller
model files do not support the plots, and do not include the raw data and some of the
validation matrices that are present in the entire model. The prediction (or classification)
results that can be computed depends on the type of model that is saved.
Entire model
this saves all the results and supports all visualizations that are available when a
model is developed in The Unscrambler® X. This option also permits recalculation of
the model by keeping out any selected data. This option is available for MLR, PLS,
PCR and PCA models.
Prediction
The prediction result options saves the model in smaller files, as the model result file
does not include many of the results matrices including the validation results and
other matrices used in the prediction visualizations.
Full with support for inlier detection: The model result file does not include the
following matrices: Y scores, Beta coefficients (weighted), Variable leverage, X
Correlation loading, Y correlation loading, Square sums, and Rotation. Three of the
validation matrices are saved in this model format: X total residuals, X value
validation residuals, and Y value validation residuals. This model can be used for
prediction, giving all the results that The Unscrambler® computes on prediction,
including the deviation.
Full: This model results file allows one to predict new values, and get the deviation
with that value, as well as to detect outliers (based on Hotelling’s T2 and Q
residuals). With this model, inliers cannot be computed during the prediction stage.
The Hotelling’s T2 and Q residual limits and X values are computed, but not plotted
during prediction with the Full model. Compared with the entire model, this version
saves 11 of the 20 validation matrices. It does not compute the Inlier limit and the
Sample inlier distance, nor the seven matrices that are saved with the Full (with
inlier detection) prediction result.
Short: In the short model, only the raw beta coefficients are saved, at the optimal
(or user-defined) number of components. No validation matrices are saved. With a
short prediction model, one can get the predicted results for new data, but no other
45
The Unscrambler X Main
4.7.1 Prediction:
This will be enabled only for Regression techniques (MLR, PCR and PLSR). Low and high limits
can be set for Deviation and Scores matrices; and so for each one of Y responses . Only high
limits can be set for Hotelling’s T², Sample Leverage, X Sample Q-Residuals and Validation
Residuals. For Explained X Sample Validation Variance, low limits can be set.
Set Alarm States for output matrix of Prediction
46
Application Framework
4.7.2 Classification:
Only high limits can be set for X Residuals, Si/S0 and Leverage matrices that will be used for
classifying new samples for models developed from PCA, PCR and PLSR.
Set Alarm States for output matrix of Classification
4.7.3 Projection:
Scores matrix provides the option to set low and high limits. For Hotelling’s T², Sample
Leverage and X Sample Q-Residuals matrices only high limits can be set. For Explained X
47
The Unscrambler X Main
Sample Validation Variance, low limits can be set. Projection for new samples is available
only for models developed from PCA, PCR and PLSR.
Set Alarm States for output matrix of Projection
4.7.4 Input:
This feature helps user to understand whether the inputs are from one or different sources.
If user has already defined the columnset matrices using Scalar and Vector dialog, those will
be listed for selection. Alternatively, the Define button would open the Scalar and Vector
dialog for defining limits for columnset matrices.
Set Alarms for input matrix
48
Application Framework
49
The Unscrambler X Main
improve. Despite the risks, bias and slope correction has been proven useful in some
industries such as the agricultural sector.
4.9.1 Algorithm
Bias and slope correction is performed on prediction data Yhat by subtracting the slope and
then divide by the bias: Yhat_corrected = (Yhat – bias)/slope
The bias and slope estimates in the above equation can be taken directly from a test set
validated Predicted vs. Reference plot, or they can be input manually by user. Default values
when not explicitly specified are bias=0 and slope=1.
4.9.3 Usage
In the dialog, user has the option to check the Apply Bias and Slope correction. When
checked, model will perform bias and slope correction during prediction based on any of the
below selected options.
Re-calculate from Prediction data: When selected, the bias and slope correction
factors will be the offset and slope, respectively, as taken from the ‘Predicted vs.
Reference’ plots for the new prediction data. The underlying assumption is that any
differences in bias and slope between the calibration and prediction data are due to
systematic and repeatable differences between the instruments used to collect the
50
Application Framework
two data sets. If used indiscriminantly this may decrease the actual prediction
performance and the option should therefore be used with caution. When selected,
reference Y data are mandatory in prediction.
Set or apply default correction factors: With this option default correction factors
based on the calibration model are suggested. For test-set validated models these
are the validation Offset and Slope values of the ‘Predicted vs. Reference’ plot,
under the assumption that the test set data are measured on a different instrument
that is representative also for future predictions. For leverage and cross-validated
models this assumption cannot be met and the default bias and slope is therefore 0
and 1, respectively. The user is free to manually change the default values, in which
case a message will be displayed that the values have been manually edited. A Reset
button will revert the bias and slope correction factors back to the default values.
4.10. Login
Two modes of operation are available in Unscrambler
The choice of installation procedure and internal program setup determines what level of
login is required by a user. This is described further in the following sections.
The Guest login requires no password or definition of a user group domain, so by clicking on
Login a user is entered into the program.
In Non-Compliance mode, a user name and login password can be setup from the Help -
User Setup menu.
If a user name and password have been set up, when a user attempts to login to the
program, a dialog similar to the one shown below is provided,
Login with defined User Name and Password, Non-Compliance mode
51
The Unscrambler X Main
In this case a user called User 1 was setup. This time, a password is required to enter the
software. If a user forgets their password, the Forgot? option should be selected. This is
described further in the next section.
Password reminders
It is possible to click Forgot? next to the password entry for a password reminder question
that is configured during user setup.
Password recovery dialog
In this dialog, a user is required to enter the correct answer to the security question and are
then required to enter a new password (with confirmation).
If the wrong answer to the question is entered, the following warning will be provided,
52
Application Framework
Set up compliance mode with Login dialog shown each time the program is started
Set up compliance mode with a hidden Login dialog
The users windows name is shown in the login screen. To enter the program, the user must
enter their windows password.
Automatic entry
When the program is installed in compliance mode, but the Hide login screen option is
chosen, when a user starts The Unscrambler® they are automatically logged into the
program and the windows authentication details are used in the Audit Trail.
This authentication method takes advantage of centralized user management features used
in regulated network configurations, instead of redefining the user names.
53
The Unscrambler X Main
For more information on how The Unscrambler® security features help a company to comply
with the requirements of 21 CFR Part 11, please have a look at the Statement of compliance
4.11. File
4.11.1 File menu
File – New
or Ctrl+N
This option is used to create a new project.
A new, blank workspace is created with a single node entry in the project navigator named
“New Project”.
See organizing data to get started adding data to a project.
File – Open…
or Ctrl+O
This option opens an existing project, using a regular file selector dialog.
File – Close
or Ctrl+W
This option closes the current project file. If changes to the project have not been saved, The
Unscrambler® prompts the user to save the project before closing it.
This option allows the import of data from an external data file. This may be data from
another project file, an earlier version of The Unscrambler® or one with a different format,
e.g. Excel, ASCII, or data files from instrument formats.
For more information see the importing data documentation.
File – Save
or Ctrl+S
Saves the currently open project file.
54
Application Framework
File – Export
This is a menu option which allows one to export all or selected parts of a data matrix to an
external file, in one of the available export formats.
For more information see the exporting data documentation.
File – Print…
or Ctrl+P
This will open the Print dialog, where the user selects settings to print the current document
to a printer or file.
For more information see the print dialog documentation.
File – Security
The Security function contains two options, Protect and Sign.
Protect
This command enables a user to protect a project with a password. Whenever this project is
accessed, the user will need to provide the password to open it. A project file can also be
Unprotected by using the command File-Unprotect, and entering the correct password.
Note: The password must be remembered! If it is lost, the project cannot be opened again
Sign
For a more detailed description on how The Unscrambler® implements Digital Signatures,
click here
The Security feature is part of the overall data integrity and compliance capabilities of the
software, which also includes Windows Authentication and Audit Trails.
For more details on how The Unscrambler® meets the requirements of digital and electronic
signatures, please refer to the section on Data Integrity and Compliance
File – Recent
The list of recently opened projects is displayed. One can toggle different projects upon
selection.
File – Exit
This allows one to quit The Unscrambler®. If any project files have been changed since the
project was last saved, there is a prompt asking if changes are to be saved.
55
The Unscrambler X Main
Plots are scaled to fit within the margins set for the designated paper size and will retain the
same aspect ratio as is seen on the screen.
Data tables will normally print with 50 rows and 6 columns per page, depending on the
numeric format and font settings. Row and variable names and numbers will be included on
each page.
Print options from The Unscrambler® works as in any Windows application, where the user
selects printer, paper size, orientation, margins, etc.:
56
Application Framework
Print preview
It is a good idea to preview a document before sending it to the printer. Print preview
provides a look at how the pages will look when they have been printed. The option is only
available if a file is currently open.
4.12. Edit
4.12.1 Edit menu
The Edit menu has three different modes, and the displayed options depend on which part
of the application window is active at any given time. There are separate modes for the
workspace editor and viewer as well as for the project navigator. Some menu items are
common for two or three modes.
Common actions
Edit – Undo
Edit – Redo
Edit – Cut
Edit – Copy
Edit – Paste
Edit – Delete
Navigator mode
Edit – Rename
Edit – Spectra
Editor mode
Edit – Copy with Headers
Edit - Insert Copied Cells
Edit - Append Copied Cells
Edit - Reverse
Edit - Convert
Edit - Fill
Edit – Find and Replace
Edit – Go To…
Edit – Select
Edit – Sort
Edit – Append
Row(s)/Column(s)…
Category Variable…
Edit – Insert
Row(s)/Column(s)…
Category Variable…
Edit – Split Text/Category Variable
Edit – Change Data Type
Edit – Scalar and Vector
Edit – Define Range…
Edit – Group rows…
Edit – Make header
Edit – Add Header
Edit - Category Property
57
The Unscrambler X Main
Viewer mode
Edit - Add Data
Edit - Create Range
Edit - Sample Grouping
Edit - Copy all
Edit – Draw
Edit – Mark
The workspace editor Edit menu mode is activated by clicking anywhere in a data table.
The workspace editor Edit menu
The workspace viewer Edit menu mode is activated by clicking in a plot. The same menu will
be shown irrespective of whether it is a raw data plot or a model results plot, however some
menu items will be grayed out when not applicable to specific plots.
The workspace viewer Edit menu
58
Application Framework
Common actions
Edit – Undo
or Ctrl+Z
This option reverses the last operation(s) performed on the data in the editor. This can be
used to Undo up to the last 10 operations. The size of the undo stack can be increased, see
Tools – Options… menu.
The following operations can be reversed with the undo operation:
59
The Unscrambler X Main
Edit – Redo
or Ctrl+Y
It is possible to recover the results of an editing operation(s) that has just been undone with
the help of the Redo command.
A selection can be recovered from the clipboard using the Paste command or Ctrl+V.
Edit – Cut
or Ctrl+X
This option removes the selected range, either data in the Editor or a plot in the Viewer, and
places it on the clipboard. Anything placed on the clipboard remains there until it is replaced
with a new item. Use the Paste command to copy the selection to a new location.
Edit – Copy
or Ctrl+C
With this option one can copy the selected range to the clipboard, overwriting its previous
contents. The selected range is not removed from its original place. Use the Paste command
to copy the selection to a new location.
Edit – Paste
or Ctrl+V
This command one to insert a copy of the clipboard contents at the insertion point. The
command is not available if the clipboard is empty or the selected range cannot be replaced.
Edit – Delete
, Ctrl+D or Del
This option enables one to delete columns or rows. One can select one or more
columns/variables or rows/samples, and deletes the selected section(s).
Any previously-defined sets are adjusted for the deleted range.
Navigator mode
Edit – Rename
Rename the currently selected matrix.
Edit – Spectra
Ranges can be defined as being spectra, and once this setting is ticked off for a given range,
loadings plots for these data ranges will display as line plots rather than 2D scatter plots.
60
Application Framework
Editor mode
or Ctrl+Shift+C
With this option one can copy the selected range to the clipboard, overwriting its previous
contents. The selected range is not removed from its original place. Use the Paste command
to copy the selection to a new location.
Edit - Reverse
With this option one can reverse the sample order and/or variable order in a selected
matrix. For more information see the reverse documentation.
Edit - Convert
This command allows one to convert the units of a column headers for spectral data from
wavelength in nanometers (nm) to wavenumber (cm-1) and vice versa. This function is
active when the the column header of a matrix is selected.
Edit - Fill
This command allows a user to fill a highlighted row or column range with either numeric or
categorical data.
For more details see the Fill section.
Edit – Go To…
Allows user to move focus to a specific entry in the data table.
For more information see the go to dialog documentation.
Edit – Select
Edit – Select has the following options
Select Rows
To select respective sample.
61
The Unscrambler X Main
Select Columns
To select respective variable.
Select Range
To select a range of samples and variables.
Select All (Ctrl+A)
To select the entire matrix.
In the first three cases, the user is asked to enter a range to select. It uses the same syntax as
the Define range dialog, e.g. 1,3-5,8-20.
Note: The Unscrambler® always works with either rows or columns. This also
applies when the whole matrix is selected. Look at the cursor shape or the
rows/columns numbers to see whether the selection is for a row or column mode.
Sample names will also be selected when operating on rows, and column headers
when operating on columns.
Edit – Sort
Sort samples according to their numerical values for the selected variable.
Sort has two options: Ascending and Descending.
Select one or more columns to sort. Headers can also be selected and used as sort keys.
This method uses the quick sort algorithm, which performs an unstable sort; that is,
if two elements are equal, their order might not be preserved. In contrast, a stable
sort preserves the order of elements that are equal.
Edit – Append
Row(s)/Column(s)…
This option can be used to append rows or columns, depending which entries are selected in
the data table.
A dialog is displayed allowing the user to enter the number of rows(columns) that are to be
appended at the end of the existing data matrix.
See Edit – Insert – Row(s)/Column(s)… below for details.
Category Variable…
Append a new category variable (column).
Details on how to specify a category variable can be found here.
Edit – Insert
Row(s)/Column(s)…
Insert new rows or columns.
Select a row or a column to insert either one or more rows or columns, respectively.
A dialog will pop-up to ask how many rows or columns to insert:
62
Application Framework
Text
Numeric
Date-time
Category
or Ctrl+E
Create and edit ranges for easy access to often-used selections.
For more information see the define range dialog documentation.
63
The Unscrambler X Main
This option allows one to change the properties of category variables, more details on which
can be found at Property dialog.
Viewer mode
64
Application Framework
Edit – Draw
This option allows a user to add a drawing object to the plot. It is possible to draw with five
different types of objects: line, arrow, rectangle, ellipse or text. This option can also be
accessed by right clicking while in a plot and selecting Insert Draw Item
For more information see the plot annotation documentation.
Edit – Mark
Mark objects (samples or variables) to bring focus to them in plots and interpretation. There
are options for automatic sample or variable selection based on modeled data, or for
manual marking using the one by one, rectangle or lasso tools.
The submenu for marking objects
65
The Unscrambler X Main
Right click
Select a variable. Right click. Select the menu Change Data Type – Category….
Right click access to the Category Converter
66
Application Framework
67
The Unscrambler X Main
68
Application Framework
The preselected variable is in the field Select Variable. If the variable to be used in a
different one select it using the drop-down list.
The field Value based on selected Variable gives information on the selected variables such
as:
This information is displayed to guide one to select the number of levels to choose and to
define the intermediate ranges.
Select the number of levels using the associated box.
Decide the method to be used to define the range among the two following options:
Divide total range of variation into interval of equal width
If this is the selected option the ranges will be automatically defined when changing
the number of levels.
Specify each range manually
Double-click on the entry to define the ranges.
69
The Unscrambler X Main
This is also available as a right click option. Highlight a column and right click, the following
options will be displayed
70
Application Framework
To fill a column/row with a specified value, either highlight the entire row/column or select a
sub-section using the mouse and select Edit - Fill. Enter the specified value (or text) in the
Value box and click on OK. The selected region will be filled with this value.
Note: A block of rows and columns can also be selected using this option.
To fill rows/columns with a category variable, first define the categories using Edit - Change
Data Type - Category. Then select specified cells and use the Edit - Fill option, this time
selecting the desired category from the Level drop-down list. Click on OK and the cells will be
filled with this new category.
71
The Unscrambler X Main
The Fill option is also available as a right click option from the Editor.
72
Application Framework
Find option
By selecting the Options button, one is then presented with Find Option choices which
enables one to match case, replace entire entry contents with specified search criteria and
search in indicated directions in the data matrix.
Select search type Numeric, Text or Date time from the Search mode drop-down
list.
Type a word, a number, or a date to search for in the Find what field.
Or tick Range to search within numeric or date limits. This option works only for
Numeric and Date time variables
For replacing category values, select the varaible and use the Find and Replace
option.
**Text** mode will match category variables. A category level labeled "200" is still
a text string. It is recommended to use words to label category levels both to avoid
confusion and to give each level meaning, such as "High" or "Low".
Click the Find Next button to locate a cell with the chosen value or sequence of characters.
If the search is successful, the entry is marked in the editor with a black frame (or a white
frame if the search is occurring in a selected area). If no match is found, the cursor does not
move from its original place.
73
The Unscrambler X Main
74
Application Framework
Result after:
This function allows to quickly move around to specific entries in a data matrix.
75
The Unscrambler X Main
76
Application Framework
or Ctrl+E
Ranges define specific parts of the data table in order to perform analyses on. When a set of
columns is defined, this is called a Column range and usually defines a specific set of
variables. These variable sets may define a single independent (X-data) range for methods
like PCA or two sets such as the X-data and the dependent Y-data for methods such as PLSR.
When a set of rows (or samples) is defined, this is known as a Row range and these are
useful when defining training and validation sets for any analysis method in The
Unscrambler®.
Combinations of row and column sets together define specific data regions to be used for
analysis purposes and the preparation of data can be performed using the Define Range
option.
Get information on:
77
The Unscrambler X Main
If the case arises that a new range has to be defined during an analysis setup, most of the
plotting and analysis dialogs in The Unscrambler® have the Define button available. An
example from the PCR dialog is shown below
Define buttons in the PCR dialog
78
Application Framework
79
The Unscrambler X Main
Dialog Usage
Functions
The dialog box contains the following functions for easily defining sets within a selected data
table.
Row and Column Ranges
This section provides two lists of the available row and column sets available in a
table. To add a new row/column set, either interactively select the sets using the
data viewer with a mouse, or manually enter specific ranges into the text dialog
boxes. For example, if a new row set is to be defined called training, and it is to
cover rows 1-10 of the current table, the dialog for Row ranges should be set up as
follows,
To add the new row set to the list, click on the Create button. Use a similar procedure for
defining new column sets.
80
Application Framework
81
The Unscrambler X Main
Use this option to define evenly spaced calibration (or validation) samples and use the Invert
function described above to easily define such sets.
Random
Insert random row or column indices using the drop-down list “Samples” and
“Variables” values and indicating a number to define in the manual entry box.
Category
Insert row indices based on a category variable. Select the category variable in the
drop-down list.
When the appropriate ranges have been selected click OK to apply the changes.
Create range from data editor
Ranges can be created directly within the data set editor: Begin by selecting the part of the
table that will be included in the range and right click to select the option Create Range,
Create Row Range or Create Column Range as appropriate.
Create Row Range
82
Application Framework
83
The Unscrambler X Main
When working with data selector that have keep out samples/variables, an warning will be
displayed allowing the user to either accept and proceed with keep outs or to cancel the
action. The Details option will display the list of keep outs.
To keep track of row and column exclusions, the data selectors provides a warning to users
that exclusions have been defined. Click on the More details link to see what has been
excluded.
More details
Automatic keep outs can only be removed manually. This means that in cases where a
category variable has been converted to a numeric column, or missing entries have been
filled in, the keep out lists must be edited to include given entries in further analyses.
84
Application Framework
Then access the option Group Rows from the menu Edit. A dialog box will open.
Add row ranges on a category variable
When the variable selected is a category variable, all levels will be used to define new
ranges. Therefore the Number of group is disabled.
Add row ranges dialog from category variable
When clicking OK, new row ranges are defined being named in the same way as the levels.
85
The Unscrambler X Main
When clicking OK, new row ranges are defined being named range1, range2, etc.
When clicking on the menu Edit – Sample grouping…, the dialog box Sample grouping &
marking opens.
Select the matrix to use for sample grouping in the Data frame. All available row sets will
appear in the dialog. They can be selected and moved to Marker settings by using the
arrows. The sample grouping will be based on the groups added to this box. Clear the
available row sets using the Clear button.
Alternatively the user can select a single column from the matrix to use for sample grouping.
If the selected column is a category variable, click Create Row Sets in order to make each
category level available for grouping. If the selected column is of numeric data type, Create
Row Sets will split the samples into a number of equally spaced ranges defined by the
Number of groups box. When created in this dialog, the ranges are created temporarily for
marking the samples. These ranges are not added to the data table in the project navigator.
To delete a selected group from Marker settings, mark the group and use the Remove
button. Alternatively use the Clear All button to remove all defined groups.
The user has the option to separated samples based on colors, symbols or both, and the
group name can optionally be used as point labels. Use the Apply button to preview the plot
settings, or click OK to apply the settings and close the dialog.
The user also has the option to label the samples by pre-defined values that may be
available in a particular column of a data sheet. The appropritate matrix and the
corresponding column need to be selected using the Data for labeling matrix. This will be
enabled only when value is selected from the Label option.
86
Application Framework
87
The Unscrambler X Main
88
Application Framework
Character position:
This feature splits text variables into new variables based on the position of the
characters only. The start split value indicates the number of characters to split and
so the second split. The default value for first split is 0 and second split is 6.
Split by character position
89
The Unscrambler X Main
Output options:
The following output options are available.
In case the user is interested to retain one or few of the new variables after split, the
range of columns in numeric can be defined in ‘Insert Columns’ using commas and
dashes. The selection can also be set using the mouse in the preview window.
The output variables can either be converted to category type using the option
‘Convert to category’ or append all the output variables as text to existing row
headers using the option ‘Add headers’.
4.13. View
4.13.1 View menu
The View menu has two different modes, and the displayed options depend on which part
of the application window is active at any given time. There are separate modes for the
workspace editor and viewer.
Editor mode
90
Application Framework
View – Navigator
View – Info
View – Level Indices
Viewer mode
View – Graphical
View – Numerical
View – Auto Scale
View – Frame Scale
View – Zoom In
View – Zoom Out
View – Legend
View – Properties
View – Full Screen
Context dependent plot indicator lines
View – Trend Lines – Target Line
View – Trend Lines – Regression Line
View – Uncertainty Limit
The workspace editor View menu mode is activated by clicking anywhere in a data table.
The workspace editor View menu
The workspace viewer View menu mode is activated by clicking in a plot. The same menu
will be shown irrespective of whether it is a raw data plot or a model results plot, however
some menu items will be grayed out when not applicable to specific plots.
The workspace viewer View menu
Editor mode
View – Navigator
Toggle project navigator pane on/off.
View – Info
Toggle information pane on/off.
View – Graphical
This lets the user view the selected data of a Viewer in a graphical mode. This is the default
view for The Unscrambler®.
91
The Unscrambler X Main
View – Numerical
Through this option a user may display results plotted in a Viewer as a numerical table. One
can copy that data table to the Clipboard and paste it into an Editor.
Restore the plot using View – Graphical
This option scales the plot so that all data points are shown within the Viewer window. This
command is useful after using Add Plot and Scaling.
This option scales the plot in a selected frame. One can change the plot by scaling its axes to
fit the desired range. Select the desired area to zoom in a frame.
Use Autoscale to display the plot as it was originally.
View – Zoom In
This option changes the plot scaling upwards in discrete steps, allowing one to view a
smaller part of the original plot at a larger scale. This can also be done by using the + key on
the graph.
This option scales the plot down by zooming out on the middle of the plot, so that more of
the plot becomes evident, but at a smaller scale. This can also be done by using the - key on
the graph.
View – Legend
View – Properties
This opens a dialog where a user can customize a plot. Here one can change plot
appearance, such as grid, axes, titles, fonts and colors.
See the formatting of plots documentation.
Make the plot fill the whole screen. Press Esc on the keyboard or right click to leave the full
screen mode.
92
Application Framework
A regression line is drawn between the data points of a 2-D scatter plot, using the least
squares algorithm.
Available for Predicted vs. reference plots.
4.14. Insert
4.14.1 Insert menu
Use the Insert menu to add items to the project navigator.
93
The Unscrambler X Main
A window will open, so as to enable a specific selection of the matrix and ranges to
duplicate.
Duplicate matrix dialog
When hitting the OK button, a second data set will be created, bearing the same name with
a replication number in parentheses, for example “(1)” for the first replication.
The structure of the table (row and column ranges) will be maintained.
Duplicated matrix
94
Application Framework
Blank
Unit matrix (diagonal 1 rest 0)
Random values (0-1)
Random values (Gaussian)
Constant
Serial numbered rows
Serial numbered columns
Serial rows with shift
If Constant is chosen, this value should then be entered in the Constant value field.
The Include Headers option will automatically display the default header names for Rows
and Columns in the data matrix.
95
The Unscrambler X Main
After clicking on OK, a matrix will be created with the default name “Data Matrix”. It
contains no values if Initial values were set to Blank, otherwise the designated values are in
the entries. Data can be entered into the empty cells.
Fill a data table
Data may be entered into a blank data table in several ways.
Manually
Data can be entered manually by double clicking on the specific cell and entering the
value. This operation can be done for the data table as well as the sample and
variable name.
Copying data from a spreadsheet (Excel)
Data can be copied from Excel to The Unscrambler® by either drag and drop, or by
copying and pasting it. To drag and drop the data from Excel, it must be selected in
Excel and then dragged into the specific entry or to the beginning (top left corner) of
the area where the data are to be added. The same can be done for the sample and
variable names. Data can also be entered from Excel by using the copy and paste
functions.
Rename
The default name of the data table is “Data Matrix”, but this can be renamed with a more
descriptive name. Rename the data matrix by right clicking on the data matrix icon in the
project navigator and selecting the option Rename.
When this is done, the name will be updated in the project navigator as well as in the
visualization window and navigation bar.
Other functions are also available from this right click menu.
Other approaches to adding data matrices
There are two other options to generate a data table in The Unscrambler®:
Importing data
Create a design table
96
Application Framework
To access this option select the menu Insert – Custom Layout… and select the desired
layout:
Four viewers,
Two Horizontal…,
Two Vertical….
This menu give access to a dialogue box divided in four parts corresponding to the four
frames of the visualization window, all containing the same options:
Custom Layout Dialog
97
The Unscrambler X Main
Choose Matrix
This button is used to select the data set and variables to be plotted. By clicking on
Matrix it is possible to select a data matrix from the navigator. Adjust the Rows and
Cols to display only what is appropriate.
Choose Matrix dialogue box
To select a matrix that was generated during an analysis, hit the select result matrix
button . The following dialogue box will appear. From here it is possible to
select any matrix.
Choose Matrix - Analysis dialogue box
98
Application Framework
Type
This drop-down list presents the plot options:
Type drop-down list
Title
Type in the title to be displayed on the specific plot.
Once all the necessary plots have been defined hit the OK button, this action will display the
selected plots.
It is always possible to abort this action by clicking the Cancel button.
Once the plots are displayed they are editable using the Properties menu accessible from a
right click on the plot or from the menu shortcut .
Further information is available for the following options:
Format a plot,
Annotate a plot,
Zoom and re-scale a plot,
Save and copy a plot.
99
The Unscrambler X Main
Filter settings:
The Filter settings tab provides option for primary and secondary filter settings.
Filtering can be done based on the models available in the project navigator and the
100
Application Framework
compatible models are PCA, PCR, PLSR and SCA. Models with auto-pretreatments
can also be defined by clicking the pretreatment button. Only full models are
acceptable.
Data Compiler - Filter Setting
Upon selection of the model, the available filter type can be selected. For PLS, PCR and PCA
the available filter matrices are
The Limit settings are active for the following filter types:
101
The Unscrambler X Main
For additional filtering, ‘Include Secondary Filter’ has to be selected and this follows the
same feature as primary filter.
Output options:
The following output options are available.
Data Compiler - Output Options
Add Statistics: To store the output data after filter based on primary and secondary filters,
the tested model statistics from the filtered model will be added as new column(s) to the
original data table.
Add status: The test results from the filter model for status, when selected will be added as
new category column(s) to the original data. Influence filter type will have four status levels
as Good, Extreme, Suspect and Outlier. For all other filter types, the status levels are Good
and Outlier. Additionally users have the option to add the Good and Rejected row ranges to
the existing matrix.
Add ranges for Good and Rejected: When checked (default), two row ranges ‘Good’ and
‘Rejected’ are added to original (exisitng) data table. ‘Good’ and ‘Rejected’ status is defined
by the output from both filters as well as the minimum number of replicates. Any sample
that has status Good in either primary or secondary filter, and that exceeds the minimum
number of replicates, will be interpreted as Good. All other will be tagged as Rejected.
102
Application Framework
Add mean matrix: When checked, the average of all non-rejected observations are
calculated and returned for each sample. Users also have the additional option to add the
standard deviation for each sample. Average and standard deviation are calculated only if
the number of non-rejected replicates exceeds the minimum number entered in Input data
tab.
Add median matrix: When checked, the median of all non-rejected observations are
calculated and returned for each sample. Users also have the additional option to add the
range for each sample. Median and Range are calculated only if the number of non-rejected
replicates exceeds the minimum number entered in Input data tab.
Include column with number of replicates: When checked, the first column in output
matrices will be the number of replicates used for calculating the summary statistics.
4.15. Plot
4.15.1 Plot menu
The Plot menu has different modes: One comes with the matrix editor, and for each analysis
it gives a list of plots related to that analysis.
The plot interpretations chapter provides more detailed information for generic plots.
Editor mode
Plot – Line
The Line plot displays one or more data vectors. When plotting from the Editor, mark the
row(s) or variable(s) (Columns) to be plotted; one sample/variable gives a one-dimensional
plot; specifying a range adds several line plots.
One can define ranges or create ranges for samples as well as variables from the edit menu
Edit - Define Range, see using define range.
For more information see the line plot documentation.
Plot – Bar
The Bar plot displays data vectors as bars.
For more information see the bar plot documentation.
Plot – Scatter
The Scatter plot shows two data vectors plotted against each other.
When plotting from the Editor, select the two rows or variables (columns) to be plotted
before using the Plot command.
For more information see the scatter plot documentation.
103
The Unscrambler X Main
Plot – Matrix
In this plot, a two-dimensional matrix is visualized. The plot is useful to get an overview of
the data before starting any analyses, as obvious errors in the data and outliers may be seen
at once. One may also want to take a look at this plot before deciding whether to scale or
transform the data for analysis.
For more information see the matrix plot documentation.
Plot – Histogram
This plot displays the distribution of the data points in a data vector, as well as the normal
distribution curve. A histogram gives useful information for exploring raw data. The height of
each bar in the histogram shows the number of elements within the value limits of the bar.
For more information see the histograms documentation.
4.16. Tasks
4.16.1 Tasks menu
This menu is divided into three main groups of actions: Transform, Analyze and Predict.
Tasks – Transform
The Tasks – Transform options allows one to transform samples or variables to get data
properties which are more suitable for analysis and easier to interpret. Bilinear models, e.g.
PCA and PLS, basically assume linear data. The transformations should therefore result in a
more symmetric distribution of the data and a more linear behavior, if there are
nonlinearities.
The Unscrambler® offers many spectral pretreatments like derivatives, smoothing,
normalization, and standard transformations. All these can be found under Tasks –
Transform.
104
Application Framework
There is also a Compute_General function to transform data using basic elementary and
trigonometric mathematical expressions, and the matrix calculator, which has options for
linear algebra, matrix operations and reshaping of data.
For more information and a list of available transformations, see documentation for each
transformation
Tasks – Analyze
The Tasks – Analyze option provides multivariate analysis options consisting of:
Univariate statistics:
L-PLSR,
Linear Discriminant Analysis (LDA),
Support Vector Machine (SVM) classification, and
Analyze design matrices
Tasks – Predict
The Tasks – Predict options provides means of applying a model on new samples for
prediction, projection or classification.
Projection
Project new samples to determine similarity with samples in a PCA, PCR or PLSR
model.
Regression
Predict unknown samples from regression models.
Prediction
SVM Prediction
Classification
Classification of unknowns by applying SIMCA, LDA, or SVM models.
SIMCA classification
LDA classification
SVM classification
105
The Unscrambler X Main
4.17. Tools
4.17.1 Tools menu
or Ctrl + Shift + M
Open an existing experimental design for modifications.
See the modify design dialog documentation.
or Ctrl + M
The Matrix calculator is used to perform simple linear algebra functions like matrix
multiplication, addition, division, inverse etc. and to reshape, append or combine two
matrices.
See the matrix calculator dialog documentation.
Tools – Report…
or Ctrl + R
A tool to create reports as PDF documents with plots and data.
See the report generator dialog documentation.
This command displays the audit trail for the active project. The audit trail is a log of actions
by a user, showing a date and time stamp for the actions.
See the audit trail dialog documentation.
Tools – Options…
This dialog can be used to change the appearance of the data editor or viewer, as well as
other options in The Unscrambler®. Default numeric formats and plot settings can be
defined here.
See the options dialog documentation for details.
106
Application Framework
Date
Time Zone
Time
User name
Action.
The types of actions that are tracked in the audit trail include:
- Creation of the project - Import of data - Transformation: compute functions, smoothing,
MSC, derivative, etc. - Formatting: sorting, delete - Analysis: statistics, PCA, regression,
prediction, etc. with detailed model settings.
Audit trail dialog
In Non-Compliance mode, the audit trail can be emptied by selecting the Empty button in
the dialog.
The audit trail can be disabled from the Tools - Options under the General tab.
When in Compliance Mode, the Audit Trail cannot be emptied.It can only be saved in a non-
editable PDF document for further printing, if desired.
The Audit Trail for Compliance Mode is shown below. Also, in Tools - Options the Audit Trail
cannot be disabled in Compliance Mode.
Audit Trail in Compliance Mode
107
The Unscrambler X Main
The calculator tool should be used only with matrices that are purely numeric. In case there
are missing values those columns are kept out; likewise with text and category entries. With
the remaining matrix contents the compatibility follows the feasibility of the matrix
operations.
See also the Compute_General transform that can do calculations on samples and variables
using basic mathematical expressions.
Matrix calculator dialog
108
Application Framework
109
The Unscrambler X Main
unique for all matrices whose entries are real or complex numbers and can be calculated
using the singular value decomposition.
QR decomposition
QR decomposition (also called a QR factorization) of a matrix that allows for the solution of
linear systems of equations.
It is a decomposition of the matrix into an orthogonal matrix (Q) and a right triangular matrix
(R). QR decomposition is the basis for a particular eigenvalue algorithm, the QR algorithm.
Element-by-element operations
Array arithmetic operations that are carried out element by element on one matrix.
X’X
Outer product of itself:
1./X
Reciprocal of individual matrix elements, or element-by-element product
X.*X
Square of the elements of X
Two matrix operations
Binary operations implies that the arithmetic operation is computed on the data and a
operand, defined by the rules of linear algebra:
Addition: X+Y
Subtraction: X-Y
Multiplication: X*Y
Matrix division: X*inv(Y)
Element by element division: X/Y
The calculations that are possible depend on dimensionality of the matrices X and Y that
have been selected in the scope.
Add, Hadamard product and subtract require X and Y to have the same number of rows and
columns or Y has to be a row or column vector with the dimension matching with X.
The X and Y matrices in the calculations should not be confused with inputs and outputs of a
model.
Reshape matrix
Change dimensions of a two-dimensional matrix.
One can rearrange the elements of a matrix to change the number of rows and columns.
This is especially useful when importing data where a matrix has been stored as a one-
dimensional list of values.
110
Application Framework
Augment X|Y: column-wise combination of matrices; i.e. 4x2 + 4x2 gives 8x2
Append Y to X: row-wise combination or matrices
Augment requires X and Y to have the same number of rows. Append requires X and Y to
have the same number of columns.
These are binary operations in the shaping tab available only when the Binary operand box is
checked. This requires that the values be numeric. If there are columns of non-numeric data,
they will be kept out of the calculation. If there are missing values in either matrix, the rows
(columns) containing them will be kept out of the calculation.
111
The Unscrambler X Main
Use this option to enable/disable the audit trail. Note: This option is not active when
the program is installed in Compliance Mode.
Prompt user to view plots
When checked, user will be prompted to view the model plots when opening a
project, after training a model and after predictions. This option will be unchecked if
the ‘Do not ask me again’ option is selected in the View Plots dialog.
Viewer
These options allow a user to set the default appearance properties of plots at the
application level. The settings can still be customized and changed at the plot level by editing
the properties for a given plot.
The following are properties that can be set from the Viewer:
Antialiasing
Use this option to set antialiasing in all analysis-generated plots.
Point label visible
Use this option to have the default view on plots have the point labels visible. Point
labels can be toggled on/off from a plot.
Line plot point visible
Use this option to have the default view on line plots have the points visible. The
point can be toggled on/off from a plot.
Point size
Use this option to set the default size of points. This can be changed for indivudual
plots under Properties.
Line size
Use this option to set the default line size. This can be changed for indivudual plots
under Properties.
112
Application Framework
113
The Unscrambler X Main
To access the Report Generator, select Tools – Report…. The Report generator dialog
appears and gives access to all matrices and plots in the current project. Add plots and
matrices in the field Included in report to create a customized report.
To add a matrix use the Data tables field and:
Either select a data matrix that is in the Navigator as a node from the drop-down list
Or select one from an analysis using the Select result matrix button
.
At the bottom of the dialog are three tabs where the user can choose settings for the
security, report content, and page setup.
Security
Passwords can be enabled to limit the access for editing and viewing the report. The
user can highlight password protected editing of reports.
Printing, editing, copying, or annotating can be disabled for added security.
114
Application Framework
Content
Under the content tab the user can select to append notes, and/or use the editor
format for numbers.
Report Generator Content
.
Page Setup
On the Page Setup tab, a user can define the paper size (A2, A3, A4, letter, legal),
and orientation (portrait or landscape).
Report Generator Page setup
4.18. Help
4.18.1 Help menu
The help menu provides access to help topics and licensing-related information in The
Unscrambler®.
Help – Contents
or F1
Open help viewer for browsing.
See the How to use help documentation.
Help – Search
Ctrl+F1
Open help viewer for searching.
115
The Unscrambler X Main
Help – About
Shows;
The System Info button will open the “Windows System Information” utility.
116
Application Framework
Company name and Email address fields become active when the activation key is for a
time-limited or perpetual license.
Contact details can be found at http://www.camo.com/contact
Login
Compliance
Users are recommended to create a login and identification, which will not only secure their
work with The Unscrambler®, but provide valuable information to keep track of actions
taken on data, through the audit trail, where the user name is logged with any action.
Use the menu option Help - User Setup… to access the dialog.
User setup dialog
The above image shows an example of a completed setup. Enter the pertinent information
in the provided fields and then click Save.
The following is a brief explanation of the fields,
User Name
This is the name that will be shown in the login dialog each time the program is
started.
First Name
117
The Unscrambler X Main
Security Question
Select from a list of pre-defined questions to provide an answer to.
Answer
Enter the answer to the question here
If a password is forgotten, it can be retrieved provided the answer to the security
question is known. See the section on [Login](../signin.htm) for more details
Contact CAMO Software on information about how to register more than one user.
Contact details can be found at http://www.camo.com/contact
118
5. Import
5.1. Importing data
This section describes how to import data from supported instruments and software utilities
into The Unscrambler®.
119
The Unscrambler X Main
The Unscrambler® X
The Unscrambler® 9.8 and earlier versions1
120
Import
NetCDF
JCAMP-DX
Instruments
Interface protocols
Databases
Other interfaces such as OPC and MyInstrument are supported. Contact CAMO Software for
details. http://www.camo.com/contact
121
The Unscrambler X Main
The file names are given in glob notation: ”*” mean any number of characters, ”?”
any character, “[ABC]” any of A,B or C.
5.2. ASCII
5.2.1 ASCII (CSV, text)
Type of data
Array
Software
ASCII (American Standard Code for Information Interchange) is a character encoding
scheme and the de-facto file standard supported by many applications.
File name extension
*.csv, *.txt, *.*
122
Import
Data delimiters
Numbers may be delimited by different characters in different ASCII files. Specify which
delimiter is used in the file to be imported, in the field Separator. The choices are
Comma
123
The Unscrambler X Main
Semicolon
Space
Tab
Custom
Note: Carriage Return, Line Feed and Tabulation are not among the available
delimiters in the dialog. They are default item delimiters, and will automatically be
recognized as such. Do not specify them in the Custom field!
There is an additional list of check box options below:
Missing data
Any text string entries in a numeric column will be imported as empty or missing data.
124
Import
Make sure that Treat consecutive separators as one is unchecked when importing ASCII files
that have empty entries for missing data, such as:
s4,0.618,,0.6022
Batch import
Often spectrometers output spectra in individual files, such that each file contains a single
spectrum (with or without headers). A selection of such single spectrum text-files can be
imported in a single step in The Unscrambler®, simply by selecting multiple files to open. A
simplified dialog is used for batch import.
Batch import dialog
Each spectrum is imported and appended to the previous spectra row-wise. If spectra are
given as a single row in the files, this means that each spectrum will become a single row in
the imported data table. If spectra are given column-wise (i.e. separated by carriage
return/newline), they should be transposed using the Transpose the data before import
check-box.
The sample file-names are included in a row-header in the imported table.
See section on single file import above for general import options.
5.3. BRIMROSE
5.3.1 Brimrose
Type of data/instrument
NIR
Data dimensions
Multiple spectra
Instrument/hardware
Snap!32 v2.03 (BFF3)
Snap!32 v3.01 (BFF4)
Vendor
Brimrose
File name extension
*.dat
125
The Unscrambler X Main
The source files may contain one or more samples per file; multiple selections allow several
samples to be imported at the same time.
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
126
Import
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
Once Auto select matching spectra has been checked it will select only those files that have
the same number of variables.
Sorting data
The file name, number of samples, number of X-variables, wavelengths for the first and last
X-variables, and step (increase in wavelength), are displayed for each file.
127
The Unscrambler X Main
Step is the increment in wavelength (or wave number) between two successive variables.
The following relationship should be true:
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
5.4. Bruker
5.4.1 OPUS from Bruker
Type of data/instrument
FT-IR, FT-NIR, Raman
Data dimensions
128
Import
Single spectra
Instrument/hardware
—
Software
OPUS
Vendor
Bruker
File name extension
*.0x, *.1
129
The Unscrambler X Main
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
Interpolate
130
Import
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
Once Auto select matching spectra has been checked, the files in the list having the same
number of variables will be selected.
Use the Interpolate option to import data with different start or end points.
131
The Unscrambler X Main
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables are
displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
5.5. DataBase
5.5.1 Databases
Type of data
Array
Software
132
Import
Note: The Data Link Properties dialog is a standard Windows dialog. Depending on
the local language setup, this dialog may be displayed in another language other
than English. The name of the dialog will be different, the fields will have a different
text, but the layout and meaning of all fields will be the same as described
hereafter. For additional information, click Help; this will start the Microsoft help
system related to the current sheet in the Data Link Properties dialog.
The next two sections describe the standard stages to go through in order to establish a
connection from The Unscrambler® to a database.
133
The Unscrambler X Main
134
Import
135
The Unscrambler X Main
136
Import
To edit a value, select it, and click the Edit Value… button, which opens the dialog where a
property can be changed.
137
The Unscrambler X Main
Press the Next button to preview the data and proceed to complete the import.
Preview data before import
138
Import
The data types will be detected for individual columns and imported as numeric values or
text.
5.6. DeltaNu
5.6.1 DeltaNu
Type of data/instrument
Raman spectrometer
Data dimensions
single vector spectrum or multiple spectra in an array
Instrument/hardware
NuSpec software
Pharma-ID Raman spectrometer
Vendor
DeltaNu
File name extension
*.dnu, *.lib
139
The Unscrambler X Main
Multiple selections are possible, by checking the box next to more than one file. The
selected samples must be of the same size (variables must match).
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
140
Import
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
141
The Unscrambler X Main
5.7. Excel
5.7.1 Microsoft Excel spreadsheets
Type of data
Array (spreadsheet)
Software
Excel (part of Microsoft Office)
Vendor
Microsoft
File name extension
*.xls, *.xlt, *.xlsx, *.xlsm
142
Import
How to use it
143
The Unscrambler X Main
All ranges that have been defined with names in the selected Excel sheet are listed under
Range names. Multiple row and column headers can be specified in headers, with up to a
maximum of 5 headers.
The sheet range is updated automatically if a range name is selected. The range can also be
entered manually, specifying the Rows and Columns, e.g. 2:1. All cells lying within this
rectangle are then imported.
Select the appropriate ranges as described above for the data values from the selection
option, as well as for the rows/sample and columns/variable names, if relevant.
Columns and rows can be removed from the import by selecting them within the preview
grid and pressing Del on the keyboard.
Data type
If the worksheet contains non-numeric values or a mixture of numeric and non-numeric
values, they can be imported. The radio button Auto can be selected to detect the data
format in the Excel spreadsheet and maintain that on import. If all the data are non-numeric,
they can be imported as text by selecting the radio button text. If the spreadsheet has a mix
of text and numeric values, and one data type is selected, only data of that type will be
imported.
Skip lines
If there are rows of data at the top of the spreadsheet that you do not want to import, you
can use the Skip lines option to enter the number of lines from the top to skip.
5.8. GRAMS
5.8.1 GRAMS from Thermo Scientific
Type of data
Array
Data dimensions
Multiple spectra, constituents
Software
GRAMS
Vendor
Thermo Scientific (formerly Galactic)
File name extension
*.spc, *.cfl
144
Import
Unscrambler® both spectra and constituents are read. If a .spc file is imported, the spectra
are read, and accompanying Y values can also be imported with them.
“X-values” (usually wavelengths) in .spc files are imported as X-variable names.
Constituents in .cfl files are imported as Y-variables. “Y-values” are imported as separate
column sets with the name of the Y values for the columns.
Some .spc files contain a log block. This may include file names and sample numbers. To
import these, one can select Sample naming… and designate whether to use one, both or
none of these fields.
The binary part of the log block (which usually contains the imaginary part of complex
spectral data) is not imported, nor is the ASCII part of the log.
The source files may contain one or more samples per file (i.e. single spectra or multifiles1);
multiple selections allow one to import several samples with the same number of variables
at the same time. The dialog will include details about the files that are eligible for import. It
will show the number of samples per file, the number of X variables, number of Y variables,
and the starting and ending X variables.
145
The Unscrambler X Main
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import. If the data files also include Y values, these will also be imported.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
146
Import
Once the Auto select matching spectra option has been checked it will select only those files
that have the same number of variables as the first selected file.
Use the Interpolate option to import data with different start or end points.
Sorting data
The file name, number of samples, number of X-variables, wavelengths for the first and last
X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list. Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
147
The Unscrambler X Main
Multifiles are a specific kind of GRAMS file that has multiple spectra in a single file,
as opposed to a single spectrum per file.
5.9. GuidedWave
5.9.1 CLASS-PA & SpectrOn from Guided Wave
Type of data/instrument
spectrometer (UV, UV-vis, NIR)
Data dimensions
Single spectra, constituents
Instrument/hardware
CLASS-PA, SpectrOn
Vendor
148
Import
Guided Wave
File name extension
*.asc, *.scn, *.autoscan, *.gva
Multiple selections are possible, by checking the box next to more than one file. The
selected samples must be of the same size (variables must match).
149
The Unscrambler X Main
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names, sample numbers or timestamps in the resulting data table.
Sample names will only be imported if they are present in the source file.
Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
Y-variables
Constituents may also be imported by checking the following options:
Import Y-variables
Import Predicted Y-variables
150
Import
Use the Interpolate option to import data with different start or end points.
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
151
The Unscrambler X Main
152
Import
If there is a mismatch in any of these values, there are two possible scenarios
If the number of points in the spectra do not match to each other, a matrix cannot
be formed as it does not have the same column dimension
If the start points do not match, again a matrix cannot be formed, however, if the
differences between the values are small, interpolation can be used to match these
small differences.
The Interpolation function used in the Import menus is different from that found in Tasks -
Transform (which may be useful for trying to match data from two sets collected as different
resolutions).
Find out more about the Interpolate Transform here.
Data Imports Supporting Interpolation
The following file imports support the interpolate functionality in The Unscrambler® import
dialog boxes.
JAMP-DX
Thermo Galactic GRAMS
OPUS (Bruker Optics)
CLASS-PA & SpectrOn
Indico (ASD)
OMNIC™ (Thermo)
Varian
PerkinElmer
Functionality
When a file import supporting interpolate is selected, the Interpolate checkbox will be
present, see below
The % button opens the Tolerance dialog box that has a slider bar for setting how far
beyond the reference spectrum limit to set the interpolation.
Tolerance Dialog
Any points that lie within +/- the set percentage tolerance of the starting point will be
included in the import.
Example
Nine Spectra were collected on three different Bruker spectrometers using 8 wavenumber
resolution. Three replicate spectra were collected on each instrument. Each spectrum
153
The Unscrambler X Main
consists of 1154 points, however, the starting point of each spectrum is different. By
selecting the first spectrum and then checking the Auto select matching spectra box, only
the three first spectra are selected, see below,
To import all data into one table, check the Interpolate box and set the Tolerance to include
all spectra in the set, see below
When the Auto select matching spectra box is reselected, all spectra are now included in the
import, see below,
154
Import
The data are now displayed as a node in the project navigator using the column headers of
the reference spectrum selected.
5.11. Indico
5.11.1 Indico
Type of data/instrument
—
Data dimensions
Single spectra
Software
Indico Pro 5.6 (version 6 files)
RS3 5.6 (version 7 files)
Indico Pro 6.0 (version 8 files)
Vendor
ASD Inc.
File name extension
*.asd, *.001, *.002, *.3456, etc. (any number)
155
The Unscrambler X Main
The source files contain one sample per file; multiple selection allows for the import of
several files (samples) at the same time.
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
156
Import
Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
157
The Unscrambler X Main
Use the Interpolate option to import data with different start or end points.
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
158
Import
5.12. JcampDX
5.12.1 JCAMP-DX
Type of data/instrument
Vector and arrays. Standard
Data dimensions
Multiple spectra, constituents
Vendor
JCAMP/IUPAC
File name extensions
*.jdx, *.dx, *.jcm
159
The Unscrambler X Main
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
160
Import
Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
161
The Unscrambler X Main
Use the Interpolate option to import data with different start or end points.
Sorting data
The file name, number of samples, number of X variables, number of Y variables, and
wavelengths for the first and last X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays line plots of selected files for import.
162
Import
General
JCAMP-DX are ASCII-files with file headers containing information about the data and their
origin, etc., and they may contain both X-data (spectra) and Y-data (concentrations).
Only the most essential information of the JCAMP-DX file will be imported. The first title in
the JCAMP-DX file will be used, and one has the additional option of also importing file
names and sample numbers. There is not a limit on the length of a file name. If several
JCAMP-DX files are imported and saved in the same Unscrambler® file, the matrix name will
be that of the first file imported JCAMP-DX file.
JCAMP “X-values” (usually wavelengths) become X-variable names, while JCAMP “Y-values”
become X-variable values. “Concentrations” are interpreted as Y-variables. Variable names
are imported, with no limit on the number of characters. The “Sample description” are used
163
The Unscrambler X Main
as sample names. Unfortunately there are different dialects of JCAMP-DX, so in some cases
one may lose e.g. sample names if they were used erroneously in the original file.
The XYPOINT variant demand more disk space than XYDATA.
Examples of the XYDATA and XYPOINTS formats follows.
JCAMP-DX XYPOINTS
The example below shows only one sample.
JCAMP-DX XYDATA
The example below shows only one sample.
164
Import
##XUNITS= NANOMETERS
##YUNITS= ABSORBANCE
##XFACTOR= 1.0
##YFACTOR= 0.000001
##FIRSTX= 1100
##LASTX= 2500
##FIRSTY= 0.139460
##MINY= 0.131600
##MAXY= 1.380070
##NPOINTS= 281
##CONCENTRATIONS= (NCU)
(<CARBOHYDRATE>, 89.400, %)
(<PROTEIN>, 9.410, %)
##DELTAX= 5
##XYDATA= (X++(Y..Y))
1100 139459 137435 135089 133060 131669 131599 133794 138899
1140 145740 151897 158459 167527 180800 195522 206585 216499
...
...
2460 1378929 1379632 1378464 1374972 1378929 1376837 1372945 1377632
2500 1380069
##END=
BASELINEC= YES or NO
APCOM= String60
JCAMP-DX= String
ORIGIN= String
5.13. Konica_Minolta
5.13.1 Konica_Minolta
Type of data/instrument
KONICA MINOLTA NIR spectrometer
Data dimensions
single vector spectrum or multiple spectra in an array
Instrument/hardware : :
Vendor
165
The Unscrambler X Main
Konica_Minolta
File name extension :
Upon selection of ASCII files the spectrum is displayed in the dialog box as a line plot. After
selecting multiple files user can click on OK to get the data in Import.
Konica_Minolta Import
166
Import
5.14. Matlab
5.14.1 Matlab
Type of data
Array
Software
Matlab
Vendor
MathWorks, Inc.
File name extension
*.mat
167
The Unscrambler X Main
This will create a Matlab formatted .mat file. For more help on using the save command,
type help save in Matlab.
168
Import
Matlab variables representing sample and variable names must be character arrays.
What Cannot be Converted
The following cannot be imported from Matlab to The Unscrambler®
This will create a Matlab formatted .mat file. For more help on using the save command,
type help save in Matlab.
5.15. MyInstrument
5.15.1 MyInstrument
Type of data/instrument
Instrument interface standard defined by Thermo Electron (formerly Galactic) and
supported by many instrument vendors.
A MyInstrument driver provided by the specific instrument vendor and the
corresponding MyInstrument add-on for The Unscrambler® are required. These
modules are available separately from CAMO Software and many not be part of the
standard package.
Additional information
How to use it
169
The Unscrambler X Main
makes use of the MyInstrument standard to allow for instrument configuration and
definition of experiments in order to run scans. The functionality provided is dependent on
the instrument. After acquisition the spectral data is directly inserted as rows per scan into
an The Unscrambler® editor, ready for further processing or modeling. The MyInstrument
add-on removes the need for acquiring data using other instrument specific software, saving
to a file and then importing into The Unscrambler®.
The next window will show the vendor specific MyInstrument control screen, e.g. for a Zeiss
instrument:
170
Import
The appearance and usage of the control dialog will depend on the particular instrument
vendor. Details of using the instrument interface will be available from the manuals provided
by the instrument vendor. Using the instrument may require specific configuration and
setup procedures provided by the vendor before being able to run scans.
171
The Unscrambler X Main
Sample scan result. This may appear entirely different for the instrument being used and are
provided here only as an example.
Click OK to end the scan acquisition session. The scans should now be available within The
Unscrambler® editor for subsequent processing and modeling.
172
Import
5.16. NetCDF
5.16.1 NetCDF
Type of data
Open standard for array-oriented data
Developed by
University Corporation for Atmospheric Research (UCAR)
File name extension
*.cdf, *.nc
The NetCDF software was developed by Glenn Davis, Russ Rew, Ed Hartnett, John Caron,
Steve Emmerson, and Harvey Davies at the Unidata Program Center in Boulder, Colorado,
with contributions from many other NetCDF users.
173
The Unscrambler X Main
One can select Sample Names and Variable names as shown above.
5.17. NSAS
5.17.1 NSAS
Type of data/instrument
NIR
Data dimensions
Multiple spectra, constituents
Instrument/hardware
Foss 5000, 6500, XDS
Vendor
FOSS
File name extension
*.da, *.cn, *.cal
174
Import
The source files may contain one or more samples per file; multiple selections allow several
samples to be imported at the same time.
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
175
The Unscrambler X Main
Once Auto select matching spectra has been checked it will select the files having the same
number of variables from the list.
Sorting data
The file name, number of samples, number of X-variables, wavelengths for the first and last
X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
176
Import
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
177
The Unscrambler X Main
NSAS_AmpType String: 1
NSAS_CellType String: 2
NSAS_Volume String: 3
NSAS_Math2_Type =
NSAS_Math3_Type =
NSAS_Math2_SegmentSize =
NSAS_Math3_SegmentSize =
NSAS_Math2_GapSize =
NSAS_Math3_GapSize =
NSAS_Math2_DivisorPoint =
NSAS_Math3_DivisorPoint =
NSAS_Math2_SubtractionPoint =
NSAS_Math3_SubtractionPoint =
178
Import
NSAS_AmpType | String:
“Reflectance”, “Transmittance”, “(Reflect/Reflect)”, “(Reflect/Transmit)”,
“(Transmit/Reflect)”, “(Transmit/Transmit)”, “Not used”
NSAS_CellType | String:
“Standard sample cup”, “Manual”, “Web analyzer”, “Coarse sample”, “Remote
reflectance”, “Powder module”, “High fat/moisture”, “Rotating drawer”, “Flow-
through liquid”, “Cuvette”, “Paste cell”, “Cuvette cell”, “3 mm liquid cell”, “30 mm
liquid cell”, “Coarse sample with sample dump”
NSAS_Volume | String:
“1/4 full”, “1/2 full”, “3/4 full”, “Completely full”
5.18. Omnic
5.18.1 OMNIC
Type of data/instrument
FTIR, FT-NIR, Raman
Data dimensions
Single spectra
179
The Unscrambler X Main
Instrument/hardware
Nicolet IR, Antaris, NXR
Vendor
Thermo Scientific (Nicolet)
File name extension
*.spa, *.spg
The source files contain one sample per file. Multiple selection allows several files (samples)
to be imported at the same time.
180
Import
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
181
The Unscrambler X Main
Once the Auto select matching spectra option has been checked it will select the files have
the same variables from the list.
Use the Interpolate option to import data with different start or end points.
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
Step is the increment in wavelength (or wave number) between two successive variables.
The following relationship should be true:
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
182
Import
5.19. OPC
5.19.1 OPC protocol
Type of data/instrument
Standard data transfer protocol
Vendor
OPC Foundation
File format information
How to use it
183
The Unscrambler X Main
All configured servers on the PC will be recognized, and displayed in the list of OPC servers.
The user must make selections for the Computer name/IP, the OPC Server, and the OPC
Group from the respective drop-down lists. The user also has provision to type in computer
name/IP, the OPC server, and the OPC Group. Once they have been selected, available items
will be given in the OPC Items list. An item is selected, and by clicking on GO, the data will be
generated from OPC, and populate the fields in the OPC Import Dialog. Click Stop to stop the
collection process from OPC, showing the data in the preview.
OPC Tag - The user should use this option to specify the OPC tag. This should be used when
more OPC groups and OPC items are available in Servers. The user can directly specify the
tag to avoid the delay in listing and selecting individual OPC group and OPC item.
184
Import
Update Rate - This is the rate(in milliseconds) at which data is retrieved from the OPC
Server.
Show preview - User should check this option to see the last 10 rows retrieved from the OPC
Server.
Set number of columns - The user should use this option to increase the number of
columns.
Filled OPC Dialog
5.20. OSISoftPI
5.20.1 PI
Type of data
PI Server - real time data collection, archiving and distribution engines
185
The Unscrambler X Main
The PI Import dialog allows the user to specify and connect to an active server. Click Add to
search a PI Server for tags using the Tag Search dialog. This dialog allows the user to search
all connected PI Servers for tags meeting a given a set of criteria, such as one or more tag
attribute values. Tags can be selected using the Search option. Three different search
options are available in Tag Search dialog, the Basic, Advanced and Alias.
Tag Search dialog
After the tags are selected (use Ctrl key for multiple tag selection) from the search list panel
and OK is clicked, they can be seen in the Tags window of the PI Import dialog. For more
details on options available in Tag Search dialog box, click on Help.
The below three sections describe the data modes to go through in order to preview and
retreive data for the selected tags from the PI server.
186
Import
187
The Unscrambler X Main
188
Import
The help option available in the PISDKUtility provides more details about the usage of PI-SDK
configuration utility.
5.21. PerkinElmer
5.21.1 PerkinElmer
Type of data/instrument
UV-Vis, NIR, FTIR, Raman
Data dimensions
Multiple spectra
Instrument/hardware
—
Software
Spectrum 6, Spectrum 10
Vendor
PerkinElmer
File name extension
*.sp, *.spp
189
The Unscrambler X Main
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
190
Import
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
191
The Unscrambler X Main
Once Auto select matching spectra has been checked, the files in the list having the same
number of variables will be selected.
Use the Interpolate option to import data with different start or end points.
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables are
displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
192
Import
5.22. PertenDX
5.22.1 Perten-DX
Type of data/instrument
Vector and arrays. Standard
Data dimensions
Multiple spectra, constituents
Vendor
Perten Instruments following JCAMP/IUPAC
File name extensions
*.jdx, *.dx, *.jcm
193
The Unscrambler X Main
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
194
Import
Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
195
The Unscrambler X Main
Use the Interpolate option to import data with different start or end points.
Sorting data
The file name, number of samples, number of X variables, number of Y variables, and
wavelengths for the first and last X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays line plots of selected files for import.
196
Import
General
Perten-DX supports additional tags specific to Perten Instruments. These are:
Tag name Imported in Unscrambler as
197
The Unscrambler X Main
Perten-DX file
The example below shows Perten-DX sample file.
##TITLE=2
##INSTRUMENT S/N=1201530
##INSTRUMENT TYPE=DA7250
##SPECTROMETER S/N=SNIR2148
##JCAMP-DX=4.24
##DATATYPE= NEAR INFRARED SPECTRUM
##LONG DATE=2013-10-18T01:59:18+02:00
##SAMPLE DESCRIPTION=2
##SMOOTHED=YES
##XUNITS= Nanometers (nm)
##YUNITS= Absorbance
##CONCENTRATIONS= (NCU)
(Protein Dry basis,-9.973E+23,<unknown>)
##PERTEN-TYPES= (KV)
(Product Type, Wheat),
(Shape Type, Unknown),
(Tray Type, Large Tray. rotating)
##PERTEN-REPACK=1
##PERTEN-REPEAT=1
##PERTEN-SAMPLEINFO= (KV)
##XFACTOR= 1.0
##YFACTOR= 0.000000001
##FIRSTX= 950.00
##LASTX= 1650.00
##NPOINTS= 141
##DELTAX= 5.0
##XYDATA= (X++(Y..Y))
950.0 186225975 188992413 193629553 199835249 207323496 215294014
222310809 227316331 230163481
995.0 231218537 230973747 229930179 228344771 226101418 223436221
220348573 216993825 213526732
1040.0 210076812 206678859 203519066 200372073 197183083 193896477
190813849 187961026 185361544
1085.0 183060794 181031311 179367942 178144637 177316150 176997467
177158004 178485737 182057610
1130.0 189131917 200696556 216125124 233953784 253292157 272636547
291094037 307752989 322292848
1175.0 335720686 348497384 360603909 370580710 377233357 380561567
380739361 377437577 370749286
1220.0 361610474 351741516 342353572 334328973 327783482 322877222
319254364 316585214 314597761
1265.0 313006114 311340643 309259709 306673122 303654410 300820687
298877629 297995673 298450579
198
Import
5.23. RapID
5.23.1 RapID
Type of data
Array
Data dimensions
single vector spectrum
Instrument/hardware
Particle size analysers
Raman Spectrometers
Laser Induced Breakdown Spectrometers (LIBS)
Vendor
rap-ID Particle Systems
File name extension
.txt,.jcm
199
The Unscrambler X Main
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
200
Import
Once Auto select matching spectra has been checked it will select only those files that have
the same number of variables.
Sorting data
The file name, number of samples, number of X-variables, are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
201
The Unscrambler X Main
5.24. U5Data
5.24.1 U5 Data
File name extension
*.UNS
202
Import
Note: The Unscrambler� recognizes the extensions: .UNS, .UNM, .UNP, and .CLA.
Rename the files if they have other extensions.
203
The Unscrambler X Main
5.25. UnscFileReader
5.25.1 The Unscrambler® 9.8
Type of data
Array
Software
The Unscrambler® 9.8
Vendor
CAMO Software
File name extensions
*.??M, *.??D
204
Import
Statistics .10D
PCA .11M
Prediction .30D
Classification .31D
MLR .40M
PLS1 .41M
PLS2 .42M
205
The Unscrambler X Main
PCR .43M
MSC .50D
The Unscrambler® 9.8 introduced a merged file format combining .??[DLPTW] into one file,
.??M.
A few details to remember about the file sets that comprise each data table or saved result:
When transferring data to another place using the Windows Explorer, make sure
that all the associated physical files are copied!
Do not change the file name extensions The Unscrambler® uses. Doing so may
create problems to access the files from within The Unscrambler®.
The log and notes files are plain ASCII files which can be opened and viewed using a
text editor.
5.26. UnscramblerX
5.26.1 The Unscrambler® X
Type of data
Array
Software
The Unscrambler® X
Vendor
CAMO Software
File name extensions
*.unsb
206
Import
After selecting the import target, click OK to enter the Import dialog.
207
The Unscrambler X Main
5.27. Varian
5.27.1 Varian
Type of data/instrument
—
Data dimensions
Multiple spectra, constituents
Instrument/hardware
Cary UV-Vis
Software
—
Vendor
Varian, Inc.
File name extension
*.bsw
208
Import
The source files may contain one or more samples per file. Multiple selections allow several
samples to be imported at the same time.
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
209
The Unscrambler X Main
Interpolate
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
210
Import
Once the Auto select matching spectra option has been checked it will select the files having
the same variables from the list.
Use the Interpolate option to import data with different start or end points.
Sorting data
The file name, number of samples, number of X variables, number of Y variables, and
wavelengths for the first and last X-variables are displayed for each file.
The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import. A
screenshot of the Varian Import dialog with the preview spectra chosen is given below.
211
The Unscrambler X Main
5.28. VisioTec
5.28.1 VisioTec
Type of data/instrument :
Data dimensions
single vector spectrum or multiple spectra in an array
Instrument/hardware : :
Vendor
VisioTec
File name extension :
212
Import
The source files may contain one or many samples per file; multiple selection allows for the
import of several files (blocks of data) at the same time.
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
Preview spectra
213
The Unscrambler X Main
214
6. Export
6.1. Exporting data
This section describes how to export data from The Unscrambler®.
ASCII
JCAMP-DX
NetCDF
Matlab
AMO: The Unscrambler® ASCII Model
DeltaNu
6.2. AMO
6.2.1 Export models to ASCII
The Unscrambler® ASCII-MOD file is an ASCII-based file format used to transfer models from
The Unscrambler® to compatible instruments and prediction software.
215
The Unscrambler X Main
Select model
A drop-down list contains all models found in the currently open project. Select the
one to export.
Type
Choose between Full and Short prediction storage, where the second is used to
achieve smaller file size when only the regression coefficients are used for
prediction.
PCs
The number of Principal Components or factors to include in the exported model.
Y-Variable
Include the Y-variables to be included with the model.
Press OK and use the file dialog to select the destination directory and give a file name to
save the model.
B0 x x PC (1-a) 1 row
ResXValTot x x PC (0-a)
216
Export
Note: The contents of the columns “Rows” and “Columns” shows the contents of
the ASCII-MOD file, not the contents of the matrices in the main model file.
TYPE=FULL // (MINI,FULL)
VERSION=1
MODELNAME=F:\U\EX\DATA\TUTBPCA.11D
MODELDATE=10/27/95 11:41:13
CREATOR=Joe Doe
METHOD=PCA // (PCA, PCR, PLS1, PLS2)
CALDATA=F:\U\EX\DATA\TUTB.00D
SAMPLES=28
217
The Unscrambler X Main
XVARS=16
YVARS=0
VALIDATION=LEVCORR // (NONE,LEVCORR,TESTSET,CROSS)
COMPONENTS=2
SUGGESTED=2
CENTERING=YES // (YES,NO)
CALSAMPLES=28
TESTSAMPLES=28
NUMCVS=0
NUMTRANS=2
TRD:DNO // ,,,,,,,complete transformation string
TRD:DSG // ,,,,,,,complete transformation string
NUMINSTRPAR=1
##GAIN=5.2
MATRICES=13
"xWeight" // (Name of 13 matrices)
"xCent"
"ResXValTot"
"ResXCalVar"
"ResXValVar"
"ResXCalSamp"
"Pax"
"Wax"
"SquSum"
"TaiCalSDev"
"xCalMean"
"xCalSDev"
"xCal"
%XvarNames
"Xvar1" "Xvar2" "Xvar3" "Xvar4"
"Xvar5" "Xvar6" "Xvar7" "Xvar8"
"Xvar9" "Xvar10" "Xvar11" "Xvar12"
"Xvar13" "Xvar14" "Xvar15" "Xvar16"
%xWeight 1 16
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01
%xCent 1 16
.1677847E+01 .2258536E+01 .2231011E+01 .2404268E+01 .2179311E+01
.2470489E+01 .2079168E+01 .1734536E+01 .1475164E+01 .1480657E+01
.1644097E+01 .1805900E+01 .1980229E+01 .1795443E+01 .1622796E+01
.1497418E+01
,,,
,,,etc.
Description of fields
The below table lists the data field codes used in ASCII-MOD files.
Description of fields
Field Description
VERSION Increases by one for each changes of the file format after release
218
Export
Field Description
MODELDATE Date for creation of the model (not the ASCII-MOD file)
CREATOR Name of the user who made the model (not the ASCII-MOD file)
VALIDATION (TEST,LEV,CROSS)
CENTERING (YES,NO)
INSTRUMENT
See below
PARAM.
Number of matrices on this file. One name for each matrix follows
MATRICES
below
Transformation strings
There is one line for each transformation. The format of the line will depend on type of
transformation. If a transformation needs more data which is the case for MSC, this extra
data will be stored as matrices at the end of the file. References to these matrices can be
done by names.
Examples
A transformation named TRANS using one parameter could look like this:
TRANS:TEMP=38.8;
A MSC transformation may look something like this:
MSC:VARS=19,SAMPS=23,MEAN="ResultMatrix18",TOT=" ResultMatrix19";
Transformation strings may also contain error status which is the case when the MSC-base
have been deleted from file before making the ASCII-MOD file.
219
The Unscrambler X Main
Transformation strings
Main Description Secondary Description
CLA Classification
PRE Prediction
STA Statistics
VAR Variable
VEC Vector
VAR Variable
IMP Import —
VAR Variable
REP Replace —
BAS Baseline
220
Export
RED Reduce
TSP Transpose
USR User-Defined
Storage of matrices
Each matrix starts with a header as in this example:
%Pax 10 155
Telling: Matrix name is Pax the matrix has the dimension 10 rows and 155 columns. From
the next line the data elements will follow in the following sequence:
If the calibration model was made using 1 Y variable, it uses PLS1, and if it was
created using >1 Y variable the AMO file uses PLS2.
6.3. ASCII
6.3.1 ASCII export
The ASCII export option is very useful if one wants to work with the data table in another
program.
221
The Unscrambler X Main
Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.
Options
Include headers
Specify sample names and variable names are to be exported by selecting them in
the Include headers field. They will be placed in the first column and in the first row,
respectively.
Name qualifier
String data, such as headers, may be quoted, using either double quotes ", or single
quotes '.
It is recommended to mark text with quotes and not mark numbers, because it
makes it easier for importing programs to assign correct data types to text and
numbers.
Default is ".
Numeric qualifier
Numeric data, may be quoted similar to headers.
222
Export
Default is None.
Item delimiter
Table cell entries may be delimited by different characters.
Default is ,.
String representation of missing data
Specify how missing data are to be coded in the ASCII file.
Default is m.
For compatibility with software that doesn’t have support for importing missing data
as strings, use a large negative number, such as -9.9730e+023 instead.
6.4. DeltaNu
6.4.1 DeltaNu
The DeltaNu file is a model file format developed for use with the DeltaNu Pharma-ID Raman
spectrometers. It contain all the necessary information for projection and classification. PCA
Models created in The Unscrambler� X can be exported to this file format. Such models are
compatible with DeltaNu Raman instrumentation for real-time projections.
The files are saved with a .dnub file name extension.
Select model
A drop-down list contains all models found in the currently open project. Select the
one to export. Only PCA models are supported in the DeltaNu format.
PCs
The number of Principal Components to include in the exported model. The default
value given is the optimal number of PCs for the model. It is recommended to export
a model with the optimal number of PCs. To export the model with a different
number of PCs use the drop-down list to choose a different number of PCs.
223
The Unscrambler X Main
Press OK and use the file dialog to select the destination directory and give a file name to
save the model.
6.5. JCampDX
6.5.1 JCAMP-DX export
Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.
Metadata
Then, in the File Info tab, enter information related to the JCAMP-DX file as a whole. Here
one must choose between two JCAMP-DX formats: XYPoints and XYData. XYData requires
that the distance between each variable is the same throughout the whole X-Variable Set.
XYData produces smaller file sizes than XYPoints.
JCAMP-DX export dialog: File info
224
Export
Title
Name of the data set
Origin
Can be the name of the lab, client name, batch number, or location where data
came from.
Owner
Name of the person conducting the experiment or the analysis.
Enter information related to the samples in the Samples Info tab. This information is saved
with each sample.
JCAMP-DX export dialog: Sample info
225
The Unscrambler X Main
Sample names
Select either Use sample name from data table or Use text to specify manually
Sampling procedure
Details on how the data was collected.
Data processing
List the transformations applied to prepare the data.
Data type
Select appropriate value from the drop-down list.
X units
Select appropriate value from the drop-down list.
Y units
Select appropriate value from the drop-down list.
Click OK to save the file.
6.6. Matlab
6.6.1 Matlab export
The Unscrambler® provides the capability to export data tables to Matlab including sample
names (row headings in The Unscrambler®) and variable names (column names in The
Unscrambler®).
Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.
226
Export
Options
Select whether sample and variable names should be exported. If this option is selected then
these names are stored in separate arrays within the export file as normally done in Matlab.
Select Use Compression to use gzip-compression for arrays stored to the Matlab file. This
will reduce the file size.
The exported data is saved as filename.mat, where “filename” represents the name
entered for the file on saving.
6.7. NetCDF
6.7.1 NetCDF export
227
The Unscrambler X Main
Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.
Metadata
In the field Global Attributes, enter all other relevant details:
Data set origin
Can be the name of the lab, client name, batch number, or location where data
came from.
Equipment ID
Can be the product name, product number, serial number, or IP address of the
instrument used.
Equipment manufacturer
Name of the instrument vendor.
Equipment type
Type of instrument used, e.g. NIR.
Operator name
Name of the person conducting the experiment or the analysis.
Experiment date time
Date and time of the data collection. It is suggested to enter the date according to
the ISO 8601 standard, e.g. 2010-01-27T09:55:41+0100.
All attributes are optional. It is generally recommended to add metadata to files for better
file search results.
228
Export
6.8. UnscFileWriter
6.8.1 Export models to The Unscrambler® v9.8
The Unscrambler® 9.8 file is the previous file format and models in this format contain all
the necessary information for prediction and classification. Models (PCA, MLR, PCR and PLS)
created in The Unscrambler® X can be exported to this previous file format using the File
writer plug-in. Such models are compatible with OLUP and OLUC 9.8 software for real-time
classification and prediction.
PCA .11M
MLR .40M
PLS1 .41M
PLS2 .42M
PCR .43M
229
The Unscrambler X Main
Available models
A drop-down list contains all models found in the currently open project that can be
exported. Select the one to export.
Model Information
This contains details about the model selected
Notes
The time the chosen model was created is given here, along with any other
information that has been added to the Notes section of the chosen model. Users
may also add additional information in the Notes section, which will be available in
the exported model.
Save model with components
Use the components box to select the correct number of components for saving the
model in 9.8 format. The set number of components for the model will be displayed
and used by default.
Save as micro model
The check box allows user to save the model in 9.8 micro format.
Press OK and use the file dialog to select the destination directory and give a file name to
save the model.
230
7. Plots
7.1. Line plot
A line plot displays a single series of numerical values with a label for each element. The plot
has two axes:
The horizontal axis shows the labels, in the same physical order as they are stored in
the source file;
The vertical axis shows the scale for the plotted numerical values.
With Symbols
Symbols produce the same visual impression as a 2-D scatter plot (see Scatter Plot),
and are therefore not recommended.
Line plot: symbol display
231
The Unscrambler X Main
Several series of values which share the same labels can be displayed on the same line plot.
The series are then distinguished by means of colors.
Line plot: 2 series with curve display
The horizontal axis shows the labels, in the same physical order as they are stored in
the source file;
The vertical axis shows the scale for the plotted numerical values.
232
Plots
Several series of values which share the same labels can be displayed on the same bar plot.
The series are then distinguished by means of colors, and an additional layout is possible:
accumulated or stacked bars. Accumulated bars are relevant if the sum of the values for
series1, series2, etc. has a concrete meaning (e.g. total production or composition).
Two layouts of a bar plot for two series of values: Bars and Accumulated Bars
233
The Unscrambler X Main
234
Plots
A regression line visualizing the relationship between the two series of values
235
The Unscrambler X Main
Plot statistics, including among others the slope and offset of the regression line
(even if the line itself is not displayed) and the correlation coefficient.
236
Plots
All the plots can be customized. This is done from the properties dialog which is accessed by
a right click on the plot and the selection of the Properties menu,
237
The Unscrambler X Main
Background
Header: title, color, font, visibility, color of the background
Legend: title, color, font, visibility, color of the background
Plot Area: Chart area, color, font, visibility, borders, surface
Properties Appearance
238
Plots
For the Header and Legend the text can be edited. One can customize the name,
such as only having part of the name displayed, the font and the color.
Properties Header
Graphic Objects
It is possible to include some graphical objects in the plot such as line, arrow,
rectangle, ellipse and text. Each of those objects can be configured in terms of color,
thickness and font if necessary.
3-D scatter plots can be enhanced by:
Addition of vertical lines
They “anchor” the points and can facilitate the interpretation of the plot.
A 3-D Scatter plot displayed with anchors
239
The Unscrambler X Main
To add vertical lines, click on More (see section below on Additional Options).
Rotation
The plot can be rotated so as to show the relative positions of the points from a
more relevant angle; this can help detect clusters. Click on the plot and move it with
the cursor in the appropriate direction.
A 3-D Scatter plot after rotation
240
Plots
The axes can be interchanged in plot, using the arrows on the toolbar. If more than three
columns are selected, the axes can be changed from the drop-down lists next to the axis
arrows on the toolbar.
Additional options
Click on More to access more options for 3D scatter plots.
Scroll through the
Gallery
Data
3D-View
options to customise the appearance of 3D Scatter Plots. These features are described in the
following,
3D Scatter plot gallery
Select from the gallery of plots to obtain the desired appearance of the plot.
3-D Scatter plot data
241
The Unscrambler X Main
The rotation, perspective, and axis scales can be changed under the 3-D view tab.
242
Plots
The first two show the labels, in the same physical order as they are stored in the
source file;
The vertical axis shows the scale for the plotted numerical values.
Depending on the layout, the third axis may be replaced by a color code indicating a range of
values (contour plot), thus making the surface plot essentially a contour plot or a map plot
when looking at it straight from above. The layout can be changed by right clicking on the
plot, and selecting Plot type for a shortcut to predefined layouts, or select Properties to
customize 3-D plots, and make changes to the axes, legends, etc..
The Plot type submenu
The points can either be represented individually, or summarized according to one of the
following layouts:
Surface
It shows the table as a 3-D landscape.
Matrix plot with a landscape display
Contour
The contour plot has only two axes. A few discrete levels are selected, and points
(actual or interpolated) with exactly those values are shown as a contour line. It
looks like a geographical map with altitude lines;
Matrix plot with a contour display
243
The Unscrambler X Main
This option is accessible from Plot type – Contour, or the Properties of the plot:
Surface plot menu
Map
On a map, each point of the table is represented by a small colored square, the color
depending on the range of the individual value. The result is a completely colored
rectangle, where zones sharing close values are easy to detect. The plot looks a bit
like an infrared picture.
244
Plots
This option is accessible from Plot type – Map, or the Properties of the plot, the
option is Scatter chart, zoned, 2D projection.
Scatter plot menu
Bars
This option gives roughly the same visual impression as the landscape plot if there
are many points, otherwise the “surface” appears more rugged.
Matrix plot with a 3-D bar display
245
The Unscrambler X Main
3-D-Scatter is also accessible via this Properties menu, see 3-D scatter plot for help on that
plot.
246
Plots
247
The Unscrambler X Main
The histogram is one of the seven basic tools of quality control, which also include the
Pareto chart, check sheet, control chart, cause-and-effect diagram, flowchart, and scatter
diagram.
248
Plots
If the points are close to a straight line, the distribution is approximately normal
(Gaussian).
Normal probability plot showing a series following a Normal distribution
249
The Unscrambler X Main
250
Plots
Once the variables are selected, click OK and the plot will appear in the viewer.
Multiple scatter plot
If more than four variables have been selected for the multiple scatter plot, others can be
displayed by choosing them from the drop-down list on the diagonal of the plots.
Variable drop-down list menu
251
The Unscrambler X Main
252
Plots
This is what has been done in the special plot “Mean and SDev”.
Special plot: Mean and SDev
253
The Unscrambler X Main
254
Plots
255
The Unscrambler X Main
256
Plots
257
The Unscrambler X Main
258
Plots
In order to remove drawing objects from plots, you can use either the Edit - Undo option (or
toolbar button), or you can select the drawing object using the mouse pointer and click the
keyboard Delete button.
259
The Unscrambler X Main
Sample Selection : Select whether the marked or unmarked samples (or both)
should be extracted from the model, and give the ranges informative names. By
default the marked and unmarked sample ranges will be named Outliers and Good
Samples, respectively.
Create Range : The new range will be created based on one or more data tables
available in the project navigator. All data tables with the correct number of rows
will be listed in this frame. Use the radio buttons to define whether a new data table
should be created or if the ranges should be added to existing tables. As an
additional quality control it is possible to list only data tables with matching sample
names. A yellow warning sign next to a table indicates that the sample names are
missing or non-matching.
Line plot
Bar plot
Scatter plot
3-D scatter plot
Matrix plot
Histograms
260
Plots
In addition, to cover a few special cases, two more kinds of representations are provided:
Table plot
Special plot
261
The Unscrambler X Main
262
Plots
Interpreting plots
To get specific information on all the available plot for each analysis, see the specific Plot
sections under respective methods.
Design of Experiments
Descriptive statistics
Statistical tests
Principal Component Analysis (PCA)
Multiple Linear Regression (MLR)
Principal Components Regression (PCR)
Partial Least Squares Regression (PLS)
L-shaped PLS Regression (L-PLS)
Multivariate Curve Resolution (MCR)
Cluster analysis
Projection
SIMCA
Prediction
263
The Unscrambler X Main
It is also possible to enter the dialog from the icon in the Mark Toolbar
Number of Number of calibration samples to select with the K-S algorithm. The
samples default is 15.
Number of Here the number of components to use for selection is given. The
components default is the optimal number as found in the model.
Pre-Select
When selected any marked samples in the score plot will be included
samples - Include
in the calibration sample set in addition to what is identified with the
already marked
K-S Sample selection.
samples
Pre-Select Opens the Select samples dialog window for selecting samples to be
samples - included in the calibration sample set from the data matrix.
264
Plots
Manually pre-
select samples
Works only for PCR and PLSR models, when checked the initial
calibration set from K-S will be augmented with samples to produce
Augment set with a more uniform distribution of response values. Additional options
boxcar samples are available for setting number of bins for boxcar samples and
number of samples to select from the sample selection. This option
will be disabled if Select validation samples is checked.
Create row set as When selected the samples will be extracted into a new matrix, with
new matrix KS-Calibration and optionally KS-Validation row sets added.
Create row set in When selected, Calibration and optionally Validation row sets will be
selected matrix(es) added to selected, matching matrices.
Allow mis- While not checked, only matrices with identical sample names in the
matching samples same order will be listed. An exclamation mark is shown for the
names matrices where the sample names do not match.
The figure below shows the score plot after specifying 15 samples for calibration and
validation. The calibration samples are marked with green rectangles and the validation
samples with orange triangles.
The score plot with marked calibration and validation samples
When the option to create the sample set in selected Matrices is chosen, the matrices will
be added in the project navigator as shown below:
265
The Unscrambler X Main
If the option to Create row set as new matrix has been chosen, a matrix with the name of
the X matrix from the scores plot will be created with KS appended to the matrix name.
7.16. Marking
It is often useful to mark some samples or variables in a plot to:
One by one
This option enables one to use the cursor to select an item to mark by clicking on it.
Rectangular
This option allows several grouped samples to be selected at the same time. The
cursor is transformed into a pointer that will allow the user to define the top left
corner and the bottom right corner of the rectangle.
Samples marked with rectangle option
266
Plots
The different types of Markings can be accessed from Edit-Mark.. or from toolbar shortcuts.
Lasso
This option activates the cursor to be used to define a special area. All samples
inside the area will be marked. To define the area click on the contour of the area to
be defined and maintain the click while defining the contour of the area. When the
click is released the selection is done.
Samples marked with lasso
267
The Unscrambler X Main
7.16.2 How to create a new range of samples or variables from the marked items
Once some samples / variables are selected in a plot it is possible to create a new range
including them. To do so right click on the plot with the selected items and select the option
Create Range.
Menu create range
For all raw data plots and for model plots of variables (e.g. PCA loadings), the new range
appears under the corresponding data table node with the default name “RowRange” or
“ColumnRange”.
New range created
268
Plots
When a sample range is created from within a model scores plot, a dialog is opened to allow
sample extraction into a new or existing data table. See the extract samples documentation
for details.
Without Marked…
The marked samples or/and variables are not included in the analysis, the
unselected samples or/and variables are.
269
The Unscrambler X Main
270
Plots
271
The Unscrambler X Main
Background
Header: title, color, font, visibility, color of the background
Legend: title, color, font, visibility, color of the background
Point Label: color, font, visibility
Axis Label: title, color, font, visibility, borders
Properties Appearance
For the Point Label and Axis Label the text can be edited. One can customize the
name, such as only having part of the name displayed. For this option use the drop-
down list in Label layout - Show.
Properties: Point Label
272
Plots
Graphic Objects
It is possible to include some graphical objects in the plot such as line, arrow,
rectangle, ellipse and text. Each of those objects can be configured in terms of color,
thickness and font if necessary.
Properties Appearance
Chart properties
It is possible to further customize the chart properties by selecting More, which will
open up the Chart properties dialogue. Here one can define simple or complex chart
types from the options in the chart gallery. Further selection of chart properties can
be made, and the chart previewed.
Chart Properties
273
The Unscrambler X Main
274
Plots
Background
Header: title, color, font, visibility, color of the background
Legend: title, color, font, visibility, color of the background
Plot Area: Chart area, color, font, visibility, borders, surface
Properties Appearance
For the Header and Legend the text can be edited. One can customize the name,
such as only having part of the name displayed, the font and the color.
275
The Unscrambler X Main
Graphic Objects
It is possible to include some graphical objects in the plot such as line, arrow,
rectangle, ellipse and text. Each of those objects can be configured in terms of color,
thickness and font if necessary.
Properties Graphic Objects
Chart properties
It is possible to further customize the chart properties by selecting More, which will
open up the 3D Chart properties dialogue. Here one can define the chart types from
the options in the chart gallery.
Chart Properties
276
Plots
Additional options of a 3-D plot can be changed from the tab in the properties dialog. In the
Data tab, the layout of the data can be changed.
3-D Scatter plot data properties dialog
The rotation, perspective, and axis scales can be changed under the 3-D view tab.
3-D Scatter plot 3-D view properties dialog
277
The Unscrambler X Main
278
Plots
Select where the plot should be stored in the field Save in.
279
The Unscrambler X Main
Enter a name for the plot in the field File name and select a format.
Types of format
There are six possible graphics file formats available for compatibility with many needs:
EMF
Use the EMF format which is vector graphics whenever possible. Vector graphics can
be scaled and will give the best quality.
Compatibility: EMF support is often limited to Microsoft applications. When sending
the plot graphics file for instance by email, a recipient may encounter problems
viewing and reusing it.
PNG
The second choice is PNG, which is raster graphics, and does not look as good when
enlarged.
This format is most suitable for web publishing and email.
This will generally result in smaller files than the following formats.
Compatibility: 5-10 year old applications may not support this image format.
Select one of the above formats. The following formats are also raster graphics, each having
it’s limitations. Included only for compatibility.
GIF
Limited to 256 colors.
JPEG
Lossy compression that will give artifacts. (JPEG is best suited for photographic
images.)
TIFF
Will produce larger files.
BMP
Will produce larger files.
Available image formats
280
Plots
281
The Unscrambler X Main
Pasting plots
Depending on the application to be used there may be different options such as the shortcut
Ctrl+V or from an Edit menu.
A common dialog appears when selecting any of the plotting options from Plot:
Line
Bar
3D Scatter
Matrix
Histogram
Normal Probability
Multiple Scatter
Define the row and column ranges from predefined ranges using the drop-down list.
To use new ranges, click on icon that looks like a matrix to access a matrix from the project
navigator and on Define to access the Define Rangeramework\menu2-edit\range.htm)
dialog.
Plot scope dialog
282
Plots
To use data that are part of a results matrix, use the select result matrix button to
choose the desired results matrix.
Min/Max
Selects the samples most separated in the data set.
A number of extreme samples will be picked out for each PC, according to the
specification in the right column in the table below the method choice. It will be
labeled Number of min/max, and for each min/max selected, two extreme samples
are marked (max and min value). Thus, setting the number to 2 will mark a total of
four samples.
Classes
The samples will be divided into a number of classes for each PC. One pair of
extreme samples (max and min value) will be picked out for each PC, according to a
user’s specification in the right column in the list below the Methods field. It will be
283
The Unscrambler X Main
labeled Number of classes, and for each class, two extreme samples are marked.
Thus, setting the number to 2 will mark a total of four samples.
Then, in the list below the method choice, specify the number of PCs (listed in the left
column) for which to mark samples, and how many (listed in the right column). No samples
are marked for PCs with 0 in the right column, i.e., in the above figure, only PC 1 is marked.
Zoom-out
To zoom out a displayed plot, the zoom-out being down from the center area, there
are two options:
Frame-scale
To zoom in a special area it is more convenient to define the area to zoom-in with a
rectangle. To access this functionality use the Frame-scale button .
A cross will appear, which is to be used to define the area to zoom into. A dotted
rectangle will appear around the defined frame and when releasing, the zoom will
be performed.
Defining the frame to zoom-in
284
Plots
Move
It is possible to move inside the plot itself. To do so use the keyboard: Ctrl+Shift.
Auto-scale
To come back to the original view of the plot defined by The Unscrambler® use the
Auto-scale button
Using the mouse wheel, will zoom the points and bars within the cube
Using Ctrl+Left mouse drag up and down, will zoom the cube itself
From the viewer one can drag the four-pin view to other sizes by choosing the center + sign
to view.
285
8. Design of Experiments
8.1. Experimental design
Experimental design is a strategy to gather empirical knowledge, i.e. knowledge based on
the analysis of experimental data and not on theoretical models. It can be applied when
investigating a phenomenon in order to gain understanding or improve performance.
Building a design means carefully choosing a small number of experiments that are to be
performed under controlled conditions.
Learn about the concepts and methods of experimental design in the Introduction to Design
of Experiments section.
Learn how to use the Design of Experiments tools offered by The Unscrambler®:
DoE basics
Why use experimental design?
What is experimental design?
Investigation stages and design objectives
Screening
Factor Influence Study
Optimization
Available designs in The Unscrambler®
Types of variables in experimental design
Design vs. non-design variables
Continuous vs. category variables
Mixture variables
Process variables
Designs for unconstrained screening situations
Full-factorial designs
Fractional-factorial designs
Plackett-Burman designs
Designs for unconstrained optimization situations
Central composite designs
Box-Behnken designs
Designs for constrained situations
Mixture designs
Axial designs: Screening of mixture components
Simplex-centroid designs: Optimization of mixtures
287
The Unscrambler X Main
Obtain historical data (from a database, from plant records, etc.). However such
data may be biased by changes occurring during the period between acquisition and
analysis. It is anyhow a good start to get some general trends and ideas.
Collect new data: record measurements directly from the production line, for
example, make observations in fish farms, process development lab, formulation
lab, etc. This will ensure that the data apply to the system being studied today (not
another system, three years ago). However most processes tend to be kept under
tight control and variation is minimal. This may lead to problems finding enough
variability to develop a model.
Run specific experiments by disturbing (exciting) the system being studied. Thus the
data will encompass more variation than is to be naturally expected in a stable
system running as usual.
Design experiments in a structured, mathematical way. By choosing symmetrical
ranges of variation and applying this variation in a balanced way among the
variables being studied, one will end up with data where effects can be studied in a
288
Design of Experiments
simple and powerful way. With designed experiments there is a better possibility of
testing the significance of the effects and the relevance of the whole model.
Define the objective of the investigation: e.g. “better understand” or “sort out
important variables” or “find the optimum conditions”.
Define the variables that will be controlled during the experiment (design variables),
and their levels or ranges of variation.
Define the variables that will be measured to describe the outcome of the
experimental runs (response variables), and examine their precision.
Choose among the available standard designs the one that is compatible with the
objective, number of design variables and precision of measurements, and has a
reasonable cost.
Most of the standard experimental designs can be generated in The Unscrambler® once the
experimental objective, the number (and nature) of the design variables, the nature of the
responses and the economical number of experimental runs have been defined. Generating
such a design will provide the user with the list of all experiments to be performed in order
to gather the required information to meet the objectives.
289
The Unscrambler X Main
variable on the responses with the help of a screening design. The variables which have
“large” effects can be considered as important. The isolated effects of single variables are
known as main effects and the purpose of screening designs is to isolate these only. There
are several ways to judge the importance of a main effect, for instance significance testing or
use of a normal probability plot of effects.
Some screening designs are capable of estimating interaction effects. These occur when the
effect of changing one variable depends on the level of other variables in the study. Some
variables may be important even though they do not seem to have an impact on the
response by themselves. The reason is that the presence of interaction effects may mask
otherwise significant main effects.
Models for screening designs
The user must choose the adequate form of the model that relates response variations to
variations in the design variables. This will depend on how precisely one wants to screen the
potentially influential variables and describe how they affect the responses. The
Unscrambler® contains two standard choices:
The simplest form is a linear model. Choosing a linear model will allow one to
investigate main effects only with possible check for curvature effect;
To study the possible interactions between several design variables, one will have to
include interaction effects in the model in addition to the linear effects.
When building a mixture or D-optimal design, one must choose a model form explicitly,
because the adequate type of design depends on this choice. For other types of designs, the
model choice is implicit in the design that has been selected.
Factor Influence Study
After an initial screening design has been performed and a number of important variables
have been isolated, a Factor Influence study can be performed using full factorial, or high
resolution fractional factorial designs. These are used to further study the main effects of
the variables, but also, they are used to investigate interactions of various orders: two factor
interactions involve two design variables, three factor interactions involve three variables
etc. The importance of an interaction can be assessed with the same tools as for main
effects.
Design variables that have an important main effect are important variables. Variables that
participate in an important interaction, even if their main effects are negligible, are also
important variables. The models generated in a factor influence study usually perform well
as predictive models and form the basis for optimization designs.
Optimization
At a later stage of investigation, when the variables that are important are already known,
one may wish to study the effects of these variables in more detail. Such a purpose will be
referred to as optimization. At the analysis stage this is also referred to as response surface
modeling.
Objectives of optimization
Optimization designs actually cover quite a wide range of objectives. They are particularly
useful in the following cases:
Maximizing a single response, i.e. to find out which combination of design variable
levels leads to the maximum value of a specific response, and what this maximum
response is.
290
Design of Experiments
Minimizing a single response, i.e. to find out which combination of design variable
levels leads to the minimum value of a specific response, and what this minimum is.
Finding a stable region, i.e. to find out which combination of design variable levels
corresponds to a specific target response, with the added criterion that small
deviations from those settings would cause negligible change in the response value.
Finding a compromise between several responses, i.e. to find out which combination
of design variable levels leads to the best compromise between several responses.
Describing response variations, i.e. to model response variations inside the
experimental region as precisely as possible in order to predict what will happen if
the settings of some design variables were changed in the future.
Models for optimization designs
The underlying idea of optimization designs is that the model should be able to describe a
response surface which has a minimum or a maximum inside the experimental range. To
achieve that purpose, linear and interaction effects are not sufficient. An optimization model
should also include quadratic effects, i.e. square effects, which describe the curvature of a
surface.
A model that includes linear, interaction and quadratic effects is called a quadratic model.
Depending on the
number of variables,
choose to study lower
order effects
Fractional
independently from
Factorial X X 3 - 13
each other, or create a
Design
screening design
aimed at find the most
important main effects
among many
291
The Unscrambler X Main
Number
Type of Factor
Screening Optimization Field of Use of design
Design Influence
variables
An alternative to
central composite
designs, when the
optimum response is
not located at the
Box- extremes of the
Behnken X experimental region 3 - 6
Design and when previous
results from a factorial
design are not
available. All design
variables must be
continuous
Contains mixture
Axial variables only, design
(Mixture) X region is simplex. Only 3 - 20
Design linear (first order)
effects can be found.
Contains mixture
Simplex-
variables only, design 3 - 6 (9 if
Lattice
X X X region is simplex. linear
(Mixture)
Tuneable lattice only)
Design
degree (order)
292
Design of Experiments
Number
Type of Factor
Screening Optimization Field of Use of design
Design Influence
variables
Simplex-
Contains mixture
Centroid
X variables only, design 3 - 6
(Mixture)
region is simplex
Design
A D-Optimal design will be used with mixture variables if the experimental region is not a
simplex, or if there is a combination of mixture and process variables in the design. The
design region is often non-simplex when upper limit constraints are added to some of the
mixture components.
Its name;
Its type: continuous or category;
Its constraints: mixture, linear;
Its levels.
Response variables
This is a type of non-design variables, they are the measured output variables that describe
the outcome (usually a quality attribute) of the experiments. These variables may often be
subject to an optimization.
Non-controllable variables
This second type of non-design variables refers to variables that can be monitored and may
have an influence on the response variables but that cannot controlled or reliably be fixed to
a value. For example the air humidity or the temperature of a plant.
Continuous vs. category variables
All variables have a pre-defined format or data type, and this format defines how the
variables are treated numerically and how they should be interpreted.
Continuous variables
All variables that have numerical values and that can be measured quantitatively are called
continuous variables. Note that this definition also covers discrete quantitative variables,
293
The Unscrambler X Main
such as counts. It reflects the implicit use which is made of these variables, namely the
modeling of their variations using continuous functions.
Examples of continuous variables are: temperature, concentrations of ingredients (e.g. in %),
pH, length (e.g. in mm), age (e.g. in years), number of failures in one year, etc.
The variations of continuous design variables are usually set within a predefined range,
which goes from a lower level to an upper level. Those two levels have to be specified when
defining a continuous design variable. More levels between the extremes may be specified if
the values are to be studied more specifically.
If only two levels are specified, the other necessary levels will be computed automatically.
This applies to center samples (which use a mid-level, halfway between lower and upper),
and axial (star) samples in optimization designs (which use extreme levels outside the
predefined range).
Category variables
In The Unscrambler®, all non-continuous variables are called category variables. Their levels
can be named, but not measured quantitatively. Examples of category variables are: color
(Blue, Red, Green), type of catalyst (A, B, C, D), place of origin (Africa, The Caribbean Islands,
…), etc.
Binary variables are a special type of category variables that have only two levels
(sometimes referred to as dichotomous). Examples of binary variables are: use of a catalyst
(Yes/No), recipe (New/Old), type of electric power (AC/DC), type of sweetener (Artificial/
Natural), etc.
For each category variable, the user must specify all levels. The number of levels can vary
between 2 - 20.
Note: Since there is a kind of quantum jump from one level to another (there is no
intermediate level in between), center samples cannot be defined for category
variables. If there is a mix of category and continuous variables in the design, center
samples are defined for all continuous variables at each level of the category
variables.
Mixture variables
When performing experiments where some ingredients are mixed according to a recipe, one
may be in a situation where the amounts of the various ingredients cannot be varied
independently from each other. In such a case, one will need to use a special kind of design
called a Mixture design, and the design variables are called mixture variables (or mixture
components).
An example of a mixture situation is blending concrete from the following three ingredients:
cement, sand and water. If the percentage of water in the blend is increased by 10%, the
proportions of one of the other ingredients (or both) will have to be reduced so that the
blend still amounts to 100%.
However, there are many situations where ingredients are blended, which do not require a
mixture design. For instance in a water solution of four ingredients whose proportions do
not exceed a few percent, one may vary the four ingredients independently from each other
and just add water at the end as a “filler”. Therefore it is important to carefully consider the
experimental situation before deciding whether the recipe being followed requires a mixture
design or not!
Process variables
In a mixture situation, one may also want to investigate the effects of variations in some
other design variables which are not themselves a component of the mixture. Such variables
294
Design of Experiments
are called process variables in The Unscrambler®, and these are analyzed using a D-optimal
design.
Typical process variables are: temperature, stirring rate, type of solvent, amount of catalyst,
etc.
Fractional-factorial designs
In the specific case where there are only two-level variables (continuous with lower and
upper levels, and/or binary variables), one can define fractions of full factorial designs that
enable the investigation of as many design variables as the chosen full-factorial designs with
fewer experiments. These “economic” designs are called fractional factorial designs.
Given that a full-factorial design suitable for the investigation has already been defined, a
fractional design might be set up by selecting half the experimental runs of the original
design. For instance, one might try to study the effects of three design variables with only 4
(2(3-1)) instead of 8 (23) experiments. Larger factorial designs admit fractional designs with a
higher degree of fractionality, i.e. even more economical designs, such as investigating nine
design variables with only 16 (2(9-5) ) experiments instead of 512 (29). Such a design can be
referred to as a fractional design; its degree of fractionality is 5. This means that one
investigates nine variables at the usual cost of four (thus saving the cost of five).
295
The Unscrambler X Main
1 – – –
2 + – –
3 – + –
4 + + –
5 – – +
6 + – +
7 – + +
8 + + +
In the table below additional columns are generated, which are computed from the products
of the original three columns A, B, C. These additional columns represent the interactions
between the design variables.
Full-factorial design 2³ with interaction columns
Experiment A B C AB AC BC ABC
1 – – – + + + –
2 + – – – – + +
3 – + – – + – +
4 + + – + – – –
5 – – + + – – +
6 + – + – + – –
7 – + + – – + –
8 + + + + + + +
The above design table is an example of an orthogonal table, i.e. the effect of each column
(main effect and interaction) can be estimated independently of each other.
In the table below, the column representing the highest degree of interaction (the ABC
interaction) is assigned to the variable, D, as it is assumed that the ABC interaction is
negligible:
Fractional factorial design 2(4-1)
Experiment A B C D
296
Design of Experiments
Experiment A B C D
1 – – – –
2 + – – +
3 – + – +
4 + + – –
5 – – + +
6 + – + –
7 – + + –
8 + + + +
This new design allows the main effects of the four design variables to be studied
independently of each other; but what about their interactions? The table below shows all
of the two-factor interactions calculated after setting D = ABC.
Fractional-factorial design 2(4-1)) with interaction columns
Experiment A B C D AB = CD AC = BD BC = AD
1 – – – – + + +
2 + – – + – – +
3 – + – + – + –
4 + + – – + – –
5 – – + + + – –
6 + – + – – + –
7 – + + – – – +
8 + + + + + + +
This table shows that each of the last three columns is shared by two different interactions
(for instance, AB and CD share the same column).
Confounding
Unfortunately, as the above example shows, there is a price to be paid for saving on the
experimental costs! “He who invests less, will also harvest less”.
In the case of fractional factorial designs, this means that if one does not use the full-
factorial set of experiments, it is not possible to study the interactions as well as the main
effects of all design variables. This happens because of the way those fractions are built,
using some of the resources that would otherwise have been devoted to the study of
interactions, to study main effects of more variables instead.
This side effect of using fractional designs is called confounding. Confounding means that
some effects cannot be studied independently of each other.
For instance, in the above example, the two-factor interactions are all confounded with each
other. The practical consequences are the following:
297
The Unscrambler X Main
All main effects can be studied independently of each other, and independently of
the interactions;
If the objective is to study the interactions themselves, using this specific design will
only enable one to detect whether either of the confounded interactions are
important. The experiments will not allow one to decide which are the important
ones. For instance, if AB (confounded with CD, “AB=CD”) turns out as significant, one
will not know whether AB or CD (or a combination of both) is responsible for the
observed effect.
The list of confounded effects is called the confounding pattern of the design.
Resolution of a fractional factorial design
How well a fractional-factorial design avoids confounding is expressed through its resolution.
The three most common cases are as follows:
Resolution III designs: Main effects are confounded with two-factor interactions.
Resolution IV designs: Main effects are free of confounding with two-factor
interactions, but two-factor interactions are confounded with each other.
Resolution V designs: Main effects and two-factor interactions are free of
confounding with each other, however some two-factor interactions are
confounded with three-factor interactions.
Definition: In a resolution R design, effects of order k are free of confounding with all effects
of order less than R-k.
In practice, before deciding on a particular factorial design, it is important to check its
resolution and its confounding pattern to make sure that it fits the experimental objectives!
Examples of factorial designs
A screening situation with three design variables is illustrated in the two examples below:
Options for screening design with three design variables
Full factorial (left) and fractional factorial (right) designs illustrated. The design points are
marked red. The points in the fractional factorial design are selected so as to cover the
maximum volume of the design space.
Plackett-Burman designs
If the experimental objective is to study the main effects only, and there are many design
variables to investigate (e.g. > 10), Plackett-Burman (PB) designs may be the solution. They
are very economical, since they require only one to four more experiments than the number
of design variables.
Plackett–Burman designs (Plackett and Burman, 1946) are experimental designs developed
while the authors were working in the British Ministry of Supply. Their goal was to find
298
Design of Experiments
1 + − + − − − + + + − +
2 + + − + − − − + + + −
3 − + + − + − − − + + +
4 + − + + − + − − − + +
5 + + − + + − + − − − +
6 + + + − + + − + − − −
7 − + + + − + + − + − −
8 − − + + + − + + − + −
9 − − − + + + − + + − +
10 + − − − + + + − + + −
11 − + − − − + + + − + +
12 − − − − − − − − − − −
For the case of two levels (L=2), Plackett and Burman used the construction of Paley (Paley,
1933) for generating orthogonal matrices whose elements are all either 1 or -1 (Hadamard
matrices). Paley’s method could be used to find such matrices of N rows for most N equal to
a multiple of 4. In particular, it worked for all such N up to 100 except N = 92. If N is a power
of 2, however, the resulting design is identical to a fractional factorial design. In The
Unscrambler® the maximum limit of N is 36, which can accommodate n = N-1 = 35 design
variables (main effects). If there are less than N-1 effects to estimate, a subset of the
columns of the matrix is used.
The prize to pay for estimating all these effects in a minimum number of runs, is the very
complex confounding patterns of Plackett-Burman designs. Main effects are often partially
confounded with several interactions, and these designs should therefore be used very
carefully.
299
The Unscrambler X Main
Factorial (cube) samples are experiments which combine the regular lower and
upper levels of the design variables; they are the “factorial” part of the design;
Center samples are replicates of the experiment for which all design variables are at
their mid-level;
Axial (star) samples are located such that they extend beyond the factorial levels of
the design for one factor at the time, all other design variables being at their mid-
level. These samples are specific to CCD designs.
Properties of a CCD
The properties of the simplest CCD, with two design variables is shown below.
Central composite design with two design variables
From the figure it can be seen that each design variable has five levels: 1) low axial, 2) low
factorial, 3) center, 4) high factorial, and 5) high axial. Low factorial and high factorial are the
lower and upper levels that are specified when defining the design variable.
The four factorial samples are located at the corners of a square (or a cube if there
are three variables, or a hypercube if there are more);
The center samples are located at the center of the square;
The four axial samples are located outside the square; by default, their distance to
the center is set to ensure rotatability (see below).
Because we do not know the position of the response surface optimum, we try to ensure
that the prediction error is the same for any point at the same distance from the center of
the design. This property is called rotatability, as the design axes can be rotated around the
origin without influencing the variance of the predicted response. This implies that the
information carried by any design point will have equal weight on the analysis, i.e. the design
points will have equal leverage. This property is important if one wants to achieve uniform
quality of prediction in all directions from the center. The distance that ensures rotatability
is given by 2k/4, k being the number of factors.
A spherical design is one in which all factorial and axial points have the same distance from
the origin. The 2- and 4- factor rotatable designs are also spherical designs (distance given by
k1/2).
300
Design of Experiments
Types of CCD
Circumscribed central composite design (CCC)
This general type is the one described in the previous section, with factorial points
defined at the lower and upper levels and with axial points outside of these ranges.
Faced central composite design (CCF)
If for some reason one cannot use levels outside the factorial range, one can tune
the axial point distances down such that these points lie at the center of the cube
faces. This is called a faced central composite design (CCF). CCF designs are not
rotatable.
Inscribed central composite design (CCI)
Another way to keep all experiments within the pre-defined range is to use an axial
sample distance that ensures rotatability, but to shrink the entire design such that
the axial points fall on the pre-defined levels. This will result in a smaller investigated
range, but will guarantee a rotatable design. This is called an inscribed central
composite design (CCI).
Efficiency of the CCD
Depending on the constraints of the experiments and the accuracy to achieve, select the
appropriate CC design using the following table:
Central composite design: constraints and accuracy
Number of Uses point outside
Design Accuracy of estimates
levels high and low levels
Box-Behnken designs
Box-Behnken designs are not built on a factorial basis, but they are nevertheless good
optimization designs for second order models.
In a Box-Behnken design, all design variables have three levels: low cube, center, and high
cube. Each experiment combines the extreme levels of two or three design variables with
the mid-levels of the others. In addition, the design includes a number of center samples.
The properties of Box-Behnken designs are the following:
The actual range of each design variable is low cube to high cube, which makes it
easy to handle;
All non-center samples are located on a sphere, achieving rotatability for the 4-
factor design, and almost rotatability for the designs with 3, 5, or 6 factors.
301
The Unscrambler X Main
The figure below shows the Box-Behnken design drawn in two different ways. In the left
drawing one can see how it is built, while the drawing to the right shows how the design is
rotatable.
Box-Behnken design
General constraints in which the allowed levels of a design variable depend on the
levels of one or more of the other design variables: linear constraints;
The special case of mixture situations, in which the levels of all design variables sum
to a fixed, total amount.
Each of these situations will then be described extensively in the following sections.
Note: Understanding the sections that follow requires basic knowledge about the
purposes and principles of experimental design. If the principles of experimental
design are unfamiliar, the user is strongly urged to read about it in the previous
sections (see What Is Experimental Design?) before proceeding with this section.
Mixture designs
A simple mixture design example
We will start describing the mixture situation by using an example.
A product development specialist has a specific problem to solve related to the optimization
of a pancake mix. The mix consists of the following ingredients: wheat flour, sugar and egg
powder. It will be sold in retail units of 100 g, to be mixed with milk for reconstitution of
pancake batter.
302
Design of Experiments
The product developer has learned about experimental design, and tries to set up an
adequate design to study the properties of the pancake batter as a function of the amounts
of flour, sugar and egg in the mix. She starts by plotting the region that encompasses all
possible combinations of those three ingredients, and soon discovers that it has a distinct
shape.
The pancake mix experimental region
The reason, as you may have guessed, is that the mixture always has to add up to a total of
100 g. This is a special case of multilinear constraint, which can be written with a single
equation:
303
The Unscrambler X Main
This simplex contains all possible combinations of the three ingredients flour, sugar and egg.
One can see that it is completely symmetrical. One could substitute egg for flour, sugar for
egg and flour for sugar in the figure, and still get exactly the same shape.
Classical mixture designs, first introduced by Scheffé, 1958, take advantage of this symmetry.
They include a varying number of experimental points, depending on the purposes of the
investigation. But whatever this purpose and whatever the total number of experiments,
these points are always symmetrically distributed, so that all mixture variables play equally
important roles.
These designs thus ensure that the effects of all investigated mixture variables will be
studied with the same precision. This property is equivalent to the properties of factorial,
central composite or Box-Behnken designs for non-constrained situations.
The figure below shows two examples of classical mixture designs.
Two classical designs for three mixture components
The first design is very simple. It contains three vertices (pure mixture components), three
edge centers (binary mixtures) and only one ternary mixture or the centroid. The second
design contains more points, spanning the mixture region regularly in a triangular lattice
pattern. It contains all possible combinations (within the mixture constraint) of five levels of
each ingredient. It is similar to a five-level full factorial design - except that many
combinations, such as “25%, 25%, 25%” or “50%, 75%, 100%”, are excluded because they
are outside the simplex.
Simplex with different boundaries
This example, taken from John A. Cornell’s reference book “Experiments With Mixtures”
Cornell 1990, illustrates a how additional constraints are sometimes useful in practical
situations.
A fruit punch is to be prepared by blending three types of fruit juice: watermelon, pineapple
and orange. The purpose of the manufacturer is to use their large supplies of watermelons
by introducing watermelon juice, of little value by itself, into a blend of fruit juices.
304
Design of Experiments
Therefore, the fruit punch should contain at least 30% of watermelon juice. Pineapple and
orange have been selected as the other components of the mixture.
The manufacturer decides to use design of experiments to find the combination of fruit
juices that scores highest in a consumer preference survey. The ranges of variation selected
for the experiment are as follows:
Ranges of variation for the fruit punch design
Ingredient Low High Centroid
305
The Unscrambler X Main
without regard to the remaining factors, because its low and high levels have been
combined with the same levels of all the other design variables.
In a mixture situation, this is no longer possible, as demonstrated in the previous figure.
While 30% watermelon can be combined with e.g. (70% P, 0% O) or (0% P, 70% O), 100%
watermelon can only be combined with (0% P, 0% O).
To find a solution to this problem the concept of “otherwise comparable conditions” must
be adapted to the constrained mixture situation. To screen what happens when watermelon
varies from 30% to 100%, this variation must be compensated in such a way that the mixture
still adds up to 100%, without disturbing the balance of the other mixture components. This
is achieved by moving along an axis where the proportions of the other mixture components
remain constant. In practice such mixtures are easily achieved by starting with the low level
of the component in questions while having equal proportions of the remaining
components. Subsequent addition of the first component to the mix would correspond to
moving up the axis. This is illustrated for the watermelon example in the figure below.
Studying variations in the proportion of watermelon
Mixture designs with points along the axes of the simplex are called axial designs. They are
best suited for screening purposes because they capture the main effect of each mixture
component in a simple and economical way.
An axial design in four components is represented in the next figure. It can be seen that
several points are located inside the simplex: they are mixtures of all four components. Only
the four corners, or vertices (containing the maximum concentration of an individual
component) are located on the surface of the experimental region.
A four-component axial design
Each axial point is placed halfway between the overall centroid of the simplex (25%, 25%,
25%, 25%) and a specific vertex. Thus the path leading from the centroid (“neutral”
situation) to a vertex (100% of a single component) is well described with the help of the
axial point.
306
Design of Experiments
In addition, end points can be included; they are located on the surface of the simplex,
opposite a vertex (they are marked by crosses on the figure). They contain the minimum
concentration of a specific component. When end points are included in an axial design, the
whole path leading from minimum to maximum concentration is studied. The above figure
Design for the optimization of the fruit punch composition is an example of a three-
component axial design where end points have been included.
In general terms, if N mixture components vary from 0 to 100%, the blends forming the
simplex-centroid design are as follows:
The vertices are pure components;
The second order centroids (edge centers) are binary mixtures with equal
proportions of selected two components;
The third order centroids (face centers) are ternary mixtures with equal proportions
of selected three components;
The Nth order centroids have equal proportions of selected N components, any
remaining components being zero.
307
The Unscrambler X Main
Note: The overall centroid is a mixture where all N components have equal
proportions.
In addition, interior points can be included in the design. They improve the precision of the
results by “anchoring” the design with additional complete mixtures (i.e. mixtures where all
components are present), and they enable computation of cubic terms. The interior points
are located halfway between the overall centroid and each vertex, and they have the same
composition as the axial points in an axial design. When a design includes interior points, it is
said to be augmented. Note that for 3 mixture components, a centroid design augmented
with axial points equals an axial design with end points included (see e.g. fruit punch
example above).
Feasibility study (degree one or two): are the blends feasible at all?
Optimization: with a lattice of degree three or more, there are enough points to fit a
precise response surface model.
Search for a special behavior or property which only occurs in an unknown, limited
subregion of the simplex.
Calibration: prepare a set of blends on which several types of properties will be
measured, in order to fit a regression model to these properties. For instance, one
may wish to relate the texture of a product, as assessed by a sensory panel, to the
parameters measured by a texture analyzer. If it is known that texture is likely to
vary as a function of the composition of the blend, a simplex-lattice design is
probably the best way to generate a representative, balanced calibration data set.
D-optimal designs
A simple design subject to linear constraints
308
Design of Experiments
1 6 5 5
2 18 5 5
3 6 15 5
4 18 15 5
5 6 5 15
6 18 5 15
7 6 15 15
8 18 15 15
After carefully analyzing this table, the process engineer expresses strong doubts that
experimental design can be of any help in this situation.
“Why?” asks the statistician in charge. “Well,” replies the engineer, “if the
meat is steamed then fried for 5 minutes each it will not be cooked, and at
15 minutes each it will be overcooked and burned on the surface. In either
case, we won’t get any valid sensory ratings, because the products will be far
beyond the ranges of acceptability.”
After some discussion, the process engineer and the statistician agree that an additional
condition should be included:
“In order for the meat to be suitably cooked, the sum of the two cooking
times should remain between 16 and 24 minutes for all experiments”.
This type of restriction is called a multilinear constraint. In the current case, it can be written
in a mathematical form requiring two equations, as follows:
309
The Unscrambler X Main
The constrained experimental region is no longer a cube! It follows that a full factorial design
poorly explores that region.
The design that best spans the new region is given in the table below:
The cooked meat constrained design
Sample Mar. Time Steam. Time Fry. Time
1 6 5 11
2 6 5 15
3 6 9 15
4 6 11 5
5 6 15 5
6 6 15 9
7 18 5 11
8 18 5 15
9 18 9 15
10 18 11 5
11 18 15 5
12 18 15 9
This design contains all “corners” of the experimental region, in the same way as the full
factorial design does when the experimental region has the shape of a cube.
310
Design of Experiments
Depending on the number and complexity of multilinear constraints, the shape of the
experimental region can be more or less complex. In the worst cases, it may be almost
impossible to imagine! Therefore, building a design to screen or optimize variables linked by
multilinear constraints requires special methods. The following section will introduce a
special class of designs beneficial for these situations. More complex examples will be given
in the section Advanced topics for constrained situations ways to build constrained designs.
Introduction to the D-optimal principle
Those familiar with factorial designs are most likely aware that one of their most important
characteristics is their ability to study all effects independently of each other. This property,
called orthogonality, is important for relating variations in responses to variations in the
design variables. Without orthogonality, the estimated effects may become unreliable.
As soon as multilinear constraints are introduced among the design variables, it is no longer
possible to build an orthogonal design. Considering that the effect of a variable is estimated
on the premise that all other influences are held constant, it may not come as a surprise that
associations between design variables make the interpretations more difficult. In the more
severe cases of dependencies between variables, the effects will become indistinguishable
or the numerical calculations will fail. As soon as the variations in one of the design variables
are linked to those of another design variable, orthogonality cannot be achieved.
The D-optimal principle ensures that, based on a set of candidate points, the selected design
matrix has columns as close to orthogonal as possible. Mathematically, this is achieved
by maximizing the determinant of the information matrix , which is known as the D-
optimality criterion (Apostrophe meaning ‘transposed’). The volume of the joint confidence
region of the resulting regression coefficients is thereby minimized, i.e. the precision of
model parameter estimates will be maximized. An example of a design matrix could be
the cooked meat constrained design table above, including some or all of the available
design points (rows) as well as any center points or replicates. Also, any interaction or higher
order terms would be included as additional columns in .
Because the determinant of tends to increase as more experimental runs are
included in the design, the D-optimality criterion is not well suited for comparing designs of
different sizes. The related D-efficiency is independent of the number of runs.
Here, n is the number of experimental runs and p is the number of model terms. The D-
efficiency ranges from 0 to 100%, where a factorial design without centerpoints has a D-
efficiency of 100%. While a large design will tend to have a larger value of and yield
a smaller confidence region for the parameters, the average point precision as estimated by
the D-efficiency will be comparable for differently sized designs.
Candidate design points
A point exchange algorithm is used to find the D-optimal design points in The Unscrambler®.
These points may optionally be augmented with a number of space filling points to ensure
good coverage also inside the experimental region. Both these procedures require a set of
candidate points as input. These points are set up in such a manner that they span the
maximum allowed design region as well as the interior region. The candidate points are
All extreme vertices. These are the outer corners of the design region:
The extreme vertices of a square design region
311
The Unscrambler X Main
All edge centers. These are defined as the midpoint between any two vertices constituting
an outer edge of the design region:
The edge centers of a square design region
All face centers. These are defined as the center point on any outer surface of the design
region as spanned by three or more edges:
The face centers of a square design region
The overall centroid. This is the center point of the design. For a design with two design
variables only the overall centroid overlaps with the single face center.
All axial check blends. These are defined as the midpoint on any axis spanned by the overall
centroid and the extreme vertices. These do not improve the coverage of the outer design
region but can be very useful space filling points for more robust models:
The axial check blends of a square design region
312
Design of Experiments
of points is bounded by the number of candidate points in each case. The number of
additional center points (overall centroids) as well as the number of replicates for the entire
design is specified separately. This enables a higher level of user control over the
replications, and it favours a better spread of points over the design region compared to
selection with replacement. On the other hand the D-efficiency of the resulting design may
be slightly lower than if replication had been allowed. For practical use we believe the
benefits of a good spread in design points far outweight a small reduction in D-efficiency
(see next section).
Addition of space filling points
The list of D-optimal points returned from the FFEA is optionally used as a starting point for
a subsequent Kennard-Stone selection process Kennard and Stone, 1969. During this
process, the design is augmented with a specified number of space filling points in order to
span the entire design region as evenly as possible. These points are taken from the
remaining candidate list, i.e. the selection is based on candidate points that have not already
been selected in the point exchange algorithm.
While D-optimal designs provide precise model terms and good predictions of training data,
they tend to focus on the outer regions of the design space. It has been shown that designs
with samples spread evenly across the entire design region tend to be more robust in many
cases Naes and Isaksson, 1989. Inclusion of space filling points by Kennard-Stone enables
better modeling of the interior design region and may therefore give more accurate
response surfaces and stable predictions when applying the model on new data. Also space
filling points tend to make the design less dependent on which model terms are included.
This is beneficial because the exact model equation is usually not known in advance.
The condition number (C.N.)
In order to minimize the negative consequences of a deviation from the ideal orthogonal
case, one needs a measure of the “lack of orthogonality” of a design. This measure is
provided by the condition number (C.N.) Golub, 1996:
C.N. = largest eigenvalue / smallest eigenvalue of the matrix
It indicates the degree of multicollinearity in the design matrix as follows:
C.N. = 1: no multicollinearity, i.e. orthogonal
C.N. < 100: multicollinearity not a serious problem
100 < C.N. < 1000: moderate to severe multicollinearity
C.N. > 1000 severe multicollinearity
It is also linked to the elongation or degree of “non-sphericity” of the region actually
explored by the design. The smaller the condition number, the more spherical the region,
and the closer a design is to being orthogonal.
Another important property of an experimental design is its ability to explore the whole
region spanned by the design variables. It can be shown that once the shape of the
experimental region has been determined by the constraints, the design with the smallest
condition number is the one that encloses maximal volume. It follows that if all extreme
vertices are included in the design, it has the smallest attainable condition number. If that
solution is too expensive, however, one needs to select a smaller number of points. The
consequence is that the condition number will increase and the enclosed volume will
decrease.
How good is the calculated design?
The condition number of an orthogonal design such as a non-modified factorial design is
exactly 1. Such a design has optimal properties in terms of interpretation, mathematical
robustness and economical considerations. The condition number of a non-orthogonal
(constrained) design will always be larger than one, and the larger the deviation, the less
313
The Unscrambler X Main
favorable is the design. In general, caution should be exercised when analyzing a non-
orthogonal design using classical DoE Analysis(ANOVA/MLR). The Unscrambler® suggests
analysis by Partial Least Squares Regression for D-optimal designs, ascorrelated effects are
handled much better by this method and misinterpretations will be rare.
If the design has a condition number much larger than, say, 100, this is an indication that the
experimental region is heavily constrained. In such a case either of several design factors
may have influence on the response, but it is impossible to find out which (ANOVA might
suggest one of them arbitrarily, PLSR will correctly reveal that both are correlated with the
response). This may occur when there is insufficient individual variation in the design levels
compared to the noise level of the experiment. To ensure sufficient orthogonal variation for
each effect, it is recommended that all of the design variables and constraints be critically re-
examined. One should search for ways to simplify the problem see the section on Advanced
Topics for Constrained Situations, otherwise there is the risk of starting an expensive series
of experiments which will not give any useful information.
A full factorial design applied to this situation would result in a sub-optimal solution that left
one half of the experimental region unexplored (i.e. the triangle spanned by the remaining 3
points). So where should we place the 4th point in order to span the experimental region as
well as possible?
We could imagine two candidate points where the dashed line of the linear constraint
crosses the factorial design region in the above figure. Two alternative solutions for selecting
4 design points are illustrated below.
Designs with four points leaving out a portion of the experimental region
314
Design of Experiments
Design II in the figure seems to be a better option than design I, because the excluded region
is smaller. A design using points (1, 3, 4, 5) would be equivalent to (I), and a design using
points (1, 2, 4, 5) would be equivalent to (II). The worst solution of all would be a design with
points (2, 3, 4, 5): this would leave out the whole corner defined by points 1, 2 and 5.
It follows that if the whole experimental region was to be explored, more than four points
would be needed. The above example shows that a minimum of five points (1, 2, 3, 4, 5) are
necessary. These five crucial points are the extreme vertices of the constrained experimental
region. They have the following property: if a sheet of paper was wrapped around those
points, the shape of the experimental region would appear, revealed by the wrapping.
If there are more than two design variables or multiple constraints it might not be straight
forward to find the best set of design points. The D-optimal criterion is commonly used to
find the best design in these situations.
Process/mixture designs
Sometimes the product properties of interest depend on a combination of a mixture recipe
with specific process settings. In such cases, it is useful to investigate mixture and process
variables together. The process variables and the mixture variables are then combined using
the pattern of subfactorial designs and a D-optimal design can be generated.
315
The Unscrambler X Main
Factorial samples can be found in factorial designs and their extensions. They are a
combination of high and low levels of the design variables in experimental plans based on
two levels of each variable. This forms a square for 2 variables or a (multidimensional) cube
for 3 (or more) variables. These samples are therefore sometimes referred to as cube
samples.
The same factorial design points are also found among other samples in central composite
designs. In Box-Behnken designs, all samples found on the factorial cube are also called
factorial samples (even though these design points are positioned on the edges rather than
the vertices of the cube).
All combinations of levels of the design variables in N-level full factorials are also called
factorial samples.
Center samples
Center samples are samples for which each design variable is set at its mid-level. When all
variables are continuous, the center points are located at the exact center of the
experimental region.
Center samples are not defined for categorical factors. When there is a combination of
continuous and category variables in the design, center points corresponding to the mid-
level of all continuous factors can be added for each unique combination of levels for up to 4
category variables.
For instance, if the number of two-level category variables in the design is (1, 2, 3, 4), this
results in (2, 4, 8, 16) single replicate center points, respectively. If two replicates of center
points are required, this doubles the total number of center points in the design. If we have
a three variable full factorial design with two two-level categorical variables, there are four
unique center points corresponding to the different level combinations of the categorical
factors. If 2 replicates of the center points are required, this results in 8 center points in
total.
The higher number of levels for the categorical variables and the more replication required,
the number of center points can grow large very quickly. It is suggested that when either the
number or levels of categorical variables becomes larger than 2, design replication may be a
better option.
Center samples in screening designs. In screening designs, center samples are used for
curvature checking: Since the underlying model in such a design assumes that all main
effects are linear, it is useful to have at least one design point with an intermediate level for
all factors. Thus, when all experiments have been performed, one can check whether the
intermediate value of the response fits with the global linear pattern, or whether there are
signs of deviation from the straight line fit.
In the case of high curvature, one will have to build a new design which accepts a quadratic
model. The Unscrambler® provides an option to calculate curvature in a design when all
variables are continuous and at least one center point is present.
If at least 2 center samples are present (preferably 3), the model will also be tested for lack
of fit (LOF). This is a test comparing the variation of the measured responses within center
samples with the overall variation between measured and fitted (i.e. predicted) response
values. A significant LOF indicates that the model might benefit from additional terms.
In screening designs, center samples are optional; however, it is recommended that at least
three are included if possible. See the section on replicates for more details.
Center samples in optimization designs. In optimization designs, center samples are
important also for fitting higher order models. It is therefore recommended that 5 or more
are included in the design. In particular for Box-Behnken designs, ample center samples are
needed to fit a precise response surface.
316
Design of Experiments
Axial samples can lie on centers of cube faces or they can lie outside the cube, at a given
distance from the center of the cube. This distance can be tuned, but it is recommended to
use the default distance (for the given design) whenever possible.
Three cases can be considered:
The default axial to center point distance ensures that all design samples have
exactly the same leverage, i.e. the same influence on the model. Such a design is
said to be “rotatable”. If the number of design variables is two or four, this distance
also ensures that all factorial and design points lie with the same distance from the
center, giving a “spherical” design region. For other numbers of factors, rotatability
almost, but not quite, corresponds with a spherical design;
The axial to center point distance can be tuned down to 1. In that case, the star
samples will be located at the centers of the faces of the cube. This ensures that a
Central Composite design can be built even if levels lower than “low cube” or higher
than “high cube” are impossible. However, the design is no longer rotatable;
Any intermediate value for the star distance to center is also possible. The design
will not be rotatable.
Sample types in mixture designs
An overview of the various sample types used in mixture designs is provided below:
Axial design: vertex and axial samples, optionally end points and overall centroids;
Simplex-centroid design: vertex samples, centroids of various orders, optional
interior (axial) points;
Simplex-lattice designs: samples positioned in a regular grid (similar to multi-level
factorial samples), overall centroid.
317
The Unscrambler X Main
When trying to improve an existing product or process, the current recipe or process
settings may be used as a reference.
When trying to copy an existing product, for which the recipe is not known, one
might still include that product as reference and measure the responses on that
sample as well as on the others, in order to know how close the experimental
samples have come to that product.
To check curvature in the case where some of the design variables are category
variables, one can include one reference sample with center levels of all continuous
variables for each level (or combination of levels) of the category variable(s).
Note: For reference samples, only response values can be taken automatically into
account in the Analysis of Effects and Response Surface analyzes. Values of the
design variables may, however, be entered manually after converting to a non-
designed data table, then run a PLS analysis on the resulting table.
Replicates
Replicates are experiments performed several times under reproduced conditions. They
should not be confused with repeated measurements, where the samples are only prepared
once but the measurements are performed several times on each.
Why include replicates?
318
Design of Experiments
Replicates are included in a design in order to estimate the experimental error associated
with the system. This is doubly useful as it:
8.2.10 Blocking
In some situations it may not be possible to run all experiments under the exact same
conditions, or there may be other reasons to split the full set of runs into blocks that are
319
The Unscrambler X Main
performed independently from the others in some sense. A common scenario is that raw
material comes from different batches, in case there is not enough material in a single batch
to accommodate the full set of experiments. Often screening designs are extended into
factor influence studies, or factor influence studies are extended into optimization studies. If
this is performed in a planned manner, it will often be possible to re-use previous
measurements and supplement them with new ones. For instance, a low resolution
fractional factorial can be extended into a high resolution or full factorial design, which again
can be extended into a circumscribed or faced central composite design (see section
Extending a design below). Because these blocks of experiments are necessarily performed
in different points of time, there is a higher risk that non-controllable or unknown factors
differ between blocks. Whether such variation has an unwanted effect on the response
should always be investigated.
Any blocked experiment should be tested for unequal block means. For experiments where
measurements are divided into two distinct blocks, the response(s) can be tested using a
Student’s t-test for equality of means. A low p-value, or equivalently a large difference
between the plotted quantiles, indicates that there is a significant blocking effect. Any effect
confounded with blocks cannot be trusted if this is the case. Careful planning of the
experiment is required to avoid that effects of interest are confounded with, or non-
distinguishable from, blocks.
For any number of blocks the responses can be plotted in a quantiles plot, where the block
means and variances can be compared using the sample grouping option. If the distributions
of response values are similar across blocks, there is no evidence that block effects have had
an influence on the response.
Incomplete blocking of full factorial designs
If the full experiment is replicated, one should strive to include the full set of unique design
points in each block. This will ensure that any blocking effect is confounded with replicates
only, and all effects will be free of confounding with blocks. When all the treatment
combinations are included in each block, the design is referred to as a complete block design
and block effects should be tested as described above.
If this is not possible some effects will always be confounded with blocks, and the estimated
effects in question will include the block contribution as well. This is referred to as an
incomplete block design, and the efficiency of such a design depends on which effects are
confounded with blocks. Of course one would not want to create a design where any of the
main effects were confounded with blocks, as these main effects would be indistinguishable
from the block effects. Preferably the blocks should be set up such that they are confounded
with high order interactions only.
The Unscrambler® supports blocking of most full factorial experiments into 2p blocks, p being
smaller than the number of design variables. A full factorial design with three 2-level factors
may be divided into two or four blocks. A full factorial design with 3-7 2-level factors may be
split into two, four or eight blocks. The blocking generators are selected to ensure that as
many low-order interactions as possible can be estimated without confounding with blocks.
For instance, in a six-variable design divided into two blocks, the blocking effect will be
confounded with the six-variable interaction only.
In the ANOVA, all interactions confounded with blocks will be summarized in a separate
sums of squares for blocks. These individual interaction effects will not be given or tested in
the ANOVA, as they are indistinguishable from the blocking effects.
320
Design of Experiments
The experiments have provided all the information needed, which means that the
project is completed.
The experiments have given valuable information which can be used to build a new
series of experiments that will lead closer to the experimental objective.
In the latter case, the new series of experiments can sometimes be designed as a
complement to, or an extension of, the previous design. This allows one to minimize the
number of new experimental runs, and the whole set of results from the two series of runs
can be analyzed together.
Why extend a design?
In principle, one should make use of the extension feature whenever possible, because it
enables progression to the next stage of an investigation using a minimum of additional
experimental runs.
Extending an existing design is also a convenient way of building a new, similar design that
can be analyzed together with the original one. For example, if a chemical reaction has been
investigated using a specific type of catalyst, one might want to investigate another type of
catalyst under the same conditions as the first reaction, in order to compare their
performances. This can be achieved by adding a new design variable, namely type of
catalyst, to the existing design.
Design extensions can also be used as a basis for an efficient sequential experimental
strategy. That strategy consists in breaking the initial problem into a series of smaller,
intermediate problems and investing in a small number of experiments to achieve each of
the intermediate objectives. Thus, if something goes wrong at one stage, the losses are cut;
and if all goes well, one may end up solving the initial problem at a lower cost than if a huge
design had been used initially.
When and how to extend a design
The following text briefly describes the most common extension cases:
Add levels: Used whenever one is interested in investigating more levels of already
included design variables, especially for category variables.
Add a design variable: Used whenever a parameter that has been kept constant is
suspected to have a potential influence on the responses, as well as when one
wishes to duplicate an existing design in order to apply it to new conditions that
differ by the values of one specific variable (continuous or category), and analyze the
results together. For instance, if a chemical reaction using a specific catalyst has
been investigated, and now another similar catalyst for the same reaction will be
studied to compare its performances to the other one’s, the first design can be
extended by adding a new variable; type of catalyst.
Delete a design variable: If the analysis of effects has established one or a few of the
variables in the original design to be clearly insignificant, the power of the
conclusions can be be increased by deleting this variable(s) and reanalyzing the
design. Deleting a design variable can also be a first step before extending a
screening design into an optimization design. This option should be exercised with
321
The Unscrambler X Main
caution if the effect of the removed variable is close to significance. Also be sure
that the variable to be removed does not participate in any significant interactions.
Add more replicates: If the first series of experiments shows that the experimental
error is unexpectedly high, replicating all experiments might make the results
clearer.
Add more center samples: In order to get a better estimation of the experimental
error, adding a few center samples is a good and inexpensive solution.
Add more reference samples whenever new references are of interest. More
replicates of existing reference samples may be used in order to get a better
estimation of the experimental error.
Extend to higher resolution: Use this option for fractional factorial designs where
some of the effects of interest are confounded with each other. This option can be
used whenever some of the confounded interactions are significant and one needs
to find out exactly which ones. This is only possible if there is a higher resolution
fractional factorial design available. Otherwise, one can extend to a full factorial
design instead.
Extend to full factorial: This applies to fractional factorial designs where some of the
effects of interest are confounded with each other and no higher resolution
fractional factorial designs are available.
Extend to central composite: This option completes a full factorial design by adding
star samples and (optionally) a few more center samples. Fractional factorial designs
can also be completed this way, by adding the necessary cube samples as well. This
should be used only when the number of design variables is small; an intermediate
step may be to delete a few variables first.
Each step of the strategy consists of a design involving a reasonably small number of
experiments. Thus, the mere size of each subproject is more easily manageable.
A smaller number of experiments also means that the underlying conditions can
more easily be kept constant for the whole design, which will make the effects of
the design variables appear more clearly.
322
Design of Experiments
If something goes wrong at a given step, the damage is restricted to that particular
step.
If all goes well, the global cost is usually smaller than with one huge design, and the
final objective is achieved all the same.
First, build a fractional factorial design 2(6-2) (resolution IV), with two center samples,
and perform the corresponding 18 experiments.
After analyzing the results, it turns out that only variables A, B, C and E have
significant main effects and/or interactions. But those interactions are confounded,
so the design needs to be extended in order to know which are really significant.
The first design is extended by deleting variables D and F and extending the
remaining part (which is now a 2(4-1), resolution IV design) to a full factorial design
with one more center sample. Additional cost: nine experiments.
After analyzing the new design, the significant interactions which are not
confounded only involve A, B and C. The effect of E is clear and goes in the same
direction for all responses. But since the center samples show some curvature, one
must proceed to the optimization stage for the remaining variables.
Thus, variable E is kept constant at its most interesting level, and after deleting that
variable from the design, the remaining 2³ full factorial design is extended to a CCD
with six center samples. Additional cost: nine experiments.
Analysis of the final results yielded a desired optimum point. Final cost: 18+9+9=36
experiments, which is less than half of the initial estimate.
If the design variables have any effect at all, the experimental design structure
should be reflected in some way or other in the response data; graphical analysis
and PCA will visualize this structure and help one detect abnormal features.
The Unscrambler® includes automatic features that take advantage of the design
structure (grouping according to levels of design variables when computing
323
The Unscrambler X Main
descriptive statistics or viewing a PCA scores plot). When the structure of the design
shows in the plots (e.g. as subgroups in a box-plot, or with different colors on a
scores plot), it is easy to spot any sample or variable with an illogical behavior.
Summary
Error 12 4 3
324
Design of Experiments
Variables
325
The Unscrambler X Main
the variance is strongly associated with the magnitude of the response, a variance-stabilizing
transform such as log(Y), Y1/2, or 1/Y might be considered (Tip: Histograms can be used to
test the influence on the response of different transforms). If the precision of runs improves
somewhat in the course of the experiment, a model based on randomized runs will most
likely be robust to these changes.
Note that if there are very few residual degrees of freedom left after estimating all the
effects in the model, artificial structure in the residuals can be expected simply due to lack of
information in the data. In the extreme case that the residual degrees of freedom is zero, all
the residuals will be zero as well. If a little more than the minimum number of experiments
can be afforded, this will benefit the interpretation of results.
Analysis of effects using classical methods
An analysis of the effects is usually performed for screening and factor influence designs:
Plackett-Burman, Fractional Factorial, Full-Factorial designs. These designs allow estimation
of main effects and some of them also 2-3 variable interactions.
The classical DoE analysis method for studying effects is based on the ANOVA-table. Main
effects or interactions found to be important in the ANOVA table can be investigated further
in an effects visualization plot. This will reveal the direction and magnitude of the individual
effects. It is important to note that even if a main effect seems to be irrelevant, the factor
can still have a large impact on the model if it takes part in a significant interaction effect.
Other checks that can be applied after analyzing the ANOVA table include the detection of
curvature effects. These can be found by plotting the main effects plot. If a nonlinear trend
is detected when checking the position of the center sample, one may consider a possible
curvature effect and include the square term of the effect in the model.
Main effect plot with curvature
When a variable is categorical, it is necessary to check which effects are significant and also
if they are significantly different. The multiple comparison test provides this type of
information. It is based on a comparison of the averages of the response variable at the
different levels. If the difference between two averages is greater than the critical limit the
two levels are significantly different. If not they have a similar effect. If no level has an effect
all levels will have a statistically similar effect, and the averages for the response variables at
the different levels will be non-significantly different.
326
Design of Experiments
In The Unscrambler®, there are three specific outputs for the multiple comparison test:
A table of distances, that gives the two-by-two distance between the levels.
A group table, that indicates the different grouping between the levels.
A plot displaying the levels in their group.
Main effects
Main effects + interactions (2-variable)
Main effects + interactions (2-variable) + quadratic terms
The above lists correspond with pre-defined alternatives, and it is possible to remove terms
from any of these models in a hierarchical manner (except linear mixture terms, which
cannot be removed).
The response surface can be used to find optimal design settings. For CCD and BB designs,
one fitted response are plotted for the entire area spanned by two design variables, any
remaining variables held constant at its minimum level. Maxima, minima, saddle points or
stable regions can be detected by changing which variables to plot while varying the levels of
327
The Unscrambler X Main
the remaining variables. For mixture designs, the plotted design region consists of three
mixture components forming a simplex/triangle.
More information on how to vary the condition can be found in the RS table section in the
plot interpretation page.
Response surface
Limitations of ANOVA
Analyses based on MLR/ANOVA are very useful for orthogonal designs or mixture designs
where one or two (non-related) responses have been measured accurately following the
experimental conditions. ANOVA has some important shortcomings, however:
The underlying MLR is based on the assumption that all variables can be measured
independently of all other variables in the model. This is always the case for
orthogonal designs such as the factorial designs. For some designs, such as
optimization designs including quadratic terms, mixture designs, D-optimal designs
or for any design where some experimental measurements are missing, some of the
model terms (effects) will become more or less correlated. If two correlated terms
both have an influence on the response, one of these will often (arbitrarily) come
out as significant at the expense of the other. While the ANOVA will automatically
handle standard designs such as mixture designs of simplex shape, a bilinear method
such as PLSR can take into account any number of correlated variables.
If several responses are modeled, the MLR will fit a model to each response
independently. If all responses are orthogonal, one can then assess the ANOVA table
for each response without taking the remaining responses into account. The
problem is that real data are seldom or never orthogonal. For any two sufficiently
correlated responses, it is sub-optimal to try to assess the effects on one
independently from the other, and trying to find the main conclusions from several
ANOVA tables together is difficult in itself. A bilinear method such as PLSR can take
into account any number of correlated responses, and any relationships between
responses and descriptors will be easily detected.
The reliability of the p-value estimates in the ANOVA table highly depends on the
residual degrees of freedom (DF) in the data after estimating all the parameters of
328
Design of Experiments
the model. If the error DF is low, the reliability of the estimated p-values is low as
well. This also limits the ability to check the assumptions of the model. When
several, correlated effects are estimated, the MLR consumes more DF than the true
number of underlying, independent effects. In contrast, with the bilinear methods
such as PLSR, the user estimates the optimal model rank based on the predictive
ability of the model.
In the ANOVA table, the predictive ability of the model is given by the ‘PRESS’ and
‘R-square prediction’ values. These are based on leverage corrected residuals, which
in the case of MLR is identical to residuals obtained from a leave-one-out (LOO)
cross-validation. This reflects the ability of the model to predict each measurement
based on models fitted using all samples except the one in question. If some
samples are replicated, the LOO procedure will be overly optimistic. If there are for
instance 3 center samples in total, these will be predicted based on models where
the 2 remaining center samples have been accounted for. The prediction error will
therefore be smaller than if all center samples were kept out in the same step. In
general, all replicated measurements of any experimental point should be kept out
in a single cross-validation segment to ensure conservative error estimates.
Non-controllable variables, i.e. variables that are believed to have an effect on the
responses but that are difficult to control at the required level of precision, are
currently not included in the ANOVA. In general, an attempt to include many of
these variables in an MLR model will have a high expense in terms of residual DF,
and the above considerations about correlation between terms would also have to
be taken into account. In PLSR any number of non-controllable variables can be
included, and they can optionally be downweighted in order to discover their
influence on the data without actually allowing them to influence the model. If e.g.
the run order was mixed up in the experiment, a passive descriptor giving the run
order or time-points of the individual measurements will reveal if any effects are
aliased with a time effect.
Analysis with PLS Regression
If some or all of the considerations above make analysis by ANOVA difficult, PLSR can always
be used as a powerful alternative. To get a refresher on the theory of PLSR follow this link.
Include all design variables including any interactions, quadratic or cubic effects of interest in
the descriptor ( ) matrix. Any additional non-controllable variable, background
information about the samples, experimental details such as time of measurement, batch, or
change of instruments can be included here as well. Include all response variables. Weight
all variables with 1/SDev, or optionally downweight some of the descriptors.
Validate with cross-validation. The level of validation depends on the cross-validation
segments. If e.g. all experimental runs are replicated once, the replication error can be
assessed by leaving out a full set of experimental runs in two cross-validation segments.
Note that this will not tell you how well the model will predict new samples but rather it will
reflect the experimental error in the experiment. In order to estimate how well the model
predicts new measurements (when level combinations are allowed to vary within the design
region), keep out all replicates of each point once. This will be a more conservative and
correct estimate for the predictive power of the model.
Include the uncertainty test to get an estimate of the significance of the effects. The
following are important tools to interpret the model and make conclusions:
Weighted Beta coefficients with their uncertainty limit
329
The Unscrambler X Main
The weighted B-coefficients are used to determine which effects are the most
important and their direction of influence. Effects with high positive or negative
regression coefficients have a larger influence on the response in question.
The uncertainty test shows which effects are significantly non-zero, averaged over
responses. Coefficients with high absolute values and little variation across cross-
validation segments will point to significant effects.
Estimated p-values
The uncertainty test will estimate p-values for all effects and interactions included in
the PLSR model. These are based on the size and stability of the PLSR regression
coefficients in the cross-validation.
Explained variance
This plot will reveal the optimal number of components in the model, its fit (blue
line) and predictive ability (red line). The optimal number of components
corresponds with the number of independent phenomena in the data that exceeds
the noise level of the measurements.
Correlation loadings
The loadings or loading weights will reveal the main dependencies between
descriptors and responses in two dimensions. Often these dimensions will capture
the majority of the co-variation between descriptors and responses.
The correlation between the factors and each original variable is captured by the
distance from the origin in the correlation loadings plot. Even downweighted
variables are easily mapped in these plots.
Outlier detection
The sample outlier or influence plots can reveal erroneous measurements or typos
that should be mended or removed.
Predicted vs. Reference
Used to assess the model’s goodness of fit (blue points) and predictive ability (red
points) for each response variable, look for deviating runs and assess prediction
statistics.
When data are missing or experimental conditions have not been reached
In a real life situation it is not always possible to reach the target for the experimental
conditions or an experiment may not go as planned. In such cases one cannot apply the
classical DOE analysis methods. In these situations one can use a PLS fitting method. The
validation procedure of the PLS by jack-knifing will provide approximate p-values for the B-
coefficients, see above chapter on Analysis with PLS regression.
More information on PLS regression can be found in the chapter on Partial Least Squares
330
Design of Experiments
For a more extensive screening, variables that are known not to interact with other variables
can be left out. If those variables have a negligible linear effect, one can choose a constant
level for them (e.g. the least expensive). If those variables have a significant linear effect,
they should be fixed at the level most likely to give the desired effect on the response.
The previous rule also applies to optimization designs, if it is known that the variables in
question have no quadratic effect. If it is suspected that a variable can have a nonlinear
effect, it should be included in the optimization stage.
331
The Unscrambler X Main
Note: If some of the ingredients do not vary in concentration, these are left out
from the mixture equation such that the ‘total amount’ refers to the sum of the
remaining mixture components. For instance if one wishes to prepare a fruit punch
by blending varying amounts of watermelon, pineapple and orange juice, with a
fixed 10% of sugar, the mixture components sum to 90% of the juice blend but to
100% of the ‘total amount’ (mixture sum). This ensures that the three mixture
components will span a 2-dimensional simplex that can be modeled by a regular
mixture design.
Whenever the mixture components are further constrained, like in the example shown
below, the mixture region is usually not a simplex.
With a multilinear constraint, the mixture region is not a simplex
In the absence of multilinear constraints, the shape of the mixture region depends on the
relationship between the lower and upper bounds of the mixture components. It is a simplex
if for each mixture component, the upper bound + the sum of lower bounds for the
remaining components equals 100% (the total amount).
The figure below illustrates one case where the mixture region is a simplex and one case
where it is not.
Changing the upper bound of watermelon affects the shape of the mixture region
In the leftmost figure, the upper bound of watermelon is 100% - (17% + 17%) = 66%, and the
mixture region is a simplex. If the upper bound of watermelon is shifted to 55% as in figure
to the right, this value will be smaller than 100% - (17% + 17%) and the mixture region is no
longer a simplex.
Note: When the mixture components only have lower bounds, the mixture region is
always a simplex.
332
Design of Experiments
So whenever one of the minor constituents of a mixture plays an important role in the
product properties, one can investigate its effects by treating it as a process variable.
333
The Unscrambler X Main
an important study. This can be done at the interpretation stage, where the mixture that
gives the desired properties with the smallest amount of that constituent is chosen.
General buttons
Start
Define Variables
Choose the Design
Design Details
Plackett-Burman designs
Fractional factorial designs
Full factorial designs
Full factorial designs without blocking
Full factorial designs with incomplete blocking
D-optimal designs
D-optimal designs including mixture constraints
Central Composite and Box-Behnken designs
Mixture designs
Simplex mixture designs
Non-simplex mixture designs and process+mixture designs
Additional Experiments
Randomization
Summary
Design Table
When sufficient information has been entered into the tab, the Finish button is made active:
By pressing this button all tasks in the design wizard are ended and the design is created in
The Unscrambler® navigator.
8.3.2 Start
The first tab in the sequence is divided in four sections:
Name
Goal
334
Design of Experiments
Description
History
Start tab
Name
By default the design will be named “MyDesign”. You may change this to the name you
would like the design to have in the project navigator later.
Goal
Select the most appropriate goal of the experiment. Based on this selection and the
number/type of design variables, the wizard will propose a suitable design.
Screening
In a screening experiment the goal is to isolate design variables that have a
significant main effect on the response variable(s).
When selecting this goal, the Design Experiment Wizard will favour either a Plackett-
Burman design or a low resolution Fractional Factorial design, provided the design
variables are not under any constraints. For mixtures an Axial design will be
suggested, and a low number of samples will be suggested if a D-optimal design is
selected.
Screening with interaction
In a screening with interaction experiment (often referred to as a factor influence
study) the goal is to assess both the main effects and the interactions of the design
variables on the response variable(s).
When selecting this goal, the Design Experiment Wizard will favour either a higher
resolution (IV or V) Fractional Factorial or a Full Factorial design, provided the
designed variables are not under any constraints. For mixtures a Simplex Lattice
design will be suggested, and the default terms and number of samples for a D-
optimal design will be adjusted accordingly.
Optimization
335
The Unscrambler X Main
When choosing optimization as the goal, the design investigates main effects,
interactions and square terms on the response variable(s).
By choosing optimization as the goal, the Design Experiment Wizard will favour
either a Central Composite or Box-Behnken design, provided the designed variables
are not under any constraints. The suggested mixture design will be a Simplex
Centroid design, and the number of terms and samples for a D-optimal design will
be higher.
Description
Edit the blank section to store information on the design and specific details about the
experiments.
History
This part contains information on the history of the design such as the creator, the date of
creation and possible revisions. It is auto-generated by the Design Experiment Wizard.
336
Design of Experiments
Variable table
This table contains information on all the variables to be included in the experiment. The
variables are ordered as follows:
The variables can be re-ordered within their category by using Ctrl+arrow up or down.
To edit a variable, highlight the corresponding row, modify the information in the variable
editor,and click OK.
To delete a variable, highlight the corresponding row and click the Delete button.
Variable editor
Click the Add button to add a new variable.
Specify the characteristics of the new variable as follows:
ID
The identity of the variable will be auto-generated. Design variables will have upper
case IDs (A-Z, except reserved letter I), response variables will have integer IDs, and
non-controllable variables will have lower case IDs (a-z, except i). Design variables
no. 26 and onwards are denoted A1, B1, etc.
Name
Enter a descriptive name in the Name field. If nothing is added here, the ID will be
used as name.
Type
Select the variable type by from the following list using the radio buttons:
337
The Unscrambler X Main
Constraints
Select the appropriate constraint setting for the variable (by default no constraints):
Type of levels
The levels are either continuous or category:
Use Category if the variable can change between 2 or more distinct levels or
groups, but where one group/level cannot be ranked on a numerical scale in
relation to the others. For instance the level ‘apple’ cannot be ranked as
higher/lower/better/worse than level ‘pear’. Similarly it is not possible to
calculate an average level between category groups. Two or more levels can
be defined for category variables (max. 20). If category variables of more
than two levels are included, the only available design will be the Full
Factorial (without blocking).
Note: Never define a numeric variable as category in order to enable more levels in
the design. These are interpreted differently and the analysis will be wrong. For
optimization designs that require more than two levels to fit a response surface,
additional levels will be added later based on the defined high and low levels.
For continuous variables: place the bounds of the design space with the low
and the high values in the Level range field. By default the levels are -1 and
1 (or 0 and 100 for mixture variables)
For category variables: the Levels section makes it possible to edit the
numbers and names of the level. The default values are “Level 1” and “Level
2”.
Units
Specify any unit for the variable in question. For mixture variables the default unit is
’%’.
338
Design of Experiments
Mixture Sum
(Available for mixture variables only.) This is the sum of all mixture components in
the blend. The default value is 100 (%), but any positive value is allowed.
Number of variables
Constraints on the variables
Goal of the experiment.
The Unscrambler® suggests the most appropriate design following some rules. Use the
radio-buttons to select a different design than the suggested one. Note that there are
limitations on which designs can be selected based on the number and type of design
variables, however the goal of the experiment can be overridden by the user. The suggested
design remains displayed in bold.
When a full factorial design is selected, a check-box is used to enable (incomplete) blocking.
Select blocking in cases where groups of experimental runs have to be performed under
different settings. For instance if one batch of raw material is insufficient for the full
experiment, different batches will have to be used for different runs. Blocking ensures that
any potential batch effect will not be confounded with other important effects such as main
effects.
339
The Unscrambler X Main
Information
The information box provides information on the selected design.
Goal
Number of variables
Constraints on the variables
340
Design of Experiments
defined goal: Screening selects an axial design, Screening with interaction selects a
Simplex-Lattice design and Optimization selects a Simplex-centroid design.
If additional constraints on the mixture components are imposed, the design region
might be non-simplex. Also, if process (i.e. non-mixture) variables are included
together with the mixture components, regular mixture designs cannot be used. The
appropriate choice for these setups is a D-optimal design.
In the situation where linear constraints are applied, for non-simplex mixture
designs, or for designs containing both process and mixture variables:
The appropriate choice is a D-optimal design. Designs with less than two process
variables or at least three mixture variables are not allowed.
Use the drop-down box to select among the available number of design points
Change the resolution with the radio buttons.
341
The Unscrambler X Main
The confounding patterns for the selected design is displayed in a separate box. They can be
visualized using the variable ID in the form : A + BC, or using the names of the variables. To
see the variable names, tick the box Show names.
After finishing a fractional factorial design, the resolution and confounding patterns will be
given in the Info box below the project navigator.
Full factorial designs
The Design Details tab looks different depending on whether blocking was selected in the
previous tab.
342
Design of Experiments
The blocking generators, as well as all their confounding interactions, will be treated
separately from the remaining effects in the subsequent ANOVA. This means that no results
will be returned for any effects confounded with blocks. The Patterns frame allows
identification of the effects confounded with blocks.
After finishing a full factorial design with incomplete blocking, the block confounding
patterns will be given in the Info box below the project navigator.
D-optimal designs
This design type corresponds to variables with constraints applied, such as:
343
The Unscrambler X Main
Note:
To add a new constraint, use the button Click to add new constraint. A list of all design
variables that are defined to have either Linear or Mixture constraints will be available for
344
Design of Experiments
editing. Select a multiple of each constrained variable, or set a variable to 0 if it is not part of
the current constraint.
The operator to be used in the multilinear constraint is selected from the drop-down list:
The ’<’ and ’>’ operators are convenience functions only. On setting up the candidate points
the ‘<=’ and ‘>=’ will used instead, but with the target value modified down or up by 0.01
compared to the specifed target. After specifying the target value, the new constraint will be
added to the Current constraints box.
Repeat the above procedure for adding additional constraints, or edit an existing constraint
by clicking on the relevant box in Current constraints.
If mixture variables are included in the design, a constraint that they sum to 100% (as given
by the Mixture sum), is added automatically. This constraint cannot be edited or removed.
To delete a constraint select it in the Current constraints table and click on the Delete
button.
Click OK when all of the desired constraints have been added. The constraints will then be
tested if they are both active and consistent.
An inactive constraints is one that is superfluous because it does not constrain the design
region as specified by the variable levels. If for instance the ranges of A and B are both [0
10], a constraint that A+B>=0 will be inactive.
Inactive constraint warning
345
The Unscrambler X Main
Second order mixture: These are all 2-variable interaction terms between the
mixture components;
Process interactions: These are all 2-variable interaction terms between the process
variables;
Process squares: These are all quadratic terms of the process variables;
Mixture and process interactions: These are all interactions of the first order mixture
terms with any first or second order process term.
Check the appropriate boxes to pre-select any of these groups of terms. For designs with
process (non-mixture) variables only, use the following guidelines:
For mixture designs, include second order mixture terms if the goal is Screening with
interaction or Optimization.
For process/mixture designs it may be useful to optimize either the process or mixture
variables, while sampling for the main effects only of the remaining group. It is also possible
to include the second order terms for both types of variables while not including interactions
between the two. By assuming that there are no interactions between the process and
mixture variables, the number of experiments can be greatly reduced.
For a more specific selection of model terms click the Modify button. This will bring up a
dialog listing all higher order terms available for selection. The selected effects are listed in
the left box and the non-selected effects are listed in the right box. All main effect terms
(and offset if non-mixture design) are included by default and will not be listed. Any second
order mixture, process interaction and process square terms will be available for selection.
Any mixture and process interaction terms will be available for selection only if this box is
checked in the Model terms frame.
Dialog for selection of interaction and square terms
346
Design of Experiments
The Add and Remove buttons can be used to move highlighted terms from
one box to the other. The Add All and Remove All buttons do the same for all available
terms. The Add Int button adds all second order mixture as well as process interaction terms
to the model, whereas Add Square moves all process square terms to the Selected Effects
box. Click OK to keep the changes or Cancel to discard them. If some but not all of the terms
of a given order are selected, the corresponding check-box will be in a full state
(intermediate between checked and empty states).
Edit the design settings
The total number of design points is divided between a number of D-optimal design points,
space filling points and additional center points. The default sum of D-optimal and space
filling points is given by the number of model terms and the Goal of the experiment. An
offset is included in the model terms only if no mixture components are specified.
If Goal=Screening, three points more than the number of model terms is suggested,
and three additional center points.
If Goal=Screening with interaction, six points more than the number of model terms
is suggested,and four additional center points.
If Goal=Optimization, nine points more than the number of model terms is
suggested, and five additional center points.
The minimum number of design points is the same as the minimum number of D-optimal
points. These are limited by the number of model terms.
The maximum number of design points is the same as the maximum number of D-optimal
points, which is limited by the number of candidate points. As the candidate points are
generated only when the Generate button is pressed, a warning will be given if too many
design points are specified.
The minimum number of space filling and additional center points are zero. Note that the
candidate points list will contain one center point which might be added even though the
number of additional center points is set to zero.
Change the default number of center points in the Additional Experiments tab. Note that the
center sample coordinates will be calculated (or re-calculated) only when the Generate
button is pressed.
347
The Unscrambler X Main
An Advanced Design Settings dialog opens when clicking the More button. Three settings are
tuned in this window
Number of initial tries: There is no guarantee that a single run of the D-optimal
algorithm will return the globally optimal set of design points. To avoid getting stuck
in local optima the algorithm can be run multiple times using different starting
conditions. Only the result with highest D-optimality is returned. The default number
of initial tries is 5, and this value can be changed between 1 and 1000.
Random points in the initial sets: To speed up the algorithm the starting set is not
completely random. Rather a smaller random set is used and points are added
sequentially to maximize the D-optimality of the starting design. The number of
random points in the initial sets can be tuned between the the number of model
terms and the specified number of D-optimal points.
Max number of iterations: Here you can set an upper limit on the number of point
exchange operations that will be performed. The default limit is 100, the lower limit
is 10 and the upper limit is 1000 iterations. You may try to increase the number if
you experience convergence problems.
The Advanced Design Settings dialog
348
Design of Experiments
region may have a non-simplex shape. D-optimal designs should be used for non-simplex
design regions as the standard mixture designs will not work.
Such a design is set up in a similar manner to a D-optimal design without mixture
components. The main difference is that a mixture constraint including all mixture
components is added automatically. These are required to sum to 100%.
Note: Currently classical ANOVA and response surface plots are not available for
non-simplex and process/mixture designs. In order to take advantage of these
features, you might consider if a regular mixture design could be an alternative.
Use the radio buttons to select the most appropriate design. For more information on these
designs please refer to the Theory section.
Design Details: Central Composite and Box-Behnken designs
The star point distance is the distance from the origin to the axial points in normalized units
(i.e. given that upper and lower levels of factorial points are 1 and -1, respectively). The
default star point distance for CCC designs ensures rotatable designs. For ICC designs, the
inverted value is used, which will for give rotatable designs by default also for ICC designs.
The star point distance for FCC designs is always 1 (non-rotatable).
The following table is given as a guide to find the most appropriate design:
Uses point outside
Number of
Design high and low Accuracy of estimates
levels
levels
349
The Unscrambler X Main
Mixture designs
Axial
In an axial design all points lie on axes that go from each vertex through the overall
centroid, ending up at the opposite surface or edge. At these end points the
component in question is zero and the remaining components have equal
concentrations.
The end points allow the study of blending processes where each component may
be reduced to zero concentration. These can optionally be left out from the
experiment by un-checking the Include end points box.
Simplex lattice
A simplex lattice design is the mixture equivalent of a full-factorial design where the
number of levels can be tuned. It can be used for both screening and optimization
purposes, according to the lattice degree of the design.
350
Design of Experiments
The Lattice degree equals the number of segments into which each edge is divided.
This corresponds to the maximal order that can be calculated for the subsequent
model. Edit the degree by changing the default value.
Simplex centroid
A Simplex centroid design consists of extreme vertices, center points of all “sub-
simplexes”, and the overall centroid. A “sub-simplex” is a simplex defined by a
subset of the design variables.
Simplex centroid designs are well suited for optimization purposes. If Augmented
design is checked, axial check blends are added to the design. These are the same as
the Axial points in an Axial design.
Adjust mixture levels
There are certain limitations on which ranges are allowed for the components in a
mixture design:
1) The design levels must be consistent. This has to do with the mixture constraint
that all component concentrations must sum to the Mixture Sum (100%). If for
instance the lower level of one component is constrained to 20%, the upper level of
the remaining components cannot exceed 80% (see image below).
2) Any (consistent) design region has to be of simplex shape, i.e. it must form a
triangle for 3 components, a tetrahedron for 4 components, etc. Imposing upper
limit constraints on some of the mixture components will often lead to a non-
simplex design region.
A mixture design is automatically tested for condition 1) above, and if the design is
consistent it is tested for condition 2). If either test fail, a warning is given and an
Adjust mixture levels button is activated. Clicking this button will open an adjust
mixture levels dialog with several options.
Adjust Mixture Levels
Make levels consistent: Active whenever the test for consistency fails. The
bounds will be adjusted for consistency with the mixture constraint.
351
The Unscrambler X Main
Adjust with normalized levels: Active whenever any range differs from the
default [0, 100%]. All mixture bounds will be adjusted to their maximum
range as bounded by 0 and the Mixture Sum.
On pressing OK, the upper and lower levels of the components are updated with the
new values. If Cancel the dialog is closed without taking any changes into account.
Only when the mixture design is both consistent and of simplex shape will the Finish
button be activated in the Design Experiment Wizard.
Design variables
Replicated samples
Center samples
Reference samples
352
Design of Experiments
Design variables
The design variables table provides a running summary of the design variables’ levels and
constraints.
Replicated samples
The number of replicated samples indicates the number of times the base design
experiments are run. Replication is used to measure the experimental error. Usually this is
done on center samples, however increasing the number of replicates in the design
improves the precision estimates of the design, by measuring replicates over the entire
design space. It is suggested to use at least two replicates of the design if the experimental
results are likely to vary significantly during the running of the experiment.
Note: Replicates (or replicated samples) are not the same as repeated
measurements. Replicates require a new experiment to be run using the same
settings for the design variables with a new experimental setup, while repeated
measurements are measures performed on the same samples numerous times in a
short time period.
Center samples
Center samples are used as a test for curvature and as a source for error variance
estimation. In the latter case, use at least two (preferably three or more) center samples as
this improves the precision of any estimates. By default the Design Experiment Wizard
suggests a number of center samples. These can be modified by using the spin box next to
Number of center samples.
The center samples are experimental runs at the mid-level of the design variable ranges
when all design variables are continuous. This corresponds to the average (mean) of the
different variables in the design.
If 1-4 variables in the design are categorical and at least one is continuous, center points can
still be defined, however these are only defined for the continuous variables in the design.
353
The Unscrambler X Main
Then a specified number of center point will be given for all combinations of categorical
levels. This ensures that the resulting design remains orthogonal.
An example is shown below for the simplest 2 factor factorial design at two levels, with one
category and for the 3 factor case with one center point defined.
Center point configurations of two factorial designs with one category variable
For the above designs it can be seen that two center points are required when there is one
categorical variable in the design. The center point is located at the mid-point of the
remaining continuous variables. The diagram below shows the 3 factor design with two
categorical variables, in which case 22 = 4 center points are needed.
In the situations described above, one replicate of center points was defined. In this case,
pure error cannot be calculated as the center points are all unique. In order to calculate pure
error, replicates of these center points is required. For the 2 factor design, two replicates of
center points yields 4 center points in total. Each center point now provides 1 degree of
freedom each per categorical level, i.e. 2 degrees of freedom in total for pure error.
For the 3 factor example with two categorical variables, two replicates of center points
results in 8 runs for center points alone. In this case, there are 4 unique center points,
therefore this situation provides 4 degrees of freedom for pure error. The more categorical
variables, the more center points are required, i.e. 2 center points minimum per categorical
variable. If replication is required, the number of center points can increase rapidly, to the
point where the number of center points exceeds the number of design points. In these
354
Design of Experiments
In the example presented here, variable D is categorical. Its value can be changed using the
drop-down list. It is also possible to delete this specific center sample by clicking on the
Delete button. When the level values for the category variables have been specified, click
OK.
Reference samples
In the field reference samples, it is possible to define samples which are incorporated for
comparison. A typical reference sample is a target sample, a competitor’s sample or a
sample produced after changes to a given recipe. The values of the design variables are not
entered and are set as missing; it can be modified later in The Unscrambler®.
8.3.7 Randomization
This tab allows a user to randomize the order of the experiments.
Randomization tab
355
The Unscrambler X Main
Randomized experiments
This table shows the sequence of experiments to run.
356
Design of Experiments
Re-randomize
If for any reason it is necessary to change the order of the samples, select the Re-
randomize button, and a new sequence of experiments will be generated.
8.3.8 Summary
This tab gives a summary of the complete design set-up, as well as the ability to calculate the
power of the design to detect small changes in the individual responses. A small change
means that the effect should be significant at a 5% level.
Summary tab
357
The Unscrambler X Main
358
Design of Experiments
A dialog box will appear where one can select the appropriate design matrix to modify in the
field Choose design.
Modify/Extend Design dialog box
Give the new design a unique name, modify any settings and click Finish when satisfied. This
will create a new design table in the project navigator.
All response values will be set to zero in the modified design.
Check the Insert – Create design… section to get more information about the design wizard.
8.4.1 To remember
When extending a design where some experiments have been already run, it is
recommended to add some extra center samples to check for bias with time with the
analysis.
Refer to the theory-section Extending a design for more details.
359
The Unscrambler X Main
Design
Response
Non-controllable
Main effects
Main effects + Interactions (2-var)
Main effects + Interactions (2- and 3-var)
Main effects + Interactions (2-var) + quadratic
Main effects + Interactions (2-var) + quadratic + cubic
Main effects + Interactions (2 and 3-var) + quadratic + cubic
Design
Response
Non-controllable
First order (Linear)
Second order (Quadratic)
Special cubic
Full cubic
Main effects + Responses
The tables are also divided into three to five sample sets (row ranges):
All samples
All design samples
Center samples
Design and center samples
Reference samples
360
Design of Experiments
Standard: This is the accepted standard order for design variables. In particular,
factorial designs adopt the standard (1), a, b, ab, … notation.
Randomized: This order is the one generated after randomization, it provides the
experimental sequence the runs should be performed in.
The order can be changed by the clicking on one of the two columns and then selecting Edit-
Sort and then choosing Ascending or Descending.
Sort menu
361
The Unscrambler X Main
Model Inputs
Select the Predictors and Responses to analyze. Only data tables created using the
Design Experiment Wizard (Insert–Create Design…) are accepted as input.
Usually the predefined column sets Design and Response should be selected in the
Cols box of the Predictors and Responses, respectively. Select All rows. Note that
selecting less or more data may alter desireable properties of the design.
Select the Effects to include in the model. It can include more or less terms. Try a
simpler model first.
In subsequent analysis, terms can be removed or added to the model. Select the
relevant effects and use the Move button to add/remove them from the analysis.
For factorial designs with no category variables and at least one centre point, there
is an option to calculate Curvature. A Curvature term can be found in the Not
Estimated box and is calculated by moving it to the Estimated box. Curvature
removes one degree of freedom from Lack of Fit calculations and is used to
determine whether the model is linear or not. Note that even if the curvature term
is added in the ANOVA, the final model (i.e. regression coefficients and predicted
responses) does not include the curvature term. Because the residual degrees of
freedom is reduced when testing for curvature, avoid using it indiscriminantly.
Note: The test for curvature will also remove some variation from the error term. In
some cases this may result in a low p-value for the model even though the model
itself does not include the curvature term. Therefore you should always verify your
final model by recalculating without curvature.
362
Design of Experiments
Method
Most designs may be analyzed using Classical DoE Analysis, which performs individual
ANOVAs for each response. If the design is heavily constrained or if multiple correlated
responses should be analyzed together, Partial Least Squares Regression may be a better
option. Other changes to a design such as modified factor levels or missing values might also
favour PLSR over ANOVA in some cases. Please refer to the theory section for a discussion
on the limitations of ANOVA.
The Method tab displays some useful properties of the design to make it easier to decide on
the best analysis method.
363
The Unscrambler X Main
Note: Modify design levels with caution, as such changes to the design matrix
cannot currently be undone (change back manually or use Tools–Modify/Extend
design if needed).
Note: Mixture designs are by definition non-orthogonal and can have both large
condition numbers and small D-efficiencies. These design can still be analyzed using
Classical DoE.
Select the preferred analysis method using the radio buttons and click OK to perform
analysis.
Analysis with ANOVA
364
Design of Experiments
365
The Unscrambler X Main
For further information on how to interpret the plots that are generated, please refer to the
section on interpreting DoE plots.
Accessing plots
Available plots for Classical DoE Analysis (Scheffe and MLR)
ANOVA overview
ANOVA table
Summary
Variables
Model check
Lack of fit
Diagnostics
Effect visualization
Effect summary
Effect and B-coefficient overview
Regression coefficients and their confidence interval
B-coefficient table
Effect visualization
Effect summary
Residuals overview
Normal probability of Y-residuals
Y-residuals vs. Y-predicted
Histogram of Y-residuals
Y-residuals in experimental order
ANOVA table
Diagnostics
B-coefficients
Regression coefficients and their confidence interval
B-coefficient table
Effect visualization
Effect visualization
Effect summary
Cube plot
Error table
Predicted vs. Reference
Response surface
Response surface plot
Response surface table
Multiple comparison
Multiple comparison plot
Group table
Distance table
B-coefficient table
Available plots for Partial Least Squares Regression (DoE PLS)
Overview
366
Design of Experiments
The availability of these plots is toggled by the options ‘Show plots’/’Hide plots’, accessible
from right clicking on the DoE model in the project navigator. This will add or remove the
Plots branch to the model. The plots are also available from the toolbar or from right-
clicking in any of the plot windows.
8.8.2 Available plots for Classical DoE Analysis (Scheffe and MLR)
ANOVA overview
The ANOVA overview plot node contains four plots. The plots described below are given for
all Plackett-Burman, Fractional Factorial and Full Factorial designs (unless otherwise noted).
For Optimization and Mixture designs, the Effect visualization and Effect summary plots are
replaced with a Response surface plot and table.
ANOVA table
The ANOVA table contains all sources of variation included in the model.
Sums of squares (SS)
367
The Unscrambler X Main
This is an unscaled measure of the dispersion or variability of the data table. It is the
sum of squares of the distance from the samples to the average point. It increases
with the number of samples.
All calculations are based on coded levels, i.e. the variable ranges are scaled
between [-1, 1] for process variables and between [0, 1] for mixture variables.
Degrees of freedom (DF)
The number of degrees of freedom of a phenomenon is the number of independent
ways this phenomenon can be varied. In the model there is one DF for each
independent parameter estimated.
Mean squares (MS)
This is the ratio of SS over the degrees of freedom. It estimates the variance, or
spread, of the observations of the different sources in a comparable unit.
F-ratio
This is the ratio between explained variance (associated to a given predictor) and
residual variance. F-ratios are not immediately interpretable, since their significance
depends on the number of degrees of freedom. However, they can be used as a
visual diagnostic: effects with high F-ratios are more likely to be significant than
effects with small F-ratios.
p-value
A small value (for instance less than 0.05 or 0.01) indicates that the effect is
significantly different from zero, i.e. that there is little chance that the observed
effect is due to mere random variation.
There are several types of sources of variations grouped in different parts of the table:
Summary
Variables
Model check
Lack of fit
In addition, some Quality values are found at the end of the table, including:
Method used
This refers to the type of samples used to calculate the error values. It can take three
values:
Design: the design is not saturated so the error values can be calculated on
the residual degree of freedom from the model.
R-square
Coefficient of multiple determination. A value close to 1 indicates a good fit, while a
value close to 0 indicates a poor fit.
368
Design of Experiments
Adjusted R-square
Coefficient of multiple determination adjusted for the DF. While R-square will
increase towards 1 as more parameters (effects) are added to the model, this
statistic will favour additional terms only if the increase in SS is sufficiently high.
R-square prediction
R-square on the predicted values, which is most conservative of the three R-squares
and says something about the predictive ability of the model.
S
Estimate for standard deviation (Root Mean Squared Error of Calibration; RMSEC)
Mean
Average value of the reference Y values on samples taking part in the analysis.
C.V. in %
Coefficient of variation is a normalized measure of dispersion of a probability
distribution. The standard deviation expressed as a percentage of the mean.
PRESS
PRediction Error Sum of Squares is an estimate of the dispersion of leverage
corrected residuals. It accounts for the predictive ability of the model in the sense
that each residual value is estimated as if the sample was left out from the model
calibration. The magnitude of this statistic can be compared with the corrected total
SS (the smaller the better).
ANOVA table
369
The Unscrambler X Main
Summary
The first part of the ANOVA table tests the significance of the model when all specified
effects are included. If the model p-value is small (e.g. smaller than 0.05), it means that the
model explains more of the variation in the response variable than could be expected from
random phenomena. In other words, the model is significant at the 5% level. The smaller the
p-value, the more significant (and useful) the model is.
Variables
The second part of the ANOVA table deals with each individual effect (main effects,
optionally also interactions and square terms). If the p-value for an effect is small, it explains
more of the variations of the response variable than could be expected from random
phenomena. The effect is significant at the 5% level if the p-value is smaller than 0.05. The
smaller the p-value, the more significant the effect is.
There are different ways to calculate sums of squares (SS), however for orthogonal designs
such as factorial designs they all give the same results. For non-orthogonal designs such as
D-optimal and mixture designs, this section tests the so-called Marginal (Type III) SS. This
corrects for the contribution of all other terms in the model irrespective of order, however
the individual contributions may not sum to the Model SS.
370
Design of Experiments
Model check
The model check tests whether it is beneficial to add terms of successively higher order to
the model. For orthogonal designs such as factorial designs, the individual contributions of
the terms of a particular order sum to the model check SS. If the p-value for a group of
effects is large it means that these terms do not contribute much to the model and that a
simpler model should be considered.
For D-optimal and mixture designs, the so-called sequential (Type I) SS is given in the Model
check section. Also higher order terms than the ones actually included in the model are
given here when relevant. This section will indicate the optimal complexity of the model
when adding terms in a hierarchical manner (i.e. lower order terms added before higher
order terms). If all tested terms are included in the model, the sum of contributions will
equal the Model SS.
Lack of fit
The lack of fit part tests whether the error in response prediction is mostly due to
experimental variability or to an inadequate shape of the model. If the p-value for lack of fit
is smaller than 0.05, it means that the model does not describe the true shape of the
response surface. In such cases, it may be helpful to apply a transformation to the response
variable.
Note:
For screening designs, the model can be saturated. In such cases, one
cannot use the design samples for significance testing; the center samples
or reference samples are used.
If the design has design variables with more than two levels, use the
Multiple Comparison plot and B-coefficient table in order to see which
levels of a given variable differ significantly from each other.
Lack of fit can only be tested if the replicated center samples do not all
have the same response values (which may sometimes happen by
accident).
Diagnostics
This plot presents several values for assessing the quality of the fit of the model to each
individual response.
Standard Order
The standard order is the non-randomized order from the experiment generator
Actual Value
This is the measured response values as given in the design table.
Predicted Value
This is the fitted response value as calculated from the model.
Compare this value to the actual value; the closer those values are the better is the
fit to the model.
Residual
This is the difference between the actual and the predicted value.
Study all the values; the smaller they are the better is the fit by the model. Note that
it does not say anything about the predictive ability of the model when applied to
new samples.
Leverage
371
The Unscrambler X Main
The leverage is the distance of the projected samples to the center of the model. A
sample with high leverage is an influential sample or an outlier. Note that for
saturated models, the leverage is 1 for all samples and there is no residual DF to
estimate error in the model.
Student Residual
A studentized residual is the result from the division of a residual by the estimate of
the sample dependent standard deviation of the residual. The presented values are
the so-called internally studentized residuals, meaning that all samples have been
included in the estimation of the standard deviation. This statistic is can be used for
detection of outliers. For any reasonably sized experiment (e.g. n>30), 95% of
normally distributed, studentized residuals will fall in the interval [-2, 2].
Cook’s Distance
The Cook’s distance of an observation is a measure of the global influence of this
observation on all the predicted values. This is done by measuring the effect of
deleting this given observation. Data points with large residuals and/or high leverage
may distort the outcome and accuracy of a regression.
The Cook’s distance gives an actual threshold to judge the samples. Points with a
Cook’s distance of 1 or more are considered to be potential outliers.
Run Order
The run order is the (randomized) order of experimentation. There should not be a
run-order dependent trend in the above diagnostic tools.
Diagnostics
Effect visualization
This plot displays one effect at a time for a given response. To change the displayed effect
and the response click on the arrows or on one of the cells of the “Summary of the
effects” table.
It is useful to study the magnitude of the effects (change in the response value when the
design variable increases from Low to High) and the interactions.
There are two types of effects that can be visualized.
Main Effects
The plot shows the average response value for a specific response variable at the
Low and High levels of the design variable. If there are center samples, the average
response value for the center samples is also displayed. It is useful to study the
magnitude of the main effect (change in the response value when the design
variable increases from Low to High). If there are center samples, one can also
detect a curvature visually. For category variables with more than two levels, the
average response value for each category level is given.
Main effects with curvature
372
Design of Experiments
Interaction effects
The plot shows the average change in response values for a design variable
depending on the level of the other variable in a two-factor interaction. One line is
given for the Low level of the second design variable, and one line is given for the
High level of the second design variable.
It is possible to study the magnitude of the interaction effect (1/2 * change in the
effect of the first design variable when the second design variable changes from Low
to High).
For a positive interaction, the slope of the effect for “High” is larger than for
“Low”;
For a negative interaction, the slope of the effect for “High” is smaller than
for “Low”;
For no interaction the curves are parallel.
Effect summary
This table plot gives an overview of the significance of all effects for all responses. There are
three values per effect and per response:
373
The Unscrambler X Main
Significance: This coded value indicates if the effect is significant for the specific
response. The significance level is also reflected by the color of the row. See the
Significance levels and associated codes table below.
Effect value: This is the value of the effect for the specific response variable.
p-value: Result of the test of significance for the effect.
374
Design of Experiments
Use the arrows to navigate from one response variable to another or click on the
Response variable to be plotted in the table Regression coefficient table.
B-coefficient table
This table presents the value of the B-coefficient for the associated design variables as well
as B0.
It also gives the 95% confidence interval for the B-coefficients. These values give an idea of
the accuracy of the estimate of the coefficients.
The p- and t-values are computed to test the null hypothesis, H0: the coefficient is equal to
0. Rejection of this hypothesis for a variable means that the variable is important for
describing the response in question. By comparing the t-value with its theoretical
distribution (Student’s T-distribution), the significance level of the studied effect is obtained.
The associated p-value represents the significance of the effect associated with the B-
coefficient. H0 can be rejected if the p-value is smaller than, say 5% (green color). This
implies that the effect in question is important for modelling the response.
B-coefficient table
Effect visualization
This plot is shown for all designs except mixture designs. For more information on this plot,
check the ANOVA overview section.
Effect summary
For more information on this plot, check the ANOVA overview section
375
The Unscrambler X Main
Residuals overview
These plots can be used to check the adequacy of the model or look for outliers, provided
that there are ample residual degrees of freedom left to study the residuals. If the model is
close to saturated, i.e. the number of effects is almost as high as the number of
observations, artificially structured residuals will result that cannot be interpreted properly.
376
Design of Experiments
The presence of an outlier is shown in the example below. The outlying sample has a much
larger residual than the others; however, it does not seem to disturb the model to a large
extent.
A simple outlier has a large residual
The figure below shows the case of an influential outlier: not only does it have a large
residual, it also attracts the whole model so that the remaining residuals show a very clear
trend. Such samples should usually be excluded from the analysis, unless there is an error in
the data table that can be corrected.
An influential outlier changes the structure of the residuals
377
The Unscrambler X Main
Small residuals (compared to the variance of Y) which are randomly distributed indicate
adequate models.
Histogram of Y-residuals
This plot shows the distribution of the residuals, optionally with a statistics table displayed.
Histogram of Y-residuals
A symmetric bell-shaped histogram which is evenly distributed around zero indicates that
the normality assumption is likely to be true. This is the case in the above plot. Moderate
departures from normality is usually acceptable. Change the resolution of the histogram by
toggling the number of bars in the toolbar.
378
Design of Experiments
ANOVA table
For more information check the ANOVA overview section
Diagnostics
For more information check the ANOVA overview section
B-coefficients
This plot node is available for all designs except designs with categorical design variables
with three levels or more and for mixture designs.
B-coefficient table
For more information on this plot, look at the section Effect and B-coefficient overview
Effect visualization
This plot node is available for all designs except designs with categorical design variables
with three levels or more and for mixture designs.
Effect visualization
For more information check the DoE overview section
379
The Unscrambler X Main
Effect summary
For more information check the DoE overview section
Cube plot
This plot is available for all factorial designs (incl. Plackett-Burman). It displays the average of
a specified response variable at the experimental points.
Cube plot
The plot is most useful when there are two or three design variables. If there are more than
three design variables it is possible to choose which cube to represent using the arrows for
X, Y and Z.
Error table
The error table is a summary of the quality parameters available for the analysis of design
data. See ANOVA table for a description of the individual terms.
Error table
380
Design of Experiments
Response surface
There are two types of response surface (RS) plots. A square response surface is given for
non-mixture designs and a triangular response surface is given for mixture designs.
381
The Unscrambler X Main
The response surface can also be rotated and viewed in 3D from any angle using the mouse:
Rotated response surface plot
382
Design of Experiments
Different representations of the response surface can be seen selecting the options in tool
bar for Mesh, Floor Contour or Surface Contour.
Response surface right click options
The following options are available from the right click menu in a response surface plot.
From the DOE menu all available analysis plots can be accessed.
Click View to switch between Graphical or Numerical view (also accessible from the toolbar),
or to toggle the colorbar (Legend) on or off.
Copy a bitmap representation to the clipboard for pasting into other applications, or Save
Plot using either of the formats JPEG, PNG, BMP, PNM or TIFF.
383
The Unscrambler X Main
The Auto Scale option available from the right click or toolbar menu will return to a default
size 2D-plot.
The following Properties can be tuned from the plot properties dialog:
Appearance
Plot Font
Bold: Toggle bold font for title, axis, colorbar and tooltip text on and off.
Italic: Toggle italic font for title, axis, colorbar and tooltip text on and off.
Name: Switch between font families Arial, Courier and Times for title, axis,
colorbar and tooltip text.
Size: Set font size as a relative number. The plotting library automatically
attempts to find the best font size for different text. You may increase or
decrease the size of all plot text within the range of 0.1 (very small) and 4.0
(very large).
384
Design of Experiments
To set the level of the non-plotted variables enter the value manually in the column
Current. By default this value is the average value.
For mixture designs the levels of the components cannot vary independently of each
other, as the mixture constraint imposes that all components must sum to the
Mixture Sum always. Therefore, if a non-plotted variable is tuned, the axes and Max
levels of the plotted variables are updated accordingly. A minimum Max value
corresponding to 3.5% of the total range is enforced for plotted mixture
components.
For mixture designs there is an additional column with Freeze check-boxes. This is
useful for designs with 5 components or more. If the current level of a non-plotted
mixture variable is increased until the plotted variable axes cannot be reduced any
more, the levels of other non-plotted components will be reduced instead. If freeze
is checked for a non-plotted variable, its current value cannot be changed due to a
change in other variables.
For category variables select one of the levels using the drop-down list.
Response variables
Only one response variable can be plotted at a time. Select the response to plot by
ticking the variable of interest.
Optimization constraints for response variables can be set using the sliders or
manually enter the values in the Min and Max columns. Setting optimization
constraints for multiple responses simultaneously is a very useful tool for finding the
optimal design settings.
Response surface table
Multiple comparison
This node is given for non-saturated designs with at least one category variable. It shows
whether the distance between levels is larger than a critical distance, in which case the
levels are considered to belong in different groups. Because the critical distance is calculated
from the data, residual degrees of freedom are required for these plots to be displayed.
385
The Unscrambler X Main
one categorical level and the other levels, the average response values are plotted in
different groups along the X-axis.
Multiple Comparisons
The average response value is displayed as a red square and its value can be read on
the vertical axis or by mouse-over.
The levels are grouped along the horizontal axis by significantly different groups.
The names of the different levels can be seen by mouse-over.
Levels that are not significantly different are linked by blue vertical bars. Each
vertical bar is the size of half the critical distance. Two levels have significantly
different average response values if they are not linked by any bar.
The critical distance is indicated in the x-axis title.
Group table
The group table shows the levels associated with the different groups. This table takes the
value 1 if the level is part of the specified group and 0 if not. One level can be associated
with several groups.
Group table
Distance table
This table shows for a specific response variable and a specific category variable the distance
between the average value of two-by-two levels.
Distance table
386
Design of Experiments
B-coefficient table
For more information look at the description in the B-coefficients section. If one of the
categorical variables has three levels or more, an Effect visualization is plotted instead of the
B-coefficient table.
8.8.3 Available plots for Partial Least Squares Regression (DoE PLS)
When PLSR is performed on designed data all the regular PLSR plots are available. The DoE
PLS in addition has some plots useful for DOE purposes.
Overview
Explained Variance
This is the total explained variance plot for models of an increasing number of components.
Use the toolbar buttons to switch between X-/Y-variance, calibration/validation variance and
387
The Unscrambler X Main
388
Design of Experiments
Turn on Plot Statistics (using the View menu) to check the slope and offset, RMSEP/RMSEC
and R-squared. Generally all the y-variables should be studied and give good results.
Note: Before interpreting the plot, check whether the plots are displaying
Calibration or Validation results (or both).
Menu option Window - Identification tells whether the plots are displaying Calibration (if
Ordinate is yPredCal) or Validation (yPredVal) results.
Use the buttons to switch Calibration and Validation results off or on.
It is also useful to show the regression line using the icon , and compare it with the
389
The Unscrambler X Main
The figures below show two different situations: one indicating a good fit, the other a poor
fit of the model.
Predicted vs. Reference shows how well the model fits
In the above plot, sample 3 does not follow the regression line whereas all the other
samples do. Sample 3 may be an outlier.
How to detect nonlinearity
In other cases, there may be a nonlinear relationship between the X- and Y-variables, so that
the predictions do not have the same level of accuracy over the whole range of variation of
Y. In such cases, the plot may look like the one shown below. Such nonlinearities should be
corrected if possible (for instance by a suitable transformation), because otherwise there
will be a systematic bias in the predictions depending on the range of the sample.
Predicted vs. Reference shows a nonlinear relationship
390
Design of Experiments
[0.10:0.05] ? ? yellow
391
The Unscrambler X Main
Predictors (X) projected in roughly the same direction from the center as a response,
are positively linked to that response. In the example below, predictors sweet, red
and color have a positive link with response Pref.
Predictors projected in the opposite direction have a negative relationship, as
predictor thick in the example below.
Predictors projected close to the center, as bitter in the example below, are not well
represented in that plot and cannot be interpreted.
The maturity has a negative effect on the adhesiveness of the cheese; they are anti-
correlated. The amount of Dry matter affects positively the stickiness and negatively the
glossiness and meltiness. Glossiness and meltiness, two responses, are correlated.
392
Design of Experiments
Caution! If the X-variables have been standardized, one should also standardize the
Y-variable so that the X- and Y-loadings have the same scale; otherwise the plot
may be difficult to interpret.
The plot shows the importance of the different variables for the two components specified.
393
The Unscrambler X Main
Correlation Loadings of process variables (X) and the quality of the cheese (Y) along (factor
1,factor 2)
Variables close to each other in the loadings plot will have a high positive correlation if the
two components explain a large portion of the variance of X. The same is true for variables in
the same quadrant lying close to a straight line through the origin. Variables in diagonally
opposed quadrants will have a tendency to be negatively correlated. For example, in the
figure above, variables dry matter and stickiness have a high positive correlation on factor 1
and factor 2, and they are negatively correlated to variables meltiness and glossiness.
Variables adhesiveness and stickiness have independent variations. Variables addition of
recycled dry matter and pH are very close to the center, they are not well described by
factor 1 and factor 2.
Note: Variables lying close to the center are poorly explained by the plotted factors
(or PCs). They cannot be interpreted in that plot.
8.10. Bibliography
R. C. Bose and K. Kishen, On the problem of confounding in the general symmetrical factorial
design, Sankhya, 5, 21, (1940).
J.A. Cornell, Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data,
Second edition, John Wiley and Sons, New York, 1990.
G.H. Golub and C.F. Van Loan, Matrix Computations, Third edition, Johns Hopkins University
Press, 1996.
R.W. Kennard and L.A. Stone, Computer Aided Design of Experiments, Technometrics, 11(1),
137-148, (1969).
G.A. Lewis, D. Mathieu, and R. Phan-Tan-Lu, Pharmaceutical Experimental Design, Marcel
Dekker, Inc., New York, 1999.
394
Design of Experiments
D.C. Montgomery, Design and Analysis of Experiments, Sixth edition, John Wiley & Sons,
New York, 2004.
R.H. Myers and D.C. Montgomery, Response Surface Methodology: Process and Product
Optimization using Designed Experiments, Second edition, Wiley, New York, 2002.
T. Naes and T. Isaksson, Selection of Samples for Calibration in Near-Infrared Spectroscopy.
Part I: General Principles Illustrated by Example, Appl. Spectrosc., 43(2), 328-335, (1989).
N.-K. Nguyen and G.F. Piepel, Computer-Generated Experimental Designs for Irregular-
Shaped Regions, QTQM, 2(2), 147-160, (2005).
R.E.A.C. Paley, On orthogonal matrices, J. Math. Phys., 12, 311–320, (1933).
R.L. Plackett and J.P. Burman, The Design of Optimum Multifactorial Experiments,
Biometrika, 33, 305-25, (1946).
H. Scheffé, Experiment with Mixtures, J. Roy. Stat. Soc. Ser. B, 20, 344-366, (1958).
395
9. Validation
9.1. Validation
Model validation is performed for PCA or regression models to estimate how useful the
model will be for future observations. It returns the predictive ability of the model as
opposed to the model’s fit to the training data.
Theory
Dialog usage: Validation tab
Dialog usage: Cross validation setup
397
The Unscrambler X Main
398
Validation
Cross validation
Though the objective is to have enough samples to put a reasonable amount aside as a test
set, this is not always possible due, for example, to the cost of samples or reference testing.
The best alternative to an independent test set for validation is to apply cross validation.
With cross validation, the same samples are used both for model estimation and testing. A
few samples are left out from the calibration data set and the model is calibrated on the
remaining data points. Then the values for the left-out samples are predicted and the
prediction residuals are computed. The process is repeated with another subset of the
calibration set, and so on until every object has been left out once; then all prediction
residuals are combined to compute the validation residual variance and RMSEP. It is of
utmost importance that the user is aware of which level of cross validation he wants to
validate. For example, if one physical sample is measured three times, and the objective is to
establish a model across samples, the three replicates must be held out in the same cross
validation segment. If the objective is to validate the repeated measurement, keep out one
replicate for all samples and generate three cross validation segments. The calibration
variance is always the same; it is the validation curve that is the important figure of merit
(and the RMSECV for regression models).
Several versions of the cross validation approach can be used:
Full cross validation
leaves out only one sample at a time; it is the original version of the method;
Segmented cross validation
leaves out a whole group of samples at a time. A typical example is when there are
systematic replicated measurements of one physical sample;
Test-set switch
divides the global data set into two subsets, each of which will be used alternatively
as calibration set and as test set;
Category variable
enables the user to validate across levels of category variables. This is useful for
evaluating how robust the model is across season, raw material supplier, location,
operator, etc.
When running a cross validation, one can get prediction diagnostics for the cross validation
segments. These are not available when full cross-validation is used. This option will provide
information on the validation results per each cross validation segment including RMSEP,
SEP, bias, slope, offset and correlation. The CV prediction diagnostics are added as a matrix
in the Validation folder of the PLSR model.
Leverage correction
Leverage correction is an approximation to cross validation that enables prediction residuals
to be estimated without actually performing any prediction. It is based on an equation that
is valid for MLR, but is only an approximation for PLSR and PCR.
According to this equation, the prediction residual equals
399
The Unscrambler X Main
the possibility to perform cross validation for most data sets without much computation
time, making the leverage correction more of a relic of the old days.
400
Validation
why such cases are difficult is that there is too little information for estimation of a model
and each sample is “unique”. Therefore all known validation methods are doomed to fail.
For MLR, leverage correction is strictly equivalent to (and much faster than) full cross
validation.
Dr. Harald Martens has (re-)developed a generic method for uncertainty testing, which gives
a safer interpretation of models. The concept for uncertainty testing is based on cross
validation, jack-knifing and stability plots. This section introduces how the Uncertainty Test
works and shows how it can be used in The Unscrambler® through an application.
The following sections will present the method with a non-mathematical approach.
How does the uncertainty test work?
The test works with PLSR or PCA models with cross validation, choosing full cross validation
or segmented cross validation as is appropriate for the data. When the optimal number of
components (factors) for PLSR have been chosen, tick Uncertainty test on the validation tab
of The Unscrambler® modeling dialog box.
Under cross validation, a number of submodels are created. These submodels are based on
all the samples that were not kept out in the cross validation segment. For every submodel,
a set of model parameters: B-coefficients, loadings and loading weights are calculated.
Variations over these submodels will be estimated so as to assess the stability of the results.
In addition a total model is generated, based on all the samples. This is the model that will
be used for interpretation.
401
The Unscrambler X Main
Stability plots
The results of all these calculations can also be visualized as stability plots in scores, loadings,
and loading weights plots. Stability plots can be used to understand the influence of specific
samples and variables on the model, and explain for example why a variable with a large
regression coefficient is not significant. This will be illustrated in the example that follows
(see Application Example).
402
Validation
See tutorial M to learn how to use the Uncertainty Test results in practice.
Therefore, the individual models m=1,2,…,M may be rotated, e.g. towards a common model:
After rotation, the rotated parameters T(m) and [P’,Q’](m) may be compared to the
corresponding parameters from the common model T and [P’,Q’]. The perturbations may
then be written as (T(m) - T)g and or ([P’,Q’](m) - [P’, Q’])g for the scores and the loadings,
respectively, where g is a scaling factor (here: g=1).
In the implemented code, an orthogonal Procrustes rotation is used. The same rotation
principle is also applied for the loading weights, W, where a separate rotation matrix is
computed for W. The uncertainty estimates for P, Q and W are estimated in the same
manner as for B below.
403
The Unscrambler X Main
where
Significance testing
When the variances for B, P, Q, and W have been estimated, they can be utilized to find
significant parameters.
As a rough significance test, a Student’s t-test is performed for each element in B relative to
the square root of its estimated uncertainty variance S²B, giving the significance level for
each parameter. In addition to the significance for B, which gives the overall significance for
a specific number of components, the significance levels for Q are useful to find in which
components the Y-variables are modeled with statistical relevance.
404
Validation
405
The Unscrambler X Main
406
Validation
Significance testing
The Uncertainty Test option can be used to estimate the significance of variables, when
using cross validation. During cross validation, the differences between the model
parameters for all samples and the model for the samples in this particular cross validation
segment is squared and summed. The significance (p-value) is estimated by a t-test with the
model parameter and its standard deviation as input. For PCA the p-values for loadings per
variable and component are returned. For PLS regression p-values are returned for x-
loadings, loading weights, y-loadings and regression coefficients.
This is referred to as Martens’ Uncertainty Test.
Use the Matrix drop-down list to select the test set, or define it using the Rows and Column
selector drop-down lists to define a test set within a selected matrix for both X and Y.
407
The Unscrambler X Main
residuals
By discarding residuals, the matrices
X-Residuals
X-Validated Residuals
Y-Residuals
Y-Validated Residuals
are removed from the Validation folder in the analysis. These are 3-Dimensional matrices
and use up a lot of memory. As in indication of the reduced size when enabling Discard
Residuals, A PLS regression model with 400 samples and 100 x-variables, 1 y-variable and 10
factors will only take up 10% of the Full model size. As the number of samples, X- and Y-
variables and factors increase, the reduced-size model will be even smaller in percentage of
the full model.
Note: When the residuals are discarded, some of the plot options will not be available. All
plots where the data are taken from the X-Residuals or Y-Residuals matrices will not be
listed in the plot menus. The Plot - Residuals sub menus now only allows Residuals and
Influence (with Q-residuals), and under Plot -Residuals -General only Influence Plot and
Variance per Sample plots are available.
Plots available in the Residuals menu when Discard Residuals is selected
408
Validation
Results - Regression
Display the PLSR Overview results. From here additional results plots can be
accessed from the menu.
Results - All
Display results for any analysis.
409
The Unscrambler X Main
410
Validation
Category variable
Allows for model cross validation by removing samples belonging to defined
categories as a group. This is useful for evaluating how robust the model is across
season, raw material supplier, location, operator etc.
411
10. Transform
10.1. Transformations
This section covers transformations available in The Unscrambler®. Transformation (or what
is often referred to as preprocessing) is applied to data to reduce or remove effects in data
which do not carry relevant information for the modeling of the system. Transformations
can reduce the complexity of a model (fewer factors needed) and improve the
interpretability of the data and models. Transformations include the application of
derivatives to spectral data to reduce baseline offset and tilt effects, while accentuating
small spectral differences. Scattering corrections are often used as transformations to
diffuse reflectance spectra to reduce differences such as light scatter and path length. These
transforms can only be performed on numerical data. Some of them cannot be performed
when there is missing data (i.e. Norris-Gap derivative).
The Unscrambler® provides the following transformations:
Baseline correction
Center_and_scale
Compute general
COW
Deresolve
Derivatives
Detrending
MSC/EMSC
Interaction & Square Effects
Missing_value_imputation
Noise
Normalize
OSC
Quantile_Normalize
Reduce and average
Smoothing
Spectroscopic transformations
SNV
Transpose
Weights
Interpolation
More details regarding transformation methods available in The Unscrambler® are given in
the Method References.
413
The Unscrambler X Main
How it works
How to use it
where x is a variable and X denotes all selected variables for this sample.
For each sample, the value of the lowest point in the spectrum is subtracted from all the
variables. The result of this is that the minimum value is set as 0 and the rest are positive
values. To use this consistently for a set of samples, make sure that the lowest point pertains
to the same variable for all samples.
Linear baseline correction
This transformation transforms a sloped baseline into a horizontal baseline. The technique is
to point out two variables which should define the new baseline. These are both defined as
0, and the rest of the variables are transformed according to this with linear
interpolation/extrapolation. It is important to take precautions not to select basis variables
that have spectroscopic bands. As for the offset correction, make sure that the lowest points
pertain to the same variables for all samples.
414
Transform
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined. This transform requires that only numerical
data be chosen.
After the range has been selected, select the method of the baseline transformation. A
method must be selected in order to carry out the transform. If Linear baseline correction is
selected, the two variables which define the new baseline must also be defined (Baseline
end variables). The first and last variables are selected by default. The first and last values
must be different for the transform to be performed. By checking the Preview result, one
can see the outcome of the data when the baseline transformations has been applied.
When the baseline transformation is completed, a new matrix is created in the project with
the word Baseline appended to the original matrix name. This name may be changed by
selecting the matrix, right clicking and selecting Rename from the menu.
415
The Unscrambler X Main
Method options
Choose between two baseline transforms:
Baseline offset
Subtract the value of the lowest point in the spectrum is subtracted from all the
variables.
Linear baseline correction
Transform a sloped baseline into a horizontal baseline.
Do not select basis variables that have spectroscopic bands.
For the offset correction in both methods, make sure that the lowest points pertain to the
same variables for all samples.
How it works
How to use it
416
Transform
The range is the difference between the highest and lowest observation for each variable.
Such scaling results in a range of one for all variables. The presence of outliers in the data
will heavily influence this transformation, however. A safer alternative would be to use the
IQR, which is the the difference between the observations at the 25th and 75th percentiles.
(There are several different ways of calculating the IQR, and The Unscrambler® utilizes the
‘Type 7’ algorithm of Hyndman and Fan, 1996.) As extreme observations are not included in
the IQR estimate, it is less likely to be affected by outliers.
The MAD is defined as the median of absolute differences between each observation in the
column and the median observation. This measure of population spread is little affected by
the tail behaviour of the distribution. For instance if a histogram of the data reveals a ‘wide’
peak where many observations fall in the tails, the standard deviation will be grossly inflated
while the MAD will remain a good estimate for the population’s spread. The MAD will
similarly be more robust for data with sharp peaks and long tails. The Scaled MAD is the
MAD multiplied with the factor 1.4826. This makes the estimate similar to the standard
deviation when many observations are collected from a normal distribution.
Centering and/or scaling data may be useful to study the data in various plots, or prior to
running Tasks – Analyze – Descriptive Statistics. It may for example allow one to compare
the distributions of variables of different scales within one plot. In subsequent analysis,
these scaled variables will contribute similarly to the model regardless of measurement unit.
These transformations are all column-oriented: the transformed values are computed as a
function of the values in the same column of the data table.
Notes: 1. Mean centering is included as a default option in the relevant analysis
dialogs, and the computations are done as a first stage of the analysis. Scaling using
the standard deviation may be applied in the Weights tabs of most analysis dialogs.
2. Centering and scaling are also available as a transformation to be performed
manually from the Editor (Tasks – Transform – Center_and_scale). Use this dialog
to perform one of the available non-parametric centering and scaling options.
A special type of standardization is the Spherize function Martinez and Martinez, 2005. It is
the multivariate equivalent of the univariate scaling methods described above. The
transformed variables have a p-dimensional mean of 0 and a covariance matrix given by the
identity matrix. It is also known in some application domains as the whitening
transformation since the resulting matrix has the signal properties of “white noise”.
More details regarding center and scale methods are given in the Method References.
417
The Unscrambler X Main
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . The rows and columns to be included in the computation must be
specified as well. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
In the Transformation frame, three options are available:
Center
within the selected sample and variable scope. This subtracts a value, e.g. the
variable mean, from each observation in each column. There is an option to center
by the mean, median, or minimum value, or not use any centering. Choose the
desired option for centering from the Center drop-down list.
Dialog showing centering options
Scale
within the selected sample and variable scope. This divides each data value by an
estimate of the of the column spread. Options available are the Standard deviation
(SDev), Interquartile range (IQR), Range, or Scaled median absolute deviation (MAD)
scaling, or not to use any scaling. Choose the desired option for scaling from the
Scale drop-down list as shown below.
Dialog showing scaling options
418
Transform
Spherize
This is a multivariate equivalent of univariate center and scaling, useful in
exploratory data analysis.
The Center and Scaling options can be selected either separately or in combination. Often
mean centering is combined with SDev scaling (autoscaling). Due to their non-parametric
nature, the Range, IQR, or Scaled MAD transformation is often used after median centering.
The type of centering and scaling is selected from the drop-down list.
By checking the Preview result box, a line plot of the observations before and after scaling is
displayed.
Notes: 1. To display the mean and standard deviation of the variables in a data set,
use menu option Tasks – Analyze- Descriptive Statistics. 2. The Center and Scale
transformations are supported in autopretreatments, meaning they can be
automatically applied when new data are analysed (classification, prediction and
sample projection analyses), using a model which was developed with this
transformation applied. See next note. 3. The principal component analysis (PCA)
and Regression dialog boxes include options for centering and scaling variables
directly at the analysis stage. It is recommended to perform centering and scaling at
the model-building stage, especially if the model will be used for future prediction
or classification. The same centering and scaling options will be applied as when the
model was built. 4. Centering and/or scaling the data more than once will not affect
the structure of the data any further. Consequently, if the Center and Scale
transformation has been applied to the data from the Tasks – Transform – Center
and Scale dialog, the data may harmlessly be recentered and/or rescaled at the
modeling stage (PCA or regression).
How it works
419
The Unscrambler X Main
How to use it
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected.
If new data ranges need to be defined, choose Define to open the Define Range dialog
where new ranges can be defined. One must also define if the selection is for the variables
or samples.
There are three ways of defining the mathematical expression to be applied:
420
Transform
Use the drop-down list, which provides the most recently used expressions (if this is
the first time using the Compute_General dialog, no formerly used expressions will
show in the drop-down list).
Click on the Build Expression button. This opens the Build Expression dialog wherein
a mathematical expression can be defined using the ready-made functions and
operators allowed in The Unscrambler®.
Syntax
The Expression field accepts a formula of the type: X=LN(ABS(X))-e or S4=(S1*S2)+S3 or
V1=V1/2+SIN(V8/V9) where S stands for sample, V stands for variable, and the number is
the sample or variable number in the Editor. To build general expressions that are not
related to a particular sample or variable, use X. X stands for the whole matrix defined by the
variable and sample set chosen in Scope. RH and CH are row and column headers,
respectively.
Note: The formula cannot contain mixed references to samples (S), variables (V)
and X.
+ Addition
- Subtraction
* Multiplication
/ Division
= Equals to
( Left Parenthesis
) Right Parenthesis
EXP(X) Exponential(X)=eX
421
The Unscrambler X Main
Name Description
COS(X) Cosine
SIN(X) Sine
TAN(X) Tangent
PI 3.14
e 2.718
”X” can denote both samples and variables in this table.
Function names are case insensitive, meaning that log, Log, and LOG will give the same
result. In the above functions a comma is used as list separator, however this depends on
the regional settings of the computer. Different list separators may be valid for different
contries, e.g. POW(X;n).
Notes: A commonly used expression is X=log(X). This expression generally
transforms skewed variable distributions into more symmetrical ones. Use a
histogram plot or Tasks – Analyze – Descriptive Statistics… in order to check
whether the skewness was improved or deteriorated after applying the
transformation.
422
Transform
The upper text field shows the expression as it is being built. In Display, choose whether the
text field should show the sample/variable Numbers or the sample/variable Names. In the
Insert field, choose to insert specific samples, specific variables or (general expression). After
choosing the Sample or the Variable options, the drop-down list is enabled and one can
select the relevant object(s) from the list. The available samples or variables are only those
belonging to the Scope formerly selected in the Compute dialog.
The Arithmetic Functions, Trigonometric Functions, Other Functions, and Numbers fields
offer buttons that are used following the same principle as for a calculator.
Click Clear to clear the expression. Click Undo to undo the latest insertion in the expression
text. Click OK to return to the Compute_General dialog.
10.5. COW
10.5.1 Correlation Optimized Warping (COW)
COW is a method for aligning data where the signals exhibit shifts in their position along the
x axis. COW cannot be performed with non-numeric data, or when there are missing data.
How it works
How to use it
423
The Unscrambler X Main
424
Transform
425
The Unscrambler X Main
measurement (such as in chromatography retention times, chemical shifts in NMR data, and
Raman spectral x axis alignment).
COW cannot be performed with non-numeric data, or when there are missing data. The
minimum number of variables required to use COW is 20.
COW Dialog
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
Three inputs must be specified in the dialog:
Reference Sample: Select which sample in the data table is to act as the reference
profile.
This is a typical sample (e.g. near the origin in a scores plot) with preferably the main
peaks present. If the COW will be applied to new data at some later point of time,
include the reference sample in a new data table as well.
Segment Size. This is the length of the segment which the data are divided into
before searching for the optimal correlation. It must be smaller than the number of
variables divided by 4.
Slack: Slack represents the allowed change in position to be searched for and has the
value <= Segment Size.
By selecting the preview result, one can see how the transformed data will look.
COW dialog with preview
426
Transform
When the COW transformation is completed, a new matrix is created in the project with the
word COW appended to the original matrix name. This name may be changed by selecting
the matrix, right clicking and selecting Rename from the menu.
10.6. Deresolv
10.6.1 Deresolve
The Deresolve function can be used to change the apparent resolution of an instrument,
changing a high resolution spectrum to low resolution. It may also be used for noise
reduction.
How it works
How to use it
427
The Unscrambler X Main
428
Transform
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined. There must be at least 4 variables to
perform the deresolve transformation.
In the Parameters field, choose the number of channels to use for convolution. The
minimum number of channels that can be used is 2, and the maximum is (#variables/2)
By selecting the preview result, one can see how the transformed data will look.
When the deresolve transformation is completed, a new matrix is created in the project with
the word Deresolve appended to the original matrix name. This name may be changed by
selecting the matrix, right clicking and selecting Rename from the menu.
10.7. Derivatives
10.7.1 Derivatives
Differentiation, i.e. computing derivatives of various orders, is a classical technique widely
used for spectroscopic applications. Some of the information “hidden” in a spectrum may be
more easily revealed when working on a first or second derivative. It is a row-oriented
429
The Unscrambler X Main
transformation; that is to say the contents of a cell are likely to be influenced by its
horizontal neighbors.
Derivatives cannot be performed with non-numeric data or where there are missing data.
Like smoothing, this transformation is relevant for variables which are themselves a function
of some underlying variable, e.g. absorbance at various wavelengths. Computing a derivative
is also called differentiation. Derivatives can help to resolve overlapped bands, but also lead
to a lower signal in the transformed data.
The segment parameter of Gap-Segment derivatives is an interval over which data values are
averaged.
In smoothing, X-values are averaged over one segment symmetrically surrounding a data
point. The raw value on this point is replaced by the average over the segment, thus creating
a smoothing effect.
In Gap-Segment derivatives (designed by Karl Norris), X-values are averaged separately over
one segment on each side of the data point. The two segments are separated by a gap. The
raw value on this point is replaced by the difference of the two averages, thus creating an
estimate of the derivative on this point.
The Unscrambler® offers three methods for computing derivatives, as described in the
following sections:
Gap_Derivatives
Gap-Segment
Savitzky-Golay
430
Transform
Mathematically, a derivative is the slope of the curve. If purely additive noise (like in the
curves above) is present, this is a constant. Therefore under derivatization, the constant
reduces to zero, meaning that all spectra should have a mean of zero and the spectral
profiles should be changed to the slopes of the curves.
The next figure displays the first order derivative for the Gaussian curves.
First derivative of Gaussian curves
The zero point can be explained by the fact that at a peak maxima (minima), the derivative is
zero.
In complex spectra, there may be many zero points and while it is adequate to transform a
purely linear offset with a first derivative, interpretation of zero points becomes difficult.
The second derivative may be useful in this instance.
431
The Unscrambler X Main
Another important feature of the second derivative is that the intensities of the original
curves can be seen in the second derivatives in order of intensity. This is an extremely useful
property, especially when performing quantitative analyses such as regression analysis.
Third and fourth derivatives
Third and fourth derivatives are available in The Unscrambler® although they are not as
popular as first and second derivatives. They may reveal phenomena which do not appear
clearly when using lower-order derivatives and can be helpful in understanding the spectral
data. Prudent use of the fourth derivative has been shown to emphasize small variations
caused by temperature changes and compositional changes. Higher-order derivatives do
significantly reduce the signal in the transformed data.
Savitzky-Golay vs. Gap-Segment
The Savitzky-Golay method and the Gap-Segment method use information from a localized
segment of the spectrum to calculate the derivative at a particular wavelength rather than
the difference between adjacent data points. In most cases, this avoids the problem of noise
enhancement from the simple difference method and may actually apply some smoothing to
the data.
The Gap-Segment method requires gap size and smoothing segment size (usually measured
in wavelength span, but sometimes in terms of data points). The Savitzky-Golay method uses
a convolution function, and thus the number of data points (segment) in the function must
432
Transform
be specified. If the segment is too small, the result may be no better than using the simple
difference method. If it is too large, the derivative will not represent the local behavior of
the spectrum (especially in the case of Gap-Segment), and it will smooth out too much of the
important information (especially in the case of Savitzky-Golay). Although there have been
many studies done on the appropriate size of the spectral segment to use, a good general
rule is to use a sufficient number of points to cover the full width at half height of the largest
absorbing band in the spectrum. One can also find optimum segment sizes by checking
model accuracy and robustness under different segment size settings.
Example:
Using data from a FT-NIR spectrometer, the next figure shows what happens when the
selected segment size is too small (Savitzky-Golay derivative, 3 points segment and second
order of polynomial). Noisy features remain in the spectra when the . segment size is too
small
Derivatized data with a segment size set too small
In the figure that follows, the selected segment size is too large: (Savitzky-Golay derivative,
31 points segment and second order of polynomial). One can see that some relevant
information has been smoothed out.
Derivatized data with a segment size set too large
433
The Unscrambler X Main
The main disadvantage of using derivative preprocessing is that the resulting spectra can be
difficult to interpret. However, this can also be advantageous, especially when a user is
looking for both specificity and selectivity of particular constituents in complex sample
matrices.
More details regarding Derivative transforms are given in the Method References.
434
Transform
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined. This derivative requires that the data all be
numeric and that there are at least five variables for each sample.
In the Parameters field, choose the Derivative order, i.e. whether to compute the first,
second, third, or the fourth derivative of the samples, from the drop-down list. Then, select
the required Gap size (width of the interval between the two values used for
differentiation). The gap size should be less than or equal to (Number of Variables -
Derivative Order - 1)/Derivative Order
By selecting the preview result, one can see how the preprocessed data will look.
When the Gap derivative transformation is completed, a new matrix is created in the project
with the word Gap Derivative appended to the original matrix name. This name may be
changed by selecting the matrix, right clicking and selecting Rename from the menu.
435
The Unscrambler X Main
436
Transform
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
In the Parameters field, choose the Derivative order, i.e. whether to compute the first,
second, third, or the fourth derivative of the samples, from the drop-down list. Then, select
the required Gap size and Segment size. The segment size + gap size should be less than or
equal to (number of variables/(derivative order + 1).
By selecting the Preview result, one can see a preview of what the derivative data will look
like with the chosen parameter settings.
Note: - The segment size must be an odd number for second or fourth derivative. -
The gap size must be an odd number for first or third derivative.
437
The Unscrambler X Main
438
Transform
Make the appropriate choices in the Savitzky_Golay Derivatives dialog by first selecting the
sample and variable sets that define the matrix to be transformed by a derivative in the
Scope field. Begin by choosing the data matrix from the drop-down list. This transform can
also be performed on a results matrix, which may be selected by clicking on the select result
matrix button . For the matrix, the rows and columns to be included in the
computation are then selected. If new data ranges need to be defined, choose Define to
open the Define Range dialog where new ranges can be defined. This derivative requires
that the data all be numeric.
In the Parameters field, choose the Derivative order, i.e. the first, second, third, or the fourth
derivative of the samples, from the drop-down list. The derivative order must be less than or
equal to polynomial order. Then select the Polynomial order, i.e. the order of the polynomial
to be fitted. A polynomial order of 2 means that a second-degree equation will be used to fit
the data points. A higher number means a more flexible polynomial, i.e. a more precise
differentiation. The polynomial order must be less than or equal to the sum of left and right
side points.
439
The Unscrambler X Main
One may then select the smoothing points. Note that a larger range will give a smoother
shape to the sample, but may result is a loss of valuable information. Choose the number of
left side points and right side points. From this the total number of smoothing points is
calculated (# left + # right + 1). The number of smoothing points must be less than number
of variables.
By selecting the Preview result, one can see a preview of the data before the transform and
what the derivative data will look like with the chosen parameter settings.
Note that, after the operation is completed, the data will be slightly truncated at both ends.
If p is the number of left side points and q the number of right side points in the smoothing
segment, the first p and the last q variables in the smoothed variable set will be set to zero.
This is because there are not enough points to the left (resp. right) of these variables to
compute the smoothing function.
When the Savitzky-Golay derivative transformation is completed, a new matrix is created in
the project with the word SGolay appended to the original matrix name. This name may be
changed by selecting the matrix, right clicking and selecting Rename from the menu.
10.8. Detrend
10.8.1 Detrending
Detrending is a transformation which seeks to remove nonlinear trends in spectroscopic
data.
How it works
How to use it
where A, B, C (and D, E) are the regression coefficients. The light blue expression within the
brackets is used if a third or fourth degree polynomial fit is considered. The base curve in the
above relationship is given by the fitted values ŷSNV,I and thus derived spectral values
subjected to SNV followed by DT become:
440
Transform
This calculation removes baseline shift and curvature which may be found in diffuse
reflectance NIR data of powders, particularly if they are densely packed. The use of thesis
transform does not change the shape of the data, as can be the case on application of
derivatives.
Example
The spectroscopic data shown hereafter, display a clear nonlinear trend.
NIR Diffuse reflectance spectra of cellulose.
There is a nonlinear trend in the data, roughly indicated by the dashed, red curve (right).
The four plots hereafter show the same data after Detrending was applied with varying
polynomial orders.
NIR diffuse reflectance spectra of cellulose: the same spectra after Detrending with
polynomial order 1 to 4.
441
The Unscrambler X Main
Begin by defining the data matrix from the drop-down list. For the matrix, the rows and
columns to be included in the computation are then selected. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
In the Parameters frame, select the Polynomial order (1 to 4) to apply to the data. The
polynomial order must be less than number of variables selected, to perform detrending
By selecting the Preview result, one can see a preview of what the preprocessed data will
look like with the chosen parameter settings.
Detrending dialog with preview of results
442
Transform
When the detrending transformation is completed, a new matrix is created in the project
with the word Detrend appended to the original matrix name. This name may be changed by
selecting the matrix, right clicking and selecting Rename from the menu.
10.9. EMSC
10.9.1 MSC/EMSC
Multiplicative Scatter Correction (MSC) is a transformation method used to compensate for
additive and/or multiplicative effects in spectral data. Extended Multiplicative Scatter
Correction (EMSC) works in a similar way; in addition, it allows for compensation of
wavelength-dependent spectral effects.
How it works
How to use it
443
The Unscrambler X Main
The idea behind MSC is that the two effects, amplification (multiplicative, scattering) and
offset (additive, chemical), should be removed from the data table to avoid that they
dominate the information (signal) in the data table.
The correction is done by two simple transformations. Two correction coefficients, a and b,
are calculated from a reference (usually the average spectrum in the data set) and used in
these computations, as represented graphically below:
Multiplicative (left) and additive (right) scatter effects:
The correction coefficients are computed from a regression of each individual spectrum onto
the average spectrum. Coefficient a is the intercept (offset) of the regression line, coefficient
b is the slope. As the MSC preprocessing uses the mean spectrum for the data set, its
success depends on how well the calculated mean spectrum resembles the true mean
spectrum, which will depend on a large sample set.
444
Transform
E
E is an extension to conventional MSC, which is not limited to only removing multiplicative
and additive effects from spectra. This extended version allows a separation of physical light
scattering effects from chemical light absorbance effects in spectra.
In E, new parameters h, d and e are introduced to account for physical and chemical
phenomena that affect the measured spectra. Parameters d and e are wavelength specific,
and used to compensate regions where such unwanted effects are present. E can make
estimates of these parameters, but the best result is obtained by providing prior knowledge
in the form of spectra that are assumed to be relevant for one or more of the underlying
constituents within the spectra and spectra containing undesired effects. The parameter h is
estimated on the basis of a reference spectrum representative for the data set, either
provided by the user or calculated as the average of all spectra. Spectra of the pure
components known to be present in the data set can be used as Good Spectra in the E
calculation, while spectra which represent the unwanted scatter effects can be used as Bad
Spectra.
More details regarding MSC/E transforms are given in the Method References.
445
The Unscrambler X Main
In the Multiplicative Scatter Correction dialog select the Sample (Rows) and variable (Cols)
sets that define the matrix to correct in the Scope field. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined. The minimum number of variables required
to perform this transformation is 2. If a valid MSC or EMSC model exists, check the box Use
existing MSC or EMSC Model to be used to transform the current data in exactly the same
way as was done for an earlier data matrix. This is useful if different data matrices should be
treated in the same way, e.g. new prediction samples. From the drop-down list one can
choose the model.
If test samples are to be used, check the Enable test samples box, and enter the numbers for
the rows holding those samples. At least two samples must be left for the transformation.
Variables can be omitted from the MSC/EMSC transform by checking the Enable omit
variables box, and entering the column numbers in the space provided. At least two
variables must be left to perform the transformation.
The default choice is to compute and use a new MSC or EMSC model which must then be
defined on the Options tab. One must then decide whether to make a full MSC model,
common offset (additive effects) model, or common amplification (multiplicative effects)
model in the Function field. In addition to regular MSC, one can also activate EMSC by
clicking the check box Extended options. Three extra options are now available, indicating
446
Transform
the available options for spectral information, channel weights and squared channel weights
used in EMSC.
Multiplicative Scatter Correction options field
447
The Unscrambler X Main
When EMSC is enabled, the user must decide which effects to include. The options channel
number and squared channel number model physical effects related to wavelength-
dependent light scatter variations. Chemical effects are included in the squared spectrum.
For all three options, one can choose Not used from the drop-down list, and the effect will
not be included in the transformation. If Model only is selected, the effect will be included to
calculate EMSC parameters. By choosing Model & subtract, the effect will not only be
included, but the effect will also be subtracted from the EMSC corrected spectra. When the
extended options are chosen, two additional tabs appear on the Dialog: Spectral Info, and
Channel Weights.
448
Transform
The Enable Reference Spectrum field allows one to select a single spectrum from the data
acting as a typical spectrum without any additional effects. If not selected, a reference will
be calculated using the mean of all spectra. In the Enable Good Spectra and the Enable Bad
Spectra fields, one can specify several spectra from a data table that are defined as good and
bad representatives of the spectral data, respectively. Spectra of the pure components
known to be present in the data set can be used as Good Spectra. Spectra which represent
the unwanted scatter effects can be used as Bad Spectra. If the Good Spectra and the Bad
Spectra have been selected, one may also enter a subtraction weight for the respective
spectra. These subtraction weights are multiplied to the good and the bad spectra and
subtracted from the corrected spectra.
It should also be noted that the background spectra available for selection in the Enable
Reference Spectrum must have the same number of variables as the spectra to be
transformed, though they may reside in a different data matrix. It is also recommended that
the background spectrum selected be different samples from the samples in the selected
scope of the data table. Overlapping reference, good and bad spectra is not allowed. A
warning message will appear if this happens.
449
The Unscrambler X Main
The last tab is for setting the Channel Weights, and is available only when using EMSC. Here,
one can choose to select different weighting of the variables. It is also possible to iteratively
find better weights than the default choice, by entering a number in the Reweightings field.
The number of reweightings to be used must be between 0 and 5. The EMSC will then be run
iteratively this number of times to find improved weights.
The options for weightings are:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows the weighting of selected variables by predefined constant values.
Downweight
This allows the multiplication of selected variables by a very small number, such that
the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
By selecting the Advanced tab one can apply weights from an existing matrix by selecting a
row in a data matrix.
450
Transform
MSC/EMSC results
When the EMSC or MSC transformation is completed, a new matrix is created in the project
with the word MSC or EMSC appended to the original matrix name. This name may be
changed by selecting the matrix, right clicking and selecting Rename from the menu. The
results of the transform also includes a model, which is an additional node in the project
navigator with several matrices for the results. The model name is MSC (or EMSC) prefixed
to the matrix from which the model was developed. The MSCMeanVar matrix gives
complete data values for the data matrix for the MSC transform. For an EMSC model the
matrix Reference Spectrum has the details on the transform.
Example
Consider a data table that consists of several spectra measured on different mixtures of two
chemical compounds where the amount of each of the two substances is varying.
The reference spectrum for the transformation can be a spectrum measured on a mixture
where the two compounds are equally represented.
Good spectra would then be spectra measured on each compound alone.
The bad spectra could then be selected as spectra believed to contain additional effects, not
caused by the chemicals.
How it works
How to use it
451
The Unscrambler X Main
the a * b interaction term and can provide more meaningful interpretations of the
regression coefficients for a and b. Whether the data are centered or not, the
regression coefficient for a * b will be the same. The coefficients for a and b will
differ depending on which method is used.
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined. This transform can only be applied to
numeric data.
The dialog contains two lists: Available Effects to the left and Selected Effects to the right.
The former lists all available effects with their full names.
Select the combinations to include in the transform and press the right arrow button to
include them in the right list under Selected Effects.
To Add All, use the double right arrow button.
Use the left arrow or double left arrow buttons to remove effects from the right-most list.
The transform is applied to the data as given in the matrix. One can choose to perform the
transformation on centered and scaled data by checking the box Rescale Interactions and
452
Transform
square effects. The interaction level can be chosen from the drop-down list next to
Interaction level.
When the Interaction and Square effects transformation is completed, a new matrix is
created in the project with the abbreviation InS appended to the original matrix name. This
name may be changed by selecting the matrix, right clicking and selecting Rename from the
menu.
10.11. Interpolate
10.11.1 Interpolation
This transformation operates by computing piecewise smooth cubic curves and allowing the
computation of values at any intermediate points.
How it works
How to use it
453
The Unscrambler X Main
In the Interpolation dialog, select the Matrix. You can choose a specific sample and variable
set within the matrix. If new data ranges need to be defined, choose Define to open the
Define Range dialog where new ranges can be defined.
If the data has numeric headers, the start and step values are detected based on the first
two headers. This is suitable for spectral data with continuous and regular intervals. In the
454
Transform
exceptional case where the intervals are not regular, the header values may be used as the
original scale.
The target scale to which the interpolation is to be performed needs to be specified by
entering the start and step values. The number of columns of data can also be chosen. The
maximum number of columns is resticted to three times the size of original number of
columns.
The interpolated data is added as a new node in the project tree.
Note: interpolation can also be performed on data without actual wavelengths or
wavenumbers by specifying arbitrary units. For instance if the data consisted of 10 columns,
one could specify the inputs as follows to reverse the columns.
How it works
How to use it
455
The Unscrambler X Main
Although some of the analysis methods (PCA, PCR, PLS, MCR) available in The Unscrambler®
can cope with a reasonable amount of missing values, there are still multiple advantages in
filling empty cells with estimated values:
In the Fill missing values dialog choose the data matrix from the drop- down menu. This
transform can also be performed on a results matrix, which may be selected by clicking on
the select result matrix button . For the matrix, the rows and columns to be
included in the computation are then selected. If new data ranges need to be defined,
choose Define to open the Define Range dialog where new ranges can be defined.
Fill Missing cannot be applied if one or more rows have more missing data than non-missing.
456
Transform
10.13. Noise
10.13.1 Noise
This transformation operates by adding additive or multiplicative noise in variables, which
can be helpful to see how this affects the model.
How it works
How to use it
457
The Unscrambler X Main
In the Noise dialog, select the Matrix, and then the sample and variable sets that to be
processed. This transform can also be performed on a results matrix, which may be selected
by clicking on the select result matrix button . If new data ranges need to be
defined, choose Define to open the Define Range dialog where new ranges can be defined.
In the Parameters field, specify the level of proportional noise (e.g. 5%) and the standard
deviation of the additive noise to be added to the data.
Noise on a variable is said to be additive when its size is independent of the level of the data
value. The range of additive noise is the same for small data values as for larger data values.
The additive noise must be greater than or equal to 0.
Noise on a variable is said to be proportional when its size depends on the level of the data
value. The range of proportional noise is a percentage of the original data values. The
designated value for proportional noise must be between 0 and 100.
By selecting the preview result, one can see how the transformed data will look.
Noise dialog with preview
458
Transform
When the noise transformation is completed, a new matrix is created in the project with the
word Noise appended to the original matrix name. This name may be changed by selecting
the matrix, right clicking and selecting Rename from the menu.
10.14. Normalize
10.14.1 Normalization
Normalization is used to “scale” samples in order to get all data on approximately the same
scale.
The following normalization methods are available in The Unscrambler®:
Area normalization;
Unit vector normalization;
Mean normalization;
Maximum normalization;
Range normalization;
459
The Unscrambler X Main
Peak normalization.
How it works
How to use it
Area normalization;
Unit vector normalization;
Mean normalization;
Maximum normalization;
Range normalization;
Peak normalization.
Area normalization
This transformation normalizes an observation (i.e. spectrum, chromatogram) Xi by
calculating the area under the curve for the observation. It attempts to correct the
transmission spectra for indeterminate path length when there is no way of measuring it, or
isolating a band of a constant constituent or of an internal standard.
460
Transform
It is equivalent to replacing the original variables by a profile centered around 1, only the
relative values of the variables are used to describe the sample, and the information carried
by their absolute level is dropped. This is indicated in the specific case where all variables are
measured in the same unit, and their values are assumed to be proportional to a factor
which cannot be directly taken into account in the analysis.
For instance, this transformation is used in chromatography to express the results in the
same units for all samples, no matter which volume was used for each of them.
Caution! This transformation is not relevant if all values of the curve do not have
the same sign. It was originally designed for positive values only, but can easily be
applied to all-negative values through division by the absolute value of the average
instead of the raw average. Thus the original sign is kept.
Property of mean-normalized samples
The area under the curve becomes the same for all samples.
Maximum normalization
This is an alternative to classical normalization which divides each row by its maximum
absolute value instead of the average.
Caution! The relevance of this transformation is doubtful if all values of the curve
do not have the same sign.
Property of maximum-normalized samples
Range normalization
Here each row is divided by its range, i.e. “max value – min value”.
Property of range-normalized samples
The curve span becomes 1.
Peak normalization
This transformation normalizes a sample Xi by the chosen kth data point, which is always
chosen for both training set and “unknowns” for prediction.
It attempts to correct spectra for indeterminate path length. Since the chosen spectral point
(usually the maximum peak of a band of the constant constituent or internal standard, or
the isosbestic point) is assumed to be concentration invariant in all samples, an increase or
decrease of the point intensity can be assumed to be entirely due to an increase or decrease
in the sample path length. Therefore, by normalizing the spectrum to the intensity of the
peak, the path length variation is effectively removed.
For peak normalization the Peak variable (max) = total number of variables.
Property of peak-normalized samples
All transformed samples take value 1 at the chosen constant point, as shown in the figures
below.
Raw UV-Vis spectra
461
The Unscrambler X Main
Caution! One potential problem with this method is that it is extremely susceptible
to baseline offset, slope effects and wavelength shift in the spectrum.
The method requires that the samples have an isosbestic point, or have a constant
concentration constituent and that an isolated spectral band can be identified which is solely
due to that constituent.
More details regarding normalization methods are given in the Method References.
462
Transform
Normalization cannot be carried out with non-numeric data, but can proceed if there are
missing values in the data.
Normalize
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
Then, select the normalization type in the Type field. The following six normalization
methods are available:
Area normalization;
Unit vector normalization;
Mean normalization;
Maximum normalization;
Range normalization;
Peak normalization.
Area normalization attempts to correct the spectra for indeterminate path length when
there is no way of measuring it, or isolating a band of a constant constituent or an internal
standard. The transformation normalizes a sample Xi by calculating the area under the curve
for the sample (i.e. spectrum, chromatogram).
Result of area normalization on two different samples
Before After
463
The Unscrambler X Main
Before After
0.3 0.5 1.0 2.5 3.0 2.5 1.0 0.111 0.185 0.370 0.926 1.111 0.926 0.370
Peak normalization normalizes a sample as the ratio of each value by the value at a selected
variable (wavelength, retention time). The chosen point (usually the maximum peak of a
band of the constant constituent, or the isosbestic point) is assumed to be concentration
invariant in all samples.
Peak Normalization
464
Transform
Type in the number of the peak variable in box next to Peak normalization.
By selecting the preview result, one can see how the preprocessed data will look.
Note: If data are peak-normalized before building a model for later use in
prediction or classification, make sure that the same peak variable is selected when
normalizing the prediction samples!
Result of peak normalization on two different samples
Before After
1234 1234
2468 1234
465
The Unscrambler X Main
10.15. OSC
10.15.1 Orthogonal Signal Correction (OSC)
OSC can be used as a transformation method for building PLS regression models from
spectral data. It removes extraneous variance from the x data, sometimes making the PLS
model more accurate.
How it works
How to use it
PLS models built on OSC transformed data should be interpreted with great caution.
OSC will make the model fit appear very good, but may not improve predictions on separate
test sets. It is important to hold out some test samples as a final sanity check on the model
and how the OSC has improved it.
OSC calculates orthogonal signal correction.
Inputs
The inputs are the matrix of predictor variables (X) and predicted variable(s) (Y),
scaled as desired, and the number of OSC components to calculate.
Usually, 1-3 OSC components are sufficient. Optional input variables are the
maximum number of iterations used in attempting to maximize the variance
captured by the orthogonal component, and the tolerance on percent of X-variance
to consider in formation of the final w-vector.
Outputs
The outputs are the OSC corrected X-matrix and the weights, loadings and scores
that were used in making the correction.
Once the OSC model has been made, new (scaled) x data can be corrected from the Tasks -
Transform - OSC… by selecting a saved OSC model.
More details regarding OSC transforms are given in the Method References.
466
Transform
Begin by defining the data matrix for the Predictor Variables (X) from the drop-down list.
This transform can also be performed on a results matrix. Choose these matrices by clicking
on the select result matrix button . Next, select the rows and columns to be
included in the computation. If new data ranges need to be defined, choose Define to open
the Define Range dialog where new ranges can be defined. Then proceed to select the
matrix for the Predicted variables (Y).
If a valid OSC model already exists, it can be used for the transformation of a new matrix by
selecting it next to the Use existing OSC Model. The model must have loadings and weights
matrices saved to it.
By selecting the preview result, the effect of the OSC transformed data can be visualized.
467
The Unscrambler X Main
Weights tabs
In the X- or Y-Weights dialog, choose the data matrix from the drop-down list. This
transform can also be performed on a results matrix, which may be selected by clicking on
the more button . For the matrix, the rows and columns to be included in the
computation are then selected (containing only numeric data). If new data ranges need to
be defined, choose Define to open the Define Range dialog where new ranges can be
defined.
Then, select the variables that the a weighting will be applied to; all variables can be
selected by selecting one variable, and then clicking the All button under the variable
selection window. The selection can also be made by typing in the variable numbers and
clicking Select. After making the selection of variables, select the weighting to be used using
the radio buttons in the Select tab. To apply the weighting, click Update, and then OK.
There are four weighting methods available:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows the weighting of selected variables by predefined constant values.
Downweight
This allows the multiplication of selected variables by a very small number, such that
the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
468
Transform
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
Options tab
On the Options tab, choose the Number of OSC Components. Usually, 1-3 OSC components
are sufficient. Then, select the algorithm to apply from the following,
NIPALS
Non-linear Iterative Partial Least Squares. This algorithm handles missing values and
is suitable for computing only the first few components of a large data set. This
method however accumulates errors that can become large in higher principal
components. Since the NIPALS algorithm is iterative, the maximum number of
iterations can be tuned in the Max iterations box. The default value of 100 should
be sufficient for most data sets, however some large and noisy data may require
more iterations to converge properly. The maximum allowed number of iterations is
30,000.
SVD
Singular Value Decomposition. This algorithm does not handle missing values and is
best suited for small data sets or “tall” or “wide” data. This algorithm produces
higher accuracy results but it is not suited for data sets with a high number of both
samples and variables since the algorithm always computes all components.
The NIPALS algorithm calculates one principal component at a time and it handles missing
values well, whereas the SVD algorithm calculates all of the principal components in one
calculation, but does not handle missing values.
OSC Options
When the OSC transform has been applied to the data, there will be two new nodes created
in the project navigator: one for the OSC model (and corresponding result matrices that
469
The Unscrambler X Main
have been designated to be included in the outputs), and another for the transformed data.
The transformed data matrix will have OSC appended to the original data matrix name.
OSC results in project navigator
How it works
How to use it
470
Transform
distribution. Then, for each observation, the lowest value is replaced with the lowest value
of the reference distribution, the second lowest value is replaced with the second lowest
value of the reference distribution, and so on. The end result is that each transformed row
contains exactly the same data as the reference distribution, however sorted in the order of
the original observations.
Quantile normalization should be used with caution and only when the reference
distribution can be assumed to be representative for all samples in the data table. It is
particularly dangerous to use if the reference distribution contains more than a single peak,
as data values will be forced to move between neighbouring peaks (disguising differences) if
the cluster sizes vary from one observation to the next.
Missing or non-numeric data are not allowed in QN.
In the Quantile Normalization dialog, select the Matrix to transform, including the relevant
row and columns sets. Data from previous results may be selected by pressing the select
result matrix button . New data ranges may be selected from the Define Range
dialog if Define is pressed.
Three choices of reference distributions are available. The mean or median of identically
ranked data values across observations is estimated by selecting the ‘Mean row’ or ‘Median
row’ radio button, respectively. Alternatively, the ‘Reference vector’ allows you to input
your own choice of reference distribution. Make sure that neither the data nor the reference
vector contains non-numeric or missing values.
Note: Never use quantile normalization unless you have pretty good reasons to
believe that your observations should be distributed identically.
471
The Unscrambler X Main
The Preview result option enables you to compare the data before and after transformation.
Quantile dialog with preview
472
Transform
How it works
How to use it
Increase precision;
Get more stable results;
Reduce noise;
Interpret the results more easily.
In The Unscrambler® this is done from the menu using Tasks – Transform – Reduce
(Average)…
Application example
Improve the precision in sensory assessments by taking the average of the sensory ratings
over all panelists.
Average replicate measurements of the same sample to increase signal to noise.
Reduce the number of variables in spectral data with very large number of variables to make
data more manageable,
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
473
The Unscrambler X Main
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
A minimum of two samples and two variables is required to perform this transformation.
Choose whether to Reduce along Variables or Samples in this field in the Reduce (Average)
dialogue. The number of adjacent samples or variables to be averaged must be given in the
Reduction Factor field, where the value can be changed using the spin box from 2 up to the
number of variables being transformed.
Note: All defined sets will be adjusted according to the reduction performed.
10.18. Smoothing
10.18.1 Smoothing methods
Smoothing helps reduce the noise in the data without reducing the number of variables. It is
a row-oriented transformation. That is to say the contents of a cell are likely to be influenced
by its horizontal neighbors.
This transformation is relevant for variables which are themselves a function of some
underlying variable, for instance time, or in the existence of intrinsic spectral intervals.
Smoothing cannot be performed with non-numeric data , but can be applied when there are
missing data.
In smoothing, X-values are averaged over one segment symmetrically surrounding a data
point. The raw value on this point is replaced by the average over the segment, thus creating
a smoothing effect.
A submenu to the Tasks – Transform – Smoothing menu provides four different methods for
smoothing of data:
Moving_Average
first finds a data value by averaging the values within a segment of data points
Savitzky-Golay
finds a data value by making a polynomial to fit the data points using a number of
data points on each side
Median_Filter
finds a data value by taking the median within a segment of data points
Gaussian_Filter
finds a data value by computing a weighted moving average within a segment of
data points.
More details regarding Smoothing methods are given in the Method References.
474
Transform
that is to say
As can be seen, points closer to the center have a larger coefficient in the Gaussian filter
than in the moving average, while the opposite is true of points close to the borders of the
segment.
475
The Unscrambler X Main
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
Then, enter the size of the segment to be used for smoothing, i.e. how many adjacent
columns should be used to compute the Gaussian fitted value, in the Parameters field. The
segment size must be less than or equal to the number of variables.
By selecting the Preview result, one can see a preview of what the preprocessed data will
look like with the chosen parameter settings.
476
Transform
Begin by defining the data matrix from the drop-down list. For the matrix, the rows and
columns to be included in the computation are then selected. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
Then in the Parameters field, enter the size of the segment to be smoothed, i.e. how many
adjacent columns should be used to compute the median. The segment size must be less
than or equal to number of variables.
By selecting the Preview result, one can see a preview of what the preprocessed data will
look like with the chosen parameter settings.
477
The Unscrambler X Main
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
478
Transform
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
The size of the segment to be averaged is then entered, i.e. how many adjacent columns
should be used to compute the average value, in the Parameters field. In smoothing, X
values are averaged over one segment symmetrically surrounding a data point. The raw
value on this point is replaced by the average over the segment, thus creating a smoothing
effect. The segment size for smoothing must be less than or equal to number of variables.
By selecting the Preview result, one can see a preview of what the preprocessed data will
look like with the chosen parameter settings.
479
The Unscrambler X Main
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog.
Three parameters need to be defined to perform a Robust Lowess Smooth
The number of iterations the calculation should perform to reach convergence
The Smoothing f factor which is set between 0 and 1
The Delta value
By selecting the Preview result, a preview of what the preprocessed data will look like with
the chosen parameter settings will be displayed. This can also be used to look at the effect of
the transformation in real time.
480
Transform
481
The Unscrambler X Main
Savitzky-Golay smoothing cannot be performed with non-numeric data or where there are
missing data.
The minimum number of variables required for Savitzky-Golay smoothing is 3.
Savitzky-Golay Smoothing
In the Savitzky-Golay Smoothing dialogue, select the matrix and to be smoothed. Begin by
defining the data matrix from the drop-down menu. This transform can also be performed
on a results matrix, which may be selected by clicking on the select result matrix button
. For the matrix, the rows and columns to be included in the computation are then
selected. If new data ranges need to be defined, choose Define to open the Define Range
dialog where new ranges can be defined.
The polynomial order is selected in the Parameters field. For instance, a polynomial order of
2 means that a second-degree equation will be used to fit the data points. The polynomial
order must be less than or equal to the sum of left and right side points.
The smoothing points are defined by choosing the number of left side points and right side
points separately. The number of smoothing points must be less than number of variables.
The number of smoothing points on the left and right side must be the same if the
symmetric kernel box is checked. By unchecking this box, a different number of points may
be set for each side (though this is not recommended for spectral data). Note that a larger
value for smoothing points will give a smoother shape to the data, but may result is the loss
of some information.
By selecting the preview result, one can see what the data look like with given smoothing
settings.
Savitzky-Golay smoothing dialog with preview
482
Transform
Note that, after the smoothing operation is completed, the data will be slightly truncated at
both ends. If p is the number of left side points and q the number of right side points in the
smoothing segment, the first p and the last q variables in the smoothed variable set will be
set to zero. This is because there are not enough points to the left (resp. right) of these
variables to compute the smoothing function.
Absorbance to reflectance
Absorbance to transmittance;
483
The Unscrambler X Main
Reflectance to absorbance
Transmittance to absorbance;
Reflectance to Kubelka-Munk units
Basic ATR Correction.
How it works
How to use it
484
Transform
Select the data matrix with spectra in the Spectroscopic Transformation dialog. This
transform can also be performed on a results matrix, which may be selected by clicking on
the select result matrix button . Then the rows and columns to include must be
selected. If the ranges of interest are not available in the drop-down boxes, choose Define to
open the Define Range dialog where new ranges can be selected.
Choose among the available transformations in the Type frame. Four types of
transformations can be performed:
Absorbance to reflectance, or
Absorbance to transmittance;
Reflectance to absorbance, or
Transmittance to absorbance;
Reflectance to Kubelka-Munk units
Basic ATR Correction.
When Basic ATR Correction is selected, Units and Reference value boxes will be available
with default values of 1000 . This is the wave number at which the absorbance
transformed ATR spectrum is expected to be the same as an absorbance transformed
transmission spectrum of the same sample. Available units are and .
Select Preview result to view the the spectra before and after transformation.
Spectroscopic transformation with preview
485
The Unscrambler X Main
When the spectroscopic transformation is completed, a new matrix is created in the project
with the word Spectroscopic appended to the original matrix name. This name may be
changed by selecting the matrix, right clicking and selecting Rename from the menu.
486
Transform
Like MSC, the practical result of SNV is that it removes multiplicative interferences of scatter
and particle size effects from spectral data. These transforms for scatter corrections are
typically used with diffuse reflectance data.
An effect of SNV is that on the vertical scale, each spectrum is centered on zero and varies
roughly from –2 to +2. Apart from the different scaling, the result is similar to that of MSC.
The practical difference is that SNV standardizes each spectrum using only the data from
that spectrum; it does not use the mean spectrum of any set. The choice between SNV and
MSC is a matter of taste. Since the MSC normalizes based on the mean spectrum in a data
set, it is best suited for similar sample sets.
More details regarding SNV transform are given in the Method References.
487
The Unscrambler X Main
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
By selecting the preview result, one can see how the preprocessed data will look.
When the SNV transformation is completed, a new matrix is created in the project with the
word SNV appended to the original matrix name. This name may be changed by selecting
the matrix, right clicking and selecting Rename from the menu.
10.21. Transpose
10.21.1 Transposition
Matrix transposition consists in exchanging rows for columns in the data table.
It is particularly useful if the data have been imported from external files where they were
stored with one row for each variable.
488
Transform
Category variables are automatically split a table containing such variables is transposed. A
transpose cannot be performed on a matrix containing non-numeric data.
Note: All defined sets are also transposed.
Select the data matrix to be transposed, by highlighting it, and go to Tasks-Transform-
Transpose. Alternatively one can select the data matrix, and right mouse click to select
Transform-Transpose
When the transpose transformation is completed, a new matrix is created in the project with
the word Transposed appended to the original matrix name. This name may be changed by
selecting the matrix, right clicking and selecting Rename from the menu.
10.23. Weights
10.23.1 Weights
Depending on the kind of information to be extracted from data, it may be necessary to
apply weights to the variables. Often the weights are based on the standard deviation of the
variables, i.e. square root of variance, which expresses the variance in the same unit as the
original variable.
Weighting of spectra may make it more difficult to interpret loadings plots, and one runs the
risk of inflating noise in wavelengths with little information. Thus, spectral data are generally
not weighted, but there are exceptions.
How it works
How to use it
489
The Unscrambler X Main
Constant
A/SDev+B
Downweight
Block Weighting
490
Transform
491
The Unscrambler X Main
standardization of variables gives an analysis that interprets the variation relative to the
extremes in the data table.
The opposite, no weighting at all, gives an analysis that has a closer relationship to the
individual assessor’s personal extremes, and these are strongly related to their very
subjective experience and background.
It is generally recommended to use standardization for sensory data. This procedure,
however, has an important disadvantage: it may increase the relative influence of unreliable
or noisy attributes (see Caution in section Weighting Option: ).
Weighting: The case of spectroscopic data
Standardization of spectra may make it more difficult to interpret loadings plots, and one
may risk inflating noise in wavelengths with little information. Thus, spectra are generally
not weighted, but there are exceptions.
In the Weights dialog, choose the data matrix from the drop-down list. This transform can
also be performed on a results matrix, which may be selected by clicking on the more button
492
Transform
. For the matrix, the rows and columns to be included in the computation are then
selected (containing only numeric data). If new data ranges need to be defined, choose
Define to open the Define Range dialog where new ranges can be defined.
Then, select the variables that the a weighting will be applied to; all variables can be
selected by selecting one variable, and then clicking the All button under the variable
selection window. The selection can also be made by typing in the variable numbers and
clicking Select. After making the selection of variables, select the weighting to be used using
the radio buttons in the Select tab. To apply the weighting, click Update, and then OK.
There are four weighting methods available:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows the weighting of selected variables by predefined constant values.
Downweight
This allows the multiplication of selected variables by a very small number, such that
the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
493
The Unscrambler X Main
When the weights transformation is completed, a new matrix is created in the project with
the word Weighted appended to the original matrix name. This name may be changed by
selecting the matrix, right clicking and selecting Rename from the menu.
Weighting can also be done when beginning an analysis, if one does not want to transform
the data, but is only concerned with applying weights during the analysis itself. A tab for
weights (or X weights, Y weights, and Z weights) is presented in the option for many
analyses, such as PCA and regression (MLR, PLS, PCR, L-PLS), as well as when doing linear
discriminant analysis or support vector machine classification.
Weights tab within PCA dialog
494
Transform
495
11. Univariate Statistics
11.1. Descriptive statistics
The Descriptive Statistics option in The Unscrambler® provides some simple and effective
plotting tools for gaining an overview of small to medium sized data sets. The tools in this
menu option are mainly used to confirm observations found in multivariate models.
Theory
Usage
Plot Interpretation
Method reference
Purposes
Parametric statistics
Terminology
The normal distribution
Measures of central tendency
The mean
The median
The mode
Measures of dispersion
Variance
Standard deviation
Range
Degrees of freedom
Skewness and kurtosis
Quartiles
11.2.1 Purposes
The main results to be found by performing Descriptive Statistics are:
Plots of the Mean and Standard Deviation of the chosen variables.
Box plots of the variables.
Scatter Effects plots, used to compare the linearity of data when plotted against the
mean of the data.
Cross-correlation matrix, for investigating variable correlations.
There are no formal statistical tests performed in the Descriptive Statistics module, these
can be found in the Tasks - Analyze - Statistical Tests… menu.
497
The Unscrambler X Main
Parametric statistics
By parametric statistics, it is inferred that the samples under investigation come from a
population with a known underlying distribution, typically a normal distribution. Parametric
statistics are sensitive to the underlying parameters, which in the case of a normal
distribution are:
Terminology
In the statistical literature, it is common practice to denote parameters, i.e. those measures
related to a population, by Greek symbols and to denote statistics, i.e. those measures
related to samples, by Roman letters Miller and Miller, 2005. The following table provides
examples of some common parameters and statistics.
Mean Variance Standard deviation
Parameter μ σ² σ
Statistic s² s
498
Univariate Statistics
The median
Another common measure used in statistics to describe central tendency is the median. The
median is known as a non-parametric or robust statistic. The median is calculated as the
pivot point of a set of ordered observations. For instance, consider the number sequence
below:
1 2 3 4 5
The number of observations is odd. Therefore placing the pivot point under the value 3
balances the data, i.e. two observations on either side. When the number of observations is
even, as in the case below:
1 2 3 4 5 6
the balance point now does not lie on a single number, but midway between the numbers 3
and 4. Therefore the median would in this case be 3.5.
In the first case above, the median was 3 and it can be shown that the mean value is also 3.
Now consider the following set of numbers:
1 2 3 4 50
The median is still 3, while the mean is now much greater than 3. This is why the median is
referred to as a robust statistic, i.e. it is robust to outliers.
The mode
The Mode is defined as the most commonly occurring value in a data set. For example, in the
following set of observations:
1 2 3 3 3 4 5
the mode is 3 as this is the most commonly occurring value.
Standard deviation
From the formula for variance, it can be seen that the value obtained for variance is in the
original units of measure squared. The Standard Deviation is a measure of spread, given in
499
The Unscrambler X Main
the same units as the original observations. In parametric statistics, this value is most
commonly used when describing a normal distribution and is used in many of the hypothesis
tests to be discussed later in this section.
Range
The Range of a data set is defined as the highest observed value minus the lowest observed
value in a data set. It is a non-parametric method of describing dispersion and should be
used instead of the standard deviation when the number of observations is less than 5.
Degrees of freedom
The Degrees of Freedom (DOF) is the number of independent measures in a data set that can
be varied independently when a value of a chosen statistic is fixed. Put simply, if all but one
value in a set of observations are known, as well as the mean, one can calculate the missing
value. Therefore the degrees of freedom are calculated as the number of observations
minus 1.
The formula for variance and standard deviation reflect this, and correct for bias using N-1 as
the denominator. For large samples, the difference diminishes.
Skewness and kurtosis
The Skewness of a distribution is a measure of its asymmetry and is referred to as the third
central moment of the distribution. The degree of this asymmetry is determined by the
coefficient of skewness.
Distributions that are skewed to the left have a negative value of skew and distributions
skewed to the right have a positive coefficient of skewness Hogg and Craigr, 1978. The
following represent some common distributions, including the left and right skew
distributions.
500
Univariate Statistics
IQR = Q3 - Q1
This provides a non-parametric estimate of the dispersion of a data set.
Use the Data input options to select a matrix to analyze, and use the rows and columns
drop-down lists to select predefined sets.
Use the Define button to add new sub ranges of the original matrix to analyze.
Check the Compute Correlation matrix to display a matrix plot of the variable correlations.
Solution: Ensure that enough samples and variables are available for the calculation using
the Define option.
To view which samples/variables have been kept out of a particular data set, click on the
More Details option in the data input dialog, as shown below.
501
The Unscrambler X Main
When the data has been correctly set up for analysis, click on OK to display the descriptive
statistics results.
Proceed to interpreting the results.
Quantiles
This plot contains one Box-plot for each variable, either over the whole sample set, or for
different subgroups. It shows the minimum, the 25% percentile (lower quartile), the median,
the 75% percentile (upper quartile) and the maximum.
The box-plot shows 5 percentiles
Note: If there are less than five samples in the data set, the percentiles are not
calculated. The plot then displays one small horizontal bar for each value (each
sample). Otherwise, individual samples do not appear on the plot, except for the
maximum and minimum values.
General case
502
Univariate Statistics
This plot is an excellent summary of the distributions of the variables. It shows the
total range of variation of each variable. Check whether all variables are within the
expected range. If not, out-of-range values are either outliers or data transcription
errors. Check the data and correct the errors!
If groups of samples have been plotted (e.g. Design samples, Center samples), there
is one box-plot per group.
Check that the spread (distance between Min and Max) over the Center samples is
much smaller than the spread over the Design samples. If not, some possible
explanations include,
Spectra
A quantiles plot can also be used as a diagnostic tool to study the distribution of a
whole set of related variables, for instance in spectroscopy the absorbances for
several wavelengths. In such cases, it is recommended not to use subgroups,
otherwise the plot may be too complex to provide interpretable information.
In the figure below, the percentile plot shows the general profile of a spectrum,
which may be common to all samples in the data set. The plot can be used to detect
which wavelengths (regions of the spectrum) have the largest variation. It is most
likely that these contain the most information.
Percentile plot for variables making up a spectrum
In some cases, the variation contained in certain parts of a spectrum may not be
relevant to the problem under study. The figure below demonstrates this by
showing an almost uniform spread over all wavelengths. This may cause suspicion,
as wavelengths with absorbances close to zero (i.e. baseline) have a large variation
for the samples analyzed. This may indicate a baseline shift, which can be corrected
using multiplicative scatter correction (MSC).
The scatter effects plot may be used to check such a hypothesis!
Equal baseline and major absorbance variation should be treated as suspicious
503
The Unscrambler X Main
The average response value indicates the central tendency of the samples under
investigation. The standard deviation is a measure of the spread of the variable around that
average. If several variables are studied together, compare their standard deviations. If
there is considerable variation in the standard deviation values between a number of
variables, it is recommended that standardization the variables be applied in later
multivariate analyzes (i.e. PCA, PLS etc.). This applies to variables of differing order of
magnitude (i.e. process variables), sensory or other data coming from a number of different
sources. Standardization should not be applied to spectral data as this may inflate the
variance of non-important regions, possibly making them artificially significant.
Mean
Bar Plot
For each variable, the average value of all samples comprising that variable is displayed as a
vertical bar for a single variable of a series of bars for many variables.
Mean plot
504
Univariate Statistics
Standard deviation
Bar Plot
For each variable, the standard deviation (square root of the variance) over all samples in
the chosen sample set is displayed. This plot may be useful for detecting which variables
have the largest absolute variation. If the variables have different standard deviations, it
may be necessary to standardize them in later multivariate analyzes.
Standard Deviation plot of spectral data
Quantiles
See the description in the General section
Scatter effects
The scatter effects plot shows each sample plotted against the average (mean) sample.
Scatter effects display themselves as differences in slope and/or offset between the lines in
the plot. Differences in the slope are caused by multiplicative scatter effects. Offset error is
due to additive effects. Sometimes the lines show profiles that deviate considerably from a
straight line. In such instances, caution must be taken when applying scatter correction, as
505
The Unscrambler X Main
major chemical information may be confused with systematic scatter effects and therefore
lost in the transformation. For an excellent reference of this situation, refer to the article by
Martens et. al. in the reference section for this chapter.
Applying Multiplicative Scatter Correction will improve the model if these scatter effects are
detected in the data table. The examples below provide a basic guide as to what to look for.
Two cases of scatter effects: Additive (left), Multiplicative (right)
Cross-correlation
Matrix Plot
The Matrix plot shows the cross-correlations between all variables included in a statistics
analysis. The matrix is symmetrical (the correlation between A and B is the same as between
B and A) and its diagonal elements contains only values of 1, since the correlation between a
variable and itself is 1. All other values are between -1 and +1. A large positive value (as
shown in red in the figure below) indicates that the corresponding two variables have a
tendency to increase simultaneously. A large negative value (as shown in blue in the figure
below) indicates an inverse relationship of the variables. A correlation close to 0 (light green
in the figure below) indicates that the two variables vary independently from each other.
It is suggested to use a matrix plot consisting of “bars” (used as default) or a “map” for
studying cross-correlations. Examples are provided below,
Cross-correlation plot, with Bars and Map layout
506
Univariate Statistics
507
The Unscrambler X Main
A B C
A 1 0.76 -0.32
B 0.76 1 -0.09
C -0.32 -0.09 1
The table is symmetrical, like the corresponding matrix plot and is used to isolate
quantitative values of correlation that exist between the variables under study.
Min, Max & Mean
This options shows a whisker plot with the minimum, mean and maximum value for each
variable in the top plot, with the value for the selected sample shown on that plot as a green
dot. The bottom plot shows all the values for the first variable in a control chart, with lower
and upper limit lines in red representing the lower and upper limit of that variable in the
selected data set. The green line is the mean value for the variable in the data set. The value
for a different sample can be shown in the whisker plot by using the arrows at the toolbar in
the top of the screen display; this will also move the dot along to the selected sample in the
bottom control chart. The whisker plot values can be centered and scaled by selecting the
11.6. Bibliography
R. Hogg and A. Craig, “Introduction to Mathematical Statistics”, 4th Edition, New York,
Macmillan Publishing Co, 1978.
J.N. Miller and J.C. Miller, “Statistics and Chemometrics for Analytical Chemistry” Fifth
Edition, Harlow, UK, Prentice Hall, 2005.
508
12. Basic Statistical Tests
12.1. Statistical tests
The Unscrambler® provides some basic hypothesis testing features, including tests for
normality, comparison of means and variances.
The tests included are:
To perform the analysis, use the menu option Tasks – Analyze – Statistical Tests…
The following sections briefly describe the ideas behind these methods, how to perform
them, and how to interpret the plots.
Theory
Usage
Plot Interpretation
Method reference
509
The Unscrambler X Main
Equal variance assumption
Non-equal variance assumption
Comparison of two dependent means
The paired t-test
Comparison of categorical data
Chi-square test
Fisher’s exact test
Bayes exact test
Both of these principles should be obeyed as much as possible in order to make true
inferences about the population being investigated.
The null and alternative hypotheses are also dependent on the type of test to be performed.
This can either be a one-sided or two-sided test. Before one and two sided tests can be
described, the principles of significance levels and p-values must be discussed.
Significance levels and p-values
The significance level of a statistical test is the risk one is willing to take of making a wrong
decision. The most commonly used significance level is the 95% confidence level. This is also
described as α=0.05, where α is called the significance or the risk. It is defined by the analyst
before the test is calculated. At 95% confidence, one is willing to take a 1 in 20 chance of
making an incorrect decision. Other common significance levels include 0.01 (99%) and 0.1
(10%). The following diagram shows the common significance levels as histograms.
510
Basic Statistical Tests
0.01–0.05 Significant
511
The Unscrambler X Main
It is recognized that other tests for normality exist and may be used instead of the KS test.
Mardia’s test for multivariate normality
Kanti V. Mardia showed that the univariate calculations of skewness and kurtosis could be
extended to the multivariate case Mardia, 1970. These calculations were used to develop a
test of multivariate normality Mardia, 1974. To describe multivariate normality (sometimes
referred to a multinormality), the simplest case is considered. This is known as the bivariate
case and is shown graphically below.
512
Basic Statistical Tests
This diagram shows that the bivariate normal distribution occupies a region in space defined
by a series of ellipses.
The diagram also shows one of the major principles behind multivariate methods such a
principal component analysis (PCA), described in other chapters of this help document. The
bivariate normal distribution consists of a number of ellipses of equal probability density
that show elongation along the direction of maximum variance.
For a multinormal distribution, Mardia has shown that the multivariate sample counterparts
of skewness and kurtosis can defined as b1,p and b2,p, where p is the number of variables
being tested Mardia, 1970. These test statistics can be used to test the null hypothesis of
multinormality. The null hypothesis is rejected for large b1,p and/or for large absolute values
of b2,p, Mardia, Kent and Bibby, 1979. Critical values of these statistics for small samples are
provided in Mardia, 1974.
513
The Unscrambler X Main
The F-test calculates the ratio of two sample set variances. The null hypothesis is set up such
that there is no significant difference between the variances, and the alternative hypothesis
is set such that one variance is greater than the other. If the null hypothesis stands, the ratio
of the variances should be close to a value of one (within the limits of random variation).
When it cannot be assumed that the difference is due to random variation, a significant
difference between the two variances exists.
The calculated test statistic F0 is compared to an F-table (the so called Snedecor F-table) for
a specified number of degrees of freedom. The form of the test statistic is as follows
Where α = significance level, n1 = degrees of freedom for observation set 1 and n2 = degrees
of freedom for observation set 2. A p-value is also generated for the test. If p > 0.05 (at 95%
confidence) then the null hypothesis cannot be rejected, if p < 0.05, the null hypothesis that
the variances are equivalent cannot be accepted.
When it can be safely accepted that the variances of the two observation sets are
equivalent, the variances can be pooled together for further analysis or the results can be
used to show that one method is equivalent to, or better than another.
Bartlett’s test
Bartlett’s test Bartlett, 1937 can be used to test if two (or more) sample sets have equal
variances. Statistical tests, such as ANOVA, assume that variances are equal across groups of
samples. The Bartlett test can be used to verify this assumption.
Bartlett’s test is a parametric test that is sensitive to departures from normality, i.e. is not
robust to outliers (non-normal results). In these cases, Levene’s test and the modification
proposed by Brown and Forsythe 1974 may be used as alternatives.
Bartlett’s test is used to test the null hypothesis, H0 that the population variances are equal
against the alternative that there is at least one pair that are different.
Levene’s test
Levene’s test Levene, 1960 is an inferential statistic, which can be used to assess the equality
of variances of two samples. Levene’s test assesses the assumption that the variances of the
populations from which different samples were drawn are equal. If the calculated p-value is
less than some critical value (α = 0.05), the sample variances are unlikely to have occurred
based on random sampling, therefore it is concluded that there is a difference between the
variances in the population.
Levene’s test is a nonparametric test, i.e. it does not require the assumption of normality
and is widely used before comparison of means (t-test). In the case where Levene’s test is
514
Basic Statistical Tests
significant, subsequent tests must be performed that are based on the assumption of non-
normality.
Test for the equality of means using the assumption of equal variances.
Test for the equality of means using the assumption of non-equal variances.
From the above description, this shows that a particular workflow is required for testing the
equivalence of two means:
The numerator contains the term x1-x2, which measures the difference between the two sets
of data, the closer this value is to 0, the more likely the two sets of observations come from
the same population. The denominator contains the term sp, which is the Pooled Standard
Deviation,
The pooled standard deviation is a measure of the common spread of the two populations
and can only be representative of both populations when the variances are equivalent (F-
test). The other term in the numerator is a correction for the number of terms used to
calculate the t-statistic. The entire numerator defines a quantity known as the Standard
Error of the Mean (SE). Therefore, the t-statistic is a measure of the ratio of the difference
between two sample sets and the precision of the mean value. Significance is established by
comparing the calculated t-value (t0) with a tabulated t-value (tcrit) computed at a specified
significance level (usually 0.05) for a particular number of degrees of freedom.
The two-sample t-test can be either one-sided or two-sided. The null hypothesis is usually
set up as follows:
515
The Unscrambler X Main
or
A p-value > 0.05 (or |t0| < tcrit) indicates that the null hypothesis cannot be rejected, i.e.
there is no difference between x1 and x2.
A p-value < 0.05 (or |t0| > tcrit) suggests that the sets of observations are significantly
different and therefore the null hypothesis must be rejected.
In this case, the variances of the two sets of observations are used in the calculation of the t-
statistic however, the DF for this case must be estimated by the following formula:
The t0 value calculated is compared to a critical t-value obtained using the estimated degrees
of freedom. The test can be either one-sided or two-sided. At 95% confidence, when p >
0.05 (|t0| < tcrit) the null hypothesis cannot be rejected and when p < 0.05 (|to| > tcrit), the
null hypothesis is rejected and the conclusion is that the two sets of observations are
significantly different.
Comparison of two dependent means
516
Basic Statistical Tests
Compare the variances of the two sets of observations for equivalence. Note in this
case, if the two sets have significantly different variances, there is no point in
continuing with the t-test.
Compute the paired t-statistic and compare it to a t-table at a specified level of
confidence for a particular number of DF.
The form of the paired t-statistic is provided below.
It is similar in form to all t-statistic formulas. In this case, the numerator contains the term d-
, which is the mean difference (i.e. the bias) between the two sets of observations. The
closer to zero, the more likely the two sets of observations are equivalent to each other. The
denominator contains the term SDev/sqrt(N), which is the standard error of the mean
difference of the observations, or the precision of the sample set. The paired t-test is useful
for not only determining whether two operators, methods, etc. are equivalent, but the
calculated SDD can provide a value for the expected error of an analytical procedure and the
mean difference (d-) can be used to determine if there is any systematic difference between
operators, methods, etc.
Men 10 79 89
Women 5 73 78
In this example the expected value for lefthanded men is 89x14/167 = 7.99 whereas the
observed value is 10. Applying the same calculation fo the other three combinations of sex
and dexterity gives a total sum of 1.184 which is to be compared to the critical value 3.84
517
The Unscrambler X Main
518
Basic Statistical Tests
The diagram below shown the main data input dialog when the Tasks-Analyze-Statistical
Tests… option is selected. All of the available methods can be found in the Test drop-down
list.
Ensure that data is available for a test to be conducted. In the case where all samples and
variables have been excluded. the following warning will be provided.
Solution: Use the Define button to deselect kept out rows and columns.
The following sections describe how to apply these basic statistical tests to data using The
Unscrambler®.
The Kolmogorov-Smirnov test of normality
The Kolmogorov-Smirnov (KS) test of normality requires only one column of input for
testing. If a data set is selected that contains more than one variable, the following warning
will be provided.
519
The Unscrambler X Main
Select the matrix to test from the drop-down list and select the rows and columns containing
the data. Use the Define range button to create new ranges.
From the Test drop-down list, select Kolmogorov-Smirnov test of Normality.
Use the Significance level drop-down list to select the desired confidence associated with
the test and click on OK to start the analysis.
The results of the test are displayed as a node in the project navigator named Kolmogorov-
Smirnov normality test and can be plotted as a Cumulative Distribution Function (CDF). Use
the KS test statistic and the Critical Value with Lilliefors Correction to determine whether the
assumption of normality can be supported or not. When a KS test is performed, a CDF matrix
is generated in the project navigator under the analysis node.
The CDF folder contains the following information.
These tests require data sets with only one column in each. If more than one column in any
of the data input boxes, the following warning will be provided.
Use the appropriate test based on knowledge of the system. It is always recommended to
apply the KS test to the data first (to assure normality, or near normality) and then test for
equal variances, before application of the t-tests.
Go to the menu Tasks- Analyze- Statistical Tests… and then in the Statistical Tests dialog
box, select the appropriate t-test to use from the drop-down list. Then use the Data drop-
down lists to select the columns to be tested. These can be from different data matrices, but
cannot include non-numeric data. Choose the significance level for the test and then click on
OK to start the test. The results are displayed as a new node in the project navigator named
Student’s t test, which has subnodes for data and test statistics.
In the special case of the paired t-test, the number of rows (samples) in both data sets
selected must be equal. If this is not the case, the following error message will be provided.
520
Basic Statistical Tests
Use the graphical and tabular output to determine whether the two sample sets being
compared are statistically equivalent, or different. The Mean Comparison plot can be used in
this case. This plot also shows the relevant statistics for these tests. For more information on
plot and result interpretation, see Plot Interpretation for Statistical Tests
Tests for the comparisons of variances
The Unscrambler® supports three common tests for the comparison of means:
These tests require data sets with only one column in each. If more than one column in any
of the data input boxes, the following warning will be provided.
Use the appropriate test based on knowledge of the system. In this case, it is recommended
to apply the KS test to the data first before application of any of these tests.
Go to the menu Tasks- Analyze- Statistical Tests… , and then in the Statistical Tests dialog
box, select the appropriate variance test to use from the drop-down list. Then use the Data
drop-down lists to select the columns to be tested. Choose the significance level for the test
and then click on OK to start the test. The results are displayed as a node in the project
navigator.
Use the graphical and tabular output to determine whether the variances of the two sample
sets being compared are statistically equivalent, or different. The Variance Comparison plot
can be used in this case. This plot also shows the relevant statistics for these tests. For more
information on plot and result interpretation, see Plot Interpretation for Statistical Tests
Mardia’s test of multivariate normality
Mardia’s test of multivariate normality is used to test whether the data in a matrix exhibits
multivariate normality. Select the matrix to test from the Data drop-down list and select the
Mardia’s Test of Multivariate Normality option from the Test drop-down list. Select the
significance level from the drop-down list and click OK to start the analysis. The results of
the analysis are displayed as a node in the project navigator named Mardia’s test with
subnodes for data and test statistics.
Mardia’s test requires a data set of at least two rows and two columns to perform the test. If
the data set does not meet this criteria, the following error message will be provided.
521
The Unscrambler X Main
In the case where there are any missing data, the following warning will be provided when
trying to apply Mardia’s test of multivariate normality.
The output of Mardia’s test of normality is a matrix of skewness and kurtosis test values.
Multivariate normality requires that the null hypothesis for both skewness and kurtosis are
not rejected.
Normal Skewness hypothesis: A value of “0” indicates that there is not enough evidence in
the data to suggest that the skewness deviates from a multivariate normal distribution. A
value of “1” indicates that the null hypothesis can be rejected at the chosen significance
level. A “small sample correction” is automatically applied when the number of data points
are 30 or fewer.
Normal kurtosis hypothesis: A value of “0” indicates that the null hypothesis of multivariate
normal kurtosis cannot be rejected, while a value of “1” indicates that the null hypothesis is
rejected at the chosen significance level. I.e. a value of “1” means that the data display a
multivariate kurtosis that is not consistent with a multivariate normal distribution.
Both tests are followed by the p-values, critical values and Mardia’s statistics for the
skewness and kurtosis tests.
Note that this test is unreliable for highly collinear data, in which case a warning will be
given.
For more details on interpreting the output of this test, see Mardia’s test for multivariate
normality
Tests for association or independence (categorical data)
Categorical data from two columns can be cross-tabulated to produce a contingency table
and the observed frequencies can be compared with expected frequencies using classical or
Pearson’s Chi-square. For small samples (below 30) the Chi-squared values are also
computed with Yate’s correction. For 2x2 contingency tables the test also computes Fisher’s
and Bayes exact probabilities. Samples that have missing values are dropped automatically.
Contingency analysis requires that two columns of data be compared containing categorical
variables. If at least one column is not categorical, the warning shown below will displayed
The main result of a Contingency Analysis is the Contingency Table and a matrix of statistics
containing Chi Squared and p-values. These are discussed below,
Contingency Table
522
Basic Statistical Tests
The Contingency (or Cross tabulation) table displays the multivariate frequency
distribution of categorical variables in order to find the relationship between them.
For example, suppose a clinical trial was performed using two main indicators, one is
sex (M or F) the other is Response to drug (R for responsive and N for non-
responsive).In this example the study was performed on 2232 subjects of which
1024 were Female and 1208 were Male. The Contingency Table provides a
condensed view of the proportion of males and females who responded or not to
the drug under study. An example table is shown below for this study.
The contingency table is found in the project navigator in the Test Statistics folder
The table shows that a greater proportion of males positively responded to the drug than
females, but how do we assess that this is a significant difference? The Statistics folder holds
the answers.
Statistics
An example Statistics Table is shown below and is accessed from the Test Statistics
folder.
523
The Unscrambler X Main
Student’s t-tests
Variance comparison tests
For a KS normality test, the actual sample value CDF (stepped red curve) is plotted along
with the expected CDF (blue smooth curve). If the two curves significantly depart from each
other over part of the curve, this is an indication that the sample distribution is non-normal.
If the two curves follow each other closely, then this is an indication that the sample
distribution is normal.
The KS statistic is displayed on the curve and is defined by the maximum vertical distance
between the two functions. The statistic is compared to tabulated values of the KS statistic
(in this case with the correction suggested by Lilliefors). If the KS statistic is less than the
critical value (from the KS table), then the null hypothesis that the distribution is normal
cannot be rejected. If however, the KS statistic is greater than the critical value, the
assumption of normality cannot be supported. The plot provides a statement regarding
whether the null hypothesis should, or should not be rejected.
Student’s t-tests
The main results output for the two sample and paired t-tests in The Unscrambler® is the
Mean Comparison plot. An example of this plot is provided below.
Mean Comparison Plot
524
Basic Statistical Tests
This plot shows the mean value and the range of values around the mean for the two
variables tested. Visually assess whether the means of the two variables line up with each
other and that the spreads of the two variables are equivalent. The plot also provides
information on the type of test (two sample, paired), whether the test was one-sided or
two-sided, the significance level the test was performed at and the test statistics for the
analysis. Use the tabulated p-value to determine whether the means of the two variables
are statistically equivalent. If the p-value is less than the significance level the test was
carried out at (usually 0.05), then the null hypothesis of no difference in the means cannot
be accepted. If the p-value is greater than the significance level of the test, the null
hypothesis cannot be rejected. The plot provides a statement regarding whether the null
hypothesis should, or should not be rejected.
Variance comparison tests
For the variance comparison tests (Levene’s, Bartlett’s and the F-Test), the main results
output is the Variance Comparison plot. An example of this plot is provided below.
Variance Comparison Plot
This plot provides a comparison of the variance of the two variables along with their
confidence intervals. Interpret these plots by visually assessing the variance range for both
variables. The closer the two variables are in variance, the more likely they come from
similarly distributed populations. The plots also provide the Levene’s, Bartlett’s and F-test
statistics (depending on which test was chosen) along with the corresponding critical value
and p-value. If the p-value is less than the level of significance chosen (usually 0.05), then
525
The Unscrambler X Main
the null hypothesis of equal variances cannot be accepted. If the p-value for the test is
greater than the significance level, the null hypothesis cannot be rejected. The plot provides
a statement regarding whether the null hypothesis should, or should not be rejected.
12.6. Bibliography
M.S. Bartlett, Properties of sufficiency and statistical tests, Proceedings of the Royal
Statistical Society Series A 160, 268–282, (1937).
M.B. Brown and A.B.E. Forsythe, Robust tests for the equality of variance, J. American
Statistical Assoc., 69, 364-367, (1974).
R.B. D’Agostino, Tests for Normal Distribution, in Goodness-of-fit Techniques, R.B.
D’Agostino, M.A. Stephens(Eds), Marcel Dekker, New York, 1986.
G.E. Dallal and L. Wilkinson, An analytic approximation to the distribution of Lilliefors’ test
for normality, The American Statistician, 40, 294–296, (1986).
H. Levene, Robust tests for equality of variances, in Contributions to Probability and
Statistics: Essays in Honor of Harold Hotelling, Ingram Olkin, Harold Hotelling et al.(Eds),
Stanford University Press, Stanford, CA, 278-292, 1960.
K.V. Mardia, Measures of Multivariate Skewness and Kurtosis with Applications, Biometrika,
57, 519-530, (1970),
K.V. Mardia, Applications of Some Measures of Multivariate Skewness and Kurtosis in
Testing Normality and Robustness Studies, Sankhy�?, Series B, 36, 115-128 (1974).
K.V. Mardia, J.T. Kent and J.M. Bibby, “Multivariate Analysis”, Academic Press, London, UK,
1979.
J.N. Miller and J.C. Miller, Statistics and Chemometrics for Analytical Chemistry, Fifth Edition,
Prentice Hall, UK, 2005.
526
13. Principal Components Analysis
13.1. Principal Component Analysis (PCA)
PCA can be used to reveal the hidden structure within large data sets. It provides a visual
representation of the relationships between samples and variables and provides insights
into how measured variables cause some samples to be similar to, or how they differ from
each other.
This section provides the details of the PCA approach to understanding data structure. When
considering a data table, each row represents an object (or individual, or sample), and each
column represents a descriptor (or measure, or variable). Throughout the rest of this
section, rows will be referred to as samples, and the columns as variables.
Theory
Usage
Plot Interpretation
Method reference
527
The Unscrambler X Main
528
Principal Components Analysis
2001 for a more complete description of PCA. Other valuable references include Jackson,
1991 and Mardia et al, 1979 Additional references may also be found in the Bibliography
section of the help.
where T is the scores matrix, P the loadings matrix and E the error matrix. These terms will
be explained in more detail in this document.
The combination of scores and loadings is the structured part of the data: the part that is
most informative What remains is called error or residual, and represents the fraction of
variation that cannot be modeled well. By multiplying the scores and the loadings together,
the entire structure of the original data set can be reconstructed and hopefully, only a small
residual is left, consisting of random fluctuations which cannot be meaningfully modeled.
When interpreting the results of a PCA, one focuses on the structure part and discards the
residual part. It is OK to do so, provided that the residuals are indeed negligible. It is a
question of how large an error one is willing to accept.
Geometrical interpretation of the difference between samples
Since humans can only visualize data in three dimensions, the following is used to describe
higher order space. Each sample in a data table may be represented by a point in a
multidimensional space (see figure below, for three dimensions). The location of the point is
determined by its coordinates, which are the cell values of the corresponding row in the
table. Each variable thus plays the role of a coordinate axis in multidimensional space.
Sample (object) representation in multidimensional space
529
The Unscrambler X Main
Let us consider the whole data table geometrically. Two samples can be described as similar
if the values of most of their variables are close to each other. This results in data points that
are close to each other in space. On the other hand, two samples can be described as
different if their values greatly differ for at least some of the variables. This results in data
points occupying distinctly different areas in multidimensional space. This is represented for
two groups, A and B in the figure below.
Sample differences in multidimensional space
Principles of projection
The major principle of PCA is defined as follows: find the directions in space along which the
distance between (i.e. the dispersion of) the data points is the largest. This can be
interpreted as finding the linear combinations of the initial variables that contribute most to
making the samples different from each other. This is shown graphically below.
The First Principal Component
530
Principal Components Analysis
These directions, or combinations, are called Principal Components (PCs). They are
computed iteratively, in such a way that the first PC is the one that carries most information
(or in statistical terms, the most explained variance). The second PC will then carry the
maximum share of the residual information (i.e. not taken into account by the previous PC),
and so on.
This process can continue until as many PCs have been computed as there are variables (or
samples, which ever contains the smallest number) in the data table. At that point, all the
variation between samples has been accounted for, and the PCs form a new set of
coordinate axes which has two advantages over the original set of axes (i.e. the original
variables). First, the PCs are orthogonal to each other. Second, they are ranked so that each
one carries more information than any subsequent ones. Thus, one can prioritize the
interpretation, focusing on the first few, since they carry the most information.
The new set of axes can be described as a new “window” for looking into the greatest
sources of information contained in the data. This is represented in the figure below of a
scores plot.
PCs 1 and 2: a new window for looking into multidimensional space
531
The Unscrambler X Main
The way PCs are generated ensures that this new set of coordinate axes is the most suitable
basis for a graphical representation for interpreting the data structure.
Separating information from noise
In well defined data sets, it is common that the first few PCs contain interpretable
information, while the later PCs mostly describe noise. Therefore, it is useful to study the
first PCs only instead of the whole raw data table: not only is this less complex, but it also
ensures that noise is not mistaken for information.
All PCA models should be validated. Validation is the only way of making sure that only
informative PCs are retained in a model. The validation procedures associated with
multivariate models are described in detail in the chapter on Validation. The following
provides a short description of the most common validation methods used for PCA.
In PCA, like most multivariate methods, there are a number of ways to validate the model
generated. The two most commonly used methods are Cross Validation (CV) and Test Set
Validation. In CV, the analyst may set up the number of samples and segments to validate
the model, based on prior knowledge of the data set. In Full Cross Validation, (sometimes
called Leave-One-Out or LOO) each sample takes part in both the calibration and validation
steps individually. This method is commonly used when there is not enough variation in the
samples selected, or there are too few samples to do test set validation. LOO is a good
method for isolating influential samples in a small data set. Other forms of cross validation
include systematic, for assessing the models ability for modeling replicate data random,
when the data sets are larger and the analyst wants to understand the robustness of a
model and custom, when there is a priori information about the data set.
The preferred method of validation for all multivariate methods is test set validation. This
provides the most representative assessment of the model in future applications. The
samples used in validation are not used in the calibration (or training) step and therefore,
the model performance is not overly optimistic, as is the case for cross validation.
Is PCA the most relevant summary of the data?
PCA produces an orthogonal bilinear matrix decomposition, where the PCs are computed in
a sequential way, explaining maximum variance in the data. Using these constraints plus
normalization during the bilinear matrix decomposition, PCA produces unique solutions.
532
Principal Components Analysis
These ‘abstract’ unique and orthogonal (independent) solutions are extremely helpful in
deducing the number of different sources of variation present in the data. However, it must
be noted that these are ‘abstract’ solutions in the sense that they are not the ‘true’
underlying factors causing the data variation, but orthogonal linear combinations of them.
In most cases one is interested in finding the “true” underlying sources of data variation. It is
not only a question of how many different sources are present and how they can be
interpreted, but to find out how they are in reality. This can sometimes be achieved using
either PC Rotation, or another type of bilinear method called Multivariate Curve Resolution
(MCR). A disadvantage of MCR methods is they do not yield a unique solution unless
external information is provided during the matrix decomposition.
Read more about Curve Resolution methods in the Help chapter Multivariate Curve
Resolution.
533
The Unscrambler X Main
Where Cov is the covariance between x and y. There is a direct relationship between the
covariance of two vectors and the cosine of the angle between them. This is shown as
follows.
Provided x and y have been mean centered, the diagram shows the relationships between
loadings and the PCs and the following statements can be made about variables 1, 2 and 3.
The angle between variable 1 and PC1 is close to zero, Cos(0) = 1, therefore PC1
completely describes variable 1.
The angle between variable 2 and PC2 is zero, therefore PC2 completely describes
variable 2.
The angle between variables 1 and 2 is 90°. Cos(90) = 0, therefore variables 1 and 2
are uncorrelated.
The angle between variable 3 and PC1 is greater than 180° and the angle between
variable 3 and PC2 is greater than 90°, therefore variable 3 is negatively correlated
to both PC1 and PC2.
Variable 4 sits at the intersection of PC1 and PC2 and is not described well by both
PCs.
534
Principal Components Analysis
535
The Unscrambler X Main
Variable residuals
From the variables’ point of view, the original variable vectors are being approximated by
their projections onto the model components. The difference between the original vector
and the projected one is the variable residual.
It can also be broken down into as many numbers as there are components.
Residual variation
The residual variation of a sample is the sum of squares of its residuals for all model
components. It is geometrically interpretable as the squared distance between the original
location of the sample and its projection onto the model.
The residual variations of Variables are computed the same way.
Residual variance
The residual variance of a variable is the mean square of its residuals for all model
components. It differs from the residual variation by a factor which takes into account the
remaining degrees of freedom in the data, thus making it a valid expression of the modeling
error for that variable.
Total residual variance is the average residual variance over all variables. This expression
summarizes the overall modeling error; i.e. it is the variance of the error part of the data.
Explained variance
Explained variance is the complement of residual variance, expressed as a percentage of the
global variance in the data. Thus the explained variance of a variable is the fraction of the
global variance of the variable taken into account by the model.
Total explained variance measures how much of the original variation in the data is
described by the model. It expresses the proportion of structure found in the data by the
model.
536
Principal Components Analysis
Variable variances
Variables with small residual variance (or large explained variance) for a particular
component are well explained by the corresponding model. Variables with large residual
variance for all or for the three to four first components have a small or moderate
relationship with the other variables.
If some variables have much larger residual variance than the other variables for all
components (or for the first three to four of them), try to keep these variables out and make
a new calculation. This may produce a model which is easier to interpret.
Calibration vs. validation variance
The calibration variance is based on fitting the calibration data to the model. The validation
variance is computed by testing the model on data not used in building the model. Look at
both variances to evaluate their difference. If the difference is large, there is reason to
question whether the calibration data or the test data are representative.
Outliers can sometimes be the reason for large residual variance. The next section discusses
outliers.
How to detect outliers in PCA
An outlier is a sample which looks so different from the others that it either is not well
described by the model or influences the model too much. As a consequence, it is possible
that one or more of the model components focuses only on trying to describe how this
sample is different from the others, even if this is irrelevant to the more important structure
present in the other samples. The diagram below depicts a typical situation where an outlier
influences the model completely, leaving the most important source of variation for the
second PC to describe.
Scores plot showing_a gross outlier
In PCA, outliers can be detected using scores plots, residuals and leverages.
Different types of outliers can be detected by the various graphical tools available in The
Unscrambler®
Scores plots
show sample patterns according to one, two, or three components. It is easy to spot
a sample lying far away from the others. Such samples are likely to be outliers.
Residuals
measure how well samples or variables fit the model determined by the
components. Samples with a high residual are poorly described by the model, which
nevertheless fits the other samples quite well.
Leverages
537
The Unscrambler X Main
measure the distance from the projected sample (i.e. its model approximation) to
the center (mean point). Samples with high leverages have a stronger influence on
the model than other samples; they may or may not be outliers, but they are
influential. An influential outlier (high residual + high leverage) is the worst case; it
can however easily be detected using an influence plot.
The diagram below provides an example of an influence plot, showing four typical classes of
sample. Samples with high leverage are considered extreme in the model as they lie furthest
from the center of the PCA model.
538
Principal Components Analysis
To summarize, if the score of a sample and the loading of a variable on a particular PC have
the same sign, the sample has higher than average value for that variable and vice-versa.
The larger the scores and loadings, the stronger that relation.
If one now consider two PCs simultaneously, a two-vector loading plot and a two-vector
scores plot can be built. The same principles apply to their interpretation, with a further
advantage: one can now interpret any direction in the plot - not only the principal directions.
539
The Unscrambler X Main
components explain. In general, only a (small) subset of components is kept for further
consideration and the remaining components are considered as noninformative, irrelevant
or nonexistent (i.e. they are assumed to reflect measurement error or noise).
In order to interpret the components that are considered relevant, one can follow the PCA
by a rotation of the components that were retained. Two main types of rotation are used:
orthogonal when the new axes are also orthogonal to each other, and oblique when the new
axes are not required to be orthogonal to each other. Nonorthogonal or oblique rotation is
the subject of Independent Component Analysis (ICA).
Why will a rotation help?
Since the rotations are always performed in a subspace (the so-called component space), the
new axes will always explain less variance than the original factors (which are computed to
be optimal), but obviously the part of variance explained by the total subspace after rotation
is the same as it was before rotation – only the partition of the variance has changed.
Because the rotated axes are not defined according to a statistical criterion, such rotations
are performed to facilitate the interpretation of the components, thus also giving more
direct meaning to the data analysis.
Rotation was designed to obtain simple structure by clustering variables into groups that
might aid in the examination of the structure of a multivariate data set. It has found most
use in psychology, market research, education and sensory analysis. In physical applications,
rotation is usually of secondary interest.
Varimax
Quartimax
Equimax
Parsimax
The rotation, R, is defined so to maximize the variance of the squared loadings, given by the
variance measure v:
540
Principal Components Analysis
where n is the number of samples, p is scores, h a normalization factor and γ a scaling factor
defining different types of rotation:
Rotation method Scaling factor
Varimax γ=1
Quartimax γ=0
Equimax γ=(NumOfPCs)/2
An orthogonal rotation, R, can be defined for loadings, such that rotated loadings are equal
to P x R. For the rotation to become invariant, scores must also be rotated, T x R, and the
rotation must satisfy:
where I is the identity matrix, and R must be orthogonal. The original data can thus be
reconstructed from the rotated loadings and scores by:
541
The Unscrambler X Main
542
Principal Components Analysis
543
The Unscrambler X Main
Some important tips and warnings associated with the Model Inputs tab
PCA is a multivariate analysis technique, therefore in The Unscrambler® it requires a
minimum of three samples (rows) and two variables (columns) to be present in a data set, in
order to complete the calculation. The following provides some warning given, when certain
analysis criteria are not met.
Not enough samples or variables present
Solution: Check that the data table (or selected row set) contains a minimum of 3 samples.
Not enough variables present
Solution: Check that the data table (or selected column set) contains a minimum of 2
variables.
Too many excluded samples/variables
The same warning as for Not enough samples or variables (described above) will be given.
Solution: Check that all samples/variables have not been excluded in a data set
To keep track of row and column exclusions, the model inputs tab provides a warning to
users that exclusions have been defined. See automatic keep outs for more details.
544
Principal Components Analysis
Individual variables can be selected from the variable list table provided in this dialog by
holding down the control (Ctrl) key and selecting variables. Alternatively, the variable
numbers can be manually entered into the text dialog box. The Select button can be used
(which will bring up the Define Range dialog), or every variable in the table can be selected
by simply clicking on All.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows the weighting of selected variables by predefined constant values.
Downweight
This allows the multiplication of selected variables by a very small number, such that
the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
Use the Advanced tab in the Weights dialog to apply predetermined weights to each
variable. To use this option, set up a row in the data set containing the weights (or create a
separate row matrix in the project navigator). Select the Advanced tab in the Weights dialog
545
The Unscrambler X Main
and select the matrix containing the weights from the drop-down list. Use the Rows option
to define the row containing the weights and click on Update to apply the new weights.
Another feature of the advanced tab is the ability to use the results matrix of another
analysis as weights, using the Select Results Matrix button This option provides
an internal project navigator for selecting the appropriate results matrix to use as a weight.
The dialog box for the Advanced option is provided below.
PCA Advanced Weights Option
Once the weighting and variables have been selected, click Update to apply them.
546
Principal Components Analysis
547
The Unscrambler X Main
Select the desired rotation method from the dialog box and a rotated model will be
displayed in the project navigator.
See Available rotation methods for information about the rotation methods available in The
Unscrambler®.
548
Principal Components Analysis
The differences between the algorithms are described in the Introduction to PCA. The
NIPALS algorithm is iterative and the maximum number of iterations can be tuned in the
Max. iterations box. The default value of 100 should be sufficient for most data sets,
however some large and noisy data may require more iterations to converge properly. The
maximum allowed number of iterations is 30,000.
When there are missing values in the data, options are to impute them automatically using
the NIPALS algorithm or as a pre-processing step using Fill Missing
Note: If there are missing values in the data and SVD is selected, a warning will be
given as shown below.
Q-residual limits are per default approximated based on calculated model components only,
which works well in many cases. Calculation of exact Q-residual limits will be performed
when the check box is marked. Note that estimation of exact limits may be slow for large
data.
549
The Unscrambler X Main
Pretreatments can also be registered from the PCA node in the project navigator. To register
the pretreatment, right click on the PCA analysis node and select Register Pretreatment.
This is shown below.
Registering a Pretreatment From The Project Navigator
550
Principal Components Analysis
The Autopretreatment dialog box will appear, where the desired pretreatments can be
selected.
Note: Some caution is required when data table dimensions are changed after first
pretreatment. The Autopretreatment is applied on the same column indices as the
original transformation, and inserting new variables (columns) before or in between
the original data will result in autopretreatment of the wrong variables.
To be safe, always insert any new variables in the table before applying any
transformations, or make a habit of always appending rather than inserting new
columns.
Set this tab up based on a priori knowledge of the data set in order to return outlier
warnings in the PCA model. Settings for estimating the optimal number of components can
551
The Unscrambler X Main
also be tuned here. The values shown in the dialog box above are default values and might
be used as a starting point for the analysis.
The warning limits in the Unscrambler® serve two major purposes:
The leverage and residual (outlier) limits are given as standard scores. This means that limit
of e.g. 3.0 corresponds to a 99.7% probability that a value will lie within 3.0 standard
deviations from the mean of a normal distribution. The following limits can be specified:
Leverage Limit
(default 3.0) The ratio between the leverage for an individual sample and the
average leverage for the model.
Sample Outlier Limit, Calibration
(default 3.0) The square root of the ratio between the residual calibration variance
per sample (Sample Residuals) and the average residual calibration variance for the
model (Total Residuals).
Sample Outlier Limit, Validation
(default 3.0) The square root of the ratio between the residual validation variance
per sample (Sample Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Individual Value Outlier, Calibration
(default 3.0) For individual, absolute values in the calibration residual matrix
(Residuals), the ratio to the model average is computed (square root of the Variable
Residuals). For spectroscopic data this limit may be set to 5.0 to avoid many false
positive warnings due to the high number of variables.
Individual Value Outlier, Validation
(default 2.6) For individual, absolute values in the validation residual matrix
(Residuals), the ratio to the validation model average is computed (square root of
the Variable Validation Residuals). For spectroscopic data this limit may be set to 5.0
to avoid many false positive warnings due to the high number of variables.
Variable Outlier Limit, Calibration
(default 3.0) The square root of the ratio between the residual calibration variance
per variable (Variable Residuals) and the average residual calibration variance for
the model (Total Residuals).
Variable Outlier Limit, Validation
(default 3.0) The square root of the ratio between the residual validation variance
per variable (Variable Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Total Explained Variance (%)
(default 20) If the model explains less than 20% of the variance the optimal number
of componets is set to 0 (see the Info Box).
Ratio of Calibrated to Validated Residual Variance
(default 0.5) If the residual variance from the validation is much higher than the
calibration a warning is given.
Ratio of Validated to Calibrated Residual Variance
552
Principal Components Analysis
(default 0.75) If the residual variance from the calibration is much higher than the
validation a warning is given. This may occur in case of test set validation where the
test samples do not span the same space as the training data.
Residual Variance Increase Limit (%)
(default 6) This limit is applied for selecting the optimal number of components and
is calculated from the residual variance for two consecutive components. If the
variance for the next component is less than x% lower than the previous component
the default number of components is set to the previous one.
When all the options are specified click OK.
553
The Unscrambler X Main
2 x 2-D scatter
4 x 2-D scatter
Loadings
Line
2-D scatter
3-D scatter
2 x 2-D scatter
4 x 2-D scatter
Residuals
Residuals and influence
Influence plot
Variance per sample
Variable residuals
Sample residuals
Sample and variable residuals
Leverage / Hotelling’s T²
Leverages
Line
Matrix
Hotelling’s T²
Line
Matrix
Scores
This is a two-dimensional scatter plot (or map) of scores for two specified components (PCs)
from PCA. The plot gives information about patterns in the samples. The scores plot for
(PC1,PC2) is especially useful, since these two components summarize more variation in the
data than any other pair of components.
554
Principal Components Analysis
The closer the samples are in the scores plot, the more similar they are with respect to the
two components concerned. Conversely, samples far away from each other are different
from each other. The plot can be used to interpret differences and similarities among
samples. Look at the scores plot together with the corresponding loadings plot, for the same
two components. This can help determining which variables are responsible for differences
between samples. For example, samples to the right of the scores plot will usually have a
large value for variables to the right of the loadings plot, and a small value for variables to
the left of the loadings plot.
Here are some things to look for in the 2-D scores plot.
Finding groups in a scores plot
Is there any indication of clustering in the set of samples? The figure below shows a
situation with four distinct clusters. Samples within a cluster are similar.
Detecting grouping in a scores plot
555
The Unscrambler X Main
Are the samples evenly spread over the whole region, or is there any accumulation
of samples at one end? The figure below shows a typical fan-shaped layout, with
most samples accumulated to the bottom left of the plot, then progressively
spreading more and more. This means that the variables responsible for the major
variations are asymmetrically distributed. In such a situation, study the distributions
of those variables (histograms), and use an appropriate transformation (most often
a logarithm).
Asymmetrical distribution of the samples on a scores plot
556
Principal Components Analysis
have been errors in data collection or transcription, or those samples may have to
be removed if they do not belong to the population of interest.
An outlier sticks out of the major group of samples
Furthermore, the display of the Hotelling’s T² ellipse fro model in two dimension is also a
good way to detect outliers. To display it click on the Hotelling’s T² ellipse button .
Scores plot with Hotelling’s T² limit
In addition, the display of the stability plot can help detecting outliers. This plot represents
the projection of the samples in the submodels used for the validation they can be part of
the model or left out. Hence this plot is only available when any type of cross-validation has
been selected. It is available from the icon .
An outlier disturbs the model
557
The Unscrambler X Main
In the above image, the sample 143_1 is projected very differently for one particular
projection. It is also visible that one particular projection is deviating all the samples. The
study of the samples left out for this particular projection indicates that sample 143_1 is the
source of this variation. This sample is an outlier.
How representative is the picture?
Check how much of the total variation each of the components explains. This is
displayed in parentheses next to the axis name. If the sum of the explained variances
for the 2 components is large (for instance 70-80%), the plot shows a large portion
of the information in the data, so the relationships can be interpreted with a high
degree of certainty. On the other hand if it is smaller, more components or a
transformation should be considered, or there may simply be little meaningful
information in the data under study.
Loadings
A two-dimensional scatter plot of X-loadings for two specified components from PCA is a
good way to detect important variables. The plot is most useful for interpreting component
1 vs. component 2, since they represent the largest variations in the X-data.
The plot shows the importance of the different variables for the two components specified.
It should preferably be used together with the corresponding scores plot. Variables with X-
loadings to the right in the loadings plot will be X-variables which usually have high values
for samples to the right in the scores plot, etc.
Note: Downweighted variables are displayed in a different color so as to be easily
identified.
X-variables correlation structure
Variables close to each other in the loadings plot will have a high positive correlation
if the two components explain a large portion of the variance of X. The same is true
for variables in the same quadrant lying close to a straight line through the origin.
Variables in diagonally opposed quadrants will have a tendency to be negatively
correlated.
558
Principal Components Analysis
For example, in the figure below, variables “redness” and “colour” have a high
positive correlation, and they are negatively correlated to variable “thickness”.
Variables “redness” and “off-flavour” have independent variations. Variables
“raspberry flavour” and “off-flavour” are negatively correlated. Variable
“sweetness” and “chew” resistance cannot be interpreted in this plot, because they
are very close to the center.
Loadings of 12 sensory variables along (PC1,PC2)
Note: Variables lying close to the center are poorly explained by the plotted PCs. Do
not interpret them in that plot!
When working with spectroscopic or time series data, line loadings plots will aid better
interpretation. This is because the loadings will have a profile similar to the original data and
may highlight regions of high importance. The plot below shows how a number of PC’s can
be overlayed in a line loadings plot to determine which components capture the important
sources of information.
559
The Unscrambler X Main
When working with discrete variables, line loadings plots can also be used to represent data.
The Ascending and Descending buttons can be used to order the loadings in
terms of the variables with highest (or lowest) contribution to the PC.
Line plot of loadings in ascending order of importance to PC1
In the above plot three variables are located in the inner circle: Chew resistance,
Sweetness and Bitterness. They do not contain enough structured variation to be
discriminating for the jam samples.
560
Principal Components Analysis
Correlation loadings are also available for 1D line loading plots. When a line plot is
generated, the 1D correlation loadings toolbar icon is displayed as follows
These are especially useful when interpreting important wavelengths in the analysis of
spectroscopic or contributing variables in time series data. An example is shown below.
Correlation Line Loadings of Spectroscopic variables in PC1)
Values that lie within the upper and lower bounds of the plot are modelled by that PC. Those
that lie between the two lower bounds are not.
Influence plot
This plot shows the Q- or F-residuals vs. Leverage or Hotelling’s T² statistics. These represent
two different kinds of outliers. The residual statistics on the ordinate axis describe the
sample distance to model, whereas the Leverage and Hotelling’s T² describe how well the
sample is described by the model.
Samples with high residual variance, i.e. lying in the upper regions of the plot, are poorly
described by the model. Including additional components may result in these samples being
described better, however caution is required that the additional components are predictive
and not modelling noise. As long as the samples with high residual variance are not
influential (see below), keeping them in the model may not be a problem as such (the high
residual variance may be due to non-important regions of a spectrum, for instance).
Samples with high leverage, i.e. lying to the right of the plot, are well described by the
model. They are well described in the sense that the sample scores may have very high or
low values for some components compared to the rest of the samples. Such samples are
dangerous in the calibration phase because they are influential to the model. A sufficiently
extreme sample may by itself span an entire component, in which case the model will
become unreliable. Removal of a highly influential sample from the model will make the
model look entirely different and the axes will span different phenomena altogether. If the
variance described by the sample is important but unique, one should try to obtain more
samples of the same type to stabilize the model. Otherwise the sample should be discarded
as an outlier.
Note that a sample with both high residual variance and high leverage is the most dangerous
outlier. Not only is it poorly described by the model but it is also influential. Samples such as
these may span up to several components single handedly. Because they also disagree with
the majority of the other calibration samples, the ability of the model to describe new
samples is likely poor.
561
The Unscrambler X Main
The Q- and F-residuals are two different methods for testing the same thing. The F-residuals
are available for both calibration and validation, in contrast to the Q-residuals, which are
available for calibration only. The validated residuals reflect the scheme chosen in the
validation and is a more concervative assessment of residual outliers. If the residual variance
from validation is much higher than for calibration one should investigate the residuals in
more detail.
The difference between Leverage and Hotelling’s T² is only a scaling factor. The critical limt
for Leverage is based on an ad-hoc rule whereas the Hotelling’s T² critical limit is based on
assumption of a student-t distribution.
Influence plot
In the above plot, sample 25 has a high leverage on PC6 which is the dimensionality of the
model. This sample has to be checked as it is a probable outlier.
Three cases can be detected from the influence plot:
Case 1: A sample has a high leverage
This is an influential sample. Check the reasons for it to be influential and decide
what to do.
Case 2: A sample has a high residual
Check which variables are poorly described by the model for this sample. Decide if
this sample is an outlier.
Case 3: A sample has a high leverage and a high residual
This sample is most likely an outlier. Retaining this sample in the model is risky.
Note: When working with designed data, the leverage of each sample in the design
is known by construction, and these leverages are optimal, i.e. all design samples
have the same contribution to the model. So do not bother about the leverages
when running a regression: the design has accounted for it.
What to do with an influential sample
The first thing to do is to understand why the sample has a high leverage (and, possibly, a
high residual variance). Investigate by looking at the raw data and checking them against the
original recordings.
There are two cases to consider:
Case 1
562
Principal Components Analysis
There is an error in the data. Correct it, or if the true value cannot be found or the
experiment cannot be redone to get a more valid value, replace the erroneous value
with “missing”.
Case 2
There is no error, but the sample is different from the others. For instance, it has
extreme values for several of the variables. Check whether this sample is “of
interest” (e.g. it has the properties to be achieved, to a higher degree than the other
samples), or “not relevant” (e.g. it belongs to another population than the one
under study). In the former case, try to generate more samples of the same kind:
they are the most interesting ones! In the latter case (and only then), remove the
high-leverage sample from the model.
Calibration and validation samples can be displayed in the influence plot by toggling
between them using the and button. This can only be done if the validation
method chosen was cross validation or test set validation.
Explained variance
This plot gives an indication of how much of the variation in the data is described by the
different components.
Total residual variance is computed as the sum of squares of the residuals for all the
variables, divided by the number of degrees of freedom.
Total explained variance is then computed as:
563
The Unscrambler X Main
Calibration variance is based on fitting the calibration data to the model. Validation variance
is computed by testing the model on data that were not used to build the model. Compare
the two variances: if they differ significantly, there is good reason to question whether
either the calibration data or the test data are truly representative. The figure 2 below
shows a situation where the residual validation variance is much larger than the residual
calibration variance (or the explained validation variance is much smaller than the explained
calibration variance). This means that although the calibration data are well fitted (small
residual calibration variances), the model does not describe new data well (large residual
validation variance).
On the contrary if the two residual variance curves are close together the model is
representative.
Total residual variance curves for Calibration and Validation showing the presence of outliers
564
Principal Components Analysis
Outliers can sometimes cause large residual variance (or small explained variance).
Outliers can also cause a decrease in the explained validation variance as can be seen in the
plot below.
Outlier causes a drop of explained variance in validation
565
The Unscrambler X Main
If some variables have much larger residual variance than all the other variables for all
components in the model (or for the first 3-4 of them), try rebuilding the model with these
variables deleted. This may produce a model that is easier to interpret.
Note: Both calibration and validation variances are available.
Sample outliers
Scores
See the description in the overview section
Influence
See the description in the overview section
Scores and Loadings
Scores
See the description in the overview section
Loadings
See the description in the overview section
Residuals and influence
Influence Plot
This plot shows the Q- or F-residuals vs. Leverage or Hotelling’s T². See the general
description about the influence plot in the overview section for more details.
The toggle buttons in the toolbar can be used to switch between the various combinations.
566
Principal Components Analysis
567
The Unscrambler X Main
Leverage / Hotelling’s T²
The lower left pane of the Residuals and Influence overview displays a line plot of the
Hotelling’s T² by default. A toolbar toogle ( ) can be used to switch between
Hotelling’s T² and Leverage view.
Hotelling’s T² statistics
The plot displays the Hotelling’s T² statistic for each sample as a line plot. The associated
critical limit (with a default p-value of 5%) is displayed as a red line.
Hotelling’s T² plot
The Hotelling’s T² statistic has a linear relationship to the leverage for a given sample. Its
critical limit is based on an F-test. Use it to identify outliers or detect situations where a
process is operating outside normal conditions. There are 6 different significance levels to
choose from using the drop-down list:
568
Principal Components Analysis
The number of factors (or PCs) may be tuned up or down with the tools.
Leverage
Leverages are useful for detecting samples which are far from the center within the space
described by the model. Samples with high leverage differ from the average samples; in
other words, they are likely outliers. A large leverage also indicates a high influence on the
model. The figure below shows a situation where sample 5 is obviously very different from
the rest and may disturb the model.
One sample has a high leverage
There is an ad-hoc critical limit (and not depending on any assumptions about distribution)
for Leverage which is , where is the number of components and
the number of calibration samples. Leverages can be interpreted in two ways: absolute, and
relative
Absolute leverage values
Leverage values are always larger than zero, and can go up to 1 for samples in the
calibration set. As a rule of thumb, samples with a leverage above 0.4 - 0.5 start
being bothering.
Relative leverage values
Influence on the model is best measured in terms of relative leverage. For instance,
if all samples have leverages between 0.02 and 0.1, except for one, which has a
leverage of 0.3, although this value is not extremely large, the sample is likely to be
influential.
Leverages in designed data
569
The Unscrambler X Main
Residuals
The lower right pane of the Residuals and Influence overview displays a line plot of the
sample residual statistics. A toolbar toogle ( ) can be used to switch between Q- and
F-residuals view.
Q-residuals
This plot shows the sample Q-residuals as a line plot with associated limits.
Q-residual sample variance
570
Principal Components Analysis
F-residuals
This plot shows the sample F-residuals as a line plot with associated limits.
Note that the F-residuals are available for both calibration and validation. If the residual x-
variance from validation is much higher than for calibration one should investigate the
residuals in more detail. The validated residuals reflect the scheme chosen in the validation
and is a more concervative assessment of residual outliers.
Leverage / Hotelling’s T²
See the description in the overview section.
Residuals
See the description in the overview section.
Two plots
The score and loadings plots will be displayed for PC1-PC2 in two frames.
Four plots
The score and loadings plots will be displayed for PC1-PC2 in the two first frames and also
for PC3-PC4 in the third and fourth frames.
Bi-plot
This is a two-dimensional scatter plot or map of scores for two specified components (PCs),
with the X-loadings displayed on the same plot. It is called a bi-plot. It enables one to
interpret sample properties and variable relationships simultaneously.
Scores
571
The Unscrambler X Main
The closer two samples are in the scores plot, the more similar they are with respect to the
two components concerned. Conversely, samples far away from each other are different
from each other.
Here are a few things to look for in the scores plot:
Is there any indication of clustering in the set of samples?
Are the samples evenly spread over the whole region, or is there any accumulation
of samples at one end?
Are some samples very different from the rest?
Loadings
The plot shows the importance of the different variables for the two components specified.
Variables with loadings to the right in the loadings plot will be variables which usually have
high values for samples to the right in the scores plot, etc.
Note: Downweighted variables are displayed in a different color so as to be easily
identified.
Interpret variable projections on the loadings plot. Variables close to each other in the
loadings plot will have a high positive correlation if the two components explain a large
portion of the variance of X. The same is true for variables in the same quadrant lying close
to a straight line through the origin. Variables in diagonally opposed quadrants will have a
tendency to be negatively correlated.
Scores and loadings together
The plot can be used to interpret sample properties. Look for variables projected far away
from the center. Samples lying in an extreme position in the same direction as a given
variable have large values for that variable; samples lying in the opposite direction have low
values.
For instance, in the figure below, C1H3 is the most colorful, while C1H2 has the highest off-
flavor (and probably lowest Raspberry taste). C4H3 is very different from C3H2: C4H3 has
highest Raspberry taste and lowest off-flavor, otherwise those two jams do not differ much
in color and thickness. C3H3 has high Raspberry taste, and is rather colorful. C2H1, C1H1 and
C3H1 are thick, and have little color. The jams cannot be compared with respect to
sweetness, because variable Sweetness is projected close to the center.
Bi-plot for 8 jam samples and 12 sensory properties
572
Principal Components Analysis
Scores
Line
This is a plot of score values vs. sample number for a specified component. Although it is
usually better to look at 2-D or 3-D scores plots because they contain more information, this
plot can be useful whenever the samples are sorted according to the values of an underlying
variable, e.g. time, to detect trends or patterns (see figure below). Also look for systematic
patterns, like a regular increase or decrease, periodicity, etc.… (only relevant if the sample
number has a meaning, like time for instance).
Trend in a scores plot
The smaller the vertical variation (i.e. the closer the score values are to each other), the
more similar the samples are for this particular component. Look for samples that have a
very large positive or negative score value compared to the others: these may be outliers.
An outlier sticks out on a line plot of the scores
573
The Unscrambler X Main
2-D scatter
See the description in the overview section
3-D scatter
This is a 3-D scatter plot or map of the scores for three specified components from PCA. The
plot gives information about patterns in the samples and is most useful when interpreting
components 1, 2 and 3, since these components summarize most of the variation in the
data. It is usually easier to look at 2-D scores plots but if three components are needed to
describe enough variation in the data, the 3-D plot is a practical alternative.
Scores plot in 3-D
574
Principal Components Analysis
Like with the 2-D plot, the closer the samples are in the 3-D scores plot, the more similar
they are with respect to the three components.
The 3-D plot can be used to interpret differences and similarities among samples. Look at
the scores plot and the corresponding loadings plot, for the same three components.
Together they can be used to determine which variables are responsible for differences
between samples. Samples with high scores along the first component usually have large
values for variables with high loadings along the first component, etc.
Here are a few patterns to look for in a scores plot.
Finding groups in a scores plot
Do the samples show any tendency towards clustering? A plot with three distinct
clusters is shown below. Samples within the same cluster are similar to each other.
Three groups of samples appear on the scores plot
575
The Unscrambler X Main
Check how much of the total variation is explained by each component (these
numbers are displayed at the bottom of the plot). If it is large, the plot shows a
significant portion of the information in the data and it can be used to interpret
relationships with a high degree of certainty. If the explained variation is smaller,
more components or a transformation may be considered, or there may be little
information in the original data.
2 x 2-D scatter
The visualization frame is divided in two. A 2-D scatter plot is displayed in each subframe.
The first one is in the PC1-PC2 plane and in the second one the plane is the PC3-PC4 plane.
4 x 2-D scatter
The visualization frame is divided in four. A 2-D scatter plot is displayed in each subframe.
The first one is in the PC1-PC2 plane, second one in the PC3-PC4 plane, the third one in the
PC5-PC6 plane and finally the fourth one in the PC7-PC8 plane.
Loadings
Line
This is a plot of X-loadings for a specified component vs. variable number. It is useful for
detecting important variables. In many cases it is usually better to look at two- or three-
vector loadings plots instead because they contain more information.
Line plots are most useful for multichannel measurements, for instance spectra from a
spectrophotometer, or in any case where the variables are implicit functions of an
underlying parameter, like wavelength, time, etc. The plot shows the relationship between
the specified component and the different X-variables. If a variable has a large positive or
negative loading, this means that the variable is important for the component concerned;
see the figure below. For example, a sample with a large score value for this component will
have a large positive value for a variable with large positive loading.
Spectral data can default to use line plots for the loadings plot. To set this, right click on the
given range in the project navigator, and tick off the Spectra option.
Line plot of the X-loadings, important variables in a spectra
576
Principal Components Analysis
Variables with large loadings in early components are the ones that vary most. This means
that these variables are responsible for the greatest differences between the samples.
Note: Downweighted variables are displayed in a different color to be easily
identified.
2-D scatter
See the description in the overview section
3-D scatter
This is a three-dimensional scatter plot of X-loadings for three specified components from
PCA. The plot is most useful for interpreting directions, in connection to a 3-D scores plot.
Otherwise it is recommended to use line- or 2-D loadings plots.
Note: Downweighted variables are displayed in a different color so as to be easily
identified.
2 x 2-D scatter
The visualization frame is divided in two. A 2-D scatter plot is displayed in each subframe.
The first one is in the PC1-PC2 plane and in the second one the plane is PC3-PC4 plane.
4 x 2-D scatter
The visualization frame is divided in four. A 2-D scatter plot is displayed in each subframe.
The first one is in PC1-PC2 plane, second one in the PC3-PC4 plane, the third one in the PC5-
PC6 plane and finally the fourth one in the PC7-PC8 plane.
Residuals
577
The Unscrambler X Main
Influence plot
See the description in the overview section
Samples with small residual variance (or large explained variance) for a particular
component are well explained by the corresponding model, and vice versa. In the above
plot, 4 samples seems to be not well explained by the model and may be outliers such as B3.
Variable residuals
This is a plot of residuals for a specified X-variable and component number for all the
samples. The plot is useful for detecting outlying sample/variable combinations, as shown
below. An outlier can sometimes be modeled by incorporating more such samples. This
should, however, be avoided since it will reduce the prediction ability of the model.
Line plot of the variable residuals
578
Principal Components Analysis
Whereas the sample residual plot gives information about residuals for all variables for a
particular sample, this plot gives information about all possible samples for a particular
variable. It is therefore more useful when investigating how one specific variable behaves in
all the samples.
Sample residuals
This is a plot of the residuals for a specified sample and component number for all the X-
variables. It is useful for detecting outlying sample or variable combinations. Although
outliers can sometimes be modeled by incorporating more components, this should be
avoided since it will reduce the prediction ability of the model.
Line plot of the sample residuals: one variable is an outlier
579
The Unscrambler X Main
In the above plot the variable 1: Adhesiveness at 1 day, for a particular sample is not very
well described by a model with a certain number of component here 4. If this is the case
with most of the samples this variable may be noisy and can be considered as an outlier.
In contrast to the variable residual plot, which gives information about residuals for all
samples for a particular variable, this plot gives information about all possible variables for a
particular sample. It is therefore useful when studying how a specific sample fits to the
model.
In the above map, one sample is suspect and should be further investigated.
Leverage / Hotelling’s T²
Leverages
Line
See the description in the Plot accessible from the Navigator section
580
Principal Components Analysis
Matrix
This is a matrix plot of leverages for all samples and all model components. The X-axis
represents the components and the Y-axis the samples. The color represents the Z-value
which is the leverage, the color scale can be customized. It is a useful plot for studying how
the influence of each sample evolves with the number of components in the model.
Hotelling’s T²
Line
See the description in the predefined plot section
Matrix
This is a matrix plot of Hotelling’s T² statistics for all samples and all model components. It is
equivalent to the matrix plot of leverages, to which it has a linear relationship. The Y-axis
represents the components and the X-axis the samples. The color represents the Z-value
which is the Hotelling’s T² statistic for a specific PC and sample, the color scale can be
customized.
581
The Unscrambler X Main
13.6. Bibliography
C.B. Crawford and G.A. Ferguson, A general rotation criterion and its use in orthogonal
rotation, Psychometrika, 35(3), 321-332, (1970).
R.A. Darton, Rotation in Factor Analysis, The Statistician, 29, 167-194, (1980).
K. Esbensen, Multivariate Data Analysis - In Practice, 5th Edition, CAMO Process AS, Oslo,
2002,
H.H. Harman, Modern Factor Analysis, 3rd Edition, revised, University of Chicago Press,
1976.
H. Hotelling, Analysis of a complex of statistical variables into principal components, J. Educ.
Psych., 24, 417-441, 498-520, (1933).
J.E. Jackson, A Users Guide to Principal Components, Wiley & Sons Inc., New York, 1991.
H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis, Psychometrika, 23,
187-200, (1958).
K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis, Academic Press Inc, London,
1979.
J.O. Neuhaus and C. Wrigley, The Quartimax Method: An analytic approach to orthogonal
simple structure, British J. Statistical Psychology, 7(2), 81-91, (1954).
D.R. Saunders, An analytic method for rotation to orthogonal simple structure, Princeton,
Educational Testing Service Research Bulletin, 53-10, (1953).
582
14. Multiple Linear Regression
14.1. Multiple Linear Regression
Multiple Linear Regression (MLR), is the classical method that combines a set of several X-
variables in linear combinations, which correlate as closely as possible to the corresponding
single Y-vector.
Theory
Usage
Plot Interpretation
Method reference
Basics
Principles behind Multiple Linear Regression (MLR)
Sum of squares due to error SSE
Sum of squares due to error SSreg
The ANOVA for regression
Interpreting the results of MLR
Regression coefficients (b-coefficients)
Predicted vs. reference plot
Residuals
Random and normally distributed residuals
Non-constant variance
Curvature in residuals
Systematic variance
Form of the model
More details about regression methods
14.2.1 Basics
MLR: Regressing one Y-variable on a set of X-variables
583
The Unscrambler X Main
The theory behind MLR has been well described in the literature and texts such as the book
by Montgomery, Peck, Vining, 2001 and Weisberg, 1985 are excellent sources for subject
matter on this topic.
In MLR a direct “least squares” regression is performed between the Y- and the X-matrix. In
this section, the case of regression of one column vector Y, will be addressed for simplicity,
but the method can readily be extended to a whole Y-matrix (as is common when MLR is
applied to designed experiment data (DOE) on multiple responses. In this case one can make
independent MLR models, one for each y-variable, based on the same X-matrix.
The following MLR model equation is just an extension of the normal univariate straight line
equation:
The objective is to find the vector of regression coefficients b that minimizes f, the error
term. This is where the least squares criterion on the squared error terms is used, i.e. find b
so that fTf is minimized. MLR estimates the model coefficients using the equation:
This operation involves the matrix inversion of the so called Dispersion Matrix (XTX)-1. If any
of the X-variables show any collinearity with each other i.e. if the variables are not linearly
independent, then the MLR solution will not be stable (if there is a solution at all).
Incidentally, this is the reason why the predictors are called independent variables in MLR;
the ability to vary the X-variables independently of each other is a crucial requirement to
variables used as predictors with this method. This is why in DOE, the initial design matrix is
generated in such a way as to establish this independence (also called orthogonality) in the
first place. MLR also requires more samples than predictors or the matrix cannot be
inverted.
MLR has the following properties and behavior:
The number of X-variables must be smaller than the number of samples;
In case of collinearity among X-variables, the b-coefficients are not reliable and the
model may be unstable;
MLR tends to overfit when noisy data are used.
584
Multiple Linear Regression
The Unscrambler® uses The QR Decomposition to find the MLR solution. No missing values
are accepted in this decomposition.
Then there are 3 DOF contributed by the model, one for the intercept and 2 from the model
terms X1 and X2
The total DOF for a data set is equal to the number of observations (n) minus 1. Using the
ANOVA model definition, the residual DOF can be found using the following
585
The Unscrambler X Main
When MSreg is larger than MSE, this implies that a greater part of the total variance is being
described by the fitted model. Significance can then be established from the p-value
calculated at a particular significance level. If MSE is larger than MSreg or the regression is
found to be insignificant, then questions must be raised regarding the validity of the fitted
model.
The general form of the ANOVA table for regression is provided below.
Generic ANOVA table for regression
Source of variation Sum of Squares Degree of Freedom Mean Square F0
meaning that the observed response values are approximated by a linear combination of the
values of the predictors. The coefficients of that combination are called regression
coefficients or B-coefficients.
Several diagnostic tools are associated with the regression coefficients (available only for
MLR):
Standard error is a measure of the precision of the estimation of a coefficient;
A Student’s t-value can be computed and Comparing the t-value to a reference t-
distribution will then yield a significance level or p-value. It shows the probability of
a t-value equal to or larger than the observed one, if the true value of the regression
coefficient were 0.
586
Multiple Linear Regression
Regression coefficients show how each variable is weighted when predicting a particular Y
response. Regression coefficients are a characteristic of all regression methods and provide
great interpretive insight into the quality of a model. Examples include
Spectroscopy
chosen wavelengths should exhibit changes related to chemical signals in the
samples and not show noise or unexplainable characteristics.
Designed data
When different variable types exist, regression coefficients show the relative
importance of the variables and their interactions can also be displayed as cross
terms of the type b12x1x2.
Predicted vs. reference plot
The predicted vs. reference plot is also another common feature of all regression methods.
The predicted vs. reference plot should show a straight line relationship between predicted
and measured values, ideally with a slope of 1 and a correlation coefficient (R²) close to 1.
More details on plots can be found in interpreting MLR plots.
For MLR, the correlation coefficient R² is calculated as the ratio of SSreg and SST, i.e.
It is the ratio of the variance explained by the model and the total variance that can be
explained. Other variants of the R2 statistics are available when terms are added or removed
from the model.
Residuals
Residuals relate to the SSE term in the ANOVA and for a good model should have a mean
value of zero and a variance s² that is indicative of the experimental error associated with
the analysis. Residuals can be plotted as Ypredicted vs. Yactual or as Studentized residuals. This
plot should show that the residuals are randomly distributed around zero with no visible
trending apparent. Some examples of residual patterns are provided in the figure below.
587
The Unscrambler X Main
Non-constant variance
This is also known as heteroscedasticity. It occurs when the precision of the analyzing
instrument decreases or the variability of a data set increases in a particular direction. In this
case, the range of the Y-variables should be decreased or other analysis methods, such as
weighted least squares should be used.
Curvature in residuals
This occurs when the form of the model is incorrect. MLR attempts to fit a linear model to
the data, however, if the underlying relationship is quadratic in nature, then the linear
model is not the best fit. This can be detected using Lack of Fit (LoF) tests.
Systematic variance
This can occur when important model terms are left out of the final equation, or an
important source of variance has not been included in the initial design. This is the most
difficult situation to deal with in the MLR problem and the source of the variation may be
either controllable or uncontrollable.
Form of the model
The general MLR equation is called a linear model, because it is linear in terms of the
coefficients. The following model is also linear and takes into account interaction terms
between the regressors
The term b12X1X2 takes into account the possibility that the interaction term contributes
significantly to the final model. In this situation, this term adds extra DOF to the regression
terms in the model and may account for any observed curvature in the residuals (should
they exist). The significance of the interaction term can be established using a t-test. If
interaction terms are found to be insignificant, they should be removed from the
model,since their inclusion inflates the term SSE in the ANOVA model.
Another important term that can be added to the MLR model to account for curvature is a
square term. The form of the model can be described as follows.
where the coefficient b11 and b22 refer to the square terms in the model. The significance of
these terms should be established using a t-test.
The reason why the MLR model can be described as linear (even with interaction and square
terms) is because the terms of the form b12X1X2 and b11X1² can be written in the form
588
Multiple Linear Regression
589
The Unscrambler X Main
In the Model Inputs tab, first select an X-matrix to be analyzed in the Predictors frame.
Select pre-defined row and column ranges in the Rows and Cols boxes, or click the Define
button to perform the selection manually in the Define Range dialog. For MLR analysis the
number of samples must exceed the number of variables.
Next select a Y- matrix to be analyzed in the Responses frame. The responses may be taken
from the same data table as the predictors or from any other data table in the project
navigator. Models may be developed for single or multiple responses.
Note: If a separate Y-response matrix is being used, ensure that the row names of Y
correspond to the row names in X. Otherwise, non-meaningful regression results
will be obtained.
The Include Intercept Term check box can be used to add an intercept term in the model. If
the data have been previously mean centered, the intercept term will be zero. If an intercept
term is found to be nonsignificant, then it can be removed from the analysis.
The Significance Level (alpha) box allows a user to set the confidence interval to apply to the
regression results. The value 0.05 (i.e. 95% confidence) is used by default.
The Identify Outliers check box allows a user to set up certain criteria in the Warning Limits
tab and use these to identify potential outliers during the analysis.
The details of the analysis setup are provided in the Information box on the model inputs
tab. It is important to check the details in this box each time an analysis is performed, to
ensure that the correct parameters have been set. The information contained in this box is:
590
Multiple Linear Regression
Some important tips and warnings associated with the Model Inputs tab
MLR is the simplest multivariate regression analysis technique. It does not work if there are
more variables than samples. If there are more variables than samples present in a defined
data set, the following warning will be provided.
More variables than samples present
Solution: Define a data set where there are at least 2 more samples than variables present.
If the number of rows in X does not match that of Y, the following warning will be provided:
Number of X rows does not match number of Y rows
Solution: Ensure that the row set dimensions of X match the row set dimensions of Y.
If too many samples or variables are excluded, the following warning will be provided:
Too many excluded samples/variables
Solution: Check that all samples/variables have not been excluded in a data set.
To keep track of row and column exclusions, the model inputs tab provides a warning to
users that exclusions have been defined. See automatic keep outs for more details.
591
The Unscrambler X Main
592
Multiple Linear Regression
Use the Matrix drop-down list to select the test set from the rows and columns drop-down
lists, or define a set using the Define button.
If the variable dimension of the test set does not match that of the set used for calibration,
the following warning is provided:
Solution: Define a meaningful set of variables to match those of the calibration set.
In the case where too many samples or variables have been excluded from the test set, the
following warning will be provided.
593
The Unscrambler X Main
Solution: Ensure that there are some variables defined for the calculation.
594
Multiple Linear Regression
Set this tab up based on a priori knowledge of the data set in order to return outlier
warnings in the PCA model. Settings for estimating the optimal number of components can
also be tuned here. The values shown in the dialog box above are default values and might
be used as a starting point for the analysis.
The warning limits in the Unscrambler® serve two major purposes:
The leverage and residual (outlier) limits are given as standard scores. This means that limit
of e.g. 3.0 corresponds to a 99.7% probability that a value will lie within 3.0 standard
deviations from the mean of a normal distribution. The following limits can be specified:
Leverage Limit
(default 3.0) The ratio between the leverage for an individual sample and the
average leverage for the model.
595
The Unscrambler X Main
596
Multiple Linear Regression
As variable weighting will change the relative sizes of MLR regression coefficients, we do not
recommend using weighting indiscriminantly and there is no Weights tab in MLR. To assess
standardized regression coefficients whose magnitude do not depend on the variance of the
variables, auto-scaling the variables can be performed as a pre-processing step. Then go to
Tasks–Transform–Weights prior to analysis and dividing each variable by its standard
deciation.
See Theory of weighting for more details.
When all the settings are done click on OK to perform analysis.
597
The Unscrambler X Main
ANOVA Table
The ANOVA table contains degrees of freedom, sums of squares, mean squares, F-values and
p-values for all sources of variation included in the model. The Multiple Correlation
coefficient and the R-squared are also presented above the main table. A value close to 1
indicates a good fit, while a value close to 0 indicates a poor fit.
Summary
The first part of the ANOVA table is a summary of the significance of the global model. If the
p-value for the global model is smaller than 0.05, it means that the model explains more of
the variations of the response variable than could be expected from random phenomena. In
other words, the model is significant at the 5% level. The smaller the p-value, the more
significant (and useful) the model is.
Second section: Variables The second part of the ANOVA table deals with each individual
effect (main effects, optionally also interactions and square terms). If the p-value for an
effect is smaller than 0.05, it means that the corresponding source of variation explains
more of the variations of the response variable than could be expected from random
phenomena. In other words, the effect is significant at the 5% level. The smaller the p-value,
the more significant the effect is.
Model check
The model check tests whether the nonlinear part of the model is significant. It includes up
to three groups of effects:
If the p-value for a group of effects is larger than 0.05, it means that these effects are not
useful, and that a simpler model would perform as well. Try to recompute the response
surface without those effects!
Lack of fit
The lack of fit part tests whether the error in response prediction is mostly due to
experimental variability or to an inadequate shape of the model. If the p-value for lack of fit
is smaller than 0.05, it means that the model does not describe the true shape of the
response surface. In such cases, try a transformation of the response variable.
Regression (t-values)
The t-value for each coefficient is computed as the ratio between deviation from the mean
accounted for by the variable represented by the coefficient, and standard error of the
mean.
By comparing the t-value with its theoretical distribution (Student’s t-distribution), the
significance level of the studied coefficient is assessed.
598
Multiple Linear Regression
The t-values plot present all the t-values for all coefficients.
Regression (t-values)
In the above plot the predictive variables “Protein”, “Carbohydrates” and “Fat” show high t-
values; they are likely to have significant effects in the model.
“Saturated fat” shows a t-value close to 0 and therefore is likely to be non-significant.
For predefined limits look at the p-value plot.
599
The Unscrambler X Main
The presence of an outlier is shown in the example below. The outlying sample has a much
larger residual than the others; however, it does not seem to disturb the model to a large
extent.
A simple outlier has a large residual
The figure below shows the case of an influential outlier: not only does it have a large
residual, it also distorts the whole model so that the remaining residuals show a very clear
trend. Such samples should usually be excluded from the analysis, unless there is an error in
the data or some data transformation can correct for the phenomenon.
An influential outlier changes the structure of the residuals
600
Multiple Linear Regression
Small residuals (compared to the variance of Y) which are randomly distributed indicate
adequate models.
.
Some statistics are available giving an idea of the quality of the regression, they are available
601
The Unscrambler X Main
When the calibration and validation samples are similar and lie close to a straight line of
slope 1, the fit can be considered as good.
Predicted vs. Reference plot for Calibration and Validation, with a good fit.
To determine the quality of the fit, the following statistics are available,
Slope
The closer the slope is to 1, the data are better modelled.
Offset
This is the intercept of the line with the Y-axis when the X-axis is set to zero (Note: It
is not a necessity that this value is zero!)
RMSE
The first one (in blue) is the Calibration error RM, the second one (in red) is the
expected Prediction/Estimation error RM or RM depending on the validation
method used. Both are expressed in the same unit as the response variable Y.
R-squared
The first one (in blue) is the raw R-squared of the model, the second one (in red) is
also called adjusted R-squared and tells how good a fit can be expected for future
predictions. R-squared varies between 0 and 1. A value of 0.9 is usually considered
602
Multiple Linear Regression
as pretty good but this varies depending on the application and on the number of
samples.
When the are toggled, more detailed statistics are displayed. The Calibration plot is
shown below with statistics,
Predicted vs. Reference plot for MLR Calibration samples
603
The Unscrambler X Main
604
Multiple Linear Regression
In the above plot, sample 3 is not following the regression line whereas all the other
samples do. Sample 3 may be an outlier.
How to detect nonlinearity
In other cases, there may be a nonlinear relationship between the X- and Y-
variables, so that the predictions do not have the same level of accuracy over the
whole range of variation of Y. In such cases, the plot may look like the one shown
below. Such nonlinearities should be corrected if possible (for instance by a suitable
transformation), because otherwise there will be a systematic bias in the predictions
depending on the range of the sample.
Predicted vs. Reference shows a nonlinear relationship
605
The Unscrambler X Main
Regression coefficients
Regression coefficients summarize the relationship between all predictors and a given
response.
The regression coefficients line plot is available for the weighted beta coefficients (Bw).
Note: If no weight were applied the weighted coefficients are confounded with the
raw coefficients
Weighted regression coefficients
The above plot shows the weighted regression coefficients for the response variable (Y).
Each predictor variable (X) defines one point of the line (or one bar of the plot). It is
recommended to configure the layout of this type of plot as bars. Variables 1, 7, 9 and 11
have the highest weighted B coefficients.
The B0 coefficient is displayed along with the X-axis name. In this case B0 = 0.03708.
The weighted coefficients reflect the importance of the X-variables in the model.
However the raw coefficients are also interesting as those are used to write the model
equation in original units:
The raw coefficients do not reflect the importance of the X-variables in the model, because
the sizes of these coefficients depend on the range of variation (and indirectly, on the
original units) of the X-variables. A small raw coefficient does not necessarily indicate an
unimportant variable; a large raw coefficient does not necessarily indicate an important
variable.
If the purpose is to identify important predictors, use plots with t-values and p-values when
available.
606
Multiple Linear Regression
Regression (t-values)
For more information look into the overview section.
Regression (p-values)
The p-value measures the probability that a parameter estimated should be as large as it is,
if the real (theoretical, non-observable) value of that parameter were actually zero. Thus, p-
value is used to assess the significance of observed variations: a small p-value means that
there is little risk of mistakenly concluding that the observed effect is real.
The usual limit used in the interpretation of a p-value is 0.05 (or 5%). If p-value < 0.05, the
observed effect is not due to random variations. Thus, the variable under study has a
significant effect.
The plot of the p-values presents the p-values for each coefficient included in the MLR.
Regression (p-values)
In the above plot “Protein” is significant below 5%. “Fat” and “Carbohydrates” show
significant effects bellow 20 and 10% respectively. “Saturated fat” does not have a
significant effect.
p-value is also called “significance level�?.
607
The Unscrambler X Main
Leverage
Leverages are useful for detecting samples which are far from the center within the space
described by the model. Samples with high leverage differ from the average samples; in
other words, they are likely outliers. A large leverage also indicates a high influence on the
model. The figure below shows a situation where sample 5 is obviously very different from
the rest and may disturb the model.
One sample has a high leverage
608
Multiple Linear Regression
Response surface
This plot is used to find the settings of the X-variables which give an optimal response value
for the variable Y, and to study the general shape of the response surface fitted by the
Regression model.
It is necessary to specify which X-variable should be plotted, use the dialogue box that
appear for this purpose.
Response Surface dialogue
This plot can appear in various layouts. The most relevant are:
Contour plot;
Landscape plot.
609
The Unscrambler X Main
610
Multiple Linear Regression
Analysis of variance
See the description in the Interpreting MLR plots section
Regression coefficients
See the description in the Interpreting MLR plots section
Residuals
General
Y-residuals vs. predicted Y
See the description in the Interpreting MLR plots section
Normal probability Y-residuals
This plot displays the cumulative distribution of the Y-residuals with a special scale, so that
normally distributed values should appear along a straight line. The plot shows all residuals
for one particular Y-variable (look for its name in the plot ID). There is one point per sample.
If the model explains the complete structure present in the data, the residuals should be
randomly distributed - and usually, normally distributed as well. So if all the residuals are
along a straight line, it means that the model explains everything that can be explained in
the variations of the variables that are being predicted.
If most of the residuals are normally distributed, and one or two stick out, these particular
samples are outliers. This is shown in the figure below. If there are outliers, mark them and
check the data.
Two outliers are sticking out
611
The Unscrambler X Main
If the plot shows a strong deviation from a straight line, the residuals are not normally
distributed, as in the figure below. In some cases - but not always - this can indicate lack of
fit of the model. However it can also be an indication that the error terms are simply not
normally distributed.
The residuals have a regular but non-normal distribution
612
Multiple Linear Regression
Influence plot
This plot displays the sample residual X-variances against leverages. It is most useful for
detecting outliers, influential samples and dangerous outliers.
Samples with high residual variance, i.e. lying to the top of the plot, are likely outliers.
Samples with high leverage, i.e. lying to the right of the plot, are influential; this means that
they somehow distort the model so that it describes them better. Influential samples are not
necessarily dangerous, if they obey the same model as more “average” samples.
A sample with both high residual variance and high leverage is a dangerous outlier: it is not
well described by a model which correctly describes most samples, and it distorts the model
so as to be better described, which means that the model then focuses on the difference
between that particular sample and the others, instead of describing more general features
common to all samples.
Three cases can be detected from the influence plot:
Leverages in designed data
For designed samples, the leverages should be interpreted differently whether the analysis
is a regression (with the design variables as X-variables) or a PCA on the responses.
By construction, the leverage of each sample in the design is known, and these leverages are
optimal, i.e. all design samples have the same contribution to the model. So do not bother
about the leverages when running a regression: the design has cared for it.
However, when running a PCA on the response variables, the leverage of each sample is now
determined with respect to the response values. Thus some samples may have high
leverages, either in an absolute or a relative sense. Such samples are either outliers, or just
samples with extreme values for some of the responses.
What to do with an influential sample?
613
The Unscrambler X Main
The first thing to do is to understand why the sample has a high leverage (and, possibly, a
high residual variance). Investigate by looking at the raw data and checking them against the
original recordings.
There are two following cases.
Case 1
There is an error in the data. Correct it, or the true value cannot be found or the
experiment cannot be re-done to get a more valid value, replace the erroneous
value with “missing”.
Case 2
There is no error, but the sample is different from the others. For instance, it has
extreme values for several of the variables. Check whether this sample is “of
interest” (e.g. it has the properties to be achieved, to a higher degree than the other
samples), or “not relevant” (e.g. it belongs to another population than the one
under study). In the former case, try to generate more samples of the same kind:
they are the most interesting ones! In the latter case (and only then), remove the
high-leverage sample from the model.
Variance per sample
This plot shows the residual (or explained) X-variance for all samples for the regression. The
plot is useful for detecting outlying samples, as shown below.
An outlying sample has high residual variance
Samples with small residual variance (or large explained variance) are well explained by the
regression model, and vice versa. In the above plot 4 samples seems to be not well explained
by the model and may be outliers such as B3.
Variable residuals
This is a plot of residuals for the Y-variable for all the samples. The plot is useful for detecting
outlying sample or variable combinations, as shown in the figure below.
Line plot of the variable residuals
614
Multiple Linear Regression
This plot gives information about all possible samples for a particular variable (as opposed to
the sample residual plot, which gives information about residuals for all variables for a
particular sample) hence it is more useful for studying how a specific variable behaves for all
the samples.
Sample residuals
This plot shows the residuals for a specified sample for the Y-variable. It is useful for
detecting outlying sample.
Go through the different samples to see if any sample has a too high residual in comparison
with the others. To do so use the arrows or drop-down list for the sample selection
.
Sample Residual
615
The Unscrambler X Main
Outliers
Influence plot
See the description in the above section
Y-residuals vs. predicted Y
See the description in the Interpreting MLR plots section
Leverage
See the description in the Interpreting MLR plots section
Response Surface
See the description in the Interpreting MLR plots section
14.6. Bibliography
C. R. Goodall, Computation Using the QR Decomposition in Handbook in Statistics Vol. 9,
Elsevier, Amsterdam, 1993.
D.C. Montgomery, E.A. Peck, and C.G. Vining, Introduction to Linear Regression Analysis
Third Edition,Wiley-Interscience, New York, 2001.
S. Weisberg, Applied Linear Regression Second Edition, Wiley, New York, 1985.
616
15. Principal Components Regression
15.1. Principal Component Regression
PCR is a method for relating the variations in a response variable (Y-variable) to the
variations of several predictors (X-variables), with explanatory or predictive purposes.
PCR is a two-step procedure which first decomposes an X-matrix by PCA, then fits an MLR
model, using the PC scores instead of the original X-variables as predictors.
Theory
Usage
Plot Interpretation
Method reference
Basics
Interpreting the results of a Principal Component Regression (PCR)
Scores and loadings
Regression coefficients
Predicted vs. reference plot
Error measures for PCR
Some more theory of PCR
PCR algorithm options
15.2.1 Basics
PCR is a two-step procedure which first decomposes an X-matrix by PCA, then fits an MLR
model, using the PC scores instead of the original X-variables as predictors.
This method performs particularly well when the various X-variables express common
information, i.e. when there is a large amount of correlation, or even collinearity.
Since the scores are orthogonal, the MLR solution is stable and therefore the PCR model
does not suffer from collinearity effects. It is the belief of some data analysis purists that PCR
is superior to PLS since it forces analysts to better understand their data and its
preprocessing (transformations) before the application of a regression procedure. The
procedure for performing PCR is shown graphically below.
PCR Procedure
617
The Unscrambler X Main
618
Principal Components Regression
The above situation is a result of the PCA decomposition not being guided by the Y-data (as
is the case of PLS). However, in most cases, PCR and PLS provide similar results, though PLS
usually converges in less factors than PCR. Most vendor spectroscopic devices only support
PLS regression in their software packages; this is the main reason why PCR is not as popular
as a spectroscopic regression tool. Read more about how sample and variable residuals, as
well as explained and residual variances, are computed in the chapter with theory about
PCA.
619
The Unscrambler X Main
Where
The next step is to regress Y on the first few scores, using MLR and then calculating
regression coefficients as follows,
620
Principal Components Regression
and relatively few variables or ‘short and fat’ data containing a large number of
variables and relatively few samples). The algorithm does not handle missing values.
More information about the algorithms can be found in the method reference.
621
The Unscrambler X Main
Some important tips and warnings associated with the Model Inputs tab
PCR is a multivariate regression analysis technique, therefore in The Unscrambler® it
requires a minimum of three samples (rows) and two variables (columns) to be present in a
data set, in order to complete the calculation. The following provides some warning given,
when certain analysis criteria are not met.
Not enough samples or variables present
Solution: Check that the data table (or selected row set) contains a minimum of 3 samples or
2 variables.
Minimum 2 variables needed to perform analysis
622
Principal Components Regression
Solution: Ensure that a minimum of 2 variables have been defined in a data set.
Number of X rows does not match number of Y rows
Solution: Ensure that the row set dimensions of X match the row set dimensions of Y.
Too many excluded samples/variables
Solution: Check that all samples/variables have not been excluded in a data set
To keep track of row and column exclusions, the model inputs tab provides a warning to
users that exclusions have been defined. See automatic keep outs for more details.
623
The Unscrambler X Main
Individual X- and Y-variables can be selected from the variable list table provided in this
dialog by holding down the control (Ctrl) key and selecting variables. Alternatively, the
variable numbers can be manually entered into the text dialog box. The Select button can be
used (which takes one to the Define Range dialog box), or by simply clicking on All, this will
select every variable in the table.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows selected variables to be weighted by predefined constant values.
Downweight
This allows for the multiplication of selected variables by a very small number, such
that the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
624
Principal Components Regression
Use the Advanced tab in the X- and Y-Weights dialog to apply predetermined weights to
each variable. To use this option, set up a row in the data set containing the weights (or
create a separate row matrix in the project navigator). Select the Advanced tab in the
Weights dialog and select the matrix containing the weights from the drop-down list. Use
the Rows option to define the row containing the weights and click on Update to apply the
new weights. The dialog box for the Advanced option is provided below.
Another feature of the advanced tab is the ability to use the results matrix of another
analysis as weights, using the Select Results Matrix button . This option provides
an internal project navigator for selecting the appropriate results matrix to use as a weight.
PCR Advanced Weights Option
Once the weighting and variables have been selected, click Update to apply them.
625
The Unscrambler X Main
626
Principal Components Regression
The differences between the algorithms are described in the Introduction to PCR. The
NIPALS algorithm is iterative and the maximum number of iterations can be tuned in the
Max. iterations box. The default value of 100 should be sufficient for most data sets,
however some large and noisy data may require more iterations to converge properly. The
maximum allowed number of iterations is 30,000.
When there are missing values in the data, options are to impute them automatically using
the NIPALS algorithm or as a pre-processing step using Fill Missing
Note: If there are missing values in the data and SVD is selected, a warning will be
given as shown below.
Q-residual limits are per default approximated based on calculated model components only,
which works well in many cases. Calculation of exact Q-residual limits will be performed
627
The Unscrambler X Main
when the check box is marked. Note that estimation of exact limits may be slow for large
data.
Pretreatments can also be registered from the PCR node in the project navigator. To register
the pretreatment, right click on the PCR analysis node and select Register Pretreatment.
This is shown below
Registering a Pretreatment from the Project Navigator
628
Principal Components Regression
The Autopretreatment dialog box will appear, where the desired pretreatments can be
selected.
Note: Some caution is required when data table dimensions are changed after first
pretreatment. The Autopretreatment is applied on the same column indices as the
original transformation, and inserting new variables (columns) before or in between
the original data will result in autopretreatment of the wrong variables.
To be safe, always insert any new variables in the table before applying any
transformations, or make a habit of always appending rather than inserting new
columns.
629
The Unscrambler X Main
Set this tab up based on a priori knowledge of the data set in order to return outlier
warnings in the PCA model. Settings for estimating the optimal number of components can
also be tuned here. The values shown in the dialog box above are default values and might
be used as a starting point for the analysis.
The warning limits in the Unscrambler® serve two major purposes:
The leverage and residual (outlier) limits are given as standard scores. This means that limit
of e.g. 3.0 corresponds to a 99.7% probability that a value will lie within 3.0 standard
deviations from the mean of a normal distribution. The following limits can be specified:
Leverage Limit
(default 3.0) The ratio between the leverage for an individual sample and the
average leverage for the model.
Sample Outlier Limit, Calibration
630
Principal Components Regression
(default 3.0) The square root of the ratio between the residual calibration variance
per sample (Sample Residuals) and the average residual calibration variance for the
model (Total Residuals).
Sample Outlier Limit, Validation
(default 3.0) The square root of the ratio between the residual validation variance
per sample (Sample Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Individual Value Outlier, Calibration
(default 3.0) For individual, absolute values in the calibration residual matrix
(Residuals), the ratio to the model average is computed (square root of the Variable
Residuals). For spectroscopic data this limit may be set to 5.0 to avoid many false
positive warnings due to the high number of variables.
Individual Value Outlier, Validation
(default 2.6) For individual, absolute values in the validation residual matrix
(Residuals), the ratio to the validation model average is computed (square root of
the Variable Validation Residuals). For spectroscopic data this limit may be set to 5.0
to avoid many false positive warnings due to the high number of variables.
Variable Outlier Limit, Calibration
(default 3.0) The square root of the ratio between the residual calibration variance
per variable (Variable Residuals) and the average residual calibration variance for
the model (Total Residuals).
Variable Outlier Limit, Validation
(default 3.0) The square root of the ratio between the residual validation variance
per variable (Variable Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Total Explained Variance (%)
(default 20) If the model explains less than 20% of the variance the optimal number
of componets is set to 0 (see the Info Box).
Ratio of Calibrated to Validated Residual Variance
(default 0.5) If the residual variance from the validation is much higher than the
calibration a warning is given.
Ratio of Validated to Calibrated Residual Variance
(default 0.75) If the residual variance from the calibration is much higher than the
validation a warning is given. This may occur in case of test set validation where the
test samples do not span the same space as the training data.
Residual Variance Increase Limit (%)
(default 6) This limit is applied for selecting the optimal number of components and
is calculated from the residual variance for two consecutive components. If the
variance for the next component is less than x% lower than the previous component
the default number of components is set to the previous one.
When all the settings are made click on OK.
631
The Unscrambler X Main
Explained Y-Variance
Explained X-Variance
Predicted vs. Reference
Variances and RMSEP
Sample Outliers
Scores
Influence
Residual Sample X-Variance
Residual Sample Y-Variance
Scores and Loadings
Scores
Loadings
Important Variables
Regression coefficients
X-loadings
Regression coefficients
Regression and Prediction
Predicted vs. Reference
Regression coefficients
Residuals and influence
Influence Plot
Influence plot with Hotelling’s T² statistic
Influence plot with Leverage
Influence plot with F-residuals
Influence plot with Q-residuals
Explained sample variance or sample residuals
Leverage / Hotelling’s T²
Hotelling’s T² statistics
Leverage
Residuals
Q-residuals
F-residuals
Leverage / Hotelling’s T²
Residuals
Response Surface
Plots accessible from the PCR plot menu
PCR Overview
Variances and RMSEP
X- or Y- Variance
X- and Y- Variance
RMSE
Sample Outliers
Scores and Loadings
2 plots
4 plots
Bi-plot
Scores
Line
2-D Scatter
632
Principal Components Regression
3-D Scatter
2 x 2-D Scatter
4 x 2-D Scatter
Loadings
Line
Loadings for the X-variables
Loadings for the Y-variable
2-D Scatter
3-D Scatter
Loadings for the X-variables
Loadings for the Y-variable
2 x 2-D Scatter
4 x 2-D Scatter
Important Variables
Regression Coefficients
Weighted coefficients (Bw)
Raw coefficients (B)
Residuals
Residuals and influence
General
Y-residuals vs. Predicted Y
Normal Probability Y-residuals
Y-residuals vs. Score
Influence Plot
Variance per sample
Variable residuals
Sample residuals
Sample and variable residuals
Outliers
Influence Plot
Y-residuals vs. Predicted Y
Patterns
Normal Probability Y-residuals
Y-residuals vs. Score
Leverage/Hotelling’s T²
Leverage
Line
Matrix
Hotelling’s T²
Line
Matrix
Response Surface
633
The Unscrambler X Main
Scores
This is a two-dimensional scatter plot (or map) of scores for two specified components (PCs)
from PCR. The plot gives information about patterns in the samples. The scores plot for
(PC1,PC2) is especially useful, since these two components summarize more variation in the
data than any other pair of components.
The closer the samples are in the scores plot, the more similar they are with respect to the
two components concerned. Conversely, samples far away from each other are different
from each other. The plot can be used to interpret differences and similarities among
samples. Look at the scores plot together with the corresponding loadings plot, for the same
two components. This can help determine which variables are responsible for differences
between samples. For example, samples to the right of the scores plot will usually have a
large value for variables to the right of the loadings plot, and a small value for variables to
the left of the loadings plot.
Here are some things to look for in the 2-D scores plot.
Finding groups in a scores plot
Is there any indication of clustering in the set of samples? The figure below shows a
situation with four distinct clusters. Samples within a cluster are similar.
Detecting grouping in a scores plot
634
Principal Components Regression
635
The Unscrambler X Main
have been errors in data collection or transcription, or those samples may have to
be removed if they do not belong to the population of interest.
An outlier sticks out of the major group of samples
Furthermore, the display of the Hotelling’s T² ellipse for a model in two dimension is
also a good way to detect outliers. To display it click on the Hotelling’s T² ellipse
button .
Scores plot with Hotelling’s T² limit
In addition, the display of the stability plot can help in detecting outliers. This plot
represents the projection of the samples in the submodels used for the validation
they can be part of the model or left out. Hence this plot is only available when any
type of cross-validation has been selected. It is available from the icon .
An outlier disturbs the model
636
Principal Components Regression
In the above image, the sample 143_1 is projected very differently for one particular
projection. It can be seen that one particular projection is deviating from all the
others. The study of the samples left out for this particular projection indicates that
sample 143_1 is the source of this variation. This sample is an outlier.
Calibration and Validation Scores
When the methods of cross validation and test set validation are used, The
Unscrambler® will by default display Calibration and Validation (Test) scores in the
same plot, Use this plot to determine whether the test set covers the entire span of
the calibration set or determine if any cross validation segments/samples are
different from the rest of the set.
637
The Unscrambler X Main
X- and Y-Loadings
A 2-D scatter plot of X- and Y-loadings for two specified components from PCR is a good way
to detect important variables. The plot is most useful for interpreting component 1 vs.
component 2, since they represent the largest variations in the X-data. By default both Y-
and X-variables are displayed but it is possible to modify this by clicking on the X and Y icons.
X- and Y-Loadings of sensory variables (X) and the mean preference (Y) along (PC1,PC2)
The plot shows the importance of the different variables for the two components specified.
It is possible to change the display by using the arrows or the PC drop-down list
.
The loadings plot should preferably be used together with the corresponding scores plot.
Variables with loadings to the right in the loadings plot will be X-variables which usually have
high values for samples to the right in the scores plot, etc. This plot can be used to study the
relationship between the X-variables and the X- and Y-variables.
If the Uncertainty test was activated the important variables will be circled. It is also possible
to mark them by using the icon .
Loadings plot with circled important variables
638
Principal Components Regression
When working with discrete variables, line loadings plots can also be used to represent data.
The Ascending and Descending buttons can be used to order the loadings in
terms of the variables with highest (or lowest) contribution to the PC.
Line plot of loadings in ascending order of importance to PC1
639
The Unscrambler X Main
More on line loadings plots can be found in a later section of this document.
Correlation Loadings Emphasize Variable Correlations
When a PCR analysis has been performed and a two-dimensional plot of loadings is
displayed on the screen, the correlation loadings option (available from the View menu and
the icon ) can be used to aid in visualizing the structure in the data. Correlation loadings
are computed for each variable for the displayed Principal Components (factors). In addition,
the plot contains two ellipses to help check how much variance is taken into account. The
outer ellipse is the unit-circle and indicates 100% explained variance. The inner ellipse
indicates 50% of explained variance. The importance of individual variables is visualized
more clearly in the correlation loadings plot compared to the standard loadings plot.
Correlation Loadings of sensory variables (X) and the mean preference (Y) along (PC1,PC2)
Variables close to each other in the loadings plot will have a high positive correlation if the
two components explain a large portion of the variance of X. The same is true for variables in
the same quadrant lying close to a straight line through the origin. Variables in diagonally
opposed quadrants will have a tendency to be negatively correlated. For example, in the
figure above, variables Acidity and Bitterness have a high positive correlation on PC1, and
they are negatively correlated to variable Odor banana. Variables Color intensity and Odor
orange have independent variations. Variables Mean preference and Bitterness are
negatively correlated.
640
Principal Components Regression
Note: Variables lying close to the center are poorly explained by the plotted PCs.
They cannot be interpreted in that plot!
Correlation loadings are also available for 1D line loading plots. When a line plot is
generated, the 1D correlation loadings toolbar icon is displayed as follows
These are especially useful when interpreting important wavelengths in the analysis of
spectroscopic or contributing variables in time series data. An example is shown below.
Correlation Line Loadings of Spectroscopic variables in PC1)
Values that lie within the upper and lower bounds of the plot are modelled by that PC. Those
that lie between the two lower bounds are not.
Explained Variance
There are two explained variance curves to look at in PCR: explained X- and Y-variance. It is
possible to change from one to the other by using the icon .
Explained Y-Variance
This plot illustrates how much of the variation in the response is described by each different
component. Total residual variance is computed as the sum of squares of the Y-variable,
divided by the number of degrees of freedom.
Total explained variance is then computed as:
641
The Unscrambler X Main
Compare the two variances: if they differ significantly, there is good reason to question
whether either the calibration data or the test data are truly representative. The figure 2
below shows a situation where the residual validation variance is much larger than the
residual calibration variance (or the explained validation variance is much smaller than the
explained calibration variance). This means that although the calibration data are well fitted
(small residual calibration variances), the model does not describe new data well (large
residual validation variance).
On the contrary, if the two residual variance curves are close together the model is
representative (figure below).
Total residual variance curves and Total explained variance curves
Outliers can sometimes cause large residual variance (or small explained variance). They can
also cause the dropping of the explained variance in validation as can be seen in the plot
below.
642
Principal Components Regression
Explained X-Variance
This plot gives an indication of how much of the variation in the explicative variables is
described by the different components. The total X-variance is computed in the same way as
the Y-variance. See above description for more information.
In PCR as the PCs are computed only taking into account the X-variance it may be necessary
to consider more PCs to explain most of the variance in Y.
.
Some statistics are available giving an idea of the quality of the regression, they are available
643
The Unscrambler X Main
Note: If there are large differences between the calibration and validation results,
the model cannot be trusted.
To determine the quality of the fit, the following statistics are available,
Slope
The closer the slope is to 1, the data are better modelled.
Offset
This is the intercept of the line with the Y-axis when the X-axis is set to zero (Note: It
is not a necessity that this value is zero!)
RMSE
The first one (in blue) is the Calibration error RMSEC, the second one (in red) is the
expected Prediction error, depending on the validation method used. Both are
expressed in the same unit as the response variable Y.
R-squared
The first one (in blue) is the calibration R-Squared value taken from the calibration
Explained Variance plot for the number of components in the model, the second
one (in red) is also calculated from the Explained Variance plot, this time for the
validation set. It tells how good a fit can be expected for future predictions for a
defined number of components.
Note: RMSE and R-Squared values are highly dependent on the validation method
used and the number of components in a model. It it important not to use too
many components and overfit the model.
When the are toggled, more detailed statistics are displayed. The Calibration plot is
shown below with statistics,
Predicted vs. Reference plot for PCR Calibration samples
644
Principal Components Regression
645
The Unscrambler X Main
Standard Error of Cross Validation. This is the RMSECV corrected for bias
RMSEP
Root Mean Square Error of Prediction. This is a measure of the dispersion of the
validation samples around the regression line when Test Set validation is used.
SEP
Standard Error of Prediction. This is the RMSEP corrected for bias
When Leverage Correction is used to first check the model, the errors become estimation
errors. For more details on the definitions, see the section on Multiple Linear Regression
(Interpreting MLR plots).
How to detect cases of good fit / poor fit
The figures below show two different situations: one indicating a good fit, the other
a poor fit of the model.
Predicted vs. Reference shows how well the model fits
646
Principal Components Regression
In the above plot, sample 3 is not following the regression line whereas all the other
samples do. Sample 3 may be an outlier.
How to detect nonlinearity
In other cases, there may be a nonlinear relationship between the X- and Y-
variables, so that the predictions do not have the same level of accuracy over the
whole range of variation of Y. In such cases, the plot may look like the one shown
below. Such nonlinearities should be corrected if possible (for instance by a suitable
transformation), because otherwise there will be a systematic bias in the predictions
depending on the range of the sample.
Predicted vs. Reference shows a nonlinear relationship
If some variables have much larger residual variance than all the other variables for all
components in the model (or for the first 3-4 of them), try rebuilding the model with these
variables deleted. This may produce a model that is easier to interpret.
647
The Unscrambler X Main
Sample Outliers
Scores
See the description in the Interpreting PCR plots section
Influence
This plot shows the Q- or F-residuals vs. Leverage or Hotelling’s T² statistics. These represent
two different kinds of outliers. The residual statistics on the ordinate axis describe the
sample distance to model, whereas the Leverage and Hotelling’s T² describe how well the
sample is described by the model.
Samples with high residual variance, i.e. lying in the upper regions of the plot, are poorly
described by the model. Including additional components may result in these samples being
described better, however caution is required that the additional components are predictive
and not modelling noise. As long as the samples with high residual variance are not
influential (see below), keeping them in the model may not be a problem as such (the high
residual variance may be due to non-important regions of a spectrum, for instance).
Samples with high leverage, i.e. lying to the right of the plot, are well described by the
model. They are well described in the sense that the sample scores may have very high or
low values for some components compared to the rest of the samples. Such samples are
dangerous in the calibration phase because they are influential to the model. A sufficiently
extreme sample may by itself span an entire component, in which case the model will
become unreliable. Removal of a highly influential sample from the model will make the
model look entirely different and the axes will span different phenomena altogether. If the
variance described by the sample is important but unique, one should try to obtain more
samples of the same type to stabilize the model. Otherwise the sample should be discarded
as an outlier.
Note that a sample with both high residual variance and high leverage is the most dangerous
outlier. Not only is it poorly described by the model but it is also influential. Samples such as
these may span up to several components single handedly. Because they also disagree with
the majority of the other calibration samples, the ability of the model to describe new
samples is likely poor.
The Q- and F-residuals are two different methods for testing the same thing. The F-residuals
are available for both calibration and validation, in contrast to the Q-residuals, which are
available for calibration only. The validated residuals reflect the scheme chosen in the
validation and is a more concervative assessment of residual outliers. If the residual variance
from validation is much higher than for calibration one should investigate the residuals in
more detail.
The difference between Leverage and Hotelling’s T² is only a scaling factor. The critical limt
for Leverage is based on an ad-hoc rule whereas the Hotelling’s T² critical limit is based on
assumption of a student-t distribution.
648
Principal Components Regression
Calibration and validation samples can be displayed in the influence plot by toggling
between them using the and button. The toggle is available for F-residuals if the
validation method chosen was cross validation or test set validation.
High residuals indicate an outlier. Incorporating more components can sometimes model
outliers; avoid doing so since it will reduce the prediction ability of the model.
649
The Unscrambler X Main
Small residual variance (or large explained variance) indicates that, for a particular number
of components, the samples are well explained by the model. Therefore a sample with a
high Y-residual may be an outlier.
Scores and Loadings
This overview shows two plots: the score and loadings plots.
Scores
See the description in the Interpreting PCR plots section
Loadings
See the description in the Interpreting PCR plots section
Important Variables
Regression coefficients
If the X-variables were weighted this plot presents the weighted regression coefficients.
Otherwise the B-coefficients and (Bw)-coefficients are confounded. The number of PCs is
fixed and can be changed using the arrows.
In general, this plot shows the weighted regression coefficients for the response or Y-
variable.
Regression coefficients summarize the relationship between all predictors and the response.
For PCR, the regression coefficients can be computed for any number of components or
factors. The regression coefficients for 3 factors, for example, summarize the relationship
between the predictors and the response, as a model with 3 components approximates it.
The weighted regression coefficients (Bw) informs about the importance of the X-variables.
X-variables with a large regression coefficient play an important role in the regression
650
Principal Components Regression
model; a positive coefficient shows a positive link with the response, and a negative
coefficient shows a negative link. Predictors with a small coefficient are negligible. Mark
them and recalculate the model without those variables. The constant value B0W is
indicated at the bottom of the plot, in the Plot ID field (use View - Plot ID).
Weighted regression coefficients for 3 factors (or PCs)
The plot shows that variables 0, 3 and 4 are contributing the most to the model.
Note: The weighted coefficients (Bw) and raw coefficients (B) are identical if no
weights were applied on the variables.
If the predictor variables have been weighted with 1/SDev (standardization), the weighted
regression coefficients (Bw) take these weights into account. Since all predictors are brought
back to the same scale, the coefficients show the relative importance of the X-variables in
the model.
X-loadings
This is a plot of X-loadings for all the components vs. variable number. It is useful for
detecting important variables. If a variable has a large positive or negative loading, this
means that the variable is important for the component concerned. For example, a sample
with a large score value for this component will have a large positive value for a variable
with large positive loading.
If a variable has the same sign for all the important components, it is most likely to be an
important variable.
651
The Unscrambler X Main
Regression coefficients
For more information see the previous section.
Regression and Prediction
Regression coefficients
See the description in the above section
Residuals and influence
Influence Plot
This plot shows the Q- or F-residuals vs. Leverage or Hotelling’s T². See the general
description about the influence plot in the overview section for more details.
The toggle buttons in the toolbar can be used to switch between the various combinations.
652
Principal Components Regression
653
The Unscrambler X Main
Leverage / Hotelling’s T²
The lower left pane of the Residuals and Influence overview displays a line plot of the
Hotelling’s T² by default. A toolbar toogle ( ) can be used to switch between
Hotelling’s T² and Leverage view.
Hotelling’s T² statistics
The plot displays the Hotelling’s T² statistic for each sample as a line plot. The associated
critical limit (with a default p-value of 5%) is displayed as a red line.
Hotelling’s T² plot
The Hotelling’s T² statistic has a linear relationship to the leverage for a given sample. Its
critical limit is based on an F-test. Use it to identify outliers or detect situations where a
process is operating outside normal conditions. There are 6 different significance levels to
choose from using the drop-down list.
654
Principal Components Regression
The number of factors (or PCs) may be tuned up or down with the tools.
Leverage
Leverages are useful for detecting samples which are far from the center within the space
described by the model. Samples with high leverage differ from the average samples; in
other words, they are likely outliers. A large leverage also indicates a high influence on the
model. The figure below shows a situation where sample 5 is obviously very different from
the rest and may disturb the model.
One sample has a high leverage
There is an ad-hoc critical limit (and not depending on any assumptions about distribution)
for Leverage which is , where is the number of components and
the number of calibration samples.
The leverage values are always larger than zero, and can go up to 1 for samples in calibration
set. As a rule of thumb, samples with a leverage above 0.4 - 0.5 start being bothering.
Influence on the model is best measured in terms of relative leverage. For instance, if all
samples have leverages between 0.02 and 0.1, except for one, which has a leverage of 0.3,
although this value is not extremely large, the sample is likely to be influential.
What Should Be Done with a High-Leverage Sample? The first thing to do is to understand
why the sample has a high leverage. Investigate by looking at the raw data and checking
them against the original recordings. Once an explanation has been found, there are two
following cases:
Case 1
655
The Unscrambler X Main
There is an error in the data. Correct it, or if true value cannot be found, and the
experiment cannot be redone to give a more valid value, the erroneous value may
be replaced with “missing”.
Case 2
There is no error, but the sample is different from the others. For instance, it has
extreme values for several of the variables. Check whether this sample is “of
interest” (e.g. it has the properties of interest, to a higher degree than the other
samples), or “not relevant” (e.g. it belongs to another population than that being
studied). In the former case, one should try to generate more samples of the same
kind: they are the most interesting ones! In the latter case (and only then), the high-
leverage sample may be removed from the model.
Residuals
The lower right pane of the Residuals and Influence overview displays a line plot of the
sample residual statistics. A toolbar toogle ( ) can be used to switch between Q- and
F-residuals view.
Q-residuals
This plot shows the sample Q-residuals as a line plot with associated limits.
Q-residual sample variance
F-residuals
This plot shows the sample F-residuals as a line plot with associated limits.
Note that the F-residuals are available for both calibration and validation. If the residual x-
variance from validation is much higher than for calibration one should investigate the
residuals in more detail. The validated residuals reflect the scheme chosen in the validation
and is a more concervative assessment of residual outliers.
Leverage / Hotelling’s T²
See the description in the overview section.
Residuals
See the description in the overview section.
656
Principal Components Regression
Response Surface
This plot is used to find the settings of the X-variables which give an optimal response value
for the variable Y, and to study the general shape of the response surface fitted by the
Regression model.
It is necessary to specify which X-variables should be plotted as well as the number of
components, use the dialogue box that appear for this purpose.
Response Surface dialogue
This plot can appear in various layouts. The most relevant are:
Contour plot.
Landscape plot.
Interpretation: Contour Plot
This plot gives a map to localize the area of the experiment goal. The plot has two
axes: two predictor variables are studied over their range of variation; the remaining
ones are kept constant. The constant levels are indicated in the Plot ID at the
bottom. The response values are displayed as contour lines, i.e. lines that show
where the response variable has the same predicted value. Clicking on a line, or on
any spot within the map, will display the predicted response value for that point,
and the coordinates of the point (i.e. the settings of the two predictor variables
giving that particular response value).
Interpretation: Landscape Plot
Look at this plot to study the 3-D shape of the response surface. Here it is obvious
whether there is a maximum, a minimum or a saddle point. This plot, however, does
not show precisely how the optimum can be achieved.
Response surface plot, with Landscape layout
657
The Unscrambler X Main
X- or Y- Variance
One-frame plot where it is possible to display either the Explained X- or Y-Variance with
Calibration and or Validation curves. See the description in the Interpreting PCR plots section
X- and Y- Variance
A two-frame plot with on the top the Explained X-Variance plot and below the Explained Y-
Variance with both Calibration and Validation variances. See the description in the
Interpreting PCR plots section
RMSE
Root Mean Square Error for the Y-variables. This plot gives the square root of the residual
variance for individual responses, back-transformed into the same units as the original
response values. This is called: RMSEC (Root Mean Square Error of Calibration) when plotting
Calibration results; RMSEP (Root Mean Square Error of Prediction) when plotting Validation
results.
RMSE Line Plot
658
Principal Components Regression
The RMSE is plotted as a function of the number of components in the model. There is one
curve per response (or two if Cal and Val together are selected). The optimal number of
components can be determined by looking at where the Val curve (i.e. RMSEP) reaches a
minimum.
Sample Outliers
See the description in the Interpreting PCR plots section
Scores and Loadings
2 plots
See the description in the Interpreting PCR plots section
4 plots
When displaying 4 plots, the screen show 2 paired plots of scores and loading one displaying
PC1-PC2 and the other PC3-PC4.
Bi-plot
The plot can be used to interpret sample properties. Look for variables projected far away
from the center. Samples lying in an extreme position in the same direction as a given
variable have large values for that variable; samples lying in the opposite direction have low
values. For instance, in the figure below, samples6,7 an8 8 are the most colour intense,
while samples 2,3,4 and 12 are most likely to have the highest banana odor (and probably
lowest acidity). C3_H3 has high Raspberry taste, and is rather colorful. C1_H1, C2_H1 and
C3_H1 are thick, and have little color. The samples cannot be compared with respect to the
variables close to the center of the bi-plot.
Bi-plot for 12 jam samples and 12 sensory properties (X-variables)
659
The Unscrambler X Main
Scores
Line
This is a plot of score values vs. sample number for a specified component. Although it is
usually better to look at 2-D or 3-D scores plots because they contain more information, this
plot can be useful whenever the samples are sorted according to the values of an underlying
variable, e.g. time, to detect trends or patterns (see figure below). Also look for systematic
patterns, like a regular increase or decrease, periodicity, etc.… (only relevant if the sample
number has a meaning, like time for instance).
Trend in a Scores plot
660
Principal Components Regression
The smaller the vertical variation (i.e. the closer the score values are to each other), the
more similar the samples are for this particular component. Look for samples that have a
very large positive or negative score value compared to the others: these may be outliers.
2-D Scatter
See the description in the Interpreting PCR plots section
3-D Scatter
This is a 3-D scatter plot or map of the scores for three specified components from PCR. The
plot gives information about patterns in the samples and is most useful when interpreting
components 1, 2 and 3, since these components summarize most of the variation in the
data. It is usually easier to look at 2-D scores plots but if three components are needed to
describe enough variation in the data, the 3-D plot is a practical alternative. The same
analysis as with a 2-D scatter plot should be done. See the description in the Interpreting
PCR plots section
2 x 2-D Scatter
The visualization window is divided into two frames. The top one shows the scatter plot of
the scores of the samples along PC1 and PC2. The bottom plot shows the scatter plot of the
scores along PC3 and PC4.
4 x 2-D Scatter
The visualization window is divided into four frames. The top left one shows the scatter plot
of the scores of the samples along PC1 and PC2. On its left is displayed the scores plot in
PC3-PC4 plane. The bottom left plot shows the scatter plot of the scores along PC5 and PC6.
To its right is displayed the scatter plot of the scores of the sample for PC7 and PC8.
Loadings
Line
Loadings for the X-variables
This is a plot of X-loadings for a specified component vs. variable number. It is useful for
detecting important variables. In many cases it is better to look at two- or three-vector
loadings plots instead because they contain more information. Line plots are most useful for
multichannel measurements, for instance spectra from a spectrophotometer, or in any case
where the variables are implicit functions of an underlying parameter, like wavelength, time,
etc. The plot shows the relationship between the specified component and the different X-
variables. If a variable has a large positive or negative loading, this means that the variable is
important for the component concerned; see the figure below. For example, a sample with a
large score value for this component will have a large positive value for a variable with large
positive loading.
Spectral data can default to use line plots for the loadings plot. To set this, right click on the
given range in the project navigator, and tick off the Spectra option.
Line plot of the X-loadings, important variables in a spectra
661
The Unscrambler X Main
Variables with large loadings in early components are the ones that vary most. This means
that these variables are responsible for the greatest differences between the samples.
Loadings for the Y-variable
This is a plot of Y-loading for a specified component vs. variable number. It is usually better
to look at 2-D or 3-D loadings plots instead because they contain more information.
However, if there is reason to study the X-loadings as line plots, then one should also display
the Y-loadings as line plots in order to make interpretation easier. The plot shows the
relationship between the specified component and the Y-variable. If a variable has a high
positive or negative loading, this means that the variable is well explained by the
component. A sample with a large score for the specified component will have a high value
for all variables with large positive loadings.
A Y-variable with large loadings in early components is easily modeled as a function of the X-
variables.
2-D Scatter
See the description in the Interpreting PCR plots section
3-D Scatter
This plot can present either the X-loadings, the Y-loadings or both. To select or unselect one
of them click on the icon .
Loadings for the X-variables
This is a three-dimensional scatter plot of X-loadings for three specified components from
PCR. The plot is most useful for interpreting directions, in connection to a 3-D scores plot.
Otherwise it is recommended that one use line or 2-D loadings plots.
Loadings for the Y-variable
This is a three-dimensional scatter plot of Y-loadings for three specified components from
PCR. As there is only one Y-variable in PCR, this plot is most useful for interpreting directions,
662
Principal Components Regression
in connection to a 3-D scores plot and together with the X-loadings. Otherwise it is
recommended that one use line or 2-D loadings plots.
Read more about Loadings and the different display and information in the Interpreting PCR
plots* section
2 x 2-D Scatter
The visualization window is divided into two frames. The top one shows the scatter plot of
the loadings of the variables along PC1 and PC2. The bottom plot shows the scatter plot of
loadings of the variables along PC3 and PC4.
4 x 2-D Scatter
The visualization window is divided into four frames. The top left one shows the scatter plot
of the loadings of the variables along PC1 and PC2. On its left is displayed the scores plot in
PC3-PC4 plane. The bottom left plot shows the scatter plot of loadings of the variables along
PC5 and PC6. To its right is displayed the scatter plot of loadings of the variables for PC7 and
PC8.
Important Variables
See the description in the Interpreting PCR plots section
Regression Coefficients
663
The Unscrambler X Main
The above plot shows the regression coefficients for the response variable (Y), and for a
model with a particular number of components (3). Each predictor variable (X) defines one
point of the line (or one bar of the plot). It is recommended to configure the layout of this
plot as bars. Variables 1 and 4 have the highest B coefficients.
Note: The weighted coefficients (Bw) and raw coefficients (B) are identical if no
weights where applied on the variables.
The raw coefficients are those that may be used to write the model equation in original
units:
Since the predictors are kept in their original scales, the coefficients do not reflect the
relative importance of the X-variables in the model. If no weights have been applied to the
X-variables the display the Uncertainty Limits maybe informative. It is available if Cross-
Validation and the Uncertainty Test option were selected in the Regression dialog.
Use View – Uncertainty Limit from the menu to toggle this indication on or off.
Residuals
General
Y-residuals vs. Predicted Y
This is a plot of Y-residuals against predicted Y values. If the model adequately predicts
variations in Y, any residual variations should be due to noise only, which means that the
residuals should be randomly distributed. If this is not the case, the model is not completely
satisfactory, and appropriate action should be taken. If strong systematic structures (e.g.
curved patterns) are observed, this can be an indication of lack of fit of the regression
model. The figure below shows a situation that strongly indicates lack of fit of the model.
This may be corrected by transforming the Y variable.
Structure in the residuals: a transformation of the y variable is recommended
664
Principal Components Regression
The presence of an outlier is shown in the example below. The outlying sample (18) has a
much larger residual than the others; however, it does not seem to disturb the model to a
large extent.
A simple outlier has a large residual
The figure below shows the case of an influential outlier: not only does it have a large
residual, it also attracts the whole model so that the remaining residuals show a very clear
trend. Such samples should usually be excluded from the analysis, unless there is an error in
the data or some data transformation can correct for the phenomenon.
An influential outlier changes the structure of the residuals
665
The Unscrambler X Main
Small residuals (compared to the variance of Y) which are randomly distributed indicate
adequate models.
Normal Probability Y-residuals
This plot displays the cumulative distribution of the Y-residuals with a special scale, so that
normally distributed values should appear along a straight line. The plot shows all residuals
for one particular Y-variable (look for its name in the plot ID). There is one point per sample.
If the model explains the complete structure present in the data, the residuals should be
randomly distributed - and usually, normally distributed as well. So if all the residuals are
along a straight line, it means that the model explains everything that can be explained in
the variations of the variables to be predicted. If most of the residuals are normally
distributed, and one or two stick out, these particular samples are outliers. This is shown in
the figure below. If there are outliers, mark them and check the data.
Two outliers are sticking out
666
Principal Components Regression
If the plot shows a strong deviation from a straight line, the residuals are not normally
distributed, as in the figure below. In some cases - but not always - this can indicate lack of
fit of the model. However it can also be an indication that the error terms are simply not
normally distributed.
The residuals have a regular but non-normal distribution
667
The Unscrambler X Main
Influence Plot
See the description in the Interpreting PCR plots section
Samples with small residual variance (or large explained variance) for a particular
component are well explained by the corresponding model, and vice versa. In the above plot
4 samples seems to be not well explained by the model and may be outliers such as B3.
668
Principal Components Regression
Variable residuals
This is a plot of residuals for a specified X-variable and component number for all the
samples. The plot is useful for detecting outlying sample/variable combinations, as shown
below. An outlier can sometimes be modeled by incorporating more such samples. This
should, however, be avoided since it will reduce the prediction ability of the model.
Line plot of the variable residuals
Whereas the sample residual plot gives information about residuals for all variables for a
particular sample, this plot gives information about all possible samples for a particular
variable. It is therefore more useful when investigating how one specific variable behaves in
all the samples.
Sample residuals
This is a plot of the residuals for a specified sample and component number for all the X-
variables. It is useful for detecting outlying sample or variable combinations. Although
outliers can sometimes be modeled by incorporating more components, this should be
avoided since it will reduce the prediction ability of the model.
Line plot of the sample residuals: one variable is an outlier
669
The Unscrambler X Main
In the above plot the variable 1: Adhesiveness at 1 day, for a particular sample is not very
well describe by a model with a certain number of component here 4. If this is the case with
most of the samples this variable may be noisy and can be considered as an outlier.
In contrast to the variable residual plot, which gives information about residuals for all
samples for a particular variable, this plot gives information about all possible variables for a
particular sample. It is therefore useful when studying how a specific sample fits to the
model.
670
Principal Components Regression
In the above map, two variables are repeatedly not well described by the model. They are to
be checked.
Outliers
Influence Plot
See the description in the Interpreting PCR plots section
Y-residuals vs. Predicted Y
See the description in the Interpreting PCR plots section
Patterns
Normal Probability Y-residuals
See the description in the above section
Y-residuals vs. Score
See the description in the above section
Leverage/Hotelling’s T²
Leverage
Line
See the description in the Interpreting PCR plots section
Matrix
This is a matrix plot of leverages for all samples and all model components. The X-axis
represents the components and the Y-axis the samples. The color represents the Z-value
which is the leverage, the color scale can be customized. It is a useful plot for studying how
the influence of each sample evolves with the number of components in the model. Display
the leverages as Hotelling’s T² statistics.
Leverage as a matrix plot
671
The Unscrambler X Main
Hotelling’s T²
Line
See the description in the Interpreting PCR plots section.
Matrix
This is a matrix plot of Hotelling’s T² statistics for all samples and all model components. It is
equivalent to the matrix plot of leverages, to which it has a linear relationship. The Y-axis
represents the components and the X-axis the samples. The color represents the Z-value
which is the Hotelling’s T² statistic for a specific PC and sample, the color scale can be
customized.
Hotelling’s T² as a matrix plot
Response Surface
See the description in the Interpreting PCR plots section
672
Principal Components Regression
15.6. Bibliography
K. Esbensen, Multivariate Data Analysis - In Practice, 5th Edition, CAMO Process AS, Oslo,
2002,
H. Hotelling, Analysis of a complex of statistical variables into principal components, J. Educ.
Psych., 24, 417-441, 498-520, (1933).
J.E. Jackson, A Users Guide to Principal Components, Wiley & Sons Inc., New York, 1991.
K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis, Academic Press Inc, London,
1979.
673
16. Partial Least Squares
16.1. Partial Least Squares regression
Partial Least Squares — or Projection to Latent Structures — (PLS) models both the X- and Y-
matrices simultaneously to find the latent variables in X that will best predict the latent
variables in Y. These PLS components are similar to principal components; however, they are
referred to as Factors. PLS maximizes the covariance between X and Y.
Theory
Usage
Plot Interpretation
Method reference
Basics
Interpreting the results of a PLS regression
Scores and loadings (in general)
PLS scores
PLS loadings
PLS loading weights
X-Y relationship outliers
Regression coefficients
Predicted vs. reference plot
Error measures for PLSR
More details about regression methods
PLSR algorithm options
16.2.1 Basics
PLSR maximizes the covariance between X and Y. In this case, convergence of the system to
a minimum residual error is often achieved in fewer factors than using PCR. This is in
contrast to PCR, which first performs Principal Component Analysis (PCA) on X and then
regresses the scores (T) vs. the Y data. A conceptual illustration for PLSR is shown graphically
below.
PLSR Procedure
675
The Unscrambler X Main
PLSR may be carried out with one or more Y variables, meaning that multiple Y responses
can be used during regression modeling.
There are three algorithms available in The Unscrambler® for PLS regression.
NIPALS
Kernel PLS
Wide Kernel PLS
How PLSR compares to other regression methods in More details about regression
methods
PLSR results in Main Results Of Regression
Details regarding PLSR algorithms are given in the Method reference.
676
Partial Least Squares
As with PCA and PCR, the results of a PLS regression provide similar graphical outputs and
diagnostics. However, in the case of PLSR, some more interesting and powerful diagnostic
tools are available. The following provides a summary of these tools.
677
The Unscrambler X Main
PLS loadings can also be plotted as X, Y and X-Y Correlation Loadings. For more details on
correlation loadings, see interpreting plots.
PLS loading weights
Loading weights are specific to PLSR (they have no equivalent in PCR) and express how the
information in each X-variable relates to the variation in Y summarized by the u-scores. They
are called loading weights because they also express, in the PLSR algorithm, how the t-scores
are to be computed from the X-matrix to obtain an orthogonal decomposition. The loading
weights are normalized, so that their lengths can be interpreted as well as their directions.
Variables with large loading weight values are important for the prediction of Y.
X-Y relationship outliers
X-Y Relationship Outliers plots the t-scores from X vs. the u-scores from Y and is used for two
main purposes:
This plot is unique to the PLSR algorithm. Since PLSR attempts to maximize the covariance
between X and Y variables in the first calculated factors, the t vs. u plot should ideally show a
straight line relationship. Samples that deviate noticeably are potential outliers. This is
shown graphically below.
The X-Y Relationship Outlier Plot for Ideal and Outlier Situations
When used as a method to determine the optimal number of factors, this can be done by
visually assessing which pair of t vs. u scores starts to deviate from a straight line. The
Quadrupole Plot is useful in this regard. This is shown diagrammatically below.
The X-Y Relationship Outlier Quadrupole Plot
678
Partial Least Squares
The X-Y Relationship Outliers plot is also useful for detecting nonlinear relationships that
may exist in the data. This may suggest a different preprocessing should be considered.
Regression coefficients
Regression coefficients show how each variable is weighted when predicting a particular Y
response. Regression coefficients are a characteristic of all regression methods and may
provide interpretive insight into the quality of a model. Examples include
Spectroscopy: Regression coefficients should have “spectral characteristics” about
them and not show noise characteristics.
Process data: When different variable types exist, regression coefficients show the
relative importance of the variables and their interactions can also be displayed.
Predicted vs. reference plot
The predicted vs. reference plot is another common feature of all regression methods. The
predicted vs. reference plot should show a straight line relationship between predicted and
measured values, ideally with a slope of 1 and a correlation of close to 1.
Error measures for PLSR
In PLSR (and PCR) models, not only the Y-variables are projected (fitted) onto the model; X-
variables are too. Sample residuals are computed for each PC of the model. The residuals
may then be combined:
Across samples
for each variable, to obtain a variance curve describing how the residual (or
explained) variance of an individual variable evolves with the number of PCs in the
model;
Across variables
(all X-variables or all Y-variables), to obtain a Total variance curve describing the
global fit of the model. The Total Y-variance curve shows how the prediction of Y
improves when more PCs are added to the model; the Total X-variance curve
679
The Unscrambler X Main
expresses how much of the variation in the X-variables is taken into account to
predict variation in Y.
Read more about how sample and variable residuals, as well as explained and residual
variances, are computed in the chapter with theory about PCA.
In addition, the Y-calibration error can be expressed in the same units as the original
response variable using the Root Mean Square Error of Calibration (RMSEC), and the Y-
prediction error as the Root Mean Square Error of Prediction (RMSEP).
RMSEC and RMSEP also vary as a function of the number of factors in the model.
E and F are initially X and Y and are “deflated” during the calculation of PLS factors.
The so-called Y-scores, u, are calculated from
The inner relation in PLS regression is the relation between T and U for the individual
factors:
The process continues by deflating. The information of the PLS factors (i.e. the outer
products, tpT and tqT) is subtracted from E and F to obtain
680
Partial Least Squares
The process is now repeated to find the next PLS factors by finding the eigenvector
. The estimation of the PLS loadings, loading weights and scores may also be
achieved by extracting eigenvectors of the smallest size of products of X, XT, Y and YT, which
are the basis for other PLSR algorithms like the kernel and wide kernel methods (see below).
The matrices, W, T, P and Q are then stored in The Unscrambler® Project Navigator with the
PLSR results, for further diagnostic purposes. To ensure that the columns of the matrix W
relate to the original matrix X, the weights may be expressed as,
The scores T are now used to calculate the regression coefficients, using the following
expression,
Since the normalization step can be introduced at various points in the calculation, using
other variants of the PLSR algorithm, this can make it difficult to compare scores and
loadings calculated by these variants.
681
The Unscrambler X Main
This is a variant of the Kernel PLS that is expected to perform better for data
containing a large number of variables and relatively few samples (‘short and fat’
data). The implementation is based on Ränner et al, 1994 and does not handle
missing values.
More details on the algorithms are given in the method reference.
682
Partial Least Squares
Some important tips and warnings associated with the Model Inputs tab
PLSR is a multivariate regression analysis technique, therefore in The Unscrambler® it
requires a minimum of three samples (rows) and two variables (columns) to be present in a
data set, in order to complete the calculation. The following provides some warning given,
when certain analysis criteria are not met.
Not enough samples present
Solution: Check that the data table (or selected row set) contains a minimum of 3 samples.
Not enough variables present
683
The Unscrambler X Main
Solution: Check that the data table (or selected column set) contains a minimum of 2
variables.
Number of X rows does not match number of Y rows
Solution: Ensure that the row set dimensions of X match the row set dimensions of Y.
Too many excluded samples/variables
Solution: Check that all samples/variables have not been excluded in a data set.
To keep track of row and column exclusions, the model inputs tab provides a warning to
users that exclusions have been defined. See automatic keep outs for more details.
684
Partial Least Squares
Individual X- and Y-variables can be selected from the variable list table provided in this
dialog by holding down the control (Ctrl) key and selecting variables. Alternatively, the
variable numbers can be manually entered into the text dialog box, the Select button can be
used (which takes one to the Define Range dialog box), or by simply clicking on All, this will
select every variable in the table.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows selected variables to be weighted by predefined constant values.
Downweight
This allows for the multiplication of selected variables by a very small number, such
that the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
685
The Unscrambler X Main
Advanced tab
Use the Advanced tab in the X- and Y-Weights dialog to apply predetermined weights to
each variable. To use this option, set up a row in the data set containing the weights (or
create a separate row matrix in the project navigator). Select the Advanced tab in the
Weights dialog and select the matrix containing the weights from the drop-down list. Use
the Rows option to define the row containing the weights and click on Update to apply the
new weights.
Another feature of the advanced tab is the ability to use the results matrix of another
analysis as weights, using the Select Results Matrix button This option provides
an internal project navigator for selecting the appropriate results matrix to use as a weight.
The dialog box for the Advanced option is provided below.
PLSR Advanced Weights Option
Once the weighting and variables have been selected for X and Y, click Update to apply
them.
686
Partial Least Squares
687
The Unscrambler X Main
The differences between the algorithms are described in the Introduction to PLSR. Contrary
to the Kernel-based methods, the NIPALS algorithm is iterative and the maximum number of
iterations can be tuned in the Max. iterations box. The default value of 100 should be
sufficient for most data sets, however some large and noisy data may require more
iterations to converge properly. The maximum allowed number of iterations is 30,000.
In the special case of a single response variable (i.e. PLS1), there are no iterations and the
Max. iterations box is grayed out.
When there are missing values in the data, options are to impute them automatically using
the NIPALS algorithm or as a pre-processing step using Fill Missing
Note: If there are missing values in the data and one of the Kernel methods are
selected, a warning will be given as shown below.
Q-residual limits are per default approximated based on calculated model components only,
which works well in many cases. Calculation of exact Q-residual limits will be performed
688
Partial Least Squares
when the check box is marked. Note that estimation of exact limits may be slow for large
data.
Pretreatments can also be registered from the PLSR node in the project navigator. To
register the pretreatment, right click on the PLSR analysis node and select Register
Pretreatment. This is shown below.
Registering a Pretreatment From The Project Navigator
689
The Unscrambler X Main
The Autopretreatment dialog box will appear, where the desired pretreatments can be
selected.
Note: Some caution is required when data table dimensions are changed after first
pretreatment. The Autopretreatment is applied on the same column indices as the
original transformation, and inserting new variables (columns) before or in between
the original data will result in autopretreatment of the wrong variables.
To be safe, always insert any new variables in the table before applying any
transformations, or make a habit of always appending rather than inserting new
columns.
690
Partial Least Squares
Set this tab up based on a priori knowledge of the data set in order to return outlier
warnings in the PCA model. Settings for estimating the optimal number of components can
also be tuned here. The values shown in the dialog box above are default values and might
be used as a starting point for the analysis.
The warning limits in the Unscrambler® serve two major purposes:
The leverage and residual (outlier) limits are given as standard scores. This means that limit
of e.g. 3.0 corresponds to a 99.7% probability that a value will lie within 3.0 standard
deviations from the mean of a normal distribution. The following limits can be specified:
Leverage Limit
(default 3.0) The ratio between the leverage for an individual sample and the
average leverage for the model.
Sample Outlier Limit, Calibration
691
The Unscrambler X Main
(default 3.0) The square root of the ratio between the residual calibration variance
per sample (Sample Residuals) and the average residual calibration variance for the
model (Total Residuals).
Sample Outlier Limit, Validation
(default 3.0) The square root of the ratio between the residual validation variance
per sample (Sample Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Individual Value Outlier, Calibration
(default 3.0) For individual, absolute values in the calibration residual matrix
(Residuals), the ratio to the model average is computed (square root of the Variable
Residuals). For spectroscopic data this limit may be set to 5.0 to avoid many false
positive warnings due to the high number of variables.
Individual Value Outlier, Validation
(default 2.6) For individual, absolute values in the validation residual matrix
(Residuals), the ratio to the validation model average is computed (square root of
the Variable Validation Residuals). For spectroscopic data this limit may be set to 5.0
to avoid many false positive warnings due to the high number of variables.
Variable Outlier Limit, Calibration
(default 3.0) The square root of the ratio between the residual calibration variance
per variable (Variable Residuals) and the average residual calibration variance for
the model (Total Residuals).
Variable Outlier Limit, Validation
(default 3.0) The square root of the ratio between the residual validation variance
per variable (Variable Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Total Explained Variance (%)
(default 20) If the model explains less than 20% of the variance the optimal number
of componets is set to 0 (see the Info Box).
Ratio of Calibrated to Validated Residual Variance
(default 0.5) If the residual variance from the validation is much higher than the
calibration a warning is given.
Ratio of Validated to Calibrated Residual Variance
(default 0.75) If the residual variance from the calibration is much higher than the
validation a warning is given. This may occur in case of test set validation where the
test samples do not span the same space as the training data.
Residual Variance Increase Limit (%)
(default 6) This limit is applied for selecting the optimal number of components and
is calculated from the residual variance for two consecutive components. If the
variance for the next component is less than x% lower than the previous component
the default number of components is set to the previous one.
When all the settings are made click on OK.
692
Partial Least Squares
693
The Unscrambler X Main
Scores
Line
2-D scatter
3-D scatter
2 x 2-D Scatter
4 x 2-D Scatter
Loadings
Line
Loadings for the X-variables
Loadings for the Y-variables
2-D scatter
3-D scatter
Loadings for the X-variables
Loadings for the Y-variables
2 x 2-D scatter
4 x 2-D scatter
Loadings weights
Line
2-D scatter
3-D scatter
2 x 2-D scatter
4 x 2-D scatter
Important variables
Regression coefficients
Weighted coefficients (Bw)
Line plot
Matrix
Raw coefficients (B)
Line plot
Matrix
Residuals
Residuals and influence
General
Y-residuals vs. Predicted Y
Normal probability Y-residuals
Y-residuals vs. Score
Influence plot
Variance per sample
Variable residuals
Sample residuals
Sample and variable residuals
Outliers
Influence Plot
Y-residuals vs. Predicted Y
Patterns
Normal Probability Y-residuals
Y-residuals vs. Score
Leverage/Hotelling’s T²
Leverage
694
Partial Least Squares
Line
Matrix
Hotelling’s T2
Line
Matrix
Response Surface
Scores
This is a two-dimensional scatter plot (or map) of scores for two specified factors (latent
variables or PCs) from PLS regression. The plot gives information about patterns in the
samples. The scores plot for (factor 1,factor 2) is especially useful, since these two
components summarize more variation in the data than any other pair of components.
The closer the samples are in the scores plot, the more similar they are with respect to the
two components concerned. Conversely, samples far away from each other are different
from each other. The plot can be used to interpret differences and similarities among
samples. Look at the scores plot together with the corresponding loadings plot for the same
two components. This can help in determining which variables are responsible for
differences between samples. For example, samples to the right of the scores plot will
usually have a large value for variables to the right of the loadings plot, and a small value for
variables to the left of the loadings plot.
Here are some things to look for in the 2-D scores plot.
Finding groups in a scores plot
Is there any indication of clustering in the set of samples? The figure below shows a situation
with four distinct clusters. Samples within a cluster are similar.
Detecting grouping in a scores plot
695
The Unscrambler X Main
696
Partial Least Squares
Furthermore, the display of the Hotelling’s T² ellipse for a model in two dimension is
also a good way to detect outliers. To display it click on the Hotelling’s T² ellipse
button .
Scores plot with Hotelling’s T² limit
In addition, the display of the stability plot can help in detecting outliers. This plot represents
the projection of the samples in the submodels used for the validation; they can be part of
the model or left out. Hence this plot is only available when any type of cross-validation has
been selected. It is available from the icon .
How representative is the picture?
Check how much of the total variation each of the components explains. This is displayed in
parentheses next to each axis name: Factor-1 (86%). If the sum of the explained variances
for the two components is large (for instance 70-80%), the plot shows a large portion of the
information in the data, so the relationships can be interpreted with a high degree of
697
The Unscrambler X Main
X- and Y-loadings
A 2-D scatter plot of X- and Y-loadings for two specified components (factors) from PLS is a
good way to detect important variables and relationships between variables. The plot is
most useful for interpreting component 1 vs. component 2, since these represent the largest
variations in the X-data that explain the largest variation in the Y-data. By default both Y-
and X-variables are displayed but it is possible to modify that by clicking on the X and Y
icons.
Interpret the X-Y relationships
To interpret the relationships between X and Y-variables, start by looking at the response (Y)
variables.
Predictors (X) projected in roughly the same direction from the center as a response,
are positively linked to that response.
Predictors projected in the opposite direction have a negative link.
Predictors projected close to the center, are not well represented in that model and
cannot be interpreted.
698
Partial Least Squares
The maturity has a negative effect on the adhesiveness of the cheese; they are
anticorrelated. The amount of Dry matter positively affects the stickiness and negatively the
glossiness and meltiness. Glossiness and meltiness, two responses, are correlated.
Caution! If the X-variables have been standardized, one should also standardize the
Y-variable so that the X- and Y-loadings have the same scale; otherwise the plot
may be difficult to interpret.
The plot shows the importance of the different variables for the two components specified.
It is possible to change the display by using the factor drop-down list . It should
preferably be used together with the corresponding scores plot. Variables with loadings to
the right in the loadings plot will be X-variables which usually have high values for samples
to the right in the scores plot, etc. This plot can be used to study the relationship between
the X-variables and the X- and Y-variables.
If the Uncertainty test was activated the important variables will be circled. It is also possible
to mark them by using the icon .
Loadings plot with circled important variables
699
The Unscrambler X Main
When working with discrete variables, line loadings plots can also be used to represent data.
The Ascending and Descending buttons can be used to order the loadings in
terms of the variables with highest (or lowest) contribution to the Factor.
Line plot of loadings in ascending order of importance to Factor 1
700
Partial Least Squares
Variables close to each other in the loadings plot will have a high positive correlation if the
two components explain a large portion of the variance of X. The same is true for variables in
the same quadrant lying close to a straight line through the origin. Variables in diagonally
opposed quadrants will have a tendency to be negatively correlated. For example, in the
figure above, variables dry matter and stickiness have a high positive correlation on factor 1
and factor 2, and they are negatively correlated to variables meltiness and glossiness.
Variables adhesiveness and stickiness have independent variations. Variables addition of
recycled dry matter and pH are very close to the center, they are not well described by
factor 1 and factor 2.
Note: Variables lying close to the center are poorly explained by the plotted factors
(or PCs). They cannot be interpreted in that plot!
Correlation loadings are also available for 1D line loading plots. When a line plot is
generated, the 1D correlation loadings toolbar icon is displayed as follows
These are especially useful when interpreting important wavelengths in the analysis of
spectroscopic or contributing variables in time series data. An example is shown below.
Correlation Line Loadings of Spectroscopic variables in Factor 1)
701
The Unscrambler X Main
Values that lie within the upper and lower bounds of the plot are modelled by that Factor.
Those that lie between the two lower bounds are not.
Explained variance
This plot illustrates how much of the total variation in X or Y is described by models including
different numbers of components. The total residual variance is computed as the sum of
squares of the X- or Y-residuals divided by the number of degrees of freedom.
The total explained variance is then computed as:
702
Partial Least Squares
Outliers can sometimes cause large residual variance (or small explained variance). They can
also cause a decrease in the explained validation variance as can be seen in the plot below.
Outlier causes a drop of explained variance in validation
703
The Unscrambler X Main
the results for other Y-variables, use the variable icon . In addition
by default the results are shown for a specific number of factors, that should reflect the
dimensionality of the model. If the number of factors is not satisfactory, it is possible to
704
Partial Least Squares
Note: Before interpreting the plot, check whether the plots are displaying
Calibration or Validation results (or both).
Menu option Window - Identification tells whether the plots are displaying Calibration (if
Ordinate is yPredCal) or Validation (yPredVal) results.
Use the buttons to switch Calibration and Validation results off or on.
It is also useful to show the regression line and compare it with the target line. These can be
Some statistics are available giving an idea of the quality of the regression. They are
705
The Unscrambler X Main
Note: RMSE and R-Squared values are highly dependent on the validation method
used and the number of factors in a model. It it important not to use too many
factors and overfit the model.
When the are toggled, more detailed statistics are displayed. The Calibration plot is
shown below with statistics,
Predicted vs. Reference plot for PLS Calibration samples
706
Partial Least Squares
707
The Unscrambler X Main
In the above plot, sample 3 does not follow the regression line whereas all the other
samples do. Sample 3 may be an outlier.
How to detect nonlinearity
In other cases, there may be a nonlinear relationship between the X- and Y-variables, so that
the predictions do not have the same level of accuracy over the whole range of variation of
Y. In such cases, the plot may look like the one shown below. Such nonlinearities should be
corrected if possible (for instance by a suitable transformation), because otherwise there
will be a systematic bias in the predictions depending on the range of the sample.
Predicted vs. Reference shows a nonlinear relationship
Explained Variance
This plot shows the explained variance for each X- or Y-variable individually for different
model complexities. It can be used to identify which variables are described by the different
components in a model. Use the to switch between X- and Y-variables, and click the
to add the total X- or Y-variance to the plot for comparison.
By default, ALL X- or Y-variables are plotted together. Use the toolbar drop-down box or
arrows to scroll between individual variables to plot. You may also type in comma separated
variable indexes manually in the box:
Toolbar variable selection box
708
Partial Least Squares
Use this plot to see which components explain the individual variables, and whether this is
due to irrelevant or predictive variation (calibration vs. validation variance). The below plot
shows the explained validation variance for some X-variables. The first component is seen to
explain Opacity, Scatter and Weight, whereas the second component spans Roughness.
Many components would have to be included in order to model Brightness, and Ink is hardly
modeled at all.
Explained variances for several individual X-variables
Sample Outliers
Scores
See the description in the Interpreting PLS plots section
Influence
This is a plot of the residual X- and Y-variances vs. leverages. Look for samples with a high
leverage and high residual X- or Y-variance.
709
The Unscrambler X Main
To study such samples in more detail, it is recommended to mark them and then plot X-Y
relation outliers for several model components. This way their influences on the shape of
the X-Y relationship can be determined, and it may be found that they dangerous outliers.
High residuals indicate an outlier. Incorporating more components can sometimes model
outliers; avoid doing so since it will reduce the prediction ability of the model.
710
Partial Least Squares
Small residual variance (or large explained variance) indicates that, for a particular number
of factors or components, the samples are well explained by the model. Therefore a sample
with a high Y-residual may be an outlier.
X-Y Relation outliers
This plot visualizes the regression relation along a particular component of the PLS model. It
shows the t-scores as abscissa and the u-scores as ordinate. In other words, it shows the
relationship between the projection of the samples in the X-space (horizontal axis) and the
projection of the samples in the Y-space (vertical axis).
Note: The X-Y relation outlier plot for factor 1 is exactly the same as the Predicted
vs. Reference plot for factor 1.
This summary can be used for two purposes.
Detecting outliers
A sample may be outlying according to the X-variables only, or to the Y-variables only, or to
both. It may also not have extreme or outlying values for either separate set of variables, but
become an outlier when one considers the (X,Y) relationship. In the X-Y Relation Outlier plot,
such a sample sticks out as being far away from the relation defined by the other samples, as
shown in the figure below. If a samples appears to be outlying, it is advisable to check the
data: there may be a data transcription error for that sample.
A simple X-Y outlier
711
The Unscrambler X Main
If a sample sticks out in such a way that it is projected far away from the center along the
model component, it is an influential outlier (see the figure below). Such samples are
dangerous to the model: they change the orientation of the component. Check the data. If
there is no data transcription error for that sample, investigate more and decide whether it
belongs to another population. If so, the sample can be removed as an outlier (mark it and
recalculate the model without the marked sample). If not, more samples of the same kind
may be needed, in order to make the data more balanced.
An influential outlier
712
Partial Least Squares
A sigmoid-shaped curvature may indicate that there are interactions between the
predictors. Adding a cross-term to the model may improve it.
Sample groups may indicate the need for separate modeling of each subgroup.
Scores and loadings
This overview shows two plots: the score and loadings plots.
Scores
See the description in the section
Loadings
X-Loadings
This plot displays by default the X-loadings along one factor (or PC) at a time and the
maximal PC should be the same as the dimensionality of the model used to study the B w
coefficients. It is possible to change the factor to be displayed by using the blue arrows
.
This view is most useful if the X-data are spectral data. It is then possible to detect the area
of the signal that is responsible for a discrimination of the samples along the specified factor.
X-loading for spectra
713
The Unscrambler X Main
In the above plot, the peak at 960 is responsible for the discrimination of the samples along
factor 2.
In general it is more interesting if the data are not spectral to look at the loading in a scatter
plot. For more information on the scatter plot see the description in the Interpreting PLS
plots section
Y-Loadings
714
Partial Least Squares
Regression coefficients
If the X-variables were weighted this plot presents the weighted regression coefficients.
Otherwise the B-coefficients and (Bw)-coefficients are confounded. The number of factors (or
PCs) is fixed and can be changed using the arrows.
In general, this plot shows the weighted regression coefficients for a specific response or Y-
variable. By default it shows the coefficient for the first Y-variable. It is possible to access the
715
The Unscrambler X Main
relationship between the predictors and the response, as a model with 3 components
approximates it. The weighted regression coefficients (Bw) provides information about the
importance of the X-variables. X-variables with a large regression coefficient play an
important role in the regression model; a positive coefficient shows a positive link with the
response, and a negative coefficient shows a negative link. Predictors with a small coefficient
are negligible. Mark them and recalculate the model without those variables. The constant
value B0W is indicated within the X axis label.
Weighted regression coefficients for 2 factors (or PCs)
In this plot it can be seen that variables Ti, Ba, Sr and Zr contribute the most to the model.
Important variables can also be plotted as a two pane window of regression coefficients and
loading weights. This plot is useful when a user wants to determine which factors most
influence the profile of the regression coefficients, particularly for spectroscopic application.
Important variables showing regression coefficients and loadings weights
716
Partial Least Squares
Note: The weighted coefficients (Bw) and raw coefficients (B) are identical if no
weights where applied on the variables.
If the predictor variables have been weighted with 1/SDev (standardization), the weighted
regression coefficients (Bw) take these weights into account. Since all predictors are brought
back to the same scale, the coefficients show the relative importance of the X-variables in
the model.
X-loading weights
This is a plot of X-loading weights for all the components vs. variable number. It is useful for
detecting important variables. If a variable has a large positive or negative loading weight,
this means that the variable is important for the component concerned. For example, a
sample with a large score value for this component will have a large positive value for a
variable with large positive loading weight.
If a variable has the same sign for all the important component, it is most likely to be an
important variable.
717
The Unscrambler X Main
Regression coefficients
See the description in the previous section
Regression and prediction
See the description in the Overview section
Residuals and influence
Influence Plot
This plot shows the Q- or F-residuals vs. Leverage or Hotelling’s T² statistics. These represent
two different kinds of outliers. The residual statistics on the ordinate axis describe the
sample distance to model, whereas the Leverage and Hotelling’s T² describe how well the
sample is described by the model.
Samples with high residual variance, i.e. lying in the upper regions of the plot, are poorly
described by the model. Including additional components may result in these samples being
described better, however caution is required that the additional components are predictive
and not modelling noise. As long as the samples with high residual variance are not
influential (see below), keeping them in the model may not be a problem as such (the high
residual variance may be due to non-important regions of a spectrum, for instance).
Samples with high leverage, i.e. lying to the right of the plot, are well described by the
model. They are well described in the sense that the sample scores may have very high or
low values for some components compared to the rest of the samples. Such samples are
dangerous in the calibration phase because they are influential to the model. A sufficiently
extreme sample may by itself span an entire component, in which case the model will
become unreliable. Removal of a highly influential sample from the model will make the
model look entirely different and the axes will span different phenomena altogether. If the
variance described by the sample is important but unique, one should try to obtain more
samples of the same type to stabilize the model. Otherwise the sample should be discarded
as an outlier.
Note that a sample with both high residual variance and high leverage is the most dangerous
outlier. Not only is it poorly described by the model but it is also influential. Samples such as
these may span up to several components single handedly. Because they also disagree with
the majority of the other calibration samples, the ability of the model to describe new
samples is likely poor.
The Q- and F-residuals are two different methods for testing the same thing. The F-residuals
are available for both calibration and validation, in contrast to the Q-residuals, which are
available for calibration only. The validated residuals reflect the scheme chosen in the
718
Partial Least Squares
validation and is a more concervative assessment of residual outliers. If the residual variance
from validation is much higher than for calibration one should investigate the residuals in
more detail.
The difference between Leverage and Hotelling’s T² is only a scaling factor. The critical limt
for Leverage is based on an ad-hoc rule whereas the Hotelling’s T² critical limit is based on
assumption of a student-t distribution.
The toggle buttons in the toolbar can be used to switch between the various combinations.
719
The Unscrambler X Main
Click the Y icon in the source taksbar to display the explained Y sample variance
plot. This plot displays the Y-sample variance explained for each sample in the model for the
number of factors selected.
720
Partial Least Squares
X Sample Residuals
Switch between explained and residual variances using the buttons in the source
taskbar to view the X sample residuals plot. This plot displays the X Sample Residuals for
each sample in the model for the number of factors selected.
Y Sample Residuals
This plot displays the Y Sample Residuals for each sample in the model for the number of
factors selected.
721
The Unscrambler X Main
High residuals indicate an outlier. Incorporating more components can sometimes model
outliers; avoid doing so since it will reduce the prediction ability of the model.
Leverage / Hotelling’s T²
The lower left pane of the Residuals and Influence overview displays a line plot of the
Hotelling’s T² by default. A toolbar toogle ( ) can be used to switch between
Hotelling’s T² and Leverage view.
Hotelling’s T² statistics
The plot displays the Hotelling’s T² statistic for each sample as a line plot. The associated
critical limit (with a default p-value of 5%) is displayed as a red line.
Hotelling’s T² plot
The Hotelling’s T² statistic has a linear relationship to the leverage for a given sample. Its
critical limit is based on an F-test. Use it to identify outliers or detect situations where a
722
Partial Least Squares
The number of factors (or PCs) may be tuned up or down with the tools.
Leverage
Leverages are useful for detecting samples which are far from the center within the space
described by the model. Samples with high leverage differ from the average samples; in
other words, they are likely outliers. A large leverage also indicates a high influence on the
model. The figure below shows a situation where sample 5 is obviously very different from
the rest and may disturb the model.
One sample has a high leverage
There is an ad-hoc critical limit (and not depending on any assumptions about distribution)
for Leverage which is , where is the number of components and
the number of calibration samples.
The leverage values are always larger than zero, and can go up to 1 for samples in calibration
set. As a rule of thumb, samples with a leverage above 0.4 - 0.5 start being bothering.
Influence on the model is best measured in terms of relative leverage. For instance, if all
samples have leverages between 0.02 and 0.1, except for one, which has a leverage of 0.3,
although this value is not extremely large, the sample is likely to be influential.
What Should Be Done with a High-Leverage Sample? The first thing to do is to understand
why the sample has a high leverage. Investigate by looking at the raw data and checking
them against the original recordings. Once an explanation has been found, there are two
following cases:
Case 1
There is an error in the data. Correct it, or if true value cannot be found, and the
experiment cannot be redone to give a more valid value, the erroneous value may
be replaced with “missing”.
Case 2
723
The Unscrambler X Main
There is no error, but the sample is different from the others. For instance, it has
extreme values for several of the variables. Check whether this sample is “of
interest” (e.g. it has the properties of interest, to a higher degree than the other
samples), or “not relevant” (e.g. it belongs to another population than that being
studied). In the former case, one should try to generate more samples of the same
kind: they are the most interesting ones! In the latter case (and only then), the high-
leverage sample may be removed from the model.
Residuals
The lower right pane of the Residuals and Influence overview displays a line plot of the
sample residual statistics. A toolbar toogle ( ) can be used to switch between Q- and
F-residuals view.
Q-residuals
This plot shows the sample Q-residuals as a line plot with associated limits.
Q-residual sample variance
F-residuals
This plot shows the sample F-residuals as a line plot with associated limits.
Note that the F-residuals are available for both calibration and validation. If the residual x-
variance from validation is much higher than for calibration one should investigate the
residuals in more detail. The validated residuals reflect the scheme chosen in the validation
and is a more concervative assessment of residual outliers.
Leverage / Hotelling’s T²
See the description in the overview section.
Residuals
See the description in the overview section.
724
Partial Least Squares
Response surface
This plot shows the response surface for a specific response or Y-variable. By default it
shows the response surface for the first Y-variable. It is possible to access the other response
This plot can appear in various layouts. The most relevant are:
Contour plot;
Landscape plot.
725
The Unscrambler X Main
X- or Y- Variance
One-frame plot where it is possible to display either the Explained X- or Y-Variance with
Calibration and or Validation curves. See the description in the Interpreting PLS plots section
X- and Y- variance
A two-frame plot with the Explained X-Variance plot on the top, and below the Explained Y-
Variance with both Calibration and Validation variances. See the description in the
Interpreting PLS plots section
RMSE
This plot shows the results for a specific response or Y-variable. By default it shows the
response surface for the first Y-variable. It is possible to access the other response surfaces
726
Partial Least Squares
The RMSE is plotted as a function of the number of factors or components in the model.
There is one curve per response (or two if Cal and Val together are selected). The optimal
number of factors (or PCs) can be determined by looking at where the Val curve (i.e. RMSEP)
reaches a minimum.
Sample outliers
See the description in the Interpreting PLS plots section
X-Y relation outliers
See the description in the Interpreting PLS plots section
Scores and loadings
See the description in the Interpreting PLS plots section
Scores
Line
This is a plot of score values vs. sample number for a specified component. Although it is
usually better to look at 2-D or 3-D scores plots because they contain more information. This
plot can be useful whenever the samples are sorted according to the values of an underlying
variable, e.g. time, to detect trends or patterns (see figure below). Also look for systematic
patterns, like a regular increase or decrease, periodicity, etc. (only relevant if the sample
number has a meaning, like time for instance).
Trend in a Scores plot
727
The Unscrambler X Main
The smaller the vertical variation (i.e. the closer the score values are to each other), the
more similar the samples are for this particular component. Look for samples that have a
very large positive or negative score value compared to the others: these may be outliers.
2-D scatter
See the description in the Interpreting PLS plots section
3-D scatter
This is a 3-D scatter plot or map of the scores for three specified components from PLS. The
plot gives information about patterns in the samples and is most useful when interpreting
components 1, 2 and 3, since these components summarize most of the variation in the
data. It is usually easier to look at 2-D scores plots but if three components are needed to
describe enough variation in the data, the 3-D plot is a practical alternative. The same
analysis as with a 2-D scatter plot should be done. See the description in the Interpreting PLS
plots section
2 x 2-D Scatter
The visualization window is divided into two frames. The top one shows the scatter plot of
the scores of the samples along factor 1 and factor 2. The bottom plot shows the scatter plot
of the scores along factor 3 and factor 4.
4 x 2-D Scatter
The visualization window is divided into four frames. The top left one shows the scatter plot
of the scores of the samples along factor 1 and factor 2. On its left is displayed the scores
plot in the factor 3-factor 4 plane. The bottom left plot shows the scatter plot of the scores
along factor 5 and factor 6. To its right is displayed the scatter plot of the scores of the
sample for factor 7 and factor 8.
728
Partial Least Squares
Loadings
Line
Loadings for the X-variables
This is a plot of X-loadings for a specified component vs. variable number. It is useful for
detecting important variables. In many cases it is better to look at two- or three-vector
loadings plots instead because they contain more information. Line plots are most useful for
multichannel measurements, for instance spectra from a spectrophotometer, or in any case
where the variables are implicit functions of an underlying parameter, like wavelength, time,
etc. The plot shows the relationship between the specified component and the different X-
variables. If a variable has a large positive or negative loading, this means that the variable is
important for the component concerned; see the figure below. For example, a sample with a
large score value for this component will have a large positive value for a variable with large
positive loading.
Spectral data can default to use line plots for the loadings plot. To set this, right click on the
given range in the project navigator, and tick off the Spectra option.
Line plot of the X-loadings, important variables in a spectra
Variables with large loadings in early components are the ones that vary most. This means
that these variables are responsible for the greatest differences between the samples.
Note: Downweighted variables are displayed in a different color so as to be easily
identified.
729
The Unscrambler X Main
Y-variables with large loadings in early components are the ones that are most easily
modeled as a function of the X-variables.
Note: Downweighted variables are displayed in a different color so as to be easily
identified.
2-D scatter
See the description in the Interpreting PLS plots section
3-D scatter
Loadings for the X-variables
This is a three-dimensional scatter plot of X-loadings for three specified components from
PLS. The plot is most useful for interpreting directions, in connection to a 3-D scores plot.
Otherwise it is recommended that one use line or 2-D loadings plots.
Note: Downweighted variables are displayed in a different color so as to be easily
identified.
730
Partial Least Squares
2 x 2-D scatter
The visualization window is divided into two frames. The top one shows the scatter plot of
the loadings of the variables along factor 1 and factor 2. The bottom plot shows the scatter
plot of loadings of the variables along factor 3 and factor 4.
4 x 2-D scatter
The visualization window is divided into four frames. The top left one shows the scatter plot
of the loadings of the variables along factor 1 and factor 2. On its left is displayed the scores
plot in the factor 3-factor 4 plane. The bottom left plot shows the scatter plot of loadings of
the variables along factor 5 and factor 6. To its right is displayed the scatter plot of loadings
of the variables for factor 7 and factor 8.
Loadings weights
Line
Loading weights are specific to PLS (they have no equivalent in PCR) and express how the
information in each X-variable relates to the variation in Y summarized by the u-scores. They
are called loading weights because they also express, in the PLS algorithm, how the t-scores
are to be computed from the X-matrix to obtain an orthogonal decomposition. The loading
weights are normalized, so that their lengths can be interpreted as well as their directions.
Variables with large loading weight values are important for the prediction of Y.
Looking at a line plot of the loading weight shows how much the variables are participating
to the plotted factor.
2-D scatter
See the description in the Interpreting PLS plots section
3-D scatter
This is a three-dimensional scatter plot of X-loading weights for three specified components
from PCR; this plot may be difficult to interpret, both because it is three-dimensional and
because it does not include the Y-loadings. Thus it is usually recommended that one use the
2-D scatter plot of X-loading weights and Y-loadings instead.
2 x 2-D scatter
The visualization window is divided into two frames. The top one shows the scatter plot of
the loading weights of the variables along factor 1 and factor 2. The bottom plot shows the
scatter plot of loading weights of the variables along factor 3 and factor 4.
4 x 2-D scatter
The visualization window is divided into four frames. The top left one shows the scatter plot
of the loading weights of the variables along factor 1 and factor 2. On its left is displayed the
scores plot in the factor 3-factor 4 plane. The bottom left plot shows the scatter plot of
loading weights of the variables along factor 5 and factor 6. To its right is displayed the
scatter plot of loading weights of the variables for factor 7 and factor 8.
731
The Unscrambler X Main
Important variables
See the description in the Interpreting PLS plots section
Regression coefficients
732
Partial Least Squares
The above plot shows the regression coefficients for one particular response variable (Y),
and for a model with a particular number of components (3). Each predictor variable (X)
defines one point of the line (or one bar of the plot). It is recommended to configure the
layout of this plot as bars. Variables 1 and 4 have the highest B coefficients.
Note: The weighted coefficients (Bw) and raw coefficients (B) are identical if no
weights where applied on the variables.
The raw coefficients are those that may be used to write the model equation in original
units:
Since the predictors are kept in their original scales, the coefficients do not reflect the
relative importance of the X-variables in the model. If no weights have been applied to the
X-variables the display of the Uncertainty Limits maybe informative. It is available if Cross-
Validation and the Uncertainty Test option were selected in the Regression dialog.
Use View Uncertainty Limit from the menu to toggle this indication on or off.
Matrix
The matrix plot is useful when there are several Y-variables. It helps to interpret the B-
coefficients for all responses. The plot below shows the B-coefficients for two responses.
There are seven X-variables corresponding to B1, B2,… B7. B0 is the coefficient that fits the
model, it is not presented in the plot. Variable 2 has a negative impact on the second
response but positive for the first responses.
Regression coefficients for 2 responses
733
The Unscrambler X Main
Residuals
General
Y-residuals vs. Predicted Y
This is a plot of Y-residuals against predicted Y values. If the model adequately predicts
variations in Y, any residual variations should be due to noise only, which means that the
residuals should be randomly distributed. If this is not the case, the model is not completely
satisfactory, and appropriate action should be taken. If strong systematic structure (e.g.
curved patterns) is observed, this can be an indication of lack of fit of the regression model.
The figure below shows a situation that strongly indicates lack of fit of the model. This may
be corrected by transforming the Y variable. This plot can be shown with the studentized
residuals by toggling the icon . The studentized residuals are also an option in many of
the other general Y residuals plots.
Structure in the residuals: a transformation of the y variable is recommended
734
Partial Least Squares
The presence of an outlier is shown in the example below. The outlying sample has a much
larger residual than the others; however, it does not seem to disturb the model to a large
extent.
A single sample has a large residual
The figure below shows the case of an influential outlier: not only does it have a large
residual, it also attracts the whole model so that the remaining residuals show a very clear
trend. Such samples should usually be excluded from the analysis, unless there is an error in
the data or some data transformation can correct for the phenomenon.
An influential outlier changes the structure of the residuals
735
The Unscrambler X Main
Small residuals (compared to the variance of Y) which are randomly distributed indicate
adequate models.
Normal probability Y-residuals
This plot displays the cumulative distribution of the Y-residuals with a special scale, so that
normally distributed values should appear along a straight line. The plot shows all residuals
for one particular Y-variable (look for its name in the axis label). There is one point per
sample. If the model explains the complete structure present in the data, the residuals
should be randomly distributed - and usually, normally distributed as well. So if all the
residuals are along a straight line, it means that the model explains everything that can be
explained in the variations of the variables to be predicted. If most of the residuals are
normally distributed, and one or two stick out, these particular samples are outliers. This is
shown in the figure below. If there are outliers, mark them and check the data.
Outliers are sticking out on Normal Probability Plot of Residuals
736
Partial Least Squares
If the plot shows a strong deviation from a straight line, the residuals are not normally
distributed, as in the figure below. In some cases - but not always - this can indicate lack of
fit of the model. However it can also be an indication that the error terms are simply not
normally distributed.
The residuals have a regular but non-normal distribution
Influence plot
See the description in the Interpreting PLS plots section
737
The Unscrambler X Main
Samples with small residual variance (or large explained variance) for a particular
component are well explained by the corresponding model, and vice versa. In the above plot
4 samples seem to be not well explained by the model and may be outliers such as B3.
Variable residuals
This is a plot of residuals for a specified X-variable and component number for all the
samples. The plot is useful for detecting outlying sample/variable combinations, as shown
below. An outlier can sometimes be modeled by incorporating more such samples. This
should, however, be avoided since it will reduce the predictive ability of the model.
Line plot of the variable residuals
738
Partial Least Squares
Whereas the sample residual plot gives information about residuals for all variables for a
particular sample, this plot gives information about all possible samples for a particular
variable. It is therefore more useful when investigating how one specific variable behaves in
all the samples.
Sample residuals
This is a plot of the residuals for a specified sample and component number for all the X-
variables. It is useful for detecting outlying sample or variable combinations. Although
outliers can sometimes be modeled by incorporating more components, this should be
avoided since it will reduce the predictive ability of the model.
Line plot of the sample residuals: one variable is an outlier
739
The Unscrambler X Main
In the above plot the variable 1: Adhesiveness at 1 day, for a particular sample is not very
well described by a model with a certain number of components, here 4. If this is the case
with most of the samples this variable may be noisy and can be considered as an outlier.
In contrast to the variable residual plot, which gives information about residuals for all
samples for a particular variable, this plot gives information about all possible variables for a
particular sample. It is therefore useful when studying how a specific sample fits to the
model.
In the above map, two variables are not well described by the model. They should be further
investigated.
Outliers
Influence Plot
See the description in the Interpreting PLS plots section
Y-residuals vs. Predicted Y
See the description in the above section.
740
Partial Least Squares
Patterns
Normal Probability Y-residuals
See the description in the above section
Y-residuals vs. Score
See the description in the above section
Leverage/Hotelling’s T²
Leverage
Line
See the description in the Interpreting PLS plots section
Matrix
This is a matrix plot of leverages for all samples and all model components. The X-axis
represents the components and the Y-axis the samples. The color represents the Z-value
which is the leverage; the color scale can be customized. It is a useful plot for studying how
the influence of each sample evolves with the number of components in the model. Display
the leverages as Hotelling’s T² statistics.
Leverage as a matrix plot
Hotelling’s T
Line
See the description in the Interpreting PLS plots section
741
The Unscrambler X Main
Matrix
This is a matrix plot of Hotelling’s T² statistics for all samples and all model components. It is
equivalent to the matrix plot of leverages, to which it has a linear relationship. The Y-axis
represents the components and the X-axis the samples. The color represents the Z-value
which is the Hotelling’s T2 statistic for a specific PC and sample; the color scale can be
customized.
Hotelling’s T2 as a matrix plot
Response Surface
See the description in the Interpreting PLS plots section
16.6. Bibliography
B. S. Dayal and J. F. MacGregor, Improved PLS Algorithms, J. Chemom., 11, 73-85 (1997).
F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS, J. Chemom., 7, 45-59 (1993).
S. Rannar, F. Lindgren, P. Geladi and S. Wold, A PLS kernel algorithm for data sets with many
variables and fewer objects, Part 1: Theory and Algorithm, J. Chemom., 8, 111-125 (1994).
742
17. LPLS
17.1. L-PLS regression
Traditionally, science demanded that a one-to-one relationship between a cause and effect
existed; however, this tradition can hinder the study of more complex systems. Such systems
may be characterized by many-to-many relationships, which are often hidden in large tables
of data.
In the sections on bilinear modeling such as PLS the data are arranged in such a way that the
information obtained on a dependent variable Y is related to some independent measures X.
In some cases, the Y data may have descriptors of its columns, organized in a third table Z
(containing the same number of columns as in Y).
The three matrices X, Y and Z can together be visualized in the form of an L-shaped
arrangement. Such data analysis has potential widespread use in areas such as consumer
preference studies, medical diagnosis and spectroscopic applications.
Theory
Usage
Plot Interpretation
Method reference
Basics
The L-PLS model
L-PLS by example
17.2.1 Basics
Traditionally, science demanded that a one-to-one relationship between a cause and effect
existed; however, this tradition can hinder the study of more complex systems. Such systems
may be characterized by many-to-many relationships, which are often hidden in large tables
of data.
In the sections on bilinear modeling such as PLS regression the data are arranged in such a
way that the information obtained on a dependent variable Y is related to some
independent measures X. In some cases, the Y data may have descriptors of its columns,
organized in a third table Z (containing the same number of columns as in Y).
The three matrices X, Y and Z can together be visualized in the form of an L-shaped
arrangement. Such data analysis has potential widespread use in areas such as consumer
preference studies, medical diagnosis and spectroscopic applications.
743
The Unscrambler X Main
and
744
LPLS
The Y-matrix is thus modeled as a function of both X and Z. The correlation loadings may also
be computed similarly to what is done for PCA, PCR and PLS.
The strength in interpretation from the L-PLS regression is that the correlation loadings plot
shows the relationship between the variables for all the three matrices. In addition, the
scores for the objects are also included. This enables direct interpretation of the rows in Z
and columns in X, two matrices that share no common dimension.
It is common to weigh the variables to unit variance for X and Z. Y is then by default double
centered or double centered and scaled. Again this depends on the properties of the
variables.
745
The Unscrambler X Main
As can be seen, the combined plot of correlation loadings and samples gives a view of the
relation between the variables and samples.
For example, the sensory Attribute 4 is anti-correlated with the background Z-variables
Married and Age. Age and Married are correlated which shows that the older the person the
most likely for this person to be married and also not to like the Attribute 4 and thus product
E.
Gender is close to the center which indicates that it is not playing a role in the preference of
products.
People practicing sport regularly may like product D better that is characterize by Attribute
3.
For algorithm details please refer to the method reference.
746
LPLS
In the Model Inputs tab, first select an X- matrix to be analyzed from the X-matrix drop-down
list. If new data ranges need to be defined, choose New or Edit from the drop-down list next
to Rows and/or Cols. This will open the Define Range editor where new ranges can be
defined.
Next select a Y- matrix to be analyzed from the Y-matrix drop-down list. The Y-variables may
be defined as a row and column set within the X-matrix selected for analysis, or may be a
separate matrix of Y-variables available from the project navigator.
Finally select a Z- matrix to be analyzed from the Z-matrix drop-down list. The Z-variables
may be defined as a row set within the Y-matrix or as a separate matrix of Z-variables
available from the project navigator.
Once the data to be used in modeling are defined, choose a starting number of Components
(latent variables, factors) to calculate, in the Maximum components spin box.
The Mean Center check box allows a user to subtract the column means from every variable
before analysis.
747
The Unscrambler X Main
Some important tips and warnings associated with the Model Inputs tab
L-PLS puts some constraints on the three input matrices in order to complete the
calculation. The following explains the warnings given when certain analysis criteria are not
met.
First, matrix shapes must match:
Constraint between X and Y not fulfilled
Solution: Make sure that the number of rows in the selected X and Y matrices matches.
Constraint between Y and Z not fulfilled
Solution: Make sure that the number of columns in the selected Y and Z matrices matches.
To understand this better, see the theory section for a diagram that illustrates how to
organize data for L-PLS analysis.
Too many excluded samples/variables
Solution: Check that all samples/variables have not been excluded in a data set.
To keep track of row and column exclusions, the model inputs tab provides a warning to
users that exclusions have been defined. See automatic keep outs for more details.
17.3.2 X weights
If it is necessary to weight the variables to make realistic comparisons of them with each
other (particularly useful for process and sensory data), click on the X- Y- and Z-Weights tabs
and the following dialog box will appear.
X Weights Dialog
748
LPLS
Individual X-variables can be selected from the variable list table provided in this dialog by
holding down the control (Ctrl) key and selecting variables. Alternatively, the variable
numbers can be manually entered into the text dialog box, the Select button can be used
(which takes one to the Define Range dialog box), or by simply clicking on All, this will select
every variable in the table.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows selected variables to be weighted by predefined constant values.
Downweight
This allows for the multiplication of selected variables by a very small number, such
that the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
Advanced tab
749
The Unscrambler X Main
Use the Advanced tab in the X-, Y-, and Z-Weights dialog to apply predetermined
weights to each variable. To use this option, set up a row in the data set containing
the weights (or create a separate row matrix in the project navigator). Select the
Advanced tab in the Weights dialog and select the matrix containing the weights
from the drop-down list. Use the Rows option to define the row containing the
weights and click on Update to apply them. The dialog box for the Advanced option
is provided below.
L-PLS Advanced Weights Option
Once the weighting and variables have been selected, click Update to apply them.
17.3.3 Y weights
The weights in the Y weights tab should be handled in the same way as the weights in the X
weights tab.
17.3.4 Z weights
The weights in the Z weights tab should be handled in the same way as the weights in the X
weights tab.
When all the settings are specified click OK.
In the L-PLS regression in The Unscrambler® there is for the time being no cross
validation implemented. It is suggested to first model (X,Y) and (Y,Z) with PLS
750
LPLS
regression respectively to evaluate the goodness of the data and the validated
variance.
See the method reference for details.
Explained X-variance
This plot gives an indication of how much of the variation in the X data is described by the
different components.
Total residual X-variance is computed as the sum of squares of the residuals for all the X-
variables, divided by the number of degrees of freedom.
Total explained X-variance is then computed as:
751
The Unscrambler X Main
Explained Y-variance
This plot gives an indication of how much of the variation in the Y data is described by the
different components.
Explained Y-variance
Explained Z-variance
This plot gives an indication of how much of the variation in the Z data is described by the
different components.
Explained Z-variance
752
LPLS
Explained variance
This plot gives an indication of how much of the variation in the three data tables: X, Y, Z is
described by the different components.
Explained all variances
Scores
This is a two-dimensional scatter plot (or map) of scores for two specified components
(factors or PCs). The plot gives information about patterns in the samples. The scores plot
for (PC1,PC2) is especially useful, since these two components summarize more variation in
the data than any other pair of components.
753
The Unscrambler X Main
Scores plot
The closer the samples are in the scores plot, the more similar they are with respect to the
two components concerned. Conversely, samples far away from each other are different
from each other.
The plot can be used to interpret differences and similarities among samples. Look at the
scores plot together with the corresponding loadings plot, for the same two components.
This can help in determining which variables are responsible for differences between
samples. For example, samples to the right of the scores plot will usually have a large value
for variables to the right of the loadings plot, and a small value for variables to the left of the
loadings plot.
Here are some things to look for in the 2-D scores plot.
Finding groups in a scores plot
Is there any indication of clustering in the set of samples? The figure below shows a
situation with three distinct clusters. Samples within a cluster are similar.
Detecting grouping in a scores plot
754
LPLS
X Correlation Loadings
A two-dimensional scatter plot of X correlation loadings for two specified components, this
is a good way to detect important variables. The importance of individual variables is
visualized more clearly in the correlation loadings plot compared to the standard loadings
plot. The plot is most useful for interpreting component 1 vs. component 2, since they
represent the largest variations in the data.
755
The Unscrambler X Main
The plot shows the importance of the different variables for the two components specified.
It should preferably be used together with the corresponding scores plot. Variables with X
correlation loadings to the right in the correlation loadings plot will be X-variables which
usually have high values for samples to the right in the scores plot, etc.
Variables close to each other in the loadings plot will have a high positive correlation if the
two components explain a large portion of the variance of X. The same is true for variables in
the same quadrant lying close to a straight line through the origin. Variables in diagonally
opposed quadrants will have a tendency to be negatively correlated. Variables Red and
Firm,Instr have independent variations. Variables Red and Acids/Sugars are negatively
correlated.
X Correlation Loadings of 10 sensory variables along (PC1,PC2)
Note: Variables lying close to the center are poorly explained by the plotted factors
(or PCs). Do not interpret them in that plot!
Y Correlation Loadings
Variables close to each other in the correlation loadings plot will have a high positive
correlation if the two components explain a large portion of the variance of Y. The same is
true for variables in the same quadrant lying close to a straight line through the origin.
Variables in diagonally opposed quadrants will have a tendency to be negatively correlated.
In this example the Y variables are the individuals, so two individuals close together will have
the same behavior and will like the samples in their quadrant.
Y Correlation Loadings
756
LPLS
Z Correlation Loadings
Variables close to each other in the correlation loadings plot will have a high positive
correlation if the two components explain a large portion of the variance of Z. The same is
true for variables in the same quadrant lying close to a straight line through the origin.
Variables in diagonally opposed quadrants will have a tendency to be negatively correlated.
In the example the Z variable are the information on the individuals, here more specifically
what they say they like in general.
Z Correlation Loadings
757
The Unscrambler X Main
Correlation
This plot shows the correlation loadings of the three data tables: X, Y and Z. Interpretation
between tables can be done in this plot.
For example individuals saying they like Gala apples also like apples with high sweetness and
sugar content. Examples of those individuals are individuals 17 and 25.
All correlation loadings
17.6. Bibliography
H. Martens, E. Anderssen, A. Flatberg, L. H. Gidskehaug, M. Hoy, F. Westad, A. Thybo, M.
Martens, Regression of a data matrix on descriptors of both its rows and of its columns via
latent variables: L-PLSR, Computational Statistics & Data Analysis 48, 103 – 123(2005).
758
18. Support Vector Machine Regression
Title: Support Vector Machine Regression
Theory
Usage: Create model
Results
Usage: Prediction
Result interpretation
Method reference
759
The Unscrambler X Main
As in all methods that can be described as statistical learning methods there is a balance
between achieving a small training error and the complexity of the model. The parsimony
principle strives to find the simplest model with an acceptable error; not only in the training
stage but more importantly for prediction. This is one reason that the SVMR implementation
in The Unscrambler® includes an option for cross validation. See the section on dialog usage
for details.
One of the most important ideas in Support Vector Machine Classification and Regression
cases is that presenting the solution by means of a small subset of training points gives
760
Support Vector Machine Regression
Parameter epsilon controls the width of the epsilon-insensitive zone, used to fit the training
data. The value of epsilon can affect the number of support vectors used to construct the
regression function. The bigger epsilon, the fewer support vectors are selected, ref. the
illustration above. Hence, both C and epsilon values affect model complexity, but in a
different way.
When using nu-SVM classification, the nu value must be defined (default value = 0.5). Nu
serves as the upper bound of the fraction of errors and is the lower bound for the fraction of
support vectors.
There is in SVMR also a parameter C that determines the trade off between the model
complexity (flatness) and the degree to which deviations larger than epsilon are tolerated in
optimization formulation. For example, if C is too large (infinity), then the objective is to
minimize the empirical risk only, without regard to model complexity part in the
optimization formulation.
The kernel type to be used can be chosen from the following four options:
Linear
Polynomial
Radial basis function
Sigmoid
The linear function is set as the default kernel because it is the simplest one and is not so
susceptible to overfitting. If the number of variables is very large the data do not need to be
mapped to a higher dimensional space and the linear kernel function is preferred. The radial
basis function is also simple function and can model systems of varying complexity. It is an
extension of the linear kernel. If a polynomial kernel is chosen, the order of the polynomial
must also be given. In SVM classification, the best value for C is often not known a priori.
Through a grid search and applying cross validation to reduce the chance of overfit, one can
identify an optimal value of C so that unknowns can be properly classified using the SVM
model.
If a polynomial kernel is chosen, the order of the polynomial must also be given. Through
cross validation, one can identify an optimal value of C from the RMSECV as displayed in the
grid search dialog.
761
The Unscrambler X Main
Support vectors
Parameters
Probabilities
Prediction
Diagnostics
The main result for SVM Regression is the matrix of predicted values which may be
compared to the reference values in a predicted versus reference plot as for any type of
regression.
The RMSEC (root mean square error from calibration) and RMSECV (from cross validation)
are given in the statistics box in the Predicted versus Reference plot. Note that in the current
version the cross validated predictions are not shown, the difference between calibration
and validation is expressed by RMSEC and RMSECV. The support vectors can be visualized in
this plot by clicking the icon “SV” on the Mark toolbar above the plot. As for modeling in
general, the RMSECV should preferably be close to the RMSEC, which indicates that the
model has not been overfitted.
The Diagnostics subnode holds RMSEC, RMSECV and the corresponding correlations
between predicted and reference.
762
Support Vector Machine Regression
763
The Unscrambler X Main
Some important tips and warnings associated with the Model Inputs tab
SVM Regression can be used as both a univariate and multivariate regression analysis
technique. In The Unscrambler® it requires a minimum of three samples (rows) and one
variable (column) to be present in a data set, in order to complete the calculation. The
following shows that a warning is given when certain analysis criteria are not met.
Not enough samples present
Solution: Check that the data table (or selected row set) contains a minimum of 3 samples or
2 variables.
Missing values in X or Y
764
Support Vector Machine Regression
Solution: Ensure that X and Y have no missing values. If required, use the Fill Missing
function to impute values for X (use caution!). For missing Y-values, it is suggested to keep
these rows out of the calculation.
Number of X rows does not match number of Y rows
Solution: Ensure that the row set dimensions of X match the row set dimensions of Y.
Non-numerical values in Response (Y)
Solution: Check that all samples/variables have not been excluded in a data set
To keep track of row and column exclusions, the model inputs tab provides a warning to
users that exclusions have been defined. See automatic keep outs for more details.
18.3.2 Options
This tab provides options for choosing the SVM type of regression to use, either epsilon-SVR
or nu-SVR, from the drop-down list next to SVM type. The kernel type to be used to
determine the hyperplane that best models the data can be selected from the drop-down
list. The default setting of Radial basis function is the simplest, and can model complex data.
Support Vector Machine Options
765
The Unscrambler X Main
Linear
Polynomial
Radial basis function
Sigmoid
For a polynomial kernel type, the degree of the polynomial should be defined.
The epsilon-SVR has an input parameter named epsilon, which is a capacity factor (also
called penalty factor), a measure of the robustness of the model. Epsilon must be greater
than 0.
The nu-SVR has the parameter nu which lies in the range 0-1 and determines a parameter in
the kernel.
Support Vector Machine Options for epsilon-SVR
766
Support Vector Machine Regression
767
The Unscrambler X Main
In the options tab the Grid Search button is available. Clicking on the Grid
Search button will open a dialog for grid search.
The dialog asks for input for the parameters Gamma and C in the case of epsilon-SVMR and
Gamma and Nu in the case of nu-SVMR. It has been reported in the literature that an
exponentially growing sequence of the parameters is good as a first course grid search. This
is why the inputs are given on the log scale. However, in the grid table above the actual
values are given. It is recommended to use cross validation in grid search to avoid overfitting
when many combinations of the parameters are tried. After an initial grid search it may be
refined with smaller ranges for the parameters once the best range has been found. Click on
the Start button for the calculations to commence. Note that it is possible to click on Stop
during the computations so that if the results become worse for higher values for the
parameters one may stop to save time.The default is to start with five levels of each
parameter. Click on one (the “best”) value in the grid after completion to see detailed
results. The SVs lists how many samples were selected and depends on the epsilon or nu
value and should be related to the number of samples in the data.
Click on Use setting to return to the previous dialog and running the SVMR again with these
parameter settings. Notice that since the cross validation is random the RMSE and the R-
square from validation may be different in the second run. This again is a function of the
distribution of the samples.
18.3.4 Weights
If the analysis calls for variables to be weighted for making realistic comparisons to each
other (particularly useful for process and sensory data), click on the Weights tab and the
following dialog box will appear.
Support Vector Machine Weights
768
Support Vector Machine Regression
769
The Unscrambler X Main
Another feature of the advanced tab is the ability to use the results matrix of another
analysis as weights, using the Select Results Matrix button . This option provides
an internal project navigator for selecting the appropriate results matrix to use as a weight.
The dialog box for the Advanced option is provided below.
SVM Advanced Weights Option
Once the weighting and variables have been selected, click Update to apply them.
18.3.5 Validation
Validation is an important part of any method applied in modeling data. Settings for the
Validation of the SVR are set under the Validation tab as shown below. First select to cross
validate the model by checking the check box. The number of segments to use can be
chosen in the segments entry. Cross validation is helpful in model development but should
not be a replacement for full model validation using a test set. In the case of SVR, test set
validation is performed using the options under Tasks - Predict - SVR Prediction.
Support Vector Machine Validation
770
Support Vector Machine Regression
Autopretreatment may be used with SVR. This allows a user to automatically apply the
transforms used during the calibration phase of the SVR model to apply to new samples
during the prediction phase.
Support Vector Machine Autopretreatment
771
The Unscrambler X Main
When all of the parameters have been defined, the SVR is run by clicking OK. A new node,
SVR, is added to the project navigator with a folder for Data, and another for Results.
More details regarding Support Vector Machine classification are given in the section SVM
Classify or in the link given under License.
772
Support Vector Machine Regression
The SVM prediction results are given in a new matrix in the project navigator named
Predicted_Range. The matrix holds the predicted value for each sample.
Support vectors
Parameters
Probabilities
Prediction
Diagnostics
When an SVM Regression model is created a new node is added in the project navigator
with a folder for the data used in the model, and the results folder.
The results folder has the following matrices:
SVMR node
773
The Unscrambler X Main
18.5.2 Parameters
The parameters matrix carries information on the following parameters for all the identified
classes:
SVM type
Kernel type - as defined in the options for the SVM learning step
Degree - as defined in the options for the SVM learning step
Gamma - related to the C values set in the options
Offset
Classes - Relevant for SVM Classification only
SV Count - the number of support vector needed for the regression model of the
data
Labels - Relevant for SVM Classification only
Numbers - Relevant for SVM Classification only
Parameters matrix
18.5.3 Probabilities
The probabilities matrix has three rows, for the Rho, and probabilities A and B.
Probabilities matrix
774
Support Vector Machine Regression
18.5.4 Diagnostics
Diagnostics matrix
The Diagnostics subnode holds RMSEC, RMSECV and the corresponding correlations
between predicted and reference. Ideally the validation figures of merit should be close to
the calibration. If not it indicates that the data were overfitted in the calibration stage.
18.5.5 Prediction
The prediction matrix exhibits the predicted value for each sample in the training set.
Prediction
775
The Unscrambler X Main
18.5.7 Predicted values after appplying the SVM model on new samples
After an SVM model has been applied to predict new samples from a data matrix in the
project Tasks - Predict - SVR Prediction, a new matrix with the predicted values is added to
the project navigator. The name given by default is Predicted
Predicted
776
Support Vector Machine Regression
18.7. Bibliography
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, A Practical Guide to Support Vector
Classification, last updated: May 19, 2009, accessed August 27, 2009.
http://www.csie.ntu.edu.tw/~cjlin
T. Czekaj, W.Wu and B.Walczak, About kernel latent variable approaches and SVM, J.
Chemom., 19, 341–354 (2005).
J.A.Fernandez Pierna, V.Baeten, A.Michotte Renier, R.P.Cogdill and P.Dardenne,
Combination of support vector machines (SVM) and near-infrared (NIR) imaging
spectroscopy for the detection of meat and bone meal (MBM) in compound feeds, J.
Chemom., 18, 341–349 (2004).
A. I. Belousov, S. A. Verzakov and J. von Frese, Applicational aspects of support vector
machines, J. Chemom., 16, 482-489 (2002).
777
19. Multivariate Curve Resolution
19.1. Multivariate Curve Resolution (MCR)
MCR methods may be defined as a group of techniques which intend the recovery of
concentration (pH profiles, time/kinetic profiles, elution profiles, chemical composition
changes, …) and response profiles (spectra, voltammograms, …) of the components in an
unresolved mixture using a minimal number of assumptions about the nature and
composition of these mixtures. MCR methods can be easily extended to the analysis of many
types of experimental data including multiway data.
Theory
Usage
Plot Interpretation
Method reference
MCR basics
What is MCR?
Data suitable for MCR
Purposes of MCR
Limitations of PCA
The Alternative: Curve resolution
Ambiguities and constraints in MCR
Rotational and intensity ambiguities in MCR
Constraints in MCR
What is a constraint?
When to apply a constraint?
Constraint types in MCR
Non-negativity
Unimodality
Closure
Other constraints
MCR and 3-D data
Algorithm implemented in The Unscrambler®: Alternating Least Squares (MCR-ALS)
Initial estimates for MCR-ALS
Computational parameters of MCR
Constraint settings are known beforehand
How to tune sensitivity to pure components?
When to tune sensitivity up or down?
Main results of MCR
Residuals
Estimated concentrations
Estimated spectra
Practical use of estimated concentrations and spectra
779
The Unscrambler X Main
The matrix X of raw data (spectra) is decomposed into two matrices: the concentrations: C,
and the sources: S.
The size should be compatible. I represents the number of samples, N, the number of
sources and J, the number of spectral/signal variables.
This can also be explained by an example. The spectra of some samples are decomposed
into concentrations and sources or single component spectra.
MCR principles: Example
Spectra
780
Multivariate Curve Resolution
Decomposition
Purposes of MCR
Multivariate Curve Resolution has been shown to be a powerful tool to describe
multicomponent mixture systems through a bilinear model of pure component
contributions. MCR, like PCA, assumes the fulfillment of a bilinear model, i.e.
Bilinear model
Comparison of constraints
Constraint PCA MCR
T Orthogonal T=C
Resolution No Yes
781
The Unscrambler X Main
Limitations of PCA
PCA produces an orthogonal bilinear matrix decomposition, where components or factors
are obtained in a sequential way explaining maximum variance. Using these constraints plus
normalization during the bilinear matrix decomposition, PCA produces unique solutions.
These ‘abstract’ unique and orthogonal (independent) solutions are very helpful in deducing
the number of different sources of variation present in the data and, eventually, they allow
for their identification and interpretation. However, these solutions are ‘abstract’ solutions
in the sense that they are not the ‘true’ underlying factors causing the data variation, but
orthogonal linear combinations of them.
782
Multivariate Curve Resolution
where ki are scalars and n refers to the number of components. Each concentration profile
of the new C’ matrix would have the same shape as the real one, but being ki times smaller,
whereas the related spectra of the new S’ matrix would be equal in shape to the real
spectra, though ki times more intense.
Constraints in MCR
Although resolution does not require previous information about the chemical system under
study, additional knowledge, when it exists, can be used to tailor the sought pure profiles
according to certain known features and, as a consequence, to minimize the ambiguity in
the data decomposition and in the results obtained.
The introduction of this information is carried out through the implementation of
constraints.
What is a constraint?
A constraint can be defined as any mathematical or chemical property systematically fulfilled
by the whole system or by some of its pure contributions. Constraints are translated into
mathematical language and force the iterative optimization to model the profiles respecting
the conditions desired.
783
The Unscrambler X Main
Unimodality
The unimodality constraint allows the presence of only one maximum per profile.
This condition is fulfilled by many peak-shaped concentration profiles, like chromatograms,
by some types of reaction profiles and by some instrumental signals, like certain
voltammetric responses.
It is important to note that this constraint does not only apply to peaks, but to profiles that
have a constant maximum (plateau) and a decreasing tendency. This is the case of many
monotonic reaction profiles that show only the decay or the emergence of a compound,
such as the most protonated and deprotonated species in an acid-base titration reaction,
respectively.
Closure
The closure constraint is applied to closed reaction systems, where the principle of mass
balance is fulfilled. With this constraint, the sum of the concentrations of all the species
involved in the reaction (the suitable elements in each row of the C matrix) is forced to be
equal to a constant value (the total concentration) at each stage in the reaction. The closure
constraint is an example of equality constraint.
In practice, the closure constraint in MCR forces the sum of the concentrations of all the
mixture components to be equal to a constant value (the total concentration) across all
samples included in the model.
Other constraints
Apart from the three constraints previously defined, other types of constraints can be
applied. See literature on curve resolution for more information about them.
Local rank constraints
Particularly important for the correct resolution of two-way data systems are the so-
called local rank constraints, selectivity and zero-concentration windows. These
types of constraints are associated with the concept of local rank, which describes
how the number and distribution of components varies locally along the data set.
The key constraint within this family is selectivity. Selectivity constraints can be used
in concentration and spectral windows where only one component is present to
completely suppress the ambiguity linked to the complementary profile in the
784
Multivariate Curve Resolution
785
The Unscrambler X Main
786
Multivariate Curve Resolution
787
The Unscrambler X Main
Residuals are error measures; they tell how much variation remains in the data after
k components have been estimated;
Estimated concentrations describe the estimated pure components’ profiles across
all the samples included in the model;
Estimated spectra describe the instrumental properties (e.g. spectra) of the
estimated pure components.
Residuals
The residuals are a measure of the fit (or rather, lack of fit) of the model. The smaller the
residuals, the better the fit. MCR residuals can be studied from three different points of
view.
Variable Residuals
is a measure of the variation remaining in each variable after k components have
been estimated. In The Unscrambler®, the variable residuals are plotted as a line
plot where each variable is represented by one value: its residual in the k-
component model.
Sample Residuals
is a measure of the distance between each sample and its model approximation. In
The Unscrambler®, the sample residuals are plotted as a line plot where each
sample is represented by one value: its residual after k components have been
estimated.
Total Residuals
these results express how much variation in the data remains to be explained after k
components have been estimated. Their role in the interpretation of MCR results is
similar to that of variances in PCA. They are plotted as a line plot showing the total
residual after a varying number of components (from 2 to n+1).
The three types of MCR residuals are available for MCR Fitting: these are the actual values of
the residuals after the data have been resolved to k pure components.
Estimated concentrations
The estimated concentrations show the profile of each estimated pure component across
the samples included in the MCR model. In The Unscrambler®, the estimated concentrations
are plotted as a line plot where the abscissa shows the samples, and each of the k pure
components is represented by one curve.
The k estimated concentration profiles can be interpreted as k new variables showing how
much each of the original samples contains of each estimated pure component.
788
Multivariate Curve Resolution
Estimated spectra
The estimated spectra show the estimated instrumental profile (e.g. spectrum) of each pure
component across the X-variables included in the analysis. In The Unscrambler®, the
estimated spectra are plotted as a line plot where the abscissa shows the X-variables, and
each of the k pure components is represented by one curve. The k estimated spectra can be
interpreted as the spectra of k new samples consisting each of the pure components
estimated by the model. Comparison of the spectra of the original samples to the estimated
spectra may be useful so as to find out which of the actual samples are closest to the pure
components.
Note: Estimated spectra are unit-vector normalized.
The sections that follow explain what can be done to improve the quality of a model. It may
take several iterations before obtaining a satisfying model.
Once the model is found satisfactory, interpretation of the MCR results in regards to
information on the system under study (e.g. chemical reaction mechanism or process) is the
next step. The last section hereafter will show how to do it.
789
The Unscrambler X Main
The main tool for diagnosing noisy variables in MCR consists of the plots of variable
residuals, accessed with the menu option Plot - Variable Residuals, or just be selecting this
plot from the MCR - Plots in the project navigator.
Any variable that sticks out on the plots of variable residuals (either with MCR fitting or PCA
fitting) may be disturbing the model, thus reducing the quality of the resolution; try
recalculating the MCR model without that variable.
790
Multivariate Curve Resolution
One can utilize estimated concentration profiles and other experimental information
to analyze a chemical/ biochemical reaction mechanism.
One can utilize estimated spectral profiles to study the mixture composition or even
intermediates during a chemical/biochemical process.
Note: What follows is not a tutorial. See the Tutorials chapter for more examples
and hands-on training.
Model Inputs
Options
791
The Unscrambler X Main
Some important tips and warnings associated with the Model Inputs tab
The only constraints to MCR are that it needs to have actual numeric input data and at least
four samples and variables. There is a warning given when both situations occurred:
Too many excluded samples/variables
Solution: Check that all samples/variables have not been excluded in a data set.
To keep track of row and column exclusions, the model inputs tab provides a warning to
users that exclusions have been defined. See automatic keep outs for more details.
19.3.2 Options
Select constraint options:
792
Multivariate Curve Resolution
Non-negative concentrations
Non-negative spectra
Closure
Unimodality.
Information on those constraints can be found in the theory section: Constraints in MCR.
It is possible to tune the sensitivity using the field Sensitivity to pure components, read
more about how and when to do so in the theory chapter: How to tune sensitivity to pure
components?.
The number of iterations can also be changed when detecting convergence is difficult. The
default setting is 50 iterations. Warnings will be added to the MCR results node if the
alternating least-squares calculation does not converge for the optimal and/or optimal plus
one number of pure components.
MCR Options
793
The Unscrambler X Main
Component concentrations
This plot displays the estimated concentrations of two or more constituents across all the
samples included in the analysis. Each plotted curve is the estimated concentration profile of
one given constituent.
The curves are plotted for a fixed number of components in the model; note that in MCR,
the number of model dimensions (components) also determines the number of resolved
constituents. Therefore, if the number of component is tuned up or down with the toolbar
buttons , this will also affect the number of curves displayed. For
instance, if the plot currently displays two curves, clicking the arrow toolbar will update the
plot to three curves representing the profiles of three constituents in a 3-dimensional MCR
model.
Component concentrations
Component spectra
This plot displays the estimated spectra of two or more constituents across all the variables
included in the analysis. Each plotted curve is the estimated spectrum of one pure
constituent.
The curves are plotted for a fixed number of components in the model; note that in MCR,
the number of model dimensions (components) also determines the number of resolved
constituents. Therefore, if the number of component is tuned up or down with the toolbar
buttons , this will also affect the number of curves displayed. For
instance, if the plot currently displays two curves, clicking on the right arrow will update the
plot to three curves representing the spectra of three constituents in a 3-dimensional MCR
model.
794
Multivariate Curve Resolution
Note: the star button enables one to go back to the suggested number of
components for the model.
Component spectra
Sample residuals
This plot displays the residuals for each sample for a given number of components in an MCR
model.
The size of the residuals is displayed on the scale of the vertical axis. The plot contains one
point for each sample included in the analysis; the samples are listed along the horizontal
axis.
The sample residuals are a measure of the distance between each sample and the MCR
model. Each sample residual varies depending on the number of components in the model
(displayed in parentheses after the name of the model, at the bottom of the plot). The
number of components for which the residuals are displayed can be tuned up or down using
the toolbar buttons.
The size of the residuals gives an indication about the misfit of the model. It may be a good
idea to compare the sample residuals from an MCR fitting to a PCA fit on the same data.
Since PCA provides the best possible fit along a set of orthogonal components, the
comparison tells how well the MCR model is performing in terms of fit.
Sample residuals
795
The Unscrambler X Main
Total residuals
This plot displays the total residuals (all samples and all variables) against increasing number
of components in an MCR model.
The size of the residuals is displayed on the scale of the vertical axis. The plot contains one
point for each number of components in the model, starting at two. The total residuals are a
measure of the global fit of the MCR model, equivalent to the total residual variance
computed in projection models like PCA.
It is a good idea to compare the total residuals from an MCR fitting to a PCA fit on the same
data. Since PCA provides the best possible fit along a set of orthogonal components, the
comparison tells how well the MCR model is performing in terms of fit.
Total residuals
796
Multivariate Curve Resolution
Variable residuals
This plot displays the residuals for each variable for a given number of components in an
MCR model.
The size of the residuals is displayed on the scale of the vertical axis. The plot contains one
point for each variable included in the analysis; the variables are listed along the horizontal
axis.
The variable residuals are a measure of how well the MCR model takes into account each
variable; the better a variable is modeled, the smaller the residual. Variable residuals vary
depending on the number of components in the model (displayed in parentheses after the
name of the model, at the bottom of the plot). The number of components for which the
residuals are displayed can be tuned up or down, using the toolbar buttons.
The size of the residuals tells about the misfit of the model. It is a good idea to compare the
variable residuals from an MCR fitting to a PCA fit on the same data. Since PCA provides the
best possible fit along a set of orthogonal components, the comparison tells how well the
MCR model is performing in terms of fit.
Variable residuals
19.6. Bibliography
R. Tauler, S. Lacorte and D. Barceló, Application of multivariate curve self-modeling curve
resolution for the quantitation of trace levels of organophosphorous pesticides in natural
waters from interlaboratory studies, J. of Chromatogr. A., 730, 177-183 (1996).
797
20. Hierarchical Modeling
20.1. Hierarchical Modeling
Hierarchical Modeling (HM) is not a method defined by a specific algorithm, but a
predefined set of Unscrambler models run in a predefined order (i.e. a hierarchy). At each
stage of the hierarchy, a decision must be made based Boolean logic which directs the
modeling to the next step. The process stops when a final decision has been arrived at in the
logic.
Theory
Usage
Prediction
Overall workflow
Setup
Expected Scenarios
The Classification - Classification Hierarchy
The Classification – Prediction Hierarchy
The Prediction – Prediction Hierarchy
799
The Unscrambler X Main
20.2.2 Setup
HM can be thought of as a cascading tree of decision making. It is expected that all
projection, prediction and classification models generated in The Unscrambler X are
candidates for hierarchical model development. The HM module supports up to 10 levels of
hierarchy and multiple models can be included within each level.
Within each level, one or more models can be defined based on the output from the
previous level. Alternatively, the output is satisfactory and reported, or it may be ambiguous
or out of limits, in which case a warning can be displayed or the HM be told to exit. This
behaviour is completely at the hands of the user, who has to make sure that the provided
sequence of steps and the limits used are sensible.
Also, for each model within each level, an ordered list of logical conditions are specified by
the user and executed in an IF-ELSE manner. This means that if the first condition is satisfied,
any remaining conditions will not be executed. It follows that the order of the conditions is
important. If for instance condition 1 finds that the predicted response is out of limits, a
condition 2 testing for e.g. leverage of the predicted sample will never be executed.
Note that the program will not attempt to detect or fix ambiguous logic in subsequent
conditions. If condition 1 states that a PLS model should be calculated if a parameter is
within limits, and condition 2 states that a PCA model should be calculated for the same
parameter values, only the PLS model will ever be calculated due to the order of the
conditions.
800
Hierarchical Modeling
It should be noted here that classes 1-5 can be separated uniquely from every other class.
Classes 6-8 cannot be separated from each other, but can from all other classes and classes
9-10 cannot be separated from each other, but can be separated from all other classes,
using the 1st level (global) model.
The next step is to define a 2nd level hierarchy in which two models are defined, a. A
separation model for classes 6-8 b. A separation model for classes 9-10.
Continuing on with the example, say for instance a new model (SIMCA/LDA/SVM/PLS-DA)
can be defined to separate class 8 from classes 6 and 7, then a 3rd level is required, in which
a new model is defined to separate classes 6 and 7.The 2nd level also contains a model for
separating classes 9 and 10. The 2nd level is shown in the figure below.
Resolving ambiguities with a second level of model hierarchy
801
The Unscrambler X Main
The final step in this particular process is to define a 3rd level of hierarchy with a single
model for separating class 6 and 7 from each other. Therefore in summary, this process
requires 3 levels of hierarchy, the first contains a single “global�? model that uniquely
separates classes 1-5 but cannot uniquely separate classes 6-8 from each other or classes 9-
10 from each other. The second level has two models one for separating class 8 from classes
6-7 and one for separating class 9 from 10. The third level separates classes 6 and 7.
There is also the situation that a sample does not classify into any models. The entire
process described above is shown in the following flow diagram.
Expected workflow of a hierarchical model for separating 10 classes uniquely
802
Hierarchical Modeling
classification. If a sample is uniquely classified then (at least) one specified prediction model
is applied to the sample.
Suppose there are five (5) groups to be classified and in this case, assume that no
ambiguities are present in the classification step, i.e. the 1st level model uniquely separates
classes 1-5. For each class there are separate sets of prediction models (PLS/PCR/MLR)
assigned to each class (it may also be feasible to have a PCA projection model here).
The figure below shows an example of a Classification - Prediction Hierarchy
The Classification – Prediction hierarchy
803
The Unscrambler X Main
If predicted y lies between 0 and some upper limit a in the 1st level, then use a local
regression model developed for that region in the 2nd level.
If predicted y lies between a and some upper limit b in the 1st level, then use a local
regression model developed for that region in the 2nd level.
If predicted y lies between b and some upper limit c in the 1st level, then use a local
regression model developed for that region in the 2nd level.
If predicted y lies above some upper limit c or below some predefined lower limit in
the 1st level, then terminate operation and provide a warning that the value is
outside the normal calibration range.
It is of course possible to define specific steps to be taken if the predicted value is close to a
junction between two models. Then the prediction intervals above should be shrunk
accordingly, so that no intervals overlap.
Defining actions
Classification setup
Prediction setup
Projection setup
Report setup
Setting up a hierarchical model
Add level
Define Conditions and Actions
Expression Builder
Define, Remove and Report buttons
804
Hierarchical Modeling
Classification
Prediction
Projection
Report
The action setup dialog is slightly different depending on which type of action to be
performed. These setup dialogs are described in the next sub-sections.
Classification setup
A classification model is defined using the following dialog window:
Add classification model dialog
805
The Unscrambler X Main
Add a name to the Method name frame. This will be displayed in the HM model structure
and also in the output matrix of HM Predict. You should choose an informative name to
make interpretation of the hierarchical model and the results easier.
Select the type of classification model in the Classification type frame. For SIMCA
classification, any number of bilinear models (PCA, PCR, PLSR) can be included, while LDA
and SVM classification expects a single model.
The individual models are defined in the ‘Add models for classification’ frame. A drop-down
box will list all available models from the project navigator. Once a model is selected, verify
that the correct auto-pretreatments will be performed and that the correct settings are
selected for centering and the number of components.
Highlighting an already added model will activate the Remove and Details buttons for that
model. The first button will remove the model from the list of added models and clear the
list of selected output matrices (see below). The Details button will bring up a separate
dialog listing details about the selected model.
The complete list of available output matrices are listed in the bottom left portion of the
window. Use the arrow buttons to select output data. These will be saved in a Results matrix
when HM Prediction is applied and may be subjected to conditional statements in the next
level of the hierarchy. Make sure to include all necessary output data that may be of interest
later, as these will otherwise be lost.
The available model outputs from SIMCA classification are class memberships at different
significance levels between 0.1-25%. In addition, the ‘X Residuals’, ‘Si/S0’ and ‘Leverage’
values can be selected for each of the individual models. For LDA and SVM classification, the
only available output is the predicted class.
Prediction setup
The following dialog is used for prediction type actions:
806
Hierarchical Modeling
Refer to the Classification setup section for an explanation to the different frames and
buttons. The model drop-down box will be populated with supported prediction models
from the project navigator. These are PLSR, PCR and MLR models. The available outputs
from PLSR and PCR models are
807
The Unscrambler X Main
Projection setup
Only a single PLSR, PCR or PCA model can be used for projection, and the dialog is therefore
simpler:
Add projection model dialog
Refer to the Classification setup section for an explanation to the different frames and
buttons.
Available outputs are
Projected Scores
Projected Hotelling’s T²
Projected Sample Leverage
Projected X Sample Residuals
Projected Explained X Sample Variance
Report setup
Once all the desired levels of the hierarchy have been modeled, or if a conditional statement
causes the modeling to stop prematurely due to an undesired outcome, a reporting action
will define how the results are displayed. Reporting involves coloring the output in the
results table and optionally adding an informative tool tip comment.
Contrary to the classification, prediction and projection action types, there is no additional
output being produced by a reporting action. This means that there can be no additional
hierarchical levels based on a reported result. An example Report setup dialog is given
below.
Example of report setup dialog
808
Hierarchical Modeling
This example condition is the default ‘No Evaluation’ condition, which is evaluated at the
end of a conditional statement if none of the other conditions hold TRUE (If you are familiar
with programming syntax, this is the ELSE statement). The dialog has a Method name box,
where the name of the reporting action can be specified. The Expressions column lists the
conditions that will lead to the current reporting action. Available Reporting Options are
AlarmHigh: Red
AlarmLow: Red
Normal: Green
WarningHigh: Yellow
WarningLow: Yellow
Alarm: Red
Warning: Yellow
The standard colors may be edited by clicking on the “Edit standard states” button. This will
bring up the “Define Reporting States” dialog with the 7 standard sub-options and their
associated colors indicated:
The Define Reporting States dialog for standard states
809
The Unscrambler X Main
Click on either of the colored boxes in order to bring up a color editing dialog. Press OK to
save any changes or Cancel to discard.
A Custom list of sub-options with associated colors can similarly be set up by pressing the
“Define custom states” button. This dialog allows you to define the number of reporting
states, their names and their associated colors.
The Define Reporting States dialog for custom states
810
Hierarchical Modeling
This dialog can be inactivated by checking “Do not show this next time”. On clicking OK, the
Hierarchical Model dialog will open, initially with no levels specified. This will be the first
dialog shown if the information dialog has been inactivated.
Initial setup dialog screen for Hierarchical Modeling
811
The Unscrambler X Main
The Hierarchical Levels frame will be populated with different conditions and actions at
multiple levels once these have been specified.
Add level
Add levels using the ‘Add Level’ button. If no levels have been specified, a dialog will open
with the options to specify a Classification, Prediction, or Projection model as the global
(Level 1) model. Depending on your selection, the relevant setup dialog will open, as
described in the previous section.
Define Action dialog box
Once the first level model(s) has been specified, click OK to add the first level to the
Hierarchical Levels frame of the main HM setup dialog. Because a hierarchical model
requires at least two levels, click the ‘Add level’ button again to setup the second level.
Clicking this button for any level between 2-10 will bring up the “Define Conditions and
Actions” dialog.
812
Hierarchical Modeling
Expression Builder
To add a new condition, specify a condition name and press the Expression Builder button.
Alternatively, click on an existing condition (row) to populate the condition name and
expression with existing values. A unique classification has a value of 1 (TRUE) for the class in
question and 0 (FALSE) for all other classes.
Expression Builder Dialog for a SIMCA model
813
The Unscrambler X Main
For SIMCA models it is possible to define different combinations of classes to evaluate. For
instance, a separate action can be specified for the case where a sample is ambiguously
classified into two classes. Also, for SIMCA, Prediction and Projection models, multiple
statements can be defined and connected with AND, OR or XOR:
Multiple conditions evaluate in a greedy manner, starting with the first two statements and
comparing with remaining statements one at the time. E.g. an expression “cond1 AND cond2
OR cond3” will evaluate to TRUE if cond2 and cond3 are TRUE while cond1 is FALSE. This is
because “cond1 AND cond2” will be evaluated first, as in the expression “(cond1 AND cond2)
OR cond3”. To add multiple conditions, use the check box to activate a new statement. The
prediction model expression below evaluates to TRUE if predicted octane is between 88 and
90, while the deviation is less than 3.
Expression Builder Dialog with multiple statements
The user must take care not to build meaningless statements, such as “X > 1 AND X < 0”,
which will always evaluate to FALSE.
Once the expression is set up, press OK to close the expression builder dialog. Then, to save
the expression as a new condition press New, or press Update to modify an existing
condition.
814
Hierarchical Modeling
815
The Unscrambler X Main
Edit Level
Once at least one level has been added to the hierarchical model, it is possible to return to
the Define Conditions And Actions dialog and change any of the settings. Click on the level of
interest and verify that the correct level is displayed in the Selected Level box on the right
hand side. Click the Edit Level button to bring up a dialog to modify the settings.
Note that changing the output or the conditions in one of the lower levels may break
dependencies in some of the higher levels. Edit such lower levels with extreme caution.
Remove Level
Clicking on Remove Level will bring up a warning that the currently selected level will be
removed permanently. Note that if a lower level is deleted, all higher levels are necessarily
deleted as well.
Remove Level warning dialog
Details
The Details button will bring up a dialog with additional information about the currently
selected level. Click on OK to close the window and return to the main HM builder dialog.
816
Hierarchical Modeling
Preview
The Preview button brings up a dialog with an expandable tree-structure with information
about the complete hierarchical model. Nodes in the tree containing additional sub-
branches can be expanded (or collapsed) by clicking on the ’+’ (or ‘-‘) symbol at the junction
of the node. Click the Expand All button to expand all sub-brances in the tree. Click OK to
close the dialog and return to the main HM builder dialog.
Example of a HM Preview tree dialog
817
The Unscrambler X Main
818
Hierarchical Modeling
Right click on the HM model node in the project navigator to Edit, Rename, Delete or Save
the model. On saving a HM model, all required classification, prediction and projection
models are saved in the same project file.
This will open the HM Predict dialog. A drop-down box will be populated with all available
HM models in the current project. Select the model of interest and specify the data to apply
in the Data frame. Make sure to specify the correct Row and Column sets, or bring up the
Define Range dialog to specify the ranges if they are not pre-defined.
HM Predict dialog
819
The Unscrambler X Main
When clicking OK, the number of columns specified is compared with the dimension of the
training data for the models in the first hierarchical level. An attempt to specify data with
incorrect dimensions will bring up a warning that the number of columns does not match
what is specified by the model:
Data size warning
Once the correct data are specified, click OK to start the hierarchical sequence of modelling
steps.
The Results table contains the specified output at each level for the conditions that
evaluated to TRUE for the samples in question. One columnset for each level is defined by
default.
820
Hierarchical Modeling
Toggle the toolbar / icons to hide/show ranges. When ranges are not shown, the
color of the Reported values are displayed. Click on a colored cell to display tool-tip Alarm
state and comments for the individual output values.
HM Predict Results table with alarm states shown
821
21. Segmented Correlation Outlier Analysis
21.1. Segmented Correlation Outlier Analysis (SCA)
The scope of the SCA method is a means to detect gross and subtle outliers in large spectra
data sets in order to objectively remove outliers. Further SCA can also be used in run time
for outlier detection as it is also a modified PCA approach and fits well with the concept of
projection. This requires a Target Spectral Profile (TSP) to be defined and saved to a run time
model for application to new data.
Theory
Usage: Create model
Results
SCA Save Model
Usage: Prediction
Prediction Results
Method reference
823
The Unscrambler X Main
Correlation calculations
The correlation calculations form the basis of SCA. A reference spectrum is used for this,
defined either as a single spectrum or by calculating the mean or median for a selected
number of spectra. Two types of correlation values will be used,
• Overall correlation: a single value of Pearson’s r² value for each spectrum to show how
close the individual spectra are to a reference spectrum for outlier detection.
• Segmented correlation: performs localized correlation calculations for each segment
defined; for each sample, a value of correlation for each segment will be generated and for
multiple samples, a matrix of local correlation values will be available
PCA calculations
The segmented correlation values matrix calculated above will be used for PCA calculations.
The results will be similar to PCA, except they are for segments and not for individual
variables.
Outlier detection
All samples below correlation limit in the overall correlation plot, as well as those with larger
than threshold T² or Q values will be marked automatically, and any marked samples has to
be interpreted as outliers. Any or all samples can be unmarked using the regular tools, in
which case the lines in question in the SCA Overview plots are coloured grey, and the circles
are removed from all other plots. Clicking the Mark Outliers button will revert to the default
selection based on current correlation, T² and Q thresholds.
Overall workflow
The following describes the overall workflow of SCA.
824
Segmented Correlation Outlier Analysis
where
The vertical bars (||) indicate the absolute value
= Conformity Index at wavelength
= Absorbance of the test material at wavelength
= Absorbance of the target spectrum profile (mean/median library spectrum) at
wavelength
= Standard deviation of library absorbance at wavelength
825
The Unscrambler X Main
Where
= Correlation between the wavelength and the specified Y-Variable
= Wavelength array over all samples at wavelength
Y = Selected Y-Variable to calculate correlation against.
= Standard Deviation of the X array at wavelength
= Standard Deviation of the Y-Variable
826
Segmented Correlation Outlier Analysis
individual spectra. The correlation values span the range from 0 to 1 and the default value of
correlation limit is 0.95
Segmented Correlation Outlier Analysis - Scope
PCA Options
The Maximum components option allows the user to set the maximum number of Principal
Components to use for the analysis. The default value is set to 7, however, the upper bound
is defined by the minimum value of number of samples-1 and windows segments
The Cross Validation method is used when either there are not enough samples available to
make a separate test set, or for simulating the effects of different validation test cases, e.g.
systematically leaving samples out vs. randomly leaving samples out, etc.
The cross validation procedures associated with multivariate models are described in detail
in the chapter on cross validation setup cross validation setup dialog.
Segmented Correlation Outlier Analysis - PCA Options
827
The Unscrambler X Main
Correlelogram
The Compute correlelogram gives users the option to calculate a correlelogram.
If the above option is enabled, the Response frame is enabled to select a matrix to be used.
Select pre-defined row and column ranges in the Rows and Cols boxes, or click the Define
button to perform the selection manually in the Define Range dialog.
Segmented Correlation Outlier Analysis - Correlelogram
828
Segmented Correlation Outlier Analysis
829
The Unscrambler X Main
830
Segmented Correlation Outlier Analysis
831
The Unscrambler X Main
SCA Module allows one to save the entire model or separate models as a project. There are
several options for the results file to be saved. Depending on what option is used, the file
size can be reduced so that they are best suited for usage in conformity prediction. Select a
SCA model in the project navigator and right click to select Save Model.
Entire model
This saves all the results and supports all visualizations that are available when a
model is developed in The Unscrambler® X. This option does not allow recalculation
of the model as available in MLR, PLS, PCR and PCA models; but allows to save
separate models. Use the option Number of Components to set the number of
components for a model to a value other than the optimal recommended number.
This number of components will then be used when the model is used for prediction
and/or classification. The Standard Deviations option helps to set the limits for the
832
Segmented Correlation Outlier Analysis
trend plots around the average of mean Confomrity Index values. The default value
is 3. The Spectral match limit option helps to set the limits for the overall correlation
plot. The default value is 0.95
SCA model
This option saves the model containing only the data required for detecting
Influence outliers for the selected number of components or less. This model results
file does not include plots and some of the results matrices that are not used in the
prediction visualization. Use the option Number of Components to set the number
of components for a model to a value other than the optimal recommended
number. This number of components will then be used when the model is used for
prediction and/or classification.
Conformity model
This option saves the model containing only data required for detecting Conformity
Index outliers for the selected number of standard deviations. In the short model,
only the target spectrum profile and its confidence limits, conformity statistics and
conformity values are saved. No validation matrices are saved. The Standard
Deviations option helps to set the limits for the trend plots around the average of
mean Confomrity Index values. The default value is 3.
Correlation (Spectral match)
This option saves the model containing only the data required for detecting
Conformity outliers at the specified limit. This model saves only the reference
spectrum and the overall correlation values. The Spectral match limit option helps
to set the limits for the overall correlation plot. The default value is 0.95
833
The Unscrambler X Main
Correlelogram
Outlier Marking
Spectra
A capture of the samples used in the analysis will be available with confidence limits. The
reference spectrum will also be shown in this plot, highlighted in green to show how all
other spectral samples behave with respect to it. The confidence limits can be set to K Std
Deviations from the reference spectrum where K is an integer between 1 and 6. A
confidence limit is calculated for each wavelength and plotted along with the reference
spectrum.
834
Segmented Correlation Outlier Analysis
The samples will be marked in gray color. The reference spectrum will be marked in green
color and so its associated confidence limits in dashed green color. Any outliers identified in
the analysis will be marked in red.
Toggle on the to view the Conformity Index plot
Influence Plot
This plot shows the Q-residual X-variance or F-residuals vs. Leverage or Hotelling’s T². The
toggle buttons in the toolbar can be used to switch between the various combinations.
835
The Unscrambler X Main
836
Segmented Correlation Outlier Analysis
Scores
This is a two-dimensional scatter plot (or map) of scores for two specified components (PCs)
from PCA, performed on the segmented correlation values matrix. The plot gives
information about patterns in the samples. The scores plot for (PC1,PC2) is especially useful,
since these two components summarize more variation in the data than any other pair of
components. Outliers will be marked automatically (in red)
Use the Hotelling’s T² ellipse in the scores plot to detect outliers. To display, click on the
Hotelling’s T² ellipse button .
SCA loadings
A line plot of segmented correlation loadings for each (or selected) component(s) is a good
way to detect important segments and so its associated variables/wavelenghs in
understanding which components capture the important source of infomration.
837
The Unscrambler X Main
Use the correlation loadings option to discover the important segments lying within the
upper and lower bounds of the plot, being modelled by that particular PC.
Influence
See the description in the overview section
Explained Variance
This plot gives an indication of how much of the variation in the data is described by the
different components.
Total residual variance is computed as the sum of squares of the residuals for all the
variables, divided by the number of degrees of freedom.
Total explained variance is then computed as:
Calibration variance is based on fitting the calibration data to the model. Validation variance
is computed by testing the model on data that were not used to build the model.
838
Segmented Correlation Outlier Analysis
Influence
See the description in the overview section
The Hotelling’s T² statistic has a linear relationship to the leverage for a given sample. Its
critical limit is based on an F-test. Use it to identify outliers or detect situations where a
process is operating outside normal conditions. There are 6 different significance levels to
choose from using the drop-down list:
839
The Unscrambler X Main
The number of factors (or PCs) may be tuned up or down with the tools.
To access the Leverage plot, use the toggle button to switch to Leverage
plot. Leverages are useful for detecting samples which are far from the center within the
space described by the model. Samples with high leverage differ from the average samples;
in other words, they are likely outliers. A large leverage also indicates a high influence on the
model.
Leverage plot
Q-Residuals/F-Residuals
The Q-residual is the sum of squares of the residuals over the variables for each object.
This test serves the purpose of finding outliers in terms of the distance to the model space,
i.e. residual distance. Given the model X = TPT + E, then the Q-residuals for the objects in X
object are computed from ETE.
A critical value of the Q-residual can be estimated from the eigenvalues of E, which can be
approximated to a normal distribution (Jackson and Mudholkar, 1979). This is the horizontal
red line.
Q-residual sample variance
840
Segmented Correlation Outlier Analysis
To access the F-Residuals plot, use the toggle button to switch to F-Residuals plot.
The F-residuals are calculated from the calibration as well as the validated residual x-
variance and thus reflects the validation method chosen for the model and may give a more
realistic view of the residuals than the Q-residuals which are based on the residuals from the
calibration.
F-residual sample variance
Conformity Analysis
841
The Unscrambler X Main
A Conformity Limit is plotted as a dashed green line at K Standard Deviations. Any samples
falling outside the Conformity limits will be tagged as a Conformity outlier and marked with
red lines in this plot. There are six levels of standard deviation to choose from using the drop
down list:
A Conformity Limit is plotted as a dashed green line at K Standard Deviations. Any samples
falling outside the Conformity limits will be tagged as a Conformity outlier and marked with
red circles in this plot.
Use the drop down list from the menu to access the trend plots:
Correlelogram
This plot is available only when the Calculate Correlelogram option is checked in
Correlelogram dialog during analysis. The Correlelogram plot is the correlation of each
wavelength present or defined in the data set with the Y-Variables, plotted one Y-Variable at
a time. It is plotted as a line plot with a maximum of +1 and a minimum of -1 and allows the
842
Segmented Correlation Outlier Analysis
overlay of selected, all or the mean spectrum to show areas of maximized spectral
correlation to the Y-Variables.
Correlelogram Plot
Outlier Marking
The following section discusses the four different types of outliers available from the SCA
analysis.
Influence Outliers
The outliers are identified based on the Influence plot. Any sample not in the low left
quadrant of the plot will be tagged as influence outlier (circled in red). This can be accessed
by toggling the IO button in the Mark menu.
Correlation Outliers
Any sample with an overall correlation value below the set correlation limit will be tagged as
a Correlation outlier. The correlation limit is set as the lower threshold on the squared
Pearson’s correlation value calculated between the entire reference spectrum and the
individual spectra. The correlation values span the range from 0 to 1. The default value of
correlation limit is set to 0.95. This can be accessed by toggling the CO button in the Mark
menu.
Conformity Index Outlier
A conformity outlier is a sample that exceeds the currently selected conformity limit in the CI
trend chart. The limit is defined by Y =K Std Deviations as selected in toolbar. A conformity
outlier is marked with a red solid line or a red circle, depending on plot type. Whenever the
limit changes, any new outliers will be tagged and marked accrodingly. Previous outliers
falling inside the new limit will be un-tagged and un-marked. This can be accessed by
toggling the CIO button in the Mark menu.
Manual Outliers
For more information see the How to mark samples/variables documentation.
843
22. Instrument Diagnostics
22.1. Instrument Diagnostics
The Instrument Diagnostics plug-in was designed to provide users of spectroscopic
instrumentation a way of assessing the quality of background scans prior to the collection of
reflectance, transmittance or absorbance spectra. It can also be useful for many other types
of sensors, not only spectral intruments. The plug-in contains specific algorithms for
calculating the following quality parameters,
RMS Noise: Provides an assessment of the baseline signal to noise ratio that
indicates that the instrument response is not being influenced by extraneous
electronic noise.
Peak Model: This functionality provides a means of calculating peak heights, areas
and ratios such that assessment can be made to critical limits. These are particularly
important for monitoring contaminant levels, such as build up in specific
instrumentation. Baseline correction is built in as a preprocessing specific to the
Peak Model functionality.
Peak Position: Wavelength accuracy of instrumentation is a critical aspect of good
instrument calibration. If the peak position shifts significantly during analysis, this
has the potential to be detrimental to the predicted values generated by a
chemometric model. Peak Position provides a measure of selected peak positions
and assess them against a tight window of acceptance.
Loss of Intensity: This diagnostics assesses the quality of the spectral luminescence
source for deterioration in intensity. Comparison of a new background is made to
either a historical background or the last known good reference and is expressed in
terms of deviation from an established 100% intensity.
PCA Projection: Utilizes the power of Principal Component Analysis (PCA) to assess if
the new background scan is in the same population as a library of scans known to
have acceptable variability.
The Instrument Diagnostics module also comes with a prediction plug in to assess new
background scans within The Unscrambler® X environment. Instrument diagnostic models
developed are used in a similar way to other predictive models (such as PLSR, PCR etc.) and
can be further utilized in real time applications in conjunction with e.g. The Unscrambler® X
ADI Insight Server or Process Pulse.
Theory
Usage
Prediction
845
The Unscrambler X Main
The returned value indicates if the RMS is higher than the alarm or warning limit.
Absolute Area: Computes the integral of the absolute amplitudes within the
specified region.
Average Height: Computes the average amplitude within the specified region.
Both low and high alarm and warning limits can be set for this diagnostic and is returned as
one of the possible states.
Find all amplitudes for the specified range above the minimum amplitude.
Find the position in the remaining amplitudes from 1, that is closest to the reference
peak position.
Check if the difference between the two positions exceeds the alarm or warning
limits.
846
Instrument Diagnostics
In case of percentage:
The returned value indicates if the intensity is lower than the alarm or warning limit.
Hotelling’s T²
Leverage.
In this equation, TNew is the projected score, P is the loading from the PCA model used for
projection and XNew is the new spectrum to be projected onto the PCA model.
Hotellings T² is calculated as,
847
The Unscrambler X Main
848
Instrument Diagnostics
The following sections below describe how to set up each diagnostic method type.
849
The Unscrambler X Main
The following table gives the functionality of the RMS Noise dialog.
Functionality Description
Defines the column range of the spectra to apply the RMS Noise
Input model to. Start defines the starting point and End is the final point of
the spectrum.
Threshold
Allows a user to set the upper limit for RMS Noise, beyond this limit
Alarm and Alarm state will be tagged to the RMS Noise value calculated for
the new spectrum.
Allows a user to set an upper limit for RMS Noise, beyond this limit
Warning and between the Alarm limit a Warning state will be tagged to the
RMS Noise value calculated for the new spectrum.
Multiple RMS Noise Models
RMS Noise models can be calculated over multiple regions of the same spectrum
and individual alarms and warnings can be set up accordingly. To add additional RMS
noise models, right click in the RMS Noise node in the navigator and select RMS
Model. This will add a new RMS Noise model to the navigator called RMS 2.
Setup dialog for an RMS Noise models
850
Instrument Diagnostics
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the RMS 1, RMS 2 nodes, an option is available to rename the nodes.
851
The Unscrambler X Main
The following table gives the functionality of the Peak Model dialog.
Functionality Description
Defines the column range of the spectra to apply the Peak Model
Input to. Start defines the starting point and End is the final point of the
spectrum.
852
Instrument Diagnostics
Functionality Description
Allows a user to set the upper/lower limits for the Peak Model,
beyond these limits and Alarm state will be tagged to the Peak
Alarm (High/Low)
Model value calculated for the new spectrum. One directional
models are possible.
853
The Unscrambler X Main
consecutive) Peak Model, right click in the Peak Model node in the navigator and
select Peak Model.This will add a new Peak Model to the navigator called Peak
Model 2.
Dialog with multiple Peak Models added
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the Model nodes, an option is available to rename the nodes.
854
Instrument Diagnostics
The following table gives the functionality of the Peak Position dialog.
Functionality Description
Defines the column range of the spectra to apply the Peak Model
Input to. Start defines the starting point and End is the final point of
the spectrum.
Expected Peak A user must enter the peak position where the peak maximum is
Position expected to occur.
Minimum Peak A user must enter the minimum amplitude expected for finding a
Amplitude peak in the defined region and at the expected position.
Threshold Two options are available for Threshold, Absolute: Uses absolute
855
The Unscrambler X Main
Functionality Description
Allows a user to set the upper/lower limits for where the peak is
expected to lie. Beyond these limits and Alarm state will be
Alarm (High/Low)
tagged to the Peak Position value calculated. One directional
models are possible.
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the Model nodes, an option is available to rename the nodes.
856
Instrument Diagnostics
The following table gives the functionality of the Loss of Intensity dialog.
Functionality Description
Defines the column range of the spectra to apply the Peak Model to.
Input Start defines the starting point and End is the final point of the
spectrum.
857
The Unscrambler X Main
Functionality Description
Allows a user to set the minimum limit for loss of intensity that
Alarm should be alarmed when compared to the original or last known
good spectrum. This is a lower bound alarm.
Allows a user to set a warning limit for loss of intensity that should
be flagged when compared to the original or last known good
Warning
spectrum. This is a lower bound alarm for warning a user that the
spectrometers lamp should be considered for changing.
Multiple Loss of Intensity Models
Loss of Intensity models can be calculated at multiple regions of the same spectrum
and individual alarms and warnings can be set up accordingly. To add a second (or
consecutive) Loss of Intensity model, right click in the Loss of Intensity node in the
navigator and select Loss of Intensity.This will add a new Loss of Intensity Model to
the navigator called Loss of Intensity 2.
Dialog with multiple Loss of Intensity Models added
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the Model nodes, an option is available to rename the nodes.
858
Instrument Diagnostics
The PCA functionality is accessed by right clicking in the Instrument Diagnostics node
in the navigator and selecting Add – PCA. A new node called PCA will be added to
the dialog navigator and a sub-node called PCA 1 will be added to show that one
PCA model is being evaluated.
Setup dialog for a PCA model
Defines the column range of the spectra to apply the Peak Model to.
Input Start defines the starting point and End is the final point of the
spectrum.
This function provides two options for the user, Model: Uses the
Use Hotellings T² critical Hotellings T² limits for the components selected for the model
at the significance level selected from the dropdown box; User
859
The Unscrambler X Main
Functionality Description
This function provides two options for the user, Model: Uses the
critical Leverage value for the components selected for the model;
Use Leverage
User Defined: Allows a user to manually enter a limit for the Leverage
value.
Significance levels for Hotellings T² values in the PCA Instrument Diagnostics dialog
860
Instrument Diagnostics
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the Model nodes, an option is available to rename the nodes.
Background spectra collected from a particular instrument can be loaded into The
Unscrambler® X project and the appropriate Instrument Diagnostics model is also loaded
into the project.
Functionality of the Instrument Diagnostics Predict dialog box
861
The Unscrambler X Main
862
Instrument Diagnostics
When implemented at run time, the results are sent to a third party application and a quality
decision can be made based on the outputs.
863
23. Spectral Diagnostics
23.1. Spectral Diagnostics
The Spectral Diagnostics plug-in is designed to provide users of spectroscopic
instrumentation a way of assessing the quality of background scans prior to the collection of
reflectance, transmittance or absorbance spectra. The plug-in contains specific algorithms
for calculating the following quality parameters,
RMS Noise: Provides an assessment of the baseline signal to noise ratio that
indicates that the instrument response is not being influenced by extraneous
electronic noise.
Peak Model: This functionality provides a means of calculating peak heights, areas
and ratios such that assessment can be made to critical limits. These are particularly
important for monitoring contaminant levels, such as build up in specific
instrumentation. Baseline correction is built in as a preprocessing specific to the
Peak Model functionality.
Peak Position: Wavelength accuracy of instrumentation is a critical aspect of good
instrument calibration. If the peak position shifts significantly during analysis, this
has the potential to be detrimental to the predicted values generated by a
chemometric model. Peak Position provides a measure of selected peak positions
and assess them against a tight window of acceptance.
Loss of Intensity: This diagnostics assesses the quality of the spectral luminescence
source for deterioration in intensity. Comparison of a new background is made to
either a historical background or the last known good reference and is expressed in
terms of deviation from an established 100% intensity.
PCA Projection: Utilizes the power of Principal Component Analysis (PCA) to assess if
the new background scan is in the same population as a library of scans known to
have acceptable variability.
The Spectral Diagnostics module also comes with a prediction plug in to assess new
background scans within The Unscrambler® X environment. Spectral diagnostic models
developed are used in a similar way to other predictive models (such as PLSR, PCR etc.) and
can be further utilized in real time applications in conjunction with e.g. The Unscrambler® X
ADI Insight server.
Theory
Usage
Prediction
The returned value indicates if the RMS is higher than the alarm or warning limit.
865
The Unscrambler X Main
Absolute Area: Computes the integral of the absolute amplitudes within the
specified region.
Average Height: Computes the average amplitude within the specified region.
Both low and high alarm and warning limits can be set for this diagnostic and is returned as
one of the possible states.
Find all amplitudes for the specified range above the minimum amplitude.
Find the position in the remaining amplitudes from 1, that is closest to the reference
peak position.
Check if the difference between the two positions exceeds the alarm or warning
limits.
866
Spectral Diagnostics
In case of percentage:
The returned value indicates if the intensity is lower than the alarm or warning limit.
Hotelling’s T²
Leverage.
In this equation, TNew is the projected score, P is the loading from the PCA model used for
projection and XNew is the new spectrum to be projected onto the PCA model.
Hotellings T² is calculated as,
867
The Unscrambler X Main
868
Spectral Diagnostics
The following sections below describe how to set up each diagnostic method type.
869
The Unscrambler X Main
The following table gives the functionality of the RMS Noise dialog.
Functionality Description
Defines the column range of the spectra to apply the RMS Noise
Input model to. Start defines the starting point and End is the final point of
the spectrum.
Threshold
Allows a user to set the upper limit for RMS Noise, beyond this limit
Alarm and Alarm state will be tagged to the RMS Noise value calculated for
the new spectrum.
Allows a user to set an upper limit for RMS Noise, beyond this limit
Warning and between the Alarm limit a Warning state will be tagged to the
RMS Noise value calculated for the new spectrum.
Multiple RMS Noise Models
RMS Noise models can be calculated over multiple regions of the same spectrum
and individual alarms and warnings can be set up accordingly. To add additional RMS
noise models, right click in the RMS Noise node in the navigator and select RMS
Model. This will add a new RMS Noise model to the navigator called RMS 2.
Setup dialog for an RMS Noise models
870
Spectral Diagnostics
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the RMS 1, RMS 2 nodes, an option is available to rename the nodes.
871
The Unscrambler X Main
The following table gives the functionality of the Peak Model dialog.
Functionality Description
Defines the column range of the spectra to apply the Peak Model
Input to. Start defines the starting point and End is the final point of the
spectrum.
872
Spectral Diagnostics
Functionality Description
Allows a user to set the upper/lower limits for the Peak Model,
beyond these limits and Alarm state will be tagged to the Peak
Alarm (High/Low)
Model value calculated for the new spectrum. One directional
models are possible.
873
The Unscrambler X Main
consecutive) Peak Model, right click in the Peak Model node in the navigator and
select Peak Model.This will add a new Peak Model to the navigator called Peak
Model 2.
Dialog with multiple Peak Models added
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the Model nodes, an option is available to rename the nodes.
874
Spectral Diagnostics
The following table gives the functionality of the Peak Position dialog.
Functionality Description
Defines the column range of the spectra to apply the Peak Model to.
Input Start defines the starting point and End is the final point of the
spectrum.
Expected Peak A user must enter the peak position where the peak maximum is
Position expected to occur.
Minimum Peak A user must enter the minimum amplitude expected for finding a
Amplitude peak in the defined region and at the expected position.
Allows a user to set the upper/lower limits for where the peak is
expected to lie. Beyond these limits and Alarm state will be tagged to
Alarm (High/Low)
the Peak Position value calculated. One directional models are
possible.
875
The Unscrambler X Main
Functionality Description
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the Model nodes, an option is available to rename the nodes.
876
Spectral Diagnostics
The following table gives the functionality of the Loss of Intensity dialog.
Functionality Description
Defines the column range of the spectra to apply the Peak Model to.
Input Start defines the starting point and End is the final point of the
spectrum.
Allows a user to set the minimum limit for loss of intensity that
Alarm should be alarmed when compared to the original or last known
good spectrum. This is a lower bound alarm.
Allows a user to set a warning limit for loss of intensity that should
Warning
be flagged when compared to the original or last known good
877
The Unscrambler X Main
Functionality Description
spectrum. This is a lower bound alarm for warning a user that the
spectrometers lamp should be considered for changing.
Multiple Loss of Intensity Models
Loss of Intensity models can be calculated at multiple regions of the same spectrum
and individual alarms and warnings can be set up accordingly. To add a second (or
consecutive) Loss of Intensity model, right click in the Loss of Intensity node in the
navigator and select Loss of Intensity.This will add a new Loss of Intensity Model to
the navigator called Loss of Intensity 2.
Dialog with multiple Loss of Intensity Models added
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the Model nodes, an option is available to rename the nodes.
878
Spectral Diagnostics
Defines the column range of the spectra to apply the Peak Model to.
Input Start defines the starting point and End is the final point of the
spectrum.
This function provides two options for the user, Model: Uses the
critical Hotellings T² limits for the components selected for the model
Use Hotellings T² at the significance level selected from the dropdown box; User
Defined: Allows a user to manually enter a limit for the Hotellings T²
value.
This function provides two options for the user, Model: Uses the
Use Leverage
critical Leverage value for the components selected for the model;
879
The Unscrambler X Main
Functionality Description
User Defined: Allows a user to manually enter a limit for the Leverage
value.
Significance levels for Hotellings T² values in the PCA Spectral Diagnostics dialog
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the Model nodes, an option is available to rename the nodes.
880
Spectral Diagnostics
The Spectral Diagnostics Predict plug-in is found under the Tasks – Predict menu
Background spectra collected from a particular instrument can be loaded into The
Unscrambler® X project and the appropriate Spectral Diagnostics model is also loaded into
the project.
Functionality of the Spectral Diagnostics Predict dialog box
881
The Unscrambler X Main
Functionality Description
When implemented at run time, the results are sent to a third party application and a quality
decision is made based on the outputs.
882
24. Cluster Analysis
24.1. Cluster analysis
Cluster analysis includes a range of quasi-statistical techniques used in unsupervised
classification. They are suitable for exploratory analysis of data and can be used to classify
samples into groups. Cluster analysis in The Unscrambler® works on the objects (or rows).
The data may be transposed prior to analysis to analyze the data in terms of variables. K-
means and K-medians clustering iteratively add or remove members from a set of clusters so
as to minimize the sum of distances of cluster members to their cluster centers. These
methods use less memory than hierarchical clustering methods and are therefore suitable
for large data sets. Hierarchical clustering methods in The Unscrambler® provide a
dendrogram plot as a visualization of clustering results.
Theory
Usage
Plot Interpretation
Method reference
Basics
Principles of cluster analysis
Nonhierarchical clustering
Hierarchical clustering
HCA linkage methods
Distance measures
Quality of the clustering
Main results of cluster analysis
24.2.1 Basics
A valuable tool for exploratory data analysis is the use of cluster analysis to understand the
natural grouping of objects. Cluster analysis is an unsupervised methodology for grouping
things based on their similarities based on specified characteristics (variables). It grew out of
work by biologists working on numerical taxonomy, and is a valuable visualization tool in
data mining. One can perform clustering using either several agglomerative methods: K-
means or K-median clustering, or hierarchical clustering with different linkage measures
(single-linkage, complete-linkage, average-linkage, median-linkage, etc.). Agglomerative
methods begin by treating each sample as a single cluster and begin clustering samples
based on their similarity until one large cluster is formed.
Although cluster analysis is usually performed to find patterns among objects (termed as Q
mode), it may also be applied to find similarities in the variables (or R mode). This can be
achieved by running the analysis by transposing the data matrix so that the rows correspond
to variables.
883
The Unscrambler X Main
K-means
K-medians
HCA single-linkage
HCA complete-linkage
HCA average-linkage
HCA median-linkage
Ward’s method
884
Cluster Analysis
classification. The distance should ideally be chosen based on the application domain and
based on whether the distance or similarity measure has a real-world interpretation. Note
that not all distances fulfill the triangle inequality. The triangle inequality for a metric holds if
the sum of two sides of a triangle exceeds the third. If this does not hold, the resulting
dendrograms in hierarchical clustering can be deformed.
With Hierarchical clustering, a dendrogram is generated as a result, based on the distance
between samples. There are several methods by which the distance between the linkages
between clusters are defined when using one of the HCA options.
HCA linkage methods
HCA single-linkage: The single-linkage (also called nearest neighbor) measure, uses
the distance between the closest samples to define a cluster. The method tends to
make large clusters and does not provide a very good classification of groups that
differ, but are not well separated. This method tends to produce elongated clusters.
HCA complete-linkage: This is also known as the farthest-neighbor method, and uses
the greatest distance between any two samples as the basis of the clustering.
Clusters from the complete-linkage method are more compact and rounded
clusters.
HCA average-linkage: The average linkage is a compromise between the single- and
complete-linkage, based on the average distance between samples for the
clustering.
HCA median-linkage: The median (or centroid) linkage, is very similar to the average-
linkage method, and uses the geometrical distance between a cluster and the
weighted center of gravity between other groups.
Ward’s method: Ward’s method aims to cluster samples to maximize the
homogeneity of the groups. Linkage is based on clustering so that the groups do not
have an increased measure of heterogeneity.
885
The Unscrambler X Main
Distance measures
For all hierarchical clustering methods a distance measure needs to be defined to define the
distance between samples. A sample is then defined to belong to a group to which it is
closest. HCA results are displayed as a dendrogram plot which is a depiction of the clustering
of samples into sets and subsets, along with the threshold distances between samples and
clusters.
In The Unscrambler® there many options available for the distance measures to use for
clustering.
Squared Euclidean distance
The squared Euclidean distance as a means of measuring similarity between clusters is
useful in cases where some feature (variable) may dominate the distance between groups,
and serves as a type of normalization to the data.
Euclidean distance
This is the most usual, “natural” and intuitive way of computing a distance between two
samples. It takes into account the difference between two samples directly, based on the
magnitude of changes in the sample levels. This distance type is usually used for data sets
that are suitably normalized or without any special distribution problem.
City-block distance
Also known as Manhattan distance, this distance measurement is especially relevant for
discrete data sets. While the Euclidean distance corresponds to the length of the shortest
path between two samples (i.e. “as the crow flies”), the Manhattan distance refers to the
sum of distances along each dimension (i.e. “walking round the block”).
Pearson correlation distance
This distance is based on the Pearson correlation coefficient that is calculated from the
sample values and their standard deviations. The correlation coefficient r takes values from
–1 (large, negative correlation) to +1 (large, positive correlation). Effectively, the Pearson
distance dp is computed as
and lies between 0 (when correlation coefficient is +1, i.e. the two samples are most similar)
and 2 (when correlation coefficient is -1). Note that the data are centered by subtracting the
mean, and scaled by dividing by the standard deviation.
Absolute Pearson correlation distance
In this distance, the absolute value of the Pearson correlation coefficient is used; hence the
corresponding distance lies between 0 and 1, just like the correlation coefficient. The
equation for the Absolute Pearson distance da is
Taking the absolute value gives equal meaning to positive and negative correlations, due to
which anti-correlated samples will get clustered together.
Uncentered correlation distance
This is the same as the Pearson correlation, except that the sample means are set to zero in
the expression for uncentered correlation. The uncentered correlation coefficient lies
between –1 and +1; hence the distance lies between 0 and 2.
886
Cluster Analysis
where
nc
number of concordant rank pairs
nd
number of discordant rank pairs
Chebyshev distance
The Chebyshev, or maximum value, distance is the absolute magnitude of the differences
between the coordinates of a pair of objects. This distance measure may be best in cases
where the difference between points is best reflected by individual dimension differences,
and not by all the dimensions considered together. Note that the Chebyshev distance is very
sensitive to outlying measurements.
Bray-Curtis distance
This value, also referred to as the Bray-Curtis dissimilarity, or the Sorenson distance, is
commonly used in ecology, biology and oceanography studies for quantifying dissimilarity
between populations.
Ward’s Method
This method is a minimum distance hierarchical clustering method and uses an analysis of
variance approach to evaluate the distances between clusters. Ward’s method attempts to
minimize the Sum of Squares (SS) of any two clusters that could be formed at each step of
the analysis.
887
The Unscrambler X Main
Distances is described as the sum of the distance values between each of the sample to their
respective cluster centroid summed up over all k clusters. This parameter is uniquely
calculated for a particular batch of cluster-ids resulting from a cluster calculation. The results
from various different cluster analyzes are compared based on the Sum Of Distances values.
The solution with a least Sum of Distances is a good indicator for an acceptable cluster
assignment. Hence it is recommended to initiate the analysis with a small Iteration Number,
say for example 10 for a sample set of 500, and proceed towards a higher cycle of Iteration
Number to obtain an optimal cluster solution. Once the user obtains an optimal (lowest)
Sum Of Distances there is a good possibility that there will not be further decline in the Sum
Of Distances by setting Iteration Number to higher values. The cluster-id assignment for an
optimal Sum Of Distances is considered to be the most appropriate result. The results for
nonHCA presents just the class-ID as a numerical value, without giving the SOD values.
Note: Since the first step of the K-Means algorithm is based on the random
distribution of the samples into k different clusters there is a good possibility that
the final clustering solution will not be exactly the same for every instance for a
fairly large sample data set.
For Hierarchical cluster analysis results of the clustering are a column matrix with a category
variable (0,1,2,…) for the class, as well as a dendrogram which is a plot of the clusters plotted
vs. the relative distance between the clusters.
Tip: Before performing a cluster analysis, it is helpful to determine if the data being
considered exhibits any tendency to cluster. This can be done by doing a PCA over
the data to see if there are any groupings which could then form the basis of
clusters.
888
Cluster Analysis
24.3.1 Inputs
To run a cluster analysis:
Choose the data to be clustered by defining the matrix and range to be clustered.
The data selected must not have any missing values. There must be at least two
samples and two variables to perform a cluster analysis.
Decide the number of clusters or categories to be identified (Default: 2 clusters).
Choose clustering method (Default: K-means).
Choose distance criterion (Default: Squared Euclidean).
889
The Unscrambler X Main
Defining the centers of the clusters based on prior knowledge can force a better
solution.
For each cluster one can either enter a range of sample indexes by typing or through
the selection dialog.
Sample ranges can be comma separated while ranges can be indicated with
hyphens. For example 1-5,7.
890
Cluster Analysis
24.3.3 Results
When a cluster analysis has been performed, a new node, Cluster analysis, is added to the
project navigator with the a folder for results and for plots (if hierarchical clustering has
been used). The node may be renamed by right clicking on it and selecting Rename. A typical
entry is shown below.
Cluster analysis results node
The results folder contains the matrix Range_Classified, which has the raw data used for
clustering, and an additional column for the class. Row sets are also created, one for each
cluster that has been identified. The column set Class has the numerical identifiers for each
sample, as can be seen below.
Cluster analysis class ID
891
The Unscrambler X Main
Dendrogram
24.4.1 Dendrogram
A dendrogram (from Greek dendron “tree”, -gramma “drawing”) is a tree diagram
frequently used to illustrate the arrangement of the clusters produced by hierarchical
clustering.
Depending on the selected number of clusters, the sample names will be displayed by
cluster color. In the following example three clusters were selected, hence the plot has three
groups of samples shown in different colors. The clusters are separated based on the
distance between clusters.
Dendrogram plot
892
Cluster Analysis
893
25. Projection
25.1. Projection
Latent space models project the data into new spaces. This is done by multiplying the new
data with the loading vectors. This method is applicable to PCA, PCR and PLS Regression
techniques.
Theory
Usage
Plot Interpretation
Method reference
Basics of projection
Sample comparison after a change
Detection of time shifts
Using projection to validate a process with a new test set
How to interpret projected samples
Sample comparison or detection of time shifts
Validation with new test data
895
The Unscrambler X Main
By projecting the data for new samples onto the PCA model based on product
produced with the existing supplier, one can see if the product properties are
impacted by the change in raw material supplier.
Has the product quality changed after a piece of equipment was repaired?
How do samples produced in factory B compare to samples from factory A?
To make this comparison one can project the new samples (e.g. from factory B) onto a PCA
of the reference samples (e.g. factory A), and see if they overlap in the scores plot.
Detection of time shifts
A model was developed one year ago. Are today’s samples still well described by the model?
Projecting new samples onto the one-year old PCA model provides information about
whether there has been a drift in sample distribution, change in the average scores,
increased spread, larger residuals, etc.
Using projection to validate a process with a new test set
In the initial stages of a process development few samples may exist and methods such as
cross-validation may be the only viable way of developing a first interpretive model. As more
experience and data are gathered from the process, these data can be used as a test set,
without recomputing the original model.
The initial PCA model may also have been developed by another scientist or engineer and
the original data may not be available to run a more complete PCA. This is not a problem for
projection, as long as the new data were collected for the same variables as the original PCA
model.
Projecting the new samples onto the existing model and checking residual variances and
leverages will allow one to determine whether the model is valid for the new samples.
896
Projection
The main difference compared to standard PCA results is that the variance plot now depicts
Calibration, Validation and Projection. Also, the projected samples are shown in the scores
plot. The following plots are relevant for the new samples:
Scores.
Variances.
Residuals.
Leverages and Hotelling’s T².
The influence plot helps one detect whether some of the projected samples are badly
described by the model or far away from the center.
The Hotelling’s T² ellipse can be plotted in the scores plot, with its critical limit that can be
tuned up or down by varying the p-value between 0.1 and 25%. These limits show which of
the projected samples can be “rejected” by the model (outside the limit). If the proportion
897
The Unscrambler X Main
of “rejected” samples is larger than the chosen p-value, one may conclude that there is a
difference between the original samples and the projected samples as a whole.
Validation with new test data
Compare the Projection variance curve to the Calibration and Validation curves. If they are
similar, one can consider the model validated by the projected samples. The diagram below
provides an example of a well chosen calibration and validation set of data using the method
of PCA projection.
Refer to the chapter on How to Interpret PCA Scores and Loadings for more details.
898
Projection
To run a projection, a project must be opened containing either a PCA or regression model
(MLR is not included in this case). If this is not the case, the following warning will be
provided.
Solution: Ensure that a PCA, PCR or PLSR model is available for projection.
Data Input
The following dialog boxes are available to input data.
Select Model
Choose the model (PCA, PCR, PLSR) to be used for projection from those available in
the project navigator.
Components
Allows the user to choose the number of components to use for projection.
Data
Matrix: Allows the user to select the matrix containing the data to be projected onto
the model. The data can be a new matrix, or a subset of the data used to generate
the model.
Use the Rows and Columns drop-down lists to define the samples and variables to
be projected.
If the variable dimensions of the new data set do not match those of the model, The
Unscrambler® will provide a warning to adjust this. This warning is shown below. A
data set of equivalent dimension must be chosen. It must not contain any non-
numeric or missing values.
New data set does not have same dimensions of original model
899
The Unscrambler X Main
Solution: Ensure that the data set to be projected has the same range as that used in the
original model.
Other warnings associated with the Data input dialog box include the following:
Too many samples or variables excluded
Solution: Ensure that enough samples or variables are present for analysis.
Non-numeric data
Solution: Ensure that the data set only contains numerical values.
Note: When a model has been developed and is to be used for projection, it
is important to define the variable ranges in the new data table so that they
match the dimensions of the original model.
Click on OK to perform the projection.
900
Projection
X-Loadings
Influence
Residual/explained variance
Variances
Scores
Loadings
Residuals
Leverage/Hotelling’s T²
Plots accessible from the Projection menu
Projection overview
Variances
Scores
Line
2-D
3-D
Loadings
Line
2-D
3-D
Residuals
Influence Plot
Variance per Sample
Sample Residuals
Leverage/Hotelling’s T²
Leverage
Line
Matrix
Hotelling’s T²
Line
Matrix
Scores
This is a two-dimensional scatter plot (or sample map) of scores for two specified
components (PCs) from Projection results. The original samples used to develop the PCA
model are displayed in blue, the new projected samples in green. Use this plot to check how
close the projections of the new samples are to the original samples.
Projection of samples in a scores plot
901
The Unscrambler X Main
In the above plot, most of the projected samples (green) fall within the two groups defined
by the model samples (blue). There are a group of four samples that lie outside the main
population in the region defined by samples M62 and H59. It may be important to check
whether these are outliers, or just unique samples.
X-Loadings
The default X-loadings plot is a two-dimensional scatter plot of for two specified
components. Use this plot to detect important variables. The plot is most useful for
interpreting components 1 vs.2, since they represent the largest variations in the X-data.
It must be interpreted together with the corresponding scores plot. Variables with high X-
loadings to the right of the plot relate to samples samples to the right in the scores plot, etc.
Loadings may also be displayed as line plots. These are useful when interpreting the results
generated from spectral data.
Influence
This plot displays the sample residual X-variances against leverages for the projected
samples at a given number of PCs. The original samples used to develop the PCA model are
displayed in blue, the new projected samples in green. Samples with a high residual variance
are poorly described by the original model. Samples with a high leverage are projected far
from the center of the original model. A sample with both high residual variance and high
leverage usually represents a highly influential outlier, i.e. it is not well described by the
model it is projected onto and it distorts the model to itself. In this case, the model only
describes why the influential sample is so different from the rest of the population.
Influence in projection
902
Projection
Residual/explained variance
This plot gives an indication of how much of the variation in the data is described by the
different components.
Total residual variance is computed as the sum of squares of the residuals for all the
variables, divided by the number of degrees of freedom.
Total explained variance is then computed as:
903
The Unscrambler X Main
Variances
For information on this plot check the Projection Overview section
Scores
For information on this plot check the Projection overview section
Loadings
For information on this plot check the Projection overview section
Residuals
Residuals can either be plotted as Residual Sample Variance and Sample Residuals. Examples
of these plots are shown below.
904
Projection
The residual sample variance displays the per sample variation compared to the projected
model and the sample residuals show the variance associated with each variable, for a
particular sample.
Leverage/Hotelling’s T²
Leverage
905
The Unscrambler X Main
Line
This is a plot of score values vs. sample number for a specified component. Although it is
usually better to look at 2-D or 3-D scores plots because they contain more information, this
plot can be useful whenever the samples are sorted according to the values of an underlying
variable, e.g. time, to detect trends or patterns.
Also look for systematic patterns, like a regular increase or decrease, periodicity, etc. (only
relevant if the sample number has a meaning, like time for instance).
2-D
For information on this plot check the Interpreting Projection plots section
3-D
This is a 3-D scatter plot or map of the scores for three specified components from PCA. The
plot gives information about patterns in the samples and is most useful when interpreting
components 1, 2 and 3, since these components summarize most of the variation in the
data. It is usually easier to look at 2-D scores plots but if three components are needed to
describe enough variation in the data, the 3-D plot is a practical alternative.
Scores plot in 3-D
906
Projection
Like with the 2-D plot, the closer the samples are in the 3-D scores plot, the more similar
they are with respect to the three components.
The 3-D plot can be used to interpret differences and similarities among samples. Look at
the scores plot and the corresponding loadings plot, for the same three components.
Together they can be used to determine which variables are responsible for differences
between samples. Samples with high scores along the first component usually have large
values for variables with high loadings along the first component, etc.
For information about what to look for in a scores plot check the information in the 2-D
scores plot section
Loadings
Line
This is a plot of X-loadings for a specified component vs. variable number. It is useful for
detecting important variables. In many cases it is usually better to look at two- or three-
vector loadings plots instead because they contain more information.
Line plots are most useful for multichannel measurements, for instance spectra from a
spectrophotometer, or in any case where the variables are implicit functions of an
underlying parameter, like wavelength, time, etc.
Loading line plot
907
The Unscrambler X Main
The plot shows the relationship between the specified component and the different X-
variables. If a variable has a large positive or negative loading, this means that the variable is
important for the component concerned. For example, a sample with a large score value for
this component will have a large positive value for a variable with large positive loading.
2-D
For information on this plot check the Interpreting Projection plots section
3-D
This is a three-dimensional scatter plot of X-loadings for three specified components from
the original PCA model. The plot is most useful for interpreting directions, in connection to a
3-D scores plot. Otherwise it is recommended to use line- or 2-D loadings plots.
Loadings plot in 3-D in projection
908
Projection
Residuals
Influence Plot
For information on this plot check the Interpreting Projection plots section
Samples with small residual variance (or large explained variance) for a particular
component are well explained by the corresponding model, and vice versa.
Sample Residuals
This is a plot of the residuals for a specified sample and component number for all the X-
variables. It is useful for detecting outlying sample or variable combinations. Although
outliers can sometimes be modeled by incorporating more components, this should be
avoided since it will reduce the prediction ability of the model.
Bar plot of the sample residuals
909
The Unscrambler X Main
In contrast to the variable residual plot, which gives information about residuals for all
samples for a particular variable, this plot gives information about all possible variables for a
particular sample. It is therefore useful when studying how a specific sample fits to the
model.
To change the displayed sample use the Sample drop-down list .
To change the PC plotted use the arrows tools.
Leverage/Hotelling’s T²
Leverage
Line
Leverages are useful for detecting samples which are far from the center within the space
described by the model. Samples with high leverage differ from the average samples; in
other words, they are likely outliers. A large leverage also indicates a high influence on the
model.
Leverage plot in projection
910
Projection
The absolute leverage values are always larger than zero, and can go (in theory) up to 1. As a
rule of thumb, samples with a leverage above 0.4 - 0.5 start being bothering.
Influence on the model is best measured in terms of relative leverage. For instance, if all
samples have leverages between 0.02 and 0.1, except for one, which has a leverage of 0.3,
although this value is not extremely large, the sample is likely to be influential.
For a critical limit on the leverages, look up the Hotelling’s T² line plot.
Matrix
This is a matrix plot of leverages for all samples and all model components. The X-axis
represents the components and the Y-axis the samples. The color represents the Z-value
which is the leverage, the color scale can be customized. It is a useful plot for studying how
the influence of each sample evolves with the number of components in the model. Display
the leverages as Hotelling’s T² statistics.
Leverage matrix plot in projection
911
The Unscrambler X Main
Hotelling’s T²
Line
The Hotelling’s T² plot is an alternative to plotting sample leverages. The plot displays the
Hotelling’s T² statistic for each sample as a line plot. The associated critical limit (with a
default p-value of 5%) is displayed as a red line.
The Hotelling’s T² limit at 5% determines a distance from the model where 95% of the
samples belonging to the model should be within this limit. The samples outside this limit
are likely to be outliers. However remember that 5% of the sample belonging to the model
can be outside.
Hotelling’s T² plot in projection
912
Projection
In the above plot some samples have a Hotelling’s T² statistic higher than the limit for 5% on
a model including the amount of necessary PCs to have an explanatory model. Hence those
samples are likely to be outliers.
The Hotelling’s T² statistic has a linear relationship to the leverage for a given sample. Its
critical limit is based on an F-test. Use it to identify outliers or detect situations where a
process is operating outside normal conditions. There are 6 different significance levels to
chose from using the drop-down list:
Tune the number of PCs up or down as desired with the arrows tools.
Matrix
This is a matrix plot of Hotelling’s T² statistics for all projected samples and all model
components. It is equivalent to the matrix plot of leverages, to which it has a linear
relationship. The Y-axis represents the components and the X-axis the samples. The color
represents the Z-value which is the Hotelling’s T² statistic for a specific PC and sample, the
color scale can be customized.
Hotelling’s T² matrix plot in projection
913
26. SIMCA
26.1. SIMCA classification
Soft Independent Modeling of Class Analogy (SIMCA) is based on making a PCA model for
each class in the training set. Unknown samples are then compared to the class models and
assigned to classes according to their proximity to the training samples.
Theory
Usage
Plot Interpretation
Method reference
915
The Unscrambler X Main
Model results
For each pair of models, the model distance between the two models is computed. This
gives a measure of how separable the class models are. A distance larger than three
indicated good class separation.
Variable results
Modeling power (of one variable in one model) is a measure of the relevance of a variable to
a model. It has a value between 0 and 1, with a value of 1 signifying importance. Variables
with modeling power less than about 0.3 are of little importance to a model.
Discrimination power (of one variable between two models) is a measure of how useful a
variable is in discriminating between two classes. Discrimination power of ~ 1 indicates no
discriminating power, while a value greater than ~ 3 indicates good discrimination for a
given variable.
Sample results
Si = object-to-model distance (of one sample to one model) is a measure of how far a sample
is from a modeled class.
Hi = leverage (of one sample to one model). Hi describes how different a sample is from
other class members.
Model distance
This measure (which could more accurately be called ”model-to-model distance”) shows how
different two (or more) models are from each other. It is computed from the results of
916
SIMCA
fitting all samples from each class to their own model and to the other ones being used to
classify new samples.
The value of this measure should be compared to is 1, i.e. the distance of a model to itself. A
model distance much larger than 1 (for instance, 3 or more) shows that the two models are
quite different, which in turn implies that the two classes are likely to be well distinguished
from each other.
Modeling power
Modeling power is a measure of the influence of a variable over a given model. It is
computed as
Discrimination power
The discrimination power of a variable indicates the ability of that variable to discriminate
between two classes. Thus, a variable with a high discrimination power (with regard to two
particular models) is very important for the differentiation between the two corresponding
classes.
Like model distance, this measure should be compared to 1 (no discrimination power at all);
variables with a discrimination power higher than 3 can be considered quite important.
Si vs. Hi plot
This plot is a graphical tool used to view of sample-to-model distance (Si) and sample
leverage (Hi) for a given model at the same time. It includes the class membership limits for
both measures, so that samples can easily be classified according to that model by checking
whether they fall inside both limits.
An equivalent plot in PCA is the influence plot (refer to section on the influence plot in the
chapter on PCA).
917
The Unscrambler X Main
Coomans’ plot
This is an “Si vs. Si” plot, where the sample-to-model distances are plotted against each
other for two models. It includes class membership limits for both models, so that one can
see whether a sample is likely to belong to one class, or both, or none. This is an orthogonal
distance measure, therefore, samples can be plotted along orthogonal axes. If any two class
models share a space around the origin of the Coomans’ plot then there is a high likelihood
that the PCA models will not discriminate between the two classes.
918
SIMCA
919
The Unscrambler X Main
Use this option to mean center the data to be classified, prior to the classification
process. The default is that this option is checked.
Use components
Use this option to vary the number of components to be included in each model. As
a general rule, this should always be set to the number of principal components/
factors found to be optimal during model development process.
Some important tips and warnings associated with the Model Inputs tab
If data is pretreated before building PCA model and the pretreatment ranges differ from the
model building range then SIMCA will ask the user to select all variables in the data.
In the event that there is no valid model present in the project navigator, the following
warning will be provided.
No valid model present for classification
Solution: Either create a model using a training set of data, or import an existing model from
another project.
When non-numeric values are present in a new data set for classification, the following
warning will be provided.
Non-numeric values in data set warning
Solution: Ensure the data set being classified only contains numerical values.
The diagram below provides an example of a completed dialog box.
Completed SIMCA dialog
920
SIMCA
Click on OK to run the classification on the data selected. A new node named SIMCA will
appear in the project navigator providing all of the model details and associated plots in the
three folders: raw data, results, and plots. The node can be renamed by selecting it, right
clicking and selecting Rename. By right clicking, one also has the option to hide the plots.
921
The Unscrambler X Main
Look for samples that are not recognized by any of the classes, or those that are allocated to
more than one class.
Classification table
Coomans’
This plot shows the orthogonal distances from the new objects to two different classes
(models) at the same time. The membership limits (S0) are indicated. Membership limits
reflect the significance level used in the classification.
The two models can be changed to study other pairs of model using the following bar tool
.
The significance level for Hi can be adjusted using the tool , there are six
different levels, the default value being 5%.
Coomans’ Plot
922
SIMCA
Samples that fall within the membership limit of a class are recognized as members of that
class. Different colors denote different types of sample: new samples being classified,
calibration samples for the model along the abscissa (A) axis, calibration samples for the
model along the ordinate (B) axis, as shown in the figure above.
Si vs. Hi
This plot is a graphical tool used to get a view of the sample-to-model distance (Si) and
sample leverage (Hi) for a given model at the same time. It includes the class membership
limits for both measures, so that samples can easily be classified according to that model by
checking whether they fall inside both limits.
The displayed results can be changed using the following tool
.
Si vs. Hi
923
The Unscrambler X Main
In the above plot the samples that will be classified as Setosa are the ones in the bottom left
corner defined by the two limits Si and Hi. The other samples will not be classified in this
group.
The significance level for Hi can be adjusted using the tool , there are six
different levels the default value being 5%.
Si/S0 vs. Hi
The Si/S0 vs. Hi plot shows the two limits used for classification: the relative distance from
the new sample to the model (residual standard deviation) and the leverage (distance from
the new sample to the model center).
Si/S0 vs. Hi
In the above plot the samples that will be classified as Setosa are the ones in the bottom left
corner defined by the two limits Si and Hi. The other samples will not be classified in this
group.
The displayed results can be changed using the following tool
.
The significance level for Hi can be adjusted using the tool , there are six
different levels the default value being 5%.
Model Distance
This plot shows the distances between different models. It is possible to compare different
models using the buttons in the tool bar. A distance larger than
three indicates good class separation and that the models are different.
Model Distance
924
SIMCA
It is clear from the plot that the models are very different from the Setosa model. The
closest one is Versicolor with a distance around 20.
Discrimination Power
This plot shows how much each variable contributes to separating two models.
It is possible to see a different pair of models using the buttons in
the tool bar.
Discrimination Power
In the above plot, the two models under study are Setosa and Virginica. The variable with
the highest discrimination power between these two classes is petal width.
Modeling Power
This plot shows how much the variables contribute to the model.
Variables with a modeling power near one are important for the model. A rule of thumb is
that variables with modeling power less than 0.3 are of little importance for the model.
Modeling Power
925
The Unscrambler X Main
The above plot shows that three of the variables have a modeling power larger than 0.3,
which means that these variables are important for describing the model. Since petal width
does not have a very high power, it could be deleted from the modeling.
It is possible to look at the modeling power for all the tested models using the drop-down
list from the tool bar: .
926
27. Linear Discriminant Analysis
27.1. Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is the simplest of all possible classification methods that
are based on Bayes’ formula. The objective of LDA is to determine the best fit parameters
for classification of samples by a developed model. The model can then be used to classify
unknown samples. It is based on the normal distribution assumption and the assumption
that the covariance matrices of the two (or more) groups are identical.
Theory
Usage: Create model
Usage: Classification
Results
Method reference
Basics
Data suitable for LDA
Purposes of LDA
Main results of LDA
LDA application examples
How to interpret LDA results
Using an LDA model for classification of unknowns
27.2.1 Basics
LDA is the simplest of all possible classification methods that are based on Bayes’ formula.
From Bayes’ rule one develops a classification model assuming the probability distribution
within all groups is known, and that the prior probabilities for groups are given, and sum to
100% over all groups. It is based on the normal distribution assumption and the assumption
that the covariance matrices of the two (or more) groups are identical. This means that the
variability within each group has the same structure. The only difference between groups is
that they have different centers. LDA considers both within-group variance and between-
group variance. The estimated covariance matrix for LDA is obtained by pooling covariance
matrices across groups.
When the variability of each group does not have the same structure (unequal covariance
matrix), the shape of the curve separating groups is not linear, and therefore quadratic
927
The Unscrambler X Main
discriminant analysis will provide a better classification model. The distance of observations
from the center of the groups can also be measured using the Mahalanobis distance.
928
Linear Discriminant Analysis
Note: For an LDA to be performed, the number of samples within each category
must be more than the number of variables.
When PCA-LDA is used, the results also include a matrix of Loadings and the Grand Mean
Matrix.
929
The Unscrambler X Main
Classified_Range. Here the probabilities for each sample to belong to a group are given, and
classification is made based on the highest probability of membership.
Linear
Quadratic
Mahalanobis
The default setting assumes equal prior probabilities for class membership or 1/G where G is
the number of groups in the data set. The user has the option of having the software
calculate prior probabilities of class membership based on the training samples.
27.3.1 Inputs
One begins by defining the data matrix to be used for the predictors, and then that to be
used for the classifications. This can be part of the same data matrix, but the classifications
must have category variables in a single column.
Linear Discriminant Analysis Inputs
930
Linear Discriminant Analysis
Begin by defining the data matrix for the predictors and the classifiers from the drop-down
list. For the matrix, the rows and columns to be included in the computation are then
selected. The X values (descriptors) should be numerical data and should not contain missing
values. There must be more samples is each class then there are variables to develop an LDA
classification model. The Y data (classification) must be a single column of category values,
and contain the same number of rows as the descriptors, with no missing values.
If new data ranges need to be defined, choose new or Edit from the drop-down list next to
Rows and/or Cols. This will open the Define Range editor where new ranges can be defined.
The classification matrix to define is that containing the category data, and must have a
single column only. This may be the same matrix as given in Predictors or another, but must
have the same number of rows as the first, and have only a single column of data, with no
missing values. If the appropriate selection is not made for the classifier, the following
warning will be displayed.
Linear Discriminant Analysis Input Warnings
27.3.2 Weights
Weights can be set for individual variables in an analysis. The variables can be selected from
the variable list table provided in the dialog by holding down the control (Ctrl) key and
931
The Unscrambler X Main
selecting variables. Alternatively, the variable numbers can be manually entered into the
text dialog box. The Select button can be used (which will take open the Define Range dialog
box), or every variable in the table can be selected by simply clicking on All.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include,
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows the weighting of selected variables by predefined constant values.
Downweight
This allows the multiplication of selected variables by a very small number, such that
the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
Once the weighting and variables have been selected, click Update to apply them.
Linear Discriminant Analysis Weights
27.3.3 Options
Once the data to be used in modeling are defined, the method for the LDA is defined in the
Options tab.
Linear Discriminant Analysis Options
932
Linear Discriminant Analysis
Three different methods for the LDA available under the options tab are:
Linear
Quadratic
Mahalanobis
The method chosen from the drop-down list will depend on the similarity of the different
classes to be discriminated. If the variability within the groups is the same structure, the
linear method may be used. Otherwise, the Quadratic or Mahalanobis method may model
the classes better, and can be chosen from the drop-down list.
The prior probabilities can also be set, either assuming equal prior probabilities, or by
calculating prior probabilities from the training set. When they are calculated from the
training set, the software uses 1/G where G is the number of groups in the data set.
If a data set contains more variables than samples (i.e. spectral data), one can choose the
option of running a PCA-LDA. In this case a PCA with the number of components defined by
the user is run first on the data, and the LDA is performed using the PCA scores.
27.3.4 Autopretreatment
The Autopretreatments tab allows a user to register the pretreatments used during the LDA
analysis, so that when future predictions are made, these pretreatments are automatically
applied to the new data, before the LDA equation is applied. The pretreatments become
part of the saved model.
Once the data matrix and parameters have been set, the LDA modeling is run by selecting
OK.
A new node, LDA, is added to the project navigator with a folder for Data, and another for
Results.
933
The Unscrambler X Main
Click “OK” after all parameters have been set, and a new matrix with the LDA classification
results, Classified_Range will be created in the project navigator. This then shows the class
identifier, added as the column class, for the unknowns based on the LDA classification
model.
Two additional matrices are generated for PCA-LDA, including the Loadings and the Grand
Mean which are used in projection.
There is also a Discrimination plot that is created as a visual display of the LDA results. *
Discrimination plot: only available in calibration
LDA node
934
Linear Discriminant Analysis
27.5.1 Prediction
The prediction matrix exhibits the discriminant value for each class, as well as the predicted
class for each sample. The predicted class is the class with the highest discriminant value.
Note that this value can be negative.
935
The Unscrambler X Main
analysis classification. It carries information about the predicted and actual classifications of
samples, with each row showing the instances in a predicted class, and each column
representing the instances in an actual class.
In the below confusion matrix, all the “Setosa” samples are nicely attributed to the “Setosa”
group.
Two samples with actual value “Virginica” are predicted as “Versicolor”.
In the same way two samples with actual value “Versicolor” are predicted as “Virginica”.
Confusion matrix
936
Linear Discriminant Analysis
http://www.camo.com/helpdocs/The_Unscrambler_Method_References.pdf
27.7. Bibliography
D. Cozzolino, A. Vadell, F. Ballesteros, G. Galietta, N. Barlocco, Combining visible and near-
infrared spectroscopy with chemometrics to trace muscles from an autochthonous breed of
pig produced in Uruguay: a feasibility study, Anal. Bioanal. Chem., 385(5), 931-936 (2006).
C. Medina-Gutiérrez, J. Luis Quintanar, C. Frausto-Reyes, R. Sato-Berrú, The application of
NIR Raman spectroscopy in the assessment of serum thyroid-stimulating hormone in rats,
Spectrochimica Acta Part A, 61 (1-2), 87-91 (2005).
T. Næs, T. Isaksson, T. Fearn and T. Davies, A User-friendly Guide to Multivariate Calibration
and Classification, NIR Publications, Chichester, UK, 2002.
937
28. Support Vector Machine Classification
28.1. Support Vector Machine Classification (SVMC)
SVM is a classification method based on statistical learning. Sometimes, a linear function is
not able to model complex separations, so SVM employs kernel functions to map from the
original space to the feature space. The function can be of many forms, thus providing the
ability to handle nonlinear classification cases. The kernels can be viewed as a mapping of
nonlinear data to a higher dimensional feature space, while providing a computation short-
cut by allowing linear algorithms to work with higher dimensional feature space.
Theory
Usage: Create model
Results
Usage: Classification
Result interpretation
Method reference
939
The Unscrambler X Main
the kernel. The figure below illustrates the principle of applying a kernel function to achieve
separability.
In this new space SVM will search for the samples that lie on the borderline between the
classes, i.e. to find the samples that are ideal for separating the classes; these samples are
named support vectors. The figure below illustrates this in that only the samples marked
with + for the two classes are used to generate the rule for classifying new samples.
A situation where SVM will perform well is when some classes are inhomogeneous and
partly overlapping, and thus, building local PCA models with all samples will not be
successful because one class may encompass other classes if all samples are used.
SVM will in this case find a set of the most relevant samples in terms of discriminating
between the classes and is invariant to samples far from the discrimination line.
SVM has advantages over classification methods such as neural networks, as it has a unique
solution, and has less tendency of overfitting when compared to other nonlinear
classification methodologies. Of course, the model validation is the critical aspect in avoiding
overfitting for any method. SVMs are effective for modeling of nonlinear data, and are
relatively insensitive to variation in parameters. SVM uses an iterative training algorithm to
achieve separation of different classes.
Two SVM classification types are available in The Unscrambler® which are based on different
means of minimizing the error function of the classification.
940
Support Vector Machine Classification
In the c-SVM classification, a capacity factor, C, can be defined. The value of C should be
chosen based on knowledge of the noise in the data being modeled. Its value can be
optimized through cross-validation procedures. When using nu-SVM classification, the nu
value must be defined (default value = 0.5). Nu serves as the upper bound of the fraction of
errors and is the lower bound for the fraction of support vectors.
Increasing nu will allow more errors, while increasing the margin of class separation.
The kernel type to be used as a separation of classes can be chosen from the following four
options:
Linear
Polynomial
Radial basis function
Sigmoid
The linear kernel is set as the default option . If the number of variables is very large the data
do not need to be mapped to a higher dimensional space the linear kernel function is
preferred. The radial basis function is also simple function and can model systems of varying
complexity. It is an extension of the linear kernel.
If a polynomial kernel is chosen, the order of the polynomial must also be given. In SVM
classification, the best value for C is often not known a priori. Through a grid search and
applying cross validation to reduce the chance of overfit, one can identify an optimal value
of C so that unknowns can be properly classified using the SVM model.
Support vectors
Confusion matrix
Parameters
941
The Unscrambler X Main
Probabilities
Prediction
The main result of the SVM is the the confusion matrix, which indicates how many samples
were classified is each class, and the prediction matrix, which indicates the classification
determined for each sample in the training set.
The prediction matrix indicates the classification determined for each sample in the training
set.
942
Support Vector Machine Classification
may be from the same matrix or another, but must have the same number of rows as the
first, and have only a single column of category data.
Support Vector Machine Model Inputs
If the appropriate selection is not made for the classifier, the following warning will be
displayed. To build the SVM model go to the column drop-down list, select a single column
containing category variables.
Support Vector Machine Model Inputs Warnings
28.3.2 Options
Here one can choose the SVM type of classification to use, either C-SVC or nu-SVM, from the
drop-down list next to SVM type. The kernel type to be used to determine the hyperplane
that best separates the classes can be selected from the following types from the drop-down
list. The default setting of Radial basis function is the simplest, and can model complex data.
Support Vector Machine Options
943
The Unscrambler X Main
Linear
Polynomial
Radial basis function
Sigmoid
For a polynomial kernel type, the degree of the polynomial should be defined. The C-SVM
has an input parameter named C, which is a capacity factor (also called penalty factor), a
measure of the robustness of the model. C must be greater than 0.
When using nu-SVM regression the nu value must be defined (default value = 0.5). Nu serves
as the upper bound of the fraction of errors and is the lower bound for the fraction of
support vectors.
Support Vector Machine Options for nu-SVM
944
Support Vector Machine Classification
945
The Unscrambler X Main
In the options tab the Grid Search button is available. Clicking on the Grid
Search button will open a dialog for grid search. The figure below shows the grid search
dialog after a grid search has been perforemd.
The dialog asks for input for the parameters Gamma and C in the case of C-SVMC and
Gamma and Nu in the case of nu-SVMR. It has been reported in the literature that an
exponentially growing sequence of the parameters is good as a first course grid search. This
is why the inputs Gamma and C are given on the log scale, but not the nu since it is between
0 and 1. However, in the grid table above the actual values are given. It is recommended to
use cross-validation in grid search to avoid overfitting when many combinations of the
parameters are tried. After an initial grid search it may be refined with smaller ranges for the
parameters once the best range has been found. Click on the Start button for the
calculations to commence. Note that it is possible to click on Stop during the computations
so that if the results become worse for higher values for the parameters one may stop to
save time.The default is to start with five levels of each parameter. Click on one (the “best”)
value for the Validation accuracy in the grid after completion to see detailed results. The SVs
lists how many samples that were selected and is depending should be related to the
number of samples in the data.
Click on Use setting to return to the previous dialog and for running the SMVC again with
these parameter settings. Notice that since the cross validation is random the RMSE and the
R-square from validation may be different in the second run. This again is a function of the
distribution of the samples.
To understand more in detail how SVMC selects the support vectors (samples that are lying
on the boundary between the classes) one may run a PCA on the same data and make use of
the Sample Grouping option in the score plot to visualize the support vectors.
946
Support Vector Machine Classification
28.3.4 Weights
If the analysis calls for variables to be weighted for making realistic comparisons to each
other (particularly useful for process and sensory data), click on the Weights tab and the
following dialog box will appear.
Support Vector Machine Weights
Individual variables can be selected from the variable list table provided in this dialog by
holding down the control (Ctrl) key and selecting variables. Alternatively, the variable
numbers can be manually entered into the text dialog box. The Select button can be used
(which will bring up the Define Range dialog), or every variable in the table can be selected
by simply clicking on All.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows the weighting of selected variables by predefined constant values.
Downweight
This allows the multiplication of selected variables by a very small number, such that
the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
947
The Unscrambler X Main
Use the Advanced tab in the Weights dialog to apply predetermined weights to each
variable. To use this option, set up a row in the data set containing the weights (or create a
separate row matrix in the project navigator). Select the Advanced tab in the Weights dialog
and select the matrix containing the weights from the drop-down list. Use the Rows option
to define the row containing the weights and click on Update to apply the new weights.
Another feature of the advanced tab is the ability to use the results matrix of another
analysis as weights, using the Select Results Matrix button . This option provides
an internal project navigator for selecting the appropriate results matrix to use as a weight.
The dialog box for the Advanced option is provided below.
SVM Advanced Weights Option
Once the weighting and variables have been selected, click Update to apply them.
28.3.5 Validation
Validation is an important part of any method applied in modeling data. Settings for the
Validation of the SVM are set under the Validation tab as shown below. First select to cross
validate the model by checking the check box. The number of segments to use can be
chosen in the segments entry. Cross validation is helpful in model development but should
not be a replacement for full model validation using a test set.
Support Vector Machine Validation
948
Support Vector Machine Classification
Autopretreatment may be used with SVM. This allows a user to automatically apply the
transforms used with the data in developing the SVM model to data used in the classification
of new samples with this model.
Support Vector Machine Autopretreatment
949
The Unscrambler X Main
When all of the parameters have been defined, the SVM is run by clicking OK. A new node,
SVM, is added to the project navigator with a folder for Data, and another for Results.
More details regarding Support Vector Machine classification are given in the section SVM
Classify or in the link given under License.
950
Support Vector Machine Classification
The SVM classification results are given in a new matrix in the project navigator named
Classified_Range. The matrix has the predicted class for each sample.
Support vectors
Confusion matrix
Parameters
Probabilities
Prediction
Accuracy
There is only one matrix generated when predicting with a SVM model: Classified range
SVM node
951
The Unscrambler X Main
analysis classification. It carries information about the predicted and actual classifications of
samples, with each row showing the instances in a predicted class, and each column
representing the instances in an actual class.
In the below confusion matrix, all the “Setosa” samples are nicely attributed to the “Setosa”
group.
Two samples with actual value “Virginica” are predicted as “Versicolor”.
In the same way two samples with actual value “Versicolor” are predicted as “Virginica”.
Confusion matrix
28.5.3 Parameters
The parameters matrix carries information on the following parameters for all the identified
classes:
SVM type
Kernel type - as defined in the options for the SVM learning step
Degree - as defined in the options for the SVM learning step
Gamma - related to the C values set in the options
Coef0 Classes - the number of classes identified by the SVM model
SV Count - the number of support vector needed for the classification of the data
Labels - the labels of the corresponding classes, given as numerical values starting
with 0
Numbers - the number of samples classified in a given class
Parameters matrix
28.5.4 Probabilities
The probabilities matrix has three rows, for the Rho, and probabilities A and B for each of
the identified classes.
Probabilities matrix
952
Support Vector Machine Classification
28.5.5 Prediction
The prediction matrix exhibits the predicted class for each sample in the training set.
Prediction
28.5.6 Accuracy
Accuracy holds the % correctly classified samples from calibration and validation. If cross
validation was not chosen it leaves this field blank. However, cross validation is highly
recommended to avoid overfitting. See the Confusion Matrix regarding details for false
positives and false negatives.
953
The Unscrambler X Main
954
Support Vector Machine Classification
28.7. Bibliography
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, A Practical Guide to Support Vector
Classification, last updated: May 19, 2009, accessed August 27, 2009.
http://www.csie.ntu.edu.tw/~cjlin
T. Czekaj, W.Wu and B.Walczak, About kernel latent variable approaches and SVM, J.
Chemom., 19, 341–354 (2005).
J.A.Fernandez Pierna, V.Baeten, A.Michotte Renier, R.P.Cogdill and P.Dardenne,
Combination of support vector machines (SVM) and near-infrared (NIR) imaging
spectroscopy for the detection of meat and bone meal (MBM) in compound feeds, J.
Chemom., 18, 341–349 (2004).
A. I. Belousov, S. A. Verzakov and J. von Frese, Applicational aspects of support vector
machines, J. Chemom., 16, 482-489 (2002).
955
29. Batch Modeling
29.1. Batch Modeling (BM)
The main objective of Batch Modeling plug in is to model and monitor data from batch-
processes to give information whether the batch is progressing as expected.
Theory
Usage
Plot Interpretation
Method reference
957
The Unscrambler X Main
Once the data to be used in modeling are defined, choose the number of Principal
Components (PCs) to calculate, from the Maximum Components box.
The Mean center data check box allows a user to subtract the column means from every
variable before analysis.
The Identify outliers check box allows a user to identify potential outliers based on
parameters set up in the Warning Limits tab.
The details of the analysis setup are provided in the Information box on the model inputs
tab. It is important to check the details in this box each time an analysis is performed, to
ensure that the correct parameters have been set. The information contained in this box is:
The Global Batch Modeling check box allows a user to build a global Batch model.
BM Model Inputs
958
Batch Modeling
959
The Unscrambler X Main
Individual variables can be selected from the variable list table provided in this dialog by
holding down the control (Ctrl) key and selecting variables. Alternatively, the variable
numbers can be manually entered into the text dialog box. The Select button can be used
(which will bring up the Define Range dialog), or every variable in the table can be selected
by simply clicking on All.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows the weighting of selected variables by predefined constant values.
Downweight
This allows the multiplication of selected variables by a very small number, such that
the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
Use the Advanced tab in the Weights dialog to apply predetermined weights to each
variable. To use this option, set up a row in the data set containing the weights (or create a
separate row matrix in the project navigator). Select the Advanced tab in the Weights dialog
960
Batch Modeling
and select the matrix containing the weights from the drop-down list. Use the Rows option
to define the row containing the weights and click on Update to apply the new weights.
Another feature of the advanced tab is the ability to use the results matrix of another
analysis as weights, using the Select Results Matrix button This option provides an internal
project navigator for selecting the appropriate results matrix to use as a weight.
The dialog box for the Advanced option is provided below.
BM Advanced Weights Option
Once the weighting and variables have been selected, click Update to apply them.
961
The Unscrambler X Main
962
Batch Modeling
Set this tab up based on a priori knowledge of the data set in order to return outlier
warnings in the batch model. Settings for estimating the optimal number of components can
also be tuned here. The values shown in the dialog box above are default values and might
be used as a starting point for the analysis.
The warning limits in the Unscrambler® serve two major purposes:
The leverage and residual (outlier) limits are given as standard scores. This means that limit
of e.g. 3.0 corresponds to a 99.7% probability that a value will lie within 3.0 standard
deviations from the mean of a normal distribution. The following limits can be specified:
Leverage Limit
(default 3.0) The ratio between the leverage for an individual sample and the
average leverage for the model.
Sample Outlier Limit, Calibration
(default 3.0) The square root of the ratio between the residual calibration variance
per sample (Sample Residuals) and the average residual calibration variance for the
model (Total Residuals).
Sample Outlier Limit, Validation
963
The Unscrambler X Main
(default 3.0) The square root of the ratio between the residual validation variance
per sample (Sample Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Individual Value Outlier, Calibration
(default 3.0) For individual values in the calibration residual matrix (Residuals), the
ratio to the model average is computed (square root of the Variable Residuals). For
spectroscopic data this limit may be set to 5.0 to avoid many false positive warnings
due to the high number of variables.
Individual Value Outlier, Validation
(default 2.6) For individual values in the calibration residual matrix (Residuals), the
ratio to the validation model average is computed (square root of the Variable
Validation Residuals). For spectroscopic data this limit may be set to 5.0 to avoid
many false positive warnings due to the high number of variables.
Variable Outlier Limit, Calibration
(default 3.0) The square root of the ratio between the residual calibration variance
per variable (Variable Residuals) and the average residual calibration variance for
the model (Total Residuals).
Variable Outlier Limit, Validation
(default 3.0) The square root of the ratio between the residual validation variance
per variable (Variable Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Total Explained Variance (%)
(default 20) If the model explains less than 20% of the variance the optimal number
of componets is set to 0 (see the Info Box).
Ratio of Calibrated to Validated Residual Variance
(default 0.5) If the residual variance from the validation is much higher than the
calibration a warning is given.
Ratio of Validated to Calibrated Residual Variance
(default 0.75) If the residual variance from the calibration is much higher than the
validation a warning is given. This may occur in case of test set validation where the
test samples do not span the same space as the training data.
Residual Variance Increase Limit (%)
(default 6) This limit is applied for selecting the optimal number of components and
is calculated from the residual variance for two consecutive components. If the
variance for the next component is less than x% lower than the previous component
the default number of components is set to the previous one.
When all the options are specified click OK.
Predefined BM plots
PCA overview
Scores
964
Batch Modeling
Scores
965
30. Moving Block
30.1. Moving Block
Block methods are a particular form of evolutionary process modeling. Statistics such as
mean and standard deviation are reported for single or multivariate sensor data collected at
regular time intervals during a process. These can be used to trend the progress of an
evolving system, such as blending, mixing and drying operations.
The moving block statistics can be based either raw data or scores (i.e. projections of the
data onto a multivariate model).
Theory
Usage
Plot Interpretation
Prediction
Block Definitions
Individual Block Mean (IBM)
Individual Block Standard Deviation (IBSD)
Moving Block Mean (MBM)
Moving Block Standard Deviation (MBSD)
Percent Relative Standard Deviation (%RSD)
967
The Unscrambler X Main
The statistics for a particular block as calculated for region of length are
described in the following.
The IBM for a block is similar to the associated sensor reading for a single sample. If these
are spectra the IBM will resemble a spectrum from the same spectral region. The collection
of IBM’s can be plotted as line or bar plots to assess the differences between multiple
blocks.
Individual Block Means for a Collection of Spectra
968
Moving Block
As for the IBM, the IBSD can also be plotted as line or bar plots. These will indicate the
degree of sample spread within different blocks.
Individual Block Standard Deviations for a Collection of Spectra
Upper and lower limits can be defined and plotted with the MBM in a trend chart to monitor
e.g. when a process reaches stable conditions.
Moving Block Mean Trend Chart
969
The Unscrambler X Main
Upper and lower limits can be defined and plotted with the MBSD in a trend chart to
monitor e.g. when a process reaches stable conditions.
Moving Block Standard Deviation
970
Moving Block
Once the data to be used in modeling are defined, click Add to specify the combination of
methods and wavelength region.
30.3.2 Region
Once the input data is valid, the regions can be added further.
Region Pane
971
The Unscrambler X Main
It allows to define between the first and last column to be include for this
Range
region, relative to the full range of the Input data
When checked, allows the user to select all PCA models in the project
with matching number of required variables. Additionally the
Apply to
Components option allows to change the number of components to be
scores
used in the selected model. The default value is the number of
components set for prediction.
972
Moving Block
If there are multiple regions in the model use the toolbar drop down box ( )
to select which region to plot. The toggle buttons can be used to switch between
Individual Block Mean and Standard Deviation plots.
Individual Block Mean for a Collection of Spectra
973
The Unscrambler X Main
If there are multiple regions in the model use the toolbar drop down box ( )
to select which region to plot. The toggle buttons can be used to switch between
Moving Block Mean and Standard Deviation plots, or a combination of both.
Moving Block Combined Trend Chart
Upper and lower limits for the trend charts can be set using the right click Set Limits
function. Also use the toggle button to see the Moving Block Mean trend plots.
Moving Block Mean Trend Chart
974
Moving Block
975
The Unscrambler X Main
By clicking on any Method Node in the tree, the Region and Trend plot names become
visible in the dialog box allowing the user to set limits for the methods. By default, the
Method Node selected will be the one associated with the plot that was right clicked.
By selecting the Upper Limit radio button, user will be allowed to set up an upper limit in the
Trend Plot and save it to the model for use in Tasks –Prediction for comparing new data to
an established model.
By selecting the Lower Limit radio button, user will be allowed to set up an lower limit in the
Trend Plot and save it to the model for use in Tasks –Prediction for comparing new data to
an established model.
976
31. Orthogonal Projections to Latent Structures
31.1. Orthogonal Projection to Latent Structures
Theory
Usage
Plot Interpretation
Method reference
Orthogonal Projection to Latent Structures (OPLS) models both the X- and Y-matrices
simultaneously in terms of components (or factors, latent variables). The difference between
PLSR and OPLS lies in the way these components are calculated. The loading weights vector
of the first component is identical to PLSR whereas the subsequent components in OPLS are
calculated as to be orthogonal to the first one. The first loading weights vector for PLSR and
OPLS in the case of a single response variable (y) represents the individual covariance or
correlations if the variables are scaled to unit variance except that the vector is normalized
to 1.0. Note that the final regression coefficient vector is identical to PLS in the case of one
y-variable, thus the predictions are also identical in the case of a single y-variable.
It is known that a regression model with one y-variable always can be described with one
component where the y-orthogonal part of X can be separated from the predictive part. The
direct way of orthogonalizing X on Y is by Direct Orthogonalization where all orthogonal
variance in X is represented by one matrix, E. OPLS separates the y-orthogonal part of X into
a structured part and the residual (error).
The total X-variance from the predictive and orthogonal components is the same as the X-
variance for PLSR after the sum of predictive and orthogonal components. E.g. if there is one
predictve and two orthogonal components in OPLS then this corresponds to a 3-component
model for PLSR. That is, the orthogonal loading weights from OPLS may differ from the
loading weights from PLS for component one to the optimal number found by proper
validation. It is recommended to first run a PLSR model to find the optimal validated number
of components and the run OPLS. If there are more y-variables there might be more than
one predictive component but not more than the number of y-variables.
Orthogonal Signal Correction (OSC) is another method that separates the y-orthogonal part
of X. The difference to OPLS is mainly that the orthogonal part is not a part of the model
itself but is separated as a pre-processing step. See also OSC theory. More details can be
found in the literature OPLS literature
977
The Unscrambler X Main
Which method to use for a given dataset to reveal the true underlying structures cannot be
known in beforehand. Multivariate Curve Resolution
(../27_Multivariate_Curve_Resolution/theory.htm) is an alternative method where it is not
assumed that the true signals are orthogonal.
If the classical PLSR indicates that one component is optimal then the so-called predictive
component in OPLS will have relevant qualitative and sometimes also quantative
information as to which variables that are important and how important. If the classical PLSR
indicates the optimal number of components to be e.g. four one cannot in general assume
that the first component reveals the correct qualitative (or quantitiative) information.
OPLS may be carried out with one or more Y variables, meaning that multiple Y responses
can be used during regression modeling. OPLS gives similar results in the case of multiple y-
variables as PLS but not exactly the same.
Thus by multiplicating the individuals score for each sample by the loading weights and
square the values this can be used to estimate the sample variance due to the predictive
part.
The predictive loading weight vector for each component is normalized to sum 1.0, Variables
with large loading weight values are important for the prediction of Y. One may make use of
uncertainty test to estimate the significance for each variable to overcome that a rule of
thumb for important/not important in absolute values cannot be set due to the
normalization.
31.2.2 Y-loadings
The Y-loadings for individual y-variables in OPLS are represented by the direct relationship
between the Y-variables and the predictive scores.
978
Orthogonal Projections to Latent Structures
Regression coefficients
Regression coefficients show how each variable is weighted when predicting a particular Y
response. In the case of OPLS they are calculated from the predictive loading weights, the
orthogonal loadings and the y-loadings. Regression coefficients are a characteristic of all
regression methods and may provide interpretive insight into the quality of a model.
Examples include:
Spectroscopy: Regression coefficients should have “spectral characteristics” about
them and not show noise characteristics.
Process data: When different variable types exist the variables should be scaled to
unit variance. Regression coefficients show the relative importance of the variables
and their interactions can also be displayed if added to the original data table with
Tasks - Transform - Interaction_and_Square_Effects.
As the regression coefficients in OPLS are identical to PLS they are given out
Predicted vs. reference plot
The predicted vs. reference plot is another common feature of all regression methods. The
predicted vs. reference plot should ideally show a straight line relationship between
predicted and reference values, ideally with a slope of 1 and a correlation close to 1.
When a data table is available in the Project Navigator use the Tasks-Analyze menu to run a
suitable analysis – here, Orthogonal Projection to Latent Structures.
Orthogonal Projection to Latent Structures Inputs
979
The Unscrambler X Main
The Mean Center check box allows a user to subtract the column means from every variable
before analysis. This option should be enabled unless one can assume that origo is a valid
sample in the data i.e. when zero concentration means no signal.
The details of the analysis setup are provided in the Information box on the model inputs
tab. It is important to check the details in this box each time an analysis is performed, to
ensure that the correct parameters have been set. The information contained in this box is:
980
Orthogonal Projections to Latent Structures
981
The Unscrambler X Main
Individual X- and Y-variables can be selected from the variable list table provided in this
dialog by holding down the control (Ctrl) key and selecting variables. Alternatively, the
variable numbers can be manually entered into the text dialog box, the Select button can be
used (which takes one to the Define Range dialog box), or by simply clicking on All, this will
select every variable in the table.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows selected variables to be weighted by predefined constant values.
Downweight
This allows for the multiplication of selected variables by a very small number, such
that the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
982
Orthogonal Projections to Latent Structures
Advanced tab
Use the Advanced tab in the X- and Y-Weights dialog to apply predetermined weights to
each variable. To use this option, set up a row in the data set containing the weights (or
create a separate row matrix in the project navigator). Select the Advanced tab in the
Weights dialog and select the matrix containing the weights from the drop-down list. Use
the Rows option to define the row containing the weights and click on Update to apply the
new weights.
The methods provided in The Unscrambler® for the validation of OPLS models are:
Leverage Correction
A first pass validation technique used for checking for the presence of gross outliers
and for “big data”.
Cross Validation
Used to simulate a test set, when there are not enough samples to define an
independent test set.
983
The Unscrambler X Main
Uncertainty Test
can be used to determine the significance of variables, when using cross validation,
by applying an Uncertainty Test. Check the Uncertainty Test box and the options
available are to use the optimal number of factors found in a model, or define the
number of factors to use for the test. For OPLS the number of factors is related to
the number of orthogonal factors specified in the main dialog.
When there are missing values in the data, options are to impute them automatically using
the NIPALS algorithm or as a pre-processing step using Fill Missing
Test Set
The most reliable way of assessing the performance of a PLSR model. It uses samples
that are independent of the calibration set.
When applying Test Set validation, the user must ensure that the test matrices have
the same column dimensions as the calibration set.
31.3.4 Autopretreatments
The Autopretreatments tab allows a user to register the pretreatments used during the OPLS
analysis, so that when future predictions are made, these pretreatments are automatically
applied to the new data. The pretreatments become part of the saved model. An example
dialog box for Autopretreatment is provided below.
The OPLS Autopretreatment Tab Options
984
Orthogonal Projections to Latent Structures
Pretreatments can also be registered from the OPLS node in the project navigator. To
register the pretreatment, right click on the OPLS analysis node and select Register
Pretreatment.
Many of the OPLS plots are the same or similar as for PLSR. The OPLS plots are described
below. For more details we refer to the section on PLS.
Predictive Scores
This is a one-dimensional bar plot of scores for one specified component and samples with
high absolute score value are influential in estimating the predictive loading weights.
985
The Unscrambler X Main
Explained Y-variance
This plot illustrates how much of the variation in the responses that is described by each
component.
Total explained variance is computed as:
986
Orthogonal Projections to Latent Structures
the results for other Y-variables, use the variable icon . In addition
by default the results are shown for a specific number of factors, that should reflect the
dimensionality of the model. If the number of factors is not satisfactory, it is possible to
Some statistics are available giving an idea of the quality of the regression. They are
987
The Unscrambler X Main
Slope
The closer the slope is to 1, the data are better modelled.
Offset
This is the intercept of the line with the Y-axis when the X-axis is set to zero (Note: It
is not a necessity that this value is zero!)
RMSE
The first one (in blue) is the Calibration error RMSEC, the second one (in red) is the
expected Prediction error, depending on the validation method used. Both are
expressed in the same unit as the response variable Y.
R-squared
The first one (in blue) is the calibration R-Squared value taken from the calibration
Explained Variance plot for the number of components in the model, the second
one (in red) is also calculated from the Explained Variance plot, this time for the
validation set. It is an estimate of how good a fit can be expected for future
predictions.
Note: RMSE and R-Squared values are highly dependent on the validation method
used and the number of components in a model.
When the are toggled, more detailed statistics are displayed. The Calibration plot is
shown below with statistics,
Predicted vs. Reference plot for Calibration samples
988
Orthogonal Projections to Latent Structures
This is the mean value over all points that either lie systematically above (or below)
the regression line. A value close to zero indicates a random distribution of points
about the regression line.
Orthogonal Scores
This is either a bar plot or a two-dimensional scatter plot (or map) of scores for two specified
orthogonal components.
The closer the samples are in the scores plot, the more similar they are with respect to the
two components concerned. Conversely, samples far away from each other are different
from each other. The plot can be used to interpret differences and similarities among
samples. Look at the scores plot together with the corresponding loadings plot for the same
two components. This can help in determining which variables are responsible for
differences between samples. For example, samples to the right of the scores plot will
usually have a large value for variables to the right of the loadings plot, and a small value for
variables to the left of the loadings plot.
989
The Unscrambler X Main
Orthogonal X-Loadings
This is either a bar plot or a two-dimensional scatter plot (or map) of the variables for two
specified orthogonal components.
990
Orthogonal Projections to Latent Structures
Variables close to each other in the loadings plot will have a high positive correlation if the
two components explain a large portion of the variance of X. Variables in diagonally opposed
quadrants will have a tendency to be negatively correlated.
Regression coefficients
General
Y-residuals vs. Predicted Y
This is a plot of Y-residuals against predicted Y values. If the model adequately predicts
variations in Y, any residual variations should be due to noise only, which means that the
residuals should be randomly distributed. If this is not the case, the model is not completely
satisfactory, and appropriate action should be taken. If strong systematic structure (e.g.
991
The Unscrambler X Main
curved patterns) is observed, this can be an indication of lack of fit of the regression model.
The figure below shows a situation where one sample has a much higher Y-residual than the
other samples.
992
Orthogonal Projections to Latent Structures
Leverage/Hotelling’s T²
Leverage
Leverages are useful to find influential samples in the model space. If all samples have
leverages between 0.02 and 0.1, except for one, which has a leverage of 0.3, although this
value is not extremely large, the sample is likely to be influential.
Leverage plot
There is an ad-hoc critical limit for leverage which is shown as a red line. The limit is 3 times
the average leverage for the calibration samples.
Hotelling’s T²
The Hotelling’s T² plot is an alternative to plotting sample leverages. The plot displays the
Hotelling’s T2 statistic for each sample as a line plot. The associated critical limit (with a
default p-value of 5%) is displayed as a red line.
Hotelling’s T² plot
993
The Unscrambler X Main
The Hotelling’s T² statistic has a linear relationship to the leverage for a given sample. Its
critical limit is based on an F-test. There are 6 different significance levels to choose from
using the drop-down list:
31.6. Bibliography
J. Trygg and S. Wold, Orthogonal projections to latent structures (O-PLS), Journal of
Chemometrics, 16, 119-128 (2002).
O. Svensson, D: Kourti and J. MacGregor, An investigation on orthogonal signal correction
algorithms and their characteristics, Journal of Chemometrics, 16, 176-188 (2002).
R. Ergon, Finding Y-relevant part of X by use of PCR and PLSR model reduction methods,
Journal of Chemometrics, 21, 537-546 (2007).
E.K. Kemsley and H.S. Tapp, OPLS filtered data can be obtained directly from non-
orthogonalized PLS1, Journal of Chemometrics, 23, 263-264 (2009).
994
32. Prediction
32.1. Prediction
Prediction (estimation of unknown response values using a regression model) may be the
purpose of a regression application. This section describes how to use an existing regression
model to predict response values for new samples.
Theory
Usage
Plot Interpretation
Method reference
Note: The model validation can only be considered successful when one has:
995
The Unscrambler X Main
This prediction method is simple and easy to understand. However it has the disadvantage
that few sample or variable outlier diagnostics are available, compared to projection
methods such as full PCR and PLSR predictions. In The Unscrambler® this method using just
the regression coefficients is called short prediction.
and
and
For these models Y is expressed as an indirect function of the X-variables using the scores T,
the X-loadings P and the Y-loadings Q (for PLSR).
The advantage of using the projection equation for prediction, is that when projecting a new
sample onto the X-part of the model (this operation gives the t-scores for the new sample),
one simultaneously gets a leverage value and an X-residual for the new sample, hence
allowing outlier detection.
A prediction sample with a high leverage and/or a large X-residual may be a prediction
outlier. Such samples may not be considered as belonging to the same “population” as the
samples the regression model was based on, and therefore one should treat the predicted Y-
values with caution.
996
Prediction
Note: Using leverages and X-residuals, prediction outliers can be detected without
any knowledge of the true value of Y.
Inlier statistic
Hotelling’s T² statistic
Q residual statistic
Inlier statistic
The inlier statistic is based on the principle that if samples, when predicted, lie far from the
nearest calibration sample in the scores plot, they should be flagged as an “inlier”. An
“inlier” should be interpreted as a potential outlier. Whereas samples with high leverages
will be found far from the origin of the scores plot (outside the Hotelling’s ellipse), an inlier
may be found anywhere in the scores plot.
997
The Unscrambler X Main
In the plots below the sample marked “E” in the scores plot (that is inside the range of
possible samples), is considered an inlier but far from any calibration sample. It is above the
inlier limit as can be seen in the Inlier plot.
Scores plot showing the inlier in the calibration range/Inlier plot with one inlier
In The Unscrambler®, the inlier statistics for predicted samples can also be displayed as a 2-D
scatter plot together with the Hotelling’s T² statistic critical limits (with a default p-value of
5%)
Inlier vs. Hotelling’s T² plot with one inlier
998
Prediction
Hotelling’s T² statistic
Predicted sample which have model distances far away from the samples in the calibration
set may also be outside the Hotelling’s T² limit (and consequently the Hotelling’s T² ellipse in
the scores plot). The Hotelling’s T² statistic is computed as a linear function of sample
leverage and can be compared to a critical limit according to an F-test. In The Unscrambler®,
the Hotelling’s T² statistics for prediction samples are displayed as a 2-D scatter plot
together with the inlier statistics.
Q residual statistic
When a full prediction is run, the Q residual limits are calculated, and the X sample Q
residual matrix is also included with the Outputs. This additional statistic, which is the sum of
the squares of the residuals and can be used to determine if predicted samples are outliers.
The Q residual contributions for each predicted sample are also provided along with the
average model Q residual contribution. These results are found in the Outputs folder and
can be plotted to view how variables in the prediction samples differ from the average
variable values in the calibration model.
999
The Unscrambler X Main
To run a prediction, a project should be opened containing a regression model and a data set
to be predicted. In the case where a prediction model is not available, the following warning
will be displayed.
Solution: First calculate a regression model on a training data set before applying the predict
function to new data.
For Bias and Slope correction, refer to Bias and Slope
Data Input
The following dialog boxes are available to enter data into.
Select model
From the Select model drop-down list, select the regression model to apply to new
data.
Components
1000
Prediction
Use the Components box to select the correct number of principal components for a
PCR model or factors for a PLSR model. The optimal number of components for the
model will be displayed and used by default.
Full Prediction/ Short Prediction
Full Prediction uses a projection on the latent space in the calculation. It will
provide comprehensive results such as plots and additional matrices for
increased data interpretation and outlier diagnostics.
Short Prediction uses only the extracted Regression (Beta) coefficients.
There are no plots associated with this type of prediction.
Inlier limit
The inlier limit is a measure of the maximum Mahalanobis distance between two
neighboring calibration samples. This feature is used for detecting outliers in the
prediction step.
Sample inlier distance
The sample inlier distance is a measure of the minimum Mahalanobis distance to the
calibration samples for each sample. This feature provides the individual values for
detecting outliers in the prediction step.
Identify Outliers
This option enables an automatic identification of outliers based on predefined
criteria. Several options are available for setting limits for outlier detection,
including,
Leverage limit.
Sample outlier limit, validation.
Individual value outlier, validation.
Total explained variance (%).
Data
Matrix: From the Data drop-down list, select the matrix to apply the
prediction model to.
Rows and Cols: Use the Rows and Columns boxes to define the range of the
data to be predicted.
Several criteria of the input data are required for a successful prediction step. Warnings
associated with this option are presented as follows,
All samples or variable kept out
Solution: Ensure there are rows and columns available in the data set for prediction.
1001
The Unscrambler X Main
The dimensions of the test set do not match those of the calibration set
Solution: Ensure that the dimensions of the new data set match those of the calibration set.
Non-numeric values in a new data set
Solution: Ensure that the new data set does not contain any non-numeric columns.
Note: When a model has been developed and is to be used for prediction, it is
important to define the variable ranges in the new data table so that they match
the dimensions of the original model.
Include Y reference
Use the Include Y Reference option to add reference data if they are available so
that the predicted vs. reference plot and actual residuals can be calculated.
Matrix: From the Data drop-down list, select the matrix where the reference
are.
Rows and Cols: Use the define Rows and Columns to select the Y-reference
data to include.
It is important to ensure that the same number of Y-variable data is available as was used to
develop the calibration model. The following warning will be provided if this is not the case,
Number of Y-variables should match those in the developed model
Solution: Ensure that the same number of Y-variables have data available as that of the
original calibration model
Click on OK to start the prediction.
1002
Prediction
If measured Y-values were added as input to the prediction, the Root Mean Squared Error of
Prediction (RMSEP) will be indicated by vertical red lines in each box.
Samples with large deviation are potential outliers. You should check the X-variable values
for the sample and see how they deviate from the calibration samples. If there has been an
error, correct it. If the values are correct, the conclusion is that the prediction sample does
not belong to the same population as the samples the model is based upon, and the
predicted Y values are not reliable.
1003
The Unscrambler X Main
Prediction table
This table plot shows the predicted values, their deviation, and the reference value (if
predicted with a reference value included).
The objective is to have predictions with as small a deviation as possible. Predictions with
high deviations may be outliers.
Prediction table
Note: This plot is built in the same way as the Predicted vs. Reference plot used
during calibration. It is possible to turn on Plot Statistics as well as the target and
the regression lines. The prediction R-square is useful to assess the quality of the
prediction.
Residuals/leverage
Sample residuals
This is a plot of the residuals for a specified sample and component number for all the X-
variables. It is useful for detecting outlying sample or variable combinations. Although
outliers can sometimes be modeled by incorporating more components, this should be
avoided since it will reduce the prediction ability of the model.
1004
Prediction
Detect variables that are not very well described by a model with a certain number of
components (factors). If this is the case with most of the samples the variable(s) isolated
may be noisy and can be considered as an outliers.
In contrast to the variable residual plot, which gives information about residuals for all
samples for a particular variable, this plot gives information about all possible variables for a
particular sample. It is therefore useful when studying how a specific sample fits to the
model.
Leverage
This plot shows the leverage of the predicted samples. It is the distance to the projected
sample to the center of the model. The absolute leverage values are always larger than zero,
and can go (in theory) up to 1 for a model sample. In prediction an outlier sample can have a
high leverage greater than 1. As a rule of thumb, samples with a leverage above 0.4 - 0.5
start being of concern.
In the plot below sample “S.057” has a leverage greater than 0.4. The last four samples show
high leverages, i.e. they are not as well described by the model compared to the other
samples.
Leverage in Prediction
1005
The Unscrambler X Main
Influence on the model is best measured in terms of relative leverage. For instance, if all
samples have leverages between 0.02 and 0.1, except for one, which has a leverage of 0.3,
although this value is not extremely large, the sample is likely to be influential.
For a critical limit on the leverages, look at the Hotelling’s T² line plot.
Inlier/Hotelling’s T²
Inliers
This plot displays the inlier statistic (minimum Mahalanobis distance to the calibration
samples) for each sample as a line plot. The associated critical limit (with a default p-value of
5%) is displayed as a red line.
This feature is a test for detecting outliers in the classification or prediction step. It is based
on the concept that a model may have an object space where there are “holes”, i.e. the
density of objects in some part of the calibration space is low.
All results on samples below the Inlier limit can be trusted.
Inliers
1006
Prediction
Note: It is possible to tune the number of PCs/Factors up or down with the arrow
tools.
Hotelling’s T²
The Hotelling’s T² plot is an alternative to plotting sample leverages. The plot displays the
Hotelling’s T² statistic for each sample as a line plot. The associated critical limit (with a
default p-value of 5%) is displayed as a red line.
The Hotelling’s T² statistic has a linear relationship to the leverage for a given sample. Its
critical limit is based on an F-test. Use it to identify outliers or detect situations where a
process is operating outside normal conditions.
Hotelling’s T²
Note: It is possible to tune the number of PCs/Factors up or down with the arrow
tools.
1007
The Unscrambler X Main
1008
33. Batch Prediction
33.1. Batch Prediction
Batch Prediction may be used to generate scores and predicted values for a large set of files
in a directory or to predict files that will be added to a directory by an external application.
Note that the model selected needs to be compatible with the files, incompatible data files
are silently skipped.
Usage
1009
The Unscrambler X Main
driven and all files in a chosen directory (and matching the extension filter) are
processed. Optionally, these files may be sorted by name prior to being queued for
prediction.
Factors
Use the Factors box to select a suitable number of principal components for a PCR
model or factors for a PLS model. The optimal number of components for the model
will be displayed and used by default.
The location for the output data to be stored must also be defined in the output path.
33.2.2 Display
Go to the display tab to choose from predefined plots to display as the results are
generated. The number of data points to display is set at a default of 15, and can be changed
by the user. The standard options of plots that can be displayed are the predicted values, the
scores, and the Hotelling T^2 values with a limit set at a user-specified significance level
(default is 5%).
Batch predict display options
33.2.3 Options
On the options tab, prediction limits can be set, as can the sounding of an alarm if those
limits, or the Hotelling’s T^2 limit are crossed.
Batch predict options
1010
Batch Prediction
33.2.4 Outputs
After the settings have been made, the batch prediction will run, and the designated plots
displayed on the screen as the data are analyzed.
Batch monitor
1011
The Unscrambler X Main
When the analysis is completed, click Close on the monitoring screen. The results are stored
as a csv file in the folder designated in the setup. The user is then prompted to load the
results into the open project.
Load batch results
When the results are loaded, the matrix is added to the project navigator.
1012
34. Multiple Model Comparison
34.1. Multiple Model Comparison
Multiple Model Comparison is used for comparison of models in terms of their y-residuals
(from the chosen validation procedure) to assess whether the models are significantly
different with respect to prediction performance.
Theory
Usage
Plot Interpretation
Method reference
2,…, M). The difference D (I x M) is expressed as the absolute validated residual for the
j=1,2,…, J response variables for the m models,
Here is the effect of sample number i, are the effects of the models m that are
being compared and is the residual in the ANOVA model. In the case of only
comparing two models the 2-way ANOVA is identical to a pair-wise t-test.
1013
The Unscrambler X Main
The comparison can be made for the existing data or new data.
For the option Re-use calibration data, the X and Y data are taken from the raw data
node of the first selected model.
For the option Apply to new data, the Predictors (X) and Responses (Y) will be user
provided.
The Select models tab provides the option to select the models for comparison.
Multiple Model Comparison dialog - select models
Before adding the first model, all the available models in the project navigator will be
displayed in the drop-down list. After the first model has been added, only models with
matching number of (validation) samples to the first model will be listed. The number of Y
1014
Multiple Model Comparison
variables has to match in all models. The first selected model willbe used as reference for
the number of responses.
Click Finish to start the prediction.
1015
35. Tutorials
35.1. Tutorials
The tutorials section of The Unscrambler® was developed for users to implement methods in
practice and also be guided through the practical aspects of experimental design, data
analysis and interpretation of real results using The Unscrambler®.
The tutorials help to establish a basic understanding of the capabilities of The Unscrambler®,
an introduction to interpretation of results, and a feeling for the procedures of multivariate
data analysis. However, analysis of real world data is seldom this straightforward! Normally
data must be processed in some way before analysis numerous calibration iterations may be
required before the desired performance of a model is reached.
Quick Start
Complete cases
Tips: Arrange The Unscrambler® application window and the Help browser side by
side for greater workflow efficiency.
Tips: Copy this directory to the home directory of the working computer, e.g. in the
“Documents” directory, and use File – Open… to load the files, in order to avoid
overwriting the original data. This way a copy the unaltered data is always available
in the event a copy has been altered.
1017
The Unscrambler X Main
From within each tutorial there is a convenient hypertext link to directly import the data set
used in the given tutorial. An example link is provided below:
Open the tutorial A data set
35.2. Complete
35.2.1 Complete cases
Read the details below to understand which tutorials are useful in specific application cases,
and also to gain some practical advice for running the tutorials. The tutorials present
application examples and contain detailed step-by-step instructions on how to use The
Unscrambler®.
Depending on an analysts degree of experience in using The Unscrambler® and the
particular fields of interest for application of the program, the following lists the
recommended tutorials for a specific user experience level:
Summary of The Unscrambler® tutorials
Experience Tutorial Prerequisites
A: simple example of
PLS, univariate analysis
calibration
C: spectroscopy and
PLS, transformations, spectroscopy
interference
1018
Tutorials
Description
Expected outcomes of this tutorial
Data table
Opening the project file
Define ranges
Univariate regression
Calibration
Interpretation of the results
Prediction
Evaluation of the predicted results
Description
This tutorial aims to provide and example of the measurement of the concentration (Y) of a
chemical constituent “a” by use of conventional transmission spectroscopy. The situation is
complicated by the presence of an interferent “b” which is present in varying unknown
quantities. Under these conditions, the instrument response of “b” strongly overlaps that of
“a”.
1019
The Unscrambler X Main
References:
Data table
The data for this tutorial can be found in the project file “Tutorial A” in the “Data” directory
installed with The Unscrambler®.
Seven solutions, (samples), of known concentration (Y) of the constituent a, will be used as
the calibration set. Three other (test) samples are available of unknown concentrations.
These will be predicted by the use of a developed regression model.
Light absorbance was measured at two different wavelengths, namely Red and Blue. Red is
variable 1, Blue is variable 2. Variable 3 has been designated as the concentration of a.
Opening the project file
Task
Open the project “Tutorial A” into The Unscrambler® project navigator and study the data in
the Editor. Use the Descriptive Statistics functionality to view some basic characteristics of
the data table.
How to do it
Use File - Open to select the project file “Tutorial_A.unsb” in The Unscrambler® data
samples directory. This directory is typically located in C:\Program Files\The
Unscrambler X\Data.
For the purposes of this tutorial, click the following link to import the data. Tutorial A data
set
The project should now be visible in the project navigator and the data should be displayed
in the editor.
Note that the values for variable Comp “a” are missing (blank) for the 3 Unknown samples.
1020
Tutorials
Use the Tasks-Analyze-Descriptive Statistics… option to view some basic statistics of the
data, including the Mean, Standard Deviation, Skewness etc.
Tasks-Analyze-Descriptive Statistics…
The following dialog will open. Select the data matrix to be analyzed and ensure that no
rows or columns have been excluded from the analysis.
Descriptive Statistics Dialog
After clicking OK, the statistics will be computed. A new analysis node will appear in the
project navigator providing some simple plots and analysis of the data.
Descriptive Statistics Results Matrix
1021
The Unscrambler X Main
Define ranges
In most practical applications of multivariate data analysis, it is necessary to work on subsets
of the data table. To do this, one must define ranges for variables and samples. One Sample
Set (Row range) and one Variable Set (column range) make up a virtual matrix which is used
in the analysis.
Task
Define two Column ranges (variable sets), one for “Light Absorb” and the other
for”Constituent a”. Also define two Row ranges (sample sets) “Calibration Samples” and
“Prediction Samples”.
How to do it
There are two options for defining data ranges in The Unscrambler®:
Create Row/Column ranges using the right mouse click option
Highlight a range of variables to be defined and right click in the column header. This
will display the Create Column Range option. Sample sets can also be defined as row
ranges using a similar method and selecting Create Row Range.
Create a column range
1022
Tutorials
Rename the column range that is automatically highlighted in the project navigator.
If it is not highlight it, and right click. Choose the Rename option, and change the
name to “Constituent a”.
Repeat this process for the “Light Absorbance” set containing the first two columns
and the row sets: “Calibration” containing samples 1 to 7 and “Prediction”
containing samples 8 to 10.
Use Edit - Define Range… to create row and column sets.
Open the Define Range dialog from the Edit menu. Define the data as follows,
Name: Light Absorbance
Interval: columns 1-2
Define Range Dialog
Enter the Column numbers directly into the Set Interval field under rows and
columns.
Deselect variables marked by mistake by pressing Ctrl while clicking on the variable
to be removed from the set.
Click OK.
Similarly define the second variable Set using the Edit -Define Range option and
specifying:
Name: Constituent A
Click OK.
Choose Edit - Create Row Range to create sample sets.
Four sample and variable sets should now be displayed in the project navigator.
Data set with ranges
1023
The Unscrambler X Main
By organizing the data into sets from the beginning, one can add value to the analysis and
also use this information to communicate results. All analyzes and plotting will be much
easier to set up, and can be used in the visualization of results.
Remember to save the project before proceeding, select File - Save or press the button.
Univariate regression
The simplest regression method (univariate regression) can be simply visualized in a 2-
dimensional scatter plot.
Task
Make a regression model of component “a” and the absorbance of red light.
How to do it
Perform the regression by plotting the red light variable against Constituent a. Select Plot -
Scatter from the Plot menu. The following plot should appear.
Scatter plot
The univariate regression should be performed on the calibration samples only, as the Y-
values are missing in the prediction set.
The plot is displayed without the trend lines visible. Toggle the regression and/or target line
on and off using the shortcut . Also view the statistics for the plot. Toggle the
1024
Tutorials
The displayed correlation value of 0.91 indicates that the two variables are highly correlated.
The univariate model for this data can be generated using the Offset value and Slope value.
The equation is as follows:
1025
The Unscrambler X Main
Weights
Click the tabs for both X and Y weights to see which options apply for each sheet.
Since the data are of spectral origin, ensure the weights are All 1.0
Validation
Under the validation tab select the cross validation option. Click on Setup to choose
Full from the drop-down list.
It is important to properly validate models. Leverage correction is not recommended
as it gives only an overly-optimistic estimate of the error of a model. The estimate of
the prediction error (validation variance) is more conservative with cross validation
than with leverage correction!
Cross Validation Dialog
1026
Tutorials
Scores,
Loadings,
Variance,
and Predicted vs. reference.
1027
The Unscrambler X Main
When OK has been selected in the PLS dialog box and Yes has been selected to view the
plots, a PLS node will be added to the project navigator. This node contains the following,
Raw data,
Results,
Validation,
Plots.
The raw data used for building the model is stored in the results folder. Validation results
matrices generated from the model can be viewed along with predefined plots for the
analysis.
Toggle between different plots from those available in the project navigator. Alternatively
use the Plot… menu option, or right click in a plot to select a desired plot.
Information about the model is available in the Information field, located at the bottom of
the project navigator view. Information such as how many samples were used to develop
the model and the optimal number of factors is contained here.
Model info box
1028
Tutorials
A number of important calculated results matrices may be obtained from the PLS node.
Returning to the PLS overview, activate the Scores plot, which is in the upper left quadrant
of the overview, by clicking in it.
Right click on this plot and select the Properties option.
Properties option
1029
The Unscrambler X Main
Select Point label from the available options, and in the dialog change the label to sample
number instead of sample name.
Properties: Point label
1030
Tutorials
Activate the Predicted vs. Reference plot (lower right quadrant of the PLS overview). In this
plot, colors are used to differentiate between Calibration results (in blue) and Validation
results (in red).
Use the Next Horizontal PC and Previous Horizontal PC buttons to display the
Predicted vs. Reference for one and two PLS Factors.
Use the Cal/Val buttons to toggle between the calibration and validation samples. It
is also possible to toggle on and off the regression and trend lines .
Interpret the Y-Residual Validation Variance Curve
Activate the Y residuals plot in the lower left quadrant of the PLS overview and choose
Cal/Val for Y from the toolbar shortcuts.
Notice that the residual variance is down to 0 afterfactor 2. This usually indicates that the
model size is 2. Also there is more Y-variance explained in the second factor than in the first
(39 vs. 61), this indicates that there may be an outlier.
Residual Y variance plot
.
Study the Predicted vs. Reference Plot
Under the PLS node in the project navigator, expand the Plots folder and select Predicted vs.
Reference to display this plot in the viewer.
1031
The Unscrambler X Main
The Predicted vs. Reference plot appears. The estimated prediction quality of the model
may be determined.
Use the toolbar icons to toggle between the regression and/or target lines.
High quality predictions were obtained from this PLS model. Comparison of the multivariate
regression model with the univariate regression model, shows the marked improvement of
using the multivariate model. This gives confidence in the future prediction of unknown
values.
Study the Regression Coefficients Plot
From the main menu, choose the Plot - Regression Coefficients - Raw Coefficients (B) - Line
option. Change the plot layout to a bar chart using the toolbar shortcut .
Regression coefficients
1032
Tutorials
This illustrates how to view raw regression coefficients (B), which define the model
equation. View the regression coefficients for the precedent factor using the arrows on the
toolbar .
In the present case, the values of the regression coefficients remain unchanged
when shifting from Weighted coefficients (Bw) to Raw coefficients (B). The reason is
that the weights were chosen as All 1.0 (no weighting) for the purposes of
calibration.
Regression coefficients can be viewed in different ways, such as lines, bars and
accumulated bars from the respective shortcut buttons found in the toolbar.
Hovering the mouse cursor over one of the bars displays numerical information associated
with the particular variable. Click once more to get the object information window. For the
two factor model developed in this tutorial, the b-coefficient for the Red absorbance is
1.0417, the b-coefficient for the Blue absorbance is -0.2083 and the offset (B0) is 1E-15, i.e.
approximately zero.
The b-coefficients can also be shown as a table by selecting the matrix Beta coefficients
(raw) in the Result folder of the PLS node in the project navigator.
Regression coefficients matrix
.
The b-coefficients are a graphical representation of the model equation relating the
concentration of “a” to the Red and Blue light absorbances:
Concentration of “a”: a = 0 + 1.0417 * Red – 0.2083 * Blue
Remember the value of the coefficient for Red in the univariate model (0.59524). This result
is different from what was found in a multivariate model.
The results should be saved in the project with the data.
Select File - Save or use the save tool and give the project file the name “Tutorial A”.
Prediction
The main purpose of developing a regression model is for future prediction of the properties
of new samples measured in a similar way.
Task
Use the PLS calibration model to predict the concentration of “a” for the three unknown
samples in the data table.
How to do it
Use the Tasks - Predict- Regression… option to predict the values of the new samples. Enter
the parameters below in the Prediction dialog:
Prediction dialog
1033
The Unscrambler X Main
It is possible to find all models in the current project using the drop-down list next to Select
model. Select the PLS model developed and click OK to start the prediction.
Evaluation of the predicted results
During the development stage of a regression model, the quality of the predictions must be
checked by evaluating the quality of the Predicted vs. Reference plot.
The predictions can be checked when some reference measurements are available. This is
not possible for the unknown samples in this tutorial as there are no reference
1034
Tutorials
measurements available for these samples. However, a method exists for determining the
quality of the predictions, based on the properties of projection modeling.
Task
Perform a prediction and evaluate the quality of the predicted results.
How to do it
First, evaluate the predicted results of the unknown samples and determine if these values
are in the same range as the calibration range of samples. Select the Prediction plot under
the new Predict – Plots node in the project navigator to visually assess the results.
Prediction with deviation
The predicted values are displayed as horizontal bars. The size of the bars represent the
deviation (uncertainty) in the estimates. The numerical values for the Y Predicted values and
Y deviations can be found in the output matrices, and are displayed under the plot. A
comparison of these predictions to actual values cannot be made, however, if the new
samples have predicted values similar to those in the calibration set and the size of the
deviation bars is small, the quality of the prediction may be ensured.
Predicted values
Another method for determining the reliability of the predicted values is to study the Inlier
vs. Hotelling’s T² plot available as a right click option in any plot.
Select the Prediction - Inlier/Hotelling’s T² - Inliers vs. Hotelling’s T² option to display this
plot.
For a prediction to be trusted its value must not be too far from a calibration sample. This
may be checked using the Inlier distance. The predicted values projection onto the model
should not be too far from the center. This may be checked using the Hotelling’s T² distance.
Inliers vs. Hotelling’s T²
1035
The Unscrambler X Main
In this case all the samples were found to be in the left bottom corner of the plot, indicating
that the predicted results can be trusted.
Description
Main learning outcomes
Data table
Preparing the data
Insert category variables
Check column (variable) sets
Define sample sets from category variable column
Objective 1: Find the main sensory qualities
Make a PCA model
Interpret the variance plot in the PCA overview
Interpretation of the scores plot for the PCA
Interpretation of the correlation loadings plot
Interpretation of scores and loadings
Interpretation of the influence plot
Objective 2: Explore the relationships between instrumental/chemical data (X) and
sensory data (Y)
Make a PLS regression model
Interpretation of the variance plot
Interpretation of the scores plot
Interpretation of the loadings and loading weights plot
Interpretation of the predicted vs. reference plot
Objective 3: Predict user preference from sensory measurements
Make a PLS regression model for preference
Interpretation of the regression overview
1036
Tutorials
Description
This tutorial aims to use multivariate techniques to analyze the quality of raspberry jam in
order to determine which sensory attributes are relevant to “perceived quality”. The analysis
will cover three aspects as follows.
A trained tasting panel has provided scores for a number of different variables using
descriptive sensory analysis. In this tutorial the first objective is to find the main
sensory quality properties relevant for raspberry jam.
The second objective is to find a way of rationalizing quality control, since the use of
taste panels is very costly. In this application a number of laboratory instrumental
measurements were investigated to potentially replace the sensory testing panel.
The third and final objective of this application is to be able to predict consumer
preference for raspberry jam from descriptive sensory analysis. The use of PLS
regression modeling techniques were investigated in order to potentially find a
relationship between sensory data and preference.
References:
Data table
Click the following link to import the Tutorial B data set used in this tutorial.
1037
The Unscrambler X Main
The analysis is based on 12 samples of jam (objects), selected to span the expected, normal
quality variations inherent in such products. Several observations and measurements were
made on the samples.
Agronomic production variables
The samples were taken from four different cultivars, at three different harvesting times.
The table below describes the sampling plan for this analysis.
Sample description
No Name Cultivar Harvest time No Name Cultivar Harvest time
1 C1-H1 1 1 7 C3-H1 3 1
2 C1-H2 1 2 8 C3-H2 3 2
3 C1-H3 1 3 9 C3-H3 3 3
4 C2-H1 2 1 10 C4-H1 4 1
5 C2-H2 2 2 11 C4-H2 4 2
6 C2-H3 2 3 12 C4-H3 4 3
Note that the agronomic production variables are not used as input variables in any of the
matrices. These represent known information which may be extremely valuable for the
interpretation of the results of the data analysis. They will be utilized as category variables in
the analyses performed in this tutorial.
Column (variable) set Instrumental
Three chemical and three instrumental variables (APHA colorimetry) variables were also
measured on the samples tested by the sensory panel. These are described in the table
below.
Instrumental variables
No Name Method
1 L Lightness
2 a Green-red axis
3 b Blue-yellow axis
4 Absorbance Absorbance
1 Redness Redness
1038
Tutorials
No Name Type
3 Shininess Shininess
6 Sweetness Sweetness
7 Sourness Sourness
8 Bitterness Bitterness
9 Off-flav Off-flavor
10 Juiciness Juiciness
11 Thickness Viscosity/thickness
1039
The Unscrambler X Main
Some additional information about the cultivar and harvest time now needs to be added to
this data as two new columns.
To select a column, click on the header cell containing the column number. Activate the first
column of the table, right mouse click and select Insert - Category Variable or use the menu
options and select Edit - Insert - Category variable.
Highlight column to activate insert options
In the dialog box, enter the category variable name “Harvest Time”. Keep the default option
Select the level manually selected.
Enter the level names: “H1”, “H2” and “H3” followed by a click on Add.
1040
Tutorials
Click OK.
In the new column, double click in each cell and select the appropriate value for each sample
as given in the sample names.
Note: Category variable cells are orange in the editor to distinguish them from
ordinary variables.
Add a second column in the same way, after highlighting the first column: Edit - Insert -
Category Variable. In the dialog box, enter the category variable name “Cultivar”.
Keep the default option Select the level manually selected.
Enter the level names: “C1”, “C2”, “C3”, and “C4” followed by a click on Add.
1041
The Unscrambler X Main
Click OK.
In the new column, double click in each cell and select the appropriate value for each sample
as given in the sample names. Alternatively, select all cells of each cultivar in sequence and
fill in the category level using the right-click Fill function.
The Tutorial_b data table displayed in the Editor (after insertion of Cultivar and Harvest
Time)
1042
Tutorials
Task
Check that the three column (Variable) Sets: “Instrumental”, “Sensory” and “Preference”
have been defined.
Verify the existence of two sample sets “Calibration Samples” and “Prediction Samples”.
These sets can be visualized in the project navigator.
How to do it
To create column and row ranges, select Edit - Define Range to open the Define Range
dialog.
Three sets have been predefined in the project Tutorial_B data set.
Column name: Instrumental
Interval: 3-8
Column name: Preference
Interval: 14
Column name: Sensory
Interval: 9-13, 15-21
To verify these definitions use the Edit - Define range and inspect the information in this
dialog.
The Define range dialog with three column sets
1043
The Unscrambler X Main
Additional row sets will be added for the various levels of the category variables harvest
time and cultivar.
How to do it
Begin by selecting the column “Cultivar” in the data editor, and select Edit- Group Rows…,
which will open the Create row ranges from column dialog.
Edit- Group rows…
The column that was selected, “Cultivar”,is already in the Cols field.
There is no need to specify the Number of Groups as it is based on a category variable.
Create row ranges from column
1044
Tutorials
Click OK.
Automatically 4 row ranges have been added. Look in the Row folder to see them:
New row ranges
1045
The Unscrambler X Main
Maximum components: 6
Check the Identify outliers and Mean center data boxes, if these check boxes are not
already selected.
Principal Component Analysis dialog: Model inputs
Weights
From the Weights tab verify that the weights are all 1.0 (constant).
No weighting is used in this model as the sensory panel is known to be well trained.
However, sensory variables are often weighted when there is evidence that the
panel is not well trained, or when investigating relationships with other variables.
The most common weighting to use is 1/SDev.
Weights tab dialog
1046
Tutorials
Validation
From the Validation tab select the option Cross Validation and press Setup which
opens the Cross Validation Setup dialog. Here select Full from the drop-down list for
cross validation method.
Validation Dialog
1047
The Unscrambler X Main
This validation method is more time consuming than other options, but the estimate of the
residual variance is more reliable.
Click OK to start the PCA. After PCA analysis is completed, the program will request a user,
“Do you want to view plots of model PCA now?”. Click Yes to see the PCA Overview plots. A
new node has been added to the project navigator containing all the PCA result matrices and
plots.
1048
Tutorials
The scores plot is a map of the samples, and shows how they are distributed. It can be used
to isolate samples that are similar, or dissimilar to one another. In this analysis, the plot
labels show that PC-1 explains 58% and PC-2 28% of the total variance in the data. The
explained variance curve (in the lower right corner) is an excellent tool for selecting the
optimal number of components in the model.
The explained variance increases until PC 5 is reached. The software does suggest the
optimal number of PCs for a model, but it is up to the user to analyze the data and confirm
the optimal number of PCs in this model, usually based on this plot.
The highest explained variance is found with 5 PCs, but the explained variance in a model
using 3 PCs contains similar explained variation. A simple (parsimonious) model is usually
more robust than a complex one, and easier to interpret. It is always suggested to work with
a model consisting of as few PCs as possible. The info box in the lower left corner of the
main workspace indicates that 3 PCs are considered optimal for this model.
Info Box
Task
Change the explained variance plot to a residual variance plot.
How to do it
1049
The Unscrambler X Main
Activate the lower right plot by clicking in it. Toggle between the Explained / Residual
buttons from toolbar shortcuts .
The explained variance is now converted to residual variance. The information is the same,
but presented in another way. The residual variance is well suited to finding the optimal
number of PCs to use in a model, while the explained variance is a better measure for
explaining how much of the variation is described by the model. The plot layout can be
changed to a bar chart by using the plot layout shortcut .
The PCA Explained Variance Bar plot
The model with 3 PCs describes 92% of the total validation variance in the data; for
calibration it is 96%. These values may be obtained by clicking on the specific data point in
the plot.
Use the toolbar buttons to change between having only the calibration or validation
variance curve plotted, or both.
1050
Tutorials
The scores plot for this analysis indicates that the 12 samples are not arranged in a random
way. By moving from left to right along this plot, a pattern can be observed where samples
harvested at time H1 are mainly found on the left. These then change to H2 and finally H3.
Moreover, moving from the top to the bottom, C4 samples occupy the top region, followed
by C3, then C2, and finally C1.
The row sets based on the category variables that were inserted into the data table can be
used to better visualize these trends.
In the scores plot, right mouse click and select Sample Grouping to open the dialog where
different row sets can be used for grouping and color-coding the plot.
Select all the cultivar row sets (C1, C2, C3, C4) individually and use the arrow to add them to
Marker settings for grouping purposes.
Tick or untick the box Use group name as label to either have the real name or the level of
each sample as a point label.
The marker color, shape and size can be customized here for optimized viewing of the data.
Sample Grouping Dialog
When the desired settings have been defined, click OK to complete the operation.
In the scores plot, right mouse click to select Properties, where customization of the plot
appearance is possible. Select header and change the plot heading to “Scores plot with
Cultivar Grouping”. Choose a different font size or color if so desired. Click Apply to preview
and OK to apply and exit the dialog.
Properties Dialog
1051
The Unscrambler X Main
Repeat the above sample grouping process, this time using the category variable Harvest
Time.
1052
Tutorials
The plot shows that two variables (redness and colour) have an extreme position to the right
of the plot along PC1. They are close to each other (i.e. they are highly positively correlated),
and far from the center and are very close to the edge of the 100% explained variance
ellipse. This also means that samples lying to the right of the scores plot have higher values
for those two variables.
Along the vertical axis (PC2), two variables can be observed, with high positive values for this
PC. These are R.SMELL and R.FLAV. These two variables are opposite to the variable OFF
FLAV which has lower values for this PC. This indicates that raspberry smell and flavor
correlate positively with each other, and negatively with off-flavor.
1053
The Unscrambler X Main
In this new plot, the horizontal axis is unchanged (PC1) and the vertical axis now shows PC3.
All of the results for the PCA are now part of the project Tutorial_B. Save the project to
capture the PCA results. The next steps in this tutorial will make use of the sensory,
instrumental and preference data.
Close the PCA overview by selecting its name in the navigation bar at the bottom of the
viewer and right clicking to select Close.
Objective 2: Explore the relationships between instrumental/chemical data (X) and
sensory data (Y)
Is it be possible to predict the quality variations observed in the jam data by using
instrumental measurements only? Training and employing a sensory panel is costly and time
consuming. Producers of jam would find it most convenient if they could predict quality
1054
Tutorials
variations by measuring some properties by instrumental means. The next task in this
tutorial is to make a regression model between the sensory and instrumental data and
analyze the results for a possible solution.
1055
The Unscrambler X Main
Responses
Maximum components: 6
X and Y weights tabs
Select the X and Y Weights tabs to access their dialogs. Weighting will be applied to
all the X and Y variables for regression purposes.
X Weights Dialog
Press All to change the weighting of all variables at the same time. Variables can also
be selected by clicking on them in the list. Remember to hold the Ctrl key down
while selecting several variables. Choose the A / (SDev +B) radio button. Use
constants A = 1 and B = 0. Press Update and ensure that the weights change in the
list.
All variables are weighted by dividing them with their own standard deviations. This
allows all variables to contribute to the model, regardless of whether they have a
small or large standard deviation from the outset; only the systematic variation is of
interest here.
Now go to the Y Weights tab and do the same. Do not click OK, but after the
Update, go to the Validation tab.
1056
Tutorials
Validation tab
Select Cross validation from the Validation tab.
Press the Setup button to access the Cross Validation Setup dialog and choose Full
from the drop-down list. It is always recommended to use test set or cross validation
to develop final models.
Click OK in the regression dialog when all parameters have been set up. The computation of
the model will begin. After PLS analysis is completed, the system will ask “Do you want to
view the plots of model PLS now?”.
Click Yes to see the PLS Overview plots. A new node, PLS, has been added to the project
navigator.
PLS Regression Overview
This overview provides the most useful and common predefined result plots for PLS,
including loading weights and residuals, etc. The model can always be reviewed during the
analysis stage by selecting any of the result plots under the PLS - Plots node in the project
navigator. For this exercise, various Y response values were used for model development.
Therefore the overview results for each of these responses are available by choosing the Y
value of interest in the tool bar. When performing this type of analysis with multiple
responses the non-significant variables may be determined for each of the responses. It can
also provide information on which sensory responses can best be predicted from the
instrumental measurements without making a separate PLS model for each response. When
a Predicted vs. reference plot is selected (lower right quadrant) active, the name of the Y
value being analyzed appears in the toolbar . Another Y-response
can be chosen from the drop-menu menu, or one can scroll through the values using the
arrow tool on the right.
1057
The Unscrambler X Main
Interpret the explained variance curve, which can be shown as residual variance, or as
explained variance. The two different views are useful for different tasks.
How to do it
The Y-explained variance plot is in the lower left quadrant. This plot can be changed to the
residual variance plot by using the toolbar and as the X-explained variance by
clicking on the X button .
A local maximum is achieved for five PLS factors. The next task is to determine why the
validation curve does not follow the general trend. This can be done by looking at the
explained variance for the variables individually.
Y-explained variance plot
From the plot menu select Variances and RMSEP - X- and Y-Variance… Make sure the
bottom plot shows the Explained Variance for the 12 individual Y variables. If not, change it
by using the toolbar shortcut. Also do not select Total, but select Cal from the toolbar
shortcuts .
Add a legend to the plot by right clicking and selecting Properties. Select legend, and check
the box visible to add the legend to the plot.
PLS, Explained Validation Variance Plot displayed for the 12 individual Y-variables
The conclusion reached from the residual variance curve was that two PLS factors were
optimal. The variables that are well described are reflected in the information conveyed by
these factors.
About 85% of the color variation (variables 1 and 2), and 80% of the variation in sweetness
(variable 6) can be explained by a combination of the chemical and instrumental variables.
1058
Tutorials
Note that only 23% of the total Y-variance is explained by the model using two factors.
1059
The Unscrambler X Main
Use the drop-down list in the toolbar to observe the prediction quality for other variables
measured in this analysis. Make sure these plots are displayed for two PLS factors, as this is
the correct number for this model. Note that for several of the properties, including
raspberry flavor, raspberry smell, and off-flavor, the instrumental values do not provide any
real information. This analysis shows that the chosen instrumental measurements are not a
good substitution for the sensory analysis of these jams.
1060
Tutorials
Responses
Maximum components: 6
PLS Regression Dialog
1061
The Unscrambler X Main
Weights in X and Y
It is necessary to standardize all variable with the option 1/SDev.
Select the X Weights tab and weight all the X variables with 1/SDev so that each
variable will contribute equally in the modeling step. Also weight the Preference
values (Y) by 1/SDev in the Y Weights tab.
Validation
Full Cross Validation
Press Setup to access the Cross Validation Setup dialog and choose Full cross
validation as the cross validation method.
Press OK.
1062
Tutorials
Activate the explained variance plot in the lower left quadrant, and change it to the residual
Y variance plot by using the toolbar shortcuts . The prediction error tapers off
significantly after two PLS factors. This represents the optimal model conditions.
Residual Y Validation Variance Plot
Turn on the regression line and the target line with the toolbar shortcuts .
Predicted vs. reference Plot with Trend Lines
It can be observed that the predictions are of good quality. Some samples are not so well
predicted, but the overall correlation is satisfactory.
1063
The Unscrambler X Main
Redness, Color and Sweetness (B1, B2 and B6) are significant in predicting Preference.
Raspberry Smell (B4) is also significant, but contributing negatively to the Preference.
Thickness (B11) seems to be of importance also as it has a large (negative) coefficient.
Save the project file with the name “Tutorial_B “. It may also be saved as the model file
itself, providing a smaller file with just the model information that can be used for predicting
new samples in real time using The Unscrambler® Prediction Engine and The Unscrambler® X
Process Pulse products. To save the model only, right click on the model node in the project
navigator and select the option Save Model. In the dialog choose what size model to save.
Models other than the full model do not include all the results matrices, and therefore
provide fewer results in addition to the predicted values when used.
Save Model
1064
Tutorials
1065
The Unscrambler X Main
Interpret the prediction results to see whether the predictions can be trusted.
How to do it
Activate the “JAMdemo” data matrix. Select Tasks - Predict - Regression… and specify the
following parameters in the Prediction dialog:
Check the boxes for Inlier statistics and Sample Inlier dist (Mahalanobis distance) to provide
valuable statistical measures of the similarity of the prediction samples to the calibration
samples.
Click OK to perform the prediction.
The Prediction dialog
1066
Tutorials
Predicted preference for the “unknown” new jams have some uncertainty limits, i.e. the
accuracy of new predictions is not so reliable, however, this model can be used to predict
the preference of new jam samples providing an indication of which ones will be accepted or
not by consumers.
View the Inlier vs. Hotelling’s T² plot by selecting Plot – Inlier/Hotelling’s T² - Inlier vs
Hotelling’s T². This plot shows how similar the new samples are to those used in developing
the calibration model. For a prediction to be trusted the predicted sample must not be too
far from a calibration sample. This is checked by the Inlier distance. The projection of the
new sample onto the model also should not be too far from the center. This may be checked
using the Hotelling’s T² distance.
Save the project file under the name “Tutorial B_complete”. This now includes all the data,
three models, and the predicted results for preference.
1067
The Unscrambler X Main
To gain a better approximation of what to expect in future predictions, the RMSECV should
be analyzed.
The RMSECV may be studied for Preference for all PLS factors. RMSECV (using two factors) is
0.83. This means that any predicted new sample on the scale from 1 to 9 will have a
prediction error around 0.8. This is an acceptable error level in sensory analysis, which has
much uncertainty in all measurements.
1068
Tutorials
Verify that the correct number of factors has been chosen for the selected model. The
optimal number of components should be used for the export. Therefore, change the
number of factors to 2 before clicking OK.
Two types of model export are available:
Full
Short prediction: corresponding to export of only the regression coefficients
Observe the ASCII file that is generated, this has the file name extension .AMO. The format of
the file is described in the ASCII-MOD Technical Reference.
Similarly any of the result or validation matrices can be selected for export into other
formats. Supported export formats are
ASCII
JCAMP-DX
Matlab
NetCDF
ASCII-MOD
Full ASCII-MOD export includes all results that are necessary to perform outlier detection,
etc. This format can be used for applying models outside The Unscrambler® environment,
for example in a custom written program script. The ASCII-MOD file is readable by any text
editor, such as Notepad.
Description
What you will learn
Data table
Get to know the data
Read data file and define sets
Plot raw data
Univariate regression
Calibration
Interpretation of the calibration model
Study the predicted vs. reference plot
Study the explained variance plot
Multiplicative Scatter Correction (MSC)
Check the error in original units: RMSE
Predict new MSCorrected samples
Guidelines for calibration of spectroscopic data
Description
There is a need for an easy way to determine the concentration of dye (a brightly red-
colored heme protein, Cytochrome-C), in water solutions. Dye absorbs light in the visible
range, and the concentration determination will be based on this light absorbance.
1069
The Unscrambler X Main
In the solutions to be analyzed there are varying, unknown amounts of milk, which absorbs
some light in the same wavelength range as dye and therefore causes chemical interference
in the measurements. In addition, milk contains particles that give serious light scattering.
Another effect that will influence the absorbance spectra is the varying sample path length.
The light absorbance spectrum figure shows the light absorbance spectrum of one sample of
the dye/milk/water solution.
Absorbance Spectrum
The vertical lines represent the 16 different wavelength channels selected as predicting
variables for this sample set.
This example is constructed to enable duplication in a lab. This illustrates the interference
effects and other effects that make spectroscopy challenging. However similar problems
occur with many industrial applications, e.g. measuring the concentration of different
chemical species in sewer water, which contains many other chemical agents, as well as
physical interferences like slurries and particles; measuring moisture and solvents in a
granulation process.
The two major peaks (variables Xvar4 and Xvar6) represent the absorbance of dye, while the
first peak (Xvar2) represents absorbance due to an absorbing component in the milk. The
broad peak to the right (Xvar12, Xvar13, Xvar14) is due to light absorption by water itself.
PLS regression
Handling of interference problems, Multiplicative Scatter Correction (MSC)
Check list for calibration of spectroscopic data
1070
Tutorials
Data table
Click the following link to import the Tutorial C data set used in this tutorial. This is best done
into a new project (File-New).
The data matrix, Tutorial_C is imported into the project. It consists of 28 samples (samples of
solutions) that spans the two most important types of variations: the dye and milk
concentrations. The composition of dye/milk/water in each calibration sample is shown. The
values are given in ml making a total of 20 ml in each solution (sample).
Sample Dye Milk Water Sample Dye Milk Water
1071
The Unscrambler X Main
In the project navigator, expand the tree under the data matrix Tutorial_C to see the file
content. An Editor with the data table is launched in the viewer.
Project navigator view of data
One can see that some sets have already been defined, but one additional column set
named Statistical will be defined.
The data table already has the following: Column (Variable) Ranges:
Put the cursor in the data viewer. Now one can define a new column set (variable range) by
going to Edit - Define Range… which will open the Define Range dialog. Define the column
set by putting the name “Statistical” in the Range - Column space, and for interval, enter 3-
19 for columns as shown below.
Define Range Dialog
1072
Tutorials
Click OK when finished defining the Column and row sets. Use File-Save As… to save the
project with the updated name “Tutorial_C_updated” in a convenient the location before
continuing. The organized data will now have numerous nodes for column and sample sets
in the project navigator, and give a color-coded data matrix.
Change the data type of the column range “Absorbance” into spectral data. To do so, select
the range “Absorbance” and right click. Select the option Spectra.
This change will change the display of some plots that are usually used differently with
spectra or with other type of variables.
Spectra
1073
The Unscrambler X Main
Plot some calibration samples in order to see how the spectra vary with varying amounts of
dye and milk.
How to do it
Make a line plot of samples that have the same amount of milk, 10 ml. The line plot is just of
the X-variables for these samples, so in the data table editor, select the four samples having
10 ml of milk by marking the samples in the Editor (samples 6, 14, 19, and 23) by clicking the
sample numbers while holding down the Ctrl key. Then right click and select Plot - Line.
Line plot dialog
In the Line Plot dialog that appears, select the column set Absorbance from the drop-down
list. Click OK and note that the four samples are highlighted in the Editor.
The same could be done by selecting the menu option Plot - Line… after having selected the
samples in the viewer, and specifying use the Column set Absorbance in the Line Plot dialog.
Line Plot of sample with 10 ml milk
.
Use shortcuts keys to change the layout of the plot to a bar chart.
These four samples have the same milk level and the line plot shows that the dye level has
influence on the absorbance of variables number 2 - 8 only.
Plot samples 20, 21, 22,and 23 the same way, using the CTRL key to to select just these
specific rows. These samples have the same dye level: 6 ml.
1074
Tutorials
The plot shows that increasing milk level will increase the absorbance of light of all
wavelengths from number 1 to number 16. There seems to be a great deal of interference or
scattering to deal with, over the whole spectrum. This indicates that some transformations
of the data may be useful to get an optimal model.
Univariate regression
Is it possible to predict the dye level from the absorbance of one single wavelength? Before
we enter the multivariate world we want to see what can be done by univariate regression.
Task
Find the best wavelength on which to make a univariate regression model.
How to do it
You find the best wavelength by looking at the correlation between each absorbance
variable and the dye level variable. Select the data set Statistical from the project navigator.
Select Tasks - Analyze - Descriptive Statistics… and specify the following parameters in the
Descriptive Statistics dialog.
When the computation is done, there will be a prompt asking if you want to view the plots.
Click Yes, and the two plots summarizing the statistics will be displayed. You will find a new
node, Descriptive statistics in the project navigator which consists of the three folders raw
data, results and plots.
In the project navigator, expand the folder results. Select the Variable Correlation matrix
from this folder to view this in the viewer. We will use these data to find the highest
correlation between Dye Level and some X-variable. You may select the first row, dye level,
and plot it (Plot - Bar) to see the highest correlation (after the correlation between Dye level
and Dye level, which of course is 1).
Bar chart of variable correlation
The variable with the highest correlation coefficient to Dye Level is Xvar6 with a correlation
coefficient of 0.49. You can close the bar plot of the correlation matrix by selecting the tab in
the navigation bar at the bottom of the viewer and right clicking to select close.
Now we should illustrate the regression in a plot. To get the right plot go back to the original
data set, Tutorial_C, and select the columns Xvar6 and Dye level using the Ctrl key and Plot -
Scatter. In the line plot dialog remember to select only the calibration samples from the row
drop-down list.
1075
The Unscrambler X Main
Another way to do this is go to Plot - Scatter and in the Scatter plot dialog click on the define
button next to Cols., which will open the Define Range dialog. Here you can select the
columns Dye level and Xvar6, or type in columns 3, 9 in the Interval box. Select the
calibration samples for the rows.
Scatter plot dialog showing define option
1076
Tutorials
Turn on the Regression Line and Target Line with the shortcut buttons . We can
also add the plot statistics from the toolbar shortcut . From the plot we see our results
are not very good using just one variable to model the dye level. Hopefully we can do better
with multivariate regression models.
Scatter plot of Xvar6 vs. Dye level with target and regression lines
Calibration
We choose to make a PLS regression model because PLS takes the variation in Y into
consideration when the model is calibrated.
Task
Make a PLS regression model between the variable set Absorbance (X) and the response Dye
Level(Y).
How to do it
Activate the Tutorial_C data Editor from project navigator and select Tasks - Analyze -
Partial Least Squares Regression…. In the PLS dialog, specify the following parameters:
1077
The Unscrambler X Main
Go to the Validation tab to select the option cross validation. You can further define the
settings for this by clicking Setup…, it opens the Cross validation setup dialog. Select
Random as the cross validation method and set the number of segments to “7”.
Cross validation setup dialog
1078
Tutorials
.
Start the calibration by clicking OK. When the computation is complete you will be asked if
you want to view the PLS plots now. Click Yes, and the regression overview plots will be
displayed.
A new node, PLS, has been added to the project navigator. This has four folders with the raw
data, results, validation, and plots for the PLS model. Rename the PLS node in the project
navigator for this analysis to “PLS Tutorial C” before you continue. You can do this by right
clicking the latest PLS model in the project navigator and selecting Rename.
1079
The Unscrambler X Main
Scores plot
The plot in the upper left quadrant is the Scores plot. From the scores plot we can interpret
that the combination of two main factors, factor 1 and factor 2, reflects the variations in the
milk and water levels. The first two factors indicate that 99% (X1 84, X2 15) of the X variance,
explains 75% (Y1 19, Y2 56) of the response dye level. By studying the samples in the plot we
can see that the milk level increases from upper left to lower right in the plot, while the
water level increases from right to left.
Regression coefficients
The regression coefficients plot summarizes the relationship between all predictors and a
given response. It is easiest to access this plot by selecting it from the plots folder in the
project navigator.
Plots folder in project navigator
1080
Tutorials
It is possible to see this plot when the any PLS plots are active in the viewer and going to Plot
- Regression Coefficients - Raw coefficients (B) - …, or by right mouse clicking and selecting
PLS - Regression Coefficients - Raw coefficients (B) -…. Select the line plot of the raw
regression coefficients. Since we did not apply any weighting to the data, the plots of
weighted and raw regression coefficients will be identical.
The regression coefficients plot indicates that the wavelength numbers (X-variables) 4 and 6
are the most important for the prediction of Y (concentration) in the first factor. The pattern
is clearer here than in the loadings plot.
Regression coefficients plot
Compare the regression coefficients plot to the raw absorbance data. See that high loading
values indicating important variables are present in the region where we know that milk and
dye absorb light.
1081
The Unscrambler X Main
1082
Tutorials
Click OK to calculate the statistics, and select Yes to view the plots now. As we already have
run descriptive statistics before, but using 17 of the variables, rather than just the
absorbance, the current results are a new node, Descriptive Statistics(1), in the project
navigator. We are not interested in the default plots that are shown, but want a plot that
1083
The Unscrambler X Main
helps us to understand the scatter in the data. Make the plot window active by clicking in it,
and select menu option Plot - Scatter effects. In this plot of the mean value of each X var we
see that the scatter is not the same for all variables. The first 8 variables are approximately
in a straight line. For the other variables, one can observe a spread in the scatter effects.
Scatter effects plot
Select the data matrix Tutorial_C. Select Tasks - Transform - MSC/E… Specify the following
parameters in the Multiplicative Scatter Correction dialog:
1084
Tutorials
1085
The Unscrambler X Main
Prediction samples are not used to find the correction factors we want to find now and use
in the MSC.
Variables 1-8 are omitted as important because the light absorption of these variables vary
with the dye level, while wavelengths 9 to 16 (the water absorption peak) is independent of
the concentration of dye. The difference in these wavelengths is instead caused by the
general light scatter due to milk addition. It is important that only wavelengths with no
chemical information are used to find the correction factors.
The transformed data are now displayed in the project navigator with the name
“Tutorial_C_MSC”. There is also a node with the MSC model for transformation, which can
be applied to future samples. This is called “MSC_Tutorial_C”, and has a folder with the
model under it.
Look at the corrected data by selecting the data from the new project navigator node, and
going to Plot - Line. Select the new sample matrix with the corrected data in the Line Plot
dialog, row set calibration, and column set Absorbance.
Line plot of MSC transformed data.
1086
Tutorials
We want to compare the corrected data with the original data. Select the raw data matrix in
the project navigator (Tutorial_C) and make a line plot of the calibration samples for the
absorbance values. You see that the MSCorrected data are different from the original. The
interference and light scatter effects have successfully been corrected for. You can display
the plots on the same screen by going to the navigation bar at the bottom of the screen and
right clicking to select Pop out to give an undocked plot of the MSC Corrected data that can
be moved around as you wish.
Pop out menu
You can then choose the line plot of the uncorrected data from the navigation bar, making it
active in the Viewer, and move the other window to the same view for easier comparison.
Line plots of the MSC corrected and the original data
.
Another way to get a view of both plots together is to go to Insert-Custom Layout - Two
Horizontal… and select the two samples matrices, selecting the calibration samples for rows,
1087
The Unscrambler X Main
absorbance for columns, and setting the plots to be line plots in the custom layout dialog.
You can also give a title for each plot as show below.
Custom layout dialog
1088
Tutorials
.
View the same plot for the model PLS MSCorrected by going to the PLS Overview plot of the
MSC corrected data (which should still be an open tab in the navigator bar at the bottom of
the viewer). Highlight the lower left quadrant, the explained variance plot, and change the
view to be the residual Y variance plot by using the toolbar shortcuts ,
selecting Y, and Res, for just the validation samples.
Y Residual validation variance: MSC Corrected data
.
The plot shows the validated residual Y-variance for the two models From these plots (line)
we find that the minimum square error is lower for the MSC corrected model with two
factors (1.87). So though the optimal number of factors recommended is four, even with two
factors we can model the system well (more of the Y variance is explained by two factors,
then when using the raw data; see scores plot). The system can be modeled well with the
MSC Corrected data, whereas with the raw data a much higher error is achieved, and less of
the Y variance is explained with two factors. This shows that MSC has removed the
interfering amplification effect in these data.
1089
The Unscrambler X Main
Tutorial C MSC corrected with four factors gives the lowest estimate for the residual Y-
variance. So we see that predictions done by this model using four factors therefore will give
the predicted values with the lowest prediction error. We could also model this system well
enough with two factors (as we do not here have information on the error of the reference
method for measuring the dye level, we will follow the model suggestion for four factors).
Check the error in original units: RMSE
The numerical residual variance values we used in order to find the best model and decide
the optimal number of factors in the model are not related directly to the predictions. We
cannot use the residual variance to tell how large we can expect the deviations in future
predictions. We have to use the RMSEP for that purpose.
Task
Let us see how large an error in ml dye we can expect in future predictions: RMSEP.
How to do it
Activate the regression overview plot for the model PLS-MSCorrected. Select Plot - Variance
and RMSEP - RMSE
Deselect the calibration samples box and select the validation samples (RMSEP) instead from
the shortcut keys.
You see that the shape of the curve is exactly that of the residual variance, but the values
have changed. The plot says that predictions done with this model and using four factors will
have an average prediction error of 0.9.
RMSE: MSC Corrected data
.
Predict new MSCorrected samples
The model with MSC is the one we will use for the prediction of new samples.
Run a prediction with automatic pretreatment
The prediction samples will be transformed automatically with the same MSC model as the
calibration samples. This will require that the variables selected for the data matrix include
the same number of variables as are associated with the MSC. This we need to select
correctly in the Prediction dialog.
Task
Predict the dye level of the unknown samples.
How to do it
Select Tasks - Predict- Regression…. Specify the following parameters in the Prediction
dialog:
Model name: “PLS MSCorrected”
Number of Components: “4”
1090
Tutorials
Click View after the prediction is done. The prediction overview plot appears where the
predicted values are shown together with the deviations. A new node, Predict, has been
added to the project navigator. This has folders for raw data, validation, and plots. The
projection overview shows a plot of values with their estimated uncertainties, and also has a
table of the values with these deviations.
Predicted values with deviation
1091
The Unscrambler X Main
Large deviations indicate that the predictions cannot be trusted. For a prediction to be
trusted the predicted sample must be not too far from a calibration sample. This is checked
by the Inlier distance and also its projection in the model should not be too far from the
center. This is checked with the Hotelling’s T² distance.
Study the Inlier vs. Hotelling’s T² plot available from a right click on the plot and then
Prediction - Inlier/Hotelling’s T² - Inliers vs. Hotelling’s T²
Inliers vs. Hotelling’s T²
In this case all the samples are found to the below the Inliers distance limit, showing that
these samples are similar to those used in making the model. One sample is outside the
Hotelling’s T² limit line (with 95% confidence), so is an outlier. The prediction for the outlier
therefore cannot be trusted.
Guidelines for calibration of spectroscopic data
Now that you have learned the basics of calibration, let us suggest steps and useful functions
for the development of calibration models.
See the guidelines for spectroscopic calibrations
Description
1092
Tutorials
Description
The global objective of this study is to develop a new processed cheese. The study is in a
screening stage to study the main effects and detect whether there are any interactions. The
experiments have been performed, and the responses have been measured. The response
values have been gathered into an Excel worksheet; they should now be imported into The
Unscrambler® as response data table. The first step is to create the design and then import
the response variables. The next step, after importing the response values, will be to get
acquainted with the data and perform first checks such as descriptive statistics. Then a
proper analysis of effect will be run.
References:
Data table
From a brainstorming session with the different expert in the cheese production six
continuous process and recipe parameters have been selected for a screening design.
Variable Low High
B: pH 5.7 6.1
1093
The Unscrambler X Main
Glossiness,
Ability to retain shape,
Adhesiveness,
Firmness,
Graininess,
Stickiness,
Meltability,
Condensed milk taste.
In the Design Experiment Wizard, on the first tab Start, type a name for the table for
example “Cheese”. Select the Goal that for now is Screening. It is possible to type
information in the Information section.
Start tab filled
1094
Tutorials
5.7 -
B pH Design None Continuous
6.1
1095
The Unscrambler X Main
After all design variables have been defined, go to the next tab Choose the Design, to select
the appropriate design.
By default, in the Beginner mode, the selected design is “Screening of many design
variables” which refers to a Fractional factorial design as can be seen in the box below the
Design section.
This design corresponds to the goal of the experimentation so no change is needed.
The Design Wizard - Choose the design tab
1096
Tutorials
The 2-variable interactions are confounded two by two. This is going to limit the study and
the conclusions, but in a screening stage this is acceptable.
The Design Wizard - Design Details tab
Proceed to the next tab, Randomization. There is no need to make any further specification
in this tab. Try different options just to get familiar with the possibilities.
The Design Wizard - Randomization tab
1097
The Unscrambler X Main
Delta: the difference to detect. In this example we want to know which type of
difference is likely to be detected.
Std. dev.: estimated standard deviation. In this example the sensory parameters
have about the standard deviation of 0.4.
1098
Tutorials
Go to the final tab: Design Table. Here the data table is presented with several view options.
Check them out to familiarize with the options.
The Design Wizard - Design table tab
1099
The Unscrambler X Main
How to do it
First, import the response values.
Click on the following link Tutorial D1 responses.
A data table containing all the response variables is now added to the project as an
additional matrix. Note that the data are in standard order.
Copy and paste the response data into the appropriate columns of the matrix CheeseDesign.
Make sure the rows are sorted in experimental order.
Sample Standard order
(1) 1
ae 2
1100
Tutorials
bef 3
abf 4
cef 5
acf 6
bc 7
abce 8
df 9
adef 10
bde 11
abd 12
cde 13
acd 14
bcd 15
abdcef 16
cp01 17
cp02 18
cp03 19
Before the full analysis, we will familiarize ourselves with the data. Go to Tasks - Analyze-
Descriptive Statistics. Choose all the rows and Responses columnset for columns, and then
click OK to compute the statistics. Review the results, and note that some of the responses
(Retain Shape and Stickiness) have some extreme values as noted in the quantiles plot. On
careful investigation, it appears there is an error on the response “stickiness” for one
sample. It should read 2.93, and not 12.93. Correct this value before proceeding with the
analysis.
To start the analysis, choose Tasks - Analyze - Analyze Design Matrix….
1101
The Unscrambler X Main
Model inputs
1102
Tutorials
Predictors
In the Predictors part set the X matrix to be “Cheese_Design”, Rows “All” and the
Cols “Design(6)”.
Responses
For the Responses set the Matrix to be “Cheese_Design”, Rows “All” and the Cols
“Response(8)”.
Model
The Model should include the “Main effects + Interactions (2-var)”.
The list of estimated effect should be “A, B, C, D, E, F, AB, AC, BC, AD, BD, CD, DE”.
Note: All the interactions are not presented. Remember that there is a confounding
pattern.
In the Method dialog select the Classical DoE analysis and click OK.
Method dialog
When the computations are done, click Yes to study the results. A new node called DOE
Analysis is added into the navigator. Before doing anything else, use File - Save As to save
the project with a name such as “Cheese Project”.
1103
The Unscrambler X Main
How to do it
The ANOVA Overview plot shows four informative plots:
ANOVA table
Look at the Summary section of the ANOVA table to check the significance of the
models for all the response variables. We say that a model is significant at the 5%
level if the p-value is smaller than 0.05. This is true for response variables
“RetainShape” (0.0136) and “Firmness” (0.0213), while “Meltability” is just over
(0.0524). Always check the validity of the model by assessing the R-square
prediction value. This is an estimate on how well the model will work for new
(currently unknown) data. As the value is negative for “Meltability” this particular
model cannot be trusted. For “RetainShape” and “Firmness” the values are higher
(around 0.5), which is not necessarily bad but caution is required.
For these three responses find out which effects are important by looking at the
Variables section. Again, the significant effects are the ones with a p-value less than
0.05. They are in shades of green. For “RetainShape” the main effects B(pH),
C(DM%), and D(Maturity) and the interaction effect BC=AE are found significant at
the 5% level. For “Firmness” the same effects are found significant except B(pH).
ANOVA table
1104
Tutorials
Note: The interaction effect BC=AD is a possible significant effect. Checking the effect value
or the B-coefficient should help to determine if it is significant or not.
The effect viewer
Look at the effects for the response “RetainShape” and check for curvature. See if
the center sample average is placed such that the average at low and high level are
linked by a linear relation. If this is the case there is no curvature effect. Use the
to scroll through the effects for the different variables.
Here a curvature effect can be found on all effects.
Effect of Maturity (D) on “RetainShape”
1105
The Unscrambler X Main
In addition the study of the interaction effects shows that the interaction effect of
B*C is the most probable, as the effects A and E are not significant.
The diagnostics
Look at the residuals to see if the model fits the samples well. The table is presented
with the experimental order (randomized) which makes it possible to check for any
deviation with time.
Diagnostics for “RetainShape”
Note that the first center sample has a very high residual. However center samples
are not taken into account when calculating the effect.
The summary table
See which effect is the most important (size) and the most significant (smallest p-
value) for all variables.
Go through the other plots and check the plot interpretation in the DOE section
Draw a conclusion from the screening design
The final conclusions of the screening experiments are the following:
Not all sensory variables are affected by the changes in the design. Only three are in
fact affected and “RetainShape” is the variable showing the most interesting
behavior.
Four main effects were found likely to be significant for “RetainShape”. One of them
is a confounded interaction. Since the main effects of B and C are significant, we can
make an educated guess and assume that the significant interaction is BC (and not
AE with which it is confounded).
1106
Tutorials
Description
What you will learn
Build an optimization design
Compute the response surface
Run a response surface analysis
Interpret analysis of variance results
Check the residuals
Interpret the response surface plots
Draw a conclusion from the optimization design
Description
This tutorial is built from the enamine synthesis example published by R. Carlsson in his book
“Design and Optimization in Organic Synthesis”, Elsevier, 1992.
A standard method for the synthesis of enamine from a ketone gave some problems, and a
modified procedure was investigated. A first series of experiments gave two important
results:
Reaction time can be shortened considerably.
The optimal operational conditions were highly dependent on the structure of the
original ketone.
Thus, a new investigation had to be conducted to study the specific case of the formation of
morpholine enamine from methyl isobutyl ketone. Two factors may have an impact on this
reaction: the relative amounts of the two reagents.
References:
1107
The Unscrambler X Main
In the Design Experiment Wizard, on the first tab Start, type a name for the table, for
example “Enamine_Opt”. Select the Goal that for now is Optimization. It is possible to type
in information in the Information section.
1108
Tutorials
Do this by clicking the Add button and editing the Variable editor. Validate by clicking OK
and enter the next variable by clicking Add again.
Define variables tab
In the next section Design Details, four options are proposed. Look at the bottom table to
see the differences between the different designs and their performance.
As it is possible to do experiments outside the selected range the option Circumscribed
Central Composite (CCC) design is chosen. Check the value of the star point distance to the
center. It should be 1.412 for two designed variables.
Design Details tab
1109
The Unscrambler X Main
1110
Tutorials
In the Summary tab check that the design includes a total of 13 experiments. Otherwise, go
back to the appropriate tab and make the necessary corrections.
Summary tab
Go to the Design Table tab, and display the experiment in different views.
Design Table tab
1111
The Unscrambler X Main
1112
Tutorials
Sample Yield
Cube1 73.4
Cube2 69.7
Cube3 88.7
Cube4 98.7
Axial_A(low) 76.8
Axial_A(high) 84.9
Axial_B(low) 56.6
Axial_B(high) 81.3
cp01 96.4
cp02 96.8
cp03 87.5
cp04 96.1
cp05 90.5
Choose Tasks – Analyze – Analyze Design Matrix….
Go to the Model Inputs tab.
In the dialog box, make the following selections:
Model inputs
1113
The Unscrambler X Main
1114
Tutorials
1115
The Unscrambler X Main
The Summary shows that the model is globally significant, so it is possible to go on with the
interpretation.
The ANOVA table for variables displays the values of the p-values for each effect. The most
significant coefficients are for the linear and quadratic effects of Morpholine. TiCl4 effects
look less important but are still significant due to the square term being possibly significant
(p-value = 0.07). However the interaction is more doubtful.
The Quality section tells about the quality of the fit of the response surface model: R-square
for the calibration and prediction are very good.
In the Results node in the project navigator, check the tables Model check and Lack of fit.
The Model Check indicates that the quadratic part of the model is significant, which shows
that the interaction and square effects included in the model are useful.
The Lack of Fit section shows that with a p-value superior to 0.05, there is no significant lack
of fit in the model. Thus the model can be trusted to describe the response surface
adequately.
1116
Tutorials
Go to the predefined plot Residuals overview, found in the Plots folder in the project
navigator.
Start with the Normal Probability plot of the residuals. This plot can be used to detect any
outliers. Here, the residuals form two groups (positive residuals and negative ones). Apart
from that, they lie roughly along a straight line, and there is one extreme residual to be
found “cp03”. This may be an outlier.
Normal Probability plot of the residuals
In the residuals plot, all values are within the (-6;+6) range. There is no clear pattern in the
residuals, so nothing seems to be wrong with the model.
Look at the bottom right plot Y-residuals in experimental order. Check if there is a bias with
time. Look at the 5 center samples residuals.
The center samples show quite some variation. This is why so few effects in the model are
very significant. There is quite a large amount of experimental variability.
1117
The Unscrambler X Main
Move the mouse over the surface to see the coordinates and the corresponding yield.
It is also possible to see it as a 3-D plot. To do so click on the surface and hold while moving
the mouse to rotate the view of the surface.
Response surface as a 3-D plot
1118
Tutorials
Move the mouse over the surface to see the coordinates and the corresponding yield.
Inspect various points in the neighborhood of the optimum, to see how fast the predicted
values decrease. Notice that the top of the surface is rather flat, but that further away, the
the yield decreases more steeply.
In this example there are only two variables so it is not necessary to use the generator table
below the response surface to change the view.
Finally, notice that the predicted max point value, found in the table below the plot, is
smaller than several of the actually observed Yield values. (Sample Cube004a for instance
has a Yield of 98.7). This is not paradoxical, since the model will smooth the observed values.
Those high observed values might not be reproduced when the same experiments are
performed again.
Draw a conclusion from the optimization design
The analysis gave a significant model, in which the quadratic part in particular was
significant, thus justifying the optimization experiments.
Since there was no apparent lack of fit, no outliers, and the residuals showed no clear
pattern, the model could be considered valid and its results interpreted more thoroughly.
The values of the b-coefficients, and their significance indicates that the most significant
coefficients are the linear and quadratic effects of morpholine; the quadratic effect of TiCl4
is close to the 0.05 significance level.
The response surface showed an optimum predicted Yield of 96.815 for TiCl4=0.8250 and
Morpholine=6.555. The predicted Yield is larger than 95 in the neighboring area, so that
even small deviations from the optimal settings of the two variables will give quite
acceptable results.
1119
The Unscrambler X Main
Description
What you will learn
Data table
Reformat the data table
Graphical clustering
Graphical clustering based on hierarchical clustering
Graphical clustering based on scores plots
Make class models
Classify unknown samples
Interpretation of classification results
Diagnosing the classification model
Description
The data to be classified in this tutorial is taken from the classical paper by Fisher.
(R.A.Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7, 179
– 188 (1936).) The task is to see whether three different types of the iris flowers can be
classified by four measurements made on them; the length and width of the Sepal and Petal.
References:
Data table
Click the following link to import the Tutorial E data set used in this tutorial.
The data contains 75 training (calibration) samples and 75 testing (validation) samples.
The training samples are divided into three Row (Sample) ranges, each containing 25
samples. The three sets are: Setosa, Versicolor, and Virginica. The row set Testing will later
be used to test the classification.
Four variables are measured; Sepal length, Sepal width, Petal length, and Petal width. The
measurements are given in centimeters. These four variables are collectively defined as the
column set Iris properties
1120
Tutorials
Now a new column has been created “Iris type” containing the appropriate value for each
sample in each cell of the column.
Data table with category variable “Iris type”
1121
The Unscrambler X Main
Graphical clustering
It is always a good idea to start a classification with some exploratory data analysis. You can
run a PCA model and/or hierarchical clustering of all samples. If you do not know the classes
in advance, this is a way of visualizing if there is clustering. The calibration samples must be
assigned to the different classes to give a sense of whether a classification model can be
developed.
Matrix: Tutorial_E
Rows: Calibration
Columns: Iris properties
Number of clusters: 3
Clustering method: Hierarchical Complete-linkage
Distance measure: Squared Euclidean.
In the options tab, you can assign samples to the initial clusters, but for this exercise, we will
make a completely unsupervised cluster analysis.
Click OK for the Cluster analysis to run.
When the clustering is complete a dialogue asking if you want to view the plots will appear.
Click Yes.
The Dendrogram showing the clustering of samples will be displayed. Notice that three
clusters are identified, but they are not all of equal size. All the results are in a new Cluster
analysis node in the project.
Dendrogram: Complete-linkage squared Euclidean distance
1122
Tutorials
Open the Results folder for the cluster analysis, and expand the levels so that you see the
different row sets; one has been defined for each cluster.
Cluster analysis results in project navigator view
By looking at the row sets, one can see that the Setosa samples are all assigned to one
cluster, and that there is a small cluster that contains only Virginica samples, but a larger
group has a mix of both Virginica and Versicolor samples. These results suggest that based
on the four variables provided for these irises, an unambiguous classification may be
difficult.
Matrix: Tutorial_E
1123
The Unscrambler X Main
Rows: Calibration
Columns: Iris properties
Maximum components: 4
Keep the default ticks in the boxes Mean center data and Identify outliers.
Weights
On the weights tab, select all the variables by highlighting them, and set the weight
by selecting the correct radio button.
Weights: 1/SDev
Click Update.
Validation
Proceed to the Validation tab to set the validation.
1124
Tutorials
You can see the three groups in different colors; one very distinct (Setosa) and two that are
not so well separated (Versicolor and Virginica). This indicates that it may be difficult to
differentiate Versicolor from Virginica in an overall classification model.
Make class models
Before we classify new samples, each class must be described by a PCA model. These models
should be made independently of each other. This means that the number of components
must be determined for each model, outliers found and removed separately, etc.
Task
Make PCA models for the three classes Setosa, Versicolor, and Virginica.
How to do it
Select Tasks - Analyze-Principal Component Analysis… and make the first PCA model for
Setosa with the following parameters:
Model Inputs
Matrix: Tutorial_E
Rows: Setosa
Cols: Iris properties
Maximum components: 4
Weights
1/SDev
Validation
Proceed to the Validation tab to set the validation.
Validation Method: Cross validation. Click Setup and choose Full from the
Cross validation method dropdown menu.
When the model is computed, view the plots. In the project navigator rename the PCA class
model with name PCA Setosa by highlighting the new PCA node, right clicking and selecting
Rename.
Rename menu
1125
The Unscrambler X Main
Repeat the procedure successively on Row Sets Versicolor and Virginica, also renaming each
new PCA model.
Classify unknown samples
When the different class models have been made and new samples are collected, it is time
to assign them to the known classes. In our case the test samples are already in the data
table, ready to use.
Task
Assign the Sample Set Testing to the classes Setosa, Versicolor, and Virginica.
How to do it
Select Tasks - Predict- Classification - SIMCA….
Menu Tasks - Predict- Classification - SIMCA…
Matrix: Tutorial_E
Rows: Testing
Columns: Iris properties
Make sure that Centered Models is checked. Add the three PCA class models Setosa,
Versicolor, and Virginica.
SIMCA classification dialog
1126
Tutorials
The suggested number of PCs to use is 3 for all models; keep that default (it is based on the
variance curve for each model).
Click OK to start the classification.
Interpretation of classification results
The classification results are displayed directly in a table, but you may also investigate the
classification model closer in some plots.
Interpret the classification table
Task
Interpret the classification results displayed in the SIMCA results.
How to do it
Click View when the classification is finished.
A table plot is displayed, called Classification membership. There are three columns: one for
each class model.
Samples “recognized” as members of a class (they are within the limits on sample-to-model
distance and leverage) have a star in the corresponding column.
SIMCA classification table
1127
The Unscrambler X Main
The significance level can be toggled with the Significance option, which is available as a
toggle on the menu bar.
At the 5% significance level, we can see that all but three samples (false negatives: virg1,
virg36, virg42) are recognized by their rightful class model.
However, some samples are classified as belonging to two classes (false positives): 12
Versicolor samples are also classified as Virginica, while 6 Virginica samples are also
classified as Versicolor. Only the Setosa samples are 100% correctly classified (no false
positives, no false negatives). This is an outcome we may have expected since a clear
separation of these two classes was not seen in the overall PCA model of the calibration
samples.
If you tune up the significance limit to 25%, this reduces the number of false positives but
also increases the number of false negatives (vers41 and virg35 come in addition).
Interpret the Coomans’ plot
If a sample is doubly classified, you should study both Si (sample-to-model distance) and Hi
(leverage) to find the best fit; at similar Si levels, the sample is probably closest to the model
to which it has the smallest Hi. The classification results are well displayed in the Coomans’
plot.
Task
Look at the Coomans’ plot.
1128
Tutorials
How to do it
Under the SIMCA/Plots node choose the Coomans’ plot. You can change which classes it
displays on the toolbar ; now set it for models Virginica and
Versicolor.
This plot displays the sample-to-model distance for each sample to two models. The newly
classified samples (from sample set Testing) are displayed in green color, while the
calibration samples for the two models are displayed in blue and red.
Coomans’ plot for Versicolor vs. Virginica
The Coomans’ plot for the classes Virginica and Versicolor shows that all Setosa samples are
far away from the Virginica model (they appear far to the right). However, we can see that
many Virginica and Versicolor samples are within the distance limits for both models. This
suggests some classification problems.
Interpret the Si vs. Hi plot
We also have to look at the distance from the model center to the projected location of the
sample, i.e. the leverage. This is done in the Si vs. Hi plot.
Task
Look at the Si vs. Hi plots.
How to do it
Under the SIMCA/Plots node choose the Si vs. Hi plot, and set it for the model Versicolor
using the arrows on the toolbar. Before you start interpreting the plot, turn on Sample
Grouping by right clicking in the plot window and selecting the Sample Grouping option. In
the sample grouping & marking dialog, select the row sets Setosa, Versicolor and Virginica.
The point labels can be changed to show just the first two characters of their name by right
clicking and selecting Properties. In the left list, select Point Label to get to the Point Label
dialog. Here one has the option to change the label name to just the first 2 characters of the
name. Select the radio button Name, and under the Label layout use the drop-down list for
show to select first, and in number of characters box enter 2, as shown in the dialog.
Point layout dialog
1129
The Unscrambler X Main
The then provides a plot which is much easier to interpret: iris type appears clearly with the
initials Se, Ve, Vi in three different colors.
Si vs. Hi plot for the model Versicolor
Some Virginica samples are classified as belonging to the class Versicolor, but most samples
that are not Versicolor are outside the lower left quadrant. The reason for the difficult
classification between Versicolor and Virginica is that the samples are overlapping in the
scores plot. They are very similar with respect to the sepal and petal width.
1130
Tutorials
This plot allows you to compare different models. A distance larger than three indicates
good class separation. The models are different.
It is clear from this plot that the Setosa model is different from the Versicolor, with a
distance close to 10, while the distance to Virginica is smaller.
Interpret discrimination power
Task
Look at the Discrimination Power plots.
How to do it
Under the SIMCA/Plots node choose the Discrimination Power plot. Using the arrows on the
toolbar, choose the discrimination power for Versicolor projected onto the Setosa model.
This plot tells which of the variables are most useful in describing the difference between
the two types of iris.
Discrimination power:Versicolor onto Setosa
1131
The Unscrambler X Main
We can see that variables sepal length and sepal width have high discrimination powers
between these classes, while it is lower for the petal length and width.
Do the same for Versicolor onto Virginica: all variables have discrimination powers around 3.
This is obviously not enough to completely discriminate these classes.
Interpret modeling power
Task
Look at the Modeling Power plots.
How to do it
From the plots choose the Modeling Power for Versicolor.
Variables with a modeling power near one are important for the model. A rule of thumb says
that variables with modeling power less than 0.3 are of little importance for the model.
Modeling power for Versicolor
The plot tells us that all variables have a modeling power larger than 0.3, which means that
all variables are important for describing the model. None of the variables should be deleted
1132
Tutorials
from the modeling. The only chance to improve on the classification between Versicolor and
Virginica is to measure some additional variables.
In this exercise, it was found even from the initial exploratory analysis that the three types of
irises cannot be clearly distinguished based on the four measured variables. In the
dendrogram from clustering, as well as the global PCA, there was not a clear separation of
the Virginica and Versicolor class of irises. Nonetheless, a SIMCA classification was
attempted. With PCA-based classification by SIMCA, all Setosa samples could be properly
classified, while there were some ambiguities between the other two classes. It is
recommended that some other distinguishing feature be measured to enable a clean
classification of all three classes of these irises. The classification results provide many useful
model diagnostics to determine how similar the models are, and which variables are most
important in the modeling.
Description
What you will learn
Data table
Import spectra from an ASCII file
Import responses from Excel
Create a category variable
Append a variable to the data set
Organizing the data
Study the data before modeling
Plot spectral data
Basic statistics on data
Make a PLS Model
Interpretation of the Regression Overview
Customizing plots and copying them into other programs
Save PLS model file
Export ASCII-MOD file
Export data to ASCII file
Description
It is not uncommon to use The Unscrambler® together with other programs in one’s daily
work. This could be a word processor to document latest work, or instrument software.
This tutorial shows some of the capabilities The Unscrambler® has to interact with other
programs under the Windows operating system. The main focus here is how The
Unscrambler® is used in conjunction with other software.
1133
The Unscrambler X Main
References:
Data table
The data are NIR spectra of wheat samples collected at a mill. Fifty five samples were
collected and the NIR spectra on an instrument using 20 channels.
The water content of wheat samples was measured by a reference method and is the
response variable in the data. These values are stored in a separate file.
Click the following links to save the data files to be used in this tutorial:
1134
Tutorials
Click OK to import the file and the data are read into The Unscrambler®, creating a data
table called “Tutorial_F” in the project.
Import responses from Excel
Spreadsheet applications are commonly used for storing data. It is easy to transfer data
between such a program and The Unscrambler®. The water content of the wheat samples is
stored in an Excel file together with the sample names.
Task
Import the water values from the Excel data file “Tutorial_F_responses.xls” into the existing
data table.
How to do it
There are two procedures. Use procedure 1 if you have Microsoft Excel or another
spreadsheet application installed on your computer or procedure 2 if you do not have a
spreadsheet program that can read the file “Tutorial_F_responses.xls”. You only need to
follow one of the procedures.
We will begin by appending a column to the existing data table. Put the cursor in the data
viewer and select Edit – Append, and in the dialog, enter 1 to add a single column.
Copy and paste from Excel
Launch Microsoft Excel and open the file “Tutorial_F_responses.xls” located in the
‘Data’ folder in your Unscrambler directory. Copy the values from the column water,
and paste them into the empty column that you appended in data matrix “Tutorial
F”.
Alternatively, follow this link Tutorial_F_responses.xls to open the spreadsheet
containing the responses
Import data from the Excel file
1135
The Unscrambler X Main
From File – Import data – Excel…, select “Tutorial_F_responses.xls” from the ‘Data’
folder in your Unscrambler directory and click Import.
Alternatively, click the following link to import the responses from
Tutorial_F_responses.xls directly.
In the project navigator you will find the two data matrices which you imported from the
ASCII and Excel files, respectively. Rename the matrices by selecting them, right clicking and
choosing Rename; rename them as Wheat NIR Spectra and water content.
Data matrices in the Navigator
We could leave the response Y values (water content) in a separate matrix, and do the
analysis from these two matrices. But for consistency on data organization in this exercise,
we will copy the values from the Water content matrix into the empty column (21) that we
appended to the data matrix “Wheat NIR Spectra”.
Create a category variable
Category variables are useful to calculate statistics and to use in plot interpretation.
Task
Insert a variable to group the samples into three categories, depending on the water content
level.
How to do it
Place the cursor in the first column and select Edit – Insert… and insert one empty column.
Then use copy (Ctrl+C) - paste (Ctrl+V) to copy the water content data into the new column.
Rename the column as “Water levels”.
Then select the “Water levels” column and go to the menu Edit – Change Data Type and
select Category.
Edit – Change Data Type - Category menu
1136
Tutorials
The category converter dialog appears. Select the option New levels based upon ranges of
values.
Add three levels by entering 3 for the Desired number of levels, and specify the following
ranges manually:
1137
The Unscrambler X Main
The column of the category values is orange to distinguish this kind of variable from the
ordinary ones.
Data after insertion of a category variable
1138
Tutorials
1139
The Unscrambler X Main
Do the same then to define the column range for “level” in column 1, and “NIR Spectra” in
columns 2-21.
The list of defined data ranges are found in the project navigator as nodes under the data
matrix.
Project navigator with data sets defined
1140
Tutorials
NIR spectrum. There is now a new entry in the project navigator for the Line plot. You can
rename this by right clicking and choosing Rename
Line Plot of Spectral Data
1141
The Unscrambler X Main
We can also compute the statistics without the plot by going to Tasks-Analyze-Descriptive
Statistics…. In the dialog, select all the rows, and the column “Water” and click OK. When
the computation is complete, say Yes to see the plots now. A quantile and mean and
standard deviation plot are displayed. If you had more than one variable, the plots would
show results for all the variables. A new node has been added to the project navigator,
“Descriptive Statistics”. This has subfolders containing the raw data, results, and plots of the
statistical analysis. Expand the folder “Results” and select the matrix “Statistics” to see the
numerical results.
Statistics on water content
If not already done, check the boxes Mean center data and Identify outliers.
Go to the X weights and Y weights tabs to verify that these are all set to 1.0 (the default
setting). On the Validation tab, select Cross validation.
PLS Dialog
1142
Tutorials
1143
The Unscrambler X Main
The Scores plot shows that the samples are scattered in the model space, with no evidence
of groupings and that the first two factors explain 92% and 8% of the variance in the data
respectively. The Explained X-variance goes up nicely and is close to 100 after two Factors
(PCs). The Predicted vs. Reference plot looks OK. The fit is quite good. The info box in the
lower left panel of the display indicates that two factors are optimal for this model.
Another very useful plot is of the regression coefficients. Activate the upper-right quadrant
and right click to go to PLS-Regression coefficients - Raw coefficients (B) - Line. From the
regression coefficients one can see that there is a distinct peak around 1940, as expected as
this is where the water absorbance peak is located in the NIR spectrum.
Raw Regression Coefficients
Save the project. All the results and plots that have generated will be part of the saved
project.
1144
Tutorials
Change the plot heading name, as well as the font used for it.
Annotations can be added to a plot by right clicking and selecting Insert Draw Item…, or
from the short cut keys on the toolbar
When the plot has been customized it can readily be saved or copied into another
application. Right click and select Copy to select just the highlighted plot, or Copy All to
select all the four overview plots. Go to another program and place the cursor where the
plot is to appear in the document. Select Edit - Paste. The plot is now inserted as a graphical
object in the other document.
The plot can be saved as a picture file. The picture file option will usually give better quality
plots, but also larger files. Highlight a plot, and right click Save as… to save the plot in a
choice of graphics image file formats, such as EMF or PNG.
Save as options
1145
The Unscrambler X Main
1146
Tutorials
Verify that the correct model is selected, and the correct number of factors. It is possible to
select two types of model:
Full
Regr.Coef. only: corresponding to only the regression coefficients
Take a look at the ASCII file that is generated, which has the file name extension .AMO. The
format of the file is described in the ASCII-MOD Technical Reference.
Export data to ASCII file
A common file format that most programs read is the simple ASCII file. There are different
ways of writing the ASCII file. Determine the format needed based on the requirements of
other programs that will be used to read the ASCII files.
Task
Write the Wheat NIR Spectra data table to an ASCII file.
How to do it
Select the Wheat NIR Spectra table and select File - Export - ASCII. Use only the columns of
the NIR Spectra, by choosing this column set from the drop-down list. Make sure that the
item deliminator is comma as suggested in the Export ASCII dialog.
Export ASCII Dialog
1147
The Unscrambler X Main
Provide a file name, and location when prompted. Open the file in an ASCII editor and look
at the file. All names are enclosed in double quotes.
Description
What you will learn
Data table
Design variables and responses
Building a Simplex Centroid design
Import response values from Excel
Check response variations with statistics
Model the mixture response surface
Conclusions
Description
This tutorial is taken from an example presented in John A. Cornell’s reference book
“Experiments With Mixtures”, to illustrate the basic principles and applications of mixture
designs to a constrained system.
A beverage known as Fruit Punch is to be prepared by blending three types of fruit juice:
1148
Tutorials
watermelon,
pineapple and
orange.
The financial driver of the manufacturer is to use their large supplies of watermelons by
introducing the juice into its current blend of fruit juices. As the value of watermelon juice is
relatively cheap compared to the other juices used, the final fruit punch blend should ideally
contain a substantial amount of watermelon - in this case specified as a minimum of 30% of
the total. Pineapple and orange juice have been selected as the other components of the
mixture, based on their availability and preference by most consumers.
To develop suitable blends for preference testing and cost analysis, the manufacturer used
experimental design, in this case, a special class of designs known as mixture designs.
References:
Mixture designs
Data import from a spreadsheet
Descriptive statistics
Analysis of mixture design results
Data table
The data in this exercise consist of two parts:
The design table, which will be created in the tutorial.
Measured responses: Sensory data: acceptance, sweetness, bitterness, fruitiness of
the juice as well the cost of production. We begin by setting up the design in The
Unscrambler®. Then you will import the response variables into the design table.
Design variables and responses
The ranges of variation selected for the experiment are as follows:
Ranges of variation for the fruit punch design
Ingredient Low High
Pineapple 0% 70%
Orange 0% 70%
The above constraints define what is known as a Simplex.
The responses of interest for the manufacturer are detailed in the table below.
1149
The Unscrambler X Main
Consumer
Average of 63 individual ratings on a 0-5 scale Maximum
acceptance
Descriptive
Sweetness Average ratings by sensory panel on a 0-9 scale
only
Descriptive
Bitterness Average ratings by sensory panel on a 0-9 scale
only
Descriptive
Fruitiness Average ratings by sensory panel on a 0-9 scale
only
Consumer acceptance is the response of primary interest. Should the analysis reveal two
responses of high consumer acceptance, the mixture with lower production cost will be
preferred. The sensory descriptors provide an explanation of the consumer acceptance
based on pre-specified properties. These provide possible directions for meeting consumer
expectations and their optimization usually leads to widely acceptable products.
Building a Simplex Centroid design
Since there are only three design variables (called components in the mixture case), setting
up an optimization design is a straight forward process. In this case, the chosen design is the
Simplex Centroid design as the points of this design allow you to investigate the importance
of the pure components, binary (two juice) blends and finally ternary (three component)
blends within the mixture space.
Task
Build a Simplex Centroid design with the help of the design experiment wizard, by selecting
Insert – Create design….
How to do it
Use Insert – Create design… to start the Design Experiment Wizard. The first tab is the Start
tab, where you enter the name of the design and the goal of the experiment. It is also
possible to add additional information in the description field.
Enter “Punch” as a name for the design and select Optimization as the goal.
Start tab for the Punch experiment
1150
Tutorials
Go to the next tab: Define variables. Specify the variables as shown in the following table:
Variables to define
ID Name Type Constraints Type of levels Level range
1 Acceptance Response - - -
2 Cost Response - - -
3 Sweet Response - - -
4 Bitter Response - - -
5 Fruity Response - - -
Do this by clicking the Add button and entering details into the Variable editor including the
level range for the design variables. Validate by each component and response by clicking
OK.
Variables involved in the design
1151
The Unscrambler X Main
1152
Tutorials
Go to the next tab: Additional Experiments. There is no need to replicate the design samples
so the Number of replications should be kept at its default value: “1”.
In this study, the centroid is to be replicated 3 times as a source of model error
determination. The Simplex Centroid design contains the “centroid” by default, In this case
select 3 to add a further 3 replicates of the centroid.
Additional experiments tab
Now proceed to the next tab, Randomization. There is no need to make any further
adjustments in this tab, however, try some re-randomizations just to get familiar with this
option.
Randomization tab
1153
The Unscrambler X Main
Next look at the Summary tab. The displayed table presents a summary of the information
in the design.
Summary tab
Go to the final tab Design Table. Here the data table is presented with several view options.
In this case, select the Display Order as Standard and leave the Design display mode as
Actual Values.
Design table tab for the fruit punch experiment
1154
Tutorials
Once all necessary checks a have been made, click the Finish button to generate the design
table in The Unscrambler® editor.
Now the designed data table appears in the Navigator. The design variables are given first,
followed by their interactions. The responses are given to the right of the interactions in the
same table. The response variables are empty and you need to fill in the responses obtained
for the experimental runs. The design matrix is organized into row and column sets
according to the types of samples (design, center, etc.) and effects.
The first part of the design table, including the mixture components
To change the order from the standard sample sequence to the experiment sample
sequence click on column randomized, and select Edit - Sort - Descending.
To change from the actual values to the level values click on the table and then View
- Level indices.
1155
The Unscrambler X Main
Save the new project with File - Save and specify a name such as “Punch Optimization”.
Import response values from Excel
The responses for this design are stored in a separate Excel spreadsheet, which can be
directly imported into the navigator and then copied and pasted into the response columns
of Punch_Design matrix.
Task
Open the Excel table containing the response values and copy them into the response
columns of the design table.
How to do it
Go to File - Import Data - Excel…, select the Excel file “Tutorial_G.xls” (found in the “Data”
sub-directory under your Unscrambler installation folder) and click Open. Alternatively, click
the following link to open the Excel sheet to import the responses from Tutorial_G.xls
directly as a new matrix in the project.
If you are importing the Excel table, in the Excel Preview window, select the “Sheet1”, and
select the 5 responses:
Accept
Cost
Sweet
Bitter
Fruity
Excel Preview
1156
Tutorials
Click on OK, and note that a new node “Tutorial_G.xls” is formed in the project navigator.
Look at the sample order of the imported data table. It is very important that the tables
“Punch_Design” and “Tutorial_G.xls” match in their order. If the “Punch-Design” table is not
given in standard order, you can highlight the Standard row header in the design table and
click Edit - Sort - Descending.
Select all the data in “Tutorial_G.xls” and copy them using right click and the option Copy or
with the shortcut Ctrl+C and paste them into the corresponding columns of “Punch_Design”.
To do so place the cursor in the first cell and use right click and the option Paste or the
shortcut Ctrl+V.
Imported response data
1157
The Unscrambler X Main
Click Yes to view the results. The results are displayed as two main plots. The upper plot is
Quantiles plot, the lower Mean and SDev plot.
Let us have a look at the upper plot: Quantiles.
If you have never interpreted a box-plot (or Quantiles plot) before, follow this link.
Right click on the plot and select View - Numerical View to display the min, max, median, Q1
and Q3 for the responses. Ensure all variations are within their expected ranges for the
responses (0-5 for Acceptance, 0-3 for Cost and 1-9 for the sensory responses on flavor).
Now display the same two plots for design samples and center samples, in order to compare
variation over the whole design to variation over the replicated Center samples. If the
experiments have been performed correctly, there should be much more variation among
design points than among the three replicates of the Centroid.
Return to the graphical view (View - Graphical view).
Right click on the plot and select Sample Grouping. A dialog box opens.
Select the sets Center samples and All design samples from the matrix Punch_Design.
Sample grouping and marking for the statistics
1158
Tutorials
Note: It is possible to edit the color of the bars in the plot and set marker names
Click OK.
To display the legend, click on the plot and then on the -icon in the toolbar.
Quantiles plot with sample grouping
The quantiles plot is now displayed separated into three groups. The boxes for all samples
appear in blue, for design samples in red and the center samples in green. From the
quantiles plot, you can see that there is much more variation between design points than
within the center samples.
Summary of Descriptive Statistics Analysis
The ranges of variation of the 5 responses are within their expected ranges.
There were no abnormal values observed for any response.
1159
The Unscrambler X Main
There is much more variation over the whole design than among the center samples, which
indicate that the experiments were performed correctly.
Model the mixture response surface
The next step after checking the quality of the data is to model the responses. By this we
mean that we want to study the quantitative relationships between fruit punch composition
and consumer acceptance, production cost and measured sensory properties.
Task
Analyze the design with a Response Surface analysis using a Scheffé model. View the results
and interpret them.
How to do it
Highlight the data table Punch_Design and run Tasks - Analyze - Analyze Design Matrix….
Make the following choices in the Design Analysis dialog:
Method
Classical
Model inputs
Predictors
Matrix: “Punch_Design (13x15)”
Rows: All
Cols: Design (10)
Model: Special cubic
Responses
Matrix: “Punch_Design (13x15)”
Rows: All
Cols: Response (5)
Design Analysis
1160
Tutorials
Note: The Special Cubic model is used here as there are enough points in the Simplex
Centroid design to support the calculation of the three binary mixture interactions and the
ternary blend interaction present within the design. There are also degrees of freedom left
in the design to test the significance of the effects estimated.
Click OK, then Yes to view the model diagnostics and plots when the computation is
complete.
Diagnosing the model
ANOVA results
The ANOVA table provides the overall fit summary for a particular response. It is found in
the upper left quadrant of the DoE overview.
The first ANOVA table is for the response variable “Accept”.
ANOVA Punch
1161
The Unscrambler X Main
The first thing to look for is the p-values for the model: In this case it is 0.0085 and since it is
smaller than 0,05, this suggest that the model is describing something other than noise.
The p-values for the binary and ternary blending terms (i.e. Watermelon x Pineapple) etc.
are all significant. This indicates that the special cubic model fit may be justified.
Before analysing the ANOVA tables of the other responses, look at the Quality section of the
ANOVA table for the response Acceptance. This is shown below,
The R-Square value for the model is OK, however the Adjusted R-Square is much lower. This
may be indicating that the model is not a good predictor of future results. This is confirmed
by the negative R-Square Prediction value. Negative R-Square Prediction values indicate that
the mean is a better predictor of future data than the model is. Remember, validation is
always the key to good results.
View the results for the other responses by using the drop-down menu or the arrows in the
menu bar .
A summary of the results is provided below
Cost: The model p-value is highly significant for this response (p = 0.0007). Closer
inspection of the sums of squares indicates that a linear model is more applicable.
Sweetness: The model p-value is highly significant (p = 0.0000). The individual sum
of squares terms indicate that the special cubic is a good fit to the data.
Bitter: The model p-value is highly insignificant (p = 0.1857). This suggests that Bitter
is not modelled well at all.
Fruity: The model p-value is highly significant (p = 0.0001). The Watermelon x
Pineapple binary blending is the most significant term in the model and the ternary
blend term is also significant. This indicates that the response is dependent on all of
the components in the blend.
Select the Error Table from the project navigator. This provides and overall summary of the
quality statistics for each response in one table.
1162
Tutorials
Diagnostics
Examine the diagnostic table for Accept Look for extreme residuals and note high values of
Cook’s Distance. These statistics help to isolate outliers based on high leverage. Diagnostics
for response “Accept”
Response surface
Response surfaces are usually the key output desired in the mixture setting as they provide
the location of the “optimal” blend. The following image is the response surface obtained for
Acceptance
Response surface for acceptance
The response surface shows that an acceptable blend can be achieved containing 55%
Watermelon juice. This more than exceeded the manufacturers expectation and allows the
consumption of the excess watermelon supplies.
1163
The Unscrambler X Main
The diagram below presents the response surfaces that best model each response. The
desired optimized response is also shown in each figure. Bitterness has been omitted as it
was not modelled well.
The optima were chosen on the basis of acceptance and cost as primary responses and that
the sweetness should not be too high and the fruitiness is maximized.
Conclusions
The mixture design and analysis showed that suitable models could be developed for
Acceptance, Cost, Sweetness and Fruitiness. Bitterness was not modelled at all. The
response surface analysis showed that the four modelled responses could be optimized to
develop a blend that uses more than the minimum stated 30% watermelon juice. The best
formulation was achieved with 55% Watermelon juice, 24% Pineapple juice and 21% Orange
juice. This blend also minimised the usage of the highest cost orange juice.
PLS-DA is the use of PLS regression for discrimination or classification purposes. In The
Unscrambler® PLS-DA is not listed as a separate method. This tutorial explains how to do it.
Description
Running a PLS Discriminant Analysis
What you will learn
Data table
Build PLS regression model
Classify unknown samples
Some general comments on classification
Description
PLS Discriminant Analysis (PLS-DA), is a classification method based on modeling the
differences between several classes with PLS. If there are only two classes to separate, the
1164
Tutorials
PLS model uses one response variable, which codes for class membership as follows: -1 for
members of one class, +1 for members of the other one.
If there are three classes or more, the model uses one response variable (-1/+1 or 0/1, which
is equivalent) coding for each class. There are then several Y-variables in the model.
In this tutorial we will analyze the chemical composition of spear heads excavated in the
African desert. 19 samples known to belong to two tribes (classes A and B) are used for
building a discriminant model, while seven new samples of unknown origin make up a test
set to be classified.
The X variables are 10 chemical elements characterizing the composition of the spear heads.
The 19 training samples are divided into 10 from class A and 9 from class B.
The normal way to make dummy variables for classes is to assign 1 if the sample belongs to
the class and 0 if not. A small trick to have a decision line of 0 and not 0.5 in the predicted vs.
reference plot is to use values -1 and 1, which gives an easier visualization.
References:
1165
The Unscrambler X Main
Data table
Click the following link to import the Tutorial H data set used in this tutorial. The data have
already been organized for you into row sets, and with the class variable, as well as the
indicators for the classes.
Tutorial H data
1166
Tutorials
Model inputs
X Weights
1/SDev
Y Weights
1/SDev
Validation
Full cross-validation
Set the weights on the X-weights and Y-weights tabs. Select all the variables, select the radio
button A/(SDev+B), and click update. Do this for both the X and Y weights.
X weights dialog
1167
The Unscrambler X Main
To set the validation method, go to the Validation tab in the PLS Regression dialog. Select
Cross validation, and then click Setup… to get to the dialog to select full cross validation.
Select Full from the cross validation method drop-down list,
Cross Validation Dialog
1168
Tutorials
After the computations are finished the default PLS regression plots will be shown. The
scores plot shows the separation of the two classes.
Scores plot
For better visualization of the classes you may use the sample grouping option. Right click in
the scores plot and select Sample Grouping from the menu.
In the Sample grouping dialog, select the row sets “A” and “B” for visualization. You can
double-click in the small boxes showing the colors to change to your preference.
The same goes for the symbols, and their size.
Sample Grouping Dialog
1169
The Unscrambler X Main
The scores plot shows that the two classes are well separated in the two first factors.
Scores plot with grouping
Thus, a discrimination line may be inserted in the plot with the line drawing tool in The
Unscrambler® .
Study the explained variance plot for Y shown in the lower-left quadrant. If need be, switch
it to the view for Y by using the X-Y button . The explained variance plot for Y shows
around 98 % explained calibration and 94 % explained validation variance for 2 factors. The
red validation curve indicates that two factors is the optimal number, as there is only a small
increase in explained variance after factor three.
Note: Explained variance or RMSE is not the main figure of merit for PLS-DA,
however.
Variance plot
1170
Tutorials
To interpret the importance in the classification the loading weights is the plot to look into.
This is given in the upper-right quadrant.
In this case the loadings express the same information as the loading weights, and since
correlation loadings show the explained variance directly, this is the preferred view. Make
the loadings plot active, and change it to the Correlation loadings view by selecting the
correlation loadings shortcut .
In the correlation loadings plot for factors one and two we see that Ba, Zr and Sr are the
variables that separate the two classes, as well as Ti, although with a slightly lower
discrimination ability. These are the variables closest to the response variable class, and
between the 50 - 100% explained circles.
The remaining elements are mostly modeling the variance within the classes.
Correlation Loadings Plot
The regression vector is a summary of the important variables, in this case representing the
loading weights plot after 2 factors. In the project navigator, select the plot Regression
Coefficients, and change it to a bar chart by using the toolbar shortcut .
Weighted Regression Coefficients
1171
The Unscrambler X Main
Note that the blue points are from calibration where the samples are merely put back in the
same model they were a part of. The red points are from cross validation which is more
conservative as the sample was not a part of the model when it was predicted. You can
toggle on/off the regression line, trend line, and statistics for the plot using the shortcut
.
Recall that “prediction” in this context does not mean that the model has been tested by
predicting a real test set. In this case all samples are correctly classified for the cross
validation.
To investigate how the model will behave on unknown samples, the next section will show
how to predict unknown sample class.
1172
Tutorials
It is a good idea to save your work so far. The project will include all the data, as well as all
the results generated thus far. Use File – Save… to save the project.
Classify unknown samples
Assign the unknown samples to the known classes by predicting (classifying) with the PLS
regression model.
Task
Assign the Sample Set Test to the classes A or B.
How to do it
Select Tasks - Predict - Regression….
Tasks - Predict - Regression…
Matrix: Tutorial H
Rows: Test
Cols: X
Prediction
Full Prediction
Inlier limit
Sample Inlier dist
Identify Outliers
Prediction Dialog
1173
The Unscrambler X Main
Click OK.
The predicted values are shown in the main plot of predicted values with estimated
uncertainties.
All F samples have predicted values close to -1 classifying these as belonging to class “B”.
The E sample 2 has a predicted value around 1 which assigns it to class “A”. As for E samples
1, 3 and 4, their predictions are close to 0, and have high uncertainties. It could be that these
can not be said to belong to any of the classes because the estimated deviation (uncertainty)
around the prediction value includes 0 in the plot.
Predicted values and deviation
1174
Tutorials
A small trick to present the results more visibly is to do Tasks - Predict - Projection and
select the PLS model from above. In the scores plot you see that all samples F are lying in the
“B” class and E samples 2 and 3 are probably belonging to class “A”, as discussed above. The
position of test samples 1 and 4 shows that they are in fact closer to class “A” as also the
predicted values indicate.
Note: Try to analyze the same data by doing PCA on the two groups and then select
Tasks - Predict - Classification - SIMCA and compare results with the PLS-DA.
To check if the prediction can be trusted, study the Inlier vs. Hotelling’s T² plot available from
a right click on the plot and then Prediction - Inlier/Hotelling’s T² - Inliers vs. Hotelling’s T²
Prediction - Inlier/Hotelling’s T² - Inliers vs. Hotelling’s T² menu
For a prediction to be trusted the predicted sample must not be too far from a calibration
sample. This is checked by the Inlier distance. The projection of the sample in the model also
should not be too far from the center. This is checked with the Hotelling’s T² distance.
Inliers vs. Hotelling’s T²
1175
The Unscrambler X Main
In this case the samples are found to be in the widely spread in the plot. If samples fall
outside the limit lines that prediction cannot be trusted.
Some general comments on classification
LDA is the basic method that is typically taught in introductory classification courses and is
available as a reference method for comparison with other classification methods such as
SIMCA. Remember that LDA has the same issue with collinearity as MLR, and that more
samples than variables are required in each class. Using PLS regression for classification as
PLS-DA has shown can give very good results in discriminating between classes. In this
context it may also be useful to apply the uncertainty test after deciding on the model
dimensionality and remove the nonrelevant variables. This can in some cases improve
results both in simpler visualization and model performance. However, PLS-DA does not take
into account the within-class variability, and predicted values around 0 (assuming -1 and 1
are used as levels for the classes) are difficult to assign. One alternative procedure is to use
the scores from the PLS-DA in an LDA to have a more “statistical” result. As the score vectors
are orthogonal there is no problem with collinearity in this case.
Using local PCA models which for historical reasons has been given the name “SIMCA” is a
good approach because it also gives the possibility to assign new samples to none of the
existing classes. However, as there is no objective in the individual PCA models to
discriminate between the classes one does not know if the variance modeled is the optimal
for this purpose. The Modeling and Discrimination Power diagnostics are helpful in this
context. One useful procedure is to first do PLS-DA and select the “best” set of variables for
discrimination. Then use these together with the most important variables in the individual
PCA models to have a variable set that both models the within and between class variability.
SVM is a powerful method which can handle nonlinearities, and very good results have been
reported in the literature. However, it is not so transparent as PCA and PLS and the choice of
values for input parameters must be decided from cross validation to assure a robust model.
1176
Tutorials
As for all methods, the proof of the method lies in the classification of a large independent
test set with known reference.
Description
What you will learn
Data table
Data plotting
Run MCR with default options
Plot MCR results
Interpret MCR results
Run MCR with initial guess
Validate the estimated results with reference information
View an MCR result matrix
Description
Multivariate Curve Resolution (MCR) attempts recovery of response profiles (spectra, pH
profiles, time profiles, elution profiles, etc) of the components in an unresolved mixture of at
two or more components. This is especially useful for mixtures obtained in evolutionary
processes and when no prior information is available about the nature and composition of
these mixtures.
The Unscrambler® MCR algorithm is based on pure variable selection from PCA loadings to
find the initial estimation of spectral profiles, and then Alternating Least Squares (ALS) to
optimize resolved spectral and concentration profiles.
The algorithm can apply a constraint of Non-negativity in either spectral or concentration
profiles or both.
It can also apply a constraint of Unimodality in concentration profiles that have only one
maximum, and/or a constraint of Closure in concentration profiles where the sum of the
mixture constituents is constant.
The Unscrambler® MCR functionality does not require any initial guess input. A mixture data
set suitable for MCR analysis should have at least four samples and four variables. If no
initial guess is used, the maximum number of variables is 5000.
In this tutorial we will utilize UV-Vis spectra of dye mixtures to extract pure dye spectra and
their relative concentrations. The data are from the Institute of Applied Research (Prof. W.
Kessler), Reutlingen University, Germany.
1177
The Unscrambler X Main
References:
Data table
Click the following link to import the Tutorial I data set used in this tutorial.
Organizing the data table
The samples consist of 39 spectra of dye mixture samples. Samples 1 to 3 are pure dyes of
blue, green and orange, respectively. Samples 4 to 39 are 36 mixture samples of those 3
dyes at known concentrations. The X variables are the UV-Vis spectra measured over the
range 250-800 nm with data at 10 nm increments. We will begin by organizing the data for
the analysis into row (sample) and column (variable) sets. The column sets have already
been defined for you, and are found in the folder Column in the project navigator. There are
5 column sets for the different variables of interest in the analysis, including the
concentrations of the three dyes, and two overlapping spectral ranges.
We begin by defining the row sets for these data. Select the entire first row in the data table,
Blue_50, and go to Edit – Define Range… to open the Define Range dialog box. In the dialog,
enter the name “Blue” in the Range row box and click OK.
Define Range Dialog
From the data table, select the sample Green_50, and go to Edit-Define Range to now make
this row set Green. Do the same for the sample Orange_50, and then for samples 4 to 39,
giving that row set the name Mixture. Additionally, create the row set Original by selecting
samples and following the same procedure, Edit-Define Range
The first three columns are concentration measurements of blue, green and orange dyes.
Columns 4 to 59 are UV-Vis spectra measured at range 250-800 nm with step 10 nm. In the
project navigator expand the node Column to see the list of existing column sets. The
1178
Tutorials
organized data will look like this in the navigator and viewer, with color-coding for the
defined set.
Navigator view of organized data
Data plotting
Before starting any analysis, it is a good idea to have a look at the data. We want to make a
line plot of the spectra of all mixture samples together. Go to the original data table and
highlight it in the navigator.
Use Plot - Line, which will open the Line plot dialog where the row set Mixture can be
selected from the drop-down list, and for Cols, the set 250-800nm. This will give an overlay
plot of the spectra.
Line plot of mixture spectra
1179
The Unscrambler X Main
We will now plot the reference spectra of the three pure components, select row set
Original, and Cols 250–800nm. Go to Plot – Line… and select the rows and columns in the
dialog.
Line plot dialog
This will results in the following plot, where we can see the maximum absorbance for each
of the dyes is at a different wavelength. It is these component spectra that we expect to be
able to extract from the data through the MCR analysis of the data in this tutorial.
Line plot of pure dyes
To plot the reference concentrations of the three dyes, select columns 1-3 and make a Line
plot of Sample set “Mixture” by right clicking and selecting Plot – Line.
Line plot of sample concentrations
1180
Tutorials
1181
The Unscrambler X Main
When the MCR calculation is completed, a new node, named MCR, is added to the project
navigator and the MCR overview plots are displayed in the viewer. The MCR results overview
includes four plots, from upper-left to lower-right: Component Concentrations, Component
Spectra, Sample Residuals and Total Residuals. The results overview plots are displayed at
the optimum number of pure components, which the system estimates to 3 in this case. Our
optimal number of components (3) is displayed on the toolbar. A summary of the analysis
results is given in the Info tab in the lower left corner of the display, and also tells the
optimal number of pure components.
MCR Info Box
1182
Tutorials
The MCR model results are all together in the new node in the project navigator named
MCR. Rename the MCR model in the project navigator by highlighting the MCR node, right
clicking and choosing Rename. Rename your first MCR model as MCR Original.
Plot MCR results
Task
Plot MCR results for various numbers of pure components.
How to do it
Actually, The Unscrambler® MCR procedure generates several sets of results, covering a
number of estimated pure components from 2 to optimum +1. By default, the results are
plotted for the optimal number of components.
You may view the results for varying numbers of pure components. Let us plot the spectral
profiles for a 2-component solution. Click the shortcut to select Component Number 2.
The plot of (estimated) component spectra for a resolution with two pure components is
displayed.
In a similar manner, click on the right arrow shortcut to plot the 4-component solution.
MCR fitting and PCA fitting results are also available for varying numbers of pure
components from 2 to optimum +1. Each fitting includes Variable Residuals, Sample
Residuals and Total Residuals plots and are stored in result matrices in the MCR node of the
project navigator. The user can plot these results upon selection of respective matrices, or
by selecting the plot from the plots node of the project navigator. The plot of Total Residuals
for MCR fitting is shown by default in the lower-right subframe. Like any other plot, it can
also be accessed from the Plot menu. Change this plot to variable residuals by clicking and
activate the lower-left subframe, then clicking MCR - Variable Residuals to have this plot
displayed in place of the sample residuals plot.
Variable residuals plot
1183
The Unscrambler X Main
This suggests that the model with 3 components is the optimum solution.
Click and activate the Component Spectra plot with 3 components in the upper-right
quadrant. The toolbar contains a set of arrows , which is used to navigate between
results at different numbers of components. Use the arrows to increase and decrease the
number of components, and watch the impact on the spectral profiles.
Run MCR with initial guess
Task
Run the MCR calculation again, this time using an Initial Guess.
How to do it
If prior knowledge such as spectra of pure components or concentrations of mixture samples
exists, this information may be included in the MCR calculation to help the algorithm
converge towards the right solution of curve resolution.
1184
Tutorials
Go back to data table Tutorial_I data by selecting the tab at the bottom of the viewer. Go to
Tasks - Analyze - Multivariate Curve Resolution…. The MCR dialog box with default settings
will open up. Select the same data as before, and then check the box Use initial guess and
select option Pure spectra.
MCR dialog with initial guess
Select Row Set Original as initial guess for spectra, making sure to use the same column set
for the data for the analysis and the initial guess. Then click OK to launch the calculations.
When asked if you want to view the plots now, select yes.
Rename the new MCR results node in the project navigator as MCR Initial Guess.
Notes:
When using the initial guess option, The Unscrambler® requires all pure
components to be included as initial guess inputs. Partial reference will
generate erroneous results. It is recommended to run MCR without initial
guess if only partial reference is available.
The Unscrambler® can be run with either spectra or concentration of pure
components as an initial guess input.
1185
The Unscrambler X Main
the navigator tab right click to choose Pop out, giving an undocked plot that can now be
docked wherever you wish for ease of viewing.
You can observe that the first estimated concentration profile is similar to the reference
profile of the blue dye (blue curves on the plots), the second estimated concentration profile
is similar to the reference profile of the green dye, and the third estimated concentration
profile is very close to the reference concentration of the orange dye (green curves on the
plots).
Caution: Estimated concentrations are relative values within an individual
component itself. Estimated concentrations of a sample are not its real
composition.
The estimated spectral profiles can be compared to the reference spectral profiles in the
same way as for the concentrations. Because we used the spectra as initial guess inputs in
this example, the comparison shows a perfect match. However, estimated spectra are unit-
vector normalized; they are not the “real” spectral profile of the samples.
Plots of the Pure and Estimated Spectra
1186
Tutorials
.
Rename this matrix, named Component concentrations, that has been added to the bottom
of the project navigator as Concentrations comparison.
With the cursor in the data matrix, go to Edit - Append and choose to add 3 columns to this
matrix. Go to table Tutorial_i, select the first three columns (blue, green and orange), from
rows 4-39. Copy them and paste them in the empty columns of the Concentrations
comparison matrix, and enter names for columns 4-6 as blue, green, and orange
respectively. We now have a table of six columns, containing the three estimated
concentrations of the pure dyes followed by the three measured concentrations .
New Data Matrix with Estimated and Real Concentrations
Select columns “Blue” and “1” (press the Ctrl key on your keyboard to select several columns
at a time). Click Plot - Scatter to display a 2-D Scatter plot of these columns. The correlation
between estimated and reference concentrations for the blue dye is 0.994. If the box
containing plot statistics (among which correlation) is not displayed on the upper-left corner
of your plot, use the toolbars to display it. These can also be used to add a
regression line and target line to the plot.
Continue to make the scatter plots for the green dye (columns “Green” and “2” in the table),
which has a correlation between estimated and reference concentrations of 0.997.
For the orange dye (columns “Orange” and “3”), the correlation is 0.998. These very high
correlations indicate that the MCR calculations have determined concentration profiles
accurately in this case.
Scatter plot of orange dye concentration
1187
The Unscrambler X Main
These plots can be customized by right clicking and choosing Properties to make changes to
the plot appearance.
Now let us convert the estimated Orange concentrations to real scale. In order to do this, at
least one reference measurement is needed. The estimated concentrations (in relative scale)
of all samples can be converted into real concentration scale by multiplying by a factor ( real
concentration / estimated concentration ).
In the present case, we can use for example sample PROBE_11, which has a reference
concentration of Orange dye of 7 and an estimated concentration of 0.4443.
Use menu Edit - Append - … to append a new column at the end of the table, and name it
“MCR Orange real scale”. Go to Tasks - Transform - Compute_General…, and type the
expression:
V7=V3*(7/0.4443)
in the Expression space.
Compute_General Dialog
1188
Tutorials
Click OK to perform the calculation. A new matrix is created where the new column has
been filled with the values of estimated Orange dye concentrations converted to real scale.
Data matrix with new values
Description
What you will learn
Data table
Data plotting
Estimate the number of pure components and detect outliers with PCA
Run MCR with default settings
Tune the model’s sensitivity to pure components
Run MCR with a constraint of closure
Remove outliers and noisy wavelengths with recalculate
1189
The Unscrambler X Main
Description
In this tutorial we will utilize FTIR spectra of an esterification reaction to extract pure spectra
and their relative concentrations. The original data are from the University of Rhode Island
(Prof. Chris Brown), USA.
In situ FTIR spectroscopy was used to monitor the esterification reaction of isopropyl alcohol
and acetic anhydride using pyridine as a catalyst in carbon tetrachloride solution. The initial
concentrations of these three chemicals were 15%, 10% and 5% in volume, respectively.
Isopropyl acetate was one of the products in this typical esterification reaction. The reaction
was carried out in a ZnSe cell, and mixture spectra were measured at 4 cm-1 resolution. The
data set consisted of 25 spectra, covering approximately 75 minutes of the reaction. To shift
the equilibrium of the esterification, one-tenth of the volume was removed from the cell at
24, 45 and 60 minutes. An equal amount of a single reactant was added to the cell in the
sequence of acetic anhydride, pyridine and isopropyl alcohol.
Estimate the number of pure components and detect outliers with PCA
Run MCR with default settings
Tune the sensitivity to pure components setting
Run MCR with a constraint of closure
Use the Recalculate functionality in MCR
References:
Data table
Click the following link to import the Tutorial J data set used in this tutorial.
The data consist of 25 FTIR spectra of 262 variables covering the spectral region from 1860
to 852 cm-1. There are two row sets already defined: mixture and closure. Mixture contains
all the data, while the row set closure has the samples that will be used when using the
constraint of closure during the MCR.
Data plotting
Before starting the analysis, it is always important to have a look at the data. Make a line
plot of all of the spectra together.
Select all the samples by selecting the data set Tutorial_J in the project navigator. The data
table for the FTIR spectra of the samples will then be displayed in the data editor. Highlight
the samples, and use Plot - Line to display an overlay of the spectra in the viewer.
Line plot dialog
1190
Tutorials
From this plot, one can see that there is a region around 1240 cm-1 that is changing over the
course of the reaction being monitored.
Line plot of FTIR spectra
Estimate the number of pure components and detect outliers with PCA
Principal Component Analysis (PCA) is recommended before running an MCR calculation. It
provides some information on the number of pure components and on sample outliers.
Task
Run a PCA on the raw data.
How to do it
Click Tasks - Analyze - Principal Component Analysis to run a PCA and choose the following
settings:
Matrix: Tutorial_J
Rows: All
Columns: All
Maximum components: 8
Mean center data: Not selected
Identify outliers: Selected
1191
The Unscrambler X Main
PCA Dialog
On the Validations tab, select Cross validation, and Setup… to set this to full cross validation,
from the drop-down list for cross validation method. Click OK, then OK again on the model
inputs page.
Cross Validation Setup
1192
Tutorials
Once the PCA calculations are done, click Yes to view the plots of the PCA model
immediately. The four plot PCA Overview will be displayed in the viewer.
The upper right quadrant is a 2-D plot of the PCA loadings. For spectral data, it is more
informative to have a line plot of the loadings, as it then resembles a spectrum. Select the
existing loading plot, and go to Plot - Loadings - Line; which will give the plot of the first PC
loading, to replace the default plot in this quadrant. This plot, once can see, closely
resembles the FTIR spectra of the raw data. Scroll through the loadings plots for the other
PCs using the arrows on the toolbar .
You can see that the loadings begin to get noisy at about the sixth principal component. The
program recommends three components as the optimal number of PCs in this model. This is
seen in the Info box in the lower left corner of the display, and by clicking on the star on
the menu toolbar. Select the Explained Variance plot in the lower-right quadrant by clicking
on it with the mouse, then right mouse click to select View - Numerical View.
As you can see, the explained variance globally reaches a plateau from the third principal
component. The fourth and fifth PCs still show some slight increase; at that stage, it is
difficult to know whether they represent noise or real information. Now, click on the
Influence plot at the bottom-left corner of the Viewer, and use the PC navigation tool
to display the influence plot at PC4. You may observe that sample 1 sticks out to the right
with a high leverage, and that sample 8 sticks out upwards with a high residual variance.
PCA Influence Plot for PC4
1193
The Unscrambler X Main
Go to menu Plot - Sample Outliers to display a combination of four useful plots for outlier
detection. Highlight the Residual Sample Variance at the bottom-left quadrant, and use the
PC navigation arrows to change that to show results for PC4. This plot indicates a high
validation residual for sample 8.
Residual Sample Variance Plot for PC4
As there is no validation check in MCR, we may use the outlier information issued from PCA
in our MCR modeling later on.
Rename the PCA model file in the project navigator by highlighting the PCA node, right
clicking and choosing Rename. Rename the model to “PCA Tutorial J”.
Run MCR with default settings
Task
Build a first MCR model with default settings.
How to do it
1194
Tutorials
Go back to the data table Tutorial_J in the project navigator. Run an MCR by going to the
menu and selecting Tasks - Analyze- Multivariate Curve Resolution… and keep the default
settings:
Matrix: Tutorial_J
Rows: All
Columns: All
Go to the Options tab and verify that the default settings are selected. Make changes as
needed.
1195
The Unscrambler X Main
Information Box
1196
Tutorials
One can compare those profiles with FTIR spectra of known constituents, and identify the 5
estimated spectra as pyridine, isopropyl alcohol, a possible intermediate, propyl acetate and
acetic anhydride, from curves 1-5 respectively.
Rename the new MCR model file created in the project navigator as MCR_Sensitivity150.
Run MCR with a constraint of closure
Task
Run MCR with a closure constraint. Compare two MCR models on the same data, with and
without closure.
How to do it
Among the MCR settings we have used so far, two types of constraints were not selected.
A constraint of Unimodality can be applied to restrict the resolution to concentration
profiles that have only one maximum.
With a constraint of Closure, the resolution will yield concentration profiles whose sum is
constant.
In the present case, acetic anhydride was added at 24 minutes (between the eighth and the
ninth samples), which means that the first 8 samples can be treated in closure conditions.
Go back to the data table and run a new MCR model with the following settings:
Rows: Closure [8] (contains the first 8 samples of the data table)
Cols: All
Non-negative concentrations: selected
Non-negative spectra: selected
Closure: selected
Unimodality: not selected
Sensitivity to pure components: 100
Once the computations are finished, choose to view the plots when prompted. Rename the
new MCR model file as “MCR_Closure”.
You may compare the resolved concentration and spectral profiles of pure components with
and without the closure setting. To do that, compute a new MCR model on sample set
“Closure” without checking the Closure constraint option. Save the new MCR model file as
“MCR_No_Closure” and compare the results to “MCR_Closure”.
The spectral profiles with and without the constraint of closure are very similar.
MCR Component Spectra
1197
The Unscrambler X Main
You can also observe that under constraint of closure, the concentrations of the pure
components always add up to 1.
MCR Component Concentrations
1198
Tutorials
1199
The Unscrambler X Main
Click on the bottom-left subframe where the Sample residuals are plotted to highlight it. If
needed, use the PC navigation arrow tool to change the view to show the sample residual
for the 4-component model.
Here you may notice a high residual showing for Sample 8, compared to the other samples.
Let us build a model without this sample. You will notice is the sample residuals plot, that
the shape is similar to what is observed in the residual sample variance plot from the PCA
model on this same data set.
MCR Sample Residuals
Select the MCR_Defaults model in the project navigator, and right click to select Recalculate
- Without Marked… to specify a new MCR calculation without sample 8.
Menu to recalculate without marked
1200
Tutorials
This brings you back to the MCR dialog, where sample 8 is now included in the Keep Out Of
Calculation field. You may launch the calculations to get the new MCR results.
MCR menu with sample 8 kept out
Similarly, you may want to keep out of the model non-targeted wavelength regions,
or highly overlapped wavelength regions.
From the MCR_Defaults overview plots, click Plot - Variable Residuals.
MCR Variable Residuals
1201
The Unscrambler X Main
Mark any unwanted variables on the plot using the marking tools, for examples variables
around 1100-1140 cm-1 which present very high residuals, then select the model
“MCR_Defaults” and right click to choose Recalculate - Without Marked… to specify a new
MCR calculation.
Description
What you will learn
Data table
Transform the raw spectra
Application of K-Means clustering
Application of Hierarchical Cluster Analysis (HCA)
Repeat the HCA using a correlation-based measure
Using the results of HCA to confirm the results of PCA
Description
This tutorial investigates the use of two well known clustering methods, K-Means and
Hierarchical Cluster Analysis (HCA) for classification of raw materials used in the
pharmaceutical industry, by means of reflectance Near Infrared (NIR) spectroscopy. This is
1202
Tutorials
References
Data table
Click the following link to import the Tutorial K data set used in this tutorial.
The data table contains 35 NIR spectra of seven classes of raw materials often used in
pharmaceutical manufacturing. Typically when developing classification models it is
recommended that more samples be used, being sure to cover the natural variability of each
class, but for this exercise, we use just five spectra for each class.
The diffuse reflectance spectra have been truncated to the wavelength region 1200 - 2200
nm for this particular example.
The type of raw material is defined in the name of each sample, and includes:
Citric acid
Dextrose anhydrous
Dextrose monohydrate
Ibuprofen
Lactose
Magnesium stearate
Starch
1203
The Unscrambler X Main
Click on OK and view the plot. Notice that there are distinct groups of spectra with similar
profiles. The main source of variation within each group comes from differences in the
absorbance (Y) axis. This baseline shifting is due to differences in sampling when preparing
and scanning, resulting in differences in light scattering by the samples measured in
reflectance by NIR spectroscopy.
Line plot of NIR spectral data
A convenient way to remove this variation is by the use of the SNV transform. This transform
reduces the scattering effects in such data by removing the mean value from each point in
the spectrum and divides each point by the standard deviation of all points in the spectrum,
i.e. the SNV transform normalizes the spectrum to itself. The effect of the SNV transform is
to remove the variation in the absorbance scale (baseline shifting), while retaining the
original profile of the spectral data.
This is a commonly used practice in many NIR applications, especially for reflectance spectra
of solids. To perform the SNV transformation, right click in the matrix Tutor K Data and
select Transform - SNV. In the Rows dialog box, select All and in the Columns dialog box,
select All. You can preview the effect of the transformation be clicking in the Preview result
box, or just click OK to perform the transformation.
SNV dialog
1204
Tutorials
The transformed data are displayed as a new node in the project navigator and the matrix is
called Tutor K Data_SNV. Plot the data to see how they now look by selecting all samples in
the new matrix and going to Plot-Line.
The resulting SNV-transformed spectra can be seen below.
Line plot of SNV-transformed NIR Spectra
The spectra are now ready for application of the clustering algorithms described below.
It is a good idea to save your work as you go. Save your project by going to File-Save As….
1205
The Unscrambler X Main
With K-means one can also make initial class assignments on the options tab, and set the
number of iterations to use to find the optimal number of clusters. Here we will allow the
algorithm to make assignments with no further input, and use the default number of 50
iterations.
Cluster analysis dialog options tab
1206
Tutorials
Click OK to start the analysis and a new node will appear in the project navigator called
Cluster analysis. Right click on the node and select Rename and call this analysis K-Means.
You will notice that there is no graphical output for K-Means clustering. The output of the
cluster analysis is found in the Results folder. Expand this folder to display a node called
Tutor K Data_SNV_Classified, where the results reside. The classified data matrix is color-
coded according to the clusters (row sets) that have been identified. Expand this matrix.
Expand the rows and the columns folders and you will see that the rows contain seven
assigned clusters from Cluster-0 to Cluster-6. The columns folder contains the class, a single
column of classification results.
The K-Means data table is now classified by different colors, corresponding to the various
assigned classes. Study this table. You will notice that the K-Means algorithm has
successfully classified the data into seven distinct classes, each containing a single raw
material type. Click on the various cluster nodes in the project navigator and confirm that
each cluster contains 5 samples of the same material type. Using the Rename function,
assign cluster names according to the table above. The results of this operation are shown
below.
View of Assigned Classes in Navigator
1207
The Unscrambler X Main
Now that the separate classes have been defined, you can use this information to use it as a
means to group samples in plots. Go back to the matrix Tutor K Data_SNV and right click to
select Plot-Line. In the plot, now you can right click to select Sample Grouping. In the sample
grouping & marking dialog , first select the matrix containing the clustered data by clicking
on the Select result matrix button, which will allow you to choose the newly formed matrix
Tutor K Data_SNV_Classified. For cols, choose Class1, and the row sets you have just
renamed are available row sets. Select all of these using », and click OK. The line plot will
now have all samples of each set displayed in a single color.
Sample grouping option
1208
Tutorials
set. The structure of the dendrogram is dependent on the distance measure used and great
care must be taken when interpreting the structures.
Task
Make a HCA model using the method of single linkage and Euclidean distance.
How to do it
Select Tasks - Analyze - Cluster Analysis… and make a model with the following parameters:
Use the drop-down lists to change the clustering method and distance measure. Click OK to
start the analysis. When the the analysis is completed, the dendrogram is displayed in the
editor window, and a new Cluster analysis node is added to the project navigator.
HCA Euclidean Dendrogram
Before reviewing the analysis results, rename the new cluster analysis node in the project
navigator as HCA Euclidean.
Analyze the dendrogram and look at the order of the clusters from top to bottom. It can be
seen that each raw material type is uniquely defined and the carbohydrate materials Starch,
Lactose, Dextrose Monohydrate and Dextrose Anhydrous all group together in the
dendrogram. Towards the bottom, the clustering is not as distinct. This indicates that the
sample classification is based on some similarity in the chemistry of the samples, but it is not
as well defined as it could be. This is one aspect of HCA that must be kept in mind when
performing such a method.
1209
The Unscrambler X Main
In the project navigator, expand the results folder for the HCA and under the rows folder,
you will see that seven clusters have been assigned to this analysis. These can be renamed
as was done above, so that the names coincide with the class name.
Repeat the HCA using a correlation-based measure
When dealing with spectroscopic data, the spectrum of a material is analogous to its
fingerprint. Using a straight distance measure such as the Euclidean measure may not be the
most sensitive way of assessing the similarities present within the data. The Absolute
correlation measure provides a better way of capturing the within spectral variable
similarities of the materials. We will also change to the complete-linkage, which looks for the
farthest neighbor, as opposed to nearest neighbor used in single-linkage HCA.
Task
Make a HCA model using the method of complete linkage and absolute correlation.
How to do it
Select Tasks - Analyze - Cluster Analysis. Use the following parameters:
Click OK to start the analysis and then click Yes to view the plots. The dendrogram for this
analysis is displayed in the editor window, and from the results node it is seen that 7 clusters
are identified.
Before reviewing the analysis results, rename the new cluster analysis node in the project
navigator as “HCA Correlation”.
Notice that all samples are uniquely classified into classes based on the raw material type.
This time there are three distinct clusters in the dendrogram. At the top of the dendrogram
is Starch. The next cluster of samples contains mostly carbohydrates: Lactose, Dextrose
Monohydrate, Dextrose Anhydrous and Citric acid. The last cluster includes the materials
Ibuprofen and Magnesium stearate, whose NIR spectra have features in the 1400 and 1700
nm regions.
HCA Absolute correlation distance dendrogram
1210
Tutorials
The method of absolute correlation not only uniquely classified the individual raw materials,
but it was also able to use the information in the spectral variables far better, by grouping
the materials by their chemical properties.
In the results folder, select the data table Tutor K Data_SNV_Classified. Go to Insert -
Duplicate Matrix…. The following dialog box opens.
Duplicate Matrix
Rename the clusters of the duplicated matrix based on the materials’ name.
Renamed row ranges
1211
The Unscrambler X Main
We will use these results, in conjunction with PCA, to show how the two methods of
unsupervised pattern recognition can be used together.
Using the results of HCA to confirm the results of PCA
Task
Perform a PCA on the SNV transformed data and group the samples based on the results of
HCA.
How to do it
Select Tasks - Analyze- Principal Component Analysis…. Use the following parameters:
PCA dialog
1212
Tutorials
Click OK to start the analysis and then click Yes to view the plots. The PCA Overview for this
analysis is displayed in the workspace.
In the Scores Plot right click and select Sample Grouping and from the Select drop-down list,
use the results from your clustering to give you the available row sets of the different
clusters. Click on the » button to select all clusters in the analysis and then click OK.
Sample grouping dialog
1213
The Unscrambler X Main
Drag the updated scores plot so that it fills most of the screen and analyze the clustering.
The scores plot shows that PC1 explains 66% of the data variance, and PC2 describes 19%.
The main difference along PC1 is between carbohydrate materials and fatty acid based
materials (i.e. Magnesium Stearate and Citric Acid) and PC2 is differentiating between the
starch and ibuprofen samples.
It can be seen that the clustering of the materials as established by HCA is consistent with
that of PCA. PCA provides more information on the groupings as the spectral loadings can be
related to the spectral features which describe the materials. To have a more informative
view of the PCA loadings it is better to look at them as a line plot - resembling then a
spectrum. Activate the loadings plot in the upper-right quadrant, and right click to select
PCA - Loadings - Line. The loadings plot now shows which spectral features are related to
the first PC, which explains most of the variance in this data set. Use the next arrow to
scroll to the next PC loadings plot.
PCA Overview Plot
1214
Tutorials
Now that the work has been done it is a good idea to save the results so you can refer to
them in the future.
When more data (more samples per each class) are available for classification, this exercise
has shown that one can proceed to make a classification model to identify these seven raw
materials from their NIR spectra. Classification modeling such as PLS-DA and SIMCA can be
used to develop methods that can be used for classification of future samples.
Description
What you will learn
Data table
Open and study the data
Build an L-PLSR model
Interpret the results
Variances
Products: X Scores
Product descriptors: X Correlation Loadings
Consumer descriptors: Z Correlation Loadings
Consumer liking of the products: Y Correlation Loadings
Overview of the L-PLS Regression solution
Verify the results
Products liking
Liking Y vs. consumer background Z
Product descriptor rows in X
Product descriptor columns in X
1215
The Unscrambler X Main
Bibliography
Description
Consumer studies represent an application field where “L-shaped” data matrix structures
X;Y;Z such as described in the following are common: A set of I products has been assessed
by a set of J consumers, e.g. with respect to liking, with results collected in “liking” data table
Y(I J). In addition, each of the I products has been “measured” by K product descriptors
(“X-variables”), reflecting chemical or physical measurements, sensory descriptions,
production facts etc., in data table X(I K). Moreover, each of the J consumers has been
characterized by L consumer descriptors (“Z-variables”), comprising sociological background
variables like gender, age, income, etc., as well as the individual’s general attitude and
consumption patterns; these are collected in data table Z(J L). Relevant questions could
then be: Is it possible to find reliable patterns of variation in the liking data Y, which can be
explained from both product descriptors X and from consumer descriptors Z? Is it possible to
predict how a new product will be liked by these consumers, by measuring its X-variables? Is
it possible to predict how a new consumer group will like these products, from their
background Z-variables?
The data consist of information gathered on Danish children’s liking of apples. Their
response to various apple types is termed Y. Chemical, physical and sensory descriptors of
these apple types are called X, and sociological and attitude descriptors on these children
are in matrix Z. The purpose of the analysis is to find patterns in these X-Y-Z data that are
causally interpretable and have predictive reliability.
We are now going to build an L-PLS regression (L-PLSR) model linking the panelists’ sensory,
chemical and physical evaluations to the consumers and their sociological and attitude
descriptors. The model will summarize all the information about consumers, consumers’
preference, the products and their characteristics.
References:
Data table
We are going to study three data tables of different sizes. The structure of the data set is as
follows:
X - ApplesSensoryChem
1216
Tutorials
Y - ApplesLiking
Z - AppleChildBackground
L-PLSR Structure
Red
Sweet
Sour
Glossy
Hard
Round
1217
The Unscrambler X Main
Content of acid (ACIDS) and sugar (SUGARS) were determined as malic acid and
soluble solids, respectively.
Based on prior theory on human sensation of sourness, the ratio ACIDS/SUGARS was
included as a separate variable (Kuhn and Thybo, 2001).
Together, the sensory, chemical and instrumental variables constituted K=10 product
descriptors, which will here be referred to as X(I K) for the I = 6 products.
Y data
The Y data (Y - ApplesLiking) consist of information gathered on Danish children’s liking of
apples. Their response to various apple types is termed Y. Each child was asked to express
the liking of the appearance of the six apple cultivars, using a five-point facial hedonic scale:
One apple at a time was shown to the child to avoid that the child concentrated on
comparing the appearances. All samples were presented in randomized order. The resulting
liking data for the I = 6 products x J = 125 consumers will here be termed Y(I J).
Z data
The Z data table (Z - AppleChildBackground) contains the information collected about the
consumers: sociological and attitude descriptors on these children.
The consumers were children aged 6 to 10 years (51% boys, 49% girls), recruited from a local
elementary school. A total of 146 children were tested and included in the original
publication of Thybo et al. (2004). For simplicity, only the J = 125 children that had no
missing values in their liking and background data are included in the present study.
First, each child was asked to look at a table with five different fruits and answer the
questions: “If you were asked to eat a fruit, which fruit would you then choose, and which
fruit would be your last choice?” The resulting responses are named “fruitFirst” and
“fruitLast”, where fruit is one of RedA (Red apple), GreenA (Green apple), Pear, Bana
(Banana), or Orange. Additional descriptors “AFirst” and “ALast” are also available which
correspond to either red or green apples.
The child was also questioned about how often he/she ate apples, by having the following
opportunities: “every day” (here coded as value 4), a couple of times weekly (3), “a couple of
times monthly” (2), “very seldom” (1); this descriptor is here named “EatAOften”. (A few of
the children responded “do not know” to how often he/she ate apples. To reduce the
number of missing values, this was taken as indicating very low apple consumption, and
coded as 0.) In addition, the child’s gender and age were noted. These two sociological
descriptors were used, together with the attitude variables fruitFirst and fruitLast and eating
habit-variable EatAOften, as L = 15 consumer background descriptors Z(J L) for the J = 125
children.
Open and study the data
Click the following link to import the Tutorial L data set used in this tutorial.
There are three matrices:
X - ApplesSensoryChem
1218
Tutorials
Y - ApplesLiking
Z - AppleChildBackground
1219
The Unscrambler X Main
Click on the X Weights option. Select all the variables clicking on the All button.
Select the option “A / (SDev + B)” with the radio button. Finally click on the Update
button.
Click on the Y Weights option and use weighting option “A / (SDev + B)” for all the
variables.
Click on the Z Weights option and use weighting option “A / (SDev + B)” for all the
variables.
1220
Tutorials
Once all necessary options have been selected, click OK to start the computations.
Interpret the results
View the results and study the different plots:
LPLS Overview
Correlation Loadings
Correlation
1221
The Unscrambler X Main
Variances
Study the bottom right plot in the LPLS overview. It presents the explained variances of the
three data tables: X (blue), Y (red) and Z (green).
Most variation in the product descriptor table X is explained in 3-4 factors, whereas all 5
factors seem to be relevant for explaining variation in the Y- and Z-tables. A total of 72% of
the consumer background variation in Z is explained by the full model.
In total 21% of the variation in the product liking table Y is explained using all 5 factors. The
majority (13%) is explained by Factor-1, whereas 4% is explained by Factor-2.
Products: X Scores
A scatter plot of the X scores describing apple types is given in the top left corner under
Correlation Loadings.
Scores plot
1222
Tutorials
The two first factors explain 54% and 14% of the variation in X. Factor-1 describes variation
separating GrannySmith (and to some degree Mutzu) from the group of products defined by
Gloster, Jona, and Gala. Factor-2 spans a direction where Granny Smith and Mutzu represent
the extremes.
This plot shows the main patterns of the sensory, instrumental and chemical product
descriptors. Interpreting Factor-1 first, it seems the main variation spans two groups of
predictors, where a group describing redness and sweetness is negatively correlated to a
group related to sourness, hardness and roundness. Factor-2 on the other hand separates
glossyness and roundness from sugar content, indicating that round, glossy cultivars tend to
contain less sugars than the other apples in the study.
Comparison with the previous scores plot confirms that e.g. Granny Smith is somewhat sour,
hard and round, and it is not red (but green). As expected, the red cultivars Gala, Jona and
Gloster are found to the right. Elstar has a red and green, marbled appearance, which
explains why its score value for Factor-1 is close for zero (neither red nor green).
1223
The Unscrambler X Main
Here, the main patterns of the consumer background descriptors picked up by the model are
seen. Factor-1 spans a tendency to choose the green apple first (GreenAFirst) against the
tendency to choose the red apple first (RedAFirst). This component explains 16% of Z (as can
be seen from the scores plot or explained variance plot above). It also seems that older
children tend to prefer red to green apples, while gender is a poor descriptor for childrens
preferences.
The second factor (explaining 22% of Z) exhibits children’s preference in different fruits.
Those who eat apples often tend to prefer apples over bananas, for instance. Similarly, the
children who particularly dislike green apples seem to have a somewhat higher preference
for other fruits.
1224
Tutorials
This plot shows the main, product-related patterns of the consumers with respect to liking.
The children grouping towards either end of the horizontal axis likely have a very clear
preference for green or red apples over the alternative.
1225
The Unscrambler X Main
Products liking
Plot a scatter plot of the most extreme products (liking GrannySmith vs. liking Jonagold) and
look at the correlation. As the responses are restricted to 5 levels, many of the values are
superimposed in the plot. Add a regression line ( ) to get a better impression of the
relation between the factors. Optionally add a statistics table to the plot ( ), and change
the point-sizes, point-labels and x-axis limits through menu View - Properties
With only five response levels possible, many data points are superimposed and the pattern
difficult to see. But their raw liking data are clearly negatively correlated (r = -0.4 over the
125 subjects), as expected.
1226
Tutorials
There is a tendency (r = 0.52 over 125 subjects) that if children chose green apple first, they
reported that they liked GrannySmith.
1227
The Unscrambler X Main
Select All for Rows and Cols. For the Transformation field select Mean for Center and
Standard deviation for Scale. Optionally check Preview result
Center and Scale window
1228
Tutorials
1229
The Unscrambler X Main
Again, these two products are seen to be described by quite opposite terms; Jonagold is
sweet, red and high in sugars compared to GrannySmith, while GrannySmith has high
acids/sugars ratio, is sour, hard and round compared to Jonagold. The correlation is -0.72
between these two rows of 10 standardized X variables.
As expected from the L-PLS regression model, these two variables are almost orthogonal,
with r = 0.07 over the six products.
1230
Tutorials
Bibliography
B.F. Kuhn, A.K. Thybo, The influence of sensory and physiochemical quality on Danish
children’s preferences for apples, Food Qual. Pref., 12, 543-550(2001).
H. Martens, E. Anderssen, A. Flatberg, L. H. Gidskehaug, M. Hoy, F. Westad, A. Thybo, M.
Martens, Regression of a data matrix on descriptors of both its rows and of its columns via
latent variables: L-PLSR, Computational Statistics & Data Analysis 48, 103-123(2005).
A.K. Thybo, B.F. Kuhn, H. Martens, Explaining Danish children’s preferences for apples using
instrumental, sensory and demographic/behavioral data, Food Qual. Pref. 15, 53-63(2004).
Description
What you will learn
Data table
Create a PLS model
Interpret a PLS model
Variance plot
Scores plot
Loadings plot
Weighted regression coefficients
Stability plots
Stability in loading weights plots
Stability in scores plots
Conclusions
Description
In this work environment study, PLS regression was used to model 34 samples corresponding
to 34 departments in a company. The data were collected from a questionnaire about
overall job satisfaction (Y), modeled from 26 questions (X1, X2, …, X26) about repetitive
tasks, inspiration from the boss, helpful colleagues, positive feedback from the boss, etc.
The unit for these questions was the percentage of people in each department who ticked
“yes”, e.g. “I can decide the pace of my work”. The response variable was the overall job
satisfaction, on a scale from 1 to 9.
PLS regression
Validation methods
Uncertainty estimates
Interpretation of plots
This tutorial is also presented differently than the other tutorials, with less detailed
instructions for each task, thus giving a slightly more demanding learning curve.
1231
The Unscrambler X Main
Data table
Click the following link to import the Tutorial M data set used in this tutorial. The data
already have several row and column sets defined, but you must define the column set for
the response variable, job satisfaction.
Create a PLS model
Click Tasks - Analyze - Partial Least Squares Regression to run a PLS regression and choose
the following settings:
Model inputs
X Weights
1/SDev
Select all the variables, select the radio button A/(SDev+B), and click Update.
Y Weights
1/SDev
Select the “Job satisfaction” and select the radio button A/(SDev+B), and click
Update.
Validation
Full cross-validation. Click on the button Setup… to select this option.
Select the Uncertainty test for the optimal number of factors.
Select Uncertainty test
1232
Tutorials
Variance plot
The initial model indicated 2 factors to be the optimal model dimension by full cross
validation. Thus the cross validation has created 34 submodels, where 1 sample has been
left out in each. The uncertainties for all x-variables were thus as a second step estimated by
jack-knifing for various model parameters based on a two-factor model.
In the variance plot the validation curve (red) shows 62% explained variance for 2 factors,
which is rather good for data of this kind.
Plot of explained y-variance
1233
The Unscrambler X Main
Scores plot
The scores plot shows that the samples are well distributed with no apparent outliers.
Plot of scores
Loadings plot
The relations between all variables are more easily interpreted in the correlation loadings
plot rather than the loadings as the explained variance can be seen directly in the plot; the
inner circle depicts 50% explained variance and the outer 100%.
Activate the X-Loadings plot by clicking in it, then use the following shortcut button ; it
will display the two circles.
1234
Tutorials
The most important variables for job satisfaction (Y) seem to be related to how the
employees evaluate their leader. Questions related to the work span the direction from
upper left to lower right in the plot.
Plot of correlation loadings
The variables found significant are marked with circles in the loadings plot. If not shown by
default, activate the marking of the significant variables using the following button .
Although the variable pattern can be interpreted in the correlation loadings, the importance
of the variables is better summarized in terms of the regression coefficients in this case.
Recall that the loadings describe the structure in X and Y whereas the loading weights are
more relevant to interpret for the importance in modeling Y. Alternatively, the predefined
plots under the weighted regression coefficients may be investigated.
1235
The Unscrambler X Main
The automatic function Mark significant variables shows clearly which variables have a
significant effect on Y.
When plotting the regression coefficients one can also plot the estimated uncertainty limits
as an approximate 95% confidence interval as shown below.
Plot of the weighted regression coefficients
1236
Tutorials
E.g. variable disrespect has uncertainty limits crossing the zero line: it is not significant at the
5% level. Zoom in with Ctrl+right click to see details.
13 out of 26 X-variables are found to be significant at the 5% level. However, there is nothing
to say that one can not set the cut off at another level depending on the application.
Variables with large regression coefficients may not be significant because the uncertainty
estimate indicates that the relation between this variable and Y is due to only some samples
spanning the range. One effective way to visualize this is to show the stability plot.
The corresponding p-values are given in the output node, in the validation folder.
p-values for the regression coefficients
1237
The Unscrambler X Main
Stability plots
Stability in loading weights plots
Go back to the loadings plot. By clicking the toolbar button Stability plot the model
stability is clearly visualized.
Stability in loading weights plots
1238
Tutorials
Variable 11 or “Help” is not very stable, the two departments 15 and 26 have a much lower
value than the others, thus being influential for this variable. This indicates that this variable
is probably not reliable to predict the “job satisfaction”.
This can be studied by looking at the scatter plot of the “Help” vs. “job satisfaction”.
To plot it go back to the data table “Work environment case”. Select the column 11 “Help”
as well as the column 27 “Job satisfaction”, use Ctrl.
Then go to Plot - Scatter or click on the icon .
“Help” vs. “job satisfaction”
This plot shows that the variable X11 ”help” (Do you find your colleagues helpful?) is not
very correlated to the “job satisfaction”. The 2 suspicious departments are influential in this
relation.
1239
The Unscrambler X Main
Go back to the scores plot. By clicking the toolbar button Stability plot the model
stability is clearly visualized.
Stability plot of scores
For each sample one can see a swarm of its scores from each submodel. There are 34
sample swarms. In the middle of each swarm is the score for the sample in the total model.
By clicking on any point, information of the segment is given. Thus, in the case of full cross
validation one can directly see how the models change when a particular sample is kept out.
In other words, a sample that makes the model change when it is not in the segment has
influenced all other submodels due to its uniqueness.
The score and loading stability plots are also very useful for higher factors in models as they
indicate when noise is becoming the main source for a specific component.
Conclusions
In the work environment example, from looking at the global picture from the stability
scores plot one can conclude that all samples seem good and the model seems robust. Also,
the uncertainty test indicates 13 significant variables at the 5% level as visualized with the
95% confidence intervals.
35.3. Quick
35.3.1 Quick start tutorials
PCA
Projection
SIMCA
MLR
PCR
PLS
Prediction
1240
Tutorials
Cluster
MCR
LDA
LDA classification
SVM
SVM classification
LPLS
Data structure
The PCA model has been developed on the variable set “Descriptors”. It needs 4 PCs. Have a
look at the PCA quick start tutorial for more information on this model.
Go to Tasks - Predict - Projection.
In the dialog box project to latent space. Make the following selections:
Projection inputs
1241
The Unscrambler X Main
Look at the residual variance plot to see how well the new samples are described. Look at
the green line. It goes rapidly to zero indicating a good description.
Projection residual variance
For more information on the plots go to the Interpreting Projection plots section
1242
Tutorials
PCA model 1
PCA model 2
PCA model 3
PCA model 4
Data structure
Matrix: “FiveRawMaterials-small”
Rows: “Test”
Cols: “Spectra”
Class model: “PCA_AcDiSol”, “PCA_DiCaP”, “PCA_Kollidon”, “PCA_MCC”
Leave the default values for the Suggested number of PC. For more information on the
optimal number of PCs to use look up the PCA theory
SIMCA inputs
1243
The Unscrambler X Main
For more information on the plots go to the Interpreting SIMCA plots section
Data structure
1244
Tutorials
1245
The Unscrambler X Main
Look at the quality of the regression by looking at the predicted vs. reference plot. The R-
square is about 1 which is very good. In addition the error is small.
MLR predicted vs. reference
1246
Tutorials
For more information on the plots go to the Interpreting MLR plots section
1247
The Unscrambler X Main
Data structure
1248
Tutorials
1249
The Unscrambler X Main
In the validation tab, select a full cross-validation. To do so select the radio button Cross
validation. Then click on the button setup and in the drop-down menu select the Full
option.
For PCR it is useful to enable the Uncertainty test* by ticking the associated box. This test
will show the important variables in the model in the loadings plot and coefficient
regression plot. For thenumber of factors to use** leave the default option use optimal
number of factors.
PCR validation - cross-validation setup
1250
Tutorials
There are several options in the algorithm tab. Look at the information in the Additional
information field. Select the SVD option as the data set is rather small.
PCR algorithm
1251
The Unscrambler X Main
Look at the scores. Notice that the 2 sundae samples are clustered together. Also “Apple
pie” and “Pommes frites” are also very close which means that the samples have similar
composition.
PCR scores
Along PC1, “Protein (%)” and “Carbohydrates (%)” are diametrically opposite. It means that
they are anti-correlated so that they vary in opposite direction. Samples with positive scores
on PC1 are rich in protein, like all the burgers. Samples with negative scores on PC1 are rich
in carbohydrates and low in protein such as “Pommes Frites”.
PC2 is describing the variation of “Fat (%)” and “Energy”. The more fat in the composition
the more energetic the product. The products that will have negative scores along PC2 have
a high fat content such as “Filet-O-Fish”.
1252
Tutorials
The variable “Saturated fat (%)” which is inside the 50% correlation circle is not a descriptive
variable. The variations are not structured and this variable may be considered as irrelevant
for this data set.
Also note that some variables are circled; they are the important variables as determined by
the uncertainty test.
PCR correlation loadings
Check the quality of the regression with the 2 and 3 factors. As can be seen, the results for 3
factors are much better both for the R-square and the RMSE.
The R-square in validation (red value) is 0.998, which is very good. The error in cross-
validation is about 0.08 on a scale of 6 to 13 kJ/g which is rather small.
PCR predicted vs. reference
1253
The Unscrambler X Main
For more information on the plots go to the Interpreting PCR plots section
Data structure
1254
Tutorials
X - Cols: “Composition”
Y - Matrix: “mcdo”
Y - Rows: “Training”
Y - Cols: “Energy”
Keep the Identity outliers and Mean center data boxes ticked. Leave the Maximum
components at “4”.
Go to the next tab: X - Weights. Keep the default settings that don’t apply any weight to the
variables as the variables have the same range of variation, even the energy that is not in the
same unit. Keep the default setting also for Y - Weights.
PLS model inputs - weights
1255
The Unscrambler X Main
1256
Tutorials
In the validation tab, select a full cross-validation. To do so select the radio button Cross
validation. Then click on the button setup and in the drop-down menu select the Full
option.
For PLS it is useful to enable the Uncertainty test* by ticking the associated box. This test
will show the important variables in the model in the loadings plot and coefficient
regression plot. For thenumber of factors to use** leave the default option use optimal
number of factors.
PLS validation - cross-validation setup
1257
The Unscrambler X Main
There are several options in the algorithm tab. Look at the information in the Additional
information field. Select the NIPALS option as it is the classical type.
PLS algorithm
1258
Tutorials
Look at the scores. Notice that the 2 sundae samples are clustered together. Also “Apple
pie” and “Pommes frites” are also very close which means that the samples have similar
composition.
PLS scores
Factor 1 is describing the variation of the X-variable “Fat (%)” and Y-variable “Energy”. The
more fat in the composition the more energetic the product. The products that have positive
scores along factor 1 have a high fat content such as “Filet-O-Fish”, “Pommes Frites” and
“Apple Pie”.
Along factor 2, “Protein (%)” and “Carbohydrates (%)” are diametrically opposite. It means
that they are anti-correlated so that they vary in opposite direction. Samples with positive
scores on PC1 are rich in protein, like all the burgers. Samples with negative scores on PC2
are rich in carbohydrates and low in protein such as “Pommes Frites”.
The variable “Saturated fat (%)” which is inside the 50% correlation circle is not a descriptive
variable. The variations are not structured and this variable may be considered as irrelevant
for this data set.
Also note that some variables are circled; they are the important variables.
PLS correlation loadings
1259
The Unscrambler X Main
Check the quality of the regression with the appropriate number of factor: 2.
The R-square in validation (red value) is 0.99, which is very good. The error in cross-
validation is about 0.16 on a scale of 6 to 13 kJ/g which is rather small.
PLS predicted vs. reference
1260
Tutorials
For more information on the plots go to the Interpreting PLS plots section
Data structure
1261
The Unscrambler X Main
The PLS model has been developed on the X-variables “Composition” and Y-variable
“Energy”. It needs 2 factors. Have a look at the PLS quick start tutorial for more information
on this model.
Go to Tasks - Predict - Regression.
In the dialog box Predict Using Regression Model. Make the following selections:
Prediction inputs
1262
Tutorials
The results can also be seen as a table and it is even easier to compare the quality of the
prediction by looking at how close the predicted values are to the reference values. Don’t
forget to look at the results for 2 factors.
In the table look at the values for “Grilled chicken” they show 8.23 predicted and 8.14
reference which is pretty close.
Prediction results as table
For more information on the plots go to the Interpreting Prediction plots section
Data structure
1263
The Unscrambler X Main
1264
Tutorials
For more information on the plots go to the Interpreting cluster analysis plots section
column ranges: “Blue”, “Green”, “Orange” that describe the composition and “360-
800nm” that describes the spectra.
row range: “samples” set of 36 samples
Data structure
Matrix: “Dye”
Rows: “samples”
Cols: “360-800nm”
We will not use any initial guess but if you wish to learn more read mCR dialogs
MCR model inputs
1265
The Unscrambler X Main
Look at the total residuals. The minimum is reach at 3 components which means 3
components are needed.
Total residuals
1266
Tutorials
Look at the spectra with 3 components displayed. The shape of the spectra looks good as it
is very close to a signal shape. The 3 spectra have the same intensity which is a good sign for
the results.
Spectra
Look at the concentrations. The summ of concentration almost summs up to 1 which is good
for a mixture. The green component seems to be always in higher concentration.
Concentrations
1267
The Unscrambler X Main
For more information on the plots go to the Interpreting MCR plots section
Data structure
1268
Tutorials
Go to the next tab weights. Leave the weights equal to 1 as the data are spectral data.
LDA inputs - weights
1269
The Unscrambler X Main
Method: “Mahalanobis”
Prior probability: “Calculate prior probabilities from training set”
1270
Tutorials
LDA options
Go the the Results folder in the project navigator to look at the confusion matrix. All the
samples are well classified.
1271
The Unscrambler X Main
Confusion matrix
For more information on the results go to the Interpreting LDA results section
Data structure
The LDA model has been developed on the “Training” set. For more information on the
model check the instruction of the LDA quick start.
Go to Tasks - Predict - Classification - LDA….
In the dialog that opens make the following selections:
1272
Tutorials
Data structure
1273
The Unscrambler X Main
1274
Tutorials
Go to the next tab weights. Leave the weights equal to 1 as the data are spectral data. In the
validation tab enable cross-validate and select 3 for the number of segments.
SVM weights - validation
1275
The Unscrambler X Main
1276
Tutorials
For more information on the results go to the Interpreting SVM results section
column range: “All variables” containing all the continuous variables and a category
variable “Type”;
row ranges: “AcDiSol”, “DiCaP”, “Kollidon”, “MCC” that group the samples by
category and a “Training” and a “Test” set of 4 samples.
Data structure
Matrix: “FiveRawMaterials”
Rows: “Test”
Cols: “All variables”
1277
The Unscrambler X Main
Data structure
1278
Tutorials
Matrix: “mcdo”
Rows: “Training”
Cols: “Descriptors”
Keep the Identity outliers and Mean center data boxes ticked. Leave the Maximum
components set to “5”.
Go to the next tab: Weights. Keep the default settings that don’t apply any weight to the
variables as the variables have the same range of variation, even the energy that is not in the
same unit.
PCA model inputs - weights
1279
The Unscrambler X Main
In the validation tab, select a full cross-validation. To do so select the radio button Cross
validation. Then click on the button setup and in the drop-down menu select the Full
option.
PCA validation - cross-validation setup
1280
Tutorials
There are two options in the algorithm tab. Select the SVD option as there are no missing
values.
PCA algorithm
1281
The Unscrambler X Main
Look at the scores plot. Notice that the 2 sundae samples are clustered together. Also
“Apple pie” and “Pommes frites” are also very close which means that the samples have the
same type of composition.
PCA scores
Along PC1, “Protein (%)” and “Carbohydrates (%)” are diametrically opposite. It means that
they are anti-correlated so that they vary in opposite direction. Samples with positive scores
on PC1 are rich in carbohydrates and low in protein such as “Pommes Frites”. Samples with
negative scores on PC1 are rich in protein, like all the burgers.
PC2 is describing the variation of “Fat (%)” and “Energy”. The more fat in the composition
the more energetic the product. The products that will have negative scores along PC2 have
a high fat content such as “Filet-O-Fish”.
The variable “Saturated fat (%)” which is inside the 50% correlation circle is not a descriptive
variable. The variations are not structured and this variable may be considered as irrelevant
for this data set.
PCA correlation loadings
For more information on the plots go to the Interpreting PCA plots section
1282
36. Data Integrity and Compliance
36.1. Data Integrity
This section covers how The Unscrambler® X can help an organisation working in a regulated
environment, particularly those that must show compliance to the rules and regulations of
electronic records and signatures as outlined in 21 CFR Part 11.
The following sections cover aspects of data integrity and security, particularly related to
electronic and digital signatures, the compliance mode of the software and audit trails.
Compliance Statement
General Application
Digital Signatures
Reference
36.2.1 Introduction
This section provides CAMO Software’s position on helping an organization meet the
requirements of 21 CFR Part 11 (Electronic Signatures and Records). All necessary steps have
been followed to align with the requirements however, it must be stated that certain
procedures, such as verification of a user’s identity with respect to their electronic
signatures with the FDA, and the development of internal SOPs are the sole responsibility of
the Organization implementing The Unscrambler® X. Also, regulations and enforcement
activities change over time and the implementations for meeting 21 CFR Part 11 are based
on current best practices and subject knowledge at the time of the present build of the
program.
36.2.2 Overview
The Unscrambler® X provides the necessary functions for an organization to meet the
requirements of 21 CFR Part 11, as defined in Subparts A, B and C
1283
The Unscrambler X Main
belief that with proper due diligence on the part of the organization using The Unscrambler®
X, then compliance with the regulations can be achieved.
Most, if not all of the above industries require strict traceability of actions applied to
documents and data and in particular to data generated electronically. The 21 CFR Part 11
regulations were developed to provide a way for organizations to attach the same meaning
of hand written signatures to electronic documents. The term electronic signature is defined
in the regulation as,
A computer data compilation of any symbol or series of symbols executed, adopted or
authorised by an individual to be the legally binding equivalent of the individuals
handwritten signature.
For clarity, a handwritten signature is defined as,
The scripted or legal mark of an individual used to authenticate a document in a permanent
form
Therefore, an electronic signature must meet the following basic criteria,
The goal is to have a system that can replace traditional handwritten signatures by an
electronic means for authoring, reviewing and releasing data and information based on the
four criteria listed above. This is where the compliance mode in Unscrambler® X can help an
organization achieve these goals.
1284
Data Integrity and Compliance
Logins
There are two ways to use compliance mode,
A login to the software can also be enforced, which means that a user has to reenter
their electronic signature to access the program. This is useful if the program is
installed on a shared computer and in order to use it, the domain has to be set, such
that the authorised user is the only one who can access the program.
The login can be hidden. In this case, it is the responsibility of the organization (and
the user) to ensure that the program is installed on a computer that can only be
accessed by the user assigned to that computer. In this case, when the program is
launched, it starts immediately and the windows authentication details are used to
record actions in the Audit Trail.
Note: In compliance mode, the Help - User Setup function is deactivated. The only way to
access the program is via windows authentication.
Audit Trails and Info boxes
In compliance mode, the Audit Trail is always enforced and cannot be deactivated in the
Tools - Options menu. In the Audit Trail itself, the Empty button is disabled and its contents
can either be printed, or saved as a non-editable PDF file.
The Info box will also display the mode of operation that the program is operating in. This
can be found by clicking on The Unscrambler® icon in the project navigator and viewing the
details.
1285
The Unscrambler X Main
CAMO Software cannot warrant the legal enforceability of the digital signature
generated and evidentiary laws may vary by jurisdiction. .
The Unscrambler� X implements a digital signature by first passing the document through a
hashing algorithm. This creates a digest file that is a unique document number for the
project. This digital signature is saved to the project and recorded in the Info box and the
Audit Trail against the user’s login credentials. When the project is sent (via electronic
media, email etc) to a colleague, when they open the project, The Unscrambler� X
computes the digital signature and compares it with the one saved to the project. If both
signatures match, the integrity of the data can be assured. If not, a warning will be given to
the user that the project has been tampered with.
In the case that the program has been installed in Compliance Mode, the digital signature
uses the users electronic signature details as the security certificate of the digital signature.
Once signed, the user is prompted with a warning that any changes saved to the project will
result in a loss of the signature, see below,
1286
Data Integrity and Compliance
If a project has not been saved before signing, the following warning will be provided,
The user will be taken to the Save As dialog where they can provide a name to the project
before it is saved.
Info box
The Info box will record information on the current sign status of the project, an example is
shown below,
1287
The Unscrambler X Main
Audit Trail
The Audit Trail shows the current status of the digital signature, an example is shown below,
Status bar
Digitally signed project display the sign icon at the bottom of the viewer in the status bar.
36.5. References
Guidance for industry, Part 11, Electronic Records: Electronic Signatures - Scope and
Application (available on www.fda.gov).
McDowall, R. D. Electronic Signatures and Logical Security, LC-GC Europe, 13(5), 331-
339 (2000).
McDowall, R. D. Digital Signatures, LC-GC Europe, 14(1), (2001).
1288
37. References
37.1. Reference documentation
Glossary of terms
Method references
Keyboard shortcuts
Upgrading documentation
1289
The Unscrambler X Main
ANOVA
See Analysis of Variance.
Axial design
One of the three types of mixture designs with a simplex-shaped experimental region. An
axial design consists of extreme vertices, overall center, axial points, end points. It can only
be used for linear modeling, and therefore it is not available for optimization purposes.
Axial point
In an axial design, an axial point is positioned on the axis of one of the mixture variables, and
must be above the overall center, opposite the end point.
B-Coefficient
See Regression Coefficient.
Bias
Systematic difference between predicted and measured values. The bias is computed as the
average value of the residuals.
BIF-PLS
See bifocal PLS
Bifocal PLS
A method similar to L-PLS
Bilinear modeling
Bilinear modeling (BLM) is one of several possible approaches for data compression.
The bilinear modeling methods are designed for situations where collinearity exists among
the original variables. Common information in the original variables is used to build new
variables, that reflect the underlying (“latent”) structure. These variables are therefore
called latent variables. The latent variables are estimated as linear functions of both the
original variables and the observations, thereby the name bilinear.
PCA, PCR and PLS are bilinear methods.
Box-Behnken design
A class of experimental designs for response surface modeling and optimization, based on
only 3 levels of each design variable. The mid-levels of some variables are combined with
extreme levels of others. The combinations of only extreme levels (i.e. cube samples of a
factorial design) are not included in the design.
Box-Behnken designs are always rotatable. On the other hand, they cannot be built as an
extension of an existing factorial design, so they are more often recommended when
changing the ranges of variation for some of the design variables after a screening stage, or
when it is necessary to avoid too extreme situations.
Calibration
Stage of data analysis where a model is fitted to the available data, so that it describes the
data as well as possible.
After calibration, the variation in the data can be expressed as the sum of a modeled part
(structure) and a residual part (noise).
1290
References
Calibration samples
Samples on which the calibration is based. The variation observed in the variables measured
on the calibration samples provides the information that is used to build the model.
If the purpose of the calibration is to build a model that will later be applied on new samples
for prediction, it is important to collect calibration samples that span the variations expected
in the future prediction samples.
Category variable
A category variable is a class variable, i.e. each of its levels is a category (or class, or type),
without any possible quantitative equivalent.
Examples: type of catalyst, choice among several instruments, wheat variety, material
identification, etc..
Candidate point
In the D-optimal design generation, a number of candidate points are first calculated. These
candidate points consist of extreme vertices and centroid points. Then, a number of
candidate points is selected D-optimally to create the set of design points.
Center sample
Sample for which the value of every design variable is set at its mid-level (halfway between
low and high).
Center samples have a double purpose: introducing one center sample in a screening design
enables curvature checking, and replicating the center sample provides a direct estimation
of the experimental error.
Real center samples can be included when all design variables are continuous.
For design containing category variables real center point do not exist, however it is possible
to generate faced center point taking the middle range values for the continuous variables
and selecting a level for the category variables.
Center samples
See Center sample.
Centering
See Mean centering.
Central composite design
A class of experimental designs for response surface modeling and optimization, based on a
two-level factorial design on continuous design variables. Star samples and center samples
are added to the full factorial design to provide the intermediate levels necessary for fitting
a quadratic model.
Central composite designs have the advantage that they can be built as an extension of a
previous factorial design, if there is no reason to change the ranges of variation of the design
variables.
If the default star point distance to center is selected, these designs are rotatable.
Centroid design
See Simplex-centroid design.
1291
The Unscrambler X Main
Centroid point
A centroid point is calculated as the mean of the extreme vertices on the design region
surface associated with this centroid point. It is used in Simplex-centroid designs, axial
designs and D-optimal designs.
Classification
Data analysis method used for predicting class membership. Classification can be seen as a
predictive method where the response is a category variable. The purpose of the analysis is
to be able to predict which category a new sample belongs to. Classification methods
implemented in The Unscrambler® include SIMCA, SVM classification, LDA, and PLS-
discriminant analysis.
Classification can for instance be used to determine the geographical origin of a raw material
from the levels of various impurities, or to accept or reject a product depending on its
quality.
To run a SIMCA classification, one needs:
One or several PCA models (one for each class) based on the same variables;
Values of those variables collected on known or unknown samples.
Each new sample is projected onto each PCA model. According to the outcome of this
projection, the sample is either recognized as a member of the corresponding class, or
rejected.
Closure
In MCR, the closure constraint forces the sum of the concentrations of all the mixture
components to be equal to a constant value (the total concentration) across all samples.
Clustering
Clustering is a classification method that does not require any prior knowledge about the
available samples. The basic principle consists in grouping together in a “cluster” several
samples which are sufficiently close to each other.
The clustering methods available in The Unscrambler® include the K-means algorithm; the
behavior of the algorithm may be tuned by choosing among various ways of computing the
distance between samples. Hierarchical clustering can also be run, as can clustering using
Ward’s method.
Coefficient of determination
See R-square.
Collinear
See Collinearity.
Collinearity
Linear relationship between variables. Two variables are collinear if the value of one variable
can be computed from the other, using a linear relation. Three or more variables are
collinear if one of them can be expressed as a linear function of the others.
Variables which are not collinear are said to be linearly independent. Collinearity - or near-
collinearity, i.e. very strong correlation - is the major cause of trouble for MLR models,
whereas projection methods like PCA, PCR and PLS handle collinearity well.
1292
References
Component
Condition number
It is the square root of the ratio of the highest eigenvalue to the smallest eigenvalue of the
experimental matrix. The higher the condition number, the more spread the region. On the
contrary, the lower the condition number, the more spherical the region. The ideal condition
number is 1; the closer to 1 the better.
Confounded
See Confounded effects.
Confounded effects
Two (or more) effects are said to be confounded when variation in the responses cannot be
traced back to the variation in the design variables to which those effects are associated.
Confounded effects can be separated by performing a few new experiments. This is useful
when some of the confounded effects have been found significant.
Confounding pattern
The confounding pattern of an experimental design is the list of the effects that can be
studied with this design, with confounded effects listed on the same line.
Confusion matrix
The confusion matrix is a matrix used for visualization for classification results from
supervised methods such as support vector machine classification or linear discriminant
analysis classification. It carries information about the predicted and actual classifications of
samples, with each row showing the instances in a predicted class, and each column
representing the instances in an actual class.
Constrained design
Experimental design involving multilinear constraints between some of the designed
variables. There are two types of constrained designed: classical mixture designs and D-
optimal designs.
Constrained experimental region
Experimental region which is not only delimited by the ranges of the designed variables, but
also by multilinear constraints existing between these variables. For classical mixture
designs, the constrained experimental region has the shape of a simplex.
Constraint
Curve Resolution:
A constraint is a restriction imposed on the solutions to the multivariate curve
resolution problem.
Many constraints take the form of a linear relationship between two variables or
more:
or
1293
The Unscrambler X Main
where Xi are relevant variables (e.g. estimated concentrations), and each constraint
is specified by the set of constants .
Mixture Designs: See Multilinear constraint.
Continuous variable
Quantitative variable measured on a continuous scale.
Examples of continuous variables are:
Corner sample
See vertex sample.
Correlations
See Correlation.
Correlation
A unit less measure of the amount of linear relationship between two variables.
The correlation is computed as the covariance between the two variables divided by the
square root of the product of their variances. It varies from –1 to +1.
Positive correlation indicates a positive link between the two variables, i.e. when one
increases, the other has a tendency to increase too. The closer to +1, the stronger this link.
Negative correlation indicates a negative link between the two variables, i.e. when one
increases, the other has a tendency to decrease. The closer to –1, the stronger this link.
Correlation loadings
Loadings plot marking the 50% and 100% explained variance limits. Correlation loadings are
helpful in revealing variable correlations.
Correlation Optimized Warping (COW)
COW is a method for aligning data where the signals exhibit shifts in their position along the
x axis. This transform is a technique often use for time-shifting chromatographic spectra.
C
A method used to check the significance of effects using a scale-independent distribution as
comparison. This method is useful when there are no residual degrees of freedom.
Covariance
A measure of the linear relationship between two variables.
The covariance is given on a scale which is a function of the scales of the two variables, and
may not be easy to interpret. Therefore, it is usually simpler to study the correlation instead.
Cross terms
See Interaction effects.
1294
References
Cross validation
Validation method where some samples are kept out of the calibration and used for
prediction. This is repeated until all samples have been kept out once. Validation residual
variance can then be computed from the prediction residuals.
In segmented cross validation, the samples are divided into subgroups or “segments”. One
segment at a time is kept out of the calibration. There are as many calibration rounds as
segments, so that predictions can be made on all samples. A final calibration is then
performed with all samples.
In full cross validation, only one sample at a time is kept out of the calibration per iteration.
Cube sample
Any sample which is a combination of high and low levels of the design variables, in
experimental plans based on two levels of each variable.
In Box-Behnken designs, all samples which are a combination of high or low levels of some
design variables, and center level of others, are also referred to as cube samples.
Cubic effects
See Cubic effect.
Cubic effect
When analyzing the results from designed experiments, cubic effects can be included in the
model to handle complex cases of nonlinear effects or multiple interactions between the X-
variables.
Also called third order effects, they comprise:
Curvature
Curvature means that the true relationship between response variations and predictor
variations is nonlinear.
In screening designs, curvature can be detected by introducing a center sample.
Data compression
Concentration of the information carried by several variables onto a few underlying
variables.
The basic idea behind data compression is that observed variables often contain common
information, and that this information can be expressed by a smaller number of variables
than originally observed.
Data mining
This is the practice of studying large amounts of data to find patterns or trends. MVA is a
form of data mining.
Detrending (DT)
A transformation which seeks to remove nonlinear trends in spectroscopic data. Like
Standard_Normal_Variate (SNV), it is applied to individual spectra. DT and SNV are often
used in combination to reduce multicollinearity, baseline shift and curvature is spectra.
1295
The Unscrambler X Main
Degree of fractionality
The degree of fractionality of a factorial design expresses how much the design has been
reduced compared to a full factorial design with the same number of variables. It can be
interpreted as the number of design variables that should be dropped to compute a full
factorial design with the same number of experiments.
Example: with 5 design variables, one can either build
Degrees of freedom
The number of degrees of freedom of a phenomenon is the number of independent ways
this phenomenon can be varied.
Degrees of freedom are used to compute variances and theoretical variable distributions.
For instance, an estimated variance is said to be “corrected for degrees of freedom” if it is
computed as the sum of square of deviations from the mean, divided by the number of
degrees of freedom of this sum.
Dendrogram
A dendrogram (from Greek dendron “tree”, -gramma “drawing”) is a tree diagram
frequently used to illustrate the arrangement of the clusters produced by hierarchical
clustering.
Design analysis
Calculation of the effects of design variables on the responses. It consists mainly of Analysis
of Variance (ANOVA), various significance tests, and multiple comparisons, response surface
generation whenever they apply.
Design variable
Experimental factor for which the variations are controlled in an experimental design.
Design variables
See Design Variable.
Distribution
Shape of the frequency diagram of a measured variable or calculated parameter. Observed
distributions can be represented by a histogram.
Some statistical parameters have a well-known theoretical distribution which can be used
for significance testing.
D-optimal design
Experimental design generated by a D-optimal algorithm. A D-optimal design takes into
account the multilinear relationships existing between design variables, and thus works with
constrained experimental regions. There are two types of D-optimal designs depending on
their initial points: D-optimal mixture designs which are based on subsimplexes and general
D-optimal designs which are based on subfactorial designs.
1296
References
1297
The Unscrambler X Main
1298
References
F-ratio
The F-ratio is the ratio between explained variance (associated to a given predictor) and
residual variance. It shows how large the effect of the predictor is, as compared with
random noise.
By comparing the F-ratio with its theoretical distribution (F-distribution), one obtains the
significance level (given by a p-value) of the effect.
Full factorial design
Experimental design where all levels of all design variables are combined.
Such designs are often used for extensive study of the effects of few variables, especially if
some variables have more than two levels. They are also appropriate in screening with
interaction designs, to study both main effects and interactions, especially if no Resolution V
design is available.
Gap
One of the parameters of the Gap-Segment and Norris Gap derivatives, the gap is the length
of the interval that separates the two segments that are being averaged.
See Segment for more information.
General D-optimal design
D-optimal design in which some of the process variables are multilinearly linked, or which
contains a mix of mixture and non-mixture variables.
Histogram
A plot showing the observed distribution of data points. The data range is divided into a
number of bins (i.e. intervals) and the number of data points that fall into each bin is
summed up.
The height of the bar in the histograms shows how many data points fall within the data
range of the bin.
Hotelling’s T statistics
See Hotelling’s T² statistic.
Hotelling’s T² ellipse
This 95% confidence ellipse can be included in scores plots and reveals potential outliers,
lying outside the ellipse.
See Hotelling’s T² statistic for more information.
Hotelling’s T² statistics
A linear function of the leverage that can be compared to a critical limit according to an F-
test. This statistic is useful for the detection of outliers at the modeling or prediction stage.
See Hotelling’s T² Ellipse for more information.
Influence
A measure of how much impact a single data point (or a single variable) has on the model.
The influence depends on the leverage and the residuals.
1299
The Unscrambler X Main
Inlier
A prediction sample far away from the calibration samples in the regression model. Local
“holes” or areas with low density in terms of calibration samples can result in a situation
where some prediction samples are detected as inliers.
Inner relation
In PLS regression models, scores in X are used to predict the scores in Y and from these
predictions, the estimated is found. This connection between X and Y through their
scores is called the inner relation.
Interaction
See Interaction effects.
Interactions
See Interaction effects.
Interaction effects
There is an interaction between two design variables when the effect of the first variable
depends on the level of the other. This means that the combined effect of the two variables
is not equal to the sum of their main effects.
An interaction that increases the main effects is a synergy. If it goes in the opposite
direction, it can be called an antagonism.
Intercept
(Also called Offset). The point where a regression line crosses the ordinate (Y-axis).
Interior point
Point which is not located on the surface, but inside of the experimental region. For
example, an axial point is a particular kind of interior point. Interior points are used in
classical mixture designs.
K-means
An algorithm for data clustering. The samples will be grouped into K (user-determined
number) clusters based on a specific distance measurement, so that the sum of distances
between each sample and its cluster centroid is minimized.
Lack of fit
In Response Surface Analysis, the ANOVA table includes a special chapter which checks
whether the regression model describes the true shape of the response surface. Lack of fit
means that the true shape is likely to be different from the shape indicated by the model.
If there is a significant lack of fit, one can investigate the residuals and try a transformation.
Latent variable
A variable that is not directly observed but is rather inferred (through a mathematical
model) from other variables that are observed and directly measured. Principal components
(PCs) and PLS factors are examples of latent variables.
Lattice degree
The degree of a Simplex-lattice design corresponds to the maximal number of experimental
points -1 for a level 0 of one of the Mixture variables.
1300
References
Lattice design
See Simplex-lattice design.
LDA
See Linear Discriminant Analysis.
Least squares criterion
Basis of classical regression methods, that consists in minimizing the sum of squares of the
residuals. It is equivalent to minimizing the average squared distance between the original
response values and the fitted values.
Leveled variable
A leveled variable is a variable which consists of discrete values instead of a range of
continuous values.
Examples are design variables and category variables.
Leveled variables can be used to separate a data table into different groups. This feature is
used by the Statistics task, and in sample plots from PCA, PCR, PLS, MLR, Prediction and
Classification results.
Leveled variables
See Leveled Variable.
Level
See Levels.
Levels
Possible values of a variable. A category variable has several levels, which are all possible
categories. A design variable has at least a low and a high level, which are the lower and
higher bounds of its range of variation. Sometimes, intermediate levels are also included in
the design.
Leverage
A measure of how extreme a data point or a variable is compared to the majority.
In PCA, PCR and PLS, leverage can be interpreted as the distance between a projected point
(or projected variable) and the model center. In MLR, it is the object distance to the model
center.
Average data points have a low leverage. Points or variables with a high leverage are likely to
have a high influence on the model.
Leverage correction
A quick method to simulate model validation without performing any actual predictions.
It is based on the assumption that samples with a higher leverage will be more difficult to
predict accurately than more central samples. Thus a validation residual variance is
computed from the calibration sample residuals, using a correction factor which increases
with the sample leverage.
Note! For MLR, leverage correction is strictly equivalent to full cross-validation. For
other methods, leverage correction should only be used as a quick-and-dirty
method for a first calibration, and a proper validation method should be employed
later on to estimate the optimal number of components correctly.
1301
The Unscrambler X Main
1302
References
The three matrices X, Y and Z can together be visualized in the form of an L-shaped
arrangement. Such data analysis has potential widespread use in areas such as consumer
preference studies, medical diagnosis and spectroscopic applications.
Main effect
Average variation observed in a response when a design variable goes from its low to its high
level.
The main effect of a design variable can be interpreted as linear variation generated in the
response, when this design variable varies and the other design variables have their average
values.
Main effects
See Main Effect.
Martens’ Uncertainty Test
See Uncertainty test.
MCR
See Multivariate Curve Resolution.
Mean
Average value of a variable over a specific sample set. The mean is computed as the sum of
the variable values, divided by the number of samples.
The mean gives a value around which all values in the sample set are distributed. In Statistics
results, the mean can be displayed together with the standard deviation.
Mean centering
Subtracting the mean (average value) from a variable, for each data point.
Median
The median of an observed distribution is the variable value that splits the distribution in its
middle: half the observations have a lower value than the median, and the other half have a
higher value. It can also be called 50% percentile.
Missing values
Whenever the value of a given variable for a given sample is unknown or not available, this
results in a hole in the data. Such holes are called missing values, and in The Unscrambler®
corresponding cell of the data table are left empty.
In some cases, it is only natural to have missing values — for instance when the
concentration of a compound (Y) in a new sample is supposed to be predicted from its
spectrum (X).
Sometimes it would be nice to reconstruct the missing values, for instance when applying a
data analysis that does not handle missing values well, like MLR, kernel-PLS or wide-kernel.
One may choose to fill missing values by using the command Tasks - Transform - Missing
Values….
MixSum
Term used in The Unscrambler® for “mixture sum”. See Mixture sum.
1303
The Unscrambler X Main
Mixture components
Ingredients of a mixture.
There must be at least three components to define a mixture. A unique component cannot
be called mixture.
Two components mixed together do not require a Mixture design to be studied: study the
variation in quantity of one of them as a classical process variable.
Mixture constraint
Multilinear constraint between Mixture variables. The general equation for the Mixture
constraint is
where the Xi represent the ingredients of the mixture, and S is the total amount of mixture.
In most cases, S is equal to 100%.
Mixture design
Special type of experimental design, applying to the case of a mixture constraint. There are
three types of classical Mixture designs: Simplex-Lattice design, Simplex-Centroid design,
and Axial design. Mixture designs that do not have a simplex experimental region are
generated D-optimally; they are called D-optimal mixture designs.
Mixture region
Experimental region for a mixture design. The mixture region for a classical mixture design is
a simplex.
Mixture sum
Total proportion of a mixture which varies in a mixture design. Generally, the mixture sum is
equal to 100%. However, it can be lower than 100% if the quantity in one of the components
has a fixed value.
The mixture sum can also be expressed as fractions, with values varying from 0 to 1.
Mixture variables
See Mixture Variable.
Mixture variable
Experimental factor for which the variations are controlled in a mixture design or D-optimal
mixture design. Mixture variables are multilinearly linked by a special constraint called
mixture constraint.
There must be at least three mixture variables to define a mixture design. See Mixture
components.
MLR
See Multiple Linear Regression.
Model
Mathematical equation summarizing variations in a data set.
Models are built so that the structure of a data table can be understood better than by just
looking at all raw values.
Statistical models consist of a structure part and an error part. The structure part
(information) is intended to be used for interpretation or prediction, and the error part
(noise) should be as small as possible for the model to be reliable.
1304
References
Model center
The model center is the origin around which variations in the data are modeled. It is the
(0,0) point on a scores plot.
If the variables have been centered, samples close to the average will lie close to the model
center.
Model check
In Response Surface Analysis, a section of the ANOVA table checks how useful the
interactions and squares are, compared with a purely linear model. This section is called
model check.
If one part of the model is not significant, it can be removed so that the remaining effects
are estimated with a better precision.
MVA
See Multivariate Analysis
Multiple comparison tests
Tests associating the levels of a category design variable with a response variable, to detect
differences in effects between different levels.
For continuous or binary design variables, if an effect is found to be significant by ANOVA,
the magnitude and direction of the effect can be interpreted directly from the effect value of
that variable. For multi-level category variables the ANOVA will test whether at least one
level is significantly different from the others, however there is no single effect value for
each category variable or level to interpret. A multiple comparison test is used to assess
which category levels are associated with the optimal response.
Interpretation of multiple comparisons in The Unscrambler® X is described in more detail in
the Design of Experiments section.
Multilinear constraints
See Multilinear constraint.
Multilinear constraint
This is a linear relationship between two variables or more. A constraint has the general
form:
or
where Xi are designed variables (mixture or process), and each constraint is specified by the
set of constants .
A multilinear constraint cannot involve both Mixture and Process variables.
Multiple Linear Regression (MLR)
A method for relating the variations in a response variable (Y-variable) to the variations of
several predictors (X-variables), with explanatory or predictive purposes.
An important assumption for the method is that the X-variables are linearly independent, i.e.
that no linear relationship exists between the X-variables. When the X-variables carry
common information, problems can arise due to exact or approximate collinearity.
1305
The Unscrambler X Main
1306
References
The observed values are used as abscissa, and the ordinate displays the corresponding
percentiles on a special scale. Thus if the values are approximately normally distributed
around zero, the points will appear close to a straight line going through (0,50%).
A normal probability plot can be used to check the normality of the residuals (they should be
normal; outliers will stick out), and to visually detect significant effects in screening designs
with few residual degrees of freedom.
Offset
See Intercept.
Optimization
Finding the settings of design variables that generate optimal response values.
Orthogonal
Two variables are said to be orthogonal if they are completely uncorrelated, i.e. their
correlation is 0.
In PCA and PCR, the principal components are orthogonal to each other.
Factorial designs, Plackett-Burman designs, Central Composite designs and Box-Behnken
designs are built in such a way that the studied effects are orthogonal to each other.
Orthogonal design
Designs built in such a way that the studied effects are orthogonal to each other, are called
orthogonal designs.
Examples: Factorial designs, Plackett-Burman designs, Central Composite designs and Box-
Behnken designs.
D-optimal designs and classical mixture designs are not orthogonal.
Outlier
An observation (outlying sample) or variable (outlying variable) which is abnormal compared
to the major part of the data.
Extreme points are not necessarily outliers; outliers are points that apparently do not belong
to the same population as the others, or that are badly described by a model.
Outliers should be investigated before they are removed from a model, as an apparent
outlier may be due to an error in the data.
Overfitting
For a model, overfitting is a tendency to describe too much of the variation in the data, so
that not only consistent structure is taken into account, but also some noise or
noninformative variation.
Overfitting should be avoided, since it usually results in a lower quality of prediction.
Validation is an efficient way to avoid model overfitting.
Partial Least Squares regression
See PLS regression.
Passified
see Downweight
In previous versions of The Unscrambler®, the term passify was used when a variable was
weighted by multiplying by a very small number. The variable was said to be Passified,
meaning that it loses all influence on the model, but it is not removed from the analysis.
1307
The Unscrambler X Main
The term for this type of weighting has been changed to Downweight.
PCA
See Principal Component Analysis.
PCR
See Principal Component Regression.
PCs
See Principal Component.
Percentile
The X% percentile of an observed distribution is the variable value that splits the
observations into X% lower values, and 100-X% higher values.
Quartiles and median are percentiles. The percentiles are displayed using a box-plot.
Plackett-Burman design
A very reduced experimental plan used for a first screening of many variables. It gives
information about the main effects of the design variables with the smallest possible
number of experiments.
No interactions can be studied with a Plackett-Burman design, and moreover, each main
effect is confounded with a combination of several interactions, so that these designs should
be used only as a first stage, to check whether there is any meaningful variation at all in the
investigated phenomena.
PLS
See PLS regression.
PLS Discriminant Analysis (PLS-DA)
Classification method based on modeling the differences between several classes with PLS.
If there are only two classes to separate, the PLS model uses one response variable, which
codes for class membership as follows: -1 for members of one class, +1 for members of the
other one.
If there are three classes or more, the PLS model uses one response variable (-1/+1 or 0/1,
which is equivalent) coding for each class.
PLS regression
A method for relating the variations in one or several response variables (Y-variables) to the
variations of several predictors (X-variables), with explanatory or predictive purposes.
This method performs particularly well when the various X-variables express common
information, i.e. when there is a large amount of correlation, or even collinearity.
Partial Least Squares Regression is a bilinear modeling method where information in the
original X-data is projected onto a small number of underlying (“latent”) variables called PLS
components. The Y-data are actively used in estimating the “latent” variables to ensure that
the first components are those that are most relevant for predicting the Y-variables.
Interpretation of the relationship between X-data and Y-data is then simplified as this
relationship in concentrated on the smallest possible number of components.
By plotting the first PLS components one can view main associations between X-variables
and Y-variables, and also interrelationships within X-data and within Y-data.
1308
References
PLS1
Version of the PLS method with only one Y-variable.
PLS2
Version of the PLS method in which several Y-variables are modeled simultaneously, thus
taking advantage of possible correlations or collinearity between Y-variables.
PLS-DA
See PLS Discriminant Analysis.
Precision
The precision of an instrument or a measurement method is its ability to give consistent
results over repeated measurements performed on the same object. A precise method will
give several values that are very close to each other.
Precision can be measured by standard deviation over repeated measurements.
If precision is poor, it can be improved by systematically repeating the measurements over
each sample, and replacing the original values by their average for that sample.
Precision differs from accuracy, which has to do with how close the average measured value
is to the target value.
Prediction
Computing response values from predictor values, using a regression model.
The following are needed to make predictions:
The new X-values are fed into the model equation (which uses the regression coefficients),
and predicted Y-values are computed.
Predictor
Variable used as input in a regression model. Predictors are usually denoted X-variables.
Predictors
See Predictor.
Principal component (PC)
Principal Components (PCs) are composite variables, i.e. linear functions of the original
variables, estimated to contain, in decreasing order, the main structured information in the
data. A PC is the same as a score vector, and is also called a latent variable or a factor.
Principal components are estimated in PCA and PCR. PLS components are also denoted PCs.
Principal Component Analysis (PCA)
PCA is a bilinear modeling method which gives an interpretable overview of the main
information in a multidimensional data table.
The information carried by the original variables is projected onto a smaller number of
underlying (“latent”) variables called principal components. The first principal component
covers as much of the variation in the data as possible. The second principal component is
orthogonal to the first and covers as much of the remaining variation as possible, and so on.
1309
The Unscrambler X Main
By plotting the principal components, one can view interrelationships between different
variables, and detect and interpret sample patterns, groupings, similarities or differences.
Principal Component Regression (PCR)
PCR is a method for relating the variations in a response variable (Y-variable) to the
variations of several predictors (X-variables), with explanatory or predictive purposes.
This method performs particularly well when the various X-variables express common
information, i.e. when there is a large amount of correlation, or even collinearity.
Principal Component Regression is a two-step method. First, a Principal Component Analysis
is carried out on the X-variables. The principal components are then used as predictors in a
Multiple Linear Regression.
Process variable
Experimental factor for which the variations are controlled in an experimental design, and to
which the mixture variable definition does not apply.
Process variables
See Process variable.
Project samples
New samples can be projected onto an existing PCA model, thus creating the PCA equivalent
of prediction for a regression model. The projection of a new sample onto the PCA model is
a kind of “prediction” of that sample according to the PCA model.
Projection
Principle underlying bilinear modeling methods such as PCA, PCR and PLS.
In those methods, each sample can be considered as a point in a multidimensional space.
The model will be built as a series of components onto which the samples - and the variables
- can be projected. Sample projections are called scores, variable projections are called
loadings.
The model approximation of the data is equivalent to the orthogonal projection of the
samples onto the model. The residual variance of each sample is the squared distance to its
projection.
Proportional noise
Noise on a variable is said to be proportional when its size depends on the level of the data
value. The range of proportional noise is a percentage of the original data values.
Pure components
In MCR, an unknown mixture is resolved into n pure components. The number of
components and their concentrations and instrumental profiles are estimated in a way that
explains the structure of the observed data under the chosen model constraints.
p-value
The p-value measures the probability that a parameter estimated from experimental data
should be as large as it is, if the real (theoretical, non-observable) value of that parameter
were actually zero. Thus, p-value is used to assess the significance of observed effects or
variations: a small p-value means a small risk of mistakenly concluding that the observed
effect is real.
1310
References
The usual limit used in the interpretation of a p-value is 0.05 (or 5%). If p-value < 0.05, the
observed effect can be presumed to be significant and is not due to random variations.
p-value is also called “significance level”.
Q-residual limits
The Q-residual limits for components 0-A are computed as a function of the remaining
eigenvalues A+1:Amax, where Amax is the maximum number of components that can be
calculated, limited by the number of samples or variables.
When PCA is computed by the SVD algorithm all eigenvalues are returned, and Q-residuals
can be estimated. When the NIP algorithm is chosen, only a few components are normally
estimated, thus Q-residual limits are not available.
Similarly for PLS regression, the Q-residual limits are correct only if the maximum number of
factors is computed, i.e. all the variance in X is modeled.
As the Q-residual limit is a function of the eigenvalue to the power of 3, one may get a
reasonable estimate if more than 95% of the X-variance is explained in the model although
the number of factors is less than the maximum.
Q-residuals
See Q-residual limits.
Quadratic model
Regression model including as X-variables the linear effects of each predictor, all two-
variable interactions, and the square effects.
With a quadratic model, the curvature of the response surface can be approximated in a
satisfactory way.
Quantile plot
The Quantile plot represents the distribution of a variable in terms of percentiles for a given
population. It shows the minimum, the 25% percentile (lower quartile), the median, the 75%
percentile (upper quartile) and the maximum.
Random effect
Effect of a variable for which the levels studied in an experimental design can be considered
to be a small selection of a larger (or infinite) number of possibilities.
Examples:
1311
The Unscrambler X Main
The design file will contain only response values for the reference samples, whereas the
input part (the design part) is missing (m).
Reference samples
See Reference sample.
Regression coefficient
In a regression model equation, regression coefficients are the numerical coefficients that
express the link between variation in the predictors and variation in the response.
Regression coefficients
See Regression coefficient.
Regression
Generic name for all methods relating the variations in one or several response variables (Y-
variables) to the variations of several predictors (X-variables), with explanatory or predictive
purposes.
Regression can be used to describe and interpret the relationship between the X-variables
and the Y-variables, and to predict the Y-values of new samples from the values of the X-
variables.
Repeated measurement
Measurement performed several times on one single experiment or sample.
The purpose of repeated measurements is to estimate the measurement error, and to
improve the precision of an instrument or measurement method by averaging over several
measurements.
Repeated measurements
See Repeated measurement.
Replicate
Replicates are experiments that are carried out several times. The purpose of including
replicates in a data table is to estimate the experimental error.
Replicates should not be confused with repeated measurements, which give information
about measurement error. In cross validation, replicates should be excluded as a group.
Replicates
See Replicate.
Residual
A measure of the variation that is not taken into account by the model.
The residual for a given sample and a given variable is computed as the difference between
observed value and fitted (or projected, or predicted) value of the variable on the sample.
Residuals
See Residual.
Residual variance
The mean square of all residuals, sample- or variable-wise.
1312
References
This is a measure of the error made when observed values are approximated by fitted
values, i.e. when a sample or a variable is replaced by its projection onto the model.
The complement to residual variance is explained variance.
Residual X-variance
See Residual variance.
Residual Y-variance
See Residual variance.
Resolution
Context: Experimental design
Information on the degree of confounding in fractional factorial designs.
Resolution is expressed as a roman number, according to the following code:
Response variable
Observed or measured parameter which a regression model tries to predict.
Responses are usually denoted Y-variables.
Response variables
See Response variable.
Responses
See Response variable.
1313
The Unscrambler X Main
RMSEC
Root Mean Square Error of Calibration. A measurement of the average difference between
predicted and measured response values, at the calibration stage.
RMSEC can be interpreted as the average modeling error, expressed in the same units as the
original response values.
RMSED
Root Mean Square Error of Deviations. A measurement of the average difference between
the abscissa and ordinate values of data points in any 2-D scatter plot.
RMSEP
Root Mean Square Error of Prediction. A measurement of the average difference between
predicted and measured response values, at the prediction or validation stage.
RMSEP can be interpreted as the average prediction error, expressed in the same units as
the original response values.
R-square
The R-square of a regression model is a measure of the quality of the model. Also known as
coefficient of determination, it is computed as 1 - (Residual Y-variance), or
(Explained Y-variance)/100. For Calibration results, this is also the square of the
correlation coefficient between predicted and measured values, and the R-square value is
always between 0 and 1: the closer to 1, the better.
The R-square is displayed among the plot statistics of a Predicted vs. Reference plot. When
based on the calibration samples, it tells about the quality of the fit. When computed from
the validation samples (similar to the “adjusted R-square” found in the literature) it tells
about the predictive ability of the model.
Sample
Object or individual on which data values are collected, and which builds up a row in a data
table.
In experimental design, each separate experiment is a sample.
Sample projection
See Project samples.
Scaling
See Weighting.
Scatter effects
In spectroscopy, scatter effects are effects that are caused by physical phenomena, like
particle size, rather than chemical properties. They interfere with the relationship between
chemical properties and shape of the spectrum. There can be additive and multiplicative
scatter effects.
Additive and multiplicative effects can be removed from the data by different methods.
Multiplicative Scatter Correction removes the effects by adjusting the spectra from ranges of
wavelengths supposed to carry no specific chemical information.
1314
References
Scores
Scores are estimated in bilinear modeling methods where information carried by several
variables is concentrated onto a few underlying variables. Each sample has a score along
each model component.
The scores show the locations of the samples along each model component, and can be
used to detect sample patterns, groupings, similarities or differences.
Screening
First stage of an investigation, where information is sought about the effects of many
variables. Since many variables have to be investigated, only main effects, and optionally
interactions, can be studied at this stage.
There are specific experimental designs for screening, such as factorial or Plackett-Burman
designs.
Segment
One of the parameters of Gap-Segment derivatives and Moving_Average smoothing, a
segment is an interval over which data values are averaged.
In smoothing, X-values are averaged over one segment symmetrically surrounding a data
point. The raw value on this point is replaced by the average over the segment, thus creating
a smoothing effect.
In Gap-Segment derivatives (designed by Karl Norris), X-values are averaged separately over
one segment on each side of the data point. The two segments are separated by a gap. The
raw value on this point is replaced by the difference of the two averages, thus creating an
estimate of the derivative on this point.
Sensitivity to pure components
In MCR computations, sensitivity to pure components is one of the parameters influencing
the convergence properties of the algorithm. It can be roughly interpreted as how
dominating the last estimated primary principal component is (the one that generates the
weakest structure in the data), compared to the first one.
The higher the sensitivity, the more pure components will be extracted.
SEP
See Standard Error of Performance.
Significance level
See p-value.
Significant
An observed effect (or variation) is declared significant if there is a small probability that it is
due to chance.
SIMCA
See SIMCA classification.
SIMCA classification
Classification method based on disjoint PCA modeling.
SIMCA focuses on modeling the similarities between members of the same class. A new
sample will be recognized as a member of a class if it is similar enough to the other
members; else it will be rejected.
1315
The Unscrambler X Main
Simplex
Specific shape of the experimental region for a classical mixture design. A Simplex has N
corners but N-1 independent variables in a N-dimensional space. This results from the fact
that whatever the proportions of the ingredients in the mixture, the total amount of mixture
has to remain the same: the Nth variable depends on the N-1 other ones. When mixing three
components, the resulting simplex is a triangle.
Simplex-Centroid design
One of the three types of mixture designs with a simplex-shaped experimental region. A
Simplex-centroid design consists of extreme vertices, center points of all “subsimplexes”,
and the overall center. A “subsimplex” is a simplex defined by a subset of the design
variables. Simplex-centroid designs are available for optimization purposes, but not for a
screening of variables.
Simplex-Lattice design
One of the three types of mixture designs with a simplex-shaped experimental region. A
Simplex-lattice design is a mixture variant of the full-factorial design. It is available for both
screening and optimization purposes, according to the degree of the design (see lattice
degree).
SVD
See Singular Value Decomposition
Singular Value Decomposition (SVD)
In linear algebra, the singular value decomposition (SVD) is an important
factorization of a rectangular real or complex matrix, with many applications
in signal processing and statistics. Applications which employ the SVD
include computing the pseudoinverse, least squares fitting of data, matrix
approximation, and determining the rank, range and null space of a matrix.
Source: Wikipedia
SNV
See Standard_Normal_Variate.
Square effect
Average variation observed in a response when a design variable goes from its center level
to an extreme level (low or high).
The square effect of a design variable can be interpreted as the curvature observed in the
response surface, with respect to this particular design variable.
Square effects
See Square effect.
Standard deviation
SDev is a measure of a variable’s spread around its mean value, expressed in the same unit
as the original values.
Standard deviation is computed as the square root of the mean square of deviations from
the mean.
1316
References
1317
The Unscrambler X Main
When the number of observations increases towards an infinite number, the Student t-
distribution becomes identical to the normal distribution.
A Student’s t-distribution can be described by two parameters: the mean value, which is the
center of the distribution, and the standard deviation, which is the spread of the individual
observations around the mean. Given those two parameters, the shape of the distribution
further depends on the number of degrees of freedom, usually n-1, if n is the number of
observations.
t-distribution
See Student’s t-distribution.
Test samples
Additional samples which are not used during the calibration stage, but only to validate an
already calibrated model.
The data for those samples consist of X-values (for PCA) or of both X- and Y-values (for
regression). The model is used to predict new values for those samples, and the predicted
values are then compared to the observed ones.
Test set validation
Validation method based on the use of different data sets for calibration and validation.
During the calibration stage, calibration samples are used. Then the calibrated model is used
on the test samples, and the validation residual variance is computed from their prediction
residuals.
Third order effects
See Cubic Effect.
Training samples
See Calibration samples.
T-scores
The scores found by PCA, PCR and PLS in the X-matrix.
See Scores for more details.
Tukey’s test
A multiple comparison test (see Multiple comparison tests for more details).
t-value
The t-value is computed as the ratio between the deviation from the mean accounted for by
a studied effect, and the standard error of the mean.
By comparing the t-value with its theoretical distribution (Student’s t-distribution), one
obtains the significance level of the studied effect.
Uncertainty limits
Limits produced by Uncertainty Testing, helping one assess the significance of the X-
variables in a regression model. Variables with uncertainty limits that do not cross the “0”
axis are significant.
1318
References
Uncertainty test
Martens’ Uncertainty Test is a significance testing method implemented in The
Unscrambler® which assesses the stability of PCA or Regression results. Many plots and
results are associated to the test, allowing the estimation of the model stability, the
identification of perturbing samples or variables, and the selection of significant X-variables.
The test is performed with cross validation, and is based on the jack-knifing principle.
Underfit
A model that leaves aside some of the structured variation in the data is said to underfit.
Unimodality
In MCR, the Unimodality constraint allows the presence of only one maximum per profile.
Upper quartile
The upper quartile of an observed distribution is the variable value that splits the
observations into 75% lower values, and 25% higher values. It can also be called 75%
percentile.
U-scores
The scores found by PLS in the Y-matrix.
See Scores for more details.
Validation samples
See Test samples.
Validation
Validation means checking how well a model will perform for future samples taken from the
same population as the calibration samples. In regression, validation also allows for
estimation of the prediction error in future predictions.
The outcome of the validation stage is generally expressed by a validation variance. The
closer the validation variance is to the calibration variance, the more reliable the model
conclusions.
When explained validation variance stops increasing with additional model components, it
means that the noise level has been reached. Thus the validation variance is a good
diagnostic tool for determining the proper number of components in a model.
Validation variance can also be used as a way to determine how well a single variable is
taken into account in an analysis. A variable with a high explained validation variance is
reliably modeled and is probably quite precise; a variable with a low explained validation
variance is badly taken into account and is probably quite noisy.
Three validation methods are available in The Unscrambler®
Variable
Any measured or controlled parameter that has varying values over a given set of samples.
A variable determines a column in a data table.
1319
The Unscrambler X Main
Variances
See Variance.
Variance
A measure of a variable’s spread around its mean value, expressed in square units as
compared to the original values.
Variance is computed as the mean square of deviations from the mean. It is equal to the
square of the standard deviation.
Vertex sample
A vertex is a point where two lines meet to form an angle. Vertex samples are used in
Simplex-centroid, axial and D-optimal mixture/non-mixture designs.
Weighting
A technique to modify the relative influences of the variables on a model. This is achieved by
giving each variable a new weight, i.e. multiplying the original values by a constant which
differs between variables. This is also called scaling.
The most common weighting technique is standardization, where the weight is the standard
deviation of the variable. Other weighting options in The Unscrambler® are constant, and
downweighted.
Save Ctrl+S
Print Ctrl+P
Close Ctrl+W
Exit Alt+F4
Cut Ctrl+X
Copy Ctrl+C
Paste Ctrl+V
Undo Ctrl+Z
1320
References
Redo Ctrl+Y
Find/replace Ctrl+H
Go to Ctrl+G
Zoom in Ctrl+Up-arrow, +
Report Ctrl+R
Edit cell F2
1321
The Unscrambler X Main
Tasks menu
Tools menu (new)
Help
1322
References
By keeping the original data intact, the user will never lose important information, and thus
complies with the guidances recommended by regulatory agencies for data integrity (e.g. US
FDA CFR 21 Part 11 compliance in the pharmaceutical industry). The other major advantage
of having successive nodes in the project navigator is that each new transform node forms
the basis for new directions in data pretreatment. In short, The Unscrambler® X, through the
use of the project navigator, has greatly simplified data visualization and management.
Models developed using The Unscrambler® X are presented as nodes in the project
navigator with the original data, results, validation and plots all included as subnodes. These
subnodes are used to navigate around the model. The results in the subnodes can be used
for further investigation. This replaces the File - Import - Unscrambler Results option
available in previous releases of The Unscrambler® and has been developed to make the
task of result importation much simpler.
Projects are also saved as XML-based files. This means that in the future, projects will not be
legacy system dependent as they are based on a universally accepted standard format.
Improved security
The Unscrambler® X allows a user to sign into the program using Windows Domain
Authentication (as well as the usual password access).
The program can be set up to accept user credentials as login, or enter a predefined user
name and password, set up within The Unscrambler® X. This system of login is compliant
with the requirements of electronic signatures and records required by the US FDA.
To further improve security, the Lock function in previous versions of The Unscrambler® is
now replaced by the Protect function. This is an internal, password based system for
protecting individual projects, data tables and models. Protected data can be unprotected
by reentering the password, when the unprotect option is chosen.
The Audit Trail system has been greatly improved and now follows the US FDA’s guidance on
time stamping of audit trails.
The improved security functions of The Unscrambler® X provide greater assurance to users
in all application areas.
The menu bar
The Menu bar in The Unscrambler® X has been optimized for better work flow. Notable
omissions when compared to previous versions include:
Modify
Results
Window
The Modify options are now shared over the Edit and Tasks menus. In particular, the options
now found in the Edit menu include
Also note that for the first release of The Unscrambler® X, 3-way data options are not
supported. These will be included in future developments.
The following functionality from the Modify menu can now be found in the Tasks menu:
1323
The Unscrambler X Main
The Tasks menu is optimized for work flow with the following options:
Transform
Analyze
Predict
The Results menu is now superseded due to the project navigator. All results are available
for a particular project in the project navigator, under the node particular to the analysis
performed.
The General View option in Results has been greatly simplified and is now part of the new
Insert menu as the Custom Layout option.
The Window menu is now obsolete. Results are displayed from the project navigator, and
stored within a project. The window functionality is now dispersed throughout the program
through various graphic and data table tab options.
The notable inclusions in the menu bar are:
Insert
Tools
the Matrix Calculator for performing basic matrix operations on data in the project
navigator
Report, allowing a user to develop custom reports, based on the output of a
developed model.
the Audit Trail
The Tools-Options menu options have been migrated from the File menu in previous
versions.The Tools-Audit Trail menu supersedes the previous File-Properties-Log options.
Plotting
General plots
Plotting data in The Unscrambler® X is much easier and more powerful than in previous
versions. The Plot menu has been expanded to include additional:
The ability to use the mouse roller option to zoom in and out of plots;
The ability to left click and drag a plots position within the current viewer;
1324
References
The ability to modify the plot region, headers, include legends and change the font
and size of axes. These are all available by choosing Properties from the Edit Menu
or by right-clicking on a plot and selecting Properties.
Three-dimensional (3-D) rotation of scatter and matrix plots can be performed using the
mouse in a continuous way.
A new plotting option in The Unscrambler® X is the Multiple Scatter Plot. This is a collection
of stepwise 2-D scatter plots of variables chosen. It plots each variable combination against
each other.
All plots have a much sharper appearance and are better suited for journal publications,
reports and presentations.
Results plots
The project navigator now contains a Plots subnode for each analysis procedure containing
plotted results. Simply highlight a plot pane in the viewer, and click on the desired plot from
the project navigator to display it. The plot is updated automatically, thus simplifying the
previous Plot menu routine.
All results plots have the ability to be modified using the Properties menu option when
right-clicking on a plot.
Importing data from previous versions of The Unscrambler
Data and models generated in previous versions of The Unscrambler® (back to version 9.2)
may be directly imported into The Unscrambler® X using the File - Import - Unscrambler
menu option. The Unscrambler® X imports data tables with formatting intact, i.e. column
and row sets defined by the previous Modify - Edit Set function are preserved and displayed
as subnodes in the project navigator.
The Unscrambler® X models still preserve their existing file format. Backwards compatibility
of models is available by using the File- Export-Unscrambler making it possible to use
models developed in The Unscrambler® X in previous versions of The Unscrambler® Online
Predictor and The Unscrambler® Online Classifier.
Analysis methods
The Unscrambler® X comes with a number of new analyses for advanced MVA applications.
These are listed as follows:
1325
The Unscrambler X Main
Statistical Tests: Basic statistical hypothesis tests are included, providing a valuable
tool for thorough data analysis within The Unscrambler®.
Normality test, both univariate and multivariate
Tests for comparing means (t-tests)
Tests for comparing variances (F-, Levene’s and Bartlett’s tests)
A completely new and easy to use Design Wizard with possibility to go back and
forth when defining the design;
Suggestion for the best suited design and guidance to the user;
Inclusion of Scheffe polynomials for the analysis of mixture data;
More interactive results’ outputs and graphical option;
A DoE PLS option with some featured plots.
Import ASCII and Excel: Easy import using a dedicated import dialog box. The dialog
box allows the import of all, or only part of the data and allows easy assignment of
row and column headers.
netCDF import of chromatographic data
Support of the OPC protocol.
1326
References
As additional formats are continually being added refer to the chapter on File Import
Improved dialog boxes
Edit - Define Range
defining data ranges has been simplified and is more interactive.
Insert - Create Design
addition of a designed experiment is much more interactive and flexible compared
to the generation of designed experiments in the past.
Tasks - Transform menu
the dialog boxes allow a preview of the transformation on the data before
application, thus providing an invaluable visualization.
Tasks - Analyze
more tabs have been added to the dialog boxes, making the analyses more self-
contained.
Tasks menu
Improved workflow, with the Transform menu added to Tasks
Inclusion of OSC, Deresolve, Interactions and Squares, Weights, Compute_General,
Fill Missing and COW as registrable pretreatments.
1327
The Unscrambler X Main
Help
The Help System has been completely updated to be more comprehensive, and reflect
current software operation. It is also simplified, and there is no longer context sensitive help
for every user interface element, as with the 9.x series. Pressing F1 will still bring up the
appropriate help page.
A completely new response surface plotting module with high resolution, fast
graphics rendering and improved plotting controls for graphical optimization.
A new D-optimal design module with option to augment design with space-filling
points (more robust).
Re-introduction of PLS-DoE and more design information displayed in ‘Tasks –
Analyze – Analyze Design Matrix’ to help you find the best method for your data.
New methods
Plotting
1328
References
Plot settings in ‘Tools – Options – Viewer’ can be used to change the default
appearance of plots.
New plots and plot layouts for Residuals and Influence plots in PCA, PCR, PLSR and
Projection, including F-residuals with limits.
Point labeling using value of any matching variable (Sample Grouping)
General
ASCII file import with default list separator based on system settings.
New Alarms tab in analysis dialogs of PCA, MLR, PCR and PLSR and right-click option
for setting alarm limits in the project navigator (these limits are applied for online
prediction using some of our prediction engines).
New dialog for assigning Scalar/Vector tags as well as units (‘Edit – Scalar and
Vector’ in editor mode or right-click option in project navigator). This information is
used for collecting data from various sources during online monitoring of processes.
General enhancements and bugfixes.
This document provides information about The Unscrambler® X ver 10.2. The Unscrambler®
X 10.2 contains several enhancements and new features for data import and export,
analysis, graphics, and Design of Experiments (DoE). These updates have been implemented
post release of version 10.1. The Unscrambler® 10.2 is available in a 32-bit and a 64-bit
version.
37.8. Applicability
Corrections have been made to address several issues:
Overall performance of the program has been optimized, mainly based on the way
data is stored in memory during calculations.
More details of analysis methods and data have been added to info boxes.
The Find and Replace functionality has been optimized.
More time allowed for renaming project navigator nodes.
The definition of Identity matrices in Insert - Data Matrix has been corrected to
produce only square matrices.
Median Absolute Deviation (MAD) scaling has now been moved to Tasks - Transform
- Centre and Scale as a scaling option in the dropdown list.
Compute_General has been optimized to handle case-sensitive entries.
Audit Trails now have a save option for printing and recording project details.
1329
The Unscrambler X Main
In Multiple Linear Regression a rank dependency test has been added to better
handle singularities.
All analysis plots now have titles
Corrections
Known issues
The 9th design variable is by default called ‘J’, not the reserved letter ‘I’
Some larger fractional factorial designs removed due to large memory usage
Upper limit imposed on the number of experimental runs for full factorial designs
Display of B coefficients and effects plots/tables removed for designs with
categorical variables with 3 levels or more.
The DoE PLS option accessible from Tasks – Analyze – Analyze design matrix has
been disabled. For D-optimal designs, use Tasks – Analyze – Partial Least Squares
Regression instead. To analyze other designs using PLSR, change data type to
numeric first.
For models with category variables and centerpoints included, the total degrees of
freedom is different from 10.1
1330
References
User-Friendly Enhancements
The Define Range dialog has been completed overhauled for better ease of use and
functionality.
Edit - Convert allows the conversion of data collected in nanometers to be displayed
in reciprocal centimeters (and vice versa).
The Fill function is available as a right click option in the data editor.
Legend and Display Points icons are now available in the toolbar.
Duplicate Matrix is now available as a right click option in the project navigator.
Keep Outs handling in all dialog boxes has been optimized.
1331
The Unscrambler X Main
When Uncertainty test is applied to PLSR, PCR, the uncertainty limits are provided
for weighted coefficients only.
Models with block weights are not compatible with v10.1. These models when used
for recalculate in 10.1, will produce different weights. Workaround: Reselect weights
while recalculating
If OSC was used as a transform in version 10.1, these values will not match those of
version 10.2.
p-Values of jack knife matrices will not match with v9.8. p-Value is set to 1 if the
variables are down weighted
Jack knife matrices will mismatch with v9.8 and v10.1.
Correlation loadings will mismatch with v9.8 and v10.1 if weights are set to zero.
This document provides information about The Unscrambler® X ver 10.1. The Unscrambler®
10.1 contains several enhancements and new features for data import and export, graphics,
and Design of Experiments (DoE).These updates have been implemented post release of
version 10.0.1. The Unscrambler® 10.1 is available in a 32-bit and a 64-bit version.
The import of Excel data files did not always import all the columns from the Excel
spreadsheet. This has been corrected.
ASCII files can be batch imported
U5 data can be imported into The Unscrambler®.
1332
References
37.15. Applicability
Corrections have been made to address several issues:
The axis labels in the influence plots in projection have been corrected to properly
reflect the information plotted.
Predictions made with models that include the MSC transform on part of the
columns did not give consistent results. This issue has been addressed.
Issues around the display of the correct sample names in the Coomans’ plot for
classified samples using SIMCA have been resolved.
The info box for a PLS model did not always correctly reflect the validation method
used. When full cross validation was used, the validation was displayed as having
been random with 20 segments.
The compute general function does allow mathematical formulae with non-integer
values.
Category variables can be copied and pasted.
The x-axis values can now be scaled based on the variable values
Compact, mini and micro models from previous versions of The Unscrambler® can
be imported.
For experiments that include category variables, center points are defined for each
level of the category variables.
Response surface plots have been improved.
The grid editor has been modified to give improved performance. Data that are
generated in version 10.1 cannot be opened in previous versions of The
Unscrambler® X.
Copy and paste and drag and drop have been implemented in the editor.
In defining ranges, one can now define the reverse selections of the selected
rows(columns) by a single click
Prediction diagnostics per segment are available when a cross-validation other than
full is used in developing PLS or PCR models
The quantile normalization function of median absolute deviation (MAD) has been
added as a new transform
Q residual limits and Q residuals for samples are available within the results for
predictions
The ability to save model files as smaller files for easier model file transportability
has been added.
A user can set the number of components to save in a model file.
Defined ranges (row and column) can be copied and pasted into a matrix of the
same dimensions.
1333
The Unscrambler X Main
The properties for category variables including the names and order of them can be
changed.
In LDA, the ability to do an automatic PCA-LDA on sample sets with many variables
has been added under the options for LDA.
Plot legends are now presented according to the sample grouping used in a given
plot
Users now have the ability to use sample grouping on 3-D scatter plots, and to more
readily change the properties of 3-D plots
Sample grouping is now possible under all the relevant PLS results plots including
the X-Y relation outliers, Y-residuals vs. predicted Y and Y-residuals vs. scores
With sample grouping, the groups in a plot can be separated by a symbol or a color,
or both.
Greater flexibility in changing plot and axis scales and labels.
Additional options for the plot types have been added.
Some issues still remain when calculating mixture designs with constraints, mainly
due to summing of mixture amounts.
37.21. Tutorials
Additional tutorials have been added to give users a quick start in using The Unscrambler® X
1334
References
The ability to mark evenly distributed samples in scores plots and other PLS results plots has
been added
37.22. Applicability
Several improvements in the graphics have been made including:
The OSC transformation has been modified, hence the old OSC model (from version
10.0) cannot be used as a registered pretreatment in prediction.
1335
The Unscrambler X Main
1336
References
In The Unscrambler® 9.8 and previous versions, the definition of the design, response,
uncontrollable variables was made in three different windows. This has been reduced to the
Define Variables table in the Design Experiment Wizard.
Method reference documentation is yet to be updated.
37.27. Installation
1) Run The Unscrambler® X setup application & follow the setup wizard Double click
“TheUnscramblerX_Setup.msi” file to start the installation wizard. The InstallShield Wizard
for The Unscrambler® X is launched. Follow the on-screen instructions
2)Finish the Setup. When the setup is complete, Click Close
3) Start The Unscrambler® X from the Start menu
4) Step 4 : The Activation Wizard dialog opens. Click the Obtain button
5) After receiving The Unscrambler® X activation key, paste in Activate window and click on
the Activate button
OR
Send “machine ID�? from The Unscrambler® X Activation window along with your user
name and E-mail address to support@camo.com. CAMO Support Team will send you The
Unscrambler® X activation key.
1337
38. Bibliography
38.1. Bibliography
1339
The Unscrambler X Main
G.H.Golub, C.F. van Loan, Matrix Computation, 2nd ed., The John Hopkins University Press,
Baltimore, 1989.
C. R. Goodall, Computation Using the QR Decomposition in Handbook in Statistics Vol. 9,
Elsevier, Amsterdam, 1993.
H.H. Harman, Modern Factor Analysis, 3rd Edition, revised, University of Chicago Press,
1976.
A. Höskuldsson, PLS regression methods, J. Chemom., 2, 211–228 (1988).
H. Hotelling, Analysis of a complex of statistical variables into principal components, J. Educ.
Psych., 24, 417–441, 498–520 (1933)
J.E.Jackson, A Users Guide to Principal Components, Wiley & Sons Inc., New York, 1991.
J.E. Jackson and G.S. Mudholkar, Control procedures for residuals associated with principal
component analysis, Technometrics, 21, 341-349 (1979).
J.E. Jackson and G.S. Mudholkar, Control procedures for residuals associated with principal
component analysis, Addendum, Technometrics, 22, 136 (1980).
R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, Prentice-Hall, Upper
Saddle River, NJ, 1988.
H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis, Psychometrika, 23,
187–200(1958).
R. Kramer, Chemometric Techniques for Quantitative Analysis, Marcel Dekker, Inc., New
York, 1998.
F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS, J. Chemom., 7, 45–59(1993).
R. Manne, Analysis of two partial least squares algorithms for multivariate calibration,
Chemom. Intell. Lab. Syst., 2, 187–197 (1987).
K.V. Mardia, J.T. Kent, J.M. Bibby, Multivariate Analysis, Academic Press Inc, London, 1979.
H. Martens, T. Næs, Multivariate Calibration, John Wiley & Sons Inc, Chichester, 1989.
W.L. Martinez and A.R. Martinez, Exploratory Data Analysis with MATLAB,, Chapman and
Hall, London, 2005.
D.L.Massart, B.G.M. Vandegiste, S.N. Deming, Y. Michotte, L. Kaufman, Chemometrics: A
textbook, Elsevier Publ., Amsterdam, 1988.
D.C. Montgomery, E.A. Peck, and C.G. Vining, Introduction to Linear Regression Analysis
Third Edition,Wiley-Interscience, New York, 2001.
T.Næs, T. Isaksson, T. Fearn and T. Davies, A user-friendly guide to multivariate calibration
and classification, NIR Publications, Chichester, 2002.
J.O. Neuhaus and C. Wrigley, The Quartimax Method: An analytic approach to orthogonal
simple structure, British J. Statistical Psychology, 7(2), 81–91(1954).
S. Rannar, F. Lindgren, P. Geladi and S. Wold, A PLS kernel algorithm for data sets with many
variables and fewer objects, Part 1: Theory and Algorithm, J. Chemom., 8, 111–125 (1994).
D.R. Saunders, An analytic method for rotation to orthogonal simple structure, Princeton,
Educational Testing Service Research Bulletin, 53–10 (1953).
S. Weisberg, Applied Linear Regression Second Edition, Wiley, New York, 1985.
S. Wold, Cross-validatory estimation of the number of components in factor and principal
components models, Technometrics, 20(4), 397–405 (1978).
S. Wold, K. Esbensen, P. Geladi, Principal component analysis — A tutorial, Chemom. Intell.
Lab. Syst., 2, 37–52(1987).
S. Wold, Pattern recognition by means of disjoint principal components models, Pattern
Recognition, 8, 127–139(1976).
1340
JDSU Application
1341
The Unscrambler X Main
1342
JDSU Application
spectroscopy for the detection of meat and bone meal (MBM) in compound feeds, J.
Chemom., 18, 341–349(2004).
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, A Practical Guide to Support Vector
Classification, last updated: May 19, 2009, accessed August 27, 2009.
http://www.csie.ntu.edu.tw/~cjlin
C. Medina-Gutiérrez, J. Luis Quintanar, C. Frausto-Reyes, R. Sato-Berrú, The application of
NIR Raman spectroscopy in the assessment of serum thyroid-stimulating hormone in rats,
Spectrochimica Acta Part A, 61 (1–2), 87–91(2005).
T. Næs, T. Isaksson, T. Fearn and T. Davies, A User-friendly Guide to Multivariate Calibration
and Classification, NIR Publications, Chichester, UK, 2002.
1343
The Unscrambler X Main
Burling-Claridge, S.E. Holroyd and R.M.W. Sumner (Eds), New Zealand NIRS Society Inc.,
Hamilton, 2007.
G. Tomasi, F.v.d. Berg, C. Andersson, Correlation optimized warping and dynamic time
warping as preprocessing methods for chromatographic data, J. Chemom., 18, 231-
241(2004).
F. Westad, H. Martens, Shift and intensity modelling in spectroscopy - general concept and
applications, Chemom. Intel. Lab. Syst., 45, 361-370(1999).
1344