The Unscrambler X v10.3 - User Manual

The Unscrambler® X v10.
3
User Manual
Version 1.0
CAMO SOFTWARE AS
Nedre Vollgate 8, N-0158, Oslo, NORWAY
Tel: (47) 223 963 00
Fax: (47) 223 963 22
E-mail : info@camo.com | www.camo.com
i
The Unscrambler X v10.3
Copyright
All intellectual property rights in this work belong to CAMO Software AS. The information contained in this work
must not be reproduced or distributed to others in any form or by any means, electronic or mechanical, for any
purpose, without the express written permission of CAMO Software AS. This document is provided on the
understanding that its use will be confined to the officers of the organization (whose name is stated on the front
cover of this document) who acquired it and that no part of its contents will be disclosed to third parties without
prior written consent of CAMO Software AS.
Copyright © 2014 CAMO Software AS. All Rights Reserved
All other trademarks and copyrights mentioned in the document are acknowledged and belong to their respective
owners.
Disclaimer
This document has been reviewed and quality assured for accuracy of content. Succeeding versions of this
document are subject to change without notice and will reflect changes made to subsequent software version.
It is the sole responsibility of the organization using this document to ensure all tests meet the criteria specified in
the test scripts. CAMO Software takes no responsibility for the end use of the product as this requires the
performance of suitable feasibility trials and performance qualification to ensure the software is fit for purpose for
its intended use.
ii
Table of Contents
Table of Contents
1. Welcome to The Unscrambler® X ................................................................................. 1
2. Support Resources........................................................................................................ 3
2.1. Support resources on our website............................................................................ 3
3. Overview ...................................................................................................................... 5
3.1. What is The Unscrambler® X? ................................................................................... 5

3.1.1 Multivariate analysis simplified ............................................................................................... 5
3.1.2 Make well-designed experimental plans ................................................................................. 5
3.1.3 Reformat, transform and plot data .......................................................................................... 6
3.1.4 Study variations among one group of variables ....................................................................... 6
3.1.5 Study relations between two groups of variables.................................................................... 7
3.1.6 Validate multivariate models with uncertainty testing ............................................................ 7
3.1.7 Estimate new, unknown response values ................................................................................ 8
3.1.8 Classify unknown samples ....................................................................................................... 8
3.1.9 Reveal groups of samples ........................................................................................................ 8
3.2. Principles of classification ......................................................................................... 8
3.2.1 Purposes of classification ......................................................................................................... 9
3.2.2 Classification methods ............................................................................................................. 9
3.2.3 Steps in SIMCA classification .................................................................................................. 11
3.2.4 Classifying new samples......................................................................................................... 11
3.2.5 Outcomes of a classification .................................................................................................. 11
3.2.6 Classification based on a regression model ........................................................................... 12
3.3. How to use help ...................................................................................................... 12
3.3.1 How to open the help documentation................................................................................... 12
3.3.2 Browsing the contents ........................................................................................................... 12
3.3.3 Searching the contents .......................................................................................................... 12
3.3.4 Typographic cues ................................................................................................................... 13
3.4. Principles of regression ........................................................................................... 13
3.4.1 What is regression?................................................................................................................ 13
3.4.2 Multiple Linear Regression (MLR) .......................................................................................... 15
3.4.3 Principal Component Regression (PCR) ................................................................................. 16
3.4.4 Partial Least Squares Regression (PLSR)................................................................................. 16
3.4.5 L-PLS Regression .................................................................................................................... 17
3.4.6 Support Vector Machine Regression (SVMR)......................................................................... 18
3.4.7 Calibration, validation and related samples........................................................................... 18
3.4.8 Main results of regression ..................................................................................................... 19
3.4.9 Making the right choice with regression methods................................................................. 21
3.4.10 How to interpret regression results ....................................................................................... 22
3.4.11 Guidelines for calibration of spectroscopic data ................................................................... 24
iii
3.5. Demonstration video .............................................................................................. 28
4. Application Framework .............................................................................................. 29
4.1. User interface basics ............................................................................................... 29

4.2. Getting to know the user interface......................................................................... 30
4.2.1 Application window ............................................................................................................... 30
4.2.2 Workspace ............................................................................................................................. 31
4.2.3 Project navigator.................................................................................................................... 32
4.2.4 Project information ................................................................................................................ 32
4.2.5 Page tab bar ........................................................................................................................... 32
4.2.6 The menu bar ......................................................................................................................... 32
4.2.7 The toolbar ............................................................................................................................ 33
4.2.8 The status bar ........................................................................................................................ 33
4.2.9 Dialogs ................................................................................................................................... 33
4.2.10 Setting up the user environment ........................................................................................... 34
4.2.11 Getting help ........................................................................................................................... 34
4.3. Matrix editor basics ................................................................................................ 34
4.3.1 What is a matrix? ................................................................................................................... 35
4.3.2 Adding data matrices ............................................................................................................. 36
4.3.3 Altering data tables ................................................................................................................ 36
4.3.4 Using ranges........................................................................................................................... 37
4.3.5 Data types .............................................................................................................................. 38
4.3.6 Keeping versions of data ........................................................................................................ 39
4.3.7 Saving data ............................................................................................................................. 39
4.4. Using the project navigator .................................................................................... 40
4.4.1 About the project navigator................................................................................................... 40
4.4.2 Create a project ..................................................................................................................... 40
4.4.3 Items in a project ................................................................................................................... 41
4.4.4 Browse a project .................................................................................................................... 41
4.4.5 Managing items in a project .................................................................................................. 41
4.5. Register pretreatment ............................................................................................ 44
4.6. Save model for prediction, classification ................................................................ 44
4.7. Set Alarms ............................................................................................................... 46
4.7.1 Prediction: .............................................................................................................................. 46
4.7.2 Classification: ......................................................................................................................... 47
4.7.3 Projection:.............................................................................................................................. 47
4.7.4 Input: ..................................................................................................................................... 48
4.8. Set Components ...................................................................................................... 49
4.9. Set Bias and Slope ................................................................................................... 49
4.9.1 Algorithm ............................................................................................................................... 50
4.9.2 Menu option .......................................................................................................................... 50
4.9.3 Usage ..................................................................................................................................... 50
iv
Table of Contents
4.10. Login ........................................................................................................................ 51

4.10.1 Non-Compliance mode .......................................................................................................... 51
4.10.2 Compliance Mode .................................................................................................................. 53
4.11. File ........................................................................................................................... 54
4.11.1 File menu ............................................................................................................................... 54
4.11.2 File – Print… ........................................................................................................................... 55
4.12. Edit .......................................................................................................................... 57
4.12.1 Edit menu ............................................................................................................................... 57
4.12.2 Edit – Change data type – Category… .................................................................................... 65
4.12.3 Edit – Category Property… ..................................................................................................... 70
4.12.4 Edit – Fill................................................................................................................................. 71
4.12.5 Edit – Find and Replace .......................................................................................................... 72
4.12.6 Edit – Go To… ......................................................................................................................... 74
4.12.7 Edit – Insert – Category Variable… ......................................................................................... 75
4.12.8 Edit – Define Range… ............................................................................................................. 77
4.12.9 Edit – Reverse… ...................................................................................................................... 85
4.12.10 Edit – Group rows…................................................................................................................ 85
4.12.11 Edit – Sample grouping… ....................................................................................................... 86
4.12.12 Scalar and Vector ................................................................................................................... 87
4.12.13 Split Text Variable .................................................................................................................. 88
4.13. View ........................................................................................................................ 90
4.13.1 View menu ............................................................................................................................. 90
4.14. Insert ....................................................................................................................... 93
4.14.1 Insert menu ............................................................................................................................ 93
4.14.2 Insert – Duplicate Matrix… ..................................................................................................... 94
4.14.3 Insert – Data Matrix… ............................................................................................................ 95
4.14.4 Insert – Custom Layout… ....................................................................................................... 96
4.14.5 Insert – Data Compiler… ...................................................................................................... 100
4.15. Plot ........................................................................................................................ 103
4.15.1 Plot menu............................................................................................................................. 103
4.16. Tasks...................................................................................................................... 104
4.16.1 Tasks menu .......................................................................................................................... 104
4.17. Tools ...................................................................................................................... 106
4.17.1 Tools menu .......................................................................................................................... 106
4.17.2 Tools – Audit Trail… ............................................................................................................. 107
4.17.3 Tools – Matrix Calculator… .................................................................................................. 108
4.17.4 Tools – Options… ................................................................................................................. 111
4.17.5 Tools – Report… ................................................................................................................... 113
4.18. Help ....................................................................................................................... 115
4.18.1 Help menu ........................................................................................................................... 115
4.18.2 Help – Modify License… ....................................................................................................... 116
4.18.3 Help – User Setup… .............................................................................................................. 117
v
5. Import ...................................................................................................................... 119
5.1. Importing data ...................................................................................................... 119

5.1.1 Supported data formats ....................................................................................................... 119
5.1.2 How to import data.............................................................................................................. 121
5.2. ASCII ...................................................................................................................... 122
5.2.1 ASCII (CSV, text) ................................................................................................................... 122
5.2.2 About ASCII, CSV and tabular text files ................................................................................ 122
5.2.3 File – Import Data – ASCII… .................................................................................................. 123
5.3. BRIMROSE ............................................................................................................. 125
5.3.1 Brimrose............................................................................................................................... 125
5.3.2 About Brimrose data files .................................................................................................... 126
5.3.3 File – Import Data – Brimrose… ........................................................................................... 126
5.4. Bruker.................................................................................................................... 128
5.4.1 OPUS from Bruker ................................................................................................................ 128
5.4.2 About Bruker (OPUS) instrument files ................................................................................. 129
5.4.3 File – Import Data – OPUS… ................................................................................................. 129
5.5. DataBase ............................................................................................................... 132
5.5.1 Databases............................................................................................................................. 132
5.5.2 About supported database interfaces ................................................................................. 133
5.5.3 File – Import Data – Database… ........................................................................................... 133
5.6. DeltaNu ................................................................................................................. 139
5.6.1 DeltaNu ................................................................................................................................ 139
5.6.2 About DeltaNu data files ...................................................................................................... 139
5.6.3 File – Import Data – DeltaNu… ............................................................................................. 139
5.7. Excel ...................................................................................................................... 142
5.7.1 Microsoft Excel spreadsheets .............................................................................................. 142
5.7.2 About Microsoft Excel spreadsheets ................................................................................... 143
5.7.3 File – Import Data – Excel… .................................................................................................. 143
5.8. GRAMS .................................................................................................................. 144
5.8.1 GRAMS from Thermo Scientific ........................................................................................... 144
5.8.2 About the GRAMS data format ............................................................................................ 144
5.8.3 File – Import Data – GRAMS… .............................................................................................. 145
5.9. GuidedWave.......................................................................................................... 148
5.9.1 CLASS-PA & SpectrOn from Guided Wave ........................................................................... 148
5.9.2 About Guided Wave CLASS-PA & SpectrOn data files .......................................................... 149
5.9.3 File – Import Data – CLASS-PA & SpectrOn… ....................................................................... 149
5.10. Import Interpolate ................................................................................................ 152
5.10.1 Interpolate functionality ...................................................................................................... 152
5.11. Indico..................................................................................................................... 155
5.11.1 Indico ................................................................................................................................... 155
5.11.2 About ASD Inc. Indico data files ........................................................................................... 155
vi
Table of Contents
5.11.3 File – Import Data – Indico… ................................................................................................ 156

5.12. JcampDX ................................................................................................................ 159
5.12.1 JCAMP-DX ............................................................................................................................ 159
5.12.2 About the JCAMP-DX file format.......................................................................................... 160
5.12.3 File – Import Data – JCAMP-DX… ......................................................................................... 160
5.12.4 JCAMP-DX file format reference .......................................................................................... 163
5.13. Konica_Minolta ..................................................................................................... 165
5.13.1 Konica_Minolta .................................................................................................................... 165
5.13.2 About Konica_Minolta data files .......................................................................................... 166
5.13.3 File – Import Data – Konica_Minolta… ................................................................................. 166
5.14. Matlab ................................................................................................................... 167
5.14.1 Matlab.................................................................................................................................. 167
5.14.2 About Matlab data files ....................................................................................................... 168
5.14.3 File – Import Data – Matlab… .............................................................................................. 168
5.15. MyInstrument ....................................................................................................... 169
5.15.1 MyInstrument ...................................................................................................................... 169
5.15.2 About the MyInstrument standard ...................................................................................... 169
5.15.3 File – Import Data – MyInstrument… ................................................................................... 170
5.16. NetCDF .................................................................................................................. 173
5.16.1 NetCDF ................................................................................................................................. 173
5.16.2 About the NetCDF file format .............................................................................................. 173
5.16.3 File – Import Data – NetCDF… .............................................................................................. 173
5.17. NSAS ...................................................................................................................... 174
5.17.1 NSAS..................................................................................................................................... 174
5.17.2 About the NSAS file format .................................................................................................. 174
5.17.3 File – Import Data – NSAS… ................................................................................................. 175
5.17.4 NSAS file format reference .................................................................................................. 177
5.18. Omnic .................................................................................................................... 179
5.18.1 OMNIC ................................................................................................................................. 179
5.18.2 About Thermo OMNIC data files .......................................................................................... 180
5.18.3 File – Import Data – OMNIC… .............................................................................................. 180
5.19. OPC........................................................................................................................ 183
5.19.1 OPC protocol ........................................................................................................................ 183
5.19.2 About the OPC protocol ....................................................................................................... 183
5.19.3 File – Import Data – OPC… ................................................................................................... 184
5.20. OSISoftPI ............................................................................................................... 185
5.20.1 PI .......................................................................................................................................... 185
5.20.2 About supported interfaces ................................................................................................. 185
5.20.3 File – Import Data – PI… ....................................................................................................... 185
5.21. PerkinElmer ........................................................................................................... 189
5.21.1 PerkinElmer.......................................................................................................................... 189
5.21.2 About PerkinElmer instrument files ..................................................................................... 190
vii
5.21.3 File – Import Data – PerkinElmer… ...................................................................................... 190

5.22. PertenDX ............................................................................................................... 193
5.22.1 Perten-DX ............................................................................................................................. 193
5.22.2 About the Perten Instruments JCAMP-DX file format.......................................................... 194
5.22.3 File – Import Data – Perten-DX… ......................................................................................... 194
5.22.4 Perten-DX file format reference .......................................................................................... 197
5.23. RapID ..................................................................................................................... 199
5.23.1 RapID.................................................................................................................................... 199
5.23.2 About RapID data files ......................................................................................................... 199
5.23.3 File – Import Data – rap-ID… ................................................................................................ 199
5.24. U5Data .................................................................................................................. 202
5.24.1 U5 Data ................................................................................................................................ 202
5.24.2 About Unscrambler� 5.0 data files..................................................................................... 202
5.24.3 File – Import Data – U5 Data… ............................................................................................. 203
5.25. UnscFileReader ..................................................................................................... 204
5.25.1 The Unscrambler® 9.8 .......................................................................................................... 204
5.25.2 About The Unscrambler® 9.8 file formats ............................................................................ 205
5.25.3 File – Import Data – Unscrambler… ..................................................................................... 205
5.25.4 The Unscrambler® 9.x file format reference ........................................................................ 205
5.26. UnscramblerX........................................................................................................ 206
5.26.1 The Unscrambler® X ............................................................................................................. 206
5.26.2 About The Unscrambler® X file format ................................................................................ 207
5.26.3 File – Import Data – Unscrambler X… .................................................................................. 207
5.27. Varian .................................................................................................................... 208
5.27.1 Varian ................................................................................................................................... 208
5.27.2 About Varian data files ........................................................................................................ 208
5.27.3 File – Import Data – Varian… ............................................................................................... 209
5.28. VisioTec ................................................................................................................. 212
5.28.1 VisioTec ................................................................................................................................ 212
5.28.2 About VisioTec data files ...................................................................................................... 213
5.28.3 File – Import Data – VisioTec…............................................................................................. 213
6. Export ....................................................................................................................... 215
6.1. Exporting data ....................................................................................................... 215

6.1.1 Supported data formats ....................................................................................................... 215
6.1.2 How to export data .............................................................................................................. 215
6.2. AMO ...................................................................................................................... 215
6.2.1 Export models to ASCII......................................................................................................... 215
6.2.2 About the ASCII-MOD file format ........................................................................................ 215
6.2.3 File – Export – ASCII-MOD… ................................................................................................. 215
6.2.4 ASCII-MOD file format reference ......................................................................................... 216
6.3. ASCII ...................................................................................................................... 221
viii
Table of Contents
6.3.1 ASCII export ......................................................................................................................... 221

6.3.2 File – Export – ASCII…........................................................................................................... 222
6.4. DeltaNu ................................................................................................................. 223
6.4.1 DeltaNu ................................................................................................................................ 223
6.4.2 File – Export – DeltaNu… ...................................................................................................... 223
6.5. JCampDX ............................................................................................................... 224
6.5.1 JCAMP-DX export ................................................................................................................. 224
6.5.2 File – Export – JCAMP-DX… .................................................................................................. 224
6.6. Matlab ................................................................................................................... 226
6.6.1 Matlab export ...................................................................................................................... 226
6.6.2 File – Export – Matlab… ....................................................................................................... 226
6.7. NetCDF .................................................................................................................. 227
6.7.1 NetCDF export ..................................................................................................................... 227
6.7.2 File – Export – NetCDF…....................................................................................................... 227
6.8. UnscFileWriter ...................................................................................................... 229
6.8.1 Export models to The Unscrambler® v9.8 ............................................................................ 229
6.8.2 About The Unscrambler® file format ................................................................................... 229
6.8.3 File – Export – Unscrambler… .............................................................................................. 230
7. Plots.......................................................................................................................... 231
7.1. Line plot ................................................................................................................ 231

7.2. Bar plot.................................................................................................................. 232
7.3. Scatter plot............................................................................................................ 234
7.4. 3-D scatter plot ..................................................................................................... 236
7.5. Matrix plot ............................................................................................................ 243
7.6. Histogram plot ...................................................................................................... 247
7.7. Normal probability plot......................................................................................... 248
7.8. Multiple scatter plot ............................................................................................. 250
7.9. Tabular summary plots ......................................................................................... 252
7.10. Special plots .......................................................................................................... 253
7.11. Plotting results from several matrices .................................................................. 255
7.11.1 Why is it useful? ................................................................................................................... 255
7.11.2 How to do it? ....................................................................................................................... 257
7.12. Annotating plots ................................................................................................... 258
7.13. Create Range Menu .............................................................................................. 259
7.14. Plotting: The smart way to display numbers ........................................................ 260
7.14.1 Various plots ........................................................................................................................ 260
7.14.2 Customizing plots ................................................................................................................. 261
7.14.3 Actions on a plot .................................................................................................................. 261
7.14.4 Plots in analysis .................................................................................................................... 261
ix
7.15. Kennard-Stone (KS) Sample Selection .................................................................. 263

7.16. Marking ................................................................................................................. 266
7.16.1 How to mark samples/variables .......................................................................................... 266
7.16.2 How to create a new range of samples or variables from the marked items ...................... 268
7.16.3 Recalculate with modifications on marked samples or/and variables ................................. 269
7.17. Point details .......................................................................................................... 270
7.18. Formatting of plots ............................................................................................... 271
7.19. Formatting of 3D plots .......................................................................................... 274
7.20. Plot – Response Surface… ..................................................................................... 278
7.21. Saving and copying a plot ..................................................................................... 279
7.21.1 Saving a plot ......................................................................................................................... 279
7.21.2 Copying plots ....................................................................................................................... 280
7.22. Scope: Select plot range........................................................................................ 282
7.23. Edit – Select Evenly Distributed Samples .............................................................. 283
7.24. Zooming and Rescaling ......................................................................................... 284
7.24.1 General options ................................................................................................................... 284
7.24.2 Special options ..................................................................................................................... 285
7.24.3 Resize plots .......................................................................................................................... 285
8. Design of Experiments.............................................................................................. 287
8.1. Experimental design.............................................................................................. 287

8.2. Introduction to Design of Experiments (DoE) ....................................................... 287
8.2.1 DoE basics ............................................................................................................................ 288
8.2.2 Investigation stages and design objectives .......................................................................... 289
8.2.3 Available designs in The Unscrambler® ............................................................................... 291
8.2.4 Types of variables in experimental design ........................................................................... 293
8.2.5 Designs for unconstrained screening situations .................................................................. 295
8.2.6 Designs for unconstrained optimization situations ............................................................. 299
8.2.7 Designs for constrained situations ....................................................................................... 302
8.2.8 Types of samples in experimental design ............................................................................ 315
8.2.9 Sample order in a design...................................................................................................... 319
8.2.10 Blocking................................................................................................................................ 319
8.2.11 Extending a design ............................................................................................................... 321
8.2.12 Building an efficient experimental strategy ......................................................................... 322
8.2.13 Analyze results from designed experiments ........................................................................ 323
8.2.14 Advanced topics for unconstrained situations ..................................................................... 330
8.2.15 Advanced topics for constrained situations ......................................................................... 331
8.3. Insert – Create design… ........................................................................................ 334
8.3.1 General buttons ................................................................................................................... 334
8.3.2 Start ..................................................................................................................................... 334
8.3.3 Define Variables ................................................................................................................... 336
x
Table of Contents
8.3.4 Choose the Design ............................................................................................................... 339

8.3.5 Design Details ...................................................................................................................... 341
8.3.6 Additional Experiments ........................................................................................................ 352
8.3.7 Randomization ..................................................................................................................... 355
8.3.8 Summary .............................................................................................................................. 357
8.3.9 Design Table ......................................................................................................................... 357
8.4. Tools – Modify/Extend Design… ........................................................................... 358
8.4.1 To remember ....................................................................................................................... 359
8.5. Tasks – Analyze – Analyze Design Matrix… ........................................................... 360
8.5.1 Order of the runs ................................................................................................................. 361
8.5.2 Level values .......................................................................................................................... 361
8.6. DoE analysis .......................................................................................................... 361
8.7. Analysis results...................................................................................................... 365
8.8. Interpreting design analysis plots ......................................................................... 366
8.8.1 Accessing plots ..................................................................................................................... 367
8.8.2 Available plots for Classical DoE Analysis (Scheffe and MLR) .............................................. 367
8.8.3 Available plots for Partial Least Squares Regression (DoE PLS) ........................................... 387
8.9. DOE method reference ......................................................................................... 394
8.10. Bibliography .......................................................................................................... 394
9. Validation ................................................................................................................. 397
9.1. Validation .............................................................................................................. 397

9.2. Introduction to validation ..................................................................................... 397
9.2.1 Principles of model validation .............................................................................................. 397
9.2.2 What is validation? .............................................................................................................. 398
9.2.3 Validation results ................................................................................................................. 400
9.2.4 When to use which validation method ................................................................................ 400
9.2.5 Uncertainty testing with cross validation ............................................................................ 401
9.2.6 More details about the uncertainty test .............................................................................. 402
9.2.7 Model validation check list .................................................................................................. 404
9.3. Validation tab ........................................................................................................ 405
9.3.1 Analysis and validation procedures ..................................................................................... 405
9.3.2 Validation methods .............................................................................................................. 406
9.3.3 How to display validation results ......................................................................................... 408
9.3.4 How to display uncertainty test results ............................................................................... 409
9.4. Validation tab – Cross validation setup….............................................................. 410
10. Transform ................................................................................................................. 413
10.1. Transformations .................................................................................................... 413

10.2. Baseline Correction ............................................................................................... 413
10.2.1 Baseline correction .............................................................................................................. 413
xi
10.2.2 About baseline corrections .................................................................................................. 414

10.2.3 Tasks – Transform – Baseline ............................................................................................... 414
10.3. Center and Scale ................................................................................................... 416
10.3.1 Center_and_scale ................................................................................................................ 416
10.3.2 About centering ................................................................................................................... 416
10.3.3 Tasks – Transform – Center and Scale ................................................................................. 417
10.4. Compute General .................................................................................................. 419
10.4.1 Compute general ................................................................................................................. 419
10.4.2 About compute general ....................................................................................................... 420
10.4.3 Tasks – Transform – Compute_General… ............................................................................ 420
10.5. COW ...................................................................................................................... 423
10.5.1 Correlation Optimized Warping (COW) ............................................................................... 423
10.5.2 About correlation optimized warping .................................................................................. 424
10.5.3 Tasks – Transform – Correlation Optimized Warping… ....................................................... 425
10.6. Deresolv ................................................................................................................ 427
10.6.1 Deresolve ............................................................................................................................. 427
10.6.2 About deresolve ................................................................................................................... 428
10.6.3 Tasks – Transform – Deresolve ............................................................................................ 428
10.7. Derivatives ............................................................................................................ 429
10.7.1 Derivatives ........................................................................................................................... 429
10.7.2 About derivative methods and applications ........................................................................ 430
10.7.3 Gap Derivatives .................................................................................................................... 434
10.7.4 Gap Segment........................................................................................................................ 436
10.7.5 Savitzky Golay ...................................................................................................................... 438
10.8. Detrend ................................................................................................................. 440
10.8.1 Detrending ........................................................................................................................... 440
10.8.2 About detrending ................................................................................................................. 440
10.8.3 Tasks – Transform – Detrending .......................................................................................... 442
10.9. EMSC ..................................................................................................................... 443
10.9.1 MSC/EMSC ........................................................................................................................... 443
10.9.2 About multiplicative scatter correction ............................................................................... 444
10.9.3 Tasks – Transform – MSC/EMSC .......................................................................................... 445
10.10. Interaction and Square Effects .................................................................... 451
10.10.1 Interaction_and_Square_Effects ......................................................................................... 451
10.10.2 About interactions and square effects ................................................................................. 451
10.10.3 Tasks – Transform – Interactions and Square Effects .......................................................... 452
10.11. Interpolate ................................................................................................... 453
10.11.1 Interpolation ........................................................................................................................ 453
10.11.2 About interpolation ............................................................................................................. 453
10.11.3 Tasks – Transform – Interpolate .......................................................................................... 454
10.12. Missing Value Imputation ............................................................................ 455
10.12.1 Fill missing values................................................................................................................. 455
xii
Table of Contents
10.12.2 About fill missing values ...................................................................................................... 455

10.12.3 Tasks – Transform – Fill Missing… ........................................................................................ 456
10.13. Noise ............................................................................................................ 457
10.13.1 Noise .................................................................................................................................... 457
10.13.2 About adding noise .............................................................................................................. 457
10.13.3 Tasks – Transform – Noise ................................................................................................... 457
10.14. Normalize ..................................................................................................... 459
10.14.1 Normalization ...................................................................................................................... 459
10.14.2 About normalization ............................................................................................................ 460
10.14.3 Tasks – Transform – Normalize ............................................................................................ 462
10.15. OSC ............................................................................................................... 466
10.15.1 Orthogonal Signal Correction (OSC) ..................................................................................... 466
10.15.2 About Orthogonal Signal Correction (OSC) .......................................................................... 466
10.15.3 Tasks – Transform – OSC… ................................................................................................... 467
10.16. Quantile Normalize ...................................................................................... 470
10.16.1 Quantile Normalization ........................................................................................................ 470
10.16.2 About quantile normalization .............................................................................................. 470
10.16.3 Tasks – Transform – Quantile_Normalize ............................................................................ 471
10.17. Reduce Average ........................................................................................... 472
10.17.1 Reduce (Average) ................................................................................................................. 472
10.17.2 About averaging ................................................................................................................... 473
10.17.3 Tasks – Transform – Reduce (Average)… ............................................................................. 473
10.18. Smoothing .................................................................................................... 474
10.18.1 Smoothing methods ............................................................................................................. 474
10.18.2 Comparison of moving average and Gaussian filters ........................................................... 474
10.18.3 Gaussian Filter ..................................................................................................................... 475
10.18.4 Median Filter........................................................................................................................ 476
10.18.5 Moving Average ................................................................................................................... 478
10.18.6 Robust LOWESS.................................................................................................................... 479
10.18.7 Savitzky Golay ...................................................................................................................... 481
10.19. Spectroscopic Transformations ................................................................... 483
10.19.1 Spectroscopic transformations ............................................................................................ 483
10.19.2 About spectroscopic transformations.................................................................................. 484
10.19.3 Tasks – Transform – Spectroscopic… ................................................................................... 484
10.20. Standard Normal Variate ............................................................................. 486
10.20.1 Standard_Normal_Variate (SNV) ......................................................................................... 486
10.20.2 About Standard_Normal_Variate (SNV) .............................................................................. 487
10.20.3 Tasks – Transform – SNV ...................................................................................................... 487
10.21. Transpose..................................................................................................... 488
10.21.1 Transposition ....................................................................................................................... 488
10.21.2 Tasks – Transform – Transpose ............................................................................................ 488
10.22. Weighted Direct Standardization ................................................................ 489
xiii
10.22.1 Weighted_Direct_Standardization (WDS) ........................................................................... 489

10.22.2 About Weighted_Direct_Standardization ............................................................................ 489
10.22.3 Tasks – Transform – Weighted_Direct_Standardization ...................................................... 489
10.23. Weights ........................................................................................................ 489
10.23.1 Weights ................................................................................................................................ 489
10.23.2 About weighting and scaling ................................................................................................ 490
10.23.3 Tasks – Transform – Weights… ............................................................................................ 492
11. Univariate Statistics .................................................................................................. 497
11.1. Descriptive statistics ............................................................................................. 497

11.2. Introduction to descriptive statistics .................................................................... 497
11.2.1 Purposes .............................................................................................................................. 497
11.2.2 The normal distribution ....................................................................................................... 498
11.2.3 Measures of central tendency ............................................................................................. 499
11.2.4 Measures of dispersion ........................................................................................................ 499
11.3. Tasks – Analyze – Descriptive Statistics… ............................................................. 501
11.3.1 Data input ............................................................................................................................ 501
11.3.2 Some important tips regarding the data input dialog .......................................................... 501
11.4. Interpreting descriptive statistics plots ................................................................ 502
11.4.1 Predefined descriptive statistics plots ................................................................................. 502
11.4.2 Plots accessible from the Statistics plot menu ..................................................................... 504
11.5. Descriptive statistics method reference ............................................................... 508
11.6. Bibliography .......................................................................................................... 508
12. Basic Statistical Tests ................................................................................................ 509
12.1. Statistical tests ...................................................................................................... 509

12.2. Introduction to statistical tests ............................................................................. 509
12.2.1 What are inferential statistics? ............................................................................................ 510
12.2.2 Hypothesis testing ............................................................................................................... 510
12.2.3 Tests for normality of data................................................................................................... 512
12.2.4 Tests for the equivalence of variances ................................................................................. 513
12.2.5 Tests for the comparison of means ..................................................................................... 515
12.2.6 Comparison of categorical data ........................................................................................... 517
12.3. Tasks – Analyze – Statistical Tests… ...................................................................... 518
12.4. Interpreting plots for statistical tests ................................................................... 523
12.4.1 Predefined plots for statistical tests .................................................................................... 524
12.5. Statistical tests method reference ........................................................................ 526
12.6. Bibliography .......................................................................................................... 526
13. Principal Components Analysis ................................................................................ 527
xiv
Table of Contents
13.1. Principal Component Analysis (PCA) ..................................................................... 527

13.2. Introduction to Principal Component Analysis (PCA) ........................................... 527
13.2.1 Exploratory data analysis ..................................................................................................... 528
13.2.2 What is PCA? ........................................................................................................................ 528
13.2.3 Purposes of PCA ................................................................................................................... 528
13.2.4 How PCA works in short ....................................................................................................... 529
13.2.5 Main result outputs of PCA .................................................................................................. 533
13.2.6 How to interpret PCA results ............................................................................................... 536
13.2.7 PCA rotation ......................................................................................................................... 539
13.2.8 PCA algorithm options ......................................................................................................... 542
13.3. Tasks – Analyze – Principal Component Analysis… ............................................... 542
13.3.1 Model Inputs tab ................................................................................................................. 543
13.3.2 Weights tab .......................................................................................................................... 544
13.3.3 Validation tab....................................................................................................................... 546
13.3.4 Rotation tab ......................................................................................................................... 547
13.3.5 Algorithm tab ....................................................................................................................... 548
13.3.6 Autopretreatment tab ......................................................................................................... 550
13.3.7 Set Alarms tab ...................................................................................................................... 551
13.3.8 Warning Limits tab ............................................................................................................... 551
13.4. Interpreting PCA plots........................................................................................... 553
13.4.1 Predefined PCA plots ........................................................................................................... 554
13.4.2 Plots accessible from the PCA plot menu ............................................................................ 571
13.5. PCA method reference .......................................................................................... 582
13.6. Bibliography .......................................................................................................... 582
14. Multiple Linear Regression ....................................................................................... 583
14.1. Multiple Linear Regression ................................................................................... 583

14.2. Introduction to Multiple Linear Regression (MLR) ............................................... 583
14.2.1 Basics ................................................................................................................................... 583
14.2.2 Principles behind Multiple Linear Regression (MLR)............................................................ 585
14.2.3 Interpreting the results of MLR ............................................................................................ 586
14.2.4 More details about regression methods .............................................................................. 589
14.3. Tasks – Analyze – Multiple Linear Regression ...................................................... 589
14.3.1 Model Inputs tab ................................................................................................................. 589
14.3.2 Validation tab....................................................................................................................... 591
14.3.3 Autopretreatments tab ........................................................................................................ 594
14.3.4 Set Alarms tab ...................................................................................................................... 594
14.3.6 Variable weighting in MLR ................................................................................................... 596
14.4. Interpreting MLR plots .......................................................................................... 597
14.4.1 Predefined MLR plots........................................................................................................... 598
14.4.2 Plots accessible from the MLR Plot menu ............................................................................ 610
xv
14.5. MLR method reference ......................................................................................... 616

14.6. Bibliography .......................................................................................................... 616
15. Principal Components Regression ............................................................................ 617
15.1. Principal Component Regression .......................................................................... 617

15.2. Introduction to Principal Component Regression (PCR) ....................................... 617
15.2.1 Basics ................................................................................................................................... 617
15.2.2 Interpreting the results of a Principal Component Regression (PCR) .................................. 618
15.2.3 Some more theory of PCR .................................................................................................... 620
15.2.4 PCR algorithm options ......................................................................................................... 620
15.3. Tasks – Analyze – Principal Component Regression ............................................. 621
15.3.1 Model Inputs tab ................................................................................................................. 621
15.3.2 Weights tabs ........................................................................................................................ 623
15.3.3 Validation tab....................................................................................................................... 625
15.3.4 Algorithm tab ....................................................................................................................... 626
15.3.5 Autopretreatment tab ......................................................................................................... 628
15.3.6 Set Alarms tab ...................................................................................................................... 629
15.4. Interpreting PCR plots ........................................................................................... 631
15.4.1 Predefined PCR plots ........................................................................................................... 634
15.4.2 Plots accessible from the PCR plot menu ............................................................................. 658
15.5. PCR method reference .......................................................................................... 673
15.6. Bibliography .......................................................................................................... 673
16. Partial Least Squares ................................................................................................ 675
16.1. Partial Least Squares regression ........................................................................... 675

16.2. Introduction to Partial Least Squares Regression (PLSR) ...................................... 675
16.2.1 Basics ................................................................................................................................... 675
16.2.2 Interpreting the results of a PLS regression ......................................................................... 676
16.2.3 Scores and loadings (in general) .......................................................................................... 677
16.2.4 More details about regression methods .............................................................................. 680
16.2.5 PLSR algorithm options ........................................................................................................ 681
16.3. Tasks – Analyze – Partial Least Squares Regression ............................................. 682
16.3.1 Model Inputs tab ................................................................................................................. 682
16.3.2 Weights tabs ........................................................................................................................ 684
16.3.3 Validation tab....................................................................................................................... 686
16.3.4 Algorithm tab ....................................................................................................................... 687
16.3.5 Autopretreatments tab ........................................................................................................ 689
16.3.6 Set Alarms tab ...................................................................................................................... 690
16.4. Interpreting PLS plots............................................................................................ 692
xvi
Table of Contents
16.4.1 Predefined PLS plots ............................................................................................................ 695

16.4.2 Plots accessible from the PLS plot menu ............................................................................. 726
16.5. PLS method reference........................................................................................... 742
16.6. Bibliography .......................................................................................................... 742
17. LPLS .......................................................................................................................... 743
17.1. L-PLS regression .................................................................................................... 743

17.2. Introduction to L-PLS ............................................................................................ 743
17.2.1 Basics ................................................................................................................................... 743
17.2.2 The L-PLS model ................................................................................................................... 744
17.2.3 L-PLS by example ................................................................................................................. 745
17.3. Tasks – Analyze – L-PLS Regression ...................................................................... 746
17.3.1 Model inputs ........................................................................................................................ 746
17.3.2 X weights .............................................................................................................................. 748
17.3.3 Y weights .............................................................................................................................. 750
17.3.4 Z weights .............................................................................................................................. 750
17.4. Interpreting L-PLS plots......................................................................................... 751
17.4.1 Predefined L-PLS plots ......................................................................................................... 751
17.4.2 Plots accessible from the L-PLS menu .................................................................................. 758
17.5. L-PLS method reference ........................................................................................ 758
17.6. Bibliography .......................................................................................................... 758
18. Support Vector Machine Regression ........................................................................ 759
18.1. Support Vector Machine Regression (SVMR) ....................................................... 759

18.2. Introduction to Support Vector Machine (SVM) Regression (SVMR) ................... 759
18.2.1 Principles of Support Vector Machine (SVM) regression ..................................................... 759
18.2.2 What is SVM regression? ..................................................................................................... 760
18.2.3 Data suitable for SVM Regression ........................................................................................ 761
18.2.4 Main results of SVM regression ........................................................................................... 762
18.2.5 More details about SVM Regression .................................................................................... 763
18.3. Tasks – Analyze – Support Vector Machine Regression… ..................................... 763
18.3.1 Model input ......................................................................................................................... 763
18.3.2 Options ................................................................................................................................ 765
18.3.3 Grid Search........................................................................................................................... 768
18.3.4 Weights ................................................................................................................................ 768
18.3.5 Validation ............................................................................................................................. 770
18.4. Tasks – Predict – SVR Prediction… ........................................................................ 772
18.5. Interpreting SVM Regression results .................................................................... 773
18.5.1 Support vectors.................................................................................................................... 774
18.5.2 Parameters........................................................................................................................... 774
18.5.3 Probabilities ......................................................................................................................... 774
xvii
18.5.4 Diagnostics ........................................................................................................................... 775

18.5.5 Prediction ............................................................................................................................. 775
18.5.6 Prediction plot ..................................................................................................................... 775
18.5.7 Predicted values after appplying the SVM model on new samples ..................................... 776
18.6. SVM method reference ......................................................................................... 776
18.7. Bibliography .......................................................................................................... 777
19. Multivariate Curve Resolution.................................................................................. 779
19.1. Multivariate Curve Resolution (MCR) ................................................................... 779

19.2. Introduction to Multivariate Curve Resolution (MCR).......................................... 779
19.2.1 MCR basics ........................................................................................................................... 780
19.2.2 Ambiguities and constraints in MCR .................................................................................... 782
19.2.3 MCR and 3-D data ................................................................................................................ 785
19.2.4 Algorithm implemented in The Unscrambler®: Alternating Least Squares (MCR-ALS) ........ 786
19.2.5 Main results of MCR ............................................................................................................ 788
19.2.6 Quality check in MCR ........................................................................................................... 789
19.2.7 MCR application examples ................................................................................................... 790
19.3. Tasks – Analyze – Multivariate Curve Resolution… .............................................. 791
19.3.1 Model Inputs ........................................................................................................................ 791
19.3.2 Options ................................................................................................................................ 792
19.4. Interpreting MCR plots ......................................................................................... 793
19.4.1 Predefined MCR plots .......................................................................................................... 794
19.5. MCR method reference ........................................................................................ 797
19.6. Bibliography .......................................................................................................... 797
20. Hierarchical Modeling .............................................................................................. 799
20.1. Hierarchical Modeling ........................................................................................... 799

20.2. Introduction to Hierarchical Modeling ................................................................. 799
20.2.1 Overall workflow.................................................................................................................. 799
20.2.2 Setup .................................................................................................................................... 800
20.2.3 Expected Scenarios .............................................................................................................. 800
20.3. Tasks – Analyze – Hierarchical Modeling .............................................................. 804
20.3.1 Defining actions ................................................................................................................... 805
20.3.2 Setting up a hierarchical model ........................................................................................... 811
20.3.3 Modifying an existing hierarchical model ............................................................................ 819
20.4. Prediction with Hierarchical Model ...................................................................... 819
20.5. Interpretation of results........................................................................................ 820
21. Segmented Correlation Outlier Analysis................................................................... 823
21.1. Segmented Correlation Outlier Analysis (SCA) ..................................................... 823
xviii
Table of Contents
21.2. Introduction to Segmented Correlation Outlier Analysis (SCA) ............................ 823

21.3. Tasks – Analyze – Segmented Correlation Outlier Analysis… ............................... 826
21.4. Tasks - Predict - Conformity… ............................................................................... 829
21.5. SCA Conformity Prediction Plots........................................................................... 830
21.5.1 Predefined prediction plots ................................................................................................. 830
21.6. Save model for SCA Conformity Prediction .......................................................... 832
21.7. Interpreting SCA plots ........................................................................................... 833
21.7.1 Predefined SCA plots ........................................................................................................... 834
21.8. SCA method reference .......................................................................................... 843
22. Instrument Diagnostics............................................................................................. 845
22.1. Instrument Diagnostics ......................................................................................... 845

22.2. Introduction to Instrument Diagnostics................................................................ 845
22.2.1 RMS Noise ............................................................................................................................ 845
22.2.2 Peak Height/Peak Area (Peak Model) .................................................................................. 846
22.2.3 Peak Position........................................................................................................................ 846
22.2.4 Loss of Intensity ................................................................................................................... 847
22.2.5 PCA Projection ..................................................................................................................... 847
22.3. Tasks – Analyze – Instrument Diagnostics ............................................................ 847
22.3.1 Main Dialog .......................................................................................................................... 847
22.3.2 Add Model ........................................................................................................................... 848
22.3.3 RMS Noise ............................................................................................................................ 849
22.3.4 Peak Model .......................................................................................................................... 851
22.3.5 Peak Position........................................................................................................................ 854
22.3.6 Single Loss of Intensity Model ............................................................................................. 857
22.3.7 Principal Component Analysis Models ................................................................................. 858
22.4. Prediction with Instrument Diagnostics Model .................................................... 861
23. Spectral Diagnostics.................................................................................................. 865
23.1. Spectral Diagnostics .............................................................................................. 865

23.2. Introduction to Spectral Diagnostics .................................................................... 865
23.2.1 RMS Noise ............................................................................................................................ 865
23.2.2 Peak Height/Peak Area (Peak Model) .................................................................................. 866
23.2.3 Peak Position........................................................................................................................ 866
23.2.4 Loss of Intensity ................................................................................................................... 867
23.2.5 PCA Projection ..................................................................................................................... 867
23.3. Tasks – Analyze – Spectral Diagnostics ................................................................. 867
23.3.1 Main Dialog .......................................................................................................................... 867
23.3.2 Add Model ........................................................................................................................... 868
23.3.3 RMS Noise ............................................................................................................................ 869
23.3.4 Peak Model .......................................................................................................................... 871
xix
23.3.5 Peak Position........................................................................................................................ 874

23.3.6 Single Loss of Intensity Model ............................................................................................. 876
23.3.7 Principal Component Analysis Models ................................................................................. 878
23.4. Prediction with Spectral Diagnostics Model ......................................................... 880
24. Cluster Analysis ........................................................................................................ 883
24.1. Cluster analysis ..................................................................................................... 883

24.2. Introduction to cluster analysis ............................................................................ 883
24.2.1 Basics ................................................................................................................................... 883
24.2.2 Principles of cluster analysis ................................................................................................ 884
24.2.3 Nonhierarchical clustering ................................................................................................... 884
24.2.4 Hierarchical clustering ......................................................................................................... 884
24.2.5 Quality of the clustering ...................................................................................................... 887
24.2.6 Main results of cluster analysis ............................................................................................ 888
24.3. Tasks – Analyze – Cluster Analysis… ..................................................................... 888
24.3.1 Inputs ................................................................................................................................... 889
24.3.2 Options for K-means/K-median clustering ........................................................................... 889
24.3.3 Results.................................................................................................................................. 891
24.4. Interpreting cluster analysis plots......................................................................... 892
24.4.1 Dendrogram ......................................................................................................................... 892
24.5. Cluster analysis method reference ....................................................................... 893
25. Projection ................................................................................................................. 895
25.1. Projection .............................................................................................................. 895

25.2. Introduction to projection of samples .................................................................. 895
25.2.1 Basics of projection .............................................................................................................. 895
25.2.2 How to interpret projected samples .................................................................................... 896
25.3. Tasks – Predict – Projection… ............................................................................... 898
25.3.1 Access the Projection functionality ...................................................................................... 898
25.4. Interpreting projection plots ................................................................................ 900
25.4.1 Predefined projection plots ................................................................................................. 901
25.4.2 Plots accessible from the Projection menu .......................................................................... 906
25.5. Projection method reference................................................................................ 913
26. SIMCA....................................................................................................................... 915
26.1. SIMCA classification .............................................................................................. 915

26.2. Introduction to SIMCA classification..................................................................... 915
26.2.1 Making a SIMCA model ........................................................................................................ 915
26.2.2 Classifying new samples....................................................................................................... 916
26.2.3 Main results of classification................................................................................................ 916
xx
Table of Contents
26.2.4 Outcomes of a classification ................................................................................................ 918

26.3. Tasks – Predict – Classification – SIMCA… ............................................................ 918
26.4. Interpreting SIMCA plots ...................................................................................... 921
26.4.1 Predefined SIMCA plots ....................................................................................................... 921
26.5. SIMCA method reference ..................................................................................... 926
27. Linear Discriminant Analysis..................................................................................... 927
27.1. Linear Discriminant Analysis ................................................................................. 927

27.2. Introduction to Linear Discriminant Analysis (LDA) classification ........................ 927
27.2.1 Basics ................................................................................................................................... 927
27.2.2 Data suitable for LDA ........................................................................................................... 928
27.2.3 Purposes of LDA ................................................................................................................... 928
27.2.4 Main results of LDA .............................................................................................................. 929
27.2.5 LDA application examples .................................................................................................... 929
27.2.6 How to interpret LDA results ............................................................................................... 929
27.2.7 Using an LDA model for classification of unknowns ............................................................ 930
27.3. Tasks – Analyze – Linear Discriminant Analysis .................................................... 930
27.3.1 Inputs ................................................................................................................................... 930
27.3.2 Weights ................................................................................................................................ 931
27.3.3 Options ................................................................................................................................ 932
27.3.4 Autopretreatment ............................................................................................................... 933
27.4. Tasks – Predict – Classification – LDA… ................................................................ 934
27.5. Interpreting LDA results ........................................................................................ 934
27.5.1 Prediction ............................................................................................................................. 935
27.5.2 Confusion matrix .................................................................................................................. 935
27.5.3 Loadings matrix .................................................................................................................... 936
27.5.4 Grand mean matrix .............................................................................................................. 936
27.5.5 Discrimination Plot............................................................................................................... 936
27.6. LDA method reference .......................................................................................... 936
27.7. Bibliography .......................................................................................................... 937
28. Support Vector Machine Classification..................................................................... 939
28.1. Support Vector Machine Classification (SVMC) .................................................... 939

28.2. Introduction to Support Vector Machine (SVM) classification ............................. 939
28.2.1 Principles of Support Vector Machine (SVM) classification ................................................. 939
28.2.2 What is SVM classification? ................................................................................................. 939
28.2.3 Data suitable for SVM classification ..................................................................................... 941
28.2.4 Main results of SVM classification ....................................................................................... 941
28.2.5 More details about SVM Classification ................................................................................ 942
28.2.6 SVM classification application examples ............................................................................. 942
28.3. Tasks – Analyze – Support Vector Machine classification .................................... 942
xxi
28.3.1 Model input ......................................................................................................................... 942

28.3.2 Options ................................................................................................................................ 943
28.3.3 Grid Search........................................................................................................................... 946
28.3.4 Weights ................................................................................................................................ 947
28.3.5 Validation ............................................................................................................................. 948
28.4. Tasks – Predict – Classification – SVM… ............................................................... 950
28.5. Interpreting SVM Classification results ................................................................. 951
28.5.1 Support vectors.................................................................................................................... 951
28.5.2 Confusion matrix .................................................................................................................. 951
28.5.3 Parameters........................................................................................................................... 952
28.5.4 Probabilities ......................................................................................................................... 952
28.5.5 Prediction ............................................................................................................................. 953
28.5.6 Accuracy ............................................................................................................................... 953
28.5.7 Plot of classification results ................................................................................................. 954
28.5.8 Classified range .................................................................................................................... 954
28.6. SVM method reference ......................................................................................... 955
28.7. Bibliography .......................................................................................................... 955
29. Batch Modeling ........................................................................................................ 957
29.1. Batch Modeling (BM) ............................................................................................ 957

29.2. Introduction to Batch Modeling (BM)................................................................... 957
29.2.1 What is Batch Modeling ....................................................................................................... 957
29.3. Tasks – Analyze – Batch Modeling… ..................................................................... 957
29.3.1 Model Inputs tab ................................................................................................................. 957
29.3.2 Weights tab .......................................................................................................................... 959
29.3.3 Validation tab....................................................................................................................... 961
29.4. Interpreting BM plots............................................................................................ 964
29.4.1 Predefined BM plots ............................................................................................................ 965
29.5. BM method reference........................................................................................... 965
30. Moving Block ............................................................................................................ 967
30.1. Moving Block......................................................................................................... 967

30.2. Introduction to Moving Block. .............................................................................. 967
30.2.1 Block Definitions .................................................................................................................. 967
30.2.2 Individual Block Mean (IBM) ................................................................................................ 968
30.2.3 Individual Block Standard Deviation (IBSD).......................................................................... 969
30.2.4 Moving Block Mean (MBM) ................................................................................................. 969
30.2.5 Moving Block Standard Deviation (MBSD) ........................................................................... 969
30.2.6 Percent Relative Standard Deviation (%RSD) ....................................................................... 970
30.3. Tasks – Analyze – Moving Block Methods ............................................................ 971
xxii
Table of Contents
30.3.1 Input data pane.................................................................................................................... 971

30.3.2 Region .................................................................................................................................. 971
30.4. Interpreting moving block plots............................................................................ 972
30.4.1 Predefined moving block plots ............................................................................................ 973
30.5. Tasks – Predict – Moving Block Statistics.............................................................. 975
30.6. Set Moving Block Limits ........................................................................................ 976
31. Orthogonal Projections to Latent Structures ............................................................ 977
31.1. Orthogonal Projection to Latent Structures ......................................................... 977

31.2. Introduction to Orthogonal Projection to Latent Structures (OPLS) .................... 977
31.2.1 Predictive scores and predictive loading weights ................................................................ 978
31.2.2 Y-loadings............................................................................................................................. 978
31.2.3 Orthogonal scores and orthogonal loading weights and loadings ....................................... 978
31.3. Tasks – Analyze – Orthogonal Projection to Latent Structures ............................ 979
31.3.1 Model Inputs tab ................................................................................................................. 979
31.3.2 Weights tabs ........................................................................................................................ 980
31.3.3 Validation tab....................................................................................................................... 983
31.3.4 Autopretreatments .............................................................................................................. 984
31.4. Interpreting OPLS plots ......................................................................................... 985
31.4.1 Predefined OPLS plots.......................................................................................................... 985
31.5. OPLS method reference ........................................................................................ 994
31.6. Bibliography .......................................................................................................... 994
32. Prediction ................................................................................................................. 995
32.1. Prediction .............................................................................................................. 995

32.2. Introduction to prediction from regression models ............................................. 995
32.2.1 When can prediction be used? ............................................................................................ 995
32.2.2 How does prediction work? ................................................................................................. 996
32.2.3 Short prediction modes for MLR, PLSR and PCR .................................................................. 996
32.2.4 Full prediction by projection onto a PCR or PLSR model ..................................................... 996
32.2.5 Main results of prediction .................................................................................................... 997
32.3. Tasks – Predict – Regression… .............................................................................. 999
32.3.1 Access the Prediction functionality ...................................................................................... 999
32.4. Interpreting prediction plots............................................................................... 1003
32.4.1 Predefined prediction plots ............................................................................................... 1003
32.4.2 Plots accessible from the Prediction menu ........................................................................ 1004
32.5. Prediction method reference.............................................................................. 1008
33. Batch Prediction ..................................................................................................... 1009
33.1. Batch Prediction .................................................................................................. 1009
xxiii
33.2. Tasks – Predict - Batch Predict ............................................................................ 1009

33.2.1 Inputs and outputs ............................................................................................................. 1009
33.2.2 Display................................................................................................................................ 1010
33.2.3 Options .............................................................................................................................. 1010
33.2.4 Outputs .............................................................................................................................. 1011
34. Multiple Model Comparison .................................................................................. 1013
34.1. Multiple Model Comparison ............................................................................... 1013

34.2. Multiple comparison of y-residuals .................................................................... 1013
34.3. Tasks – Predict – Multiple Model Comparison ................................................... 1013
34.4. Interpreting prediction plots............................................................................... 1015
34.4.1 Predefined prediction plots ............................................................................................... 1015
34.5. Method reference ............................................................................................... 1015
35. Tutorials.................................................................................................................. 1017
35.1. Tutorials .............................................................................................................. 1017

35.1.1 Content of the tutorials ..................................................................................................... 1017
35.1.2 How to use the tutorials .................................................................................................... 1017
35.1.3 Where to find the tutorial data files .................................................................................. 1017
35.2. Complete ............................................................................................................. 1018
35.2.1 Complete cases .................................................................................................................. 1018
35.2.2 Tutorial A: A simple example of calibration ....................................................................... 1019
35.2.3 Tutorial B: Quality analysis with PCA and PLS .................................................................... 1036
35.2.4 Tutorial C: Spectroscopy and interference problems ........................................................ 1069
35.2.5 Tutorial D1: Screening design ............................................................................................ 1092
35.2.6 Tutorial D2: Optimization design ....................................................................................... 1107
35.2.7 Tutorial E: SIMCA classification .......................................................................................... 1120
35.2.8 Tutorial F: Interacting with other programs ...................................................................... 1133
35.2.9 Tutorial G: Mixture design ................................................................................................. 1148
35.2.10 Tutorial H: PLS Discriminant Analysis (PLS-DA) .................................................................. 1164
35.2.11 Tutorial I: Multivariate curve resolution (MCR) of dye mixtures ....................................... 1177
35.2.12 Tutorial J: MCR constraint settings .................................................................................... 1189
35.2.13 Tutorial K: Clustering.......................................................................................................... 1202
35.2.14 Tutorial L: L-PLS Regression ............................................................................................... 1215
35.2.15 Tutorial M: Variable selection and model stability ............................................................ 1231
35.3. Quick ................................................................................................................... 1240
35.3.1 Quick start tutorials ........................................................................................................... 1240
35.3.2 Projection quick start ......................................................................................................... 1241
35.3.3 SIMCA quick start ............................................................................................................... 1243
35.3.4 MLR quick start .................................................................................................................. 1244
35.3.5 PCR quick start ................................................................................................................... 1247
35.3.6 PLS quick start .................................................................................................................... 1254
xxiv
Table of Contents
35.3.7 Prediction quick start ......................................................................................................... 1261

35.3.8 Cluster quick start .............................................................................................................. 1263
35.3.9 MCR quick start .................................................................................................................. 1265
35.3.10 LDA quick start ................................................................................................................... 1268
35.3.11 LDA classification quick start.............................................................................................. 1272
35.3.12 SVM quick start .................................................................................................................. 1273
35.3.13 SVM classification quick start ............................................................................................ 1277
35.3.14 PCA quick start ................................................................................................................... 1278
36. Data Integrity and Compliance ............................................................................... 1283
36.1. Data Integrity ...................................................................................................... 1283

36.2. Statement of Compliance ................................................................................... 1283
36.2.1 Introduction ....................................................................................................................... 1283
36.2.2 Overview ............................................................................................................................ 1283
36.2.3 Other software applications .............................................................................................. 1283
36.2.4 Statement of 21 CFR Part 11 Compliance .......................................................................... 1283
36.3. Compliance mode in The Unscrambler® X .......................................................... 1284
36.3.1 Main features of the compliance mode ............................................................................. 1284
36.3.2 A comprehensive approach to security and data integrity ................................................ 1285
36.4. Digital Signatures ................................................................................................ 1285
36.4.1 Digital Signature implementation in The Unscrambler� X ............................................... 1285
36.4.2 How to assign a digital signature to a project .................................................................... 1286
36.4.3 How to tell if a project has been signed ............................................................................. 1287
36.4.4 Digital signatures and 21 CFR Part 11 ................................................................................ 1288
36.5. References .......................................................................................................... 1288
37. References.............................................................................................................. 1289
37.1. Reference documentation .................................................................................. 1289

37.2. Glossary of terms ................................................................................................ 1289
37.3. Method reference ............................................................................................... 1320
37.4. Keyboard shortcuts ............................................................................................. 1320
37.5. Smarter, simpler multivariate data analysis: The Unscrambler® X..................... 1321
37.5.1 Workflow oriented main screen ........................................................................................ 1322
37.5.2 A new look for a new generation ....................................................................................... 1322
37.5.3 New analysis methods ....................................................................................................... 1325
37.5.4 General improvements and inclusions summary ............................................................... 1327
37.6. What’s new in The Unscrambler® X version 10.3 ............................................... 1328
37.7. What’s new in The Unscrambler® X ver 10.2 ...................................................... 1329
37.8. Applicability......................................................................................................... 1329
37.9. Design of Experiments ........................................................................................ 1330
xxv
37.10. Overall Enhancements ............................................................................... 1330

37.11. Known Limitations in The Unscrambler® X ver 10.2 .................................. 1332
37.12. What’s new in The Unscrambler® X ver 10.1............................................. 1332
37.13. Data Import ................................................................................................ 1332
37.14. Data Export ................................................................................................ 1332
37.15. Applicability ............................................................................................... 1333
37.16. Design of Experiments ............................................................................... 1333
37.17. Overall Enhancements ............................................................................... 1333
37.18. Known Limitations in The Unscrambler® X ver 10.1 .................................. 1334
37.19. What’s new in The Unscrambler® X ver 10.0.1.......................................... 1334
37.20. Data Import ................................................................................................ 1334
37.21. Tutorials ..................................................................................................... 1334
37.22. Applicability ............................................................................................... 1335
37.23. Design of Experiments ............................................................................... 1335
37.24. Known Limitations in The Unscrambler® X ver 10.0.1 ............................... 1335
37.25. What’s new in The Unscrambler® X........................................................... 1336
37.26. System Requirements ................................................................................ 1337
37.27. Installation ................................................................................................. 1337
38. Bibliography ........................................................................................................... 1339
38.1. Bibliography ........................................................................................................ 1339

38.1.1 Statistics and multivariate data analysis ............................................................................ 1339
38.1.2 Basic statistical tests .......................................................................................................... 1341
38.1.3 Design of experiments ....................................................................................................... 1341
38.1.4 Multivariate curve resolution ............................................................................................ 1342
38.1.5 Classification methods ....................................................................................................... 1342
38.1.6 Data transformations and pretreatments .......................................................................... 1343
38.1.7 L-shaped PLS ...................................................................................................................... 1344
38.1.8 Martens’ uncertainty test .................................................................................................. 1344
38.1.9 Data formats ...................................................................................................................... 1344
xxvi
1. WelcometoTheUnscrambler®X
The Unscrambler® is a complete multivariate data analysis and experimental design software
solution, equipped with powerful methods including PCA, PLS, clustering and classification.
 Getting to know The Unscrambler®

 Video demonstration of the new user interface
 Migrating from earlier versions
 Tutorials
 Keyboard shortcuts
 How to use the help documentation
See the release notes for a list of fixes, new features and known limitations.
1
2. Support Resources
2.1. Support resources on our website
Our web site is filled with resources, case studies, recorded webinars as well as information
about our products and commercial offerings, including courses and professional services.
 Support
 Webinars
 Training courses
 Consulting
3
3. Overview
3.1. What is The Unscrambler® X?
A brief review of the tasks that can be carried out using The Unscrambler® X.
 Multivariate analysis simplified

 Make well-designed experimental plans
 Reformat, transform and plot data
 Study variations among one group of variables
 Study relations between two groups of variables
 Validate multivariate models with uncertainty testing
 Estimate new, unknown response values
 Classify unknown samples
 Reveal groups of samples
3.1.1 Multivariate analysis simplified

The main strength of The Unscrambler® X is to provide simple to use tools for analysis of any
sort of multivariate data. This involves finding variations, co-variations and other internal
relationships in data matrices (tables). One can also use The Unscrambler® X set up an
experimental design to achieve the maximum information as efficiently as possible.
The following are the basic types of problems that can be solved using The Unscrambler® X:
 Set up experiments, analyze effects and find optima using the Design of Experiments
(DoE) module;
 Reformat and preprocess data to enhance future analyses;
 Find relevant variation in one data matrix (X);
 Find relationships between two data matrices (X and Y);
 Validate multivariate models with Uncertainty Testing;
 Resolve unknown mixtures by finding the number of pure components and
estimating their concentration profiles and spectra;
 Predict the unknown values of a response variable;
 Classify unknown samples into various possible categories.
One should always remember, however, that there is no point in trying to analyze data if
they do not contain any meaningful information. Experimental design is a valuable tool for
building data tables which give such meaningful information. The Unscrambler® can help to
do this in an elegant way.
The Unscrambler® satisfies the US FDA’s requirements for 21 CFR Part 11 compliance.
3.1.2 Make well-designed experimental plans

Choosing samples carefully increases the chance of extracting useful information from data.
Furthermore, being able to actively experiment with the variables also increases the chance
of extracting relationships. The critical part is deciding which variables to change, which
intervals to use for this variation, and the pattern of the experimental points.
5
The Unscrambler X Main
The purpose of experimental design is to generate experimental data that enable one to
determine which design variables (X) have an influence on the response variables (Y), in
order to understand the interactions between the design variables and thus determine the
optimum conditions. Of course, it is equally important to do this with a minimum number of
experiments to reduce costs. An experimental design program should offer appropriate
design methods and encourage good experimental practice, i.e. allow one to perform few
but useful experiments which span the important variations.
Screening designs (e.g. fractional, full factorial and Plackett-Burman) are used to find out
which design variables have an effect on the responses and are suitable for collection of data
spanning all important variations.
Optimization designs (e.g. central composite, Box-Behnken) aim to find the optimum
conditions for a process and generate nonlinear (quadratic) models. They generate data
tables that describe relationships in more detail, and are usually used to refine a model, i.e.
after the initial screening has been performed.
Whether the purpose of designed experiments is screening or optimization, there may be
multilinear constraints among some of the design variables. In such a case a D-optimal
design may be required.
Another special case is that of mixture designs, where the main design variables are the
components of a mixture. The Unscrambler® provides the classical types of mixture designs,
with or without additional constraints.
There are several methods for analysis of experimental designs. The Unscrambler® uses
Multiple Linear Regression (MLR) as its default methods for orthogonal designs. For non-
orthogonal designs, or when the levels of a design cannot be reached, The Unscrambler®
allows the use other methods, such as PCR or PLS, for this purpose.
3.1.3 Reformat, transform and plot data

Raw data may have a distribution that is not optimal for analysis. Background effects,
measurements in different units, different variances in variables etc. may make it difficult for
the methods to extract meaningful information. Preprocessing or transformations help in
reducing the “noise” introduced by such effects.
Before applying transforms, it is important to look at the data from a slightly different point
of view. Sorting samples or variables and transposing the data table are examples of such
reformatting operations.
Whether the data have been reformatted and transformed or not, a quick plot may reveal
more about the data than is to be seen with the naked eye on a mere collection of numbers.
Various types of plots are available in The Unscrambler®. They facilitate visual checks of
individual variable distributions, allow one to study the correlation among two variables or
examine samples as for example a 3-D swarm of points or a 3-D landscape.
3.1.4 Study variations among one group of variables

A common problem is to determine which variables actually contribute to the variation seen
in a given data matrix; i.e. to find answers to questions such as
 “Which variables are necessary to describe the samples adequately?”

 “Which samples are similar to each other?”
 “Are there groups of samples in a particular data set?”
 “What is the meaning of these sample patterns?”
6
Overview
The Unscrambler® finds this information by decomposing the data matrix into a structured
part and a noise part, using a technique called Principal Component Analysis (PCA).
Other methods to describe one group of variables

Classical descriptive statistics are also available in The Unscrambler®. Mean, standard
deviation, minimum, maximum, median and quartiles provide an overview of the univariate
distributions of variables, allowing for their comparison. In addition, the correlation matrix
provides a summary of the covariations among variables.
In the case of instrumental measurements (such as spectra or voltammograms) performed
on samples representing mixtures of a few pure components at varying concentrations or at
different stages of a process (such as chromatography), The Unscrambler® offers a method
for recovering the unknown concentrations, called Multivariate Curve Resolution (MCR).
3.1.5 Study relations between two groups of variables

Another common problem is establishing a regression model between two data matrices.
For example, one may have a set of many inexpensive measurements (X) of properties of a
set of different solutions (for example), and want to relate these measurements to the
concentration of a particular compound (Y) in the solution. The concentrations of the
particular compound are usually found using a reliable reference method.
In order to do this, it is necessary to find the relationship between the two data matrices.
This task varies somewhat depending on whether the data have been generated using
statistical experimental design or have simply been collected, more or less at random, from
a given population (i.e. non-designed data).
How to analyze designed data matrices

The variables in designed data tables (excluding mixture or D-optimal designs) are
orthogonal. Traditional statistical methods such as ANOVA and MLR are well suited to make
a regression model from orthogonal data tables.
How to analyze non-designed data matrices

The variables in non-designed data matrices are seldom orthogonal, but rather more or less
collinear with each other. MLR will most likely fail in such circumstances, so the use of
projection techniques such as PCR or PLS is recommended.
3.1.6 Validate multivariate models with uncertainty testing

Whatever the purpose in multivariate modeling – explore, describe precisely, build a
predictive model – validation is an important issue. Only a proper validation can ensure that
the model results are not too highly dependent on some extreme samples, and that the
predictive power of the regression model meets the experimental objectives.
With the help of Martens’ Uncertainty Test, the power of cross validation is further
increased and allows one to:
 Study the influence of individual samples in a model with powerful, simple to
interpret graphical representations;
 Test the significance of the predictor variables and remove unimportant predictors
from a PLS or PCR model.
7
3.1.7 Estimate new, unknown response values

A regression model can be used to predict new, i.e. unknown, Y-values. Prediction is a useful
technique as it can replace costly and time consuming measurements. A typical example is
the prediction of concentrations from absorbance spectra instead of direct measurements of
them by, for example titration.
3.1.8 Classify unknown samples

Classification simply means to find out whether new samples are similar to classes of
samples that have been used to make models in the past. If a new sample fits a particular
model well, it is said to be a member of that class. Classification can be done using several
different techniques including SIMCA, LDA, SVM classification and PLS-DA.
Many analytical tasks fall into this category. For example, raw materials may be sorted into
“good” and “bad” quality, finished products classified into grades “A”, “B”, “C”, etc.
3.1.9 Reveal groups of samples

Clustering attempts to group samples into ‘k’ clusters based on specific distance
measurements.
In The Unscrambler®, clustering can be applied to a data set using the K-Means algorithm, as
well as using hierarchical clustering (HCA). Seven different types of distance measurements
are provided (including Chebyshev and Bray-Curtis) along with popular algorithms, including
Ward’s method.
Overall, The Unscrambler® is a complete, All-In-One Multivariate Data Analysis and Design of
Experiment package, which can be used to investigate simple, through to extremely large
and complex data tables, for most applications. It provides the analytical tools most
commonly used and requested by most data analysts. The plug in architecture allows for the
inclusion new transforms and methods as they become available and software validation has
been greatly simplified as a result of this. The Unscrambler® meets the data security
requirements for regulated industries.
Related topics:
 User interface basics
 Principles of regression
 Principles of classification
3.2. Principles of classification

Multivariate classification is split into two equally important areas: cluster analysis and
discriminant analysis.
Cluster analysis methods can be used to find groups in the data without any predefined class
structure and are referred to as unsupervised learning. Cluster analysis is highly exploratory,
but can sometimes, especially at an early stage of an investigation, be very useful.
Discriminant analysis is a supervised classification method, as it is used to build classification
rules for a number of prespecified classes. These rules (model) are later used for allocating
new and unknown samples to the most probable class. Another important application of
discriminant analysis is to help in interpreting differences between groups of samples.
8
Overview
 Purposes of classification
 Classification methods
 SIMCA classification
 Linear Discriminant Analysis
 Support Vector Machines classification
 PLS Discriminant Analysis
 Steps in SIMCA classification
 Classifying new samples
 Outcomes of a classification
 Classification based on a regression model
3.2.1 Purposes of classification

The main goal of classification is to reliably assign new samples to existing classes (in a given
population). Note that classification is not the same as clustering.
One can also use classification results as a diagnostic tool:
 to distinguish among the most important variables to keep in a model (variables that
“characterize” the population);
 or to find outliers (samples that are not typical of the population).
It follows that, contrary to regression, which predicts the values of one or several
quantitative variables, classification is useful when the response is a category variable that
can be interpreted in terms of several classes to which a sample may belong.
Examples of such situations are:
 Predicting whether a product meets quality requirements, where the result is simply
“Yes” or “No” (i.e. binary response).
 Modeling various close species of plants or animals according to their easily
observable characteristics, so as to be able to decide whether new individuals
belong to one of the modeled species.
 Modeling various diseases according to a set of easily observable symptoms, clinical
signs or biological parameters, so as to help future diagnostic of those diseases.
3.2.2 Classification methods

This chapter presents the purpose of sample classification, and provides a brief overview of
the classification methods available in The Unscrambler®:
 Soft Independent Modeling of Class Analogy (SIMCA)

 Linear Discriminant Analysis (LDA)
 Support Vector Machine (SVM) Classification
Unsupervised classification methods:
 Cluster analysis
 Projection
Discriminant analysis is a kind of qualitative calibration, where the quantity to be calibrated

for is a category group variable, and not a continuous measurement as would be the case for
a quantitative calibration (regression).
9
It grew out of work by biologists working on numerical taxonomy, and is a valuable

visualization tool in data mining. One can perform clustering using either several
agglomerative methods: K-means or K-median clustering, or hierarchical clustering with
different linkage measures (single-linkage, complete-linkage, average-linkage, median-
linkage, etc.). Agglomerative methods begin by treating each sample as a single cluster and
begin clustering samples based on their similarity until one large cluster is formed.
The main categories of cluster analysis in The Unscrambler® are nonhierarchical clustering
(K-means, K-medians) and hierarchical cluster analysis (HCA).
SIMCA classification
Soft Independent Modeling of Class Analogy (SIMCA) is based on making a PCA model for
each class in the training set. Unknown samples are then compared to the class models and
assigned to classes according to their analogy to the training samples.
Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is the simplest of all possible classification methods that
are based on Bayes’ formula. The objective of LDA is to determine the best fit parameters
for classification of samples by a developed model. The model can then be used to classify
unknown samples. It is based on the normal distribution assumption and the assumption
that the covariance matrices of the two (or more) groups are identical.
Support Vector Machines classification
Support Vector Machines (SVM) is a classification method based on statistical learning.
Sometimes, a linear function is not able to model complex separations, so SVM employs
kernel functions to map from the original space to the feature space. The function can be of
many forms, thus providing the ability to handle nonlinear classification cases. The kernels
can be viewed as a mapping of nonlinear data to a higher dimensional feature space, while
providing a computation shortcut by allowing linear algorithms to work with higher
dimensional feature space.
PLS Discriminant Analysis
The discriminant analysis approach differs from the SIMCA approach in that it assumes that
a sample has to be a member of one of the classes included in the analysis. The most
common case is that of a binary discriminant variable: a question with a Yes / No answer.
Binary discriminant analysis is performed using regression, with the discriminant variable
coded 0 / 1 (Yes = 1, No = 0) as the Y-variable in the model.
With PLS, this can easily be extended to the case of more than two classes. Each class is
represented by an indicator variable, i.e. a binary variable with value 1 for members of that
class, 0 for non-members. By building a PLS model with all indicator variables as Y, one can
directly predict class membership from the X-variables describing the samples. The model is
interpreted by viewing the Predicted vs. Reference plot for each class indicator Y-variable:
 Ypred > 0.5 means “roughly 1” that is to say member;

 Ypred < 0.5 means “roughly 0” that is to say non-member.
Once the PLS model has been checked and validated (see the chapter about multivariate
regression for more details on diagnosing and validating a model), one can run a Prediction
in order to classify new samples. The prediction results are interpreted by viewing the plot
Predicted with Deviations for each class indicator Y-variable:
10
Overview
 Samples with Ypred > 0.5 and a deviation that does not cross the 0.5 line are
predicted members;
 Samples with Ypred < 0.5 and a deviation that does not cross the 0.5 line are
predicted nonmembers;
 Samples with a deviation that crosses the 0.5 line cannot be safely classified.
See Chapter Prediction for more details on how to run a prediction and interpret results. A
tutorial explaining PLS-DA in practice is also available: PLS Discriminant Analysis.
3.2.3 Steps in SIMCA classification

Solving a classification problem requires two steps:
 Modeling: Build one separate model for each class;
 Classifying new samples: Fit each sample to each model and decide whether the
sample belongs to the corresponding class.
The modeling stage implies that enough samples have been identified as members of each
class to be able to build a reliable model. It also requires enough variables to describe the
samples accurately.
The actual classification stage uses significance tests, where the decisions are based on
statistical tests performed on the object-to-model distances.
3.2.4 Classifying new samples

Once each class has been modeled, and provided that the classes do not overlap too much,
new samples can be fitted to (projected onto) each model. This means that for each sample,
new values for all variables are computed using the scores and loadings of the model, and
compared to the actual values.
The residuals are then combined into a measure of the object-to-model distance.
The scores are also used to build up a measure of the distance of the sample to the model
center, called leverage.
Finally, both object-to-model distance and leverage are taken into account to decide which
class(es) the sample belongs to.
The classification decision rule is based on a classical statistical approach. If a sample belongs
to a class, it should have a small distance to the class model (the ideal situation being
“distance=0”). Given a new sample, one needs to compare its distance to the model to a
class membership limit reflecting the probability distribution of object-to-model distances
around zero.
3.2.5 Outcomes of a classification

There are three possible outcomes of a classification:
 Unknown sample belongs to one class;

 Unknown sample belongs to several classes;
 Unknown sample belongs to none of the classes.
The first case is the easiest to interpret.

If the classes have been modeled with enough precision, the second case should not occur
(no overlap). If it does occur, this means that the class models might need improvement, i.e.
more calibration samples and/or additional variables should be included.
11
The last case is not necessarily a problem. It may be a quite interpretable outcome,
especially in a one-class problem. A typical example is product quality prediction, which can
be done by modeling the single class of acceptable products. If a new sample belongs to the
modeled class, it is accepted; otherwise, it is rejected.
3.2.6 Classification based on a regression model

Throughout this chapter, SIMCA classification is described as a method involving disjoint PCA
modeling. Instead of PCA models, one can also use PCR or PLS models. In those cases, only
the X-part of the model will be used. The results will be interpreted in exactly the same way.
SIMCA classification based on the X-part of a regression model is a nice way to detect
whether new samples are suitable for prediction. If the samples are recognized as members
of the class formed by the calibration sample set, the predictions for those samples should
be reliable. Conversely, one should avoid using any model for extrapolation, i.e. making
predictions on samples which are rejected by the classification.
Besides, classification may be achieved with a regression technique called Linear
Discriminant Analysis (LDA), which is an alternative to SIMCA.
3.3. How to use help

The help system has been implemented to provide help and advice to those working with
The Unscrambler®. Help covers use of the dialogs and methods, and interpretation of plots.
For best viewing of the contents users are recommended to have Internet Explorer 7.0 or
higher.
 How to open the help documentation

 Browsing the contents
 Searching the contents
 Typographic cues
3.3.1 How to open the help documentation

Press the F1 key or click on the ? help button near the top right corner of the active dialog
window to read help for the appropriate topic.
The help documentation can also be opened for browsing by selecting Help - Contents from
the menu, or pressing the Help button in the toolbar.
Several levels of help are available. Click on underlined words to follow built-in hypertext
links to related topics.
3.3.2 Browsing the contents

The Help documentation can be read as a book by clicking through the chapters and
sections, accessing chapters from the table of contents displayed to the left.
The left-most window consists of two tabs for switching between a Contents hierarchical
view, and the Search utility.
3.3.3 Searching the contents

The search engine allows one to search for occurrences of one or several words. Select a
page from the result list to read it.
12
Overview
Use Find in page to search for a phrase within the current page.
3.3.4 Typographic cues

The help documentation text itself provides typographic cues to the reader:
 Emphasized text (italic) indicate important concepts, or variables.
 Strong emphasis (bold) indicate actions, e.g. a menu entry or button.
 Dotted underline indicate abbreviations. Hover the mouse pointer over such text for
a tooltip explanation for the acronym.
 Computer code text indicate file name selectors like *.unsb, and command input
such as X=sqrt(X).
 A globe icon indicates that the hypertext link will open external content in the
system default web browser, such as http://www.camo.com/
 A table grid icon indicates that the hypertext link will open, import or download a
data set, like this: Import the tutorial A data
 Hovering the mouse pointer over figures will display the caption as a tooltip.
Useful tips are put in text boxes like this.
Caution notes are put in text boxes like this.
3.4. Principles of regression

Regression is used to find out about how well some predictor variables (X) explain the
variations in some response variables (Y) using methods such as MLR, PCR, PLSR and L-PLSR.
 What is regression?
 General notation and definitions
 The whys and hows of regression modeling
 What is a good regression model?
 Regression methods in The Unscrambler®
 Multiple Linear Regression (MLR)
 Principal Component Regression (PCR)
 Partial Least Squares Regression (PLSR)
 L-PLS Regression
 Support Vector Machine Regression (SVMR)
 Calibration, validation and related samples
 Main results of regression
 Making the right choice with regression methods
 How to interpret regression results
 How to detect nonlinearities (lack of fit)
 What are outliers and how are they detected?
 Guidelines for calibration of spectroscopic data
3.4.1 What is regression?

Regression is a generic term used for all methods that attempt to model and analyze several
variables with the purpose of building a relationship between two groups of variables,
namely the independent and dependent variables. The fitted model may then be used to
either just describe the relationship between the two groups of variables, or to predict new
values.
13
General notation and definitions

The two data matrices involved in regression are usually denoted X (independent,
predictors) and Y (dependent, responses), and the purpose of regression is to build a model
. Such a model is used to explain, or predict, the variations in the Y-variable(s)
from the variations in the X-variable(s). The link between X and Y is achieved through a
common set of samples for which both X- and Y-values have been collected.
Names for X and Y
The X- and Y-variables can be denoted with a variety of terms, according to the particular
context (or culture). The most common ones are listed in the table below:
Usual names for X- and Y-variables
Context X Y
General Predictors Responses
MLR Independent Variables Dependent Variables
Designed Data Factors, Design Variables Responses
Spectroscopy Spectra Constituents
Chromatography Chromatograms Concentrations
Univariate vs. multivariate regression

Univariate regression uses a single predictor to define a relationship with a response. The
classical example in chemistry is the Beer-Lambert law for spectroscopy, where a straight
line model is established to relate concentration to absorbance. In this case, physical sample
preparation is required to “clean the signal” to ensure that the relationship between
absorbance and concentration holds. However, in most practical applications a single
predictor is not sufficient to model a property precisely. The form of the model is described
by,
Where b0 is an intercept term and b1 is a regression coefficient; in this case, the slope of the
straight line.
Multivariate regression takes into account several predictor variables, thus modeling the
property of interest with more accuracy. The form of the model is
Where the terms in the equation are defined as usual. This chapter focuses on the general
principles of multivariate regression.
The whys and hows of regression modeling
Building a regression model involves collecting the predictors and the corresponding
response values for a set of samples, and then finding the optimal parameters in a
predefined mathematical relationship to the collected data. A commonly used measure of
optimality is the minimization of the sum of squares of the deviations between the
measured and predicted responses.
For example, in analytical chemistry, spectroscopic measurements are made on solutions
with known concentrations of a component of interest. Regression is then used to relate the
concentration of the component of interest to the spectrum.
14
Overview
Once a regression model has been built, it can be used to predict the unknown
concentration for new samples, using the spectroscopic measurements as predictors. The
advantage is obvious if the concentration is difficult or expensive to measure directly.
Replacement with the spectroscopic method is less expensive and in some cases, requires
minimal to no sample preparation. It also allows for development of spectroscopic
measurements for real-time process monitoring.
The most common motivations for developing regression models as predictive tools may
include:
 Replacement of expensive or time-consuming analysis methods, with cheap, rapid,
easy-to-perform measurements (e.g. NIR spectroscopy, mass spectrometry for gas
analysis).
 When one wants to build a response surface model from the results of some
experimental design, i.e. describe precisely the response levels according to the
values of a few controlled factors.
What is a good regression model?
The purpose of a regression model is to extract all the information relevant for the
prediction of the response from the available data.
Unfortunately, observed data usually contains some amount of noise and in some cases,
irrelevant information.
Noise can be random variation in the response due to experimental error, or it can be
random variation in the data values due to measurement error. It may also be some amount
of response variation due to factors which are not included in the model.
Irrelevant information is carried by predictors which have little or nothing to do with the
modeled phenomenon. For instance, NIR absorbance spectra may carry some information
relative to the solvent and not only to the compound of interest in developing a model to
predict the concentration of the compound in solution.
A good regression model should be able to:
 Model only relevant information, by highly weighting these sources of information
and downweighting any irrelevant variation.
 Avoid overfitting, i.e. distinguish between variation in the response (that can be
explained by variation in the predictors), and variation caused by mere noise.
Regression methods in The Unscrambler®
The Unscrambler® provides five regression method choices:

 Partial Least Squares Regression (PLSR)
 L-PLSR Regression
 Support Vector Machine Regression
3.4.2 Multiple Linear Regression (MLR)

MLR is a well-known statistical method based on ordinary least squares regression. It
estimates the model coefficients by the equation:
This operation involves a matrix inversion, which can be numerically unstable when there is
collinearity, that is when the variables are not linearly independent. Incidentally, this is the
15
reason why the predictors are called independent variables in MLR; the ability to vary
independently of each other is a crucial requirement to variables used as predictors with this
method. MLR requires more samples than predictors since the system with more variables
than samples would not have a unique solution.
The Unscrambler® uses The QR Decomposition to find the MLR solution. No missing values
are accepted.
More details about MLR regression can be found in the section Multiple Linear Regression
(MLR)
3.4.3 Principal Component Regression (PCR)

PCR is a two-step procedure which first decomposes the X-matrix by PCA, then fits an MLR
model, using the PCs instead of the original X-variables as predictors.
PCR procedure
More about PCR can be found in the help section Principal Component Regression (PCR)
More information about the PCR algorithm can be found in Method References.
3.4.4 Partial Least Squares Regression (PLSR)

Partial Least Squares regression (PLSR, sometimes referred to as Projection to Latent
Structures or simply PLS) models both the X- and Y-matrices simultaneously to find the
latent variables in X that will best predict the latent variables in Y. These PLSR components
are similar to principal components; however, they are referred to as factors.
PLSR procedure
16
Overview
More about PLS regression can be found in the help section Partial Least Squares Regression
(PLSR)
More details regarding the PLSR algorithm are given in the Method References.
3.4.5 L-PLS Regression

Traditionally, science demanded that a one-to-one relationship between a cause and effect
existed; however, this tradition can hinder the study of more complex systems. Such systems
may be characterized by many-to-many relationships, which are often hidden in large tables
of data.
In some cases, the Y data may have descriptors of its columns, organized in a third table Z
(containing the same number of columns as in Y).
The three matrices X, Y and Z can together be visualized in the form of an L-shaped
arrangement. Such data analysis has potential widespread use in areas such as consumer
preference studies, medical diagnosis and spectroscopic applications.
17
More about L-PLS regression can be found in the help section L-PLS Regression
More details regarding the L-PLSR algorithm are given in the Method References.
3.4.6 Support Vector Machine Regression (SVMR)

Unlike the bilinear methods of PCR/PLSR, Support Vector Machine SVMR uses kernels to
transform non-linear systems into linear systems before the application of regression. This is
done by selecting an appropriate kernel and fine tuning its parameters to achieve an
acceptable result (if such a result exists).
A simple diagrammatic representation of SVMR is provided below,
How SVMR Works
More about SVMR can be found in the help section Support Vector Machine Regression
(SVMR)
More details regarding the SVMR algorithm are given in the Method References.
3.4.7 Calibration, validation and related samples

All regression modeling must include some form of validation (i.e. testing) to make sure that
the results obtained can be applied to new data. This requires two separate steps in the
computation of a model, whether it be PCA, MLR, PCR, PLSR, etc.
Calibration
Modeling the relevant information in a set of data used as a training set.
18
Overview
Validation
Checking whether the model is capable of performing its task on a separate test set
of data.
Calibration is the fitting stage in the regression modeling process. The main data set,
containing only the calibration sample set, is used to compute the model parameters (PCs,
regression coefficients).
It is essential to validate models to get an idea of how well a regression model will perform
when it is used to predict new, unknown samples. A test set consisting of samples with
known response values is used. Only the X-values are fed into the model, from which
response values are predicted and compared to the known, actual response values. The
model is validated if the prediction residuals are low and there is no evidence of lack of fit in
the model.
Each of the two steps described above requires its own set of samples; thus, the following
terms are used interchangeably calibration samples = training samples and validation
samples = test samples.
A more detailed description of validation techniques and their interpretation is to be found
in the chapter Validate a Model.
3.4.8 Main results of regression

The main results of a regression analysis vary depending on the method used. They may be
roughly divided into two categories:
Diagnosis
results that are used to check the validity and quality of the model;
Interpretation
results that provide mechanistic insights into the relationship between X and Y, as
well as (for projection methods only) sample properties.
Note: Some results, e.g. scores, may be considered as belonging to both categories
(scores can help in the detection of outliers, and they also give information about
differences or similarities among samples).
The table below lists the various types of regression results computed in The Unscrambler®,
their application area (diagnosis or interpretation) and the regression method(s) for which
they are available.
Regression results available for each method
Result Appl. MLR PCR PLSR
B-coefficients I X X X
Predicted Y-values I,D X X X
Residuals 1 D X X X
Error Measures D X X X
ANOVA D X
Scores and Loadings 2 I,D X X
Loading weights I,D X
19
In short, all three regression methods give a model with an equation expressed by the
regression coefficients (b-coefficients), from which predicted Y-values are computed. For all
methods, residuals can be computed as the difference between predicted (fitted) values and
actual (observed) values; these residuals can then be combined into error measures that tell
how well a model performs.
PCR and PLSR, in addition to those standard results, provide powerful interpretation and
diagnostic tools linked to projection: more elaborate error measures, as well as scores and
loadings.
The simplicity of MLR, on the other hand, allows for simple significance testing of the model
with ANOVA and of the b-coefficients with a Student’s t-test (ANOVA will not be presented
hereafter; read more about it in the ANOVA section from Chapter “Analyze Results from
Designed Experiments”.) However, significance testing is also possible in PCR and PLSR, using
Martens’ Uncertainty Test.
B-coefficients
The regression model can be written
meaning that the observed response values (Y) are approximated by a linear combination of
the values of the predictors (X). The coefficients of that combination are called regression
coefficients or B-coefficients.
Several diagnostic statistics are associated with the regression coefficients (available only for
MLR):
Standard error is a measure of the precision of the estimation of a coefficient;
From that, a student’s t-value can be computed;
Comparing the t-value to a reference t-distribution will then yield a significance level or p-
value. It provides an indication that the regression coefficients are significantly different
from 0. If the t-value is found to be nonsignificant this means that the regression coefficient
cannot be distinguished from 0.
Predicted Y-values
Predicted Y-values are computed for each sample by applying the model equation (i.e. the B-
coefficients) to new (or existing) observed X-values.
For PCR or PLSR models, the predicted Y-values can also be computed using projection along
the successive components of the model. This has the advantage of diagnosing samples
which are badly represented by the model, and therefore have high prediction uncertainty.
This is discussed more fully in the chapter Predictions.
Residuals
For each sample, the residual is the difference between the observed Y-value and the
predicted Y-value. It appears as the term e in the model equation.
More generally, residuals may also be computed for each fitting operation in a projection
model: thus the samples have X- and Y-residuals along each PC (factor) in PCR and PLSR
models. Read more about how sample and variable residuals are computed in the chapter
More Details About the Theory of PCA.
20
Overview
Scores and loadings (in general)

In PCR and PLSR models, scores and loadings express how the samples and variables are
projected along the model components.
PCR uses the same scores and loadings as PCA, since PCA is used in the decomposition of X. Y
is then projected onto the “plane” defined by the MLR equation, and no extra scores or
loadings are required to express this operation.
Read more about PCA scores and loadings in Chapters PCA and How to Interpret PCA Scores
and Loadings. PCR and PLSR scores and loadings are presented in the relevant sections for
these topics.
L-PLSR is further described in the method section on this topic. L-PLSR
3.4.9 Making the right choice with regression methods

It may be somewhat confusing to have a choice between three different methods that
apparently solve the same problem, i.e. fit a model in order to approximate Y as a linear
function of X.
The sections that follow provide a comparison of the three methods and may aid in selecting
the one which is best suited to specific analysis objectives.
MLR vs. PCR vs. PLSR vs. SVMR

MLR has the following properties and behavior:
 The number of X-variables must be smaller than the number of samples;
 In case of collinearity among X-variables, the b-coefficients are not reliable and the
model may be unstable;
 MLR tends to overfit when noisy data are used.
PCR and PLSR are projection methods, like PCA.
Model components are extracted in such a way that the first PC/factor explains the largest
amount of variation, followed by the second PC/factor, etc. At a certain point, the variation
modeled by any new PC/factor is mostly noise. The optimal number of PCs/factors -
modeling useful information, but avoiding overfitting - is determined with the help of the
residual variances.
PCR uses MLR in the regression step; a PCR model using all PCs gives the same solution as
MLR (as does a PLSR model using all factors).
If one were to run MLR, PCR and PLSR on the same data, their performance could be
compared by checking validation errors (Predicted vs. Measured Y-values for validation
samples, RMSEP).
It should also be noted that both MLR and PCR can only model one Y-variable at a time.
The difference between PCR and PLSR lies in the algorithm. PLSR uses the information lying
in both X and Y to fit the model, switching between X and Y iteratively to find the relevant
factors. So PLSR often needs fewer factors to reach the optimal solution because the focus is
on the prediction of the Y-variables (not on achieving the best projection of X as in PCA).
SVMR is a special class of regression that is very distinct from all of the methods described
above. SVMR uses kernels to map variable space to feature space in order to minimise
particular errors associated with the calibration development. This is done by
 Selecting a specific kernel function that is capable of mapping the variable space.
 Fine tuning the parameters of the chosen function such that the best calibration and
prediction statistics are achieved.
21
SVMR provides the least graphical output and diagnostics statistics of all the regression
methods implemented in The Unscrambler® and can often pose a difficult task for the user
to develop robust models. However, when they work, SVMR models are much better able to
handle non-linearities than MLR/PCR/PLSR models and can provide an alternative method to
Artificial Neural Networks (ANN).
How to select a regression method

If there is more than one Y-variable, PLSR is usually the best method if the objective is to
interpret all variables simultaneously. It is often argued that PLSR or PCR gives better
prediction ability. This is usually true if there are strong nonlinearities in the data, in which
case modeling each Y-variable separately according to its own nonlinear features might
perform better than trying to build a common model for all Ys. On the other hand, if the Y-
variables are somewhat noisy, but strongly correlated, PLSR is the best way to model the
whole information and minimize the influence of noise.
The difference between PLSR and PCR in prediction error is usually quite small, but PLSR will
usually give results comparable to PCR results using fewer components.
MLR should only be used if the number of X-variables is low (around 20 or less) and there
are only small correlations among them.
Formal tests of significance for the regression coefficients are well-known and accepted for
MLR. If using PCR or PLSR, one can check the stability of the results and the significance of
the regression coefficients with Martens’ Uncertainty Test.
SVMR should be considered when it is known a priori that non-linearity will affect the
system and attempts should be made to find a kernel function that best handles this.
3.4.10 How to interpret regression results

Once a regression model is built, one needs to to diagnose it, i.e. assess its quality, before
interpreting the relationship between X and Y. Finally, the model will be ready for use for
prediction once it has been thoroughly checked and refined.
The various types of results from MLR, PCR and PLS regression models and more information
about the interpretation of projection results (scores and loadings) and variance curves for
PCR and PLSR can be found in the corresponding chapters covering each method.
How to detect nonlinearities (lack of fit)
Different types of residual plots can be used to detect nonlinearities or lack of fit. If the
model is good, the residuals should be randomly distributed, and these plots should be free
from systematic trends.
The most useful residual plots are the Y-residuals vs. predicted Y and Y-residuals vs. scores
plots. Variable residuals and Normal Probability Plots can also be useful.
The PLSR X-Y Relation Outliers plot is also a powerful tool to detect nonlinearities, since it
shows the shape of the relationship between X and Y along one specific model factor.
What are outliers and how are they detected?
An outlier is an object which deviates from the other objects in a model and may not belong
to the same population as the majority and therefore can disturb the model.
The cause of outliers could be one or more of the following:
 Measurement error
 Wrong labeling
22
Overview
 Deviating products / processes

 Noise
 Extreme / interesting sample
For projection methods like PCA, PCR and PLSR, outliers can be detected using scores plots,
residuals, leverages and influence plots.
Outliers in regression
In regression, there are many ways for a sample to be classified as an outlier. It may be
outlying according to the X-variables only, or to the Y-variables only, or to both. It may also
not be an outlier for either separate set of variables, but become an outlier when one
considers the (X,Y) relationship. In the latter case, the X-Y Relation Outliers plot (only
available for PLSR) is a very powerful tool showing the (X,Y) relationship and how well the
data points fit into it.
Use of residuals to detect outliers
One can use the residuals in several ways. For instance, first use residual variance per
sample plot, then use a variable residual plot to detect samples with large squared residual
in the first plot. The first of the two plots is used for indicating samples with outlying
variables, while the latter plot is used for a detailed study for each of these samples. In both
cases, points located far from the zero line indicate outlying samples or variables.
Use of leverages to detect outliers
The leverages are usually plotted vs. sample number. Samples showing a much larger
leverage than the rest of the samples may be outliers and may have had a strong influence
on the model, which should be avoided.
For calibration samples, it is also natural to use an influence plot. This is a plot of squared
residuals (either X or Y) vs. leverages. Samples with both large residuals and large leverage
can then be detected. These are the samples with the strongest influence on the model, and
may disturb (influence) the model towards themselves.
The features of two plots can be utilized by plotting influence and Y-residuals vs. predicted Y
together. Some example plots are shown below:
Scores plot showing a gross outlier
Y-Residual vs. Y-Predicted showing the presence of a potential Outlier
23
Leverage plot showing the presence of a potential outlier
3.4.11 Guidelines for calibration of spectroscopic data

The information described in this chapter so far has presented the basics of calibration. The
following steps and useful functions may be used as a guideline for the development of
spectroscopic calibration models.
Preparing data for analysis

Read data
File - Open or File - Import Data. Data can be imported from many vendor
instrument formats — directly or via e.g. JCAMP-DX, GRAMS SPC or ASCII.
See full details on compatible formats in the chapter on Importing data
View and prepare data
View data as a spreadsheet in the Editor, define sets using the Define Range option.
Select some samples and Plot - Line or Matrix to get an overview of the spectra
(data plot). Histograms of Y-variables are also useful to assess the spread of the data
for calibration. 3-D scatter plots can be used as an initial assessment of any
covariance between numerous constituents, if there are several present in the
24
Overview
analysis. All of these plots can be helpful in detecting outliers, or possible errors in
the data.
Note: It is advisable to aim for a boxcar distribution of Y-values, as this provides the
most even coverage of the region of interest.
Preprocess (transform the data)
Tasks - Transform… allows for spectroscopic transformations, derivation,
smoothing, etc. Tasks - Transform - Reduce (Average) may also be useful when
replicates have been measured, or variable reduction is required. The Preview Result
option in the transform dialog, provides a graphical preview of spectral data as
transform parameters are changed. These changes are presented to the user in real
time.
Statistics
Tasks - Analyze - Descriptive Statistics… may be used to reveal scatter effects and
for visually detecting large changes in specific wavelength regions. Use the Scatter
option to reveal potential scatter effects before the application of transforms such
as Multiplicative Scatter Correction (MSC).
Select samples
The Edit - Mark option is useful for selecting a more balanced data set from a large
data set from PCA, PCR or PLSR scores. This can be applied to either the spectra or
the constituents (if more than one component is being analyzed). Mark samples that
span all the important components (samples far away from the origin, including the
extremes when selecting calibration samples). Use the Create Range option to
extract marked samples as a new row set in the project navigator.
Reduce spectra
Use the Tasks- Transform- Reduce(Average)… options to reduce spectra of high
data point spacings (being careful not to lose resolution) to fewer data points, or
average out replicate spectra in a data set.
Calibration and fine-tuning of models

Make a first calibration model and look for outliers
Tasks - Analyze - Partial Least Squares Regression… with more than one response
variable (Y) gives a simple overview for several constituents. Otherwise run PLSR
with a single response, or PCR or MLR, which use only a single response. View the
results, especially Variance plots, Scores and Predicted vs. Reference plots. Use Edit -
Mark (also available as right mouse button option) to mark suspicious samples in the
scores plots. Use Plot - Sample outliers and XY Relation outliers to investigate
potential outliers.
Refine the Model
After marking samples one can go to the analysis (i.e. PLSR) node in the project
navigator and right click to select Recalculate - Without Marked, which allows the
calculation of a new model with the marked samples removed. Compare results, and
look for additional outliers. Repeat this process if necessary.
Study the model in detail
Plot the results including Variances and RMSEP - RMSE, Important variables,
Predicted vs. reference, loadings as these are useful tools for assessing model
quality. View the regression lines and statistics in the predicted vs. reference plot, as
these are helpful for assessing the model fit. Highlight samples in scores plots by
groups using the Sample grouping available as a right mouse button option, for
25
investigating interesting patterns in the data. View the loadings as line plots and see
if the variables of importance coincide with the spectral regions related to the
property being measured.
Delete variables (wavelengths).
From the Important variables plot the Edit - Mark option can be used to define
ranges in the spectra that are not important (potentially due to noise). Use the
Recalculate - Without Marked option to generate a new model based on fewer
wavelengths. Apply the Uncertainty test during PLS regression to aid in the
identification of important variables for modeling.
Validation
It is essential to ensure that a developed model is properly validated using a suitable
validation method (cross validation or test set validation). Cross validation can be set
up to look at the effect of removing an entire set of replicates from an analysis or
single replicates can be removed to test the predictive ability of the model for single
replicates.
Deploying models in real world applications

Access to results
All of the models that have been created in a project are stored as analysis nodes in
that project and can be accessed from the project navigator. The Save Model option
can be accessed by right clicking on an analysis node, allowing one to save the model
as an independent file from the project. This allows the models only to be shared
with others and not the entire project. The models can be used in real-time via The
Unscrambler® Process Pulse, and with The Unscrambler® Predictor/Classifier
(OLUP/OLUC). It is also the way The Unscrambler® Online Predictor/Classifier will
use models for online and 3rd party applications. More on this is discussed in the
Instrument Compatibility section below.
Detailed information about the model is stored in the results and validation folders under a
particular analysis node. A summary is available in the Info box in the lower left part of the
display, when the model name is highlighted.
Predict new samples
Tasks - Predict - Regression… is used to predict Y-values for new unknown samples
from spectra. If new samples have known reference values available, these can used
in the Predict option to assess the quality of new predictions during the validation
stage of model development. The prediction also provides the uncertainty of the
measurements and additional statistics to show the similarity of the prediction
samples to the calibration samples. Reproducibility can also be assessed in terms of
samples measured on different instruments, or from different manufacturing sites,
etc by applying a model developed on one spectrometer to spectra scanned on
another instrument. Remember to preprocess new samples in the same way as the
original calibration samples used to develop the model (which can readily be done
using Autopretreatments).
Check the robustness of calibration models
By using Tasks - Transform - Noise various amounts of additive or multiplicative
noise can be added to new samples to see how sensitive the model is to small
changes. In the project navigator, under the Validation folder, the Prediction
Diagnostics matrix is available for regression methods. Assess the numerical values
of all results, checking that bias is close to 0 and slope is close to 1. Otherwise there
may be a need to slope and bias adjust the predicted Y-values (e.g. the spectra may
26
Overview
exhibit slight differences on one instrument compared to another, or there may be

systematic differences in the reference values from another laboratory). SEPcorr
provides a bias corrected SEP value, i.e. the expected predicted error in the absence
of systematic bias.
Audit Trails
The Tools-Audit Trail… option provides a non-editable record of all imports, analyses
and manipulations made to a project. It is especially useful in regulated
environments requiring compliance to 21 CFR part 11. All saves and project entries
are also recorded in the audit trail.
When predictive models have been optimized to meet certain desirable criteria, i.e. the
predictive ability on new samples is satisfactory, these models may be used in third party or
The Unscrambler® based applications, such as The Unscrambler® Online Predictor/ Classifier
and The Unscrambler® X Process Pulse.
Instrument compatibility
Some instrument vendors (for example Perten, Brimrose, Guided Wave, Foss NIRSystems,
Thermo, etc.) make use of The Unscrambler® Online Predictor/ Classifier software available
for integration of The Unscrambler® models into third party systems. These packages are
DLL-based programs that are incorporated into the instrument software, allowing the use of
The Unscrambler® predictive or classification models on the data, providing the model
results to the instrument interface for either graphical or numerical display when a new
(spectral) measurement is made. Visit http://www.camo.com/ for more information on
these applications.
The Unscrambler® X uses the Save Model option to save predictive, or classification models
as separate files from a project. The Unscrambler® Generation X family of online software
uses these model files directly for applications. The Unscrambler® X is backward compatible
for use in previous versions of The Unscrambler® Online Predictor and Classifier (back to
version 9.2). Use the File-Export-Unscrambler option to export model files for use in these
previous versions. This option will allow users to save data or model for backward
compatibility. Contact CAMO for this plug-in option.
Some instrument software can read the B vector (regression coefficients). Use File - Export -
ASCII…, or JCAMP-DX. Use File - Export - ASCII MOD… , which is a simple file format
containing all information necessary to make predictions, either using full PLSR or PCR
models, or just the B vector. It can be used with user-defined conversion routines.
Use The Unscrambler® to develop models for instruments that do not support The
Unscrambler® Online Predictor/Classifier
If an instrument vendor software does not support The Unscrambler® developed
models, import the instrument data as a common format, i.e. ASCII Excel, JCAMP etc
and develop a model using the powerful diagnostic and algorithmic capabilities. Use
this model to select appropriate calibration and validation samples, determine the
optimal PCs/factors to use and match the preprocessing to the options available in
the vendor software. Redevelop the model in the vendors’ software and compare
the two results. This will provide added assurance that the developed model is
robust and performs as required.
 The various residuals and error measures are available for each PC in PCR and PLSR,
while for MLR there is only one of each type
27
 There are two types of scores and loadings in PLSR, only one in PCR
3.5. Demonstration video

Watch this video to become familiar with the new user interface in The Unscrambler® X.
The video provides a guided tour of some of the basic operations in the software
application. This will show the project-based structure of The Unscrambler®, how to import,
view and analyze data. The video gives an overview of using the project navigator which
incorporate raw data, transformed data, and all the results of analysis within a given project.
Note: This video was created using The Unscrambler® X version 10.0. The current
version of the software has a slightly different look and feel and even more
functionality.
An Internet connection and Adobe Flash Player is required to play the above video.
28
4. Application Framework
4.1. User interface basics
The purpose of this chapter is to give the user an overall introduction to the principles used
in The Unscrambler®. A short overview of The Unscrambler® user interface and workplace is
provided in this section, covering the various menu options, and the data organization
environment:
 Getting to know the user interface

 Matrix editor
 Project navigator
Menu walk-through:
 File
 Edit
 View
 Insert
 Plot
 Tasks
 Tools
 Help
General dialogs usage, by menu:

File
 Import data >

 Export >
 Print…
Edit
 Find and replace

 Go to…
 Change data type – Category…
 Define range…
 Group rows…
 Sample grouping…
Insert
 Data matrix…
 Duplicate matrix…
 Custom layout…
Tools
29
 Matrix calculator…
 Report generator…
 Audit trail…
 Options…
Help
 Modify license…
 User setup…
4.2. Getting to know the user interface

This will introduce terminology related to the user interface in The Unscrambler®. It is
assumed that the user is already familiar with using the operating of his computer.
 Application window
 Workspace
 Editor
 Viewer
 Project navigator
 Project information
 Page tab bar
 The menu bar
 The toolbar
 The status bar
 Dialogs
 Setting up the user environment
 Getting help
4.2.1 Application window

The application window layout is composed to give an overview of the work currently being
done.
The below screenshot shows the application with its menu bar, toolbar, the project
navigator and project information panes on the left, the workspace in editor mode mode
(there is also a viewer mode), and the page tab bar below it. The status bar at the bottom
shows a summary of the selected content and status while The Unscrambler® is calculating.
The Unscrambler® main window
30
Application Framework
4.2.2 Workspace
The Workspace occupies the largest area of the application window, containing either a
table view of a data set, called the Editor, or a Viewer which displays results either
graphically as plots or numerically as tables.
Editor
The Editor presents a data table that may or may not be modified depending on its
protection status:
If a table can be edited, it is possible to:
 Type in values.
 Change the column and row headers.
 Create ranges.
More info on organizing data.

Viewer
In the Viewer, data and results are visualized graphically in an interactive manner.
Whenever data are plotted, the plot appears in a Viewer. Every time the Viewer is
mentioned throughout this manual and help system, it refers to a window where a plot is
displayed.
The information in the viewer can come from:
 Plotting raw data from the editor: either for a data matrix or a matrix from a result.
 Displaying predefined plots.
31
 Custom layout.
To learn more about working in this mode, please refer to the chapter on plotting data.
4.2.3 Project navigator

The project navigator is a tree-like structure consisting of data matrices and analysis results
along with plots.
All raw and modified data sets along with different analysis results and plots can be stored
as a single project. One can toggle between different data sets and analysis results just by
selection.
4.2.4 Project information

The Project information pane, found in the lower left corner of the display has two tabs:
Info and Notes.
Info
Include details about the currently selected item in the project navigator, such as
the matrix or model name, matrix shape, creation time and type of input,
parameters used for output matrices, plots and results.
Notes
Annotations are saved in notes.
More information about a project are found in the audit trail.
4.2.5 Page tab bar

At the bottom of the Workspace there is a list of recent views. It acts as a “breadcrumb trail”
of what has been viewed recently.
When reopening a file, only the most recently active view will be available.
By right clicking on a tab and selecting Pop out, the item becomes a separate window, that
can be moved around and placed as a side-by-side view.
It is also possible to close the current tab, all other tabs or all tabs via this menu.
4.2.6 The menu bar

All operations in The Unscrambler® are performed with the help of the menus and options
available in the menus.
Available menu actions will change depending on context; Editor or Viewer mode, or the
currently selected plots. Some submenus and options may be invalid in a given context;
these are grayed out.
Context-sensitive menus
The Unscrambler® also features so-called context-sensitive menus. These can be accessed by
clicking the right mouse button while the cursor rests on the area on which an operation is
performed. The context-sensitive menus are a kind of shortcut, as they contain only the
32
options which are valid for the selected area, which will save a user the work of having to
click through all the menus on the Menu bar.
4.2.7 The toolbar

The Toolbar buttons provide shortcuts to the most frequently used commands. When the
mouse cursor is rested on a toolbar button, a short tooltip explanation of its function
appears.
4.2.8 The status bar

The Status bar at the bottom of the screen displays concise information including:
 Computations currently in progress.

 Short explanation of the current menu option.
On the right-hand side, additional information is displayed, such as
 the value of the currently selected entry, and

 the size of the data table.
4.2.9 Dialogs
The Unscrambler® aims to aid the user through dialogs that provide detailed instructions to
the application.
When working in The Unscrambler® the user will often have to enter information or make
choices in order to be able to complete an analysis. This includes activities such as specifying
the names of data matrices/files to work with, the data sets to analyze, how many PCs to
compute, or the type of validation methods to choose. This is done in dialogs, which will
normally look something like the one pictured below.
The Unscrambler® dialog
33
This particular dialog is the one associated with running a Principal Component Analysis on
data. Items that are predefined, such as rows/samples, columns/variables, etc. are selected
from a drop-down list. Options which are mutually exclusive are selected via radio buttons.
The settings for many of the analysis dialogs will be remembered from the last time the
dialog was open.
Any dialog can also be canceled by pressing the Esc (escape) key on the keyboard. Ongoing
calculations can also be aborted pressing Esc.
4.2.10 Setting up the user environment

The Unscrambler® provides user authentication to offer traceability required by regulations.
See the documentation for the Login dialog for how to make use of this facility, and set up a
user.
The look and feel of the workspace can be customized. See the documentation for the Tools
– Options… dialog for more information.
4.2.11 Getting help

Documentation for currently open dialogs can be accessed by pressing F1, or by using the ?
button near the top right corner of the active dialog window.
See How to use help and the Help menu for more details.
4.3. Matrix editor basics

This is an introduction to the matrix editor.
34
 What is a matrix?
 Matrix structure
 Samples and variables
 Adding data matrices
 Manually
 Drag and drop from other applications
 Altering data tables
 Using ranges
 Create ranges to organize subsets
 Superimposed ranges
 Storing data as separate matrices
 Data types
 Possible data types
 Converting data types
 Keeping versions of data
 Saving data
4.3.1 What is a matrix?

A matrix is a rectangular table of numbers.
The horizontal lines in a matrix are called rows and the vertical lines are called columns. A
matrix with m rows and n columns is called an m-by-n matrix (or m×n matrix) and m and n
are called its dimensions.
The places in the matrix where the numbers are, are called entries. The entry of a matrix A
that lies in the row number i and column number j is called the i,j entry of A. This is written
as Ai,j or aij.
Matrix structure
The matrix A with M rows and N columns is defined as A(M,N) and can be represented as
shown below.
A11 A12 A13 … A1N
A21 A22 A23 … A2N
A31 A32 A33 … A3N
… … … … …
AM1 AM2 AM3 … AMN

Matrices consisting of only one column or row are called vectors, while higher-dimensional,
e.g. three-dimensional, arrays of numbers are called tensors. Matrices can be added and
subtracted entry wise, and multiplied according to a rule corresponding to composition of
linear transformations. For more details on operations possible using matrices look into the
Matrix calculator
Samples and variables
A matrix represents the values associated to samples and variables. An entry corresponds to
the value of a specific sample for a specific variable. The general way of presenting data in a
matrix is to place the samples in row and the variables in column.
Variable 1 Variable 2 Variable 3 … Variable N
35
Sample 1 A11 A12 A13 … A1N
… … … … … …
Sample M AM1 AM2 AM3 … AMN
4.3.2 Adding data matrices

To create a data table in The Unscrambler®, there are three options:
 Create a data matrix

 Create a design table
 Import data
See insert matrix dialog box for more information on how to create a blank table, fill it with
data and rename it.
Manually
Enter data manually into a matrix by simply typing while an entry is focused, double clicking
on a specific entry, or pressing F2 and entering the value. This operation can be done for the
data table as well as the sample and variable name.
Category entries have a drop-down list, allowing the user to select one of the levels already
used. It can also be typed, and it is possible to type anything to add new levels.
Date-time entries have a calendar pop-out, allowing the user to pick a date from it.
Drag and drop from other applications
Data can be copied from any application, e.g. Microsoft Excel, to The Unscrambler® by either
drag and drop, or by copy and paste.
Files can also be dragged from the file manager onto The Unscrambler® application window.
The window title bar is a good drop target.
4.3.3 Altering data tables

It is possible to move focus between entries using the arrow keys. Hold shift to select a
range of entries.
Press Del to delete the contents of an entry.
Use Ctrl or Shift when clicking on row or column index numbers to select more than one row
or column: Ctrl+click will add the clicked index to the selection, while Shift+click will add all
rows and columns up to the clicked index.
Columns and rows can be moved by selecting them and grabbing the selection border. Drag
and release the mouse button on the target column or row where it will be moved.
Hold the Ctrl key while doing this to make a copy of the selected column or row.
36
4.3.4 Using ranges

When collecting data, one may gather information on a sample from different sources, for
example a spectrum and some chemical measurements, or some process data and some
quality measurements.
In the same way one may have several types of samples: the ones that will be used for
model calibration and the ones to be used to validate the model.
There are different options to store the data in The Unscrambler®: either collect the
information in the same data table or use different matrices within the same project.
Create ranges to organize subsets
It is often useful to create subset of either samples or variables to make them easily
accessible from the different plotting and analysis dialogs. This is done by defining ranges. A
quick way to start is to select a part of a data table and right click to select the option Create
Range.
The created range will be displayed in the project navigator and can be renamed to allow for
easier identification later. The color box next to the range node connects the range visually
to the corresponding entries the matrix editor.
Each subset of the matrix will be displayed separately in the matrix editor by selecting a
range in the project navigator.
More sophisticated options for working with ranges are available in the Define Range or
Scalar and Vector dialogs.
When ranges have been created in a matrix, they can be copied to another matrix of the
same dimensions. Right click on the matrix node in the project navigator and select Range -
Copy Range. The right-click option Range - Paste Range can be used to apply the same
ranges to a new matrix of the same dimensions (rows or columns).
Superimposed ranges
A region comprises a row range and a column range, thus selecting entries spanning multiple
rows and columns will result in two ranges, one for each axis.
These ranges are independent of each other and can be used in conjunction with any other
range.
This above case is typical of creating two set of variables: X (predictors) and Y (responses),
and two sets of samples for calibration and validation.
Storing data as separate matrices
In The Unscrambler® one can use different matrices in the analysis as long as they are
compatible in size and stored in the same project.
Hence one can store data in several matrices that will appear in the project navigator as
illustrated below:
37
4.3.5 Data types

Possible data types
Variables (columns) can have one of four available data types:
Numerical
A numerical variable is one that has numbers as values.
Category
A category variable is one that has two or more category levels. There is no intrinsic
ordering required and no distinction between nominal (e.g. male or female) and
ordinal (e.g. high or low) categories.
It is recommended to use words to label category levels to give each level meaning,
such as “High” or “Low”.
Categories are stored as text, each level is assigned a index. Use View – Level Indices
to display the integer value assigned to each level.
Category variables are kept out of calculations.
Text
Each value is a text string.
International characters are supported. The encoding used internally is UTF-8.
Maximum text length is 256 characters.
Text columns are kept out of calculations.
Date-time
Each entry is a date-and/or-time.
The displayed date format can be customized, see Tools – Options… menu.
Date-time variables are kept out of calculations.
In the matrix editor these are given colors to make it easy to identify different types of
variables.
Visualization of data types in the matrix editor
Explanation of default colors for data types

Data type Background Color Alignment
Numerical Right
Category Orange Left
Date-time Left
Text Blue Left
38
Missing data Gray
Selection Blue White
Converting data types

The data type of one or several variables can be changed by selecting them and using the
option Change data type in the Edit menu. Select one of the available data types from the
menu.
4.3.6 Keeping versions of data

When working with data, it is advisable to always keep the raw data unaltered. For
traceability and verification it is required. Keep in mind that when a transform is applied to
data matrix, a new matrix is created in the project, maintaining the original data matrix. At
appropriate steps in a workflow, use the option Insert – Duplicate Matrix… to take a
snapshot by replicating the matrix.
For more information see the duplicate matrix documentation.
4.3.7 Saving data

By default, all the project data, results, models and plots will be saved as a proprietary
binary format with the .unsb file name extension.
It is also possible to save just a matrix from a project, by selecting the matrix, right clicking,
and choosing Save Matrix. The given matrix is then saved as a file with extension .unsb and
can be opened as a separate project.
Other options are to use File – Export to export a selected data set in file formats that can
be opened with for instance Matlab or Microsoft Excel.
The default binary format will load and save faster, whereas the XML based format makes it
easy to create software for reading data saved by The Unscrambler®.
The Unscrambler® file formats supported:
Version File name extensions1 Compatibility
X .unsb,.unsx2 Read, Write
X-9.0 .AMO Write
9.8–9.7 .??M Read
9.8–9.0 .??[DLPTW] Read, Write3
 The file names are given in glob notation: ”*” mean any number of characters, ”?”
any character, “[ABC]” any of A,B or C.
 Support for XML is available via a separately installed export plug-in.
 Available via a separately installed export plug-in.
39
4.4. Using the project navigator

This is a guide to the project navigator.
 About the project navigator

 Create a project
 Items in a project
 Browse a project
 Managing items in a project
 Actions common to all item types
 Actions for data table nodes
 Actions for results nodes
4.4.1 About the project navigator

The top node in the project navigator represents the project node. Only a single project can
be opened at one time. The project contains all of the data for a particular analysis, any
transformed (preprocessed) data, any models developed, and predictions or classifications
performed.
Models such as PCA or PLSR, or predictions using these, have their own special node icons
for better recognition of the types of analysis that have been performed.
When a user adds column or row sets to an imported data matrix, a new subnode is
displayed. This provides the user greater visualization of the structure present in a data
matrix and allows better tracking of modifications. This data organization also creates
subsets of the data that can be chosen for analysis and/or plotting.
When a user transforms the data in an imported, or generated matrix. The Unscrambler®
keeps the original data intact during transformation, and provides a new data matrix node in
the project navigator containing the transformed data.
4.4.2 Create a project

When The Unscrambler® is launched, it will display an empty project, ready to add data.
The Unscrambler® can not have more than one project open at a time, but each project can
contain many data sets and results.
To start a new project with another project opened, use the File – New menu. A prompt will
ask if the user would like to save the current project.
The first thing to do is to get data or a model into the project. Do that by:
 Adding a data matrix.

 Creating a design matrix.
 Importing data.
 Importing models.
40
4.4.3 Items in a project
In the project there are three types of items:
 Matrices
 Plots
 Results: Each analysis will create a new node containing model or prediction details
The items are organized as nodes that create a tree.

Generic icons used for the project navigator nodes
Node symbol Description
Project top node
Data set
Plot
Data set range shown with its respective color
Outlier warnings list
4.4.4 Browse a project

The project navigator is a useful way to navigate, browse and access data sets, result
matrices, plots and visual presentations of results.
Note: It is possible to collapse (-) and expand (+) the folders to hide or show their
content.
To select an item click on it. It will be displayed in the workspace.
4.4.5 Managing items in a project

There are different right-click menu options available for the different item types in the
project navigator. These are described in the following.
41
Actions common to all item types

Plot node menu
Rename
Rename the node
Delete
Delete the node. This operation cannot be undone, so use with caution. This action
has to be confirmed in a pop-up dialog in order for the node to be deleted.
Actions for data table nodes
Data table node menu
Transform
Shortcut all the pretreatment available in the Tasks – Transform menu.
Plot
Shortcut to all the plots available in the Plot menu.
Export
Export the data using one of the supported external data formats.
Range
The Range option allows the following actions to be performed
Define Range allows the definition or row and column ranges and special intervals in
a data set. For more information see the Define Range dialog.
Copy Range Copy the selected ranges (rows or columns) to another matrix of the
same dimensions
Paste Range Paste copied ranges into the same or another matrix of the same
dimensions
Duplicate Matrix
This will create a new copy of the data matrix in the project
navigator. It is a shortcut to the Insert - Duplicate Matrix
(Insert – Duplicate Matrix…) option.
Spectra
Define a selected columnset to hold spectral data, in order to change the default
view of certain model result plots (e.g. PLS regression coefficients plotted as line in
Regression Overview, or X-loadings plotted as line in PCA Overview).
Save Matrix
42
Save the selected data or result node to a new project file.

Scalar and Vector
Open the Scalar and Vector dialog in order to add scalar/vector tags to column-sets,
along with units and range information. This is useful for quality control in an online
process.
Actions for results nodes
Result node menu
Recalculate
Rebuild the model with the following changes
 With Marked… (samples or variables)

 Without Marked… (samples or/and variables)
 With Marked Downweighted… (variables only)
 With UnMarked Downweighted… (variables only)
 With New Data… (samples only)
See more details about recalculate options here

Register Pretreatment
When a model has been built using transformed data, all the transformations will be
selected for automatic pretreatment in case the model will be used for prediction of
new samples. In some cases the new data may have been pre-processed manually
before prediction. Use this dialog to define which transformations to be applied on
future prediction samples.
Hide/Show plots
Hide/Show the model folder containing the predefined result plots.
Save Model
Save the selected model in a new project file, as described here.
Set Components
Change the default number of components to use for prediction, as described here.
Set Alarms
Open the Set Alarms dialog to set warning and alarm limits for input or output data
of individual models. Can be applied in CAMO’s online engines for prediction,
projection and classification.This is useful for quality control in an online process.
Set Bias and Slope
Bias and slope correction is used as a post-processing step to achieve an offset (bias)
of 0 and slope of 1. This option will be available only for MLR, PCR and PLS
regression models.
43
4.5. Register pretreatment

Use this dialog to store a given set of transformations applied when building a model for
reuse in prediction.
Registered transformations will be automatically applied to input data before running a

prediction, projection or classification.
Normally, the preference is to keep all transformations applied to the training data set
selected, so that prediction data are given the exact same treatment. If not the model may
be invalid, as input data will not be in the shape expected by the model.
4.6. Save model for prediction, classification

This option allows one to save the model (results) as a separate project (smaller file). There
are several options for the results file. Depending on what option is used, the file size can be
reduced so that they are best suited for usage in prediction and/or classification. These
models can be used also with the Unscrambler Prediction Engine, Classification Engine, and
Unscrambler Process Pulse. Select a model in the project navigator and right click to select
Save Result.
44
In the dialog, one has the option to save several different types of model files. These smaller
model files do not support the plots, and do not include the raw data and some of the
validation matrices that are present in the entire model. The prediction (or classification)
results that can be computed depends on the type of model that is saved.
Entire model
this saves all the results and supports all visualizations that are available when a
model is developed in The Unscrambler® X. This option also permits recalculation of
the model by keeping out any selected data. This option is available for MLR, PLS,
PCR and PCA models.
Prediction
The prediction result options saves the model in smaller files, as the model result file
does not include many of the results matrices including the validation results and
other matrices used in the prediction visualizations.
 Full with support for inlier detection: The model result file does not include the
following matrices: Y scores, Beta coefficients (weighted), Variable leverage, X
Correlation loading, Y correlation loading, Square sums, and Rotation. Three of the
validation matrices are saved in this model format: X total residuals, X value
validation residuals, and Y value validation residuals. This model can be used for
prediction, giving all the results that The Unscrambler® computes on prediction,
including the deviation.
 Full: This model results file allows one to predict new values, and get the deviation
with that value, as well as to detect outliers (based on Hotelling’s T2 and Q
residuals). With this model, inliers cannot be computed during the prediction stage.
The Hotelling’s T2 and Q residual limits and X values are computed, but not plotted
during prediction with the Full model. Compared with the entire model, this version
saves 11 of the 20 validation matrices. It does not compute the Inlier limit and the
Sample inlier distance, nor the seven matrices that are saved with the Full (with
inlier detection) prediction result.
 Short: In the short model, only the raw beta coefficients are saved, at the optimal
(or user-defined) number of components. No validation matrices are saved. With a
short prediction model, one can get the predicted results for new data, but no other
45
distance measure, or deviation measure. No comparison between known and

predicted values can be made when using a short prediction. A short prediction
model is not recommended if one would like to have model and/or sample
diagnostics during the prediction step.
Classification
PCA, PCR and PLS models can be saved for use only for classification. These models
cannot however then be used for regression. This result option saves the
information from the model needed to apply this model for classification. It is a
smaller file, and contains only the results and validation matrices needed to perform
classification on new samples. The saved results matrices are for a PLS classification
model are: X means, X weights, X loadings, scores, and Loading weights. The PCA
classification model does not include plots. The results matrices with the PCA
classification model are: X means, X weights, X loadings, and scores. The validation
matrices saved in this model format are: X Variable Residuals, X Variable Validation
Residuals, X Sample Residuals and X Sample Validation Residuals. A model of type
classification can be used with OLUC X.
Number of components
A model will be saved with all the components that have been computed for it,
unless specified otherwise (and for a short model, which will be saved for the
optimal number of components by default). The user can specify the number of
components to save with a given model. This can be more, or less than the optimal
number of components for a given model.
4.7. Set Alarms

User can set alarms during model development that can be useful during prediction,
classification and projection for new samples. Two warning limits (high and low) and two
alarming limits (high and low) can be set for the available results and validated matrices
calculated from PCA, MLR, PCR and PLSR. The values entered here serve as warning and
alarm thresholds. The alarm values can be entered in standard or scientific notation.
4.7.1 Prediction:
This will be enabled only for Regression techniques (MLR, PCR and PLSR). Low and high limits
can be set for Deviation and Scores matrices; and so for each one of Y responses . Only high
limits can be set for Hotelling’s T², Sample Leverage, X Sample Q-Residuals and Validation
Residuals. For Explained X Sample Validation Variance, low limits can be set.
Set Alarm States for output matrix of Prediction
46
4.7.2 Classification:
Only high limits can be set for X Residuals, Si/S0 and Leverage matrices that will be used for
classifying new samples for models developed from PCA, PCR and PLSR.
Set Alarm States for output matrix of Classification
4.7.3 Projection:
Scores matrix provides the option to set low and high limits. For Hotelling’s T², Sample
Leverage and X Sample Q-Residuals matrices only high limits can be set. For Explained X
47
Sample Validation Variance, low limits can be set. Projection for new samples is available
only for models developed from PCA, PCR and PLSR.
Set Alarm States for output matrix of Projection
4.7.4 Input:
This feature helps user to understand whether the inputs are from one or different sources.
If user has already defined the columnset matrices using Scalar and Vector dialog, those will
be listed for selection. Alternatively, the Define button would open the Scalar and Vector
dialog for defining limits for columnset matrices.
Set Alarms for input matrix
48
4.8. Set Components

Use this option to set the number of components for a model to a value other than the
optimal recommended number. This number of components will then be used when the
model is used for prediction and/or classification.
4.9. Set Bias and Slope

Bias and slope correction is sometimes used as a post-processing step to achieve an offset
(bias) of 0 and slope of 1. This may be useful e.g. if samples measured on a different
instrument give consistently different predictions than samples measured on the same
instrument as the calibration data. If successful, this means that the same model can be
used to predict properties of samples measured on different instruments. Caution is
required however, as any bias and slope estimation will be associated with a risk of
overfitting, and there is no guarantee that the prediction error for future samples will
49
improve. Despite the risks, bias and slope correction has been proven useful in some
industries such as the agricultural sector.
4.9.1 Algorithm
Bias and slope correction is performed on prediction data Yhat by subtracting the slope and
then divide by the bias: Yhat_corrected = (Yhat – bias)/slope
The bias and slope estimates in the above equation can be taken directly from a test set
validated Predicted vs. Reference plot, or they can be input manually by user. Default values
when not explicitly specified are bias=0 and slope=1.
4.9.2 Menu option

User can set bias and slope during model development that can be useful during prediction
for new samples. Select a Regression (PCR, PLS, MLR) model in project navigator and right
click to select Set Bias and Slope
4.9.3 Usage
In the dialog, user has the option to check the Apply Bias and Slope correction. When
checked, model will perform bias and slope correction during prediction based on any of the
below selected options.
 Re-calculate from Prediction data: When selected, the bias and slope correction
factors will be the offset and slope, respectively, as taken from the ‘Predicted vs.
Reference’ plots for the new prediction data. The underlying assumption is that any
differences in bias and slope between the calibration and prediction data are due to
systematic and repeatable differences between the instruments used to collect the
50
two data sets. If used indiscriminantly this may decrease the actual prediction
performance and the option should therefore be used with caution. When selected,
reference Y data are mandatory in prediction.
 Set or apply default correction factors: With this option default correction factors
based on the calibration model are suggested. For test-set validated models these
are the validation Offset and Slope values of the ‘Predicted vs. Reference’ plot,
under the assumption that the test set data are measured on a different instrument
that is representative also for future predictions. For leverage and cross-validated
models this assumption cannot be met and the default bias and slope is therefore 0
and 1, respectively. The user is free to manually change the default values, in which
case a message will be displayed that the values have been manually edited. A Reset
button will revert the bias and slope correction factors back to the default values.
4.10. Login
Two modes of operation are available in Unscrambler
 Compliance Mode- This is the recommended installation procedure for companies

that need to comply with the regulations of 21 CFR Part 11 (electronic signatures).
 Non-Compliance Mode- Recommended for users and industries that do not require
electronic signature authentication and audit trailing.
The choice of installation procedure and internal program setup determines what level of
login is required by a user. This is described further in the following sections.
4.10.1 Non-Compliance mode

When The Unscrambler® is installed in Non-Compliance mode, the first time the program is
started, the Guest login screen is displayed,
Guest Login, Non-Compliance Mode
The Guest login requires no password or definition of a user group domain, so by clicking on
Login a user is entered into the program.
In Non-Compliance mode, a user name and login password can be setup from the Help -
User Setup menu.
If a user name and password have been set up, when a user attempts to login to the
program, a dialog similar to the one shown below is provided,
Login with defined User Name and Password, Non-Compliance mode
51
In this case a user called User 1 was setup. This time, a password is required to enter the
software. If a user forgets their password, the Forgot? option should be selected. This is
described further in the next section.
Password reminders
It is possible to click Forgot? next to the password entry for a password reminder question
that is configured during user setup.
Password recovery dialog
In this dialog, a user is required to enter the correct answer to the security question and are
then required to enter a new password (with confirmation).
If the wrong answer to the question is entered, the following warning will be provided,
Solution - Enter the correct answer to the security question to proceed.

If the new password has not been entered the same way in the confirmation box, the
following warning will be provided,
Incorrect password confirmation warning
52
Solution - Be sure to enter the new password twice correctly.
4.10.2 Compliance Mode

When The Unscrambler® is installed in Compliance mode, it uses the Windows
Authentication details of the user logged into the computer that is being used for the
analysis. There are two options available during the installation and setup of the program,
 Set up compliance mode with Login dialog shown each time the program is started
 Set up compliance mode with a hidden Login dialog
System enforced login

When the installation is performed such that a user is required to login to The
Unscrambler®, a dialog similar to the one shown below is provided.
Windows Authentication login
The users windows name is shown in the login screen. To enter the program, the user must
enter their windows password.
Automatic entry
When the program is installed in compliance mode, but the Hide login screen option is
chosen, when a user starts The Unscrambler® they are automatically logged into the
program and the windows authentication details are used in the Audit Trail.
This authentication method takes advantage of centralized user management features used
in regulated network configurations, instead of redefining the user names.
53
For more information on how The Unscrambler® security features help a company to comply
with the requirements of 21 CFR Part 11, please have a look at the Statement of compliance
4.11. File
4.11.1 File menu
File – New
or Ctrl+N
This option is used to create a new project.
A new, blank workspace is created with a single node entry in the project navigator named
“New Project”.
See organizing data to get started adding data to a project.
File – Open…
or Ctrl+O
This option opens an existing project, using a regular file selector dialog.
File – Close
or Ctrl+W
This option closes the current project file. If changes to the project have not been saved, The
Unscrambler® prompts the user to save the project before closing it.
File – Import Data
This option allows the import of data from an external data file. This may be data from
another project file, an earlier version of The Unscrambler® or one with a different format,
e.g. Excel, ASCII, or data files from instrument formats.
For more information see the importing data documentation.
File – Save
or Ctrl+S
Saves the currently open project file.
File – Save As…

Save the current project in a new location or with a different file name.
The Unscrambler® will save projects using a proprietary binary format with the .unsb file
name extension.
File - Save Matrix/Model

Depending on whether a user is in the Editor or Viewer mode, an option to save the matrix
or the model to a location separate from the project is available.
54
File – Export
This is a menu option which allows one to export all or selected parts of a data matrix to an
external file, in one of the available export formats.
For more information see the exporting data documentation.
File – Print…
or Ctrl+P
This will open the Print dialog, where the user selects settings to print the current document
to a printer or file.
For more information see the print dialog documentation.
File – Security
The Security function contains two options, Protect and Sign.
Protect
This command enables a user to protect a project with a password. Whenever this project is
accessed, the user will need to provide the password to open it. A project file can also be
Unprotected by using the command File-Unprotect, and entering the correct password.
Note: The password must be remembered! If it is lost, the project cannot be opened again
Sign
For a more detailed description on how The Unscrambler® implements Digital Signatures,
click here
The Security feature is part of the overall data integrity and compliance capabilities of the
software, which also includes Windows Authentication and Audit Trails.
For more details on how The Unscrambler® meets the requirements of digital and electronic
signatures, please refer to the section on Data Integrity and Compliance
File – Recent
The list of recently opened projects is displayed. One can toggle different projects upon
selection.
File – Exit
This allows one to quit The Unscrambler®. If any project files have been changed since the
project was last saved, there is a prompt asking if changes are to be saved.
4.11.2 File – Print…

This will send the currently viewed plot or data table to a printer.
55
Plots are scaled to fit within the margins set for the designated paper size and will retain the
same aspect ratio as is seen on the screen.
Data tables will normally print with 50 rows and 6 columns per page, depending on the
numeric format and font settings. Row and variable names and numbers will be included on
each page.
Print options from The Unscrambler® works as in any Windows application, where the user
selects printer, paper size, orientation, margins, etc.:
What can be printed

One may print either the current plot, or all plots. Select Current Plot to print out only the
currently active plot on screen; select All Plots to print out all plots currently shown on
screen.
In the field Print range designate what to print by selecting the appropriate radio button.
The print range applies to the current window in the Workspace. Use Selection if a range in
the current window has been selected to print.
Note: There must be a file open (in the Editor or the Viewer) to have access to this
option.
Printing several plots

The Print dialog for plots offers the possibility to print either the Current plot, or All Plots.
Select Current Plot to print out only the currently active plot on screen; select All Plots to
print out all plots currently shown on screen.
Select the printer to use from the Printer drop-down list.
The properties of the printer can be viewed by pressing Properties. See the operating
system documentation or printer manual for information on setting up the printer.
Information can be printed to a file by clicking on the Print to file box.
56
Print preview
It is a good idea to preview a document before sending it to the printer. Print preview
provides a look at how the pages will look when they have been printed. The option is only
available if a file is currently open.
4.12. Edit
4.12.1 Edit menu
The Edit menu has three different modes, and the displayed options depend on which part
of the application window is active at any given time. There are separate modes for the
workspace editor and viewer as well as for the project navigator. Some menu items are
common for two or three modes.
 Common actions
 Edit – Undo
 Edit – Redo
 Edit – Cut
 Edit – Copy
 Edit – Paste
 Edit – Delete
 Navigator mode
 Edit – Rename
 Edit – Spectra
 Editor mode
 Edit – Copy with Headers
 Edit - Insert Copied Cells
 Edit - Append Copied Cells
 Edit - Reverse
 Edit - Convert
 Edit - Fill
 Edit – Find and Replace
 Edit – Go To…
 Edit – Select
 Edit – Sort
 Edit – Append
 Row(s)/Column(s)…
 Category Variable…
 Edit – Insert
 Row(s)/Column(s)…
 Category Variable…
 Edit – Split Text/Category Variable
 Edit – Change Data Type
 Edit – Scalar and Vector
 Edit – Define Range…
 Edit – Group rows…
 Edit – Make header
 Edit – Add Header
 Edit - Category Property
57
 Viewer mode
 Edit - Add Data
 Edit - Create Range
 Edit - Sample Grouping
 Edit - Copy all
 Edit – Draw
 Edit – Mark
The workspace editor Edit menu mode is activated by clicking anywhere in a data table.
The workspace editor Edit menu
The workspace viewer Edit menu mode is activated by clicking in a plot. The same menu will
be shown irrespective of whether it is a raw data plot or a model results plot, however some
menu items will be grayed out when not applicable to specific plots.
The workspace viewer Edit menu
58
The project navigator Edit menu is the simplest of the three.

The project navigator Edit menu
Common actions
Edit – Undo
or Ctrl+Z
This option reverses the last operation(s) performed on the data in the editor. This can be
used to Undo up to the last 10 operations. The size of the undo stack can be increased, see
Tools – Options… menu.
The following operations can be reversed with the undo operation:
 Cut, paste action in entry

 Cut, paste action with column, row, headers
 Change data type for column and headers
 Delete data action for entry (including headers)
 Delete row/column/headers action
 Drag and drop of entry/column/row/headers
 Move row, or column
59
 Move row to column headers

 Move column to row headers
Edit – Redo
or Ctrl+Y
It is possible to recover the results of an editing operation(s) that has just been undone with
the help of the Redo command.
A selection can be recovered from the clipboard using the Paste command or Ctrl+V.
Edit – Cut
or Ctrl+X
This option removes the selected range, either data in the Editor or a plot in the Viewer, and
places it on the clipboard. Anything placed on the clipboard remains there until it is replaced
with a new item. Use the Paste command to copy the selection to a new location.
Edit – Copy
or Ctrl+C
With this option one can copy the selected range to the clipboard, overwriting its previous
contents. The selected range is not removed from its original place. Use the Paste command
to copy the selection to a new location.
Edit – Paste
or Ctrl+V
This command one to insert a copy of the clipboard contents at the insertion point. The
command is not available if the clipboard is empty or the selected range cannot be replaced.
Edit – Delete
, Ctrl+D or Del
This option enables one to delete columns or rows. One can select one or more
columns/variables or rows/samples, and deletes the selected section(s).
Any previously-defined sets are adjusted for the deleted range.
Navigator mode
Edit – Rename
Rename the currently selected matrix.
Edit – Spectra
Ranges can be defined as being spectra, and once this setting is ticked off for a given range,
loadings plots for these data ranges will display as line plots rather than 2D scatter plots.
60
Editor mode
Edit – Copy with Headers
or Ctrl+Shift+C
With this option one can copy the selected range to the clipboard, overwriting its previous
contents. The selected range is not removed from its original place. Use the Paste command
to copy the selection to a new location.
Edit - Insert Copied Cells

Inserts copied rows or columns from the selected position in the matrix
Edit - Append Copied Cells

Appends copied rows or columns to the end of a data matrix.
Edit - Reverse
With this option one can reverse the sample order and/or variable order in a selected
matrix. For more information see the reverse documentation.
Edit - Convert
This command allows one to convert the units of a column headers for spectral data from
wavelength in nanometers (nm) to wavenumber (cm-1) and vice versa. This function is
active when the the column header of a matrix is selected.
Edit - Fill
This command allows a user to fill a highlighted row or column range with either numeric or
categorical data.
For more details see the Fill section.
Edit – Find and Replace

Ctrl+H
This command allows one to find entries containing a given value or sequence of characters,
and replace the selected value with a new one. The Find search mode consists can be
selected as text, number and Date Time from the drop-down list. For more information see
the find and replace dialog documentation.
Edit – Go To…
Allows user to move focus to a specific entry in the data table.
For more information see the go to dialog documentation.
Edit – Select
Edit – Select has the following options
Select Rows
To select respective sample.
61
Select Columns
To select respective variable.
Select Range
To select a range of samples and variables.
Select All (Ctrl+A)
To select the entire matrix.
In the first three cases, the user is asked to enter a range to select. It uses the same syntax as
the Define range dialog, e.g. 1,3-5,8-20.
Note: The Unscrambler® always works with either rows or columns. This also
applies when the whole matrix is selected. Look at the cursor shape or the
rows/columns numbers to see whether the selection is for a row or column mode.
Sample names will also be selected when operating on rows, and column headers
when operating on columns.
Edit – Sort
Sort samples according to their numerical values for the selected variable.
Sort has two options: Ascending and Descending.
Select one or more columns to sort. Headers can also be selected and used as sort keys.
This method uses the quick sort algorithm, which performs an unstable sort; that is,
if two elements are equal, their order might not be preserved. In contrast, a stable
sort preserves the order of elements that are equal.
Edit – Append
Row(s)/Column(s)…
This option can be used to append rows or columns, depending which entries are selected in
the data table.
A dialog is displayed allowing the user to enter the number of rows(columns) that are to be
appended at the end of the existing data matrix.
See Edit – Insert – Row(s)/Column(s)… below for details.
Category Variable…
Append a new category variable (column).
Details on how to specify a category variable can be found here.
Edit – Insert
Row(s)/Column(s)…
Insert new rows or columns.
Select a row or a column to insert either one or more rows or columns, respectively.
A dialog will pop-up to ask how many rows or columns to insert:
62
This command is also available by right click.

Category Variable…
Insert a new category variable (column).
Details on how to specify a category variable can be found here.
Edit – Split Text/Category Variable

Text: Converts text variable into multiple new text or category variables as needed.
Category: Create one new column for each level, with binary values (true/false). These will
be inserted to the left of the selected column.
Edit – Change Data Type

One can change the data type of one or several variables by selecting them and using the
option Change Data Type in the Edit menu. The available data types are:
 Text
 Numeric
 Date-time
 Category
This command is also available by right click.
Edit – Scalar and Vector

This item opens a dialog where units can be assigned to previously defined or new column
ranges. Each column range can also be defined as a scalar (e.g. single process variable) or
vector (e.g. spectrum).
For more information see the Scalar and Vector documentation.
Edit – Define Range…
or Ctrl+E
Create and edit ranges for easy access to often-used selections.
For more information see the define range dialog documentation.
Edit – Group rows…

Create row ranges based on a category variable or a variable split linearly into value ranges.
For more information see the add row range from column dialog documentation.
Edit – Make header

Convert the selected column or row to a header.
This action can also be invoked by right clicking on a row or column number.
The existing row or column will be removed as a result of making it a header, and a header
can not be converted to data.
63
Edit – Add Header

Insert an extra header.
A row or column header must be selected to add either a new row or column header,
respectively. Choose to insert the row header above or below, or the column header to the
left or right.
There can be up to five column and row headers.
Edit - Category Property
This option allows one to change the properties of category variables, more details on which
can be found at Property dialog.
Viewer mode
Edit - Add Data

To be able to add data to an existing plot it is necessary to select Edit- Add Data….
The following dialog box opens.
Add Data… dialog box
It is necessary to locate the second set of data.

Matrix
Use the drop-down list if the data are in a data matrix and use the select result
matrix button if the data are in an analysis result.
Rows and Cols
Use the drop-down list if the subset is already defined and use the Define button if it
has to be defined.
Edit - Create Range

Once some samples / variables are selected in a plot it is possible to create a new range
including them. This can be done using the Edit - Create Range option or by right clicking on
the plot with the selected items and selecting the option Create Range.
The new range appears under the matrix that was plotted as a new row or column set.
64
Edit - Sample Grouping

For more information see the Sample grouping dialog documentation.
Edit - Copy all

This action will copy all plots in the current viewer to the clipboard and make it available for
pasting into documents, etc.
Edit – Draw
This option allows a user to add a drawing object to the plot. It is possible to draw with five
different types of objects: line, arrow, rectangle, ellipse or text. This option can also be
accessed by right clicking while in a plot and selecting Insert Draw Item
For more information see the plot annotation documentation.
Edit – Mark
Mark objects (samples or variables) to bring focus to them in plots and interpretation. There
are options for automatic sample or variable selection based on modeled data, or for
manual marking using the one by one, rectangle or lasso tools.
The submenu for marking objects
For more information see the marking in plots documentation.

A typical use of this command is to mark extreme samples in a score plot in order to
investigate the behavior of those samples on other plots. Another is to mark ranges of the
spectra in the Important variables plot, to make a new model based on only important
wavelengths.
Note: If the Viewer contains more than one plot, marking is only possible from the
currently active subframe. For instance, if the currently active subframe contains a
scores plot, only samples can be selected. In order to mark variables, one must click
on the subframe containing a variable plot in order to mark any variables.
Once objects have been marked, they appear marked in all current and future plots, until
they are unmarked or when the Viewer is closed.
4.12.2 Edit – Change data type – Category…

Access the category converter
The Category converter is accessible from two menus:
65
Edit – Change data type – Category…

Select a variable. Go to the menu Edit and select the option Change Data Type and
from the four choices select Category….
Menu Edit – Change Data Type – Category…
Right click
Select a variable. Right click. Select the menu Change Data Type – Category….
Right click access to the Category Converter
66
Use the category converter

There are two way of creating levels for category variables:
 Use individual values

 Use ranges of values
Convert to category dialog
67
New levels based upon individual values

If there were already some values in the selected variable each of them will be defined as a
level. Click on OK if this corresponds to what is needed.
The variable background changes color to differentiate it from the numerical variables.
It is possible to add new values for new samples or to select one of the available ones by
using the drop-down list.
Choices of levels in the drop-down list
New levels based upon ranges of values

If the variable to be converted into a category variable is a continuous variable, it is
recommended to use ranges of values.
To do so select the second option available in the Category Converter: New levels based
upon ranges of values.
68
New levels based upon ranges of values
The preselected variable is in the field Select Variable. If the variable to be used in a
different one select it using the drop-down list.
The field Value based on selected Variable gives information on the selected variables such
as:
 The number of different values,

 The minimal and maximal values.
This information is displayed to guide one to select the number of levels to choose and to
define the intermediate ranges.
Select the number of levels using the associated box.
Decide the method to be used to define the range among the two following options:
Divide total range of variation into interval of equal width
If this is the selected option the ranges will be automatically defined when changing
the number of levels.
Specify each range manually
Double-click on the entry to define the ranges.
69
Note: It is not possible to have overlapping ranges. An error message will

appear if the entered value is not correct
When done, click on OK.
4.12.3 Edit – Category Property…

This option allows one to change the properties of category variables that have already been
defined. The name of the category column, as well as the name for any given category can
be changed. The order of categories can be changed, categories can be added, and already
defined categories can be deleted.
This is also available as a right click option. Highlight a column and right click, the following
options will be displayed
70
4.12.4 Edit – Fill

This option allows a user to select specified row or column ranges and fill them with either a
constant number for numerical columns, or text if the row or column is defined as text.
This option also allows selected rows to be filled with pre-defined categorical variables.
The dialog box for the Fill option is provided below.
To fill a column/row with a specified value, either highlight the entire row/column or select a
sub-section using the mouse and select Edit - Fill. Enter the specified value (or text) in the
Value box and click on OK. The selected region will be filled with this value.
Note: A block of rows and columns can also be selected using this option.
To fill rows/columns with a category variable, first define the categories using Edit - Change
Data Type - Category. Then select specified cells and use the Edit - Fill option, this time
selecting the desired category from the Level drop-down list. Click on OK and the cells will be
filled with this new category.
71
The Fill option is also available as a right click option from the Editor.
4.12.5 Edit – Find and Replace

This command allows a user to find entries containing a given numerical value or word, and
replace the selected value with a new one.
There are three search modes: text, number and date-time.
Edit – Find and Replace (Ctrl+F, or Ctrl+H) launches the Replace pane, where one can specify
a value to search for, launch the search, and optionally define a replacement value and
perform the replacement. For replacing category variable with a new value not defined, a
warning will be displayed for creating a new category level.
Find and Replace:
72
Find option
By selecting the Options button, one is then presented with Find Option choices which
enables one to match case, replace entire entry contents with specified search criteria and
search in indicated directions in the data matrix.
How to find a number, text string, date/time and category
 Select search type Numeric, Text or Date time from the Search mode drop-down
list.
 Type a word, a number, or a date to search for in the Find what field.
 Or tick Range to search within numeric or date limits. This option works only for
Numeric and Date time variables
 For replacing category values, select the varaible and use the Find and Replace
option.
**Text** mode will match category variables. A category level labeled "200" is still
a text string. It is recommended to use words to label category levels both to avoid
confusion and to give each level meaning, such as "High" or "Low".
Click the Find Next button to locate a cell with the chosen value or sequence of characters.
If the search is successful, the entry is marked in the editor with a black frame (or a white
frame if the search is occurring in a selected area). If no match is found, the cursor does not
move from its original place.
Advanced search options

In addition, one can make a more specific search by clicking Options which will expand the
dialog with additional search parameters:
Match case
Make search case sensitive.
73
Replace entire cell contents

Find only entries which have the requested sequence of digits or characters as exact
contents.
Search criteria
Specify how text is matched.
Choose Contains, Equal, Starts with, or Ends with from the drop-down list.
Search direction
Set search order to traverse horizontally first (by row), or vertically first (by column).
Restricted to selection
Base search on preselected data only.
How to replace a value with another

Once a value has been specified for the Find what value, proceed with a replacement.
In the Replace with field, type in the new value or sequence of characters. Any combination
of digits and characters is allowed, e.g. A51-02.b.DSF24%. However, if the requested value
is not compatible with the current type of entry (e.g. “A51” in a numeric entry), an error
message will be displayed and no replacement will be made.
If the Find what value has already been located with the Find Next button, hit the Replace
button to replace the value in the current entry. In order to make the replacement in all
entries containing the Find what value, hit the Replace All button.
How to undo replace

The Undo button is available once a replacement has been performed. Clicking it reverses
the last replacement made.
If the Find and Replace dialog has already closed, use the Edit – Undo command (Ctrl+Z) to
revert the change.
4.12.6 Edit – Go To…

Use Edit – Go To… to move focus to a given data matrix location. This function is active
when the cursor is in an active matrix window.
74
Enter the desired destination row and column numbers.
Result after:
This function allows to quickly move around to specific entries in a data matrix.
4.12.7 Edit – Insert – Category Variable…

This tool will insert a new column with a category variable, either by manually entering
levels, or deducing true/false levels based on one or more non-overlapping row sets.
Create category variable: Specify levels manually
75
Create category variable based on a row set
The resulting category column can look like this:
76
4.12.8 Edit – Define Range…
or Ctrl+E
Ranges define specific parts of the data table in order to perform analyses on. When a set of
columns is defined, this is called a Column range and usually defines a specific set of
variables. These variable sets may define a single independent (X-data) range for methods
like PCA or two sets such as the X-data and the dependent Y-data for methods such as PLSR.
When a set of rows (or samples) is defined, this is known as a Row range and these are
useful when defining training and validation sets for any analysis method in The
Unscrambler®.
Combinations of row and column sets together define specific data regions to be used for
analysis purposes and the preparation of data can be performed using the Define Range
option.
Get information on:
 Accessing Define Range

 Define range dialog
 Create range from data editor
 Create range from scores plots
 Automatic keep outs
Accessing Define Range

The Define Range dialog can be accessed from:
Menu Edit – Define Range…
77
If the case arises that a new range has to be defined during an analysis setup, most of the
plotting and analysis dialogs in The Unscrambler® have the Define button available. An
example from the PCR dialog is shown below
Define buttons in the PCR dialog
78
The Define button is shown as follows

By selecting this option from either the Edit menu or from an analysis dialog, the Define
range dialog box described in the next section will appear.
Define range dialog
Dialog
The Define Range dialog is a multi-task, interactive window for easily defining specific row
and column sets prior to analysis.
Define range dialog
79
Tip: The F5 key toggles focus between viewer and editor.
Dialog Usage
Functions
The dialog box contains the following functions for easily defining sets within a selected data
table.
Row and Column Ranges
This section provides two lists of the available row and column sets available in a
table. To add a new row/column set, either interactively select the sets using the
data viewer with a mouse, or manually enter specific ranges into the text dialog
boxes. For example, if a new row set is to be defined called training, and it is to
cover rows 1-10 of the current table, the dialog for Row ranges should be set up as
follows,
To add the new row set to the list, click on the Create button. Use a similar procedure for
defining new column sets.
80
Updating an existing row or column set

If modifications have to be performed to an existing row or column set, simply
highlight the set from those available in the list, make the modifications using either
an interactive or manual change and click on the Update button. The set definition
will be updated accordingly in the list.
Inverting a selection
In some applications, the definition of training and test sets is an important step in
multivariate analysis. If a training set has been defined and the test set is to be
defined as the rest of the samples not defined by the training set, click on the Invert
Selection button , and the reverse of the current selection will be selected. To
add the inverted selection to the list, provide the row or column set with a unique
name and click on Create. This will define a training and test set which is particularly
useful when using Test Matrix Validation.
Range deletion
To remove existing rows or columns sets from a list, simply highlight the sets and
click on the Delete Range button
Using all of the actions described above, when the OK button is selected to apply the
changes, all of the defined ranges (or deletions) will be shown in the data matrix node in the
project navigator.
Keep out
Use this option to define samples or variables to be kept out in the analysis from the
defined range(s).
Variables and samples satisfying given conditions are automatically added to these
lists. For more information on how this works see below.
Special intervals The special intervals option can be selected for performing predefined
actions to a data table when defining row or column sets. To access this functionality, click
on the Special Intervals button
This will open an expanded options section as shown below,
81
The functions in this section are described below.

Interval
Insert regularly spaced row or column indices using the drop-down list “Samples”
and “Variables” values. There are two parameters to enter:
 The frequency: the Every field refers to the frequency of sampling.

 The starting sample in the field Starting from spin box.
Use this option to define evenly spaced calibration (or validation) samples and use the Invert
function described above to easily define such sets.
Random
Insert random row or column indices using the drop-down list “Samples” and
“Variables” values and indicating a number to define in the manual entry box.
Category
Insert row indices based on a category variable. Select the category variable in the
drop-down list.
When the appropriate ranges have been selected click OK to apply the changes.
Create range from data editor
Ranges can be created directly within the data set editor: Begin by selecting the part of the
table that will be included in the range and right click to select the option Create Range,
Create Row Range or Create Column Range as appropriate.
Create Row Range
82
Create range from scores plots

Sample sets can be created directly from the PCA/PCR/PLSR scores plots as well. Select some
samples using any of the Edit - Mark options and then right-click Create Range. In the dialog
that opens there is an option to use either the marked or unmarked samples (or both). The
selected samples will be added to a new or existing matrix in the project navigator.
See extract samples documentation for details.
Automatic keep outs
Variables and samples not applicable in calculations are automatically added to the lists of
Keep outs. Entries are excluded based on the following (method dependent) criteria:
 Samples with missing values1.

 Columns with category, text or date-time variables.
 Entire columns or rows with constant values.
 Columns where all values are missing.
Keep out warning dialog
83
When working with data selector that have keep out samples/variables, an warning will be
displayed allowing the user to either accept and proceed with keep outs or to cancel the
action. The Details option will display the list of keep outs.
To keep track of row and column exclusions, the data selectors provides a warning to users
that exclusions have been defined. Click on the More details link to see what has been
excluded.
More details
Automatic keep outs can only be removed manually. This means that in cases where a
category variable has been converted to a numeric column, or missing entries have been
filled in, the keep out lists must be edited to include given entries in further analyses.
 With the exception of NIPALS based methods.
84
4.12.9 Edit – Reverse…

The order of samples and variables in the data matrix can be reversed by choosing the Edit -
Reverse option from the menu when the cursor is in a data matrix.
The Reverse option menu is shown below
4.12.10 Edit – Group rows…

Select a variable to be used for the definition of row ranges. This variable can be:
 Either a category variable

 Or a numeric variable.
Then access the option Group Rows from the menu Edit. A dialog box will open.
Add row ranges on a category variable
When the variable selected is a category variable, all levels will be used to define new
ranges. Therefore the Number of group is disabled.
Add row ranges dialog from category variable
When clicking OK, new row ranges are defined being named in the same way as the levels.
85
Add row ranges on a numeric variable

When the variable selected is a numeric variable, the Number of group has to be specified.
The ranges are divided linearly in equal ranges of values.
Add row ranges dialog from numeric variable
When clicking OK, new row ranges are defined being named range1, range2, etc.
4.12.11 Edit – Sample grouping…

The menu option Edit – Sample grouping… can be used to group samples in a plot. This can
also be accessed in any plot by a right mouse click.
This feature is available in the general following plots:
 2D or 3D scatter plots (including score plots)

 Line plots
 Bar plots
When clicking on the menu Edit – Sample grouping…, the dialog box Sample grouping &
marking opens.
Select the matrix to use for sample grouping in the Data frame. All available row sets will
appear in the dialog. They can be selected and moved to Marker settings by using the
arrows. The sample grouping will be based on the groups added to this box. Clear the
available row sets using the Clear button.
Alternatively the user can select a single column from the matrix to use for sample grouping.
If the selected column is a category variable, click Create Row Sets in order to make each
category level available for grouping. If the selected column is of numeric data type, Create
Row Sets will split the samples into a number of equally spaced ranges defined by the
Number of groups box. When created in this dialog, the ranges are created temporarily for
marking the samples. These ranges are not added to the data table in the project navigator.
To delete a selected group from Marker settings, mark the group and use the Remove
button. Alternatively use the Clear All button to remove all defined groups.
The user has the option to separated samples based on colors, symbols or both, and the
group name can optionally be used as point labels. Use the Apply button to preview the plot
settings, or click OK to apply the settings and close the dialog.
The user also has the option to label the samples by pre-defined values that may be
available in a particular column of a data sheet. The appropritate matrix and the
corresponding column need to be selected using the Data for labeling matrix. This will be
enabled only when value is selected from the Label option.
86
Sample grouping and marking dialog
4.12.12 Scalar and Vector

The Scalar and Vector dialog box allows user to define additional properties of data. Data
may be acquired from different sources and these properties help identifying the data
during online processing.
Scalar and Vector Dialog
87
In the above dialog user can perform following:
 Define new column sets and their properties

 A single variable column range is defined as a Scalar and the Units, Min and Max
values can be specified. For example a scalar Temperature can be specified within an
allowed range of 25 to 35 degrees Celsius by setting Units=C, Min=25 and Max=35
 A multi-variable column range is referred to as a Vector. This is usually a spectrum
where the Start and End wavelength can be defined. For instance an NIR absorbance
spectrum can have Units= and Start and End wavelengths of 1100 and 2500,
respectively.
 The Min/Max values are disabled for Vectors and Start/End values are disabled for
Scalars
4.12.13 Split Text Variable

It is a text parser function that takes any text variable or row header and splits it into
multiple text or category variables as desired. This function can be accessed from Edit-Split
Text Variable or right-click menu option after selecting a row header or variable of type
‘text’.
The split text function works with two options separator and character position.
Separator:
This feature is similar to ASCII import accommodating commonly used separator
types comma, space, semicolon and custom values. Double quotes and consecutive
separators can be handled efficiently.
Split by separator dialog
88
Character position:
This feature splits text variables into new variables based on the position of the
characters only. The start split value indicates the number of characters to split and
so the second split. The default value for first split is 0 and second split is 6.
Split by character position
89
Output options:
The following output options are available.
 In case the user is interested to retain one or few of the new variables after split, the
range of columns in numeric can be defined in ‘Insert Columns’ using commas and
dashes. The selection can also be set using the mouse in the preview window.
 The output variables can either be converted to category type using the option
‘Convert to category’ or append all the output variables as text to existing row
headers using the option ‘Add headers’.
4.13. View
4.13.1 View menu
The View menu has two different modes, and the displayed options depend on which part
of the application window is active at any given time. There are separate modes for the
workspace editor and viewer.
 Editor mode
90
View – Navigator
View – Info
View – Level Indices
 Viewer mode
 View – Graphical
 View – Numerical
 View – Auto Scale
 View – Frame Scale
 View – Zoom In
 View – Zoom Out
 View – Legend
 View – Properties
 View – Full Screen
 Context dependent plot indicator lines
 View – Trend Lines – Target Line
 View – Trend Lines – Regression Line
 View – Uncertainty Limit
The workspace editor View menu mode is activated by clicking anywhere in a data table.
The workspace editor View menu
The workspace viewer View menu mode is activated by clicking in a plot. The same menu
will be shown irrespective of whether it is a raw data plot or a model results plot, however
some menu items will be grayed out when not applicable to specific plots.
The workspace viewer View menu
Editor mode
View – Navigator
Toggle project navigator pane on/off.
View – Info
Toggle information pane on/off.
View – Level Indices

Available when a data set has category variables. Toggle category variable view as level
integers on/off.
Viewer mode
View – Graphical
This lets the user view the selected data of a Viewer in a graphical mode. This is the default
view for The Unscrambler®.
91
View – Numerical
Through this option a user may display results plotted in a Viewer as a numerical table. One
can copy that data table to the Clipboard and paste it into an Editor.
Restore the plot using View – Graphical
View – Auto Scale
This option scales the plot so that all data points are shown within the Viewer window. This
command is useful after using Add Plot and Scaling.
View – Frame Scale
This option scales the plot in a selected frame. One can change the plot by scaling its axes to
fit the desired range. Select the desired area to zoom in a frame.
Use Autoscale to display the plot as it was originally.
View – Zoom In
This option changes the plot scaling upwards in discrete steps, allowing one to view a
smaller part of the original plot at a larger scale. This can also be done by using the + key on
the graph.
View – Zoom Out
This option scales the plot down by zooming out on the middle of the plot, so that more of
the plot becomes evident, but at a smaller scale. This can also be done by using the - key on
the graph.
View – Legend
This option allows the user to add a legend to an existing plot.
View – Properties
This opens a dialog where a user can customize a plot. Here one can change plot
appearance, such as grid, axes, titles, fonts and colors.
See the formatting of plots documentation.
View – Full Screen
Make the plot fill the whole screen. Press Esc on the keyboard or right click to leave the full
screen mode.
92
Context dependent plot indicator lines

Trend lines are available to help interpreting Predicted vs. reference plots.
View – Trend Lines – Target Line
Insert a target line in a 2-D scatter plot.

The target line is the line with slope = 1.0 and offset = 0.0 (or equation Y=X). In many cases
this line will be the optimal solution, e.g. in predicted vs. reference plots.
View – Trend Lines – Regression Line
A regression line is drawn between the data points of a 2-D scatter plot, using the least
squares algorithm.
Available for Predicted vs. reference plots.
View – Uncertainty Limit

Uncertainty limits can be indicated using this option for regression coefficients line plots.
For more information, see Martens’ Uncertainty Test and how to plot regression
coefficients.
4.14. Insert
4.14.1 Insert menu
Use the Insert menu to add items to the project navigator.
Insert – Data Matrix…

Add a new data table, which may be empty, or filled with predefined values.
See the insert data matrix dialog documentation.
Insert – Create Design…

Create a designed experiment table to perform a DOE.
See the design experiment wizard documentation.
Insert – Duplicate Matrix…

Create a replicate of an existing data table.
See the duplicate matrix dialog documentation.
Insert – Custom Layout

Create custom layouts for plotting any data matrix or results in a two-plot or four-plot
viewer.
See the custom layout dialog documentation.
93
4.14.2 Insert – Duplicate Matrix…

When working with data, it is advisable to always maintain a copy of the raw data.
In addition, to use matrices generated while running an analysis for other purposes, it is
necessary to duplicate them. Select the matrix to be duplicated and use the menu option
Insert – Duplicate Matrix… to obtain a replicate of the data table.
This will create a second data matrix, bearing the same name with a replication number in
parentheses, for example “(1)” for the first replication. It is now possible to work on this
replicated matrix.
Duplicate matrix dialog
A window will open, so as to enable a specific selection of the matrix and ranges to
duplicate.
Duplicate matrix dialog
When hitting the OK button, a second data set will be created, bearing the same name with
a replication number in parentheses, for example “(1)” for the first replication.
The structure of the table (row and column ranges) will be maintained.
Duplicated matrix
94
4.14.3 Insert – Data Matrix…

In this section, information is given on how to create a new data table. This can be done
from the Insert menu, selecting Data Matrix….
When clicking on this option the Add Data Matrix dialogue appears where one can define
the size of the data matrix in terms of rows for the samples, and columns for the variables.
By default, the values are 10 both for the number of rows and columns. This can be edited
by using the arrows or by directly typing in the desired number.
The initial values for the matrix can be chosen from the following options in the drop-down
list in the Add Data Matrix Dialog:
 Blank
 Unit matrix (diagonal 1 rest 0)
 Random values (0-1)
 Random values (Gaussian)
 Constant
 Serial numbered rows
 Serial numbered columns
 Serial rows with shift
If Constant is chosen, this value should then be entered in the Constant value field.
The Include Headers option will automatically display the default header names for Rows
and Columns in the data matrix.
95
After clicking on OK, a matrix will be created with the default name “Data Matrix”. It
contains no values if Initial values were set to Blank, otherwise the designated values are in
the entries. Data can be entered into the empty cells.
Fill a data table
Data may be entered into a blank data table in several ways.
Manually
Data can be entered manually by double clicking on the specific cell and entering the
value. This operation can be done for the data table as well as the sample and
variable name.
Copying data from a spreadsheet (Excel)
Data can be copied from Excel to The Unscrambler® by either drag and drop, or by
copying and pasting it. To drag and drop the data from Excel, it must be selected in
Excel and then dragged into the specific entry or to the beginning (top left corner) of
the area where the data are to be added. The same can be done for the sample and
variable names. Data can also be entered from Excel by using the copy and paste
functions.
Rename
The default name of the data table is “Data Matrix”, but this can be renamed with a more
descriptive name. Rename the data matrix by right clicking on the data matrix icon in the
project navigator and selecting the option Rename.
When this is done, the name will be updated in the project navigator as well as in the
visualization window and navigation bar.
Other functions are also available from this right click menu.
Other approaches to adding data matrices
There are two other options to generate a data table in The Unscrambler®:
 Importing data
 Create a design table
4.14.4 Insert – Custom Layout…

The Custom Layout tool is a way to display any two or four selected plots.
It can be very useful for example to display the results of two PCA analyses with two
different pretreatments as shown in the plot below for easier comparison.
Custom Layout of two PCA score and loadings plot with or without pretreatment
96
To access this option select the menu Insert – Custom Layout… and select the desired
layout:
 Four viewers,
 Two Horizontal…,
 Two Vertical….
Insert – Custom Layout… menu
This menu give access to a dialogue box divided in four parts corresponding to the four
frames of the visualization window, all containing the same options:
Custom Layout Dialog
97
Choose Matrix
This button is used to select the data set and variables to be plotted. By clicking on
Matrix it is possible to select a data matrix from the navigator. Adjust the Rows and
Cols to display only what is appropriate.
Choose Matrix dialogue box
To select a matrix that was generated during an analysis, hit the select result matrix
button . The following dialogue box will appear. From here it is possible to
select any matrix.
Choose Matrix - Analysis dialogue box
98
Type
This drop-down list presents the plot options:
Type drop-down list
 Scatter: Click to see information about Scatter plots.

 Bar: Click to see information about Bar plots.
 3D Scatter: Click to see information about 3-D Scatter plots .
 Line: Click to see information about what a Line plot .
 Matrix: Click to see information about Matrix plots.
 Histogram: Click to see information about Histogram plots .
 Normal Probability: Click to see information about Normal Probability plots .
 Multiple Scatter: Click to see information about Multiple Scatter plots .
Title
Type in the title to be displayed on the specific plot.
Once all the necessary plots have been defined hit the OK button, this action will display the
selected plots.
It is always possible to abort this action by clicking the Cancel button.
Once the plots are displayed they are editable using the Properties menu accessible from a
right click on the plot or from the menu shortcut .
Further information is available for the following options:
 Format a plot,
 Annotate a plot,
 Zoom and re-scale a plot,
 Save and copy a plot.
99
4.14.5 Insert – Data Compiler…

Data Compiler:
This section helps the user to process and filter bad and suspect spectra out of large
dataset based on combination of unique sample identifier and sample replicate
index. Sample identifiers or replicate scans will be identified using a categorical/text
variable and to split it, ‘Split Text/Category Variable’ feature in Edit menu is used.
When clicking on this option the Data Compiler dialog appears where one can
define the Input data, Filter settings and Output options.
Input data:
This tab provides the option to input numeric data (usually spectra) from any data
matrix in project navigator by defining the rows and columns. The sample index
allows the user to select a categorical variable; the number of samples should match
with the data selected. Non-category variable and multiple selection options will not
be allowed and all observations within one category level will be treated as
replicates of a single sample. The minimum number of replicates is used to specify
the minimum number of samples to include in average. The default value is 10 and
minimum value is 1.
Data Compiler - Input data
Filter settings:
The Filter settings tab provides option for primary and secondary filter settings.
Filtering can be done based on the models available in the project navigator and the
100
compatible models are PCA, PCR, PLSR and SCA. Models with auto-pretreatments
can also be defined by clicking the pretreatment button. Only full models are
acceptable.
Data Compiler - Filter Setting
Upon selection of the model, the available filter type can be selected. For PLS, PCR and PCA
the available filter matrices are
 Influence (T2 vs. F)

 Influence (T2 vs. Q)
 Leverage
 Hotelling’s T2
 Q-residuals
 F-test residuals SCA may have some or all of the above in addition to some or all of:
 Conformity limit
 Spectral match value The component provides the option to select the number of
components from the selected model. The default number of components is user
defined ‘set components’. User will also have the option to select the six levels of
significance, active for filter types Influence, Hotelling’s T2, Q-residuals and F-
residuals.
The Limit settings are active for the following filter types:
101
 Leverage: Positive floating point value. Default value 1

 Conformity limit: Positive floating point value. Default value 3
 Spectral match: Floating point value in range 0-1. Default value 0.99
For additional filtering, ‘Include Secondary Filter’ has to be selected and this follows the
same feature as primary filter.
Output options:
The following output options are available.
Data Compiler - Output Options
Add Statistics: To store the output data after filter based on primary and secondary filters,
the tested model statistics from the filtered model will be added as new column(s) to the
original data table.
Add status: The test results from the filter model for status, when selected will be added as
new category column(s) to the original data. Influence filter type will have four status levels
as Good, Extreme, Suspect and Outlier. For all other filter types, the status levels are Good
and Outlier. Additionally users have the option to add the Good and Rejected row ranges to
the existing matrix.
Add ranges for Good and Rejected: When checked (default), two row ranges ‘Good’ and
‘Rejected’ are added to original (exisitng) data table. ‘Good’ and ‘Rejected’ status is defined
by the output from both filters as well as the minimum number of replicates. Any sample
that has status Good in either primary or secondary filter, and that exceeds the minimum
number of replicates, will be interpreted as Good. All other will be tagged as Rejected.
102
Add mean matrix: When checked, the average of all non-rejected observations are
calculated and returned for each sample. Users also have the additional option to add the
standard deviation for each sample. Average and standard deviation are calculated only if
the number of non-rejected replicates exceeds the minimum number entered in Input data
tab.
Add median matrix: When checked, the median of all non-rejected observations are
calculated and returned for each sample. Users also have the additional option to add the
range for each sample. Median and Range are calculated only if the number of non-rejected
replicates exceeds the minimum number entered in Input data tab.
Include column with number of replicates: When checked, the first column in output
matrices will be the number of replicates used for calculating the summary statistics.
4.15. Plot
4.15.1 Plot menu
The Plot menu has different modes: One comes with the matrix editor, and for each analysis
it gives a list of plots related to that analysis.
The plot interpretations chapter provides more detailed information for generic plots.
Editor mode
Plot – Line
The Line plot displays one or more data vectors. When plotting from the Editor, mark the
row(s) or variable(s) (Columns) to be plotted; one sample/variable gives a one-dimensional
plot; specifying a range adds several line plots.
One can define ranges or create ranges for samples as well as variables from the edit menu
Edit - Define Range, see using define range.
For more information see the line plot documentation.
Plot – Bar
The Bar plot displays data vectors as bars.
For more information see the bar plot documentation.
Plot – Scatter
The Scatter plot shows two data vectors plotted against each other.
When plotting from the Editor, select the two rows or variables (columns) to be plotted
before using the Plot command.
For more information see the scatter plot documentation.
Plot – 3-D Scatter

The 3-D Scatter plot shows three data vectors plotted against each other.
When plotting from the Editor, mark the three samples or variables to be plotted before
using the Plot command.
For more information see the 3-D scatter plot documentation.
103
Plot – Matrix
In this plot, a two-dimensional matrix is visualized. The plot is useful to get an overview of
the data before starting any analyses, as obvious errors in the data and outliers may be seen
at once. One may also want to take a look at this plot before deciding whether to scale or
transform the data for analysis.
For more information see the matrix plot documentation.
Plot – Normal Probability

The Normal Probability plot shows the deviation from an assumed normal distribution of
the data vector. It is not possible to plot more than one row or column at a time in this plot.
Select the sample or variable to be plotted and use Plot – Normal Probability.
For more information see the normal probability plot documentation.
Plot – Histogram
This plot displays the distribution of the data points in a data vector, as well as the normal
distribution curve. A histogram gives useful information for exploring raw data. The height of
each bar in the histogram shows the number of elements within the value limits of the bar.
For more information see the histograms documentation.
Plot – Multiple scatter

The Multiple scatter plot shows a matrix of 2-D scatter plots for comparing several variables
in a flat view.
For more information see the multiple scatter plot documentation.
Viewer mode
After running an analysis, the Plot menu for the Viewer mode will change to a list of
available plots.
See the respective analysis method chapters for how to use and interpret these plots.
4.16. Tasks
4.16.1 Tasks menu
This menu is divided into three main groups of actions: Transform, Analyze and Predict.
Tasks – Transform
The Tasks – Transform options allows one to transform samples or variables to get data
properties which are more suitable for analysis and easier to interpret. Bilinear models, e.g.
PCA and PLS, basically assume linear data. The transformations should therefore result in a
more symmetric distribution of the data and a more linear behavior, if there are
nonlinearities.
The Unscrambler® offers many spectral pretreatments like derivatives, smoothing,
normalization, and standard transformations. All these can be found under Tasks –
Transform.
104
There is also a Compute_General function to transform data using basic elementary and
trigonometric mathematical expressions, and the matrix calculator, which has options for
linear algebra, matrix operations and reshaping of data.
For more information and a list of available transformations, see documentation for each
transformation
Tasks – Analyze
The Tasks – Analyze option provides multivariate analysis options consisting of:
Univariate statistics:
 Descriptive statistics, and

 Statistical tests
Qualitative multivariate analysis:
 Principal Component Analysis (PCA),

 Multivariate Curve Resolution (MCR),
 Cluster analysis, and
Quantitative regression techniques:
 Multiple Linear Regression (MLR),

 Principal Component Regression (PCR),
 Partial Least Squares Regression (PLSR), and
 Support Vector Machine Regression (SVR)
Special purpose methods:
 L-PLSR,
 Linear Discriminant Analysis (LDA),
 Support Vector Machine (SVM) classification, and
 Analyze design matrices
Tasks – Predict
The Tasks – Predict options provides means of applying a model on new samples for
prediction, projection or classification.
Projection
Project new samples to determine similarity with samples in a PCA, PCR or PLSR
model.
Regression
Predict unknown samples from regression models.
Prediction
SVM Prediction
Classification
Classification of unknowns by applying SIMCA, LDA, or SVM models.
LDA classification
SVM classification
105
4.17. Tools
4.17.1 Tools menu
Tools – Modify/Extend Design…
or Ctrl + Shift + M
Open an existing experimental design for modifications.
See the modify design dialog documentation.
Tools – Matrix Calculator…
or Ctrl + M
The Matrix calculator is used to perform simple linear algebra functions like matrix
multiplication, addition, division, inverse etc. and to reshape, append or combine two
matrices.
See the matrix calculator dialog documentation.
Tools – Report…
or Ctrl + R
A tool to create reports as PDF documents with plots and data.
See the report generator dialog documentation.
Tools – Audit Trail…
This command displays the audit trail for the active project. The audit trail is a log of actions
by a user, showing a date and time stamp for the actions.
See the audit trail dialog documentation.
Tools - Run Scripts

Please refer to plug in specific help documentation for this add on options. Contact CAMO
Software for more details.
Tools – Options…
This dialog can be used to change the appearance of the data editor or viewer, as well as
other options in The Unscrambler®. Default numeric formats and plot settings can be
defined here.
See the options dialog documentation for details.
106
4.17.2 Tools – Audit Trail…

The audit trail provides a record of the actions performed by different users. Audit trails are
required for maintaining data integrity and are a requirement of Good Manufacturing
Principles (GMP) and the US FDA’s 21 CFR part 11 requirements for electronic signatures.
Caution: Audit trails are not a substitute for well-documented work.
For each operation, The Unscrambler® keeps track of:
 Date
 Time Zone
 Time
 User name
 Action.
The types of actions that are tracked in the audit trail include:
- Creation of the project - Import of data - Transformation: compute functions, smoothing,
MSC, derivative, etc. - Formatting: sorting, delete - Analysis: statistics, PCA, regression,
prediction, etc. with detailed model settings.
Audit trail dialog
In Non-Compliance mode, the audit trail can be emptied by selecting the Empty button in
the dialog.
The audit trail can be disabled from the Tools - Options under the General tab.
When in Compliance Mode, the Audit Trail cannot be emptied.It can only be saved in a non-
editable PDF document for further printing, if desired.
The Audit Trail for Compliance Mode is shown below. Also, in Tools - Options the Audit Trail
cannot be disabled in Compliance Mode.
Audit Trail in Compliance Mode
107
4.17.3 Tools – Matrix Calculator…

Matrix calculator is used for simple linear algebra like matrix multiplication, addition,
division, inverse, etc. and matrix shaping. The options available are:
 Unary operations: Linear algebra on a single matrix

 Binary operations: Arithmetic operations on two matrices
 Reshape a single matrix
 Combine two matrices
The calculator tool should be used only with matrices that are purely numeric. In case there
are missing values those columns are kept out; likewise with text and category entries. With
the remaining matrix contents the compatibility follows the feasibility of the matrix
operations.
See also the Compute_General transform that can do calculations on samples and variables
using basic mathematical expressions.
Matrix calculator dialog
108
Matrix calculator’s shaping tab
Single matrix operations

Unary operations implies that the arithmetic operation is computed on a single matrix.
Inverse (X): Moore-Penrose matrix inverse

The Moore–Penrose inverse of an arbitrary matrix (including singular and rectangular) has
many applications in statistics, prediction theory, control system analysis, curve fitting and
numerical analysis.
In mathematics, and in particular linear algebra, the pseudoinverse A+ of an m × n matrix A is
a generalization of the inverse matrix.
A common use of the pseudoinverse is to compute a ‘best fit’ (least squares) solution to a
system of linear equations that lacks a unique solution. The pseudoinverse is defined and
109
unique for all matrices whose entries are real or complex numbers and can be calculated
using the singular value decomposition.
Singular Value Decomposition (SVD)

In linear algebra, the singular value decomposition (SVD) is an important factorization of a
rectangular real or complex matrix, with many applications in signal processing and
statistics. Applications which employ the SVD include computing the pseudoinverse, least
squares fitting of data, matrix approximation, and determining the rank, range and null
space of a matrix.
QR decomposition
QR decomposition (also called a QR factorization) of a matrix that allows for the solution of
linear systems of equations.
It is a decomposition of the matrix into an orthogonal matrix (Q) and a right triangular matrix
(R). QR decomposition is the basis for a particular eigenvalue algorithm, the QR algorithm.
Element-by-element operations
Array arithmetic operations that are carried out element by element on one matrix.
X’X
Outer product of itself:
1./X
Reciprocal of individual matrix elements, or element-by-element product
X.*X
Square of the elements of X
Two matrix operations
Binary operations implies that the arithmetic operation is computed on the data and a
operand, defined by the rules of linear algebra:
 Addition: X+Y
 Subtraction: X-Y
 Multiplication: X*Y
 Matrix division: X*inv(Y)
 Element by element division: X/Y
The calculations that are possible depend on dimensionality of the matrices X and Y that
have been selected in the scope.
Add, Hadamard product and subtract require X and Y to have the same number of rows and
columns or Y has to be a row or column vector with the dimension matching with X.
The X and Y matrices in the calculations should not be confused with inputs and outputs of a
model.
Reshape matrix
Change dimensions of a two-dimensional matrix.
One can rearrange the elements of a matrix to change the number of rows and columns.
This is especially useful when importing data where a matrix has been stored as a one-
dimensional list of values.
110
Combine two matrices

A user can combine matrices with either of the two options:
 Augment X|Y: column-wise combination of matrices; i.e. 4x2 + 4x2 gives 8x2
 Append Y to X: row-wise combination or matrices
Augment requires X and Y to have the same number of rows. Append requires X and Y to
have the same number of columns.
These are binary operations in the shaping tab available only when the Binary operand box is
checked. This requires that the values be numeric. If there are columns of non-numeric data,
they will be kept out of the calculation. If there are missing values in either matrix, the rows
(columns) containing them will be kept out of the calculation.
4.17.4 Tools – Options…

This menu option allows the user to define user preferences for the viewer, general and
editor settings, to change the appearance and performance of The Unscrambler®.
General
This section contains options for the following

Select temporary folder
This is the location where The Unscrambler® stores temporary results during
calculations. These files will be removed when exiting the application.
Use audit trail
111
Use this option to enable/disable the audit trail. Note: This option is not active when
the program is installed in Compliance Mode.
Prompt user to view plots
When checked, user will be prompted to view the model plots when opening a
project, after training a model and after predictions. This option will be unchecked if
the ‘Do not ask me again’ option is selected in the View Plots dialog.
Viewer
These options allow a user to set the default appearance properties of plots at the
application level. The settings can still be customized and changed at the plot level by editing
the properties for a given plot.
The following are properties that can be set from the Viewer:
 Antialiasing
Use this option to set antialiasing in all analysis-generated plots.
 Point label visible
Use this option to have the default view on plots have the point labels visible. Point
labels can be toggled on/off from a plot.
 Line plot point visible
Use this option to have the default view on line plots have the points visible. The
point can be toggled on/off from a plot.
 Point size
Use this option to set the default size of points. This can be changed for indivudual
plots under Properties.
 Line size
Use this option to set the default line size. This can be changed for indivudual plots
under Properties.
112
 Sample grouping point size

Use this option to set the default size of points when applying sample grouping. This
can be changed for indivudual plots in the Sample Grouping dialog.
 Crosshair axes color
Use this option to set the default color for plot axes. This can be changed for
individual plots under Properties.
Editor
These options allow a user to set the default properties of worksheet view at the global
level. This option will be available only when a data matrix is present in the project.
The following properties can be set from the Editor tab:

 General
This tab provides the settings for defining the maximum number of categories
(default - 50, maximum - 100000), maximum times to undo stack (default - 10,
maximum - 5000) and file size to disable preview (default - 10 MB)
 Format
This tab provides the settings for Numeric and Date time display format.
 Color
This tab provides the settings for color of Row header, Column header, Category and
Matrix name.
 Font
This tab provides the font settings for Row header, Column header and Matrix name
4.17.5 Tools – Report…

The Report Generator is a tool to generate customized reports.
113
To access the Report Generator, select Tools – Report…. The Report generator dialog
appears and gives access to all matrices and plots in the current project. Add plots and
matrices in the field Included in report to create a customized report.
To add a matrix use the Data tables field and:
 Either select a data matrix that is in the Navigator as a node from the drop-down list
 Or select one from an analysis using the Select result matrix button
Then click on Add matrix.

To add a plot, select one in the Available plots list and move it to Included in report with the
right arrow.
Generate Report Dialog
.
At the bottom of the dialog are three tabs where the user can choose settings for the
security, report content, and page setup.
Security
Passwords can be enabled to limit the access for editing and viewing the report. The
user can highlight password protected editing of reports.
Printing, editing, copying, or annotating can be disabled for added security.
114
Content
Under the content tab the user can select to append notes, and/or use the editor
format for numbers.
Report Generator Content
.
Page Setup
On the Page Setup tab, a user can define the paper size (A2, A3, A4, letter, legal),
and orientation (portrait or landscape).
Report Generator Page setup
One can also preview a report by clicking on the Preview button.

Save the report and close the dialog using the appropriate buttons.
All reports will be saved in PDF format with a file name, and in a location given by the user.
4.18. Help
4.18.1 Help menu
The help menu provides access to help topics and licensing-related information in The
Unscrambler®.
Help – Contents
or F1
Open help viewer for browsing.
See the How to use help documentation.
Help – Search
Ctrl+F1
Open help viewer for searching.
115
Help – Modify License

Change the current license of The Unscrambler® by typing in a new activation key. Use this
feature for instance to upgrade from a trial installation to a full version of The Unscrambler®.
See the modify license dialog documentation.
Help – User Setup…

Manage user profiles.
See the user setup dialog documentation.
Help – About
Shows;
 Software version number

 License holder and activation key
 Addresses to CAMO Software offices
 Additional information such as build number and date
 A list of all upgrades and plugins installed
The System Info button will open the “Windows System Information” utility.
4.18.2 Help – Modify License…

Use this dialog to activate or modify a license for The Unscrambler®.
Note that this requires certain privileges and may, in regulated environments, require the
intervention of a system administrator.
Press the Obtain button to request the activation key from the CAMO Software web site.
The activation key will be sent by email.
The above step requires an Internet connection.

Contact a sales representative by phone or fax if the computer is not connected to the
Internet. Note that the machine ID shown in this dialog would be required.
116
Company name and Email address fields become active when the activation key is for a
time-limited or perpetual license.
Contact details can be found at http://www.camo.com/contact
4.18.3 Help – User Setup…

From version 10.2 of The Unscrambler® the User Setup is only available in the Non-
Compliant mode of operation. For details of Compliant and Non-Compliant modes of
operation consult the installation guide or refer to the following sections,
 Login
 Compliance
Users are recommended to create a login and identification, which will not only secure their
work with The Unscrambler®, but provide valuable information to keep track of actions
taken on data, through the audit trail, where the user name is logged with any action.
Use the menu option Help - User Setup… to access the dialog.
User setup dialog
The above image shows an example of a completed setup. Enter the pertinent information
in the provided fields and then click Save.
The following is a brief explanation of the fields,
User Name
This is the name that will be shown in the login dialog each time the program is
started.
First Name
117
The first name of the user.

Last Name
The surname of the user.
Initial
Usually the first letters of the first and last names entered.
Location
Here a user can enter the site/geography/company name associated with the
license.
Password Management
By checking the Password required at login option the user will be enforced to enter a valid
user name and password to use the software.
The following functions of this option are listed below,
Enter a Password
A user is required to enter a password of any size and detail into this field.
Re-enter Password
This option enforces a user to confirm that the two password entries are consistent.
If they are not, the following warning will be provided,
Password mismatch warning
Security Question
Select from a list of pre-defined questions to provide an answer to.
Answer
Enter the answer to the question here
If a password is forgotten, it can be retrieved provided the answer to the security
question is known. See the section on [Login](../signin.htm) for more details
Contact CAMO Software on information about how to register more than one user.
Contact details can be found at http://www.camo.com/contact
118
5. Import
5.1. Importing data
This section describes how to import data from supported instruments and software utilities
into The Unscrambler®.
5.1.1 Supported data formats

The Unscrambler® can import the following data formats:
Symbol Vendor
CAMO Unscrambler® X Models and Projects
CAMO Unscrambler® Version 9.8 or earlier
CAMO Unscrambler® DOS file format
Generic ASCII and other text based files
Microsoft Excel formats including .xlsx
Matlab data table files
rap ID vendor proprietary format
Universal spectroscopic file import
Universal chromatographic file format
Thermo universal file import
Bruker Optics OPUS proprietary format
119
Brimrose proprietary format
ASDI Indico proprietary format
Thermo OMNIC proprietary format
Varian proprietary format
Guided Wave CLASS-PA proprietary format
FOSS/NIRSystems NSAS proprietary format
PerkinElmer proprietary format
DeltaNu proprietary format
Visiotec proprietary format
The following sections describe these import formats in more detail
The Unscrambler® data and models
 The Unscrambler® X
 The Unscrambler® 9.8 and earlier versions1
Version File name extensions2 Compatibility
X .unsb,.unsx3 Read, Write
X-9.0 .AMO Write
9.8–9.2 .??[DLPTW] Read, Write4
9.8–9.7 .??M Read
Non-proprietary data exchange formats
 ASCII, CSV and tabular text
120
Import
 NetCDF
 JCAMP-DX
Formats created by commonly used applications
 Microsoft Excel spreadsheets

 Matlab data files
Instruments
 Thermo Galactic GRAMS

 Brimrose
 OPUS (Bruker Optics)
 CLASS-PA & SpectrOn (Guided Wave)
 Indico (ASD)
 NSAS (FOSS NIRSystems)
 OMNIC™ (Thermo)
 Varian
 PerkinElmer
 RapID
 DeltaNu
 VisioTec
Interface protocols
 Databases
Other interfaces such as OPC and MyInstrument are supported. Contact CAMO Software for
details. http://www.camo.com/contact
5.1.2 How to import data

Choose which kind of file format to import from the File – Import Data submenu, select the
files to import and click OK.
Dialogs differ according to the type of file and the amount of user input required, allowing
the user to select which matrices to import. It also provides an option to preview data
before import.
File formats are recognized based on the file name extension. If the file(s) to be
imported does not have the expected extension, it may have to be changed
manually in a file manager.
Drag and drop files

Files can also be imported by dragging them from the file manager and dropping them on
The Unscrambler® application window.
121
Drag and drop selections

Instead of going via the File – Import Data menu, data can be imported by using drag and
drop or copy and paste. Simply select the file/data in another Windows application like Excel
and drag it into the project navigator or the workspace of The Unscrambler®.
One can select whether to insert the data as columns or rows. The columns or rows are
appended at the end of the existing data table.
One may also overwrite the existing data in the Editor. The area that is going to be
overwritten is marked by a frame.
 See also the chapter on migrating to X.
 The file names are given in glob notation: ”*” mean any number of characters, ”?”
any character, “[ABC]” any of A,B or C.
 Support for XML is available via a separately installed export plug-in.
 Available via a separately installed export plug-in.
5.2. ASCII
5.2.1 ASCII (CSV, text)
Type of data
Array
Software
ASCII (American Standard Code for Information Interchange) is a character encoding
scheme and the de-facto file standard supported by many applications.
File name extension
*.csv, *.txt, *.*
 File format information

 How to use it
5.2.2 About ASCII, CSV and tabular text files

ASCII, CSV (character separated values) and tabular text are common names for essentially
the same format: Data saved as a plain text file.
The Unscrambler® supports ASCII formats with
 Typical file name extensions: .csv, .txt
122
Import
 Semicolon delimited files

 Files with the comma used for decimal point
 Tab delimited files
 Space delimited files
 Custom string used as delimiter e.g.: 1.4**4.5**6.7**8.9 ( “**” is given as custom
separator )
5.2.3 File – Import Data – ASCII…

ASCII files with different formats can be imported into The Unscrambler® through the File –
Import Data – ASCII menu. Single file or batch import is allowed.
 Single file import

 Batch import
Single file import

When a single text-file (e.g. .txt, .csv, …) file is selected for import, the following dialog is
used.
ASCII import dialog
Data delimiters
Numbers may be delimited by different characters in different ASCII files. Specify which
delimiter is used in the file to be imported, in the field Separator. The choices are
 Comma
123
 Semicolon
 Space
 Tab
 Custom
Note: Carriage Return, Line Feed and Tabulation are not among the available
delimiters in the dialog. They are default item delimiters, and will automatically be
recognized as such. Do not specify them in the Custom field!
There is an additional list of check box options below:
Process double quotes

Interpret double quotes such that separators within double quotes are not
recognized as such
Treat consecutive separators as one
Consider multiple identical separator characters as one.
Normally used for tabular text files that have been aligned into columns using
spaces.
Data Type
There are three options available for data import
Auto- The Unscrambler® will import individual columns as text or numeric data
based on the values in the first row.
Numeric - The Unscrambler® will import all columns as numeric. Cells with non-
numeric content will be lost.
Text - The Unscrambler® will import the entire table as text data type.
Individual variables can be converted to other data formats after import using Edit – Change
Data Type.
Skip Rows
This option allows a user to skip a predefined number of header rows during the
import using the number spin box
Preview
This option allows a user to turn on/off a preview of the tabular data before import.
Headers
One can add multiple rows or columns as headers.
Sample and/or variable names can be selected using the Headers options; multiple columns
and rows can be selected for variable ID and sample ID, up to a maximum of 5 headers.
The user can select rows and columns from the data preview table while importing. One can
import all of a table, or just portions of it.
Note: If names are not enclosed in quotes in the ASCII file, they should not contain
any spaces if “space” is selected as the separator. (See Separators above.)
Missing data
Any text string entries in a numeric column will be imported as empty or missing data.
124
Import
Make sure that Treat consecutive separators as one is unchecked when importing ASCII files
that have empty entries for missing data, such as:
s4,0.618,,0.6022
Batch import
Often spectrometers output spectra in individual files, such that each file contains a single
spectrum (with or without headers). A selection of such single spectrum text-files can be
imported in a single step in The Unscrambler®, simply by selecting multiple files to open. A
simplified dialog is used for batch import.
Batch import dialog
Each spectrum is imported and appended to the previous spectra row-wise. If spectra are
given as a single row in the files, this means that each spectrum will become a single row in
the imported data table. If spectra are given column-wise (i.e. separated by carriage
return/newline), they should be transposed using the Transpose the data before import
check-box.
The sample file-names are included in a row-header in the imported table.
See section on single file import above for general import options.
5.3. BRIMROSE
5.3.1 Brimrose
Type of data/instrument
NIR
Data dimensions
Multiple spectra
Instrument/hardware
Snap!32 v2.03 (BFF3)
Snap!32 v3.01 (BFF4)
Vendor
Brimrose
File name extension
*.dat
125

 How to use it
5.3.2 About Brimrose data files

This option allows for the import of BFF3 and BFF4 data from Brimrose instrument files. The
BFF3 file is created from Snap!32 v2.03 while the BFF4 file is created from Snap!32 v3.01.
5.3.3 File – Import Data – Brimrose…

One or several Brimrose files (BFF3 or BFF4) can be imported into a project in The
Unscrambler®.
How to import data

Select the files to import from the file list in the Brimrose Import dialog or use the Browse
button to display a list of available files. The different files must have the same number of X-
variables to allow simultaneous import.
Brimrose Import
The source files may contain one or more samples per file; multiple selections allow several
samples to be imported at the same time.
Multiple selections
Select one or more files to import by checking the check box next to each file, or by using
Auto select matching spectra.
The contents of all the selected spectra will be merged to create a one data matrix during
import.
Deselect all
Clear the current selection by unselecting all samples.
126
Import
Preview spectra
Check to review a plot of selected spectra before importing.
Sample naming…
Include sample names or sample numbers in the resulting data table.
Sample names will only be imported if they are present in the source file.
Auto select matching spectra

The Auto select matching spectra preview option allows the automatic selection of the all
data file(s) with the same wavelength ranges as the current selection. A screenshot of the
Brimrose Import dialog with the auto select chosen is provided below.
Once Auto select matching spectra has been checked it will select only those files that have
the same number of variables.
Sorting data
The file name, number of samples, number of X-variables, wavelengths for the first and last
X-variables, and step (increase in wavelength), are displayed for each file.
127
Step is the increment in wavelength (or wave number) between two successive variables.
The following relationship should be true:
First X-var + Step\*(Xvars-1) = Last X-var

The data table resulting from the import can be sorted based on any of these columns in the
file list: Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import.
5.4. Bruker
5.4.1 OPUS from Bruker
FT-IR, FT-NIR, Raman
Data dimensions
128
Import
Single spectra
Instrument/hardware
—
Software
OPUS
Vendor
Bruker
File name extension
*.0x, *.1

 How to use it
5.4.2 About Bruker (OPUS) instrument files

One or several spectra from OPUS data files generated by Bruker instruments using OPUS
software can be imported. The import supports 2-D spectral files. When multiple spectra are
contained in a file, the preference is to import the normalized spectrum. However if a file
contains a single spectrum (sample or reference alone), then these will be imported. Data
files containing 3-D spectra are not supported.
5.4.3 File – Import Data – OPUS…

This option supports the import of data from OPUS files generated by Bruker instruments
using the OPUS software.
Data files containing 3-D spectra are not supported.
In the OPUS Import dialog box, one can choose a folder where OPUS files are stored. A list of
OPUS files from which data can be imported is then displayed.
Note: Multiple files that vary in their spectral range and resolution cannot be
imported together.
How to import data

Select the files to import from the file list in the dialog OPUS Import or use the Browse
button to get a list of available files. The different files must have the same number of X-
OPUS Import
129
Multiple selections
import.
Deselect all
Preview spectra
Sample naming…
Interpolate
130
Import
By checking the Interpolate option this allows the import of data with different
starting and ending points, provided the number of points is the same in all sets to
be imported.
When the % button is selected, the following dialog appears allowing a user to set
the Tolerance for allowing data with different start or end points to be imported.
Interpolate Tolerance Dialog
For more information see the section on Import_Interpolate

The Auto select matching spectra preview option provides automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used for
import of spectral data from instruments with OPUS file format. A screenshot of the OPUS
Import dialog with the auto select option chosen is given below.
Once Auto select matching spectra has been checked, the files in the list having the same
number of variables will be selected.
Use the Interpolate option to import data with different start or end points.
131
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables are
displayed for each file.
order.
Preview
5.5. DataBase
5.5.1 Databases
Type of data
Array
Software
132
Import
ODBC/ADO compliant databases

 How to use it
5.5.2 About supported database interfaces

This feature allows a user to import data from a wide selection of databases that are
ODBC/ADO compliant.
5.5.3 File – Import Data – Database…

Data can be imported from a database into a project in The Unscrambler®.
Since there are many possible database platforms and the data structure may be complex,
the user must go through several tabs in order to specify the import:
 Provider: Database service protocol to use

 Connection: Server address and user authentication
 Advanced: Network settings
 All: Initialization properties
Note: The Data Link Properties dialog is a standard Windows dialog. Depending on
the local language setup, this dialog may be displayed in another language other
than English. The name of the dialog will be different, the fields will have a different
text, but the layout and meaning of all fields will be the same as described
hereafter. For additional information, click Help; this will start the Microsoft help
system related to the current sheet in the Data Link Properties dialog.
The next two sections describe the standard stages to go through in order to establish a
connection from The Unscrambler® to a database.
Data link properties dialog: Provider

In the Provider tab of the Data Link Properties dialog, select the database provider to
import from.
Data Link Properties, Provider sheet
133
Hit Next to shift to the next dialog sheet, Connection.
Data link properties dialog: Connection

In the Connection sheet of the Data Link Properties dialog, locate the desired database from
the proper server and specify the security settings for logging on to the database.
Data Link Properties, Connection sheet
134
Import
Specify the following three fields:

 Specify the source of data prompts for a choice between:
Use data source name
select from the list, or type the ODBC database source name (DSN) to access. More
sources can be added through the ODBC Data Source Administrator. Refresh the list
by clicking Refresh, and
Use connection string
allows the user to type or build an ODBC connection string instead of using an
existing DSN.
 Enter information to log on to the server: type the User name and Password to use
for authentication when logging on to the data source. Ticking box Blank password
enables the specified provider to return a blank password in the connection string.
Tick Allow saving password to allow the password to be saved with the connection
string.
 Enter the initial catalog to use: type in the name of the catalog (or database), or
select from the drop-down list.
Once everything is specified, press Test Connection to check whether contact with the
desired database has been successfully established. If the connection fails, ensure that the
settings are correct. For example, spelling errors and case sensitivity can cause failed
connections.
135
Data link properties dialog: Advanced

Go to the Advanced Tab to choose network settings, set connection timeout, and access
permissions.
Data Link Properties Advanced Tab
Data link properties dialog: All

The All tab is provider-specific and displays only the initialization properties required by the
selected OLE DB provider.
Data Link Properties All Tab
136
Import
To edit a value, select it, and click the Edit Value… button, which opens the dialog where a
property can be changed.
Import from database dialog

From the List of tables, select the data table to access. The List of fields to the right is then
updated accordingly.
Select database tables
137
Press the Next button to preview the data and proceed to complete the import.
Preview data before import
138
Import
The data types will be detected for individual columns and imported as numeric values or
text.
5.6. DeltaNu
5.6.1 DeltaNu
Raman spectrometer
Data dimensions
single vector spectrum or multiple spectra in an array
Instrument/hardware
NuSpec software
Pharma-ID Raman spectrometer
Vendor
DeltaNu
File name extension
*.dnu, *.lib

 How to use it
5.6.2 About DeltaNu data files

This option allows for the import of data files generated by the DeltaNu Raman
spectrometers using the NuSpec software. The files may have a single or multiple spectrum
in them. Typically the file extensions are .dnu or.lib, but are not limited to having such a file
extension.
5.6.3 File – Import Data – DeltaNu…

This option allows a user to import data from the DeltaNu Pharma-ID Raman spectrometer
operating with NuSpec software. Files with the following file name extensions are
supported: .dnu.
How to import data

From the File – Import Data menu, select DeltaNu. The DeltaNu dialog box displays a list of
files from which one can import data generated using the NuSpec software from DeltaNu. If
necessary, click the Browse button to access files from a different folder.
DeltaNu import
139
Multiple selections are possible, by checking the box next to more than one file. The
selected samples must be of the same size (variables must match).
Multiple selections
The contents of all the selected spectra will be merged to create one data matrix during
import.
Deselect all
Preview spectra
Sample naming…
140
Import

The Auto select matching spectra preview option allows the automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used by
spectral data imports from instrument formats such as DeltaNu, GRAMS, OPUS, etc.
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables, and step
(increase in wavelength), are displayed for each file.
order.
Preview
141
5.7. Excel
5.7.1 Microsoft Excel spreadsheets
Type of data
Array (spreadsheet)
Software
Excel (part of Microsoft Office)
Vendor
Microsoft
File name extension
*.xls, *.xlt, *.xlsx, *.xlsm
142
Import
 How to use it
5.7.2 About Microsoft Excel spreadsheets

Data in Excel Workbooks from Microsoft Excel 97 and newer can be imported:
The Unscrambler® supports the OOXML (Office Open XML) file format that was introduced
with Office 2007 with more than 255 columns. Users should remove any formatting from
spreadsheets before importing into The Unscrambler®.
Binary Excel 2007 workbooks with file name extension .xlsb are not supported.
5.7.3 File – Import Data – Excel…

The Excel Workbook files must have the file name extensions .xls or .xlsx to be
recognized by The Unscrambler®.
Note: The Unscrambler® supports the OOXML format (.xlsx file name extension)
with more than 255 columns.
Note: Users should remove any formatting (particularly borders) from spreadsheets
before importing into The Unscrambler®. To avoid data type recognition problems
on import, make sure there are no empty cells in first row of values.
To import data into The Unscrambler®

From the menu choose File – Import Data – Excel… to select an Excel file to open. Once a file
has been selected the Excel Preview dialog opens. An Excel workbook may contain several
worksheets. Select the worksheet that contains the matrix to be imported from the drop-
down list Select sheet or named range.
Once the sheet or named range are selected, the data preview window will open. The
screenshot below shows the Excel preview window, which enables the user to select the
desired data sheet, header and data selection of rows and columns.
Excel Preview
143
All ranges that have been defined with names in the selected Excel sheet are listed under
Range names. Multiple row and column headers can be specified in headers, with up to a
maximum of 5 headers.
The sheet range is updated automatically if a range name is selected. The range can also be
entered manually, specifying the Rows and Columns, e.g. 2:1. All cells lying within this
rectangle are then imported.
Select the appropriate ranges as described above for the data values from the selection
option, as well as for the rows/sample and columns/variable names, if relevant.
Columns and rows can be removed from the import by selecting them within the preview
grid and pressing Del on the keyboard.
Data type
If the worksheet contains non-numeric values or a mixture of numeric and non-numeric
values, they can be imported. The radio button Auto can be selected to detect the data
format in the Excel spreadsheet and maintain that on import. If all the data are non-numeric,
they can be imported as text by selecting the radio button text. If the spreadsheet has a mix
of text and numeric values, and one data type is selected, only data of that type will be
imported.
Skip lines
If there are rows of data at the top of the spreadsheet that you do not want to import, you
can use the Skip lines option to enter the number of lines from the top to skip.
5.8. GRAMS
5.8.1 GRAMS from Thermo Scientific
Type of data
Array
Data dimensions
Multiple spectra, constituents
Software
GRAMS
Vendor
Thermo Scientific (formerly Galactic)
File name extension
*.spc, *.cfl

 How to use it
5.8.2 About the GRAMS data format

This format is from GRAMS, a software package developed by Galactic (now part of Thermo
Scientific), and available for data from many different instruments.
The data are stored in two different file types. Spectra are stored in binary files with the
.spc file name extension, and constituents are stored in ASCII files with the .cfl file name
extension. The two file types are connected so that if a .cfl file is imported into The
144
Import
Unscrambler® both spectra and constituents are read. If a .spc file is imported, the spectra
are read, and accompanying Y values can also be imported with them.
“X-values” (usually wavelengths) in .spc files are imported as X-variable names.
Constituents in .cfl files are imported as Y-variables. “Y-values” are imported as separate
column sets with the name of the Y values for the columns.
Some .spc files contain a log block. This may include file names and sample numbers. To
import these, one can select Sample naming… and designate whether to use one, both or
none of these fields.
The binary part of the log block (which usually contains the imaginary part of complex
spectral data) is not imported, nor is the ASCII part of the log.
5.8.3 File – Import Data – GRAMS…

One or several GRAMS .spc files can be imported into a project in The Unscrambler®.
How to import data

Select the files to import from the file list in the GRAMS Import dialog box or use the Browse
button to obtain a list of available files. The different files must have the same number of X-
variables and the same contents in the Y-matrix to allow simultaneous import.
GRAMS Import
The source files may contain one or more samples per file (i.e. single spectra or multifiles1);
multiple selections allow one to import several samples with the same number of variables
at the same time. The dialog will include details about the files that are eligible for import. It
will show the number of samples per file, the number of X variables, number of Y variables,
and the starting and ending X variables.
145
Multiple selections
import. If the data files also include Y values, these will also be imported.
Deselect all
Preview spectra
Sample naming…
Interpolate
be imported.

GRAMS Import dialog with the auto select chosen is provided below.
146
Import
Once the Auto select matching spectra option has been checked it will select only those files
that have the same number of variables as the first selected file.
Sorting data
X-variables are displayed for each file.
file list. Click on a column header to set sort order, and a second time to reverse the sort
order.
Preview
147
 Multifiles are a specific kind of GRAMS file that has multiple spectra in a single file,
as opposed to a single spectrum per file.
5.9. GuidedWave
5.9.1 CLASS-PA & SpectrOn from Guided Wave
spectrometer (UV, UV-vis, NIR)
Data dimensions
Single spectra, constituents
Instrument/hardware
CLASS-PA, SpectrOn
Vendor
148
Import
Guided Wave
File name extension
*.asc, *.scn, *.autoscan, *.gva

 How to use it
5.9.2 About Guided Wave CLASS-PA & SpectrOn data files

This option allows one to import data from Guided Wave instruments. The data files
typically have the extension .asc, .scn, .autoscan, or .gva but may be another extension as
the file type is not defined strictly by the extension.
5.9.3 File – Import Data – CLASS-PA & SpectrOn…

This option allows a user to import data from Guided Wave instrument files with the
following file name extensions: .asc, .scn, .autoscan.
How to import data

From the File – Import Data menu, select CLASS-PA & SpectrOn. The Guided Wave dialog
box displays a list of files from which one can import CLASS-PA & SpectrOn data. If
necessary, click the Browse button to access files from a different folder.
CLASS-PA & SpectrOn import
Multiple selections are possible, by checking the box next to more than one file. The
selected samples must be of the same size (variables must match).
149
Multiple selections
import.
Deselect all
Preview spectra
Sample naming…
Include sample names, sample numbers or timestamps in the resulting data table.
Interpolate
be imported.
Y-variables
Constituents may also be imported by checking the following options:
 Import Y-variables
 Import Predicted Y-variables
150
Import

spectral data imports from instrument formats such as CLASS-PA & SpectrOn GRAMS, OPUS,
etc. A screenshot of the Guided Wave Import dialog box with the auto select option chosen
is given below.
Sorting data
order.
Preview
151
5.10. Import Interpolate

5.10.1 Interpolate functionality
It is the common case, particularly with Fourier Transform (FT) spectrometers, when data is
collected on different instruments (of the same make), even though they have been
collected at the same resolution the starting and ending wavenumbers may be slightly
different.
When data is imported into The Unscrambler®, the import dialog relies on three important
pieces of information
 Number of wavelengths/wavenumbers (points) in the spectrum

 The starting value of the spectra
 The ending value of the spectra
152
Import
If there is a mismatch in any of these values, there are two possible scenarios
 If the number of points in the spectra do not match to each other, a matrix cannot
be formed as it does not have the same column dimension
 If the start points do not match, again a matrix cannot be formed, however, if the
differences between the values are small, interpolation can be used to match these
small differences.
The Interpolation function used in the Import menus is different from that found in Tasks -
Transform (which may be useful for trying to match data from two sets collected as different
resolutions).
Find out more about the Interpolate Transform here.
Data Imports Supporting Interpolation
The following file imports support the interpolate functionality in The Unscrambler® import
dialog boxes.
 JAMP-DX
 Thermo Galactic GRAMS
 OPUS (Bruker Optics)
 CLASS-PA & SpectrOn
 Indico (ASD)
 OMNIC™ (Thermo)
 Varian
 PerkinElmer
Functionality
When a file import supporting interpolate is selected, the Interpolate checkbox will be
present, see below
The % button opens the Tolerance dialog box that has a slider bar for setting how far
beyond the reference spectrum limit to set the interpolation.
Tolerance Dialog
Any points that lie within +/- the set percentage tolerance of the starting point will be
included in the import.
Example
Nine Spectra were collected on three different Bruker spectrometers using 8 wavenumber
resolution. Three replicate spectra were collected on each instrument. Each spectrum
153
consists of 1154 points, however, the starting point of each spectrum is different. By
selecting the first spectrum and then checking the Auto select matching spectra box, only
the three first spectra are selected, see below,
To import all data into one table, check the Interpolate box and set the Tolerance to include
all spectra in the set, see below
When the Auto select matching spectra box is reselected, all spectra are now included in the
import, see below,
154
Import
The data are now displayed as a node in the project navigator using the column headers of
the reference spectrum selected.
5.11. Indico
5.11.1 Indico
—
Data dimensions
Single spectra
Software
Indico Pro 5.6 (version 6 files)
RS3 5.6 (version 7 files)
Indico Pro 6.0 (version 8 files)
Vendor
ASD Inc.
File name extension
*.asd, *.001, *.002, *.3456, etc. (any number)

 How to use it
5.11.2 About ASD Inc. Indico data files

This option allows for the import of data files created with the ASD Inc software. Current
ASD files that are supported for import are version 6, generated from Indico Pro 5.6, version
7, generated from RS3 5.6, and version 8 generated from Indico Pro 6.0.
155
5.11.3 File – Import Data – Indico…

This option allows a user to import data files created with the ASD Inc. software Indico Pro
and RS3. Source files with the following file name extensions are supported: .asd, .001,
.002, .3456, etc. (any number).
How to import data

Select the files to import from the file list in the Indico Import dialog box or use the Browse
button to obtain a list of available files. The Indico Import dialog box displays a list of files
from which one may import Indico data. This includes the file names, the number of X-
variables, names of the First and Last X-variables and step size.
INDICO Import
The source files contain one sample per file; multiple selection allows for the import of
several files (samples) at the same time.
Multiple selections
import.
Deselect all
Preview spectra
Sample naming…
156
Import
Interpolate
be imported.

The auto select matching spectra preview option allows the automatic selection of all data
spectral data imports from instrument formats such as Indico, GRAMS,OPUS etc. A
screenshot of the Indico Import dialog with the auto selection chosen is given below.
157
Sorting data
order.
Preview
158
Import
5.12. JcampDX
5.12.1 JCAMP-DX
Vector and arrays. Standard
Data dimensions
Vendor
JCAMP/IUPAC
File name extensions
*.jdx, *.dx, *.jcm

 How to use it
159
5.12.2 About the JCAMP-DX file format

This is a standard, portable data format defined by JCAMP to support exchange of chemical
and spectroscopic information.
It was originally a standard data format for IR, which has since been extended to
accommodate NMR, mass spec and other data, motivated by the desire to share data
irrespective of the spectrometer on which it was acquired and the need for long-term data
archival, well past the expected lifetime of current hardware and software.
Further development of JCAMP standards is now under the auspices of IUPAC.
5.12.3 File – Import Data – JCAMP-DX…

One can import one or several JCAMP-DX files with .jdx, .dx, .jcm file name extensions
into a project in The Unscrambler®.
How to import data

Select the files to import from the file list in the JCAMP-DX Import dialog box or use the
Browse button to get a list of available files.
The different files must have the same number of X-variables and the same contents in the
Y-matrix to allow simultaneous import.
JCAMP-DX Import
Multiple selections
import.
Deselect all
160
Import

Preview spectra
Sample naming…
Interpolate
be imported.

file(s) with the same wavelength ranges as the current selection.
161
Sorting data
The file name, number of samples, number of X variables, number of Y variables, and
wavelengths for the first and last X-variables are displayed for each file.
order.
Preview
Preview spectra displays line plots of selected files for import.
162
Import
5.12.4 JCAMP-DX file format reference

This format is used by many spectroscopy instrument vendors, e.g. Bran+Luebbe
(IDAS/Infralyzer), NIRSystems (NSAS), Perkin Elmer, Thermo Fisher (Grams, Omnic), Bruker
(OPUS), etc.
General
JCAMP-DX are ASCII-files with file headers containing information about the data and their
origin, etc., and they may contain both X-data (spectra) and Y-data (concentrations).
Only the most essential information of the JCAMP-DX file will be imported. The first title in
the JCAMP-DX file will be used, and one has the additional option of also importing file
names and sample numbers. There is not a limit on the length of a file name. If several
JCAMP-DX files are imported and saved in the same Unscrambler® file, the matrix name will
be that of the first file imported JCAMP-DX file.
JCAMP “X-values” (usually wavelengths) become X-variable names, while JCAMP “Y-values”
become X-variable values. “Concentrations” are interpreted as Y-variables. Variable names
are imported, with no limit on the number of characters. The “Sample description” are used
163
as sample names. Unfortunately there are different dialects of JCAMP-DX, so in some cases
one may lose e.g. sample names if they were used erroneously in the original file.
The XYPOINT variant demand more disk space than XYDATA.
Examples of the XYDATA and XYPOINTS formats follows.
JCAMP-DX XYPOINTS
The example below shows only one sample.
##TITLE= DMCAL.DAT to DMCAL19.DAT using FILTER1.DAT wavelengths

##JCAMP-DX= 4.24 $IDAS 1.40
##DATA TYPE= NEAR INFRARED SPECTRUM
##ORIGIN= Bran+Luebbe Analyzing Technologies
##OWNER= Applications Laboratory
##DATE= 92/ 6/10 $$ WED
##TIME= 1: 0: 3
##BLOCKS= 14
##SAMPLE DESCRIPTION= WHE202CH $$ 1.00
##SAMPLING PROCEDURE= DIFFUSE REFLECTION
##DATA PROCESSING= LOG(1/R)
##XUNITS= NANOMETERS
##YUNITS= ABSORBANCE
##XFACTOR= 1.0
##YFACTOR= 0.000001
##FIRSTX= 1445
##LASTX= 2348
##FIRSTY= 0.652170
##MINY= 0.552445
##MAXY= 1.258505
##NPOINTS= 19
##CONCENTRATIONS= (NCU)
(<CARBOHYDRATE>, 89.400, %)
(<PROTEIN>, 9.410, %)
##XYPOINTS= (XY..XY)
1445, 652170; 1680, 555209; 1722, 606660; 1734, 612745;
1759, 604142; 1778, 575455; 1818, 552445; 1940, 631510;
1982, 657704; 2100, 1188830; 2139, 1082772; 2180, 1008640;
2190, 999405; 2208, 951049; 2230, 978299; 2270, 1198344;
2310, 1258505; 2336, 1209149; 2348, 1153169;
##END=
JCAMP-DX XYDATA
The example below shows only one sample.
##TITLE= Infralyzer 500 (5 NM Intervals)

##JCAMP-DX= 4.24 $IDAS 1.40
##DATA TYPE= NEAR INFRARED SPECTRUM
##ORIGIN= Bran+Luebbe Analyzing Technologies
##OWNER= Applications Laboratory
##DATE= 92/ 7/ 9 $$ THU
##TIME= 20:53:17
##BLOCKS= 14
##SAMPLE DESCRIPTION= COF12BUS $$ 1.00
##SAMPLING PROCEDURE= DIFFUSE REFLECTION
##DATA PROCESSING= LOG(1/R)
164
Import
##XUNITS= NANOMETERS
##YUNITS= ABSORBANCE
##XFACTOR= 1.0
##YFACTOR= 0.000001
##FIRSTX= 1100
##LASTX= 2500
##FIRSTY= 0.139460
##MINY= 0.131600
##MAXY= 1.380070
##NPOINTS= 281
(<CARBOHYDRATE>, 89.400, %)
(<PROTEIN>, 9.410, %)
##DELTAX= 5
##XYDATA= (X++(Y..Y))
1100 139459 137435 135089 133060 131669 131599 133794 138899
1140 145740 151897 158459 167527 180800 195522 206585 216499
...
...
2460 1378929 1379632 1378464 1374972 1378929 1376837 1372945 1377632
2500 1380069
##END=
Instrument parameters for JCAMP files

The appropriate parameters in this field will be written to the JCAMP exported file.
Please feel free to include more parameters in the file if necessary . The user can type any
information into the field, but only text in the format ##KEYWORD = ..., as listed below, will
be used during export.
JCAMP keywords
Keyword Legal values
AVERAGE= INTEGER*4 > 0
GAIN= REAL*4 >= 0.0
BASELINEC= YES or NO
APCOM= String60
JCAMP-DX= String
ORIGIN= String
5.13. Konica_Minolta
5.13.1 Konica_Minolta
KONICA MINOLTA NIR spectrometer
Data dimensions
Instrument/hardware : :
Vendor
165
Konica_Minolta
File name extension :

 How to use it
5.13.2 About Konica_Minolta data files

This option allows for the import of data files created with KONICA MINOLTA NIR
spectrometer.
5.13.3 File – Import Data – Konica_Minolta…

This option allows a user to import data files from KONICA MINOLTA NIR spectrometer. This
option would directly connect the spectrometer and acquire data. This import also supports
ASCII file import.
How to import data

Select the ASCII files to import from Import Button in the Konica_Minolta Import dialog box.
Konica_Minolta Import
Upon selection of ASCII files the spectrum is displayed in the dialog box as a line plot. After
selecting multiple files user can click on OK to get the data in Import.
Konica_Minolta Import
166
Import
To get the data directly from instrument click on “Scan” button.

The contents of all the spectra in dialog will be merged to create one data matrix after
import.
Delete
Deletes the selected spectra
Rename
Option to rename the name of spectra
Select/DeSelect
Use Mouse left button to select/unselect the spectra for viewing the plots
5.14. Matlab
5.14.1 Matlab
Type of data
Array
Software
Matlab
Vendor
MathWorks, Inc.
File name extension
*.mat

 How to use it
167
5.14.2 About Matlab data files

MATLAB is a numerical computing environment and fourth generation programming
language.
The Unscrambler® allows for the import of data from Matlab data files created with Matlab
versions 5.x to 7.0.
What cannot be converted

The following cannot be imported from Matlab to The Unscrambler®
 Matrices containing imaginary numbers,

 Cells arrays,
 Structures,
 Sparse matrices.
To save data for importing

Use the save command in Matlab:
 either save destinationfilename var1 var2 ... ,

 or save destinationfilename to save all variables in the workspace.
This will create a Matlab formatted .mat file. For more help on using the save command,
type help save in Matlab.
5.14.3 File – Import Data – Matlab…

This option allows for the import of data from Matlab formatted files created in Matlab
versions 5.x to 7.0.
How to import data into The Unscrambler®

To import the file in The Unscrambler® select File - Import Data - Matlab. Select the
destination filename in The Unscrambler® to get the Import Matlab dialog box.
Select which selections represent the Data, Sample names and Variable names. The sample
name and variable name variables must match the corresponding dimension of the data
variable (for example, 5 rows and 4 columns in the figure below) or they will not be
displayed in the drop-down lists with available sample and variable names.
Import Matlab dialog
168
Import
Matlab variables representing sample and variable names must be character arrays.
What Cannot be Converted
The following cannot be imported from Matlab to The Unscrambler®
 Matrices containing imaginary numbers,

 Cells arrays,
 Structures,
 Sparse matrices.
To Save Data for Importing

Use the save command in Matlab:
 either save destinationfilename var1 var2 ... ,

 or save destinationfilename to save all variables in the workspace.
This will create a Matlab formatted .mat file. For more help on using the save command,
type help save in Matlab.
5.15. MyInstrument
5.15.1 MyInstrument
Instrument interface standard defined by Thermo Electron (formerly Galactic) and
supported by many instrument vendors.
A MyInstrument driver provided by the specific instrument vendor and the
corresponding MyInstrument add-on for The Unscrambler® are required. These
modules are available separately from CAMO Software and many not be part of the
standard package.
 Additional information
 How to use it
5.15.2 About the MyInstrument standard

The MyInstrument add-on for The Unscrambler® provides users with the ability to directly
acquire spectra from their spectrometers into The Unscrambler®. The acquisition process
169
makes use of the MyInstrument standard to allow for instrument configuration and
definition of experiments in order to run scans. The functionality provided is dependent on
the instrument. After acquisition the spectral data is directly inserted as rows per scan into
an The Unscrambler® editor, ready for further processing or modeling. The MyInstrument
add-on removes the need for acquiring data using other instrument specific software, saving
to a file and then importing into The Unscrambler®.
5.15.3 File – Import Data – MyInstrument…

Working with the MyInstrument add-on
Start a session in The Unscrambler® and use the menu item which typically has the vendor
company name followed by MyInstrument…, e.g. for a Zeiss instrument: File – Import Data –
Zeiss MyInstrument…
The next window will show the vendor specific MyInstrument control screen, e.g. for a Zeiss
instrument:
170
Import
The appearance and usage of the control dialog will depend on the particular instrument
vendor. Details of using the instrument interface will be available from the manuals provided
by the instrument vendor. Using the instrument may require specific configuration and
setup procedures provided by the vendor before being able to run scans.
171
Sample scan result. This may appear entirely different for the instrument being used and are
provided here only as an example.
Click OK to end the scan acquisition session. The scans should now be available within The
Unscrambler® editor for subsequent processing and modeling.
172
Import
5.16. NetCDF
5.16.1 NetCDF
Type of data
Open standard for array-oriented data
Developed by
University Corporation for Atmospheric Research (UCAR)
File name extension
*.cdf, *.nc

 How to use it
5.16.2 About the NetCDF file format

NetCDF (network Common Data Form) is a set of software libraries and machine-
independent data formats that support the creation, access, and sharing of array-oriented
scientific data.
What Is NetCDF?
NetCDF (network Common Data Form) is a set of interfaces for array-oriented data access
and a freely-distributed collection of data access libraries for C, Fortran, C++, Java, and other
languages. The NetCDF libraries support a machine-independent format for representing
scientific data. Together, the interfaces, libraries, and format support the creation, access,
and sharing of scientific data.
NetCDF data is:
 Self-Describing. A NetCDF file includes information about the data it contains.

 Portable. A NetCDF file can be accessed by computers with different ways of storing
integers, characters, and floating-point numbers.
 Scalable. A small subset of a large data set may be accessed efficiently.
 Appendable. Data may be appended to a properly structured NetCDF file without
copying the data set or redefining its structure.
 Sharable. One writer and multiple readers may simultaneously access the same
NetCDF file.
 Archivable. Access to all earlier forms of NetCDF data will be supported by current
and future versions of the software.
The NetCDF software was developed by Glenn Davis, Russ Rew, Ed Hartnett, John Caron,
Steve Emmerson, and Harvey Davies at the Unidata Program Center in Boulder, Colorado,
with contributions from many other NetCDF users.
5.16.3 File – Import Data – NetCDF…

NetCDF (network Common Data Form) is a set of software libraries and machine-
scientific data.
173
How to import data

Select the files to import from the file list in the dialog NetCDF Import or use the Browse
button to get a list of available files.
Select a .cdf file to import and then click Open.
NetCDF Import dialog
One can select Sample Names and Variable names as shown above.
5.17. NSAS
5.17.1 NSAS
NIR
Data dimensions
Instrument/hardware
Foss 5000, 6500, XDS
Vendor
FOSS
File name extension
*.da, *.cn, *.cal

 How to use it
5.17.2 About the NSAS file format

NSAS file format originates from FOSS NIRSystems NIR instruments, and is a format from
their DOS-based NSAS software. Files can be saved from the FOSS WINISI software and FOSS
Vision software into the NSAS format.
See the technical reference for an overview of instrument parameters that The
Unscrambler® can import from NSAS data files.
174
Import
5.17.3 File – Import Data – NSAS…

NSAS data import allows the import of NIR spectral data files generated by FOSS instruments
and accompanying constituents from the NSAS file format, which have the .da and .cn file
name extensions respectively.
How to import data

Select the files to import from the file list in the dialog NSAS Import or use the Browse
button to get a list of available files. The different files must have the same number of X-
variables and the same contents in the Y-matrix to allow simultaneous import.
NSAS Import
The source files may contain one or more samples per file; multiple selections allow several
Multiple selections
import.
Deselect all
Preview spectra
Sample naming…
175

Auto select matching spectra preview option provides the automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used by input
spectral data from instruments with NSAS file format, as well as others. A screenshot of the
NSAS Import dialog with the auto select option chosen is given below.
Once Auto select matching spectra has been checked it will select the files having the same
number of variables from the list.
Sorting data
X-variables are displayed for each file.
order.
176
Import
Preview
5.17.4 NSAS file format reference

This document describes the instrument parameters that can be imported from NSAS data
files. Files can be saved from the FOSS WINISI software and FOSS Vision software into the
NSAS format.
Instrument parameters from NSAS files
NSAS Data Import will read information in the NSAS data file which has no natural place in
The Unscrambler® file format into the Instrument Info block under specific keywords.
Similarly, NSAS/Vision Model Export will look for a relevant subset of these keywords and, if
found, it will place the values in the corresponding places in the NSAS/Vision Model file.
The NSAS/Vision keywords are listed below.
NSAS/Vision keywords
177
NSAS_InstrumentModel String representing integer > 0
NSAS_AmpType String: 1
NSAS_CellType String: 2
NSAS_Volume String: 3
NSAS_NumScans String representing integer > 0
NSAS_HasSampleTransport String: Yes/No
NSAS_ReferenceAcquiredInRefPos String: Yes/No
NSAS_SampleAcquiredInSamPos String: Yes/No
NSAS_OnlineInstrument String: Yes/No
NSAS_Math1_Type String representing integer > 0: 4
NSAS_Math2_Type =
NSAS_Math3_Type =
NSAS_Math1_SegmentSize String representing integer > 0
NSAS_Math2_SegmentSize =
NSAS_Math3_SegmentSize =
NSAS_Math1_GapSize String representing integer > 0
NSAS_Math2_GapSize =
NSAS_Math3_GapSize =
NSAS_Math1_DivisorPoint String representing integer > 0
NSAS_Math2_DivisorPoint =
NSAS_Math3_DivisorPoint =
NSAS_Math1_SubtractionPoint String representing integer > 0
NSAS_Math2_SubtractionPoint =
NSAS_Math3_SubtractionPoint =
NSAS_NumberOfConstituents String representing integer > 0
NSAS_NumberOfDataPoints String representing integer > 0
NSAS_StartingWaveLength String representing integer > 0
NSAS_EndWaveLength String representing integer > 0
178
Import
NSAS_CreationDay String representing integer > 0
NSAS_CreationMonth String representing integer > 0
NSAS_CreationYear String representing integer > 0
NSAS_CreationHour String representing integer > 0
NSAS_CreationMinute String representing integer > 0
NSAS_CreationSecond String representing integer > 0
 NSAS_AmpType | String:
“Reflectance”, “Transmittance”, “(Reflect/Reflect)”, “(Reflect/Transmit)”,
“(Transmit/Reflect)”, “(Transmit/Transmit)”, “Not used”
 NSAS_CellType | String:
“Standard sample cup”, “Manual”, “Web analyzer”, “Coarse sample”, “Remote
reflectance”, “Powder module”, “High fat/moisture”, “Rotating drawer”, “Flow-
through liquid”, “Cuvette”, “Paste cell”, “Cuvette cell”, “3 mm liquid cell”, “30 mm
liquid cell”, “Coarse sample with sample dump”
 NSAS_Volume | String:
“1/4 full”, “1/2 full”, “3/4 full”, “Completely full”
 NSAS_Math[1-3]_Type | String representing integer > 0:

1 = “N-point smooth”, 2 = “Reflective energy”, 3 = “Kubelka-Munk”, 4 = “1st
derivative”, 5 = “2nd derivative”, 6 = “3rd derivative”, 7 = “4th derivative”, 8 =
“Savitsky & Golay”, 9 = “Divide by wavelength”, 10 = “Fourier transform”, 11 =
“Correct for reference changes”, 13 = “Full MSC”, 21 = “N-point smooth”, 22 = “1st
derivative”, 23 = “2nd derivative”, 31 = “Savitzky-Golay first derivative”
5.18. Omnic
5.18.1 OMNIC
FTIR, FT-NIR, Raman
Data dimensions
Single spectra
179
Instrument/hardware
Nicolet IR, Antaris, NXR
Vendor
Thermo Scientific (Nicolet)
File name extension
*.spa, *.spg

 How to use it
5.18.2 About Thermo OMNIC data files

Data generated by Thermo molecular spectroscopy instruments and related OMNIC
software.
5.18.3 File – Import Data – OMNIC…

This option allows for the import of data from OMNIC files generated by ThermoFisher
instruments and related software.
Source files with .spa or .spg file name extension are supported.
How to import data

Selecting the OMNIC dialog box displays a list of files from which one can import OMNIC
data.
If necessary, click the Browse button close to the Look in: field in order to access files from a
different folder.
OMNIC Import
The source files contain one sample per file. Multiple selection allows several files (samples)
to be imported at the same time.
180
Import
Multiple selections
import.
Deselect all
Preview spectra
Sample naming…
Interpolate
be imported.

Auto select matching spectra preview option allows the automatic selection of all data file(s)
with the same wavelength ranges as the current selection. This dialog is used by input
spectral data from instruments with OMNIC file format. A screenshot of the OMNIC Import
dialog with the auto select chosen is given below.
181
Once the Auto select matching spectra option has been checked it will select the files have
the same variables from the list.
Sorting data
Step is the increment in wavelength (or wave number) between two successive variables.
The following relationship should be true:
First X-var + Step\*(Xvars-1) = Last X-var

order.
Preview
182
Import
5.19. OPC
5.19.1 OPC protocol
Standard data transfer protocol
Vendor
OPC Foundation
 How to use it
5.19.2 About the OPC protocol

OPC (originally OLE for process control) is a non-proprietary technical specification created
with the collaboration of a number of leading worldwide automation hardware and software
suppliers, working in cooperation with Microsoft under the auspices of the OPC Foundation.
The original standard provided specifications for process data acquisition, making possible
interoperability between automation/control applications, field systems/devices and
183
business/office applications. The standard defines methods for exchanging real-time

automation data between PC-based clients using Microsoft operating systems. In 2009 a
new standard, OPC Unified Architecture, was developed, providing specifications for cross-
platform capability .
An OPC Server is often referred to as an OPC Driver. The two terms are synonymous.
An OPC Server is a software application that acts as an API (Application Programming
Interface) or protocol converter. An OPC Server will connect to a device such as a PLC, DCS,
RTU, or a data source such as a database or User interface, and translate the data into a
standard-based OPC format. OPC compliant applications such as a HMI (Human Machine
Interface), historian, spreadsheet, trending application, etc can connect to the OPC Server
and use it to read and write device data. An OPC Server is analogous to the role a printer
driver plays to enable a computer to communicate with an ink jet printer. An OPC Server is
based on a Server/Client architecture.
5.19.3 File – Import Data – OPC…

Data can be imported into The Unscrambler® via OPC. This requires a connection with an
OPC server. Begin by selecting File – Import Data – OPC… to open the OPC Dialog menu.
OPC Dialog
All configured servers on the PC will be recognized, and displayed in the list of OPC servers.
The user must make selections for the Computer name/IP, the OPC Server, and the OPC
Group from the respective drop-down lists. The user also has provision to type in computer
name/IP, the OPC server, and the OPC Group. Once they have been selected, available items
will be given in the OPC Items list. An item is selected, and by clicking on GO, the data will be
generated from OPC, and populate the fields in the OPC Import Dialog. Click Stop to stop the
collection process from OPC, showing the data in the preview.
OPC Tag - The user should use this option to specify the OPC tag. This should be used when
more OPC groups and OPC items are available in Servers. The user can directly specify the
tag to avoid the delay in listing and selecting individual OPC group and OPC item.
184
Import
Update Rate - This is the rate(in milliseconds) at which data is retrieved from the OPC
Server.
Show preview - User should check this option to see the last 10 rows retrieved from the OPC
Server.
Set number of columns - The user should use this option to increase the number of
columns.
Filled OPC Dialog
Click OK to complete the import of the data into The Unscrambler®.
5.20. OSISoftPI
5.20.1 PI
Type of data
PI Server - real time data collection, archiving and distribution engines

 How to use it
5.20.2 About supported interfaces

PI Import is an add-in that retrieves tags from compiled PI archives and servers, and writes
the data in The Unscrambler workbook which can then be used for regular plotting,
transformation and multivariate analysis. Tags are unique storage points for the data in the
PI system. Each tag is simply a single point of measurement.
5.20.3 File – Import Data – PI…

Data can be imported into The Unscrambler® via OSISoft PI.
185
The PI Import dialog allows the user to specify and connect to an active server. Click Add to
search a PI Server for tags using the Tag Search dialog. This dialog allows the user to search
all connected PI Servers for tags meeting a given a set of criteria, such as one or more tag
attribute values. Tags can be selected using the Search option. Three different search
options are available in Tag Search dialog, the Basic, Advanced and Alias.
Tag Search dialog
After the tags are selected (use Ctrl key for multiple tag selection) from the search list panel
and OK is clicked, they can be seen in the Tags window of the PI Import dialog. For more
details on options available in Tag Search dialog box, click on Help.
The below three sections describe the data modes to go through in order to preview and
retreive data for the selected tags from the PI server.
Data Mode: Archive

This mode will search the archive data specified within time ranges. For each tag, the values
recorded in the PI data source will be retrieved, within the specified time range and
previewed in the preview list. The timestamp (for the specified tag in Tag No) can either be
imported as row header or first column from the tag.
Data Mode, Archive
186
Import
Data Mode: Polling

The polling mode retrieves fresh data based on timer-driven method for any of the three
events selected. The time interval can be selected in seconds and the Start Timer option will
watch for new data. For each tag, the new values recorded in the PI data source will be
retrieved, and can be previewed in the preview list. The timestamp (for the specified tag in
Tag No) can either be imported as row header or first column from the tag.
Data Mode, Polling
187
Data Mode: Event

The event driven method retrieves fresh data based on any of the three events selected. The
Start Monitoring option will watch for new data. For each tag, the new values recorded in
the PI data source will be retrieved, and can be previewed in the preview list. The timestamp
(for the specified tag in Tag No) can either be imported as row header or first column from
the tag.
Data Mode, Event
188
Import
The help option available in the PISDKUtility provides more details about the usage of PI-SDK
configuration utility.
5.21. PerkinElmer
5.21.1 PerkinElmer
UV-Vis, NIR, FTIR, Raman
Data dimensions
Multiple spectra
Instrument/hardware
—
Software
Spectrum 6, Spectrum 10
Vendor
PerkinElmer
File name extension
*.sp, *.spp

 How to use it
189
5.21.2 About PerkinElmer instrument files

One or several spectra from files generated by PerkinElmer molecular spectroscopy
instruments (FTIR, Raman and UV-vis) using Spectrum 6 and Spectrum 10 software can be
imported.
When multiple spectra are contained in a file, the preference is to import the normalized
spectrum. However if a file contains a single spectrum (sample or reference alone), then
these will be imported.
5.21.3 File – Import Data – PerkinElmer…

This option supports the import of data from files generated by some PerkinElmer
instruments.
In the PerkinElmer Import dialog box, one can choose a folder where files are stored. A list
of files from which data can be imported is then displayed.
Note: Multiple files that vary in their spectral range and resolution cannot be
imported together.
How to import data

Select the files to import from the file list in the dialog or use the Browse button to get a list
of available files. The different files must have the same number of X-variables to allow
simultaneous import.
PerkinElmer Import
Multiple selections
190
Import
import.
Deselect all
Preview spectra
Sample naming…
Interpolate
be imported.


The Auto select matching spectra preview option provides automatic selection of all data
file(s) with the same wavelength ranges as the current selection. This dialog is used for
import of spectral data from PerkinElmer instruments. A screenshot of the dialog with the
auto select option chosen is given below.
191
Once Auto select matching spectra has been checked, the files in the list having the same
number of variables will be selected.
Sorting data
The file name, number of X-variables, wavelengths for the first and last X-variables are
displayed for each file.
order.
Preview
192
Import
5.22. PertenDX
5.22.1 Perten-DX
Vector and arrays. Standard
Data dimensions
Vendor
Perten Instruments following JCAMP/IUPAC
*.jdx, *.dx, *.jcm

 How to use it
193
5.22.2 About the Perten Instruments JCAMP-DX file format

This is a standard, portable data format defined by JCAMP and modified by Perten to
support few of the specific Perten types
It was originally a standard data format for IR, which has since been extended to
accommodate NMR, mass spec and other data, motivated by the desire to share data
irrespective of the spectrometer on which it was acquired and the need for long-term data
archival, well past the expected lifetime of current hardware and software.
Further development of JCAMP standards is now under the auspices of IUPAC.
5.22.3 File – Import Data – Perten-DX…

One can import one or several Perten-DX files with .jdx, .dx, .jcm file name extensions
into a project in The Unscrambler®.
How to import data

Select the files to import from the file list in the Perten-DX Import dialog box or use the
Browse button to get a list of available files.
The different files must have the same number of X-variables and the same contents in the
Y-matrix to allow simultaneous import.
Perten-DX Import
Multiple selections
import.
Deselect all
194
Import

Preview spectra
Sample naming…
Interpolate
be imported.

file(s) with the same wavelength ranges as the current selection.
195
Sorting data
order.
Preview
Preview spectra displays line plots of selected files for import.
196
Import
5.22.4 Perten-DX file format reference

This format is based on JCAMP-DX file format. For more information on JCAMP-DX see the
section on Import JCAMP File Format
General
Perten-DX supports additional tags specific to Perten Instruments. These are:
Tag name Imported in Unscrambler as
##OWNER Information box
##INSTRUMENT S/N Category variable
##SPECTROMETER S/N Category variable
##LONG DATE Sample header
##PERTEN-TYPES Category variable
197
Tag name Imported in Unscrambler as
##PERTEN-SAMPLEINFO Category variable
##PERTEN-REPACK Sample header
##PERTEN-REPEAT Sample header
Perten-DX file
The example below shows Perten-DX sample file.
##TITLE=2
##INSTRUMENT S/N=1201530
##INSTRUMENT TYPE=DA7250
##SPECTROMETER S/N=SNIR2148
##JCAMP-DX=4.24
##DATATYPE= NEAR INFRARED SPECTRUM
##LONG DATE=2013-10-18T01:59:18+02:00
##SAMPLE DESCRIPTION=2
##SMOOTHED=YES
##XUNITS= Nanometers (nm)
##YUNITS= Absorbance
(Protein Dry basis,-9.973E+23,<unknown>)
##PERTEN-TYPES= (KV)
(Product Type, Wheat),
(Shape Type, Unknown),
(Tray Type, Large Tray. rotating)
##PERTEN-REPACK=1
##PERTEN-REPEAT=1
##PERTEN-SAMPLEINFO= (KV)
##XFACTOR= 1.0
##YFACTOR= 0.000000001
##FIRSTX= 950.00
##LASTX= 1650.00
##NPOINTS= 141
##DELTAX= 5.0
##XYDATA= (X++(Y..Y))
950.0 186225975 188992413 193629553 199835249 207323496 215294014
222310809 227316331 230163481
995.0 231218537 230973747 229930179 228344771 226101418 223436221
220348573 216993825 213526732
1040.0 210076812 206678859 203519066 200372073 197183083 193896477
190813849 187961026 185361544
1085.0 183060794 181031311 179367942 178144637 177316150 176997467
177158004 178485737 182057610
1130.0 189131917 200696556 216125124 233953784 253292157 272636547
291094037 307752989 322292848
1175.0 335720686 348497384 360603909 370580710 377233357 380561567
380739361 377437577 370749286
1220.0 361610474 351741516 342353572 334328973 327783482 322877222
319254364 316585214 314597761
1265.0 313006114 311340643 309259709 306673122 303654410 300820687
298877629 297995673 298450579
198
Import
1310.0 300507674 304469670 310617035 318953135 329739582 342663051

357349953 373092331 389380072
1355.0 405360164 420025538 432690507 443690839 453913399 465033895
478927915 497519241 520603469
1400.0 547701532 578341832 610554253 641977198 670671475 694941644
714033309 728135504 737936222
1445.0 744584470 748870234 751802130 753593537 754701424 754774651
753793482 752142124 750221679
1490.0 747923597 745168624 742032801 738770350 735344011 731975306
728708573 725796673 723188418
1535.0 721043949 719373104 717859979 716709549 715573447 714720046
713740590 712450919 710535970
1580.0 708248969 705216090 701261550 696380943 690796672 684905943
678981726 673139165 666952182
1625.0 661182311 655418737 649996320 644795947 640163793 636351883 0 0 0
##END= $$ 2
5.23. RapID
5.23.1 RapID
Type of data
Array
Data dimensions
single vector spectrum
Instrument/hardware
Particle size analysers
Raman Spectrometers
Laser Induced Breakdown Spectrometers (LIBS)
Vendor
rap-ID Particle Systems
File name extension
.txt,.jcm

 How to use it
5.23.2 About RapID data files

This option allows for the import of .txt and.jcm data from rap-ID particle size analyzers
instrument files.
5.23.3 File – Import Data – rap-ID…

One or several rap-ID files (.txt or.jcm) can be imported into a project in The Unscrambler®.
How to import data

Select the files to import from the file list in the RAP-ID Import dialog or use the Browse
button to display a list of available files. The different files must have the same number of X-
RAP-ID Import
199
The source files contain a single samples per file
Multiple selections
import.
Deselect all
Preview spectra
Sample naming…
200
Import

RAP-ID Import dialog with the auto select chosen is provided below.
Once Auto select matching spectra has been checked it will select only those files that have
the same number of variables.
Sorting data
The file name, number of samples, number of X-variables, are displayed for each file.
order.
Preview
201
5.24. U5Data
5.24.1 U5 Data
File name extension
*.UNS

 How to use it
5.24.2 About Unscrambler� 5.0 data files

Imports data files from earlier versions of The Unscrambler� (versions 3.0 - 5.5). If the file
to be imported contains several matrices, a dialog pops up to let the user specify which
matrices to import.
202
Import
Note: The Unscrambler� recognizes the extensions: .UNS, .UNM, .UNP, and .CLA.
Rename the files if they have other extensions.
5.24.3 File – Import Data – U5 Data…

Imports data files from earlier versions of The Unscrambler® (versions 3.0 - 5.5). If the file to
be imported contains several matrices, all of the matrices will be available to import. The
user can define which matrices to import, When multiple matrices are selected, they will be
combined into a single matrix.
How to import U5 data

Select the files to import from the file list in the U5 Import dialog box or use the Browse
button to obtain a list of available files. The U5 Import dialog box displays a list of matrices
from which one may import U5 data. This includes the matrix names, the number of rows,
and the number of columns. When selecting multiple matrices, use the radio buttons at the
top to specify whether they should be combined in terms of rows or columns.
U5 Data import
203
5.25. UnscFileReader
5.25.1 The Unscrambler® 9.8
Type of data
Array
Software
The Unscrambler® 9.8
Vendor
CAMO Software
*.??M, *.??D

 How to use it
204
Import
5.25.2 About The Unscrambler® 9.8 file formats

The Unscrambler® X features a new file format, but files created by versions 9.2 to 9.8 can
be imported.
More details.
5.25.3 File – Import Data – Unscrambler…

Import data and model matrices from files made by versions 9.2 to 9.8 of The Unscrambler®
into the Editor.
Select a file and the imported data and plots will appear in the project navigator.
Not all plots are available for models that were created in versions of The Unscrambler®
before 9.8. In such instances, the user is recommended to import the data, and rebuild the
models.
5.25.4 The Unscrambler® 9.x file format reference

The Unscrambler® 9.x used the file name extensions listed below to distinguish between
different data types:
The Unscrambler® 9.x files File name extension
Non-designed raw data .00D
Fractional factorial design .01D
Full factorial design .02D
Combined design .03D
Central Composite design .04D
Plackett-Burman design .05D
Box-Behnken design .06D
D-optimal design .07D
Statistics .10D
PCA .11M
Analysis of Effects .20D
Response Surface .21D
Prediction .30D
Classification .31D
MLR .40M
PLS1 .41M
PLS2 .42M
205
PCR .43M
Three-way PLS .44M
MSC .50D
Lattice design (mixtures) .60D
Centroid design (mixtures) .61D
Axial design (mixtures) .62D
D-optimal mixture design .63D
3-D data table .70D

Each of the .??D files above may have the following corresponding additional files:
 .??L Log file

 .??P Preference file (settings for the file when it closes)
 .??T Notes file
 .??W Warnings file
The Unscrambler® 9.8 introduced a merged file format combining .??[DLPTW] into one file,
.??M.
A few details to remember about the file sets that comprise each data table or saved result:
 When transferring data to another place using the Windows Explorer, make sure
that all the associated physical files are copied!
 Do not change the file name extensions The Unscrambler® uses. Doing so may
create problems to access the files from within The Unscrambler®.
 The log and notes files are plain ASCII files which can be opened and viewed using a
text editor.
5.26. UnscramblerX
5.26.1 The Unscrambler® X
Type of data
Array
Software
The Unscrambler® X
Vendor
CAMO Software
*.unsb

 How to use it
206
Import
5.26.2 About The Unscrambler® X file format

The native file format used by The Unscrambler® X have the .unsb file name extension, a
proprietary binary format made specifically for The Unscrambler® to provide fast and
efficient storage of large data sets and multivariate models.
5.26.3 File – Import Data – Unscrambler X…

This option allows one to import data tables and models from another The Unscrambler® X
project file.
How to import data
Use File – Import Data – Unscrambler X…
After selecting the import target, click OK to enter the Import dialog.
207
Select a data set or model to import.
5.27. Varian
5.27.1 Varian
—
Data dimensions
Instrument/hardware
Cary UV-Vis
Software
—
Vendor
Varian, Inc.
File name extension
*.bsw

 How to use it
5.27.2 About Varian data files

This option allows one to import data from files generated by Varian UV-Vis instruments and
related software.
Source files with .bsw file name extension are supported.
208
Import
5.27.3 File – Import Data – Varian…

This option allows one to import data from files generated by Varian instruments and
related software (Cary UV-Vis instruments).
Source files with .bsw file name extension are supported.
How to import data

Selecting the Varian dialog box displays a list of files from which one can import Varian data.
If necessary, click the Browse button close to the Look in: field in order to access files from a
different folder.
VARIAN Import
The source files may contain one or more samples per file. Multiple selections allow several
Multiple selections
import.
Deselect all
Preview spectra
Sample naming…
209
Interpolate
be imported.


Auto select matching spectra preview option provides automatic selection of all the data
file(s) with the wavelength ranges as the current selection. This dialog is used by input
spectral data from instruments with Varian file format.
210
Import
Once the Auto select matching spectra option has been checked it will select the files having
the same variables from the list.
Sorting data
order.
Preview
Preview spectra displays a line plot of selected files that have been selected for import. A
screenshot of the Varian Import dialog with the preview spectra chosen is given below.
211
5.28. VisioTec
5.28.1 VisioTec
Type of data/instrument :
Data dimensions
Instrument/hardware : :
Vendor
VisioTec
File name extension :

 How to use it
212
Import
5.28.2 About VisioTec data files

This option allows for the import of data files created with the Uhlmann VisioTec NIR
Inspection systems.
5.28.3 File – Import Data – VisioTec…

This option allows a user to import data files created with the Uhlmann VisioTec NIR
inspection systems. Source files with the following file name extensions are supported:
.ldfor ‘.dat’.
How to import data

Select the files to import from the file list in the VisioTec Import dialog box or use the
Browse button to obtain a list of available files. The VisioTec Import dialog box displays a list
of files from which one may import VisioTec data. This includes the file names, the number
of X-variables, names of the First and Last X-variables and step size.
VisioTec Import
The source files may contain one or many samples per file; multiple selection allows for the
import of several files (blocks of data) at the same time.
Multiple selections
import.
Deselect all
Preview spectra
213

Sample naming…
214
6. Export
6.1. Exporting data
This section describes how to export data from The Unscrambler®.
6.1.1 Supported data formats

The Unscrambler® can export data in the following data formats:
 ASCII
 JCAMP-DX
 NetCDF
 Matlab
 AMO: The Unscrambler® ASCII Model
 DeltaNu
6.1.2 How to export data

Select a format from the File – Export menu, which will open an Export dialog specific to the
given file format.
After selecting the model, or the data matrix and range to export, entering meta data and
other storage options, press OK to specify the directory and file name to save the exported
data to.
6.2. AMO
6.2.1 Export models to ASCII
The Unscrambler® ASCII-MOD file is an ASCII-based file format used to transfer models from
The Unscrambler® to compatible instruments and prediction software.

 How to use it
6.2.2 About the ASCII-MOD file format

The Unscrambler® ASCII-MOD file is an easy-to-read ASCII-based file format capable of
representing models created by The Unscrambler® and contains all information necessary
for prediction and classification.
The file format is used to transfer models to compatible instruments and prediction
software.
The files are saved with a .amo file name extension.
6.2.3 File – Export – ASCII-MOD…

ASCII-MOD export dialog
215
Select model
A drop-down list contains all models found in the currently open project. Select the
one to export.
Type
Choose between Full and Short prediction storage, where the second is used to
achieve smaller file size when only the regression coefficients are used for
prediction.
PCs
The number of Principal Components or factors to include in the exported model.
Y-Variable
Include the Y-variables to be included with the model.
Press OK and use the file dialog to select the destination directory and give a file name to
save the model.
6.2.4 ASCII-MOD file format reference

File structure
An ASCII-MOD file contains all information necessary for prediction and classification.
The ASCII-MOD file is an easy-to-read ASCII file. The table below lists the matrices which are
found in the ASCII-MOD file, depending on the type of ASCII-MOD file and type of model.
When generating an ASCII-MOD file, one can choose between “Short” (referred to as “Mini”
in previous versions of the software) and “Full” storage. Matrices stored under these options
are indicated with ‘x’ in the table.
ASCII-MOD file matrices
Matrix name Short Full PCA Full Regr. Rows Columns
B x x PC (1-a) X-var (1-x)
B0 x x PC (1-a) 1 row
xWeight x x 1 row X-var (1-x)
yWeight x 1 row Y-var (1-y)
xCent x x 1 row X-var (1-x)
yCent x 1 row Y-var (1-y)
ResXValTot x x PC (0-a)
216
Export
Matrix name Short Full PCA Full Regr. Rows Columns
ResXCalVar x x PC (0-a) X-var (1-x)
ResXValVar x x PC (0-a) X-var (1-x)
ResYValVar x PC (0-a) Y-var (1-y)
ResXCalSamp x x PC (0-a) Samp (1-i)
Pax x x PC (1-a) X-var (1-x)
Wax x x PC (1-a) X-var (1-x)
Qay x PC (1-a) Y-var (1-y)
SquSum x x [1] PC (1-a)
HiCalMean x PC (1-a) 1 row
ExtraVal x 1 row [2]
RMSECal x PC (1-a) Y-var (1-y)
TaiCalSDev x x PC (1-a) 1 row
xCalMean x x 1 row X-var (1-x)
xCalSDev x x 1 row X-var (1-x)
xCal x x 1 row X-var (1-x)
yCalMean x 1 row Y-var (1-y)
yCalSDev x 1 row Y-var (1-y)
yCal x 1 row Y-var (1-y)

Table of result matrices:
 SquSumT, SquSumW, SquSumP, SquSumQ, MinTai, MaxTai

 RMSEP, SEP, Bias, Slope, Offset, Corr, SEPcorr, ICM-Slope, ICM-Offset
Note: The contents of the columns “Rows” and “Columns” shows the contents of
the ASCII-MOD file, not the contents of the matrices in the main model file.
Example of an ASCII-MOD File
TYPE=FULL // (MINI,FULL)
VERSION=1
MODELNAME=F:\U\EX\DATA\TUTBPCA.11D
MODELDATE=10/27/95 11:41:13
CREATOR=Joe Doe
METHOD=PCA // (PCA, PCR, PLS1, PLS2)
CALDATA=F:\U\EX\DATA\TUTB.00D
SAMPLES=28
217
XVARS=16
YVARS=0
VALIDATION=LEVCORR // (NONE,LEVCORR,TESTSET,CROSS)
COMPONENTS=2
SUGGESTED=2
CENTERING=YES // (YES,NO)
CALSAMPLES=28
TESTSAMPLES=28
NUMCVS=0
NUMTRANS=2
TRD:DNO // ,,,,,,,complete transformation string
TRD:DSG // ,,,,,,,complete transformation string
NUMINSTRPAR=1
##GAIN=5.2
MATRICES=13
"xWeight" // (Name of 13 matrices)
"xCent"
"ResXValTot"
"ResXCalVar"
"ResXValVar"
"ResXCalSamp"
"Pax"
"Wax"
"SquSum"
"TaiCalSDev"
"xCalMean"
"xCalSDev"
"xCal"
%XvarNames
"Xvar1" "Xvar2" "Xvar3" "Xvar4"
%xWeight 1 16
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01 .1000000E+01
.1000000E+01
%xCent 1 16
.1677847E+01 .2258536E+01 .2231011E+01 .2404268E+01 .2179311E+01
.2470489E+01 .2079168E+01 .1734536E+01 .1475164E+01 .1480657E+01
.1644097E+01 .1805900E+01 .1980229E+01 .1795443E+01 .1622796E+01
.1497418E+01
,,,
,,,etc.
Description of fields
The below table lists the data field codes used in ASCII-MOD files.
Description of fields
Field Description
TYPE (MINI,FULL) MINI gives “Prediction Light” only
VERSION Increases by one for each changes of the file format after release
MODELNAME Name of model
218
Export
Field Description
MODELDATE Date for creation of the model (not the ASCII-MOD file)
CREATOR Name of the user who made the model (not the ASCII-MOD file)
METHOD Calibration method (PCA, PCR, PLS1, PLS2) 1
CALDATA Name of data set used to make the model
SAMPLES Number of samples used when making the model
XVARS Number of X variables used when making the model
YVARS Number of Y variables used when making the model
VALIDATION (TEST,LEV,CROSS)
COMPONENTS Number of components present in the ASCII-MOD file
Suggested number of components to use (may not be on the

SUGGESTED
ASCII-MOD file)
CENTERING (YES,NO)
CALSAMPLES Number of calibration samples
TESTSAMPLES Number of Test samples
NUMCVS Number of Cross Validation Segments
NUMTRANS Number of transformation strings
INSTRUMENT
See below
PARAM.
TRANSFORMATIONS Number of transformations
Number of matrices on this file. One name for each matrix follows
MATRICES
below
Transformation strings
There is one line for each transformation. The format of the line will depend on type of
transformation. If a transformation needs more data which is the case for MSC, this extra
data will be stored as matrices at the end of the file. References to these matrices can be
done by names.
Examples
A transformation named TRANS using one parameter could look like this:
TRANS:TEMP=38.8;
A MSC transformation may look something like this:
MSC:VARS=19,SAMPS=23,MEAN="ResultMatrix18",TOT=" ResultMatrix19";
Transformation strings may also contain error status which is the case when the MSC-base
have been deleted from file before making the ASCII-MOD file.
219
Transformation strings
Main Description Secondary Description
ANA Analysis… AOE Analysis of Effects
CLA Classification
MLR Multiple Linear Regression
PCA Principal Component Analysis
PCR Principal Component Regression
PL1 Partial Least Squares 1
PL2 Partial Least Squares 2
PRE Prediction
RES Response Surface Analysis
STA Statistics
APP Append… SAM Sample
VAR Variable
COM Compute… MAT Matrix
VEC Vector
DEL Delete… SAM Sample
VAR Variable
IMP Import —
INS Insert… SAM Sample
VAR Variable
REP Replace —
SHI Shift Variables —
SOR Sort Samples —
TRA Transform… ATR Absorbance to Reflectance
BAS Baseline
DNO Norris Derivative
DSG S. Golay Derivative
MNO Maximum Normalization
220
Export
Main Description Secondary Description
MSC Multiplicative Scatter Correction
NOI Added Noise
NOR Mean Normalization
RED Reduce
RNO Range Normalization
RTA Reflectance to Absorbance
RTK Reflectance to Kubelka-Munck
SMA Moving_Average Smoothing
SSG S. Golay Smoothing
TSP Transpose
USR User-Defined
Storage of matrices
Each matrix starts with a header as in this example:
%Pax 10 155
Telling: Matrix name is Pax the matrix has the dimension 10 rows and 155 columns. From
the next line the data elements will follow in the following sequence:
Pax(1,1) Pax(1,2) Pax(1,3) , , Pax(1,7)

Pax(1,8) Pax(1,9) , , , ,
Pax(1,xvars-1) Pax(1,xvars)
Pax(1,2) Pax(2,2) Pax(2,3) , , ,
, ,
Pax(comp,1) Pax(comp,2) , , , Pax(comp,xvars)
A missing value will simply be written as the character m.
 If the calibration model was made using 1 Y variable, it uses PLS1, and if it was
created using >1 Y variable the AMO file uses PLS2.
6.3. ASCII
6.3.1 ASCII export
The ASCII export option is very useful if one wants to work with the data table in another
program.

 How to use it
221
6.3.2 File – Export – ASCII…

Many other programs can read ASCII files. This export option therefore is very useful if one
wants to work with the data table in another program.
ASCII export dialog
Select the matrix and data ranges that make up the data to be exported, or use Define to
create a new range.
Options
Include headers
Specify sample names and variable names are to be exported by selecting them in
the Include headers field. They will be placed in the first column and in the first row,
respectively.
Name qualifier
String data, such as headers, may be quoted, using either double quotes ", or single
quotes '.
It is recommended to mark text with quotes and not mark numbers, because it
makes it easier for importing programs to assign correct data types to text and
numbers.
Default is ".
Numeric qualifier
Numeric data, may be quoted similar to headers.
222
Export
Default is None.
Item delimiter
Table cell entries may be delimited by different characters.
Default is ,.
String representation of missing data
Specify how missing data are to be coded in the ASCII file.
Default is m.
For compatibility with software that doesn’t have support for importing missing data
as strings, use a large negative number, such as -9.9730e+023 instead.
6.4. DeltaNu
6.4.1 DeltaNu
The DeltaNu file is a model file format developed for use with the DeltaNu Pharma-ID Raman
spectrometers. It contain all the necessary information for projection and classification. PCA
Models created in The Unscrambler� X can be exported to this file format. Such models are
compatible with DeltaNu Raman instrumentation for real-time projections.
The files are saved with a .dnub file name extension.

 How to use it
6.4.2 File – Export – DeltaNu…

To export a PCA model to the DeltaNu format, go to File- Export-DeltaNu.. and the following
dialog will appear.
DeltaNu export dialog
Select model
A drop-down list contains all models found in the currently open project. Select the
one to export. Only PCA models are supported in the DeltaNu format.
PCs
The number of Principal Components to include in the exported model. The default
value given is the optimal number of PCs for the model. It is recommended to export
a model with the optimal number of PCs. To export the model with a different
number of PCs use the drop-down list to choose a different number of PCs.
223
save the model.
6.5. JCampDX
6.5.1 JCAMP-DX export

 How to use it
6.5.2 File – Export – JCAMP-DX…

The JCAMP-DX format is read by many instrument software. This file format requires that
the X-part of the data have numerical names, e.g. wavelengths, wavenumbers, retention
times, etc.
JCAMP-DX export dialog: Select data
create a new range.
Metadata
Then, in the File Info tab, enter information related to the JCAMP-DX file as a whole. Here
one must choose between two JCAMP-DX formats: XYPoints and XYData. XYData requires
that the distance between each variable is the same throughout the whole X-Variable Set.
XYData produces smaller file sizes than XYPoints.
JCAMP-DX export dialog: File info
224
Export
Title
Name of the data set
Origin
Can be the name of the lab, client name, batch number, or location where data
came from.
Owner
Name of the person conducting the experiment or the analysis.
Enter information related to the samples in the Samples Info tab. This information is saved
with each sample.
JCAMP-DX export dialog: Sample info
225
Sample names
Select either Use sample name from data table or Use text to specify manually
Sampling procedure
Details on how the data was collected.
Data processing
List the transformations applied to prepare the data.
Data type
Select appropriate value from the drop-down list.
X units
Y units
Click OK to save the file.
6.6. Matlab
6.6.1 Matlab export
The Unscrambler® provides the capability to export data tables to Matlab including sample
names (row headings in The Unscrambler®) and variable names (column names in The
Unscrambler®).

 How to use it
6.6.2 File – Export – Matlab…

The Unscrambler® provides the capability to export data tables to Matlab including sample
names (row headings in The Unscrambler®) and variable names (column names in The
Unscrambler®).
Matlab export dialog
create a new range.
226
Export
Options
Select whether sample and variable names should be exported. If this option is selected then
these names are stored in separate arrays within the export file as normally done in Matlab.
Select Use Compression to use gzip-compression for arrays stored to the Matlab file. This
will reduce the file size.
The exported data is saved as filename.mat, where “filename” represents the name
entered for the file on saving.
Reading the file in Matlab

To load the converted file, type load filename in the Matlab command window. If the data
are exported without sample and variable names, the filename.mat file contains one
variable called “Matrix” that contains The Unscrambler® worksheet data.
Sample and variable names
If the data are exported with sample and variable names, the file contains 2
additional arrays: “ObjLabels” and “VarLabels”.
“ObjLabels” contains row (sample) names.
“VarLabels” contains are column (variable) names.
Both are character arrays.
Missing Value Conversion
Missing values in a worksheet in The Unscrambler® are converted to the number -
9.9730e+023.
Converting category variables
Category variables are converted into integers.
Note: The array names (“Matrix”, “VarLabels”, and “ObjLabels”) are the same in
each exported file from The Unscrambler®. Thus, if several converted files are
loaded into Matlab, rename the variables in Matlab after each load command or
they will be overwritten by subsequent import operations.
6.7. NetCDF
6.7.1 NetCDF export

 How to use it
6.7.2 File – Export – NetCDF…

NetCDF (Network Common Data Format) is a set of software libraries and machine-
scientific data.
Upon choosing File – Export – NetCDF… an export dialog will open:
227
create a new range.
Metadata
In the field Global Attributes, enter all other relevant details:
Data set origin
Can be the name of the lab, client name, batch number, or location where data
came from.
Equipment ID
Can be the product name, product number, serial number, or IP address of the
instrument used.
Equipment manufacturer
Name of the instrument vendor.
Equipment type
Type of instrument used, e.g. NIR.
Operator name
Name of the person conducting the experiment or the analysis.
Experiment date time
Date and time of the data collection. It is suggested to enter the date according to
the ISO 8601 standard, e.g. 2010-01-27T09:55:41+0100.
All attributes are optional. It is generally recommended to add metadata to files for better
file search results.
228
Export
6.8. UnscFileWriter
6.8.1 Export models to The Unscrambler® v9.8
The Unscrambler® 9.8 file is the previous file format and models in this format contain all
the necessary information for prediction and classification. Models (PCA, MLR, PCR and PLS)
created in The Unscrambler® X can be exported to this previous file format using the File
writer plug-in. Such models are compatible with OLUP and OLUC 9.8 software for real-time
classification and prediction.

 How to use it
6.8.2 About The Unscrambler® file format

Model files (MLR, PCR, PLSR and PCA) can be exported to The Unscrambler® 9.8 format using
the File Writer plug in.
Some methods and features that were not available in Unscrambler® 9.8 cannot be
exported. These include:
 Models registered with following pretreatments
 Orthogonal Signal Correction (OSC)
 Correlation Optimized Warping (COW)
 Weights
 Deresolve
 Quantile Normalization
 Basic ATR correction (Spectroscopic transformation)
 Models with cross validation based on category variable
 The following classification models
 Linear Discriminant Analysis (LDA, PCA-LDA)
 Support Vector Machine Classification (SVM-C)
 SIMCA classification
 Support Vector Machine Regression (SVM-R)
 Prediction, classification or projection results from The Unscrambler® X
The Unscrambler® 9.x used the file name extensions listed below to distinguish between
different data and model types:
Non-designed raw data .00D
PCA .11M
MLR .40M
PLS1 .41M
PLS2 .42M
PCR .43M
229
6.8.3 File – Export – Unscrambler…

Unscrambler export dialog
Available models
A drop-down list contains all models found in the currently open project that can be
exported. Select the one to export.
Model Information
This contains details about the model selected
Notes
The time the chosen model was created is given here, along with any other
information that has been added to the Notes section of the chosen model. Users
may also add additional information in the Notes section, which will be available in
the exported model.
Save model with components
Use the components box to select the correct number of components for saving the
model in 9.8 format. The set number of components for the model will be displayed
and used by default.
Save as micro model
The check box allows user to save the model in 9.8 micro format.
save the model.
230
7. Plots
7.1. Line plot
A line plot displays a single series of numerical values with a label for each element. The plot
has two axes:
 The horizontal axis shows the labels, in the same physical order as they are stored in
the source file;
 The vertical axis shows the scale for the plotted numerical values.
The points in this plot can be represented in several ways:

As a Curve
A curve linking the successive points is more relevant to study a profile, and if the
labels displayed on the horizontal axis are ordered in some way (e.g. PC1, PC2, PC3).
Line Plot: Curve display for following a batch evolution
With Symbols
Symbols produce the same visual impression as a 2-D scatter plot (see Scatter Plot),
and are therefore not recommended.
Line plot: symbol display
231
Several series of values which share the same labels can be displayed on the same line plot.
The series are then distinguished by means of colors.
Line plot: 2 series with curve display
7.2. Bar plot

A bar plot displays a single series of numerical values with a label for each element. The plot
has two axes:
 The horizontal axis shows the labels, in the same physical order as they are stored in
the source file;
The vertical bars emphasize the relative size of the numbers.

Bar plot of a series
232
Plots
Several series of values which share the same labels can be displayed on the same bar plot.
The series are then distinguished by means of colors, and an additional layout is possible:
accumulated or stacked bars. Accumulated bars are relevant if the sum of the values for
series1, series2, etc. has a concrete meaning (e.g. total production or composition).
Two layouts of a bar plot for two series of values: Bars and Accumulated Bars
233
7.3. Scatter plot

A 2-D scatter plot displays two series of values which are related to common elements. The
values are shown indirectly, as the coordinates of points in a 2-dimensional space: one point
per element.
As opposed to the line plot, where the individual elements are identified by means of a label
along one of the axes, both axes of the 2-D scatter plot are used for displaying a numerical
scale (one for each series of values), and the labels may appear beside each point.
234
Plots
Various elements may be added to the plot, to provide more information:
 A regression line visualizing the relationship between the two series of values
Scatter plot with the regression line
 A target line, valid whenever the theoretical relationship should be “Y=X”
Scatter plot with the target and the regression lines
235
 Plot statistics, including among others the slope and offset of the regression line
(even if the line itself is not displayed) and the correlation coefficient.
Scatter plot with statistics and the regression line
7.4. 3-D scatter plot

A 3-D scatter plot displays three series of values which are related to common elements. The
values are shown indirectly, as the coordinates of points in a 3-dimensional space: one point
per element.
A 3-D scatter plot
236
Plots
All the plots can be customized. This is done from the properties dialog which is accessed by
a right click on the plot and the selection of the Properties menu,
or by selecting the properties shortcut from the toolbar

When selecting the Properties menu, the Plot properties dialog appears.
Each of the following items can be modified:
Axis X, its gridlines and axis labels
The visibility, the title with its font and position, the scale - both its appearance
(logarithmic or reversed) and its labels - and origin can be modified on the X axis.
The axis label rotation can also be set in this menu.
Properties Axis X
237
Axis Y and Z and its gridlines

Access to the same possibilities as the Axis X and its gridlines.
Appearance
Four different items can be customized from this menu and its sub-menu:
 Background
 Header: title, color, font, visibility, color of the background
 Legend: title, color, font, visibility, color of the background
 Plot Area: Chart area, color, font, visibility, borders, surface
Properties Appearance
238
Plots
For the Header and Legend the text can be edited. One can customize the name,
such as only having part of the name displayed, the font and the color.
Properties Header
Graphic Objects
It is possible to include some graphical objects in the plot such as line, arrow,
rectangle, ellipse and text. Each of those objects can be configured in terms of color,
thickness and font if necessary.
3-D scatter plots can be enhanced by:
Addition of vertical lines
They “anchor” the points and can facilitate the interpretation of the plot.
A 3-D Scatter plot displayed with anchors
239
To add vertical lines, click on More (see section below on Additional Options).
Rotation
The plot can be rotated so as to show the relative positions of the points from a
more relevant angle; this can help detect clusters. Click on the plot and move it with
the cursor in the appropriate direction.
A 3-D Scatter plot after rotation
240
Plots
The axes can be interchanged in plot, using the arrows on the toolbar. If more than three
columns are selected, the axes can be changed from the drop-down lists next to the axis
arrows on the toolbar.
Additional options
Click on More to access more options for 3D scatter plots.
Scroll through the
 Gallery
 Data
 3D-View
options to customise the appearance of 3D Scatter Plots. These features are described in the
following,
3D Scatter plot gallery
Select from the gallery of plots to obtain the desired appearance of the plot.
3-D Scatter plot data
241
Define plot specifics with these options.

3-D Scatter plot 3-D view properties dialog
The rotation, perspective, and axis scales can be changed under the 3-D view tab.
242
Plots
7.5. Matrix plot

The matrix or surface plot can be seen as the 3-dimensional equivalent of a line plot to
display a whole table of numerical values with a label for each element along the 2
dimensions of the table. The plot has up to three axes:
 The first two show the labels, in the same physical order as they are stored in the
source file;
Depending on the layout, the third axis may be replaced by a color code indicating a range of
values (contour plot), thus making the surface plot essentially a contour plot or a map plot
when looking at it straight from above. The layout can be changed by right clicking on the
plot, and selecting Plot type for a shortcut to predefined layouts, or select Properties to
customize 3-D plots, and make changes to the axes, legends, etc..
The Plot type submenu
The points can either be represented individually, or summarized according to one of the
following layouts:
Surface
It shows the table as a 3-D landscape.
Matrix plot with a landscape display
Contour
The contour plot has only two axes. A few discrete levels are selected, and points
(actual or interpolated) with exactly those values are shown as a contour line. It
looks like a geographical map with altitude lines;
Matrix plot with a contour display
243
This option is accessible from Plot type – Contour, or the Properties of the plot:
Surface plot menu
Map
On a map, each point of the table is represented by a small colored square, the color
depending on the range of the individual value. The result is a completely colored
rectangle, where zones sharing close values are easy to detect. The plot looks a bit
like an infrared picture.
244
Plots
Matrix plot with a map display
This option is accessible from Plot type – Map, or the Properties of the plot, the
option is Scatter chart, zoned, 2D projection.
Scatter plot menu
Bars
This option gives roughly the same visual impression as the landscape plot if there
are many points, otherwise the “surface” appears more rugged.
Matrix plot with a 3-D bar display
245
This option is accessible from the Properties of the plot.

Bar plot menu
3-D-Scatter is also accessible via this Properties menu, see 3-D scatter plot for help on that
plot.
246
Plots
7.6. Histogram plot

A histogram summarizes a series of numbers without actually showing any of the original
elements. The values are divided into ranges (or “bins”), and the elements within each bin
are counted.
The plot displays the ranges of values along the horizontal axis, and the number of elements
as a vertical bar for each bin.
Histograms are used to plot the data distribution, and often for density estimation:
estimating the probability density function of the underlying variable. The total area of a
histogram used for probability density is always normalized to 1. If the length of the intervals
on the x-axis are all 1, then a histogram is identical to a relative frequency plot.
A statistics table can be added to the plot by clicking the button. This will print the
number of data elements as well as the distribution statistics Skewness (i.e. asymmetry),
Kurtosis (i.e. flatness), Mean, Variance and the Standard Deviation (SDev).
It is possible to redefine the number of bins, to improve or reduce the smoothness of the
histogram, using the drop-down list Bars.
A histogram with different configurations: Few or Numerous bins
247
The histogram is one of the seven basic tools of quality control, which also include the
Pareto chart, check sheet, control chart, cause-and-effect diagram, flowchart, and scatter
diagram.
7.7. Normal probability plot

The normal probability plot is a graphical technique for normality testing: assessing whether
or not a data set is approximately normally distributed.
The data are plotted against a theoretical normal distribution in such a way that the points
should form an approximate straight line. Departures from this straight line indicate
departures from normality. Each element of the series is represented by a point. A label can
be displayed beside each point to identify the elements.
This type of plot enables a visual check of the probability distribution of the values.
Normal distribution
248
Plots
If the points are close to a straight line, the distribution is approximately normal
(Gaussian).
Normal probability plot showing a series following a Normal distribution
Normal distribution with outliers

If most points are close to a straight line but a few extreme values (low or high) are
far away from the line, these points are outliers. In the example below sample 50
looks like an outlier.
Normal probability plot showing a series following Normal distribution with an
outlier
Not a Normal distribution

If the points are not close to a straight line, but determine another type of curve, or
clusters, the distribution is not normal.
249
Normal probability plot showing a series not following a Normal distribution
7.8. Multiple scatter plot

This plot displays several scatter plots. A maximum of five variables at a time are used and
scatter plots for each pair of variables are shown above the diagonal. The variables are
indicated on the diagonal and can be changed from the list.
Multiple scatter plot structure
Variable 1 Variable 2 Variable 3
Variable Scatter plot between Scatter plot between

Name of variable 1
1 Variable 1 and 2 Variable 1 and 3
Variable R-square for variable Scatter plot between

Name of variable 2
2 1 and 2 Variable 2 and 3
Variable R-square for variable R-square for variable 2

Name of variable 3
3 1 and 3 and 3
The colors of the panels on the lower diagonal are an indicator of the correlation. Positive
correlation is indicated in shades of blue while negative values are shown in shades of red.
This plot helps in quickly identifying relationships between variables and allows one to
choose variables to examine in greater detail.
It is specially useful to detect which variables are responsible for a discrimination of sample
groups for example.
Access the Multiple Scatter plot from the menu Plot - Multiple Scatter
Plot - Multiple Scatter menu
250
Plots
Then it is necessary to specify the scope of the plot.

Multiple Scatter plot Scope
Once the variables are selected, click OK and the plot will appear in the viewer.
Multiple scatter plot
If more than four variables have been selected for the multiple scatter plot, others can be
displayed by choosing them from the drop-down list on the diagonal of the plots.
Variable drop-down list menu
251
7.9. Tabular summary plots

A table plot is nothing more than results arranged in a tabular format, displayed in a
graphical interface which optionally allows for resizing and sorting the columns of the table.
Although it is not a “plot” as such, it allows tabulated results to be displayed in the same
viewer system as other plots.
Example of table plot: Table of Correlation
The table plot format is used under two different circumstances:

 A few analysis results require this format, because it is the only way to get an
interpretable summary of complex results. A typical example is Analysis of Variance
(ANOVA); some of its individual results can be plotted separately as line plots, but
the only way to get a full overview is to study 4 or 5 columns of the table
simultaneously.
 Standard graphical plots like line plots, 2-D scatter plots, matrix plots, etc. can be
displayed numerically to facilitate the exportation of the underlying numbers to
another graphical package, or a worksheet.
To do so, use the option View Numerical accessible in two ways: from a right click
on the plot and from the View menu.
View Numerical option from a Right click on the plot and from the View menu
252
Plots
7.10. Special plots

This is an ad-hoc category which groups all plots that do not fit into any of the other
descriptions.
Some of these plots are an adaptation of existing plot types, with an additional
enhancement, while other plots have been developed to answer specific needs.
Mean and standard deviation plot
For instance, “Means” can be displayed as a line plot. However to include standard
deviations (SDev) into the same plot which is quite useful, the most relevant way to do so is
to:
 configure the plot layout as bars;

 and display SDev as an error bar on top of the Mean vertical bar.
This is what has been done in the special plot “Mean and SDev”.
Special plot: Mean and SDev
253
Visualize the outcome of a multiple comparisons test

This plot presents the level of a design variable that have significantly different effects on a
response variable in a graphical way which gives an immediate overview.
Special plot: Multiple Comparisons
Qualify the quality of a prediction

The Predicted with deviation plot shows the predicted value as well as the possible
deviation. It gives a direct answer to the level of trust to have on the results. The deviations
are estimated as a function of the global model error, the sample leverage, and the sample
residual X-variance. A large deviation indicates that the sample used for prediction is not
similar to the samples used to make the calibration model. This is a prediction outlier: check
its values for the X-variables.
Special plot: Predicted with deviation
254
Plots
7.11. Plotting results from several matrices
7.11.1 Why is it useful?

In order to compare different results it can be useful to plot them in the same plot instead of
two separate plots.
Two separate plots
255
Two results in one plot
256
Plots
7.11.2 How to do it?

Access to Add Data…
To be able to add data to a plot it is necessary to access to the Add Data… menu. This is
available when creating a custom layout. Begin by going to Insert - Custom Layout. When a
plot is displayed after formatting the custom layout, the Add Data option is accessible from
a right click on a plot displayed in the workspace.
Access Add Data… menu

257
It is necessary to locate the second set of data.

Matrix
Use the drop-down list if the data are in a data matrix and use the select result
matrix button if the data are in an analysis result.
Rows and Cols
Use the drop-down list if the subset is already defined and use the Define button if it
has to be defined.
7.12. Annotating plots

It is possible to customize a plot by adding text, lines and drawings to it.
To do this use the Draw toolbar:
Or right click in a plot frame:
Example of an edited plot
258
Plots
In order to remove drawing objects from plots, you can use either the Edit - Undo option (or
toolbar button), or you can select the drawing object using the mouse pointer and click the
keyboard Delete button.
7.13. Create Range Menu

In an interactive analysis it can be very useful to mark some samples in e.g. a Scores plot to
create a new range. To do so, right click on the plot with the marked samples and select the
option Create Range
Create Range Dialog
A dialog with the following frames will open:
259
 Sample Selection : Select whether the marked or unmarked samples (or both)
should be extracted from the model, and give the ranges informative names. By
default the marked and unmarked sample ranges will be named Outliers and Good
Samples, respectively.
 Create Range : The new range will be created based on one or more data tables
available in the project navigator. All data tables with the correct number of rows
will be listed in this frame. Use the radio buttons to define whether a new data table
should be created or if the ranges should be added to existing tables. As an
additional quality control it is possible to list only data tables with matching sample
names. A yellow warning sign next to a table indicates that the sample names are
missing or non-matching.
7.14. Plotting: The smart way to display numbers

Mean and standard deviation, PCA scores, regression coefficients: all these results from
various types of analyses are originally expressed as numbers. Their numerical values are
useful, e.g. to compute predicted response values. However, numbers are seldom easy to
interpret as such.
Furthermore, the purpose of most of the methods implemented in The Unscrambler® is to
convert numerical data into information. It would be a pity if numbers were the only way to
express this information!
Thus visualization tools are provided for representation of the main results of the methods
available in The Unscrambler®. The best way, the most concrete, the one which will helps
one to get a real feeling for results, is the following:
A plot!
Most often, a well-chosen picture conveys a message faster and more efficiently than a long
sentence, or a series of numbers. This also applies to raw data – displaying them in a smart
graphical way is already a big step towards understanding the information contained in
numerical data.
However, there are many different ways to plot the same numbers! The trick is to use the
most relevant one in each situation, so that the information which matters most is
emphasized by the graphical representation of the results.
7.14.1 Various plots

Numbers arranged in a series or a table can have various types of relationships with each
other, or be related to external elements which are not explicitly represented by the
numbers themselves. Plotting is a way of seeing the structure. The chosen plot has to reflect
this internal organization, so as to give an insight into the structure and meaning of the
numerical results.
According to the possible cases of internal relationships between the series of numbers, The
Unscrambler® provides seven main types of plots for graphical representation of data:
 Line plot
 Bar plot
 Scatter plot
 3-D scatter plot
 Matrix plot
 Histograms
260
Plots
 Normal probability plot

 Multiple scatter plot
In addition, to cover a few special cases, two more kinds of representations are provided:
 Table plot
 Special plot
7.14.2 Customizing plots
 Zooming and re-scaling

 Formatting plot appearance
 Adding text and drawings
 Grouping samples
 Plotting results from several matrices
 Saving and copying a plot
7.14.3 Actions on a plot

A plot displays some information as points, bars or lines. Those items are displayed
accordingly to their coordinates and values.
It is possible to access this information by pointing at the item. It is also possible to mark the
item for further use.
7.14.4 Plots in analysis

Specific plots for each analysis
When performing an analysis there are some plots that will summarize the information
better than others.
In The Unscrambler® there is a list of predefined plots for each analysis. This list can be
accessed through one of the following:
Navigator
A shortcut to the most important plots can be given in the Plots sub-node of a model
in the project navigaor. The plots are displayed if the right-click model menu option
‘Show Plots’ is toggled on, and can be hidden by using the ‘Hide Plots’ option.
Plot node under a PCA analysis in the navigator
261
From the Plot menu

The plot menu changes for each analysis, providing an extensive list of the available
plots.
Plot menu specific to the PCA analysis
From a right click on a plot

The plot menu there is called by the name of the method for example PCA, it
provides the full list of available plots.
Plot menu from a right click on a plot from a PCA analysis
262
Plots
Interpreting plots
To get specific information on all the available plot for each analysis, see the specific Plot
sections under respective methods.
 Design of Experiments
 Descriptive statistics
 Statistical tests
 Principal Component Analysis (PCA)
 Principal Components Regression (PCR)
 Partial Least Squares Regression (PLS)
 L-shaped PLS Regression (L-PLS)
 Multivariate Curve Resolution (MCR)
 Cluster analysis
 Projection
 SIMCA
 Prediction
7.15. Kennard-Stone (KS) Sample Selection

The objective with this function is to select subsets of samples to evenly cover the
multivariate space, as originally described by Kennard and Stone 1969. The starting point for
this option is a score plot. This document describes the functionality of the Kennard-Stone
Sample Selection dialog as implemented in The Unscrambler® X.
User Dialog
The user dialog is found by right clicking in a score plot from PCA, PCR or PLS
regression, and then under the option Mark select Kennard-Stone Sample Selection.
263
It is also possible to enter the dialog from the icon in the Mark Toolbar
This will open the Kennard -Stone sample selection dialog

Kennard-Stone Sample Selecton
A detailed description of the inputs to the dialog is given below:

Function Description of Functionality
Number of Number of calibration samples to select with the K-S algorithm. The
samples default is 15.
Number of Here the number of components to use for selection is given. The
components default is the optimal number as found in the model.
Pre-Select
When selected any marked samples in the score plot will be included
samples - Include
in the calibration sample set in addition to what is identified with the
already marked
K-S Sample selection.
samples
Pre-Select Opens the Select samples dialog window for selecting samples to be
samples - included in the calibration sample set from the data matrix.
264
Plots
Function Description of Functionality
Manually pre-
select samples
When enabled a row set of the same size as the number of

Select validation
calibration samples will be created as a validation set using the
samples
Double Kennard-Stone sample selection algorithm.
Works only for PCR and PLSR models, when checked the initial
calibration set from K-S will be augmented with samples to produce
Augment set with a more uniform distribution of response values. Additional options
boxcar samples are available for setting number of bins for boxcar samples and
number of samples to select from the sample selection. This option
will be disabled if Select validation samples is checked.
Create row set as When selected the samples will be extracted into a new matrix, with
new matrix KS-Calibration and optionally KS-Validation row sets added.
Create row set in When selected, Calibration and optionally Validation row sets will be
selected matrix(es) added to selected, matching matrices.
Allow mis- While not checked, only matrices with identical sample names in the
matching samples same order will be listed. An exclamation mark is shown for the
names matrices where the sample names do not match.
The figure below shows the score plot after specifying 15 samples for calibration and
validation. The calibration samples are marked with green rectangles and the validation
samples with orange triangles.
The score plot with marked calibration and validation samples
When the option to create the sample set in selected Matrices is chosen, the matrices will
be added in the project navigator as shown below:
265
If the option to Create row set as new matrix has been chosen, a matrix with the name of
the X matrix from the scores plot will be created with KS appended to the matrix name.
7.16. Marking
It is often useful to mark some samples or variables in a plot to:
 Create a new range of samples or variables

 Recalculate with modification on those samples or variables (Downweight, exclude,
include only)
7.16.1 How to mark samples/variables

There are several toolbar buttons available to mark a sample or a variable in a plot. The
Mark functions can also be accessed from the Edit - Mark menu, or by right-clicking in a plot
and selecting Mark
TheEdit - Mark* menu*
One by one
This option enables one to use the cursor to select an item to mark by clicking on it.
Rectangular
This option allows several grouped samples to be selected at the same time. The
cursor is transformed into a pointer that will allow the user to define the top left
corner and the bottom right corner of the rectangle.
Samples marked with rectangle option
266
Plots
The different types of Markings can be accessed from Edit-Mark.. or from toolbar shortcuts.
Lasso
This option activates the cursor to be used to define a special area. All samples
inside the area will be marked. To define the area click on the contour of the area to
be defined and maintain the click while defining the contour of the area. When the
click is released the selection is done.
Samples marked with lasso
Evenly distributed samples only…

Automatically mark samples uniformly throughout the data.
For more information see the Select evenly distributed samples documentation.
Kennard-Stone Sample Selection…
Automatically mark representative samples using the Kennard-Stone sample
selection algorithm, or use the double Kennard-Stone to extract both calibration and
validation samples.
For more information see the Kennard-Stone sample selection documentation.
267
Mark significant X-variables

This option is available only if:
 Selecting variables from PCA, PCR, PLS and

 Uncertainty test was enabled.
The selection is automatic.

Mark outliers
Add outliers to the current selection. These outliers are based on the warning limits
associated with a given analysis on the Warning Limits tab.
Unmark all
This option is used to remove a previous selection.
Reverse marking
When some items are selected in a plot and one would like to select the unselected
items, i.e. invert the current selection, the button Reverse marking can be used.
7.16.2 How to create a new range of samples or variables from the marked items
Once some samples / variables are selected in a plot it is possible to create a new range
including them. To do so right click on the plot with the selected items and select the option
Create Range.
Menu create range
For all raw data plots and for model plots of variables (e.g. PCA loadings), the new range
appears under the corresponding data table node with the default name “RowRange” or
“ColumnRange”.
New range created
268
Plots
When a sample range is created from within a model scores plot, a dialog is opened to allow
sample extraction into a new or existing data table. See the extract samples documentation
for details.
7.16.3 Recalculate with modifications on marked samples or/and variables

Once some samples / variables are selected in a plot it is possible to perform a new analysis
based on the same parameters as previously used, including a modification affecting the
selected samples or/and variables.
Select the analysis in the project navigator and right click. Select the Recalculate option.
Menu recalculate
Five options are available:

With Marked…
This option allows the user to perform recalculation using the marked/selected
samples or variables for further analysis, the rest are kept out.
Without Marked…
The marked samples or/and variables are not included in the analysis, the
unselected samples or/and variables are.
269
With Marked Downweighted…

The marked variables are downweighted. See more information about downweight.
The other variables keep their original weight.
With UnMarked Downweighted…
The unmarked variables are downweighted. See more information about
downweight. The other variables keep their original weight.
With New Data
Additional data can be added to an analysis using this option. This will open a new
dialog from which the new data are selected. These new data can be appended to
the original data or original data in the matrix can be overwritten for the new
analysis.
Add data set
7.17. Point details

In addition to the general information available about the whole plot, one may also display
specific details regarding one particular point. This is done as follows:
 Rest the cursor close to a data point: the point number is displayed.
 Click on the point: a small box containing point number, point name and point
coordinates is displayed as shown in the figure below.
Point details
270
Plots
7.18. Formatting of plots


Axis X and its gridlines
Properties Axis X
271
Axis Y and its gridlines

Appearance
Five different items can be customized from this menu:
 Background
 Point Label: color, font, visibility
 Axis Label: title, color, font, visibility, borders
For the Point Label and Axis Label the text can be edited. One can customize the
name, such as only having part of the name displayed. For this option use the drop-
down list in Label layout - Show.
Properties: Point Label
272
Plots
Graphic Objects
Chart properties
It is possible to further customize the chart properties by selecting More, which will
open up the Chart properties dialogue. Here one can define simple or complex chart
types from the options in the chart gallery. Further selection of chart properties can
be made, and the chart previewed.
Chart Properties
273
7.19. Formatting of 3D plots


Axis X, its gridlines and axis labels
Properties Axis X
274
Plots
Axis Y and Z and its gridlines

Appearance
Three different items can be customized from this menu:
 Background
 Plot Area: Chart area, color, font, visibility, borders, surface
For the Header and Legend the text can be edited. One can customize the name,
such as only having part of the name displayed, the font and the color.
275
Graphic Objects
Properties Graphic Objects
Chart properties
It is possible to further customize the chart properties by selecting More, which will
open up the 3D Chart properties dialogue. Here one can define the chart types from
the options in the chart gallery.
Chart Properties
276
Plots
Additional options of a 3-D plot can be changed from the tab in the properties dialog. In the
Data tab, the layout of the data can be changed.
3-D Scatter plot data properties dialog
The rotation, perspective, and axis scales can be changed under the 3-D view tab.
3-D Scatter plot 3-D view properties dialog
277
7.20. Plot – Response Surface…

This dialog opens when clicking on the predefined plot “Response Surface” or when clicking
in the Plot - Response Surface menu when regression results are opened.
It contains four fields:
Y Variable
This is the response variable to be plotted. Use the drop-down list to select one.
Factor
This is only for PLS and PCR but not for MLR. Select the optimal number of factors to
be used. This affects the Beta coefficients and thus the response surface.
X Variable - 1
The predictor variable to be used in the first direction.
X Variable - 2
The predictor variable to be used in the second direction.
Click OK to generate the response surface or Cancel to go back to the viewer.
278
Plots
7.21. Saving and copying a plot
7.21.1 Saving a plot

Access Save Plot… menu
A plot can be saved from the Save Plot… menu. It is accessible from a right click on a plot
displayed in the workspace.
Save Plot… menu
Save As… dialog box

Save As… dialog box
Select where the plot should be stored in the field Save in.
279
Enter a name for the plot in the field File name and select a format.
Types of format
There are six possible graphics file formats available for compatibility with many needs:
EMF
Use the EMF format which is vector graphics whenever possible. Vector graphics can
be scaled and will give the best quality.
Compatibility: EMF support is often limited to Microsoft applications. When sending
the plot graphics file for instance by email, a recipient may encounter problems
viewing and reusing it.
PNG
The second choice is PNG, which is raster graphics, and does not look as good when
enlarged.
This format is most suitable for web publishing and email.
This will generally result in smaller files than the following formats.
Compatibility: 5-10 year old applications may not support this image format.
Select one of the above formats. The following formats are also raster graphics, each having
it’s limitations. Included only for compatibility.
GIF
Limited to 256 colors.
JPEG
Lossy compression that will give artifacts. (JPEG is best suited for photographic
images.)
TIFF
Will produce larger files.
BMP
Will produce larger files.
Available image formats
7.21.2 Copying plots

It is possible to copy either one plot or all plots displayed in the workspace.
Copy one plot
Access Copy menu

The Copy menu is available from two places:
From right click on a plot
Right click on a plot and select Copy.
Copy from right click
280
Plots
From Edit menu

Go to the Edit menu and select Copy.
Copy from Edit menu
Copy from clipboard

The shortcut Ctrl+C is a fast way to copy a plot.
Copy all plots
Access Copy All menu

The Copy All menu is also accessible from a right click on a plot displayed in the workspace.
Result of Copy All

After pasting, the plots that were displayed on the workspace will be shown without
borders.
281
Example of Copy All
Pasting plots
Depending on the application to be used there may be different options such as the shortcut
Ctrl+V or from an Edit menu.
7.22. Scope: Select plot range

When creating a plot, it is necessary to define the scope of the plot in terms of:
 Data set (matrix),

 Samples (row range),
 Variables (column range).
A common dialog appears when selecting any of the plotting options from Plot:
 Line
 Bar
 3D Scatter
 Matrix
 Histogram
 Normal Probability
 Multiple Scatter
Define the row and column ranges from predefined ranges using the drop-down list.
To use new ranges, click on icon that looks like a matrix to access a matrix from the project
navigator and on Define to access the Define Rangeramework\menu2-edit\range.htm)
dialog.
Plot scope dialog
282
Plots
To use data that are part of a results matrix, use the select result matrix button to
choose the desired results matrix.
7.23. Edit – Select Evenly Distributed Samples

This tool allows users to automatically select a representative subset of the samples in any
plot of samples. The selection can be used to create a range.
Evenly distributed samples dialog
Min/Max
Selects the samples most separated in the data set.
A number of extreme samples will be picked out for each PC, according to the
specification in the right column in the table below the method choice. It will be
labeled Number of min/max, and for each min/max selected, two extreme samples
are marked (max and min value). Thus, setting the number to 2 will mark a total of
four samples.
Classes
The samples will be divided into a number of classes for each PC. One pair of
extreme samples (max and min value) will be picked out for each PC, according to a
user’s specification in the right column in the list below the Methods field. It will be
283
labeled Number of classes, and for each class, two extreme samples are marked.
Thus, setting the number to 2 will mark a total of four samples.
Then, in the list below the method choice, specify the number of PCs (listed in the left
column) for which to mark samples, and how many (listed in the right column). No samples
are marked for PCs with 0 in the right column, i.e., in the above figure, only PC 1 is marked.
7.24. Zooming and Rescaling
7.24.1 General options

When a plot is displayed in the view pane, it is possible to modify this view by several scaling
options:
Full-screen
To view a plot in full-screen mode select it by clicking on it and use the Full-screen
button .
The plot will be expanded in full-screen mode. To come back to the usual view in the
view pane, right click on the expanded plot.
Zoom-in
To zoom in a displayed plot, the zoom-in being down in the center area, there are
two options:
 Use the Zoom-in button
 Use the keyboard: Ctrl+Up-arrow
 Use the scroll wheel: Scroll up or left
Zoom-out
To zoom out a displayed plot, the zoom-out being down from the center area, there
are two options:
 Use the Zoom-out button
 Use the keyboard: Ctrl+Down-arrow
 Use the scroll wheel: Scroll down or right
Frame-scale
To zoom in a special area it is more convenient to define the area to zoom-in with a
rectangle. To access this functionality use the Frame-scale button .
A cross will appear, which is to be used to define the area to zoom into. A dotted
rectangle will appear around the defined frame and when releasing, the zoom will
be performed.
Defining the frame to zoom-in
284
Plots
Move
It is possible to move inside the plot itself. To do so use the keyboard: Ctrl+Shift.
Auto-scale
To come back to the original view of the plot defined by The Unscrambler® use the
Auto-scale button
7.24.2 Special options

For Matrix and 3D-Scatter there are two ways to zoom-in:
 Using the mouse wheel, will zoom the points and bars within the cube
 Using Ctrl+Left mouse drag up and down, will zoom the cube itself
7.24.3 Resize plots
From the viewer one can drag the four-pin view to other sizes by choosing the center + sign
to view.
285
8. Design of Experiments
8.1. Experimental design
Experimental design is a strategy to gather empirical knowledge, i.e. knowledge based on
the analysis of experimental data and not on theoretical models. It can be applied when
investigating a phenomenon in order to gain understanding or improve performance.
Building a design means carefully choosing a small number of experiments that are to be
performed under controlled conditions.
Learn about the concepts and methods of experimental design in the Introduction to Design
of Experiments section.
Learn how to use the Design of Experiments tools offered by The Unscrambler®:
 Create a design using Insert – Create design…

 Modify or extend an existing design using Tools – Modify/Extend Design…
 Analyze the experimental results using Tasks – Analyze – Analyze Design Matrix…
 Interpret the analytical results
8.2. Introduction to Design of Experiments (DoE)

The aim of multivariate data analysis is to extract the maximum amount of information from
a data table. The data can be collected from various sources or designed with a specific
purpose in mind.
 DoE basics
 Why use experimental design?
 What is experimental design?
 Investigation stages and design objectives
 Screening
 Factor Influence Study
 Optimization
 Available designs in The Unscrambler®
 Types of variables in experimental design
 Design vs. non-design variables
 Continuous vs. category variables
 Mixture variables
 Process variables
 Designs for unconstrained screening situations
 Full-factorial designs
 Fractional-factorial designs
 Plackett-Burman designs
 Designs for unconstrained optimization situations
 Central composite designs
 Box-Behnken designs
 Designs for constrained situations
 Mixture designs
 Axial designs: Screening of mixture components
 Simplex-centroid designs: Optimization of mixtures
287
 Simplex-lattice designs: Cover the mixture region evenly

 D-optimal designs
 Designs with simple linear constraints
 Non-simplex mixture designs
 Process/mixture designs
 Types of samples in experimental design
 Sample order in a design
 Blocking
 Extending a design
 Building an efficient experimental strategy
 Analyze results from designed experiments
 Simple data checks and graphical analysis
 Analysis Of Variance (ANOVA)
 Checking the adequacy of the model
 Analysis of effects using classical methods
 Response surface analysis using classical methods
 Limitations of ANOVA
 Analysis with PLS Regression
 When data are missing or experimental conditions have not been reached
 Advanced topics for unconstrained situations
 Advanced topics for constrained situations
8.2.1 DoE basics

Why use experimental design?
When collecting new data for multivariate modeling, one should pay attention to the
following criteria:
 Efficiency: get more information from fewer experiments;

 Focusing: collect only the information that is really needed.
There are four basic ways to collect data for an analysis:
 Obtain historical data (from a database, from plant records, etc.). However such
data may be biased by changes occurring during the period between acquisition and
analysis. It is anyhow a good start to get some general trends and ideas.
 Collect new data: record measurements directly from the production line, for
example, make observations in fish farms, process development lab, formulation
lab, etc. This will ensure that the data apply to the system being studied today (not
another system, three years ago). However most processes tend to be kept under
tight control and variation is minimal. This may lead to problems finding enough
variability to develop a model.
 Run specific experiments by disturbing (exciting) the system being studied. Thus the
data will encompass more variation than is to be naturally expected in a stable
system running as usual.
 Design experiments in a structured, mathematical way. By choosing symmetrical
ranges of variation and applying this variation in a balanced way among the
variables being studied, one will end up with data where effects can be studied in a
288
Design of Experiments
simple and powerful way. With designed experiments there is a better possibility of
testing the significance of the effects and the relevance of the whole model.
Experimental design (commonly referred to as DoE) is a useful complement to multivariate

data analysis because it generates “structured” data tables, i.e. data tables that contain an
important amount of structured variation. This underlying structure will then be used as a
basis for multivariate modeling, which will guarantee stable and robust models.
More generally, careful sample selection increases the chances of extracting useful
information from the data. When one has the possibility to actively perturb the system
(experiment with the variables), these chances become even greater. The critical part is to
decide which variables to change, the intervals for this variation, and the pattern of the
experimental points.
What is experimental design?
Experimental design is a strategy to gather empirical knowledge, i.e. knowledge based on
the analysis of experimental data and not on theoretical models. It can be applied when
investigating a phenomenon in order to gain understanding or improve performance.
Building a design means carefully choosing a small number of experiments that are to be
performed under controlled conditions. There are four interrelated steps in building a
design:
 Define the objective of the investigation: e.g. “better understand” or “sort out
important variables” or “find the optimum conditions”.
 Define the variables that will be controlled during the experiment (design variables),
and their levels or ranges of variation.
 Define the variables that will be measured to describe the outcome of the
experimental runs (response variables), and examine their precision.
 Choose among the available standard designs the one that is compatible with the
objective, number of design variables and precision of measurements, and has a
reasonable cost.
Most of the standard experimental designs can be generated in The Unscrambler® once the
experimental objective, the number (and nature) of the design variables, the nature of the
responses and the economical number of experimental runs have been defined. Generating
such a design will provide the user with the list of all experiments to be performed in order
to gather the required information to meet the objectives.
8.2.2 Investigation stages and design objectives

Depending on the stage of the investigation, the amount of information to be collected and
the resources that are available to achieve the goal, it is important to choose an adequate
design among those available in The Unscrambler®. The following describes the most
common standard designs for dealing with the various data types and situations described
above.
Screening
When starting a new investigation or a new product development, there is usually a large
number of potentially important variables. At this stage, the main objective of the
experimental work is to find out which are the most important variables. This is achieved by
including many variables in the design, and roughly estimating the effect of each design
289
variable on the responses with the help of a screening design. The variables which have
“large” effects can be considered as important. The isolated effects of single variables are
known as main effects and the purpose of screening designs is to isolate these only. There
are several ways to judge the importance of a main effect, for instance significance testing or
use of a normal probability plot of effects.
Some screening designs are capable of estimating interaction effects. These occur when the
effect of changing one variable depends on the level of other variables in the study. Some
variables may be important even though they do not seem to have an impact on the
response by themselves. The reason is that the presence of interaction effects may mask
otherwise significant main effects.
Models for screening designs
The user must choose the adequate form of the model that relates response variations to
variations in the design variables. This will depend on how precisely one wants to screen the
potentially influential variables and describe how they affect the responses. The
Unscrambler® contains two standard choices:
 The simplest form is a linear model. Choosing a linear model will allow one to
investigate main effects only with possible check for curvature effect;
 To study the possible interactions between several design variables, one will have to
include interaction effects in the model in addition to the linear effects.
When building a mixture or D-optimal design, one must choose a model form explicitly,
because the adequate type of design depends on this choice. For other types of designs, the
model choice is implicit in the design that has been selected.
Factor Influence Study
After an initial screening design has been performed and a number of important variables
have been isolated, a Factor Influence study can be performed using full factorial, or high
resolution fractional factorial designs. These are used to further study the main effects of
the variables, but also, they are used to investigate interactions of various orders: two factor
interactions involve two design variables, three factor interactions involve three variables
etc. The importance of an interaction can be assessed with the same tools as for main
effects.
Design variables that have an important main effect are important variables. Variables that
participate in an important interaction, even if their main effects are negligible, are also
important variables. The models generated in a factor influence study usually perform well
as predictive models and form the basis for optimization designs.
Optimization
At a later stage of investigation, when the variables that are important are already known,
one may wish to study the effects of these variables in more detail. Such a purpose will be
referred to as optimization. At the analysis stage this is also referred to as response surface
modeling.
Objectives of optimization
Optimization designs actually cover quite a wide range of objectives. They are particularly
useful in the following cases:
 Maximizing a single response, i.e. to find out which combination of design variable
levels leads to the maximum value of a specific response, and what this maximum
response is.
290
 Minimizing a single response, i.e. to find out which combination of design variable
levels leads to the minimum value of a specific response, and what this minimum is.
 Finding a stable region, i.e. to find out which combination of design variable levels
corresponds to a specific target response, with the added criterion that small
deviations from those settings would cause negligible change in the response value.
 Finding a compromise between several responses, i.e. to find out which combination
of design variable levels leads to the best compromise between several responses.
 Describing response variations, i.e. to model response variations inside the
experimental region as precisely as possible in order to predict what will happen if
the settings of some design variables were changed in the future.
Models for optimization designs
The underlying idea of optimization designs is that the model should be able to describe a
response surface which has a minimum or a maximum inside the experimental range. To
achieve that purpose, linear and interaction effects are not sufficient. An optimization model
should also include quadratic effects, i.e. square effects, which describe the curvature of a
surface.
A model that includes linear, interaction and quadratic effects is called a quadratic model.
8.2.3 Available designs in The Unscrambler®

The designs with their fields of application and the allowed number of design variables are
listed below.
Available types of experimental design
Number
Type of Factor
Screening Optimization Field of Use of design
Design Influence
variables
Study the effects of a

low number of design
variables
independently from
Full
each other, including
Factorial X X 2-9
interaction terms. The
Design
only design that allows
for categorical
variables with 3 or
more levels
Depending on the
number of variables,
choose to study lower
order effects
Fractional
independently from
Factorial X X 3 - 13
each other, or create a
Design
screening design
aimed at find the most
important main effects
among many
Plackett- Economical alternative

X 8 - 35
Burman to fractional factorial
291
Number
Type of Factor
Design Influence
variables
Design designs, study main

effects only. Complex
interaction effects
Finds the optimal

levels of the design
variables by adding a
Central
few more experiments
Composite X 2-6
to a full factorial
Design
design. All design
variable must be
continuous
An alternative to
central composite
designs, when the
optimum response is
not located at the
Box- extremes of the
Behnken X experimental region 3 - 6
Design and when previous
results from a factorial
design are not
available. All design
variables must be
continuous
Some design variables

have multilinear
constraints, and
D-Optimal
X X X design is not 2 - 9
Design
orthogonal. Analysis
usually by Partial Least
Squares Regression
Contains mixture
Axial variables only, design
(Mixture) X region is simplex. Only 3 - 20
Design linear (first order)
effects can be found.
Contains mixture
Simplex-
variables only, design 3 - 6 (9 if
Lattice
X X X region is simplex. linear
(Mixture)
Tuneable lattice only)
Design
degree (order)
292
Number
Type of Factor
Design Influence
variables
Simplex-
Contains mixture
Centroid
X variables only, design 3 - 6
(Mixture)
region is simplex
Design
A D-Optimal design will be used with mixture variables if the experimental region is not a
simplex, or if there is a combination of mixture and process variables in the design. The
design region is often non-simplex when upper limit constraints are added to some of the
mixture components.
8.2.4 Types of variables in experimental design

This section introduces the nomenclature of variable types used in The Unscrambler®. Most
of these names are commonly used in the standard literature on experimental design;
however the use made of these names may differ somewhat between different softwares or
fields. Therefore it is recommended that the user reads this section before proceeding to
more details about the various types of designs.
Design vs. non-design variables
In The Unscrambler®, all variables appearing in the context of designed experiments can be
categorized as either design or non-design variables.
Design variables
Performing designed experiments is based on controlling the variations of the variables that
are being investigated to study their effects. Such variables with controlled variations are
called design variables, or factors.
In The Unscrambler®, a design variable is completely defined by:
 Its name;
 Its type: continuous or category;
 Its constraints: mixture, linear;
 Its levels.
Response variables
This is a type of non-design variables, they are the measured output variables that describe
the outcome (usually a quality attribute) of the experiments. These variables may often be
subject to an optimization.
Non-controllable variables
This second type of non-design variables refers to variables that can be monitored and may
have an influence on the response variables but that cannot controlled or reliably be fixed to
a value. For example the air humidity or the temperature of a plant.
Continuous vs. category variables
All variables have a pre-defined format or data type, and this format defines how the
variables are treated numerically and how they should be interpreted.
Continuous variables
All variables that have numerical values and that can be measured quantitatively are called
continuous variables. Note that this definition also covers discrete quantitative variables,
293
such as counts. It reflects the implicit use which is made of these variables, namely the
modeling of their variations using continuous functions.
Examples of continuous variables are: temperature, concentrations of ingredients (e.g. in %),
pH, length (e.g. in mm), age (e.g. in years), number of failures in one year, etc.
The variations of continuous design variables are usually set within a predefined range,
which goes from a lower level to an upper level. Those two levels have to be specified when
defining a continuous design variable. More levels between the extremes may be specified if
the values are to be studied more specifically.
If only two levels are specified, the other necessary levels will be computed automatically.
This applies to center samples (which use a mid-level, halfway between lower and upper),
and axial (star) samples in optimization designs (which use extreme levels outside the
predefined range).
Category variables
In The Unscrambler®, all non-continuous variables are called category variables. Their levels
can be named, but not measured quantitatively. Examples of category variables are: color
(Blue, Red, Green), type of catalyst (A, B, C, D), place of origin (Africa, The Caribbean Islands,
…), etc.
Binary variables are a special type of category variables that have only two levels
(sometimes referred to as dichotomous). Examples of binary variables are: use of a catalyst
(Yes/No), recipe (New/Old), type of electric power (AC/DC), type of sweetener (Artificial/
Natural), etc.
For each category variable, the user must specify all levels. The number of levels can vary
between 2 - 20.
Note: Since there is a kind of quantum jump from one level to another (there is no
intermediate level in between), center samples cannot be defined for category
variables. If there is a mix of category and continuous variables in the design, center
samples are defined for all continuous variables at each level of the category
variables.
Mixture variables
When performing experiments where some ingredients are mixed according to a recipe, one
may be in a situation where the amounts of the various ingredients cannot be varied
independently from each other. In such a case, one will need to use a special kind of design
called a Mixture design, and the design variables are called mixture variables (or mixture
components).
An example of a mixture situation is blending concrete from the following three ingredients:
cement, sand and water. If the percentage of water in the blend is increased by 10%, the
proportions of one of the other ingredients (or both) will have to be reduced so that the
blend still amounts to 100%.
However, there are many situations where ingredients are blended, which do not require a
mixture design. For instance in a water solution of four ingredients whose proportions do
not exceed a few percent, one may vary the four ingredients independently from each other
and just add water at the end as a “filler”. Therefore it is important to carefully consider the
experimental situation before deciding whether the recipe being followed requires a mixture
design or not!
Process variables
In a mixture situation, one may also want to investigate the effects of variations in some
other design variables which are not themselves a component of the mixture. Such variables
294
are called process variables in The Unscrambler®, and these are analyzed using a D-optimal
design.
Typical process variables are: temperature, stirring rate, type of solvent, amount of catalyst,
etc.
8.2.5 Designs for unconstrained screening situations

The Unscrambler® provides three classical types of screening designs for unconstrained
situations:
 Full-factorial designs for a number of design variables usually between 2 and 5
(maximum 9); the design variables may be two-level continuous or category with 2
to 20 levels.
 Fractional-factorial designs for any number of two-level design variables (continuous
or category) between 3 and 13.
 Plackett-Burman designs for any number of two-level design variables (continuous
or category) between 8 and 35.
Full-factorial designs
Full-factorial designs combine all defined levels of all design variables. For instance, a full-
factorial design investigating one two-level continuous variable, one three-level category
variable and one four-level category variable will include 2x3x4=24 experiments (excluding
center points).
Among other properties, full-factorial designs are perfectly balanced, i.e. each level of every
design variable is studied an equal number of times in combination with every level of the
other design variables.
Full-factorial designs include enough experiments to allow use of a model with all
interactions included. This can be very beneficial if the number of design variables is low,
however it comes at the prize of having to perform a high number of experiments if more
than a few variables are included. In this case, a fractional factorial design should be
considered.
Note: In theory a full factorial design can accommodate any number of levels also
for continuous variables, and such a design could be used for optimization. Because
central composite and Box-Behnken designs are much more economical than a 3-
level (or higher) full-factorial design, only two levels are allowed for continuous
variable factorial designs in The Unscrambler.
Fractional-factorial designs
In the specific case where there are only two-level variables (continuous with lower and
upper levels, and/or binary variables), one can define fractions of full factorial designs that
enable the investigation of as many design variables as the chosen full-factorial designs with
fewer experiments. These “economic” designs are called fractional factorial designs.
Given that a full-factorial design suitable for the investigation has already been defined, a
fractional design might be set up by selecting half the experimental runs of the original
design. For instance, one might try to study the effects of three design variables with only 4
(2(3-1)) instead of 8 (23) experiments. Larger factorial designs admit fractional designs with a
higher degree of fractionality, i.e. even more economical designs, such as investigating nine
design variables with only 16 (2(9-5) ) experiments instead of 512 (29). Such a design can be
referred to as a fractional design; its degree of fractionality is 5. This means that one
investigates nine variables at the usual cost of four (thus saving the cost of five).
295
Example of a fractional-factorial design

In order to better understand the principles of fractionality, the following illustrates how a
fractional factorial is built in the following concrete case: computing the half-fraction of a full
factorial with four variables (2 (4-1)).
In the following tables, the design variables are named A, B, C, D, and their lower and upper
levels are coded – and +, respectively.
First, the full factorial design is built with only variables A, B, C (2 ³), as shown below:
Full-factorial design 2³
Experiment A B C
1 – – –
2 + – –
3 – + –
4 + + –
5 – – +
6 + – +
7 – + +
8 + + +
In the table below additional columns are generated, which are computed from the products
of the original three columns A, B, C. These additional columns represent the interactions
between the design variables.
Full-factorial design 2³ with interaction columns
Experiment A B C AB AC BC ABC
1 – – – + + + –
2 + – – – – + +
3 – + – – + – +
4 + + – + – – –
5 – – + + – – +
6 + – + – + – –
7 – + + – – + –
8 + + + + + + +
The above design table is an example of an orthogonal table, i.e. the effect of each column
(main effect and interaction) can be estimated independently of each other.
In the table below, the column representing the highest degree of interaction (the ABC
interaction) is assigned to the variable, D, as it is assumed that the ABC interaction is
negligible:
Fractional factorial design 2(4-1)
Experiment A B C D
296
Experiment A B C D
1 – – – –
2 + – – +
3 – + – +
4 + + – –
5 – – + +
6 + – + –
7 – + + –
8 + + + +
This new design allows the main effects of the four design variables to be studied
independently of each other; but what about their interactions? The table below shows all
of the two-factor interactions calculated after setting D = ABC.
Fractional-factorial design 2(4-1)) with interaction columns
Experiment A B C D AB = CD AC = BD BC = AD
1 – – – – + + +
2 + – – + – – +
3 – + – + – + –
4 + + – – + – –
5 – – + + + – –
6 + – + – – + –
7 – + + – – – +
8 + + + + + + +
This table shows that each of the last three columns is shared by two different interactions
(for instance, AB and CD share the same column).
Confounding
Unfortunately, as the above example shows, there is a price to be paid for saving on the
experimental costs! “He who invests less, will also harvest less”.
In the case of fractional factorial designs, this means that if one does not use the full-
factorial set of experiments, it is not possible to study the interactions as well as the main
effects of all design variables. This happens because of the way those fractions are built,
using some of the resources that would otherwise have been devoted to the study of
interactions, to study main effects of more variables instead.
This side effect of using fractional designs is called confounding. Confounding means that
some effects cannot be studied independently of each other.
For instance, in the above example, the two-factor interactions are all confounded with each
other. The practical consequences are the following:
297
 All main effects can be studied independently of each other, and independently of
the interactions;
 If the objective is to study the interactions themselves, using this specific design will
only enable one to detect whether either of the confounded interactions are
important. The experiments will not allow one to decide which are the important
ones. For instance, if AB (confounded with CD, “AB=CD”) turns out as significant, one
will not know whether AB or CD (or a combination of both) is responsible for the
observed effect.
The list of confounded effects is called the confounding pattern of the design.
Resolution of a fractional factorial design
How well a fractional-factorial design avoids confounding is expressed through its resolution.
The three most common cases are as follows:
 Resolution III designs: Main effects are confounded with two-factor interactions.
 Resolution IV designs: Main effects are free of confounding with two-factor
interactions, but two-factor interactions are confounded with each other.
 Resolution V designs: Main effects and two-factor interactions are free of
confounding with each other, however some two-factor interactions are
confounded with three-factor interactions.
Definition: In a resolution R design, effects of order k are free of confounding with all effects
of order less than R-k.
In practice, before deciding on a particular factorial design, it is important to check its
resolution and its confounding pattern to make sure that it fits the experimental objectives!
Examples of factorial designs
A screening situation with three design variables is illustrated in the two examples below:
Options for screening design with three design variables
Full factorial (left) and fractional factorial (right) designs illustrated. The design points are
marked red. The points in the fractional factorial design are selected so as to cover the
maximum volume of the design space.
Plackett-Burman designs
If the experimental objective is to study the main effects only, and there are many design
variables to investigate (e.g. > 10), Plackett-Burman (PB) designs may be the solution. They
are very economical, since they require only one to four more experiments than the number
of design variables.
Plackett–Burman designs (Plackett and Burman, 1946) are experimental designs developed
while the authors were working in the British Ministry of Supply. Their goal was to find
298
experimental designs for investigating the dependence of some measured responses on a

number of independent variables (factors), each taking L levels. The designs were developed
in such a way as to minimize the variance of the estimates of these dependencies using a
limited number of experiments. Interactions between the factors were considered
negligible. The solution to this problem is to find an experimental design in which each
combination of levels for any pair of factors appears the same number of times. A complete
factorial design would satisfy this criterion, but the idea was to find smaller designs. An
example of a PB design is provided below.
Plackett–Burman design for 12 runs and up to 11 two-level factors
Run A B C D E F G H J K L
1 + − + − − − + + + − +
2 + + − + − − − + + + −
3 − + + − + − − − + + +
4 + − + + − + − − − + +
5 + + − + + − + − − − +
6 + + + − + + − + − − −
7 − + + + − + + − + − −
8 − − + + + − + + − + −
9 − − − + + + − + + − +
10 + − − − + + + − + + −
11 − + − − − + + + − + +
12 − − − − − − − − − − −
For the case of two levels (L=2), Plackett and Burman used the construction of Paley (Paley,
1933) for generating orthogonal matrices whose elements are all either 1 or -1 (Hadamard
matrices). Paley’s method could be used to find such matrices of N rows for most N equal to
a multiple of 4. In particular, it worked for all such N up to 100 except N = 92. If N is a power
of 2, however, the resulting design is identical to a fractional factorial design. In The
Unscrambler® the maximum limit of N is 36, which can accommodate n = N-1 = 35 design
variables (main effects). If there are less than N-1 effects to estimate, a subset of the
columns of the matrix is used.
The prize to pay for estimating all these effects in a minimum number of runs, is the very
complex confounding patterns of Plackett-Burman designs. Main effects are often partially
confounded with several interactions, and these designs should therefore be used very
carefully.
8.2.6 Designs for unconstrained optimization situations

The Unscrambler® provides two classical types of optimization designs:
 Central Composite designs for 2 to 6 continuous design variables;

 Box-Behnken designs for 3 to 6 continuous design variables.
299
Central composite designs

Central composite designs (CCD) are extensions of two-level full factorial designs. A CCD
enables a quadratic model to be fitted by including new levels in addition to the regular
lower and upper levels.
A CCD consists of three types of experiments:
 Factorial (cube) samples are experiments which combine the regular lower and
upper levels of the design variables; they are the “factorial” part of the design;
 Center samples are replicates of the experiment for which all design variables are at
their mid-level;
 Axial (star) samples are located such that they extend beyond the factorial levels of
the design for one factor at the time, all other design variables being at their mid-
level. These samples are specific to CCD designs.
Properties of a CCD
The properties of the simplest CCD, with two design variables is shown below.
Central composite design with two design variables
From the figure it can be seen that each design variable has five levels: 1) low axial, 2) low
factorial, 3) center, 4) high factorial, and 5) high axial. Low factorial and high factorial are the
lower and upper levels that are specified when defining the design variable.
 The four factorial samples are located at the corners of a square (or a cube if there
are three variables, or a hypercube if there are more);
 The center samples are located at the center of the square;
 The four axial samples are located outside the square; by default, their distance to
the center is set to ensure rotatability (see below).
Because we do not know the position of the response surface optimum, we try to ensure
that the prediction error is the same for any point at the same distance from the center of
the design. This property is called rotatability, as the design axes can be rotated around the
origin without influencing the variance of the predicted response. This implies that the
information carried by any design point will have equal weight on the analysis, i.e. the design
points will have equal leverage. This property is important if one wants to achieve uniform
quality of prediction in all directions from the center. The distance that ensures rotatability
is given by 2k/4, k being the number of factors.
A spherical design is one in which all factorial and axial points have the same distance from
the origin. The 2- and 4- factor rotatable designs are also spherical designs (distance given by
k1/2).
300
Types of CCD
Circumscribed central composite design (CCC)
This general type is the one described in the previous section, with factorial points
defined at the lower and upper levels and with axial points outside of these ranges.
Faced central composite design (CCF)
If for some reason one cannot use levels outside the factorial range, one can tune
the axial point distances down such that these points lie at the center of the cube
faces. This is called a faced central composite design (CCF). CCF designs are not
rotatable.
Inscribed central composite design (CCI)
Another way to keep all experiments within the pre-defined range is to use an axial
sample distance that ensures rotatability, but to shrink the entire design such that
the axial points fall on the pre-defined levels. This will result in a smaller investigated
range, but will guarantee a rotatable design. This is called an inscribed central
composite design (CCI).
Efficiency of the CCD
Depending on the constraints of the experiments and the accuracy to achieve, select the
appropriate CC design using the following table:
Central composite design: constraints and accuracy
Number of Uses point outside
Design Accuracy of estimates
levels high and low levels
Circumscribed 5 Yes Good over entire design space
Good over central subset of the

Inscribed 5 No
design space
Fair over entire design space, poor

Faced 3 No
for pure quadratic coefficients
Box-Behnken designs
Box-Behnken designs are not built on a factorial basis, but they are nevertheless good
optimization designs for second order models.
In a Box-Behnken design, all design variables have three levels: low cube, center, and high
cube. Each experiment combines the extreme levels of two or three design variables with
the mid-levels of the others. In addition, the design includes a number of center samples.
The properties of Box-Behnken designs are the following:
 The actual range of each design variable is low cube to high cube, which makes it
easy to handle;
 All non-center samples are located on a sphere, achieving rotatability for the 4-
factor design, and almost rotatability for the designs with 3, 5, or 6 factors.
Box-Behnken design: constraints and accuracy

Number of Uses point outside
Design Accuracy of estimates
levels high and low levels
Good over entire design space, more

Box
3 No uncertainty on the edge of the design
Behnken
area
301
Examples of optimization designs

A central composite design for three design variables is shown here:
Central composite design with three design variables
The figure below shows the Box-Behnken design drawn in two different ways. In the left
drawing one can see how it is built, while the drawing to the right shows how the design is
rotatable.
Box-Behnken design
8.2.7 Designs for constrained situations

This chapter introduces “tricky” situations in which classical designs based upon the factorial
principle do not apply. Here two related cases will be discussed:
 General constraints in which the allowed levels of a design variable depend on the
levels of one or more of the other design variables: linear constraints;
 The special case of mixture situations, in which the levels of all design variables sum
to a fixed, total amount.
Each of these situations will then be described extensively in the following sections.
Note: Understanding the sections that follow requires basic knowledge about the
purposes and principles of experimental design. If the principles of experimental
design are unfamiliar, the user is strongly urged to read about it in the previous
sections (see What Is Experimental Design?) before proceeding with this section.
Mixture designs
A simple mixture design example
We will start describing the mixture situation by using an example.
A product development specialist has a specific problem to solve related to the optimization
of a pancake mix. The mix consists of the following ingredients: wheat flour, sugar and egg
powder. It will be sold in retail units of 100 g, to be mixed with milk for reconstitution of
pancake batter.
302
The product developer has learned about experimental design, and tries to set up an
adequate design to study the properties of the pancake batter as a function of the amounts
of flour, sugar and egg in the mix. She starts by plotting the region that encompasses all
possible combinations of those three ingredients, and soon discovers that it has a distinct
shape.
The pancake mix experimental region
The reason, as you may have guessed, is that the mixture always has to add up to a total of
100 g. This is a special case of multilinear constraint, which can be written with a single
equation:
Flour + Sugar + Egg = 100

This is called the mixture constraint: the amounts of all mixture components always have to
sum to 100% of the total product. This means that if you know the amounts of flour and
sugar in the mix, the amount of egg can be deduced by subtraction from 100%. In other
words, even if there are three mixture components, only two of them can be varied
independently at any time. The practical consequence is that the mixture region defined by
three ingredients is not a three-dimensional region! It is contained in a two-dimensional
surface called a simplex.
A simplex is a generalization of a triangle in possibly higher dimensions. If there are N
mixture components, the dimensionality of the simplex is N-1. For instance, for 4 mixture
components, the simplex is a tetrahedron. There is a special class of designs called mixture
designs which are based on regular simplexes.
Designs based on a simplex
Since the region defined by the three mixture components in the previous example is a two-
dimensional surface, we cannot use a factorial design to analyze the design region. Rather,
the design region is given below.
The pancake mix simplex
303
This simplex contains all possible combinations of the three ingredients flour, sugar and egg.
One can see that it is completely symmetrical. One could substitute egg for flour, sugar for
egg and flour for sugar in the figure, and still get exactly the same shape.
Classical mixture designs, first introduced by Scheffé, 1958, take advantage of this symmetry.
They include a varying number of experimental points, depending on the purposes of the
investigation. But whatever this purpose and whatever the total number of experiments,
these points are always symmetrically distributed, so that all mixture variables play equally
important roles.
These designs thus ensure that the effects of all investigated mixture variables will be
studied with the same precision. This property is equivalent to the properties of factorial,
central composite or Box-Behnken designs for non-constrained situations.
The figure below shows two examples of classical mixture designs.
Two classical designs for three mixture components
The first design is very simple. It contains three vertices (pure mixture components), three
edge centers (binary mixtures) and only one ternary mixture or the centroid. The second
design contains more points, spanning the mixture region regularly in a triangular lattice
pattern. It contains all possible combinations (within the mixture constraint) of five levels of
each ingredient. It is similar to a five-level full factorial design - except that many
combinations, such as “25%, 25%, 25%” or “50%, 75%, 100%”, are excluded because they
are outside the simplex.
Simplex with different boundaries
This example, taken from John A. Cornell’s reference book “Experiments With Mixtures”
Cornell 1990, illustrates a how additional constraints are sometimes useful in practical
situations.
A fruit punch is to be prepared by blending three types of fruit juice: watermelon, pineapple
and orange. The purpose of the manufacturer is to use their large supplies of watermelons
by introducing watermelon juice, of little value by itself, into a blend of fruit juices.
304
Therefore, the fruit punch should contain at least 30% of watermelon juice. Pineapple and
orange have been selected as the other components of the mixture.
The manufacturer decides to use design of experiments to find the combination of fruit
juices that scores highest in a consumer preference survey. The ranges of variation selected
for the experiment are as follows:
Ranges of variation for the fruit punch design
Ingredient Low High Centroid
Watermelon 30% 100% 54%
Pineapple 0% 70% 23%
Orange 0% 70% 23%

The resulting experimental design has a number of features that makes it very different from
a factorial or central composite design.
First, the ranges of variation of the three variables are not independent. Since watermelon
has a lower level of 30%, the high level of pineapple cannot exceed 100 - 30 = 70% (in which
case the orange content would be 0%). The same holds true for orange.
The second feature concerns the levels of the three variables for the point called the
“centroid”: these levels are not halfway between “low” and “high”, they are closer to the
low level. The reason is, once again, that the blend has to add up to a total of 100%.
Since the concentrations of the ingredients cannot vary independently of each other, these
variables cannot be handled in the same way as the design variables encountered in a
factorial design. Whenever the ranges of the mixture components result in a simplex design
region, a selection of classical mixture designs are available instead. One example of a
mixture design for the optimization of Cornell’s fruit punch is shown below. It is seen that
the design region remains simplex even if the lower boundary of watermelon juice has been
increased.
Design for the optimization of fruit punch
Axial designs: Screening of mixture components

In a screening situation, the primary objective is to study the main effects of each of the
mixture components.The main effect of an input variable is the change occurring in the
response variable when the input varies from low to high, all experimental conditions being
otherwise comparable.
In a factorial design, the levels of the design variables are combined in a balanced way, so
that one can follow what happens to the response value when a particular design variable
goes from low to high. It is possible to compute the main effect of that design variable
305
without regard to the remaining factors, because its low and high levels have been
combined with the same levels of all the other design variables.
In a mixture situation, this is no longer possible, as demonstrated in the previous figure.
While 30% watermelon can be combined with e.g. (70% P, 0% O) or (0% P, 70% O), 100%
watermelon can only be combined with (0% P, 0% O).
To find a solution to this problem the concept of “otherwise comparable conditions” must
be adapted to the constrained mixture situation. To screen what happens when watermelon
varies from 30% to 100%, this variation must be compensated in such a way that the mixture
still adds up to 100%, without disturbing the balance of the other mixture components. This
is achieved by moving along an axis where the proportions of the other mixture components
remain constant. In practice such mixtures are easily achieved by starting with the low level
of the component in questions while having equal proportions of the remaining
components. Subsequent addition of the first component to the mix would correspond to
moving up the axis. This is illustrated for the watermelon example in the figure below.
Studying variations in the proportion of watermelon
Mixture designs with points along the axes of the simplex are called axial designs. They are
best suited for screening purposes because they capture the main effect of each mixture
component in a simple and economical way.
An axial design in four components is represented in the next figure. It can be seen that
several points are located inside the simplex: they are mixtures of all four components. Only
the four corners, or vertices (containing the maximum concentration of an individual
component) are located on the surface of the experimental region.
A four-component axial design
Each axial point is placed halfway between the overall centroid of the simplex (25%, 25%,
25%, 25%) and a specific vertex. Thus the path leading from the centroid (“neutral”
situation) to a vertex (100% of a single component) is well described with the help of the
axial point.
306
In addition, end points can be included; they are located on the surface of the simplex,
opposite a vertex (they are marked by crosses on the figure). They contain the minimum
concentration of a specific component. When end points are included in an axial design, the
whole path leading from minimum to maximum concentration is studied. The above figure
Design for the optimization of the fruit punch composition is an example of a three-
component axial design where end points have been included.
Simplex-centroid designs: Optimization of mixtures

For the optimization of the concentrations of several mixture components, one needs a
design that enables a highly accurate prediction for any mixture - whether it involves all
components or only a subset.
Peculiar behavior may occur when the concentration of a mixture component drops down to
zero. For instance, to prepare the base for a Dijon mayonnaise, one needs to blend Dijon
mustard, egg and vegetable oil. But what happens when the egg is removed from the
recipe? The resulting dressing will have a different appearance and texture. This illustrates
the importance of interactions (e.g. between egg and oil) in mixture applications.
Thus, an optimization design for mixtures will include a large number of blends of only two,
three, or more generally, a subset of the components to be studied. The most regular design
including those sub blends is called a simplex-centroid design. It is based on the centroids of
the simplex: balanced blends of a subset of the mixture components of interest. For
instance, to optimize the concentrations of three ingredients, each of them varying between
0 and 100%, the simplex-centroid design will consist of:
 The three vertices: (100,0,0), (0,100,0) and (0,0,100);

 The three edge centers (or centroids of the two-dimensional subsimplexes defining
binary mixtures): (50,50,0), (50,0,50) and (0,50,50);
 The overall centroid: (33,33,33).
A simplex-centroid design for four variables is illustrated in the figure below.

A 4-component simplex-centroid design
In general terms, if N mixture components vary from 0 to 100%, the blends forming the
simplex-centroid design are as follows:
 The vertices are pure components;
 The second order centroids (edge centers) are binary mixtures with equal
proportions of selected two components;
 The third order centroids (face centers) are ternary mixtures with equal proportions
of selected three components;
 The Nth order centroids have equal proportions of selected N components, any
remaining components being zero.
307
Note: The overall centroid is a mixture where all N components have equal
proportions.
In addition, interior points can be included in the design. They improve the precision of the
results by “anchoring” the design with additional complete mixtures (i.e. mixtures where all
components are present), and they enable computation of cubic terms. The interior points
are located halfway between the overall centroid and each vertex, and they have the same
composition as the axial points in an axial design. When a design includes interior points, it is
said to be augmented. Note that for 3 mixture components, a centroid design augmented
with axial points equals an axial design with end points included (see e.g. fruit punch
example above).
Simplex-lattice designs: Cover the mixture region evenly

Sometimes one may not be specifically interested in a screening or optimization design. One
may be doing exploratory experiments. For example, one may just want to investigate what
would happen if three ingredients that have never been mixed before were combined.
This is one of the cases where the main purpose is to cover the mixture region as evenly and
regularly as possible. Designs that address that purpose are called simplex-lattice designs.
They consist of a network of points located at regular intervals between the vertices of the
simplex. Depending on how thoroughly you want to investigate the mixture region, the
network will be more or less dense, including a varying number of intermediate levels of the
mixture components. As such, it is quite similar to an N-level full factorial design. The figure
below illustrates this similarity.
A fourth degree simplex-lattice design is similar to a five-level full factorial
Simplex-lattice designs have a wide variety of applications, depending on their degree

(number of intervals between points along the edge of the simplex). Here are a few:
 Feasibility study (degree one or two): are the blends feasible at all?
 Optimization: with a lattice of degree three or more, there are enough points to fit a
precise response surface model.
 Search for a special behavior or property which only occurs in an unknown, limited
subregion of the simplex.
 Calibration: prepare a set of blends on which several types of properties will be
measured, in order to fit a regression model to these properties. For instance, one
may wish to relate the texture of a product, as assessed by a sensory panel, to the
parameters measured by a texture analyzer. If it is known that texture is likely to
vary as a function of the composition of the blend, a simplex-lattice design is
probably the best way to generate a representative, balanced calibration data set.
D-optimal designs
A simple design subject to linear constraints
308
The following example is used to demonstrate the principles of design constraints.

A manufacturer of prepared foods wants to investigate the impact of several processing
parameters on the sensory properties of cooked, marinated meat. The meat is to be first
immersed in a marinade, then steam-cooked, and finally deep-fried. The steaming and frying
temperatures are fixed; the marinating and cooking times are the process parameters of
interest.The process engineer wants to investigate the effect of the three process variables
within the following ranges of variation:
Ranges of the process variables for the cooked meat design
Process variable Low High
Marinating time 6 hours 18 hours
Steaming time 5 min 15 min
Frying time 5 min 15 min

A full factorial design would give the following factorial (cube) experiments:
The cooked meat full factorial design
Sample Mar. Time Steam. Time Fry. Time
1 6 5 5
2 18 5 5
3 6 15 5
4 18 15 5
5 6 5 15
6 18 5 15
7 6 15 15
8 18 15 15
After carefully analyzing this table, the process engineer expresses strong doubts that
experimental design can be of any help in this situation.
“Why?” asks the statistician in charge. “Well,” replies the engineer, “if the
meat is steamed then fried for 5 minutes each it will not be cooked, and at
15 minutes each it will be overcooked and burned on the surface. In either
case, we won’t get any valid sensory ratings, because the products will be far
beyond the ranges of acceptability.”
After some discussion, the process engineer and the statistician agree that an additional
condition should be included:
“In order for the meat to be suitably cooked, the sum of the two cooking
times should remain between 16 and 24 minutes for all experiments”.
This type of restriction is called a multilinear constraint. In the current case, it can be written
in a mathematical form requiring two equations, as follows:
Steam + Fry ≥ 16 and Steam + Fry ≤ 24

The impact of these constraints on the shape of the experimental region is shown in the two
figures below:
309
The cooked meat experimental region - no constraints
The cooked meat experimental region - multilinear constraints
The constrained experimental region is no longer a cube! It follows that a full factorial design
poorly explores that region.
The design that best spans the new region is given in the table below:
The cooked meat constrained design
Sample Mar. Time Steam. Time Fry. Time
1 6 5 11
2 6 5 15
3 6 9 15
4 6 11 5
5 6 15 5
6 6 15 9
7 18 5 11
8 18 5 15
9 18 9 15
10 18 11 5
11 18 15 5
12 18 15 9
This design contains all “corners” of the experimental region, in the same way as the full
factorial design does when the experimental region has the shape of a cube.
310
Depending on the number and complexity of multilinear constraints, the shape of the
experimental region can be more or less complex. In the worst cases, it may be almost
impossible to imagine! Therefore, building a design to screen or optimize variables linked by
multilinear constraints requires special methods. The following section will introduce a
special class of designs beneficial for these situations. More complex examples will be given
in the section Advanced topics for constrained situations ways to build constrained designs.
Introduction to the D-optimal principle
Those familiar with factorial designs are most likely aware that one of their most important
characteristics is their ability to study all effects independently of each other. This property,
called orthogonality, is important for relating variations in responses to variations in the
design variables. Without orthogonality, the estimated effects may become unreliable.
As soon as multilinear constraints are introduced among the design variables, it is no longer
possible to build an orthogonal design. Considering that the effect of a variable is estimated
on the premise that all other influences are held constant, it may not come as a surprise that
associations between design variables make the interpretations more difficult. In the more
severe cases of dependencies between variables, the effects will become indistinguishable
or the numerical calculations will fail. As soon as the variations in one of the design variables
are linked to those of another design variable, orthogonality cannot be achieved.
The D-optimal principle ensures that, based on a set of candidate points, the selected design
matrix has columns as close to orthogonal as possible. Mathematically, this is achieved
by maximizing the determinant of the information matrix , which is known as the D-
optimality criterion (Apostrophe meaning ‘transposed’). The volume of the joint confidence
region of the resulting regression coefficients is thereby minimized, i.e. the precision of
model parameter estimates will be maximized. An example of a design matrix could be
the cooked meat constrained design table above, including some or all of the available
design points (rows) as well as any center points or replicates. Also, any interaction or higher
order terms would be included as additional columns in .
Because the determinant of tends to increase as more experimental runs are
included in the design, the D-optimality criterion is not well suited for comparing designs of
different sizes. The related D-efficiency is independent of the number of runs.
Here, n is the number of experimental runs and p is the number of model terms. The D-
efficiency ranges from 0 to 100%, where a factorial design without centerpoints has a D-
efficiency of 100%. While a large design will tend to have a larger value of and yield
a smaller confidence region for the parameters, the average point precision as estimated by
the D-efficiency will be comparable for differently sized designs.
Candidate design points
A point exchange algorithm is used to find the D-optimal design points in The Unscrambler®.
These points may optionally be augmented with a number of space filling points to ensure
good coverage also inside the experimental region. Both these procedures require a set of
candidate points as input. These points are set up in such a manner that they span the
maximum allowed design region as well as the interior region. The candidate points are
All extreme vertices. These are the outer corners of the design region:
The extreme vertices of a square design region
311
All edge centers. These are defined as the midpoint between any two vertices constituting
an outer edge of the design region:
The edge centers of a square design region
All face centers. These are defined as the center point on any outer surface of the design
region as spanned by three or more edges:
The face centers of a square design region
The overall centroid. This is the center point of the design. For a design with two design
variables only the overall centroid overlaps with the single face center.
All axial check blends. These are defined as the midpoint on any axis spanned by the overall
centroid and the extreme vertices. These do not improve the coverage of the outer design
region but can be very useful space filling points for more robust models:
The axial check blends of a square design region
Point exchange algorithm

A D-optimal design containing a specified number of D-optimal points are found based on
the Fast Fedorov Exchange Algorithm (FFEA) Nguyen and Piepel, 2005. Partially random
starting designs are used in which a smaller subset of points is selected randomly, and then
points are added one by one to maximize the D-efficiency. When the pre-specified number
of design points have been included the design is optimized using the FFEA. The best D-
optimal design is finally selected from several such partially random starts. This ensures that
a good design is found that is less likely to result from a local maximum.
The points are selected from the candidate list without replacement. This means that the
algorithm itself will never return replicates of the selected points, and the maximum number
312
of points is bounded by the number of candidate points in each case. The number of
additional center points (overall centroids) as well as the number of replicates for the entire
design is specified separately. This enables a higher level of user control over the
replications, and it favours a better spread of points over the design region compared to
selection with replacement. On the other hand the D-efficiency of the resulting design may
be slightly lower than if replication had been allowed. For practical use we believe the
benefits of a good spread in design points far outweight a small reduction in D-efficiency
(see next section).
Addition of space filling points
The list of D-optimal points returned from the FFEA is optionally used as a starting point for
a subsequent Kennard-Stone selection process Kennard and Stone, 1969. During this
process, the design is augmented with a specified number of space filling points in order to
span the entire design region as evenly as possible. These points are taken from the
remaining candidate list, i.e. the selection is based on candidate points that have not already
been selected in the point exchange algorithm.
While D-optimal designs provide precise model terms and good predictions of training data,
they tend to focus on the outer regions of the design space. It has been shown that designs
with samples spread evenly across the entire design region tend to be more robust in many
cases Naes and Isaksson, 1989. Inclusion of space filling points by Kennard-Stone enables
better modeling of the interior design region and may therefore give more accurate
response surfaces and stable predictions when applying the model on new data. Also space
filling points tend to make the design less dependent on which model terms are included.
This is beneficial because the exact model equation is usually not known in advance.
The condition number (C.N.)
In order to minimize the negative consequences of a deviation from the ideal orthogonal
case, one needs a measure of the “lack of orthogonality” of a design. This measure is
provided by the condition number (C.N.) Golub, 1996:
C.N. = largest eigenvalue / smallest eigenvalue of the matrix
It indicates the degree of multicollinearity in the design matrix as follows:
 C.N. = 1: no multicollinearity, i.e. orthogonal
 C.N. < 100: multicollinearity not a serious problem
 100 < C.N. < 1000: moderate to severe multicollinearity
 C.N. > 1000 severe multicollinearity
It is also linked to the elongation or degree of “non-sphericity” of the region actually
explored by the design. The smaller the condition number, the more spherical the region,
and the closer a design is to being orthogonal.
Another important property of an experimental design is its ability to explore the whole
region spanned by the design variables. It can be shown that once the shape of the
experimental region has been determined by the constraints, the design with the smallest
condition number is the one that encloses maximal volume. It follows that if all extreme
vertices are included in the design, it has the smallest attainable condition number. If that
solution is too expensive, however, one needs to select a smaller number of points. The
consequence is that the condition number will increase and the enclosed volume will
decrease.
How good is the calculated design?
The condition number of an orthogonal design such as a non-modified factorial design is
exactly 1. Such a design has optimal properties in terms of interpretation, mathematical
robustness and economical considerations. The condition number of a non-orthogonal
(constrained) design will always be larger than one, and the larger the deviation, the less
313
favorable is the design. In general, caution should be exercised when analyzing a non-
orthogonal design using classical DoE Analysis(ANOVA/MLR). The Unscrambler® suggests
analysis by Partial Least Squares Regression for D-optimal designs, ascorrelated effects are
handled much better by this method and misinterpretations will be rare.
If the design has a condition number much larger than, say, 100, this is an indication that the
experimental region is heavily constrained. In such a case either of several design factors
may have influence on the response, but it is impossible to find out which (ANOVA might
suggest one of them arbitrarily, PLSR will correctly reveal that both are correlated with the
response). This may occur when there is insufficient individual variation in the design levels
compared to the noise level of the experiment. To ensure sufficient orthogonal variation for
each effect, it is recommended that all of the design variables and constraints be critically re-
examined. One should search for ways to simplify the problem see the section on Advanced
Topics for Constrained Situations, otherwise there is the risk of starting an expensive series
of experiments which will not give any useful information.
Designs with simple linear constraints

We will use the the marinated meat example above to illustrate a design with multilinear
constraints. For simplification, we can focus on the “Steaming time” and “Frying time” and
take into account only one constraint:
Steaming time + Frying time ≤ 24.

The figure below shows the impact of the constraint on the variations of the two design
variables.
The constraint cuts off one corner of the “cube”
A full factorial design applied to this situation would result in a sub-optimal solution that left
one half of the experimental region unexplored (i.e. the triangle spanned by the remaining 3
points). So where should we place the 4th point in order to span the experimental region as
well as possible?
We could imagine two candidate points where the dashed line of the linear constraint
crosses the factorial design region in the above figure. Two alternative solutions for selecting
4 design points are illustrated below.
Designs with four points leaving out a portion of the experimental region
314
Design II in the figure seems to be a better option than design I, because the excluded region
is smaller. A design using points (1, 3, 4, 5) would be equivalent to (I), and a design using
points (1, 2, 4, 5) would be equivalent to (II). The worst solution of all would be a design with
points (2, 3, 4, 5): this would leave out the whole corner defined by points 1, 2 and 5.
It follows that if the whole experimental region was to be explored, more than four points
would be needed. The above example shows that a minimum of five points (1, 2, 3, 4, 5) are
necessary. These five crucial points are the extreme vertices of the constrained experimental
region. They have the following property: if a sheet of paper was wrapped around those
points, the shape of the experimental region would appear, revealed by the wrapping.
If there are more than two design variables or multiple constraints it might not be straight
forward to find the best set of design points. The D-optimal criterion is commonly used to
find the best design in these situations.
Non-simplex mixture designs

D-optimal designs may also be used for analyzing mixtures. This is useful if there are upper
constraints on some of the mixture components such that the design region is non-simplex
(refer to the section, Is the Mixture Region a Simplex?). While the regular mixture designs
cannot handle these cases, a D-optimal design can be used by including a constraint that all
mixture components should sum to 100%. Additional upper or lower levels on any of the
mixture components will then have to be added as separate multilinear constraints.
Note: Classical mixture designs have much better properties than D-optimal
designs. Remember this before establishing additional constraints on mixture
components.
Process/mixture designs
Sometimes the product properties of interest depend on a combination of a mixture recipe
with specific process settings. In such cases, it is useful to investigate mixture and process
variables together. The process variables and the mixture variables are then combined using
the pattern of subfactorial designs and a D-optimal design can be generated.
8.2.8 Types of samples in experimental design

This section presents an overview of the various types of samples to be found in
experimental designs, along with their properties.
Factorial (cube) samples
315
Factorial samples can be found in factorial designs and their extensions. They are a
combination of high and low levels of the design variables in experimental plans based on
two levels of each variable. This forms a square for 2 variables or a (multidimensional) cube
for 3 (or more) variables. These samples are therefore sometimes referred to as cube
samples.
The same factorial design points are also found among other samples in central composite
designs. In Box-Behnken designs, all samples found on the factorial cube are also called
factorial samples (even though these design points are positioned on the edges rather than
the vertices of the cube).
All combinations of levels of the design variables in N-level full factorials are also called
factorial samples.
Center samples
Center samples are samples for which each design variable is set at its mid-level. When all
variables are continuous, the center points are located at the exact center of the
experimental region.
Center samples are not defined for categorical factors. When there is a combination of
continuous and category variables in the design, center points corresponding to the mid-
level of all continuous factors can be added for each unique combination of levels for up to 4
category variables.
For instance, if the number of two-level category variables in the design is (1, 2, 3, 4), this
results in (2, 4, 8, 16) single replicate center points, respectively. If two replicates of center
points are required, this doubles the total number of center points in the design. If we have
a three variable full factorial design with two two-level categorical variables, there are four
unique center points corresponding to the different level combinations of the categorical
factors. If 2 replicates of the center points are required, this results in 8 center points in
total.
The higher number of levels for the categorical variables and the more replication required,
the number of center points can grow large very quickly. It is suggested that when either the
number or levels of categorical variables becomes larger than 2, design replication may be a
better option.
Center samples in screening designs. In screening designs, center samples are used for
curvature checking: Since the underlying model in such a design assumes that all main
effects are linear, it is useful to have at least one design point with an intermediate level for
all factors. Thus, when all experiments have been performed, one can check whether the
intermediate value of the response fits with the global linear pattern, or whether there are
signs of deviation from the straight line fit.
In the case of high curvature, one will have to build a new design which accepts a quadratic
model. The Unscrambler® provides an option to calculate curvature in a design when all
variables are continuous and at least one center point is present.
If at least 2 center samples are present (preferably 3), the model will also be tested for lack
of fit (LOF). This is a test comparing the variation of the measured responses within center
samples with the overall variation between measured and fitted (i.e. predicted) response
values. A significant LOF indicates that the model might benefit from additional terms.
In screening designs, center samples are optional; however, it is recommended that at least
three are included if possible. See the section on replicates for more details.
Center samples in optimization designs. In optimization designs, center samples are
important also for fitting higher order models. It is therefore recommended that 5 or more
are included in the design. In particular for Box-Behnken designs, ample center samples are
needed to fit a precise response surface.
316
Axial (Star) samples

Axial samples are used in Central Composite designs. Their coordinates often exceeds the
low or high levels defined for the variable in question, while all other variables are at the
mid-level. The additional levels are beneficial for fitting a quadratic or cubic model to the
data.
Axial samples in a Central Composite design with two design variables
Axial samples can lie on centers of cube faces or they can lie outside the cube, at a given
distance from the center of the cube. This distance can be tuned, but it is recommended to
use the default distance (for the given design) whenever possible.
Three cases can be considered:
 The default axial to center point distance ensures that all design samples have
exactly the same leverage, i.e. the same influence on the model. Such a design is
said to be “rotatable”. If the number of design variables is two or four, this distance
also ensures that all factorial and design points lie with the same distance from the
center, giving a “spherical” design region. For other numbers of factors, rotatability
almost, but not quite, corresponds with a spherical design;
 The axial to center point distance can be tuned down to 1. In that case, the star
samples will be located at the centers of the faces of the cube. This ensures that a
Central Composite design can be built even if levels lower than “low cube” or higher
than “high cube” are impossible. However, the design is no longer rotatable;
 Any intermediate value for the star distance to center is also possible. The design
will not be rotatable.
Sample types in mixture designs
An overview of the various sample types used in mixture designs is provided below:
 Axial design: vertex and axial samples, optionally end points and overall centroids;
 Simplex-centroid design: vertex samples, centroids of various orders, optional
interior (axial) points;
 Simplex-lattice designs: samples positioned in a regular grid (similar to multi-level
factorial samples), overall centroid.
Each type is of point is described in more detail as follows.

Axial point
In a simplex design, an Axial point is positioned on the axis of one of the mixture
variables, half-way between the overall centroid and the vertex for that component.
Used in Axial designs and augmented Simplex-Centroid designs.
Centroid point
A Centroid point is calculated as the mean of the extreme vertices on a given
surface. Edge centers, Face centers and Overall Centroids are all examples of
317
centroid points. The number of mixture components involved in the centroid is

called the centroid order. For instance, in a four-component mixture, the overall
centroid is the fourth order centroid. Edge centers, or second order centroids, are
positioned in the center of the edges of the simplex. In Unscrambler the overall
centroid is denoted ‘Centroid’ while lower order centroids are referred to as ‘Blend’
points in Simplex-Centroid designs.
End point
In an axial design, ‘End’ points are optionally positioned at the bottom of the axis of
one of the mixture variables, and is thus on the opposite side to the axial point.
These are second order centroids and are referred to as Blend points in Simplex-
Centroid designs.
Face center
The face centers are positioned in the center of the faces of a simplex. They are also
referred to as third order centroids.
Interior point
An interior point is not located on the surface of a design, but inside the
experimental region. For example, an axial point is a particular kind of interior point.
Overall centroid
The overall centroid is calculated as the mean of all extreme vertices. It is the
mixture equivalent of a center sample.
Vertex sample
A vertex is a point where two lines meet to form an angle. Vertex samples are the
“corners” of the simplex corresponding to pure components.
Reference samples
Reference samples do not belong to a standard design, but are included for various
purposes.
Here are a few classical cases where reference samples are often used:
 When trying to improve an existing product or process, the current recipe or process
settings may be used as a reference.
 When trying to copy an existing product, for which the recipe is not known, one
might still include that product as reference and measure the responses on that
sample as well as on the others, in order to know how close the experimental
samples have come to that product.
 To check curvature in the case where some of the design variables are category
variables, one can include one reference sample with center levels of all continuous
variables for each level (or combination of levels) of the category variable(s).
Note: For reference samples, only response values can be taken automatically into
account in the Analysis of Effects and Response Surface analyzes. Values of the
design variables may, however, be entered manually after converting to a non-
designed data table, then run a PLS analysis on the resulting table.
Replicates
Replicates are experiments performed several times under reproduced conditions. They
should not be confused with repeated measurements, where the samples are only prepared
once but the measurements are performed several times on each.
Why include replicates?
318
Replicates are included in a design in order to estimate the experimental error associated
with the system. This is doubly useful as it:
 Gives information about the average experimental error in itself;

 Enables a comparison of the response variation due to controlled causes (i.e. due to
variation in the design variables) with uncontrolled response variation. If the
“explainable” variation in a response is no larger than its random variation, the
variations of this response cannot be related to the investigated design variables.
How to include replicates

The usual strategy is to specify several replicates of the center sample. This has the
advantage of both being rather economical, and providing an estimation of the experimental
error under “average” conditions.
When no center sample can be defined (because the design includes category variables only
or variables with more than two levels), one may repeat the entire set of experimental
points instead. This also provides a better estimation of the experimental error across the
design region. If it is known that there is a lot of uncontrolled or unexplained variability in
the experiments, it might be wise to replicate the whole design.
8.2.9 Sample order in a design

The purpose of experimental design usually is to find out how variations in design variables
influence response variations. However, no matter how well the conditions of an
experimental setup is controlled, random variations still occur. The next sections describe
what can be done to limit the effect of random variations on the interpretation of the final
results.
Randomization
Randomization means that the experiments are performed in random order, as opposed to
the standard order which is sorted according to the levels of the design variables.
Most often, the experimental conditions are likely “drift” during the course of the
investigation, such as when temperature and humidity vary according to external
meteorological conditions, or when the experiments are carried out by a new employee who
is better trained at the end of the investigation than at the beginning. It is crucial not to risk
confusing the effect of a change over time with the effect of one of the investigated
variables. To avoid such misinterpretation, the order in which the experimental runs are to
be performed is usually randomized.
Incomplete randomization
There may be circumstances which prevent the use of full randomization. For instance, one
of the design variables may be a parameter that is particularly difficult to tune, so that the
experiments will be performed much more efficiently if that parameter only needs to be
tuned a few times. Another case for incomplete randomization is blocking.
The Unscrambler® enables one to leave some variables out of the randomization. As a result,
the experimental runs will be sorted according to the non-randomized variable(s). This will
generate groups of samples with a constant value for those variables. Within these groups,
the samples will be randomized according to the remaining variables.
8.2.10 Blocking
In some situations it may not be possible to run all experiments under the exact same
conditions, or there may be other reasons to split the full set of runs into blocks that are
319
performed independently from the others in some sense. A common scenario is that raw
material comes from different batches, in case there is not enough material in a single batch
to accommodate the full set of experiments. Often screening designs are extended into
factor influence studies, or factor influence studies are extended into optimization studies. If
this is performed in a planned manner, it will often be possible to re-use previous
measurements and supplement them with new ones. For instance, a low resolution
fractional factorial can be extended into a high resolution or full factorial design, which again
can be extended into a circumscribed or faced central composite design (see section
Extending a design below). Because these blocks of experiments are necessarily performed
in different points of time, there is a higher risk that non-controllable or unknown factors
differ between blocks. Whether such variation has an unwanted effect on the response
should always be investigated.
Any blocked experiment should be tested for unequal block means. For experiments where
measurements are divided into two distinct blocks, the response(s) can be tested using a
Student’s t-test for equality of means. A low p-value, or equivalently a large difference
between the plotted quantiles, indicates that there is a significant blocking effect. Any effect
confounded with blocks cannot be trusted if this is the case. Careful planning of the
experiment is required to avoid that effects of interest are confounded with, or non-
distinguishable from, blocks.
For any number of blocks the responses can be plotted in a quantiles plot, where the block
means and variances can be compared using the sample grouping option. If the distributions
of response values are similar across blocks, there is no evidence that block effects have had
an influence on the response.
Incomplete blocking of full factorial designs
If the full experiment is replicated, one should strive to include the full set of unique design
points in each block. This will ensure that any blocking effect is confounded with replicates
only, and all effects will be free of confounding with blocks. When all the treatment
combinations are included in each block, the design is referred to as a complete block design
and block effects should be tested as described above.
If this is not possible some effects will always be confounded with blocks, and the estimated
effects in question will include the block contribution as well. This is referred to as an
incomplete block design, and the efficiency of such a design depends on which effects are
confounded with blocks. Of course one would not want to create a design where any of the
main effects were confounded with blocks, as these main effects would be indistinguishable
from the block effects. Preferably the blocks should be set up such that they are confounded
with high order interactions only.
The Unscrambler® supports blocking of most full factorial experiments into 2p blocks, p being
smaller than the number of design variables. A full factorial design with three 2-level factors
may be divided into two or four blocks. A full factorial design with 3-7 2-level factors may be
split into two, four or eight blocks. The blocking generators are selected to ensure that as
many low-order interactions as possible can be estimated without confounding with blocks.
For instance, in a six-variable design divided into two blocks, the blocking effect will be
confounded with the six-variable interaction only.
In the ANOVA, all interactions confounded with blocks will be summarized in a separate
sums of squares for blocks. These individual interaction effects will not be given or tested in
the ANOVA, as they are indistinguishable from the blocking effects.
320
8.2.11 Extending a design

After a series of designed experiments has been performed, the are results analyzed and
conclusion are drawn from them, two situations may occur:
 The experiments have provided all the information needed, which means that the
project is completed.
 The experiments have given valuable information which can be used to build a new
series of experiments that will lead closer to the experimental objective.
In the latter case, the new series of experiments can sometimes be designed as a
complement to, or an extension of, the previous design. This allows one to minimize the
number of new experimental runs, and the whole set of results from the two series of runs
can be analyzed together.
Why extend a design?
In principle, one should make use of the extension feature whenever possible, because it
enables progression to the next stage of an investigation using a minimum of additional
experimental runs.
Extending an existing design is also a convenient way of building a new, similar design that
can be analyzed together with the original one. For example, if a chemical reaction has been
investigated using a specific type of catalyst, one might want to investigate another type of
catalyst under the same conditions as the first reaction, in order to compare their
performances. This can be achieved by adding a new design variable, namely type of
catalyst, to the existing design.
Design extensions can also be used as a basis for an efficient sequential experimental
strategy. That strategy consists in breaking the initial problem into a series of smaller,
intermediate problems and investing in a small number of experiments to achieve each of
the intermediate objectives. Thus, if something goes wrong at one stage, the losses are cut;
and if all goes well, one may end up solving the initial problem at a lower cost than if a huge
design had been used initially.
When and how to extend a design
The following text briefly describes the most common extension cases:
 Add levels: Used whenever one is interested in investigating more levels of already
included design variables, especially for category variables.
 Add a design variable: Used whenever a parameter that has been kept constant is
suspected to have a potential influence on the responses, as well as when one
wishes to duplicate an existing design in order to apply it to new conditions that
differ by the values of one specific variable (continuous or category), and analyze the
results together. For instance, if a chemical reaction using a specific catalyst has
been investigated, and now another similar catalyst for the same reaction will be
studied to compare its performances to the other one’s, the first design can be
extended by adding a new variable; type of catalyst.
 Delete a design variable: If the analysis of effects has established one or a few of the
variables in the original design to be clearly insignificant, the power of the
conclusions can be be increased by deleting this variable(s) and reanalyzing the
design. Deleting a design variable can also be a first step before extending a
screening design into an optimization design. This option should be exercised with
321
caution if the effect of the removed variable is close to significance. Also be sure
that the variable to be removed does not participate in any significant interactions.
 Add more replicates: If the first series of experiments shows that the experimental
error is unexpectedly high, replicating all experiments might make the results
clearer.
 Add more center samples: In order to get a better estimation of the experimental
error, adding a few center samples is a good and inexpensive solution.
 Add more reference samples whenever new references are of interest. More
replicates of existing reference samples may be used in order to get a better
estimation of the experimental error.
 Extend to higher resolution: Use this option for fractional factorial designs where
some of the effects of interest are confounded with each other. This option can be
used whenever some of the confounded interactions are significant and one needs
to find out exactly which ones. This is only possible if there is a higher resolution
fractional factorial design available. Otherwise, one can extend to a full factorial
design instead.
 Extend to full factorial: This applies to fractional factorial designs where some of the
effects of interest are confounded with each other and no higher resolution
fractional factorial designs are available.
 Extend to central composite: This option completes a full factorial design by adding
star samples and (optionally) a few more center samples. Fractional factorial designs
can also be completed this way, by adding the necessary cube samples as well. This
should be used only when the number of design variables is small; an intermediate
step may be to delete a few variables first.
Caution! Whenever extending a design, remember that all the experimental

conditions not represented in the design variables must be the same for the new
experimental runs as for the previous runs.
How to ensure representative new samples
As the new experiments will be exploring a new area of the design space, it is important to
be sure that there has been no drift since the first experiments have been performed.
To do so try to use at least two or three new center samples. Once the experiments are
performed run a T-test to compare the average of the first series of center samples and the
second. See the section on T-test (Introduction to statistical tests) or blocking for more
details.
8.2.12 Building an efficient experimental strategy

How should experimental design be used in practice? Is it more efficient to build one global
design that tries to achieve the main goal, or would it be better to break it down into a
sequence of more modest objectives, each with its own design?
It is strongly advised that even if the initial number of design variables to be investigated is
rather small, use the latter, sequential approach. This has at least four advantages:
 Each step of the strategy consists of a design involving a reasonably small number of
experiments. Thus, the mere size of each subproject is more easily manageable.
 A smaller number of experiments also means that the underlying conditions can
more easily be kept constant for the whole design, which will make the effects of
the design variables appear more clearly.
322
 If something goes wrong at a given step, the damage is restricted to that particular
step.
 If all goes well, the global cost is usually smaller than with one huge design, and the
final objective is achieved all the same.
Example of an experimental strategy

The following example illustrates an example experimental strategy. The objective is to
optimize a process that relies on six parameters: A, B, C, D, E, F. As it is not known which of
these parameters are influential, one must start at the screening stage.
The most straightforward approach would be to try an optimization at once, by building a
CCD with six design variables. It is possible, but costly (with at least 77 samples required) and
is also a risky approach (consider the impact if a wrong initial assumption was made, like a
wrong choice of ranges of variation? All experiments may be lost).
An alternative approach is described below:
 First, build a fractional factorial design 2(6-2) (resolution IV), with two center samples,
and perform the corresponding 18 experiments.
 After analyzing the results, it turns out that only variables A, B, C and E have
significant main effects and/or interactions. But those interactions are confounded,
so the design needs to be extended in order to know which are really significant.
 The first design is extended by deleting variables D and F and extending the
remaining part (which is now a 2(4-1), resolution IV design) to a full factorial design
with one more center sample. Additional cost: nine experiments.
 After analyzing the new design, the significant interactions which are not
confounded only involve A, B and C. The effect of E is clear and goes in the same
direction for all responses. But since the center samples show some curvature, one
must proceed to the optimization stage for the remaining variables.
 Thus, variable E is kept constant at its most interesting level, and after deleting that
variable from the design, the remaining 2³ full factorial design is extended to a CCD
with six center samples. Additional cost: nine experiments.
 Analysis of the final results yielded a desired optimum point. Final cost: 18+9+9=36
experiments, which is less than half of the initial estimate.
8.2.13 Analyze results from designed experiments

Simple data checks and graphical analysis
Any data analysis should start with simple data checks: use descriptive statistics, check
variable distributions, detect out-of-range values, etc.
For designed data, this is particularly important: one would not want to base a test of the
significance of the effects on erroneous data!
The good news is that data checks are even easier to perform when experimental design has
been used to generate the data. The reason for this is twofold:
 If the design variables have any effect at all, the experimental design structure
should be reflected in some way or other in the response data; graphical analysis
and PCA will visualize this structure and help one detect abnormal features.
 The Unscrambler® includes automatic features that take advantage of the design
structure (grouping according to levels of design variables when computing
323
descriptive statistics or viewing a PCA scores plot). When the structure of the design
shows in the plots (e.g. as subgroups in a box-plot, or with different colors on a
scores plot), it is easy to spot any sample or variable with an illogical behavior.
Analysis Of Variance (ANOVA)

The ANOVA table is a powerful tool to assess how well the model fits individual responses. It
has a Summary section that provides information about the overall significance of the
model. This is followed by a Variables section providing information about the importance of
the different design variables and their interactions. A Model Check section divides the total
variance into variability explained by terms of different order. For factorial and lower order
CCD models, all effects are orthogonal, meaning that e.g. the effect of linear terms equals
the sum of individual contributions.
Mixture designs are not orthogonal, and variances are therefore no longer additive. For
these designs, the Variables section provides the so-called marginal (Type III) sums of
squares (SS), reflecting the difference in SS between the full model and a model with the
effect in question left out. In contrast, the model check section provides the sequential
(Type I) SS, reflecting the increase in model SS when higher order terms are added to the
design. The model check section can be used to decide the optimal complexity of the
mixture model. Higher order terms should not be included unless they contribute
significantly to the model fit.
There is a Lack of Fit section that compares the experimental uncertainty (pure error) with
the residual variability due to inadequate modeling of the data (lack of fit). The pure error is
estimated based on replicated measurements of center samples. A significant lack of fit is an
indication that additional terms may improve the model. At the bottom of the ANOVA table,
there is a section with different model quality estimates such as calibration and prediction
R², prediction error sums of squares (PRESS), etc. The PRESS value reflects the error variance
when each observation is left out from the calibration model once and subsequently
predicted. It reflects the predictive ability of the model and is therefore a conservative
estimate of how good the model is. A PRESS value close to (or higher than) the corrected
total SS means very low predictive ability and will give an ‘R-square prediction’ value close to
zero (or negative). R-square prediction closer to 1.0 means that the predictive ability is good
and the PRESS value is correspondingly small.
The analysis sequence is then to first look at the model p-value and R². A p-value below 5%
indicates a good model fit and a R² close to 1 indicates a good correlation between the
predicted response value and the actual response value. Consideration must then be given
to the value of the individual effects or model terms and their sign. Consideration should
also be given to the corresponding p-values. Each effect with a p-value < 5% is considered
significant; if the p-value is < 1% it is highly significant. A p-value between 5 and 10%
indicates a marginally significant effect. A p-value > 10% indicates that an effect is not
considered to be significant.
ANOVA table
Sum of Squares Degree of Freedom Mean p-
F-ratio
(SS) (DF) Square value
Summary
Model 1.750 e+03 3 583.333 194.444 0.0001
Error 12 4 3
324
Sum of Squares Degree of Freedom Mean p-

F-ratio
(SS) (DF) Square value
Total 1.762 e+03 7 251.714
Variables
A 50.000 1 50.000 16.667 0.0151
B 1.250 e+03 1 1.250 e+03 416.667 0.0000
AB 450.000 1 450.000 150.000 0.0003

In this example the model is valid (p-value=0.0001) and all effects are significant (p-values <
0.05). The most significant effect is B as it has the smallest p-value.
Note: A saturated design is a design in which the number of experimental runs
equals to the number of model terms (including offset if necessary). This type of
design uses all the degrees of freedom to calculate the model terms, the error SS is
zero and p-values will not available.
Checking the adequacy of the model

Some assumptions underlying the ANOVA need to be verified before the test results can be
fully trusted. The first assumption is that the observations are adequately described by the
model. The model is defined by the included effects, and the best way to validate the model
is to apply it on left-out observations and see how well the predicted and measured
responses correspond with each other. A low PRESS value, or correspondingly an ‘R-square
prediction’ close to one, is an indication that the first assumption holds.
Also, the errors should be normally and independently distributed with mean zero and
constant but unknown variance. An important step of the analysis is therefore to plot the
residuals in different representations. In short, no obvious structures or patterns should be
found in the residuals when these assumptions are met.
The normality assumption is checked by looking at the residual histogram or normal
probability plot. The first should ideally look like the bell-shaped probability density of the
normal distribution centered at zero. Samples displaying strong deviation from the normal
distribution will be detected as deviating from a straight line in the normal probability plot of
residuals. This plot can therefore also be used as an outlier detection tool. Note that if the
number of observations is small, even perfectly random residuals will deviate somewhat
from the ideal bell-shaped density function. Luckily, the significance tests are robust to
moderate departures from normality.
The independence assumption can be verified by plotting the Y-residuals in experimental
order. The reason for randomizing the experimental order of runs is to avoid that time
dependent variations are influencing the estimation of effects. Correlation between
residuals, however, indicates that the runs have not been independently measured, which
may seriously affect the validity of the results. Also the Y-residuals vs. Y-predicted plot
should be studied to see whether any obvious patterns are found. Independent residuals will
appear as random variations in these plots.
Both the Y-residuals in experimental order and the Y-residuals vs. Y-predicted plots can also
be studied to check the constant variance assumption. Use these plots to see whether the
spread of observations is larger in one end compared to the other. A funnel or cone shape of
the experimental points indicates that some measurements are more precise than others, or
equivalently that some measurements have a larger influence on the model than others. If
325
the variance is strongly associated with the magnitude of the response, a variance-stabilizing
transform such as log(Y), Y1/2, or 1/Y might be considered (Tip: Histograms can be used to
test the influence on the response of different transforms). If the precision of runs improves
somewhat in the course of the experiment, a model based on randomized runs will most
likely be robust to these changes.
Note that if there are very few residual degrees of freedom left after estimating all the
effects in the model, artificial structure in the residuals can be expected simply due to lack of
information in the data. In the extreme case that the residual degrees of freedom is zero, all
the residuals will be zero as well. If a little more than the minimum number of experiments
can be afforded, this will benefit the interpretation of results.
Analysis of effects using classical methods
An analysis of the effects is usually performed for screening and factor influence designs:
Plackett-Burman, Fractional Factorial, Full-Factorial designs. These designs allow estimation
of main effects and some of them also 2-3 variable interactions.
The classical DoE analysis method for studying effects is based on the ANOVA-table. Main
effects or interactions found to be important in the ANOVA table can be investigated further
in an effects visualization plot. This will reveal the direction and magnitude of the individual
effects. It is important to note that even if a main effect seems to be irrelevant, the factor
can still have a large impact on the model if it takes part in a significant interaction effect.
Other checks that can be applied after analyzing the ANOVA table include the detection of
curvature effects. These can be found by plotting the main effects plot. If a nonlinear trend
is detected when checking the position of the center sample, one may consider a possible
curvature effect and include the square term of the effect in the model.
Main effect plot with curvature
When a variable is categorical, it is necessary to check which effects are significant and also
if they are significantly different. The multiple comparison test provides this type of
information. It is based on a comparison of the averages of the response variable at the
different levels. If the difference between two averages is greater than the critical limit the
two levels are significantly different. If not they have a similar effect. If no level has an effect
all levels will have a statistically similar effect, and the averages for the response variables at
the different levels will be non-significantly different.
326
In The Unscrambler®, there are three specific outputs for the multiple comparison test:
 A table of distances, that gives the two-by-two distance between the levels.
 A group table, that indicates the different grouping between the levels.
 A plot displaying the levels in their group.
More information in the plot (Interpreting design analysis plots) section.

Response surface analysis using classical methods
A response surface analysis is very useful when the experimental objective is optimization.
This is often the case for Central Composite and Box-Behnken designs as well as Mixture
designs.
The classical DoE method of analysis for studying a response surface is to fit a quadratic (or
even a cubic) model by MLR. For mixture designs, a special type of MLR models called
Scheffé models are used, which do not include an offset parameter.
The ANOVA table is still the main tool to assess the significance of effects. The significance of
individual effects as well as two-variable and three-variable interactions, square and cubic
terms must be assessed, depending on the terms included in the analysis.
The available models for BB designs are:
 Main effects
 Main effects + interactions (2-variable)
 Main effects + interactions (2-variable) + quadratic terms
The available models for CCD designs are:
 Main effects + interactions (2-variable) + quadratic terms

 Main effects + interactions (2-variable) + quadratic + cubic terms
 Main effects + interactions (2- and 3-variable) + quadratic terms
 Main effects + interactions (2- and 3-variable) + quadratic + cubic terms
The models for mixture designs are:
 First order (linear),

 Second order (quadratic),
 Special cubic. This is similar to main effects + interactions (2- and 3-variable).
However as the model has a closure constraint quadratic terms are partially
included.
 Full cubic. This is similar to main effects + interactions (2- and 3-variable) + quadratic
terms.
The above lists correspond with pre-defined alternatives, and it is possible to remove terms
from any of these models in a hierarchical manner (except linear mixture terms, which
cannot be removed).
The response surface can be used to find optimal design settings. For CCD and BB designs,
one fitted response are plotted for the entire area spanned by two design variables, any
remaining variables held constant at its minimum level. Maxima, minima, saddle points or
stable regions can be detected by changing which variables to plot while varying the levels of
327
the remaining variables. For mixture designs, the plotted design region consists of three
mixture components forming a simplex/triangle.
More information on how to vary the condition can be found in the RS table section in the
plot interpretation page.
Response surface
Limitations of ANOVA
Analyses based on MLR/ANOVA are very useful for orthogonal designs or mixture designs
where one or two (non-related) responses have been measured accurately following the
experimental conditions. ANOVA has some important shortcomings, however:
 The underlying MLR is based on the assumption that all variables can be measured
independently of all other variables in the model. This is always the case for
orthogonal designs such as the factorial designs. For some designs, such as
optimization designs including quadratic terms, mixture designs, D-optimal designs
or for any design where some experimental measurements are missing, some of the
model terms (effects) will become more or less correlated. If two correlated terms
both have an influence on the response, one of these will often (arbitrarily) come
out as significant at the expense of the other. While the ANOVA will automatically
handle standard designs such as mixture designs of simplex shape, a bilinear method
such as PLSR can take into account any number of correlated variables.
 If several responses are modeled, the MLR will fit a model to each response
independently. If all responses are orthogonal, one can then assess the ANOVA table
for each response without taking the remaining responses into account. The
problem is that real data are seldom or never orthogonal. For any two sufficiently
correlated responses, it is sub-optimal to try to assess the effects on one
independently from the other, and trying to find the main conclusions from several
ANOVA tables together is difficult in itself. A bilinear method such as PLSR can take
into account any number of correlated responses, and any relationships between
responses and descriptors will be easily detected.
 The reliability of the p-value estimates in the ANOVA table highly depends on the
residual degrees of freedom (DF) in the data after estimating all the parameters of
328
the model. If the error DF is low, the reliability of the estimated p-values is low as
well. This also limits the ability to check the assumptions of the model. When
several, correlated effects are estimated, the MLR consumes more DF than the true
number of underlying, independent effects. In contrast, with the bilinear methods
such as PLSR, the user estimates the optimal model rank based on the predictive
ability of the model.
 In the ANOVA table, the predictive ability of the model is given by the ‘PRESS’ and
‘R-square prediction’ values. These are based on leverage corrected residuals, which
in the case of MLR is identical to residuals obtained from a leave-one-out (LOO)
cross-validation. This reflects the ability of the model to predict each measurement
based on models fitted using all samples except the one in question. If some
samples are replicated, the LOO procedure will be overly optimistic. If there are for
instance 3 center samples in total, these will be predicted based on models where
the 2 remaining center samples have been accounted for. The prediction error will
therefore be smaller than if all center samples were kept out in the same step. In
general, all replicated measurements of any experimental point should be kept out
in a single cross-validation segment to ensure conservative error estimates.
 Non-controllable variables, i.e. variables that are believed to have an effect on the
responses but that are difficult to control at the required level of precision, are
currently not included in the ANOVA. In general, an attempt to include many of
these variables in an MLR model will have a high expense in terms of residual DF,
and the above considerations about correlation between terms would also have to
be taken into account. In PLSR any number of non-controllable variables can be
included, and they can optionally be downweighted in order to discover their
influence on the data without actually allowing them to influence the model. If e.g.
the run order was mixed up in the experiment, a passive descriptor giving the run
order or time-points of the individual measurements will reveal if any effects are
aliased with a time effect.
Analysis with PLS Regression
If some or all of the considerations above make analysis by ANOVA difficult, PLSR can always
be used as a powerful alternative. To get a refresher on the theory of PLSR follow this link.
Include all design variables including any interactions, quadratic or cubic effects of interest in
the descriptor ( ) matrix. Any additional non-controllable variable, background
information about the samples, experimental details such as time of measurement, batch, or
change of instruments can be included here as well. Include all response variables. Weight
all variables with 1/SDev, or optionally downweight some of the descriptors.
Validate with cross-validation. The level of validation depends on the cross-validation
segments. If e.g. all experimental runs are replicated once, the replication error can be
assessed by leaving out a full set of experimental runs in two cross-validation segments.
Note that this will not tell you how well the model will predict new samples but rather it will
reflect the experimental error in the experiment. In order to estimate how well the model
predicts new measurements (when level combinations are allowed to vary within the design
region), keep out all replicates of each point once. This will be a more conservative and
correct estimate for the predictive power of the model.
Include the uncertainty test to get an estimate of the significance of the effects. The
following are important tools to interpret the model and make conclusions:
Weighted Beta coefficients with their uncertainty limit
329
The weighted B-coefficients are used to determine which effects are the most
important and their direction of influence. Effects with high positive or negative
regression coefficients have a larger influence on the response in question.
The uncertainty test shows which effects are significantly non-zero, averaged over
responses. Coefficients with high absolute values and little variation across cross-
validation segments will point to significant effects.
Estimated p-values
The uncertainty test will estimate p-values for all effects and interactions included in
the PLSR model. These are based on the size and stability of the PLSR regression
coefficients in the cross-validation.
Explained variance
This plot will reveal the optimal number of components in the model, its fit (blue
line) and predictive ability (red line). The optimal number of components
corresponds with the number of independent phenomena in the data that exceeds
the noise level of the measurements.
Correlation loadings
The loadings or loading weights will reveal the main dependencies between
descriptors and responses in two dimensions. Often these dimensions will capture
the majority of the co-variation between descriptors and responses.
The correlation between the factors and each original variable is captured by the
distance from the origin in the correlation loadings plot. Even downweighted
variables are easily mapped in these plots.
Outlier detection
The sample outlier or influence plots can reveal erroneous measurements or typos
that should be mended or removed.
Predicted vs. Reference
Used to assess the model’s goodness of fit (blue points) and predictive ability (red
points) for each response variable, look for deviating runs and assess prediction
statistics.
When data are missing or experimental conditions have not been reached
In a real life situation it is not always possible to reach the target for the experimental
conditions or an experiment may not go as planned. In such cases one cannot apply the
classical DOE analysis methods. In these situations one can use a PLS fitting method. The
validation procedure of the PLS by jack-knifing will provide approximate p-values for the B-
coefficients, see above chapter on Analysis with PLS regression.
More information on PLS regression can be found in the chapter on Partial Least Squares
8.2.14 Advanced topics for unconstrained situations

In the following section, a few tips that might come in handy when building a design or
analyzing designed data are presented.
How to select design variables

Choosing which variables to investigate is the first step in designing experiments. That
problem is best tackled during a brainstorming session in which all people involved in the
project should participate, reducing the likelihood of overlooking an important aspect of the
investigation.
330
For a more extensive screening, variables that are known not to interact with other variables
can be left out. If those variables have a negligible linear effect, one can choose a constant
level for them (e.g. the least expensive). If those variables have a significant linear effect,
they should be fixed at the level most likely to give the desired effect on the response.
The previous rule also applies to optimization designs, if it is known that the variables in
question have no quadratic effect. If it is suspected that a variable can have a nonlinear
effect, it should be included in the optimization stage.
How to select ranges of variation

Once the variables to be investigated have been defined, appropriate ranges of variation
remain to be established.
For screening designs, one is generally interested in covering the largest possible region. On
the other hand, no information is available in the regions between the levels of the
experimental factors unless it is assumed that the response behaves smoothly enough as a
function of the design variables. Selecting the adequate levels is a trade-off between these
two aspects.
Thus a rule of thumb can be applied: Make the range large enough to give an effect and
small enough to be realistic. If it is suspected that two of the designed experimental runs will
give extreme, opposite results, perform those first. If the two results are indeed different
from each other, this means that enough variation has been generated. If they are too far
apart, and too much variation has been generated, the ranges should be decreased some. If
they are too close, try a center sample; as they might just have a very strong curvature!
Since optimization designs are usually built after some kind of screening, one should already
know roughly in what area the optimum lies. So unless a CCD is being built as an extension
of a previous factorial design, one should try to select a smaller range of variation. This way
a quadratic model will be more likely to approximate the true response surface correctly.
The importance of having measurements for all design samples

Analysis of effects and response surface modeling, which are specially tailored for
orthogonally designed data sets and are ideally run if response values are available for all
the designed samples. The reason is that those methods need balanced data to be fully
applicable. As a consequence, one should exercise great care when collecting response
values for all experiments. If a measurement is lost, for instance due to some instrument
failure, it might be advisable to redo the experiment later to collect the missing values.
If, for some reason, some response values simply cannot be measured, one can still to use
the standard multivariate methods available in The Unscrambler®: PCA on the responses,
and PCR or PLSR to relate response variation to the design variables.
8.2.15 Advanced topics for constrained situations

This section focuses on more technical or “tricky” issues related to the computation of
constrained designs.
Is the mixture region a simplex?

In a mixture situation where all concentrations vary from 0 to 100%, it was shown in the
mixture design section that the experimental region has the shape of a simplex. This shape
reflects the mixture constraint (sum of all concentrations = 100%).
331
Note: If some of the ingredients do not vary in concentration, these are left out
from the mixture equation such that the ‘total amount’ refers to the sum of the
remaining mixture components. For instance if one wishes to prepare a fruit punch
by blending varying amounts of watermelon, pineapple and orange juice, with a
fixed 10% of sugar, the mixture components sum to 90% of the juice blend but to
100% of the ‘total amount’ (mixture sum). This ensures that the three mixture
components will span a 2-dimensional simplex that can be modeled by a regular
mixture design.
Whenever the mixture components are further constrained, like in the example shown
below, the mixture region is usually not a simplex.
With a multilinear constraint, the mixture region is not a simplex
In the absence of multilinear constraints, the shape of the mixture region depends on the
relationship between the lower and upper bounds of the mixture components. It is a simplex
if for each mixture component, the upper bound + the sum of lower bounds for the
remaining components equals 100% (the total amount).
The figure below illustrates one case where the mixture region is a simplex and one case
where it is not.
Changing the upper bound of watermelon affects the shape of the mixture region
In the leftmost figure, the upper bound of watermelon is 100% - (17% + 17%) = 66%, and the
mixture region is a simplex. If the upper bound of watermelon is shifted to 55% as in figure
to the right, this value will be smaller than 100% - (17% + 17%) and the mixture region is no
longer a simplex.
Note: When the mixture components only have lower bounds, the mixture region is
always a simplex.
How to deal with small proportions

In a mixture situation, it is important to notice that variations in the major constituents are
only marginally influenced by changes in the minor constituents. For instance, an ingredient
varying between 0.02% and 0.05% will not noticeably disturb the mixture total; thus it can
be considered to vary independently from the other constituents of the blend.
This means that ingredients that are represented in the mixture with a very small proportion
can in a way “escape” from the mixture constraint.
332
So whenever one of the minor constituents of a mixture plays an important role in the
product properties, one can investigate its effects by treating it as a process variable.
Is a mixture design necessary?

A special case occurs when all the ingredients of interest have small proportions. Consider
the following example: a water-based soft drink consists of about 98% of water, an artificial
sweetener, coloring agent, and plant extracts. Even if the sum of the “non-water”
ingredients varies from 0 to 3%, the impact on the proportion of water will be negligible.
It does not make any sense to treat such a situation as a true mixture; it is better addressed
by building a classical orthogonal design (full or fractional factorial, central composite, Box-
Behnken, depending on the design objectives).
How to select reasonable constraints

There are various types of constraints on the levels of design variables. At least three
different situations can be considered.
 Some combinations of variable levels are physically impossible. For example: a
mixture with a total of 110%, or a negative concentration.
 Although the combinations are feasible, they are not relevant, or they will result in
difficult situations. Examples: some of the product properties cannot be measured,
or there may be discontinuities in the product properties.
 Some of the combinations that are physically possible and would not lead to any
complications are not desired, for example the cost of the ingredients may be
prohibitive.
During the define stage of a new design, give careful attention to any constraint that may be
introduced. An unnecessary constraint will not help solve the problem faster; on the
contrary, it will make the design more complex, and may lead to more experiments or
poorer results.
Design constraints
The first two cases mentioned above can be referred to as design constraints
because they should be included in the design itself. They cannot be disregarded
because if they are, one will end up with missing values in some of the experiments,
or uninterpretable results.
Optimization constraints
The third case can be referred to as an optimization constraint. Whenever
considering introducing such a constraint, examine the impact it will have on the
form of the design. If it turns a perfectly symmetrical situation, which can be solved
with a classical design (factorial or classical mixture), into a complex problem
requiring a D-optimal algorithm, it may be better to disregard the constraint.
For the third situation, build a standard (orthogonal or mixture) design and take the
optimization constraint into account afterwards, at the result interpretation stage. For
instance, a constraint on one or multiple design or response variables can be added to a
response surface plot, and the optimum solution selected within the constrained region.
This also applies to upper bounds in mixture components. As mentioned in the section on Is
the Mixture Region a Simplex?, if all mixture components have only lower bounds, the
mixture region will automatically be a simplex. It is important to keep this in mind so to
avoid imposing an upper bound on a constituent playing a similar role to the others. Expense
of material (thereby limiting its usage to a minimum) should not be considered an option for
333
an important study. This can be done at the interpretation stage, where the mixture that
gives the desired properties with the smallest amount of that constituent is chosen.
8.3. Insert – Create design…

A new design is created by using the menu Insert – Create design…, which will open the
Design Experiment Wizard. This dialog contains a sequence of tabs, where the next tab
content often depends on the input in the previous tab.
 General buttons
 Start
 Define Variables
 Choose the Design
 Design Details
 Plackett-Burman designs
 Fractional factorial designs
 Full factorial designs
 Full factorial designs without blocking
 Full factorial designs with incomplete blocking
 D-optimal designs
 D-optimal designs including mixture constraints
 Central Composite and Box-Behnken designs
 Mixture designs
 Simplex mixture designs
 Non-simplex mixture designs and process+mixture designs
 Additional Experiments
 Randomization
 Summary
 Design Table
8.3.1 General buttons

Cancel
At any time it is possible to exit the Design Experiment Wizard and go back to The
Unscrambler® main interface by pressing the Cancel button.
Finish
At the bottom of each tab, the Finish button is located. Initially this is disabled:
When sufficient information has been entered into the tab, the Finish button is made active:
By pressing this button all tasks in the design wizard are ended and the design is created in
The Unscrambler® navigator.
8.3.2 Start
The first tab in the sequence is divided in four sections:
 Name
 Goal
334
 Description
 History
Start tab
Name
By default the design will be named “MyDesign”. You may change this to the name you
would like the design to have in the project navigator later.
Goal
Select the most appropriate goal of the experiment. Based on this selection and the
number/type of design variables, the wizard will propose a suitable design.
Screening
In a screening experiment the goal is to isolate design variables that have a
significant main effect on the response variable(s).
When selecting this goal, the Design Experiment Wizard will favour either a Plackett-
Burman design or a low resolution Fractional Factorial design, provided the design
variables are not under any constraints. For mixtures an Axial design will be
suggested, and a low number of samples will be suggested if a D-optimal design is
selected.
Screening with interaction
In a screening with interaction experiment (often referred to as a factor influence
study) the goal is to assess both the main effects and the interactions of the design
variables on the response variable(s).
When selecting this goal, the Design Experiment Wizard will favour either a higher
resolution (IV or V) Fractional Factorial or a Full Factorial design, provided the
designed variables are not under any constraints. For mixtures a Simplex Lattice
design will be suggested, and the default terms and number of samples for a D-
optimal design will be adjusted accordingly.
Optimization
335
When choosing optimization as the goal, the design investigates main effects,
interactions and square terms on the response variable(s).
By choosing optimization as the goal, the Design Experiment Wizard will favour
either a Central Composite or Box-Behnken design, provided the designed variables
are not under any constraints. The suggested mixture design will be a Simplex
Centroid design, and the number of terms and samples for a D-optimal design will
be higher.
Note: In Optimization no category variables can be optimized. If there are category

variables to be investigated it is necessary to break down the design strategy into
two stages:
 Find the optimum levels for category variables (include the possible non-
category variable that can interact with them).
 Find the optimum for the non-category variables using the optimized values
for the category variables.
Description
Edit the blank section to store information on the design and specific details about the
experiments.
History
This part contains information on the history of the design such as the creator, the date of
creation and possible revisions. It is auto-generated by the Design Experiment Wizard.
8.3.3 Define Variables

In this tab, define the design space as well as other variables such as the response variables
and the non-controllable variables.
It is divided into two sections:
 Variable table, which displays the defined variables.
 Variable editor, which allows the addition of new variables or the deletion/editing of
previously defined variables.
Define variables tab
336
Variable table
This table contains information on all the variables to be included in the experiment. The
variables are ordered as follows:
 Design variables (factors, components)

 Response variables
 Non-controllable variables
The variables can be re-ordered within their category by using Ctrl+arrow up or down.
To edit a variable, highlight the corresponding row, modify the information in the variable
editor,and click OK.
To delete a variable, highlight the corresponding row and click the Delete button.
Variable editor
Click the Add button to add a new variable.
Specify the characteristics of the new variable as follows:
ID
The identity of the variable will be auto-generated. Design variables will have upper
case IDs (A-Z, except reserved letter I), response variables will have integer IDs, and
non-controllable variables will have lower case IDs (a-z, except i). Design variables
no. 26 and onwards are denoted A1, B1, etc.
Name
Enter a descriptive name in the Name field. If nothing is added here, the ID will be
used as name.
Type
Select the variable type by from the following list using the radio buttons:
 Design: Design variables (factors) submitted to experimentation.

 Response: Measured variables assumed to depend on the levels of the
design variables.
337
 Non-controllable: Variables not submitted to experimentation but may have

an effect on the design. They can be measured for the purpose of including
them in a regression model.
Constraints
Select the appropriate constraint setting for the variable (by default no constraints):
 Linear: If at least two variables are submitted to a common constraint, for

example , they should be defined as having linear
constraints.
 Mixture: If at least three variables are part of a mixture, they may be defined
as having a mixture constraint. This implies that the sum of all mixture
components equals the Mixture Sum (100%).
Type of levels
The levels are either continuous or category:
 Use Continuous if the variable is measured on a continuous scale. This

means that it is possible and that it makes sense to rank the levels with
respect to each other. For example high level is larger than low level and
values in between the upper and lower level exist. Only two levels are
specified for continuous variables.
 Use Category if the variable can change between 2 or more distinct levels or
groups, but where one group/level cannot be ranked on a numerical scale in
relation to the others. For instance the level ‘apple’ cannot be ranked as
higher/lower/better/worse than level ‘pear’. Similarly it is not possible to
calculate an average level between category groups. Two or more levels can
be defined for category variables (max. 20). If category variables of more
than two levels are included, the only available design will be the Full
Factorial (without blocking).
Note: Never define a numeric variable as category in order to enable more levels in
the design. These are interpreted differently and the analysis will be wrong. For
optimization designs that require more than two levels to fit a response surface,
additional levels will be added later based on the defined high and low levels.
Level range / Levels
 For continuous variables: place the bounds of the design space with the low
and the high values in the Level range field. By default the levels are -1 and
1 (or 0 and 100 for mixture variables)
 For category variables: the Levels section makes it possible to edit the
numbers and names of the level. The default values are “Level 1” and “Level
2”.
Units
Specify any unit for the variable in question. For mixture variables the default unit is
’%’.
338
Mixture Sum
(Available for mixture variables only.) This is the sum of all mixture components in
the blend. The default value is 100 (%), but any positive value is allowed.
8.3.4 Choose the Design
Different types of experimental design

Different designs can be created depending on the:
 Number of variables
 Constraints on the variables
 Goal of the experiment.
The Unscrambler® suggests the most appropriate design following some rules. Use the
radio-buttons to select a different design than the suggested one. Note that there are
limitations on which designs can be selected based on the number and type of design
variables, however the goal of the experiment can be overridden by the user. The suggested
design remains displayed in bold.
When a full factorial design is selected, a check-box is used to enable (incomplete) blocking.
Select blocking in cases where groups of experimental runs have to be performed under
different settings. For instance if one batch of raw material is insufficient for the full
experiment, different batches will have to be used for different runs. Blocking ensures that
any potential batch effect will not be confounded with other important effects such as main
effects.
Beginner and expert mode

In Beginner mode, the design description is intuitive for those not experienced with DoE. In
Expert mode, select the design by choosing the actual design name.
It is possible to change the view by using the Beginner/Expert cursor
.
Choose the design tab in Beginner mode
339
Information
The information box provides information on the selected design.
Design selection criteria used by the design wizard

The Design Experiment Wizard will always suggest a design taking into account 3 pre-defined
criteria:
 Goal
 Number of variables
 Constraints on the variables
The rules are as follows

 In situations where no constraints are applied:
If the goal is Screening and # of variables ≥ 11, then a Plackett-Burman design is
selected.
If the goal is Screening and # of variables > 2 and < 7, then a fractional factorial
design of resolution III is selected.
If the goal is Screening with interaction and # of variables > 4, then a fractional
factorial design is selected. Make sure to select a resolution IV design or higher.
If the goal is Screening with interaction and # of variables ≤ 4, then a full fractional
design is selected.
If the goal is Optimization and # of variable ≤ 6, then a Central composite design is
selected.
If the goal is Optimization and # of variable > 6, this is not possible as too many
experiments are required to be practically feasible. The optimization should be
performed in steps.
 In the situation where Mixture constraints are applied:
At least 3 mixture variables have to be defined. If the experiment contains mixture
variables only, a mixture design will be suggested by default. Depending on the
340
defined goal: Screening selects an axial design, Screening with interaction selects a
Simplex-Lattice design and Optimization selects a Simplex-centroid design.
If additional constraints on the mixture components are imposed, the design region
might be non-simplex. Also, if process (i.e. non-mixture) variables are included
together with the mixture components, regular mixture designs cannot be used. The
appropriate choice for these setups is a D-optimal design.
 In the situation where linear constraints are applied, for non-simplex mixture
designs, or for designs containing both process and mixture variables:
The appropriate choice is a D-optimal design. Designs with less than two process
variables or at least three mixture variables are not allowed.
8.3.5 Design Details

This tab is allows a user to define the details of the various designs.
Plackett-Burman designs
When a Plackett-Burman design is selected, the Design Details tab displays a list of design
variables and a summary of the size of the design.
Design Details: Plackett-Burman
Fractional factorial designs

For a fractional factorial design there may be several possible resolutions corresponding the
available confounding patterns.
To change the resolution and the confounding pattern, there are two options:
 Use the drop-down box to select among the available number of design points
 Change the resolution with the radio buttons.
Design Details: Fractional factorial design
341
The confounding patterns for the selected design is displayed in a separate box. They can be
visualized using the variable ID in the form : A + BC, or using the names of the variables. To
see the variable names, tick the box Show names.
After finishing a fractional factorial design, the resolution and confounding patterns will be
given in the Info box below the project navigator.
Full factorial designs
The Design Details tab looks different depending on whether blocking was selected in the
previous tab.
Full factorial designs without blocking

Details about the design variables and number of experiments are shown.
Design Details: Full factorial without blocking
342
Full factorial designs with incomplete blocking

When blocking is selected, the available number of blocks (per design replicate) is selected
in the Number of blocks drop-down box.
Depending on the number of blocks, the Block Generators are displayed in a separate
frame. These are given capital letter ID’s similar to the design variables, but they are dummy
variables used for blocking only. They are named Generator_1, Generator_2, etc.
Design Details: Full factorial with blocking
The blocking generators, as well as all their confounding interactions, will be treated
separately from the remaining effects in the subsequent ANOVA. This means that no results
will be returned for any effects confounded with blocks. The Patterns frame allows
identification of the effects confounded with blocks.
After finishing a full factorial design with incomplete blocking, the block confounding
patterns will be given in the Info box below the project navigator.
D-optimal designs
This design type corresponds to variables with constraints applied, such as:
 Multilinear constraints on some variables

 Mixture variables with upper bounds that result in a non-simplex design region
 A combination of mixture and process variables.
This tab is used to:
 Set the constraints

 Set interactions and squares
 Edit the design settings
 Generate the design
Design Details: D-optimal design
343
Note:
 Adding variables with linear constraints automatically leads to a D-Optimal

design.
 Defining both mixture and process variables automatically leads to a D-
Optimal design.
 No multilinear constraints can be defined including category variables.
Set the constraints

The Multilinear constraints frame include a window where all the design constraints are
displayed as well as an Edit button. Clicking this button will open a dialog where multilinear
constraints can be added, edited or removed.
Editing multilinear constraints
To add a new constraint, use the button Click to add new constraint. A list of all design
variables that are defined to have either Linear or Mixture constraints will be available for
344
editing. Select a multiple of each constrained variable, or set a variable to 0 if it is not part of
the current constraint.
The operator to be used in the multilinear constraint is selected from the drop-down list:
The ’<’ and ’>’ operators are convenience functions only. On setting up the candidate points
the ‘<=’ and ‘>=’ will used instead, but with the target value modified down or up by 0.01
compared to the specifed target. After specifying the target value, the new constraint will be
added to the Current constraints box.
Repeat the above procedure for adding additional constraints, or edit an existing constraint
by clicking on the relevant box in Current constraints.
If mixture variables are included in the design, a constraint that they sum to 100% (as given
by the Mixture sum), is added automatically. This constraint cannot be edited or removed.
To delete a constraint select it in the Current constraints table and click on the Delete
button.
Click OK when all of the desired constraints have been added. The constraints will then be
tested if they are both active and consistent.
An inactive constraints is one that is superfluous because it does not constrain the design
region as specified by the variable levels. If for instance the ranges of A and B are both [0
10], a constraint that A+B>=0 will be inactive.
Inactive constraint warning
An inconsistent constraint is a constraint that is impossible based on the design variable

levels. A constraint that A+B>=30 for the above design will be inconsistent, because the sum
of A and B at their maximum levels is 20.
Inconsistent constraint warning
If a constraint is found to be inactive or inconsistent it should be reviewed carefully. When

all constraints are valid, click OK again to close the dialog. All specified constraintswill then
be listed in the main dialog window.
345
Set interactions and squares

Any D-optimal design will include the main effects of all design variables as a minimum. In
addition some types of interaction and square terms are available depending on the type of
design variables included. These are
 Second order mixture: These are all 2-variable interaction terms between the
mixture components;
 Process interactions: These are all 2-variable interaction terms between the process
variables;
 Process squares: These are all quadratic terms of the process variables;
 Mixture and process interactions: These are all interactions of the first order mixture
terms with any first or second order process term.
Check the appropriate boxes to pre-select any of these groups of terms. For designs with
process (non-mixture) variables only, use the following guidelines:
 Screening: the model to study is a linear model. No need to add interaction or

square terms
 Screening with interaction: the model should include the process interaction terms.
 Optimization: the model should include the process interactions as well as the
process squares.
For mixture designs, include second order mixture terms if the goal is Screening with
interaction or Optimization.
For process/mixture designs it may be useful to optimize either the process or mixture
variables, while sampling for the main effects only of the remaining group. It is also possible
to include the second order terms for both types of variables while not including interactions
between the two. By assuming that there are no interactions between the process and
mixture variables, the number of experiments can be greatly reduced.
For a more specific selection of model terms click the Modify button. This will bring up a
dialog listing all higher order terms available for selection. The selected effects are listed in
the left box and the non-selected effects are listed in the right box. All main effect terms
(and offset if non-mixture design) are included by default and will not be listed. Any second
order mixture, process interaction and process square terms will be available for selection.
Any mixture and process interaction terms will be available for selection only if this box is
checked in the Model terms frame.
Dialog for selection of interaction and square terms
346
The Add and Remove buttons can be used to move highlighted terms from
one box to the other. The Add All and Remove All buttons do the same for all available
terms. The Add Int button adds all second order mixture as well as process interaction terms
to the model, whereas Add Square moves all process square terms to the Selected Effects
box. Click OK to keep the changes or Cancel to discard them. If some but not all of the terms
of a given order are selected, the corresponding check-box will be in a full state
(intermediate between checked and empty states).
Edit the design settings
The total number of design points is divided between a number of D-optimal design points,
space filling points and additional center points. The default sum of D-optimal and space
filling points is given by the number of model terms and the Goal of the experiment. An
offset is included in the model terms only if no mixture components are specified.
 If Goal=Screening, three points more than the number of model terms is suggested,
and three additional center points.
 If Goal=Screening with interaction, six points more than the number of model terms
is suggested,and four additional center points.
 If Goal=Optimization, nine points more than the number of model terms is
suggested, and five additional center points.
The minimum number of design points is the same as the minimum number of D-optimal
points. These are limited by the number of model terms.
The maximum number of design points is the same as the maximum number of D-optimal
points, which is limited by the number of candidate points. As the candidate points are
generated only when the Generate button is pressed, a warning will be given if too many
design points are specified.
The minimum number of space filling and additional center points are zero. Note that the
candidate points list will contain one center point which might be added even though the
number of additional center points is set to zero.
Change the default number of center points in the Additional Experiments tab. Note that the
center sample coordinates will be calculated (or re-calculated) only when the Generate
button is pressed.
347
An Advanced Design Settings dialog opens when clicking the More button. Three settings are
tuned in this window
 Number of initial tries: There is no guarantee that a single run of the D-optimal
algorithm will return the globally optimal set of design points. To avoid getting stuck
in local optima the algorithm can be run multiple times using different starting
conditions. Only the result with highest D-optimality is returned. The default number
of initial tries is 5, and this value can be changed between 1 and 1000.
 Random points in the initial sets: To speed up the algorithm the starting set is not
completely random. Rather a smaller random set is used and points are added
sequentially to maximize the D-optimality of the starting design. The number of
random points in the initial sets can be tuned between the the number of model
terms and the specified number of D-optimal points.
 Max number of iterations: Here you can set an upper limit on the number of point
exchange operations that will be performed. The default limit is 100, the lower limit
is 10 and the upper limit is 1000 iterations. You may try to increase the number if
you experience convergence problems.
The Advanced Design Settings dialog
Click OK to keep the changes or Cancel to discard them.

Generate the design
A sequence of operations is performed when the Generate button is pressed. First the
candidate point list is generated based on the constraints. The number of candidate points is
the effective upper limit on the number of design points, and a warning will be given if too
many design points have been specified. Also the center point coordinates are generated
and will be displayed in the Additional Experiments tab. Then the specified number of D-
optimal points is found by the exchange algorithm, before these points are supplemented
with the specified number of space filling points and finally with the number of additional
center points.
The resulting design matrix is returned and the condition number is displayed in the Design
Experiment Wizard. The condition number is an indication of the orthogonality of a design,
and the lower condition number the better.
D-optimal designs including mixture constraints

If three or more variables are defined to have Mixture constraints, a D-optimal design can be
generated. If there is a combination of process and mixture variables, a D-optimal design is
the only available option. Also if the upper level of one or more of the mixture components
is lower than the Mixture Sum, or if additional constraints are imposed on them, the design
348
region may have a non-simplex shape. D-optimal designs should be used for non-simplex
design regions as the standard mixture designs will not work.
Such a design is set up in a similar manner to a D-optimal design without mixture
components. The main difference is that a mixture constraint including all mixture
components is added automatically. These are required to sum to 100%.
Note: Currently classical ANOVA and response surface plots are not available for
non-simplex and process/mixture designs. In order to take advantage of these
features, you might consider if a regular mixture design could be an alternative.
Central Composite and Box-Behnken designs

Available optimization designs are:
 Circumscribed Central Composite (CCC)

 Inscribed Central Composite (ICC)
 Faced Central Composite (FCC)
 Box-Behnken (BB)
Use the radio buttons to select the most appropriate design. For more information on these
designs please refer to the Theory section.
Design Details: Central Composite and Box-Behnken designs
The star point distance is the distance from the origin to the axial points in normalized units
(i.e. given that upper and lower levels of factorial points are 1 and -1, respectively). The
default star point distance for CCC designs ensures rotatable designs. For ICC designs, the
inverted value is used, which will for give rotatable designs by default also for ICC designs.
The star point distance for FCC designs is always 1 (non-rotatable).
The following table is given as a guide to find the most appropriate design:
Uses point outside
Number of
Design high and low Accuracy of estimates
levels
levels
Circumscribed 5 Yes Good over entire design space
349
Uses point outside

Number of
Design high and low Accuracy of estimates
levels
levels
Good over central subset of the

Inscribed 5 No
design space
Fair over entire design space, poor

Faced 3 No
for pure quadratic coefficients
Good over entire design space, more

Box-Behnken 3 No uncertainty on the edge of the
design area
Mixture designs
Simplex mixture designs

Whenever three or more variables with Mixture constraints are defined, and there are no
other variables in the design, the mixture design tab is accessible.
Design Details: Mixture design
Axial
In an axial design all points lie on axes that go from each vertex through the overall
centroid, ending up at the opposite surface or edge. At these end points the
component in question is zero and the remaining components have equal
concentrations.
The end points allow the study of blending processes where each component may
be reduced to zero concentration. These can optionally be left out from the
experiment by un-checking the Include end points box.
Simplex lattice
A simplex lattice design is the mixture equivalent of a full-factorial design where the
number of levels can be tuned. It can be used for both screening and optimization
purposes, according to the lattice degree of the design.
350
The Lattice degree equals the number of segments into which each edge is divided.
This corresponds to the maximal order that can be calculated for the subsequent
model. Edit the degree by changing the default value.
Simplex centroid
A Simplex centroid design consists of extreme vertices, center points of all “sub-
simplexes”, and the overall centroid. A “sub-simplex” is a simplex defined by a
subset of the design variables.
Simplex centroid designs are well suited for optimization purposes. If Augmented
design is checked, axial check blends are added to the design. These are the same as
the Axial points in an Axial design.
Adjust mixture levels
There are certain limitations on which ranges are allowed for the components in a
mixture design:
1) The design levels must be consistent. This has to do with the mixture constraint
that all component concentrations must sum to the Mixture Sum (100%). If for
instance the lower level of one component is constrained to 20%, the upper level of
the remaining components cannot exceed 80% (see image below).
2) Any (consistent) design region has to be of simplex shape, i.e. it must form a
triangle for 3 components, a tetrahedron for 4 components, etc. Imposing upper
limit constraints on some of the mixture components will often lead to a non-
simplex design region.
A mixture design is automatically tested for condition 1) above, and if the design is
consistent it is tested for condition 2). If either test fail, a warning is given and an
Adjust mixture levels button is activated. Clicking this button will open an adjust
mixture levels dialog with several options.
Adjust Mixture Levels
 Make levels consistent: Active whenever the test for consistency fails. The
bounds will be adjusted for consistency with the mixture constraint.
351
 Reset to user specified levels: Active whenever modifications have been

done to the constraints within the dialog. Reverts any modifications to those
originally defined.
 Adjust with normalized levels: Active whenever any range differs from the
default [0, 100%]. All mixture bounds will be adjusted to their maximum
range as bounded by 0 and the Mixture Sum.
 Switch to d-optimal: Active whenever the design is consistent but non-

simplex. Applies any changes to the constraints, closes the dialog and
switches to the tab for D-optimal designs.
 Adjust to simplex: Active whenever the design is consistent but non-

simplex. Applies a general adjustment to turn the experimental region into a
simplex shape. The pre-defined upper and lower levels may be exceeded.
On pressing OK, the upper and lower levels of the components are updated with the
new values. If Cancel the dialog is closed without taking any changes into account.
Only when the mixture design is both consistent and of simplex shape will the Finish
button be activated in the Design Experiment Wizard.
Non-simplex mixture designs and process+mixture designs

In the situations where imposed upper bounds or multilinear constraints lead to a non-
simplex design region, or where a combination of mixture and process variables are to be
analysed a D-optimal design is required.
8.3.6 Additional Experiments

This tab allows one to manage the replication of the design as well add center points and
reference samples.
It includes four sections:
 Design variables
 Replicated samples
 Center samples
 Reference samples
Additional experiment tab
352
Design variables
The design variables table provides a running summary of the design variables’ levels and
constraints.
Replicated samples
The number of replicated samples indicates the number of times the base design
experiments are run. Replication is used to measure the experimental error. Usually this is
done on center samples, however increasing the number of replicates in the design
improves the precision estimates of the design, by measuring replicates over the entire
design space. It is suggested to use at least two replicates of the design if the experimental
results are likely to vary significantly during the running of the experiment.
Note: Replicates (or replicated samples) are not the same as repeated
measurements. Replicates require a new experiment to be run using the same
settings for the design variables with a new experimental setup, while repeated
measurements are measures performed on the same samples numerous times in a
short time period.
Center samples
Center samples are used as a test for curvature and as a source for error variance
estimation. In the latter case, use at least two (preferably three or more) center samples as
this improves the precision of any estimates. By default the Design Experiment Wizard
suggests a number of center samples. These can be modified by using the spin box next to
Number of center samples.
The center samples are experimental runs at the mid-level of the design variable ranges
when all design variables are continuous. This corresponds to the average (mean) of the
different variables in the design.
If 1-4 variables in the design are categorical and at least one is continuous, center points can
still be defined, however these are only defined for the continuous variables in the design.
353
Then a specified number of center point will be given for all combinations of categorical
levels. This ensures that the resulting design remains orthogonal.
An example is shown below for the simplest 2 factor factorial design at two levels, with one
category and for the 3 factor case with one center point defined.
Center point configurations of two factorial designs with one category variable
For the above designs it can be seen that two center points are required when there is one
categorical variable in the design. The center point is located at the mid-point of the
remaining continuous variables. The diagram below shows the 3 factor design with two
categorical variables, in which case 22 = 4 center points are needed.
In the situations described above, one replicate of center points was defined. In this case,
pure error cannot be calculated as the center points are all unique. In order to calculate pure
error, replicates of these center points is required. For the 2 factor design, two replicates of
center points yields 4 center points in total. Each center point now provides 1 degree of
freedom each per categorical level, i.e. 2 degrees of freedom in total for pure error.
For the 3 factor example with two categorical variables, two replicates of center points
results in 8 runs for center points alone. In this case, there are 4 unique center points,
therefore this situation provides 4 degrees of freedom for pure error. The more categorical
variables, the more center points are required, i.e. 2 center points minimum per categorical
variable. If replication is required, the number of center points can increase rapidly, to the
point where the number of center points exceeds the number of design points. In these
354
cases, the experimenter should assess if design replication is a better choice, or a

combination of a design replicate and a single replicate of center points. This depends on the
goal of the design and the budget one has for the experimentation. Also, refer to the section
below on modification of center points which describes how to modify and delete specific
center points.
Note: For designs with more than 1-2 categorical variables, it is usually both more
informative and more economical to replicate the entire experiment than to add
center points.
Modification of center points

It is possible to modify center points by double-clicking on the sample, which will open a
dialog box for editing.
Modify center sample
In the example presented here, variable D is categorical. Its value can be changed using the
drop-down list. It is also possible to delete this specific center sample by clicking on the
Delete button. When the level values for the category variables have been specified, click
OK.
Reference samples
In the field reference samples, it is possible to define samples which are incorporated for
comparison. A typical reference sample is a target sample, a competitor’s sample or a
sample produced after changes to a given recipe. The values of the design variables are not
entered and are set as missing; it can be modified later in The Unscrambler®.
8.3.7 Randomization
This tab allows a user to randomize the order of the experiments.
Randomization tab
355
Randomization is used to avoid bias induced by sequential experimentation. However it is

sometimes necessary to perform some experiments in sequence, for example, if a
parameter is difficult to change (for example, the temperature of a blast furnace). In such
cases, it may be more practical to make all experiments with the same temperature at the
same time. In the Randomization tab, it is possible to specify blocks of similar samples to be
kept together during randomization.
Designed variables to randomize
This table displays the randomization pattern of the designed variables. It is possible
to edit the randomization pattern of the variables by clicking on the Detailed
randomization button.
By clicking on this button a new window opens. The selected variables (including
center points) will be randomized. When the desired pattern has been achieved,
click OK.
Define randomization
Randomized experiments
This table shows the sequence of experiments to run.
356
Re-randomize
If for any reason it is necessary to change the order of the samples, select the Re-
randomize button, and a new sequence of experiments will be generated.
8.3.8 Summary
This tab gives a summary of the complete design set-up, as well as the ability to calculate the
power of the design to detect small changes in the individual responses. A small change
means that the effect should be significant at a 5% level.
Summary tab
In order to calculate the power of the design:

 Enter the following parameters into the respective fields:
Delta
the required difference to detect in the response for successful experimentation
Std. dev. (also called Sigma)
the estimated standard deviation on the reference method used to obtain the
response
The ratio Signal to Noise (S/N) is provided as an indication.
 Click the Recalculate power button. The power for each response variable will be
displayed in the Power field.
The power of the design is its estimated ability to detect small but real changes in the
response values. Traditionally a power of 80% is regarded to be good, which would imply a
20% probability of overlooking small effects. If the power of a design is low, one risks
performing expensive and time-consuming experiments that will not provide any answers.
Increase the power by adding additional experiments to the design, e.g. perform an
additional replication.
8.3.9 Design Table

This tab shows the list of experiments to perform.
Design table tab
357
Different visualization options are available:

Randomized or Standard sequence
Randomized sequence is the sequence defined in the Randomization section, which
corresponds to the run order. Standard sequence is an ordered sequence
convenient for display.
Display order
Actual values or design levels

Actual values (or Actuals) are the levels as specified in the Define Variables tab,
these are the original units of the design variables.
Design levels are the levels in normalized units, i.e. [-1, 1] for factorial (process)
variables and [0, 1] for mixture components. Also called Level indices or Reals.
Display values
Select the options to be used with the available radio buttons.

After selecting the Finish button, the design matrices will be generated in The Unscrambler®
project navigator.
8.4. Tools – Modify/Extend Design…

To modify or extend a design, use the menu option Tools - Modify/Extend Design….
Modify/Extend Design menu
358
A dialog box will appear where one can select the appropriate design matrix to modify in the
field Choose design.
Modify/Extend Design dialog box
When the design is selected click the OK button.

The Design Experiment Wizard will open. The History field of the Start tab will be modified,
and all the variables will be loaded with their previous settings.
Modified History field
Give the new design a unique name, modify any settings and click Finish when satisfied. This
will create a new design table in the project navigator.
All response values will be set to zero in the modified design.
Check the Insert – Create design… section to get more information about the design wizard.
8.4.1 To remember
When extending a design where some experiments have been already run, it is
recommended to add some extra center samples to check for bias with time with the
analysis.
Refer to the theory-section Extending a design for more details.
359
8.5. Tasks – Analyze – Analyze Design Matrix…

After clicking on Finish in the Create Design dialog, the design table is displayed in The
Unscrambler® project navigator. The design table contains all design variables (with
interactions), followed by the response variables and non-controllable variables (when
applicable).
The Design table is divided into sets (column ranges) depending on the model complexity:
Designs not containing mixture variables contain some or all of the sets:
 Design
 Response
 Non-controllable
 Main effects
 Main effects + Interactions (2-var)
 Main effects + Interactions (2- and 3-var)
 Main effects + Interactions (2-var) + quadratic
 Main effects + Interactions (2-var) + quadratic + cubic
 Main effects + Interactions (2 and 3-var) + quadratic + cubic
Designs containing mixture variables contain some or all of the sets:
 Design
 Response
 Non-controllable
 First order (Linear)
 Second order (Quadratic)
 Special cubic
 Full cubic
 Main effects + Responses
The tables are also divided into three to five sample sets (row ranges):
 All samples
 All design samples
 Center samples
 Design and center samples
 Reference samples
Data sets generated in The Unscrambler®
360
8.5.1 Order of the runs

There are two ways in which to order the samples:
 Standard: This is the accepted standard order for design variables. In particular,
factorial designs adopt the standard (1), a, b, ab, … notation.
 Randomized: This order is the one generated after randomization, it provides the
experimental sequence the runs should be performed in.
Standard and randomized order view
The order can be changed by the clicking on one of the two columns and then selecting Edit-
Sort and then choosing Ascending or Descending.
Sort menu
8.5.2 Level values

There are two ways to view the design levels in the table: either in actual values or in leveled
index levels.
Change between these views by ticking or unticking the Level indices option available in the
View menu.
8.6. DoE analysis

Go to Tasks - Analyze - Analyze Design Matrix… to open the Design Analysis dialog. The first
tab is the Model Inputs tab where the input data are specified along with which interactions
or higher order terms to include in the model. The Method tab suggests alternative analysis
strategies based on the input data and allows you to select the preferred method.
361
Model Inputs
 Select the Predictors and Responses to analyze. Only data tables created using the
Design Experiment Wizard (Insert–Create Design…) are accepted as input.
 Usually the predefined column sets Design and Response should be selected in the
Cols box of the Predictors and Responses, respectively. Select All rows. Note that
selecting less or more data may alter desireable properties of the design.
 Select the Effects to include in the model. It can include more or less terms. Try a
simpler model first.
 In subsequent analysis, terms can be removed or added to the model. Select the
relevant effects and use the Move button to add/remove them from the analysis.
 For factorial designs with no category variables and at least one centre point, there
is an option to calculate Curvature. A Curvature term can be found in the Not
Estimated box and is calculated by moving it to the Estimated box. Curvature
removes one degree of freedom from Lack of Fit calculations and is used to
determine whether the model is linear or not. Note that even if the curvature term
is added in the ANOVA, the final model (i.e. regression coefficients and predicted
responses) does not include the curvature term. Because the residual degrees of
freedom is reduced when testing for curvature, avoid using it indiscriminantly.
Note: The test for curvature will also remove some variation from the error term. In
some cases this may result in a low p-value for the model even though the model
itself does not include the curvature term. Therefore you should always verify your
final model by recalculating without curvature.
The Model Inputs tab
362
Method
Most designs may be analyzed using Classical DoE Analysis, which performs individual
ANOVAs for each response. If the design is heavily constrained or if multiple correlated
responses should be analyzed together, Partial Least Squares Regression may be a better
option. Other changes to a design such as modified factor levels or missing values might also
favour PLSR over ANOVA in some cases. Please refer to the theory section for a discussion
on the limitations of ANOVA.
The Method tab displays some useful properties of the design to make it easier to decide on
the best analysis method.
 Design: This is the name of the design.

 Design Type: This is the type of the design.
 Modified: If at least one of the design level values has been modified in the past,
this value will be set to Yes. Depending on the magnitude of the change, this may
have a high or low impact on the orthogonality properties of the design.
 Kept-out samples: While all samples may be very important in a design, especially
non-replicated ones, things may happen during the experiment or data collection
that lead to missing response values for some samples. This may severely reduce the
quality of the design. The number of kept-out or missing samples in the data table is
given here.
363
 Max. R2 Responses: If multiple, correlated responses are selected, attempting to

interpret them under the assumption that they are independent is a difficult (and
risky) endeavor. This value is the highest of all pairwise, squared correlations
between responses. If the value is higher than 0.5 PLSR is suggested by default.
 Condition Number: Constrained (D-optimal) designs and designs with modified
levels or missing runs will be non-orthogonal. As valid interpretation of an ANOVA
model relies on independent design parameters, highly non-orthogonal designs
should be analyzed using Partial Least Squares Regression rather than Classical DoE.
An orthogonal design has condition number of 1, and for any non-mixture design
with condition number larger than 100 Partial Least Squares Regression will be
selected by default. If the value is larger than 1000 Classical DoE will be disabled.
 D-efficiency: This property of the design is closely related to the D-optimality
criterion. A factorial design without center points has a D-efficiency of 100%. This
value decreases if additional points are added that do not contribute to making the
design more orthogonal, or if constraints are added to the design. Useful for
assessing the quality of D-optimal designs.
Note: Modify design levels with caution, as such changes to the design matrix
cannot currently be undone (change back manually or use Tools–Modify/Extend
design if needed).
Note: Mixture designs are by definition non-orthogonal and can have both large
condition numbers and small D-efficiencies. These design can still be analyzed using
Classical DoE.
Select the preferred analysis method using the radio buttons and click OK to perform
analysis.
Analysis with ANOVA
364
8.7. Analysis results

A message will appear asking whether you want to display the model plots. Click on Yes or
No and the model will be added to the project navigator named “DOE Analysis”. Each model
contains the nodes Raw data and Results, and, if you decided to display it, Plots. There will
always be an option to right click on the model node in order to show or hide plots.
DOE Analysis results from a classical analysis in project navigator
365
For further information on how to interpret the plots that are generated, please refer to the
section on interpreting DoE plots.
8.8. Interpreting design analysis plots

Depending on the method selected to analyze the design data, different results will be
plotted. Select one of the following methods to see the appropriate plot interpretation.
 Accessing plots
 Available plots for Classical DoE Analysis (Scheffe and MLR)
 ANOVA overview
 ANOVA table
 Summary
 Variables
 Model check
 Lack of fit
 Diagnostics
 Effect visualization
 Effect summary
 Effect and B-coefficient overview
 Regression coefficients and their confidence interval
 B-coefficient table
 Effect summary
 Residuals overview
 Normal probability of Y-residuals
 Y-residuals vs. Y-predicted
 Histogram of Y-residuals
 Y-residuals in experimental order
 ANOVA table
 Diagnostics
 B-coefficients
 Regression coefficients and their confidence interval
 Effect summary
 Cube plot
 Error table
 Predicted vs. Reference
 Response surface
 Response surface plot
 Response surface table
 Multiple comparison
 Multiple comparison plot
 Group table
 Distance table
 Available plots for Partial Least Squares Regression (DoE PLS)
 Overview
366
 Weighted regression coefficients

 Explained Variance
 PLSR ANOVA p-values
 PLS-ANOVA Summary table
 Normal probability plot
 X- and Y-Loadings
8.8.1 Accessing plots

On finishing the calculation of a DoE model, the user is asked whether to view the plots or
not. Answering Yes will generate a sub-branch of the model called Plots in the project
navigator. This branch contains a number of readily accessible plot nodes.
Project navigator plot nodes
The availability of these plots is toggled by the options ‘Show plots’/’Hide plots’, accessible
from right clicking on the DoE model in the project navigator. This will add or remove the
Plots branch to the model. The plots are also available from the toolbar or from right-
clicking in any of the plot windows.
8.8.2 Available plots for Classical DoE Analysis (Scheffe and MLR)
ANOVA overview
The ANOVA overview plot node contains four plots. The plots described below are given for
all Plackett-Burman, Fractional Factorial and Full Factorial designs (unless otherwise noted).
For Optimization and Mixture designs, the Effect visualization and Effect summary plots are
replaced with a Response surface plot and table.
ANOVA table
The ANOVA table contains all sources of variation included in the model.
Sums of squares (SS)
367
This is an unscaled measure of the dispersion or variability of the data table. It is the
sum of squares of the distance from the samples to the average point. It increases
with the number of samples.
All calculations are based on coded levels, i.e. the variable ranges are scaled
between [-1, 1] for process variables and between [0, 1] for mixture variables.
Degrees of freedom (DF)
The number of degrees of freedom of a phenomenon is the number of independent
ways this phenomenon can be varied. In the model there is one DF for each
independent parameter estimated.
Mean squares (MS)
This is the ratio of SS over the degrees of freedom. It estimates the variance, or
spread, of the observations of the different sources in a comparable unit.
F-ratio
This is the ratio between explained variance (associated to a given predictor) and
residual variance. F-ratios are not immediately interpretable, since their significance
depends on the number of degrees of freedom. However, they can be used as a
visual diagnostic: effects with high F-ratios are more likely to be significant than
effects with small F-ratios.
p-value
A small value (for instance less than 0.05 or 0.01) indicates that the effect is
significantly different from zero, i.e. that there is little chance that the observed
effect is due to mere random variation.
There are several types of sources of variations grouped in different parts of the table:
 Summary
 Variables
 Model check
 Lack of fit
In addition, some Quality values are found at the end of the table, including:
Method used
This refers to the type of samples used to calculate the error values. It can take three
values:
 Design: the design is not saturated so the error values can be calculated on
the residual degree of freedom from the model.
 Center: the design is saturated so the error is calculated on additional

experiments: the center samples.
 References: the design is saturated so the error is calculated on additional

experiments: the reference samples.
R-square
Coefficient of multiple determination. A value close to 1 indicates a good fit, while a
value close to 0 indicates a poor fit.
 R-square = 1 - SS(Error) / SS(Total)
368
Adjusted R-square
Coefficient of multiple determination adjusted for the DF. While R-square will
increase towards 1 as more parameters (effects) are added to the model, this
statistic will favour additional terms only if the increase in SS is sufficiently high.
 Adjusted R-square = 1 - MS(Error) / [SS(Total)/(n-1)], n being the number of

design experiments
R-square prediction
R-square on the predicted values, which is most conservative of the three R-squares
and says something about the predictive ability of the model.
 R-square prediction = 1- PRESS / SS(Total)
S
Estimate for standard deviation (Root Mean Squared Error of Calibration; RMSEC)
Mean
Average value of the reference Y values on samples taking part in the analysis.
C.V. in %
Coefficient of variation is a normalized measure of dispersion of a probability
distribution. The standard deviation expressed as a percentage of the mean.
 C.V. in % = 100 * S / Mean
PRESS
PRediction Error Sum of Squares is an estimate of the dispersion of leverage
corrected residuals. It accounts for the predictive ability of the model in the sense
that each residual value is estimated as if the sample was left out from the model
calibration. The magnitude of this statistic can be compared with the corrected total
SS (the smaller the better).
ANOVA table
369
Summary
The first part of the ANOVA table tests the significance of the model when all specified
effects are included. If the model p-value is small (e.g. smaller than 0.05), it means that the
model explains more of the variation in the response variable than could be expected from
random phenomena. In other words, the model is significant at the 5% level. The smaller the
p-value, the more significant (and useful) the model is.
Variables
The second part of the ANOVA table deals with each individual effect (main effects,
optionally also interactions and square terms). If the p-value for an effect is small, it explains
more of the variations of the response variable than could be expected from random
phenomena. The effect is significant at the 5% level if the p-value is smaller than 0.05. The
smaller the p-value, the more significant the effect is.
There are different ways to calculate sums of squares (SS), however for orthogonal designs
such as factorial designs they all give the same results. For non-orthogonal designs such as
D-optimal and mixture designs, this section tests the so-called Marginal (Type III) SS. This
corrects for the contribution of all other terms in the model irrespective of order, however
the individual contributions may not sum to the Model SS.
370
Model check
The model check tests whether it is beneficial to add terms of successively higher order to
the model. For orthogonal designs such as factorial designs, the individual contributions of
the terms of a particular order sum to the model check SS. If the p-value for a group of
effects is large it means that these terms do not contribute much to the model and that a
simpler model should be considered.
For D-optimal and mixture designs, the so-called sequential (Type I) SS is given in the Model
check section. Also higher order terms than the ones actually included in the model are
given here when relevant. This section will indicate the optimal complexity of the model
when adding terms in a hierarchical manner (i.e. lower order terms added before higher
order terms). If all tested terms are included in the model, the sum of contributions will
equal the Model SS.
Lack of fit
The lack of fit part tests whether the error in response prediction is mostly due to
experimental variability or to an inadequate shape of the model. If the p-value for lack of fit
is smaller than 0.05, it means that the model does not describe the true shape of the
response surface. In such cases, it may be helpful to apply a transformation to the response
variable.
Note:
 For screening designs, the model can be saturated. In such cases, one
cannot use the design samples for significance testing; the center samples
or reference samples are used.
 If the design has design variables with more than two levels, use the
Multiple Comparison plot and B-coefficient table in order to see which
levels of a given variable differ significantly from each other.
 Lack of fit can only be tested if the replicated center samples do not all
have the same response values (which may sometimes happen by
accident).
Diagnostics
This plot presents several values for assessing the quality of the fit of the model to each
individual response.
Standard Order
The standard order is the non-randomized order from the experiment generator
Actual Value
This is the measured response values as given in the design table.
Predicted Value
This is the fitted response value as calculated from the model.
Compare this value to the actual value; the closer those values are the better is the
fit to the model.
Residual
This is the difference between the actual and the predicted value.
Study all the values; the smaller they are the better is the fit by the model. Note that
it does not say anything about the predictive ability of the model when applied to
new samples.
Leverage
371
The leverage is the distance of the projected samples to the center of the model. A
sample with high leverage is an influential sample or an outlier. Note that for
saturated models, the leverage is 1 for all samples and there is no residual DF to
estimate error in the model.
Student Residual
A studentized residual is the result from the division of a residual by the estimate of
the sample dependent standard deviation of the residual. The presented values are
the so-called internally studentized residuals, meaning that all samples have been
included in the estimation of the standard deviation. This statistic is can be used for
detection of outliers. For any reasonably sized experiment (e.g. n>30), 95% of
normally distributed, studentized residuals will fall in the interval [-2, 2].
Cook’s Distance
The Cook’s distance of an observation is a measure of the global influence of this
observation on all the predicted values. This is done by measuring the effect of
deleting this given observation. Data points with large residuals and/or high leverage
may distort the outcome and accuracy of a regression.
The Cook’s distance gives an actual threshold to judge the samples. Points with a
Cook’s distance of 1 or more are considered to be potential outliers.
Run Order
The run order is the (randomized) order of experimentation. There should not be a
run-order dependent trend in the above diagnostic tools.
Diagnostics
Effect visualization
This plot displays one effect at a time for a given response. To change the displayed effect
and the response click on the arrows or on one of the cells of the “Summary of the
effects” table.
It is useful to study the magnitude of the effects (change in the response value when the
design variable increases from Low to High) and the interactions.
There are two types of effects that can be visualized.
Main Effects
The plot shows the average response value for a specific response variable at the
Low and High levels of the design variable. If there are center samples, the average
response value for the center samples is also displayed. It is useful to study the
magnitude of the main effect (change in the response value when the design
variable increases from Low to High). If there are center samples, one can also
detect a curvature visually. For category variables with more than two levels, the
average response value for each category level is given.
Main effects with curvature
372
Interaction effects
The plot shows the average change in response values for a design variable
depending on the level of the other variable in a two-factor interaction. One line is
given for the Low level of the second design variable, and one line is given for the
High level of the second design variable.
It is possible to study the magnitude of the interaction effect (1/2 * change in the
effect of the first design variable when the second design variable changes from Low
to High).
 For a positive interaction, the slope of the effect for “High” is larger than for
“Low”;
 For a negative interaction, the slope of the effect for “High” is smaller than
for “Low”;
 For no interaction the curves are parallel.
Interaction Effects: No effect, Positive effect, Negative effect
Effect summary
This table plot gives an overview of the significance of all effects for all responses. There are
three values per effect and per response:
373
 Significance: This coded value indicates if the effect is significant for the specific
response. The significance level is also reflected by the color of the row. See the
Significance levels and associated codes table below.
 Effect value: This is the value of the effect for the specific response variable.
 p-value: Result of the test of significance for the effect.
Effect Summary table
The sign and significance level of each effect is given as a code:

Significance levels and associated codes
P-value limits Negative effect Positive effect Color code
P > 0.10 NS NS red
0.05 < P <= 0.1 ? ? yellow
0.01 < P <= 0.05 – + light green
0.005 < P <= 0.01 – – ++ dark green
P <= 0.005 ––– +++ dark green

NS: non-significant. ?: Marginally significant (alpha-level 10%).
Look for rows which contain many ”+” or ”–” signs and are green: these main effects or
interactions are most important for explaining the variance of the response in question.
If the design contains category variables with 3 levels or more, the effects table is replaced
with a multiple comparison plot in the ANOVA overview.
Effect and B-coefficient overview
This overview is available for all designs that contain continuous or 2-level category variables
only. For category variables with 3 levels or more, no single regression coefficent or effect
can describe the variable in question and these plots would be less informative.
Regression coefficients and their confidence interval

This plot shows the value of the regression coefficients with their confidence intervals (CIs)
for one response variable.
The bigger the coefficient the more important the design variable for the response variable.
The smaller the CI the more accurate the coefficient.
Regression coefficients with their CI
374
Use the arrows to navigate from one response variable to another or click on the
Response variable to be plotted in the table Regression coefficient table.
B-coefficient table
This table presents the value of the B-coefficient for the associated design variables as well
as B0.
It also gives the 95% confidence interval for the B-coefficients. These values give an idea of
the accuracy of the estimate of the coefficients.
The p- and t-values are computed to test the null hypothesis, H0: the coefficient is equal to
0. Rejection of this hypothesis for a variable means that the variable is important for
describing the response in question. By comparing the t-value with its theoretical
distribution (Student’s T-distribution), the significance level of the studied effect is obtained.
The associated p-value represents the significance of the effect associated with the B-
coefficient. H0 can be rejected if the p-value is smaller than, say 5% (green color). This
implies that the effect in question is important for modelling the response.
B-coefficient table
This plot is shown for all designs except mixture designs. For more information on this plot,
check the ANOVA overview section.
Effect summary
For more information on this plot, check the ANOVA overview section
375
Residuals overview
These plots can be used to check the adequacy of the model or look for outliers, provided
that there are ample residual degrees of freedom left to study the residuals. If the model is
close to saturated, i.e. the number of effects is almost as high as the number of
observations, artificially structured residuals will result that cannot be interpreted properly.
Normal probability of Y-residuals

This is a normal probability plot of the residuals of all the modelled effects. If effects are well
modelled, the residuals should contain unstructured noise only. Effects in the upper right or
lower left of the plot that do not approximately follow a straight line going through the rest
of the points, deviate from the normal distribution. This is an indication that the model is not
describing the sample very well – it may be an outlier.
The abd sample in the plot below is a typical example of an outlier. In this particular
example, it was found that the reason was a mis-typed response for that sample. After
correction the residuals of both abd and cef looked more like random noise.
Normal probability of Y-residuals
Y-residuals vs. Y-predicted

This is a plot of Y-residuals against predicted Y values. If the model adequately predicts
variations in Y, any residual variations should be due to noise only, which means that the
residuals should be randomly distributed. If this is not the case, the model is not completely
satisfactory, and appropriate action should be taken. If strong systematic structures (e.g.
curved patterns) are observed, this can be an indication of lack of fit of the regression
model. The figure below shows a situation that strongly indicates lack of fit of the model.
This is typical for a model that would benefit from including quadratic terms.
Structure in the residuals
376
The presence of an outlier is shown in the example below. The outlying sample has a much
larger residual than the others; however, it does not seem to disturb the model to a large
extent.
A simple outlier has a large residual
The figure below shows the case of an influential outlier: not only does it have a large
residual, it also attracts the whole model so that the remaining residuals show a very clear
trend. Such samples should usually be excluded from the analysis, unless there is an error in
the data table that can be corrected.
An influential outlier changes the structure of the residuals
377
Small residuals (compared to the variance of Y) which are randomly distributed indicate
adequate models.
Histogram of Y-residuals
This plot shows the distribution of the residuals, optionally with a statistics table displayed.
Histogram of Y-residuals
A symmetric bell-shaped histogram which is evenly distributed around zero indicates that
the normality assumption is likely to be true. This is the case in the above plot. Moderate
departures from normality is usually acceptable. Change the resolution of the histogram by
toggling the number of bars in the toolbar.
378
Y-residuals in experimental order

This plot is a line/bar plot of the Y-residuals in experimental order. It is used to detect if
there is a time-dependent trend in the experimentation. If the Y-residual increases with the
time of experimentation some non-randomized variationis occurring. The experimentation is
biased with a factor that varies with time. Try to identify it.
This plot can also detect if the variance/spread of the residuals changes over time, which
might violate the constant variance assumption.
Y-residuals in experimental order: No apparent time-effect (left), clear time-dependent effect
ANOVA table
For more information check the ANOVA overview section
Diagnostics
For more information check the ANOVA overview section
B-coefficients
This plot node is available for all designs except designs with categorical design variables
with three levels or more and for mixture designs.
Regression coefficients and their confidence interval

For more information on this plot, look at the section Effect and B-coefficient overview
B-coefficient table
For more information on this plot, look at the section Effect and B-coefficient overview
This plot node is available for all designs except designs with categorical design variables
with three levels or more and for mixture designs.
For more information check the DoE overview section
379
Effect summary
For more information check the DoE overview section
Cube plot
This plot is available for all factorial designs (incl. Plackett-Burman). It displays the average of
a specified response variable at the experimental points.
Cube plot
The plot is most useful when there are two or three design variables. If there are more than
three design variables it is possible to choose which cube to represent using the arrows for
X, Y and Z.
Error table
The error table is a summary of the quality parameters available for the analysis of design
data. See ANOVA table for a description of the individual terms.
Error table

This is a scatter plot of the predicted response values vs. the reference/measured values.
The better the fit, the closer the values will fall on a straight line. See section on calibration
values in Predicted vs. Reference for details.
Predicted vs. Reference plot
380
Response surface
There are two types of response surface (RS) plots. A square response surface is given for
non-mixture designs and a triangular response surface is given for mixture designs.
Response surface plot

This plot is used to find the levels of the design variables that will give an optimal response,
and to study the general shape of the response surface. It shows the response surface for
one response variable at a time.
Look at this plot as a map which tells how to reach the experimental objective. Two design
variables are studied over their range of variation; the remaining ones are by default held
constant at their mean level. The levels of the non-plotted variables can be tuned in the RS
table. For mixture designs, three components are plotted, and the response surface has a
simplex (triangular) shape.
The response surface is initially viewed from the top, i.e. the axis showing the predicted
response points out from the plot and contour lines indicate where the predicted response
has the same value. Pointing the cursor anywhere in the design region will show the
coordinate values as well as the predicted response value for that point. A color-bar
translates the colors into levels of response values.
Response surface plot
381
The response surface can also be rotated and viewed in 3D from any angle using the mouse:
Rotated response surface plot
382
Different representations of the response surface can be seen selecting the options in tool
bar for Mesh, Floor Contour or Surface Contour.
Response surface right click options
The following options are available from the right click menu in a response surface plot.
From the DOE menu all available analysis plots can be accessed.
Click View to switch between Graphical or Numerical view (also accessible from the toolbar),
or to toggle the colorbar (Legend) on or off.
Copy a bitmap representation to the clipboard for pasting into other applications, or Save
Plot using either of the formats JPEG, PNG, BMP, PNM or TIFF.
383
The Auto Scale option available from the right click or toolbar menu will return to a default
size 2D-plot.
The following Properties can be tuned from the plot properties dialog:
Appearance
 The contour count: The number of contour lines on the plot
 FloorContour: Toggle display of contour lines below the response surface on

or off. Also accessible as a toolbar check box when the response surface plot
is active.
 Mesh: Toggle display of a rectangular grid on the response surface on or off.

Also accessible as a toolbar check box when the response surface plot is
active.
 SurfaceContour: Toggle display of response surface contour lines on or off.

Also accessible as a toolbar check box when the response surface plot is
active.
Plot Font
 Bold: Toggle bold font for title, axis, colorbar and tooltip text on and off.
 Italic: Toggle italic font for title, axis, colorbar and tooltip text on and off.
 Name: Switch between font families Arial, Courier and Times for title, axis,
colorbar and tooltip text.
 Size: Set font size as a relative number. The plotting library automatically
attempts to find the best font size for different text. You may increase or
decrease the size of all plot text within the range of 0.1 (very small) and 4.0
(very large).
Response surface table

This table is used to select design and response variables to plot, to set the levels of non
plotted factors and optionally to impose optimization constraints on any of the design or
response variables. The latter is a very useful tool to find the optimum level combinations
for one or more responses. By imposing constraints on multiple responses simultaneously
and overlaying them in the same plot, it can immediately be seen which level combinations
are allowed and which fall outside of the (tuneable) optimization regions.
Design variables
In a response surface for non-mixture designs two design variables are plotted while
the others are fixed. For mixture designs, three mixture components are plotted in a
simplex plot.
To select the variables to plot, tick/uncheck the box in the Display column.
Optimization constraints for design variables can be set using the sliders or manually
enter the values in the Min and Max columns. The area outside of the selected
design region will be grayed out.
384
To set the level of the non-plotted variables enter the value manually in the column
Current. By default this value is the average value.
For mixture designs the levels of the components cannot vary independently of each
other, as the mixture constraint imposes that all components must sum to the
Mixture Sum always. Therefore, if a non-plotted variable is tuned, the axes and Max
levels of the plotted variables are updated accordingly. A minimum Max value
corresponding to 3.5% of the total range is enforced for plotted mixture
components.
For mixture designs there is an additional column with Freeze check-boxes. This is
useful for designs with 5 components or more. If the current level of a non-plotted
mixture variable is increased until the plotted variable axes cannot be reduced any
more, the levels of other non-plotted components will be reduced instead. If freeze
is checked for a non-plotted variable, its current value cannot be changed due to a
change in other variables.
For category variables select one of the levels using the drop-down list.
Response variables
Only one response variable can be plotted at a time. Select the response to plot by
ticking the variable of interest.
Optimization constraints for response variables can be set using the sliders or
manually enter the values in the Min and Max columns. Setting optimization
constraints for multiple responses simultaneously is a very useful tool for finding the
optimal design settings.
Response surface table
Multiple comparison
This node is given for non-saturated designs with at least one category variable. It shows
whether the distance between levels is larger than a critical distance, in which case the
levels are considered to belong in different groups. Because the critical distance is calculated
from the data, residual degrees of freedom are required for these plots to be displayed.
Multiple comparison plot

This is a comparison of the average of a given response variable for the different levels of a
design variable. It shows whether any of the levels are associated with a higher or lower
mean response compared to the other levels.
This plot displays one design variable and one response variable at a time. Use the the
toolbar arrows to switch between category variables and the toolbar drop-down
box for changing the response variable to display. If there is significant difference between
385
one categorical level and the other levels, the average response values are plotted in
different groups along the X-axis.
Multiple Comparisons
 The average response value is displayed as a red square and its value can be read on
the vertical axis or by mouse-over.
 The levels are grouped along the horizontal axis by significantly different groups.
 The names of the different levels can be seen by mouse-over.
 Levels that are not significantly different are linked by blue vertical bars. Each
vertical bar is the size of half the critical distance. Two levels have significantly
different average response values if they are not linked by any bar.
 The critical distance is indicated in the x-axis title.
Group table
The group table shows the levels associated with the different groups. This table takes the
value 1 if the level is part of the specified group and 0 if not. One level can be associated
with several groups.
Group table
Distance table
This table shows for a specific response variable and a specific category variable the distance
between the average value of two-by-two levels.
Distance table
386
B-coefficient table
For more information look at the description in the B-coefficients section. If one of the
categorical variables has three levels or more, an Effect visualization is plotted instead of the
B-coefficient table.
8.8.3 Available plots for Partial Least Squares Regression (DoE PLS)
When PLSR is performed on designed data all the regular PLSR plots are available. The DoE
PLS in addition has some plots useful for DOE purposes.
Overview
Weighted regression coefficients

This plot displays the weighted regression coefficients for the optimal number of factors
with their uncertainty limits.
Stable weighted B-coefficients show an uncertainty limit that does not cross the 0-line.
Regression coefficients
Explained Variance
This is the total explained variance plot for models of an increasing number of components.
Use the toolbar buttons to switch between X-/Y-variance, calibration/validation variance and
387
explained/residual variance. The validation variance in DoE-PLSR is based on a full (leave-

one-out) cross-validation.
Refer to the Explained variance plot in PLSR for more details.
PLSR ANOVA p-values

This plot displays the p-values obtained from the uncertainty test of regression coefficients.
Small p-values indicate model terms that most likely have an important effect on the
response.
Four significance levels are given at 0.01 (dark green), 0.05 (light green), 0.1 (yellow) and 0.2
(red). Terms with p-values lower than one of the lines are significant at the corresponding
level.
PLSR ANOVA p-values

By default the Predicted vs. Reference plot shows the results for the first Y-variable. To see
the results for other Y-variables, use the arrows next to the response values. In
addition by default the results are shown for a specific number of factors (or PCs), that
should reflect the dimensionality of the model. If the number of factors (or PCs) is not
satisfactory, it is possible to change it by using the PC icon .

The selected predicted Y-value from the model is plotted against the measured Y-value. This
is a good way to check the quality of the regression model. If the model gives a good fit, the
plot will show points close to a straight line through the origin and with slope close to 1.
388
Turn on Plot Statistics (using the View menu) to check the slope and offset, RMSEP/RMSEC
and R-squared. Generally all the y-variables should be studied and give good results.
Note: Before interpreting the plot, check whether the plots are displaying
Calibration or Validation results (or both).
Menu option Window - Identification tells whether the plots are displaying Calibration (if
Ordinate is yPredCal) or Validation (yPredVal) results.
Use the buttons to switch Calibration and Validation results off or on.
It is also useful to show the regression line using the icon , and compare it with the
target line that is enabled with the icon .

Some statistics are available giving an idea of the quality of the regression. They are
available from the icon

When Calibration and Validation results are displayed together as shown in the figure below,
pay special attention to:
Differences between Cal and Val
If there are large differences, the model cannot be trusted.
R-squared
The first one (in blue) is the raw R-squared of the model, the second one (in red) is
also called adjusted R-squared and tells how good a fit can be expected for future
predictions. R-squared varies between 0 and 1. A value of 0.9 is usually considered
as pretty good but this varies depending on the application and on the number of
samples.
RMSE
The first one (in blue) is the Calibration error RMSEC, the second one (in red) is the
expected Prediction error RMSEP. Both are expressed in the same unit as the
response variable Y.
Predicted vs. Reference plot for Calibration and Validation, with Plot Statistics turned on, as
well as Regression line and Target line.
How to detect cases of good fit / poor fit
389
The figures below show two different situations: one indicating a good fit, the other a poor
fit of the model.
Predicted vs. Reference shows how well the model fits
Left: Good fit. Right: Poor fit.

How to detect outliers
One may also see cases where the majority of the samples lie close to the line while a few of
them are further away. This may indicate good fit of the model to the majority of the data,
but with a few outliers present (see the figure below).
Detecting outliers on a Predicted vs. Reference plot
In the above plot, sample 3 does not follow the regression line whereas all the other
samples do. Sample 3 may be an outlier.
How to detect nonlinearity
In other cases, there may be a nonlinear relationship between the X- and Y-variables, so that
the predictions do not have the same level of accuracy over the whole range of variation of
Y. In such cases, the plot may look like the one shown below. Such nonlinearities should be
corrected if possible (for instance by a suitable transformation), because otherwise there
will be a systematic bias in the predictions depending on the range of the sample.
Predicted vs. Reference shows a nonlinear relationship
390
PLS-ANOVA Summary table

This table presents the effect values for all variables as well as their significance levels and p-
values.
PLSR-ANOVA Summary table
Significance levels and associated codes

P-value Negative effect Positive effect Color code
>= 0.10 NS NS red
[0.10:0.05] ? ? yellow
[0.01:0.05] – + light green
[0.005:0.01] – – ++ light green
\< 0.005 ––– +++ dark green

NS: non significant.
?: possible effect at the significance level 10%.
391
Normal probability plot

This is a normal probability plot of the Y-residuals after a given number of components. As
residuals are supposed to contain little or no structured variation, all the points should
ideally fall close to a straight line. See Normal probability of Y-residuals for more details.
X- and Y-Loadings
A 2-D scatter plot of X- and Y-loadings for two specified components (factors) from PLSR is a
good way to detect important variables and relationships between variables. The plot is
most useful for interpreting component 1 vs. component 2, since they represent the largest
variations in the X-data that explain the largest variation in the Y-data. By default both Y-
and X-variables are displayed but it is possible to modify that by clicking on the X and Y
icons.
Interpret the X-Y relationships
To interpret the relationships between X and Y-variables, start by looking at the response (Y)
variables.
 Predictors (X) projected in roughly the same direction from the center as a response,
are positively linked to that response. In the example below, predictors sweet, red
and color have a positive link with response Pref.
 Predictors projected in the opposite direction have a negative relationship, as
predictor thick in the example below.
 Predictors projected close to the center, as bitter in the example below, are not well
represented in that plot and cannot be interpreted.
Cheese experimentation: Six responses (Adhesiveness, Stickiness, Firmness, Shape retention,

Glossiness, Meltiness), four process predictors (Amount of dry matter, maturity, pH and
addition of recycled dry matter)
The maturity has a negative effect on the adhesiveness of the cheese; they are anti-
correlated. The amount of Dry matter affects positively the stickiness and negatively the
glossiness and meltiness. Glossiness and meltiness, two responses, are correlated.
392
Caution! If the X-variables have been standardized, one should also standardize the
Y-variable so that the X- and Y-loadings have the same scale; otherwise the plot
may be difficult to interpret.
The plot shows the importance of the different variables for the two components specified.
It is possible to change the display by using the PC drop-down list . It should

preferably be used together with the corresponding scores plot. Variables with loadings to
the right in the loadings plot will be X-variables which usually have high values for samples
to the right in the scores plot, etc. This plot can be used to study the relationship between
the X-variables and the X- and Y-variables.
If the Uncertainty test was activated the important variables will be circled. It is also possible
to mark them by using the icon .
Loadings plot with circled important variables
Note: Downweighted variables are displayed in a different color so as to be easily

identified.
Correlation loadings emphasize variable correlations

When a PLSR analysis has been performed and a two-dimensional plot of loadings is
displayed on the screen, the Correlation Loadings option (available from the View menu and
the icon can be used to aid in the discovery of the structure in the data. Correlation
loadings are computed for each variable for the displayed factors. In addition, the plot
contains two ellipses to help check how much variance is taken into account. The outer
ellipse is the unit-circle and indicates 100% explained variance. The inner ellipse indicates
50% of explained variance. The importance of individual variables is visualized more clearly
in the correlation loadings plot compared to the standard loadings plot.
393
Correlation Loadings of process variables (X) and the quality of the cheese (Y) along (factor
1,factor 2)
Variables close to each other in the loadings plot will have a high positive correlation if the
two components explain a large portion of the variance of X. The same is true for variables in
the same quadrant lying close to a straight line through the origin. Variables in diagonally
opposed quadrants will have a tendency to be negatively correlated. For example, in the
figure above, variables dry matter and stickiness have a high positive correlation on factor 1
and factor 2, and they are negatively correlated to variables meltiness and glossiness.
Variables adhesiveness and stickiness have independent variations. Variables addition of
recycled dry matter and pH are very close to the center, they are not well described by
factor 1 and factor 2.
Note: Variables lying close to the center are poorly explained by the plotted factors
(or PCs). They cannot be interpreted in that plot.
8.9. DOE method reference

This document, which can be downloaded from our web site, details the algorithms used in
The Unscrambler® as well as some statistical measures and formulas.
http://www.camo.com/helpdocs/The_Unscrambler_Method_References.pdf
8.10. Bibliography
R. C. Bose and K. Kishen, On the problem of confounding in the general symmetrical factorial
design, Sankhya, 5, 21, (1940).
J.A. Cornell, Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data,
Second edition, John Wiley and Sons, New York, 1990.
G.H. Golub and C.F. Van Loan, Matrix Computations, Third edition, Johns Hopkins University
Press, 1996.
R.W. Kennard and L.A. Stone, Computer Aided Design of Experiments, Technometrics, 11(1),
137-148, (1969).
G.A. Lewis, D. Mathieu, and R. Phan-Tan-Lu, Pharmaceutical Experimental Design, Marcel
Dekker, Inc., New York, 1999.
394
D.C. Montgomery, Design and Analysis of Experiments, Sixth edition, John Wiley & Sons,
New York, 2004.
R.H. Myers and D.C. Montgomery, Response Surface Methodology: Process and Product
Optimization using Designed Experiments, Second edition, Wiley, New York, 2002.
T. Naes and T. Isaksson, Selection of Samples for Calibration in Near-Infrared Spectroscopy.
Part I: General Principles Illustrated by Example, Appl. Spectrosc., 43(2), 328-335, (1989).
N.-K. Nguyen and G.F. Piepel, Computer-Generated Experimental Designs for Irregular-
Shaped Regions, QTQM, 2(2), 147-160, (2005).
R.E.A.C. Paley, On orthogonal matrices, J. Math. Phys., 12, 311–320, (1933).
R.L. Plackett and J.P. Burman, The Design of Optimum Multifactorial Experiments,
Biometrika, 33, 305-25, (1946).
H. Scheffé, Experiment with Mixtures, J. Roy. Stat. Soc. Ser. B, 20, 344-366, (1958).
395
9. Validation
9.1. Validation
Model validation is performed for PCA or regression models to estimate how useful the
model will be for future observations. It returns the predictive ability of the model as
opposed to the model’s fit to the training data.
 Theory
 Dialog usage: Validation tab
 Dialog usage: Cross validation setup
9.2. Introduction to validation

Validating a model based on empirical data means checking how well the model will perform
on new data of the same kind that was used in developing the model. The validation of a
model estimates the uncertainty of future predictions that may be made with the model. If
the uncertainty is reasonably low, the model can be considered valid. However, regression
methods are also applied for modeling relations between blocks of data without any
objective of implementing the model in a process or in an instrument.
This chapter presents the purposes and principles of model validation in multivariate data
analysis.
 Principles of model validation

 What is validation?
 Test set validation
 How to select a test set
 Cross validation
 Leverage correction
 Validation results
 When to use which validation method
 Uncertainty testing with cross validation
 How does the uncertainty test work?
 Uncertainty of regression coefficients
 Uncertainty of loadings and loading weights
 Stability plots
 Easier to interpret important variables in models with many
components
 Remove non-significant variables for more robust models
 Application areas
 More details about the uncertainty test
 Model validation check list
9.2.1 Principles of model validation

To keep this discussion as general as possible, it is written with focus on the case of a
regression model. However, the same principles apply to PCA and other methods.
For the case of validation of PCA results:
397
 Disregard any mention of “Y-variables”.

 Disregard the sections on RMSEP.
9.2.2 What is validation?

Validating a model based on empirical data means checking how well the model will perform
on new data.
A regression model is often made to do predictions in the future. The validation of the
model estimates the uncertainty of such future predictions. If the uncertainty is reasonably
low, the model can be considered valid. However, regression methods are also applied for
modeling relations between blocks of data without any objective of implementing the model
in a process or in an instrument.
The same argument applies to a descriptive multivariate analysis such as PCA: If the
objective of the PCA is to extrapolate the correlations observed in the data table to future,
similar data, one should check whether they still apply for new data.
In The Unscrambler® three methods are available to estimate the model stability and
prediction ability: test set validation, cross validation and leverage correction.
Test set validation
Test set validation is based on testing the model on a subset of the available samples, which
will not be present in the computations of the model parameters.
The global data table is split into two subsets:
Calibration set
contains all samples used to compute the model components, using X- and Y-values;
Test set
contains all the remaining samples, for which X-values are fed into the model once a
new component has been computed. Their predicted Y-values are then compared to
the observed Y-values, yielding a prediction residual that can be used to compute a
validation residual variance or an RMSEP.
How to select a test set

A test set should contain 20-40% of the full data table. The calibration and test set should in
principle cover the same population of samples as well as possible. Samples which can be
considered to be replicate measurements should not be present in both the calibration and
test set.
There are several ways to select test sets:
Manual selection
is recommended since it gives one full control over the selection of a test set;
Random selection
is the simplest way to select a test set, but leaves the selection to the computer;
Group selection
makes it possible for the user to specify a set of samples as test set by selecting a
value or values for one of the variables. This should only be used under special
circumstances. An example of such a situation is a case where there are two true
replicates for each data point, and a separate variable indicates which replicate a
sample belongs to. In such a case, one can construct two groups according to this
variable and use one of the sets as test set. The group can be selected from one
chosen level of a category variable.
398
Validation
Cross validation
Though the objective is to have enough samples to put a reasonable amount aside as a test
set, this is not always possible due, for example, to the cost of samples or reference testing.
The best alternative to an independent test set for validation is to apply cross validation.
With cross validation, the same samples are used both for model estimation and testing. A
few samples are left out from the calibration data set and the model is calibrated on the
remaining data points. Then the values for the left-out samples are predicted and the
prediction residuals are computed. The process is repeated with another subset of the
calibration set, and so on until every object has been left out once; then all prediction
residuals are combined to compute the validation residual variance and RMSEP. It is of
utmost importance that the user is aware of which level of cross validation he wants to
validate. For example, if one physical sample is measured three times, and the objective is to
establish a model across samples, the three replicates must be held out in the same cross
validation segment. If the objective is to validate the repeated measurement, keep out one
replicate for all samples and generate three cross validation segments. The calibration
variance is always the same; it is the validation curve that is the important figure of merit
(and the RMSECV for regression models).
Several versions of the cross validation approach can be used:
Full cross validation
leaves out only one sample at a time; it is the original version of the method;
Segmented cross validation
leaves out a whole group of samples at a time. A typical example is when there are
systematic replicated measurements of one physical sample;
Test-set switch
divides the global data set into two subsets, each of which will be used alternatively
as calibration set and as test set;
Category variable
enables the user to validate across levels of category variables. This is useful for
evaluating how robust the model is across season, raw material supplier, location,
operator, etc.
When running a cross validation, one can get prediction diagnostics for the cross validation
segments. These are not available when full cross-validation is used. This option will provide
information on the validation results per each cross validation segment including RMSEP,
SEP, bias, slope, offset and correlation. The CV prediction diagnostics are added as a matrix
in the Validation folder of the PLSR model.
Leverage correction
Leverage correction is an approximation to cross validation that enables prediction residuals
to be estimated without actually performing any prediction. It is based on an equation that
is valid for MLR, but is only an approximation for PLSR and PCR.
According to this equation, the prediction residual equals
(calibration residual) divided by (1 - sample leverage)

All samples with low leverage (i.e. low influence on the model) will have estimated
prediction residuals very close to their calibration residuals (the leverage being close to
zero). For samples with high leverage, the calibration residual will be divided by a smaller
number, thus giving a much larger estimated prediction residual.
In the earlier days of multivariate modeling, when computer power was a fraction of what it
is today, this method was applied in the initial modeling. Nowadays, the user typically has
399
the possibility to perform cross validation for most data sets without much computation
time, making the leverage correction more of a relic of the old days.
9.2.3 Validation results

The simplest and most efficient measure of the uncertainty on future predictions is the
RMSEP. This value (one for each response) is a measure of the average uncertainty that can
be expected when predicting Y-values for new samples, expressed in the same units as the
Y-variable. The results of future predictions can then be presented as predicted values ±
2*RMSEP. This measure is valid provided that the new samples are similar to the ones used
for calibration, otherwise, the prediction error might be much higher.
Validation residual and explained variances are also computed in exactly the same way as
calibration variances, except that prediction residuals are used instead of calibration
residuals. Validation variances are used, as in PCA, to find the optimum number of model
components. When validation residual variance is minimal, RMSEP also is, and the model
with an optimal number of components will have the lowest expected prediction error.
RMSEP can be compared with the precision of the reference method. Usually one cannot
expect RMSEP to be lower than twice the precision.
9.2.4 When to use which validation method
Properties of test set validation

Test set validation can be used if there are many samples in the data table, for instance
more than 50.
It is the most “objective” validation method, since the test samples have no influence on the
calibration of the model.
Properties of cross validation

Cross validation represents a more efficient way of utilizing the samples if the number of
samples is small or moderate.
Segmented cross validation is the fast approach, but full cross validation is also often
applied. The suggested rule of thumb is to do random 10-segment cross validation if there is
no reason to divide the samples into subgroups.
When using segmented cross validation, make sure that all segments contain unique
information, i.e. samples which can be considered as replicates of each other should not be
present in different segments.
The major advantage of cross validation is that it allows for the jack-knifing approach on
which an Uncertainty Test is based. This provides significance testing for PLSR results. For
more information, see Uncertainty testing with cross validation.
Properties of leverage correction

Leverage correction for projection methods should only be used in an early stage of the
analysis if it is very important to obtain a quick answer. In general it gives more “optimistic”
results than the other validation methods and can sometimes be highly overoptimistic.
Sometimes, especially for small data tables, leverage correction can give apparently
reasonable results, while cross validation fails completely. In such cases, the “reasonable”
behavior of the leverage correction can be an artifact and cannot be trusted. The reason
400
Validation
why such cases are difficult is that there is too little information for estimation of a model
and each sample is “unique”. Therefore all known validation methods are doomed to fail.
For MLR, leverage correction is strictly equivalent to (and much faster than) full cross
validation.
9.2.5 Uncertainty testing with cross validation

Users of multivariate modeling methods are often uncertain when interpreting models.
Frequently asked questions are:
 Which variables are significant?

 Is the model stable?
 Why is there a problem?
Dr. Harald Martens has (re-)developed a generic method for uncertainty testing, which gives
a safer interpretation of models. The concept for uncertainty testing is based on cross
validation, jack-knifing and stability plots. This section introduces how the Uncertainty Test
works and shows how it can be used in The Unscrambler® through an application.
The following sections will present the method with a non-mathematical approach.
How does the uncertainty test work?
The test works with PLSR or PCA models with cross validation, choosing full cross validation
or segmented cross validation as is appropriate for the data. When the optimal number of
components (factors) for PLSR have been chosen, tick Uncertainty test on the validation tab
of The Unscrambler® modeling dialog box.
Under cross validation, a number of submodels are created. These submodels are based on
all the samples that were not kept out in the cross validation segment. For every submodel,
a set of model parameters: B-coefficients, loadings and loading weights are calculated.
Variations over these submodels will be estimated so as to assess the stability of the results.
In addition a total model is generated, based on all the samples. This is the model that will
be used for interpretation.
Uncertainty of regression coefficients

For each variable one can calculate the difference between the B-coefficient Bi in a
submodel and the Btot for the total model. The Unscrambler® takes the sum of the squares of
the differences in all submodels to get an expression of the variance of the Bi estimate for a
variable.
With a t-test the significance of the estimate of Bi is calculated. Thus the resulting regression
coefficients can be presented with uncertainty limits that correspond to 2 standard
deviations under ideal conditions. Variables with uncertainty limits that do not cross the zero
line are significant variables.
Uncertainty of loadings and loading weights

The same can be done for the other model parameters, but there is a rotational ambiguity in
the latent variables of bilinear models. To be able to compare all the submodels correctly,
they are rotated back to the main model before the uncertainty is estimated. Therefore one
can also get uncertainty limits for these parameters.
401
Stability plots
The results of all these calculations can also be visualized as stability plots in scores, loadings,
and loading weights plots. Stability plots can be used to understand the influence of specific
samples and variables on the model, and explain for example why a variable with a large
regression coefficient is not significant. This will be illustrated in the example that follows
(see Application Example).
Easier to interpret important variables in models with many components

Models with many components, three, four or more, may be difficult to interpret, especially
if the first factors (PCs) do not explain much of the variance.
For instance, if each of the first 4-5 PCs explain 15-20% of the variance, the factor 1/factor 2
plot is not enough to understand which are the most important variables.
In such cases, Martens’ automatic uncertainty test shows the significant variables in the
many-component model and interpretation is far easier.
Remove non-significant variables for more robust models

Variables that are non-significant display non-structured variation, i.e. noise. When these
variables are removed, the resulting model will be more stable and robust (i.e. less sensitive
to noise). Usually the prediction error decreases too.
Therefore, after identifying the significant variables by using the automatic marking based
on Martens’ test, use The Unscrambler® function Recalculate with Marked (Right click on
equation node in project navigator, and select Recalculate- With Marked…) to make a new
model and check the improvements.
Application areas
Spectroscopic calibration works better if noisy wavelengths are removed.
Some models (not spectroscopic) may be improved by adding interactions and squares of
the variables, and The Unscrambler® has a feature to do this automatically. However, many
of these terms are irrelevant. Apply Martens’ uncertainty test to identify and keep only the
significant ones.
9.2.6 More details about the uncertainty test

One of the critiques towards PLS regression has been the lack of significance of the model
parameters. Many years of experience have given “rules of thumb” of how to find which
variables are significant. However, these “rules of thumb” do not apply in all cases, and the
users still see the need for easy interpretation and guidance in these matters. The data
analysis must give reasonable protection against wishful thinking based on spurious effects
in the data. To be effective, such statistical validation must be easily understood by its user.
The modified Jack-knifing method implemented in The Unscrambler® has been invented by
Harald Martens, and was published in Food Quality and Preference Martens (1999). Its
details are presented hereafter.
Note: To understand this chapter requires a basic knowledge about the purposes
and principles of chemometrics. For those who have never worked with
multivariate data analysis before, it is strongly recommended that they begin
reading about it in the chapters about PCA and regression before proceeding with
this chapter.
402
Validation
See tutorial M to learn how to use the Uncertainty Test results in practice.
New assessment of model parameters

The cross validation assessment of the predictive validity is here extended to uncertainty
assessment of the individual model parameters: In each cross validation segment
m=1,2,…,M a perturbed version of the structure model described is obtained.
For more details refer to the method references chapter.
Each perturbed model is based on all the objects except one or more objects which were
kept ‘secret’ in this cross validation segment m.
If a perturbed segment model differs greatly from the common model, based on all the
objects, it means that the object(s) kept ‘secret’ in this cross validation segment have
significantly affected the common model. These left out objects caused some unique pattern
of variation in the model parameters. Thus, a plot of how the model parameters are
perturbed when different objects are kept ‘secret’ in the different cross validation segments
m=1,2,…,M shows the robustness of the common model against peculiarities in the data of
individual objects or segments of objects.
These perturbations may be inspected graphically in order to acquire a general impression of
the stability of the parameter estimates, and to identify dominating sources of model
instability. Furthermore, they may also be summarized to yield estimates of the
variance/covariance of the model parameters.
This is often called “jack-knifing”. It will here be used for two purposes:
 Elimination of useless variables, based on the linear parameters B;
 Stability assessment of the bilinear structure parameters T and P’, Q’.
Rotation of perturbed models

It is also important to be able to assess the bilinear score and loading parameters. However,
the bilinear structure model has a related rotational ambiguity in the latent variables that
needs to be corrected for in the jack-knifing. Only then is it meaningful to assess the
perturbations of scores Tm and loadings Pm and Qm in cross validation model segment # m.
Any invertible matrix Cm (AxA) satisfies the relationships:
Therefore, the individual models m=1,2,…,M may be rotated, e.g. towards a common model:
After rotation, the rotated parameters T(m) and [P’,Q’](m) may be compared to the
corresponding parameters from the common model T and [P’,Q’]. The perturbations may
then be written as (T(m) - T)g and or ([P’,Q’](m) - [P’, Q’])g for the scores and the loadings,
respectively, where g is a scaling factor (here: g=1).
In the implemented code, an orthogonal Procrustes rotation is used. The same rotation
principle is also applied for the loading weights, W, where a separate rotation matrix is
computed for W. The uncertainty estimates for P, Q and W are estimated in the same
manner as for B below.
403
Eliminating useless variables

On the basis of such jack-knife estimates of the uncertainty of the model parameters,
useless or unreliable X-or Y-variables may be eliminated automatically, in order to simplify
the final model and making it more reliable. The following part describes the cross validation
/ jack-knifing procedure:
When cross validation is applied in regression, the optimal rank A is determined based on
prediction of kept-out objects (samples) from the individual models. The approximate
uncertainty variance of the PCR and PLS regression coefficients B can be estimated by jack-
knifing
where
 s²(B) (K x J) = estimated uncertainty variance of B

 B (K x J) = the regression coefficient at the cross validated rank A using all the N
objects,
 Bm (K x J) = the regression coefficient at the rank A using all objects except the
object(s) left out in cross validation segment m
 g = scaling coefficient (here: g=(M-1)(M), where M is the number of cross-validation
segments).
Significance testing
When the variances for B, P, Q, and W have been estimated, they can be utilized to find
significant parameters.
As a rough significance test, a Student’s t-test is performed for each element in B relative to
the square root of its estimated uncertainty variance S²B, giving the significance level for
each parameter. In addition to the significance for B, which gives the overall significance for
a specific number of components, the significance levels for Q are useful to find in which
components the Y-variables are modeled with statistical relevance.
9.2.7 Model validation check list

In The Unscrambler® validation is always automatically included in model computation.
However, what matters most is the choice of a relevant validation method for the particular
case (data set) being studied, and the configuration of its parameters.
The general validation procedure for PCA and regression is as follows:
Build a first model
Use segmented cross validation or leverage correction — the computations will go
faster. Allow for a large number of factors. Cross validation is recommended as it
also gives the ability to apply Martens’ Uncertainty Test.
Diagnose
the first model with respect to outliers, nonlinearities, any other abnormal behavior.
Take advantage of the variety of diagnostic tools available in The Unscrambler®
variance curves, automatic warnings, scores and loadings, stability plots, influence
plot, X-Y relation outliers plot, etc.
Investigate and fix problems
404
Validation
Correct errors, apply transformations, etc.

Check improvements
by building a new model.
For regression only: validate intermediate model with a full cross validation, using
Uncertainty Testing, then do variable selection based on significant regression
coefficients.
Validate final model
with a proper method (test set or full cross validation).
Interpret final model
in terms of sample properties, variable relationships, etc. Check RMSEP for
regression models.
9.3. Validation tab

Menu options, dialogs, plots for validation.
9.3.1 Analysis and validation procedures

Validation is configured via the Validation tab for the respective analysis methods on the
Tasks - Analyze menu where one may choose a validation method and further specify
validation details.
 Principal Component Analysis (PCR)

 Partial Least Squares regression (PLSR)
 Support Vector Machine Regression (SVMR)
 Support Vector Machine Classification (SVMC)
 Linear Discriminant Analysis (LDA)
405
9.3.2 Validation methods
The methods available for validation include:

Leverage Correction
A method used as a first pass model check. This should not to be used as a final
model validation method, as it an overly optimistic approximation.
Cross Validation
This method is used when either there are not enough samples available to make a
separate test set, or for simulating the effects of different validation test cases, e.g.
systematically leaving samples out vs. randomly leaving samples out, etc.
See Cross validation setup dialog usage
Test matrix
This is also known as Test Set Validation, and uses independent samples that have
not taken part in the calibration for validation. This allows one to define either a
new matrix, of the same number of variables, or a defined range within a single
matrix to be used as an independent check of model performance. Both X- and Y-
matrices need to be defined in this case. This is the preferred method for validation
and should be aimed for.
406
Validation
Prediction diagnostics for CV segments

When running a cross validation with a PLSR or PCR regression, one can select to also
compute the prediction diagnostics for the cross validation segments by checking this
selection in the dialog. These are not available when full cross-validation is used. This option
will provide information on the validation results per each cross validation segment
including RMSEP, SEP, bias, slope, offset and correlation. The CV prediction diagnostics are
added as a matrix in the Validation folder of the PLSR model.
Significance testing
The Uncertainty Test option can be used to estimate the significance of variables, when
using cross validation. During cross validation, the differences between the model
parameters for all samples and the model for the samples in this particular cross validation
segment is squared and summed. The significance (p-value) is estimated by a t-test with the
model parameter and its standard deviation as input. For PCA the p-values for loadings per
variable and component are returned. For PLS regression p-values are returned for x-
loadings, loading weights, y-loadings and regression coefficients.
This is referred to as Martens’ Uncertainty Test.
Details: Test matrix setup

Multiple Linear Regression Test Matrix Setup
Use the Matrix drop-down list to select the test set, or define it using the Rows and Column
selector drop-down lists to define a test set within a selected matrix for both X and Y.
407
Discard Residuals option (PCA/PCR and PLSR models only)

In The Unscrambler® X all results from the modeling are stored to have the maximum
flexibility in plotting any result matrix in any way to make the right decision regarding
outliers, interpretation of the model etc. However, as the size of data matrices become
large, the residual matrices use a lot of available memory and disk space, resulting in the size
of the Unscrambler project becoming large and sometime unmanageable. To enable the
user to reduce the size of models, there is an option for PCA, PCR and PLSR to discard
residuals
By discarding residuals, the matrices
 X-Residuals
 X-Validated Residuals
 Y-Residuals
 Y-Validated Residuals
are removed from the Validation folder in the analysis. These are 3-Dimensional matrices
and use up a lot of memory. As in indication of the reduced size when enabling Discard
Residuals, A PLS regression model with 400 samples and 100 x-variables, 1 y-variable and 10
factors will only take up 10% of the Full model size. As the number of samples, X- and Y-
variables and factors increase, the reduced-size model will be even smaller in percentage of
the full model.
Note: When the residuals are discarded, some of the plot options will not be available. All
plots where the data are taken from the X-Residuals or Y-Residuals matrices will not be
listed in the plot menus. The Plot - Residuals sub menus now only allows Residuals and
Influence (with Q-residuals), and under Plot -Residuals -General only Influence Plot and
Variance per Sample plots are available.
Plots available in the Residuals menu when Discard Residuals is selected
9.3.3 How to display validation results

First, one should display the PCA or regression results as plots in the Viewer. When the
results plots have been opened in the Viewer one can access the Plot and the View menus
to select the various results to plot and interpret. Alternatively, the plots can be selected
from the Plots folder in the model node in the project navigator.
For more on these plots see the following sections:
 Interpreting PCA plots

 Interpreting PLS regression plots
Details: Review the overview of results

Results - PCA
Display the PCA Overview results. From here additional results plots can be accessed
from the menu.
408
Validation
Results - Regression
Display the PLSR Overview results. From here additional results plots can be
accessed from the menu.
Results - All
Display results for any analysis.
Validation plots and statistics

Plot - Variances and RMSEP
Plot variance curves and estimated Prediction Error (PCA, PCR, PLSR).
Plot - Predicted vs. Reference
Display plot of predicted Y values against actual Y values.
Plot Statistics
Display statistics (including RMSEP) on Predicted vs. Reference plot by using the
toolbar short cut.
Plot - Residuals
Display various types of residual plots.
Validation
Toggle Validation results on/off on current plot.
Calibration
Toggle Calibration results on/off on current plot.
Outlier Warnings
Display general warnings issued during the analysis – among others related to
validation. The Outlier Warnings are in the project navigator under the analysis
node.
9.3.4 How to display uncertainty test results

First, one should display the PCA or regression results. When the results plots have been
opened in the Viewer one can access the Plot and the View menus to select the various
results to plot and interpret. Alternatively, the plots can be selected from the Plots folder in
the model node in the project navigator.
See tutorial M for a guide to uncertainty plots; variable selection and model stability.
Details: How to display uncertainty results

Hotelling’s T² Ellipse
Display Hotelling’s T² ellipse on a scores plot using the toolbar short cut.
Uncertainty Test - Stability Plot
Display stability plot for scores or loadings using the toolbar short cut.
Plot - Important Variables
Display uncertainty limits on regression coefficients plot.
Correlation Loadings
Change a loadings plot to display correlation loadings by using the toolbar short cut.
409
9.4. Validation tab – Cross validation setup…
The options available for Cross Validation include:

Full
Also known as Leave One Out (LOO) cross validation, this produces as many
calibration submodels as there are samples in the data set.
Random
One can choose the Number of segments a data set is to be divided up into and the
cross validation procedure randomly selects the number of samples to take, as
defined in the Samples per segment drop-down list. The number of segments may
be adjusted, depending on the size of the sample set and the number of samples to
take per segment.
Custom
Allows the user to manually choose the Number of Segments and to define the
samples for each segment by manual entry or by using the Select button. The Select
button takes one to the Define Range dialog box.
Systematic
The Unscrambler® provides two options for systematic sample selection.
Systematic (112233)
Allows the user to define the Number of segments and the Samples per segment. In
this case, the first N samples are removed for segment 1 and successfully replaced
for the number of segments defined. This is particularly useful when replicate
measures exist and are ordered together in the data matrix, allowing one to see the
impact of removing a complete replicate from a data set.
Systematic (123123)
Allows the user to look at the impact of removing a single replicate from a group of
replicate measures to assess the precision of the developed model.
410
Validation
Category variable
Allows for model cross validation by removing samples belonging to defined
categories as a group. This is useful for evaluating how robust the model is across
season, raw material supplier, location, operator etc.
411
10. Transform
10.1. Transformations
This section covers transformations available in The Unscrambler®. Transformation (or what
is often referred to as preprocessing) is applied to data to reduce or remove effects in data
which do not carry relevant information for the modeling of the system. Transformations
can reduce the complexity of a model (fewer factors needed) and improve the
interpretability of the data and models. Transformations include the application of
derivatives to spectral data to reduce baseline offset and tilt effects, while accentuating
small spectral differences. Scattering corrections are often used as transformations to
diffuse reflectance spectra to reduce differences such as light scatter and path length. These
transforms can only be performed on numerical data. Some of them cannot be performed
when there is missing data (i.e. Norris-Gap derivative).
The Unscrambler® provides the following transformations:
 Baseline correction
 Center_and_scale
 Compute general
 COW
 Deresolve
 Derivatives
 Detrending
 MSC/EMSC
 Interaction & Square Effects
 Missing_value_imputation
 Noise
 Normalize
 OSC
 Quantile_Normalize
 Reduce and average
 Smoothing
 Spectroscopic transformations
 SNV
 Transpose
 Weights
 Interpolation
More details regarding transformation methods available in The Unscrambler® are given in
the Method References.
10.2. Baseline Correction

10.2.1 Baseline correction
Baseline corrections are used to adjust the spectral offset by either adjusting the data to the
minimum point in the data, or by making a linear correction based on two user-defined
variables.
413
 How it works
 How to use it
10.2.2 About baseline corrections

Baseline corrections are used to adjust the spectral offset by either adjusting the data to the
minimum point in the data, or by making a linear correction based on two user-defined
variables. Baseline offset and Linear baseline correction are transformations used to correct
the baseline of samples, and are set in the dialog Tasks - Transform - Baseline. They are
mostly used for spectroscopic purposes. The two transformations can be executed
separately or together. In the combined case the Linear baseline correction will be run first,
then the Baseline offset.
Baseline offset
The formula for the baseline offset correction can be written as follows:
where x is a variable and X denotes all selected variables for this sample.
For each sample, the value of the lowest point in the spectrum is subtracted from all the
variables. The result of this is that the minimum value is set as 0 and the rest are positive
values. To use this consistently for a set of samples, make sure that the lowest point pertains
to the same variable for all samples.
Linear baseline correction
This transformation transforms a sloped baseline into a horizontal baseline. The technique is
to point out two variables which should define the new baseline. These are both defined as
0, and the rest of the variables are transformed according to this with linear
interpolation/extrapolation. It is important to take precautions not to select basis variables
that have spectroscopic bands. As for the offset correction, make sure that the lowest points
pertain to the same variables for all samples.
10.2.3 Tasks – Transform – Baseline

Baseline offset and Linear baseline correction are transformations used to correct the
baseline of samples, and are set in the dialog Tasks - Transform - Baseline. They are mostly
used for spectroscopic purposes. The two transformations can be executed separately or
together, but at least one transformation method must be selected. In the combined case
the Linear baseline correction will be run first, then the Baseline offset.
Baseline correction cannot be carried out with non-numeric data, but can proceed if there
are missing values in the data.
Baseline dialog
414
Transform
Begin by defining the data matrix from the drop-down list. This transform can also be
performed on a results matrix, which may be selected by clicking on the select result matrix
button . For the matrix, the rows and columns to be included in the computation
are then selected. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined. This transform requires that only numerical
data be chosen.
After the range has been selected, select the method of the baseline transformation. A
method must be selected in order to carry out the transform. If Linear baseline correction is
selected, the two variables which define the new baseline must also be defined (Baseline
end variables). The first and last variables are selected by default. The first and last values
must be different for the transform to be performed. By checking the Preview result, one
can see the outcome of the data when the baseline transformations has been applied.
When the baseline transformation is completed, a new matrix is created in the project with
the word Baseline appended to the original matrix name. This name may be changed by
selecting the matrix, right clicking and selecting Rename from the menu.
415
Method options
Choose between two baseline transforms:
Baseline offset
Subtract the value of the lowest point in the spectrum is subtracted from all the
variables.
Linear baseline correction
Transform a sloped baseline into a horizontal baseline.
Do not select basis variables that have spectroscopic bands.
For the offset correction in both methods, make sure that the lowest points pertain to the
same variables for all samples.
10.3. Center and Scale

10.3.1 Center_and_scale
Centering is often the first stage of multivariate modeling. It involves subtracting an average
value from each variable in order to investigate the variation around the average rather than
the absolute values of the observations. Depending on the data and the problem at hand,
other values than the mean may also be subtracted.
Scaling involves division of each variable by its estimated spread, using either the standard
deviation or other measures of variability. Scaling is particularly important if the variables
differ a lot in their relative magnitudes, as variables with larger variance are given more
influence in regression analysis.
 How it works
 How to use it
10.3.2 About centering

Centering using the average value, also called mean centering, ensures that the resulting
data or model may be interpreted in terms of variation around the mean. This is often the
preferred pre-processing method, as it focuses on differences between observations rather
than their absolute values. As a robust alternative to the mean, the median may be used
instead. The median will more likely put the origin in the ‘center of mass’ in cases where
some of the variables may be distributed non-symmetrically.
In some situations, for instance for chromatographic concentrations, it may not make sense
to use negative values at all. Subtraction of the minimum value will ensure non-negativity
for all variables.
The alternative to data centering is to keep the raw data origin for all variables. This is only
advisable in the special case of a regression model where it is known in advance that the
linear relationship between X and Y is expected to pass through zero. In The Unscrambler®
one may apply mean, median, minimum as a pre-processing step, or choose not to center
the data.
Scaling involves dividing the (centered) variables by individual measures of dispersion. Using
the Standard Deviation as the scaling factor sets the variance for each variable to one, and is
usually applied after mean centering. Other scaling options available in The Unscrambler®
are Interquartile Range (IQR), Range, and Scaled Median Absolute Deviation (MAD). All these
are non-parametric methods and are often used in combination with median centering.
416
Transform
The range is the difference between the highest and lowest observation for each variable.
Such scaling results in a range of one for all variables. The presence of outliers in the data
will heavily influence this transformation, however. A safer alternative would be to use the
IQR, which is the the difference between the observations at the 25th and 75th percentiles.
(There are several different ways of calculating the IQR, and The Unscrambler® utilizes the
‘Type 7’ algorithm of Hyndman and Fan, 1996.) As extreme observations are not included in
the IQR estimate, it is less likely to be affected by outliers.
The MAD is defined as the median of absolute differences between each observation in the
column and the median observation. This measure of population spread is little affected by
the tail behaviour of the distribution. For instance if a histogram of the data reveals a ‘wide’
peak where many observations fall in the tails, the standard deviation will be grossly inflated
while the MAD will remain a good estimate for the population’s spread. The MAD will
similarly be more robust for data with sharp peaks and long tails. The Scaled MAD is the
MAD multiplied with the factor 1.4826. This makes the estimate similar to the standard
deviation when many observations are collected from a normal distribution.
Centering and/or scaling data may be useful to study the data in various plots, or prior to
running Tasks – Analyze – Descriptive Statistics. It may for example allow one to compare
the distributions of variables of different scales within one plot. In subsequent analysis,
these scaled variables will contribute similarly to the model regardless of measurement unit.
These transformations are all column-oriented: the transformed values are computed as a
function of the values in the same column of the data table.
Notes: 1. Mean centering is included as a default option in the relevant analysis
dialogs, and the computations are done as a first stage of the analysis. Scaling using
the standard deviation may be applied in the Weights tabs of most analysis dialogs.
2. Centering and scaling are also available as a transformation to be performed
manually from the Editor (Tasks – Transform – Center_and_scale). Use this dialog
to perform one of the available non-parametric centering and scaling options.
A special type of standardization is the Spherize function Martinez and Martinez, 2005. It is
the multivariate equivalent of the univariate scaling methods described above. The
transformed variables have a p-dimensional mean of 0 and a covariance matrix given by the
identity matrix. It is also known in some application domains as the whitening
transformation since the resulting matrix has the signal properties of “white noise”.
More details regarding center and scale methods are given in the Method References.
10.3.3 Tasks – Transform – Center and Scale

Centering and/or scaling of data may be useful to study the data in various plots, or prior to
running Tasks - Analyze – Descriptive Statistics. Centering and scaling are widely applied in
order to transform the data to comparable levels and scale units prior to analysis.
These transformations are column-oriented: the transformed values are computed as a
function of the values in the same column of the table. They cannot be applied to non-
numeric data.
Center and Scale
417
button . The rows and columns to be included in the computation must be
specified as well. If new data ranges need to be defined, choose Define to open the Define
Range dialog where new ranges can be defined.
In the Transformation frame, three options are available:
Center
within the selected sample and variable scope. This subtracts a value, e.g. the
variable mean, from each observation in each column. There is an option to center
by the mean, median, or minimum value, or not use any centering. Choose the
desired option for centering from the Center drop-down list.
Dialog showing centering options
Scale
within the selected sample and variable scope. This divides each data value by an
estimate of the of the column spread. Options available are the Standard deviation
(SDev), Interquartile range (IQR), Range, or Scaled median absolute deviation (MAD)
scaling, or not to use any scaling. Choose the desired option for scaling from the
Scale drop-down list as shown below.
Dialog showing scaling options
418
Transform
Spherize
This is a multivariate equivalent of univariate center and scaling, useful in
exploratory data analysis.
The Center and Scaling options can be selected either separately or in combination. Often
mean centering is combined with SDev scaling (autoscaling). Due to their non-parametric
nature, the Range, IQR, or Scaled MAD transformation is often used after median centering.
The type of centering and scaling is selected from the drop-down list.
By checking the Preview result box, a line plot of the observations before and after scaling is
displayed.
Notes: 1. To display the mean and standard deviation of the variables in a data set,
use menu option Tasks – Analyze- Descriptive Statistics. 2. The Center and Scale
transformations are supported in autopretreatments, meaning they can be
automatically applied when new data are analysed (classification, prediction and
sample projection analyses), using a model which was developed with this
transformation applied. See next note. 3. The principal component analysis (PCA)
and Regression dialog boxes include options for centering and scaling variables
directly at the analysis stage. It is recommended to perform centering and scaling at
the model-building stage, especially if the model will be used for future prediction
or classification. The same centering and scaling options will be applied as when the
model was built. 4. Centering and/or scaling the data more than once will not affect
the structure of the data any further. Consequently, if the Center and Scale
transformation has been applied to the data from the Tasks – Transform – Center
and Scale dialog, the data may harmlessly be recentered and/or rescaled at the
modeling stage (PCA or regression).
10.4. Compute General

10.4.1 Compute general
The transform Compute_General can be used to make general mathematical
transformations to samples and/or variables.
 How it works
419
 How to use it
10.4.2 About compute general

One can use the transform Compute_General to make computations on selected samples,
variables or a matrix range using basic elementary and trigonometric functions.
Additional functions for computation on the entire data matrix are available with the Matrix
calculator: Tools - Matrix Calculator… has options for linear algebra, matrix operations and
reshaping of data.
10.4.3 Tasks – Transform – Compute_General…

This opens the Compute dialog, where one can perform arithmetic and more advanced
computations on the whole data matrix or on selected rows (samples) or columns
(variables). This option also helps in transforming variables. Computations cannot be
performed with non-numeric data.
Compute_General
are then selected.
If new data ranges need to be defined, choose Define to open the Define Range dialog
where new ranges can be defined. One must also define if the selection is for the variables
or samples.
There are three ways of defining the mathematical expression to be applied:
 Type the mathematical expression directly in the Expression box,
420
Transform
 Use the drop-down list, which provides the most recently used expressions (if this is
the first time using the Compute_General dialog, no formerly used expressions will
show in the drop-down list).
 Click on the Build Expression button. This opens the Build Expression dialog wherein
a mathematical expression can be defined using the ready-made functions and
operators allowed in The Unscrambler®.
Syntax
The Expression field accepts a formula of the type: X=LN(ABS(X))-e or S4=(S1*S2)+S3 or
V1=V1/2+SIN(V8/V9) where S stands for sample, V stands for variable, and the number is
the sample or variable number in the Editor. To build general expressions that are not
related to a particular sample or variable, use X. X stands for the whole matrix defined by the
variable and sample set chosen in Scope. RH and CH are row and column headers,
respectively.
Note: The formula cannot contain mixed references to samples (S), variables (V)
and X.
Available functions and operators

The constants, operators, and functions that are allowed in computations are listed below:
Table: Operators, functions and constants allowed in computations
Name Description
+ Addition
- Subtraction
* Multiplication
/ Division
= Equals to
( Left Parenthesis
) Right Parenthesis
ABS(X) Absolute value of X
SQRT(X) Square root of X
POW(X,n) Power of X, with exponent n: Xn
LOG(X) Briggs logarithm (base 10)
LN(X) Natural logarithm (base e)
EXP(X) Exponential(X)=eX
MIN(X1,X2,…) Minimum value
MAX(X1,X2,…) Maximum value
421
Name Description
SIGN(X) -1 if X < 0, 1 if X >= 0
ANINT(X) Nearest integer (rounding)
AINT(X) Integer part of X
COS(X) Cosine
SIN(X) Sine
TAN(X) Tangent
COSH(X) Hyperbolic cosine
SINH(X) Hyperbolic sine
TANH(X) Hyperbolic tangent
ACOS(X) Inverse cosine (radians)
ASIN(X) Inverse sine (radians)
ATAN(X) Inverse tangent (radians)
ATAN2(X1,X2) Four quadrant inverse tangent
PI 3.14
e 2.718
”X” can denote both samples and variables in this table.
Function names are case insensitive, meaning that log, Log, and LOG will give the same
result. In the above functions a comma is used as list separator, however this depends on
the regional settings of the computer. Different list separators may be valid for different
contries, e.g. POW(X;n).
Notes: A commonly used expression is X=log(X). This expression generally
transforms skewed variable distributions into more symmetrical ones. Use a
histogram plot or Tasks – Analyze – Descriptive Statistics… in order to check
whether the skewness was improved or deteriorated after applying the
transformation.
Build expression dialog

In the Expression Builder dialog a mathematical expression can be built using the ready-
made functions and operators allowed in The Unscrambler®.
Expression Builder
422
Transform
The upper text field shows the expression as it is being built. In Display, choose whether the
text field should show the sample/variable Numbers or the sample/variable Names. In the
Insert field, choose to insert specific samples, specific variables or (general expression). After
choosing the Sample or the Variable options, the drop-down list is enabled and one can
select the relevant object(s) from the list. The available samples or variables are only those
belonging to the Scope formerly selected in the Compute dialog.
The Arithmetic Functions, Trigonometric Functions, Other Functions, and Numbers fields
offer buttons that are used following the same principle as for a calculator.
Click Clear to clear the expression. Click Undo to undo the latest insertion in the expression
text. Click OK to return to the Compute_General dialog.
10.5. COW
10.5.1 Correlation Optimized Warping (COW)
COW is a method for aligning data where the signals exhibit shifts in their position along the
x axis. COW cannot be performed with non-numeric data, or when there are missing data.
 How it works
 How to use it
423
10.5.2 About correlation optimized warping

x axis. COW can be used to eliminate shift-related artifacts to measurement data by
correcting a sample vector to a reference. COW has applicability to data where there can be
a poor alignment of the x axis from sample to sample, as can be the case with
chromatographic data, Raman spectra and NMR spectra. One example of such data is
chromatography where peak positions change between samples due to changes in mobile
phase or deterioration of the column. Another example is in NMR spectroscopy where
matrix effects and the chemistry itself induce position changes in the chemical shifts.
The method works by finding the optimal correlation between defined segments of the data
for which there is a shift in position. The result of this procedure is one shift value per
segment. These are then interpolated to give a so-called shift-vector for all data points, and
a mapping function (move-back operator) which moves the samples back to the reference
profile’s position. The present implementation handles data of similar length only. To cope
with various lengths, it is suggested to pad the data table out with zeros before performing
the shift alignment. Alignment is done by allowing small changes in the segment length on
the sample vector, and those segment lengths being shifted (“warped”) to optimize the
correlation between the sample and the reference vector. Slack refers to the maximum
increase or decrease in sample segment length, and provides flexibility in optimizing the
correlation between the samples and reference.
The reference sample is the sample in the data which is used as the reference, and should be
a representative sample with the main peaks present.
Segment length is defined by the user, and is the size of the data segment that data are
divided into before searching for the optimal correlation. It must be smaller than the
number of variables divided by 4.
The slack is the flexibility in adjusting the segment size to give the optimal fit to the
reference data, and is the allowed change in position to be searched for. Slack is <= segment.
The figures below illustrate the result of applying the COW preprocessing to chromatograms.
Raw chromatograms
424
Transform
Chromatograms after COW preprocessing (segment = 100, slack = 20)
More details regarding COW are given in the Method References.
10.5.3 Tasks – Transform – Correlation Optimized Warping…

Correlation Optimized Warping (COW) is a row-oriented transformation for aligning data
where the signals exhibit shifts in their position along the x axis. This can be applicable to
data sets where there may be differences due to alignment differences that arise from the
425
measurement (such as in chromatography retention times, chemical shifts in NMR data, and
Raman spectral x axis alignment).
COW cannot be performed with non-numeric data, or when there are missing data. The
minimum number of variables required to use COW is 20.
COW Dialog
Three inputs must be specified in the dialog:
 Reference Sample: Select which sample in the data table is to act as the reference
profile.
This is a typical sample (e.g. near the origin in a scores plot) with preferably the main
peaks present. If the COW will be applied to new data at some later point of time,
include the reference sample in a new data table as well.
 Segment Size. This is the length of the segment which the data are divided into
before searching for the optimal correlation. It must be smaller than the number of
variables divided by 4.
 Slack: Slack represents the allowed change in position to be searched for and has the
value <= Segment Size.
By selecting the preview result, one can see how the transformed data will look.
COW dialog with preview
426
Transform
When the COW transformation is completed, a new matrix is created in the project with the
word COW appended to the original matrix name. This name may be changed by selecting
the matrix, right clicking and selecting Rename from the menu.
10.6. Deresolv
10.6.1 Deresolve
The Deresolve function can be used to change the apparent resolution of an instrument,
changing a high resolution spectrum to low resolution. It may also be used for noise
reduction.
 How it works
 How to use it
427
10.6.2 About deresolve

On occasion, one may wish to standardize a lower resolution instrument to a higher
resolution instrument. This may be the case when transferring data from one instrument to
another with the intention of calibration model transfer. In such an instance, it may be more
effective to mathematically lower the resolution of the higher resolution instrument prior to
forming the transfer model. The Deresolve function can be used to change the apparent
resolution of an instrument, changing a high resolution spectrum to a lower resolution by
downsampling the signal. Deresolve may also be used for noise reduction.
Deresolve uses a triangle kernel filter for smoothing to convolve spectra with a resolution
function in order to make it appear as if it had been taken on a lower resolution instrument.
The inputs are the high resolution spectra to be deresolved and the number of channels to
convolve them over. The output is the estimate of the lower resolution spectra with the
original number of variables maintained.
More details regarding the Deresolve method are given in the Method References.
10.6.3 Tasks – Transform – Deresolve

The Deresolve function can be used to change the apparent resolution of an instrument,
changing a high resolution spectrum to low resolution. It may also be used for noise
reduction. It is a row-oriented transformation; that is to say the contents of a cell are likely
to be influenced by its horizontal neighbors. This transformation cannot be applied to non-
numeric data.
A new data matrix with the deresolved data will be created in the project where the original
data matrix resides.
Deresolve
428
Transform
Range dialog where new ranges can be defined. There must be at least 4 variables to
perform the deresolve transformation.
In the Parameters field, choose the number of channels to use for convolution. The
minimum number of channels that can be used is 2, and the maximum is (#variables/2)
When the deresolve transformation is completed, a new matrix is created in the project with
the word Deresolve appended to the original matrix name. This name may be changed by
10.7. Derivatives
10.7.1 Derivatives
Differentiation, i.e. computing derivatives of various orders, is a classical technique widely
used for spectroscopic applications. Some of the information “hidden” in a spectrum may be
more easily revealed when working on a first or second derivative. It is a row-oriented
429
transformation; that is to say the contents of a cell are likely to be influenced by its
horizontal neighbors.
Derivatives cannot be performed with non-numeric data or where there are missing data.
Like smoothing, this transformation is relevant for variables which are themselves a function
of some underlying variable, e.g. absorbance at various wavelengths. Computing a derivative
is also called differentiation. Derivatives can help to resolve overlapped bands, but also lead
to a lower signal in the transformed data.
The segment parameter of Gap-Segment derivatives is an interval over which data values are
averaged.
In smoothing, X-values are averaged over one segment symmetrically surrounding a data
point. The raw value on this point is replaced by the average over the segment, thus creating
a smoothing effect.
In Gap-Segment derivatives (designed by Karl Norris), X-values are averaged separately over
one segment on each side of the data point. The two segments are separated by a gap. The
raw value on this point is replaced by the difference of the two averages, thus creating an
estimate of the derivative on this point.
The Unscrambler® offers three methods for computing derivatives, as described in the
following sections:
 Gap_Derivatives
 Gap-Segment
 Savitzky-Golay
10.7.2 About derivative methods and applications

Derivatives are applied to correct for baseline effects in spectra for the purpose of removing
nonchemical effects and creating robust calibration models. Derivatives may also aid in
resolving overlapped bands which can provide a better understanding of the data,
emphasizing small spectral variations not evident in the raw data.
The first derivative
The first derivative of a spectrum is simply a measure of the slope of the spectral curve at
every point. The slope of the curve is not affected by purely additive baseline offsets in the
spectrum, and thus the first derivative is a very effective method for removing such offsets.
However, peaks in raw spectra usually become zero-crossing points in first derivative
spectra, which can be difficult to interpret.
Example:
To illustrate how derivatives work, Gaussian curves of various offsets and intensities are
used to demonstrate the principles. These curves are shown below.
Gaussian curves of various offsets and intensities
430
Transform
Mathematically, a derivative is the slope of the curve. If purely additive noise (like in the
curves above) is present, this is a constant. Therefore under derivatization, the constant
reduces to zero, meaning that all spectra should have a mean of zero and the spectral
profiles should be changed to the slopes of the curves.
The next figure displays the first order derivative for the Gaussian curves.
First derivative of Gaussian curves
There are two points to note here,
 The baseline offset has been removed under derivatization

 The peak maxima in the raw data has now become a zero point in the derivative.
The zero point can be explained by the fact that at a peak maxima (minima), the derivative is
zero.
In complex spectra, there may be many zero points and while it is adequate to transform a
purely linear offset with a first derivative, interpretation of zero points becomes difficult.
The second derivative may be useful in this instance.
431
The second derivative

The second derivative is a measure of the change in the slope of the curve. In addition to
removing pure additive offset, it is not affected by any linear “tilt” that may exist in the data,
and is therefore a very effective method for removing both the baseline offset and slope
from a spectrum. The second derivative can help resolve nearby peaks and sharpen spectral
features. Peaks in raw spectra change sign and turn to negative peaks with lobes on either
side in the second derivative.
Example:
Returning to the Gaussian curves, the second derivative can be conceptualized as the slope
of the first derivative. Therefore at the zero point in the first derivative, the slope is
maximum and in this case will result in the original raw data maxima being minima in the
second derivative. The figure below demonstrates this.
Second derivative of Gaussian curves
Another important feature of the second derivative is that the intensities of the original
curves can be seen in the second derivatives in order of intensity. This is an extremely useful
property, especially when performing quantitative analyses such as regression analysis.
Third and fourth derivatives
Third and fourth derivatives are available in The Unscrambler® although they are not as
popular as first and second derivatives. They may reveal phenomena which do not appear
clearly when using lower-order derivatives and can be helpful in understanding the spectral
data. Prudent use of the fourth derivative has been shown to emphasize small variations
caused by temperature changes and compositional changes. Higher-order derivatives do
significantly reduce the signal in the transformed data.
Savitzky-Golay vs. Gap-Segment
The Savitzky-Golay method and the Gap-Segment method use information from a localized
segment of the spectrum to calculate the derivative at a particular wavelength rather than
the difference between adjacent data points. In most cases, this avoids the problem of noise
enhancement from the simple difference method and may actually apply some smoothing to
the data.
The Gap-Segment method requires gap size and smoothing segment size (usually measured
in wavelength span, but sometimes in terms of data points). The Savitzky-Golay method uses
a convolution function, and thus the number of data points (segment) in the function must
432
Transform
be specified. If the segment is too small, the result may be no better than using the simple
difference method. If it is too large, the derivative will not represent the local behavior of
the spectrum (especially in the case of Gap-Segment), and it will smooth out too much of the
important information (especially in the case of Savitzky-Golay). Although there have been
many studies done on the appropriate size of the spectral segment to use, a good general
rule is to use a sufficient number of points to cover the full width at half height of the largest
absorbing band in the spectrum. One can also find optimum segment sizes by checking
model accuracy and robustness under different segment size settings.
Example:
Using data from a FT-NIR spectrometer, the next figure shows what happens when the
selected segment size is too small (Savitzky-Golay derivative, 3 points segment and second
order of polynomial). Noisy features remain in the spectra when the . segment size is too
small
Derivatized data with a segment size set too small
In the figure that follows, the selected segment size is too large: (Savitzky-Golay derivative,
31 points segment and second order of polynomial). One can see that some relevant
information has been smoothed out.
Derivatized data with a segment size set too large
433
The main disadvantage of using derivative preprocessing is that the resulting spectra can be
difficult to interpret. However, this can also be advantageous, especially when a user is
looking for both specificity and selectivity of particular constituents in complex sample
matrices.
More details regarding Derivative transforms are given in the Method References.
10.7.3 Gap Derivatives

Gap derivative
This is a special case of Gap-Segment Derivative with segment size = 1 and therefore does
not smooth the data. This derivative requires that the data all be numeric and that there are
at least five variables for each sample, and no missing values.
Properties of Gap-segment and Gap derivatives
Karl Norris has developed a powerful approach for the pretreatment of near-infrared
spectral data in which two distinct items are involved. The first is the Gap Derivative, the
second is the “Norris Regression”, which may or may not use the derivatives. The Gap
Derivative is applied to improve the rejection of interfering absorbers. The “Norris
Regression” is a regression procedure to reduce the impact of varying baseline, variable path
lengths, and high stray light among samples due to scatter effects. .
Tasks – Transform – Derivative – Gap Derivative
This method computes derivatives of up to the fourth order. It has the advantage of not
generating any missing value at the ends of the spectrum as the segment size of the
derivative is fixed at 1.
Gap derivatives cannot be performed with non-numeric data or where there are missing
data.
The minimum number of variables for Gap derivation is 5.
Gap_Derivatives Dialog
434
Transform
Range dialog where new ranges can be defined. This derivative requires that the data all be
numeric and that there are at least five variables for each sample.
In the Parameters field, choose the Derivative order, i.e. whether to compute the first,
second, third, or the fourth derivative of the samples, from the drop-down list. Then, select
the required Gap size (width of the interval between the two values used for
differentiation). The gap size should be less than or equal to (Number of Variables -
Derivative Order - 1)/Derivative Order
By selecting the preview result, one can see how the preprocessed data will look.
When the Gap derivative transformation is completed, a new matrix is created in the project
with the word Gap Derivative appended to the original matrix name. This name may be
changed by selecting the matrix, right clicking and selecting Rename from the menu.
435
10.7.4 Gap Segment

Gap-Segment derivative
The Gap-Segment derivative enables one to compute first, second, third and fourth order
derivatives. The parameters of the algorithm are a gap factor and a smoothing factor that
are determined by the segment size and gap size chosen by the user. This derivative requires
that the data all be numeric and that there are at least five variables for each sample.
The principles of the Gap-Segment derivative can be explained shortly in the simple case of a
first order derivative. If the function y=f(x) underlying the observed data varies slowly
compared to sampling frequency, the derivative can often be approximated by taking the
difference in y values for x locations separated by more than one point. For such functions,
Karl Norris suggested that derivative curves with less noise could be obtained by taking the
difference of two averages, formed by points surrounding the selected x locations. As a
further simplification, the division of the difference in y values, or the y-averages, by the x-
separation x, is omitted.
Norris introduced the term segment to indicate the length of the x interval over which y
values are averaged, to obtain the two values that are subtracted to form the estimated
derivative. If too large a segment is defined, one may decrease the resolution of the peaks.
Too narrow a segment (smaller than the half-band width of the peak) may introduce noise in
the derivative data.
The gap is the length of the x interval that separates the two segments that are averaged.
Read more about Norris derivatives (implemented as Gap-Segment and Norris-Gap in The
Unscrambler® in D.W. Hopkins, What is a Norris derivative?, NIR News 12(3) 3-5(2001).
See chapter Method References for more references on derivatives.
Tasks – Transform – Derivative – Gap-Segment
This method computes derivatives of up to the fourth order. It is a more complex version of
the Norris gap method that includes an additional segment option for smoothing.
The Gap-Segment derivative cannot be performed with non-numeric data or where there
are missing data.
The minimum number of variables for Gap segment derivation is 5.
Gap-Segment Derivatives
436
Transform
In the Parameters field, choose the Derivative order, i.e. whether to compute the first,
second, third, or the fourth derivative of the samples, from the drop-down list. Then, select
the required Gap size and Segment size. The segment size + gap size should be less than or
equal to (number of variables/(derivative order + 1).
By selecting the Preview result, one can see a preview of what the derivative data will look
like with the chosen parameter settings.
Note: - The segment size must be an odd number for second or fourth derivative. -
The gap size must be an odd number for first or third derivative.
437
When the Gap-Segment derivative transformation is completed, a new matrix is created in

the project with the word GapSegment appended to the original matrix name. This name
may be changed by selecting the matrix, right clicking and selecting Rename from the menu.
10.7.5 Savitzky Golay

Savitzky-Golay derivative
The Savitzky-Golay derivative can be used to compute first, second, third and fourth order
derivatives. The Savitzky-Golay algorithm is based on performing a least squares linear
regression fit of a polynomial around each point in the spectrum to smooth the data. The
derivative is then the derivative of the fitted polynomial at each point. The algorithm
includes a smoothing factor that determines how many adjacent variables will be used to
estimate the polynomial approximation of the curve segment.
Tasks – Transform – Derivative – Savitzky-Golay
Savitzky-Golay differentiation computes derivatives of up to the fourth order, based on a
polynomial approximation of a portion of the curve.
The Savitzky-Golay derivative cannot be performed with non-numeric data or where there
are missing data.
The minimum number of variables for the Savitzky-Golay derivative is 4.
Savitzky-Golay Derivatives
438
Transform
Make the appropriate choices in the Savitzky_Golay Derivatives dialog by first selecting the
sample and variable sets that define the matrix to be transformed by a derivative in the
Scope field. Begin by choosing the data matrix from the drop-down list. This transform can
also be performed on a results matrix, which may be selected by clicking on the select result
matrix button . For the matrix, the rows and columns to be included in the
computation are then selected. If new data ranges need to be defined, choose Define to
open the Define Range dialog where new ranges can be defined. This derivative requires
that the data all be numeric.
In the Parameters field, choose the Derivative order, i.e. the first, second, third, or the fourth
derivative of the samples, from the drop-down list. The derivative order must be less than or
equal to polynomial order. Then select the Polynomial order, i.e. the order of the polynomial
to be fitted. A polynomial order of 2 means that a second-degree equation will be used to fit
the data points. A higher number means a more flexible polynomial, i.e. a more precise
differentiation. The polynomial order must be less than or equal to the sum of left and right
side points.
439
One may then select the smoothing points. Note that a larger range will give a smoother
shape to the sample, but may result is a loss of valuable information. Choose the number of
left side points and right side points. From this the total number of smoothing points is
calculated (# left + # right + 1). The number of smoothing points must be less than number
of variables.
By selecting the Preview result, one can see a preview of the data before the transform and
what the derivative data will look like with the chosen parameter settings.
Note that, after the operation is completed, the data will be slightly truncated at both ends.
If p is the number of left side points and q the number of right side points in the smoothing
segment, the first p and the last q variables in the smoothed variable set will be set to zero.
This is because there are not enough points to the left (resp. right) of these variables to
compute the smoothing function.
When the Savitzky-Golay derivative transformation is completed, a new matrix is created in
the project with the word SGolay appended to the original matrix name. This name may be
10.8. Detrend
10.8.1 Detrending
Detrending is a transformation which seeks to remove nonlinear trends in spectroscopic
data.
 How it works
 How to use it
10.8.2 About detrending

Detrending (DT) is a transformation which seeks to remove nonlinear trends in spectroscopic
data. Standard_Normal_Variate (SNV) and DT in combination reduce multicollinearity,
baseline shift and curvature. The detrend calculates a baseline function as a least squares fit
of a polynomial to the sample spectrum, These transformations are applied to individual
spectra and are distinct from other transformations which operate at each wavelength in a
given set of spectra. As the polynomial order of the detrend increases, additional baseline
effects are removed. (0-order: offset; first-order: offset and slope; second-order: offset,
slope and curvature)
Detrending may be used in combination with SNV on spectroscopic data. The SNV removes
multiplicative interferences such as baseline shift. However, SNV-corrected data may still be
affected by baseline curvature. To remove this effect DT is performed by using a second-
order (or higher degree) polynomial in regression analysis where spectral values form the
response or y-variable and independent variable or x-variable (W) is given by the
corresponding wavelengths:
where A, B, C (and D, E) are the regression coefficients. The light blue expression within the
brackets is used if a third or fourth degree polynomial fit is considered. The base curve in the
above relationship is given by the fitted values ŷSNV,I and thus derived spectral values
subjected to SNV followed by DT become:
440
Transform
This calculation removes baseline shift and curvature which may be found in diffuse
reflectance NIR data of powders, particularly if they are densely packed. The use of thesis
transform does not change the shape of the data, as can be the case on application of
derivatives.
Example
The spectroscopic data shown hereafter, display a clear nonlinear trend.
NIR Diffuse reflectance spectra of cellulose.
There is a nonlinear trend in the data, roughly indicated by the dashed, red curve (right).
The four plots hereafter show the same data after Detrending was applied with varying
polynomial orders.
NIR diffuse reflectance spectra of cellulose: the same spectra after Detrending with
polynomial order 1 to 4.
441
More details regarding Detrending are given in the Method References.
10.8.3 Tasks – Transform – Detrending

Like SNV, Detrending (DT) is a row-oriented transformation which affects individual spectra.
DT removes nonlinear trends from spectroscopic data by fitting a higher-order polynomial to
each individual spectrum, then removing the estimated baseline curvature.
Detrending cannot be performed with non-numeric data or where there are missing data.
Detrending
Begin by defining the data matrix from the drop-down list. For the matrix, the rows and
columns to be included in the computation are then selected. This transform can also be
button . If new data ranges need to be defined, choose Define to open the Define
In the Parameters frame, select the Polynomial order (1 to 4) to apply to the data. The
polynomial order must be less than number of variables selected, to perform detrending
By selecting the Preview result, one can see a preview of what the preprocessed data will
look like with the chosen parameter settings.
Detrending dialog with preview of results
442
Transform
When the detrending transformation is completed, a new matrix is created in the project
with the word Detrend appended to the original matrix name. This name may be changed by
10.9. EMSC
10.9.1 MSC/EMSC
Multiplicative Scatter Correction (MSC) is a transformation method used to compensate for
additive and/or multiplicative effects in spectral data. Extended Multiplicative Scatter
Correction (EMSC) works in a similar way; in addition, it allows for compensation of
wavelength-dependent spectral effects.
 How it works
 How to use it
443
10.9.2 About multiplicative scatter correction

additive and/or multiplicative effects in spectral data. Extended Multiplicative Scatter
Correction (E) works in a similar way; in addition, it allows for compensation of wavelength-
dependent spectral effects.
MSC
MSC (also known as multiplicative signal correction) was originally designed to deal with
multiplicative scattering in reflectance spectroscopy. However, a number of similar effects
can be successfully treated with MSC, such as:
 Path length variations,

 Offset shifts,
 Interference, etc.
The idea behind MSC is that the two effects, amplification (multiplicative, scattering) and
offset (additive, chemical), should be removed from the data table to avoid that they
dominate the information (signal) in the data table.
The correction is done by two simple transformations. Two correction coefficients, a and b,
are calculated from a reference (usually the average spectrum in the data set) and used in
these computations, as represented graphically below:
Multiplicative (left) and additive (right) scatter effects:
The correction coefficients are computed from a regression of each individual spectrum onto
the average spectrum. Coefficient a is the intercept (offset) of the regression line, coefficient
b is the slope. As the MSC preprocessing uses the mean spectrum for the data set, its
success depends on how well the calculated mean spectrum resembles the true mean
spectrum, which will depend on a large sample set.
444
Transform
E
E is an extension to conventional MSC, which is not limited to only removing multiplicative
and additive effects from spectra. This extended version allows a separation of physical light
scattering effects from chemical light absorbance effects in spectra.
In E, new parameters h, d and e are introduced to account for physical and chemical
phenomena that affect the measured spectra. Parameters d and e are wavelength specific,
and used to compensate regions where such unwanted effects are present. E can make
estimates of these parameters, but the best result is obtained by providing prior knowledge
in the form of spectra that are assumed to be relevant for one or more of the underlying
constituents within the spectra and spectra containing undesired effects. The parameter h is
estimated on the basis of a reference spectrum representative for the data set, either
provided by the user or calculated as the average of all spectra. Spectra of the pure
components known to be present in the data set can be used as Good Spectra in the E
calculation, while spectra which represent the unwanted scatter effects can be used as Bad
Spectra.
More details regarding MSC/E transforms are given in the Method References.
10.9.3 Tasks – Transform – MSC/EMSC

multiplicative and/or additive scatter effects in the data. Extended Multiplicative Scatter
Correction (EMSC) is an extension to regular MSC correction using prior knowledge that
includes extra parameters that can account for the physical or chemical phenomena that
affected the spectra. Both methods are row-oriented transformations; that is to say the
contents of a cell are likely to be influenced by its horizontal neighbors.
MSC/EMSC cannot be performed with non-numeric data or where there are missing data.
Multiplicative Scatter Correction
445
In the Multiplicative Scatter Correction dialog select the Sample (Rows) and variable (Cols)
sets that define the matrix to correct in the Scope field. This transform can also be
Range dialog where new ranges can be defined. The minimum number of variables required
to perform this transformation is 2. If a valid MSC or EMSC model exists, check the box Use
existing MSC or EMSC Model to be used to transform the current data in exactly the same
way as was done for an earlier data matrix. This is useful if different data matrices should be
treated in the same way, e.g. new prediction samples. From the drop-down list one can
choose the model.
If test samples are to be used, check the Enable test samples box, and enter the numbers for
the rows holding those samples. At least two samples must be left for the transformation.
Variables can be omitted from the MSC/EMSC transform by checking the Enable omit
variables box, and entering the column numbers in the space provided. At least two
variables must be left to perform the transformation.
The default choice is to compute and use a new MSC or EMSC model which must then be
defined on the Options tab. One must then decide whether to make a full MSC model,
common offset (additive effects) model, or common amplification (multiplicative effects)
model in the Function field. In addition to regular MSC, one can also activate EMSC by
clicking the check box Extended options. Three extra options are now available, indicating
446
Transform
the available options for spectral information, channel weights and squared channel weights
used in EMSC.
Multiplicative Scatter Correction options field
Multiplicative Scatter Correction with EMSC options field
447
When EMSC is enabled, the user must decide which effects to include. The options channel
number and squared channel number model physical effects related to wavelength-
dependent light scatter variations. Chemical effects are included in the squared spectrum.
For all three options, one can choose Not used from the drop-down list, and the effect will
not be included in the transformation. If Model only is selected, the effect will be included to
calculate EMSC parameters. By choosing Model & subtract, the effect will not only be
included, but the effect will also be subtracted from the EMSC corrected spectra. When the
extended options are chosen, two additional tabs appear on the Dialog: Spectral Info, and
Channel Weights.
MSC/EMSC: Spectral information setup dialog

It is also possible to include available background spectra in the EMSC calculation by going to
the tab Spectral info and designating data to use for Reference, Good and Bad Spectra.
Ensure that the selected matrix and the reference spectrum, good spectra, and bad spectra
have an equal number of variables. The weightings to use for these can also be designated
when defining these spectra in the EMSC transformation.
This dialog is accessed by selecting the Spectral Info tab after the Extended options box has
been checked in the Options tab. It allows the user to provide the EMSC model with prior
knowledge about the data including a reference spectrum, a good spectrum and a bad
spectrum.
Spectral Information setup
448
Transform
The Enable Reference Spectrum field allows one to select a single spectrum from the data
acting as a typical spectrum without any additional effects. If not selected, a reference will
be calculated using the mean of all spectra. In the Enable Good Spectra and the Enable Bad
Spectra fields, one can specify several spectra from a data table that are defined as good and
bad representatives of the spectral data, respectively. Spectra of the pure components
known to be present in the data set can be used as Good Spectra. Spectra which represent
the unwanted scatter effects can be used as Bad Spectra. If the Good Spectra and the Bad
Spectra have been selected, one may also enter a subtraction weight for the respective
spectra. These subtraction weights are multiplied to the good and the bad spectra and
subtracted from the corrected spectra.
It should also be noted that the background spectra available for selection in the Enable
Reference Spectrum must have the same number of variables as the spectra to be
transformed, though they may reside in a different data matrix. It is also recommended that
the background spectrum selected be different samples from the samples in the selected
scope of the data table. Overlapping reference, good and bad spectra is not allowed. A
warning message will appear if this happens.
EMSC: Channel weights setup

Multiplicative Scatter Correction with EMSC channel weights
449
The last tab is for setting the Channel Weights, and is available only when using EMSC. Here,
one can choose to select different weighting of the variables. It is also possible to iteratively
find better weights than the default choice, by entering a number in the Reweightings field.
The number of reweightings to be used must be between 0 and 5. The EMSC will then be run
iteratively this number of times to find improved weights.
The options for weightings are:
A/(SDev +B)
This is a standard deviation weighting process where the parameters A and B can be
defined. The default is A = 1 and B = 0.
Constant
This allows the weighting of selected variables by predefined constant values.
Downweight
This allows the multiplication of selected variables by a very small number, such that
the variables do not participate in the model calculation, but their correlation
structure can still be observed in the scores and loadings plots and in particular, the
correlation loadings plot.
Block weighting
This option is useful for weighting various blocks of variables prior to analysis so that
they have the same weight in the model. Check the Divide by SDev box
to weight the variables with standard deviation in addition to the block weighting.
By selecting the Advanced tab one can apply weights from an existing matrix by selecting a
row in a data matrix.
450
Transform
MSC/EMSC results
When the EMSC or MSC transformation is completed, a new matrix is created in the project
with the word MSC or EMSC appended to the original matrix name. This name may be
changed by selecting the matrix, right clicking and selecting Rename from the menu. The
results of the transform also includes a model, which is an additional node in the project
navigator with several matrices for the results. The model name is MSC (or EMSC) prefixed
to the matrix from which the model was developed. The MSCMeanVar matrix gives
complete data values for the data matrix for the MSC transform. For an EMSC model the
matrix Reference Spectrum has the details on the transform.
Example
Consider a data table that consists of several spectra measured on different mixtures of two
chemical compounds where the amount of each of the two substances is varying.
The reference spectrum for the transformation can be a spectrum measured on a mixture
where the two compounds are equally represented.
Good spectra would then be spectra measured on each compound alone.
The bad spectra could then be selected as spectra believed to contain additional effects, not
caused by the chemicals.
10.10. Interaction and Square Effects

10.10.1 Interaction_and_Square_Effects
One can use the transform interactions and squares to specify combinations of variables
(cross-products of two variables) to be taken into account in a model.
 How it works
 How to use it
10.10.2 About interactions and square effects

One can use the transform interactions and squares to specify combinations of variables
(cross-products of two variables, also called interactions, or squares of individual variables)
to be taken into account in a model.
Interactions may exist within a data set when there are nonlinearities between the X and Y
variables. One can expand the X matrix to include interactions (cross terms) and quadratic
effects (square terms) by using the transform Interactions and Square Effects. When
performing a DOE all the factors that are varied are mathematically independent. One can
expand the complexity of the model by expanding the data set to include such terms.
When one has an X matrix with highly correlated variables, computing interaction terms
between these variables may be unreliable. Likewise, adding interaction and square effects
for all variables, will greatly increase the matrix size, and may add noninformative variables
which will add noise, and hence not improve the predictive ability of the model. For a data
set with 12 X variables, adding all the interaction and square effects will results in 12 linear
terms + 12 quadratic terms + 12 * 12/2 interactions = 90 variables.
In The Unscrambler® 9.8, interactions and squares were computed on centered and
scaled values. The user now has the option to apply this transform to the raw data,
or to the centered and scaled data. The chief advantages of centering are that it
reduces multicollinearity (a high correlation) between the a and b predictors, and
451
the a * b interaction term and can provide more meaningful interpretations of the
regression coefficients for a and b. Whether the data are centered or not, the
regression coefficient for a * b will be the same. The coefficients for a and b will
differ depending on which method is used.
10.10.3 Tasks – Transform – Interactions and Square Effects

The Interaction_and_Square_Effects dialog is accessed from the Tasks - Transform menu
and is where one can specify combinations of variables (cross-products of two variables, also
called interactions, or squares of individual variables) to be taken into account in a model.
This transform can only be applied to numeric data.
Interaction_and_Square_Effects
Range dialog where new ranges can be defined. This transform can only be applied to
numeric data.
The dialog contains two lists: Available Effects to the left and Selected Effects to the right.
The former lists all available effects with their full names.
Select the combinations to include in the transform and press the right arrow button to
include them in the right list under Selected Effects.
To Add All, use the double right arrow button.
Use the left arrow or double left arrow buttons to remove effects from the right-most list.
The transform is applied to the data as given in the matrix. One can choose to perform the
transformation on centered and scaled data by checking the box Rescale Interactions and
452
Transform
square effects. The interaction level can be chosen from the drop-down list next to
Interaction level.
When the Interaction and Square effects transformation is completed, a new matrix is
created in the project with the abbreviation InS appended to the original matrix name. This
name may be changed by selecting the matrix, right clicking and selecting Rename from the
menu.
10.11. Interpolate
10.11.1 Interpolation
This transformation operates by computing piecewise smooth cubic curves and allowing the
computation of values at any intermediate points.
 How it works
 How to use it
10.11.2 About interpolation

Interpolation may be used to transform data with different intervals on the x-axis to the
same equidistant interval or in situations where the axis values are not exactly the same.
This may be the case for mass spectroscopy data where the m/z values are not exactly
identical or in Fourier transform spectroscopy (FT-IR and FT-NIR). Interpolation is also useful
when one wants to combine data from instruments that have different resolutions along the
x-axes. A typical example is for Near Infrared Spectroscopy where some instruments have a
2 nm resolution, others 5 nm.
Interpolation can in general be performed with linear interpolation or by applying
polynomials or spline functions. The interpolation option available in the Unscrambler makes
use of natural cubic splines. In this methodology, a third order polynomial is fit to the
defined intervals and the polynomial segments are joined together such that they fit
together smoothly. New points in between the original intervals are included based on this
spline fitting.
The inputs for this interpolation are a series of data pairs consisting of the X positions and
the corresponding Y values. The Y data is then computed for a new set of X positions,
typically at uniform intervals starting from a specified starting value. The step and number of
columns may also be defined. The input for the x positions as the original scale can be
header values as variable names in numeric format or a scale given manually from the start
and step length. The end value is automatically computed when the step and number of
data points are specified in the dialog. It is advised to not interpolate beyond the end value
for the original scale as this will be to be extrapolate the data.
It is advised to not interpolate beyond the start or end values for the original range of the
data, i.e. extrapolating. Owing to the nature of interpolation, the number of columns can
increase or decrease and therefore cannot be used registered as a pretreatment or work
with prior pretreatments.
Interpolation is suitable for data that shows a continuous trend. Data consisting of sharp
spikes are unsuitable for interpolation as this will produce artefacts due to the smoothing.
453
10.11.3 Tasks – Transform – Interpolate

Interpolation allows data to be computed at new intermediate positions for a given input.
This allows the use of data such as FTIR and Raman spectra which may be measured with
slight shifts in their x axis.
Interpolation
In the Interpolation dialog, select the Matrix. You can choose a specific sample and variable
set within the matrix. If new data ranges need to be defined, choose Define to open the
Define Range dialog where new ranges can be defined.
If the data has numeric headers, the start and step values are detected based on the first
two headers. This is suitable for spectral data with continuous and regular intervals. In the
454
Transform
exceptional case where the intervals are not regular, the header values may be used as the
original scale.
The target scale to which the interpolation is to be performed needs to be specified by
entering the start and step values. The number of columns of data can also be chosen. The
maximum number of columns is resticted to three times the size of original number of
columns.
The interpolated data is added as a new node in the project tree.
Note: interpolation can also be performed on data without actual wavelengths or
wavenumbers by specifying arbitrary units. For instance if the data consisted of 10 columns,
one could specify the inputs as follows to reverse the columns.
10.12. Missing Value Imputation

10.12.1 Fill missing values
Using the Tasks - Transform - Fill Missing… menu option, one can fill empty cells in a data
matrix with values estimated from the information contained in the rest of the data matrix.
This function cannot be used for non-numeric data, or for data matrices that do not contain
any data. : Fill missing cannot be performed Issue 11: The selection does not contain any
missing data.
 How it works
 How to use it
10.12.2 About fill missing values

It may sometimes be difficult to gather values of all the variables of interest, for all the
samples included in a given study. As a consequence, some of the cells in a data table will
remain empty. This may also occur if some values are lost due to human or instrumental
failure, or if a recorded value appears so improbable that it must be deleted, thus creating
an empty cell.
Using the Tasks - Transform - Fill Missing… menu option, one can fill those cells with values
estimated from the information contained in the rest of the data table.
455
Although some of the analysis methods (PCA, PCR, PLS, MCR) available in The Unscrambler®
can cope with a reasonable amount of missing values, there are still multiple advantages in
filling empty cells with estimated values:
 Allow all points to appear on a 2-D or 3-D scatter plot;

 Enable the use of transformations requiring that all values are non-missing, such as
derivatives;
 Enable the use of analysis methods requiring that all values are non-missing, like for
instance MLR or Analysis of Effects.
Two methods are available for the estimation of missing values:

Principal Component Analysis
performs a reconstruction of the missing values based on a PCA model of the data
with an optimal number of components. This fill missing procedure is the default
selection and the recommended method of choice for spectroscopic data.
Row Column Means Method
only makes use of the same column and row as each cell with missing data. Use this
method if the columns or rows in the data come from very different sources that do
not carry information about other rows or columns. This can be the case for process
data.
More details regarding the fill missing values function are given in the Method References.
10.12.3 Tasks – Transform – Fill Missing…

This function can be used to fill the missing values in a data table with estimated values that
take into account the data structure. A new data matrix will be created in the project, with
the original matrix kept intact. This function cannot be used for non-numeric data, or for
data matrices that do not contain any data.
Fill missing values
In the Fill missing values dialog choose the data matrix from the drop- down menu. This
transform can also be performed on a results matrix, which may be selected by clicking on
the select result matrix button . For the matrix, the rows and columns to be
included in the computation are then selected. If new data ranges need to be defined,
choose Define to open the Define Range dialog where new ranges can be defined.
Fill Missing cannot be applied if one or more rows have more missing data than non-missing.
456
Transform
In the Parameters frame, choose the estimation method:

Principal Component Analysis
will perform a reconstruction of the missing data based on a PCA with an optimal
number of components of the data selected in the scope field. This fill missing
procedure is the default selection and the recommended method of choice for
spectroscopic data.
Row Column Mean Analysis
will only make use of the same column and row as each cell with missing data. Use
this method if the columns or rows in the data come from very different sources
that do not carry information about other rows or columns. This can be the case for
process data.
One may optionally scale data before estimating missing values by ticking the box at the
bottom. This is recommended if the variables included in the replacement scope are
measured in different units or have different scales.
By selecting the preview result, one can see how the filled data will look.
When the fill missing transformation is completed, a new matrix is created in the project
with the word FillMissing appended to the original matrix name. This name may be changed
by selecting the matrix, right clicking and selecting Rename from the menu.
10.13. Noise
10.13.1 Noise
This transformation operates by adding additive or multiplicative noise in variables, which
can be helpful to see how this affects the model.
 How it works
 How to use it
10.13.2 About adding noise

Contrary to the other transformations, adding noise to data would seem to decrease the
precision of the analysis.
This is exactly the purpose of that transformation: include some additive or multiplicative
noise in the variables, and see how this affects the model.
This option should be used only when the original data have been modeled satisfactorily, to
check how well the model may perform if it is used for future predictions based on new data
assumed to be more noisy than the calibration data.
More details regarding the noise transformation are given in the Method References.
10.13.3 Tasks – Transform – Noise

One can introduce additive and/or proportional noise into the selected data range. This may
be useful to see how sensitive a model is to noise in the data. This transformation has no
specific row- or column orientation. That is to say one can compute the new value of a cell
independently from its neighbors. It cannot be applied to non-numeric data.
Noise
457
In the Noise dialog, select the Matrix, and then the sample and variable sets that to be
processed. This transform can also be performed on a results matrix, which may be selected
by clicking on the select result matrix button . If new data ranges need to be
defined, choose Define to open the Define Range dialog where new ranges can be defined.
In the Parameters field, specify the level of proportional noise (e.g. 5%) and the standard
deviation of the additive noise to be added to the data.
Noise on a variable is said to be additive when its size is independent of the level of the data
value. The range of additive noise is the same for small data values as for larger data values.
The additive noise must be greater than or equal to 0.
Noise on a variable is said to be proportional when its size depends on the level of the data
value. The range of proportional noise is a percentage of the original data values. The
designated value for proportional noise must be between 0 and 100.
Noise dialog with preview
458
Transform
When the noise transformation is completed, a new matrix is created in the project with the
word Noise appended to the original matrix name. This name may be changed by selecting
10.14. Normalize
10.14.1 Normalization
Normalization is used to “scale” samples in order to get all data on approximately the same
scale.
The following normalization methods are available in The Unscrambler®:
 Area normalization;
 Unit vector normalization;
 Mean normalization;
 Maximum normalization;
 Range normalization;
459
 Peak normalization.
 How it works
 How to use it
10.14.2 About normalization

Normalization is a family of transformations that are computed sample-wise. Its purpose is
to “scale” samples in order to get all data on approximately the same scale. It is applied in
cases where the data are collected with a method (system) where the detector signal is a
function of sample mass (i.e. most GC detectors) or of source power (i.e. Raman
spectroscopy) instead of sample concentration.
The following normalization methods are available in The Unscrambler®
Area normalization
This transformation normalizes an observation (i.e. spectrum, chromatogram) Xi by
calculating the area under the curve for the observation. It attempts to correct the
transmission spectra for indeterminate path length when there is no way of measuring it, or
isolating a band of a constant constituent or of an internal standard.
Property of area-normalized samples

The area under the curve becomes the same for all samples.
In practice, area normalization and mean normalization only differ by a constant
multiplicative factor. The reason why both are available in The Unscrambler® is
that, while spectroscopists may be more familiar with area normalization, other
groups of users may consider mean normalization a more “standard” method.
Unit vector normalization

This transformation normalizes sample-wise data Xi to unit vectors. It can be used for
pattern normalization, which is useful for preprocessing in some pattern recognition
applications.
Property of unit vector normalized samples

The normalized samples have a length (“norm”) of 1.
Mean normalization
This is the most classical case of normalization.
It consists in dividing each row (each observation) of a data matrix by its average, thus
neutralizing the influence of the hidden factor.
460
Transform
It is equivalent to replacing the original variables by a profile centered around 1, only the
relative values of the variables are used to describe the sample, and the information carried
by their absolute level is dropped. This is indicated in the specific case where all variables are
measured in the same unit, and their values are assumed to be proportional to a factor
which cannot be directly taken into account in the analysis.
For instance, this transformation is used in chromatography to express the results in the
same units for all samples, no matter which volume was used for each of them.
Caution! This transformation is not relevant if all values of the curve do not have
the same sign. It was originally designed for positive values only, but can easily be
applied to all-negative values through division by the absolute value of the average
instead of the raw average. Thus the original sign is kept.
Property of mean-normalized samples
The area under the curve becomes the same for all samples.
Maximum normalization
This is an alternative to classical normalization which divides each row by its maximum
absolute value instead of the average.
Caution! The relevance of this transformation is doubtful if all values of the curve
do not have the same sign.
Property of maximum-normalized samples
 If all values are positive the maximum value becomes +1.

 If all values are negative the minimum value becomes -1.
 If the sign of the values changes over the curve either the maximum value becomes
+1 or the minimum value becomes -1.
Range normalization
Here each row is divided by its range, i.e. “max value – min value”.
Property of range-normalized samples
The curve span becomes 1.
Peak normalization
This transformation normalizes a sample Xi by the chosen kth data point, which is always
chosen for both training set and “unknowns” for prediction.
It attempts to correct spectra for indeterminate path length. Since the chosen spectral point
(usually the maximum peak of a band of the constant constituent or internal standard, or
the isosbestic point) is assumed to be concentration invariant in all samples, an increase or
decrease of the point intensity can be assumed to be entirely due to an increase or decrease
in the sample path length. Therefore, by normalizing the spectrum to the intensity of the
peak, the path length variation is effectively removed.
For peak normalization the Peak variable (max) = total number of variables.
Property of peak-normalized samples
All transformed samples take value 1 at the chosen constant point, as shown in the figures
below.
Raw UV-Vis spectra
461
Spectra after peak normalization at 530 nm, the isosbestic point
Caution! One potential problem with this method is that it is extremely susceptible
to baseline offset, slope effects and wavelength shift in the spectrum.
The method requires that the samples have an isosbestic point, or have a constant
concentration constituent and that an isolated spectral band can be identified which is solely
due to that constituent.
More details regarding normalization methods are given in the Method References.
10.14.3 Tasks – Transform – Normalize

Normalization is used to get all data in approximately the same scaling, or to get a more
even distribution of the variances and the average values. It is a row-oriented
transformation; that is to say the contents of a cell are likely to be influenced by its
horizontal neighbors.
Chromatography data are usually normalized before analysis, as are data from laser-based
measurements.
462
Transform
Normalization cannot be carried out with non-numeric data, but can proceed if there are
missing values in the data.
Normalize
Then, select the normalization type in the Type field. The following six normalization
methods are available:
Area normalization attempts to correct the spectra for indeterminate path length when
there is no way of measuring it, or isolating a band of a constant constituent or an internal
standard. The transformation normalizes a sample Xi by calculating the area under the curve
for the sample (i.e. spectrum, chromatogram).
Result of area normalization on two different samples
Before After
1234 0.1 0.2 0.3 0.4
2468 0.1 0.2 0.3 0.4

Unit vector normalization normalizes sample-wise data Xi to unit vectors. It can be used for
pattern normalization, which is useful for preprocessing in some pattern recognition
applications.
Result of unit vector normalization on two different samples
Before After
463
Before After
1234 0.183 0.365 0.548 0.730
2468 0.183 0.365 0.548 0.730

Mean normalization is the standard normalization that is used within chromatography. The
areas under the curve are made equal. The results of a Mean normalization on two different
samples are listed below.
Result of a mean normalization on two different samples
Before After
1234 0.4 0.8 1.2 1.6
2468 0.4 0.8 1.2 1.6

Maximum normalization is a normalization that “polarizes” the spectra. The peaks of all
spectra with positive values touch +1, while spectra with values of both signs touch -1.
Result of a maximum normalization on two different samples
Before After
1234 0.25 0.50 0.75 1.0
-1 2 -3 4 -0.25 0.50 -0.75 1.0

Range normalization involves scaling all samples to a common range, for example between 0
and +1. Thus each axis in a plot of range-scaled data is adjusted such that the data fill the
region of the plot in all directions. The results of a Range normalization on two different
samples are listed below.
Result of a range normalization on two different samples
Before After
10 25 30 25 10 5 3 0.370 0.926 1.111 0.926 0.370 0.185 0.111
0.3 0.5 1.0 2.5 3.0 2.5 1.0 0.111 0.185 0.370 0.926 1.111 0.926 0.370
Peak normalization normalizes a sample as the ratio of each value by the value at a selected
variable (wavelength, retention time). The chosen point (usually the maximum peak of a
band of the constant constituent, or the isosbestic point) is assumed to be concentration
invariant in all samples.
Peak Normalization
464
Transform
Type in the number of the peak variable in box next to Peak normalization.
Note: If data are peak-normalized before building a model for later use in
prediction or classification, make sure that the same peak variable is selected when
normalizing the prediction samples!
Result of peak normalization on two different samples
Before After
1234 1234
2468 1234
465
10.15. OSC
10.15.1 Orthogonal Signal Correction (OSC)
OSC can be used as a transformation method for building PLS regression models from
spectral data. It removes extraneous variance from the x data, sometimes making the PLS
model more accurate.
 How it works
 How to use it
10.15.2 About Orthogonal Signal Correction (OSC)

OSC is used as a transformation method before building PLS regression models.
OSC removes extraneous variances from the X-data, that are not related to the Y response,
and may therefore perturb a regression model.
OSC was originally developed for application to near-infrared reflectance spectroscopic data.
The implementation in The Unscrambler® is based on the work of Tom Fearn. The reference
is: T. Fearn: On orthogonal signal correction. Chemom. Intell. Lab. Syst., 50, 47-52.
OSC can sometimes make a PLS model more accurate. If the first factors in the original PLS
model captured a very large amount of X-variance (>80%) but a very small amount of Y-
variance (<15%) the data set is not well-described by the raw data, and is a candidate for a
transformation such as OSC.
Because OSC depends upon the Y-values, it requires a matrix that has y values whose
accuracy is very important.
 If the Y values are very accurate, OSC may work well.

 If they are not, it is unlikely that the OSC will produce good results.
PLS models built on OSC transformed data should be interpreted with great caution.
OSC will make the model fit appear very good, but may not improve predictions on separate
test sets. It is important to hold out some test samples as a final sanity check on the model
and how the OSC has improved it.
OSC calculates orthogonal signal correction.
Inputs
The inputs are the matrix of predictor variables (X) and predicted variable(s) (Y),
scaled as desired, and the number of OSC components to calculate.
Usually, 1-3 OSC components are sufficient. Optional input variables are the
maximum number of iterations used in attempting to maximize the variance
captured by the orthogonal component, and the tolerance on percent of X-variance
to consider in formation of the final w-vector.
Outputs
The outputs are the OSC corrected X-matrix and the weights, loadings and scores
that were used in making the correction.
Once the OSC model has been made, new (scaled) x data can be corrected from the Tasks -
Transform - OSC… by selecting a saved OSC model.
More details regarding OSC transforms are given in the Method References.
466
Transform
10.15.3 Tasks – Transform – OSC…

OSC is a transformation method for applied to data before building regression models. It
removes extraneous variance from the X data that is not related to the Y response, and may
therefore perturb a regression model.
OSC cannot be performed on non-numeric data. However, it can be performed on data with
missing values using the NIPALS algorithm.
The inputs are the matrix of predictor variables (x) and response variable(s) (y), scaled as
desired, and the number of OSC components to calculate. There must be the same number
of samples for the X and Y matrices.
OSC Dialog
Begin by defining the data matrix for the Predictor Variables (X) from the drop-down list.
This transform can also be performed on a results matrix. Choose these matrices by clicking
on the select result matrix button . Next, select the rows and columns to be
included in the computation. If new data ranges need to be defined, choose Define to open
the Define Range dialog where new ranges can be defined. Then proceed to select the
matrix for the Predicted variables (Y).
If a valid OSC model already exists, it can be used for the transformation of a new matrix by
selecting it next to the Use existing OSC Model. The model must have loadings and weights
matrices saved to it.
By selecting the preview result, the effect of the OSC transformed data can be visualized.
467
Weights tabs
In the X- or Y-Weights dialog, choose the data matrix from the drop-down list. This
the more button . For the matrix, the rows and columns to be included in the
computation are then selected (containing only numeric data). If new data ranges need to
be defined, choose Define to open the Define Range dialog where new ranges can be
defined.
Then, select the variables that the a weighting will be applied to; all variables can be
selected by selecting one variable, and then clicking the All button under the variable
selection window. The selection can also be made by typing in the variable numbers and
clicking Select. After making the selection of variables, select the weighting to be used using
the radio buttons in the Select tab. To apply the weighting, click Update, and then OK.
There are four weighting methods available:
A/(SDev +B)
Constant
Downweight
Block weighting
468
Transform
Options tab
On the Options tab, choose the Number of OSC Components. Usually, 1-3 OSC components
are sufficient. Then, select the algorithm to apply from the following,
NIPALS
Non-linear Iterative Partial Least Squares. This algorithm handles missing values and
is suitable for computing only the first few components of a large data set. This
method however accumulates errors that can become large in higher principal
components. Since the NIPALS algorithm is iterative, the maximum number of
iterations can be tuned in the Max iterations box. The default value of 100 should
be sufficient for most data sets, however some large and noisy data may require
more iterations to converge properly. The maximum allowed number of iterations is
30,000.
SVD
Singular Value Decomposition. This algorithm does not handle missing values and is
best suited for small data sets or “tall” or “wide” data. This algorithm produces
higher accuracy results but it is not suited for data sets with a high number of both
samples and variables since the algorithm always computes all components.
The NIPALS algorithm calculates one principal component at a time and it handles missing
values well, whereas the SVD algorithm calculates all of the principal components in one
calculation, but does not handle missing values.
OSC Options
When the OSC transform has been applied to the data, there will be two new nodes created
in the project navigator: one for the OSC model (and corresponding result matrices that
469
have been designated to be included in the outputs), and another for the transformed data.
The transformed data matrix will have OSC appended to the original data matrix name.
OSC results in project navigator
10.16. Quantile Normalize

10.16.1 Quantile Normalization
Quantile normalization (QN) is a pre-processing method that forces all observations/rows of
a data matrix into identical distributions. It is widely used in fields such as genomics or
metabolomics where thousands of variables are measured, but only a smaller subset are
expected to show relevant variation. In any application where this assumption cannot be
expected to hold (i.e. for spectral data), quantile normalization should not be attempted.
 How it works
 How to use it
10.16.2 About quantile normalization

Quantile normalization (QN) is a method used to standardize data row-wise such that the
empirical distribution (histogram) becomes the same for all objects. This normalization may
effectively remove background differences between observations under the assumption that
their distributions should be similar. This in general holds true only if the majority of the
measured variables (often genes or metabolites) show no significant variation across the
experimental condition.
For instance, a common objective in a microarray study is to identify differentially expressed
genes in a group of sick people compared to a group of healthy people. While several
thousands of genes can be observed simultaneously, only a small subset is expected to be
relevant for the condition in question. The distribution of measured gene activity is
therefore expected to be similar between subjects, and large deviations can be attributed to
unwanted, experimental differences. QN may effectively transform the signal such that the
relevant, genetic variation between patient groups is more likely to be detected.
Three normalization options are available: ‘Mean row’, ‘Median row’ or ‘Reference vector’.
In the first two, each row is sorted from low to high and the mean/median of each ordered
value across observations is used as the reference distribution. The third option allows you
to use any vector with the same number of columns as the input data as reference
470
Transform
distribution. Then, for each observation, the lowest value is replaced with the lowest value
of the reference distribution, the second lowest value is replaced with the second lowest
value of the reference distribution, and so on. The end result is that each transformed row
contains exactly the same data as the reference distribution, however sorted in the order of
the original observations.
Quantile normalization should be used with caution and only when the reference
distribution can be assumed to be representative for all samples in the data table. It is
particularly dangerous to use if the reference distribution contains more than a single peak,
as data values will be forced to move between neighbouring peaks (disguising differences) if
the cluster sizes vary from one observation to the next.
Missing or non-numeric data are not allowed in QN.
10.16.3 Tasks – Transform – Quantile_Normalize

Quantile normalization (QN) is a highly non-linear transformation used to force each
row/observation into a reference distribution. It cannot be applied to non-numeric data.
Quantile Normalization
In the Quantile Normalization dialog, select the Matrix to transform, including the relevant
row and columns sets. Data from previous results may be selected by pressing the select
result matrix button . New data ranges may be selected from the Define Range
dialog if Define is pressed.
Three choices of reference distributions are available. The mean or median of identically
ranked data values across observations is estimated by selecting the ‘Mean row’ or ‘Median
row’ radio button, respectively. Alternatively, the ‘Reference vector’ allows you to input
your own choice of reference distribution. Make sure that neither the data nor the reference
vector contains non-numeric or missing values.
Note: Never use quantile normalization unless you have pretty good reasons to
believe that your observations should be distributed identically.
471
The Preview result option enables you to compare the data before and after transformation.
Quantile dialog with preview
Upon transformation a new node is created in the project navigator with

‘Quantile_Normalize’ appended to the original matrix name. The usual renaming option is
made available by right clicking on the node and selecting Rename from the menu.
10.17. Reduce Average

10.17.1 Reduce (Average)
The size of a data table can be reduced by averaging samples or variables.
472
Transform
 How it works
 How to use it
10.17.2 About averaging

Averaging over samples (in case of replicates) or over variables (for variable reduction, e.g.
to reduce the number of spectroscopic variables) may have, depending on the context, the
following advantages:
 Increase precision;
 Get more stable results;
 Reduce noise;
 Interpret the results more easily.
In The Unscrambler® this is done from the menu using Tasks – Transform – Reduce
(Average)…
Application example
Improve the precision in sensory assessments by taking the average of the sensory ratings
over all panelists.
Average replicate measurements of the same sample to increase signal to noise.
Reduce the number of variables in spectral data with very large number of variables to make
data more manageable,
10.17.3 Tasks – Transform – Reduce (Average)…

The size of a data table can be reduced by averaging samples or variables. Averaging reduces
uncertainty in the measurements and the effect of noise. If there is an equal number of
replicates for each sample in the data table, the replicates can be averaged to get one row
for each sample. Depending on whether the reduction (averaging) is done along samples or
variables, the transformation is either column-oriented or row-oriented. This transformation
cannot be performed on non-numeric data.
Reduce (Average) Dialog
473
A minimum of two samples and two variables is required to perform this transformation.
Choose whether to Reduce along Variables or Samples in this field in the Reduce (Average)
dialogue. The number of adjacent samples or variables to be averaged must be given in the
Reduction Factor field, where the value can be changed using the spin box from 2 up to the
number of variables being transformed.
Note: All defined sets will be adjusted according to the reduction performed.
10.18. Smoothing
10.18.1 Smoothing methods
Smoothing helps reduce the noise in the data without reducing the number of variables. It is
a row-oriented transformation. That is to say the contents of a cell are likely to be influenced
by its horizontal neighbors.
This transformation is relevant for variables which are themselves a function of some
underlying variable, for instance time, or in the existence of intrinsic spectral intervals.
Smoothing cannot be performed with non-numeric data , but can be applied when there are
missing data.
a smoothing effect.
A submenu to the Tasks – Transform – Smoothing menu provides four different methods for
smoothing of data:
Moving_Average
first finds a data value by averaging the values within a segment of data points
Savitzky-Golay
finds a data value by making a polynomial to fit the data points using a number of
data points on each side
Median_Filter
finds a data value by taking the median within a segment of data points
Gaussian_Filter
finds a data value by computing a weighted moving average within a segment of
data points.
More details regarding Smoothing methods are given in the Method References.
10.18.2 Comparison of moving average and Gaussian filters

Let us compare the coefficients in a moving average and a Gaussian filter for a data segment
of size 5.
If the data point to be smoothed is xk, the segment consists of the 5 values: xk-2, xk-1, xk, xk+1
and xk+2.
The Moving average is computed as:
474
Transform
that is to say
The Gaussian distribution function for a 5-point segment is:
As a consequence, the Gaussian filter is:
As can be seen, points closer to the center have a larger coefficient in the Gaussian filter
than in the moving average, while the opposite is true of points close to the borders of the
segment.
10.18.3 Gaussian Filter

Gaussian filter
Gaussian filtering is a weighted moving average where each point in the averaging function
is affected a coefficient determined by a Gaussian function with σ² = 2.
This transformation cannot be performed on non-numeric data or if there is missing data in
the matrix selected for smoothing.
Tasks – Transform – Smoothing – Gaussian_Filter
In the Gaussian_Filter Smoothing dialog, each value in a row can be replaced by a fitted
value determined by a Gaussian filter function of its nearest neighbors. In practice, this
amounts to averaging values within a segment of data points that have been weighted
according to a Gaussian distribution function with σ² = 2.
Gaussian_Filter(Smoothing) cannot be performed with non-numeric data or where there are
missing data.
The minimum number of variables for Gaussian filter smoothing is 3.
Gaussian_Filter Smoothing
475
Then, enter the size of the segment to be used for smoothing, i.e. how many adjacent
columns should be used to compute the Gaussian fitted value, in the Parameters field. The
segment size must be less than or equal to the number of variables.
10.18.4 Median Filter

Median filter
Median filtering replaces each observation with the median of its neighbors. The number of
observations from which to take the median is the user-chosen “segment size” parameter; it
should be an odd number.
476
Transform
Tasks – Transform – Smoothing – Median_Filter

In the Median_Filter Smoothing dialog, each value in a row can be replaced by the median
of the values within a given segment centered on the point to be smoothed.
Median filter smoothing cannot be performed with non-numeric data or where there are
missing data.
The minimum number of variables for median filter smoothing is 3.
Median_Filter Smoothing
Begin by defining the data matrix from the drop-down list. For the matrix, the rows and
columns to be included in the computation are then selected. This transform can also be
Then in the Parameters field, enter the size of the segment to be smoothed, i.e. how many
adjacent columns should be used to compute the median. The segment size must be less
than or equal to number of variables.
477
10.18.5 Moving Average

Moving average filter
Moving average is a classical smoothing method, which replaces each observation with an
average of the adjacent observations (including itself).
Tasks – Transform – Smoothing – Moving_Average
In the Moving_Average Smoothing dialog, each value in a row can be replaced by the
average of its nearest neighbors.
Moving average smoothing cannot be performed with non-numeric data. It can handle
missing data.
The minimum number of variables for moving average smoothing is 3.
Moving_Average Smoothing
478
Transform
The size of the segment to be averaged is then entered, i.e. how many adjacent columns
should be used to compute the average value, in the Parameters field. In smoothing, X
values are averaged over one segment symmetrically surrounding a data point. The raw
value on this point is replaced by the average over the segment, thus creating a smoothing
effect. The segment size for smoothing must be less than or equal to number of variables.
10.18.6 Robust LOWESS

Tasks – Transform – Smoothing – Robust Lowess
In the Robust Lowess dialog, each row entry (sample) is replaced by a locally weighted
regression line that gives larger weights to points close to a neighbourhood and smaller
weights to those that are further away.
Robust Lowess smoothing cannot be performed with non-numeric data. It can handle
missing data, however, caution should be exercised at all times when dealing with missing
values.
The minimum number of variables for Robust Lowess smoothing is 3.
Robust Lowess Smoothing
479
Range dialog.
Three parameters need to be defined to perform a Robust Lowess Smooth
The number of iterations the calculation should perform to reach convergence
The Smoothing f factor which is set between 0 and 1
The Delta value
By selecting the Preview result, a preview of what the preprocessed data will look like with
the chosen parameter settings will be displayed. This can also be used to look at the effect of
the transformation in real time.
480
Transform
10.18.7 Savitzky Golay

About Savitzky-Golay smoothing
The Savitzky-Golay algorithm fits a polynomial to each successive curve segment, thus
replacing the original values with more regular variations. The user chooses the length of the
smoothing segment (or right and left points separately) and the order of the polynomial. It is
a very useful method to effectively remove spectral noise spikes while keeping chemical
information, as shown in the figures below.
Raw UV-Vis spectra show noise spikes
UV-Vis spectra after Savitzky-Golay smoothing at 11 smoothing points and second

polynomial degree setting
Tasks – Transform – Smoothing – Savitzky-Golay

Savitzky-Golay is an averaging algorithm that fits a polynomial to the data points. The value
to be averaged is then predicted from this polynomial equation.
481
Savitzky-Golay smoothing cannot be performed with non-numeric data or where there are
missing data.
The minimum number of variables required for Savitzky-Golay smoothing is 3.
Savitzky-Golay Smoothing
In the Savitzky-Golay Smoothing dialogue, select the matrix and to be smoothed. Begin by
defining the data matrix from the drop-down menu. This transform can also be performed
on a results matrix, which may be selected by clicking on the select result matrix button
. For the matrix, the rows and columns to be included in the computation are then
selected. If new data ranges need to be defined, choose Define to open the Define Range
dialog where new ranges can be defined.
The polynomial order is selected in the Parameters field. For instance, a polynomial order of
2 means that a second-degree equation will be used to fit the data points. The polynomial
order must be less than or equal to the sum of left and right side points.
The smoothing points are defined by choosing the number of left side points and right side
points separately. The number of smoothing points must be less than number of variables.
The number of smoothing points on the left and right side must be the same if the
symmetric kernel box is checked. By unchecking this box, a different number of points may
be set for each side (though this is not recommended for spectral data). Note that a larger
value for smoothing points will give a smoother shape to the data, but may result is the loss
of some information.
By selecting the preview result, one can see what the data look like with given smoothing
settings.
Savitzky-Golay smoothing dialog with preview
482
Transform
Note that, after the smoothing operation is completed, the data will be slightly truncated at
both ends. If p is the number of left side points and q the number of right side points in the
smoothing segment, the first p and the last q variables in the smoothed variable set will be
set to zero. This is because there are not enough points to the left (resp. right) of these
variables to compute the smoothing function.
10.19. Spectroscopic Transformations

10.19.1 Spectroscopic transformations
Different spectral representations are available to make spectra more suited for analysis.
The options available in The Unscrambler® are:
 Absorbance to reflectance
Absorbance to transmittance;
483
 Reflectance to absorbance
Transmittance to absorbance;
 Reflectance to Kubelka-Munk units
 Basic ATR Correction.
 How it works
 How to use it
10.19.2 About spectroscopic transformations

Specific transformations for spectroscopy data are simply a change of units. Often such
transformations are performed with the data acquisition software of the spectrometer used.
Spectroscopic transformations cannot be performed on non-numeric data
The following transformations are available in The Unscrambler®:
Reflectance to absorbance, or transmittance to absorbance
This transformation allows the conversion of units to absorbance, which can be
related to concentration by Beer’s law. The absorbance is simply logarithm of (1/T)
or (1/R).
Absorbance to reflectance, or absorbance to transmittance
This transformation is the inverse of the one above, converting units from
absorbance by taking the inverse logarithm of it to give reflectance or transmittance
respectively.
Reflectance to Kubelka-Munk units
This transformation was developed to make a univariate correction to the spectrum
of a scattering sample. This transformation, most commonly used in diffuse
reflectance FTIR measurements, is intended to compensate for the difference in
scatter between measurements. The Kubelka-Munk function is ,

where K is the true absorbance, S the scatter, and R is the reflected light.
Basic ATR Correction
The Attenuated Total Reflectance (ATR) correction applies a linear, wavelength
dependent transformation to absorbance data to account for differences in sample
penetration depth. After correction the ATR spectra will have a shape similar to
regular absorbance transformed transmission spectra. A tuneable reference value
indicates the wave number at which the absorbance transformed ATR spectrum and
transmission spectrum should be the same.
More details regarding spectroscopic transformations are given in the Method References.
10.19.3 Tasks – Transform – Spectroscopic…

Sometimes it is desirable to transform spectra from one representation to another.
Spectroscopic transformations provide the ability to switch between absorbance and
reflectance/transmittance data, transform reflectance data into Kubelka-Munk units, or
adjust Attenuated Total Reflectance (ATR) data to look like spectra collected on a regular
transmission instrument.
This transform requires that the data matrix contains only numeric (spectroscopic) data.
Spectroscopic transformation dialog
484
Transform
Select the data matrix with spectra in the Spectroscopic Transformation dialog. This
the select result matrix button . Then the rows and columns to include must be
selected. If the ranges of interest are not available in the drop-down boxes, choose Define to
open the Define Range dialog where new ranges can be selected.
Choose among the available transformations in the Type frame. Four types of
transformations can be performed:
 Absorbance to reflectance, or
Absorbance to transmittance;
 Reflectance to absorbance, or
Transmittance to absorbance;
 Reflectance to Kubelka-Munk units
 Basic ATR Correction.
When Basic ATR Correction is selected, Units and Reference value boxes will be available
with default values of 1000 . This is the wave number at which the absorbance
transformed ATR spectrum is expected to be the same as an absorbance transformed
transmission spectrum of the same sample. Available units are and .
Select Preview result to view the the spectra before and after transformation.
Spectroscopic transformation with preview
485
When the spectroscopic transformation is completed, a new matrix is created in the project
with the word Spectroscopic appended to the original matrix name. This name may be
10.20. Standard Normal Variate

10.20.1 Standard_Normal_Variate (SNV)
SNV is a row-oriented transformation which removes scatter effects from spectra by
centering and scaling individual spectra
 How it works
 How to use it
486
Transform
10.20.2 About Standard_Normal_Variate (SNV)

SNV is a transformation usually applied to spectroscopic data, to remove scatter effects by
centering and scaling each individual spectrum (i.e. a sample-oriented standardization). It is
sometimes used in combination with de-trending (DT) to reduce multicollinearity, baseline
shift and curvature in spectroscopic data.
This transformation cannot be applied to non-numeric data.
Each value xk in a row of data X is transformed according to the formula:
Like MSC, the practical result of SNV is that it removes multiplicative interferences of scatter
and particle size effects from spectral data. These transforms for scatter corrections are
typically used with diffuse reflectance data.
An effect of SNV is that on the vertical scale, each spectrum is centered on zero and varies
roughly from –2 to +2. Apart from the different scaling, the result is similar to that of MSC.
The practical difference is that SNV standardizes each spectrum using only the data from
that spectrum; it does not use the mean spectrum of any set. The choice between SNV and
MSC is a matter of taste. Since the MSC normalizes based on the mean spectrum in a data
set, it is best suited for similar sample sets.
More details regarding SNV transform are given in the Method References.
10.20.3 Tasks – Transform – SNV

SNV is a row-oriented transformation which removes scatter effects from spectra by
centering and scaling individual spectra. This transformation cannot be applied to non-
numeric data.
SNV Dialog
487
When the SNV transformation is completed, a new matrix is created in the project with the
word SNV appended to the original matrix name. This name may be changed by selecting
10.21. Transpose
10.21.1 Transposition
Matrix transposition consists in exchanging rows for columns in the data table.
It is particularly useful if the data have been imported from external files where they were
stored with one row for each variable.
10.21.2 Tasks – Transform – Transpose

This command transposes the complete data table. It is recommended that data be
transposed before the data ranges are defined. Designed data tables cannot be transposed.
488
Transform
Category variables are automatically split a table containing such variables is transposed. A
transpose cannot be performed on a matrix containing non-numeric data.
Note: All defined sets are also transposed.
Select the data matrix to be transposed, by highlighting it, and go to Tasks-Transform-
Transpose. Alternatively one can select the data matrix, and right mouse click to select
Transform-Transpose
When the transpose transformation is completed, a new matrix is created in the project with
the word Transposed appended to the original matrix name. This name may be changed by
10.22. Weighted Direct Standardization

10.22.1 Weighted_Direct_Standardization (WDS)
The WDS method is a weighted regression approach and the input data will typically be the
samples selected by the Kennard � Stone algorithm for selecting the best samples for
generating the standardization matrix (transfer function).
The overall goal is to develop a set of coefficients that can be registered as a pretreatment
to be applied to new spectra, to make them look like a master instrument that the
calibration model was developed on. This pre-treatment is instrument specific and each
instrument may have its own calibration model with specific standardization coefficients.
 How it works
 How to use it
10.22.2 About Weighted_Direct_Standardization

Placeholder for WDS
10.22.3 Tasks – Transform – Weighted_Direct_Standardization

Placeholder for WDS Dialog Usage
10.23. Weights
10.23.1 Weights
Depending on the kind of information to be extracted from data, it may be necessary to
apply weights to the variables. Often the weights are based on the standard deviation of the
variables, i.e. square root of variance, which expresses the variance in the same unit as the
original variable.
Weighting of spectra may make it more difficult to interpret loadings plots, and one runs the
risk of inflating noise in wavelengths with little information. Thus, spectral data are generally
not weighted, but there are exceptions.
 How it works
 How to use it
489
10.23.2 About weighting and scaling

PCA, PLS and PCR are projection methods based on finding directions of maximum variation.
Thus, they all depend on the relative variance of the variables.
Depending on the kind of information to be extracted from the data, it may be necessary to
use weights based on the standard deviation of the variables, i.e. square root of variance,
which expresses the variance in the same unit as the original variable. This operation is also
called scaling. Other weighting options are also available and will be discussed below.
Note: Weighting in terms of mean centering is included as a default option in the
relevant analysis dialogs, and the computations are done as a first stage of the
analysis. Weightings may also be applied just for analysis in the relevant analysis
dialogs.
Standard deviation scaling is also available as a transformation to be performed from the
Tasks-Transform-Center and Scale… option. This may help in the study of data in various
plots from the Editor, or prior to computing descriptive statistics. It may for example allow
one to compare the distributions of variables of different scales into one plot. See the
section on Center and Scale for more information.
Weighting options
The following weighting options are available in the menu option Tasks – Transform –
Weights… and in the analysis dialogs of The Unscrambler®
 Constant
 A/SDev+B
 Downweight
 Block Weighting
Weighting option: A/SDev+B

1/SDev is called standardization and is used to give all variables the same variance, i.e. 1.
This gives all variables the same chance to influence the estimation of the components, and
is often used if the variables:
 are measured with different units;

 have different ranges;
 are of different types.
Setting A = 1 and B = 0 achieves the standardization of data.

Sensory data, which are already measured in the same units, are nevertheless sometimes
standardized if the scales are used differently for different attributes.
Caution! If a noisy variable with small standard deviation is standardized, its
influence will be increased, which can sometimes make a model less reliable.
A/SDev+B can be used as an alternative to full standardization when this is considered to be
too dangerous. It is a compromise between 1/SDev and a constant.
This option may be applied in cases where there are noisy variables that one does not want
to exclude from the analysis. To keep a noisy variable with a small standard deviation in an
analysis while reducing the risk of “blowing up noise”, use A/SDev + B with a value of A
smaller than 1, and / or a nonzero value of B.
490
Transform
Weighting option: Constant

This option can be used to set the weighting for each variable manually.
Weighting option: Downweight
Projection methods (PCA, PCR and PLS) take advantage of variances and covariances to build
models where the influence of a variable is determined by its variance, and the relationship
between two variables may be summarized by their correlation.
While variance is sensitive to weighting, correlation is not. This provides us with a possibility
of still studying the relationship between one variable and the others, while limiting this
variable’s influence on the model. This is achieved by giving this variable a very low weight in
the analysis. This operation is called Downweighting (“passifying”) the variable.
Downweighted variables will lose any influence they might have on the model, but by
plotting Correlation Loadings one can still study their behavior in relation to the active
variables.
Weighting option: Block weight
Assume that two or more variable sets are to be analyzed in a PCA model or as independent
variables in a regression situation. Then one can perform individual PCA on the two blocks of
variables and then combine the scores in a new model or regression on the individual blocks.
One reason for this approach is that the number of variables is different in the two blocks
and one wants to give the two blocks the same possible impact in the model.
Block weighting is an alternative to this in that the blocks are weighted according to the
number of variables. Modeling the variables directly instead of in a two step process gives a
view of all individual variables which is preferable in most cases. In the Unscrambler® you
may combine scores from several models with the Tools - Matrix Calculator - Shaping option
if so desired.
The use of block weighing in this context gives a direct interpretation if there is overlapping
information between the blocks.
Also notice that downweighting one block of variables in a regression situation is a very
informative way of visualizing the possible impact of these variables although they do not
contribute to the model numerically. This is particularly useful when several steps in a
process are modeled; thus the partial variance explained and the degree of redundancy can
be assessed.
Use the correlation loading option to visualize the downweighted variables.
Weighting: The case of multiple Y responses
For regression against multiple Y responses, the X- and Y matrices can be weighted
independently of each other, since only the relative variances inside the X-matrix and the
relative variances inside the Y-matrix influence the model.
Even if weighting of Y has no effect on a PLS model, it is useful to have X and Y in the same
scale in the result plots.
Weighting: The case of sensory analysis
There is disagreement in the literature about whether one should standardize sensory
attributes or use them as they are. Generally, this decision depends on how the assessors
are trained, and also on what kind of information the analysis is supposed to give.
A standardization corresponds to a stretching/shrinking that gives new “sensory scores”
which measure position relative to the extremes in the actual data table. In other words,
491
standardization of variables gives an analysis that interprets the variation relative to the
extremes in the data table.
The opposite, no weighting at all, gives an analysis that has a closer relationship to the
individual assessor’s personal extremes, and these are strongly related to their very
subjective experience and background.
It is generally recommended to use standardization for sensory data. This procedure,
however, has an important disadvantage: it may increase the relative influence of unreliable
or noisy attributes (see Caution in section Weighting Option: ).
Weighting: The case of spectroscopic data
Standardization of spectra may make it more difficult to interpret loadings plots, and one
may risk inflating noise in wavelengths with little information. Thus, spectra are generally
not weighted, but there are exceptions.
10.23.3 Tasks – Transform – Weights…

Depending on the kind of information to be extracted from data, it may be necessary to use
weights based on the standard deviation of the variables, i.e. square root of variance, which
expresses the variance in the same unit as the original variable. This operation is also called
scaling. Weighting cannot be applied to non-numeric data.
Weights
In the Weights dialog, choose the data matrix from the drop-down list. This transform can
also be performed on a results matrix, which may be selected by clicking on the more button
492
Transform
. For the matrix, the rows and columns to be included in the computation are then
selected (containing only numeric data). If new data ranges need to be defined, choose
Define to open the Define Range dialog where new ranges can be defined.
Then, select the variables that the a weighting will be applied to; all variables can be
selected by selecting one variable, and then clicking the All button under the variable
selection window. The selection can also be made by typing in the variable numbers and
clicking Select. After making the selection of variables, select the weighting to be used using
the radio buttons in the Select tab. To apply the weighting, click Update, and then OK.
There are four weighting methods available:
A/(SDev +B)
Constant
Downweight
Block weighting
An example of the use of block weighting

Assume two subsets of variables make up the variable set X of size 13; the first subset
consists of four variables and the second subset of nine variables. Assuming there is
potentially equal information in these blocks they should be weighted so that they have the
same possible impact in a multivariate model. To achieve this a constant a = 1/sqrt(No. of
variables) is multiplied with each variable. This will give for subset 1 a weight a of 1/sqrt(4) =
0.5, whereas for subset 2 the constant
a equals 1/sqrt(9) = 0.333. The Divide by SDev option is selected when the variables have
various ranges of variations and/or units, such as e.g. when temperature, pressure, viscosity
and pH make up a subset. When the subsets are for example spectral data in the same unit
from visible and NIR wavelength ranges the normal choice is to mean center only.
By selecting the Advanced tab a user can apply weights from an existing matrix by selecting
a row in a data matrix.
Weights, Advanced
493
When the weights transformation is completed, a new matrix is created in the project with
the word Weighted appended to the original matrix name. This name may be changed by
Weighting can also be done when beginning an analysis, if one does not want to transform
the data, but is only concerned with applying weights during the analysis itself. A tab for
weights (or X weights, Y weights, and Z weights) is presented in the option for many
analyses, such as PCA and regression (MLR, PLS, PCR, L-PLS), as well as when doing linear
discriminant analysis or support vector machine classification.
Weights tab within PCA dialog
494
Transform
More details regarding weights are given in the Method References.
495
11. Univariate Statistics
11.1. Descriptive statistics
The Descriptive Statistics option in The Unscrambler® provides some simple and effective
plotting tools for gaining an overview of small to medium sized data sets. The tools in this
menu option are mainly used to confirm observations found in multivariate models.
 Theory
 Usage
 Plot Interpretation
 Method reference
11.2. Introduction to descriptive statistics

The Descriptive Statistics option in The Unscrambler® provides some simple and effective
plotting tools for gaining an overview of small to medium sized data sets. The tools in this
menu option are mainly used to confirm observations found in multivariate models.
 Purposes
 Parametric statistics
 Terminology
 The normal distribution
 Measures of central tendency
 The mean
 The median
 The mode
 Measures of dispersion
 Variance
 Standard deviation
 Range
 Degrees of freedom
 Skewness and kurtosis
 Quartiles
11.2.1 Purposes
The main results to be found by performing Descriptive Statistics are:
 Plots of the Mean and Standard Deviation of the chosen variables.
 Box plots of the variables.
 Scatter Effects plots, used to compare the linearity of data when plotted against the
mean of the data.
 Cross-correlation matrix, for investigating variable correlations.
There are no formal statistical tests performed in the Descriptive Statistics module, these
can be found in the Tasks - Analyze - Statistical Tests… menu.
497
Parametric statistics
By parametric statistics, it is inferred that the samples under investigation come from a
population with a known underlying distribution, typically a normal distribution. Parametric
statistics are sensitive to the underlying parameters, which in the case of a normal
distribution are:
 The Mean, or the central tendency of the samples and,

 The Variance, or the spread of the samples.
Terminology
In the statistical literature, it is common practice to denote parameters, i.e. those measures
related to a population, by Greek symbols and to denote statistics, i.e. those measures
related to samples, by Roman letters Miller and Miller, 2005. The following table provides
examples of some common parameters and statistics.
Mean Variance Standard deviation
Parameter μ σ² σ
Statistic s² s
11.2.2 The normal distribution

In the natural sciences (and in many other application areas) the distribution of sample
values tends to congregate around a central value. This value is usually called the Mean. The
spread of the values around the mean is referred to as the Variance. A Normal Distribution is
one where the population (or sample) values are symmetrically distributed around this
mean value and the variance describes the width of the distribution. The normal distribution
is therefore fully characterized by the two parameters, the mean and a measure of spread
known as the Standard Deviation. The following diagram shows some of the more important
characteristics of the normal distribution.
498
Univariate Statistics
11.2.3 Measures of central tendency

The mean
One of the main characteristics of the normal distribution is that the most likely value to
occur is one that is close to the center of the population. This value is known as the mean
and to be more specific, the arithmetic mean. The arithmetic mean is the weighted sum of
the observed values (observations), the weighting factor being the number of samples
measured.
The median
Another common measure used in statistics to describe central tendency is the median. The
median is known as a non-parametric or robust statistic. The median is calculated as the
pivot point of a set of ordered observations. For instance, consider the number sequence
below:
1 2 3 4 5
The number of observations is odd. Therefore placing the pivot point under the value 3
balances the data, i.e. two observations on either side. When the number of observations is
even, as in the case below:
1 2 3 4 5 6
the balance point now does not lie on a single number, but midway between the numbers 3
and 4. Therefore the median would in this case be 3.5.
In the first case above, the median was 3 and it can be shown that the mean value is also 3.
Now consider the following set of numbers:
1 2 3 4 50
The median is still 3, while the mean is now much greater than 3. This is why the median is
referred to as a robust statistic, i.e. it is robust to outliers.
The mode
The Mode is defined as the most commonly occurring value in a data set. For example, in the
following set of observations:
1 2 3 3 3 4 5
the mode is 3 as this is the most commonly occurring value.
11.2.4 Measures of dispersion

Variance
The variance is a measure of the spread of observations around a mean value. It is calculated
as the sum of squares of the individual observations and the mean, divided by the degrees of
freedom (DOF) associated with the observations.
Standard deviation
From the formula for variance, it can be seen that the value obtained for variance is in the
original units of measure squared. The Standard Deviation is a measure of spread, given in
499
the same units as the original observations. In parametric statistics, this value is most
commonly used when describing a normal distribution and is used in many of the hypothesis
tests to be discussed later in this section.
Range
The Range of a data set is defined as the highest observed value minus the lowest observed
value in a data set. It is a non-parametric method of describing dispersion and should be
used instead of the standard deviation when the number of observations is less than 5.
Degrees of freedom
The Degrees of Freedom (DOF) is the number of independent measures in a data set that can
be varied independently when a value of a chosen statistic is fixed. Put simply, if all but one
value in a set of observations are known, as well as the mean, one can calculate the missing
value. Therefore the degrees of freedom are calculated as the number of observations
minus 1.
The formula for variance and standard deviation reflect this, and correct for bias using N-1 as
the denominator. For large samples, the difference diminishes.
Skewness and kurtosis
The Skewness of a distribution is a measure of its asymmetry and is referred to as the third
central moment of the distribution. The degree of this asymmetry is determined by the
coefficient of skewness.
Distributions that are skewed to the left have a negative value of skew and distributions
skewed to the right have a positive coefficient of skewness Hogg and Craigr, 1978. The
following represent some common distributions, including the left and right skew
distributions.
The Kurtosis of a distribution is a different type of departure from normality compared to

skewness. It describes the extent of the degree of flatness (or alternatively, the peakedness)
of the center of a distribution. A value of the coefficient of kurtosis of around 0 indicates
that the distribution is normal. When it is greater than 0, it indicates that there are more
observations around the mean, i.e.the distribution is peaked. If the coefficient is less than 0,
this indicates that the curve is flatter than normal. (Note: Some softwares use a conventional
definition where the normal kurtosis is 3. The computation used in Unscrambler X subtracts
3 from this definition. Additionally, the computation includes the standard bias correction . )
Quartiles
The Median represents the point in a data set that splits it into two equal parts. Quartiles
take this idea further by splitting the data into four equal parts. These parts are labeled Q1,
Q2 and Q3 respectively and Q2 represents the median. Another important measure in
statistics is the Interquartile Range (IQR). The IQR is defined by the relationship
500
IQR = Q3 - Q1
This provides a non-parametric estimate of the dispersion of a data set.
11.3. Tasks – Analyze – Descriptive Statistics…
11.3.1 Data input

To generate a number of useful univariate statistics for a data set, select the Task – Analyze
– Descriptive Statistics… option from the main menu. The following data input dialog box
will appear.
Descriptive statistics dialog
Use the Data input options to select a matrix to analyze, and use the rows and columns
drop-down lists to select predefined sets.
Use the Define button to add new sub ranges of the original matrix to analyze.
Check the Compute Correlation matrix to display a matrix plot of the variable correlations.
11.3.2 Some important tips regarding the data input dialog

In the case all samples/variables have been kept out, the following warning will be provided.
Too many samples/variables kept out warning
Solution: Ensure that enough samples and variables are available for the calculation using
the Define option.
To view which samples/variables have been kept out of a particular data set, click on the
More Details option in the data input dialog, as shown below.
501
When the data has been correctly set up for analysis, click on OK to display the descriptive
statistics results.
Proceed to interpreting the results.
11.4. Interpreting descriptive statistics plots
 Predefined descriptive statistics plots

 Compressed
 Quantiles
 Mean and standard deviation
 Plots accessible from the Statistics plot menu
 General
 Mean
 Bar Plot
 Standard deviation
 Bar Plot
 Mean and standard deviation
 Quantiles
 Scatter effects
 Cross-correlation
 Matrix Plot
 Table of cross correlations
 Min, Max & Mean
11.4.1 Predefined descriptive statistics plots

Compressed
Quantiles
This plot contains one Box-plot for each variable, either over the whole sample set, or for
different subgroups. It shows the minimum, the 25% percentile (lower quartile), the median,
the 75% percentile (upper quartile) and the maximum.
The box-plot shows 5 percentiles
Note: If there are less than five samples in the data set, the percentiles are not
calculated. The plot then displays one small horizontal bar for each value (each
sample). Otherwise, individual samples do not appear on the plot, except for the
maximum and minimum values.
General case
502
This plot is an excellent summary of the distributions of the variables. It shows the
total range of variation of each variable. Check whether all variables are within the
expected range. If not, out-of-range values are either outliers or data transcription
errors. Check the data and correct the errors!
If groups of samples have been plotted (e.g. Design samples, Center samples), there
is one box-plot per group.
Check that the spread (distance between Min and Max) over the Center samples is
much smaller than the spread over the Design samples. If not, some possible
explanations include,
 Problems associated with some of the center samples, or

 There may be unusually large uncontrolled variations, or
 Some variables may have small, but meaningful variations.
Spectra
A quantiles plot can also be used as a diagnostic tool to study the distribution of a
whole set of related variables, for instance in spectroscopy the absorbances for
several wavelengths. In such cases, it is recommended not to use subgroups,
otherwise the plot may be too complex to provide interpretable information.
In the figure below, the percentile plot shows the general profile of a spectrum,
which may be common to all samples in the data set. The plot can be used to detect
which wavelengths (regions of the spectrum) have the largest variation. It is most
likely that these contain the most information.
Percentile plot for variables making up a spectrum
In some cases, the variation contained in certain parts of a spectrum may not be
relevant to the problem under study. The figure below demonstrates this by
showing an almost uniform spread over all wavelengths. This may cause suspicion,
as wavelengths with absorbances close to zero (i.e. baseline) have a large variation
for the samples analyzed. This may indicate a baseline shift, which can be corrected
using multiplicative scatter correction (MSC).
The scatter effects plot may be used to check such a hypothesis!
Equal baseline and major absorbance variation should be treated as suspicious
503
Mean and standard deviation

This plot displays the average value and the standard deviation together in a single plot. The
vertical solid bar is the average value, and the standard deviation is shown as an error bar
around the average (see the figure below).
Mean and Standard Deviation for one variable, one group of samples
The average response value indicates the central tendency of the samples under
investigation. The standard deviation is a measure of the spread of the variable around that
average. If several variables are studied together, compare their standard deviations. If
there is considerable variation in the standard deviation values between a number of
variables, it is recommended that standardization the variables be applied in later
multivariate analyzes (i.e. PCA, PLS etc.). This applies to variables of differing order of
magnitude (i.e. process variables), sensory or other data coming from a number of different
sources. Standardization should not be applied to spectral data as this may inflate the
variance of non-important regions, possibly making them artificially significant.
11.4.2 Plots accessible from the Statistics plot menu

General
Mean
Bar Plot
For each variable, the average value of all samples comprising that variable is displayed as a
vertical bar for a single variable of a series of bars for many variables.
Mean plot
504
Standard deviation
Bar Plot
For each variable, the standard deviation (square root of the variance) over all samples in
the chosen sample set is displayed. This plot may be useful for detecting which variables
have the largest absolute variation. If the variables have different standard deviations, it
may be necessary to standardize them in later multivariate analyzes.
Standard Deviation plot of spectral data
Mean and standard deviation

See the description in the General section
Quantiles
See the description in the General section
Scatter effects
The scatter effects plot shows each sample plotted against the average (mean) sample.
Scatter effects display themselves as differences in slope and/or offset between the lines in
the plot. Differences in the slope are caused by multiplicative scatter effects. Offset error is
due to additive effects. Sometimes the lines show profiles that deviate considerably from a
straight line. In such instances, caution must be taken when applying scatter correction, as
505
major chemical information may be confused with systematic scatter effects and therefore
lost in the transformation. For an excellent reference of this situation, refer to the article by
Martens et. al. in the reference section for this chapter.
Applying Multiplicative Scatter Correction will improve the model if these scatter effects are
detected in the data table. The examples below provide a basic guide as to what to look for.
Two cases of scatter effects: Additive (left), Multiplicative (right)
Cross-correlation
Matrix Plot
The Matrix plot shows the cross-correlations between all variables included in a statistics
analysis. The matrix is symmetrical (the correlation between A and B is the same as between
B and A) and its diagonal elements contains only values of 1, since the correlation between a
variable and itself is 1. All other values are between -1 and +1. A large positive value (as
shown in red in the figure below) indicates that the corresponding two variables have a
tendency to increase simultaneously. A large negative value (as shown in blue in the figure
below) indicates an inverse relationship of the variables. A correlation close to 0 (light green
in the figure below) indicates that the two variables vary independently from each other.
It is suggested to use a matrix plot consisting of “bars” (used as default) or a “map” for
studying cross-correlations. Examples are provided below,
Cross-correlation plot, with Bars and Map layout
506
Layouts: Bars (left), Map (right):

Note: Care must be exercised when interpreting the color scale of the such plots;
not all data sets have correlations varying from -1 to +1. The highest value will
always be +1 (diagonal), but the lowest may, in some cases, never go below zero!
This may occur when, in for example, process measurements where several
measurements that capture similar information are studied. This may include
texture or light absorbance in a narrow range. Look at the values on the color scale
before reaching any conclusions!
Table of cross correlations

This table shows the cross-correlations between all variables included in a descriptive
statistics analysis.
507
A B C
A 1 0.76 -0.32
B 0.76 1 -0.09
C -0.32 -0.09 1
The table is symmetrical, like the corresponding matrix plot and is used to isolate
quantitative values of correlation that exist between the variables under study.
Min, Max & Mean
This options shows a whisker plot with the minimum, mean and maximum value for each
variable in the top plot, with the value for the selected sample shown on that plot as a green
dot. The bottom plot shows all the values for the first variable in a control chart, with lower
and upper limit lines in red representing the lower and upper limit of that variable in the
selected data set. The green line is the mean value for the variable in the data set. The value
for a different sample can be shown in the whisker plot by using the arrows at the toolbar in
the top of the screen display; this will also move the dot along to the selected sample in the
bottom control chart. The whisker plot values can be centered and scaled by selecting the
toolbar short cut for this .

Min, Max & Mean plots
11.5. Descriptive statistics method reference

11.6. Bibliography
R. Hogg and A. Craig, “Introduction to Mathematical Statistics”, 4th Edition, New York,
Macmillan Publishing Co, 1978.
J.N. Miller and J.C. Miller, “Statistics and Chemometrics for Analytical Chemistry” Fifth
Edition, Harlow, UK, Prentice Hall, 2005.
508
12. Basic Statistical Tests
12.1. Statistical tests
The Unscrambler® provides some basic hypothesis testing features, including tests for
normality, comparison of means and variances.
The tests included are:
 The Kolmogorov-Smirnov test of normality

 Student’s t test for the equality of means (assuming equal variances)
 Student’s t test for the equality of means (unequal group variances)
 Paired t test for the comparison of means.
 Levene’s test for the equality of variances
 Bartlett’s test for the equality of variances
 F test for the equality of variances
 Mardia’s test of multivariate normality
 Contingency Analysis
To perform the analysis, use the menu option Tasks – Analyze – Statistical Tests…
The following sections briefly describe the ideas behind these methods, how to perform
them, and how to interpret the plots.
 Theory
 Usage
12.2. Introduction to statistical tests

The Unscrambler® provides some basic hypothesis testing features, including tests for
normality, comparison of means and variances.
The following sections briefly describe the ideas behind these methods.
 What are inferential statistics?

 Hypothesis testing
 The null hypothesis
 Significance levels and p-values
 One-sided and two-sided tests
 Tests for normality of data
 The Kolmogorov-Smirnov Test
 Mardia’s test for multivariate normality
 Tests for the equivalence of variances
 The F-test
 Bartlett’s test
 Levene’s test
 Tests for the comparison of means
 Comparison of two independent means
 The two sample t-test
509

Equal variance assumption

Non-equal variance assumption
 Comparison of two dependent means
 The paired t-test
 Comparison of categorical data
 Chi-square test
 Fisher’s exact test
 Bayes exact test
12.2.1 What are inferential statistics?

The idea behind statistical inference is to draw conclusions about a large population, based
on measurements performed on a small number of samples. This is based on the following
assumptions:
 The samples selected are representative of the population under study.

 The samples are randomly chosen.
Both of these principles should be obeyed as much as possible in order to make true
inferences about the population being investigated.
12.2.2 Hypothesis testing

The null hypothesis
In the statistics literature, when two (or more) measures are compared with each other, it is
usually done with respect to some reference point. The Null Hypothesis (H0) is used to
describe the situation where there is no (statistical) difference between two sets of
observations. Consider the case where the effect of two methods of sample preparation on
a testing procedure are being compared. If the two preparation techniques are the same,
the mean difference of the test results should be close to zero. In this case the null
hypothesis cannot be rejected and it is concluded that there is no difference. If, however,
the difference is significantly different from zero, the conclusion is that the preparation
methods produce different results. In this case, the null hypothesis is rejected in favor of the
alternative hypothesis (Ha). The usual terminology used in the literature is as follows:
 H0: Population (Sample) 1 = Population (Sample) 2

 Ha: Population (Sample) 1 < > Population (Sample) 2
The null and alternative hypotheses are also dependent on the type of test to be performed.
This can either be a one-sided or two-sided test. Before one and two sided tests can be
described, the principles of significance levels and p-values must be discussed.
Significance levels and p-values
The significance level of a statistical test is the risk one is willing to take of making a wrong
decision. The most commonly used significance level is the 95% confidence level. This is also
described as α=0.05, where α is called the significance or the risk. It is defined by the analyst
before the test is calculated. At 95% confidence, one is willing to take a 1 in 20 chance of
making an incorrect decision. Other common significance levels include 0.01 (99%) and 0.1
(10%). The following diagram shows the common significance levels as histograms.
510
Basic Statistical Tests
A p-value is a probability estimate of an observation describing its likelihood of belonging to

a particular population of values. The p-values are usually computer calculated and must be
compared to the significance level decided for the test. The following table provides some
general rules for interpreting the value of a p-value at an α level of 0.05.
p-value Interpretation (α=0.05)
> 0.1 Statistically insignificant
0.05–1 Potentially significant
0.01–0.05 Significant
< 0.01 Highly significant
One-sided and two-sided tests

When setting up a hypothesis test, the objective of the testing should be well defined. There
are two types of tests that can be performed: one-sided and two-sided tests. In a one-sided
test, the objective is to test the alternative hypothesis that a specified value is either greater
than, or less than the value specified for the null hypothesis. At an α value of 0.05, this tests
whether the observed difference lies within 95% of the sample population on a specific side
(i.e. either less than, or greater than the null hypothesis value).
In a two-sided test, the objective is to test whether a specified value is different (in any
direction) from the value specified in the null hypothesis. In this case, an α value of 0.05 is
divided by 2 so that the sample is tested as lying within 2.5 percent of each tail of the
distribution. These are shown graphically below.
511
12.2.3 Tests for normality of data

The Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov (KS) test (sometimes referred to as the KS Lilliefors test) is a
normality test based on comparing the sample cumulative distribution function with the
cumulative distribution function of the hypothesized distribution. The sample and
hypothetical distributions are drawn on the same plot and if the experimental data
significantly depart from the expected distribution, then a test is performed to determine
whether the sample distribution is normal or not. Overall, the KS test is a goodness of fit
test. The test statistic is given by the maximum vertical distance between the two
cumulative functions. This distance is then converted to a standard normal variate and
compared to a standard table to determine significance. The Unscrambler® plots the
cumulative distribution functions as outputs and provides the Dallal-Wilkinson-Lilliefors
correction, as proposed in D'Agostino 1986 as the test statistic for establishing normality. An
example output of the KS test is provided below.
It is recognized that other tests for normality exist and may be used instead of the KS test.
Mardia’s test for multivariate normality
Kanti V. Mardia showed that the univariate calculations of skewness and kurtosis could be
extended to the multivariate case Mardia, 1970. These calculations were used to develop a
test of multivariate normality Mardia, 1974. To describe multivariate normality (sometimes
referred to a multinormality), the simplest case is considered. This is known as the bivariate
case and is shown graphically below.
512
This diagram shows that the bivariate normal distribution occupies a region in space defined
by a series of ellipses.
The diagram also shows one of the major principles behind multivariate methods such a
principal component analysis (PCA), described in other chapters of this help document. The
bivariate normal distribution consists of a number of ellipses of equal probability density
that show elongation along the direction of maximum variance.
For a multinormal distribution, Mardia has shown that the multivariate sample counterparts
of skewness and kurtosis can defined as b1,p and b2,p, where p is the number of variables
being tested Mardia, 1970. These test statistics can be used to test the null hypothesis of
multinormality. The null hypothesis is rejected for large b1,p and/or for large absolute values
of b2,p, Mardia, Kent and Bibby, 1979. Critical values of these statistics for small samples are
provided in Mardia, 1974.
12.2.4 Tests for the equivalence of variances

The F-test
After a normality test has been applied to observations in one or more data sets, the next
step in hypothesis testing is to compare for equivalence of variance. Variance is highly
related to precision and to fairly compare the results obtained from two sample sets of
observations, it should be established that their precision is equivalent. When the variance
(precision) of one set of observations is poor, compared to another set, the lack of precision
means that it may be difficult to establish whether a real difference exists between the two
sets. This is shown in the diagram below.
Differences in variance between two data sets with the same mean
513
The F-test calculates the ratio of two sample set variances. The null hypothesis is set up such
that there is no significant difference between the variances, and the alternative hypothesis
is set such that one variance is greater than the other. If the null hypothesis stands, the ratio
of the variances should be close to a value of one (within the limits of random variation).
When it cannot be assumed that the difference is due to random variation, a significant
difference between the two variances exists.
The calculated test statistic F0 is compared to an F-table (the so called Snedecor F-table) for
a specified number of degrees of freedom. The form of the test statistic is as follows
Where α = significance level, n1 = degrees of freedom for observation set 1 and n2 = degrees
of freedom for observation set 2. A p-value is also generated for the test. If p > 0.05 (at 95%
confidence) then the null hypothesis cannot be rejected, if p < 0.05, the null hypothesis that
the variances are equivalent cannot be accepted.
When it can be safely accepted that the variances of the two observation sets are
equivalent, the variances can be pooled together for further analysis or the results can be
used to show that one method is equivalent to, or better than another.
Bartlett’s test
Bartlett’s test Bartlett, 1937 can be used to test if two (or more) sample sets have equal
variances. Statistical tests, such as ANOVA, assume that variances are equal across groups of
samples. The Bartlett test can be used to verify this assumption.
Bartlett’s test is a parametric test that is sensitive to departures from normality, i.e. is not
robust to outliers (non-normal results). In these cases, Levene’s test and the modification
proposed by Brown and Forsythe 1974 may be used as alternatives.
Bartlett’s test is used to test the null hypothesis, H0 that the population variances are equal
against the alternative that there is at least one pair that are different.
Levene’s test
Levene’s test Levene, 1960 is an inferential statistic, which can be used to assess the equality
of variances of two samples. Levene’s test assesses the assumption that the variances of the
populations from which different samples were drawn are equal. If the calculated p-value is
less than some critical value (α = 0.05), the sample variances are unlikely to have occurred
based on random sampling, therefore it is concluded that there is a difference between the
variances in the population.
Levene’s test is a nonparametric test, i.e. it does not require the assumption of normality
and is widely used before comparison of means (t-test). In the case where Levene’s test is
514
significant, subsequent tests must be performed that are based on the assumption of non-
normality.
12.2.5 Tests for the comparison of means

Comparison of two independent means
The two sample t-test

The two-sample t-test is used to compare the equivalence of means for two independent
sets of observations. By independent, it is assumed that the results are random samples
from a population of all such results. A typical example is taking samples from two different
batches of a material and comparing the mean results for a particular measured value. The
test can be set up for two situations,
 Test for the equality of means using the assumption of equal variances.
 Test for the equality of means using the assumption of non-equal variances.
From the above description, this shows that a particular workflow is required for testing the
equivalence of two means:
 Test for normality of the observations (KS).

 Test for equivalence of variance (F-test).
 Apply the appropriate t-test.
Equal variance assumption

When it can be assumed that the variances of the two sets of observations are equivalent,
the form of the t-statistic is as follows:
The numerator contains the term x1-x2, which measures the difference between the two sets
of data, the closer this value is to 0, the more likely the two sets of observations come from
the same population. The denominator contains the term sp, which is the Pooled Standard
Deviation,
The pooled standard deviation is a measure of the common spread of the two populations
and can only be representative of both populations when the variances are equivalent (F-
test). The other term in the numerator is a correction for the number of terms used to
calculate the t-statistic. The entire numerator defines a quantity known as the Standard
Error of the Mean (SE). Therefore, the t-statistic is a measure of the ratio of the difference
between two sample sets and the precision of the mean value. Significance is established by
comparing the calculated t-value (t0) with a tabulated t-value (tcrit) computed at a specified
significance level (usually 0.05) for a particular number of degrees of freedom.
The two-sample t-test can be either one-sided or two-sided. The null hypothesis is usually
set up as follows:
515
 H0: x1 = x2 (i.e. no difference)

 Ha: x1 < > x2 (two-sided)
or
 Ha: x1 < x2, or x1 > x2 (one-sided)
A p-value > 0.05 (or |t0| < tcrit) indicates that the null hypothesis cannot be rejected, i.e.
there is no difference between x1 and x2.
A p-value < 0.05 (or |t0| > tcrit) suggests that the sets of observations are significantly
different and therefore the null hypothesis must be rejected.
Non-equal variance assumption

In the case where it cannot be assumed that the variances of the two sets of observations
are equal, these variances cannot be pooled together. The form of the t-statistic is provided
below.
In this case, the variances of the two sets of observations are used in the calculation of the t-
statistic however, the DF for this case must be estimated by the following formula:
The t0 value calculated is compared to a critical t-value obtained using the estimated degrees
of freedom. The test can be either one-sided or two-sided. At 95% confidence, when p >
0.05 (|t0| < tcrit) the null hypothesis cannot be rejected and when p < 0.05 (|to| > tcrit), the
null hypothesis is rejected and the conclusion is that the two sets of observations are
significantly different.
Comparison of two dependent means
The paired t-test

When sets of observations come from measurements performed on the same sample, the
assumption of independence is no longer valid, i.e. the samples are dependent. Examples of
dependent data sets include measuring the durability of the soles of shoes made from two
different materials. In this case the samples (shoes) are tested in the same way by giving
them to someone and measuring the durability after a given time. In this case, wherever one
shoe goes so does the other, therefore there is dependence.
Paired t-tests are commonly used to test the equivalence of operators performing similar
tasks, or for comparison of a new analytical method compared to an established method. In
the calculation of the paired t-test statistic, a number of other important statistics are
calculated, including the bias between the results and the Standard Deviation of Differences
(SDD), used for establishing the error of a measuring system.
The general procedure for performing a paired t-test is as follows:
 Establish normality of the sets of observations (KS).
516
 Compare the variances of the two sets of observations for equivalence. Note in this
case, if the two sets have significantly different variances, there is no point in
continuing with the t-test.
 Compute the paired t-statistic and compare it to a t-table at a specified level of
confidence for a particular number of DF.
The form of the paired t-statistic is provided below.
It is similar in form to all t-statistic formulas. In this case, the numerator contains the term d-
, which is the mean difference (i.e. the bias) between the two sets of observations. The
closer to zero, the more likely the two sets of observations are equivalent to each other. The
denominator contains the term SDev/sqrt(N), which is the standard error of the mean
difference of the observations, or the precision of the sample set. The paired t-test is useful
for not only determining whether two operators, methods, etc. are equivalent, but the
calculated SDD can provide a value for the expected error of an analytical procedure and the
mean difference (d-) can be used to determine if there is any systematic difference between
operators, methods, etc.
12.2.6 Comparison of categorical data

Consider the data given below. For a number of men and women it has been observed
which of them are left or right-handed. This can be represented in a contingency table, i.e.
the number of occurrence of left- and right-handed is tabulated for men and women
respectively.
Experiment Left Right Sum
Men 10 79 89
Women 5 73 78
Sum 14 153 167

One objective when analyzing these data would be to assess if there is a significant
difference between the number of men and women that are lefthanded. To conduct this
analysis one will make use of a Chi-square test
Chi-square test
In contingency analysis it is common to use the Chi-square statistic to test if the frequency
distribution of observed data is different from the theoretical (expected) distribution.
The basis for the Chi-square test is a calculation of the expected value. This is done by
multiplying the total counts in the row * total counts in the column and divide by the total
counts. Then these values are summed to give the Chi-square value:
In this example the expected value for lefthanded men is 89x14/167 = 7.99 whereas the
observed value is 10. Applying the same calculation fo the other three combinations of sex
and dexterity gives a total sum of 1.184 which is to be compared to the critical value 3.84
517
(from a table of distribution of Chi-Square probabilities). The degrees of freedom is

(#columns-1)(#rows-1) = (2-1)(2-1) = 1.
The corresponding p-value is 0.2765, Thus the null hypothesis that the two variables are
independent can not be rejected. Said in a more stringent statistical phrasing: the probability
of getting data as extreme as or more extreme than the observed data if H_0 was true is
27%.
Fisher’s exact test
The approximation using the Chi-square statistic is inadequate when sample sizes are small,
or the data are very unequally distributed among the cells of the table, resulting in the cell
counts predicted on the null hypothesis (the “expected values”) being low. The usual rule of
thumb for deciding whether the Chi-squared approximation is good enough is that the Chi-
squared test is not suitable when the expected values in any of the cells of a contingency
table are below 5. In this situation the Fisher’s exact test is a more appropriate way to test
for significance. This gives the exact probability of observing this particular arrangement of
the data, assuming the given totals. Fisher showed that the probability can be computed
using a hypergeometric formula which involves the faculty of the number of observations for
each value in the contingency table and the total sums.
Bayes exact test
The literature has reported that the Fisher’s exact test is too conservative, and thus a
Bayesian approach has been proposed. The Unscrambler® therefore also provides this test
as output from analysis of a contingency table. Of course, as in any statistical test, there is
often no exact answer of which method to use; when all the three methods described above
give more or less identical results (p-value in this case) this is a kind of validation in itself.
A textbook reference to Contingency Analysis is Interpreting Statistical Findings, Jan Walker
& Palo Almond, 2010, McGraw-Hill Ryerson
12.3. Tasks – Analyze – Statistical Tests…

The Basic Statistical Tests functionality provides some of the most commonly used statistical
tests for comparing two sample sets. The tests included are:
 Tests for normality of both univariate and multivariate data:
 The Kolmogorov-Smirnov test of normality.
 Mardia’s test of multivariate normality.
 Tests for the comparison of means:
 Student’s t test for the equality of means (assuming equal variances).
 Student’s t test for the equality of means (unequal group variances).
 Tests for the comparison of variances:
 Levene’s test for the equality of variances.
 Bartlett’s test for the equality of variances.
 F test for the equality of variances.
 Tests for association or independence (categorical data)
 Pearson’s Chi-square test for association
 Fisher’s exact probability test for 2x2 contingency analysis
 Bayes exact probability test for 2x2 contingency analysis
518
The diagram below shown the main data input dialog when the Tasks-Analyze-Statistical
Tests… option is selected. All of the available methods can be found in the Test drop-down
list.
Ensure that data is available for a test to be conducted. In the case where all samples and
variables have been excluded. the following warning will be provided.
Solution: Use the Define button to deselect kept out rows and columns.
The following sections describe how to apply these basic statistical tests to data using The
Unscrambler®.
The Kolmogorov-Smirnov test of normality
The Kolmogorov-Smirnov (KS) test of normality requires only one column of input for
testing. If a data set is selected that contains more than one variable, the following warning
will be provided.
Solution: Define a column set with only one variable.
519
Select the matrix to test from the drop-down list and select the rows and columns containing
the data. Use the Define range button to create new ranges.
From the Test drop-down list, select Kolmogorov-Smirnov test of Normality.
Use the Significance level drop-down list to select the desired confidence associated with
the test and click on OK to start the analysis.
The results of the test are displayed as a node in the project navigator named Kolmogorov-
Smirnov normality test and can be plotted as a Cumulative Distribution Function (CDF). Use
the KS test statistic and the Critical Value with Lilliefors Correction to determine whether the
assumption of normality can be supported or not. When a KS test is performed, a CDF matrix
is generated in the project navigator under the analysis node.
The CDF folder contains the following information.
 The sorted x values (in increasing order).

 The Empirical CDF is a step function, defining the probability of sample occurrence in
a data set.
 The z-scores are calculated as standard normal variables, i.e. subtract the sample
mean from each observation and divide by the sample standard deviation.
 The Normal CDF is the expected normal distribution function for N observations.
 The Sample Index defines the order of the original observations.
All of these measures are used in defining the KS statistic.

Tests for the comparison of means
The Unscrambler® supports three common tests for the comparison of means:
 Student’s t test for the equality of means (assuming equal variances).

 Student’s t test for the equality of means (unequal group variances).
These tests require data sets with only one column in each. If more than one column in any
of the data input boxes, the following warning will be provided.
Use the appropriate test based on knowledge of the system. It is always recommended to
apply the KS test to the data first (to assure normality, or near normality) and then test for
equal variances, before application of the t-tests.
Go to the menu Tasks- Analyze- Statistical Tests… and then in the Statistical Tests dialog
box, select the appropriate t-test to use from the drop-down list. Then use the Data drop-
down lists to select the columns to be tested. These can be from different data matrices, but
cannot include non-numeric data. Choose the significance level for the test and then click on
OK to start the test. The results are displayed as a new node in the project navigator named
Student’s t test, which has subnodes for data and test statistics.
In the special case of the paired t-test, the number of rows (samples) in both data sets
selected must be equal. If this is not the case, the following error message will be provided.
520
Use the graphical and tabular output to determine whether the two sample sets being
compared are statistically equivalent, or different. The Mean Comparison plot can be used in
this case. This plot also shows the relevant statistics for these tests. For more information on
plot and result interpretation, see Plot Interpretation for Statistical Tests
Tests for the comparisons of variances
The Unscrambler® supports three common tests for the comparison of means:
 Levene’s test for the equality of variances.

 Bartlett’s test for the equality of variances.
 F test for the equality of variances.
These tests require data sets with only one column in each. If more than one column in any
of the data input boxes, the following warning will be provided.
Use the appropriate test based on knowledge of the system. In this case, it is recommended
to apply the KS test to the data first before application of any of these tests.
Go to the menu Tasks- Analyze- Statistical Tests… , and then in the Statistical Tests dialog
box, select the appropriate variance test to use from the drop-down list. Then use the Data
drop-down lists to select the columns to be tested. Choose the significance level for the test
and then click on OK to start the test. The results are displayed as a node in the project
navigator.
Use the graphical and tabular output to determine whether the variances of the two sample
sets being compared are statistically equivalent, or different. The Variance Comparison plot
can be used in this case. This plot also shows the relevant statistics for these tests. For more
information on plot and result interpretation, see Plot Interpretation for Statistical Tests
Mardia’s test of multivariate normality
Mardia’s test of multivariate normality is used to test whether the data in a matrix exhibits
multivariate normality. Select the matrix to test from the Data drop-down list and select the
Mardia’s Test of Multivariate Normality option from the Test drop-down list. Select the
significance level from the drop-down list and click OK to start the analysis. The results of
the analysis are displayed as a node in the project navigator named Mardia’s test with
subnodes for data and test statistics.
Mardia’s test requires a data set of at least two rows and two columns to perform the test. If
the data set does not meet this criteria, the following error message will be provided.
521
In the case where there are any missing data, the following warning will be provided when
trying to apply Mardia’s test of multivariate normality.
The output of Mardia’s test of normality is a matrix of skewness and kurtosis test values.
Multivariate normality requires that the null hypothesis for both skewness and kurtosis are
not rejected.
Normal Skewness hypothesis: A value of “0” indicates that there is not enough evidence in
the data to suggest that the skewness deviates from a multivariate normal distribution. A
value of “1” indicates that the null hypothesis can be rejected at the chosen significance
level. A “small sample correction” is automatically applied when the number of data points
are 30 or fewer.
Normal kurtosis hypothesis: A value of “0” indicates that the null hypothesis of multivariate
normal kurtosis cannot be rejected, while a value of “1” indicates that the null hypothesis is
rejected at the chosen significance level. I.e. a value of “1” means that the data display a
multivariate kurtosis that is not consistent with a multivariate normal distribution.
Both tests are followed by the p-values, critical values and Mardia’s statistics for the
skewness and kurtosis tests.
Note that this test is unreliable for highly collinear data, in which case a warning will be
given.
For more details on interpreting the output of this test, see Mardia’s test for multivariate
normality
Tests for association or independence (categorical data)
Categorical data from two columns can be cross-tabulated to produce a contingency table
and the observed frequencies can be compared with expected frequencies using classical or
Pearson’s Chi-square. For small samples (below 30) the Chi-squared values are also
computed with Yate’s correction. For 2x2 contingency tables the test also computes Fisher’s
and Bayes exact probabilities. Samples that have missing values are dropped automatically.
Contingency analysis requires that two columns of data be compared containing categorical
variables. If at least one column is not categorical, the warning shown below will displayed
The main result of a Contingency Analysis is the Contingency Table and a matrix of statistics
containing Chi Squared and p-values. These are discussed below,
Contingency Table
522
The Contingency (or Cross tabulation) table displays the multivariate frequency
distribution of categorical variables in order to find the relationship between them.
For example, suppose a clinical trial was performed using two main indicators, one is
sex (M or F) the other is Response to drug (R for responsive and N for non-
responsive).In this example the study was performed on 2232 subjects of which
1024 were Female and 1208 were Male. The Contingency Table provides a
condensed view of the proportion of males and females who responded or not to
the drug under study. An example table is shown below for this study.
The contingency table is found in the project navigator in the Test Statistics folder
The table shows that a greater proportion of males positively responded to the drug than
females, but how do we assess that this is a significant difference? The Statistics folder holds
the answers.
Statistics
An example Statistics Table is shown below and is accessed from the Test Statistics
folder.
The table contains the following statistics,

Table Chi-Squared Value
This is the tabulated critical statistic for the test at the specified level of significance.
Observed Chi Squared Value
This is the calculated statistic from the data to be compared to the critical value.
p-value
The probability value where the test statistic becomes significant
Fisher’s Exact Probability
A statistic specific to contingency tables, particularly when the sample size is small
(less than 30).
Bayes Exact Probability
An alternative test to the Fisher’s Exact Probability text
For more details on these statistics see the theory section in Contingency Analysis
12.4. Interpreting plots for statistical tests
 Predefined plots for statistical tests

 Kolmogorov-Smirnov (KS) normality test
523
 Student’s t-tests
 Variance comparison tests
12.4.1 Predefined plots for statistical tests

Kolmogorov-Smirnov (KS) normality test
The main result output for the KS normality test in The Unscrambler® is the Cumulative
Distribution Function (CDF) plot. An example output is provided below
Cumulative distribution plot
For a KS normality test, the actual sample value CDF (stepped red curve) is plotted along
with the expected CDF (blue smooth curve). If the two curves significantly depart from each
other over part of the curve, this is an indication that the sample distribution is non-normal.
If the two curves follow each other closely, then this is an indication that the sample
distribution is normal.
The KS statistic is displayed on the curve and is defined by the maximum vertical distance
between the two functions. The statistic is compared to tabulated values of the KS statistic
(in this case with the correction suggested by Lilliefors). If the KS statistic is less than the
critical value (from the KS table), then the null hypothesis that the distribution is normal
cannot be rejected. If however, the KS statistic is greater than the critical value, the
assumption of normality cannot be supported. The plot provides a statement regarding
whether the null hypothesis should, or should not be rejected.
Student’s t-tests
The main results output for the two sample and paired t-tests in The Unscrambler® is the
Mean Comparison plot. An example of this plot is provided below.
Mean Comparison Plot
524
This plot shows the mean value and the range of values around the mean for the two
variables tested. Visually assess whether the means of the two variables line up with each
other and that the spreads of the two variables are equivalent. The plot also provides
information on the type of test (two sample, paired), whether the test was one-sided or
two-sided, the significance level the test was performed at and the test statistics for the
analysis. Use the tabulated p-value to determine whether the means of the two variables
are statistically equivalent. If the p-value is less than the significance level the test was
carried out at (usually 0.05), then the null hypothesis of no difference in the means cannot
be accepted. If the p-value is greater than the significance level of the test, the null
hypothesis cannot be rejected. The plot provides a statement regarding whether the null
hypothesis should, or should not be rejected.
Variance comparison tests
For the variance comparison tests (Levene’s, Bartlett’s and the F-Test), the main results
output is the Variance Comparison plot. An example of this plot is provided below.
Variance Comparison Plot
This plot provides a comparison of the variance of the two variables along with their
confidence intervals. Interpret these plots by visually assessing the variance range for both
variables. The closer the two variables are in variance, the more likely they come from
similarly distributed populations. The plots also provide the Levene’s, Bartlett’s and F-test
statistics (depending on which test was chosen) along with the corresponding critical value
and p-value. If the p-value is less than the level of significance chosen (usually 0.05), then
525
the null hypothesis of equal variances cannot be accepted. If the p-value for the test is
greater than the significance level, the null hypothesis cannot be rejected. The plot provides
a statement regarding whether the null hypothesis should, or should not be rejected.
12.5. Statistical tests method reference

12.6. Bibliography
M.S. Bartlett, Properties of sufficiency and statistical tests, Proceedings of the Royal
Statistical Society Series A 160, 268–282, (1937).
M.B. Brown and A.B.E. Forsythe, Robust tests for the equality of variance, J. American
Statistical Assoc., 69, 364-367, (1974).
R.B. D’Agostino, Tests for Normal Distribution, in Goodness-of-fit Techniques, R.B.
D’Agostino, M.A. Stephens(Eds), Marcel Dekker, New York, 1986.
G.E. Dallal and L. Wilkinson, An analytic approximation to the distribution of Lilliefors’ test
for normality, The American Statistician, 40, 294–296, (1986).
H. Levene, Robust tests for equality of variances, in Contributions to Probability and
Statistics: Essays in Honor of Harold Hotelling, Ingram Olkin, Harold Hotelling et al.(Eds),
Stanford University Press, Stanford, CA, 278-292, 1960.
K.V. Mardia, Measures of Multivariate Skewness and Kurtosis with Applications, Biometrika,
57, 519-530, (1970),
K.V. Mardia, Applications of Some Measures of Multivariate Skewness and Kurtosis in
Testing Normality and Robustness Studies, Sankhy�?, Series B, 36, 115-128 (1974).
K.V. Mardia, J.T. Kent and J.M. Bibby, “Multivariate Analysis”, Academic Press, London, UK,
1979.
J.N. Miller and J.C. Miller, Statistics and Chemometrics for Analytical Chemistry, Fifth Edition,
Prentice Hall, UK, 2005.
526
13. Principal Components Analysis
13.1. Principal Component Analysis (PCA)
PCA can be used to reveal the hidden structure within large data sets. It provides a visual
representation of the relationships between samples and variables and provides insights
into how measured variables cause some samples to be similar to, or how they differ from
each other.
This section provides the details of the PCA approach to understanding data structure. When
considering a data table, each row represents an object (or individual, or sample), and each
column represents a descriptor (or measure, or variable). Throughout the rest of this
section, rows will be referred to as samples, and the columns as variables.
 Theory
 Usage
13.2. Introduction to Principal Component Analysis (PCA)

PCA can be used to reveal the hidden structure within large data sets. It provides a visual
representation of the relationships between samples and variables and provides insights
into how measured variables cause some samples to be similar to, or how they differ from
each other.
 Exploratory data analysis

 What is PCA?
 Purposes of PCA
 How PCA works in short
 Geometrical interpretation of the difference between samples
 Principles of projection
 Separating information from noise
 Is PCA the most relevant summary of the data?
 Main result outputs of PCA
 Scores
 Loadings
 Sample residuals
 Variable residuals
 Residual variation
 Residual variance
 Explained variance
 How to interpret PCA results
 How to use residual and explained variances
 How to detect outliers in PCA
 How to interpret PCA scores and loadings
 PCA rotation
 What is simple structure?
 Varimax rotation
527
 What is orthogonal rotation?

 Interpretation of rotated PCA results
 Equation of varimax and other orthogonal rotation methods
 PCA algorithm options
13.2.1 Exploratory data analysis

Exploratory Data Analysis (EDA) provides preliminary, primarily visual approaches to find
patterns in data. One of the most powerful multivariate EDA tools is known as PCA.
This section provides the details of the PCA approach to understanding data structure. When
considering a data table, each row represents an object (or individual, or sample), and each
column represents a descriptor (or measure, or variable). Throughout the rest of this
section, rows will be referred to as samples, and the columns as variables.
13.2.2 What is PCA?

PCA is a bilinear modeling method that provides an interpretable overview of the main
information contained in a multidimensional table. It is also known as a projection method,
because it takes information carried by the original variables and projects them onto a
smaller number of latent variables called Principal Components (PC). Each PC explains a
certain amount of the total information contained in the original data and the first PC
contains the greatest source of information in the data set. Each subsequent PC contains, in
order, less information than the previous one.
By plotting PCs important sample and variable interrelationships can be revealed, leading to
the interpretation of certain sample groupings, similarities or differences.
13.2.3 Purposes of PCA

Large data tables usually contain large amounts of information. This information may be
partly hidden because the data are too complex to be easily interpreted. Examples of such
large tables include spectroscopic data collected on modern instrumentation,
chromatographic data and the large data sets generated by systems biology research
groups, in particular, metabolomics.
PCA can be used to answer questions such as:
 Which variables (for spectral data, wavelengths) describe the differences between
samples?
 Which variables contribute most to an observed difference?
 Which variables contribute in the same way (i.e. are correlated)?
It enables the detection of sample patterns and more importantly, can be used to detect
gross or subtle outliers. Finally, it quantifies the amount of useful information — as opposed
to noise or meaningless variation — contained in the data.
PCA is the basic workhorse of multivariate data analysis techniques. A solid understanding of
this method is required as it is a very useful method in its own rights, but also, it forms the
basis for many of the other methods used in The Unscrambler® including Principal
Component Regression (PCR), Partial Least Squares (PLS) Regression, and Multivariate Curve
Resolution (MCR). In classification, new information can be projected onto a single PCA
model using the method of Projection, or onto multiple models simultaneously by use of the
classification method known as Soft Independent Modeling of Class Analogy (SIMCA). The
purpose of this section is to provide the reader with a brief introduction to PCA. The reader
is referred to the book “Multivariate Analysis in Practice” by Kim Esbensen et al, Esbensen,
528
Principal Components Analysis
2001 for a more complete description of PCA. Other valuable references include Jackson,
1991 and Mardia et al, 1979 Additional references may also be found in the Bibliography
section of the help.
13.2.4 How PCA works in short

To understand how PCA works, one must remember that information can be assimilated to
variation. When a measured variable exhibits large systematic variation, this is attributed to
information. If a variable exhibits very little variation, it can be concluded there is no
information associated with it and it may be contributing to “noise”. PCA aims to extract the
information from a data table and disregard the noise.
In matrix representation, the model with a given number of components has the following
equation:
where T is the scores matrix, P the loadings matrix and E the error matrix. These terms will
be explained in more detail in this document.
The combination of scores and loadings is the structured part of the data: the part that is
most informative What remains is called error or residual, and represents the fraction of
variation that cannot be modeled well. By multiplying the scores and the loadings together,
the entire structure of the original data set can be reconstructed and hopefully, only a small
residual is left, consisting of random fluctuations which cannot be meaningfully modeled.
When interpreting the results of a PCA, one focuses on the structure part and discards the
residual part. It is OK to do so, provided that the residuals are indeed negligible. It is a
question of how large an error one is willing to accept.
Geometrical interpretation of the difference between samples
Since humans can only visualize data in three dimensions, the following is used to describe
higher order space. Each sample in a data table may be represented by a point in a
multidimensional space (see figure below, for three dimensions). The location of the point is
determined by its coordinates, which are the cell values of the corresponding row in the
table. Each variable thus plays the role of a coordinate axis in multidimensional space.
Sample (object) representation in multidimensional space
529
Let us consider the whole data table geometrically. Two samples can be described as similar
if the values of most of their variables are close to each other. This results in data points that
are close to each other in space. On the other hand, two samples can be described as
different if their values greatly differ for at least some of the variables. This results in data
points occupying distinctly different areas in multidimensional space. This is represented for
two groups, A and B in the figure below.
Sample differences in multidimensional space
Principles of projection
The major principle of PCA is defined as follows: find the directions in space along which the
distance between (i.e. the dispersion of) the data points is the largest. This can be
interpreted as finding the linear combinations of the initial variables that contribute most to
making the samples different from each other. This is shown graphically below.
The First Principal Component
530
These directions, or combinations, are called Principal Components (PCs). They are
computed iteratively, in such a way that the first PC is the one that carries most information
(or in statistical terms, the most explained variance). The second PC will then carry the
maximum share of the residual information (i.e. not taken into account by the previous PC),
and so on.
This process can continue until as many PCs have been computed as there are variables (or
samples, which ever contains the smallest number) in the data table. At that point, all the
variation between samples has been accounted for, and the PCs form a new set of
coordinate axes which has two advantages over the original set of axes (i.e. the original
variables). First, the PCs are orthogonal to each other. Second, they are ranked so that each
one carries more information than any subsequent ones. Thus, one can prioritize the
interpretation, focusing on the first few, since they carry the most information.
The new set of axes can be described as a new “window” for looking into the greatest
sources of information contained in the data. This is represented in the figure below of a
scores plot.
PCs 1 and 2: a new window for looking into multidimensional space
531
The way PCs are generated ensures that this new set of coordinate axes is the most suitable
basis for a graphical representation for interpreting the data structure.
Separating information from noise
In well defined data sets, it is common that the first few PCs contain interpretable
information, while the later PCs mostly describe noise. Therefore, it is useful to study the
first PCs only instead of the whole raw data table: not only is this less complex, but it also
ensures that noise is not mistaken for information.
All PCA models should be validated. Validation is the only way of making sure that only
informative PCs are retained in a model. The validation procedures associated with
multivariate models are described in detail in the chapter on Validation. The following
provides a short description of the most common validation methods used for PCA.
In PCA, like most multivariate methods, there are a number of ways to validate the model
generated. The two most commonly used methods are Cross Validation (CV) and Test Set
Validation. In CV, the analyst may set up the number of samples and segments to validate
the model, based on prior knowledge of the data set. In Full Cross Validation, (sometimes
called Leave-One-Out or LOO) each sample takes part in both the calibration and validation
steps individually. This method is commonly used when there is not enough variation in the
samples selected, or there are too few samples to do test set validation. LOO is a good
method for isolating influential samples in a small data set. Other forms of cross validation
include systematic, for assessing the models ability for modeling replicate data random,
when the data sets are larger and the analyst wants to understand the robustness of a
model and custom, when there is a priori information about the data set.
The preferred method of validation for all multivariate methods is test set validation. This
provides the most representative assessment of the model in future applications. The
samples used in validation are not used in the calibration (or training) step and therefore,
the model performance is not overly optimistic, as is the case for cross validation.
Is PCA the most relevant summary of the data?
PCA produces an orthogonal bilinear matrix decomposition, where the PCs are computed in
a sequential way, explaining maximum variance in the data. Using these constraints plus
normalization during the bilinear matrix decomposition, PCA produces unique solutions.
532
These ‘abstract’ unique and orthogonal (independent) solutions are extremely helpful in
deducing the number of different sources of variation present in the data. However, it must
be noted that these are ‘abstract’ solutions in the sense that they are not the ‘true’
underlying factors causing the data variation, but orthogonal linear combinations of them.
In most cases one is interested in finding the “true” underlying sources of data variation. It is
not only a question of how many different sources are present and how they can be
interpreted, but to find out how they are in reality. This can sometimes be achieved using
either PC Rotation, or another type of bilinear method called Multivariate Curve Resolution
(MCR). A disadvantage of MCR methods is they do not yield a unique solution unless
external information is provided during the matrix decomposition.
Read more about Curve Resolution methods in the Help chapter Multivariate Curve
Resolution.
13.2.5 Main result outputs of PCA

Each component of a PCA model is characterized by three complementary sets of attributes:
Scores
These describe the properties of the samples and are usually shown as a map of one
PC plotted against another. However, PCs can be plotted as line plots for describing
time evolving processes.
Loadings
These describe the relationships between variables and may be plotted as a line
(commonly used in spectral data interpretation) or a map (commonly used in
process or sensory data analysis).
Explained (or Residual) Variances
These are error measures that tell how much information is taken into account by
each PC.
- Residual variance expresses how much variation in the data remains to be explained once
the current PC has been taken into account.
- Explained variance, often measured as a percentage of the total variance in the data, is a
measurement of the proportion of variation in the data accounted for by the current PC.
These two values are complementary. The variance which is not explained is residual.
Scores
Scores describe the data structure in terms of sample patterns, and more generally show
sample differences or similarities.
Each sample has a score on each PC. It reflects the sample location along that PC and is the
coordinate of the sample on the PC.
533
Scores can be interpreted as follows:

The score describes the major features of the sample, relative to the variables with high
loadings on the same PC.
Samples with close scores along the same PC are similar (they have close values for the
corresponding variables). Conversely, samples for which the scores differ greatly are quite
different from each other with respect to those variables.
The relative importance of each principal component is expressed in terms of how much
variance of the original data it describes. There are two ways to look at it:
Loadings
Note: Loadings cannot be interpreted without Scores, and vice versa.

Loadings describe the data structure in terms of variable contributions and correlations.
Every variable analyzed has a loading on each PC, which reflects how much the individual
variable contributes to that PC, and how well the PC takes into account the variation
contained in a variable.
In geometrical terms, a loading is the cosine of the angle between the variable and the
current PC: the smaller the angle (i.e. the higher the link between variable and PC), the
larger the loading. It also follows that loadings can range between –1 and +1. The correlation
r between two variables (vectors), x and y, is defined as,
Where Cov is the covariance between x and y. There is a direct relationship between the
covariance of two vectors and the cosine of the angle between them. This is shown as
follows.
Provided x and y have been mean centered, the diagram shows the relationships between
loadings and the PCs and the following statements can be made about variables 1, 2 and 3.
 The angle between variable 1 and PC1 is close to zero, Cos(0) = 1, therefore PC1
completely describes variable 1.
 The angle between variable 2 and PC2 is zero, therefore PC2 completely describes
variable 2.
 The angle between variables 1 and 2 is 90°. Cos(90) = 0, therefore variables 1 and 2
are uncorrelated.
 The angle between variable 3 and PC1 is greater than 180° and the angle between
variable 3 and PC2 is greater than 90°, therefore variable 3 is negatively correlated
to both PC1 and PC2.
 Variable 4 sits at the intersection of PC1 and PC2 and is not described well by both
PCs.
534
The basic principles of interpretation are as follows:

 For each PC, look for variables with high loadings (i.e. close to +1 or –1); this
indicates that the loading is interpretable.
 To study variable correlations, one studies the relative location of variables in the
loadings space. Variables that lie close together are highly correlated. For instance, if
two variables have high loadings along the same PC, it means that their angle is
small, which in turn means that the two variables are highly correlated. If both
loadings have the same sign, the correlation is positive (when one variable increases,
so does the other). If the loadings have opposite signs, the correlation is negative
(when one variable increases, the other decreases).
Sample residuals
Looking at data from the samples’ point of view, each data point is approximated by another
point which lies on the hyperplane generated by the model components.
The difference between the original location of the point and its approximated location (or
projection onto the model) is the sample residual (see figure below).
This overall residual is a vector which can be decomposed in as many numbers as there are
components. Those numbers are the sample residuals for each particular component.
535
Variable residuals
From the variables’ point of view, the original variable vectors are being approximated by
their projections onto the model components. The difference between the original vector
and the projected one is the variable residual.
It can also be broken down into as many numbers as there are components.
Residual variation
The residual variation of a sample is the sum of squares of its residuals for all model
components. It is geometrically interpretable as the squared distance between the original
location of the sample and its projection onto the model.
The residual variations of Variables are computed the same way.
Residual variance
The residual variance of a variable is the mean square of its residuals for all model
components. It differs from the residual variation by a factor which takes into account the
remaining degrees of freedom in the data, thus making it a valid expression of the modeling
error for that variable.
Total residual variance is the average residual variance over all variables. This expression
summarizes the overall modeling error; i.e. it is the variance of the error part of the data.
Explained variance
Explained variance is the complement of residual variance, expressed as a percentage of the
global variance in the data. Thus the explained variance of a variable is the fraction of the
global variance of the variable taken into account by the model.
Total explained variance measures how much of the original variation in the data is
described by the model. It expresses the proportion of structure found in the data by the
model.
13.2.6 How to interpret PCA results

Once a model is built, it needs to be diagnosed, i.e. its quality must be assessed, before it
can actually be used for interpretation.
There are two major steps in diagnosing a PCA model:
 Check variances, to determine how many components (PCs) the model should
include and know how much information the selected components take into
account. At this stage, it is especially important to check validation variances.
Validation is described in detail in a separate chapter on Validation.
 Look for outliers, i.e. samples that do not fit into the general pattern.
These two steps may have to be run several times before a satisfactory model is reached.
How to use residual and explained variances
Total variances
Total residual and explained variances show how well the model fits to the data.
Models with small total residual variance (close to 0) or large total explained variance (close
to 100%) explain most of the variation in the data. Ideally, one should strive to have simple
models, i.e. models where the residual variance goes down to zero with as few components
as possible. If this is not the case, it means that there may be a large amount of noise in the
data or, alternatively, that the data structure may be too complex to be accounted for by
only a small number of components.
536
Variable variances
Variables with small residual variance (or large explained variance) for a particular
component are well explained by the corresponding model. Variables with large residual
variance for all or for the three to four first components have a small or moderate
relationship with the other variables.
If some variables have much larger residual variance than the other variables for all
components (or for the first three to four of them), try to keep these variables out and make
a new calculation. This may produce a model which is easier to interpret.
Calibration vs. validation variance
The calibration variance is based on fitting the calibration data to the model. The validation
variance is computed by testing the model on data not used in building the model. Look at
both variances to evaluate their difference. If the difference is large, there is reason to
question whether the calibration data or the test data are representative.
Outliers can sometimes be the reason for large residual variance. The next section discusses
outliers.
How to detect outliers in PCA
An outlier is a sample which looks so different from the others that it either is not well
described by the model or influences the model too much. As a consequence, it is possible
that one or more of the model components focuses only on trying to describe how this
sample is different from the others, even if this is irrelevant to the more important structure
present in the other samples. The diagram below depicts a typical situation where an outlier
influences the model completely, leaving the most important source of variation for the
second PC to describe.
Scores plot showing_a gross outlier
In PCA, outliers can be detected using scores plots, residuals and leverages.
Different types of outliers can be detected by the various graphical tools available in The
Unscrambler®
Scores plots
show sample patterns according to one, two, or three components. It is easy to spot
a sample lying far away from the others. Such samples are likely to be outliers.
Residuals
measure how well samples or variables fit the model determined by the
components. Samples with a high residual are poorly described by the model, which
nevertheless fits the other samples quite well.
Leverages
537
measure the distance from the projected sample (i.e. its model approximation) to
the center (mean point). Samples with high leverages have a stronger influence on
the model than other samples; they may or may not be outliers, but they are
influential. An influential outlier (high residual + high leverage) is the worst case; it
can however easily be detected using an influence plot.
The diagram below provides an example of an influence plot, showing four typical classes of
sample. Samples with high leverage are considered extreme in the model as they lie furthest
from the center of the PCA model.
How to interpret PCA scores and loadings

Loadings show how data values vary along a model component. This interpretation of a PC is
then used to understand the meaning of the scores.
To figure out how this works, one must remember that the PCs are oriented axes. Loadings
can have negative or positive values; so can scores. PCs build a link between samples and
variables by means of scores and loadings.
First, let us consider one PC at a time. Here are the rules to interpret that link:
 If a variable has a very small loading, whatever the sign of that loading, it should not
be used for interpretation, because that variable is badly accounted for by the PC.
One may discard this variable and focus on the variables with large loadings;
 If a variable has a positive loading, it means that all samples with positive scores
have higher than average values for that variable. All samples with negative scores
have lower than average values for that variable;
 If a variable has a negative loading, it means just the opposite. All samples with
positive scores have lower than average values for that variable. All samples with
negative scores have higher than average values for that variable;
 The higher the positive score of a sample, the larger its values for variables with
positive loadings and vice versa;
 The more negative the score of a sample, the smaller its values for variables with
positive loadings and vice versa;
 The larger the loading of a variable, the quicker sample values will increase with
their scores.
538
To summarize, if the score of a sample and the loading of a variable on a particular PC have
the same sign, the sample has higher than average value for that variable and vice-versa.
The larger the scores and loadings, the stronger that relation.
If one now consider two PCs simultaneously, a two-vector loading plot and a two-vector
scores plot can be built. The same principles apply to their interpretation, with a further
advantage: one can now interpret any direction in the plot - not only the principal directions.
13.2.7 PCA rotation

In most cases, when a PCA is performed on a data set, although most of the variability has
been explained, the loadings may not be physically interpretable, e.g. Spectral data may
have been collected on a scale between 0 and 1, but the loadings may be negative. In some
cases, a “second rotation” of the PC space may lead to more physically interpretable
loadings. This is referred to as Simple Structure.
What is simple structure?
Simple structure is to used to produce a new set of vectors, from a subset of the original
variables with as little overlap as possible. In doing this the original variables are divided into
groups, somewhat independent of each other. Jackson, 1991. Harman, 1976 has listed some
criteria associated with simple structure.
These criteria are stated here for convenience,
 Each row of the factor matrix should have at least one zero.
 If there are m common factors, each column of the factor matrix should have at
least m zeros.
 For every pair of columns of the factor matrix there should be several variables
whose entries vanish in one column but not in the other.
 For every pair of columns of the factor matrix, a large proportion of the variables
should have vanishing entries in both columns when there are four or more factors.
 For every pair of columns of the factor matrix there should be only a small number
of variables with non-vanishing entries in both columns.”
Varimax rotation
There is usually a massive amount of information that can be extracted from a data set after
PCA is performed. The Residual variance at the optimal number of PCs tells how much noise
there is in the data. Scores plots provide a map of the samples and loadings plots indicate
how the different variables contribute to the important PCs.
However, when the main interest lies in the variables included in the analysis, the loadings
plots do not always provide an easy interpretation of each individual variable.
This is for instance the case in sensory science, where one is interested in explaining distinct
variables and the cause for their variation.
This is where Varimax rotation (and other orthogonal rotation methods) can be useful.
Varimax rotation is the most commonly used method and was first proposed by Kaiser in
1958. Rotation allows for the alignment of the PCs with the most important variables, by
maximizing the variance of the squared loadings along the rotated PCs. Thus one may
directly interpret the rotated PCs as the directions along which the most significant variables
are to be found.
What is orthogonal rotation?
As mentioned previously, the components extracted by PCA are always orthogonal and are
ordered according to the proportion of the variance of the original data that these
539
components explain. In general, only a (small) subset of components is kept for further
consideration and the remaining components are considered as noninformative, irrelevant
or nonexistent (i.e. they are assumed to reflect measurement error or noise).
In order to interpret the components that are considered relevant, one can follow the PCA
by a rotation of the components that were retained. Two main types of rotation are used:
orthogonal when the new axes are also orthogonal to each other, and oblique when the new
axes are not required to be orthogonal to each other. Nonorthogonal or oblique rotation is
the subject of Independent Component Analysis (ICA).
Why will a rotation help?
Since the rotations are always performed in a subspace (the so-called component space), the
new axes will always explain less variance than the original factors (which are computed to
be optimal), but obviously the part of variance explained by the total subspace after rotation
is the same as it was before rotation – only the partition of the variance has changed.
Because the rotated axes are not defined according to a statistical criterion, such rotations
are performed to facilitate the interpretation of the components, thus also giving more
direct meaning to the data analysis.
Rotation was designed to obtain simple structure by clustering variables into groups that
might aid in the examination of the structure of a multivariate data set. It has found most
use in psychology, market research, education and sensory analysis. In physical applications,
rotation is usually of secondary interest.
Available rotation methods

The Unscrambler® supports the following types of PCA rotation,
 Varimax
 Quartimax
 Equimax
 Parsimax
The rotation, R, is defined so to maximize the variance of the squared loadings, given by the
variance measure v:
540
where n is the number of samples, p is scores, h a normalization factor and γ a scaling factor
defining different types of rotation:
Rotation method Scaling factor
Varimax γ=1
Quartimax γ=0
Equimax γ=(NumOfPCs)/2
Parsimax γ=VarNum*(NumOfPCs-1)/(VarNum + NumOfPCs-2)

Quartimax rotation was first introduced by Neuhaus and Wrigley, 1954 and is more likely to
produce a “general” component than varimax. This is because quartimax attempts to
simplify the rows in the so called pattern matrix. Refer to Darton, 1980 for more details on
this.
By changing γ to Num PCs/2, the Equimax rotation of Saunders, 1953 is obtained.
The Parsimax rotation was first described by Crawford and Ferguson, 1970 . The Parsimax
criterion states that weights can be chosen, such that test and parsimony always have the
same weight, regardless of the number of factors rotated.
Interpretation of rotated PCA results
The main results of a rotated PCA can be interpreted in a similar way to that of normal PCA.
In practice, one should first study the original PCA model and diagnose it with respect to
number of PCs, variances and potential outliers. Once it has been established that the
quality of the original model provides a good basis for meaningful and reliable
interpretation, then one may apply an orthogonal rotation and interpret the rotated results.
The rotated PCA overview contains the same plots as given for a PCA model without
rotation, where for instance scores and loadings or influence plots can be interpreted
together as usual. The residuals and variance plots, however, contain only “Calibration”
results, since no validation is performed at the rotation stage.
Equation of varimax and other orthogonal rotation methods
Starting with PCA, a data table, X, can be factorized into scores, T, and loadings, P, according
to the following equation:
An orthogonal rotation, R, can be defined for loadings, such that rotated loadings are equal
to P x R. For the rotation to become invariant, scores must also be rotated, T x R, and the
rotation must satisfy:
where I is the identity matrix, and R must be orthogonal. The original data can thus be
reconstructed from the rotated loadings and scores by:
541
13.2.8 PCA algorithm options

The Unscrambler® provides two algorithms for PCA model calibration, both of which will
produce the same results on convergence (down to numerical precision differences and
acknowledging that bi-linear components may be arbitrarily flipped). For most smaller data
sets the choice of algorithm is therefore not important, however some guidelines are given
below.
NIPALS
A common, iterative algorithm used in PCR and PCA. It is useful when the data
contain missing values as these can be automatically imputed by the algorithm. Also
it tends to be faster than SVD if both the number of rows and columns in the data
are large.
For any factor , convergence is tested by the ratio
where is the current estimate of the score vector and is the difference
between score vector estimates in the current and previous iteration. Convergence
is tested only when needed and it is reached if the ratio is found to be larger than
(the single precision convergence criterion). If convergence for a factor fails,
current results are returned with a warning. The number of iterations and
convergence statistics for each factor is reported in a separate ‘Convergence’ table
in the PCA model node.
For large data tables with small signal to noise ratio (or in the extreme case: random
data), NIPALS may converge slowly or not at all. In this case options are to increase
the number of iterations or to use SVD instead.
This algorithm is non-iterative. It is usually faster than NIPALS for data where one of
the dimensions is large (i.e. ‘tall and thin’ data containing a large number of samples
and relatively few variables or ‘short and fat’ data containing a large number of
variables and relatively few samples). The algorithm does not handle missing values.
13.3. Tasks – Analyze – Principal Component Analysis…

PCA is one of the most powerful exploratory data analysis tools available to the investigative
large data sets.
When a data matrix is available in the Project Navigator, access the menu for analysis by PCA
from Tasks – Analyze – Principal Component Analysis… The PCA dialog box is described
below.
 Model Inputs tab

 Weights tab
 Validation tab
 Rotation tab
 Algorithm tab
 Autopretreatment tab
 Set Alarms tab
 Warning Limits tab
542
13.3.1 Model Inputs tab

In the Model Inputs tab, select a Matrix to be analyzed in the Data frame. Select pre-defined
row and column ranges in the Rows and Cols boxes, or click the Define button to perform
the selection manually in the Define Range dialog.
Once the data to be used in modeling are defined, choose the number of Principal
Components (PCs) to calculate, from the Maximum Components box.
The Mean center data check box allows a user to subtract the column means from every
variable before analysis.
The Identify outliers check box allows a user to identify potential outliers based on
parameters set up in the Warning Limits tab.
The details of the analysis setup are provided in the Information box on the model inputs
tab. It is important to check the details in this box each time an analysis is performed, to
ensure that the correct parameters have been set. The information contained in this box is:
 The algorithm used to calculated the model.

 The rotation method applied (if any).
 The validation method employed.
 The weights applied to the data.
Principal Component Analysis Model Inputs
543
Some important tips and warnings associated with the Model Inputs tab
PCA is a multivariate analysis technique, therefore in The Unscrambler® it requires a
minimum of three samples (rows) and two variables (columns) to be present in a data set, in
order to complete the calculation. The following provides some warning given, when certain
analysis criteria are not met.
Not enough samples or variables present
Solution: Check that the data table (or selected row set) contains a minimum of 3 samples.
Not enough variables present
Solution: Check that the data table (or selected column set) contains a minimum of 2
variables.
Too many excluded samples/variables
The same warning as for Not enough samples or variables (described above) will be given.
Solution: Check that all samples/variables have not been excluded in a data set
To keep track of row and column exclusions, the model inputs tab provides a warning to
users that exclusions have been defined. See automatic keep outs for more details.
13.3.2 Weights tab

For weighting the individual variables releative to each other, use the Weights tab. This is
useful e.g. to give process or sensory variables equal weight in the analysis or to downweight
variables you expect not to be important. The tab is given below.
Principal Component Analysis Weights
544
Individual variables can be selected from the variable list table provided in this dialog by
holding down the control (Ctrl) key and selecting variables. Alternatively, the variable
numbers can be manually entered into the text dialog box. The Select button can be used
(which will bring up the Define Range dialog), or every variable in the table can be selected
by simply clicking on All.
Once the variables have been selected, to weight them, use the options in the Change
Selected Variable(s) dialog box, under the Select tab. The options include:
A/(SDev +B)
Constant
Downweight
Block weighting
Use the Advanced tab in the Weights dialog to apply predetermined weights to each
variable. To use this option, set up a row in the data set containing the weights (or create a
separate row matrix in the project navigator). Select the Advanced tab in the Weights dialog
545
and select the matrix containing the weights from the drop-down list. Use the Rows option
to define the row containing the weights and click on Update to apply the new weights.
Another feature of the advanced tab is the ability to use the results matrix of another
analysis as weights, using the Select Results Matrix button This option provides
an internal project navigator for selecting the appropriate results matrix to use as a weight.
The dialog box for the Advanced option is provided below.
PCA Advanced Weights Option
Once the weighting and variables have been selected, click Update to apply them.
13.3.3 Validation tab

The next step in the PCA modeling process is to choose a suitable validation method from
the Validation tab. For an in-depth discussion on the topic see the chapter on Validation.
The Validation tab is given below. See Validation tab for a description of the different
validation types and Cross validation setup for the available cross validation options.
Principal Component Analysis Validation
546
13.3.4 Rotation tab

The Rotation tab allows a user to apply rotation methods such as Varimax to a PCA model.
The dialog box for PCA rotation is shown below.
PCA Rotation Option
547
Select the desired rotation method from the dialog box and a rotated model will be
displayed in the project navigator.
See Available rotation methods for information about the rotation methods available in The
Unscrambler®.
13.3.5 Algorithm tab

The Algorithm tab provides a choice between the PCA algorithms NIPALS and Singular Value
Decomposition (SVD).
PCA Algorithm Options
548
The differences between the algorithms are described in the Introduction to PCA. The
NIPALS algorithm is iterative and the maximum number of iterations can be tuned in the
Max. iterations box. The default value of 100 should be sufficient for most data sets,
however some large and noisy data may require more iterations to converge properly. The
maximum allowed number of iterations is 30,000.
When there are missing values in the data, options are to impute them automatically using
the NIPALS algorithm or as a pre-processing step using Fill Missing
Note: If there are missing values in the data and SVD is selected, a warning will be
given as shown below.
Q-residual limits are per default approximated based on calculated model components only,
which works well in many cases. Calculation of exact Q-residual limits will be performed
when the check box is marked. Note that estimation of exact limits may be slow for large
data.
549
13.3.6 Autopretreatment tab

The autopretreatment tab allows a user to register any combination of pretreatments used
to develop a PCA model for use with future applications. For example, a PCA model was
developed for projection purposes using a first derivative transformation applied to the
data. By registering the pretreatment to the model, when a new data set is used, the model
will first transform the data with the registered pretreatments. The pretreatment registered
with the model can be used with other applications such as The Unscrambler® Classifier,
used for real time applications.
PCA Autopretreatment Option
Pretreatments can also be registered from the PCA node in the project navigator. To register
the pretreatment, right click on the PCA analysis node and select Register Pretreatment.
This is shown below.
Registering a Pretreatment From The Project Navigator
550
The Autopretreatment dialog box will appear, where the desired pretreatments can be
selected.
Note: Some caution is required when data table dimensions are changed after first
pretreatment. The Autopretreatment is applied on the same column indices as the
original transformation, and inserting new variables (columns) before or in between
the original data will result in autopretreatment of the wrong variables.
To be safe, always insert any new variables in the table before applying any
transformations, or make a habit of always appending rather than inserting new
columns.
13.3.7 Set Alarms tab

See Set Alarms for information on setting alarms that can be useful during classification,
projection and to define scalar and vector information for input matrix.
13.3.8 Warning Limits tab

The warning limits tab allows a user to define specific criteria for detecting outliers in a PCA
model. It is available when Identify outliers is checked in the Model Inputs tab. The dialog
box is shown below.
PCA Warning Limits Option
Set this tab up based on a priori knowledge of the data set in order to return outlier
warnings in the PCA model. Settings for estimating the optimal number of components can
551
also be tuned here. The values shown in the dialog box above are default values and might
be used as a starting point for the analysis.
The warning limits in the Unscrambler® serve two major purposes:
 To avoid overfitting by suggesting a conservative estimate for the optimal number of

components in terms of the information content.
 Detect outliers in terms of leverage and residuals. Outlier limits are given for both
samples and variables, as well as for individual variables that stand out for specific
samples.
The leverage and residual (outlier) limits are given as standard scores. This means that limit
of e.g. 3.0 corresponds to a 99.7% probability that a value will lie within 3.0 standard
deviations from the mean of a normal distribution. The following limits can be specified:
Leverage Limit
(default 3.0) The ratio between the leverage for an individual sample and the
average leverage for the model.
Sample Outlier Limit, Calibration
(default 3.0) The square root of the ratio between the residual calibration variance
per sample (Sample Residuals) and the average residual calibration variance for the
model (Total Residuals).
Sample Outlier Limit, Validation
(default 3.0) The square root of the ratio between the residual validation variance
per sample (Sample Validation Residuals) and the total residual validation variance
for the model (Total Residuals).
Individual Value Outlier, Calibration
(default 3.0) For individual, absolute values in the calibration residual matrix
(Residuals), the ratio to the model average is computed (square root of the Variable
Residuals). For spectroscopic data this limit may be set to 5.0 to avoid many false
positive warnings due to the high number of variables.
Individual Value Outlier, Validation
(default 2.6) For individual, absolute values in the validation residual matrix
(Residuals), the ratio to the validation model average is computed (square root of
the Variable Validation Residuals). For spectroscopic data this limit may be set to 5.0
to avoid many false positive warnings due to the high number of variables.
Variable Outlier Limit, Calibration
per variable (Variable Residuals) and the average residual calibration variance for
the model (Total Residuals).
Variable Outlier Limit, Validation
per variable (Variable Validation Residuals) and the total residual validation variance
Total Explained Variance (%)
(default 20) If the model explains less than 20% of the variance the optimal number
of componets is set to 0 (see the Info Box).
Ratio of Calibrated to Validated Residual Variance
(default 0.5) If the residual variance from the validation is much higher than the
calibration a warning is given.
Ratio of Validated to Calibrated Residual Variance
552
(default 0.75) If the residual variance from the calibration is much higher than the
validation a warning is given. This may occur in case of test set validation where the
test samples do not span the same space as the training data.
Residual Variance Increase Limit (%)
(default 6) This limit is applied for selecting the optimal number of components and
is calculated from the residual variance for two consecutive components. If the
variance for the next component is less than x% lower than the previous component
the default number of components is set to the previous one.
When all the options are specified click OK.
13.4. Interpreting PCA plots
 Predefined PCA plots

 PCA overview
 Scores
 Loadings
 Influence plot
 Variances and RMSEP
 Sample outliers
 Scores
 Influence
 Scores and Loadings
 Scores
 Loadings
 Residuals and influence
 Influence Plot
 Influence plot with Hotelling’s T² statistic
 Influence plot with Leverage
 Influence plot with F-residuals
 Influence plot with Q-residuals
 Explained sample variance or sample residuals
 Leverage / Hotelling’s T²
 Hotelling’s T² statistics
 Leverage
 Residuals
 Q-residuals
 F-residuals
 Residuals
 Plots accessible from the PCA plot menu
 Scores and loadings
 Two plots
 Four plots
 Bi-plot
 Scores
 Line
 2-D scatter
 3-D scatter
553

2 x 2-D scatter

4 x 2-D scatter
 Loadings
 Line
 2-D scatter
 3-D scatter
 2 x 2-D scatter
 4 x 2-D scatter
 Residuals
 Influence plot
 Variance per sample
 Sample and variable residuals
 Leverages
 Line
 Matrix
 Hotelling’s T²
 Line
 Matrix
13.4.1 Predefined PCA plots

PCA overview
Scores
This is a two-dimensional scatter plot (or map) of scores for two specified components (PCs)
from PCA. The plot gives information about patterns in the samples. The scores plot for
(PC1,PC2) is especially useful, since these two components summarize more variation in the
data than any other pair of components.
554
The closer the samples are in the scores plot, the more similar they are with respect to the
two components concerned. Conversely, samples far away from each other are different
from each other. The plot can be used to interpret differences and similarities among
samples. Look at the scores plot together with the corresponding loadings plot, for the same
two components. This can help determining which variables are responsible for differences
between samples. For example, samples to the right of the scores plot will usually have a
large value for variables to the right of the loadings plot, and a small value for variables to
the left of the loadings plot.
Here are some things to look for in the 2-D scores plot.
Finding groups in a scores plot
Is there any indication of clustering in the set of samples? The figure below shows a
situation with four distinct clusters. Samples within a cluster are similar.
Detecting grouping in a scores plot
Studying sample distribution in a scores plot
555
Are the samples evenly spread over the whole region, or is there any accumulation
of samples at one end? The figure below shows a typical fan-shaped layout, with
most samples accumulated to the bottom left of the plot, then progressively
spreading more and more. This means that the variables responsible for the major
variations are asymmetrically distributed. In such a situation, study the distributions
of those variables (histograms), and use an appropriate transformation (most often
a logarithm).
Asymmetrical distribution of the samples on a scores plot
Calibration and Validation Scores

When the methods of cross validation and test set validation are used, The
Unscrambler® will by default display Calibration and Validation (Test) scores in the
same plot, Use this plot to determine whether the test set covers the entire span of
the calibration set or determine if any cross validation segments/samples are
different from the rest of the set.
Detecting outliers in a scores plot

Are some samples very different from the rest? This can indicate that they are
outliers, as shown in the figure below. Outliers should be investigated: there may
556
have been errors in data collection or transcription, or those samples may have to
be removed if they do not belong to the population of interest.
An outlier sticks out of the major group of samples
Furthermore, the display of the Hotelling’s T² ellipse fro model in two dimension is also a
good way to detect outliers. To display it click on the Hotelling’s T² ellipse button .
Scores plot with Hotelling’s T² limit
In addition, the display of the stability plot can help detecting outliers. This plot represents
the projection of the samples in the submodels used for the validation they can be part of
the model or left out. Hence this plot is only available when any type of cross-validation has
been selected. It is available from the icon .
An outlier disturbs the model
557
In the above image, the sample 143_1 is projected very differently for one particular
projection. It is also visible that one particular projection is deviating all the samples. The
study of the samples left out for this particular projection indicates that sample 143_1 is the
source of this variation. This sample is an outlier.
How representative is the picture?
Check how much of the total variation each of the components explains. This is
displayed in parentheses next to the axis name. If the sum of the explained variances
for the 2 components is large (for instance 70-80%), the plot shows a large portion
of the information in the data, so the relationships can be interpreted with a high
degree of certainty. On the other hand if it is smaller, more components or a
transformation should be considered, or there may simply be little meaningful
information in the data under study.
Loadings
A two-dimensional scatter plot of X-loadings for two specified components from PCA is a
good way to detect important variables. The plot is most useful for interpreting component
1 vs. component 2, since they represent the largest variations in the X-data.
It should preferably be used together with the corresponding scores plot. Variables with X-
loadings to the right in the loadings plot will be X-variables which usually have high values
for samples to the right in the scores plot, etc.
identified.
X-variables correlation structure
Variables close to each other in the loadings plot will have a high positive correlation
if the two components explain a large portion of the variance of X. The same is true
for variables in the same quadrant lying close to a straight line through the origin.
Variables in diagonally opposed quadrants will have a tendency to be negatively
correlated.
558
For example, in the figure below, variables “redness” and “colour” have a high
positive correlation, and they are negatively correlated to variable “thickness”.
Variables “redness” and “off-flavour” have independent variations. Variables
“raspberry flavour” and “off-flavour” are negatively correlated. Variable
“sweetness” and “chew” resistance cannot be interpreted in this plot, because they
are very close to the center.
Loadings of 12 sensory variables along (PC1,PC2)
Note: Variables lying close to the center are poorly explained by the plotted PCs. Do
not interpret them in that plot!
When working with spectroscopic or time series data, line loadings plots will aid better
interpretation. This is because the loadings will have a profile similar to the original data and
may highlight regions of high importance. The plot below shows how a number of PC’s can
be overlayed in a line loadings plot to determine which components capture the important
sources of information.
559
When working with discrete variables, line loadings plots can also be used to represent data.
The Ascending and Descending buttons can be used to order the loadings in
terms of the variables with highest (or lowest) contribution to the PC.
Line plot of loadings in ascending order of importance to PC1

When a PCA analysis has been performed and a two-dimensional plot of X-loadings
is displayed on the screen, use the Correlation Loadings option (available from the
View menu or the icon ) to discover the structure in the data more clearly.
Correlation loadings are computed for each variable for the displayed Principal
Components. In addition, the plot contains two ellipses that indicate how much
variance is taken into account. The outer ellipse is the unit-circle and indicates 100%
explained variance. The inner ellipse indicates 50% of explained variance.
The importance of individual variables is visualized more clearly in the correlation
loadings plot compared to the standard loadings plot.
Correlation Loadings of sensory variables along (PC1,PC2)
In the above plot three variables are located in the inner circle: Chew resistance,
Sweetness and Bitterness. They do not contain enough structured variation to be
discriminating for the jam samples.
560
Correlation loadings are also available for 1D line loading plots. When a line plot is
generated, the 1D correlation loadings toolbar icon is displayed as follows
These are especially useful when interpreting important wavelengths in the analysis of
spectroscopic or contributing variables in time series data. An example is shown below.
Correlation Line Loadings of Spectroscopic variables in PC1)
Values that lie within the upper and lower bounds of the plot are modelled by that PC. Those
that lie between the two lower bounds are not.
Influence plot
This plot shows the Q- or F-residuals vs. Leverage or Hotelling’s T² statistics. These represent
two different kinds of outliers. The residual statistics on the ordinate axis describe the
sample distance to model, whereas the Leverage and Hotelling’s T² describe how well the
sample is described by the model.
Samples with high residual variance, i.e. lying in the upper regions of the plot, are poorly
described by the model. Including additional components may result in these samples being
described better, however caution is required that the additional components are predictive
and not modelling noise. As long as the samples with high residual variance are not
influential (see below), keeping them in the model may not be a problem as such (the high
residual variance may be due to non-important regions of a spectrum, for instance).
Samples with high leverage, i.e. lying to the right of the plot, are well described by the
model. They are well described in the sense that the sample scores may have very high or
low values for some components compared to the rest of the samples. Such samples are
dangerous in the calibration phase because they are influential to the model. A sufficiently
extreme sample may by itself span an entire component, in which case the model will
become unreliable. Removal of a highly influential sample from the model will make the
model look entirely different and the axes will span different phenomena altogether. If the
variance described by the sample is important but unique, one should try to obtain more
samples of the same type to stabilize the model. Otherwise the sample should be discarded
as an outlier.
Note that a sample with both high residual variance and high leverage is the most dangerous
outlier. Not only is it poorly described by the model but it is also influential. Samples such as
these may span up to several components single handedly. Because they also disagree with
the majority of the other calibration samples, the ability of the model to describe new
samples is likely poor.
561
The Q- and F-residuals are two different methods for testing the same thing. The F-residuals
are available for both calibration and validation, in contrast to the Q-residuals, which are
available for calibration only. The validated residuals reflect the scheme chosen in the
validation and is a more concervative assessment of residual outliers. If the residual variance
from validation is much higher than for calibration one should investigate the residuals in
more detail.
The difference between Leverage and Hotelling’s T² is only a scaling factor. The critical limt
for Leverage is based on an ad-hoc rule whereas the Hotelling’s T² critical limit is based on
assumption of a student-t distribution.
Influence plot
In the above plot, sample 25 has a high leverage on PC6 which is the dimensionality of the
model. This sample has to be checked as it is a probable outlier.
Three cases can be detected from the influence plot:
Case 1: A sample has a high leverage
This is an influential sample. Check the reasons for it to be influential and decide
what to do.
Case 2: A sample has a high residual
Check which variables are poorly described by the model for this sample. Decide if
this sample is an outlier.
Case 3: A sample has a high leverage and a high residual
This sample is most likely an outlier. Retaining this sample in the model is risky.
Note: When working with designed data, the leverage of each sample in the design
is known by construction, and these leverages are optimal, i.e. all design samples
have the same contribution to the model. So do not bother about the leverages
when running a regression: the design has accounted for it.
What to do with an influential sample
The first thing to do is to understand why the sample has a high leverage (and, possibly, a
high residual variance). Investigate by looking at the raw data and checking them against the
original recordings.
There are two cases to consider:
Case 1
562
There is an error in the data. Correct it, or if the true value cannot be found or the
experiment cannot be redone to get a more valid value, replace the erroneous value
with “missing”.
Case 2
There is no error, but the sample is different from the others. For instance, it has
extreme values for several of the variables. Check whether this sample is “of
interest” (e.g. it has the properties to be achieved, to a higher degree than the other
samples), or “not relevant” (e.g. it belongs to another population than the one
under study). In the former case, try to generate more samples of the same kind:
they are the most interesting ones! In the latter case (and only then), remove the
high-leverage sample from the model.
Calibration and validation samples can be displayed in the influence plot by toggling
between them using the and button. This can only be done if the validation
method chosen was cross validation or test set validation.
Explained variance
This plot gives an indication of how much of the variation in the data is described by the
different components.
Total residual variance is computed as the sum of squares of the residuals for all the
variables, divided by the number of degrees of freedom.
Total explained variance is then computed as:
100*(initial variance - residual variance)/(initial variance)

It is the percentage of the original variance in the data that is taken into account by the
model. Both variances can be computed after 0, 1, 2… components have been extracted
from the data.
Models with small (close to 0) total residual variance or large (close to 100%) total explained
variance explain most of the variation in X; see the example below. Ideally one would like to
have simple models, where the residual variance goes to 0 with as few components as
possible.
Total residual variance curves and Total explained variance curves
563
Calibration variance is based on fitting the calibration data to the model. Validation variance
is computed by testing the model on data that were not used to build the model. Compare
the two variances: if they differ significantly, there is good reason to question whether
either the calibration data or the test data are truly representative. The figure 2 below
shows a situation where the residual validation variance is much larger than the residual
calibration variance (or the explained validation variance is much smaller than the explained
calibration variance). This means that although the calibration data are well fitted (small
residual calibration variances), the model does not describe new data well (large residual
validation variance).
On the contrary if the two residual variance curves are close together the model is
representative.
Total residual variance curves for Calibration and Validation showing the presence of outliers
564
Outliers can sometimes cause large residual variance (or small explained variance).
Outliers can also cause a decrease in the explained validation variance as can be seen in the
plot below.
Outlier causes a drop of explained variance in validation
Variances and RMSEP

This plot shows the explained variance for each X-variable when different numbers of
components are used in the model. It is used to identify which individual variables are well
described by a given model.
X-variables with large explained variance (or small residual variance) for a particular
component are explained well by the corresponding model, while those with small explained
variance for all (or for at least the first 3-4) components have little relationship to the other
X-variables (if this is a PCA model) or little predictive ability (for PCR and PLS models).
Explained variances for several individual X-variables
565
If some variables have much larger residual variance than all the other variables for all
components in the model (or for the first 3-4 of them), try rebuilding the model with these
variables deleted. This may produce a model that is easier to interpret.
Note: Both calibration and validation variances are available.
Sample outliers
Scores
See the description in the overview section
Influence
Scores and Loadings
Scores
Loadings
Residuals and influence
Influence Plot
This plot shows the Q- or F-residuals vs. Leverage or Hotelling’s T². See the general
description about the influence plot in the overview section for more details.
The toggle buttons in the toolbar can be used to switch between the various combinations.
566
Influence plot with Hotelling’s T² statistic

When the option “T²” on the toolbar is enabled the abscissa in the plot shows the Hotelling’s
T² statistic for each object with the corresponding critical limit. The Hotelling’s T² statistic
describes the distance to the model center as spanned by the principal components. The
limit associated with different statistical confidence limits can be shown.
Influence plot with Hotelling’s T² on the abscissa and F-residuals on the ordinate
Influence plot with Leverage

When the option “Lev” on the toolbar is enabled the abscissa in the plot shows the Leverage
for each object with the corresponding critical limit. The ad-hoc critical limit (and not
depending on any assumptions about distribution) for Leverage is ,
where is the number of components and the number of calibration samples.
Influence plot with F-residuals
When the option “F-r” on the toolbar is enabled the ordinate in the plot shows the residuals
for each object with the critical limit based on an f-test.
Note that the F-residuals are calculated from the calibration as well as the validated residual
x-variance. The validated residuals reflect the scheme chosen in the validation and is a more
concervative assessment of residual outliers. The Q-residuals are available for calibration
residuals only.
Influence plot with Q-residuals
When the option “Q-r” on the toolbar is enabled the ordinate in the plot shows the Q-
residuals with an associated critical limit.
The Q-residual is the sum of squares of the residuals over the variables for each object. The
critical value of the Q-residuals are estimated from the eigenvalues of E, as described in
Jackson and Mudholkar, 1979.
567
Explained sample variance or sample residuals

The plot displays the X-residual or explained variance in percent for each sample as a line
plot. The best option is normally in terms of residuals as samples close to the center of the
model may have low explained variance in percent but nevertheless a small residual in
numerical sense.
Explained Variance (in percent) and Sample Residuals plots for Calibration
Leverage / Hotelling’s T²
The lower left pane of the Residuals and Influence overview displays a line plot of the
Hotelling’s T² by default. A toolbar toogle ( ) can be used to switch between
Hotelling’s T² and Leverage view.
Hotelling’s T² statistics
The plot displays the Hotelling’s T² statistic for each sample as a line plot. The associated
critical limit (with a default p-value of 5%) is displayed as a red line.
Hotelling’s T² plot
The Hotelling’s T² statistic has a linear relationship to the leverage for a given sample. Its
critical limit is based on an F-test. Use it to identify outliers or detect situations where a
process is operating outside normal conditions. There are 6 different significance levels to
choose from using the drop-down list:
568
The number of factors (or PCs) may be tuned up or down with the tools.
Leverage
Leverages are useful for detecting samples which are far from the center within the space
described by the model. Samples with high leverage differ from the average samples; in
other words, they are likely outliers. A large leverage also indicates a high influence on the
model. The figure below shows a situation where sample 5 is obviously very different from
the rest and may disturb the model.
One sample has a high leverage
There is an ad-hoc critical limit (and not depending on any assumptions about distribution)
for Leverage which is , where is the number of components and
the number of calibration samples. Leverages can be interpreted in two ways: absolute, and
relative
Absolute leverage values
Leverage values are always larger than zero, and can go up to 1 for samples in the
calibration set. As a rule of thumb, samples with a leverage above 0.4 - 0.5 start
being bothering.
Relative leverage values
Influence on the model is best measured in terms of relative leverage. For instance,
if all samples have leverages between 0.02 and 0.1, except for one, which has a
leverage of 0.3, although this value is not extremely large, the sample is likely to be
influential.
Leverages in designed data
569
For designed samples, the leverages should be interpreted differently depending on

the analysis: regression (with the design variables as X-variables) or PCA on the
responses.
By construction, the leverage of each sample in the design is known, and these
leverages are optimal, i.e. all design samples have the same contribution to the
model. So do not bother about the leverages if the analysis is a regression: the
design has cared for it.
However, in the case of a PCA on the response variables, the leverage of each
sample is now determined with respect to the response values. Thus some samples
may have high leverages, either in an absolute or a relative sense. Such samples are
either outliers, or just samples with extreme values for some of the responses.
What Should Be Done with a High-Leverage Sample? The first thing to do is to understand
why the sample has a high leverage. Investigate by looking at the raw data and checking
them against the original recordings. Once an explanation has been found, there are two
following cases:
 Case 1
There is an error in the data. Correct it, or if true value cannot be found, and the
experiment cannot be redone to give a more valid value, the erroneous value may
be replaced with “missing”.
 Case 2
interest” (e.g. it has the properties of interest, to a higher degree than the other
samples), or “not relevant” (e.g. it belongs to another population than that being
studied). In the former case, one should try to generate more samples of the same
kind: they are the most interesting ones! In the latter case (and only then), the high-
leverage sample may be removed from the model.
Residuals
The lower right pane of the Residuals and Influence overview displays a line plot of the
sample residual statistics. A toolbar toogle ( ) can be used to switch between Q- and
F-residuals view.
Q-residuals
This plot shows the sample Q-residuals as a line plot with associated limits.
Q-residual sample variance
570
F-residuals
This plot shows the sample F-residuals as a line plot with associated limits.
Note that the F-residuals are available for both calibration and validation. If the residual x-
variance from validation is much higher than for calibration one should investigate the
residuals in more detail. The validated residuals reflect the scheme chosen in the validation
and is a more concervative assessment of residual outliers.
See the description in the overview section.
Residuals
13.4.2 Plots accessible from the PCA plot menu

Scores and loadings
Two plots
The score and loadings plots will be displayed for PC1-PC2 in two frames.
Four plots
The score and loadings plots will be displayed for PC1-PC2 in the two first frames and also
for PC3-PC4 in the third and fourth frames.
Bi-plot
This is a two-dimensional scatter plot or map of scores for two specified components (PCs),
with the X-loadings displayed on the same plot. It is called a bi-plot. It enables one to
interpret sample properties and variable relationships simultaneously.
Scores
571
The closer two samples are in the scores plot, the more similar they are with respect to the
from each other.
Here are a few things to look for in the scores plot:
 Is there any indication of clustering in the set of samples?
 Are the samples evenly spread over the whole region, or is there any accumulation
of samples at one end?
 Are some samples very different from the rest?
Loadings
Variables with loadings to the right in the loadings plot will be variables which usually have
high values for samples to the right in the scores plot, etc.
identified.
Interpret variable projections on the loadings plot. Variables close to each other in the
loadings plot will have a high positive correlation if the two components explain a large
portion of the variance of X. The same is true for variables in the same quadrant lying close
to a straight line through the origin. Variables in diagonally opposed quadrants will have a
tendency to be negatively correlated.
Scores and loadings together
The plot can be used to interpret sample properties. Look for variables projected far away
from the center. Samples lying in an extreme position in the same direction as a given
variable have large values for that variable; samples lying in the opposite direction have low
values.
For instance, in the figure below, C1H3 is the most colorful, while C1H2 has the highest off-
flavor (and probably lowest Raspberry taste). C4H3 is very different from C3H2: C4H3 has
highest Raspberry taste and lowest off-flavor, otherwise those two jams do not differ much
in color and thickness. C3H3 has high Raspberry taste, and is rather colorful. C2H1, C1H1 and
C3H1 are thick, and have little color. The jams cannot be compared with respect to
sweetness, because variable Sweetness is projected close to the center.
Bi-plot for 8 jam samples and 12 sensory properties
572

identified.
Scores
Line
This is a plot of score values vs. sample number for a specified component. Although it is
usually better to look at 2-D or 3-D scores plots because they contain more information, this
plot can be useful whenever the samples are sorted according to the values of an underlying
variable, e.g. time, to detect trends or patterns (see figure below). Also look for systematic
patterns, like a regular increase or decrease, periodicity, etc.… (only relevant if the sample
number has a meaning, like time for instance).
Trend in a scores plot
The smaller the vertical variation (i.e. the closer the score values are to each other), the
more similar the samples are for this particular component. Look for samples that have a
very large positive or negative score value compared to the others: these may be outliers.
An outlier sticks out on a line plot of the scores
573
2-D scatter
3-D scatter
This is a 3-D scatter plot or map of the scores for three specified components from PCA. The
plot gives information about patterns in the samples and is most useful when interpreting
components 1, 2 and 3, since these components summarize most of the variation in the
data. It is usually easier to look at 2-D scores plots but if three components are needed to
describe enough variation in the data, the 3-D plot is a practical alternative.
Scores plot in 3-D
574
Like with the 2-D plot, the closer the samples are in the 3-D scores plot, the more similar
they are with respect to the three components.
The 3-D plot can be used to interpret differences and similarities among samples. Look at
the scores plot and the corresponding loadings plot, for the same three components.
Together they can be used to determine which variables are responsible for differences
between samples. Samples with high scores along the first component usually have large
values for variables with high loadings along the first component, etc.
Here are a few patterns to look for in a scores plot.
Do the samples show any tendency towards clustering? A plot with three distinct
clusters is shown below. Samples within the same cluster are similar to each other.
Three groups of samples appear on the scores plot

Are one or more samples very different from the rest? If so, this can indicate that
they are outliers. A situation with an outlying sample is given in the figure below.
Outliers may have to be removed.
An outlier sticks out of the main group of samples
575
Check how much of the total variation is explained by each component (these
numbers are displayed at the bottom of the plot). If it is large, the plot shows a
significant portion of the information in the data and it can be used to interpret
relationships with a high degree of certainty. If the explained variation is smaller,
more components or a transformation may be considered, or there may be little
information in the original data.
2 x 2-D scatter
The visualization frame is divided in two. A 2-D scatter plot is displayed in each subframe.
The first one is in the PC1-PC2 plane and in the second one the plane is the PC3-PC4 plane.
4 x 2-D scatter
The visualization frame is divided in four. A 2-D scatter plot is displayed in each subframe.
The first one is in the PC1-PC2 plane, second one in the PC3-PC4 plane, the third one in the
PC5-PC6 plane and finally the fourth one in the PC7-PC8 plane.
Loadings
Line
This is a plot of X-loadings for a specified component vs. variable number. It is useful for
detecting important variables. In many cases it is usually better to look at two- or three-
vector loadings plots instead because they contain more information.
Line plots are most useful for multichannel measurements, for instance spectra from a
spectrophotometer, or in any case where the variables are implicit functions of an
underlying parameter, like wavelength, time, etc. The plot shows the relationship between
the specified component and the different X-variables. If a variable has a large positive or
negative loading, this means that the variable is important for the component concerned;
see the figure below. For example, a sample with a large score value for this component will
have a large positive value for a variable with large positive loading.
Spectral data can default to use line plots for the loadings plot. To set this, right click on the
given range in the project navigator, and tick off the Spectra option.
Line plot of the X-loadings, important variables in a spectra
576
Variables with large loadings in early components are the ones that vary most. This means
that these variables are responsible for the greatest differences between the samples.
Note: Downweighted variables are displayed in a different color to be easily
identified.
2-D scatter
3-D scatter
This is a three-dimensional scatter plot of X-loadings for three specified components from
PCA. The plot is most useful for interpreting directions, in connection to a 3-D scores plot.
Otherwise it is recommended to use line- or 2-D loadings plots.
identified.
2 x 2-D scatter
The visualization frame is divided in two. A 2-D scatter plot is displayed in each subframe.
The first one is in the PC1-PC2 plane and in the second one the plane is PC3-PC4 plane.
4 x 2-D scatter
The visualization frame is divided in four. A 2-D scatter plot is displayed in each subframe.
The first one is in PC1-PC2 plane, second one in the PC3-PC4 plane, the third one in the PC5-
PC6 plane and finally the fourth one in the PC7-PC8 plane.
Residuals

If the algorithm used is SVD then this plot is available.
Check the node in the predefined plot section.
577
Influence plot
Variance per sample

This plot shows the residual (or explained) X-variance for all samples, with the number of
components fixed. The plot is useful for detecting outlying samples, as shown below. An
outlier can sometimes be modeled by incorporating more components. This should be
avoided, especially in regression, since it will reduce the predictive power of the model.
An outlying sample has high residual variance
Samples with small residual variance (or large explained variance) for a particular
component are well explained by the corresponding model, and vice versa. In the above
plot, 4 samples seems to be not well explained by the model and may be outliers such as B3.
Variable residuals
This is a plot of residuals for a specified X-variable and component number for all the
samples. The plot is useful for detecting outlying sample/variable combinations, as shown
below. An outlier can sometimes be modeled by incorporating more such samples. This
should, however, be avoided since it will reduce the prediction ability of the model.
Line plot of the variable residuals
578
Whereas the sample residual plot gives information about residuals for all variables for a
particular sample, this plot gives information about all possible samples for a particular
variable. It is therefore more useful when investigating how one specific variable behaves in
all the samples.
Sample residuals
This is a plot of the residuals for a specified sample and component number for all the X-
variables. It is useful for detecting outlying sample or variable combinations. Although
outliers can sometimes be modeled by incorporating more components, this should be
avoided since it will reduce the prediction ability of the model.
Line plot of the sample residuals: one variable is an outlier
579
In the above plot the variable 1: Adhesiveness at 1 day, for a particular sample is not very
well described by a model with a certain number of component here 4. If this is the case
with most of the samples this variable may be noisy and can be considered as an outlier.
In contrast to the variable residual plot, which gives information about residuals for all
samples for a particular variable, this plot gives information about all possible variables for a
particular sample. It is therefore useful when studying how a specific sample fits to the
model.
Sample and variable residuals

It is a map of the residuals. The X-axis represents the samples and the Y-axis represents the
variables. It is useful to detect whether a particular sample has high residuals on few or all
variables. It is a diagnostic tools to check the reasons why a particular sample is different
from the others. It helps in deciding whether this sample is an outlier or not.
For the variables one can detect if a particular variable is not well described by the model for
most samples. It can show that this variable is either noisy or not structured in a proper way.
It is possible to remove this variable or to try different pretreatments.
In the above map, one sample is suspect and should be further investigated.
Leverages
Line
See the description in the Plot accessible from the Navigator section
580
Matrix
This is a matrix plot of leverages for all samples and all model components. The X-axis
represents the components and the Y-axis the samples. The color represents the Z-value
which is the leverage, the color scale can be customized. It is a useful plot for studying how
the influence of each sample evolves with the number of components in the model.
Hotelling’s T²
Line
See the description in the predefined plot section
Matrix
This is a matrix plot of Hotelling’s T² statistics for all samples and all model components. It is
equivalent to the matrix plot of leverages, to which it has a linear relationship. The Y-axis
represents the components and the X-axis the samples. The color represents the Z-value
which is the Hotelling’s T² statistic for a specific PC and sample, the color scale can be
customized.
581
13.5. PCA method reference

13.6. Bibliography
C.B. Crawford and G.A. Ferguson, A general rotation criterion and its use in orthogonal
rotation, Psychometrika, 35(3), 321-332, (1970).
R.A. Darton, Rotation in Factor Analysis, The Statistician, 29, 167-194, (1980).
K. Esbensen, Multivariate Data Analysis - In Practice, 5th Edition, CAMO Process AS, Oslo,
2002,
H.H. Harman, Modern Factor Analysis, 3rd Edition, revised, University of Chicago Press,
1976.
H. Hotelling, Analysis of a complex of statistical variables into principal components, J. Educ.
Psych., 24, 417-441, 498-520, (1933).
J.E. Jackson, A Users Guide to Principal Components, Wiley & Sons Inc., New York, 1991.
H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis, Psychometrika, 23,
187-200, (1958).
K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis, Academic Press Inc, London,
1979.
J.O. Neuhaus and C. Wrigley, The Quartimax Method: An analytic approach to orthogonal
simple structure, British J. Statistical Psychology, 7(2), 81-91, (1954).
D.R. Saunders, An analytic method for rotation to orthogonal simple structure, Princeton,
Educational Testing Service Research Bulletin, 53-10, (1953).
582
14. Multiple Linear Regression
14.1. Multiple Linear Regression
Multiple Linear Regression (MLR), is the classical method that combines a set of several X-
variables in linear combinations, which correlate as closely as possible to the corresponding
single Y-vector.
 Theory
 Usage
14.2. Introduction to Multiple Linear Regression (MLR)

Multiple linear regression, MLR, is a classical regression method that combines a set of
several predictor or X-variables, in linear combinations, which correlate as closely as possible
to a corresponding single response or Y-vector; refer to figure below for a graphical
description of the simplest MLR situation with two regressors (X) and one response (Y).
 Basics
 Principles behind Multiple Linear Regression (MLR)
 Sum of squares due to error SSE
 Sum of squares due to error SSreg
 The ANOVA for regression
 Interpreting the results of MLR
 Regression coefficients (b-coefficients)
 Predicted vs. reference plot
 Residuals
 Random and normally distributed residuals
 Non-constant variance
 Curvature in residuals
 Systematic variance
 Form of the model
 More details about regression methods
14.2.1 Basics
MLR: Regressing one Y-variable on a set of X-variables
583
The theory behind MLR has been well described in the literature and texts such as the book
by Montgomery, Peck, Vining, 2001 and Weisberg, 1985 are excellent sources for subject
matter on this topic.
In MLR a direct “least squares” regression is performed between the Y- and the X-matrix. In
this section, the case of regression of one column vector Y, will be addressed for simplicity,
but the method can readily be extended to a whole Y-matrix (as is common when MLR is
applied to designed experiment data (DOE) on multiple responses. In this case one can make
independent MLR models, one for each y-variable, based on the same X-matrix.
The following MLR model equation is just an extension of the normal univariate straight line
equation:
This can be compressed into the convenient matrix form:
The objective is to find the vector of regression coefficients b that minimizes f, the error
term. This is where the least squares criterion on the squared error terms is used, i.e. find b
so that fTf is minimized. MLR estimates the model coefficients using the equation:
This operation involves the matrix inversion of the so called Dispersion Matrix (XTX)-1. If any
of the X-variables show any collinearity with each other i.e. if the variables are not linearly
independent, then the MLR solution will not be stable (if there is a solution at all).
Incidentally, this is the reason why the predictors are called independent variables in MLR;
the ability to vary the X-variables independently of each other is a crucial requirement to
variables used as predictors with this method. This is why in DOE, the initial design matrix is
generated in such a way as to establish this independence (also called orthogonality) in the
first place. MLR also requires more samples than predictors or the matrix cannot be
inverted.
MLR has the following properties and behavior:
 The number of X-variables must be smaller than the number of samples;
 In case of collinearity among X-variables, the b-coefficients are not reliable and the
model may be unstable;
 MLR tends to overfit when noisy data are used.
584
Multiple Linear Regression
The Unscrambler® uses The QR Decomposition to find the MLR solution. No missing values
are accepted in this decomposition.
14.2.2 Principles behind Multiple Linear Regression (MLR)

More details regarding MLR algorithms can be found in the method reference.
The basic MLR problem is an Analysis of Variance (ANOVA) problem. In ANOVA, the total
variability is represented by the Total Sum of Squares (SST). This is defined as the squared
sum of the deviations of each observation from the Grand Mean of the observations. The
theory behind ANOVA states that SST can be further decomposed into two parts, a sum of
squares due to regression SSreg and a sum of squares due to random error SSE. The ANOVA
relationship is defined by
Sum of squares due to error SS

The term SSE is the term that is minimized in the least squares process. If the form of the
model chosen to fit the data is correct (i.e. a linear model in this case), then the SSE term
should be normally and independently distributed with a mean of zero and a variance s². In
terms of the ANOVA relationship, when this term is minimized, the SSreg term by definition is
maximized.
Sum of squares due to error SS
The term SSreg describes how much of the total variability the fitted model describes. This
should be maximized and is dependent on,
 The form of the model fitted
 The quality of the data collected for the analysis and
 The terms (regressors X) used in the final model.
More will be said about the form of the model later in this section.
The ANOVA for regression
ANOVA is based around the comparison of variances from two sources. In this case, it is a
comparison of the variance due to regression and the variance due to random error. The
sums of squares can be converted into variances by dividing them by their respective
Degrees of Freedom, (DOF). The models MLR analyzed by The Unscrambler® are known as
fixed models where the levels of each regressor have been previously defined. In the case of
general MLR, the intercept term b0 and the regressor terms in the fitted model contribute
one DOF to the model. For example, if the following model is to be fitted to a set of data,
Then there are 3 DOF contributed by the model, one for the intercept and 2 from the model
terms X1 and X2
The total DOF for a data set is equal to the number of observations (n) minus 1. Using the
ANOVA model definition, the residual DOF can be found using the following
DOF(residual) = DOF(total) - DOF(regressors)

= (n-1) - k
= n - k - 1
Dividing the sums of squares by their respective DOF leads to the Mean Squares (MS). Mean
squares are variance estimators and can be compared using and F-test. See the section
about the F-test in statistical tests for more information.
585
The significance of a regression model is determined by calculating the following F-statistic
When MSreg is larger than MSE, this implies that a greater part of the total variance is being
described by the fitted model. Significance can then be established from the p-value
calculated at a particular significance level. If MSE is larger than MSreg or the regression is
found to be insignificant, then questions must be raised regarding the validity of the fitted
model.
The general form of the ANOVA table for regression is provided below.
Generic ANOVA table for regression
Source of variation Sum of Squares Degree of Freedom Mean Square F0
Regression SSReg k MSReg MSReg/MSE
Error (Residual) SSE N-k-1 MSE
Total SST N-1
14.2.3 Interpreting the results of MLR

Remember: Good models are generated from good data! If either
 one of the X or Y data are non-representative of future conditions, or

 one of the X or Y data are nonrepresentative of future conditions, or
 if the data were collected under poor conditions,
then the results of the MLR model may be useless.

A good regression model should be able to:
 Model only relevant information, by highly weighting these sources of information
and downweighting any irrelevant variation. Thus interpretation of the b-
coefficients is important.
 Avoid overfitting, i.e. distinguish between variation in the response (that can be
explained by variation in the predictors), and variation caused by mere noise.
Regression coefficients (b-coefficients)
The regression model can be written
meaning that the observed response values are approximated by a linear combination of the
values of the predictors. The coefficients of that combination are called regression
coefficients or B-coefficients.
Several diagnostic tools are associated with the regression coefficients (available only for
MLR):
 Standard error is a measure of the precision of the estimation of a coefficient;
 A Student’s t-value can be computed and Comparing the t-value to a reference t-
distribution will then yield a significance level or p-value. It shows the probability of
a t-value equal to or larger than the observed one, if the true value of the regression
coefficient were 0.
586
Regression coefficients show how each variable is weighted when predicting a particular Y
response. Regression coefficients are a characteristic of all regression methods and provide
great interpretive insight into the quality of a model. Examples include
Spectroscopy
chosen wavelengths should exhibit changes related to chemical signals in the
samples and not show noise or unexplainable characteristics.
Designed data
When different variable types exist, regression coefficients show the relative
importance of the variables and their interactions can also be displayed as cross
terms of the type b12x1x2.
Predicted vs. reference plot
The predicted vs. reference plot is also another common feature of all regression methods.
The predicted vs. reference plot should show a straight line relationship between predicted
and measured values, ideally with a slope of 1 and a correlation coefficient (R²) close to 1.
More details on plots can be found in interpreting MLR plots.
For MLR, the correlation coefficient R² is calculated as the ratio of SSreg and SST, i.e.
It is the ratio of the variance explained by the model and the total variance that can be
explained. Other variants of the R2 statistics are available when terms are added or removed
from the model.
Residuals
Residuals relate to the SSE term in the ANOVA and for a good model should have a mean
value of zero and a variance s² that is indicative of the experimental error associated with
the analysis. Residuals can be plotted as Ypredicted vs. Yactual or as Studentized residuals. This
plot should show that the residuals are randomly distributed around zero with no visible
trending apparent. Some examples of residual patterns are provided in the figure below.
587
Random and normally distributed residuals

This is the desired outcome and is indicative of a good model fit. The spread of the residuals
should be within the expected error for an acceptable analysis.
Non-constant variance
This is also known as heteroscedasticity. It occurs when the precision of the analyzing
instrument decreases or the variability of a data set increases in a particular direction. In this
case, the range of the Y-variables should be decreased or other analysis methods, such as
weighted least squares should be used.
Curvature in residuals
This occurs when the form of the model is incorrect. MLR attempts to fit a linear model to
the data, however, if the underlying relationship is quadratic in nature, then the linear
model is not the best fit. This can be detected using Lack of Fit (LoF) tests.
Systematic variance
This can occur when important model terms are left out of the final equation, or an
important source of variance has not been included in the initial design. This is the most
difficult situation to deal with in the MLR problem and the source of the variation may be
either controllable or uncontrollable.
Form of the model
The general MLR equation is called a linear model, because it is linear in terms of the
coefficients. The following model is also linear and takes into account interaction terms
between the regressors
The term b12X1X2 takes into account the possibility that the interaction term contributes
significantly to the final model. In this situation, this term adds extra DOF to the regression
terms in the model and may account for any observed curvature in the residuals (should
they exist). The significance of the interaction term can be established using a t-test. If
interaction terms are found to be insignificant, they should be removed from the
model,since their inclusion inflates the term SSE in the ANOVA model.
Another important term that can be added to the MLR model to account for curvature is a
square term. The form of the model can be described as follows.
where the coefficient b11 and b22 refer to the square terms in the model. The significance of
these terms should be established using a t-test.
The reason why the MLR model can be described as linear (even with interaction and square
terms) is because the terms of the form b12X1X2 and b11X1² can be written in the form
with the appropriate substitution made in the above equation.
588
14.2.4 More details about regression methods

The Unscrambler® currently provides two other alternatives to MLR. These are Principal
Component Regression (PCR) and Partial Please Squares (PLS) Regression. These methods
work well when the number of X-variables are much greater than the number of samples
and they are not affected by collinearity effects. Refer to the chapters on principal
component regression and partial least squares regression for more details.
For more details on the MLR algorithm, see the method reference section.
The Unscrambler® makes use of the QR decomposition of the X matrix (sometimes referred
to as the QR Factorization). This is described well in the Montgomery, Peck, Vining, 2001 and
Goodall, 1993.
14.3. Tasks – Analyze – Multiple Linear Regression

When a data table is available in the Project Navigator use the Tasks-Analyze menu to run a
suitable analysis. When Multiple Linear Regression is selected a dialog will open with the
tabs described below.

 Validation tab
 Autopretreatments tab
 Set Alarms tab
 Variable weighting in MLR

Multiple Linear Regression Model Inputs
589
In the Model Inputs tab, first select an X-matrix to be analyzed in the Predictors frame.
Select pre-defined row and column ranges in the Rows and Cols boxes, or click the Define
button to perform the selection manually in the Define Range dialog. For MLR analysis the
number of samples must exceed the number of variables.
Next select a Y- matrix to be analyzed in the Responses frame. The responses may be taken
from the same data table as the predictors or from any other data table in the project
navigator. Models may be developed for single or multiple responses.
Note: If a separate Y-response matrix is being used, ensure that the row names of Y
correspond to the row names in X. Otherwise, non-meaningful regression results
will be obtained.
The Include Intercept Term check box can be used to add an intercept term in the model. If
the data have been previously mean centered, the intercept term will be zero. If an intercept
term is found to be nonsignificant, then it can be removed from the analysis.
The Significance Level (alpha) box allows a user to set the confidence interval to apply to the
regression results. The value 0.05 (i.e. 95% confidence) is used by default.
The Identify Outliers check box allows a user to set up certain criteria in the Warning Limits
tab and use these to identify potential outliers during the analysis.

 The weights applied to the X-data.
 The weights applied to the Y-date.
590
MLR is the simplest multivariate regression analysis technique. It does not work if there are
more variables than samples. If there are more variables than samples present in a defined
data set, the following warning will be provided.
More variables than samples present
Solution: Define a data set where there are at least 2 more samples than variables present.
If the number of rows in X does not match that of Y, the following warning will be provided:
Number of X rows does not match number of Y rows
Solution: Ensure that the row set dimensions of X match the row set dimensions of Y.
If too many samples or variables are excluded, the following warning will be provided:
Solution: Check that all samples/variables have not been excluded in a data set.

The next step in the MLR modeling process is to choose a suitable validation method from
the Validation tab. The following dialog box will appear.
Multiple Linear Regression Validation
591
The methods available for validation include,

Leverage Correction
A method used as a first pass model check. It is noted here that Leverage correction
is equivalent to full cross validation when applied in MLR.
Cross Validation
Not used in MLR as full cross validation is equivalent to leverage correction for MLR.
Test Matrix
This is also known as Test Set Validation, and uses independent samples, that have
not taken part in the calibration, for validation. This allows one to define either a
new matrix, of the same number of variables, or a defined range within a single
matrix to be used as an independent check of model performance. Both X- and Y-
matrices need to be defined in this case. This is the preferred method for validation
and should be aimed for. The test matrix validation options are shown below.
Multiple Linear Regression Test Matrix Setup.
592
Use the Matrix drop-down list to select the test set from the rows and columns drop-down
lists, or define a set using the Define button.
If the variable dimension of the test set does not match that of the set used for calibration,
the following warning is provided:
Solution: Define a meaningful set of variables to match those of the calibration set.
In the case where too many samples or variables have been excluded from the test set, the
following warning will be provided.
593
Solution: Ensure that there are some variables defined for the calculation.
14.3.3 Autopretreatments tab

The Autopretreatments tab allows a user to register the pretreatments used before the MLR
analysis, so that when future predictions are made, these pretreatments are automatically
applied to the new data as well. The pretreatments become part of the saved model. An
example dialog box for Autopretreatment is provided below.
The MLR Autopretreatment Tab Options

See Set Alarms for information on setting alarms that can be useful during prediction and to
define scalar and vector information for input matrix.
594

This tab allows a user to set predefined warning limits for the detection of potential outliers.
It is available when Identify outliers is checked in the Model Inputs tab. The dialog box is
shown below.
The MLR Warning Limits Tab Options

samples.
Leverage Limit
595

14.3.6 Variable weighting in MLR

Contrary to the so-called bilinear methods of PCA, PCR and PLSR, there is no dimension
reduction step in MLR. Therefore predictions are identical regardless of relative variable
weighting. Also, as there are no scores and loadings in MLR, interpretation of model is often
reduced to assessing the relative size of regression coefficients. (NB! See MLR Basics for
conditions where such an interpretation is not reliable.)
596
As variable weighting will change the relative sizes of MLR regression coefficients, we do not
recommend using weighting indiscriminantly and there is no Weights tab in MLR. To assess
standardized regression coefficients whose magnitude do not depend on the variance of the
variables, auto-scaling the variables can be performed as a pre-processing step. Then go to
Tasks–Transform–Weights prior to analysis and dividing each variable by its standard
deciation.
See Theory of weighting for more details.
When all the settings are done click on OK to perform analysis.
14.4. Interpreting MLR plots
 Predefined MLR plots

 MLR Overview
 ANOVA Table
 Regression (t-values)
 Y-residuals vs. predicted Y
 Predicted vs. reference
 Regression and prediction
 Regression coefficients
 Regression (t-values)
 Regression (p-values)
 Analysis of variance
 Residuals
 Leverage
 Plots accessible from the MLR Plot menu
 MLR overview
 Predicted and measured
 Analysis of variance
 Residuals
 General
 Normal probability Y-residuals
 Influence plot
 Leverages in designed data
 What to do with an influential sample?
 Outliers
 Influence plot
597

 Leverage
 Response Surface
14.4.1 Predefined MLR plots

MLR Overview
ANOVA Table
The ANOVA table contains degrees of freedom, sums of squares, mean squares, F-values and
p-values for all sources of variation included in the model. The Multiple Correlation
coefficient and the R-squared are also presented above the main table. A value close to 1
indicates a good fit, while a value close to 0 indicates a poor fit.
Summary
The first part of the ANOVA table is a summary of the significance of the global model. If the
p-value for the global model is smaller than 0.05, it means that the model explains more of
the variations of the response variable than could be expected from random phenomena. In
other words, the model is significant at the 5% level. The smaller the p-value, the more
significant (and useful) the model is.
Second section: Variables The second part of the ANOVA table deals with each individual
effect (main effects, optionally also interactions and square terms). If the p-value for an
effect is smaller than 0.05, it means that the corresponding source of variation explains
more of the variations of the response variable than could be expected from random
phenomena. In other words, the effect is significant at the 5% level. The smaller the p-value,
the more significant the effect is.
Model check
The model check tests whether the nonlinear part of the model is significant. It includes up
to three groups of effects:
 Interactions (and how they improve a purely linear model).

 Squares (and how they improve a model which already contains interactions).
 Squares (and how they improve a purely linear model).
If the p-value for a group of effects is larger than 0.05, it means that these effects are not
useful, and that a simpler model would perform as well. Try to recompute the response
surface without those effects!
Lack of fit
The lack of fit part tests whether the error in response prediction is mostly due to
experimental variability or to an inadequate shape of the model. If the p-value for lack of fit
is smaller than 0.05, it means that the model does not describe the true shape of the
response surface. In such cases, try a transformation of the response variable.
Regression (t-values)
The t-value for each coefficient is computed as the ratio between deviation from the mean
accounted for by the variable represented by the coefficient, and standard error of the
mean.
By comparing the t-value with its theoretical distribution (Student’s t-distribution), the
significance level of the studied coefficient is assessed.
598
The t-values plot present all the t-values for all coefficients.
In the above plot the predictive variables “Protein”, “Carbohydrates” and “Fat” show high t-
values; they are likely to have significant effects in the model.
“Saturated fat” shows a t-value close to 0 and therefore is likely to be non-significant.
For predefined limits look at the p-value plot.
Y-residuals vs. predicted Y

satisfactory, and appropriate action should be taken.
If strong systematic structures (e.g. curved patterns) are observed, this can be an indication
of lack of fit of the regression model. The figure below shows a situation that strongly
indicates lack of fit of the model. This may be corrected by transforming the Y variable.
Structure in the residuals: a transformation of the y variable is recommended
599
extent.
residual, it also distorts the whole model so that the remaining residuals show a very clear
the data or some data transformation can correct for the phenomenon.
600
adequate models.
Predicted vs. reference

The predicted Y-value from the model is plotted against the measured Y-value. This is a good
way to check the quality of the regression model. If the model gives a good fit, the plot will
show points close to a straight line through the origin and with slope equal to 1. Turn on Plot
Statistics (using the View menu) to check the slope and offset, RM/RM and R-squared.
Use the buttons to switch Calibration (resp. Validation) results off or on.
It is also useful to show the regression line and compare it with the target line using the icon
.
Some statistics are available giving an idea of the quality of the regression, they are available
from the icon

When Calibration and Validation results are displayed together as shown in the figure below,
pay special attention to:
Differences between Cal and Val
If there are large differences, the model cannot be trusted.
601
When the calibration and validation samples are similar and lie close to a straight line of
slope 1, the fit can be considered as good.
Predicted vs. Reference plot for Calibration and Validation, with a good fit.
To determine the quality of the fit, the following statistics are available,
Slope
The closer the slope is to 1, the data are better modelled.
Offset
This is the intercept of the line with the Y-axis when the X-axis is set to zero (Note: It
is not a necessity that this value is zero!)
RMSE
The first one (in blue) is the Calibration error RM, the second one (in red) is the
expected Prediction/Estimation error RM or RM depending on the validation
method used. Both are expressed in the same unit as the response variable Y.
R-squared
The first one (in blue) is the raw R-squared of the model, the second one (in red) is
also called adjusted R-squared and tells how good a fit can be expected for future
predictions. R-squared varies between 0 and 1. A value of 0.9 is usually considered
602
as pretty good but this varies depending on the application and on the number of
samples.
When the are toggled, more detailed statistics are displayed. The Calibration plot is
shown below with statistics,
Predicted vs. Reference plot for MLR Calibration samples
The relevant calibration statistics are described as follows,

Correlation
This is the linear correlation between the predicted and reference values in the plot.
R2(Pearson)
This is the square of the Correlation value and expresses correlation on a positive
scale between 0 and 1.
RM
Root Mean Square Error of Calibration. This is a measure of the dispersion of the
calibration samples about the regression line.
SEC
This is the Standard Error of Calibration and is similar to RM, except it is corrected
for the Bias.
Bias
This is the mean value over all points that either lie systematically above (or below)
the regression line. A value close to zero indicates a random distribution of points
about the regression line.
Note: When RM and SEC are close, the bias is insignificant. This holds for all errors.
The following plots show two predicted vs. reference plots, the first one is the validation plot
for Leverage Correction (which is the equivalent of Full Cross Validation in MLR) and the
second plot is for Test Set validation.
603
The relevant validation statistics are described as follows,

RM
Root Mean Square Error of Estimation. This is a measure of the dispersion of the
validation samples around the regression line when leverage correction is used.
SEE
Standard Error of Estimation. This is the RM corrected for bias
RM
Root Mean Square Error of Prediction. This is a measure of the dispersion of the
validation samples around the regression line when Test Set validation is used.
SEP
Standard Error of Prediction. This is the RM corrected for bias
The figures below show two different situations: one indicating a good fit, the other
a poor fit of the model.

One may also see cases where the majority of the samples lie close to the line while
a few of them are further away. This may indicate good fit of the model to the
majority of the data, but with a few outliers present (see the figure below).
604
In the above plot, sample 3 is not following the regression line whereas all the other
In other cases, there may be a nonlinear relationship between the X- and Y-
variables, so that the predictions do not have the same level of accuracy over the
whole range of variation of Y. In such cases, the plot may look like the one shown
below. Such nonlinearities should be corrected if possible (for instance by a suitable
transformation), because otherwise there will be a systematic bias in the predictions
depending on the range of the sample.
605
Regression and prediction

Regression coefficients summarize the relationship between all predictors and a given
response.
The regression coefficients line plot is available for the weighted beta coefficients (Bw).
Note: If no weight were applied the weighted coefficients are confounded with the
raw coefficients
The above plot shows the weighted regression coefficients for the response variable (Y).
Each predictor variable (X) defines one point of the line (or one bar of the plot). It is
recommended to configure the layout of this type of plot as bars. Variables 1, 7, 9 and 11
have the highest weighted B coefficients.
The B0 coefficient is displayed along with the X-axis name. In this case B0 = 0.03708.
The weighted coefficients reflect the importance of the X-variables in the model.
However the raw coefficients are also interesting as those are used to write the model
equation in original units:
The raw coefficients do not reflect the importance of the X-variables in the model, because
the sizes of these coefficients depend on the range of variation (and indirectly, on the
original units) of the X-variables. A small raw coefficient does not necessarily indicate an
unimportant variable; a large raw coefficient does not necessarily indicate an important
variable.
If the purpose is to identify important predictors, use plots with t-values and p-values when
available.
606
For more information look into the overview section.
Regression (p-values)
The p-value measures the probability that a parameter estimated should be as large as it is,
if the real (theoretical, non-observable) value of that parameter were actually zero. Thus, p-
value is used to assess the significance of observed variations: a small p-value means that
there is little risk of mistakenly concluding that the observed effect is real.
The usual limit used in the interpretation of a p-value is 0.05 (or 5%). If p-value < 0.05, the
observed effect is not due to random variations. Thus, the variable under study has a
significant effect.
The plot of the p-values presents the p-values for each coefficient included in the MLR.
Regression (p-values)
In the above plot “Protein” is significant below 5%. “Fat” and “Carbohydrates” show
significant effects bellow 20 and 10% respectively. “Saturated fat” does not have a
significant effect.
p-value is also called “significance level�?.

Analysis of variance
See the description in the Regression and Prediction section
Residuals
This is the Y-residual vs. predicted Y. For more information look into the MLR overview
section.
607
Leverage
Absolute leverage values

The absolute leverage values are always larger than zero, and can go (in theory) up to 1. As a
rule of thumb, samples with a leverage above 0.4 - 0.5 start being troublesome.
Influence on the model is best measured in terms of relative leverage. For instance, if all
samples have leverages between 0.02 and 0.1, except for one, which has a leverage of 0.3,
although this value is not extremely large, the sample is likely to be influential.
For a critical limit on the leverages, look up the Hotelling’s T² line plot.
What should be done with a high-leverage sample?
The first thing to do is to understand why the sample has a high leverage. Investigate by
looking at the raw data and checking them against the original recordings.
Once an explanation has been found, the following cases may apply.
Case 1
Case 2
608
Response surface
This plot is used to find the settings of the X-variables which give an optimal response value
for the variable Y, and to study the general shape of the response surface fitted by the
Regression model.
It is necessary to specify which X-variable should be plotted, use the dialogue box that
appear for this purpose.
Response Surface dialogue
This plot can appear in various layouts. The most relevant are:
 Contour plot;
 Landscape plot.
Interpretation: Contour plot

Look at this plot to reach the experimental goal. The plot has two axes, two
predictor variables are studied over their range of variation. The remaining ones are
kept constant. The constant levels are indicated in the Plot ID at the bottom.
The response values are displayed as contour lines, i.e. lines that show where the
response variable has the same predicted value. Clicking on a line, or on any spot
within the map, will display the predicted response value for that point, and the
coordinates of the point (i.e. the settings of the two predictor variables giving that
particular response value).
To interpret several responses together, print out their contour plots on color
transparencies and superimpose the maps.
Interpretation: Landscape plot
Look at this plot to study the 3-D shape of the response surface. Here it is obvious
whether there is a maximum, a minimum or a saddle point.
This plot, however, does not give precise indication of how the optimum can be
achieved.
Response surface plot, with Landscape layout
609
14.4.2 Plots accessible from the MLR Plot menu

MLR overview
See the description in the predefined plots section
See the description in the predefined plots section

See the description in the Interpreting MLR plots section
Predicted and measured

The predicted Y-values from the model in calibration and validation are plotted as a line plot
with the measured Y-values as a third line. This is a good way to check the quality of the
regression model. If the model gives a good fit, the plot will show points close to each other
for all samples. This plot is useful to detect outlier samples that will show different values for
those three lines.
Predicted and measured
610
Analysis of variance
Residuals
General
Normal probability Y-residuals
This plot displays the cumulative distribution of the Y-residuals with a special scale, so that
normally distributed values should appear along a straight line. The plot shows all residuals
for one particular Y-variable (look for its name in the plot ID). There is one point per sample.
If the model explains the complete structure present in the data, the residuals should be
randomly distributed - and usually, normally distributed as well. So if all the residuals are
along a straight line, it means that the model explains everything that can be explained in
the variations of the variables that are being predicted.
If most of the residuals are normally distributed, and one or two stick out, these particular
samples are outliers. This is shown in the figure below. If there are outliers, mark them and
check the data.
Two outliers are sticking out
611
If the plot shows a strong deviation from a straight line, the residuals are not normally
distributed, as in the figure below. In some cases - but not always - this can indicate lack of
fit of the model. However it can also be an indication that the error terms are simply not
normally distributed.
The residuals have a regular but non-normal distribution
612
Influence plot
This plot displays the sample residual X-variances against leverages. It is most useful for
detecting outliers, influential samples and dangerous outliers.
Samples with high residual variance, i.e. lying to the top of the plot, are likely outliers.
Samples with high leverage, i.e. lying to the right of the plot, are influential; this means that
they somehow distort the model so that it describes them better. Influential samples are not
necessarily dangerous, if they obey the same model as more “average” samples.
A sample with both high residual variance and high leverage is a dangerous outlier: it is not
well described by a model which correctly describes most samples, and it distorts the model
so as to be better described, which means that the model then focuses on the difference
between that particular sample and the others, instead of describing more general features
common to all samples.
Three cases can be detected from the influence plot:
Leverages in designed data
For designed samples, the leverages should be interpreted differently whether the analysis
is a regression (with the design variables as X-variables) or a PCA on the responses.
By construction, the leverage of each sample in the design is known, and these leverages are
optimal, i.e. all design samples have the same contribution to the model. So do not bother
about the leverages when running a regression: the design has cared for it.
However, when running a PCA on the response variables, the leverage of each sample is now
determined with respect to the response values. Thus some samples may have high
leverages, either in an absolute or a relative sense. Such samples are either outliers, or just
samples with extreme values for some of the responses.
What to do with an influential sample?
613
The first thing to do is to understand why the sample has a high leverage (and, possibly, a
high residual variance). Investigate by looking at the raw data and checking them against the
original recordings.
There are two following cases.
Case 1
There is an error in the data. Correct it, or the true value cannot be found or the
experiment cannot be re-done to get a more valid value, replace the erroneous
value with “missing”.
Case 2
interest” (e.g. it has the properties to be achieved, to a higher degree than the other
samples), or “not relevant” (e.g. it belongs to another population than the one
under study). In the former case, try to generate more samples of the same kind:
they are the most interesting ones! In the latter case (and only then), remove the
high-leverage sample from the model.
Variance per sample
This plot shows the residual (or explained) X-variance for all samples for the regression. The
plot is useful for detecting outlying samples, as shown below.
Samples with small residual variance (or large explained variance) are well explained by the
regression model, and vice versa. In the above plot 4 samples seems to be not well explained
by the model and may be outliers such as B3.
Variable residuals
This is a plot of residuals for the Y-variable for all the samples. The plot is useful for detecting
outlying sample or variable combinations, as shown in the figure below.
614
This plot gives information about all possible samples for a particular variable (as opposed to
the sample residual plot, which gives information about residuals for all variables for a
particular sample) hence it is more useful for studying how a specific variable behaves for all
the samples.
Sample residuals
This plot shows the residuals for a specified sample for the Y-variable. It is useful for
detecting outlying sample.
Go through the different samples to see if any sample has a too high residual in comparison
with the others. To do so use the arrows or drop-down list for the sample selection
.
Sample Residual
615

This is a plot of the residuals for all Y-variables and samples for a specified component
number. The plot is useful for detecting outlying (sample*variable) combinations. High
residuals indicate an outlier. Incorporating more components can sometimes model outliers;
this should be avoided since it will reduce the prediction ability of the model.
Outliers
Influence plot
See the description in the above section
Leverage
Response Surface
14.5. MLR method reference

14.6. Bibliography
C. R. Goodall, Computation Using the QR Decomposition in Handbook in Statistics Vol. 9,
Elsevier, Amsterdam, 1993.
D.C. Montgomery, E.A. Peck, and C.G. Vining, Introduction to Linear Regression Analysis
Third Edition,Wiley-Interscience, New York, 2001.
S. Weisberg, Applied Linear Regression Second Edition, Wiley, New York, 1985.
616
15. Principal Components Regression
15.1. Principal Component Regression
PCR is a method for relating the variations in a response variable (Y-variable) to the
variations of several predictors (X-variables), with explanatory or predictive purposes.
PCR is a two-step procedure which first decomposes an X-matrix by PCA, then fits an MLR
model, using the PC scores instead of the original X-variables as predictors.
 Theory
 Usage
15.2. Introduction to Principal Component Regression (PCR)

PCR is a method for relating the variance in a response variable (Y-variable) to the variance
of several predictors (X-variables), with explanatory or predictive purposes.
Before reading this section, the reader should be familiar general principles of regression
and Principal Component Analysis (PCA).
 Basics
 Interpreting the results of a Principal Component Regression (PCR)
 Error measures for PCR
 Some more theory of PCR
 PCR algorithm options
15.2.1 Basics
PCR is a two-step procedure which first decomposes an X-matrix by PCA, then fits an MLR
model, using the PC scores instead of the original X-variables as predictors.
This method performs particularly well when the various X-variables express common
information, i.e. when there is a large amount of correlation, or even collinearity.
Since the scores are orthogonal, the MLR solution is stable and therefore the PCR model
does not suffer from collinearity effects. It is the belief of some data analysis purists that PCR
is superior to PLS since it forces analysts to better understand their data and its
preprocessing (transformations) before the application of a regression procedure. The
procedure for performing PCR is shown graphically below.
PCR Procedure
617
15.2.2 Interpreting the results of a Principal Component Regression (PCR)

Remember: Good models are generated from good data! If either the X or Y data are
nonrepresentative of future conditions, or if they were collected under poor conditions,
then the results of the PCR model may be useless.
As with PCA and PLS, the results of a PCR regression provide similar graphical outputs and
diagnostics. The following provides a summary of these tools.
Scores and loadings
In PCR (and PLS models), scores and loadings express how the samples and variables are
projected along the model components.
loadings are required to express this operation.
Read more about PCA scores and loadings in theory on Principal Component Analysis (PCA)
response. Regression coefficients are a characteristic of all regression methods and may
provide great interpretive insight into the quality of a model. Examples include
Spectroscopy
Regression coefficients should have “spectral characteristics” about them and not
show noise characteristics.
Process data
When different variable types exist, regression coefficients show the relative
importance of the variables, and their interactions can also be included in the model
and displayed.
The predicted vs. reference plot is also another common feature of all regression methods.
The predicted vs. reference plot should show a straight line relationship between predicted
and measured values, ideally with a slope of 1, intercept 0 and a correlation of close to 1.
618
Principal Components Regression
Error measures for PCR

In PCR (and PLS) models, not only the Y-variables are projected (fitted) onto the model; X-
variables are too. Sample residuals are computed for each PC of the model. The residuals
may then be combined.
Across samples
for each variable, to obtain a variance curve describing how the residual (or
explained) variance of an individual variable evolves with the number of PCs in the
model.
Across variables
(all X-variables or all Y-variables), to obtain a Total variance curve describing the
global fit of the model. The total Y-variance curve shows how the prediction of Y
improves when more PCs are added to the model.
Total X-variance curve
expresses how much of the variation in the X-variables is taken into account to
predict variation in Y.
In PCR, it may sometimes be observed that the RMSE for the first principal component
increases when the first PC is estimated. This is indicative of the first component being
associated with a systematic (physical) effect remaining in the data set that is not related to
the property of interest. This may be the case when the preprocessing or transformation
used on the data has not been optimized. The diagram below provides an example of a Y-
Residual Variance plot for a PCR model.
Y-Residual Variance Plot
The above situation is a result of the PCA decomposition not being guided by the Y-data (as
is the case of PLS). However, in most cases, PCR and PLS provide similar results, though PLS
usually converges in less factors than PCR. Most vendor spectroscopic devices only support
PLS regression in their software packages; this is the main reason why PCR is not as popular
as a spectroscopic regression tool. Read more about how sample and variable residuals, as
well as explained and residual variances, are computed in the chapter with theory about
PCA.
619
15.2.3 Some more theory of PCR

In PCR, the X matrix is approximated by the first few principal components, obtained by one
of the methods described in Algorithms Used for Calculating Principal Components in The
Unscrambler®.
If Singular Value Decomposition (SVD) is used,
Where
 T = Scores (usually the first ‘a’ scores),

 P = Loadings,
 E = Residual,
 U = First ‘a’ left singular vectors,
 D = Singular Values,
 V = First ‘a’ right singular values.
The next step is to regress Y on the first few scores, using MLR and then calculating
regression coefficients as follows,
15.2.4 PCR algorithm options

The Unscrambler® provides two algorithms for PCR model calibration, both of which will
below.
NIPALS
A common, iterative algorithm used in PCR and PCA. It is useful when the data
contain missing values as these can be automatically imputed by the algorithm. Also
it tends to be faster than SVD if both the number of rows and columns in the data
are large.
in the PCR model node.
the number of iterations or to use SVD instead.
This algorithm is non-iterative. It is usually faster than NIPALS for data where one of
the dimensions is large (i.e. ‘tall and thin’ data containing a large number of samples
620
and relatively few variables or ‘short and fat’ data containing a large number of
variables and relatively few samples). The algorithm does not handle missing values.
More information about the algorithms can be found in the method reference.
15.3. Tasks – Analyze – Principal Component Regression

When an appropriate data matrix is available in the Project Navigator access the PCR from
the Tasks-Analyze menu.

 Weights tabs
 Validation tab
 Algorithm tab
 Autopretreatment tab
 Set Alarms tab

button to perform the selection manually in the Define Range dialog.
will be obtained.
Once the data to be used in modeling are defined, choose a starting number of components
(latent variables) to calculate in the maximum components box.
The Mean Center check box allows a user to subtract the column means from every variable
before analysis (mean center the data).
The Identify Outliers check box will mark outliers in the analysis according to the
significance level that has been set and the criteria set up in the Warning Limits tab.
The Significance Level box allows the user to set the significance level for the regression
coefficients. Since PCR uses Multiple Linear Regression (MLR), the significance level has the
same meaning as for MLR.
 The algorithm used to calculate the model.

Principal Component Regression Model Inputs
621
PCR is a multivariate regression analysis technique, therefore in The Unscrambler® it
requires a minimum of three samples (rows) and two variables (columns) to be present in a
data set, in order to complete the calculation. The following provides some warning given,
when certain analysis criteria are not met.
Not enough samples or variables present
Solution: Check that the data table (or selected row set) contains a minimum of 3 samples or
2 variables.
Minimum 2 variables needed to perform analysis
622
Solution: Ensure that a minimum of 2 variables have been defined in a data set.
15.3.2 Weights tabs

For weighting the individual variables releative to each other, use the X Weights and Y
Weights tabs. This is useful e.g. to give process or sensory variables equal weight in the
analysis or to downweight variables you expect not to be important. The X Weights dialog is
given below.
Principal Components Regression X- Weights
623
Individual X- and Y-variables can be selected from the variable list table provided in this
dialog by holding down the control (Ctrl) key and selecting variables. Alternatively, the
variable numbers can be manually entered into the text dialog box. The Select button can be
used (which takes one to the Define Range dialog box), or by simply clicking on All, this will
select every variable in the table.
A/(SDev +B)
Constant
This allows selected variables to be weighted by predefined constant values.
Downweight
This allows for the multiplication of selected variables by a very small number, such
that the variables do not participate in the model calculation, but their correlation
Block weighting
624
Use the Advanced tab in the X- and Y-Weights dialog to apply predetermined weights to
each variable. To use this option, set up a row in the data set containing the weights (or
create a separate row matrix in the project navigator). Select the Advanced tab in the
Weights dialog and select the matrix containing the weights from the drop-down list. Use
the Rows option to define the row containing the weights and click on Update to apply the
new weights. The dialog box for the Advanced option is provided below.
analysis as weights, using the Select Results Matrix button . This option provides
PCR Advanced Weights Option

The next step in the PCR modeling process is to choose a suitable validation method. For an
in-depth discussion on the topic see the chapter on Validation.
PCR Validation Option
625

The Algorithm tab provides a choice between the PCR algorithms NIPALS and Singular Value
Decomposition (SVD).
PCR Algorithm Options
626
The differences between the algorithms are described in the Introduction to PCR. The
NIPALS algorithm is iterative and the maximum number of iterations can be tuned in the
Max. iterations box. The default value of 100 should be sufficient for most data sets,
however some large and noisy data may require more iterations to converge properly. The
maximum allowed number of iterations is 30,000.
Note: If there are missing values in the data and SVD is selected, a warning will be
given as shown below.
627
data.
15.3.5 Autopretreatment tab

The Autopretreatments tab allows a user to register the pretreatments used during the PCR
applied to the new data, before the PCR equation is applied. The pretreatments become
part of the saved model. An example dialog box for Autopretreatment is provided below.
The PCR Autopretreatment Tab Options
Pretreatments can also be registered from the PCR node in the project navigator. To register
the pretreatment, right click on the PCR analysis node and select Register Pretreatment.
This is shown below
Registering a Pretreatment from the Project Navigator
628
selected.
columns.

See Set Alarms for information on setting alarms that can be useful during prediction,
classification, projection and to define scalar and vector information for input matrix.

The warning Limits tab allows a user to define specific criteria for detecting outliers in a PCR
model. It is available when Identify outliers is checked in the Model Inputs tab. The dialog
box is shown below.
The PCR Warning Limits Tab Options
629

samples.
Leverage Limit
630
When all the settings are made click on OK.
15.4. Interpreting PCR plots
 Predefined PCR plots

 Regression overview
 Scores
 X- and Y-Loadings
631
 Explained Y-Variance
 Explained X-Variance
 Sample Outliers
 Scores
 Influence
 Residual Sample X-Variance
 Residual Sample Y-Variance
 Scores
 Loadings
 Important Variables
 X-loadings
 Regression and Prediction
 Influence Plot
 Explained sample variance or sample residuals
 Leverage
 Residuals
 Q-residuals
 F-residuals
 Residuals
 Plots accessible from the PCR plot menu
 PCR Overview
 X- or Y- Variance
 X- and Y- Variance
 RMSE
 Sample Outliers
 2 plots
 4 plots
 Bi-plot
 Scores
 Line
 2-D Scatter
632
 3-D Scatter
 2 x 2-D Scatter
 4 x 2-D Scatter
 Loadings
 Line
 Loadings for the X-variables
 Loadings for the Y-variable
 2-D Scatter
 3-D Scatter
 Loadings for the Y-variable
 2 x 2-D Scatter
 4 x 2-D Scatter
 Important Variables
 Regression Coefficients
 Weighted coefficients (Bw)
 Raw coefficients (B)
 Residuals
 General
 Y-residuals vs. Predicted Y
 Normal Probability Y-residuals
 Y-residuals vs. Score
 Influence Plot
 Outliers
 Influence Plot
 Patterns
 Leverage/Hotelling’s T²
 Leverage
 Line
 Matrix
 Line
 Matrix
633
15.4.1 Predefined PCR plots

Regression overview
Scores
from PCR. The plot gives information about patterns in the samples. The scores plot for
(PC1,PC2) is especially useful, since these two components summarize more variation in the
data than any other pair of components.
samples. Look at the scores plot together with the corresponding loadings plot, for the same
two components. This can help determine which variables are responsible for differences
between samples. For example, samples to the right of the scores plot will usually have a
large value for variables to the right of the loadings plot, and a small value for variables to
the left of the loadings plot.
situation with four distinct clusters. Samples within a cluster are similar.
634

most samples accumulated to the right of the plot, then progressively spreading
more and more. This means that the variables responsible for the major variations
are asymmetrically distributed. In such a situation, study the distributions of those
variables (histograms), and use an appropriate transformation (most often a
logarithm).

635
Furthermore, the display of the Hotelling’s T² ellipse for a model in two dimension is
also a good way to detect outliers. To display it click on the Hotelling’s T² ellipse
button .
In addition, the display of the stability plot can help in detecting outliers. This plot
represents the projection of the samples in the submodels used for the validation
they can be part of the model or left out. Hence this plot is only available when any
type of cross-validation has been selected. It is available from the icon .
An outlier disturbs the model
636
In the above image, the sample 143_1 is projected very differently for one particular
projection. It can be seen that one particular projection is deviating from all the
others. The study of the samples left out for this particular projection indicates that
sample 143_1 is the source of this variation. This sample is an outlier.
Calibration and Validation Scores
When the methods of cross validation and test set validation are used, The
Unscrambler® will by default display Calibration and Validation (Test) scores in the
same plot, Use this plot to determine whether the test set covers the entire span of
the calibration set or determine if any cross validation segments/samples are
different from the rest of the set.

Check how much of the total variation each of the components explains. This is
displayed in parentheses next to the axis name. The first value corresponds to X and
the second to Y. If a lot of variance in X (more than 80%) explains little of Y (less than
50%), a major variation in X is introducing noise in the model. In spectroscopy it can
be a baseline variability. If a small part of the variance in X (less than 50%) explains a
lot of the variance in Y (more than 80%), some variables in X are not carrying
information. It is advised to remove the non-informative variables.
637
X- and Y-Loadings
A 2-D scatter plot of X- and Y-loadings for two specified components from PCR is a good way
to detect important variables. The plot is most useful for interpreting component 1 vs.
component 2, since they represent the largest variations in the X-data. By default both Y-
and X-variables are displayed but it is possible to modify this by clicking on the X and Y icons.
X- and Y-Loadings of sensory variables (X) and the mean preference (Y) along (PC1,PC2)
It is possible to change the display by using the arrows or the PC drop-down list
.
The loadings plot should preferably be used together with the corresponding scores plot.
Variables with loadings to the right in the loadings plot will be X-variables which usually have
high values for samples to the right in the scores plot, etc. This plot can be used to study the
relationship between the X-variables and the X- and Y-variables.
638

identified.
When working with spectroscopic or time series data, X-line loadings plots will aid better
may highlight regions of high importance.
The plot below shows how a number of PC’s can be overlayed in a line loadings plot to
determine which components capture the most important sources of information.
terms of the variables with highest (or lowest) contribution to the PC.
Line plot of loadings in ascending order of importance to PC1
639
More on line loadings plots can be found in a later section of this document.
Correlation Loadings Emphasize Variable Correlations
When a PCR analysis has been performed and a two-dimensional plot of loadings is
displayed on the screen, the correlation loadings option (available from the View menu and
the icon ) can be used to aid in visualizing the structure in the data. Correlation loadings
are computed for each variable for the displayed Principal Components (factors). In addition,
the plot contains two ellipses to help check how much variance is taken into account. The
outer ellipse is the unit-circle and indicates 100% explained variance. The inner ellipse
indicates 50% of explained variance. The importance of individual variables is visualized
more clearly in the correlation loadings plot compared to the standard loadings plot.
Correlation Loadings of sensory variables (X) and the mean preference (Y) along (PC1,PC2)
figure above, variables Acidity and Bitterness have a high positive correlation on PC1, and
they are negatively correlated to variable Odor banana. Variables Color intensity and Odor
orange have independent variations. Variables Mean preference and Bitterness are
negatively correlated.
640
Note: Variables lying close to the center are poorly explained by the plotted PCs.
They cannot be interpreted in that plot!
Correlation Line Loadings of Spectroscopic variables in PC1)
Values that lie within the upper and lower bounds of the plot are modelled by that PC. Those
that lie between the two lower bounds are not.
Explained Variance
There are two explained variance curves to look at in PCR: explained X- and Y-variance. It is
possible to change from one to the other by using the icon .
Explained Y-Variance
This plot illustrates how much of the variation in the response is described by each different
component. Total residual variance is computed as the sum of squares of the Y-variable,
divided by the number of degrees of freedom.

model.
Both the residual and the explained variance can be computed after 0, 1, 2… components
have been extracted from the data. Both results can be accessible by using the icon .
variance explain most of the variation in Y; see the example below. Ideally one would like to
possible.
is computed by testing the model on data that was not used to build the model. Both curves
can be accessed by using the icon .
641
Compare the two variances: if they differ significantly, there is good reason to question
whether either the calibration data or the test data are truly representative. The figure 2
below shows a situation where the residual validation variance is much larger than the
residual calibration variance (or the explained validation variance is much smaller than the
explained calibration variance). This means that although the calibration data are well fitted
(small residual calibration variances), the model does not describe new data well (large
residual validation variance).
On the contrary, if the two residual variance curves are close together the model is
representative (figure below).
Outliers can sometimes cause large residual variance (or small explained variance). They can
also cause the dropping of the explained variance in validation as can be seen in the plot
below.
642
Explained X-Variance
This plot gives an indication of how much of the variation in the explicative variables is
described by the different components. The total X-variance is computed in the same way as
the Y-variance. See above description for more information.
In PCR as the PCs are computed only taking into account the X-variance it may be necessary
to consider more PCs to explain most of the variance in Y.

The predicted Y-value from the model is plotted against the measured Y-value. This is a good
way to check the quality of the regression model. If the model gives a good fit, the plot will
show points close to a straight line through the origin and with slope equal to 1. Turn on Plot
Statistics (using the statistics shortcut) to check the slope and offset, RMSEP/RMSEC and R-
squared.
It is also useful to show the regression line and compare it with the target using the icons
.
Some statistics are available giving an idea of the quality of the regression, they are available
from the icon .

The Predicted vs. reference plot for a model showing good fit to the data is shown below,
643
Note: If there are large differences between the calibration and validation results,
the model cannot be trusted.
To determine the quality of the fit, the following statistics are available,
Slope
Offset
RMSE
expected Prediction error, depending on the validation method used. Both are
expressed in the same unit as the response variable Y.
R-squared
The first one (in blue) is the calibration R-Squared value taken from the calibration
Explained Variance plot for the number of components in the model, the second
one (in red) is also calculated from the Explained Variance plot, this time for the
validation set. It tells how good a fit can be expected for future predictions for a
defined number of components.
Note: RMSE and R-Squared values are highly dependent on the validation method
used and the number of components in a model. It it important not to use too
many components and overfit the model.
Predicted vs. Reference plot for PCR Calibration samples
644

Correlation
R2(Pearson)
The Pearson R2 value is the square of the Correlation value and expresses
correlation on a positive scale between 0 and 1.
RMSEC
SEC
This is the Standard Error of Calibration and is similar to RMSEC, except it is
corrected for the Bias.
Bias
Note: When RMSEC and SEC are close, the bias is insignificant. This holds for all
errors.
for Cross Validation and the second plot is for Test Set validation.

RMSECV
Root Mean Square Error of Cross Validation. This is a measure of the dispersion of
the validation samples around the regression line when Cross Validation is used.
SECV
645
Standard Error of Cross Validation. This is the RMSECV corrected for bias
RMSEP
SEP
Standard Error of Prediction. This is the RMSEP corrected for bias
When Leverage Correction is used to first check the model, the errors become estimation
errors. For more details on the definitions, see the section on Multiple Linear Regression
(Interpreting MLR plots).
The figures below show two different situations: one indicating a good fit, the other
a poor fit of the model.

One may also see cases where the majority of the samples lie close to the line while
a few of them are further away. This may indicate good fit of the model to the
majority of the data, but with a few outliers present (see the figure below).
646
In the above plot, sample 3 is not following the regression line whereas all the other
In other cases, there may be a nonlinear relationship between the X- and Y-
variables, so that the predictions do not have the same level of accuracy over the
whole range of variation of Y. In such cases, the plot may look like the one shown
below. Such nonlinearities should be corrected if possible (for instance by a suitable
transformation), because otherwise there will be a systematic bias in the predictions
depending on the range of the sample.
Variances and RMSEP

This plot shows the explained variance for each X-variable when different numbers of
components are used in the model. It is used to identify which individual variables are well
described by a given model.
X-variables with large explained variance (or small residual variance) for a particular
component are explained well by the corresponding model, while those with small explained
variance for all (or for at least the first 3-4) components have little relationship to the other
X-variables (if this is a PCA model) or little predictive ability (for PCR and PLS models).
If some variables have much larger residual variance than all the other variables for all
components in the model (or for the first 3-4 of them), try rebuilding the model with these
variables deleted. This may produce a model that is easier to interpret.
647
Note: Both calibration and validation variances are available.
Sample Outliers
Scores
See the description in the Interpreting PCR plots section
Influence
as an outlier.
more detail.
648
Calibration and validation samples can be displayed in the influence plot by toggling
between them using the and button. The toggle is available for F-residuals if the
validation method chosen was cross validation or test set validation.
Residual Sample X-Variance

This plot displays the X-variance for each sample on a model containing a specified number
of PCs.
High residuals indicate an outlier. Incorporating more components can sometimes model
outliers; avoid doing so since it will reduce the prediction ability of the model.
649
Residual Sample Y-Variance

This plot displays the Y-variance for each sample on a model containing a specified number
of PCs.
Small residual variance (or large explained variance) indicates that, for a particular number
of components, the samples are well explained by the model. Therefore a sample with a
high Y-residual may be an outlier.
Scores and Loadings
This overview shows two plots: the score and loadings plots.
Scores
Loadings
Important Variables
If the X-variables were weighted this plot presents the weighted regression coefficients.
Otherwise the B-coefficients and (Bw)-coefficients are confounded. The number of PCs is
fixed and can be changed using the arrows.
In general, this plot shows the weighted regression coefficients for the response or Y-
variable.
Regression coefficients summarize the relationship between all predictors and the response.
For PCR, the regression coefficients can be computed for any number of components or
factors. The regression coefficients for 3 factors, for example, summarize the relationship
between the predictors and the response, as a model with 3 components approximates it.
The weighted regression coefficients (Bw) informs about the importance of the X-variables.
X-variables with a large regression coefficient play an important role in the regression
650
model; a positive coefficient shows a positive link with the response, and a negative
coefficient shows a negative link. Predictors with a small coefficient are negligible. Mark
them and recalculate the model without those variables. The constant value B0W is
indicated at the bottom of the plot, in the Plot ID field (use View - Plot ID).
Weighted regression coefficients for 3 factors (or PCs)
The plot shows that variables 0, 3 and 4 are contributing the most to the model.
Note: The weighted coefficients (Bw) and raw coefficients (B) are identical if no
weights were applied on the variables.
If the predictor variables have been weighted with 1/SDev (standardization), the weighted
regression coefficients (Bw) take these weights into account. Since all predictors are brought
back to the same scale, the coefficients show the relative importance of the X-variables in
the model.
X-loadings
This is a plot of X-loadings for all the components vs. variable number. It is useful for
detecting important variables. If a variable has a large positive or negative loading, this
means that the variable is important for the component concerned. For example, a sample
with a large score value for this component will have a large positive value for a variable
with large positive loading.
If a variable has the same sign for all the important components, it is most likely to be an
important variable.
651
For more information see the previous section.
Regression and Prediction

See the description in the Overview section
Influence Plot
This plot shows the Q- or F-residuals vs. Leverage or Hotelling’s T². See the general
description about the influence plot in the overview section for more details.

652

Explained sample variance or sample residuals

numerical sense.
Sample X-residuals from Calibration
653
choose from using the drop-down list.
654
Leverage
the number of calibration samples.
The leverage values are always larger than zero, and can go up to 1 for samples in calibration
set. As a rule of thumb, samples with a leverage above 0.4 - 0.5 start being bothering.
following cases:
 Case 1
655
 Case 2
Residuals
F-residuals view.
Q-residuals
F-residuals
Residuals
656
Response Surface
This plot is used to find the settings of the X-variables which give an optimal response value
for the variable Y, and to study the general shape of the response surface fitted by the
Regression model.
It is necessary to specify which X-variables should be plotted as well as the number of
components, use the dialogue box that appear for this purpose.
 Contour plot.
 Landscape plot.
Interpretation: Contour Plot
This plot gives a map to localize the area of the experiment goal. The plot has two
axes: two predictor variables are studied over their range of variation; the remaining
ones are kept constant. The constant levels are indicated in the Plot ID at the
bottom. The response values are displayed as contour lines, i.e. lines that show
where the response variable has the same predicted value. Clicking on a line, or on
any spot within the map, will display the predicted response value for that point,
and the coordinates of the point (i.e. the settings of the two predictor variables
giving that particular response value).
Interpretation: Landscape Plot
Look at this plot to study the 3-D shape of the response surface. Here it is obvious
whether there is a maximum, a minimum or a saddle point. This plot, however, does
not show precisely how the optimum can be achieved.
657
15.4.2 Plots accessible from the PCR plot menu

PCR Overview
Variances and RMSEP
X- or Y- Variance
One-frame plot where it is possible to display either the Explained X- or Y-Variance with
Calibration and or Validation curves. See the description in the Interpreting PCR plots section
X- and Y- Variance
A two-frame plot with on the top the Explained X-Variance plot and below the Explained Y-
Variance with both Calibration and Validation variances. See the description in the
Interpreting PCR plots section
RMSE
Root Mean Square Error for the Y-variables. This plot gives the square root of the residual
variance for individual responses, back-transformed into the same units as the original
response values. This is called: RMSEC (Root Mean Square Error of Calibration) when plotting
Calibration results; RMSEP (Root Mean Square Error of Prediction) when plotting Validation
results.
RMSE Line Plot
658
The RMSE is plotted as a function of the number of components in the model. There is one
curve per response (or two if Cal and Val together are selected). The optimal number of
components can be determined by looking at where the Val curve (i.e. RMSEP) reaches a
minimum.
Sample Outliers
Scores and Loadings
2 plots
4 plots
When displaying 4 plots, the screen show 2 paired plots of scores and loading one displaying
PC1-PC2 and the other PC3-PC4.
Bi-plot
The plot can be used to interpret sample properties. Look for variables projected far away
from the center. Samples lying in an extreme position in the same direction as a given
variable have large values for that variable; samples lying in the opposite direction have low
values. For instance, in the figure below, samples6,7 an8 8 are the most colour intense,
while samples 2,3,4 and 12 are most likely to have the highest banana odor (and probably
lowest acidity). C3_H3 has high Raspberry taste, and is rather colorful. C1_H1, C2_H1 and
C3_H1 are thick, and have little color. The samples cannot be compared with respect to the
variables close to the center of the bi-plot.
Bi-plot for 12 jam samples and 12 sensory properties (X-variables)
659

identified.
Scores
Line
patterns, like a regular increase or decrease, periodicity, etc.… (only relevant if the sample
Trend in a Scores plot
660
2-D Scatter
3-D Scatter
This is a 3-D scatter plot or map of the scores for three specified components from PCR. The
describe enough variation in the data, the 3-D plot is a practical alternative. The same
analysis as with a 2-D scatter plot should be done. See the description in the Interpreting
PCR plots section
2 x 2-D Scatter
The visualization window is divided into two frames. The top one shows the scatter plot of
the scores of the samples along PC1 and PC2. The bottom plot shows the scatter plot of the
scores along PC3 and PC4.
4 x 2-D Scatter
The visualization window is divided into four frames. The top left one shows the scatter plot
of the scores of the samples along PC1 and PC2. On its left is displayed the scores plot in
PC3-PC4 plane. The bottom left plot shows the scatter plot of the scores along PC5 and PC6.
To its right is displayed the scatter plot of the scores of the sample for PC7 and PC8.
Loadings
Line
Loadings for the X-variables
detecting important variables. In many cases it is better to look at two- or three-vector
loadings plots instead because they contain more information. Line plots are most useful for
multichannel measurements, for instance spectra from a spectrophotometer, or in any case
where the variables are implicit functions of an underlying parameter, like wavelength, time,
etc. The plot shows the relationship between the specified component and the different X-
variables. If a variable has a large positive or negative loading, this means that the variable is
important for the component concerned; see the figure below. For example, a sample with a
large score value for this component will have a large positive value for a variable with large
positive loading.
661
Loadings for the Y-variable
This is a plot of Y-loading for a specified component vs. variable number. It is usually better
to look at 2-D or 3-D loadings plots instead because they contain more information.
However, if there is reason to study the X-loadings as line plots, then one should also display
the Y-loadings as line plots in order to make interpretation easier. The plot shows the
relationship between the specified component and the Y-variable. If a variable has a high
positive or negative loading, this means that the variable is well explained by the
component. A sample with a large score for the specified component will have a high value
for all variables with large positive loadings.
A Y-variable with large loadings in early components is easily modeled as a function of the X-
variables.
2-D Scatter
3-D Scatter
This plot can present either the X-loadings, the Y-loadings or both. To select or unselect one
of them click on the icon .
PCR. The plot is most useful for interpreting directions, in connection to a 3-D scores plot.
Otherwise it is recommended that one use line or 2-D loadings plots.
Loadings for the Y-variable
This is a three-dimensional scatter plot of Y-loadings for three specified components from
PCR. As there is only one Y-variable in PCR, this plot is most useful for interpreting directions,
662
in connection to a 3-D scores plot and together with the X-loadings. Otherwise it is
recommended that one use line or 2-D loadings plots.
Read more about Loadings and the different display and information in the Interpreting PCR
plots* section
2 x 2-D Scatter
the loadings of the variables along PC1 and PC2. The bottom plot shows the scatter plot of
loadings of the variables along PC3 and PC4.
4 x 2-D Scatter
of the loadings of the variables along PC1 and PC2. On its left is displayed the scores plot in
PC3-PC4 plane. The bottom left plot shows the scatter plot of loadings of the variables along
PC5 and PC6. To its right is displayed the scatter plot of loadings of the variables for PC7 and
PC8.
Important Variables
Regression Coefficients
Weighted coefficients (B)

See the description in the predefined PCR plots section
Raw coefficients (B)

response. For PCR, the regression coefficients can be computed for any number of
components. The regression coefficients for 3 PCs, for example, summarize the relationship
between the predictors and the response, as a model with 3 components approximates it.
The B0 value is presented along with the X-axis name.
Regression coefficients for 3 PCs
663
The above plot shows the regression coefficients for the response variable (Y), and for a
model with a particular number of components (3). Each predictor variable (X) defines one
point of the line (or one bar of the plot). It is recommended to configure the layout of this
plot as bars. Variables 1 and 4 have the highest B coefficients.
weights where applied on the variables.
The raw coefficients are those that may be used to write the model equation in original
units:
Since the predictors are kept in their original scales, the coefficients do not reflect the
relative importance of the X-variables in the model. If no weights have been applied to the
X-variables the display the Uncertainty Limits maybe informative. It is available if Cross-
Validation and the Uncertainty Test option were selected in the Regression dialog.
Use View – Uncertainty Limit from the menu to toggle this indication on or off.
Residuals

See the predefined plot section for more information.
General
Y-residuals vs. Predicted Y
satisfactory, and appropriate action should be taken. If strong systematic structures (e.g.
curved patterns) are observed, this can be an indication of lack of fit of the regression
model. The figure below shows a situation that strongly indicates lack of fit of the model.
This may be corrected by transforming the Y variable.
664
The presence of an outlier is shown in the example below. The outlying sample (18) has a
much larger residual than the others; however, it does not seem to disturb the model to a
large extent.
665
adequate models.
Normal Probability Y-residuals
for one particular Y-variable (look for its name in the plot ID). There is one point per sample.
If the model explains the complete structure present in the data, the residuals should be
randomly distributed - and usually, normally distributed as well. So if all the residuals are
along a straight line, it means that the model explains everything that can be explained in
the variations of the variables to be predicted. If most of the residuals are normally
distributed, and one or two stick out, these particular samples are outliers. This is shown in
the figure below. If there are outliers, mark them and check the data.
Two outliers are sticking out
666
Y-residuals vs. Score

This is a plot of Y-residuals vs. component scores. Clearly visible structures are an indication
of lack of fit of the regression model. The figure below shows such a situation, with a strong
nonlinear structure of the residuals indicating lack of fit. There is a lack of fit in the direction
(in the multidimensional space) defined by the selected component. Small residuals
(compared to the variance of Y) that are randomly distributed indicate adequate models.
Structure in the residuals: need of a transformation
667
Influence Plot
Variance per sample

avoided, especially in regression, since it will reduce the predictive power of the model.
component are well explained by the corresponding model, and vice versa. In the above plot
4 samples seems to be not well explained by the model and may be outliers such as B3.
668
Variable residuals
should, however, be avoided since it will reduce the prediction ability of the model.
all the samples.
Sample residuals
669
well describe by a model with a certain number of component here 4. If this is the case with
most of the samples this variable may be noisy and can be considered as an outlier.
model.

It is a map of the residuals. The X-axis represents the samples and the Y-axis represents the
variables. It is useful to determine whether a particular sample has high residuals on few or
all variables. It is a diagnostic tools to check the reasons why a particular sample is different
from the others. It helps in deciding whether this sample is an outlier or not.
most samples. It can show that this variable is either noisy or not structured in a proper way.
It is possible to remove this variable or try to try different pretreatments.
670
In the above map, two variables are repeatedly not well described by the model. They are to
be checked.
Outliers
Influence Plot
Patterns
Leverage/Hotelling’s T²
Leverage
Line
Matrix
the influence of each sample evolves with the number of components in the model. Display
the leverages as Hotelling’s T² statistics.
Leverage as a matrix plot
671
Hotelling’s T²
Line
See the description in the Interpreting PCR plots section.
Matrix
which is the Hotelling’s T² statistic for a specific PC and sample, the color scale can be
customized.
Hotelling’s T² as a matrix plot
Response Surface
672
15.5. PCR method reference

15.6. Bibliography
K. Esbensen, Multivariate Data Analysis - In Practice, 5th Edition, CAMO Process AS, Oslo,
2002,
Psych., 24, 417-441, 498-520, (1933).
J.E. Jackson, A Users Guide to Principal Components, Wiley & Sons Inc., New York, 1991.
K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis, Academic Press Inc, London,
1979.
673
16. Partial Least Squares
16.1. Partial Least Squares regression
Partial Least Squares — or Projection to Latent Structures — (PLS) models both the X- and Y-
matrices simultaneously to find the latent variables in X that will best predict the latent
variables in Y. These PLS components are similar to principal components; however, they are
referred to as Factors. PLS maximizes the covariance between X and Y.
 Theory
 Usage
16.2. Introduction to Partial Least Squares Regression (PLSR)

Partial Least Squares Regression (PLSR), also sometimes referred to as Projection to Latent
Structures or just PLS, models both the X- and Y-matrices simultaneously to find the latent
(or hidden) variables in X that will best predict the latent variables in Y. These PLS
components are similar to principal components, but will be referred to as factors.
 Basics
 Interpreting the results of a PLS regression
 Scores and loadings (in general)
 PLS scores
 PLS loadings
 PLS loading weights
 X-Y relationship outliers
 Error measures for PLSR
 More details about regression methods
 PLSR algorithm options
16.2.1 Basics
PLSR maximizes the covariance between X and Y. In this case, convergence of the system to
a minimum residual error is often achieved in fewer factors than using PCR. This is in
contrast to PCR, which first performs Principal Component Analysis (PCA) on X and then
regresses the scores (T) vs. the Y data. A conceptual illustration for PLSR is shown graphically
below.
PLSR Procedure
675
PLSR may be carried out with one or more Y variables, meaning that multiple Y responses
can be used during regression modeling.
There are three algorithms available in The Unscrambler® for PLS regression.
 NIPALS
 Kernel PLS
 Wide Kernel PLS
These are discussed below.

Note: The distinction between PLS1 and PLS2 is more of a conceptual and historical
nature, as in “the old days” the computational time was of essence even for rather
small data sets. The difference between the two is that for PLS1 no iterations are
necessary.The term PLS2 is no longer explicitly used in this software, though one
can choose multiple Y responses for to develop a PLSR model. PLS1 : deals with only
one Y response variable at a time (similar to MLR and PCR); PLS2 : handles several Y
responses simultaneously.
More About:
 How PLSR compares to other regression methods in More details about regression
methods
 PLSR results in Main Results Of Regression
 Details regarding PLSR algorithms are given in the Method reference.
16.2.2 Interpreting the results of a PLS regression

Remember: Good models are generated from good data! If either the X or Y data are non-
representative of future conditions, or if they were collected under poor conditions, then
the results of the PLSR model may be useless.
676
Partial Least Squares
As with PCA and PCR, the results of a PLS regression provide similar graphical outputs and
diagnostics. However, in the case of PLSR, some more interesting and powerful diagnostic
tools are available. The following provides a summary of these tools.
16.2.3 Scores and loadings (in general)

In PLSR models, scores and loadings express how the samples and variables are projected
along the model components.
loadings are required to express this operation; PLS scores and loadings are fundamentally
different from those of PCR. These differences are presented in the next two sections.
PLS scores
Basically, PLS scores are interpreted the same way as PCA scores. They are the sample
coordinates along the model components. The only new feature in PLSR is that two different
sets of components can be considered, depending on whether one is interested in
summarizing the variation in the X- or Y-space.
T-scores
are the new coordinates of the data points in the X-space, computed in such a way
that they capture the part of the structure in X which is most predictive for Y.
U-scores
summarize the part of the structure in Y which is explained by X along a given factor,
and are related to T by a constant (see below). (Note: they do not exist in PCR!)
The relationship between t- and u-scores is a summary of the relationship between X and Y
along a specific model component. For diagnostic purposes, this relationship can be
visualized using the X-Y Relationship Outliers plot.
PLS loadings
The PLS loadings used in The Unscrambler® express how each of the X- and Y-variables is
related to the model component summarized by the t-scores. It follows that the loadings will
be interpreted somewhat differently in the X- and Y-space.
P-loadings
express how much each X-variable contributes to a specific model component, and
can be used exactly the same way as PCA loadings. Directions determined by the
projections of the X-variables are used to interpret the meaning of the location of a
projected data point on a t-scores plot in terms of variations in X.
Q-loadings
express the direct relationship between the Y-variables and the t-scores. Thus, the
directions determined by the projections of the Y-variables (by means of the q-
loadings) can be used to interpret the meaning of the location of a projected data
point on a t-score plot in terms of sample variation in Y.
The two kinds of loadings can be plotted on a single graph to facilitate the interpretation of
the t-scores with regard to directions of variation both in X and Y. It must be pointed out
that, contrary to PCA loadings, PLS loadings are not normalized, so that p- and q-loadings do
not share a common scale. Thus, their directions are easier to interpret than their lengths,
and the directions should only be interpreted provided that the corresponding X- or Y-
variables are sufficiently taken into account (which can be checked using explained or
residual variances).
677
PLS loadings can also be plotted as X, Y and X-Y Correlation Loadings. For more details on
correlation loadings, see interpreting plots.
PLS loading weights
Loading weights are specific to PLSR (they have no equivalent in PCR) and express how the
information in each X-variable relates to the variation in Y summarized by the u-scores. They
are called loading weights because they also express, in the PLSR algorithm, how the t-scores
are to be computed from the X-matrix to obtain an orthogonal decomposition. The loading
weights are normalized, so that their lengths can be interpreted as well as their directions.
Variables with large loading weight values are important for the prediction of Y.
X-Y relationship outliers
X-Y Relationship Outliers plots the t-scores from X vs. the u-scores from Y and is used for two
main purposes:
 To detect possible outliers

 To determine the optimal number of factors to use in a PLSR model.
This plot is unique to the PLSR algorithm. Since PLSR attempts to maximize the covariance
between X and Y variables in the first calculated factors, the t vs. u plot should ideally show a
straight line relationship. Samples that deviate noticeably are potential outliers. This is
shown graphically below.
The X-Y Relationship Outlier Plot for Ideal and Outlier Situations
When used as a method to determine the optimal number of factors, this can be done by
visually assessing which pair of t vs. u scores starts to deviate from a straight line. The
Quadrupole Plot is useful in this regard. This is shown diagrammatically below.
The X-Y Relationship Outlier Quadrupole Plot
678
The X-Y Relationship Outliers plot is also useful for detecting nonlinear relationships that
may exist in the data. This may suggest a different preprocessing should be considered.
response. Regression coefficients are a characteristic of all regression methods and may
provide interpretive insight into the quality of a model. Examples include
 Spectroscopy: Regression coefficients should have “spectral characteristics” about
them and not show noise characteristics.
 Process data: When different variable types exist, regression coefficients show the
relative importance of the variables and their interactions can also be displayed.
The predicted vs. reference plot is another common feature of all regression methods. The
predicted vs. reference plot should show a straight line relationship between predicted and
measured values, ideally with a slope of 1 and a correlation of close to 1.
Error measures for PLSR
In PLSR (and PCR) models, not only the Y-variables are projected (fitted) onto the model; X-
variables are too. Sample residuals are computed for each PC of the model. The residuals
may then be combined:
Across samples
for each variable, to obtain a variance curve describing how the residual (or
explained) variance of an individual variable evolves with the number of PCs in the
model;
Across variables
(all X-variables or all Y-variables), to obtain a Total variance curve describing the
global fit of the model. The Total Y-variance curve shows how the prediction of Y
improves when more PCs are added to the model; the Total X-variance curve
679
expresses how much of the variation in the X-variables is taken into account to
predict variation in Y.
Read more about how sample and variable residuals, as well as explained and residual
variances, are computed in the chapter with theory about PCA.
In addition, the Y-calibration error can be expressed in the same units as the original
response variable using the Root Mean Square Error of Calibration (RMSEC), and the Y-
prediction error as the Root Mean Square Error of Prediction (RMSEP).
RMSEC and RMSEP also vary as a function of the number of factors in the model.
16.2.4 More details about regression methods

Refer to the section More details on regression methods for a comparison of MLR, PCR and
PLSR.
Some more theory of PLS regression

Unlike PCR, which only takes into account the variance structure of the X matrix, the NIPALS
PLS regression algorithm starts with extracting the largest eigenvector of ,
which includes information on X and Y and their covariance (or correlation if the data are
scaled to unit variance). This eigenvector is named w, or the first loading weight vector. This
leads to the expression,
w is then normalized to length 1.

X and Y loadings are then calculated by regressing against t, calculated above.
E and F are initially X and Y and are “deflated” during the calculation of PLS factors.
The so-called Y-scores, u, are calculated from
Thus the models for X and Y can be written as:
The inner relation in PLS regression is the relation between T and U for the individual
factors:
As u is a constant multiplied with t, it is conceptually simpler to have the same scores

expressing both X and Y:
The process continues by deflating. The information of the PLS factors (i.e. the outer
products, tpT and tqT) is subtracted from E and F to obtain
680
The process is now repeated to find the next PLS factors by finding the eigenvector
. The estimation of the PLS loadings, loading weights and scores may also be
achieved by extracting eigenvectors of the smallest size of products of X, XT, Y and YT, which
are the basis for other PLSR algorithms like the kernel and wide kernel methods (see below).
The matrices, W, T, P and Q are then stored in The Unscrambler® Project Navigator with the
PLSR results, for further diagnostic purposes. To ensure that the columns of the matrix W
relate to the original matrix X, the weights may be expressed as,
The scores T are now used to calculate the regression coefficients, using the following
expression,
Since the normalization step can be introduced at various points in the calculation, using
other variants of the PLSR algorithm, this can make it difficult to compare scores and
loadings calculated by these variants.
16.2.5 PLSR algorithm options

The Unscrambler® provides three algorithms for PLSR model calibration, all of which will
below.
NIPALS
This is an iterative algorithm that automatically can impute missing values. Also it
tends to be faster than the Kernel-based algorithms if both the number of rows and
columns in the data are large.
in the PLSR model node.
the number of iterations or to use a Kernel method instead.
Kernel PLS
This algorithm is non-iterative. It can be expected to perform better than the others
for data containing a large number of samples and relatively few variables (‘tall and
thin’ data). The algorithm does not handle missing values Lindgren et al 1993, Dayal
and MacGregor 1997
Wide Kernel PLS
681
This is a variant of the Kernel PLS that is expected to perform better for data
containing a large number of variables and relatively few samples (‘short and fat’
data). The implementation is based on Ränner et al, 1994 and does not handle
missing values.
More details on the algorithms are given in the method reference.
16.3. Tasks – Analyze – Partial Least Squares Regression

suitable analysis. The steps for setting up a Partial Least Squares Regression are described
below.

 Weights tabs
 Validation tab
 Algorithm tab
 Autopretreatments tab
 Set Alarms tab

will be obtained.
(latent variables, factors) to calculate, from the maximum components spin box.
before analysis.
The Identify Outliers check box allows a user to set up certain criteria in the Warning Limits
tab and use these to identify potential outliers during the analysis.
 The algorithm used to calculate the model.

Partial Least Squares Regression Model Inputs
682
PLSR is a multivariate regression analysis technique, therefore in The Unscrambler® it
requires a minimum of three samples (rows) and two variables (columns) to be present in a
data set, in order to complete the calculation. The following provides some warning given,
when certain analysis criteria are not met.
Not enough samples present
Solution: Check that the data table (or selected row set) contains a minimum of 3 samples.
Not enough variables present
683
Solution: Check that the data table (or selected column set) contains a minimum of 2
variables.
16.3.2 Weights tabs

given below.
Partial Least Squares Regression X- Weights
684
variable numbers can be manually entered into the text dialog box, the Select button can be
A/(SDev +B)
Constant
Downweight
Block weighting
685
Advanced tab
new weights.
analysis as weights, using the Select Results Matrix button This option provides
PLSR Advanced Weights Option
Once the weighting and variables have been selected for X and Y, click Update to apply
them.

The next step in the PLSR modeling process is to choose a suitable validation method. For an
686
The PLSR Validation Tab Options

The Algorithm tab provides a choice between different algorithms for PLS regression:
Partial Least Squares Regression Algorithm Tab.
687
The differences between the algorithms are described in the Introduction to PLSR. Contrary
to the Kernel-based methods, the NIPALS algorithm is iterative and the maximum number of
iterations can be tuned in the Max. iterations box. The default value of 100 should be
sufficient for most data sets, however some large and noisy data may require more
iterations to converge properly. The maximum allowed number of iterations is 30,000.
In the special case of a single response variable (i.e. PLS1), there are no iterations and the
Max. iterations box is grayed out.
Note: If there are missing values in the data and one of the Kernel methods are
selected, a warning will be given as shown below.
688
data.
16.3.5 Autopretreatments tab

The Autopretreatments tab allows a user to register the pretreatments used during the PLSR
applied to the new data, before the PLSR equation is applied. The pretreatments become
part of the saved model. An example dialog box for Autopretreatment is provided below.
The PLSR Autopretreatment Tab Options
Pretreatments can also be registered from the PLSR node in the project navigator. To
register the pretreatment, right click on the PLSR analysis node and select Register
Pretreatment. This is shown below.
Registering a Pretreatment From The Project Navigator
689
selected.
columns.

See Set Alarms for information on setting alarms that can be useful during prediction,
classification, projection and to define scalar and vector information for input matrix.

This tab allows a user to set predefined warning limits for the detection of potential outliers.
These options are available when Identify outliers is checked in the Model Inputs tab. The
Warning Limits tab is shown below.
The PLSR Warning Limits Tab Options
690

samples.
Leverage Limit
691
16.4. Interpreting PLS plots
 Predefined PLS plots

 Regression Overview
 Scores
 X- and Y-loadings
692

 Sample Outliers
 Scores
 Influence
 Residual sample Calibration X-Variance
 Residual sample Calibration Y-variance
 X-Y Relation outliers
 Scores
 Loadings
 X-Loadings
 Y-Loadings
 Loading weights
 Important variables
 X-loading weights
 Influence Plot
 Explained Variance and Residual Plots
 Explained X Sample Variance
 Explained Y Sample Variance
 X Sample Residuals
 Y Sample Residuals
 Leverage
 Residuals
 Q-residuals
 F-residuals
 Residuals
 Plots accessible from the PLS plot menu
 PLS overview
 X- or Y- Variance
 X- and Y- variance
 RMSE
 Sample outliers
 X-Y relation outliers
693
 Scores
Line
2-D scatter
3-D scatter
2 x 2-D Scatter
4 x 2-D Scatter
 Loadings
 Line
 Loadings for the Y-variables
 2-D scatter
 3-D scatter
 Loadings for the Y-variables
 2 x 2-D scatter
 4 x 2-D scatter
 Loadings weights
 Line
 2-D scatter
 3-D scatter
 2 x 2-D scatter
 4 x 2-D scatter
 Important variables
 Line plot
 Matrix
 Line plot
 Matrix
 Residuals
 General
 Influence plot
 Outliers
 Influence Plot
 Patterns
 Leverage
694
 Line
 Matrix
 Line
 Matrix
16.4.1 Predefined PLS plots

Regression Overview
Scores
This is a two-dimensional scatter plot (or map) of scores for two specified factors (latent
variables or PCs) from PLS regression. The plot gives information about patterns in the
samples. The scores plot for (factor 1,factor 2) is especially useful, since these two
components summarize more variation in the data than any other pair of components.
samples. Look at the scores plot together with the corresponding loadings plot for the same
two components. This can help in determining which variables are responsible for
differences between samples. For example, samples to the right of the scores plot will
usually have a large value for variables to the right of the loadings plot, and a small value for
variables to the left of the loadings plot.
Is there any indication of clustering in the set of samples? The figure below shows a situation
with four distinct clusters. Samples within a cluster are similar.
695

Are the samples evenly spread over the whole region, or is there any accumulation of
samples at one end? The figure below shows a typical fan-shaped layout, with most samples
accumulated to the right of the plot, then progressively spreading more and more. This
means that the variables responsible for the major variations are asymmetrically distributed.
In such a situation, study the distributions of those variables (histograms), and use an
appropriate transformation (most often a logarithm).

Are some samples very different from the rest? This can indicate that they are outliers, as
shown in the figure below. Outliers should be investigated: there may have been errors in
data collection or transcription, or those samples may have to be removed if they do not
belong to the population of interest.
696
Furthermore, the display of the Hotelling’s T² ellipse for a model in two dimension is
also a good way to detect outliers. To display it click on the Hotelling’s T² ellipse
button .
In addition, the display of the stability plot can help in detecting outliers. This plot represents
the projection of the samples in the submodels used for the validation; they can be part of
the model or left out. Hence this plot is only available when any type of cross-validation has
been selected. It is available from the icon .
Check how much of the total variation each of the components explains. This is displayed in
parentheses next to each axis name: Factor-1 (86%). If the sum of the explained variances
for the two components is large (for instance 70-80%), the plot shows a large portion of the
information in the data, so the relationships can be interpreted with a high degree of
697
certainty. On the other hand if it is smaller, more components or a transformation should be

considered, or there may simply be little meaningful information in the data under study.
There are two values one for the X- and one for the Y-variance. In a perfect case few factors
(or PCs) would be necessary to explained a lot of both variances. If a lot of variance in X
(more than 80%) explains little of Y (less than 50%), a major variation in X is introducing
noise in the model; in spectroscopy it can be a baseline variation. If a small part of the
variance in X (less than 50%) explains a lot of the variance in Y (more than 80%), some
variables in X are not carrying information. It is advised to remove the non-informative
variables.
X- and Y-loadings
A 2-D scatter plot of X- and Y-loadings for two specified components (factors) from PLS is a
good way to detect important variables and relationships between variables. The plot is
most useful for interpreting component 1 vs. component 2, since these represent the largest
variations in the X-data that explain the largest variation in the Y-data. By default both Y-
and X-variables are displayed but it is possible to modify that by clicking on the X and Y
icons.
Interpret the X-Y relationships
variables.
 Predictors (X) projected in roughly the same direction from the center as a response,
are positively linked to that response.
 Predictors projected in the opposite direction have a negative link.
 Predictors projected close to the center, are not well represented in that model and
cannot be interpreted.
Cheese experimentation: Six responses (Adhesiveness, Stickiness, Firmness, Shape retention,

Glossiness, Meltiness), four process predictors (Amount of dry matter, maturity, pH and
addition of recycled dry matter)
698
The maturity has a negative effect on the adhesiveness of the cheese; they are
anticorrelated. The amount of Dry matter positively affects the stickiness and negatively the
glossiness and meltiness. Glossiness and meltiness, two responses, are correlated.
Caution! If the X-variables have been standardized, one should also standardize the
Y-variable so that the X- and Y-loadings have the same scale; otherwise the plot
may be difficult to interpret.
It is possible to change the display by using the factor drop-down list . It should
preferably be used together with the corresponding scores plot. Variables with loadings to
the right in the loadings plot will be X-variables which usually have high values for samples
to the right in the scores plot, etc. This plot can be used to study the relationship between
the X-variables and the X- and Y-variables.

identified.
X-Loadings
When working with spectroscopic or time series data, X-line loadings plots will aid better
may highlight regions of high importance.
The plot below shows how a number of PCs can be overlayed in a line loadings plot to
determine which components capture the most important sources of information.
699
terms of the variables with highest (or lowest) contribution to the Factor.
Line plot of loadings in ascending order of importance to Factor 1

When a PLS analysis has been performed and a two-dimensional plot of loadings is displayed
on the screen, the Correlation Loadings option (available from the button ) can be used
to aid in the visualization of the structure in the data. Correlation loadings are computed for
each variable for the displayed factors. In addition, the plot contains two ellipses to help
check how much variance is taken into account. The outer ellipse is the unit-circle and
indicates 100% explained variance. The inner ellipse indicates 50% of explained variance.
The importance of individual variables is visualized more clearly in the correlation loadings
plot compared to the standard loadings plot.
Correlation Loadings of process variables (X) and the quality of the cheese (Y) along (factor 1,
factor 2)
700
figure above, variables dry matter and stickiness have a high positive correlation on factor 1
and factor 2, and they are negatively correlated to variables meltiness and glossiness.
Variables adhesiveness and stickiness have independent variations. Variables addition of
recycled dry matter and pH are very close to the center, they are not well described by
factor 1 and factor 2.
(or PCs). They cannot be interpreted in that plot!
Correlation Line Loadings of Spectroscopic variables in Factor 1)
701
Values that lie within the upper and lower bounds of the plot are modelled by that Factor.
Those that lie between the two lower bounds are not.
Explained variance
This plot illustrates how much of the total variation in X or Y is described by models including
different numbers of components. The total residual variance is computed as the sum of
squares of the X- or Y-residuals divided by the number of degrees of freedom.
The total explained variance is then computed as:

This is the percentage of the original variance in the data that is accounted for by the model.
The explained variances are computed for models of different number of components. Use
the buttons to switch between X- and Y-variance to plot. Switch between explained
and residual variances using the buttons . Models with small (close to 0) total
residual variance or large (close to 100%) total explained variance explain most of the
variation in Y; see the example below. Ideally one would like to have simple models, where
the residual variance goes to 0 with as few components as possible.
The calibration variance is based on fitting the calibration data to the model. Validation
variance is computed by testing the model on data that were not used to build the model.
Toggle display of either or both curves with the buttons.
Compare the two variances: if they differ significantly, there is good reason to question
whether either the calibration data or the test data are truly representative. The figure
below shows a situation where the residual validation variance is much larger than the
residual calibration variance (or the explained validation variance is much smaller than the
explained calibration variance). This means that although the calibration data are well fitted
(small residual calibration variances), the model does not describe new data well (large
residual validation variance).
On the contrary, if the two residual variance curves are close together the model is
representative (figure below).
702
Outliers can sometimes cause large residual variance (or small explained variance). They can
also cause a decrease in the explained validation variance as can be seen in the plot below.
703
Dimensionality and quality of the model

Check how much of the total variation each of the components explains both for X and Y.
Look for when the variance in validation reaches a plateau or peak. As a general rule avoid
including extra components in the model unless they contribute significantly in terms of
prediction ability or interpretation.
Note that the calibration variance (blue) line always improves as more components are
included in the model. It is not a goal in itself to have a perfect fit like this because noise will
be modeled as true effects. Rather it is the validation variance (red line) that provides an
estimate of the predictive ability of the model on new data. It is therefore the red line that is
most important to assess for the optimal number of components in the model.
Note: The level (and quality) of the validation is highly dependent on how the
samples are selected and also how training and validation samples are set up. Avoid
keeping replicates of the same samples in both training and validation sets unless
the goal is to estimate the replication error!

the results for other Y-variables, use the variable icon . In addition
by default the results are shown for a specific number of factors, that should reflect the
dimensionality of the model. If the number of factors is not satisfactory, it is possible to
change it by using the PC icon .

The selected predicted Y-value from the model is plotted against the reference Y-value. This
Turn on Plot Statistics (using the View menu) to check the slope and offset, RMSE and R-
squared. Generally all the y-variables should be studied and give good results.
704
It is also useful to show the regression line and compare it with the target line. These can be
enabled with the icon .

The following provides an image of a predicted vs. reference plot with regression and target
lines and statistics displayed.

Slope
Offset
RMSE
R-squared
validation set. It tells how good a fit can be expected for future predictions for a
defined number of factors.
705
used and the number of factors in a model. It it important not to use too many
factors and overfit the model.
Predicted vs. Reference plot for PLS Calibration samples

Correlation
R2(Pearson)
RMSEC
SEC
Bias
Note: When RMSEC and SEC are close, the bias is insignificant. This holds for all
errors.
for Cross Validation and the second plot is for Test Set validation.
706

RMSECV
Root Mean Square Error of Cross Validation. This is a measure of the dispersion of
the validation samples around the regression line when Cross Validation is used.
SECV
Standard Error of Cross Validation. This is the RMSECV corrected for bias
RMSEP
SEP
Standard Error of Prediction. This is the RMSEP corrected for bias
When Leverage Correction is used to first check the model, the errors become estimation
errors. For more details on the definitions, see the section on Multiple Linear Regression
(Interpreting MLR plots).
The figures below show two different situations: one indicating a good fit, the other a poor
fit of the model.

One may also see cases where the majority of the samples lie close to the line while a few of
them are further away. This may indicate good fit of the model to the majority of the data,
but with a few outliers present (see the figure below).
707
In the above plot, sample 3 does not follow the regression line whereas all the other
In other cases, there may be a nonlinear relationship between the X- and Y-variables, so that
the predictions do not have the same level of accuracy over the whole range of variation of
Y. In such cases, the plot may look like the one shown below. Such nonlinearities should be
corrected if possible (for instance by a suitable transformation), because otherwise there
will be a systematic bias in the predictions depending on the range of the sample.
Explained Variance
This plot shows the explained variance for each X- or Y-variable individually for different
model complexities. It can be used to identify which variables are described by the different
components in a model. Use the to switch between X- and Y-variables, and click the
to add the total X- or Y-variance to the plot for comparison.
By default, ALL X- or Y-variables are plotted together. Use the toolbar drop-down box or
arrows to scroll between individual variables to plot. You may also type in comma separated
variable indexes manually in the box:
Toolbar variable selection box
708
Use this plot to see which components explain the individual variables, and whether this is
due to irrelevant or predictive variation (calibration vs. validation variance). The below plot
shows the explained validation variance for some X-variables. The first component is seen to
explain Opacity, Scatter and Weight, whereas the second component spans Roughness.
Many components would have to be included in order to model Brightness, and Ink is hardly
modeled at all.
Sample Outliers
Scores
See the description in the Interpreting PLS plots section
Influence
This is a plot of the residual X- and Y-variances vs. leverages. Look for samples with a high
leverage and high residual X- or Y-variance.
709
To study such samples in more detail, it is recommended to mark them and then plot X-Y
relation outliers for several model components. This way their influences on the shape of
the X-Y relationship can be determined, and it may be found that they dangerous outliers.
Residual sample Calibration X-Variance

This plot displays the X-variance for each sample on a model containing a specified number
of factors (or PCs).
710
Residual sample Calibration Y-variance

This plot displays the Y-variance for each sample on a model containing a specified number
of factors (or PCs).
Small residual variance (or large explained variance) indicates that, for a particular number
of factors or components, the samples are well explained by the model. Therefore a sample
with a high Y-residual may be an outlier.
X-Y Relation outliers
This plot visualizes the regression relation along a particular component of the PLS model. It
shows the t-scores as abscissa and the u-scores as ordinate. In other words, it shows the
relationship between the projection of the samples in the X-space (horizontal axis) and the
projection of the samples in the Y-space (vertical axis).
Note: The X-Y relation outlier plot for factor 1 is exactly the same as the Predicted
vs. Reference plot for factor 1.
This summary can be used for two purposes.
Detecting outliers
A sample may be outlying according to the X-variables only, or to the Y-variables only, or to
both. It may also not have extreme or outlying values for either separate set of variables, but
become an outlier when one considers the (X,Y) relationship. In the X-Y Relation Outlier plot,
such a sample sticks out as being far away from the relation defined by the other samples, as
shown in the figure below. If a samples appears to be outlying, it is advisable to check the
data: there may be a data transcription error for that sample.
A simple X-Y outlier
711
If a sample sticks out in such a way that it is projected far away from the center along the
model component, it is an influential outlier (see the figure below). Such samples are
dangerous to the model: they change the orientation of the component. Check the data. If
there is no data transcription error for that sample, investigate more and decide whether it
belongs to another population. If so, the sample can be removed as an outlier (mark it and
recalculate the model without the marked sample). If not, more samples of the same kind
may be needed, in order to make the data more balanced.
An influential outlier
Studying the shape of the X-Y Relationship

One of the underlying assumptions of PLS is that the relationship between the X- and Y-
variables is essentially linear. A strong deviation from that assumption may result in
unnecessarily high calibration or prediction errors. It will also make the prediction error
unevenly spread over the range of variation of the response. Thus it is important to detect
nonlinearities in the X-Y relation (especially if they occur in the first model components), and
to try to correct them.
An exponential-like curvature, as in the figure below, may appear when one or several
responses have a skewed (asymmetric) distribution. A logarithmic transformation of those
variables may improve the quality of the model.
Nonlinear relationship between X and Y
712
A sigmoid-shaped curvature may indicate that there are interactions between the
predictors. Adding a cross-term to the model may improve it.
Sample groups may indicate the need for separate modeling of each subgroup.
Scores and loadings
This overview shows two plots: the score and loadings plots.
Scores
See the description in the section
Loadings
X-Loadings
This plot displays by default the X-loadings along one factor (or PC) at a time and the
maximal PC should be the same as the dimensionality of the model used to study the B w
coefficients. It is possible to change the factor to be displayed by using the blue arrows
.
This view is most useful if the X-data are spectral data. It is then possible to detect the area
of the signal that is responsible for a discrimination of the samples along the specified factor.
X-loading for spectra
713
In the above plot, the peak at 960 is responsible for the discrimination of the samples along
factor 2.
In general it is more interesting if the data are not spectral to look at the loading in a scatter
plot. For more information on the scatter plot see the description in the Interpreting PLS
plots section
Y-Loadings
It is possible to view the Y-loadings as well by clicking on the Y icon .

See above description for more information
Loading weights
This is a 2-D scatter plot of X-loading weights and Y-loadings for two specified components
from PLS. It shows the importance of the different variables for the two components
selected and can thus be used to detect important predictors and understand the
relationships between X- and Y-variables. The plot is most useful when interpreting
component 1 vs. component 2, since these two represent the most important variations in
Y.
Loading weights are specific to PLS (they have no equivalent in PCR) and express how the
are called loading weights because they also express, in the PLS algorithm, how the t-scores
variables.
Predictors (X) projected in roughly the same direction from the center as a response, are
positively linked to that response. In the example below, predictors sweet, red and color
have a positive correlation with the response Pref. Predictors projected in the opposite
direction have a negative link, as predictor thick in the example below. Predictors projected
close to the center, as bitter in the example below, are not well represented in that plot and
cannot be interpreted.
One response (Mean preference), 13 sensory predictors
714

identified.
Scaling the variables and the plot
Here are two important details that should be considered to make sure that interpretation
of the plots is correct.
1- If there is only one Y-variable, and the X-variables have been standardized, the Y-variable
should also be standardized so that the X-loading weights and Y-loadings have the same
scale; otherwise the plot may be difficult to interpret.
2- Make sure that the two axes of the plot have consistent scales, so that a unit of 1
horizontally is displayed with the same size as a unit of 1 vertically. This is the necessary
condition for interpreting directions correctly.
Interpretation for more than two components
If the PLS model has more than two useful components, this plot is still interesting, because
it shows the correlations among predictors, among responses, and between predictors and
responses, along each component. However, a better summary of the relationships between
X and Y can be obtained by looking at the regression coefficients, which take into account all
useful components together.
Important variables
If the X-variables were weighted this plot presents the weighted regression coefficients.
Otherwise the B-coefficients and (Bw)-coefficients are confounded. The number of factors (or
PCs) is fixed and can be changed using the arrows.
In general, this plot shows the weighted regression coefficients for a specific response or Y-
variable. By default it shows the coefficient for the first Y-variable. It is possible to access the
other results by using the Y-variable icon .

response. For PLS, the regression coefficients can be computed for any number of
components or factors. The regression coefficients for 3 factors, for example, summarize the
715
relationship between the predictors and the response, as a model with 3 components
approximates it. The weighted regression coefficients (Bw) provides information about the
importance of the X-variables. X-variables with a large regression coefficient play an
important role in the regression model; a positive coefficient shows a positive link with the
response, and a negative coefficient shows a negative link. Predictors with a small coefficient
are negligible. Mark them and recalculate the model without those variables. The constant
value B0W is indicated within the X axis label.
Weighted regression coefficients for 2 factors (or PCs)
In this plot it can be seen that variables Ti, Ba, Sr and Zr contribute the most to the model.
Important variables can also be plotted as a two pane window of regression coefficients and
loading weights. This plot is useful when a user wants to determine which factors most
influence the profile of the regression coefficients, particularly for spectroscopic application.
Important variables showing regression coefficients and loadings weights
716
If the predictor variables have been weighted with 1/SDev (standardization), the weighted
regression coefficients (Bw) take these weights into account. Since all predictors are brought
back to the same scale, the coefficients show the relative importance of the X-variables in
the model.
X-loading weights
This is a plot of X-loading weights for all the components vs. variable number. It is useful for
detecting important variables. If a variable has a large positive or negative loading weight,
this means that the variable is important for the component concerned. For example, a
sample with a large score value for this component will have a large positive value for a
variable with large positive loading weight.
If a variable has the same sign for all the important component, it is most likely to be an
important variable.
717
See the description in the previous section
See the description in the Overview section
Influence Plot
as an outlier.
718
more detail.


719

Explained Variance and Residual Plots

The explained sample variance and sample residual plots for X and Y are available for both
calibration and validation data. Switch between Calibration and Validation data using the
buttons
Explained X Sample Variance
This plot displays the X-sample variance explained for each sample in the model for the
number of factors selected.
Explained Y Sample Variance
Click the Y icon in the source taksbar to display the explained Y sample variance
plot. This plot displays the Y-sample variance explained for each sample in the model for the
number of factors selected.
720
X Sample Residuals
Switch between explained and residual variances using the buttons in the source
taskbar to view the X sample residuals plot. This plot displays the X Sample Residuals for
each sample in the model for the number of factors selected.
Y Sample Residuals
This plot displays the Y Sample Residuals for each sample in the model for the number of
factors selected.
721
722
process is operating outside normal conditions.There are 6 different significance levels to

Leverage
the number of calibration samples.
The leverage values are always larger than zero, and can go up to 1 for samples in calibration
set. As a rule of thumb, samples with a leverage above 0.4 - 0.5 start being bothering.
following cases:
 Case 1
 Case 2
723
Residuals
F-residuals view.
Q-residuals
F-residuals
Residuals
724
Response surface
This plot shows the response surface for a specific response or Y-variable. By default it
shows the response surface for the first Y-variable. It is possible to access the other response
surfaces by using the Y-variable icon .

This plot is used to find the settings of the design variables which give optimal response
values, and to study the general shape of the response surface fitted by the Response
Surface model or the Regression model. It shows one response variable at a time.
It is necessary to specify which X-variables and Y-variable should be plotted as well as the
number of components. Use the dialogue box that appear for this purpose.
 Contour plot;
 Landscape plot.
Interpretation: Contour plot

This plot gives a map to localize the area of the experiment goal. The plot has two axes: two
predictor variables are studied over their range of variation; the remaining ones are kept
constant. The constant levels are indicated in the Plot ID at the bottom. The response values
are displayed as contour lines, i.e. lines that show where the response variable has the same
predicted value. Clicking on a line, or on any spot within the map, will display the predicted
response value for that point, and the coordinates of the point (i.e. the settings of the two
predictor variables giving that particular response value). To interpret several responses
together, print out their contour plots on color transparencies and superimpose the maps.
Interpretation: Landscape plot
Look at this plot to study the 3-D shape of the response surface. Here it is obvious whether
there is a maximum, a minimum or a saddle point. This plot, however, does not show
precisely how the optimum can be achieved.
725
16.4.2 Plots accessible from the PLS plot menu

PLS overview
Variances and RMSEP
X- or Y- Variance
One-frame plot where it is possible to display either the Explained X- or Y-Variance with
Calibration and or Validation curves. See the description in the Interpreting PLS plots section
X- and Y- variance
A two-frame plot with the Explained X-Variance plot on the top, and below the Explained Y-
Variance with both Calibration and Validation variances. See the description in the
Interpreting PLS plots section
RMSE
This plot shows the results for a specific response or Y-variable. By default it shows the
response surface for the first Y-variable. It is possible to access the other response surfaces
by using the Y-variable icon .

Root Mean Square Error for the Y-variables. This plot gives the square root of the residual
variance for individual responses, back-transformed into the same units as the original
response values. This is called: RMSEC (Root Mean Square Error of Calibration) when plotting
Calibration results; RMSEP (Root Mean Square Error of Prediction) when plotting Validation
results.
RMSE Line Plot
726
The RMSE is plotted as a function of the number of factors or components in the model.
There is one curve per response (or two if Cal and Val together are selected). The optimal
number of factors (or PCs) can be determined by looking at where the Val curve (i.e. RMSEP)
reaches a minimum.
Sample outliers
X-Y relation outliers
Scores and loadings
Scores
Line
usually better to look at 2-D or 3-D scores plots because they contain more information. This
patterns, like a regular increase or decrease, periodicity, etc. (only relevant if the sample
Trend in a Scores plot
727
2-D scatter
3-D scatter
This is a 3-D scatter plot or map of the scores for three specified components from PLS. The
describe enough variation in the data, the 3-D plot is a practical alternative. The same
analysis as with a 2-D scatter plot should be done. See the description in the Interpreting PLS
plots section
2 x 2-D Scatter
the scores of the samples along factor 1 and factor 2. The bottom plot shows the scatter plot
of the scores along factor 3 and factor 4.
4 x 2-D Scatter
of the scores of the samples along factor 1 and factor 2. On its left is displayed the scores
plot in the factor 3-factor 4 plane. The bottom left plot shows the scatter plot of the scores
along factor 5 and factor 6. To its right is displayed the scatter plot of the scores of the
sample for factor 7 and factor 8.
728
Loadings
Line
detecting important variables. In many cases it is better to look at two- or three-vector
loadings plots instead because they contain more information. Line plots are most useful for
multichannel measurements, for instance spectra from a spectrophotometer, or in any case
where the variables are implicit functions of an underlying parameter, like wavelength, time,
etc. The plot shows the relationship between the specified component and the different X-
important for the component concerned; see the figure below. For example, a sample with a
large score value for this component will have a large positive value for a variable with large
positive loading.
identified.
Loadings for the Y-variables

This is a plot of Y-loadings for a specified component vs. variable number. It is usually better
to look at 2-D or 3-D loadings plots instead because they contain more information.
However, if there is reason to study the X-loadings as line plots, then one should also display
the Y-loadings as line plots in order to make interpretation easier. The plot shows the
relationship between the specified component and the different Y-variables. If a variable has
a high positive or negative loading, as in the example plot shown below, this means that the
variable is well explained by the component. A sample with a large score for the specified
component will have a high value for all variables with large positive loadings.
729
Line plot of the Y-loadings, three important variables
Y-variables with large loadings in early components are the ones that are most easily
modeled as a function of the X-variables.
identified.
2-D scatter
3-D scatter
PLS. The plot is most useful for interpreting directions, in connection to a 3-D scores plot.
identified.
Loadings for the Y-variables

This is a three-dimensional scatter plot of Y-loadings for three specified components from
PLS. The plot is most useful for interpreting directions, in connection to a 3-D scores plot.
identified.
The same analysis as with a 2-D scatter plot should be done. See the description in the
Interpreting PLS plots section
730
2 x 2-D scatter
the loadings of the variables along factor 1 and factor 2. The bottom plot shows the scatter
plot of loadings of the variables along factor 3 and factor 4.
4 x 2-D scatter
of the loadings of the variables along factor 1 and factor 2. On its left is displayed the scores
plot in the factor 3-factor 4 plane. The bottom left plot shows the scatter plot of loadings of
the variables along factor 5 and factor 6. To its right is displayed the scatter plot of loadings
of the variables for factor 7 and factor 8.
Loadings weights
Line
Loading weights are specific to PLS (they have no equivalent in PCR) and express how the
are called loading weights because they also express, in the PLS algorithm, how the t-scores
Looking at a line plot of the loading weight shows how much the variables are participating
to the plotted factor.
2-D scatter
3-D scatter
This is a three-dimensional scatter plot of X-loading weights for three specified components
from PCR; this plot may be difficult to interpret, both because it is three-dimensional and
because it does not include the Y-loadings. Thus it is usually recommended that one use the
2-D scatter plot of X-loading weights and Y-loadings instead.
2 x 2-D scatter
the loading weights of the variables along factor 1 and factor 2. The bottom plot shows the
scatter plot of loading weights of the variables along factor 3 and factor 4.
4 x 2-D scatter
of the loading weights of the variables along factor 1 and factor 2. On its left is displayed the
scores plot in the factor 3-factor 4 plane. The bottom left plot shows the scatter plot of
loading weights of the variables along factor 5 and factor 6. To its right is displayed the
scatter plot of loading weights of the variables for factor 7 and factor 8.
731
Important variables

Line plot
Matrix
The matrix plot is useful when there are several Y-variables. It helps to identify the
important variables for all responses. In the above plot, the B weighted coefficients of 13 Y-
variables are represented. For the Y-variable 10, the most important B coefficient is the fifth
corresponding to X-variable 4. In general this X-variable is important for most of the Y-
variables. For the third Y-variable, the most important coefficient is the second
corresponding to X-variable 1.
Weighted regression coefficients for 13 responses

Line plot
components. The regression coefficients for 2 factors (or PCs), for example, summarize the
approximates it.
The constant value B0 is indicated along with the x-axis name.
Regression coefficients for 2 PCs
732
The above plot shows the regression coefficients for one particular response variable (Y),
and for a model with a particular number of components (3). Each predictor variable (X)
defines one point of the line (or one bar of the plot). It is recommended to configure the
layout of this plot as bars. Variables 1 and 4 have the highest B coefficients.
units:
Since the predictors are kept in their original scales, the coefficients do not reflect the
relative importance of the X-variables in the model. If no weights have been applied to the
X-variables the display of the Uncertainty Limits maybe informative. It is available if Cross-
Validation and the Uncertainty Test option were selected in the Regression dialog.
Use View Uncertainty Limit from the menu to toggle this indication on or off.
Matrix
The matrix plot is useful when there are several Y-variables. It helps to interpret the B-
coefficients for all responses. The plot below shows the B-coefficients for two responses.
There are seven X-variables corresponding to B1, B2,… B7. B0 is the coefficient that fits the
model, it is not presented in the plot. Variable 2 has a negative impact on the second
response but positive for the first responses.
Regression coefficients for 2 responses
733
Residuals

For more information look into the Predefined plots section
General
satisfactory, and appropriate action should be taken. If strong systematic structure (e.g.
curved patterns) is observed, this can be an indication of lack of fit of the regression model.
The figure below shows a situation that strongly indicates lack of fit of the model. This may
be corrected by transforming the Y variable. This plot can be shown with the studentized
residuals by toggling the icon . The studentized residuals are also an option in many of
the other general Y residuals plots.
734
extent.
A single sample has a large residual
735
adequate models.
for one particular Y-variable (look for its name in the axis label). There is one point per
sample. If the model explains the complete structure present in the data, the residuals
should be randomly distributed - and usually, normally distributed as well. So if all the
residuals are along a straight line, it means that the model explains everything that can be
explained in the variations of the variables to be predicted. If most of the residuals are
normally distributed, and one or two stick out, these particular samples are outliers. This is
shown in the figure below. If there are outliers, mark them and check the data.
Outliers are sticking out on Normal Probability Plot of Residuals
736

This is a plot of Y-residuals vs. component scores. Clearly visible structures are an indication
of lack of fit of the regression model. The figure below shows such a situation, with a strong
nonlinear structure of the residuals indicating lack of fit. There is a lack of fit in the direction
(in the multidimensional space) defined by the selected component. Small residuals
(compared to the variance of Y) that are randomly distributed indicate adequate models.
Y residual vs. scores plot
Influence plot
737
Variance per sample

avoided, especially in regression, since it may reduce the predictive power of the model.
component are well explained by the corresponding model, and vice versa. In the above plot
4 samples seem to be not well explained by the model and may be outliers such as B3.
Variable residuals
should, however, be avoided since it will reduce the predictive ability of the model.
738
all the samples.
Sample residuals
avoided since it will reduce the predictive ability of the model.
739
well described by a model with a certain number of components, here 4. If this is the case
with most of the samples this variable may be noisy and can be considered as an outlier.
model.

This is a map of the residuals. The X-axis represents the samples, the Y-axis represents the
variables and the Z-axis represents the X-Residuals. It is useful to detect whether a particular
sample has high residuals on few or all variables. It is a diagnostic tool to check why a
particular sample is different from the others. It helps in deciding whether this sample is an
outlier or not.
most samples. This plot can show that this variable is either noisy or not structured in a
proper way. It is possible to remove this variable or to try different pretreatments.
In the above map, two variables are not well described by the model. They should be further
investigated.
Outliers
Influence Plot
See the description in the above section.
740
Patterns
Leverage
Line
Matrix
which is the leverage; the color scale can be customized. It is a useful plot for studying how
Leverage as a matrix plot
Hotelling’s T
Line
741
Matrix
which is the Hotelling’s T2 statistic for a specific PC and sample; the color scale can be
customized.
Hotelling’s T2 as a matrix plot
Response Surface
16.5. PLS method reference

16.6. Bibliography
B. S. Dayal and J. F. MacGregor, Improved PLS Algorithms, J. Chemom., 11, 73-85 (1997).
F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS, J. Chemom., 7, 45-59 (1993).
S. Rannar, F. Lindgren, P. Geladi and S. Wold, A PLS kernel algorithm for data sets with many
variables and fewer objects, Part 1: Theory and Algorithm, J. Chemom., 8, 111-125 (1994).
742
17. LPLS
17.1. L-PLS regression
of data.
In the sections on bilinear modeling such as PLS the data are arranged in such a way that the
information obtained on a dependent variable Y is related to some independent measures X.
In some cases, the Y data may have descriptors of its columns, organized in a third table Z
(containing the same number of columns as in Y).
 Theory
 Usage
17.2. Introduction to L-PLS
 Basics
 The L-PLS model
 L-PLS by example
17.2.1 Basics
of data.
In the sections on bilinear modeling such as PLS regression the data are arranged in such a
way that the information obtained on a dependent variable Y is related to some
independent measures X. In some cases, the Y data may have descriptors of its columns,
organized in a third table Z (containing the same number of columns as in Y).
743
17.2.2 The L-PLS model

The usual data structure for an L-PLS regression model may be described as follows.
The X-Matrix contains product/sample descriptors. These may be sensory, or instrumental in
nature and the dimension of this matrix is defined as (IxK), where I represents the
products/samples and K represents the variables measured on the I samples.
The Y-matrix contains the dependent measures made on the I samples described in the
independent X-Matrix. This matrix has dimensions (IxJ), where J represents the individual
objects, for example, if there were 125 participants in a study to test their preference for the
products described in X, then J would equal 125.
The Z-Matrix consists of extraneous descriptors that provide additional information to the Y-
Matrix. This may be socioeconomic or demographic information, used to extract the most
relevant information from Y as it is most commonly noisy in nature. For spectroscopic
applications or genetics, Z may represent known chemistry or biology for the wavelengths or
genes. The Z-Matrix has dimensions (JxL) and only has a common dimension with the Y-
Matrix. L represents the number of additional descriptors associated with the Y-variables.
Since the X-matrix and the Z-matrix share one dimension with the Y-matrix and none with
each other, the data tables form an L-shaped arrangement. The next step is to perform L-PLS
Regression on the data tables.
Assume that the matrices X, Y and Z exist as described above.
Recall that the PLS regression models the covariance between X and Y, where it can be
shown that the loading weight vectors can be estimated as the eigenvector of X’YY’X.
A being the first eigenvector.

The L-PLS regression can be seen as an extension of the 2-block PLS regression.
It can be shown that the left and right singular vectors of Singular Value Decomposition
(SVD) of the matrix product X’YZ’, which has dimensions (KxL), serve as basis for all scores
and loadings for the L-PLS regression.
and
744
LPLS
The Y-matrix is thus modeled as a function of both X and Z. The correlation loadings may also
be computed similarly to what is done for PCA, PCR and PLS.
The strength in interpretation from the L-PLS regression is that the correlation loadings plot
shows the relationship between the variables for all the three matrices. In addition, the
scores for the objects are also included. This enables direct interpretation of the rows in Z
and columns in X, two matrices that share no common dimension.
It is common to weigh the variables to unit variance for X and Z. Y is then by default double
centered or double centered and scaled. Again this depends on the properties of the
variables.
17.2.3 L-PLS by example

Lets take a simple example:
 37 consumers
 4 background variables:
 Gender; male/female (binary),
 Married; single/married (binary),
 Age (continuous variable),
 Sport; don’t exercise regularly, exercise regularly (binary),
 5 products: A, B, C, D, E
 4 attributes to describe the product
The data are organized as illustrated below.
Organization and dimensions of L-PLS input matrices
The correlation loadings plot for this analysis is provided below.

Interpreting an L-PLS correlation loadings plot
745
As can be seen, the combined plot of correlation loadings and samples gives a view of the
relation between the variables and samples.
For example, the sensory Attribute 4 is anti-correlated with the background Z-variables
Married and Age. Age and Married are correlated which shows that the older the person the
most likely for this person to be married and also not to like the Attribute 4 and thus product
E.
Gender is close to the center which indicates that it is not playing a role in the preference of
products.
People practicing sport regularly may like product D better that is characterize by Attribute
3.
For algorithm details please refer to the method reference.
17.3. Tasks – Analyze – L-PLS Regression

When the data tables are available in the Project Navigator use the Tasks-Analyze menu to
run a suitable analysis – here, L-PLS Regression.
17.3.1 Model inputs

When the data tables are available in the Project Navigator one can access the Tasks-
Analyze menu to run a suitable analysis — here, L-PLS Regression. The following main dialog
box will appear.
L Partial Least Squares Dialog
746
LPLS
In terms of inputs one needs to select the appropriate X, Y and Z matrices.

There need to be some size compatibilities:
 Number of samples in X must be equal to the number of samples in Y.

 Number of variables in Y must be equal to the number of variables in Z.
In the Model Inputs tab, first select an X- matrix to be analyzed from the X-matrix drop-down
list. If new data ranges need to be defined, choose New or Edit from the drop-down list next
to Rows and/or Cols. This will open the Define Range editor where new ranges can be
defined.
Next select a Y- matrix to be analyzed from the Y-matrix drop-down list. The Y-variables may
be defined as a row and column set within the X-matrix selected for analysis, or may be a
separate matrix of Y-variables available from the project navigator.
Finally select a Z- matrix to be analyzed from the Z-matrix drop-down list. The Z-variables
may be defined as a row set within the Y-matrix or as a separate matrix of Z-variables
available from the project navigator.
Once the data to be used in modeling are defined, choose a starting number of Components
(latent variables, factors) to calculate, in the Maximum components spin box.
before analysis.
747
L-PLS puts some constraints on the three input matrices in order to complete the
calculation. The following explains the warnings given when certain analysis criteria are not
met.
First, matrix shapes must match:
Constraint between X and Y not fulfilled
Solution: Make sure that the number of rows in the selected X and Y matrices matches.
Constraint between Y and Z not fulfilled
Solution: Make sure that the number of columns in the selected Y and Z matrices matches.
To understand this better, see the theory section for a diagram that illustrates how to
organize data for L-PLS analysis.
17.3.2 X weights
If it is necessary to weight the variables to make realistic comparisons of them with each
other (particularly useful for process and sensory data), click on the X- Y- and Z-Weights tabs
and the following dialog box will appear.
X Weights Dialog
748
LPLS
Individual X-variables can be selected from the variable list table provided in this dialog by
numbers can be manually entered into the text dialog box, the Select button can be used
(which takes one to the Define Range dialog box), or by simply clicking on All, this will select
every variable in the table.
A/(SDev +B)
Constant
Downweight
structure can still be observed in the scores and loadings plots.
Block weighting
Advanced tab
749
Use the Advanced tab in the X-, Y-, and Z-Weights dialog to apply predetermined
weights to each variable. To use this option, set up a row in the data set containing
the weights (or create a separate row matrix in the project navigator). Select the
Advanced tab in the Weights dialog and select the matrix containing the weights
from the drop-down list. Use the Rows option to define the row containing the
weights and click on Update to apply them. The dialog box for the Advanced option
is provided below.
L-PLS Advanced Weights Option
17.3.3 Y weights
The weights in the Y weights tab should be handled in the same way as the weights in the X
weights tab.
17.3.4 Z weights
The weights in the Z weights tab should be handled in the same way as the weights in the X
weights tab.
When all the settings are specified click OK.
In the L-PLS regression in The Unscrambler® there is for the time being no cross
validation implemented. It is suggested to first model (X,Y) and (Y,Z) with PLS
750
LPLS
regression respectively to evaluate the goodness of the data and the validated
variance.
See the method reference for details.
17.4. Interpreting L-PLS plots
 Predefined L-PLS plots

 L-PLS overview
 Explained X-variance
 Explained Y-variance
 Explained Z-variance
 Scores and Correlation Loadings
 Scores
 X Correlation Loadings
 Y Correlation Loadings
 Z Correlation Loadings
 Correlation
 Plots accessible from the L-PLS menu
 L-PLS overview
 Correlation Loadings
 Correlation
17.4.1 Predefined L-PLS plots

L-PLS overview
Explained X-variance
This plot gives an indication of how much of the variation in the X data is described by the
Total residual X-variance is computed as the sum of squares of the residuals for all the X-
Total explained X-variance is then computed as:
100*(initial X-variance - residual X-variance)/(initial X-variance)

It is the percentage of the original variance in the X-data that is taken into account by the
from the data.
have simple models, where the explained variance goes to 100 with as few components as
possible.
751
Explained Y-variance
This plot gives an indication of how much of the variation in the Y data is described by the
Explained Z-variance
This plot gives an indication of how much of the variation in the Z data is described by the
Explained Z-variance
752
LPLS
Explained variance
This plot gives an indication of how much of the variation in the three data tables: X, Y, Z is
described by the different components.
Explained all variances
Scores and Correlation Loadings
Scores
This is a two-dimensional scatter plot (or map) of scores for two specified components
(factors or PCs). The plot gives information about patterns in the samples. The scores plot
for (PC1,PC2) is especially useful, since these two components summarize more variation in
the data than any other pair of components.
753
Scores plot
from each other.
The plot can be used to interpret differences and similarities among samples. Look at the
scores plot together with the corresponding loadings plot, for the same two components.
This can help in determining which variables are responsible for differences between
samples. For example, samples to the right of the scores plot will usually have a large value
for variables to the right of the loadings plot, and a small value for variables to the left of the
loadings plot.
situation with three distinct clusters. Samples within a cluster are similar.
754
LPLS

most samples accumulated to the right of the plot, then progressively spreading
more and more. This means that the variables responsible for the major variations
are asymmetrically distributed. In such a situation, study the distributions of those
variables (histograms), and use an appropriate transformation (most often a
logarithm).

X Correlation Loadings
A two-dimensional scatter plot of X correlation loadings for two specified components, this
is a good way to detect important variables. The importance of individual variables is
visualized more clearly in the correlation loadings plot compared to the standard loadings
plot. The plot is most useful for interpreting component 1 vs. component 2, since they
represent the largest variations in the data.
755
It should preferably be used together with the corresponding scores plot. Variables with X
correlation loadings to the right in the correlation loadings plot will be X-variables which
usually have high values for samples to the right in the scores plot, etc.
opposed quadrants will have a tendency to be negatively correlated. Variables Red and
Firm,Instr have independent variations. Variables Red and Acids/Sugars are negatively
correlated.
X Correlation Loadings of 10 sensory variables along (PC1,PC2)
(or PCs). Do not interpret them in that plot!
Y Correlation Loadings
Variables close to each other in the correlation loadings plot will have a high positive
correlation if the two components explain a large portion of the variance of Y. The same is
true for variables in the same quadrant lying close to a straight line through the origin.
Variables in diagonally opposed quadrants will have a tendency to be negatively correlated.
In this example the Y variables are the individuals, so two individuals close together will have
the same behavior and will like the samples in their quadrant.
756
LPLS
Z Correlation Loadings
Variables close to each other in the correlation loadings plot will have a high positive
correlation if the two components explain a large portion of the variance of Z. The same is
true for variables in the same quadrant lying close to a straight line through the origin.
Variables in diagonally opposed quadrants will have a tendency to be negatively correlated.
In the example the Z variable are the information on the individuals, here more specifically
what they say they like in general.
757
Correlation
This plot shows the correlation loadings of the three data tables: X, Y and Z. Interpretation
between tables can be done in this plot.
For example individuals saying they like Gala apples also like apples with high sweetness and
sugar content. Examples of those individuals are individuals 17 and 25.
All correlation loadings
17.4.2 Plots accessible from the L-PLS menu

L-PLS overview
For information on this plot look at the description in the Interpreting L-PLS plots section
Correlation Loadings
Correlation
17.5. L-PLS method reference

17.6. Bibliography
H. Martens, E. Anderssen, A. Flatberg, L. H. Gidskehaug, M. Hoy, F. Westad, A. Thybo, M.
Martens, Regression of a data matrix on descriptors of both its rows and of its columns via
latent variables: L-PLSR, Computational Statistics & Data Analysis 48, 103 – 123(2005).
758
18. Support Vector Machine Regression
Title: Support Vector Machine Regression
18.1. Support Vector Machine Regression (SVMR)

SVMR is a regression method based on statistical learning. Sometimes, a linear function is
not able to model complex systems, so SVMR employs kernel functions to map from the
original space to the feature space. The function can be of many forms, thus providing the
ability to handle nonlinear regression cases. The kernels can be viewed as a mapping of
nonlinear data to a higher dimensional feature space, while providing a computation short-
cut by allowing linear algorithms to work with higher dimensional feature space.
 Theory
 Usage: Create model
 Results
 Usage: Prediction
 Result interpretation
18.2. Introduction to Support Vector Machine (SVM) Regression (SVMR)
 Principles of Support Vector Machine (SVM) regression

 What is SVM regression?
 Data suitable for SVM Regression
 Main results of SVM regression
 More details about SVM Regression
18.2.1 Principles of Support Vector Machine (SVM) regression

For a general introduction to the concept of SVM we refer to the chapter on Support Vector
Machine Classification (SVMC) in the main menu in the Help system. SVM can be applied not
only to classification problems but also for regression. The same features that characterize
SVMC also pertain to SVM as a regression method. A kernel function is applied to map the
data into a new space followed by finding the support vectors for the best performance.
The algorithm used within The Unscrambler® is based on code developed and released
under a modified BSD license by Chih-Chung Chang and Chih-Jen Lin of the National Taiwan
University. Hsu et al,2009
In the same way as with the classification approach, there is motivation to seek and optimize
the generalization bounds given for regression. It relies on defining the loss function that
ignores errors which are situated within a certain distance of the true value. This type of
function is often called an epsilon-intensive loss function. The figure below shows an
example of one-dimensional linear regression function with an epsilon-intensive band. The
variables measure the cost of the errors on the training points. These are zero for all points
that are inside the band.
759
As in all methods that can be described as statistical learning methods there is a balance
between achieving a small training error and the complexity of the model. The parsimony
principle strives to find the simplest model with an acceptable error; not only in the training
stage but more importantly for prediction. This is one reason that the SVMR implementation
in The Unscrambler® includes an option for cross validation. See the section on dialog usage
for details.
18.2.2 What is SVM regression?

The SVM regression is a method that handles linear as well as non-linear situations in a
regression context. The illustration below shows examples of a linear and non-linear
regression problem.
One of the most important ideas in Support Vector Machine Classification and Regression
cases is that presenting the solution by means of a small subset of training points gives
760
Support Vector Machine Regression
computational advantages. Using the epsilon-intensive loss function it is ensured that a

global minimum is found and at the same time a generalization bound is optimized.
Two SVM regression types are available in The Unscrambler® which are based on different
means of minimizing the error function of the classification.
 epsilon-SVR: also known as SVM Type 1.

 nu-SVR: also known as SVM Type 2.
Parameter epsilon controls the width of the epsilon-insensitive zone, used to fit the training
data. The value of epsilon can affect the number of support vectors used to construct the
regression function. The bigger epsilon, the fewer support vectors are selected, ref. the
illustration above. Hence, both C and epsilon values affect model complexity, but in a
different way.
When using nu-SVM classification, the nu value must be defined (default value = 0.5). Nu
serves as the upper bound of the fraction of errors and is the lower bound for the fraction of
support vectors.
There is in SVMR also a parameter C that determines the trade off between the model
complexity (flatness) and the degree to which deviations larger than epsilon are tolerated in
optimization formulation. For example, if C is too large (infinity), then the objective is to
minimize the empirical risk only, without regard to model complexity part in the
optimization formulation.
The kernel type to be used can be chosen from the following four options:
 Linear
 Polynomial
 Radial basis function
 Sigmoid
The linear function is set as the default kernel because it is the simplest one and is not so
susceptible to overfitting. If the number of variables is very large the data do not need to be
mapped to a higher dimensional space and the linear kernel function is preferred. The radial
basis function is also simple function and can model systems of varying complexity. It is an
extension of the linear kernel. If a polynomial kernel is chosen, the order of the polynomial
must also be given. In SVM classification, the best value for C is often not known a priori.
Through a grid search and applying cross validation to reduce the chance of overfit, one can
identify an optimal value of C so that unknowns can be properly classified using the SVM
model.
If a polynomial kernel is chosen, the order of the polynomial must also be given. Through
cross validation, one can identify an optimal value of C from the RMSECV as displayed in the
grid search dialog.
18.2.3 Data suitable for SVM Regression

SVM Regression must have two data matrices; one with the predictors (X) and one with one
response variable (Y). The X and Y matrices must have the same number of rows (samples)
and not have any missing data. The X matrix must be numerical, and not contain any missing
data.
761
18.2.4 Main results of SVM regression

When an SVM Regression model is created a new node is added in the project navigator
with a folder for the data used in the model, and the results folder.
The results folder has the following matrices:
 Support vectors
 Parameters
 Probabilities
 Prediction
 Diagnostics
The main result for SVM Regression is the matrix of predicted values which may be
compared to the reference values in a predicted versus reference plot as for any type of
regression.
The RMSEC (root mean square error from calibration) and RMSECV (from cross validation)
are given in the statistics box in the Predicted versus Reference plot. Note that in the current
version the cross validated predictions are not shown, the difference between calibration
and validation is expressed by RMSEC and RMSECV. The support vectors can be visualized in
this plot by clicking the icon “SV” on the Mark toolbar above the plot. As for modeling in
general, the RMSECV should preferably be close to the RMSEC, which indicates that the
model has not been overfitted.
The Diagnostics subnode holds RMSEC, RMSECV and the corresponding correlations
between predicted and reference.
762
18.2.5 More details about SVM Regression

It is advised to start with the a linear kernel with various settings of C for C-SVM and select
10-segment cross validation.
If the data are expected to be nonlinear, e.g. from looking at the classes in a scores plot from
PCA or PLS-DA, one may try other kernels and change the settings for C or nu.
More details regarding Support Vector Machine Regression are given in the method
reference.
18.3. Tasks – Analyze – Support Vector Machine Regression…

The sections that follow list menu options and dialogs associated with Tasks-Analyze-
Support Vector Machine Regression….
18.3.1 Model input

In the first tab, Model Inputs, select an X- matrix to be analyzed from the Predictors drop-
down list. If new data ranges need to be defined, choose the Define button next to the Rows
and Cols boxes. This will open the Define Range editor where new ranges can be defined.
The matrix of predictors should contain only numerical values, with no missing values.
Next select a Y- matrix to be analyzed from the Responses drop-down list. The Y-responses
may be defined as a row and column set within the X-matrix selected for analysis, or may be
a separate matrix of Y-responses available from the project navigator.
will be obtained.
 The SVM type.

 The Kernel used.
 The Gamma value.
 The C-value.
 The Epsilon value.
 The Weights used.
 The Scaling used.
 Cross Validation (if used).
Support Vector Machine Regression Inputs
763
SVM Regression can be used as both a univariate and multivariate regression analysis
technique. In The Unscrambler® it requires a minimum of three samples (rows) and one
variable (column) to be present in a data set, in order to complete the calculation. The
following shows that a warning is given when certain analysis criteria are not met.
Not enough samples present
Solution: Check that the data table (or selected row set) contains a minimum of 3 samples or
2 variables.
Missing values in X or Y
764
Solution: Ensure that X and Y have no missing values. If required, use the Fill Missing
function to impute values for X (use caution!). For missing Y-values, it is suggested to keep
these rows out of the calculation.
Non-numerical values in Response (Y)
Solution: Ensure that the matrix Y only contains numerical data.

18.3.2 Options
This tab provides options for choosing the SVM type of regression to use, either epsilon-SVR
or nu-SVR, from the drop-down list next to SVM type. The kernel type to be used to
determine the hyperplane that best models the data can be selected from the drop-down
list. The default setting of Radial basis function is the simplest, and can model complex data.
Support Vector Machine Options
765
The kernel types are:
 Linear
 Polynomial
 Sigmoid
For a polynomial kernel type, the degree of the polynomial should be defined.
The epsilon-SVR has an input parameter named epsilon, which is a capacity factor (also
called penalty factor), a measure of the robustness of the model. Epsilon must be greater
than 0.
The nu-SVR has the parameter nu which lies in the range 0-1 and determines a parameter in
the kernel.
Support Vector Machine Options for epsilon-SVR
766
Support Vector Machine Options for nu-SVR
767
18.3.3 Grid Search
In the options tab the Grid Search button is available. Clicking on the Grid
Search button will open a dialog for grid search.
The dialog asks for input for the parameters Gamma and C in the case of epsilon-SVMR and
Gamma and Nu in the case of nu-SVMR. It has been reported in the literature that an
exponentially growing sequence of the parameters is good as a first course grid search. This
is why the inputs are given on the log scale. However, in the grid table above the actual
values are given. It is recommended to use cross validation in grid search to avoid overfitting
when many combinations of the parameters are tried. After an initial grid search it may be
refined with smaller ranges for the parameters once the best range has been found. Click on
the Start button for the calculations to commence. Note that it is possible to click on Stop
during the computations so that if the results become worse for higher values for the
parameters one may stop to save time.The default is to start with five levels of each
parameter. Click on one (the “best”) value in the grid after completion to see detailed
results. The SVs lists how many samples were selected and depends on the epsilon or nu
value and should be related to the number of samples in the data.
Click on Use setting to return to the previous dialog and running the SVMR again with these
parameter settings. Notice that since the cross validation is random the RMSE and the R-
square from validation may be different in the second run. This again is a function of the
distribution of the samples.
18.3.4 Weights
If the analysis calls for variables to be weighted for making realistic comparisons to each
other (particularly useful for process and sensory data), click on the Weights tab and the
following dialog box will appear.
Support Vector Machine Weights
768
The Weights option is available for both X and Y variables in SVR.

A/(SDev +B)
Constant
Downweight
Block weighting
769
SVM Advanced Weights Option
18.3.5 Validation
Validation is an important part of any method applied in modeling data. Settings for the
Validation of the SVR are set under the Validation tab as shown below. First select to cross
validate the model by checking the check box. The number of segments to use can be
chosen in the segments entry. Cross validation is helpful in model development but should
not be a replacement for full model validation using a test set. In the case of SVR, test set
validation is performed using the options under Tasks - Predict - SVR Prediction.
Support Vector Machine Validation
770
Autopretreatment may be used with SVR. This allows a user to automatically apply the
transforms used during the calibration phase of the SVR model to apply to new samples
during the prediction phase.
Support Vector Machine Autopretreatment
771
When all of the parameters have been defined, the SVR is run by clicking OK. A new node,
SVR, is added to the project navigator with a folder for Data, and another for Results.
More details regarding Support Vector Machine classification are given in the section SVM
Classify or in the link given under License.
18.4. Tasks – Predict – SVR Prediction…

After an SVM regression model has been developed, it can be used to predict new samples
by going to Tasks-Predict-SVR Prediction. In the dialog box, a user must choose which SVR
model to apply from the drop-down list. This requires a valid SVR model to be located in the
current project.From there, a user defines which samples to predict by selecting samples
from the appropriate data matrix, along with the X variables that are to be used for the
prediction. The X-variables must contain only numerical data and have the same number of
variables as were used to develop the SVR model.
Predict Using SVR Model
772
The SVM prediction results are given in a new matrix in the project navigator named
Predicted_Range. The matrix holds the predicted value for each sample.
18.5. Interpreting SVM Regression results

There are four result matrices generated after creating a SVM regression model:
 Support vectors
 Parameters
 Probabilities
 Prediction
 Diagnostics
When an SVM Regression model is created a new node is added in the project navigator
with a folder for the data used in the model, and the results folder.
The results folder has the following matrices:
SVMR node
773
18.5.1 Support vectors

The support vector matrix is comprised of the support vectors which are a subset of the
original samples that are outside the boundary given by epsilon.
18.5.2 Parameters
The parameters matrix carries information on the following parameters for all the identified
classes:
 SVM type
 Kernel type - as defined in the options for the SVM learning step
 Degree - as defined in the options for the SVM learning step
 Gamma - related to the C values set in the options
 Offset
 Classes - Relevant for SVM Classification only
 SV Count - the number of support vector needed for the regression model of the
data
 Labels - Relevant for SVM Classification only
 Numbers - Relevant for SVM Classification only
Parameters matrix
18.5.3 Probabilities
The probabilities matrix has three rows, for the Rho, and probabilities A and B.
Probabilities matrix
774
18.5.4 Diagnostics
Diagnostics matrix
The Diagnostics subnode holds RMSEC, RMSECV and the corresponding correlations
between predicted and reference. Ideally the validation figures of merit should be close to
the calibration. If not it indicates that the data were overfitted in the calibration stage.
18.5.5 Prediction
The prediction matrix exhibits the predicted value for each sample in the training set.
Prediction
18.5.6 Prediction plot

The Predicted versus Reference plot shows the reference values and the calibrated
“predictions”. Note that in the current version the cross validated predictions are not shown,
the difference between calibration and validation is expressed by RMSEC and RMSECV in the
Statistics Box. The support vectors can be visualized in this plot by clicking the icon “SV” on
the Mark toolbar above the plot. As for modeling in general, the RMSECV should preferably
be close to the RMSEC, which indicates that the model has not been overfitted.
775
18.5.7 Predicted values after appplying the SVM model on new samples
After an SVM model has been applied to predict new samples from a data matrix in the
project Tasks - Predict - SVR Prediction, a new matrix with the predicted values is added to
the project navigator. The name given by default is Predicted
Predicted
18.6. SVM method reference

The method reference for SVM is available from this link
http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
776
18.7. Bibliography
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, A Practical Guide to Support Vector
Classification, last updated: May 19, 2009, accessed August 27, 2009.
http://www.csie.ntu.edu.tw/~cjlin
T. Czekaj, W.Wu and B.Walczak, About kernel latent variable approaches and SVM, J.
Chemom., 19, 341–354 (2005).
J.A.Fernandez Pierna, V.Baeten, A.Michotte Renier, R.P.Cogdill and P.Dardenne,
Combination of support vector machines (SVM) and near-infrared (NIR) imaging
spectroscopy for the detection of meat and bone meal (MBM) in compound feeds, J.
Chemom., 18, 341–349 (2004).
A. I. Belousov, S. A. Verzakov and J. von Frese, Applicational aspects of support vector
machines, J. Chemom., 16, 482-489 (2002).
777
19. Multivariate Curve Resolution
19.1. Multivariate Curve Resolution (MCR)
MCR methods may be defined as a group of techniques which intend the recovery of
concentration (pH profiles, time/kinetic profiles, elution profiles, chemical composition
changes, …) and response profiles (spectra, voltammograms, …) of the components in an
unresolved mixture using a minimal number of assumptions about the nature and
composition of these mixtures. MCR methods can be easily extended to the analysis of many
types of experimental data including multiway data.
 Theory
 Usage
19.2. Introduction to Multivariate Curve Resolution (MCR)

The theoretical sections of this chapter were authored by Romà Tauler and Anna de Juan.
 MCR basics
 What is MCR?
 Data suitable for MCR
 Purposes of MCR
 Limitations of PCA
 The Alternative: Curve resolution
 Ambiguities and constraints in MCR
 Rotational and intensity ambiguities in MCR
 Constraints in MCR
 What is a constraint?
 When to apply a constraint?
 Constraint types in MCR
 Non-negativity
 Unimodality
 Closure
 Other constraints
 MCR and 3-D data
 Algorithm implemented in The Unscrambler®: Alternating Least Squares (MCR-ALS)
 Initial estimates for MCR-ALS
 Computational parameters of MCR
 Constraint settings are known beforehand
 How to tune sensitivity to pure components?
 When to tune sensitivity up or down?
 Main results of MCR
 Residuals
 Estimated concentrations
 Estimated spectra
 Practical use of estimated concentrations and spectra
779
 Quality check in MCR

 Use of the MCR warnings list
 Outliers in MCR
 Noisy variables in MCR
 MCR application examples
 Solving co-elution problems in LC-DAD data
 Spectroscopic monitoring of a chemical reaction or process
19.2.1 MCR basics

What is MCR?
MCR methods may be defined as a group of techniques (also known as “Blind source
separation” or “Self-modeling mixture analysis”) which intend the recovery of concentration
(pH profiles, time/kinetic profiles, elution profiles, chemical composition changes, …) and
response profiles (spectra, voltammograms, …) of the components in an unresolved mixture
using a minimal number of assumptions about the nature and composition of these
mixtures. MCR methods can be easily extended to the analysis of many types of
experimental data including multiway data. Most of the data examples analyzed until now
were arranged in two-way data “flat” table structures. An alternative to PCA in the analysis
of these two-way data tables is to perform MCR on them.
Data suitable for MCR
Any type of spectral/signal data that is a linear combination of several individual
components can be analyzed by MCR.
This means that all noise should be removed previous to analysis.
MCR is a method that aims at the decomposition of signal/spectral data into the original
individual components also called sources.
This can be explained as follows:
MCR principles: Matrix decomposition
The matrix X of raw data (spectra) is decomposed into two matrices: the concentrations: C,
and the sources: S.
The size should be compatible. I represents the number of samples, N, the number of
sources and J, the number of spectral/signal variables.
This can also be explained by an example. The spectra of some samples are decomposed
into concentrations and sources or single component spectra.
MCR principles: Example
Spectra
780
Multivariate Curve Resolution
Decomposition
Purposes of MCR
Multivariate Curve Resolution has been shown to be a powerful tool to describe
multicomponent mixture systems through a bilinear model of pure component
contributions. MCR, like PCA, assumes the fulfillment of a bilinear model, i.e.
Bilinear model
Comparison of constraints
Constraint PCA MCR
T Orthogonal T=C
PT Orthonormal PT=ST non-negative
PT dir. of max. C or ST normalization
Other Non-negativity, unimodality, local rank

Comparison of characteristics
Characteristic PCA MCR
Unique solution Yes No
Physical meaning No Yes
Interpretation Useful Useful
Resolution No Yes
781
Limitations of PCA
PCA produces an orthogonal bilinear matrix decomposition, where components or factors
are obtained in a sequential way explaining maximum variance. Using these constraints plus
normalization during the bilinear matrix decomposition, PCA produces unique solutions.
These ‘abstract’ unique and orthogonal (independent) solutions are very helpful in deducing
the number of different sources of variation present in the data and, eventually, they allow
for their identification and interpretation. However, these solutions are ‘abstract’ solutions
in the sense that they are not the ‘true’ underlying factors causing the data variation, but
orthogonal linear combinations of them.
The Alternative: Curve resolution

On the other hand, in curve resolution methods, the goal is to unravel the ‘true’ underlying
sources of data variation. It is not only a question of how many different sources are present
and how they can be interpreted, but to find out how they are in reality. The price to pay is
that unique solutions are not usually obtained by means of curve resolution methods unless
external information is provided during the matrix decomposition.
Whenever the goals of curve resolution are achieved, the understanding of a chemical
system is dramatically increased and facilitated, avoiding the use of enhanced and much
more costly experimental techniques. Through MCR methods, the ubiquitous mixture
analysis problem in chemistry (and other scientific fields) is solved directly by mathematical
and software tools instead of using costly analytical chemistry and instrumental tools, for
example as in sophisticated hyphenated mass spectrometry-chromatographic methods.
19.2.2 Ambiguities and constraints in MCR

Rotational and intensity ambiguities in MCR
From the early days in resolution research, the mathematical decomposition of a single data
matrix, no matter the method used, has been known to be subject to ambiguities. This
means that many pairs of C- and ST-type matrices can be found that reproduce the original
data set with the same fit quality. In plain words, the correct reproduction of the original
data matrix can be achieved by using component profiles differing in shape (rotational
ambiguity) or in magnitude (intensity ambiguity) from the sought (true) ones.
These two kinds of ambiguities can be easily explained. The basic equation associated with
resolution methods can be transformed as follows:
where and describe the X matrix as correctly as the true C

and ST matrices do, though C’ and S’T are not the sought solutions.
As a result of the rotational ambiguity problem, a resolution method can potentially provide
as many solutions as T matrices can exist. This may represent an infinite set of solutions,
unless C and ST are forced to obey certain conditions. In a hypothetical case with no
rotational ambiguity, that is, the shapes of the profiles in C and ST are correctly recovered,
the basic resolution model with intensity ambiguity could be written as shown below:
Resolution model with intensity ambiguity
782
where ki are scalars and n refers to the number of components. Each concentration profile
of the new C’ matrix would have the same shape as the real one, but being ki times smaller,
whereas the related spectra of the new S’ matrix would be equal in shape to the real
spectra, though ki times more intense.
Constraints in MCR
Although resolution does not require previous information about the chemical system under
study, additional knowledge, when it exists, can be used to tailor the sought pure profiles
according to certain known features and, as a consequence, to minimize the ambiguity in
the data decomposition and in the results obtained.
The introduction of this information is carried out through the implementation of
constraints.
What is a constraint?
A constraint can be defined as any mathematical or chemical property systematically fulfilled
by the whole system or by some of its pure contributions. Constraints are translated into
mathematical language and force the iterative optimization to model the profiles respecting
the conditions desired.
When to apply a constraint?

The application of constraints should be always prudent and soundly grounded and they
should only be set when there is an absolute certainty about the validity of the constraint.
Even a potentially useful constraint can play a negative role in the resolution process when
factors like experimental noise or instrumental problems distort the related profile or when
the profile is modified so roughly that the convergence of the optimization process is
seriously damaged. When well implemented and fulfilled by the data set, constraints can be
seen as the driving forces of the iterative process to the right solution and, often, they are
found not to be active in the last part of the optimization process.
The efficient and reliable use of constraints has improved significantly with the development
of methods and software that allow them to be easily used in flexible ways. This increase in
flexibility allows complete freedom in the way combinations of constraints may be used for
profiles in the different concentration and spectral domains. This increase in flexibility also
makes it possible to apply a certain constraint with variable degrees of tolerance to cope
with noisy real data, i.e., the implementation of constraints often allows for small deviations
from the ideal behavior before correcting a profile. Methods to correct the profile to be
constrained have evolved into smoother methodologies, which modify the poorly behaved
profile so that the global shape is kept as much as possible and the convergence of the
iterative optimization is minimally upset.
Constraint types in MCR

There are several ways to classify constraints: the main ones relate either to the nature of
the constraints or to the way they are implemented. In terms of their nature, constraints can
be based on either chemical or mathematical features of the data set. In terms of
implementation, one can distinguish between equality constraints or inequality constraints.
An equality constraint sets the elements in a profile to be equal to a certain value, whereas
an inequality constraint forces the elements in a profile to be unequal (higher or lower) than
a certain value. The most widely used types of constraints will be described using these
classification schemes. In some of the descriptions that follow, comments on the
783
implementation (as equality or inequality constraints) will be added to illustrate this

concept.
Non-negativity
The non-negativity constraint is applied when it can be assumed that the measured values in
an experiment will always be non-negative. This constraint forces the values in a profile to
be equal to or greater than zero. It is an example of an inequality constraint. Non-negativity
constraints may be applied independently of each other to:
 Concentrations (the elements in each row of the C matrix)

 Response profiles (the elements in each row of the ST matrix)
For example, non-negativity applies to:
 All concentration profiles in general;

 Many instrumental responses, such as UV absorbance, fluorescence intensities etc.
Unimodality
The unimodality constraint allows the presence of only one maximum per profile.
This condition is fulfilled by many peak-shaped concentration profiles, like chromatograms,
by some types of reaction profiles and by some instrumental signals, like certain
voltammetric responses.
It is important to note that this constraint does not only apply to peaks, but to profiles that
have a constant maximum (plateau) and a decreasing tendency. This is the case of many
monotonic reaction profiles that show only the decay or the emergence of a compound,
such as the most protonated and deprotonated species in an acid-base titration reaction,
respectively.
Closure
The closure constraint is applied to closed reaction systems, where the principle of mass
balance is fulfilled. With this constraint, the sum of the concentrations of all the species
involved in the reaction (the suitable elements in each row of the C matrix) is forced to be
equal to a constant value (the total concentration) at each stage in the reaction. The closure
constraint is an example of equality constraint.
In practice, the closure constraint in MCR forces the sum of the concentrations of all the
mixture components to be equal to a constant value (the total concentration) across all
samples included in the model.
Other constraints
Apart from the three constraints previously defined, other types of constraints can be
applied. See literature on curve resolution for more information about them.
Local rank constraints
Particularly important for the correct resolution of two-way data systems are the so-
called local rank constraints, selectivity and zero-concentration windows. These
types of constraints are associated with the concept of local rank, which describes
how the number and distribution of components varies locally along the data set.
The key constraint within this family is selectivity. Selectivity constraints can be used
in concentration and spectral windows where only one component is present to
completely suppress the ambiguity linked to the complementary profile in the
784
system. Thus, selective concentration windows provide unique spectra of the

associated components and vice versa. The powerful effect of this type of
constraints and their direct link with the corresponding concept of chemical
selectivity explains their early and wide application in resolution problems. Not so
common, but equally recommended is the use of other local rank constraints in
iterative resolution methods. These types of constraints can be used to describe
which components are absent in data set windows by setting the number of
components inside windows smaller than the total rank. This approach always
improves the resolution of profiles and minimizes the rotational ambiguity in the
final results.
Physicochemical constraints
One of the most recent progresses in chemical constraints refers to the
implementation of a physicochemical model into the multivariate curve resolution
process. In this manner, the concentration profiles of compounds involved in a
kinetic or a thermodynamic process are shaped according to the suitable chemical
law. Such a strategy has been used to reconcile the separate worlds of hard- and
soft-modeling and has enabled the mathematical resolution of chemical systems
that could not be successfully tackled by either of these two pure methodologies
alone. The strictness of the hard model constraints dramatically decreases the
ambiguity of the constrained profiles and provides fitted parameters of
physicochemical and analytical interest, such as equilibrium constants, kinetic rate
constants and total analyte concentrations. The soft part of the algorithm allows for
modeling of complex systems, where the central reaction system evolves in the
presence of absorbing interferences.
19.2.3 MCR and 3-D data

Finally, it should be mentioned that MCR methods based on a bilinear model may be easily
adapted to resolve three-way data sets. Particular multiway models and structures may be
easily implemented in the form of constraints during MCR optimization algorithms, such as
Alternating Least Squares (see below). When a set of data matrices is obtained in the
analysis of the same chemical system, they can be simultaneously analyzed setting all of
them together in an augmented data matrix and following the same steps as for a single
data matrix analysis. The possible data arrangements are displayed in the following figures:
Data matrix augmentation in MCR: Extension of bilinear models
Samples analyzed by different techniques
Variation in the samples analyzed by the same technique
Variation in the samples analyzed by different techniques
785
19.2.4 Algorithm implemented in The Unscrambler®: Alternating Least Squares (MCR-

ALS)
Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) uses an iterative
approach to find the matrices of concentration profiles and instrumental responses. In this
method, neither C nor ST matrices has priority over each other and both are optimized at
each iterative cycle.
Initial estimates for MCR-ALS
Starting the iterative optimization of the profiles in C or ST requires a matrix or a set of
profiles sized as C or as ST with more or less rough approximations of the concentration
profiles or spectra that will be obtained as the final results. This matrix contains the initial
estimates of the resolution process. In general, the use of non-random estimates helps
shorten the iterative optimization process and helps to avoid convergence to local optima
different from the desired solution. It is sensible to use chemically meaningful estimates if
there is a way of obtaining them or if the necessary information is available. Whether the
initial estimates are either a C-type or an ST-type matrix can depend on which type of
profiles are less overlapped, which direction of the matrix (rows or columns) has more
information or simply on the will of the chemist.
In The Unscrambler®, it is possible to enter estimates as initial guess.
Computational parameters of MCR
In the Unscrambler® MCR procedure, the computational parameters for which user input is
allowed are the constraint settings (non-negative concentrations, non-negative spectra,
unimodality, closure) and the setting for sensitivity to pure components.
One can also set the maximum number of iterations the software will use in its calculations.
Constraint settings are known beforehand
In general, some background knowledge is available such as constraints that apply to the
application and the data before building the MCR model.
Example (courtesy of Prof. Chris Brown, University of Rhode Island, USA)
FTIR is employed to monitor the reaction of isopropyl alcohol and acetic anhydride using
pyridine as a catalyst in a carbon tetrachloride solution. Isopropyl acetate is one of the
products in this typical esterification reaction.
As long as nothing more is added to the samples in the course of the reaction, the sum of
the concentrations of the pure components (isopropyl alcohol, acetic anhydride, pyridine,
isopropyl acetate + possibly other products of the esterification) should remain constant.
This satisfies the requirements for a closure constraint.
Of course, by viewing the results some inconsistency can appear such as the sum of the
estimated concentrations not being constant – and should be. It is then possible to
introduce a closure constraint next time when recalculating the model.
786
How to tune sensitivity to pure components?

Example: The case of very small components
Unlike the constraints applying to the system under study, which usually are known
beforehand, there is little information about the relative order of magnitude of the
estimated pure components upon the first attempt at curve resolution.
For instance, one of the products of the reaction may be dominating, but detection and
identification of possible by-products may be of interest.
If some of these by-products are synthesized in a very small amount compared to the initial
chemicals present in the system and the main product of the reaction, the MCR
computations will have trouble distinguishing these by-products’ “signature” from mere
noise in the data.
General use of sensitivity to pure components
This is where tuning the parameter called “sensitivity to pure components” may help the
analysis. This unitless number with formula
Ratio of eigenvalues
can be roughly interpreted as how dominating the last estimated primary principal
component is (the one that generates the weakest structure in the data), compared to the
first one. The higher the sensitivity, the more pure components will be extracted (the MCR
procedure will allow the last component to be more “negligible” in comparison to the first
one).
By default, a value of 100 is used; it can be tuned up or down between 10 and 190 if
necessary.
Read what follows for concrete situation examples.
When to tune sensitivity up or down?

Upon viewing the first MCR results, check the estimated number of pure components and
study the profiles of those components.
Case 1
The estimated number of pure components is larger than expected. Action: reduce
sensitivity.
Case 2
No prior expectations about the number of pure components, but some of the
extracted profiles look very noisy and/or two of the estimated spectra are very
similar. This indicates that the actual number of components is probably smaller
than the estimated number. Action: reduce sensitivity.
Case 3
Knowing that there are at least n different components whose concentrations vary
in the system, and the estimated number of pure components is smaller than n.
Action: increase sensitivity.
Case 4
Knowing that the system should contain a trace-level component, which is not
detected in the current resolution. Action: increase sensitivity.
Case 5
No prior expectations about the number of pure components, and doubts whether
the current results are sensible or not. Action: check MCR message list.
787
19.2.5 Main results of MCR

Contrary to what happens when building a PCA model, the number of components
computed in MCR cannot be chosen. The optimal number of components necessary to
resolve the data is estimated by the system, and the total number of components saved in
the MCR model is set to n+1.
Note: As there must be at least two components in a mixture, the minimum
number of components in MCR is 2.
For each number of components k between 2 and n+1, the MCR results are as follows:
 Residuals are error measures; they tell how much variation remains in the data after
k components have been estimated;
 Estimated concentrations describe the estimated pure components’ profiles across
all the samples included in the model;
 Estimated spectra describe the instrumental properties (e.g. spectra) of the
estimated pure components.
Residuals
The residuals are a measure of the fit (or rather, lack of fit) of the model. The smaller the
residuals, the better the fit. MCR residuals can be studied from three different points of
view.
Variable Residuals
is a measure of the variation remaining in each variable after k components have
been estimated. In The Unscrambler®, the variable residuals are plotted as a line
plot where each variable is represented by one value: its residual in the k-
component model.
Sample Residuals
is a measure of the distance between each sample and its model approximation. In
The Unscrambler®, the sample residuals are plotted as a line plot where each
sample is represented by one value: its residual after k components have been
estimated.
Total Residuals
these results express how much variation in the data remains to be explained after k
components have been estimated. Their role in the interpretation of MCR results is
similar to that of variances in PCA. They are plotted as a line plot showing the total
residual after a varying number of components (from 2 to n+1).
The three types of MCR residuals are available for MCR Fitting: these are the actual values of
the residuals after the data have been resolved to k pure components.
Estimated concentrations
The estimated concentrations show the profile of each estimated pure component across
the samples included in the MCR model. In The Unscrambler®, the estimated concentrations
are plotted as a line plot where the abscissa shows the samples, and each of the k pure
components is represented by one curve.
The k estimated concentration profiles can be interpreted as k new variables showing how
much each of the original samples contains of each estimated pure component.
788
Note: Estimated concentrations are expressed as relative values within individual

components from the ratio.
Estimated spectra
The estimated spectra show the estimated instrumental profile (e.g. spectrum) of each pure
component across the X-variables included in the analysis. In The Unscrambler®, the
estimated spectra are plotted as a line plot where the abscissa shows the X-variables, and
each of the k pure components is represented by one curve. The k estimated spectra can be
interpreted as the spectra of k new samples consisting each of the pure components
estimated by the model. Comparison of the spectra of the original samples to the estimated
spectra may be useful so as to find out which of the actual samples are closest to the pure
components.
Note: Estimated spectra are unit-vector normalized.
Practical use of estimated concentrations and spectra

Once a satisfactory MCR model is build, it is time to interpret the results and make practical
use of the main findings. The results can be interpreted from three different points of view:
 Assess or confirm the number of pure components in the system under study;
 Identify the extracted components, using the estimated spectra;
 Quantify variations across samples, using the estimated concentrations.
Here are a few rules and principles that may help:
 To have reliable results on the number of pure components, one should cross-check
with a PCA result, try different settings for the Sensitivity to pure components, and
use the navigation bar to study the MCR results for various estimated numbers of
pure components.
 Weak components (either low concentration or noise) are usually listed first.
 Estimated spectra are unit-vector normalized.
 The spectral profiles obtained may be compared to a library of similar spectra in
order to identify the nature of the pure components that were resolved.
 Estimated concentrations are relative values within an individual component itself.
Estimated concentrations of a sample are not its real composition.
19.2.6 Quality check in MCR

Once an MCR model is built, it is important to diagnose it, i.e. assess its quality, before
actually using it for interpretation. There are two types of factors that may affect the quality
of the model:
 Computational parameters: look into the MCR warning list

 Quality of the data: check for noisy variables or outliers
The sections that follow explain what can be done to improve the quality of a model. It may
take several iterations before obtaining a satisfying model.
Once the model is found satisfactory, interpretation of the MCR results in regards to
information on the system under study (e.g. chemical reaction mechanism or process) is the
next step. The last section hereafter will show how to do it.
789
Use of the MCR warnings list

One of the diagnostic tools available upon viewing MCR results is the MCR warnings list,
accessed under the MCR model node in the project navigator. This matrix provides system
recommendations (based on some numerical properties of the results) regarding the value
of the MCR parameter sensitivity to pure components and the possible need for some data
preprocessing.
There are four types of recommendations:
Type 1
Increase sensitivity to pure components;
Type 2
Decrease sensitivity to pure components;
Type 3
Change sensitivity to pure components (increase or decrease);
Type 4
Baseline offset or normalization is recommended.
Outliers in MCR
As in any other multivariate analyses, the available data may be more or less “clean” when
building the first curve resolution model.
The main tool for diagnosing outliers in MCR consists of the plot of sample residuals, which
is one of the default plots in the MCR Overview results.
Any sample that sticks out on the plots of sample residuals (either with MCR fitting or PCA
fitting) is a possible outlier.
To find out more about such a sample (Why is it outlying? Is it an influential sample? Is that
sample disturbing the model?), it is recommended to run a PCA on the data.
If an outlier should be removed, recalculate the MCR model without that sample.
Noisy variables in MCR
In MCR, some of the available variables – even if, strictly speaking, they are no more “noisy”
than the others – may contribute poorly to the resolution, or even disturb the results.
The two main cases are:
 Non-targeted wavelength regions: These variables carry virtually no information that

can be of use to the model;
 Highly overlapped wavelength regions: Several of the estimated components have
simultaneous peaks in those regions, so that their respective contributions are
difficult to entangle.
The main tool for diagnosing noisy variables in MCR consists of the plots of variable
residuals, accessed with the menu option Plot - Variable Residuals, or just be selecting this
plot from the MCR - Plots in the project navigator.
Any variable that sticks out on the plots of variable residuals (either with MCR fitting or PCA
fitting) may be disturbing the model, thus reducing the quality of the resolution; try
recalculating the MCR model without that variable.
19.2.7 MCR application examples

This section briefly presents two application examples.
790
 One can utilize estimated concentration profiles and other experimental information
to analyze a chemical/ biochemical reaction mechanism.
 One can utilize estimated spectral profiles to study the mixture composition or even
intermediates during a chemical/biochemical process.
Note: What follows is not a tutorial. See the Tutorials chapter for more examples
and hands-on training.
Solving co-elution problems in LC-DAD data

A classical application of MCR-ALS is the resolution of the co-elution peak of a mixture.
A mixture of three compounds co-elutes in a LC-DAD analysis, i.e. their elution profiles and
UV spectra overlap. Spectra are collected at different elution times, and the corresponding
chromatograms are measured at the different wavelengths.
First, the number of components can be easily deduced from rank analysis of the data
matrix, for instance, using PCA. Then initial estimates of spectra or elution profiles for these
three compounds are obtained to start the ALS iterative optimization. Possible constraints to
be applied are non-negativity for elution and spectra profiles, unimodality for elution
profiles and a type of normalization to scale the solutions. Normalization of spectra profiles
may also be recommended.
Spectroscopic monitoring of a chemical reaction or process
A second example frequently encountered in curve resolution studies is the study and
analysis of chemical reactions or processes monitored using spectroscopic methods. The
process may evolve with time or because some master variable of the system changes, like
pH, temperature, concentration of reagents or any other property. For example in the case
of an A → B reaction where both A and B have overlapped spectra, and reaction profiles also
overlap in the whole range of study.
This is a case of strong rotational ambiguity since many possible solutions to the problem are
possible. Using non-negativity (for both spectra and reaction profiles), unimodality, and
closure (for reaction profiles) reduces considerably the number of possible solutions.
19.3. Tasks – Analyze – Multivariate Curve Resolution…

When a data matrix is available in a project, access the Tasks menu to run a suitable analysis,
here MCR: Tasks - Analyze - Multivariate Curve Resolution… and make selections in the
dialog that opens.
There are two tabs:
 Model Inputs
 Options
19.3.1 Model Inputs

In this tab, there are two sections:
Data
Specify the data to be analyzed in this field.
First select the Matrix and Rows and Columns to be used in the analysis.
Caution: At least 4 samples and 4 variables are required for performing Multivariate
Curve Resolution.
791
Use initial guess

Tick this option if some Pure spectra or Concentration profiles can be used in the
model.
Locate the necessary data using the Matrix, Rows, Cols fields.
Caution: At least 2 spectra or concentration variables are required to be included as
pure spectra or concentration profiles.
Multivariate Curve Resolution Dialog
The only constraints to MCR are that it needs to have actual numeric input data and at least
four samples and variables. There is a warning given when both situations occurred:
19.3.2 Options
Select constraint options:
792
 Non-negative concentrations
 Non-negative spectra
 Closure
 Unimodality.
Information on those constraints can be found in the theory section: Constraints in MCR.
It is possible to tune the sensitivity using the field Sensitivity to pure components, read
more about how and when to do so in the theory chapter: How to tune sensitivity to pure
components?.
The number of iterations can also be changed when detecting convergence is difficult. The
default setting is 50 iterations. Warnings will be added to the MCR results node if the
alternating least-squares calculation does not converge for the optimal and/or optimal plus
one number of pure components.
MCR Options
19.4. Interpreting MCR plots
 Predefined MCR plots

 MCR Overview
 Component concentrations
 Component spectra
 Total residuals
793
19.4.1 Predefined MCR plots

MCR Overview
Component concentrations
This plot displays the estimated concentrations of two or more constituents across all the
samples included in the analysis. Each plotted curve is the estimated concentration profile of
one given constituent.
The curves are plotted for a fixed number of components in the model; note that in MCR,
the number of model dimensions (components) also determines the number of resolved
constituents. Therefore, if the number of component is tuned up or down with the toolbar
buttons , this will also affect the number of curves displayed. For
instance, if the plot currently displays two curves, clicking the arrow toolbar will update the
plot to three curves representing the profiles of three constituents in a 3-dimensional MCR
model.
Component concentrations
Component spectra
This plot displays the estimated spectra of two or more constituents across all the variables
included in the analysis. Each plotted curve is the estimated spectrum of one pure
constituent.
The curves are plotted for a fixed number of components in the model; note that in MCR,
the number of model dimensions (components) also determines the number of resolved
constituents. Therefore, if the number of component is tuned up or down with the toolbar
buttons , this will also affect the number of curves displayed. For
instance, if the plot currently displays two curves, clicking on the right arrow will update the
plot to three curves representing the spectra of three constituents in a 3-dimensional MCR
model.
794
Note: the star button enables one to go back to the suggested number of
components for the model.
Component spectra
Sample residuals
This plot displays the residuals for each sample for a given number of components in an MCR
model.
The size of the residuals is displayed on the scale of the vertical axis. The plot contains one
point for each sample included in the analysis; the samples are listed along the horizontal
axis.
The sample residuals are a measure of the distance between each sample and the MCR
model. Each sample residual varies depending on the number of components in the model
(displayed in parentheses after the name of the model, at the bottom of the plot). The
number of components for which the residuals are displayed can be tuned up or down using
the toolbar buttons.
The size of the residuals gives an indication about the misfit of the model. It may be a good
idea to compare the sample residuals from an MCR fitting to a PCA fit on the same data.
Since PCA provides the best possible fit along a set of orthogonal components, the
comparison tells how well the MCR model is performing in terms of fit.
Sample residuals
795
Total residuals
This plot displays the total residuals (all samples and all variables) against increasing number
of components in an MCR model.
point for each number of components in the model, starting at two. The total residuals are a
measure of the global fit of the MCR model, equivalent to the total residual variance
computed in projection models like PCA.
It is a good idea to compare the total residuals from an MCR fitting to a PCA fit on the same
data. Since PCA provides the best possible fit along a set of orthogonal components, the
comparison tells how well the MCR model is performing in terms of fit.
Total residuals
796
Variable residuals
This plot displays the residuals for each variable for a given number of components in an
MCR model.
point for each variable included in the analysis; the variables are listed along the horizontal
axis.
The variable residuals are a measure of how well the MCR model takes into account each
variable; the better a variable is modeled, the smaller the residual. Variable residuals vary
depending on the number of components in the model (displayed in parentheses after the
name of the model, at the bottom of the plot). The number of components for which the
residuals are displayed can be tuned up or down, using the toolbar buttons.
The size of the residuals tells about the misfit of the model. It is a good idea to compare the
variable residuals from an MCR fitting to a PCA fit on the same data. Since PCA provides the
best possible fit along a set of orthogonal components, the comparison tells how well the
MCR model is performing in terms of fit.
Variable residuals
19.5. MCR method reference

19.6. Bibliography
R. Tauler, S. Lacorte and D. Barceló, Application of multivariate curve self-modeling curve
resolution for the quantitation of trace levels of organophosphorous pesticides in natural
waters from interlaboratory studies, J. of Chromatogr. A., 730, 177-183 (1996).
797
20. Hierarchical Modeling
20.1. Hierarchical Modeling
Hierarchical Modeling (HM) is not a method defined by a specific algorithm, but a
predefined set of Unscrambler models run in a predefined order (i.e. a hierarchy). At each
stage of the hierarchy, a decision must be made based Boolean logic which directs the
modeling to the next step. The process stops when a final decision has been arrived at in the
logic.
 Theory
 Usage
 Prediction
20.2. Introduction to Hierarchical Modeling

Hierarchical Modeling (HM) is actually not a method of analysis in its own right, but is the
combination of a number of multivariate models joined together using logic statements in
order to arrive at a single, unique result. In some applications a single global classification or
prediction model may not be sufficient to fully describe a system. A user could manually set
standard operating procedures in place such that based on the result of one model, other
specified models would be applied. However, this would be a laborious and time-consuming
task, especially if the procedure was to be repeated frequently when new data were
generated. Also, such manual analysis could introduce many errors if not put into the hands
of competent users. To overcome this, The Unscrambler(R) X supports the development of
Hierarchical Models using the HM development and prediction module.
Hieararchical Modeling was developed mainly as a tool for industrial processes where
operators could analyse a sample at-line (using a rapid spectroscopic method) or as an
online tool for establishing the state of a process. However, it can also be used in the
research or discovery laboratory for identifying new classes.
 Overall workflow
 Setup
 Expected Scenarios
 The Classification - Classification Hierarchy
 The Classification – Prediction Hierarchy
 The Prediction – Prediction Hierarchy
20.2.1 Overall workflow

The first step in any hierarchical model development is outside of the scope of the actual
hierarchical model developer. A user must develop and validate a set of classification or
regression rules and understand where the boundaries/ambiguities are. From there, sub
models will be developed that best handle these situations and once the entire hierarchy is
understood, the model can be compiled. The following lists the steps involved in developing
a final hierarchical model.
799
 Develop a global multivariate model to understand if there are any ambiguities or

non-linearities in the system.
 If ambiguities or non-linearities exist, develop sub-models that can handle these. If
there are subclasses, also develop and validate models to handle such situations.
 Validate all models against a suitable validation set to ensure that the results
project/predict/classify as expected.
 Develop the hierarchies as determined by the results of the individual models during
the training stage and enter the logic required to take the model to the next level.
Also define the conditions that will result in a premature termination of the
hierarchy. These will be defined as alarm conditions.
 Alarm conditions will be defined as, a. Primary: These will result in termination of
the method b. Secondary: These will allow the hierarchy to proceed, however, the
results that do not meet some predefined criteria will be marked for investigation.
 Compile the hierarchical model using the ‘Tasks – Hierarchical Modeling’ menu in
The Unscrambler (R)
 Use the ‘Tasks – Predict – HM Prediction’ function to validate the model based on
the validation set used previously to verify the models, or a new set as required.
 Save the model for use in the The Unscrambler® X Hierarchical Engine, The
Unscrambler® X Process Pulse or The Unscrambler® Prediction Server for real time
usage.
20.2.2 Setup
HM can be thought of as a cascading tree of decision making. It is expected that all
projection, prediction and classification models generated in The Unscrambler X are
candidates for hierarchical model development. The HM module supports up to 10 levels of
hierarchy and multiple models can be included within each level.
Within each level, one or more models can be defined based on the output from the
previous level. Alternatively, the output is satisfactory and reported, or it may be ambiguous
or out of limits, in which case a warning can be displayed or the HM be told to exit. This
behaviour is completely at the hands of the user, who has to make sure that the provided
sequence of steps and the limits used are sensible.
Also, for each model within each level, an ordered list of logical conditions are specified by
the user and executed in an IF-ELSE manner. This means that if the first condition is satisfied,
any remaining conditions will not be executed. It follows that the order of the conditions is
important. If for instance condition 1 finds that the predicted response is out of limits, a
condition 2 testing for e.g. leverage of the predicted sample will never be executed.
Note that the program will not attempt to detect or fix ambiguous logic in subsequent
conditions. If condition 1 states that a PLS model should be calculated if a parameter is
within limits, and condition 2 states that a PCA model should be calculated for the same
parameter values, only the PLS model will ever be calculated due to the order of the
conditions.
20.2.3 Expected Scenarios

Three common hierarchical method combinations are described in the following. These are
commonly used either alone or as building blocks in larger hierarchical structures, and also
other method combinations can be imagined.
800
Hierarchical Modeling
The Classification - Classification Hierarchy

In this scenario, a first classification model (SIMCA/ LDA/ SVM/ PLS-DA) is applied to a data
set (1st level model). At this stage a decision is made based on the results of the first
classification. If the desired result is attained in the first level, then the process terminates.
If, however, an ambiguity in the classification occurs, then a next level (2nd level) hierarchy
is defined where a new classification model is applied to resolve the ambiguity. Multiple
logics can be placed into a single level and the number of levels is defined based on the point
at which the final ambiguity is broken.
Suppose there are ten (10) groups to be classified and a 1st level model is defined. In this
model say, classes 1-5 can be uniquely classified with the 1st level model, but classes 6,7 and
8 cannot be separated from each other and classes 9-10 cannot be separated. The diagram
below describes this situation.
An ambiguous classification situation
It should be noted here that classes 1-5 can be separated uniquely from every other class.
Classes 6-8 cannot be separated from each other, but can from all other classes and classes
9-10 cannot be separated from each other, but can be separated from all other classes,
using the 1st level (global) model.
The next step is to define a 2nd level hierarchy in which two models are defined, a. A
separation model for classes 6-8 b. A separation model for classes 9-10.
Continuing on with the example, say for instance a new model (SIMCA/LDA/SVM/PLS-DA)
can be defined to separate class 8 from classes 6 and 7, then a 3rd level is required, in which
a new model is defined to separate classes 6 and 7.The 2nd level also contains a model for
separating classes 9 and 10. The 2nd level is shown in the figure below.
Resolving ambiguities with a second level of model hierarchy
801
The final step in this particular process is to define a 3rd level of hierarchy with a single
model for separating class 6 and 7 from each other. Therefore in summary, this process
requires 3 levels of hierarchy, the first contains a single “global�? model that uniquely
separates classes 1-5 but cannot uniquely separate classes 6-8 from each other or classes 9-
10 from each other. The second level has two models one for separating class 8 from classes
6-7 and one for separating class 9 from 10. The third level separates classes 6 and 7.
There is also the situation that a sample does not classify into any models. The entire
process described above is shown in the following flow diagram.
Expected workflow of a hierarchical model for separating 10 classes uniquely
The Classification – Prediction Hierarchy

In this scenario, a first classification model (SIMCA/ LDA/ SVM/ PLS-DA) is applied to a data
set (1st level model). At this stage a decision is made based on the results of the first
802
classification. If a sample is uniquely classified then (at least) one specified prediction model
is applied to the sample.
Suppose there are five (5) groups to be classified and in this case, assume that no
ambiguities are present in the classification step, i.e. the 1st level model uniquely separates
classes 1-5. For each class there are separate sets of prediction models (PLS/PCR/MLR)
assigned to each class (it may also be feasible to have a PCA projection model here).
The figure below shows an example of a Classification - Prediction Hierarchy
The Classification – Prediction hierarchy
The Prediction – Prediction Hierarchy

Prediction – Prediction models start out with a global prediction model that may not be
linear over the entire range of concentrations for a sample. The global model must be
defined into several sections where different decision are made. If a predicted value lies in a
particular section, then a decision is made to either accept the result (as the precision can be
trusted) or to apply a 2nd level prediction model that has been developed on a narrower
range to achieve better linearity.
The figure below is a representation of a prediction model with a distinct non-linearity. The
model has been separated along the Predicted axis into three distinct regions A, B and C.
A diagrammatic representation of the Prediction – Prediction Hierarchy
803
The hierarchy is described as follows
 If predicted y lies between 0 and some upper limit a in the 1st level, then use a local
regression model developed for that region in the 2nd level.
 If predicted y lies between a and some upper limit b in the 1st level, then use a local
 If predicted y lies between b and some upper limit c in the 1st level, then use a local
 If predicted y lies above some upper limit c or below some predefined lower limit in
the 1st level, then terminate operation and provide a warning that the value is
outside the normal calibration range.
It is of course possible to define specific steps to be taken if the predicted value is close to a
junction between two models. Then the prediction intervals above should be shrunk
accordingly, so that no intervals overlap.
20.3. Tasks – Analyze – Hierarchical Modeling
 Defining actions
 Classification setup
 Prediction setup
 Projection setup
 Report setup
 Setting up a hierarchical model
 Add level
 Define Conditions and Actions
 Expression Builder
 Define, Remove and Report buttons
804
 Change sequential order of conditions

 Finalize Level
 Edit Level
 Remove Level
 Details
 Preview
 Finalize Hierarchical Model
 Modifying an existing hierarchical model
20.3.1 Defining actions

In a hierarchical model, the actions to be performed at one level of analysis is governed by
one or more conditional statements regarding the outcome of analysis at the previous level.
The conditions can be of the kind “if prediction at previous level is larger than a certain limit,
perform action”. There can be multiple conditions, and there can be multiple tests for each
condition as well. A conditional statement will always evaluate to TRUE or FALSE (or
equivalently ‘1’ or ‘0’). If a condition evaluates to TRUE, one of four different types of actions
can be performed:
 Classification
 Prediction
 Projection
 Report
The action setup dialog is slightly different depending on which type of action to be
performed. These setup dialogs are described in the next sub-sections.
Classification setup
A classification model is defined using the following dialog window:
Add classification model dialog
805
Add a name to the Method name frame. This will be displayed in the HM model structure
and also in the output matrix of HM Predict. You should choose an informative name to
make interpretation of the hierarchical model and the results easier.
Select the type of classification model in the Classification type frame. For SIMCA
classification, any number of bilinear models (PCA, PCR, PLSR) can be included, while LDA
and SVM classification expects a single model.
The individual models are defined in the ‘Add models for classification’ frame. A drop-down
box will list all available models from the project navigator. Once a model is selected, verify
that the correct auto-pretreatments will be performed and that the correct settings are
selected for centering and the number of components.
Highlighting an already added model will activate the Remove and Details buttons for that
model. The first button will remove the model from the list of added models and clear the
list of selected output matrices (see below). The Details button will bring up a separate
dialog listing details about the selected model.
The complete list of available output matrices are listed in the bottom left portion of the
window. Use the arrow buttons to select output data. These will be saved in a Results matrix
when HM Prediction is applied and may be subjected to conditional statements in the next
level of the hierarchy. Make sure to include all necessary output data that may be of interest
later, as these will otherwise be lost.
The available model outputs from SIMCA classification are class memberships at different
significance levels between 0.1-25%. In addition, the ‘X Residuals’, ‘Si/S0’ and ‘Leverage’
values can be selected for each of the individual models. For LDA and SVM classification, the
only available output is the predicted class.
Prediction setup
The following dialog is used for prediction type actions:
806
Add prediction model dialog
Refer to the Classification setup section for an explanation to the different frames and
buttons. The model drop-down box will be populated with supported prediction models
from the project navigator. These are PLSR, PCR and MLR models. The available outputs
from PLSR and PCR models are
 Y Predicted (for different responses)

 Y Deviation (for different responses)
 Scores
 Sample Leverage
 X Sample Q-Residuals
 Sample Validation Residuals
 Explained X Sample Validation Variance
The available outputs from MLR models are
 Y Predicted (for different responses)

 Y Deviation (for different responses)
 Sample Leverage
SVM Regression models are not currently supported.
807
Projection setup
Only a single PLSR, PCR or PCA model can be used for projection, and the dialog is therefore
simpler:
Add projection model dialog
Refer to the Classification setup section for an explanation to the different frames and
buttons.
Available outputs are
 Projected Scores
 Projected Hotelling’s T²
 Projected Sample Leverage
 Projected X Sample Residuals
 Projected Explained X Sample Variance
Report setup
Once all the desired levels of the hierarchy have been modeled, or if a conditional statement
causes the modeling to stop prematurely due to an undesired outcome, a reporting action
will define how the results are displayed. Reporting involves coloring the output in the
results table and optionally adding an informative tool tip comment.
Contrary to the classification, prediction and projection action types, there is no additional
output being produced by a reporting action. This means that there can be no additional
hierarchical levels based on a reported result. An example Report setup dialog is given
below.
Example of report setup dialog
808
This example condition is the default ‘No Evaluation’ condition, which is evaluated at the
end of a conditional statement if none of the other conditions hold TRUE (If you are familiar
with programming syntax, this is the ELSE statement). The dialog has a Method name box,
where the name of the reporting action can be specified. The Expressions column lists the
conditions that will lead to the current reporting action. Available Reporting Options are
 Standard: 7 predefined sub-options with associated colors

 Custom: Any number of sub-options with associated colors
 None
The predefined standard sub-options are
 AlarmHigh: Red
 AlarmLow: Red
 Normal: Green
 WarningHigh: Yellow
 WarningLow: Yellow
 Alarm: Red
 Warning: Yellow
The standard colors may be edited by clicking on the “Edit standard states” button. This will
bring up the “Define Reporting States” dialog with the 7 standard sub-options and their
associated colors indicated:
The Define Reporting States dialog for standard states
809
Click on either of the colored boxes in order to bring up a color editing dialog. Press OK to
save any changes or Cancel to discard.
A Custom list of sub-options with associated colors can similarly be set up by pressing the
“Define custom states” button. This dialog allows you to define the number of reporting
states, their names and their associated colors.
The Define Reporting States dialog for custom states
810
20.3.2 Setting up a hierarchical model

Access the Hierarchical Modeling setup dialog from ‘Tasks – Hierarchical Modeling’. The first
time this function is accessed an information dialog will appear.
Information dialog for Hierarchical modeling
This dialog can be inactivated by checking “Do not show this next time”. On clicking OK, the
Hierarchical Model dialog will open, initially with no levels specified. This will be the first
dialog shown if the information dialog has been inactivated.
Initial setup dialog screen for Hierarchical Modeling
811
The Hierarchical Levels frame will be populated with different conditions and actions at
multiple levels once these have been specified.
Add level
Add levels using the ‘Add Level’ button. If no levels have been specified, a dialog will open
with the options to specify a Classification, Prediction, or Projection model as the global
(Level 1) model. Depending on your selection, the relevant setup dialog will open, as
described in the previous section.
Define Action dialog box
Once the first level model(s) has been specified, click OK to add the first level to the
Hierarchical Levels frame of the main HM setup dialog. Because a hierarchical model
requires at least two levels, click the ‘Add level’ button again to setup the second level.
Clicking this button for any level between 2-10 will bring up the “Define Conditions and
Actions” dialog.
812
Define Conditions and Actions

In a hierarchical model the outputs from a lower level will always be used to specify the
conditional statements at the next higher level, until the end of the hierarchy is reached and
the results are reported.
The “Define Conditions and Actions” dialog is populated with a group of conditional
statements for each model specified at the previous level. The groups are separated by grey
row headers. Each condition is associated with a name, one or more tests, and an action to
be performed if the condition evaluates to TRUE. The currently selected group for which
new conditions and actions will be defined is indicated by the previous level condition
printed in the upper, left text box. Click on a different group header to select a different
group.
If the previous level model was a classification model, the dialog will be populated with a
default condition for each unique class assignment. For SIMCA models, where samples can
also be assigned to multiple or no classes, a default ‘No Classification’ condition is provided
as well.
All conditional statements also have a ‘No evaluation’ condition which is executed if no
other conditions are TRUE. The default action associated with each condition is ‘Report’. A
“Define Conditions and Actions” dialog populated with SIMCA classification conditions is
given below.
Define Conditions and Actions Dialog for a SIMCA model
Expression Builder
To add a new condition, specify a condition name and press the Expression Builder button.
Alternatively, click on an existing condition (row) to populate the condition name and
expression with existing values. A unique classification has a value of 1 (TRUE) for the class in
question and 0 (FALSE) for all other classes.
Expression Builder Dialog for a SIMCA model
813
For SIMCA models it is possible to define different combinations of classes to evaluate. For
instance, a separate action can be specified for the case where a sample is ambiguously
classified into two classes. Also, for SIMCA, Prediction and Projection models, multiple
statements can be defined and connected with AND, OR or XOR:
 AND: If both statements evaluate to TRUE, the total expression is TRUE

 OR: If at least one statement evaluates to TRUE, the total expression is TRUE
 XOR (exclusive OR): If one and only one statement evaluates to TRUE, the total
expression is TRUE
Multiple conditions evaluate in a greedy manner, starting with the first two statements and
comparing with remaining statements one at the time. E.g. an expression “cond1 AND cond2
OR cond3” will evaluate to TRUE if cond2 and cond3 are TRUE while cond1 is FALSE. This is
because “cond1 AND cond2” will be evaluated first, as in the expression “(cond1 AND cond2)
OR cond3”. To add multiple conditions, use the check box to activate a new statement. The
prediction model expression below evaluates to TRUE if predicted octane is between 88 and
90, while the deviation is less than 3.
Expression Builder Dialog with multiple statements
The user must take care not to build meaningless statements, such as “X > 1 AND X < 0”,
which will always evaluate to FALSE.
Once the expression is set up, press OK to close the expression builder dialog. Then, to save
the expression as a new condition press New, or press Update to modify an existing
condition.
814
Define, Remove and Report buttons

Each condition (row) in the “Define Conditions and Actions” dialog is associated with a
specific action. The type of action (Classification, Prediction, Projection or Report) is selected
using a drop-down box. Clicking the Define button will open the relevant classification,
prediction, projection, or report setup dialog.
Note that all actions except reporting need to be defined – it is not sufficient to select a
Classification and press OK directly. Once an action has been defined using the Define
button, it will also be given an unique name and added to the Actions drop-down box. Such
a pre-defined action can then be re-used for other conditions (and optionally modified and
renamed) if convenient.
Click the Remove button to delete a condition entirely.
It is often useful to add reporting actions for classification, prediction and projection outputs
as well. The results for any condition that evaluates to TRUE will be colored green by default,
but the color can be edited and a tool-tip comment can be added using the Report button. If
a condition consists of more than one test, individual colors can be assigned to the different
values that are tested.
Change sequential order of conditions
The order of conditional statements is important. If a condition evaluates to TRUE, any
remaining conditions within the same group will not be evaluated. Change the order of
conditions within a group by highlighting the condition and using the right hand arrow
buttons to move it up or down in the list.
The “No Evaluation” condition cannot be moved, as this is always evaluated at the end if no
other conditions are TRUE.
Finalize Level
Click on OK to save the level settings and return to the main HM builder dialog, or click
Cancel to return without saving. The hierarchical model builder dialog will be populated with
the specified HM model.
Hierarchical Model Builder Dialog with 3 levels added
815
Edit Level
Once at least one level has been added to the hierarchical model, it is possible to return to
the Define Conditions And Actions dialog and change any of the settings. Click on the level of
interest and verify that the correct level is displayed in the Selected Level box on the right
hand side. Click the Edit Level button to bring up a dialog to modify the settings.
Note that changing the output or the conditions in one of the lower levels may break
dependencies in some of the higher levels. Edit such lower levels with extreme caution.
Remove Level
Clicking on Remove Level will bring up a warning that the currently selected level will be
removed permanently. Note that if a lower level is deleted, all higher levels are necessarily
deleted as well.
Remove Level warning dialog
Details
The Details button will bring up a dialog with additional information about the currently
selected level. Click on OK to close the window and return to the main HM builder dialog.
816
Details dialog example
Preview
The Preview button brings up a dialog with an expandable tree-structure with information
about the complete hierarchical model. Nodes in the tree containing additional sub-
branches can be expanded (or collapsed) by clicking on the ’+’ (or ‘-‘) symbol at the junction
of the node. Click the Expand All button to expand all sub-brances in the tree. Click OK to
close the dialog and return to the main HM builder dialog.
Example of a HM Preview tree dialog
817
Finalize Hierarchical Model

Once all the hierarchical levels have been set up, click OK to add the hierarchical model to
the project navigator, or click OK to discard all changes. The project navigator node will be
named HM and a data table representing the complete hierarchical model is displayed. Each
group at each level is enveloped by a red border, and multiple conditions within a group are
separated by green, double lines.
A hierarchical model as viewed from the project navigator
818
Right click on the HM model node in the project navigator to Edit, Rename, Delete or Save
the model. On saving a HM model, all required classification, prediction and projection
models are saved in the same project file.
20.3.3 Modifying an existing hierarchical model

Right click on the HM model node in the project navigator and select Edit to bring up the HM
model builder dialog, from which changes and modifications can be made to any or all levels
of the hierarchy. Refer to the above sections for details.
20.4. Prediction with Hierarchical Model

To apply a hierarchical model with the complete sequence of multivariate models and
reporting of results, go to “Tasks – Predict – HM Prediction”
HM Prediction menu
This will open the HM Predict dialog. A drop-down box will be populated with all available
HM models in the current project. Select the model of interest and specify the data to apply
in the Data frame. Make sure to specify the correct Row and Column sets, or bring up the
Define Range dialog to specify the ranges if they are not pre-defined.
HM Predict dialog
819
When clicking OK, the number of columns specified is compared with the dimension of the
training data for the models in the first hierarchical level. An attempt to specify data with
incorrect dimensions will bring up a warning that the number of columns does not match
what is specified by the model:
Data size warning
Once the correct data are specified, click OK to start the hierarchical sequence of modelling
steps.
20.5. Interpretation of results

The HM prediction results are added to a new node in the navigator, by default called
“HMPredict”. Two sub-nodes are given: Output and Results.
The Output table contains, for each predicted sample, the Action Name, Action Type and
Models Used at each level.
HM Predict Output table
The Results table contains the specified output at each level for the conditions that
evaluated to TRUE for the samples in question. One columnset for each level is defined by
default.
820
HM Predict Results table with columnsets shown
Toggle the toolbar / icons to hide/show ranges. When ranges are not shown, the
color of the Reported values are displayed. Click on a colored cell to display tool-tip Alarm
state and comments for the individual output values.
HM Predict Results table with alarm states shown
821
21. Segmented Correlation Outlier Analysis
21.1. Segmented Correlation Outlier Analysis (SCA)
The scope of the SCA method is a means to detect gross and subtle outliers in large spectra
data sets in order to objectively remove outliers. Further SCA can also be used in run time
for outlier detection as it is also a modified PCA approach and fits well with the concept of
projection. This requires a Target Spectral Profile (TSP) to be defined and saved to a run time
model for application to new data.
 Theory
 Results
 SCA Save Model
 Usage: Prediction
 Prediction Results
21.2. Introduction to Segmented Correlation Outlier Analysis (SCA)

Segmented Correlation Analysis is a method in which the variables are partitioned into
intervals or segments and correlation is calculated for each interval. SCA is good for
multivariate data when the variables are clustered into different groups or exhibit different
relationships between the variables in these intervals.
Calculation of Segments
SCA provides two options in defining the segments for calculation of localized correlation
values. This is done either by discrete segments or moving window segments.
Discrete segments
This approach defines floor (#variables/window size) non-overlapping segments to
calculate localized correlation values. If the final segment has fewer variables than
the specified window size, then those variables will be excluded from the
calculation.
Moving window segments
This approach includes the window size as to how many adjacent variables will be
used to calculate the localized correlation values as each segment and moves along
the spectrum in steps of one to make a more continuous curve. There can be a
minimum of (#variables – (window size – 1)) correlation values calculated. The
difference between discrete segment and moving window segments are defined
diagrammatically below for a 5 point window. The variable-names for variables
marked with ‘o’ are used to name the columns in the segmented correlation matrix.
823
Discrete and Moving Window segments
Correlation calculations
The correlation calculations form the basis of SCA. A reference spectrum is used for this,
defined either as a single spectrum or by calculating the mean or median for a selected
number of spectra. Two types of correlation values will be used,
• Overall correlation: a single value of Pearson’s r² value for each spectrum to show how
close the individual spectra are to a reference spectrum for outlier detection.
• Segmented correlation: performs localized correlation calculations for each segment
defined; for each sample, a value of correlation for each segment will be generated and for
multiple samples, a matrix of local correlation values will be available
PCA calculations
The segmented correlation values matrix calculated above will be used for PCA calculations.
The results will be similar to PCA, except they are for segments and not for individual
variables.
Outlier detection
All samples below correlation limit in the overall correlation plot, as well as those with larger
than threshold T² or Q values will be marked automatically, and any marked samples has to
be interpreted as outliers. Any or all samples can be unmarked using the regular tools, in
which case the lines in question in the SCA Overview plots are coloured grey, and the circles
are removed from all other plots. Clicking the Mark Outliers button will revert to the default
selection based on current correlation, T² and Q thresholds.
Overall workflow
The following describes the overall workflow of SCA.
824
Segmented Correlation Outlier Analysis
Individual Conformity Indexes across Wavelengths

Conformity is measured as the Conformity Index (CI) and this is calculated at each
wavelength measured. The maximum CI is displayed as a single number representing the
maximum value obtained at a particular wavelength. It is calculated using the following
formula.
where
The vertical bars (||) indicate the absolute value
= Conformity Index at wavelength
= Absorbance of the test material at wavelength
= Absorbance of the target spectrum profile (mean/median library spectrum) at
wavelength
= Standard deviation of library absorbance at wavelength
825
Mean/Median Conformity Indexes for samples

The Mean Conformity Index is the arithmetic mean of all CI’s calculated for a spectrum, in
section above. This value for all samples will be plotted as a line trend plot with limits that
can be set up. The limits for this trend plot are given directly as Y=K Standard Deviations.
Here, K can be tuned between 1 to 6. These controls can be saved in the model and used for
online trend charting. As a robust alternative to the Mean CI, the Median Conformity
Indexes can also be calculated. The limits for this chart can also be set, tuned and used for
online trend charting.
Correlelogram
A Correlelogram may be calculated when Y-Variables are present in a data set. The
Correlelogram calculation is defined as correlation of each wavelength present or defined in
the data set with the Y-Variables, plotted 1 Y-Variable at a time. It is plotted as a line plot
with a maximum of +1 and a minimum of -1 and allows the overlay of all samples (with
outliers marked) to show areas of maximized spectral correlation to the Y-Variables. The
following formula is used to calculate correlation:
Where
= Correlation between the wavelength and the specified Y-Variable
= Wavelength array over all samples at wavelength
Y = Selected Y-Variable to calculate correlation against.
= Standard Deviation of the X array at wavelength
= Standard Deviation of the Y-Variable
21.3. Tasks – Analyze – Segmented Correlation Outlier Analysis…

When a data matrix is available in the Project Navigator, access the menu for analysis by PCA
from Tasks – Analyze – Segmented Correlation Analysis… The SCA dialog box is shown
below.
Scope
In the Scope tab, select a Matrix to be analyzed in the Data frame. Select pre-defined row
and column ranges in the Rows and Cols boxes, or click the Define button to perform the
selection manually in the Define Range dialog.
Once the data to be used in modeling are defined, choose the reference spectrum by
locating a matrix from the project navigator or define a row in the selected matrix as the
reference. If a single row is selected, the Mean and Median options are disabled. If more
than one row is selected, then the Mean and Median based on the selection will be used for
the reference spectrum. The column dimension defined for the original data must match the
column dimension defined for the reference spectrum to calculate correlation values.
Choose the window size in order to calculate the sensitivity of the local correlation
calculations, from the Define window size box. Only odd values will be allowed for window
size and the analysis will not be performed if the data has less than 11 elements. The
minimum window size can be set to 3.
The Use moving window option passes the segment over the data as a boxcar filter and will
result in #variables – (Segment Size – 1) correlation values.
The Set correlation limit provides a means of setting the lower threshold on the squared
Pearson’s correlation (r2) value calculated between the entire reference spectrum and the
826
individual spectra. The correlation values span the range from 0 to 1 and the default value of
correlation limit is 0.95
Segmented Correlation Outlier Analysis - Scope
PCA Options
The Maximum components option allows the user to set the maximum number of Principal
Components to use for the analysis. The default value is set to 7, however, the upper bound
is defined by the minimum value of number of samples-1 and windows segments
The Cross Validation method is used when either there are not enough samples available to
make a separate test set, or for simulating the effects of different validation test cases, e.g.
systematically leaving samples out vs. randomly leaving samples out, etc.
The cross validation procedures associated with multivariate models are described in detail
in the chapter on cross validation setup cross validation setup dialog.
Segmented Correlation Outlier Analysis - PCA Options
827
Correlelogram
The Compute correlelogram gives users the option to calculate a correlelogram.
If the above option is enabled, the Response frame is enabled to select a matrix to be used.
Segmented Correlation Outlier Analysis - Correlelogram
828
When all the options are specified click OK
21.4. Tasks - Predict - Conformity…

After an SCA model has been developed, it can be used to predict new samples by going to
Tasks-Predict-Conformity… When clicked, a Conformity Prediction dialog opens.
In the dialog box, a user must choose which SCA model to apply from the drop-down list.
This requires a valid SCA model to be located in the current project.
The CI Type option allows the user to select which type of trend chart to be displayed, Mean
or Median.
The Confidence limit option allows setting the number of standard deviation units to use for
the conformity limits.
The Data frame allows the selection of a data matrix to be analyzed. Selection can be from
the pre-defined row and column ranges in the Rows and Cols boxes, or using the Define
Segmented Correlation Outlier Analysis - Predict Conformity
829
Click OK to start the prediction.

The prediction results are given in a new matrix in the project navigator named
Predict_Conformity. The matrix holds the conformity values for each sample under the
Results tab. In the plots section, an index plot and trend plot will be available based on the CI
Type selected in the conformity prediction dialog.
Segmented Correlation Outlier Analysis - Results
21.5. SCA Conformity Prediction Plots
21.5.1 Predefined prediction plots
Conformity Index Plot

This plot shows the conformity index at each wavelengths for all the samples used in
prediction. The conformity outlier samples will be marked in red and the remaining samples
in grey. The selected/highlighted sample from the X drop down menu in task bar will be
marked in black. Mean Conformity Index is the arithmetic mean of all CI’s calculated for a
spectrum with limits calculated from K Standard Deviations, defined in the predict dialog.
830
Mean Conformity Trend Plot

This plot reveals a general pattern of change in mean values of samples used in prediction
(marked in solid Blue), mean calculated based on the conformity indexes calculated for each
wavelength. The limit is K Standard Deviations (marked in dotted Green), value of K
depending on selection in the Predict Conformity dialog. The selected/highlighted sample
from the X drop down menu in task bar will be marked in black.
Any sample exceeding the conformity limit are tagged as conformity outlier(s) and marked
with red circles in the plot and red lines in the conformity index plot.
Median Conformity Trend Plot

This plot reveals a general pattern of change in median values of samples used in prediction
(marked in solid Blue), median calculated based on the conformity indexes calculated for
each wavelength. The limit is K Standard Deviations (marked in dotted Green), value of K
depending on selection in the Predict Conformity dialog. The selected/highlighted sample
from the X drop down menu in task bar will be marked in black.
Any sample exceeding the conformity limit are tagged as conformity outlier(s) and marked
with red circles in the plot and red lines in the conformity index plot.
831
21.6. Save model for SCA Conformity Prediction

One of the major objectives of the Enhanced SCA Module is the ability to save these models
for use in run-time in Process Pulse or other Prediction/Classification Engines. There is also
much usefulness in saving model components for other applications. For example,
 A SCA Projection model by itself is useful in objective outlier detection in Process

Pulse, or as a tool in Predict to apply to new data sets to extract good and bad
samples.
 A Conformity model by itself can be used to determine the quality of new spectral
data with respect to the variability of previous data in a visual manner that
compliments methods such as PCA.
 Correlation (Spectral Match Value) models are highly useful in classification
problems and can be implemented in such a way that they can be used like the
SIMCA classification method.
SCA Module allows one to save the entire model or separate models as a project. There are
several options for the results file to be saved. Depending on what option is used, the file
size can be reduced so that they are best suited for usage in conformity prediction. Select a
SCA model in the project navigator and right click to select Save Model.
Entire model
This saves all the results and supports all visualizations that are available when a
model is developed in The Unscrambler® X. This option does not allow recalculation
of the model as available in MLR, PLS, PCR and PCA models; but allows to save
separate models. Use the option Number of Components to set the number of
components for a model to a value other than the optimal recommended number.
This number of components will then be used when the model is used for prediction
and/or classification. The Standard Deviations option helps to set the limits for the
832
trend plots around the average of mean Confomrity Index values. The default value
is 3. The Spectral match limit option helps to set the limits for the overall correlation
plot. The default value is 0.95
SCA model
This option saves the model containing only the data required for detecting
Influence outliers for the selected number of components or less. This model results
file does not include plots and some of the results matrices that are not used in the
prediction visualization. Use the option Number of Components to set the number
of components for a model to a value other than the optimal recommended
number. This number of components will then be used when the model is used for
prediction and/or classification.
Conformity model
This option saves the model containing only data required for detecting Conformity
Index outliers for the selected number of standard deviations. In the short model,
only the target spectrum profile and its confidence limits, conformity statistics and
conformity values are saved. No validation matrices are saved. The Standard
Deviations option helps to set the limits for the trend plots around the average of
mean Confomrity Index values. The default value is 3.
Correlation (Spectral match)
This option saves the model containing only the data required for detecting
Conformity outliers at the specified limit. This model saves only the reference
spectrum and the overall correlation values. The Spectral match limit option helps
to set the limits for the overall correlation plot. The default value is 0.95
21.7. Interpreting SCA plots
 Predefined SCA plots

 SCA Outlier Overview
 Segmented Correlation Plot
 Spectra
 Influence Plot
 Overall Correlation Plot
 Segmented Correlation PCA
 Scores
 SCA loadings
 Influence
 Residuals and Influence
 Influence
 Explained X Sample Variance
 Hotelling’s T²/Leverage statistics
 Q-Residuals/F-Residuals
 Conformity Analysis
 Conformity Index Plot
 Conformity Trend Plot
833
 Correlelogram
 Outlier Marking
21.7.1 Predefined SCA plots

SCA Outlier Overview
Segmented Correlation Plot

It is a row-wise line plot, showing the correlation values plotted against segments. It helps to
find localized regions of the spectrum where anomalies may occur with respect to the
reference spectrum and is bounded in the Y-axis to the values -1 to +1. When a sample is
marked in this plot, it will be turned red to mark it as a candidate for removal. By clicking on
a marked sample, this will also unmark it in all other plots and make it a candidate for
inclusion into a potential calibration set of data.
Spectra
A capture of the samples used in the analysis will be available with confidence limits. The
reference spectrum will also be shown in this plot, highlighted in green to show how all
other spectral samples behave with respect to it. The confidence limits can be set to K Std
Deviations from the reference spectrum where K is an integer between 1 and 6. A
confidence limit is calculated for each wavelength and plotted along with the reference
spectrum.
834
The samples will be marked in gray color. The reference spectrum will be marked in green
color and so its associated confidence limits in dashed green color. Any outliers identified in
the analysis will be marked in red.
Toggle on the to view the Conformity Index plot
Influence Plot
This plot shows the Q-residual X-variance or F-residuals vs. Leverage or Hotelling’s T². The
toggle buttons in the toolbar can be used to switch between the various combinations.

T² statistic for each object with the corresponding critical limit. See the section below for
more details about the Hotelling’s T² statistic.
835

depending on any assumptions about distribution) for Leverage is 3(A-1)/Ical, where A is the
number of PCs and Ical the number of calibration samples.
Note that the F-residuals are calculated from the calibration as well as the validated residual
x-variance and thus reflects the validation method chosen for the model and may give a
more realistic view of the residuals than the Q-residuals which are based on the residuals
from the calibration. Also, the Q-residuals are based on the eigenvalues of the residual
matrix E and thus in principle, all PCs for the PCA model must be calculated for a correct
estimation of the Q-residuals.
residuals with the associated critical limit from the theory describing Q-residuals which can
be derived from a chi-square distribution.
The Q-residual is the sum of squares of the residuals over the variables for each object.
This test serves the purpose of finding outliers in terms of the distance to the model space,
i.e. residual distance. Given the model X = TP’ + E, then the Q-residual critical limit for the
objects in X are computed from the diagonal of EE’.
A critical value of the Q-residual can be estimated from the eigenvalues of E, which can be
approximated to a normal distribution Jackson and Mudholkar, 1979. This is the horizontal
red line.
The Hotelling’s T² statistic describes the distance to the model center as spanned by the
principal components.
A plot of the Q-residual distance vs. Hotelling’s T² is referred to as the Influence Q-residual
plot and is suited to spot samples which may be regarded as outliers as being too extreme in
the model sense or being “something else”.
Overall Correlation Plot

This plot uses the “Overall Correlation�? matrix available in the Results folder to plot
Pearson’s r2 value in sample order. The Correlation Limit is also displayed in this plot. Any
samples falling below this limit are included in the automatically marked outlier set (marked
in red circles).
836
Segmented Correlation PCA

The PCA analysis is done on the segmented correlation values matrix.
Scores
from PCA, performed on the segmented correlation values matrix. The plot gives
information about patterns in the samples. The scores plot for (PC1,PC2) is especially useful,
since these two components summarize more variation in the data than any other pair of
components. Outliers will be marked automatically (in red)
Use the Hotelling’s T² ellipse in the scores plot to detect outliers. To display, click on the
Hotelling’s T² ellipse button .
SCA loadings
A line plot of segmented correlation loadings for each (or selected) component(s) is a good
way to detect important segments and so its associated variables/wavelenghs in
understanding which components capture the important source of infomration.
837
Use the correlation loadings option to discover the important segments lying within the
upper and lower bounds of the plot, being modelled by that particular PC.
Influence
Explained Variance

from the data.
possible.
is computed by testing the model on data that were not used to build the model.
838
Residuals and Influence
Influence
Explained X Sample Variance

numerical sense.
Explained Variance (in percent) and Sample Residuals plots for Calibration
Hotelling’s T²/Leverage statistics

The Hotelling’s T² plot is an alternative to plotting sample leverages. The plot displays the
Hotelling’s T² statistic for each sample as a line plot. The associated critical limit (with a
default p-value of 5%) is displayed as a red line.
839
To access the Leverage plot, use the toggle button to switch to Leverage
plot. Leverages are useful for detecting samples which are far from the center within the
space described by the model. Samples with high leverage differ from the average samples;
in other words, they are likely outliers. A large leverage also indicates a high influence on the
model.
Leverage plot
Q-Residuals/F-Residuals
The Q-residual is the sum of squares of the residuals over the variables for each object.
This test serves the purpose of finding outliers in terms of the distance to the model space,
i.e. residual distance. Given the model X = TPT + E, then the Q-residuals for the objects in X
object are computed from ETE.
A critical value of the Q-residual can be estimated from the eigenvalues of E, which can be
approximated to a normal distribution (Jackson and Mudholkar, 1979). This is the horizontal
red line.
840
To access the F-Residuals plot, use the toggle button to switch to F-Residuals plot.
The F-residuals are calculated from the calibration as well as the validated residual x-
variance and thus reflects the validation method chosen for the model and may give a more
realistic view of the residuals than the Q-residuals which are based on the residuals from the
calibration.
F-residual sample variance
Conformity Analysis

The conformity index measured at each wavelength for all the samples are plotted. Users
have the option to select a sample from the X dropdown of the task bar and the
highlighted/selected sample will be marked in black. The conformity outlier samples will be
marked in red and the remaining samples in grey. For more information on outlier marking,
see Outlier Marking discussed below.
841
A Conformity Limit is plotted as a dashed green line at K Standard Deviations. Any samples
falling outside the Conformity limits will be tagged as a Conformity outlier and marked with
red lines in this plot. There are six levels of standard deviation to choose from using the drop
down list:
Conformity Trend Plot

The conformity trend plot for Mean and Median are calculated for each spectrum. The
selected sample will be marked in black solid circle and the outliers in red. For more
information on outlier marking, see Outlier Marking discussed below.
Conformity Trend Plot
A Conformity Limit is plotted as a dashed green line at K Standard Deviations. Any samples
falling outside the Conformity limits will be tagged as a Conformity outlier and marked with
red circles in this plot.
Use the drop down list from the menu to access the trend plots:
Correlelogram
This plot is available only when the Calculate Correlelogram option is checked in
Correlelogram dialog during analysis. The Correlelogram plot is the correlation of each
wavelength present or defined in the data set with the Y-Variables, plotted one Y-Variable at
a time. It is plotted as a line plot with a maximum of +1 and a minimum of -1 and allows the
842
overlay of selected, all or the mean spectrum to show areas of maximized spectral
correlation to the Y-Variables.
Correlelogram Plot
Outlier Marking
The following section discusses the four different types of outliers available from the SCA
analysis.
Influence Outliers
The outliers are identified based on the Influence plot. Any sample not in the low left
quadrant of the plot will be tagged as influence outlier (circled in red). This can be accessed
by toggling the IO button in the Mark menu.
Correlation Outliers
Any sample with an overall correlation value below the set correlation limit will be tagged as
a Correlation outlier. The correlation limit is set as the lower threshold on the squared
Pearson’s correlation value calculated between the entire reference spectrum and the
individual spectra. The correlation values span the range from 0 to 1. The default value of
correlation limit is set to 0.95. This can be accessed by toggling the CO button in the Mark
menu.
Conformity Index Outlier
A conformity outlier is a sample that exceeds the currently selected conformity limit in the CI
trend chart. The limit is defined by Y =K Std Deviations as selected in toolbar. A conformity
outlier is marked with a red solid line or a red circle, depending on plot type. Whenever the
limit changes, any new outliers will be tagged and marked accrodingly. Previous outliers
falling inside the new limit will be un-tagged and un-marked. This can be accessed by
toggling the CIO button in the Mark menu.
Manual Outliers
For more information see the How to mark samples/variables documentation.
21.8. SCA method reference

The method reference is available upon request.
Contact http://www.camo.com
843
22. Instrument Diagnostics
22.1. Instrument Diagnostics
The Instrument Diagnostics plug-in was designed to provide users of spectroscopic
instrumentation a way of assessing the quality of background scans prior to the collection of
reflectance, transmittance or absorbance spectra. It can also be useful for many other types
of sensors, not only spectral intruments. The plug-in contains specific algorithms for
calculating the following quality parameters,
 RMS Noise: Provides an assessment of the baseline signal to noise ratio that
indicates that the instrument response is not being influenced by extraneous
electronic noise.
 Peak Model: This functionality provides a means of calculating peak heights, areas
and ratios such that assessment can be made to critical limits. These are particularly
important for monitoring contaminant levels, such as build up in specific
instrumentation. Baseline correction is built in as a preprocessing specific to the
Peak Model functionality.
 Peak Position: Wavelength accuracy of instrumentation is a critical aspect of good
instrument calibration. If the peak position shifts significantly during analysis, this
has the potential to be detrimental to the predicted values generated by a
chemometric model. Peak Position provides a measure of selected peak positions
and assess them against a tight window of acceptance.
 Loss of Intensity: This diagnostics assesses the quality of the spectral luminescence
source for deterioration in intensity. Comparison of a new background is made to
either a historical background or the last known good reference and is expressed in
terms of deviation from an established 100% intensity.
 PCA Projection: Utilizes the power of Principal Component Analysis (PCA) to assess if
the new background scan is in the same population as a library of scans known to
have acceptable variability.
The Instrument Diagnostics module also comes with a prediction plug in to assess new
background scans within The Unscrambler® X environment. Instrument diagnostic models
developed are used in a similar way to other predictive models (such as PLSR, PCR etc.) and
can be further utilized in real time applications in conjunction with e.g. The Unscrambler® X
ADI Insight Server or Process Pulse.
 Theory
 Usage
 Prediction
22.2. Introduction to Instrument Diagnostics
22.2.1 RMS Noise

The RMS Noise takes as input a spectral range and a threshold and is calculated from
845
The returned value indicates if the RMS is higher than the alarm or warning limit.
22.2.2 Peak Height/Peak Area (Peak Model)

The Peak model takes as input a spectral range and specified alarm and optionally a warning
threshold. IF the tic-box compare to historical background is enabled, the threshold(s) may
also be given as percent. Optionally, an offset or a linear baseline correction as pretreatment
can be saved with the model.
The Peak model has five options: 1. Area: Computes the integral of the amplitudes within
the specified region.
 Absolute Area: Computes the integral of the absolute amplitudes within the
specified region.
 Average Height: Computes the average amplitude within the specified region.
 Maximum: Finds the maximum amplitude within the specified region.
 Single Point(Peak Height): Uses the amplitude at a single spectral point.
Both low and high alarm and warning limits can be set for this diagnostic and is returned as
one of the possible states.
22.2.3 Peak Position

The peak position takes as input a spectral range and a minimum amplitude. The calculation
steps are as follows:
 Find all amplitudes for the specified range above the minimum amplitude.
 Find the position in the remaining amplitudes from 1, that is closest to the reference
peak position.
 Check if the difference between the two positions exceeds the alarm or warning
limits.
846
Instrument Diagnostics
22.2.4 Loss of Intensity

Loss of intensity is calculated relative to a reference (“historical background” or “last known
good background”) and takes as input a spectral range and a threshold which may be given
in absolute value or as percentage of the amplitude.
In case of absolute values:
In case of percentage:
The returned value indicates if the intensity is lower than the alarm or warning limit.
22.2.5 PCA Projection

This is implemented as per usual Unscrambler® X projection functionality and only results in
two statistics,
 Leverage.
Projection is defined by the following formula,
In this equation, TNew is the projected score, P is the loading from the PCA model used for
projection and XNew is the new spectrum to be projected onto the PCA model.
Hotellings T² is calculated as,
Leverage is calculated as,
22.3. Tasks – Analyze – Instrument Diagnostics

When a data matrix is available in the Project Navigator, access the menu for analysis by
Instrument Diagnostics from Tasks – Analyze – Instrument Diagnostics…
22.3.1 Main Dialog

The main dialog box for Instrument Diagnostics is shown in the figure below. Initially, the
dialog is not populated with any information.
The initial Instrument Diagnostics main dialog
847
22.3.2 Add Model

To add the first model right click in the Instrument Diagnostics node and select Add. The
method types available are listed and when one is selected, the setup window for that
diagnostic becomes available.
Options available in the Instrument Diagnostics plug-in
848
The following sections below describe how to set up each diagnostic method type.
22.3.3 RMS Noise

Single RMS Noise Model
To add an RMS Noise calculation to the overall diagnostics model, select Add – RMS
Noise and a new node called RMS Noise will be added to the dialog navigator and a
sub-node called RMS 1 will be added to show that one RMS Noise model is being
evaluated.
Setup dialog for an RMS Noise model
849
The following table gives the functionality of the RMS Noise dialog.
Functionality Description
Defines the column range of the spectra to apply the RMS Noise
Input model to. Start defines the starting point and End is the final point of
the spectrum.
Allows the new spectrum defined in the Input to be ratioed against a

Compare to
historical initial background. Note : If a historical background is used,
Historical
ensure that the alarm and warning limits are defined for the ratio
background
spectrum.
Threshold
Allows a user to set the upper limit for RMS Noise, beyond this limit
Alarm and Alarm state will be tagged to the RMS Noise value calculated for
the new spectrum.
Allows a user to set an upper limit for RMS Noise, beyond this limit
Warning and between the Alarm limit a Warning state will be tagged to the
RMS Noise value calculated for the new spectrum.
Multiple RMS Noise Models
RMS Noise models can be calculated over multiple regions of the same spectrum
and individual alarms and warnings can be set up accordingly. To add additional RMS
noise models, right click in the RMS Noise node in the navigator and select RMS
Model. This will add a new RMS Noise model to the navigator called RMS 2.
Setup dialog for an RMS Noise models
850
Models can be deleted using a right click option in the model nodes. By right clicking in one
of the RMS 1, RMS 2 nodes, an option is available to rename the nodes.
22.3.4 Peak Model

Single Peak Model
The Peak Position functionality is accessed by right clicking in the Instrument
Diagnostics node in the navigator and selecting Add – Peak Model. A new node
called Peak Model will be added to the dialog navigator and a sub-node called Peak
Model 1 will be added to show that one Peak Model is being evaluated.
Setup dialog for a Peak model
851
The following table gives the functionality of the Peak Model dialog.
Defines the column range of the spectra to apply the Peak Model
Input to. Start defines the starting point and End is the final point of the
spectrum.
Allows the new spectrum defined in the Input to be ratioed

Compare to Historical against a historical initial background. Note : If a historical
background background is used, ensure that the alarm and warning limits are
defined for the ratio spectrum.
When this check box is selected, the Setup button becomes

active. Press this button to access the Baseline_Correction setup
Use
dialog, where there are two options, Offset and Linear. Note:
Baseline_Correction
Baseline_Correction is applied to the entire spectrum when Offset
is selected.
This provides a dropdown option with 5 methods for defining a

Peak Model. The options available are Area,Absolute Area,
Method Average Height, Maximum Height, Peak Height. Note: When Peak
Height is selected, the Single Point spin box becomes active to
define the position of the peak to use.
When this option is checked, options are provided for a user to

ratio two regions of the same spectrum using the options
Ratio
available in the Method options. Only methods of the same type
can be ratioed against each other. A new Start and End range
852
must be added that is different from that defined in the Input

section. When the Peak Height option is selected, the Single Point
spin box becomes active.
Two options are available for Threshold, Absolute: Uses absolute

values with respect to the Method used to define the peak model;
Threshold Percentage: This uses the resulting values of the new peak
area/height as a percentage with respect to the reference
spectrum used.
Allows a user to set the upper/lower limits for the Peak Model,
beyond these limits and Alarm state will be tagged to the Peak
Alarm (High/Low)
Model value calculated for the new spectrum. One directional
models are possible.
Allows a user to set an upper/lower limits for the Peak Model,

beyond these limits, and between the Alarm limits a Warning
Warning (High/Low)
state will be tagged to the Peak Model value calculated for the
new spectrum.
Baseline_Correction options available in the Peak Model functionality
Peak Model calculation options available in the Instrument Diagnostics dialog
Multiple Peak Model

Peak Models can be calculated over multiple regions of the same spectrum and
individual alarms and warnings can be set up accordingly. To add a second (or
853
consecutive) Peak Model, right click in the Peak Model node in the navigator and
select Peak Model.This will add a new Peak Model to the navigator called Peak
Model 2.
Dialog with multiple Peak Models added
of the Model nodes, an option is available to rename the nodes.

Single Peak Position Model
The Peak Position functionality is accessed by right clicking in the Instrument
Diagnostics node in the navigator and selecting Add – Peak Position. A new node
called Peak Position will be added to the dialog navigator and a sub-node called Peak
Position 1 will be added to show that one Peak Position model is being evaluated.
Setup dialog for a Peak Position model
854
The following table gives the functionality of the Peak Position dialog.
Input to. Start defines the starting point and End is the final point of
the spectrum.

Compare to Historical against a historical initial background. Note: If a historical
Background background is used, ensure that the alarm and warning limits are

Use
Baseline_Correction
Baseline_Correction is applied to the entire spectrum when
Offset is selected.
Expected Peak A user must enter the peak position where the peak maximum is
Position expected to occur.
Minimum Peak A user must enter the minimum amplitude expected for finding a
Amplitude peak in the defined region and at the expected position.
Default set to None, user has option to select Gaussian and

Peak Fitting Method
Lorentzian methods for peak fitting.
Threshold Two options are available for Threshold, Absolute: Uses absolute
855
values with respect to the Method used to define the peak

model; Percentage: This uses the resulting values of the new
peak area/height as a percentage with respect to the reference
spectrum used.
Allows a user to set the upper/lower limits for where the peak is
expected to lie. Beyond these limits and Alarm state will be
Alarm (High/Low)
tagged to the Peak Position value calculated. One directional
Allows a user to set an upper/lower limits for where the peak is

expected to lie. Beyond these limits, and between the Alarm
Warning (High/Low)
limits a Warning state will be tagged to the Peak Position value
calculated for the new spectrum.
Multiple Peak Position Models
Peak Position models can be calculated at multiple points of the same spectrum and
consecutive) Peak Position model, right click in the Peak Position node in the
navigator and select Peak Position.This will add a new Peak Position Model to the
navigator called Peak Position 2.
Dialog with multiple Peak Position Models added
856
22.3.6 Single Loss of Intensity Model

The Loss of Intensity functionality is accessed by right clicking in the Instrument
Diagnostics node in the navigator and selecting Add – Loss of Intensity. A new node
called Loss of Intensity will be added to the dialog navigator and a sub-node called
Loss of Intensity 1 will be added to show that one Loss of Intensity model is being
evaluated.
Setup dialog for a Loss of Intensity model
The following table gives the functionality of the Loss of Intensity dialog.
Defines the column range of the spectra to apply the Peak Model to.
Input Start defines the starting point and End is the final point of the
spectrum.
Allows the new spectrum defined in the Input to be ratioed against

Compare to
a historical initial background. Note: If a historical background is
Historical
used, ensure that the alarm and warning limits are defined for the
Background
ratio spectrum.
Compare to Last Similar functionality to Compare to Historical Background, but when

Known Good used in a run time application will call this spectrum for calculations
Background rather than the historical (original) spectrum.

Threshold values with respect to the Method used to define the peak model;
Percentage: This uses the resulting values of the new peak
857
area/height as a percentage with respect to the reference spectrum

used.
Allows a user to set the minimum limit for loss of intensity that
Alarm should be alarmed when compared to the original or last known
good spectrum. This is a lower bound alarm.
Allows a user to set a warning limit for loss of intensity that should
be flagged when compared to the original or last known good
Warning
spectrum. This is a lower bound alarm for warning a user that the
spectrometers lamp should be considered for changing.
Multiple Loss of Intensity Models
Loss of Intensity models can be calculated at multiple regions of the same spectrum
and individual alarms and warnings can be set up accordingly. To add a second (or
consecutive) Loss of Intensity model, right click in the Loss of Intensity node in the
navigator and select Loss of Intensity.This will add a new Loss of Intensity Model to
the navigator called Loss of Intensity 2.
Dialog with multiple Loss of Intensity Models added
22.3.7 Principal Component Analysis Models

Single PCA Model
858
The PCA functionality is accessed by right clicking in the Instrument Diagnostics node
in the navigator and selecting Add – PCA. A new node called PCA will be added to
the dialog navigator and a sub-node called PCA 1 will be added to show that one
PCA model is being evaluated.
Setup dialog for a PCA model
The following table gives the functionality of the PCA dialog.

spectrum.

Compare to
historical initial background. Note: If a historical background is used,
Historical
Background
spectrum.
Allows a user to select a PCA model developed on past raw or ratio

background spectra that is expected to be representative of future
Select Model
good background spectra. This PCA model must be present in the
current Unscrambler® X project.
Allows a user to set the number of Principal Components (PCs) to use

Components
for the evaluation of new background spectra.
This function provides two options for the user, Model: Uses the
Use Hotellings T² critical Hotellings T² limits for the components selected for the model
at the significance level selected from the dropdown box; User
859
Defined: Allows a user to manually enter a limit for the Hotellings T²

value.
critical Leverage value for the components selected for the model;
Use Leverage
User Defined: Allows a user to manually enter a limit for the Leverage
value.
Significance levels for Hotellings T² values in the PCA Instrument Diagnostics dialog
Multiple PCA Models

PCA models can be calculated at multiple regions of the same spectrum and
consecutive) PCA model, right click in the PCA node in the navigator and select
PCA.This will add a new PCA Model to the navigator called PCA 2.
Dialog with multiple PCA Models added
860
22.4. Prediction with Instrument Diagnostics Model

Post processing of background spectra for diagnostics purposes is made possible through the
use of the Tasks – Predict – Instrument Diagnostics functionality
The Instrument Diagnostics Predict plug-in is found under the Tasks – Predict menu
Background spectra collected from a particular instrument can be loaded into The
Unscrambler® X project and the appropriate Instrument Diagnostics model is also loaded
into the project.
Functionality of the Instrument Diagnostics Predict dialog box
861
The following table gives the Instrument Diagnostics Predict functionality.

Select the appropriate model from a list of available models in

Diagnostic Model
the current project.
Details This box provides the details of the selected model.
Allows a user to define the new background spectra to be

Input Spectrum
analysed.
Allows a user to define the spectrum to use as the historical

Historical Background
background spectrum for diagnostic tests that make use of this
Spectrum
spectrum.
Allows a user to define the spectrum to use as the last known

Last Known Good
good background spectrum for diagnostic tests that make use
Background Spectrum
of this spectrum.
Once the dialog box has been set up and the OK button is selected, a new analysis node is
generated in The Unscrambler® X project navigator and the results of the analysis are
presented in a tabulated form with the result in the first column and the alarm state in the
second column for each diagnostic assessed.
862
Output from the Instrument Diagnostics Predict functionality
When implemented at run time, the results are sent to a third party application and a quality
decision can be made based on the outputs.
863
23. Spectral Diagnostics
23.1. Spectral Diagnostics
The Spectral Diagnostics plug-in is designed to provide users of spectroscopic
instrumentation a way of assessing the quality of background scans prior to the collection of
reflectance, transmittance or absorbance spectra. The plug-in contains specific algorithms
for calculating the following quality parameters,
 RMS Noise: Provides an assessment of the baseline signal to noise ratio that
indicates that the instrument response is not being influenced by extraneous
electronic noise.
 Peak Model: This functionality provides a means of calculating peak heights, areas
and ratios such that assessment can be made to critical limits. These are particularly
important for monitoring contaminant levels, such as build up in specific
instrumentation. Baseline correction is built in as a preprocessing specific to the
Peak Model functionality.
 Peak Position: Wavelength accuracy of instrumentation is a critical aspect of good
instrument calibration. If the peak position shifts significantly during analysis, this
has the potential to be detrimental to the predicted values generated by a
chemometric model. Peak Position provides a measure of selected peak positions
and assess them against a tight window of acceptance.
 Loss of Intensity: This diagnostics assesses the quality of the spectral luminescence
source for deterioration in intensity. Comparison of a new background is made to
either a historical background or the last known good reference and is expressed in
terms of deviation from an established 100% intensity.
 PCA Projection: Utilizes the power of Principal Component Analysis (PCA) to assess if
the new background scan is in the same population as a library of scans known to
have acceptable variability.
The Spectral Diagnostics module also comes with a prediction plug in to assess new
background scans within The Unscrambler® X environment. Spectral diagnostic models
developed are used in a similar way to other predictive models (such as PLSR, PCR etc.) and
can be further utilized in real time applications in conjunction with e.g. The Unscrambler® X
ADI Insight server.
 Theory
 Usage
 Prediction
23.2. Introduction to Spectral Diagnostics
23.2.1 RMS Noise

The RMS Noise takes as input a spectral range and a threshold and is calculated from
The returned value indicates if the RMS is higher than the alarm or warning limit.
865
23.2.2 Peak Height/Peak Area (Peak Model)

The Peak model takes as input a spectral range and specified alarm and optionally a warning
threshold. IF the tic-box compare to historical background is enabled, the threshold(s) may
also be given as percent. Optionally, an offset or a linear baseline correction as pretreatment
can be saved with the model.
The Peak model has five options: 1. Area: Computes the integral of the amplitudes within
the specified region.
 Absolute Area: Computes the integral of the absolute amplitudes within the
specified region.
 Average Height: Computes the average amplitude within the specified region.
 Maximum: Finds the maximum amplitude within the specified region.
 Single Point(Peak Height): Uses the amplitude at a single spectral point.
Both low and high alarm and warning limits can be set for this diagnostic and is returned as
one of the possible states.

The peak position takes as input a spectral range and a minimum amplitude. The calculation
steps are as follows:
 Find all amplitudes for the specified range above the minimum amplitude.
 Find the position in the remaining amplitudes from 1, that is closest to the reference
peak position.
 Check if the difference between the two positions exceeds the alarm or warning
limits.
866
Spectral Diagnostics
23.2.4 Loss of Intensity

Loss of intensity is calculated relative to a reference (“historical background” or “last known
good background”) and takes as input a spectral range and a threshold which may be given
in absolute value or as percentage of the amplitude.
In case of absolute values:
In case of percentage:
The returned value indicates if the intensity is lower than the alarm or warning limit.
23.2.5 PCA Projection

This is implemented as per usual Unscrambler® X projection functionality and only results in
two statistics,
 Leverage.
Projection is defined by the following formula,
In this equation, TNew is the projected score, P is the loading from the PCA model used for
projection and XNew is the new spectrum to be projected onto the PCA model.
Hotellings T² is calculated as,
Leverage is calculated as,
23.3. Tasks – Analyze – Spectral Diagnostics

When a data matrix is available in the Project Navigator, access the menu for analysis by
Spectral Diagnostics from Tasks – Analyze – Spectral Diagnostics…
23.3.1 Main Dialog

The main dialog box for Spectral Diagnostics is shown in the figure below. Initially, the dialog
is not populated with any information.
The initial Spectral Diagnostics main dialog
867
23.3.2 Add Model

To add the first model right click in the Spectral Diagnostics node and select Add. The
method types available are listed and when one is selected, the setup window for that
diagnostic becomes available.
Options available in the Spectral Diagnostics plug-in
868
The following sections below describe how to set up each diagnostic method type.
23.3.3 RMS Noise

Single RMS Noise Model
To add an RMS Noise calculation to the overall diagnostics model, select Add – RMS
Noise and a new node called RMS Noise will be added to the dialog navigator and a
sub-node called RMS 1 will be added to show that one RMS Noise model is being
evaluated.
Setup dialog for an RMS Noise model
869
The following table gives the functionality of the RMS Noise dialog.
Defines the column range of the spectra to apply the RMS Noise
Input model to. Start defines the starting point and End is the final point of
the spectrum.

Compare to
historical initial background. Note : If a historical background is used,
Historical
background
spectrum.
Threshold
Allows a user to set the upper limit for RMS Noise, beyond this limit
Alarm and Alarm state will be tagged to the RMS Noise value calculated for
the new spectrum.
Allows a user to set an upper limit for RMS Noise, beyond this limit
Warning and between the Alarm limit a Warning state will be tagged to the
RMS Noise value calculated for the new spectrum.
Multiple RMS Noise Models
RMS Noise models can be calculated over multiple regions of the same spectrum
and individual alarms and warnings can be set up accordingly. To add additional RMS
noise models, right click in the RMS Noise node in the navigator and select RMS
Model. This will add a new RMS Noise model to the navigator called RMS 2.
Setup dialog for an RMS Noise models
870
of the RMS 1, RMS 2 nodes, an option is available to rename the nodes.
23.3.4 Peak Model

Single Peak Model
The Peak Position functionality is accessed by right clicking in the Spectral
Diagnostics node in the navigator and selecting Add – Peak Model. A new node
called Peak Model will be added to the dialog navigator and a sub-node called Peak
Model 1 will be added to show that one Peak Model is being evaluated.
Setup dialog for a Peak model
871
The following table gives the functionality of the Peak Model dialog.
Input to. Start defines the starting point and End is the final point of the
spectrum.

Compare to Historical against a historical initial background. Note : If a historical
background background is used, ensure that the alarm and warning limits are

Use
Baseline_Correction
Baseline_Correction is applied to the entire spectrum when Offset
is selected.
This provides a dropdown option with 5 methods for defining a

Peak Model. The options available are Area,Absolute Area,
Method Average Height, Maximum Height, Peak Height. Note: When Peak
Height is selected, the Single Point spin box becomes active to
define the position of the peak to use.
When this option is checked, options are provided for a user to

ratio two regions of the same spectrum using the options
Ratio
available in the Method options. Only methods of the same type
can be ratioed against each other. A new Start and End range
872
must be added that is different from that defined in the Input

section. When the Peak Height option is selected, the Single Point
spin box becomes active.

area/height as a percentage with respect to the reference
spectrum used.
Allows a user to set the upper/lower limits for the Peak Model,
beyond these limits and Alarm state will be tagged to the Peak
Alarm (High/Low)
Model value calculated for the new spectrum. One directional
Allows a user to set an upper/lower limits for the Peak Model,

beyond these limits, and between the Alarm limits a Warning
Warning (High/Low)
state will be tagged to the Peak Model value calculated for the
new spectrum.
Baseline_Correction options available in the Peak Model functionality
Peak Model calculation options available in the Spectral Diagnostics dialog
Multiple Peak Model

Peak Models can be calculated over multiple regions of the same spectrum and
873
consecutive) Peak Model, right click in the Peak Model node in the navigator and
select Peak Model.This will add a new Peak Model to the navigator called Peak
Model 2.
Dialog with multiple Peak Models added

The Peak Position functionality is accessed by right clicking in the Spectral
Diagnostics node in the navigator and selecting Add – Peak Position. A new node
called Peak Position will be added to the dialog navigator and a sub-node called Peak
Position 1 will be added to show that one Peak Position model is being evaluated.
Setup dialog for a Peak Position model
874
The following table gives the functionality of the Peak Position dialog.
spectrum.

Compare to
Historical
Background
spectrum.
Expected Peak A user must enter the peak position where the peak maximum is
Position expected to occur.
Minimum Peak A user must enter the minimum amplitude expected for finding a
Amplitude peak in the defined region and at the expected position.

used.
Allows a user to set the upper/lower limits for where the peak is
expected to lie. Beyond these limits and Alarm state will be tagged to
Alarm (High/Low)
the Peak Position value calculated. One directional models are
possible.
875
Allows a user to set an upper/lower limits for where the peak is

Warning expected to lie. Beyond these limits, and between the Alarm limits a
(High/Low) Warning state will be tagged to the Peak Position value calculated for
the new spectrum.
Multiple Peak Position Models
Peak Position models can be calculated at multiple points of the same spectrum and
consecutive) Peak Position model, right click in the Peak Position node in the
navigator and select Peak Position.This will add a new Peak Position Model to the
navigator called Peak Position 2.
Dialog with multiple Peak Position Models added
23.3.6 Single Loss of Intensity Model

The Loss of Intensity functionality is accessed by right clicking in the Spectral
Diagnostics node in the navigator and selecting Add – Loss of Intensity. A new node
called Loss of Intensity will be added to the dialog navigator and a sub-node called
Loss of Intensity 1 will be added to show that one Loss of Intensity model is being
evaluated.
Setup dialog for a Loss of Intensity model
876
The following table gives the functionality of the Loss of Intensity dialog.
spectrum.
Allows the new spectrum defined in the Input to be ratioed against

Compare to
a historical initial background. Note: If a historical background is
Historical
used, ensure that the alarm and warning limits are defined for the
Background
ratio spectrum.
Compare to Last Similar functionality to Compare to Historical Background, but when

Known Good used in a run time application will call this spectrum for calculations
Background rather than the historical (original) spectrum.

used.
Allows a user to set the minimum limit for loss of intensity that
Alarm should be alarmed when compared to the original or last known
good spectrum. This is a lower bound alarm.
Allows a user to set a warning limit for loss of intensity that should
Warning
be flagged when compared to the original or last known good
877
spectrum. This is a lower bound alarm for warning a user that the
spectrometers lamp should be considered for changing.
Multiple Loss of Intensity Models
Loss of Intensity models can be calculated at multiple regions of the same spectrum
and individual alarms and warnings can be set up accordingly. To add a second (or
consecutive) Loss of Intensity model, right click in the Loss of Intensity node in the
navigator and select Loss of Intensity.This will add a new Loss of Intensity Model to
the navigator called Loss of Intensity 2.
Dialog with multiple Loss of Intensity Models added
23.3.7 Principal Component Analysis Models

Single PCA Model
The PCA functionality is accessed by right clicking in the Spectral Diagnostics node in
the navigator and selecting Add – PCA. A new node called PCA will be added to the
dialog navigator and a sub-node called PCA 1 will be added to show that one PCA
model is being evaluated.
Setup dialog for a PCA model
878
The following table gives the functionality of the PCA dialog.

spectrum.

Compare to
Historical
Background
spectrum.
Allows a user to select a PCA model developed on past raw or ratio

background spectra that is expected to be representative of future
Select Model
good background spectra. This PCA model must be present in the
current Unscrambler® X project.
Allows a user to set the number of Principal Components (PCs) to use

Components
for the evaluation of new background spectra.
critical Hotellings T² limits for the components selected for the model
Use Hotellings T² at the significance level selected from the dropdown box; User
Defined: Allows a user to manually enter a limit for the Hotellings T²
value.
Use Leverage
critical Leverage value for the components selected for the model;
879
User Defined: Allows a user to manually enter a limit for the Leverage
value.
Significance levels for Hotellings T² values in the PCA Spectral Diagnostics dialog
Multiple PCA Models

PCA models can be calculated at multiple regions of the same spectrum and
consecutive) PCA model, right click in the PCA node in the navigator and select
PCA.This will add a new PCA Model to the navigator called PCA 2.
Dialog with multiple PCA Models added
23.4. Prediction with Spectral Diagnostics Model

Post processing of background spectra for diagnostics purposes is made possible through the
use of the Tasks – Predict – Spectral Diagnostics functionality
880
The Spectral Diagnostics Predict plug-in is found under the Tasks – Predict menu
Background spectra collected from a particular instrument can be loaded into The
Unscrambler® X project and the appropriate Spectral Diagnostics model is also loaded into
the project.
Functionality of the Spectral Diagnostics Predict dialog box
The following table gives the Spectral Diagnostics Predict functionality.

Select the appropriate model from a list of available models in

Diagnostic Model
the current project.
Details This box provides the details of the selected model.
Allows a user to define the new background spectra to be

Input Spectrum
analysed.
881
Allows a user to define the spectrum to use as the historical

Historical Background
background spectrum for diagnostic tests that make use of this
Spectrum
spectrum.
Allows a user to define the spectrum to use as the last known

Last Known Good
good background spectrum for diagnostic tests that make use
Background Spectrum
of this spectrum.
Once the dialog box has been set up and the OK button is selected, a new analysis node is
generated in The Unscrambler® X project navigator and the results of the analysis are
presented in a tabulated form with the result in the first column and the alarm state in the
second column for each diagnostic assessed.
Output from the Spectral Diagnostics Predict functionality
When implemented at run time, the results are sent to a third party application and a quality
decision is made based on the outputs.
882
24. Cluster Analysis
24.1. Cluster analysis
Cluster analysis includes a range of quasi-statistical techniques used in unsupervised
classification. They are suitable for exploratory analysis of data and can be used to classify
samples into groups. Cluster analysis in The Unscrambler® works on the objects (or rows).
The data may be transposed prior to analysis to analyze the data in terms of variables. K-
means and K-medians clustering iteratively add or remove members from a set of clusters so
as to minimize the sum of distances of cluster members to their cluster centers. These
methods use less memory than hierarchical clustering methods and are therefore suitable
for large data sets. Hierarchical clustering methods in The Unscrambler® provide a
dendrogram plot as a visualization of clustering results.
 Theory
 Usage
24.2. Introduction to cluster analysis
 Basics
 Principles of cluster analysis
 Nonhierarchical clustering
 Hierarchical clustering
 HCA linkage methods
 Distance measures
 Quality of the clustering
 Main results of cluster analysis
24.2.1 Basics
A valuable tool for exploratory data analysis is the use of cluster analysis to understand the
natural grouping of objects. Cluster analysis is an unsupervised methodology for grouping
things based on their similarities based on specified characteristics (variables). It grew out of
work by biologists working on numerical taxonomy, and is a valuable visualization tool in
data mining. One can perform clustering using either several agglomerative methods: K-
means or K-median clustering, or hierarchical clustering with different linkage measures
(single-linkage, complete-linkage, average-linkage, median-linkage, etc.). Agglomerative
methods begin by treating each sample as a single cluster and begin clustering samples
based on their similarity until one large cluster is formed.
Although cluster analysis is usually performed to find patterns among objects (termed as Q
mode), it may also be applied to find similarities in the variables (or R mode). This can be
achieved by running the analysis by transposing the data matrix so that the rows correspond
to variables.
883
24.2.2 Principles of cluster analysis

The main categories of cluster analysis in The Unscrambler® are nonhierarchical clustering
(K-means, K-medians) and hierarchical cluster analysis (HCA). The Unscrambler® offers
several methods of clustering within these two categories as shown below.
 K-means
 K-medians
 HCA single-linkage
 HCA complete-linkage
 HCA average-linkage
 HCA median-linkage
 Ward’s method
24.2.3 Nonhierarchical clustering

Nonhierarchical clustering is an unsupervised method which works iteratively to group
samples based on their similarity based on some measured variables. The user specifies the
number of clusters in advance, and can also define cluster membership as well. The output is
to give a class identification for each object (sample). K-Means methodology is a commonly
used clustering technique. The K-medians methodology is also used, and though slower than
K-means, is more robust to outliers. In both cases the analysis involves starting with a
collection of samples that one attempts to group them into k Number of clusters based on
certain specific distance measurements.
The main steps involved in the K-Means clustering algorithm are given below.
 This algorithm is initiated by creating ‘k’ different clusters. The given sample set is
first randomly distributed between these ‘k’ different clusters.
 The initial cluster centroid is then calculated. This center can be identified either
using averages (K-means) or using centroids (K-medians). The latter method is more
robust with outliers.
 As a next step, the distance measurement between each of the samples, within a
given cluster, to its respective cluster centroid is calculated.
 Samples are then moved to a cluster (k’) that records the shortest distance from a
sample to the cluster (k) centroid. As a first step to the cluster analysis the user
decides on the Number of Clusters ‘k’. This parameter could take definite integer
values with the lower bound of 1 (in practice, 2 is the smallest relevant number of
clusters) and an upper bound that equals the total number of samples.
The K-Means algorithm is repeated a number of times to obtain an optimal clustering
solution, every time starting with a random set of initial clusters. When an initial starting
point is used, only one iteration is run, The default number of iterations to use is 50, and can
be adjusted by the user to find the optimal classification. The output of nonhierarchical
clustering is a class identifier, based on the proximity of samples to identified clusters.
24.2.4 Hierarchical clustering

HCA is based on using different linkage methods to generate clusters. The user must
therefore choose the linkage method, as well as the distance measure that will be used to
define the clusters (separate the classes) in a data set. The distance between objects or
clusters can be computed in several ways and these can have major impacts on the resulting
884
Cluster Analysis
classification. The distance should ideally be chosen based on the application domain and
based on whether the distance or similarity measure has a real-world interpretation. Note
that not all distances fulfill the triangle inequality. The triangle inequality for a metric holds if
the sum of two sides of a triangle exceeds the third. If this does not hold, the resulting
dendrograms in hierarchical clustering can be deformed.
With Hierarchical clustering, a dendrogram is generated as a result, based on the distance
between samples. There are several methods by which the distance between the linkages
between clusters are defined when using one of the HCA options.
HCA linkage methods
 HCA single-linkage: The single-linkage (also called nearest neighbor) measure, uses
the distance between the closest samples to define a cluster. The method tends to
make large clusters and does not provide a very good classification of groups that
differ, but are not well separated. This method tends to produce elongated clusters.
 HCA complete-linkage: This is also known as the farthest-neighbor method, and uses
the greatest distance between any two samples as the basis of the clustering.
Clusters from the complete-linkage method are more compact and rounded
clusters.
 HCA average-linkage: The average linkage is a compromise between the single- and
complete-linkage, based on the average distance between samples for the
clustering.
 HCA median-linkage: The median (or centroid) linkage, is very similar to the average-
linkage method, and uses the geometrical distance between a cluster and the
weighted center of gravity between other groups.
 Ward’s method: Ward’s method aims to cluster samples to maximize the
homogeneity of the groups. Linkage is based on clustering so that the groups do not
have an increased measure of heterogeneity.
The following image illustrates some of the common linkages.
885
Distance measures
For all hierarchical clustering methods a distance measure needs to be defined to define the
distance between samples. A sample is then defined to belong to a group to which it is
closest. HCA results are displayed as a dendrogram plot which is a depiction of the clustering
of samples into sets and subsets, along with the threshold distances between samples and
clusters.
In The Unscrambler® there many options available for the distance measures to use for
clustering.
Squared Euclidean distance
The squared Euclidean distance as a means of measuring similarity between clusters is
useful in cases where some feature (variable) may dominate the distance between groups,
and serves as a type of normalization to the data.
Euclidean distance
This is the most usual, “natural” and intuitive way of computing a distance between two
samples. It takes into account the difference between two samples directly, based on the
magnitude of changes in the sample levels. This distance type is usually used for data sets
that are suitably normalized or without any special distribution problem.
City-block distance
Also known as Manhattan distance, this distance measurement is especially relevant for
discrete data sets. While the Euclidean distance corresponds to the length of the shortest
path between two samples (i.e. “as the crow flies”), the Manhattan distance refers to the
sum of distances along each dimension (i.e. “walking round the block”).
Pearson correlation distance
This distance is based on the Pearson correlation coefficient that is calculated from the
sample values and their standard deviations. The correlation coefficient r takes values from
–1 (large, negative correlation) to +1 (large, positive correlation). Effectively, the Pearson
distance dp is computed as
and lies between 0 (when correlation coefficient is +1, i.e. the two samples are most similar)
and 2 (when correlation coefficient is -1). Note that the data are centered by subtracting the
mean, and scaled by dividing by the standard deviation.
Absolute Pearson correlation distance
In this distance, the absolute value of the Pearson correlation coefficient is used; hence the
corresponding distance lies between 0 and 1, just like the correlation coefficient. The
equation for the Absolute Pearson distance da is
Taking the absolute value gives equal meaning to positive and negative correlations, due to
which anti-correlated samples will get clustered together.
Uncentered correlation distance
This is the same as the Pearson correlation, except that the sample means are set to zero in
the expression for uncentered correlation. The uncentered correlation coefficient lies
between –1 and +1; hence the distance lies between 0 and 2.
886
Cluster Analysis
Absolute uncentered correlation distance

This is the same as the Absolute Pearson correlation, except that the sample means are set
to zero in the expression for uncentered correlation. The uncentered correlation coefficient
lies between 0 and +1; hence the distance lies between 0 and 1.
Spearman’s rank correlation distance
Spearman’s rank correlation measures the correlation between two sequences of values.
The differences in rank for the two sequences are calculated at each position, i. The distance
between sequences X = (X1, X2, etc.) and Y = (Y1, Y2, etc.) is computed, giving a value that
ranges from -1 to 1.
Kendall’s (tau) distance
This non-parametric distance measurement is more useful in identifying samples with a
huge deviation in a given data set.
where
nc
number of concordant rank pairs
nd
number of discordant rank pairs
Chebyshev distance
The Chebyshev, or maximum value, distance is the absolute magnitude of the differences
between the coordinates of a pair of objects. This distance measure may be best in cases
where the difference between points is best reflected by individual dimension differences,
and not by all the dimensions considered together. Note that the Chebyshev distance is very
sensitive to outlying measurements.
Bray-Curtis distance
This value, also referred to as the Bray-Curtis dissimilarity, or the Sorenson distance, is
commonly used in ecology, biology and oceanography studies for quantifying dissimilarity
between populations.
Ward’s Method
This method is a minimum distance hierarchical clustering method and uses an analysis of
variance approach to evaluate the distances between clusters. Ward’s method attempts to
minimize the Sum of Squares (SS) of any two clusters that could be formed at each step of
the analysis.
24.2.5 Quality of the clustering

Before performing a cluster analysis, it is helpful to determine if the data being considered
exhibits any tendency to cluster. This can be done by doing a PCA over the data to see if
there are any groupings which could then form the basis of clusters. To fully assess the
quality of a cluster analysis requires some judgment on whether the output is meaningful.
The nonhierarchical clustering analysis (K-means, K-medians) results in the assignment of
cluster-id to each of the samples based on the Sum Of Distances (SOD). The Sum Of
887
Distances is described as the sum of the distance values between each of the sample to their
respective cluster centroid summed up over all k clusters. This parameter is uniquely
calculated for a particular batch of cluster-ids resulting from a cluster calculation. The results
from various different cluster analyzes are compared based on the Sum Of Distances values.
The solution with a least Sum of Distances is a good indicator for an acceptable cluster
assignment. Hence it is recommended to initiate the analysis with a small Iteration Number,
say for example 10 for a sample set of 500, and proceed towards a higher cycle of Iteration
Number to obtain an optimal cluster solution. Once the user obtains an optimal (lowest)
Sum Of Distances there is a good possibility that there will not be further decline in the Sum
Of Distances by setting Iteration Number to higher values. The cluster-id assignment for an
optimal Sum Of Distances is considered to be the most appropriate result. The results for
nonHCA presents just the class-ID as a numerical value, without giving the SOD values.
Note: Since the first step of the K-Means algorithm is based on the random
distribution of the samples into k different clusters there is a good possibility that
the final clustering solution will not be exactly the same for every instance for a
fairly large sample data set.
For Hierarchical cluster analysis results of the clustering are a column matrix with a category
variable (0,1,2,…) for the class, as well as a dendrogram which is a plot of the clusters plotted
vs. the relative distance between the clusters.
24.2.6 Main results of cluster analysis

A clustering analysis gives the results in the form of a category variable created as a column
matrix. The overall results are summarized in a node in the project named Cluster Analysis.
Under this node, the folder Range_Classified has the matrix with the class variable added
(which is also summarized in the results as the column matrix Class. This category variable,
class, has one level (0, 1, 2, …) for each cluster, and tells which cluster each sample belongs
to. A separate results matrix is created containing the samples of each cluster.
The name of the clustering method and which distance type was applied are summarized in
the lower left Info box, visible when the Cluster analysis node is selected in the project
navigator.
A dendrogram is created as an output plot of hierarchical clustering methods. This plot
provides a visualization of the proximity of samples to each other and is color coded based
on the defined clusters.
For instance, if the clustering was performed using the Squared Euclidean distance, and the
method was single linkage (nearest neighbor), this information will be displayed in the Info
box when the Cluster analysis node is selected. The results of the hierarchical methods also
include a dendrogram under the results in the folder for Plots.
24.3. Tasks – Analyze – Cluster Analysis…

When the data table is available in the project navigator, Clustering analysis can be accessed
by going to using Tasks – Analyze- Cluster Analysis…
Tip: Before performing a cluster analysis, it is helpful to determine if the data being
considered exhibits any tendency to cluster. This can be done by doing a PCA over
the data to see if there are any groupings which could then form the basis of
clusters.
888
Cluster Analysis
24.3.1 Inputs
To run a cluster analysis:
 Choose the data to be clustered by defining the matrix and range to be clustered.
The data selected must not have any missing values. There must be at least two
samples and two variables to perform a cluster analysis.
 Decide the number of clusters or categories to be identified (Default: 2 clusters).
 Choose clustering method (Default: K-means).
 Choose distance criterion (Default: Squared Euclidean).
Cluster analysis input
24.3.2 Options for K-means/K-median clustering

General option: number of iteration
 Choose the number of iterations to find optimal clustering (Default is 50).

 Start with some predefined cluster members if desired.
Cluster analysis options
889
Defining initial cluster members for K-means or K-medians clustering
 Defining the centers of the clusters based on prior knowledge can force a better
solution.
 For each cluster one can either enter a range of sample indexes by typing or through
the selection dialog.
 Sample ranges can be comma separated while ranges can be indicated with
hyphens. For example 1-5,7.
Define range dialog
890
Cluster Analysis
24.3.3 Results
When a cluster analysis has been performed, a new node, Cluster analysis, is added to the
project navigator with the a folder for results and for plots (if hierarchical clustering has
been used). The node may be renamed by right clicking on it and selecting Rename. A typical
entry is shown below.
Cluster analysis results node
The results folder contains the matrix Range_Classified, which has the raw data used for
clustering, and an additional column for the class. Row sets are also created, one for each
cluster that has been identified. The column set Class has the numerical identifiers for each
sample, as can be seen below.
Cluster analysis class ID
891
24.4. Interpreting cluster analysis plots
 Dendrogram
24.4.1 Dendrogram
A dendrogram (from Greek dendron “tree”, -gramma “drawing”) is a tree diagram
frequently used to illustrate the arrangement of the clusters produced by hierarchical
clustering.
Depending on the selected number of clusters, the sample names will be displayed by
cluster color. In the following example three clusters were selected, hence the plot has three
groups of samples shown in different colors. The clusters are separated based on the
distance between clusters.
Dendrogram plot
892
Cluster Analysis
24.5. Cluster analysis method reference

893
25. Projection
25.1. Projection
Latent space models project the data into new spaces. This is done by multiplying the new
data with the loading vectors. This method is applicable to PCA, PCR and PLS Regression
techniques.
 Theory
 Usage
25.2. Introduction to projection of samples
 Basics of projection
 Sample comparison after a change
 Detection of time shifts
 Using projection to validate a process with a new test set
 How to interpret projected samples
 Sample comparison or detection of time shifts
 Validation with new test data
25.2.1 Basics of projection

Decomposing a data set into principal components using PCA is a powerful tool for gaining
insight into variable relationships and sample characterization. It is important to note that
the samples and variables in a data set are mutually dependent.
The samples are described by their corresponding variables.
The covariance or correlation between variables is a property of their manifestations across
all the available samples.
In other words: the variables define the samples and vice-versa. Changing one sample may
change the relationships among variables if that sample is influential; change one variable
and the relative positions of the samples may change.
In large data sets, removing a sample, or variable may have a negligible effect.
Thus, if new and old samples are combined in a new PCA model, based on the same
variables, one will get new results. If the new samples are similar to the old ones, the PCs
may be similar. However, if the new samples convey completely new information, the PCs
will be different; all variables will get new loadings and the old samples will get new scores.
However, there are some cases where it is of interest to project new samples onto an
existing sample map without the new samples affecting the existing structure. A few
examples are listed as follows.
Projection is the PCA equivalent of prediction in regression methods.
Sample comparison after a change
 Is there a change in product characteristics since raw material has been ordered
from a new supplier?
895
By projecting the data for new samples onto the PCA model based on product
produced with the existing supplier, one can see if the product properties are
impacted by the change in raw material supplier.
 Has the product quality changed after a piece of equipment was repaired?
 How do samples produced in factory B compare to samples from factory A?
To make this comparison one can project the new samples (e.g. from factory B) onto a PCA
of the reference samples (e.g. factory A), and see if they overlap in the scores plot.
Detection of time shifts
A model was developed one year ago. Are today’s samples still well described by the model?
Projecting new samples onto the one-year old PCA model provides information about
whether there has been a drift in sample distribution, change in the average scores,
increased spread, larger residuals, etc.
Using projection to validate a process with a new test set
In the initial stages of a process development few samples may exist and methods such as
cross-validation may be the only viable way of developing a first interpretive model. As more
experience and data are gathered from the process, these data can be used as a test set,
without recomputing the original model.
The initial PCA model may also have been developed by another scientist or engineer and
the original data may not be available to run a more complete PCA. This is not a problem for
projection, as long as the new data were collected for the same variables as the original PCA
model.
Projecting the new samples onto the existing model and checking residual variances and
leverages will allow one to determine whether the model is valid for the new samples.
25.2.2 How to interpret projected samples

The results from sample projections onto an existing PCA model can be interpreted in the
same way as usual PCA results. The loadings are, however, fixed based on an established
PCA model. New data are projected through the PCA loadings and these projected samples
have new scores, computed exactly the same way as usual. The principle of projection is
shown graphically below.
896
Projection
The main difference compared to standard PCA results is that the variance plot now depicts
Calibration, Validation and Projection. Also, the projected samples are shown in the scores
plot. The following plots are relevant for the new samples:
 Scores.
 Variances.
 Residuals.
 Leverages and Hotelling’s T².
Sample comparison or detection of time shifts

Study the scores plot, where the projected samples are plotted with a different color. Check
for groupings, trends or higher spread among the projected samples than the calibration /
validation samples.
The influence plot helps one detect whether some of the projected samples are badly
described by the model or far away from the center.
The Hotelling’s T² ellipse can be plotted in the scores plot, with its critical limit that can be
tuned up or down by varying the p-value between 0.1 and 25%. These limits show which of
the projected samples can be “rejected” by the model (outside the limit). If the proportion
897
of “rejected” samples is larger than the chosen p-value, one may conclude that there is a
difference between the original samples and the projected samples as a whole.
Validation with new test data
Compare the Projection variance curve to the Calibration and Validation curves. If they are
similar, one can consider the model validated by the projected samples. The diagram below
provides an example of a well chosen calibration and validation set of data using the method
of PCA projection.
Refer to the chapter on How to Interpret PCA Scores and Loadings for more details.
25.3. Tasks – Predict – Projection…

Once a latent space model has been created on a set of samples, the samples comprising the
model are projected into the new space and can be visualized in the scores plot.
New samples can be projected on the same scores space (i.e. the model). This is done by
matrix multiplication of the new data and the loading vectors. This method is applicable to
Principal Component Analysis (PCA), Principal Component Regression (PCR) and Partial Least
Squares Regression (PLSR) techniques.
25.3.1 Access the Projection functionality

To access the projection functionality use the Tasks – Predict – Projection… menu option.
The Project To Latent Space dialog box will open and is displayed below.
898
Projection
To run a projection, a project must be opened containing either a PCA or regression model
(MLR is not included in this case). If this is not the case, the following warning will be
provided.
Solution: Ensure that a PCA, PCR or PLSR model is available for projection.
Data Input
The following dialog boxes are available to input data.
Select Model
Choose the model (PCA, PCR, PLSR) to be used for projection from those available in
the project navigator.
Components
Allows the user to choose the number of components to use for projection.
Data
Matrix: Allows the user to select the matrix containing the data to be projected onto
the model. The data can be a new matrix, or a subset of the data used to generate
the model.
Use the Rows and Columns drop-down lists to define the samples and variables to
be projected.
If the variable dimensions of the new data set do not match those of the model, The
Unscrambler® will provide a warning to adjust this. This warning is shown below. A
data set of equivalent dimension must be chosen. It must not contain any non-
numeric or missing values.
New data set does not have same dimensions of original model
899
Solution: Ensure that the data set to be projected has the same range as that used in the
original model.
Other warnings associated with the Data input dialog box include the following:
Too many samples or variables excluded
Solution: Ensure that enough samples or variables are present for analysis.
Non-numeric data
Solution: Ensure that the data set only contains numerical values.
Note: When a model has been developed and is to be used for projection, it
is important to define the variable ranges in the new data table so that they
match the dimensions of the original model.
Click on OK to perform the projection.
Caution: Important considerations: If the original samples were pretreated

(transformed) prior to model development, one can register the pretreatment so
that the new samples will automatically be transformed as part of the analysis.
Refer to pretreatment registration in the chapter of the model type: PCA, PCR,
PLSR.
25.4. Interpreting projection plots
 Predefined projection plots

 Projection overview
 Scores
900
Projection
 X-Loadings
 Influence
 Residual/explained variance
 Variances
 Scores
 Loadings
 Residuals
 Plots accessible from the Projection menu
 Projection overview
 Variances
 Scores
 Line
 2-D
 3-D
 Loadings
 Line
 2-D
 3-D
 Residuals
 Influence Plot
 Variance per Sample
 Sample Residuals
 Leverage
 Line
 Matrix
 Line
 Matrix
25.4.1 Predefined projection plots

Projection overview
Scores
This is a two-dimensional scatter plot (or sample map) of scores for two specified
components (PCs) from Projection results. The original samples used to develop the PCA
model are displayed in blue, the new projected samples in green. Use this plot to check how
close the projections of the new samples are to the original samples.
Projection of samples in a scores plot
901
In the above plot, most of the projected samples (green) fall within the two groups defined
by the model samples (blue). There are a group of four samples that lie outside the main
population in the region defined by samples M62 and H59. It may be important to check
whether these are outliers, or just unique samples.
X-Loadings
The default X-loadings plot is a two-dimensional scatter plot of for two specified
components. Use this plot to detect important variables. The plot is most useful for
interpreting components 1 vs.2, since they represent the largest variations in the X-data.
It must be interpreted together with the corresponding scores plot. Variables with high X-
loadings to the right of the plot relate to samples samples to the right in the scores plot, etc.
Loadings may also be displayed as line plots. These are useful when interpreting the results
generated from spectral data.
Influence
This plot displays the sample residual X-variances against leverages for the projected
samples at a given number of PCs. The original samples used to develop the PCA model are
displayed in blue, the new projected samples in green. Samples with a high residual variance
are poorly described by the original model. Samples with a high leverage are projected far
from the center of the original model. A sample with both high residual variance and high
leverage usually represents a highly influential outlier, i.e. it is not well described by the
model it is projected onto and it distorts the model to itself. In this case, the model only
describes why the influential sample is so different from the rest of the population.
Influence in projection
902
Projection
Residual/explained variance

model. Both variances can be computed after 0, 1, 2,… components have been extracted
from the data.
possible.
Calibration variance is based on fitting the calibration data to the model and is represented
by the blue curve. Validation variance is computed by testing the model on data that was
not used to build the model and is shown as the red curve. In the projection case the
explained variance of the projected samples are also shown as the green curve.
Explained variance in projection
903
Variances
For information on this plot check the Projection Overview section
Scores
For information on this plot check the Projection overview section
Loadings
For information on this plot check the Projection overview section
Residuals
Residuals can either be plotted as Residual Sample Variance and Sample Residuals. Examples
of these plots are shown below.
904
Projection
The residual sample variance displays the per sample variation compared to the projected
model and the sample residuals show the variance associated with each variable, for a
particular sample.
Leverage
905
25.4.2 Plots accessible from the Projection menu

Projection overview
For information on this plot check the Interpreting Projection plots section
Variances
Scores
Line
variable, e.g. time, to detect trends or patterns.
Also look for systematic patterns, like a regular increase or decrease, periodicity, etc. (only
relevant if the sample number has a meaning, like time for instance).
2-D
3-D
This is a 3-D scatter plot or map of the scores for three specified components from PCA. The
describe enough variation in the data, the 3-D plot is a practical alternative.
Scores plot in 3-D
906
Projection
Like with the 2-D plot, the closer the samples are in the 3-D scores plot, the more similar
they are with respect to the three components.
The 3-D plot can be used to interpret differences and similarities among samples. Look at
the scores plot and the corresponding loadings plot, for the same three components.
Together they can be used to determine which variables are responsible for differences
between samples. Samples with high scores along the first component usually have large
values for variables with high loadings along the first component, etc.
For information about what to look for in a scores plot check the information in the 2-D
scores plot section
Loadings
Line
detecting important variables. In many cases it is usually better to look at two- or three-
vector loadings plots instead because they contain more information.
Line plots are most useful for multichannel measurements, for instance spectra from a
spectrophotometer, or in any case where the variables are implicit functions of an
underlying parameter, like wavelength, time, etc.
Loading line plot
907
The plot shows the relationship between the specified component and the different X-
important for the component concerned. For example, a sample with a large score value for
this component will have a large positive value for a variable with large positive loading.
2-D
3-D
the original PCA model. The plot is most useful for interpreting directions, in connection to a
3-D scores plot. Otherwise it is recommended to use line- or 2-D loadings plots.
Loadings plot in 3-D in projection
908
Projection

identified.
Residuals
Influence Plot
Variance per Sample

This plot shows the total residual (or explained) X-variance for each sample, for a given
number of PCs.
Variance per Sample in projection
component are well explained by the corresponding model, and vice versa.
Sample Residuals
Bar plot of the sample residuals
909
model.
To change the displayed sample use the Sample drop-down list .
To change the PC plotted use the arrows tools.
Leverage
Line
model.
Leverage plot in projection
910
Projection
The absolute leverage values are always larger than zero, and can go (in theory) up to 1. As a
rule of thumb, samples with a leverage above 0.4 - 0.5 start being bothering.
For a critical limit on the leverages, look up the Hotelling’s T² line plot.
Matrix
Leverage matrix plot in projection
911
Hotelling’s T²
Line
The Hotelling’s T² limit at 5% determines a distance from the model where 95% of the
samples belonging to the model should be within this limit. The samples outside this limit
are likely to be outliers. However remember that 5% of the sample belonging to the model
can be outside.
Hotelling’s T² plot in projection
912
Projection
In the above plot some samples have a Hotelling’s T² statistic higher than the limit for 5% on
a model including the amount of necessary PCs to have an explanatory model. Hence those
samples are likely to be outliers.
chose from using the drop-down list:
Tune the number of PCs up or down as desired with the arrows tools.
Matrix
This is a matrix plot of Hotelling’s T² statistics for all projected samples and all model
components. It is equivalent to the matrix plot of leverages, to which it has a linear
relationship. The Y-axis represents the components and the X-axis the samples. The color
represents the Z-value which is the Hotelling’s T² statistic for a specific PC and sample, the
color scale can be customized.
Hotelling’s T² matrix plot in projection
25.5. Projection method reference

913
26. SIMCA
26.1. SIMCA classification
each class in the training set. Unknown samples are then compared to the class models and
assigned to classes according to their proximity to the training samples.
 Theory
 Usage
26.2. Introduction to SIMCA classification

each class in a defined training set. Unknown samples are then compared to the class
models and assigned to classes according to their proximity to the training samples. SIMCA is
known as a supervised pattern recognition method as the individual PCA models define
classification rules. Within The Unscrambler� classification can also be done using PLS or
PCR models for each class.
 Making a SIMCA model

 Classifying new samples
 Main results of classification
 Outcomes of a classification
26.2.1 Making a SIMCA model

As is the case with all classification methods, there is a training stage and a test stage.
The training stage implies that one has identified enough samples as members of each class
to be able to build a reliable model. It also requires enough variables are measured to
describe the samples accurately.
The test stage uses significance tests to classify new samples, where the decisions are based
on statistical tests performed on the object-to-model distances.
SIMCA modeling requires building one PCA model for each class which describes the
structure of that class as well as possible. The optimal number of PCs should be chosen for
each model separately, according to a suitable validation procedure. Each model should be
checked for possible outliers and improved if possible (as one would do for any PCA model).
Before developing a SIMCA model, it is helpful to determine if the data being considered
exhibit any tendency to cluster by the classes. This can be done by doing a PCA with all the
data to see if there are any groupings which could then form the basis of the individual PCA
models.
Before using the models to predict class membership for new samples, one should also
evaluate the model specificity, i.e. whether the classes overlap or are sufficiently distant
from each other. Specific tools, such as model distance and modeling power are available for
this purpose.
915
26.2.2 Classifying new samples

Once each class has been modeled, and provided that the classes do not significantly
overlap, new samples are then projected onto each model used in the SIMCA classification
process. This means that for each sample, scores are produced by projecting the samples
onto the loadings of the models used. Membership to an existing class is then based on
distance metrics characteristic of PCA models.
The residuals are then combined into a measure of the object-to-model distance.
The scores are also used to build up a measure of the distance of the sample to the model
center, called leverage.
Finally, both object-to-model distance and leverage are taken into account to decide which
class(es) the sample belongs to.
The classification decision rule is based on a classical statistical approach. If a sample belongs
to a class, it should have a small distance to the class model (the ideal situation being
“distance = 0”). Given a new sample, compare its distance to a particular model and
determine whether this distance lies within the class membership limit of the model. This is
analogous to the probability distribution of object-to-model distances around zero.
26.2.3 Main results of classification

A SIMCA analysis provides specific results in addition to the usual PCA results like scores,
loadings and residuals.
These results are detailed in the following sections.
Model results
For each pair of models, the model distance between the two models is computed. This
gives a measure of how separable the class models are. A distance larger than three
indicated good class separation.
Variable results
Modeling power (of one variable in one model) is a measure of the relevance of a variable to
a model. It has a value between 0 and 1, with a value of 1 signifying importance. Variables
with modeling power less than about 0.3 are of little importance to a model.
Discrimination power (of one variable between two models) is a measure of how useful a
variable is in discriminating between two classes. Discrimination power of ~ 1 indicates no
discriminating power, while a value greater than ~ 3 indicates good discrimination for a
given variable.
Sample results
Si = object-to-model distance (of one sample to one model) is a measure of how far a sample
is from a modeled class.
Hi = leverage (of one sample to one model). Hi describes how different a sample is from
other class members.
Model distance
This measure (which could more accurately be called ”model-to-model distance”) shows how
different two (or more) models are from each other. It is computed from the results of
916
SIMCA
fitting all samples from each class to their own model and to the other ones being used to
classify new samples.
The value of this measure should be compared to is 1, i.e. the distance of a model to itself. A
model distance much larger than 1 (for instance, 3 or more) shows that the two models are
quite different, which in turn implies that the two classes are likely to be well distinguished
from each other.
Modeling power
Modeling power is a measure of the influence of a variable over a given model. It is
computed as
(1 - square root of (variable residual variance / variable total variance))

This measure has values between 0 and 1; the closer to 1, the better that variable is taken
into account in the class model, the higher the influence of that variable, and the more
relevant it is to that particular class.
Discrimination power
The discrimination power of a variable indicates the ability of that variable to discriminate
between two classes. Thus, a variable with a high discrimination power (with regard to two
particular models) is very important for the differentiation between the two corresponding
classes.
Like model distance, this measure should be compared to 1 (no discrimination power at all);
variables with a discrimination power higher than 3 can be considered quite important.
Sample-to-model distance (Si)

The sample-to-model distance is a measure of how far the sample lies from the modeled
class. It is computed as the square root of the sample residual variance.
It can be compared to the overall variation of the class (called S0), and this is the basis of the
statistical criterion used to decide whether a new sample can be classified as a member of
the class or not. A small distance means that the sample is well described by the class model;
it is then a likely class member.
Sample leverage (Hi)

The sample leverage is a measure of how far the projection of a sample onto the model is
from the class center, i.e. it expresses how different the sample is from the other class
members, regardless of how well it can be described by the class model.
The leverage can take values between 0 and 1; the value is compared to a fixed limit which
depends on the number of components and of the calibration samples in the model.
Si vs. Hi plot
This plot is a graphical tool used to view of sample-to-model distance (Si) and sample
leverage (Hi) for a given model at the same time. It includes the class membership limits for
both measures, so that samples can easily be classified according to that model by checking
whether they fall inside both limits.
An equivalent plot in PCA is the influence plot (refer to section on the influence plot in the
chapter on PCA).
917
Coomans’ plot
This is an “Si vs. Si” plot, where the sample-to-model distances are plotted against each
other for two models. It includes class membership limits for both models, so that one can
see whether a sample is likely to belong to one class, or both, or none. This is an orthogonal
distance measure, therefore, samples can be plotted along orthogonal axes. If any two class
models share a space around the origin of the Coomans’ plot then there is a high likelihood
that the PCA models will not discriminate between the two classes.
26.2.4 Outcomes of a classification

There are three possible outcomes of a classification:
 Unknown sample belongs to one class.

 Unknown sample belongs to several classes.
 Unknown sample belongs to none of the classes.
The first case is the easiest to interpret.

If the classes have been modeled with enough precision, the second case should not occur
(no overlap). If it does occur, this means that the class models might need improvement, i.e.
more calibration samples and/or additional variables should be included.
The last case is not necessarily a problem. It may be a quite interpretable outcome,
especially in a one-class problem (i.e. a single model projection. A typical example is product
quality prediction, which can be done by modeling the single class of acceptable products. If
a new sample belongs to the modeled class, it is accepted; otherwise, it is rejected.
26.3. Tasks – Predict – Classification – SIMCA…

When a new data table to be classified is present in the project navigator and some models
have been created or imported, one may access the Tasks-Predict-Classification… menu to
run a SIMCA Classification. This requires that one has valid PCA, PCR or PLS models in the
project.
Use the Tasks - Predict - Classification - SIMCA menu for this purpose. The following dialog
box will appear when SIMCA classification is selected:
SIMCA Dialog
918
SIMCA
The dialog box consists of the following functions:

Data to be classified inputs
This allows the user to select a matrix of data to be classified from that available in
the current project navigator. Use the drop-down list to select a different matrix
from the project if needed. Select the Rows and Columns to be used and ensure that
the number of columns used in the new data matches that used for the specific
model development.
Class model
This option becomes active when the dimensions of the data selected match the
corresponding dimensions used for the specific model development. All models in
the project based on the same variables will be available from the drop-down list.
The table will show the model name, the maximum number of PCs computed for the
model (note that this dialog may reduce it by one component if the model is of full
rank), the suggested number of PCs for a model, how many PCs or factors to use
(changeable by the Use components numeric up-down), and the model type. One
can use PCA, PLS or PLS models for the classification. All the models used in the
SIMCA classification step must have the same variables and pretreatments.
Add and Remove buttons
Use these to add or remove models for the final SIMCA classification process. The
details of the models used will be displayed in the spreadsheet viewer in the SIMCA
dialog box.
Pretreatment
Use this button to access the Register Pretreatment dialog. If the data to be
classified are already pretreated, use Uncheck All to remove autopretreatment.
However, if the classification model is to be applied to raw data, then make sure
that this option is used.
Center
919
Use this option to mean center the data to be classified, prior to the classification
process. The default is that this option is checked.
Use components
Use this option to vary the number of components to be included in each model. As
a general rule, this should always be set to the number of principal components/
factors found to be optimal during model development process.
If data is pretreated before building PCA model and the pretreatment ranges differ from the
model building range then SIMCA will ask the user to select all variables in the data.
In the event that there is no valid model present in the project navigator, the following
warning will be provided.
No valid model present for classification
Solution: Either create a model using a training set of data, or import an existing model from
another project.
When non-numeric values are present in a new data set for classification, the following
warning will be provided.
Non-numeric values in data set warning
Solution: Ensure the data set being classified only contains numerical values.
The diagram below provides an example of a completed dialog box.
Completed SIMCA dialog
920
SIMCA
Click on OK to run the classification on the data selected. A new node named SIMCA will
appear in the project navigator providing all of the model details and associated plots in the
three folders: raw data, results, and plots. The node can be renamed by selecting it, right
clicking and selecting Rename. By right clicking, one also has the option to hide the plots.
26.4. Interpreting SIMCA plots
 Predefined SIMCA plots

 Classification table
 Coomans’
 Si vs. Hi
 Si/S0 vs. Hi
 Model Distance
 Discrimination Power
 Modeling Power
26.4.1 Predefined SIMCA plots

The same plots are available from the Classification menu or in the navigator.
Classification table
This plot shows the classification of each sample. Classes that are significant for a sample are
marked with a star (or an asterisk).
The outcome of the classification depends on the significance limit; by default it is set to 5%,
but it can be tuned up or down with the tool.
921
Look for samples that are not recognized by any of the classes, or those that are allocated to
more than one class.
Coomans’
This plot shows the orthogonal distances from the new objects to two different classes
(models) at the same time. The membership limits (S0) are indicated. Membership limits
reflect the significance level used in the classification.
The two models can be changed to study other pairs of model using the following bar tool
.
The significance level for Hi can be adjusted using the tool , there are six
different levels, the default value being 5%.
Coomans’ Plot
922
SIMCA
Samples that fall within the membership limit of a class are recognized as members of that
class. Different colors denote different types of sample: new samples being classified,
calibration samples for the model along the abscissa (A) axis, calibration samples for the
model along the ordinate (B) axis, as shown in the figure above.
Si vs. Hi
This plot is a graphical tool used to get a view of the sample-to-model distance (Si) and
sample leverage (Hi) for a given model at the same time. It includes the class membership
limits for both measures, so that samples can easily be classified according to that model by
checking whether they fall inside both limits.
The displayed results can be changed using the following tool
.
Si vs. Hi
923
In the above plot the samples that will be classified as Setosa are the ones in the bottom left
corner defined by the two limits Si and Hi. The other samples will not be classified in this
group.
different levels the default value being 5%.
Si/S0 vs. Hi
The Si/S0 vs. Hi plot shows the two limits used for classification: the relative distance from
the new sample to the model (residual standard deviation) and the leverage (distance from
the new sample to the model center).
Si/S0 vs. Hi
In the above plot the samples that will be classified as Setosa are the ones in the bottom left
corner defined by the two limits Si and Hi. The other samples will not be classified in this
group.
The displayed results can be changed using the following tool
.
different levels the default value being 5%.
Model Distance
This plot shows the distances between different models. It is possible to compare different
models using the buttons in the tool bar. A distance larger than
three indicates good class separation and that the models are different.
Model Distance
924
SIMCA
It is clear from the plot that the models are very different from the Setosa model. The
closest one is Versicolor with a distance around 20.
Discrimination Power
This plot shows how much each variable contributes to separating two models.
It is possible to see a different pair of models using the buttons in
the tool bar.
Discrimination Power
In the above plot, the two models under study are Setosa and Virginica. The variable with
the highest discrimination power between these two classes is petal width.
Modeling Power
This plot shows how much the variables contribute to the model.
Variables with a modeling power near one are important for the model. A rule of thumb is
that variables with modeling power less than 0.3 are of little importance for the model.
Modeling Power
925
The above plot shows that three of the variables have a modeling power larger than 0.3,
which means that these variables are important for describing the model. Since petal width
does not have a very high power, it could be deleted from the modeling.
It is possible to look at the modeling power for all the tested models using the drop-down
list from the tool bar: .
26.5. SIMCA method reference

926
27. Linear Discriminant Analysis
27.1. Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) is the simplest of all possible classification methods that
are based on Bayes’ formula. The objective of LDA is to determine the best fit parameters
for classification of samples by a developed model. The model can then be used to classify
unknown samples. It is based on the normal distribution assumption and the assumption
that the covariance matrices of the two (or more) groups are identical.
 Theory
 Usage: Classification
 Results
27.2. Introduction to Linear Discriminant Analysis (LDA) classification

Linear Discriminant Analysis (LDA) is a classification method that provides a linear
transformation of n-dimensional feature vectors (or samples) into an m-dimensional space
(m < n), so that samples belonging to the same class are close together but samples from
different classes are far apart from each other. LDA is a supervised classification method, as
the categories to which objects are to be classified is known before the model is created.
The objective of LDA is to determine the best fit parameters for classification of samples by a
developed model. The model can then be used to classify unknown samples.
 Basics
 Data suitable for LDA
 Purposes of LDA
 Main results of LDA
 LDA application examples
 How to interpret LDA results
 Using an LDA model for classification of unknowns
27.2.1 Basics
LDA is the simplest of all possible classification methods that are based on Bayes’ formula.
From Bayes’ rule one develops a classification model assuming the probability distribution
within all groups is known, and that the prior probabilities for groups are given, and sum to
100% over all groups. It is based on the normal distribution assumption and the assumption
that the covariance matrices of the two (or more) groups are identical. This means that the
variability within each group has the same structure. The only difference between groups is
that they have different centers. LDA considers both within-group variance and between-
group variance. The estimated covariance matrix for LDA is obtained by pooling covariance
matrices across groups.
When the variability of each group does not have the same structure (unequal covariance
matrix), the shape of the curve separating groups is not linear, and therefore quadratic
927
discriminant analysis will provide a better classification model. The distance of observations
from the center of the groups can also be measured using the Mahalanobis distance.
27.2.2 Data suitable for LDA

LDA is used for classifying objects (samples, people, foods, etc.) into groups based on
features that can be used to describe the objects. This could include developing
classifications models for a library of products, good vs. bad quality product, or healthy vs.
cancerous cells. A typical example related to classifying objects or, more generally,
recognizing patterns is not a simple task for automated procedures, particularly when the
objects are of biological interest. For example, identifying species, predicting species
distributions or finding gene expression patterns that predict the risk of developing a
particular type of tumor are generally difficult tasks. Data can be different analytical
techniques related to chromatographic hyphenated techniques, like liquid chromatography
with diode array detection (LC-DAD), where a set of UV-Vis spectra are used for
classification. Data from any type of measurement, including spectroscopic data, imaging
data, or generic data such as a table of physical properties of samples, can be used for
classification, if those measurements have features which describe the objects. But for an
LDA to be a well-posed problem, the number of objects in the calibration set should be
larger than the number of variables. Often variable selection is used during model
development when LDA is applied to spectral data.
In order to overcome the constraint of requiring more objects than features, one may use
PCA-LDA which reduces the data dimensionality using PCA prior to running LDA. The number
of components would still need to be less than the number of objects in each class. Note
that unlike in SIMCA where each class is projected to their own space, PCA-LDA makes use of
a common projection space for all the classes. This option may be chosen for classification
using spectral data by enabling it in the Options tab of the analysis dialog.
27.2.3 Purposes of LDA

Discriminant analysis is a supervised classification method, as it is used to build classification
rules for a number of prespecified classes. These rules (model) are later used for allocating
new and unknown samples to the most probable class. Another important application of
discriminant analysis is to help in interpreting differences between groups of samples.
Discriminant analysis is a type of qualitative calibration, where the a category group variable
is used for the classification, and not a continuous measurement as would be the case for a
quantitative calibration. Discriminant analysis can be done in The Unscrambler® using linear
discriminant analysis, which is described in the following, or by partial least squares
regression methods (PLS-DA).
LDA can be done in many different ways and The Unscrambler® has options for Linear,
Quadratic and Mahalanobis classifiers.
The linear method is used when the difference between two groups can be represented by a
linear function. When a curved line separates the distance between groups, the quadratic
method is effective. This is the case when the covariance matrices differ from group to
group. Quadratic discriminant analysis may perform better in situations where the different
groups being classified have their main variability in different directions, but only when the
training sets used are large. The Mahalanobis distance is a way of measuring the distance of
an observation to the centers of the groups, and uses ellipses to define the distances.
LDA will not perform well on data sets where the discriminatory information is not in the
mean, but is in the variance of the data.
928
Note: For an LDA to be performed, the number of samples within each category
must be more than the number of variables.
27.2.4 Main results of LDA

The results of the LDA classification are the predicted class for each sample, presented in the
results matrix Prediction, and the matrix containing the Confusion matrix.
The prediction matrix exhibits the probability of membership for each class, as well as the
predicted class for each sample added as a category variable in the column Predicted.
The confusion matrix is a matrix used for visualization for classification results from
supervised methods such as support vector machine classification or linear discriminant
analysis classification. It carries information about the predicted and actual classifications of
samples, with each row showing the instances in a predicted class, and each column
representing the instances in an actual class.
When PCA-LDA is used, the results also include a matrix of Loadings and the Grand Mean
Matrix.
27.2.5 LDA application examples

LDA was used in the assessment of serum thyroid-stimulating hormone in rats using NIR
Raman spectroscopy Medina-Gutiérrez et al, 2005. Serum blood samples of euthyroid and
thyroidectomized rats treated with thyrotropin-releasing hormone (TRH) were analyzed on
aluminum substrates using near-infrared Raman spectroscopy (830 nm). Spectra of thyroid-
stimulating hormone (TSH), TRH and prolactin standards were obtained. Differences
between Raman spectral profiles of control and Tx+TRH samples groups were found. These
differences were confirmed by LDA, which presents a good classification between groups. It
is supposed that these differences are produced by the increment of TSH in the
thyroidectomized rats.
Visible (Vis) and NIR reflectance spectroscopy, combined with chemometrics were explored
as tools to trace muscles from autochthonous and crossbreed pigs from Uruguay. Cozzolino
et al, 2006 Muscles were sourced from two breeds, and minced muscles were scanned in the
Vis and NIR spectral regions (400–2,500 nm) in reflectance. PCA, PLS-DA, LDA based on PCA
scores, and SIMCA were used to identify the origin of the muscles based on the spectral
data. PLS-DA correctly classified 87% of PR and 78% of PRxD muscle samples. LDA calibration
models correctly classified 87 and 67% of muscles as PR and PRxD, respectively. SIMCA
correctly classified 100% of PR muscles. The results demonstrated the usefulness of Vis and
NIR spectra combined with chemometrics as a rapid method for authentication and
identification of muscles according to the breed of pig.
27.2.6 How to interpret LDA results

The results of the LDA classification model are given in the Prediction matrix and the
Confusion matrix. The model can then be applied to new data to classify it according to the
model. Results will then be given in a new node in the project navigator named
929
Classified_Range. Here the probabilities for each sample to belong to a group are given, and
classification is made based on the highest probability of membership.
Computational parameters of LDA

Three methods for LDA are available in The Unscrambler®
 Linear
 Quadratic
 Mahalanobis
The default setting assumes equal prior probabilities for class membership or 1/G where G is
the number of groups in the data set. The user has the option of having the software
calculate prior probabilities of class membership based on the training samples.
27.2.7 Using an LDA model for classification of unknowns

Once an LDA model has been developed, it can then be applied to unknown samples for
classification of the samples. This is done from the Tasks-Predict-Classification-LDA… menu.
More details regarding Linear Discriminant Analysis are given in the Method References
chapter.
27.3. Tasks – Analyze – Linear Discriminant Analysis

LDA is used for classifying objects (samples, people, foods, etc.) into groups based on
features that can be used to describe the objects. This is a supervised classification method,
meaning that one develops a model based on predefined classes in the data. One should
have a data matrix which includes a category variable defining which classes are to be
discriminated by the model.
27.3.1 Inputs
One begins by defining the data matrix to be used for the predictors, and then that to be
used for the classifications. This can be part of the same data matrix, but the classifications
must have category variables in a single column.
Linear Discriminant Analysis Inputs
930
Begin by defining the data matrix for the predictors and the classifiers from the drop-down
list. For the matrix, the rows and columns to be included in the computation are then
selected. The X values (descriptors) should be numerical data and should not contain missing
values. There must be more samples is each class then there are variables to develop an LDA
classification model. The Y data (classification) must be a single column of category values,
and contain the same number of rows as the descriptors, with no missing values.
If new data ranges need to be defined, choose new or Edit from the drop-down list next to
Rows and/or Cols. This will open the Define Range editor where new ranges can be defined.
The classification matrix to define is that containing the category data, and must have a
single column only. This may be the same matrix as given in Predictors or another, but must
have the same number of rows as the first, and have only a single column of data, with no
missing values. If the appropriate selection is not made for the classifier, the following
warning will be displayed.
Linear Discriminant Analysis Input Warnings
27.3.2 Weights
Weights can be set for individual variables in an analysis. The variables can be selected from
the variable list table provided in the dialog by holding down the control (Ctrl) key and
931
selecting variables. Alternatively, the variable numbers can be manually entered into the
text dialog box. The Select button can be used (which will take open the Define Range dialog
box), or every variable in the table can be selected by simply clicking on All.
Selected Variable(s) dialog box, under the Select tab. The options include,
A/(SDev +B)
Constant
Downweight
Block weighting
Linear Discriminant Analysis Weights
27.3.3 Options
Once the data to be used in modeling are defined, the method for the LDA is defined in the
Options tab.
Linear Discriminant Analysis Options
932
Three different methods for the LDA available under the options tab are:
 Linear
 Quadratic
 Mahalanobis
The method chosen from the drop-down list will depend on the similarity of the different
classes to be discriminated. If the variability within the groups is the same structure, the
linear method may be used. Otherwise, the Quadratic or Mahalanobis method may model
the classes better, and can be chosen from the drop-down list.
The prior probabilities can also be set, either assuming equal prior probabilities, or by
calculating prior probabilities from the training set. When they are calculated from the
training set, the software uses 1/G where G is the number of groups in the data set.
If a data set contains more variables than samples (i.e. spectral data), one can choose the
option of running a PCA-LDA. In this case a PCA with the number of components defined by
the user is run first on the data, and the LDA is performed using the PCA scores.
27.3.4 Autopretreatment
The Autopretreatments tab allows a user to register the pretreatments used during the LDA
applied to the new data, before the LDA equation is applied. The pretreatments become
part of the saved model.
Once the data matrix and parameters have been set, the LDA modeling is run by selecting
OK.
A new node, LDA, is added to the project navigator with a folder for Data, and another for
Results.
933
More details on the LDA classification can be found in Classify by LDA.
27.4. Tasks – Predict – Classification – LDA…

Classify unknowns using LDA: Once an LDA classification model has been saved, it can be
used to classify new samples.
To do this, go to the Tasks-Predict-Classification-LDA… In the Classify using LDA model
dialogue, define the LDA model to use from the drop-down list. Information about the LDA
model will be shown under it next to type. Then select the matrix with samples to be
classified. Select the rows and columns of the matrix containing the unknowns. The variables
for the samples to be classified must be the same as those used in developing the LDA
model.
Classify using LDA model
Click “OK” after all parameters have been set, and a new matrix with the LDA classification
results, Classified_Range will be created in the project navigator. This then shows the class
identifier, added as the column class, for the unknowns based on the LDA classification
model.
27.5. Interpreting LDA results

There are two main result matrices generated after an LDA:
 Prediction: also available when predicting with an LDA model

 Confusion matrix: only available in calibration
Two additional matrices are generated for PCA-LDA, including the Loadings and the Grand
Mean which are used in projection.
There is also a Discrimination plot that is created as a visual display of the LDA results. *
Discrimination plot: only available in calibration
LDA node
934
27.5.1 Prediction
The prediction matrix exhibits the discriminant value for each class, as well as the predicted
class for each sample. The predicted class is the class with the highest discriminant value.
Note that this value can be negative.
27.5.2 Confusion matrix

935
In the below confusion matrix, all the “Setosa” samples are nicely attributed to the “Setosa”
group.
Two samples with actual value “Virginica” are predicted as “Versicolor”.
In the same way two samples with actual value “Versicolor” are predicted as “Virginica”.
Confusion matrix
27.5.3 Loadings matrix

The loadings matrix, generated when one has run a PCA-LDA, has the PCA loadings for the
user-defined number of components used in computing the PC scores for this analysis.
27.5.4 Grand mean matrix

The grand mean matrix is the mean used for mean centering in the PCA. In PCA-LDA the
samples are projected to a common subspace for all categories.
27.5.5 Discrimination Plot

The Discrimination plot is a visualization of the LDA results for the training samples. Every
sample is displayed, color-coded by class, and the axes are for two of the classes in the
model. Samples lying close to zero for a class are associated with the class. The axes can be
changed to show other classes by using the arrows in the tool bar.
LDA discrimination plot
27.6. LDA method reference

936
27.7. Bibliography
D. Cozzolino, A. Vadell, F. Ballesteros, G. Galietta, N. Barlocco, Combining visible and near-
infrared spectroscopy with chemometrics to trace muscles from an autochthonous breed of
pig produced in Uruguay: a feasibility study, Anal. Bioanal. Chem., 385(5), 931-936 (2006).
C. Medina-Gutiérrez, J. Luis Quintanar, C. Frausto-Reyes, R. Sato-Berrú, The application of
NIR Raman spectroscopy in the assessment of serum thyroid-stimulating hormone in rats,
Spectrochimica Acta Part A, 61 (1-2), 87-91 (2005).
T. Næs, T. Isaksson, T. Fearn and T. Davies, A User-friendly Guide to Multivariate Calibration
and Classification, NIR Publications, Chichester, UK, 2002.
937
28. Support Vector Machine Classification
28.1. Support Vector Machine Classification (SVMC)
SVM is a classification method based on statistical learning. Sometimes, a linear function is
not able to model complex separations, so SVM employs kernel functions to map from the
original space to the feature space. The function can be of many forms, thus providing the
ability to handle nonlinear classification cases. The kernels can be viewed as a mapping of
nonlinear data to a higher dimensional feature space, while providing a computation short-
cut by allowing linear algorithms to work with higher dimensional feature space.
 Theory
 Results
 Usage: Classification
 Result interpretation
28.2. Introduction to Support Vector Machine (SVM) classification
 Principles of Support Vector Machine (SVM) classification

 What is SVM classification?
 Data suitable for SVM classification
 Main results of SVM classification
 More details about SVM Classification
 SVM classification application examples
28.2.1 Principles of Support Vector Machine (SVM) classification

SVM is a pattern recognition method that is used widely in data mining applications, and
provides a means of supervised classification, as do SIMCA and LDA. SVM was originally
developed for the linear classification of separable data, but is applicable to nonlinear data
with the use of kernel functions. SVM are used in machine learning, optimization, statistics,
bioinformatics, and other fields that use pattern recognition. The algorithm used within The
Unscrambler® is based on code developed and released under an modified BSD license by
Chih-Chung Chang and Chih-Jen Lin of the National Taiwan University. Hsu et al,2009
28.2.2 What is SVM classification?

SVM is a classification method based on statistical learning wherein a function that describes
a hyperplane for optimal separation of classes is determined. As the linear function is not
always able to model such a separation, data are mapped into a new feature space and a
dual representation is used with the data objects represented by their dot product. A kernel
function is used to map from the original space to the feature space, and can be of many
forms, thus providing the ability to handle nonlinear classification cases. The kernels can be
viewed as a mapping of nonlinear data to a higher dimensional feature space, while
providing a computation shortcut by allowing linear algorithms to work with higher
dimensional feature space. The support vector is defined as the reduced training data from
939
the kernel. The figure below illustrates the principle of applying a kernel function to achieve
separability.
In this new space SVM will search for the samples that lie on the borderline between the
classes, i.e. to find the samples that are ideal for separating the classes; these samples are
named support vectors. The figure below illustrates this in that only the samples marked
with + for the two classes are used to generate the rule for classifying new samples.
A situation where SVM will perform well is when some classes are inhomogeneous and
partly overlapping, and thus, building local PCA models with all samples will not be
successful because one class may encompass other classes if all samples are used.
SVM will in this case find a set of the most relevant samples in terms of discriminating
between the classes and is invariant to samples far from the discrimination line.
SVM has advantages over classification methods such as neural networks, as it has a unique
solution, and has less tendency of overfitting when compared to other nonlinear
classification methodologies. Of course, the model validation is the critical aspect in avoiding
overfitting for any method. SVMs are effective for modeling of nonlinear data, and are
relatively insensitive to variation in parameters. SVM uses an iterative training algorithm to
achieve separation of different classes.
Two SVM classification types are available in The Unscrambler® which are based on different
means of minimizing the error function of the classification.
940
Support Vector Machine Classification
 c-SVC: also known as Classification SVM Type 1.

 nu-SVC: also known as Classification SVM Type 2.
In the c-SVM classification, a capacity factor, C, can be defined. The value of C should be
chosen based on knowledge of the noise in the data being modeled. Its value can be
optimized through cross-validation procedures. When using nu-SVM classification, the nu
value must be defined (default value = 0.5). Nu serves as the upper bound of the fraction of
errors and is the lower bound for the fraction of support vectors.
Increasing nu will allow more errors, while increasing the margin of class separation.
The kernel type to be used as a separation of classes can be chosen from the following four
options:
 Linear
 Polynomial
 Sigmoid
The linear kernel is set as the default option . If the number of variables is very large the data
do not need to be mapped to a higher dimensional space the linear kernel function is
preferred. The radial basis function is also simple function and can model systems of varying
complexity. It is an extension of the linear kernel.
If a polynomial kernel is chosen, the order of the polynomial must also be given. In SVM
classification, the best value for C is often not known a priori. Through a grid search and
applying cross validation to reduce the chance of overfit, one can identify an optimal value
of C so that unknowns can be properly classified using the SVM model.
28.2.3 Data suitable for SVM classification

SVM classification is a supervised method of classification. The data used for SVM must have
a data matrix which includes a single category variable defining which classes are to be
discriminated by the model. The X and Y matrices must have the same number of rows
(samples) for SVM classification, and not have any missing data. The Y matrix must contain a
single column of category variables. The X data must be numerical, and not contain any
missing data.
SVM have been used in drug discovery to identify compounds that may have efficacy, and
also to identify toxicity issues with drugs. They have been used in classification problems
such as that of classifying plastics from their FTIR spectra, meat and bone meal in feed from
NIR imaging spectroscopy, teas from HPLC chromatograms, and many other areas in pattern
recognition and data mining.
28.2.4 Main results of SVM classification

When an SVM model is created a new node is added in the project navigator with a folder
for the data used in the model, and the results folder. The results folder has the following
matrices:
 Support vectors
 Confusion matrix
 Parameters
941
 Probabilities
 Prediction
The main result of the SVM is the the confusion matrix, which indicates how many samples
were classified is each class, and the prediction matrix, which indicates the classification
determined for each sample in the training set.
The prediction matrix indicates the classification determined for each sample in the training
set.
28.2.5 More details about SVM Classification

It is advised to start with the RBF kernel with various settings of C for C-SVM and select 10-
segment cross validation. If all samples are correctly classified, which means the confusion
matrix has no values outside the diagonal, one may select this model as suitable for
classifying future samples. Of course, some data will not classify all samples in the correct
class during training.
If the data are expected to be nonlinear, e.g. from looking at the classes in a scores plot from
PCA or PLS-DA, one may try other kernels and change the settings for C or nu.
28.2.6 SVM classification application examples

SVM were used as a multivariate classification tool for the identification of meat and bone
meal in animal feed in response to legislation banning such substances following the
outbreak of mad cow disease.Fernandez Pierna et al, 2004 NIR imaging spectroscopy is able
to detect differences in feeds based on the chemical composition. SVM can be used to
classify feed samples, reducing the need for constant expert analysis of data, thus providing
a rapid tool for analysis that can be utilized for certification of animal feed.
SVM were applied for the classification of plastics in a recycling system. Belousov et al, 2002
A remote FTIR spectrometer was mounted on a conveyor where plastics were being sorted
for recycling. A two-tiered classification model was developed where at the first level
samples were divided into the classes of “important” plastics (ABS, PC, PC/ABS, SB and PVC)
and reject plastics (PA, PP and PE). The “important” plastics were then further categorized
into each individual type of plastic.
More details regarding Support Vector Machine classification are given in the method
reference.
28.3. Tasks – Analyze – Support Vector Machine classification

The sections that follow list menu options, dialogs and results while using Support Vector
Machine classification in practice accessible from the menu Tasks-Analyze-Support Vector
Machine Classification….
28.3.1 Model input

First the input data for the classification is defined in the Support Vector Machine dialog.
Choose the data matrix which contains the data to be used for the classification as the first
matrix. This matrix of predictors should contain only numerical values, with no missing
values. The second matrix to define is that containing the category, and must have a single
column only. The SVM training requires at least two classes. This classification information
942
may be from the same matrix or another, but must have the same number of rows as the
first, and have only a single column of category data.
Support Vector Machine Model Inputs
If the appropriate selection is not made for the classifier, the following warning will be
displayed. To build the SVM model go to the column drop-down list, select a single column
containing category variables.
Support Vector Machine Model Inputs Warnings
28.3.2 Options
Here one can choose the SVM type of classification to use, either C-SVC or nu-SVM, from the
drop-down list next to SVM type. The kernel type to be used to determine the hyperplane
that best separates the classes can be selected from the following types from the drop-down
list. The default setting of Radial basis function is the simplest, and can model complex data.
Support Vector Machine Options
943
The kernel types are:
 Linear
 Polynomial
 Sigmoid
For a polynomial kernel type, the degree of the polynomial should be defined. The C-SVM
has an input parameter named C, which is a capacity factor (also called penalty factor), a
measure of the robustness of the model. C must be greater than 0.
When using nu-SVM regression the nu value must be defined (default value = 0.5). Nu serves
as the upper bound of the fraction of errors and is the lower bound for the fraction of
support vectors.
Support Vector Machine Options for nu-SVM
944
Support Vector Machine Options for C-SVM
945
28.3.3 Grid Search
In the options tab the Grid Search button is available. Clicking on the Grid
Search button will open a dialog for grid search. The figure below shows the grid search
dialog after a grid search has been perforemd.
The dialog asks for input for the parameters Gamma and C in the case of C-SVMC and
Gamma and Nu in the case of nu-SVMR. It has been reported in the literature that an
exponentially growing sequence of the parameters is good as a first course grid search. This
is why the inputs Gamma and C are given on the log scale, but not the nu since it is between
0 and 1. However, in the grid table above the actual values are given. It is recommended to
use cross-validation in grid search to avoid overfitting when many combinations of the
parameters are tried. After an initial grid search it may be refined with smaller ranges for the
parameters once the best range has been found. Click on the Start button for the
calculations to commence. Note that it is possible to click on Stop during the computations
so that if the results become worse for higher values for the parameters one may stop to
save time.The default is to start with five levels of each parameter. Click on one (the “best”)
value for the Validation accuracy in the grid after completion to see detailed results. The SVs
lists how many samples that were selected and is depending should be related to the
number of samples in the data.
Click on Use setting to return to the previous dialog and for running the SMVC again with
these parameter settings. Notice that since the cross validation is random the RMSE and the
R-square from validation may be different in the second run. This again is a function of the
distribution of the samples.
To understand more in detail how SVMC selects the support vectors (samples that are lying
on the boundary between the classes) one may run a PCA on the same data and make use of
the Sample Grouping option in the score plot to visualize the support vectors.
946
28.3.4 Weights
If the analysis calls for variables to be weighted for making realistic comparisons to each
other (particularly useful for process and sensory data), click on the Weights tab and the
following dialog box will appear.
Support Vector Machine Weights
A/(SDev +B)
Constant
Downweight
Block weighting
947
SVM Advanced Weights Option
28.3.5 Validation
Validation is an important part of any method applied in modeling data. Settings for the
Validation of the SVM are set under the Validation tab as shown below. First select to cross
validate the model by checking the check box. The number of segments to use can be
chosen in the segments entry. Cross validation is helpful in model development but should
not be a replacement for full model validation using a test set.
Support Vector Machine Validation
948
Autopretreatment may be used with SVM. This allows a user to automatically apply the
transforms used with the data in developing the SVM model to data used in the classification
of new samples with this model.
Support Vector Machine Autopretreatment
949
When all of the parameters have been defined, the SVM is run by clicking OK. A new node,
SVM, is added to the project navigator with a folder for Data, and another for Results.
More details regarding Support Vector Machine classification are given in the section SVM
Classify or in the link given under License.
28.4. Tasks – Predict – Classification – SVM…

After an SVM classification model has been developed, it can be used to classify new
samples by going to Tasks-Predict-Classification-SVM…. In the dialog box, one first chooses
which SVM model to apply from the drop-down list. This requires a valid SVM model in the
current project. One then defines which samples to classify by selecting samples from the
appropriate data matrix, along with the X variables that are to be used for the classification.
The X-variables must contain only numerical data and have the same number of variables as
were used to develop the SVM model.
Classify Using SVM Model
950
The SVM classification results are given in a new matrix in the project navigator named
Classified_Range. The matrix has the predicted class for each sample.
28.5. Interpreting SVM Classification results

There are six result matrices generated after creating a SVM model:
 Support vectors
 Confusion matrix
 Parameters
 Probabilities
 Prediction
 Accuracy
There is only one matrix generated when predicting with a SVM model: Classified range
SVM node
28.5.1 Support vectors

The support vector matrix is comprised of the support vectors which are a subset of the
original samples that are closest to the boundary between classes and define the optimal
separation between classes.
28.5.2 Confusion matrix

951
In the below confusion matrix, all the “Setosa” samples are nicely attributed to the “Setosa”
group.
Two samples with actual value “Virginica” are predicted as “Versicolor”.
In the same way two samples with actual value “Versicolor” are predicted as “Virginica”.
Confusion matrix
28.5.3 Parameters
The parameters matrix carries information on the following parameters for all the identified
classes:
 SVM type
 Kernel type - as defined in the options for the SVM learning step
 Degree - as defined in the options for the SVM learning step
 Gamma - related to the C values set in the options
 Coef0 Classes - the number of classes identified by the SVM model
 SV Count - the number of support vector needed for the classification of the data
 Labels - the labels of the corresponding classes, given as numerical values starting
with 0
 Numbers - the number of samples classified in a given class
Parameters matrix
28.5.4 Probabilities
The probabilities matrix has three rows, for the Rho, and probabilities A and B for each of
the identified classes.
Probabilities matrix
952
28.5.5 Prediction
The prediction matrix exhibits the predicted class for each sample in the training set.
Prediction
28.5.6 Accuracy
Accuracy holds the % correctly classified samples from calibration and validation. If cross
validation was not chosen it leaves this field blank. However, cross validation is highly
recommended to avoid overfitting. See the Confusion Matrix regarding details for false
positives and false negatives.
953
28.5.7 Plot of classification results

This plot shows the various classes as they were classified for a 2D scatter plot of the original
variables. Use the arrows or drop-down list to choose which of the original variables to
show. This is useful to see for which combinations of pairs of variables there is good
separation between the classes. Alternatively perform PCA on the same data and visualize
the the support vectors with the sample grouping option in the score plot and interpret the
loading plot to find the most important variables.The Act and Pre buttons can be used to
toggle if one of them or both should be shown; the predicted are shown with a smaller
markersize. If the predicted class differs from the actual this is shown with a small symbol
with the color for the wrongly assigned class inside the larger marker for the actual class. In
the illustration below two samples (Batch19 and Batch21) are predicted to belong to class
Asia although the actual class is Europe.
28.5.8 Classified range

After an SVM model has been applied to new data to classify them, a new matrix with the
results is added to the project navigator. The Classified_Range matrix contains a category
variable giving the category predicted by the model for each sample.
Classified range
954
28.6. SVM method reference

The method reference for SVM is available from this link
http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
28.7. Bibliography
Chemom., 19, 341–354 (2005).
Chemom., 18, 341–349 (2004).
machines, J. Chemom., 16, 482-489 (2002).
955
29. Batch Modeling
29.1. Batch Modeling (BM)
The main objective of Batch Modeling plug in is to model and monitor data from batch-
processes to give information whether the batch is progressing as expected.
 Theory
 Usage
29.2. Introduction to Batch Modeling (BM)
29.2.1 What is Batch Modeling
29.3. Tasks – Analyze – Batch Modeling…

When a data matrix is available in the Project Navigator, access the menu for analysis by BM
from Tasks – Analyze – Batch Modeling…
The BM dialog box is described below.

 Weights tab
 Validation tab

In the Model Inputs tab, select a Matrix to be analyzed in the Data frame. Select pre-defined
row and column ranges in the Rows and Cols boxes, or click the Define button to perform
the selection manually in the Define Range dialog.
Select the Batch (from the list of category variables) variable and its relevant batches for
model building.
Batch Selection
957
Once the data to be used in modeling are defined, choose the number of Principal
Components (PCs) to calculate, from the Maximum Components box.
The Mean center data check box allows a user to subtract the column means from every
variable before analysis.
The Identify outliers check box allows a user to identify potential outliers based on
parameters set up in the Warning Limits tab.
 The algorithm used to calculated the model (default - SVD)

 The rotation method applied (default - None).
 The weights applied to the data.
The Global Batch Modeling check box allows a user to build a global Batch model.
BM Model Inputs
958
Batch Modeling
29.3.2 Weights tab

For weighting the individual variables releative to each other, use the Weights tab. This is
useful e.g. to give process variables equal weight in the analysis or to downweight variables
you expect not to be important. The tab is given below.
Batch Modeling Weights
959
A/(SDev +B)
Constant
Downweight
Block weighting
960
Batch Modeling
analysis as weights, using the Select Results Matrix button This option provides an internal
project navigator for selecting the appropriate results matrix to use as a weight.
BM Advanced Weights Option

The next step in the PCA modeling process is to choose a suitable validation method from
the Validation tab. Currently in Batch Modeling, only Cross Validation is available.
Batch Modeling Validation
961

The warning limits tab allows a user to define specific criteria for detecting outliers in a
batch model. It is available when Identify outliers is checked in the Model Inputs tab. The
dialog box is shown below.
BM Warning Limits Option
962
Batch Modeling
warnings in the batch model. Settings for estimating the optimal number of components can

samples.
Leverage Limit
963
(default 3.0) For individual values in the calibration residual matrix (Residuals), the
ratio to the model average is computed (square root of the Variable Residuals). For
spectroscopic data this limit may be set to 5.0 to avoid many false positive warnings
due to the high number of variables.
(default 2.6) For individual values in the calibration residual matrix (Residuals), the
ratio to the validation model average is computed (square root of the Variable
Validation Residuals). For spectroscopic data this limit may be set to 5.0 to avoid
many false positive warnings due to the high number of variables.
When all the options are specified click OK.
29.4. Interpreting BM plots
 Predefined BM plots
 PCA overview
 Scores
964
Batch Modeling
29.4.1 Predefined BM plots

PCA overview
Scores
29.5. BM method reference

The method reference is available upon request.
Contact http://www.camo.com
965
30. Moving Block
30.1. Moving Block
Block methods are a particular form of evolutionary process modeling. Statistics such as
mean and standard deviation are reported for single or multivariate sensor data collected at
regular time intervals during a process. These can be used to trend the progress of an
evolving system, such as blending, mixing and drying operations.
The moving block statistics can be based either raw data or scores (i.e. projections of the
data onto a multivariate model).
 Theory
 Usage
 Prediction
30.2. Introduction to Moving Block.
 Block Definitions
 Individual Block Mean (IBM)
 Individual Block Standard Deviation (IBSD)
 Moving Block Mean (MBM)
 Moving Block Standard Deviation (MBSD)
 Percent Relative Standard Deviation (%RSD)
30.2.1 Block Definitions

Monitoring of data blocks can be defined for one or multiple regions in a data table. A block
size N and a step size are required input to the method and these will be applied to all blocks
and regions . The number of variables in a region can
range from single sensor readings to highly multivariate spectra.
Rather than using the raw data, an alternative is to use PCA scores (also called latent
variables) as input to the moving block method. The PCA model is trained on historic data
and the individual scores are treated as univariate sensor readings. New data can then be
projected onto the PCA model to generate new scores for trend charting.
The below figure is an example of how the 8 first spectra collected from a running process
can be divided into two blocks of size 5 and using a step size of 3.
Example of spectra being defined as blocks
967
The statistics for a particular block as calculated for region of length are
described in the following.
30.2.2 Individual Block Mean (IBM)
The IBM is the average of sensor readings over samples in a block. It is a

vector of length , i.e. its size depends on the number of variables in the current region.
The IBM for a block is similar to the associated sensor reading for a single sample. If these
are spectra the IBM will resemble a spectrum from the same spectral region. The collection
of IBM’s can be plotted as line or bar plots to assess the differences between multiple
blocks.
Individual Block Means for a Collection of Spectra
968
Moving Block
30.2.3 Individual Block Standard Deviation (IBSD)

The IBSD is the standard deviation of sensor readings over the samples in a block. Its
dimensions are the same as for IBM.
As for the IBM, the IBSD can also be plotted as line or bar plots. These will indicate the
degree of sample spread within different blocks.
Individual Block Standard Deviations for a Collection of Spectra
30.2.4 Moving Block Mean (MBM)

The MBM is the average of absolute IBM’s over the variables in the current region. For
univariate sensors (Temperature, etc.) the MBM will be just the absolute value of the IBM.
Upper and lower limits can be defined and plotted with the MBM in a trend chart to monitor
e.g. when a process reaches stable conditions.
Moving Block Mean Trend Chart
30.2.5 Moving Block Standard Deviation (MBSD)

The MBSD is the square root of average, squared IBSD’s over the variables in the current
region. For univariate sensors (Temperature, etc.) the MBSD will be the same as the IBSD.
969
Upper and lower limits can be defined and plotted with the MBSD in a trend chart to
monitor e.g. when a process reaches stable conditions.
Moving Block Standard Deviation
30.2.6 Percent Relative Standard Deviation (%RSD)

The %RSD is the ratio of the MBSD to the MBM expressed in percent.
It is sometimes referred to as the coefficient of variation and can be used to estimate

variability in a product independent of the measurement unit. It can be used for trend
charting similar to MBM and MBSD.
The %RSD is most useful for so-called heteroscedastic data (widely found in spectroscopy)
where the average is proportional to the standard deviation. It may work well for
measurement scales such as absorbance or weight where 0 has a physical interpretation of
‘no contribution’. It should be avoided for other units where the position of 0 is arbitrary
such as PCA scores, temperature measured in , or pH.
This statistic should be used with caution when the block size is small as uncertain
estimates of the MBSD can have a large impact on the %RSD. Also if MBM is close to zero
the %RSD will have very high and uncertain values.
Percent Relative Standard Deviation
970
Moving Block
30.3. Tasks – Analyze – Moving Block Methods

When a data matrix is available in the project navigator, access the Moving Block setup
dialog from Taks–Analyze–Moving Block Methods
 Input data pane

 Region
30.3.1 Input data pane

In the Input data, select a Matrix to be analyzed in the Data frame. Select pre-defined row
and column ranges in the Rows and Cols boxes, or click the Define button to perform the
selection manually in the Define Range dialog.
Input Data Pane
Once the data to be used in modeling are defined, click Add to specify the combination of
methods and wavelength region.
30.3.2 Region
Once the input data is valid, the regions can be added further.
Region Pane
971
The following table gives the functionality of the Region pane.

It allows to define between the first and last column to be include for this
Range
region, relative to the full range of the Input data
When checked, allows the user to select all PCA models in the project
with matching number of required variables. Additionally the
Apply to
Components option allows to change the number of components to be
scores
used in the selected model. The default value is the number of
components set for prediction.
Allows a user to define the number of samples (rows) to be used to

Window size calculate the block statistics in each step. Default value is 5, minimum is 2
and maximum is the number of rows.
Allows a user to define the gap between successive block statistics.

Default value is 1, indicating the smallest possible increment to calculate
Step size
successive block statistics. Maximum value is the number of rows -
window size.
Models can be deleted using the Remove option in the Region pane. Also multiple regions
can be added for a single analysis.
30.4. Interpreting moving block plots
 Predefined moving block plots

 Moving Block Overview
 Individual block statistics
972
Moving Block
 Moving block trends

 Individual block statistics
 Moving block trends
 Percent RSD
30.4.1 Predefined moving block plots

Moving Block Overview
Individual block statistics

These are line plots based on the chosen column range and can resemble plots of regular
spectra if applied to spectral data.
Individual Block Standard Deviation for a Collection of Spectra
If there are multiple regions in the model use the toolbar drop down box ( )
to select which region to plot. The toggle buttons can be used to switch between
Individual Block Mean and Standard Deviation plots.
Individual Block Mean for a Collection of Spectra
Moving block trends

These are time series line plots based on Moving Block Mean/Standard Deviation values
calculated for the individual ranges.
Moving Block Standard Deviation Trend Chart
973
If there are multiple regions in the model use the toolbar drop down box ( )
to select which region to plot. The toggle buttons can be used to switch between
Moving Block Mean and Standard Deviation plots, or a combination of both.
Moving Block Combined Trend Chart
Upper and lower limits for the trend charts can be set using the right click Set Limits
function. Also use the toggle button to see the Moving Block Mean trend plots.
Moving Block Mean Trend Chart
Individual block statistics

Moving block trends
Percent RSD
These are line plots for samples based on the relative standard deviation, expressed in
percentage.
Percent Relative Standard Deviation
974
Moving Block
30.5. Tasks – Predict – Moving Block Statistics

After a Moving Block model has been developed, it can be used to predict new samples by
going to Tasks-Predict-Moving Block Statistics. When clicked, a Moving Block Statistics dialog
opens.
In the dialog box, a user must choose which Moving Block model to apply from the drop-
down list. This requires a valid moving block model to be located in the current project.
The Data frame allows the selection of a data matrix to be analyzed. Selection can be from
the pre-defined row and column ranges in the Rows and Cols boxes, or using the Define
button to perform the selection manually in the in the Define Range dialog.
Moving Block Statistics Prediction dialog
Click Ok to start the prediction.

The prediction results are given in a new matrix in the project navigator named Moving
Block Prediction. The matrix holds folders for Raw Data, separate for each of Regions
available and Plots.
Moving Block Prediction Results
975
30.6. Set Moving Block Limits

Block methods can be used to control processes and The Unscrambler® X Set Limits
functionality provides a means to set action limits for the Trend (mean and standard
deviation) Plots. To set the limits, right click in the Trend Plot and select the Set Limits option
available as shown below,
Set Limits selection and dialog
By clicking on any Method Node in the tree, the Region and Trend plot names become
visible in the dialog box allowing the user to set limits for the methods. By default, the
Method Node selected will be the one associated with the plot that was right clicked.
By selecting the Upper Limit radio button, user will be allowed to set up an upper limit in the
Trend Plot and save it to the model for use in Tasks –Prediction for comparing new data to
an established model.
By selecting the Lower Limit radio button, user will be allowed to set up an lower limit in the
Trend Plot and save it to the model for use in Tasks –Prediction for comparing new data to
an established model.
976
31. Orthogonal Projections to Latent Structures
31.1. Orthogonal Projection to Latent Structures
 Theory
 Usage
31.2. Introduction to Orthogonal Projection to Latent Structures (OPLS)
 Predictive scores and predictive loading weights

 Y-loadings
 Orthogonal scores and orthogonal loading weights and loadings
 OPLS predictive and orthogonal scores
Orthogonal Projection to Latent Structures (OPLS) models both the X- and Y-matrices
simultaneously in terms of components (or factors, latent variables). The difference between
PLSR and OPLS lies in the way these components are calculated. The loading weights vector
of the first component is identical to PLSR whereas the subsequent components in OPLS are
calculated as to be orthogonal to the first one. The first loading weights vector for PLSR and
OPLS in the case of a single response variable (y) represents the individual covariance or
correlations if the variables are scaled to unit variance except that the vector is normalized
to 1.0. Note that the final regression coefficient vector is identical to PLS in the case of one
y-variable, thus the predictions are also identical in the case of a single y-variable.
It is known that a regression model with one y-variable always can be described with one
component where the y-orthogonal part of X can be separated from the predictive part. The
direct way of orthogonalizing X on Y is by Direct Orthogonalization where all orthogonal
variance in X is represented by one matrix, E. OPLS separates the y-orthogonal part of X into
a structured part and the residual (error).
The total X-variance from the predictive and orthogonal components is the same as the X-
variance for PLSR after the sum of predictive and orthogonal components. E.g. if there is one
predictve and two orthogonal components in OPLS then this corresponds to a 3-component
model for PLSR. That is, the orthogonal loading weights from OPLS may differ from the
loading weights from PLS for component one to the optimal number found by proper
validation. It is recommended to first run a PLSR model to find the optimal validated number
of components and the run OPLS. If there are more y-variables there might be more than
one predictive component but not more than the number of y-variables.
Orthogonal Signal Correction (OSC) is another method that separates the y-orthogonal part
of X. The difference to OPLS is mainly that the orthogonal part is not a part of the model
itself but is separated as a pre-processing step. See also OSC theory. More details can be
found in the literature OPLS literature
977
Which method to use for a given dataset to reveal the true underlying structures cannot be
known in beforehand. Multivariate Curve Resolution
(../27_Multivariate_Curve_Resolution/theory.htm) is an alternative method where it is not
assumed that the true signals are orthogonal.
If the classical PLSR indicates that one component is optimal then the so-called predictive
component in OPLS will have relevant qualitative and sometimes also quantative
information as to which variables that are important and how important. If the classical PLSR
indicates the optimal number of components to be e.g. four one cannot in general assume
that the first component reveals the correct qualitative (or quantitiative) information.
OPLS may be carried out with one or more Y variables, meaning that multiple Y responses
can be used during regression modeling. OPLS gives similar results in the case of multiple y-
variables as PLS but not exactly the same.
31.2.1 Predictive scores and predictive loading weights

The predictive part of X is modelled by the predictive loading weights and the corresponding
predictive scores.
Thus by multiplicating the individuals score for each sample by the loading weights and
square the values this can be used to estimate the sample variance due to the predictive
part.
The predictive loading weight vector for each component is normalized to sum 1.0, Variables
with large loading weight values are important for the prediction of Y. One may make use of
uncertainty test to estimate the significance for each variable to overcome that a rule of
thumb for important/not important in absolute values cannot be set due to the
normalization.
31.2.2 Y-loadings
The Y-loadings for individual y-variables in OPLS are represented by the direct relationship
between the Y-variables and the predictive scores.
31.2.3 Orthogonal scores and orthogonal loading weights and loadings

For OPLS both loading weights and loadings are calculated for the orthogonal part of X.
The so-called orthogonal components in OPLS have orthonormal loading weights but the
orthogonal loadings are not orthogonal, similar to the properties of PLSR. The orthogonal
part of X is expressed by
OPLS predictive and orthogonal scores

As the OPLS separates the predictive and ortogonal parts in X the corresponding scores
should not be interpreted in the same plot from a conceptual point of view. This is why
there is one plot option for predictive scores and one for the orthogonal scores. The y-
loadings for the orthogonal part are always 0.
For general interpretation of loadings and scores see the section for PLSR.
978
Orthogonal Projections to Latent Structures
response. In the case of OPLS they are calculated from the predictive loading weights, the
orthogonal loadings and the y-loadings. Regression coefficients are a characteristic of all
regression methods and may provide interpretive insight into the quality of a model.
Examples include:
 Spectroscopy: Regression coefficients should have “spectral characteristics” about
them and not show noise characteristics.
 Process data: When different variable types exist the variables should be scaled to
unit variance. Regression coefficients show the relative importance of the variables
and their interactions can also be displayed if added to the original data table with
Tasks - Transform - Interaction_and_Square_Effects.
As the regression coefficients in OPLS are identical to PLS they are given out
The predicted vs. reference plot is another common feature of all regression methods. The
predicted vs. reference plot should ideally show a straight line relationship between
predicted and reference values, ideally with a slope of 1 and a correlation close to 1.
31.3. Tasks – Analyze – Orthogonal Projection to Latent Structures

 Weights tabs
 Validation tab
 Autopretreatments
suitable analysis – here, Orthogonal Projection to Latent Structures.
Orthogonal Projection to Latent Structures Inputs

will be obtained.
(latent variables, factors) to calculate, from the maximum components spin box. For OPLS
there are two inputs needed, the number of predicitve components and the number of
orthogonal components. The number of predictive components must be <= the number of
responses. It is recommended to first run a PLS regression to find the optimal number of
components. This number should be the same as the sum of predictive and orthogonal
components.
979
before analysis. This option should be enabled unless one can assume that origo is a valid
sample in the data i.e. when zero concentration means no signal.

 The weights applied to the Y-data.
31.3.2 Weights tabs

given below.
Orthogonal Projection to Latent Structures X-Weights
980
Orthogonal Projection to Latent Structures Y-Weights
981
variable numbers can be manually entered into the text dialog box, the Select button can be
A/(SDev +B)
Constant
Downweight
Block weighting
982
Advanced tab
new weights.

The next step in the OPLS modeling process is to choose a suitable validation method. For an
The methods provided in The Unscrambler® for the validation of OPLS models are:
Leverage Correction
A first pass validation technique used for checking for the presence of gross outliers
and for “big data”.
Cross Validation
Used to simulate a test set, when there are not enough samples to define an
independent test set.
983
Uncertainty Test
can be used to determine the significance of variables, when using cross validation,
by applying an Uncertainty Test. Check the Uncertainty Test box and the options
available are to use the optimal number of factors found in a model, or define the
number of factors to use for the test. For OPLS the number of factors is related to
the number of orthogonal factors specified in the main dialog.
Test Set
The most reliable way of assessing the performance of a PLSR model. It uses samples
that are independent of the calibration set.
When applying Test Set validation, the user must ensure that the test matrices have
the same column dimensions as the calibration set.
31.3.4 Autopretreatments
The Autopretreatments tab allows a user to register the pretreatments used during the OPLS
applied to the new data. The pretreatments become part of the saved model. An example
dialog box for Autopretreatment is provided below.
The OPLS Autopretreatment Tab Options
984
Pretreatments can also be registered from the OPLS node in the project navigator. To
register the pretreatment, right click on the OPLS analysis node and select Register
Pretreatment.
31.4. Interpreting OPLS plots
 Predefined OPLS plots

 OPLS Overview
 Predictive Scores
 Predictive Loading Weigths
 Explained Y-variance
 Orthogonal Scores
 Line plot
 Line plot
 Residuals
 General
 Leverage
 Residual Sample distance
Many of the OPLS plots are the same or similar as for PLSR. The OPLS plots are described
below. For more details we refer to the section on PLS.
31.4.1 Predefined OPLS plots

OPLS Overview
The OPLS Overview shows four main plots from the OPLS analysis. The individual pots are
described below.
Predictive Scores
This is a one-dimensional bar plot of scores for one specified component and samples with
high absolute score value are influential in estimating the predictive loading weights.
985
Predictive Loading Weigths

This is a one-dimensional bar plot of the predictive loading weights for one specified
component. If Uncertainty test was chosen in the prediction dialog then the signifcance for
each variable is given in the Validation node.
This plot illustrates how much of the variation in the responses that is described by each
component.
Total explained variance is computed as:

model.
986

the results for other Y-variables, use the variable icon . In addition
by default the results are shown for a specific number of factors, that should reflect the
dimensionality of the model. If the number of factors is not satisfactory, it is possible to
change it by using the PC icon .

The selected predicted Y-value from the model is plotted against the reference Y-value. This
Turn on Plot Statistics (using the View menu) to check the slope and offset, RMSE and R-
squared. Generally all the y-variables should be studied and give good results.
It is also useful to show the regression line and compare it with the target line. These can be
enabled with the icon .

The following provides an image of a predicted vs. reference plot with regression and target
lines and statistics displayed.
987
Slope
Offset
RMSE
R-squared
validation set. It is an estimate of how good a fit can be expected for future
predictions.
used and the number of components in a model.
Predicted vs. Reference plot for Calibration samples

Correlation
R2(Pearson)
RMSEC
SEC
Bias
988
Orthogonal Scores
This is either a bar plot or a two-dimensional scatter plot (or map) of scores for two specified
orthogonal components.
samples. Look at the scores plot together with the corresponding loadings plot for the same
two components. This can help in determining which variables are responsible for
differences between samples. For example, samples to the right of the scores plot will
usually have a large value for variables to the right of the loadings plot, and a small value for
variables to the left of the loadings plot.
Orthogonal Loading Weights

This is either a bar plot or a two-dimensional scatter plot (or map) of the variables for two
specified orthogonal components.
989
Orthogonal X-Loadings
This is either a bar plot or a two-dimensional scatter plot (or map) of the variables for two
specified orthogonal components.

When an OPLS analysis has been performed and a two-dimensional plot of orthogonal
loadings is displayed on the screen, the Correlation Loadings option (available from the
button ) can be used to aid in the visualization of the structure in the data. Correlation
loadings are computed for each variable for the displayed factors. In addition, the plot
contains two ellipses to help check how much variance is taken into account. The outer
ellipse is the unit-circle and indicates 100% explained variance. The inner ellipse indicates
50% of explained variance. The importance of individual variables is visualized more clearly
in the correlation loadings plot compared to the standard loadings plot.
990
two components explain a large portion of the variance of X. Variables in diagonally opposed
quadrants will have a tendency to be negatively correlated.

Line plot

Line plot
components. The regression coefficients for 2 factors (or PCs), for example, summarize the
approximates it.
The constant value B0 is indicated along with the x-axis name.
units:
Residuals
General
satisfactory, and appropriate action should be taken. If strong systematic structure (e.g.
991
curved patterns) is observed, this can be an indication of lack of fit of the regression model.
The figure below shows a situation where one sample has a much higher Y-residual than the
other samples.

for one particular Y-variable (look for its name in the axis label). There is one point per
sample. If the model explains the complete structure present in the data, the residuals
should be randomly distributed - and usually, normally distributed as well. So if all the
residuals are along a straight line, it means that the model explains everything that can be
explained in the variations of the variables to be predicted. If most of the residuals are
normally distributed, and one or two stick out, these particular samples are outliers.
992
Leverage
Leverages are useful to find influential samples in the model space. If all samples have
leverages between 0.02 and 0.1, except for one, which has a leverage of 0.3, although this
value is not extremely large, the sample is likely to be influential.
Leverage plot
There is an ad-hoc critical limit for leverage which is shown as a red line. The limit is 3 times
the average leverage for the calibration samples.
Hotelling’s T²
Hotelling’s T2 statistic for each sample as a line plot. The associated critical limit (with a
993
critical limit is based on an F-test. There are 6 different significance levels to choose from
using the drop-down list:
Residual Sample distance

This is a plot of the individual sample’s disance to the model as a function of the number of
factors. It is useful for detecting outlying samples.
Bar plot of the residual sample distance
31.5. OPLS method reference
31.6. Bibliography
J. Trygg and S. Wold, Orthogonal projections to latent structures (O-PLS), Journal of
Chemometrics, 16, 119-128 (2002).
O. Svensson, D: Kourti and J. MacGregor, An investigation on orthogonal signal correction
algorithms and their characteristics, Journal of Chemometrics, 16, 176-188 (2002).
R. Ergon, Finding Y-relevant part of X by use of PCR and PLSR model reduction methods,
Journal of Chemometrics, 21, 537-546 (2007).
E.K. Kemsley and H.S. Tapp, OPLS filtered data can be obtained directly from non-
orthogonalized PLS1, Journal of Chemometrics, 23, 263-264 (2009).
994
32. Prediction
32.1. Prediction
Prediction (estimation of unknown response values using a regression model) may be the
purpose of a regression application. This section describes how to use an existing regression
model to predict response values for new samples.
 Theory
 Usage
32.2. Introduction to prediction from regression models

Prediction (estimation of unknown response values using a regression model), is the main
purpose of most regression applications. Common applications include the use of predictive
models for real-time measurements of quality in a number of industrial and research
settings.
 When can prediction be used?

 How does prediction work?
 Short prediction modes for MLR, PLSR and PCR
 Full prediction by projection onto a PCR or PLSR model
 Main results of prediction
32.2.1 When can prediction be used?

Prerequisites for prediction of response values on new samples for which X-values are
available are the following:
 Prediction requires a regression model (MLR, PCR or PLSR) which expresses the
response variable(s) (Y) as a function of the X-variables.
 The model should have been calibrated on samples covering the same region the
new samples belong to, i.e. on similar samples (similarity being determined by the X-
values).
 The model should have been validated on samples covering the region the new
samples belong to using cross- or test set validation at the desired validation level.
Note: The model validation can only be considered successful when one has:
 Used a proper validation method (test set or cross validation).

 Dealt with outliers in an appropriate way (not just removed all the samples
that did not fit the model well), and
 Obtained a value of RMSEP that meets the measurement objectives.
995
32.2.2 How does prediction work?

Prediction consists of applying a regression model to new X-values so as to obtain estimated
(predicted) Y-values.
As the next sections describe, this operation may be done in more than one way, at least for
projection methods.
32.2.3 Short prediction modes for MLR, PLSR and PCR

When using a MLR model, or using the short prediction mode for PLSR and PCR models,
predicted values are calculated using the regression coefficients ( ) and this
only provides a single numeric response using the following equation:
This prediction method is simple and easy to understand. However it has the disadvantage
that few sample or variable outlier diagnostics are available, compared to projection
methods such as full PCR and PLSR predictions. In The Unscrambler® this method using just
the regression coefficients is called short prediction.
32.2.4 Full prediction by projection onto a PCR or PLSR model

If PCR or PLSR is chosen as a regression method, one can compute predicted Y-values using X
and the b-coefficients, as presented in the above section.
However, one can also take advantage of projection onto the model components to express
predicted Y-values and take advantage of these as outlier diagnostics. This method is called
full prediction in The Unscrambler®.
The PCR model equations can be written:
and
The PLSR model equations are:
and
For these models Y is expressed as an indirect function of the X-variables using the scores T,
the X-loadings P and the Y-loadings Q (for PLSR).
The advantage of using the projection equation for prediction, is that when projecting a new
sample onto the X-part of the model (this operation gives the t-scores for the new sample),
one simultaneously gets a leverage value and an X-residual for the new sample, hence
allowing outlier detection.
A prediction sample with a high leverage and/or a large X-residual may be a prediction
outlier. Such samples may not be considered as belonging to the same “population” as the
samples the regression model was based on, and therefore one should treat the predicted Y-
values with caution.
996
Prediction
Note: Using leverages and X-residuals, prediction outliers can be detected without
any knowledge of the true value of Y.
32.2.5 Main results of prediction

The main results of prediction include Predicted Y-values and Deviations. They can be
displayed as scatter and line plots.
In addition, warnings are computed and help one detect outlying samples or individual
values of some variables.
For more advanced detection of differences between prediction and calibration samples,
extra statistics are also computed:
 Inlier statistic
 Hotelling’s T² statistic
 Q residual statistic
They are described in specific sections hereafter.
Predicted with deviation

This plot shows the predicted Y-values for all samples, together with a deviation that
expresses the uncertainty of the prediction. The deviations are estimated as a function of
the global model error, the sample leverage, and the sample residual X-variance. A small
deviation indicates that the sample used for prediction is similar to the samples used to
make the calibration model. On the other hand, predicted Y-values for samples with high
deviations are less reliable.
The deviation may be interpreted similar to the root mean squared error of prediction
(RMSEP; or standard error) for new samples, however it has been estimated without taking
the ‘true’ Y into account, as this may be unknown. Also, while the RMSEP is calculated based
on all samples, the deviation is estimated for each individual sample. For any moderately
sized dataset, a 95% confidence interval for the prediction is given as
.
See the Prediction Plots Section for additional details about the plot.

This option is only available if reference response values have been collected for the
prediction samples.
This is a 2-D scatter plot of Predicted Y-values vs. Reference Y-values. It has the same
features as a Predicted vs. Reference plot that one gets with a regression model.
Inlier statistic
The inlier statistic is based on the principle that if samples, when predicted, lie far from the
nearest calibration sample in the scores plot, they should be flagged as an “inlier”. An
“inlier” should be interpreted as a potential outlier. Whereas samples with high leverages
will be found far from the origin of the scores plot (outside the Hotelling’s ellipse), an inlier
may be found anywhere in the scores plot.
997
In the plots below the sample marked “E” in the scores plot (that is inside the range of
possible samples), is considered an inlier but far from any calibration sample. It is above the
inlier limit as can be seen in the Inlier plot.
Scores plot showing the inlier in the calibration range/Inlier plot with one inlier
In The Unscrambler®, the inlier statistics for predicted samples can also be displayed as a 2-D
scatter plot together with the Hotelling’s T² statistic critical limits (with a default p-value of
5%)
Inlier vs. Hotelling’s T² plot with one inlier
998
Prediction
Hotelling’s T² statistic
Predicted sample which have model distances far away from the samples in the calibration
set may also be outside the Hotelling’s T² limit (and consequently the Hotelling’s T² ellipse in
the scores plot). The Hotelling’s T² statistic is computed as a linear function of sample
leverage and can be compared to a critical limit according to an F-test. In The Unscrambler®,
the Hotelling’s T² statistics for prediction samples are displayed as a 2-D scatter plot
together with the inlier statistics.
Q residual statistic
When a full prediction is run, the Q residual limits are calculated, and the X sample Q
residual matrix is also included with the Outputs. This additional statistic, which is the sum of
the squares of the residuals and can be used to determine if predicted samples are outliers.
The Q residual contributions for each predicted sample are also provided along with the
average model Q residual contribution. These results are found in the Outputs folder and
can be plotted to view how variables in the prediction samples differ from the average
variable values in the calibration model.
32.3. Tasks – Predict – Regression…

Use an existing regression model to predict response values for new samples.
32.3.1 Access the Prediction functionality

To access the predict function use the Tasks – Predict – Regression… option in the main
menu. The Predict Using Regression Model dialog opens and is displayed below.
Predict Using Regression Model dialog box
999
To run a prediction, a project should be opened containing a regression model and a data set
to be predicted. In the case where a prediction model is not available, the following warning
will be displayed.
Solution: First calculate a regression model on a training data set before applying the predict
function to new data.
For Bias and Slope correction, refer to Bias and Slope
Data Input
The following dialog boxes are available to enter data into.
Select model
From the Select model drop-down list, select the regression model to apply to new
data.
Components
1000
Prediction
Use the Components box to select the correct number of principal components for a
PCR model or factors for a PLSR model. The optimal number of components for the
model will be displayed and used by default.
Full Prediction/ Short Prediction
 Full Prediction uses a projection on the latent space in the calculation. It will
provide comprehensive results such as plots and additional matrices for
increased data interpretation and outlier diagnostics.
 Short Prediction uses only the extracted Regression (Beta) coefficients.
There are no plots associated with this type of prediction.
Inlier limit
The inlier limit is a measure of the maximum Mahalanobis distance between two
neighboring calibration samples. This feature is used for detecting outliers in the
prediction step.
Sample inlier distance
The sample inlier distance is a measure of the minimum Mahalanobis distance to the
calibration samples for each sample. This feature provides the individual values for
detecting outliers in the prediction step.
Identify Outliers
This option enables an automatic identification of outliers based on predefined
criteria. Several options are available for setting limits for outlier detection,
including,
 Leverage limit.
 Sample outlier limit, validation.
 Individual value outlier, validation.
 Total explained variance (%).
Data
 Matrix: From the Data drop-down list, select the matrix to apply the
prediction model to.
 Rows and Cols: Use the Rows and Columns boxes to define the range of the
data to be predicted.
Several criteria of the input data are required for a successful prediction step. Warnings
associated with this option are presented as follows,
All samples or variable kept out
Solution: Ensure there are rows and columns available in the data set for prediction.
1001
The dimensions of the test set do not match those of the calibration set
Solution: Ensure that the dimensions of the new data set match those of the calibration set.
Non-numeric values in a new data set
Solution: Ensure that the new data set does not contain any non-numeric columns.
Note: When a model has been developed and is to be used for prediction, it is
important to define the variable ranges in the new data table so that they match
the dimensions of the original model.
Include Y reference
Use the Include Y Reference option to add reference data if they are available so
that the predicted vs. reference plot and actual residuals can be calculated.
 Matrix: From the Data drop-down list, select the matrix where the reference
are.
 Rows and Cols: Use the define Rows and Columns to select the Y-reference
data to include.
It is important to ensure that the same number of Y-variable data is available as was used to
develop the calibration model. The following warning will be provided if this is not the case,
Number of Y-variables should match those in the developed model
Solution: Ensure that the same number of Y-variables have data available as that of the
original calibration model
Click on OK to start the prediction.
1002
Prediction
Caution: Important considerations: If the original samples were pretreated

(transformed) prior to model development, one can register the pretreatment so
that the new samples will automatically be transformed as part of the analysis.
Refer to pretreatment registration in the chapter of the model type: PCA, PCR,
PLSR.
32.4. Interpreting prediction plots
 Predefined prediction plots

 Prediction
 Predicted with deviation
 Prediction table
 Plots accessible from the Prediction menu
 Prediction
 Predicted with deviation
 Residuals/leverage
 Leverage
 Inlier/Hotelling’s T²
 Inliers
 Inlier vs. Hotelling’s T²

Prediction

This is a plot of the predicted response for the new samples shown as a horizontal red line.
The blue box around the predicted value spans the deviation in both directions and is an
estimate of the prediction uncertainty.
Predicted value and deviation
If measured Y-values were added as input to the prediction, the Root Mean Squared Error of
Prediction (RMSEP) will be indicated by vertical red lines in each box.
Samples with large deviation are potential outliers. You should check the X-variable values
for the sample and see how they deviate from the calibration samples. If there has been an
error, correct it. If the values are correct, the conclusion is that the prediction sample does
not belong to the same population as the samples the model is based upon, and the
predicted Y values are not reliable.
1003
Prediction table
This table plot shows the predicted values, their deviation, and the reference value (if
predicted with a reference value included).
The objective is to have predictions with as small a deviation as possible. Predictions with
high deviations may be outliers.
Prediction table
32.4.2 Plots accessible from the Prediction menu

Prediction

For information on this plot see the Prediction section

This is a plot of predicted Y-values vs. the true (measured) reference Y-values. It is used to
check whether the model predicts new samples well. Ideally the predicted values should be
equal to the reference values.
Note: This plot is built in the same way as the Predicted vs. Reference plot used
during calibration. It is possible to turn on Plot Statistics as well as the target and
the regression lines. The prediction R-square is useful to assess the quality of the
prediction.
Residuals/leverage
Sample residuals
1004
Prediction
Line plot of the sample residuals
Detect variables that are not very well described by a model with a certain number of
components (factors). If this is the case with most of the samples the variable(s) isolated
may be noisy and can be considered as an outliers.
model.
Leverage
This plot shows the leverage of the predicted samples. It is the distance to the projected
sample to the center of the model. The absolute leverage values are always larger than zero,
and can go (in theory) up to 1 for a model sample. In prediction an outlier sample can have a
high leverage greater than 1. As a rule of thumb, samples with a leverage above 0.4 - 0.5
start being of concern.
In the plot below sample “S.057” has a leverage greater than 0.4. The last four samples show
high leverages, i.e. they are not as well described by the model compared to the other
samples.
Leverage in Prediction
1005
For a critical limit on the leverages, look at the Hotelling’s T² line plot.
Inlier/Hotelling’s T²
Inliers
This plot displays the inlier statistic (minimum Mahalanobis distance to the calibration
samples) for each sample as a line plot. The associated critical limit (with a default p-value of
5%) is displayed as a red line.
This feature is a test for detecting outliers in the classification or prediction step. It is based
on the concept that a model may have an object space where there are “holes”, i.e. the
density of objects in some part of the calibration space is low.
All results on samples below the Inlier limit can be trusted.
Inliers
1006
Prediction
Note: It is possible to tune the number of PCs/Factors up or down with the arrow
tools.
Hotelling’s T²
process is operating outside normal conditions.
Hotelling’s T²
Note: It is possible to choose between 6 different significance levels with the

significance level button.
Note: It is possible to tune the number of PCs/Factors up or down with the arrow
tools.
Inlier vs. Hotelling’s T²

This plot displays the inlier statistic (minimum Mahalanobis distance to the calibration
samples) against Hotelling’s T² statistic for each sample. The associated critical limits (with a
default p-value of 5%) are displayed as a horizontal and a vertical red line.
Inlier vs. Hotelling’s T²
1007
Note: It is possible to choose between 6 different significance levels with the

significance level button.
32.5. Prediction method reference

1008
33. Batch Prediction
33.1. Batch Prediction
Batch Prediction may be used to generate scores and predicted values for a large set of files
in a directory or to predict files that will be added to a directory by an external application.
Note that the model selected needs to be compatible with the files, incompatible data files
are silently skipped.
 Usage
33.2. Tasks – Predict - Batch Predict

Use an existing regression model to predict response values (and scores) for a set of new
samples.
Access the Menu by going to Tasks - Predict - Batch Predict..
Batch predict dialog
33.2.1 Inputs and outputs

On the Inputs and Outputs tab, the following dialog boxes are available to enter data into.
Begin by providing the path to the location where data to be predicted are located. The data
can be in several formats, including .00D (The Unscrambler® 9.x), .spc, .csv, .jdx, etc. and the
data type is designated by the Extension filter drop-down list.
Select model
From the Select model drop-down list, select the regression model to use. Data can
be analyzed synchronously, or asynchronously. In the asynchronous mode, the
analysis is triggered by an event, the arrival of a new file (matching the extension
filter). Files are processed in the order of their arrival. The synchronous mode is user
1009
driven and all files in a chosen directory (and matching the extension filter) are
processed. Optionally, these files may be sorted by name prior to being queued for
prediction.
Factors
Use the Factors box to select a suitable number of principal components for a PCR
model or factors for a PLS model. The optimal number of components for the model
will be displayed and used by default.
The location for the output data to be stored must also be defined in the output path.
33.2.2 Display
Go to the display tab to choose from predefined plots to display as the results are
generated. The number of data points to display is set at a default of 15, and can be changed
by the user. The standard options of plots that can be displayed are the predicted values, the
scores, and the Hotelling T^2 values with a limit set at a user-specified significance level
(default is 5%).
Batch predict display options
33.2.3 Options
On the options tab, prediction limits can be set, as can the sounding of an alarm if those
limits, or the Hotelling’s T^2 limit are crossed.
Batch predict options
1010
Batch Prediction
33.2.4 Outputs
After the settings have been made, the batch prediction will run, and the designated plots
displayed on the screen as the data are analyzed.
Batch monitor
1011
When the analysis is completed, click Close on the monitoring screen. The results are stored
as a csv file in the folder designated in the setup. The user is then prompted to load the
results into the open project.
Load batch results
When the results are loaded, the matrix is added to the project navigator.
1012
34. Multiple Model Comparison
34.1. Multiple Model Comparison
Multiple Model Comparison is used for comparison of models in terms of their y-residuals
(from the chosen validation procedure) to assess whether the models are significantly
different with respect to prediction performance.
 Theory
 Usage
34.2. Multiple comparison of y-residuals

The method is based on 2-way ANOVA followed by Tukey’s test for multiple comparisons.
The ANOVA model considered in this case is a linear mixed model without interaction with
random effect of the sample (i = 1, 2,…, I) and fixed effects of the model (m = 1,
2,…, M). The difference D (I x M) is expressed as the absolute validated residual for the
j=1,2,…, J response variables for the m models,
Here, is one of the response variables and is the predicted response

from the validation. The model can be expressed as
Here is the effect of sample number i, are the effects of the models m that are
being compared and is the residual in the ANOVA model. In the case of only
comparing two models the 2-way ANOVA is identical to a pair-wise t-test.
34.3. Tasks – Predict – Multiple Model Comparison

Use the existing regression models to compare the models.
Access the Menu by going to Tasks - Predict - Multiple Model Comparison..
Multiple Model Comparison dialog - input
1013
The comparison can be made for the existing data or new data.
 For the option Re-use calibration data, the X and Y data are taken from the raw data
node of the first selected model.
 For the option Apply to new data, the Predictors (X) and Responses (Y) will be user
provided.
The Select models tab provides the option to select the models for comparison.
Multiple Model Comparison dialog - select models
Before adding the first model, all the available models in the project navigator will be
displayed in the drop-down list. After the first model has been added, only models with
matching number of (validation) samples to the first model will be listed. The number of Y
1014
Multiple Model Comparison
variables has to match in all models. The first selected model willbe used as reference for
the number of responses.
Click Finish to start the prediction.
34.4. Interpreting prediction plots

Comparison overview
Predicted vs. Reference Plot
For each model selected in the Select models tab, the selected predicted Y-value
from the model is plotted against the reference Y-value. By default, the plot shows
the results for the first response. To see the results for other responses, use the X
icon.
Results Table
The Results table has the below information.
Model : The models selected for comparison will be displayed here.
RMSEP :
34.5. Method reference

The methodology is based on the article “Evaluation of alternative spectral feature
extraction methods of textural images for multivariate modeling” by Ulf Indahl and Tormod
Næs, Journal of chemometrics, 12, 261-278 (1998).
1015
35. Tutorials
35.1. Tutorials
The tutorials section of The Unscrambler® was developed for users to implement methods in
practice and also be guided through the practical aspects of experimental design, data
analysis and interpretation of real results using The Unscrambler®.
The tutorials help to establish a basic understanding of the capabilities of The Unscrambler®,
an introduction to interpretation of results, and a feeling for the procedures of multivariate
data analysis. However, analysis of real world data is seldom this straightforward! Normally
data must be processed in some way before analysis numerous calibration iterations may be
required before the desired performance of a model is reached.
35.1.1 Content of the tutorials

There are two types of tutorials:
 Quick Start
 Complete cases
35.1.2 How to use the tutorials

Each tutorial starts with a presentation of the application example. Read the details of the
tutorials carefully so as to understand the context of the application and the nature of the
data. Understanding the data is a key success factor in the successful application of
multivariate analysis.
Each of the tutorials are devoted to practical tasks. The “Task” section presents the
assignment in a few words; and the following “How To Do It” provides detailed instructions
for the following:
 The commands to be used in the tutorial.

 How to select correct options in the dialogs.
 How to interpret the results displayed on screen.
Tips: Arrange The Unscrambler® application window and the Help browser side by
side for greater workflow efficiency.
35.1.3 Where to find the tutorial data files

The data sets used in the tutorials come pre-loaded with the software. During installation
they were automatically stored in the directory “Data” inside the directory where the
program files itself has been installed (typical location: C:\Program Files\The
Unscrambler X\Data).
Tips: Copy this directory to the home directory of the working computer, e.g. in the
“Documents” directory, and use File – Open… to load the files, in order to avoid
overwriting the original data. This way a copy the unaltered data is always available
in the event a copy has been altered.
1017
From within each tutorial there is a convenient hypertext link to directly import the data set
used in the given tutorial. An example link is provided below:
Open the tutorial A data set
35.2. Complete
35.2.1 Complete cases
Read the details below to understand which tutorials are useful in specific application cases,
and also to gain some practical advice for running the tutorials. The tutorials present
application examples and contain detailed step-by-step instructions on how to use The
Unscrambler®.
Depending on an analysts degree of experience in using The Unscrambler® and the
particular fields of interest for application of the program, the following lists the
recommended tutorials for a specific user experience level:
Summary of The Unscrambler® tutorials
Experience Tutorial Prerequisites
A: simple example of
PLS, univariate analysis
calibration
PCA, PLS, sensory, consumer, chemical,

B: quality analysis
instrumental measurements
C: spectroscopy and
PLS, transformations, spectroscopy
interference
Experimental design, ANOVA, analysis of

D1: screening
effect
Experimental design, ANOVA, response

D2: optimization
surface, chemistry
E: SIMCA Classification, biology
F: interact with other

PLS, Spectroscopy, data import/export
programs
Experimental design, mixtures, food

G: mixture design
technology
H: PLS-DA Classification, PLS
Spectroscopy, analytical chemistry, curve

I: MCR of dye mixtures
resolution
Spectroscopy, analytical chemistry, curve

J: constraints in MCR
resolution
K: clustering Classification, clustering, spectroscopy
1018
Tutorials
Experience Tutorial Prerequisites
L: L-PLS PLS, sensory
M: Variable selection PCA, validation
35.2.2 Tutorial A: A simple example of calibration
 Description
 Expected outcomes of this tutorial
 Data table
 Opening the project file
 Define ranges
 Univariate regression
 Calibration
 Interpretation of the results
 Prediction
 Evaluation of the predicted results
Description
This tutorial aims to provide and example of the measurement of the concentration (Y) of a
chemical constituent “a” by use of conventional transmission spectroscopy. The situation is
complicated by the presence of an interferent “b” which is present in varying unknown
quantities. Under these conditions, the instrument response of “b” strongly overlaps that of
“a”.
Expected outcomes of this tutorial

This tutorial contains the following tasks and procedures:
 Open a project file.

 Define row and column sets.
 Compare the results of univariate vs. multivariate regression.
1019
 Develop calibration models.

 Predict new samples.
 Validate the model for future use.
 Analyze and interpret regression coefficients.
 Explore the plotting options available for these methods.
References:
 Basic principles in using The Unscrambler®

 Descriptive Statistics
 About Regression methods
 Prediction
 Validation
Data table
The data for this tutorial can be found in the project file “Tutorial A” in the “Data” directory
installed with The Unscrambler®.
Seven solutions, (samples), of known concentration (Y) of the constituent a, will be used as
the calibration set. Three other (test) samples are available of unknown concentrations.
These will be predicted by the use of a developed regression model.
Light absorbance was measured at two different wavelengths, namely Red and Blue. Red is
variable 1, Blue is variable 2. Variable 3 has been designated as the concentration of a.
Opening the project file
Task
Open the project “Tutorial A” into The Unscrambler® project navigator and study the data in
the Editor. Use the Descriptive Statistics functionality to view some basic characteristics of
the data table.
How to do it
Use File - Open to select the project file “Tutorial_A.unsb” in The Unscrambler® data
samples directory. This directory is typically located in C:\Program Files\The
Unscrambler X\Data.
For the purposes of this tutorial, click the following link to import the data. Tutorial A data
set
The project should now be visible in the project navigator and the data should be displayed
in the editor.
Note that the values for variable Comp “a” are missing (blank) for the 3 Unknown samples.
1020
Tutorials
Use the Tasks-Analyze-Descriptive Statistics… option to view some basic statistics of the
data, including the Mean, Standard Deviation, Skewness etc.
Tasks-Analyze-Descriptive Statistics…
The following dialog will open. Select the data matrix to be analyzed and ensure that no
rows or columns have been excluded from the analysis.
Descriptive Statistics Dialog
After clicking OK, the statistics will be computed. A new analysis node will appear in the
project navigator providing some simple plots and analysis of the data.
Descriptive Statistics Results Matrix
1021
Define ranges
In most practical applications of multivariate data analysis, it is necessary to work on subsets
of the data table. To do this, one must define ranges for variables and samples. One Sample
Set (Row range) and one Variable Set (column range) make up a virtual matrix which is used
in the analysis.
Task
Define two Column ranges (variable sets), one for “Light Absorb” and the other
for”Constituent a”. Also define two Row ranges (sample sets) “Calibration Samples” and
“Prediction Samples”.
How to do it
There are two options for defining data ranges in The Unscrambler®:
Create Row/Column ranges using the right mouse click option
Highlight a range of variables to be defined and right click in the column header. This
will display the Create Column Range option. Sample sets can also be defined as row
ranges using a similar method and selecting Create Row Range.
Create a column range
1022
Tutorials
Rename the column range that is automatically highlighted in the project navigator.
If it is not highlight it, and right click. Choose the Rename option, and change the
name to “Constituent a”.
Repeat this process for the “Light Absorbance” set containing the first two columns
and the row sets: “Calibration” containing samples 1 to 7 and “Prediction”
containing samples 8 to 10.
Use Edit - Define Range… to create row and column sets.
Open the Define Range dialog from the Edit menu. Define the data as follows,
Name: Light Absorbance
Interval: columns 1-2
Define Range Dialog
Enter the Column numbers directly into the Set Interval field under rows and
columns.
Deselect variables marked by mistake by pressing Ctrl while clicking on the variable
to be removed from the set.
Click OK.
Similarly define the second variable Set using the Edit -Define Range option and
specifying:
 Name: Constituent A
 Set Interval: Column 3
Click OK.
Choose Edit - Create Row Range to create sample sets.
Four sample and variable sets should now be displayed in the project navigator.
Data set with ranges
1023
By organizing the data into sets from the beginning, one can add value to the analysis and
also use this information to communicate results. All analyzes and plotting will be much
easier to set up, and can be used in the visualization of results.
Remember to save the project before proceeding, select File - Save or press the button.
Univariate regression
The simplest regression method (univariate regression) can be simply visualized in a 2-
dimensional scatter plot.
Task
Make a regression model of component “a” and the absorbance of red light.
How to do it
Perform the regression by plotting the red light variable against Constituent a. Select Plot -
Scatter from the Plot menu. The following plot should appear.
Scatter plot
The univariate regression should be performed on the calibration samples only, as the Y-
values are missing in the prediction set.
The plot is displayed without the trend lines visible. Toggle the regression and/or target line
on and off using the shortcut . Also view the statistics for the plot. Toggle the
statistics display on and off using the shortcut .

Statistics for the plot are shown in a special frame in the upper left corner.
Scatter plot with trend lines and statistics
1024
Tutorials
The displayed correlation value of 0.91 indicates that the two variables are highly correlated.
The univariate model for this data can be generated using the Offset value and Slope value.
The equation is as follows:
Comp"a" = -0.9285 + 0.59524 * Red

Calibration
This section describes how to develop the simplest multivariate model containing two
predictor (X) variables.
Task
Make a PLS regression model between the absorbance measurements and the
concentration of “a”.
How to do it
Select Tasks - Analyze - Partial Least Squares Regression… to display the PLS regression
dialog. Use the following parameters to define the model:
Model inputs
 Rows (indicating which samples to use): Calibration Samples (7)

 Predictors, X: Light Absorbance (2)
 Responses, Y: Constituent a (1)
 Maximum components: 2
Check the Mean center Data and Identify Outliers boxes.

Partial Least Squares Regression Dialog: Model Inputs
1025
Weights
Click the tabs for both X and Y weights to see which options apply for each sheet.
Since the data are of spectral origin, ensure the weights are All 1.0
Validation
Under the validation tab select the cross validation option. Click on Setup to choose
Full from the drop-down list.
It is important to properly validate models. Leverage correction is not recommended
as it gives only an overly-optimistic estimate of the error of a model. The estimate of
the prediction error (validation variance) is more conservative with cross validation
than with leverage correction!
Cross Validation Dialog
1026
Tutorials
Click OK to start the calibration.

Interpretation of the results
Task
 Display the results of the modeling steps.

 Interpret the Y-Residual Validation Variance Curve.
 Study the Regression Coefficients plot and provide an interpretation.
Display the model results

From the project navigator, display the Regression Overview plots.
Four predefined plots make up the Regression Overview:
 Scores,
 Loadings,
 Variance,
 and Predicted vs. reference.
PLS Regression Overview
1027
When OK has been selected in the PLS dialog box and Yes has been selected to view the
plots, a PLS node will be added to the project navigator. This node contains the following,
 Raw data,
 Results,
 Validation,
 Plots.
The raw data used for building the model is stored in the results folder. Validation results
matrices generated from the model can be viewed along with predefined plots for the
analysis.
Toggle between different plots from those available in the project navigator. Alternatively
use the Plot… menu option, or right click in a plot to select a desired plot.
Information about the model is available in the Information field, located at the bottom of
the project navigator view. Information such as how many samples were used to develop
the model and the optimal number of factors is contained here.
Model info box
1028
Tutorials
A number of important calculated results matrices may be obtained from the PLS node.
Returning to the PLS overview, activate the Scores plot, which is in the upper left quadrant
of the overview, by clicking in it.
Right click on this plot and select the Properties option.
Properties option
1029
Select Point label from the available options, and in the dialog change the label to sample
number instead of sample name.
Properties: Point label
In the properties dialog it is possible to make other customizations to the plot.

Click OK.
1030
Tutorials
Activate the Predicted vs. Reference plot (lower right quadrant of the PLS overview). In this
plot, colors are used to differentiate between Calibration results (in blue) and Validation
results (in red).
Use the Next Horizontal PC and Previous Horizontal PC buttons to display the
Predicted vs. Reference for one and two PLS Factors.
Use the Cal/Val buttons to toggle between the calibration and validation samples. It
is also possible to toggle on and off the regression and trend lines .
Interpret the Y-Residual Validation Variance Curve
Activate the Y residuals plot in the lower left quadrant of the PLS overview and choose
Cal/Val for Y from the toolbar shortcuts.
Notice that the residual variance is down to 0 afterfactor 2. This usually indicates that the
model size is 2. Also there is more Y-variance explained in the second factor than in the first
(39 vs. 61), this indicates that there may be an outlier.
Residual Y variance plot
.
Study the Predicted vs. Reference Plot
Under the PLS node in the project navigator, expand the Plots folder and select Predicted vs.
Reference to display this plot in the viewer.
1031
The Predicted vs. Reference plot appears. The estimated prediction quality of the model
may be determined.
Use the toolbar icons to toggle between the regression and/or target lines.
High quality predictions were obtained from this PLS model. Comparison of the multivariate
regression model with the univariate regression model, shows the marked improvement of
using the multivariate model. This gives confidence in the future prediction of unknown
values.
Study the Regression Coefficients Plot
From the main menu, choose the Plot - Regression Coefficients - Raw Coefficients (B) - Line
option. Change the plot layout to a bar chart using the toolbar shortcut .
1032
Tutorials
This illustrates how to view raw regression coefficients (B), which define the model
equation. View the regression coefficients for the precedent factor using the arrows on the
toolbar .
In the present case, the values of the regression coefficients remain unchanged
when shifting from Weighted coefficients (Bw) to Raw coefficients (B). The reason is
that the weights were chosen as All 1.0 (no weighting) for the purposes of
calibration.
Regression coefficients can be viewed in different ways, such as lines, bars and
accumulated bars from the respective shortcut buttons found in the toolbar.
Hovering the mouse cursor over one of the bars displays numerical information associated
with the particular variable. Click once more to get the object information window. For the
two factor model developed in this tutorial, the b-coefficient for the Red absorbance is
1.0417, the b-coefficient for the Blue absorbance is -0.2083 and the offset (B0) is 1E-15, i.e.
approximately zero.
The b-coefficients can also be shown as a table by selecting the matrix Beta coefficients
(raw) in the Result folder of the PLS node in the project navigator.
Regression coefficients matrix
.
The b-coefficients are a graphical representation of the model equation relating the
concentration of “a” to the Red and Blue light absorbances:
Concentration of “a”: a = 0 + 1.0417 * Red – 0.2083 * Blue
Remember the value of the coefficient for Red in the univariate model (0.59524). This result
is different from what was found in a multivariate model.
The results should be saved in the project with the data.
Select File - Save or use the save tool and give the project file the name “Tutorial A”.
Prediction
The main purpose of developing a regression model is for future prediction of the properties
of new samples measured in a similar way.
Task
Use the PLS calibration model to predict the concentration of “a” for the three unknown
samples in the data table.
How to do it
Use the Tasks - Predict- Regression… option to predict the values of the new samples. Enter
the parameters below in the Prediction dialog:
Prediction dialog
1033
 Select model: PLS.

 Components: 2.
 Full Prediction.
 Inlier statistics.
 Sample inleier distance.
 Data Matrix: Tutor_a.
 Rows: Prediction (3).
 Columns (X-variables): Light Absorbance (2).
 Y-reference: no selection (do not include Y-reference values).
It is possible to find all models in the current project using the drop-down list next to Select
model. Select the PLS model developed and click OK to start the prediction.
Evaluation of the predicted results
During the development stage of a regression model, the quality of the predictions must be
checked by evaluating the quality of the Predicted vs. Reference plot.
The predictions can be checked when some reference measurements are available. This is
not possible for the unknown samples in this tutorial as there are no reference
1034
Tutorials
measurements available for these samples. However, a method exists for determining the
quality of the predictions, based on the properties of projection modeling.
Task
Perform a prediction and evaluate the quality of the predicted results.
How to do it
First, evaluate the predicted results of the unknown samples and determine if these values
are in the same range as the calibration range of samples. Select the Prediction plot under
the new Predict – Plots node in the project navigator to visually assess the results.
Prediction with deviation
The predicted values are displayed as horizontal bars. The size of the bars represent the
deviation (uncertainty) in the estimates. The numerical values for the Y Predicted values and
Y deviations can be found in the output matrices, and are displayed under the plot. A
comparison of these predictions to actual values cannot be made, however, if the new
samples have predicted values similar to those in the calibration set and the size of the
deviation bars is small, the quality of the prediction may be ensured.
Predicted values
Another method for determining the reliability of the predicted values is to study the Inlier
vs. Hotelling’s T² plot available as a right click option in any plot.
Select the Prediction - Inlier/Hotelling’s T² - Inliers vs. Hotelling’s T² option to display this
plot.
For a prediction to be trusted its value must not be too far from a calibration sample. This
may be checked using the Inlier distance. The predicted values projection onto the model
should not be too far from the center. This may be checked using the Hotelling’s T² distance.
Inliers vs. Hotelling’s T²
1035
In this case all the samples were found to be in the left bottom corner of the plot, indicating
that the predicted results can be trusted.
35.2.3 Tutorial B: Quality analysis with PCA and PLS
 Description
 Main learning outcomes
 Data table
 Preparing the data
 Insert category variables
 Check column (variable) sets
 Define sample sets from category variable column
 Objective 1: Find the main sensory qualities
 Make a PCA model
 Interpret the variance plot in the PCA overview
 Interpretation of the scores plot for the PCA
 Interpretation of the correlation loadings plot
 Interpretation of scores and loadings
 Interpretation of the influence plot
 Objective 2: Explore the relationships between instrumental/chemical data (X) and
sensory data (Y)
 Make a PLS regression model
 Interpretation of the variance plot
 Interpretation of the scores plot
 Interpretation of the loadings and loading weights plot
 Interpretation of the predicted vs. reference plot
 Objective 3: Predict user preference from sensory measurements
 Make a PLS regression model for preference
 Interpretation of the regression overview
1036
Tutorials
 Interpretation of the regression coefficients

 Open result matrices in the Editor
 Predict preference for new samples
 Interpretation of Predicted with Deviation
 Check the error in original units – RMSE
 Export models from The Unscrambler®
Description
This tutorial aims to use multivariate techniques to analyze the quality of raspberry jam in
order to determine which sensory attributes are relevant to “perceived quality”. The analysis
will cover three aspects as follows.
 A trained tasting panel has provided scores for a number of different variables using
descriptive sensory analysis. In this tutorial the first objective is to find the main
sensory quality properties relevant for raspberry jam.
 The second objective is to find a way of rationalizing quality control, since the use of
taste panels is very costly. In this application a number of laboratory instrumental
measurements were investigated to potentially replace the sensory testing panel.
 The third and final objective of this application is to be able to predict consumer
preference for raspberry jam from descriptive sensory analysis. The use of PLS
regression modeling techniques were investigated in order to potentially find a
relationship between sensory data and preference.
Main learning outcomes

This tutorial contains the following parts and learning objectives:
 Explore methods for inserting category variables.

 Define ranges in data sets.
 Investigate the relationships existing in a single data table by the use of PCA.
 Interpret scores and loadings of the PCA and draw relevant conclusions.
 Run a PLS regression for understanding the relationships between two data tables.
 Export models developed within The Unscrambler® to use with other applications.
 Predict response values from new samples.
 Estimate regression coefficients and interpret them.
 Find optimal number of components or factors in multivariate models.
References:

 PCA Analysis
 Exporting data from The Unscrambler®
 Prediction
Data table
Click the following link to import the Tutorial B data set used in this tutorial.
1037
The analysis is based on 12 samples of jam (objects), selected to span the expected, normal
quality variations inherent in such products. Several observations and measurements were
made on the samples.
Agronomic production variables
The samples were taken from four different cultivars, at three different harvesting times.
The table below describes the sampling plan for this analysis.
Sample description
No Name Cultivar Harvest time No Name Cultivar Harvest time
1 C1-H1 1 1 7 C3-H1 3 1
2 C1-H2 1 2 8 C3-H2 3 2
3 C1-H3 1 3 9 C3-H3 3 3
4 C2-H1 2 1 10 C4-H1 4 1
5 C2-H2 2 2 11 C4-H2 4 2
6 C2-H3 2 3 12 C4-H3 4 3
Note that the agronomic production variables are not used as input variables in any of the
matrices. These represent known information which may be extremely valuable for the
interpretation of the results of the data analysis. They will be utilized as category variables in
the analyses performed in this tutorial.
Column (variable) set Instrumental
Three chemical and three instrumental variables (APHA colorimetry) variables were also
measured on the samples tested by the sensory panel. These are described in the table
below.
Instrumental variables
No Name Method
1 L Lightness
2 a Green-red axis
3 b Blue-yellow axis
4 Absorbance Absorbance
5 Soluble Soluble solids (%)
6 Acidity Titrable acidity (%)

Column (variable) set “Sensory”
A trained sensory panel evaluated 12 different sensory attributes of the raspberries used to
make the jam, using a 1-9 point intensity scale. The entries in the data matrix are the
average ratings over all judges. The observed variables are listed in the table below.
Sensory variables
No Name Type
1 Redness Redness
2 Colour Color intensity
1038
Tutorials
No Name Type
3 Shininess Shininess
4 R.Smell Raspberry smell
5 R.Flav Raspberry flavor
6 Sweetness Sweetness
7 Sourness Sourness
8 Bitterness Bitterness
9 Off-flav Off-flavor
10 Juiciness Juiciness
11 Thickness Viscosity/thickness
12 Chew.res Chewing resistance

Column (variable) set Preference
114 representative consumers were invited to taste the 12 jam samples used in this
application. They each provided an individual preference score on a scale from 1-9. The
average over all consumers for each sample is provided in the data table.
Row (sample) sets
The data table, “JAMdemo”, consists of 20 samples. The first twelve samples will be used to
develop the models in this application and are hereafter referred to as training samples.
Eight new jam samples were assessed by the trained panel and given a sensory rating. These
samples represent the eight last samples in the table, and are referred to as Prediction
samples. The preference and the instrumental values are missing for these samples, as
measurements were not performed on these samples. The calibration model will be used to
predict the preference for these eight samples.
Preparing the data
Insert category variables

Category variables are useful for interpreting patterns in data sets. Here, the raspberries
used to make the jam samples originated from different cultivars and were harvested at
different times. These parameters represent excellent candidates for using category
variables in an analysis.
Task
Insert two category variables, Cultivar and Harvest Time.
How to do it
The data table should be opened by following the above link and are already organized into
two row sets for training and prediction. The different types of variables have been defined
in the column sets as Instrumental, Sensory and Preference, based on the definitions in the
data tables above. These defined sets can be seen by expanding the folders in the project
navigator.
Jam data organization
1039
Some additional information about the cultivar and harvest time now needs to be added to
this data as two new columns.
To select a column, click on the header cell containing the column number. Activate the first
column of the table, right mouse click and select Insert - Category Variable or use the menu
options and select Edit - Insert - Category variable.
Highlight column to activate insert options
In the dialog box, enter the category variable name “Harvest Time”. Keep the default option
Select the level manually selected.
Enter the level names: “H1”, “H2” and “H3” followed by a click on Add.
1040
Tutorials
Click OK.
In the new column, double click in each cell and select the appropriate value for each sample
as given in the sample names.
Note: Category variable cells are orange in the editor to distinguish them from
ordinary variables.
Add a second column in the same way, after highlighting the first column: Edit - Insert -
Category Variable. In the dialog box, enter the category variable name “Cultivar”.
Keep the default option Select the level manually selected.
Enter the level names: “C1”, “C2”, “C3”, and “C4” followed by a click on Add.
1041
Click OK.
In the new column, double click in each cell and select the appropriate value for each sample
as given in the sample names. Alternatively, select all cells of each cultivar in sequence and
fill in the category level using the right-click Fill function.
The Tutorial_b data table displayed in the Editor (after insertion of Cultivar and Harvest
Time)
Check column (variable) sets

In The Unscrambler® matrices are defined by Row and Column (Sample and Variable) Sets. A
recommended good practice is to define all sets before any analyses are performed. The
information entered to organize the data can later be used to color-code graphics according
to these sample groups.
1042
Tutorials
Task
Check that the three column (Variable) Sets: “Instrumental”, “Sensory” and “Preference”
have been defined.
Verify the existence of two sample sets “Calibration Samples” and “Prediction Samples”.
These sets can be visualized in the project navigator.
How to do it
To create column and row ranges, select Edit - Define Range to open the Define Range
dialog.
Three sets have been predefined in the project Tutorial_B data set.
Column name: Instrumental
Interval: 3-8
Column name: Preference
Interval: 14
Column name: Sensory
Interval: 9-13, 15-21
To verify these definitions use the Edit - Define range and inspect the information in this
dialog.
The Define range dialog with three column sets
After defining column intervals, click OK to perform the task.

Verify also the row sets:
 Row Name: Calibration Samples, Interval: 1-12

 Row Name: Prediction Samples, Interval: 13-20
Exit from the Define Range dialog box by clicking Cancel.
Define sample sets from category variable column

Task
1043
Additional row sets will be added for the various levels of the category variables harvest
time and cultivar.
How to do it
Begin by selecting the column “Cultivar” in the data editor, and select Edit- Group Rows…,
which will open the Create row ranges from column dialog.
Edit- Group rows…
The column that was selected, “Cultivar”,is already in the Cols field.
There is no need to specify the Number of Groups as it is based on a category variable.
Create row ranges from column
1044
Tutorials
Click OK.
Automatically 4 row ranges have been added. Look in the Row folder to see them:
New row ranges
Do the same for the variable “Harvest time”.

Objective 1: Find the main sensory qualities
The main variations in the sensory measurements may be found by decomposing them by
Principal Component Analysis (PCA). This data decomposition results in valuable graphical
diagnostic tools including scores, loadings and residuals. The results will be interpreted in
order to establish whether sensory measurements made on the jam samples have any
practical meaning.
Make a PCA model

Task
Make a PCA model using the column set “Sensory” as the variable set.
How to do it
Select Tasks – Analyze - Principal Component Analysis… Specify the following parameters in
the dialog box:
Model inputs
 Data matrix: “JAMdemo” (20x21)
 Rows: Training (12)
 Cols: Sensory (12)
1045
Check the Identify outliers and Mean center data boxes, if these check boxes are not
already selected.
Principal Component Analysis dialog: Model inputs
Weights
From the Weights tab verify that the weights are all 1.0 (constant).
No weighting is used in this model as the sensory panel is known to be well trained.
However, sensory variables are often weighted when there is evidence that the
panel is not well trained, or when investigating relationships with other variables.
The most common weighting to use is 1/SDev.
Weights tab dialog
1046
Tutorials
Validation
From the Validation tab select the option Cross Validation and press Setup which
opens the Cross Validation Setup dialog. Here select Full from the drop-down list for
cross validation method.
Validation Dialog
1047
This validation method is more time consuming than other options, but the estimate of the
residual variance is more reliable.
Click OK to start the PCA. After PCA analysis is completed, the program will request a user,
“Do you want to view plots of model PCA now?”. Click Yes to see the PCA Overview plots. A
new node has been added to the project navigator containing all the PCA result matrices and
plots.
Interpret the variance plot in the PCA overview

Task
Determine the optimal number of PCs.
How to do it
The PCA Overview contains the most commonly used plots for interpreting PCA models,
including
 Scores plot.
 Loadings plot.
 Influence plot.
 Explained/Residual Variance plot.
PCA Overview plots
1048
Tutorials
The scores plot is a map of the samples, and shows how they are distributed. It can be used
to isolate samples that are similar, or dissimilar to one another. In this analysis, the plot
labels show that PC-1 explains 58% and PC-2 28% of the total variance in the data. The
explained variance curve (in the lower right corner) is an excellent tool for selecting the
optimal number of components in the model.
The explained variance increases until PC 5 is reached. The software does suggest the
optimal number of PCs for a model, but it is up to the user to analyze the data and confirm
the optimal number of PCs in this model, usually based on this plot.
The highest explained variance is found with 5 PCs, but the explained variance in a model
using 3 PCs contains similar explained variation. A simple (parsimonious) model is usually
more robust than a complex one, and easier to interpret. It is always suggested to work with
a model consisting of as few PCs as possible. The info box in the lower left corner of the
main workspace indicates that 3 PCs are considered optimal for this model.
Info Box
Task
Change the explained variance plot to a residual variance plot.
How to do it
1049
Activate the lower right plot by clicking in it. Toggle between the Explained / Residual
buttons from toolbar shortcuts .
The explained variance is now converted to residual variance. The information is the same,
but presented in another way. The residual variance is well suited to finding the optimal
number of PCs to use in a model, while the explained variance is a better measure for
explaining how much of the variation is described by the model. The plot layout can be
changed to a bar chart by using the plot layout shortcut .
The PCA Explained Variance Bar plot
The model with 3 PCs describes 92% of the total validation variance in the data; for
calibration it is 96%. These values may be obtained by clicking on the specific data point in
the plot.
Use the toolbar buttons to change between having only the calibration or validation
variance curve plotted, or both.
Interpretation of the scores plot for the PCA

The scores plot, which is a map of samples, displays information about the sample
relationships for a particular data set.
Task
Interpret Scores plot. Use different plot options for ease of interpretation.
How to do it
The scores plot shows the projected locations of the samples onto the calculated PCs. By
studying patterns in the samples a meaningful interpretation of the PCs may be possible.
PCA Scores plot
1050
Tutorials
The scores plot for this analysis indicates that the 12 samples are not arranged in a random
way. By moving from left to right along this plot, a pattern can be observed where samples
harvested at time H1 are mainly found on the left. These then change to H2 and finally H3.
Moreover, moving from the top to the bottom, C4 samples occupy the top region, followed
by C3, then C2, and finally C1.
The row sets based on the category variables that were inserted into the data table can be
used to better visualize these trends.
In the scores plot, right mouse click and select Sample Grouping to open the dialog where
different row sets can be used for grouping and color-coding the plot.
Select all the cultivar row sets (C1, C2, C3, C4) individually and use the arrow to add them to
Marker settings for grouping purposes.
Tick or untick the box Use group name as label to either have the real name or the level of
each sample as a point label.
The marker color, shape and size can be customized here for optimized viewing of the data.
Sample Grouping Dialog
When the desired settings have been defined, click OK to complete the operation.
In the scores plot, right mouse click to select Properties, where customization of the plot
appearance is possible. Select header and change the plot heading to “Scores plot with
Cultivar Grouping”. Choose a different font size or color if so desired. Click Apply to preview
and OK to apply and exit the dialog.
Properties Dialog
1051
PCA Scores with Sample Grouping
Repeat the above sample grouping process, this time using the category variable Harvest
Time.
Interpretation of the correlation loadings plot

The loadings plot, which is a map of the variables, displays information about the variables
analyzed in the PCA model. Correlation Loadings provide a scale independent assessment of
the variables and may, in some cases, provide a clearer indication of variable correlations.
Task
Interpret variable relationships in the correlation loadings plot.
How to do it
Activate the X-Loadings plot by clicking in it, then use the corresponding shortcut button
to make it the correlation loadings plot.
The Correlation Loadings plot may be used to study the variable correlations that exist in a
particular data set.
Correlation Loadings plot
1052
Tutorials
The plot shows that two variables (redness and colour) have an extreme position to the right
of the plot along PC1. They are close to each other (i.e. they are highly positively correlated),
and far from the center and are very close to the edge of the 100% explained variance
ellipse. This also means that samples lying to the right of the scores plot have higher values
for those two variables.
Along the vertical axis (PC2), two variables can be observed, with high positive values for this
PC. These are R.SMELL and R.FLAV. These two variables are opposite to the variable OFF
FLAV which has lower values for this PC. This indicates that raspberry smell and flavor
correlate positively with each other, and negatively with off-flavor.
Interpretation of scores and loadings

Task
Relate Scores (samples) information to Loadings (variables) information.
How to do it
The Scores plot and Correlation Loadings plot show that samples C2H3 and C1H3 have high
color and redness intensities, while sample C1H2 is more likely to have an off-flavor
character. Samples located in a specific part of a 2-vector scores plot have, in general, much
of the properties of the variables in the same location in the 2-vector loadings plot, provided
that the plotted PCs describe a large proportion of the variance.
PC 3 describes the variation in sweetness, bitterness and chewing resistance. Confirm this by
activating the loadings plot (upper right quadrant) and selecting Plot - Loadings. Display PC 1
vs. PC 3 by changing Vector 2 using the arrows in the toolbar .
PCA Loadings 1 vs. 3
1053
In this new plot, the horizontal axis is unchanged (PC1) and the vertical axis now shows PC3.
Interpretation of the influence plot

Task
Interpret the influence plot, which is used for the detection of outliers.
How to do it
The influence plot is displayed in the lower left quadrant of the PCA Overview. The strongest
outliers are placed in the upper right corner of the plot, i.e. they have a large leverage and a
high residual variance. In the current analysis, there is no evidence of outliers.
PCA Influence plot
All of the results for the PCA are now part of the project Tutorial_B. Save the project to
capture the PCA results. The next steps in this tutorial will make use of the sensory,
instrumental and preference data.
Close the PCA overview by selecting its name in the navigation bar at the bottom of the
viewer and right clicking to select Close.
Objective 2: Explore the relationships between instrumental/chemical data (X) and
sensory data (Y)
Is it be possible to predict the quality variations observed in the jam data by using
instrumental measurements only? Training and employing a sensory panel is costly and time
consuming. Producers of jam would find it most convenient if they could predict quality
1054
Tutorials
variations by measuring some properties by instrumental means. The next task in this
tutorial is to make a regression model between the sensory and instrumental data and
analyze the results for a possible solution.
Make a PLS regression model

In The Unscrambler® the regression between two matrices can be performed using a
number of common multivariate methods. Partial Least Squares (PLS) regression is used in
this case in order to maximize the information obtained from both X and Y.
Task
Make a PLS regression model that predicts the variations in sensory variables from
instrumental and chemical variables.
How to do it
Select Tasks - Analyze - Partial Least Squares Regression…. Specify the following parameters
in the Regression dialog:
Partial Least Squares Model Inputs
Model inputs tab

Predictors

 Cols/X-variables: Instrumental (6)
1055
Responses

 Cols/Y-variables: Sensory (12)
Maximum components: 6
X and Y weights tabs
Select the X and Y Weights tabs to access their dialogs. Weighting will be applied to
all the X and Y variables for regression purposes.
X Weights Dialog
Press All to change the weighting of all variables at the same time. Variables can also
be selected by clicking on them in the list. Remember to hold the Ctrl key down
while selecting several variables. Choose the A / (SDev +B) radio button. Use
constants A = 1 and B = 0. Press Update and ensure that the weights change in the
list.
All variables are weighted by dividing them with their own standard deviations. This
allows all variables to contribute to the model, regardless of whether they have a
small or large standard deviation from the outset; only the systematic variation is of
interest here.
Now go to the Y Weights tab and do the same. Do not click OK, but after the
Update, go to the Validation tab.
1056
Tutorials
Validation tab
Select Cross validation from the Validation tab.
Press the Setup button to access the Cross Validation Setup dialog and choose Full
from the drop-down list. It is always recommended to use test set or cross validation
to develop final models.
Click OK in the regression dialog when all parameters have been set up. The computation of
the model will begin. After PLS analysis is completed, the system will ask “Do you want to
view the plots of model PLS now?”.
Click Yes to see the PLS Overview plots. A new node, PLS, has been added to the project
navigator.
PLS Regression Overview
This overview provides the most useful and common predefined result plots for PLS,
including loading weights and residuals, etc. The model can always be reviewed during the
analysis stage by selecting any of the result plots under the PLS - Plots node in the project
navigator. For this exercise, various Y response values were used for model development.
Therefore the overview results for each of these responses are available by choosing the Y
value of interest in the tool bar. When performing this type of analysis with multiple
responses the non-significant variables may be determined for each of the responses. It can
also provide information on which sensory responses can best be predicted from the
instrumental measurements without making a separate PLS model for each response. When
a Predicted vs. reference plot is selected (lower right quadrant) active, the name of the Y
value being analyzed appears in the toolbar . Another Y-response
can be chosen from the drop-menu menu, or one can scroll through the values using the
arrow tool on the right.
Interpretation of the variance plot

Task
1057
Interpret the explained variance curve, which can be shown as residual variance, or as
explained variance. The two different views are useful for different tasks.
How to do it
The Y-explained variance plot is in the lower left quadrant. This plot can be changed to the
residual variance plot by using the toolbar and as the X-explained variance by
clicking on the X button .
A local maximum is achieved for five PLS factors. The next task is to determine why the
validation curve does not follow the general trend. This can be done by looking at the
explained variance for the variables individually.
Y-explained variance plot
From the plot menu select Variances and RMSEP - X- and Y-Variance… Make sure the
bottom plot shows the Explained Variance for the 12 individual Y variables. If not, change it
by using the toolbar shortcut. Also do not select Total, but select Cal from the toolbar
shortcuts .
Add a legend to the plot by right clicking and selecting Properties. Select legend, and check
the box visible to add the legend to the plot.
PLS, Explained Validation Variance Plot displayed for the 12 individual Y-variables
The conclusion reached from the residual variance curve was that two PLS factors were
optimal. The variables that are well described are reflected in the information conveyed by
these factors.
About 85% of the color variation (variables 1 and 2), and 80% of the variation in sweetness
(variable 6) can be explained by a combination of the chemical and instrumental variables.
1058
Tutorials
Note that only 23% of the total Y-variance is explained by the model using two factors.
Interpretation of the scores plot

The scores plot shows how the samples are related to each other.
Task
Interpret the scores plot.
How to do it
Return to the Regression Overview Plot (by selecting it from the Plots node in the project
navigator). The Scores plot is always found in the upper left quadrant of the overview. The
scores plot shows patterns in the samples. This is often difficult to see without some other
powerful visual tools. Use the category variables as markers in the same way it was
performed in the “Interpretation of the Scores plot” for the PCA model. This can be
performed by highlighting the scores plot and right clicking to select Sample Grouping. The
category variables harvest time, will be used for the sample grouping.
PLS factor 1 describes the harvesting time. Harvest time 1 is found on the right in the plot
and harvest time 3 to the left. The scores plot does not reveal information about the
cultivars.
A comparison with the loadings plot provides more information. Interpret the two plots
(Scores and Loadings) by analyzing them together.
Interpretation of the loadings and loading weights plot

Study the loading weights plot to find correlating variables.
Task
Interpret the loadings and the loadings weight plots.
How to do it
The loadings plot is located in the upper right quadrant of the Regression Overview. Activate
it (if it is present), or choose it from the project navigator under the PLS - Plots node. Make
sure both X and Y loadings are plotted.
To interpret variable relationships, visualize straight lines between the variables through the
origin. Variables along the same line, far from the origin, may be correlated. (Negatively
correlated when situated on opposite sides of the origin.)
PLS, X-Loading Weights and Y-Loadings Plot
1059
The spectrophotometric color measurements (L, a, and b) appear to be strongly negatively

correlated with color intensity and redness. Sweetness is, as expected, strongly negatively
correlated with measured Acidity. But the R. Flavor shows weak correlation to the PLS-
factors (near origin = low PLS loadings).
The regression coefficients may also be analyzed to understand which X variables are
important in describing each of the Y responses. These can be selected from the project
navigator, or from the menu Plot- Regression Coefficients - Raw coefficients (B)- Line. The
coefficients for each of the Y responses can be displayed by selecting them from the drop-
down list in the toolbar.
From Problem I it was concluded that the jam quality varied with respect to color, flavor,
and sweetness. But the results so far in Problem II show that the chemical and instrumental
variables mainly predict variations in color and sweetness (which is indicated by the low
explained Y-variance of Flavor). This indicates that the Y-variable Flavor cannot be replaced
with the present set of X-variables, i.e. there is no information in the chemical and
instrumental measurements related to the Flavor of the jam samples.
Use of other instrumental X-variables, e.g. gas chromatographic data, may have increased
the flavor prediction ability of the raspberry jam data.
Interpretation of the predicted vs. reference plot

The predicted vs. reference plot displays the predictive ability of the developed model.
Task
Interpret the predicted vs. reference plot.
How to do it
The predicted vs. reference plot in the regression overview currently displays the results for
the first Y-variable, in this case, “Redness”.
PLS, Predicted vs. Reference Plot for variable “Redness”, model with two factors
Use the drop-down list in the toolbar to observe the prediction quality for other variables
measured in this analysis. Make sure these plots are displayed for two PLS factors, as this is
the correct number for this model. Note that for several of the properties, including
raspberry flavor, raspberry smell, and off-flavor, the instrumental values do not provide any
real information. This analysis shows that the chosen instrumental measurements are not a
good substitution for the sensory analysis of these jams.
1060
Tutorials
Objective 3: Predict user preference from sensory measurements

Is it possible to develop a model for predicting consumer preference data from new sensory
data? If so, expensive consumer tests can be replaced by cheaper sensory tests. The PLS
model previously developed was used for interpretation purposes. The focus is now on
prediction. A new model will be built relating the sensory data to consumer preference data,
and this model will be applied to unknown samples to predict their preference.
Make a PLS regression model for preference

First, develop a model relating sensory data to preference, and interpret it. PLS regression
will be used as the regression method
Task
Make a PLS regression model for describing the relationships between sensory data and
preference.
How to do it
From the Main Menu, select Tasks - Analyze - Partial Least Squares Regression…, and
specify the following parameters in the PLS Regression dialog:
Model Inputs
Predictors
 X data set: “JAMdemo”

 Rows/Samples: Training (12)
 Col/X-variables: Sensory (12)
Responses
 Y data set: “JAMdemo”

 Rows/Samples: Training (12)
 Cols/Y-variables: Preference (1)
Maximum components: 6
PLS Regression Dialog
1061
Weights in X and Y
It is necessary to standardize all variable with the option 1/SDev.
Select the X Weights tab and weight all the X variables with 1/SDev so that each
variable will contribute equally in the modeling step. Also weight the Preference
values (Y) by 1/SDev in the Y Weights tab.
Validation
Full Cross Validation
Press Setup to access the Cross Validation Setup dialog and choose Full cross
validation as the cross validation method.
Press OK.
Interpretation of the regression overview

Task
A new PLS node has been added to the project navigator. Rename this to “PLS Sensory” by
highlighting it, then right clicking and selecting the Rename option. Interpret the model
using the regression overview plots and other diagnostic tools available.
How to do it
It is of primary interest to determine how well the model can predict new values. Therefore
only the residual variance and the Predicted vs. reference plots have most meaning.
The residual variance
1062
Tutorials
Activate the explained variance plot in the lower left quadrant, and change it to the residual
Y variance plot by using the toolbar shortcuts . The prediction error tapers off
significantly after two PLS factors. This represents the optimal model conditions.
Residual Y Validation Variance Plot

Activate the predicted vs. reference plot and specify to display it for 2 PLS factors, using the
arrows in the toolbar .
Turn on the regression line and the target line with the toolbar shortcuts .
Predicted vs. reference Plot with Trend Lines
It can be observed that the predictions are of good quality. Some samples are not so well
predicted, but the overall correlation is satisfactory.
1063
Interpretation of the regression coefficients

The regression coefficients are used to calculate the response value from the X-
measurements. The size of the coefficients provides an indication of which variables have an
important impact on the response variables.
There are two kinds of regression coefficients, Bw and B. The Bw coefficients are calculated
from the weighted data table and are used for interpretation. The B coefficients (raw) are
calculated from the raw data table and are used for predictions.
Task
Find which variables are important for predicting the Y-variable Preference.
How to do it
The estimated regression coefficients indicate the cumulative importance of each of the
sensory variables to the consumer preference.
Select Plot - Regression Coefficients. Choose the Weighted coefficients (Bw) option. Using
the arrows in the toolbar, change the plot to show regression coefficients for 2 PLS factors,
and change the plot layout to a bar chart.
Regression Coefficients Plot
Redness, Color and Sweetness (B1, B2 and B6) are significant in predicting Preference.
Raspberry Smell (B4) is also significant, but contributing negatively to the Preference.
Thickness (B11) seems to be of importance also as it has a large (negative) coefficient.
Save the project file with the name “Tutorial_B “. It may also be saved as the model file
itself, providing a smaller file with just the model information that can be used for predicting
new samples in real time using The Unscrambler® Prediction Engine and The Unscrambler® X
Process Pulse products. To save the model only, right click on the model node in the project
navigator and select the option Save Model. In the dialog choose what size model to save.
Models other than the full model do not include all the results matrices, and therefore
provide fewer results in addition to the predicted values when used.
Save Model
1064
Tutorials
Rename the model if desired and click on Save.
Open result matrices in the Editor

The result matrices may also be observed numerically. Comparison of results may be easier
in tables and the Editor is a good starting point for exporting data into other programs.
The plot Raw regression Coefficients (B) is available as a predefined plot from the Plot menu
in the regression results viewer. However, for this exercise the B coefficients will be viewed
from the list of numerous available matrices.
Task
View the regression coefficients in the editor.
How to do it
Open the Results folder under the PLS node in the project navigator and select the Beta
Coefficients (raw) matrix. Any of the other validation matrices may be selected from the
validation folder of the PLS model. The beta coefficients can then be treated as every other
data in an Editor. They may be plotted from the Plot menu, etc.
Predict preference for new samples

Regression models are mainly used to predict the response value for new samples. Models
are developed to allow the prediction of these values rather than performing reference
measurements, which often are time consuming and expensive.
The purpose of the model previously developed was to predict the jam preference for some
consumers based on sensory values that were measured for the samples.
Task
Predict the Preference for the jam samples.
1065
Interpret the prediction results to see whether the predictions can be trusted.
How to do it
Activate the “JAMdemo” data matrix. Select Tasks - Predict - Regression… and specify the
following parameters in the Prediction dialog:
 Select model: PLS Sensory

 Data matrix: “JAMdemo”
 Rows/Samples: Prediction Samples (8)
 Cols/X-variables: Sensory (12)
 Prediction type: Full Prediction
 Y-reference: Not included
 Number of Components: 2
Check the boxes for Inlier statistics and Sample Inlier dist (Mahalanobis distance) to provide
valuable statistical measures of the similarity of the prediction samples to the calibration
samples.
Click OK to perform the prediction.
The Prediction dialog
1066
Tutorials
Interpretation of Predicted with Deviation

There were no reference measurements available for the new samples in the “Prediction”
Set. This makes it impossible to check predicted vs. reference values. Since a model has been
developed based on projection, the only option available is to check the reliability of the
predictions from the deviations. There are also some statistical measurements of the
similarity of predicted samples to those used in developing the calibration model that can be
used: inlier statistics and Mahalanobis distance.
Task
Interpret the Predicted with Deviation plot, and other plots related to prediction results.
How to do it
Click OK in the Prediction dialog to display the predicted with deviation plot, and the
tabulated prediction results.
Prediction results
Predicted preference for the “unknown” new jams have some uncertainty limits, i.e. the
accuracy of new predictions is not so reliable, however, this model can be used to predict
the preference of new jam samples providing an indication of which ones will be accepted or
not by consumers.
View the Inlier vs. Hotelling’s T² plot by selecting Plot – Inlier/Hotelling’s T² - Inlier vs
Hotelling’s T². This plot shows how similar the new samples are to those used in developing
the calibration model. For a prediction to be trusted the predicted sample must not be too
far from a calibration sample. This is checked by the Inlier distance. The projection of the
new sample onto the model also should not be too far from the center. This may be checked
using the Hotelling’s T² distance.
Save the project file under the name “Tutorial B_complete”. This now includes all the data,
three models, and the predicted results for preference.
Check the error in original units – RMSE

Finally, observe how large the expected error is in predicted preference results, i.e.
determine what an approximate RMSEP is for such an analysis.
Task
Plot the RMSE.
How to do it
Return to the PLS Sensory node in the project navigator. In the plots folder select Regression
Overview, then select Plot - Variances and RMSEP - RMSE.
Two curves are plotted, one for the calibration: RMSEC and one for validation. In this
particular case it is the cross-validation error: RMSECV.
PLS, Root Mean Square Error Plot
1067
To gain a better approximation of what to expect in future predictions, the RMSECV should
be analyzed.
The RMSECV may be studied for Preference for all PLS factors. RMSECV (using two factors) is
0.83. This means that any predicted new sample on the scale from 1 to 9 will have a
prediction error around 0.8. This is an acceptable error level in sensory analysis, which has
much uncertainty in all measurements.
Export models from The Unscrambler®

Models from The Unscrambler® are often used in instruments to make predictions in real
time. The model itself can be saved in smaller size when using the Save Model option. These
models can then be used in conjunction with The Unscrambler® Prediction Engine and The
Unscrambler® X Process Pulse for real-time monitoring. Other model formats have been
developed to facilitate the easy reading of results in instruments or other software that do
not read The Unscrambler® models directly.
Task
Export the regression model used to predict Preference from Sensory Data.
How to do it
Select a PLS Model from the project navigator and select File – Export - ASCII-MOD…
This displays the Export ASCII-MOD dialog box.
Export ASCII-MOD Dialog
1068
Tutorials
Verify that the correct number of factors has been chosen for the selected model. The
optimal number of components should be used for the export. Therefore, change the
number of factors to 2 before clicking OK.
Two types of model export are available:
 Full
 Short prediction: corresponding to export of only the regression coefficients
Observe the ASCII file that is generated, this has the file name extension .AMO. The format of
the file is described in the ASCII-MOD Technical Reference.
Similarly any of the result or validation matrices can be selected for export into other
formats. Supported export formats are
 ASCII
 JCAMP-DX
 Matlab
 NetCDF
 ASCII-MOD
Full ASCII-MOD export includes all results that are necessary to perform outlier detection,
etc. This format can be used for applying models outside The Unscrambler® environment,
for example in a custom written program script. The ASCII-MOD file is readable by any text
editor, such as Notepad.
35.2.4 Tutorial C: Spectroscopy and interference problems
 Description
 What you will learn
 Data table
 Get to know the data
 Read data file and define sets
 Plot raw data
 Univariate regression
 Calibration
 Interpretation of the calibration model
 Study the predicted vs. reference plot
 Study the explained variance plot
 Multiplicative Scatter Correction (MSC)
 Check the error in original units: RMSE
 Predict new MSCorrected samples
 Guidelines for calibration of spectroscopic data
Description
There is a need for an easy way to determine the concentration of dye (a brightly red-
colored heme protein, Cytochrome-C), in water solutions. Dye absorbs light in the visible
range, and the concentration determination will be based on this light absorbance.
1069
In the solutions to be analyzed there are varying, unknown amounts of milk, which absorbs
some light in the same wavelength range as dye and therefore causes chemical interference
in the measurements. In addition, milk contains particles that give serious light scattering.
Another effect that will influence the absorbance spectra is the varying sample path length.
The light absorbance spectrum figure shows the light absorbance spectrum of one sample of
the dye/milk/water solution.
Absorbance Spectrum
The vertical lines represent the 16 different wavelength channels selected as predicting
variables for this sample set.
This example is constructed to enable duplication in a lab. This illustrates the interference
effects and other effects that make spectroscopy challenging. However similar problems
occur with many industrial applications, e.g. measuring the concentration of different
chemical species in sewer water, which contains many other chemical agents, as well as
physical interferences like slurries and particles; measuring moisture and solvents in a
granulation process.
The two major peaks (variables Xvar4 and Xvar6) represent the absorbance of dye, while the
first peak (Xvar2) represents absorbance due to an absorbing component in the milk. The
broad peak to the right (Xvar12, Xvar13, Xvar14) is due to light absorption by water itself.
What you will learn

Tutorial C contains the following parts:
 PLS regression
 Handling of interference problems, Multiplicative Scatter Correction (MSC)
 Check list for calibration of spectroscopic data
A problem similar to this tutorial is described extensively in chapter 8 in the book

“Multivariate Calibration”, by Martens & Næs.
References
 Transformations: Principles of Data Preprocessing

 Multivariate regression methods
 Prediction with regression models
1070
Tutorials
Data table
Click the following link to import the Tutorial C data set used in this tutorial. This is best done
into a new project (File-New).
The data matrix, Tutorial_C is imported into the project. It consists of 28 samples (samples of
solutions) that spans the two most important types of variations: the dye and milk
concentrations. The composition of dye/milk/water in each calibration sample is shown. The
values are given in ml making a total of 20 ml in each solution (sample).
Sample Dye Milk Water Sample Dye Milk Water
1 0.0 0.5 19.5 15 4.0 0.5 15.5
2 0.0 1.0 19.0 16 4.0 1.0 15.0
3 0.0 2.0 18.0 17 4.0 1.5 14.5
4 0.0 6.0 14.0 18 4.0 6.0 10.0
5 0.0 8.0 12.0 19 4.0 10.0 6.0
6 0.0 10.0 10.0 20 6.0 1.0 13.0
7 2.0 0.5 17.5 21 6.0 2.0 12.0
8 2.0 1.0 17.0 22 6.0 6.0 8.0
9 2.0 1.5 16.5 23 6.0 10.0 4.0
10 2.0 2.0 16.0 24 8.0 0.5 11.5
11 2.0 4.0 14.0 25 8.0 1.0 11.0
12 2.0 6.0 12.0 26 8.0 1.5 10.5
13 2.0 8.0 10.0 27 8.0 2.0 10.0
14 2.0 10.0 8.0 28 8.0 6.0 6.0

Note that the known milk and water quantities will not be used to make the model, only as
descriptors in result plots. The sample names are coded with these quantities as well.
Get to know the data
Read data file and define sets

The first step in all modeling is to get the data into The Unscrambler® and organize it into
appropriate sets. The data for the different analyses are organized as sets, defining which
samples(rows) or variables(columns) are used in the modeling. Cleverly defined Sets make
modeling and plotting work much easier.
Task
Open the data matrix Tutorial_C, and take a look at the properties of the data. Some of the
data have already been organized into row and column sets. The data will be further
organized by defining some additional sets to be used in the analysis.
How to do it
1071
In the project navigator, expand the tree under the data matrix Tutorial_C to see the file
content. An Editor with the data table is launched in the viewer.
Project navigator view of data
One can see that some sets have already been defined, but one additional column set
named Statistical will be defined.
The data table already has the following: Column (Variable) Ranges:
 Cols/Name: Absorbance; Interval, Columns: 4-19

 Cols/Name: Dye Level; Interval, Columns: 3
 Cols/Name: Description; Interval, Columns: 1-2
Row (Sample) Ranges:
 Rows/Name: Calibration; Interval, Rows: 1-28

 Rows/Name : Prediction; Interval, Rows: 29-42
Put the cursor in the data viewer. Now one can define a new column set (variable range) by
going to Edit - Define Range… which will open the Define Range dialog. Define the column
set by putting the name “Statistical” in the Range - Column space, and for interval, enter 3-
19 for columns as shown below.
Define Range Dialog
1072
Tutorials
Click OK when finished defining the Column and row sets. Use File-Save As… to save the
project with the updated name “Tutorial_C_updated” in a convenient the location before
continuing. The organized data will now have numerous nodes for column and sample sets
in the project navigator, and give a color-coded data matrix.
Change the data type of the column range “Absorbance” into spectral data. To do so, select
the range “Absorbance” and right click. Select the option Spectra.
This change will change the display of some plots that are usually used differently with
spectra or with other type of variables.
Spectra
Plot raw data

It is good practice to start by plotting the raw data to get an impression of what the data
look like. It will be of tremendous help when you want to assess which pretreatments are
necessary and what kind of model (e.g. how many factors) to expect, as well as generally
understanding the structure of the data.
Task
1073
Plot some calibration samples in order to see how the spectra vary with varying amounts of
dye and milk.
How to do it
Make a line plot of samples that have the same amount of milk, 10 ml. The line plot is just of
the X-variables for these samples, so in the data table editor, select the four samples having
10 ml of milk by marking the samples in the Editor (samples 6, 14, 19, and 23) by clicking the
sample numbers while holding down the Ctrl key. Then right click and select Plot - Line.
Line plot dialog
In the Line Plot dialog that appears, select the column set Absorbance from the drop-down
list. Click OK and note that the four samples are highlighted in the Editor.
The same could be done by selecting the menu option Plot - Line… after having selected the
samples in the viewer, and specifying use the Column set Absorbance in the Line Plot dialog.
Line Plot of sample with 10 ml milk
.
Use shortcuts keys to change the layout of the plot to a bar chart.
These four samples have the same milk level and the line plot shows that the dye level has
influence on the absorbance of variables number 2 - 8 only.
Plot samples 20, 21, 22,and 23 the same way, using the CTRL key to to select just these
specific rows. These samples have the same dye level: 6 ml.
1074
Tutorials
The plot shows that increasing milk level will increase the absorbance of light of all
wavelengths from number 1 to number 16. There seems to be a great deal of interference or
scattering to deal with, over the whole spectrum. This indicates that some transformations
of the data may be useful to get an optimal model.
Univariate regression
Is it possible to predict the dye level from the absorbance of one single wavelength? Before
we enter the multivariate world we want to see what can be done by univariate regression.
Task
Find the best wavelength on which to make a univariate regression model.
How to do it
You find the best wavelength by looking at the correlation between each absorbance
variable and the dye level variable. Select the data set Statistical from the project navigator.
Select Tasks - Analyze - Descriptive Statistics… and specify the following parameters in the
Descriptive Statistics dialog.
 Rows: Calibration (28)

 Cols: Statistical (17)
 Compute Correlation matrix: On or tick
When the computation is done, there will be a prompt asking if you want to view the plots.
Click Yes, and the two plots summarizing the statistics will be displayed. You will find a new
node, Descriptive statistics in the project navigator which consists of the three folders raw
data, results and plots.
In the project navigator, expand the folder results. Select the Variable Correlation matrix
from this folder to view this in the viewer. We will use these data to find the highest
correlation between Dye Level and some X-variable. You may select the first row, dye level,
and plot it (Plot - Bar) to see the highest correlation (after the correlation between Dye level
and Dye level, which of course is 1).
Bar chart of variable correlation
The variable with the highest correlation coefficient to Dye Level is Xvar6 with a correlation
coefficient of 0.49. You can close the bar plot of the correlation matrix by selecting the tab in
the navigation bar at the bottom of the viewer and right clicking to select close.
Now we should illustrate the regression in a plot. To get the right plot go back to the original
data set, Tutorial_C, and select the columns Xvar6 and Dye level using the Ctrl key and Plot -
Scatter. In the line plot dialog remember to select only the calibration samples from the row
drop-down list.
1075
Scatter plot dialog
Scatter plot of Xvar6 vs. Dye level
Another way to do this is go to Plot - Scatter and in the Scatter plot dialog click on the define
button next to Cols., which will open the Define Range dialog. Here you can select the
columns Dye level and Xvar6, or type in columns 3, 9 in the Interval box. Select the
calibration samples for the rows.
Scatter plot dialog showing define option
1076
Tutorials
Turn on the Regression Line and Target Line with the shortcut buttons . We can
also add the plot statistics from the toolbar shortcut . From the plot we see our results
are not very good using just one variable to model the dye level. Hopefully we can do better
with multivariate regression models.
Scatter plot of Xvar6 vs. Dye level with target and regression lines
Calibration
We choose to make a PLS regression model because PLS takes the variation in Y into
consideration when the model is calibrated.
Task
Make a PLS regression model between the variable set Absorbance (X) and the response Dye
Level(Y).
How to do it
Activate the Tutorial_C data Editor from project navigator and select Tasks - Analyze -
Partial Least Squares Regression…. In the PLS dialog, specify the following parameters:
 Data Set: Tutorial_C

 Predictors:
Rows: Calibration (28) Cols (X-variables): Absorbance (16)
 Responses:
Cols (Y-variables): Dye Level (1)
 Mean center data: selected
 Identify outliers: selected
PLS Regression dialog
1077
 Weights: All 1.0 in X and Y

 Validation method: Cross validation
Go to the Validation tab to select the option cross validation. You can further define the
settings for this by clicking Setup…, it opens the Cross validation setup dialog. Select
Random as the cross validation method and set the number of segments to “7”.
Cross validation setup dialog
1078
Tutorials
.
Start the calibration by clicking OK. When the computation is complete you will be asked if
you want to view the PLS plots now. Click Yes, and the regression overview plots will be
displayed.
A new node, PLS, has been added to the project navigator. This has four folders with the raw
data, results, validation, and plots for the PLS model. Rename the PLS node in the project
navigator for this analysis to “PLS Tutorial C” before you continue. You can do this by right
clicking the latest PLS model in the project navigator and selecting Rename.
Interpretation of the calibration model

The interpretation of a calibration model involves several steps. First, we check whether the
model has detected any systematic variation. This is done by looking at the residual variance
plot. If the model has successfully described systematic variation, we start to interpret
different additional modeling results. The most important model results to study are the
Scores, Loadings, and the Predicted vs. reference, all of which are part of the Regression
Overview Plots.
Task
Interpret the plots in the regression overview.
How to do it
The regression overview was displayed when you clicked Yes to view the plots. It consists of
four plots of the most important modeling results from the regression model. We will now
view the PLS results. The plot in the lower left quadrant is the residual variance. This plot
gives information about how many factors are required to explain model variation and
optimal number of factors for the model. A summary of the model information is given in
the Info box in the lower left of the screen, below the project navigator.
PLS Regression Overview Plots
1079
Scores plot
The plot in the upper left quadrant is the Scores plot. From the scores plot we can interpret
that the combination of two main factors, factor 1 and factor 2, reflects the variations in the
milk and water levels. The first two factors indicate that 99% (X1 84, X2 15) of the X variance,
explains 75% (Y1 19, Y2 56) of the response dye level. By studying the samples in the plot we
can see that the milk level increases from upper left to lower right in the plot, while the
water level increases from right to left.
The regression coefficients plot summarizes the relationship between all predictors and a
given response. It is easiest to access this plot by selecting it from the plots folder in the
project navigator.
Plots folder in project navigator
1080
Tutorials
It is possible to see this plot when the any PLS plots are active in the viewer and going to Plot
- Regression Coefficients - Raw coefficients (B) - …, or by right mouse clicking and selecting
PLS - Regression Coefficients - Raw coefficients (B) -…. Select the line plot of the raw
regression coefficients. Since we did not apply any weighting to the data, the plots of
weighted and raw regression coefficients will be identical.
The regression coefficients plot indicates that the wavelength numbers (X-variables) 4 and 6
are the most important for the prediction of Y (concentration) in the first factor. The pattern
is clearer here than in the loadings plot.
Regression coefficients plot
Compare the regression coefficients plot to the raw absorbance data. See that high loading
values indicating important variables are present in the region where we know that milk and
dye absorb light.
1081
Study the predicted vs. reference plot

This plot, in the lower right of the regression Overview shows how the model is able to
predict the response value for the calibration samples. This gives an indication of how well
the model will perform in the future when new samples are collected and we want to
calculate the dye level for these samples, from the spectral data.
Task
Take a closer look at predicted vs. reference plot.
How to do it
Select the plot with the appropriate number of factors. Check out the statistics given.
As you can see it is not a very good model. The R-square is about 0.82 in validation for 5
factors. The error in cross-validation is about 1.23 on a scale of -1 to 8, so about 13% error.
Study the explained variance plot

Task
Take a closer look at the residual variances in the error measures plots.
How to do it
Activate the Predicted vs. reference plot and select Plot - Variances and RMSEP… and select
the X- and Y-variance, which will bring up two plots summarizing the X and Y variance.
The upper plot shows that the model describes much of the variance in the X-variables in the
first factors, while it takes more factors in the lower plot to describe the variance in Y (dye
level). We are interested in describing Y, therefore we have to include enough factors in our
model to get a high explained variance for the Y-variable.
The X-variance and Y-variance plots
1082
Tutorials
Multiplicative Scatter Correction (MSC)

Since we suspect that the light scattering and sample thickness have multiplicative effects on
the data, and that the chemical absorptions have additive effects, we decide to try
MSCorrection on the X-variables in order to separate these effects from each other.
Perform a Multiplicative Scatter Correction
Task
Correct the data for multiplicative scatter effects. Omit variables 1 to 8 in the Set
Absorbance as important variables.
How to do it
Select the data matrix Tutorial_C.
First, we verify the need for MSC by looking at the Scatter Effects plot. This plot is available
from a statistics model. Select Tasks - Analyze - Descriptive Statistics and specify the
following parameters in the Descriptive Statistics dialog:
 Rows: “Calibration (28)”

 Cols: “Absorbance (16)”
Click OK to calculate the statistics, and select Yes to view the plots now. As we already have
run descriptive statistics before, but using 17 of the variables, rather than just the
absorbance, the current results are a new node, Descriptive Statistics(1), in the project
navigator. We are not interested in the default plots that are shown, but want a plot that
1083
helps us to understand the scatter in the data. Make the plot window active by clicking in it,
and select menu option Plot - Scatter effects. In this plot of the mean value of each X var we
see that the scatter is not the same for all variables. The first 8 variables are approximately
in a straight line. For the other variables, one can observe a spread in the scatter effects.
Scatter effects plot
Select the data matrix Tutorial_C. Select Tasks - Transform - MSC/E… Specify the following
parameters in the Multiplicative Scatter Correction dialog:
 Rows: “Calibration (28)”

 Columns: “Absorbance (16)”
 Enable omit variables: “1-8”
Multiplicative Scatter Correction dialog
1084
Tutorials
Go to the Options tab and under Function select Common Amplification.

Multiplicative Scatter Correction options
1085
Prediction samples are not used to find the correction factors we want to find now and use
in the MSC.
Variables 1-8 are omitted as important because the light absorption of these variables vary
with the dye level, while wavelengths 9 to 16 (the water absorption peak) is independent of
the concentration of dye. The difference in these wavelengths is instead caused by the
general light scatter due to milk addition. It is important that only wavelengths with no
chemical information are used to find the correction factors.
The transformed data are now displayed in the project navigator with the name
“Tutorial_C_MSC”. There is also a node with the MSC model for transformation, which can
be applied to future samples. This is called “MSC_Tutorial_C”, and has a folder with the
model under it.
Look at the corrected data by selecting the data from the new project navigator node, and
going to Plot - Line. Select the new sample matrix with the corrected data in the Line Plot
dialog, row set calibration, and column set Absorbance.
Line plot of MSC transformed data.
1086
Tutorials
We want to compare the corrected data with the original data. Select the raw data matrix in
the project navigator (Tutorial_C) and make a line plot of the calibration samples for the
absorbance values. You see that the MSCorrected data are different from the original. The
interference and light scatter effects have successfully been corrected for. You can display
the plots on the same screen by going to the navigation bar at the bottom of the screen and
right clicking to select Pop out to give an undocked plot of the MSC Corrected data that can
be moved around as you wish.
Pop out menu
You can then choose the line plot of the uncorrected data from the navigation bar, making it
active in the Viewer, and move the other window to the same view for easier comparison.
Line plots of the MSC corrected and the original data
.
Another way to get a view of both plots together is to go to Insert-Custom Layout - Two
Horizontal… and select the two samples matrices, selecting the calibration samples for rows,
1087
absorbance for columns, and setting the plots to be line plots in the custom layout dialog.
You can also give a title for each plot as show below.
Custom layout dialog
Calibrate with MSC transformed data

So far we have only corrected the data, now we have to make a new PLS model using
MSCorrected data.
Task
Make a PLS model with the same model parameters as the model “PLS Tutorial C”.
How to do it
Activate the matrix with the corrected data. Select Tasks - Analyze - Partial Least Squares
Regression… and specify the following parameters in the Partial Least Squares dialog:
Data Set: “Tutorial_C_MSC” - Predictors:
Rows: Calibration (28)
Cols (X-variables): Absorbance (16) - Responses:
Rows: Calibration (28)
Cols** (Y-variables): Dye Level (1)
 Mean center data: selected
 Identify outliers: selected
 Weights: All 1.0 in X and Y
 Validation method: Cross Validation
Go to the Validation tab to select the cross validation method, again using Random with “7”
segments.
Click yes to view the plots now for this model, and the regression overview plots will be
displayed in the viewer.
The new regression model will create a new PLS node in the project navigator. Rename this
to “PLS MSCorrected” by selecting the node and right clicking to select Rename.
Comparison of models
We are now interested in seeing how the model performs with regard to prediction ability.
The residual variance is therefore the yardstick we compare the different models.
Task
1088
Tutorials
Look at the residual variance for all models in Tutorial C.

How to do it
Study the residual variance for each model. In the project navigator, select the PLS results
for the first PLS model, and from the plots folder select Regression overview. The plot on the
lower left quadrant shows the variance. Use the toolbar shortcuts to display the residual Y
variance . We see that for the optimal number of factors (2) the
variance value is 4.4. There is a minimum in this plot for 5 factors, where the value of the
residual variance has not really decreased.
Y Residual validation variance: original data
.
View the same plot for the model PLS MSCorrected by going to the PLS Overview plot of the
MSC corrected data (which should still be an open tab in the navigator bar at the bottom of
the viewer). Highlight the lower left quadrant, the explained variance plot, and change the
view to be the residual Y variance plot by using the toolbar shortcuts ,
selecting Y, and Res, for just the validation samples.
Y Residual validation variance: MSC Corrected data
.
The plot shows the validated residual Y-variance for the two models From these plots (line)
we find that the minimum square error is lower for the MSC corrected model with two
factors (1.87). So though the optimal number of factors recommended is four, even with two
factors we can model the system well (more of the Y variance is explained by two factors,
then when using the raw data; see scores plot). The system can be modeled well with the
MSC Corrected data, whereas with the raw data a much higher error is achieved, and less of
the Y variance is explained with two factors. This shows that MSC has removed the
interfering amplification effect in these data.
1089
Tutorial C MSC corrected with four factors gives the lowest estimate for the residual Y-
variance. So we see that predictions done by this model using four factors therefore will give
the predicted values with the lowest prediction error. We could also model this system well
enough with two factors (as we do not here have information on the error of the reference
method for measuring the dye level, we will follow the model suggestion for four factors).
Check the error in original units: RMSE
The numerical residual variance values we used in order to find the best model and decide
the optimal number of factors in the model are not related directly to the predictions. We
cannot use the residual variance to tell how large we can expect the deviations in future
predictions. We have to use the RMSEP for that purpose.
Task
Let us see how large an error in ml dye we can expect in future predictions: RMSEP.
How to do it
Activate the regression overview plot for the model PLS-MSCorrected. Select Plot - Variance
and RMSEP - RMSE
Deselect the calibration samples box and select the validation samples (RMSEP) instead from
the shortcut keys.
You see that the shape of the curve is exactly that of the residual variance, but the values
have changed. The plot says that predictions done with this model and using four factors will
have an average prediction error of 0.9.
RMSE: MSC Corrected data
.
Predict new MSCorrected samples
The model with MSC is the one we will use for the prediction of new samples.
Run a prediction with automatic pretreatment
The prediction samples will be transformed automatically with the same MSC model as the
calibration samples. This will require that the variables selected for the data matrix include
the same number of variables as are associated with the MSC. This we need to select
correctly in the Prediction dialog.
Task
Predict the dye level of the unknown samples.
How to do it
Select Tasks - Predict- Regression…. Specify the following parameters in the Prediction
dialog:
 Model name: “PLS MSCorrected”
 Number of Components: “4”
1090
Tutorials
 Full Prediction with inlier options also selected

 Data Matrix: “Tutorial_C_MSC”
 Rows: Prediction (14)
 Columns: All
As you can see, there is the option to make the prediction for a different number of
components than what is deemed optimal for the model. We can also in the predictions,
compare results with a model of fewer components, which is good to help avoid possible
overfitting.
Prediction dialog
Click View after the prediction is done. The prediction overview plot appears where the
predicted values are shown together with the deviations. A new node, Predict, has been
added to the project navigator. This has folders for raw data, validation, and plots. The
projection overview shows a plot of values with their estimated uncertainties, and also has a
table of the values with these deviations.
Predicted values with deviation
1091
Large deviations indicate that the predictions cannot be trusted. For a prediction to be
trusted the predicted sample must be not too far from a calibration sample. This is checked
by the Inlier distance and also its projection in the model should not be too far from the
center. This is checked with the Hotelling’s T² distance.
Study the Inlier vs. Hotelling’s T² plot available from a right click on the plot and then
Prediction - Inlier/Hotelling’s T² - Inliers vs. Hotelling’s T²
In this case all the samples are found to the below the Inliers distance limit, showing that
these samples are similar to those used in making the model. One sample is outside the
Hotelling’s T² limit line (with 95% confidence), so is an outlier. The prediction for the outlier
therefore cannot be trusted.
Guidelines for calibration of spectroscopic data
Now that you have learned the basics of calibration, let us suggest steps and useful functions
for the development of calibration models.
See the guidelines for spectroscopic calibrations
35.2.5 Tutorial D1: Screening design
 Description
1092
Tutorials

 Data table
 Build a screening design
 Estimate the effects
 Run an analysis of effects
 Interpret the results
 Draw a conclusion from the screening design
Description
The global objective of this study is to develop a new processed cheese. The study is in a
screening stage to study the main effects and detect whether there are any interactions. The
experiments have been performed, and the responses have been measured. The response
values have been gathered into an Excel worksheet; they should now be imported into The
Unscrambler® as response data table. The first step is to create the design and then import
the response variables. The next step, after importing the response values, will be to get
acquainted with the data and perform first checks such as descriptive statistics. Then a
proper analysis of effect will be run.
What you will learn

Tutorial D1 contains the following parts:
 Build a suitable design for screening purpose;

 Analysis of Effects;
 Extension of a design.
References:
 Principles of Data Collection and Experimental Design

 Principles of experimental design
 Analysis of designed data
Data table
From a brainstorming session with the different expert in the cheese production six
continuous process and recipe parameters have been selected for a screening design.
Variable Low High
A: Addition of Dry matter 0 3
B: pH 5.7 6.1
C: Dry matter of process cheese (%) 36 40
D: Maturity of cheese (month) 2 8
E: After creaming (min) 15 35
F: Cooling of process cheese (min) 5 15

The response variables are:
1093
 Glossiness,
 Ability to retain shape,
 Adhesiveness,
 Firmness,
 Graininess,
 Stickiness,
 Meltability,
 Condensed milk taste.
Build a screening design

Screening designs are used to identify which design variables influence the responses
significantly.
Task
Select a screening design which requires a maximum of 20 experiments that will make it
possible to estimate all main effects and possibly some interactions.
Note: With 4 design variables:
A Plackett-Burman design is not interesting because it requires 8 experiments; the
same amount as a fractional factorial design resolution III (26-3).
A fractional factorial resolution IV gives 16 experiments (26-2) and 32 (26-1) for a
resolution V.
A full factorial design gives 64 (26) experiments.
How to do it
Choose Insert – Create Design… to launch the Design Experiment Wizard.
In the Design Experiment Wizard, on the first tab Start, type a name for the table for
example “Cheese”. Select the Goal that for now is Screening. It is possible to type
information in the Information section.
Start tab filled
1094
Tutorials
Go to the next section: Define Variables.

Specify the variables as shown in the table hereafter:
Analysis Analysis Type of
ID Name Constraints Levels
type levels
A AddDM Design None Continuous 0-3
5.7 -
B pH Design None Continuous
6.1
C DM% Design None Continuous 36 - 40
D Maturity Design None Continuous 2-8
E AfterCreaming Design None Continuous 15 - 45
F Cooling Design None Continuous 5 - 15
1 Glossiness Response None – –
2 RetainShape Response None – –
3 Adhesiveness Response None – –
4 Firmness Response None – –
5 Graininess Response None – –
6 Stickiness Response None – –
7 Meltability Response None – –
8 TasteCondensedMilk Response None – –

Do this by clicking the Add button and editing the Variable editor. Validate by clicking OK
and enter the next variable by clicking Add again.
Define Variables tab filled
1095
After all design variables have been defined, go to the next tab Choose the Design, to select
the appropriate design.
By default, in the Beginner mode, the selected design is “Screening of many design
variables” which refers to a Fractional factorial design as can be seen in the box below the
Design section.
This design corresponds to the goal of the experimentation so no change is needed.
The Design Wizard - Choose the design tab
Go to the next tab: Design Details.

This tab gives information about the resolution of the design, the confounding pattern and
the number of experiments to perform including the center samples.
By default the selected option is a Fractional factorial design with a resolution III, which has
confounding of all the main effects and interactions. It is possible to change to a fractional
factorial with a resolution IV, which increases the number of experiments to perform to 19
which is just what we would like to do.
Study the confounding pattern of the suggested design. All main effects are confounded with
3-variable interactions, which is acceptable if those interactions are unlikely to be significant.
1096
Tutorials
The 2-variable interactions are confounded two by two. This is going to limit the study and
the conclusions, but in a screening stage this is acceptable.
The Design Wizard - Design Details tab
Go to the next tab: Additional Experiments.

There is no need to replicate the design samples so the Number of replications is kept at its
default value: “1”.
By default there are “3” center samples. This is enough.
There is no need to add reference samples.
The Design Wizard - Additional experiments tab
Proceed to the next tab, Randomization. There is no need to make any further specification
in this tab. Try different options just to get familiar with the possibilities.
The Design Wizard - Randomization tab
1097
Go to the Summary tab.

In this tab some information about the design is presented. It is also possible to calculate the
power of the design. To do so two values are needed:
 Delta: the difference to detect. In this example we want to know which type of
difference is likely to be detected.
 Std. dev.: estimated standard deviation. In this example the sensory parameters
have about the standard deviation of 0.4.
Enter the following values:
 Std. dev.: 0.4 for all variable

 Delta: enter increasing values of the difference from 0.4 to 1.2
and click on the Recalculate power button

Look at these values. A power superior to 0.80 is considered to be good enough. A
difference Can be detected with certainty above the threshold of 0.8 for the difference.
The Design Wizard - Summary tab
1098
Tutorials
Go to the final tab: Design Table. Here the data table is presented with several view options.
Check them out to familiarize with the options.
The Design Wizard - Design table tab
The design creation is now complete. Click the Finish button.

Now the data tables appear in the Navigator. There is a separate table for the responses.
The design table has been organized with row sets and column sets for the design, and
center samples, and the effects, respectively.
The design table in the navigator
1099
It is possible to view the data in different ways.

 To change the order from the standard sample sequence to the experiment sample
sequence click on the column Randomized and go to Edit – Sort – Ascending.
 To change from the actual values to the level values click on the table and then View
– Level indices.
Estimate the effects

After the experiments have been performed and the responses have been measured, the
results have to be analyzed using a suitable method. Study the main effects of the four
design variables. The simplest way to do this is to run an analysis of effects, and then,
interpret the results.
Run an analysis of effects

Task
 Fill in the responses values in the matrix.

 Run an analysis of effects.
How to do it
First, import the response values.
Click on the following link Tutorial D1 responses.
A data table containing all the response variables is now added to the project as an
additional matrix. Note that the data are in standard order.
Copy and paste the response data into the appropriate columns of the matrix CheeseDesign.
Make sure the rows are sorted in experimental order.
Sample Standard order
(1) 1
ae 2
1100
Tutorials
Sample Standard order
bef 3
abf 4
cef 5
acf 6
bc 7
abce 8
df 9
adef 10
bde 11
abd 12
cde 13
acd 14
bcd 15
abdcef 16
cp01 17
cp02 18
cp03 19
Before the full analysis, we will familiarize ourselves with the data. Go to Tasks - Analyze-
Descriptive Statistics. Choose all the rows and Responses columnset for columns, and then
click OK to compute the statistics. Review the results, and note that some of the responses
(Retain Shape and Stickiness) have some extreme values as noted in the quantiles plot. On
careful investigation, it appears there is an error on the response “stickiness” for one
sample. It should read 2.93, and not 12.93. Correct this value before proceeding with the
analysis.
To start the analysis, choose Tasks - Analyze - Analyze Design Matrix….
1101
Model inputs
1102
Tutorials
Predictors
In the Predictors part set the X matrix to be “Cheese_Design”, Rows “All” and the
Cols “Design(6)”.
Responses
For the Responses set the Matrix to be “Cheese_Design”, Rows “All” and the Cols
“Response(8)”.
Model
The Model should include the “Main effects + Interactions (2-var)”.
The list of estimated effect should be “A, B, C, D, E, F, AB, AC, BC, AD, BD, CD, DE”.
Note: All the interactions are not presented. Remember that there is a confounding
pattern.
In the Method dialog select the Classical DoE analysis and click OK.
Method dialog
When the computations are done, click Yes to study the results. A new node called DOE
Analysis is added into the navigator. Before doing anything else, use File - Save As to save
the project with a name such as “Cheese Project”.
Interpret the results

Task
Interpret the results of the Analysis of Effects that was just run.
1103
How to do it
The ANOVA Overview plot shows four informative plots:
 the ANOVA table

 the Diagnostics table
 the Effect viewer
 the Effect Summary table
ANOVA table
Look at the Summary section of the ANOVA table to check the significance of the
models for all the response variables. We say that a model is significant at the 5%
level if the p-value is smaller than 0.05. This is true for response variables
“RetainShape” (0.0136) and “Firmness” (0.0213), while “Meltability” is just over
(0.0524). Always check the validity of the model by assessing the R-square
prediction value. This is an estimate on how well the model will work for new
(currently unknown) data. As the value is negative for “Meltability” this particular
model cannot be trusted. For “RetainShape” and “Firmness” the values are higher
(around 0.5), which is not necessarily bad but caution is required.
For these three responses find out which effects are important by looking at the
Variables section. Again, the significant effects are the ones with a p-value less than
0.05. They are in shades of green. For “RetainShape” the main effects B(pH),
C(DM%), and D(Maturity) and the interaction effect BC=AE are found significant at
the 5% level. For “Firmness” the same effects are found significant except B(pH).
ANOVA table
1104
Tutorials
Note: The interaction effect BC=AD is a possible significant effect. Checking the effect value
or the B-coefficient should help to determine if it is significant or not.
The effect viewer
Look at the effects for the response “RetainShape” and check for curvature. See if
the center sample average is placed such that the average at low and high level are
linked by a linear relation. If this is the case there is no curvature effect. Use the
to scroll through the effects for the different variables.
Here a curvature effect can be found on all effects.
Effect of Maturity (D) on “RetainShape”
1105
In addition the study of the interaction effects shows that the interaction effect of
B*C is the most probable, as the effects A and E are not significant.
The diagnostics
Look at the residuals to see if the model fits the samples well. The table is presented
with the experimental order (randomized) which makes it possible to check for any
deviation with time.
Diagnostics for “RetainShape”
Note that the first center sample has a very high residual. However center samples
are not taken into account when calculating the effect.
The summary table
See which effect is the most important (size) and the most significant (smallest p-
value) for all variables.
Go through the other plots and check the plot interpretation in the DOE section
Draw a conclusion from the screening design
The final conclusions of the screening experiments are the following:
 Not all sensory variables are affected by the changes in the design. Only three are in
fact affected and “RetainShape” is the variable showing the most interesting
behavior.
 Four main effects were found likely to be significant for “RetainShape”. One of them
is a confounded interaction. Since the main effects of B and C are significant, we can
make an educated guess and assume that the significant interaction is BC (and not
AE with which it is confounded).
1106
Tutorials
 There seems to be a strong nonlinearity in the relationship between “RetainShape”

and (pH, DM%).
Thus, the next sensible step would be to perform an optimization, using only the three
variables that were significant.
35.2.6 Tutorial D2: Optimization design
 Description
 Build an optimization design
 Compute the response surface
 Run a response surface analysis
 Interpret analysis of variance results
 Check the residuals
 Interpret the response surface plots
 Draw a conclusion from the optimization design
Description
This tutorial is built from the enamine synthesis example published by R. Carlsson in his book
“Design and Optimization in Organic Synthesis”, Elsevier, 1992.
A standard method for the synthesis of enamine from a ketone gave some problems, and a
modified procedure was investigated. A first series of experiments gave two important
results:
 Reaction time can be shortened considerably.
 The optimal operational conditions were highly dependent on the structure of the
original ketone.
Thus, a new investigation had to be conducted to study the specific case of the formation of
morpholine enamine from methyl isobutyl ketone. Two factors may have an impact on this
reaction: the relative amounts of the two reagents.
What you will learn

Tutorial D2 contains the following parts:
 Build a suitable design for optimization purpose;

 Response Surface Modeling.
References:
 Principles of Data Collection and Experimental Design

 Principles of experimental design
 Analysis of designed data
1107
Build an optimization design

After finding the important variables from a screening design, it is natural to proceed to the
next step: find the optimal levels of those variables. This is achieved by an optimization
design.
Task
Build a Central Composite Design to study the effects of the two variables (TiCl4 and
Morpholine) in more detail.
Note: The other two variables investigated in the screening design, found to not be
significant, have been set to their most convenient values: No stirring, and
Temperature=40°C.
How to do it
Choose Insert – Create Design… to launch the Design Experiment Wizard.
In the Design Experiment Wizard, on the first tab Start, type a name for the table, for
example “Enamine_Opt”. Select the Goal that for now is Optimization. It is possible to type
in information in the Information section.
Go to the next section: Define Variables.

Specify the variables as shown in the table hereafter:
ID Name Analysis type Constraints Analysis Type of levels Levels
A TiCl4 Design None Continuous 0.6 - 0.9
B Morpholine Design None Continuous 3.7 - 7.3
1 Yield Response None – –
1108
Tutorials
Do this by clicking the Add button and editing the Variable editor. Validate by clicking OK
and enter the next variable by clicking Add again.
Define variables tab
Go to the next tab Choose the Design.

The selected option, Optimization of response(s) with 3 or 5 levels, corresponds to either a
central composite design or a Box-Behnken design. This is a good option for an optimization
on variables without constraints. Do nothing and go to the next tab.
Choose the Design tab
In the next section Design Details, four options are proposed. Look at the bottom table to
see the differences between the different designs and their performance.
As it is possible to do experiments outside the selected range the option Circumscribed
Central Composite (CCC) design is chosen. Check the value of the star point distance to the
center. It should be 1.412 for two designed variables.
Design Details tab
1109
Go to the next section: Additional Experiments.

In this section it is possible to add some samples: either replicate the design points or the
center samples. Let the Number of replications be “1”. Set the Number of center samples to
“5”. The are no Reference samples.
Additional Experiments tab
Go to the Randomization tab.

It is possible to change the order of the experimentation by modifying the settings of this
tab. To not randomize a design variable use the Detailed randomization button. To just have
another go at the randomization click on Re-randomize.
Randomization tab
1110
Tutorials
In the Summary tab check that the design includes a total of 13 experiments. Otherwise, go
back to the appropriate tab and make the necessary corrections.
Summary tab
Go to the Design Table tab, and display the experiment in different views.
Design Table tab
1111
Finally click the Finish button.

The generated design table is displayed in the viewer and all associated tables are
automatically added to the project navigator. Their names start with “Enamine_Opt”.
Save the project, which now include the information for the screening and optimization
experiments.
Generated designed table
Compute the response surface

After the new experiments have been performed and their results collected, it is possible to
analyze the results so as to find the optimum. This is done by finding the levels of TiCl4 and
Morpholine that give the best possible yield. A response surface analysis can give this
information.
Run a response surface analysis

Task
Run a Response Surface Analysis.
How to do it
Enter the response values in the Yield column of the Enamine_Opt matrix. Before doing so,
check that the order of experiments is the standard one and not the experimental one. Use
Edit-Sort-Ascending to change the order if necessary.
1112
Tutorials
Sample Yield
Cube1 73.4
Cube2 69.7
Cube3 88.7
Cube4 98.7
Axial_A(low) 76.8
Axial_A(high) 84.9
Axial_B(low) 56.6
Axial_B(high) 81.3
cp01 96.4
cp02 96.8
cp03 87.5
cp04 96.1
cp05 90.5
Choose Tasks – Analyze – Analyze Design Matrix….
Go to the Model Inputs tab.
In the dialog box, make the following selections:
 Predictor Matrix: “Enamine_Opt”, Rows: “All”, Cols: “Design {2}”

 Model: “Main effects + Interactions (2-var) + Quadratic”
 Responses Matrix: “Enamine_Opt”, Rows: “All”, Cols: “Response {1}”
Model inputs
1113
Ensure that Classical DoE analysis is chosen in the Method tab.
1114
Tutorials
Click OK to start the analysis.

When the computations are done, click Yes to study the results. A new node called DOE
Analysis is added into the navigator.
Interpret analysis of variance results

Task
Interpret the results from the analysis.
How to do it
The ANOVA Overview plot shows four informative plots:
 the ANOVA table

 the Diagnostics table
 the Response surface
 the Response surface parameters
First, study the ANOVA results.

Note: It is possible to resize the overview the table by expanding any quadrant by
dragging the resize cross.
Study in turn: Summary, Variables, and Quality in the ANOVA table.
ANOVA Table for the Response Surface model
1115
The Summary shows that the model is globally significant, so it is possible to go on with the
interpretation.
The ANOVA table for variables displays the values of the p-values for each effect. The most
significant coefficients are for the linear and quadratic effects of Morpholine. TiCl4 effects
look less important but are still significant due to the square term being possibly significant
(p-value = 0.07). However the interaction is more doubtful.
The Quality section tells about the quality of the fit of the response surface model: R-square
for the calibration and prediction are very good.
In the Results node in the project navigator, check the tables Model check and Lack of fit.
The Model Check indicates that the quadratic part of the model is significant, which shows
that the interaction and square effects included in the model are useful.
The Lack of Fit section shows that with a p-value superior to 0.05, there is no significant lack
of fit in the model. Thus the model can be trusted to describe the response surface
adequately.
Check the residuals

Task
Check the residuals from the Response Surface Analysis.
How to do it
1116
Tutorials
Go to the predefined plot Residuals overview, found in the Plots folder in the project
navigator.
Start with the Normal Probability plot of the residuals. This plot can be used to detect any
outliers. Here, the residuals form two groups (positive residuals and negative ones). Apart
from that, they lie roughly along a straight line, and there is one extreme residual to be
found “cp03”. This may be an outlier.
Normal Probability plot of the residuals
Look at the second plot Y-Residuals vs. Y-Predicted.

Y-Residuals vs. Y-Predicted
In the residuals plot, all values are within the (-6;+6) range. There is no clear pattern in the
residuals, so nothing seems to be wrong with the model.
Look at the bottom right plot Y-residuals in experimental order. Check if there is a bias with
time. Look at the 5 center samples residuals.
The center samples show quite some variation. This is why so few effects in the model are
very significant. There is quite a large amount of experimental variability.
1117
Interpret the response surface plots

Now that the model has been thoroughly checked, use it for final interpretation. This is most
easily done by studying the response surface.
Task
Interpret the response surface plots.
How to do it
The contour plot is available from the project navigator in the folder Plots - Response
surface or in the ANOVA Overview and shows the shape of the response surface as a
contour plot.
Response surface as a contour plot
Move the mouse over the surface to see the coordinates and the corresponding yield.
It is also possible to see it as a 3-D plot. To do so click on the surface and hold while moving
the mouse to rotate the view of the surface.
Response surface as a 3-D plot
1118
Tutorials
Move the mouse over the surface to see the coordinates and the corresponding yield.
Inspect various points in the neighborhood of the optimum, to see how fast the predicted
values decrease. Notice that the top of the surface is rather flat, but that further away, the
the yield decreases more steeply.
In this example there are only two variables so it is not necessary to use the generator table
below the response surface to change the view.
Finally, notice that the predicted max point value, found in the table below the plot, is
smaller than several of the actually observed Yield values. (Sample Cube004a for instance
has a Yield of 98.7). This is not paradoxical, since the model will smooth the observed values.
Those high observed values might not be reproduced when the same experiments are
performed again.
Draw a conclusion from the optimization design
The analysis gave a significant model, in which the quadratic part in particular was
significant, thus justifying the optimization experiments.
Since there was no apparent lack of fit, no outliers, and the residuals showed no clear
pattern, the model could be considered valid and its results interpreted more thoroughly.
The values of the b-coefficients, and their significance indicates that the most significant
coefficients are the linear and quadratic effects of morpholine; the quadratic effect of TiCl4
is close to the 0.05 significance level.
The response surface showed an optimum predicted Yield of 96.815 for TiCl4=0.8250 and
Morpholine=6.555. The predicted Yield is larger than 95 in the neighboring area, so that
even small deviations from the optimal settings of the two variables will give quite
acceptable results.
1119
35.2.7 Tutorial E: SIMCA classification
 Description
 Data table
 Reformat the data table
 Graphical clustering
 Graphical clustering based on hierarchical clustering
 Graphical clustering based on scores plots
 Make class models
 Interpretation of classification results
 Diagnosing the classification model
Description
The data to be classified in this tutorial is taken from the classical paper by Fisher.
(R.A.Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7, 179
– 188 (1936).) The task is to see whether three different types of the iris flowers can be
classified by four measurements made on them; the length and width of the Sepal and Petal.
What you will learn

Tutorial E contains the following parts:
 Make models of different classes

 Classify new data
 Diagnose the classification model
References:
 Principal Component Analysis (PCA) overview

 Classification
 SIMCA Classification
Data table
Click the following link to import the Tutorial E data set used in this tutorial.
The data contains 75 training (calibration) samples and 75 testing (validation) samples.
The training samples are divided into three Row (Sample) ranges, each containing 25
samples. The three sets are: Setosa, Versicolor, and Virginica. The row set Testing will later
be used to test the classification.
Four variables are measured; Sepal length, Sepal width, Petal length, and Petal width. The
measurements are given in centimeters. These four variables are collectively defined as the
column set Iris properties
1120
Tutorials
Reformat the data table

Whenever working with classification, it is very useful to identify samples belonging to the
same class under all circumstances – in the raw data table and on PCA or classification plots.
In order to do this, we need to create a category variable stating class name for all samples.
Task
Insert a category variable into the Tutorial_E data table.
How to do it
View the data set Tutorial_E.
Select the first column in the data set and select Edit - Insert - Category Variable…. This
opens a dialog that asks how you want to define the levels.
First enter a name for the variable: “Iris type”.
Then select the second option: Specify levels to be based on a collection of row sets.
In the left column select one by one the three row ranges: “Setosa”, “Virginica” and
“Versicolor” and add them to the right column using the button Add.
Category variable dialog
Now a new column has been created “Iris type” containing the appropriate value for each
sample in each cell of the column.
Data table with category variable “Iris type”
1121
Graphical clustering
It is always a good idea to start a classification with some exploratory data analysis. You can
run a PCA model and/or hierarchical clustering of all samples. If you do not know the classes
in advance, this is a way of visualizing if there is clustering. The calibration samples must be
assigned to the different classes to give a sense of whether a classification model can be
developed.
Graphical clustering based on hierarchical clustering

Task
Perform hierarchical clustering of all calibration samples.
How to do it
Use Tasks - Analyze - Cluster Analysis… and select the following parameters:
Model inputs
 Matrix: Tutorial_E
 Rows: Calibration
 Columns: Iris properties
 Number of clusters: 3
 Clustering method: Hierarchical Complete-linkage
 Distance measure: Squared Euclidean.
In the options tab, you can assign samples to the initial clusters, but for this exercise, we will
make a completely unsupervised cluster analysis.
Click OK for the Cluster analysis to run.
When the clustering is complete a dialogue asking if you want to view the plots will appear.
Click Yes.
The Dendrogram showing the clustering of samples will be displayed. Notice that three
clusters are identified, but they are not all of equal size. All the results are in a new Cluster
analysis node in the project.
Dendrogram: Complete-linkage squared Euclidean distance
1122
Tutorials
Open the Results folder for the cluster analysis, and expand the levels so that you see the
different row sets; one has been defined for each cluster.
Cluster analysis results in project navigator view
By looking at the row sets, one can see that the Setosa samples are all assigned to one
cluster, and that there is a small cluster that contains only Virginica samples, but a larger
group has a mix of both Virginica and Versicolor samples. These results suggest that based
on the four variables provided for these irises, an unambiguous classification may be
difficult.
Graphical clustering based on scores plots

Task
Make a PCA model of all calibration samples.
How to do it
Use Tasks - Analyze - Principal Component Analysis… and select the following parameters:
Model inputs
1123
 Rows: Calibration
 Keep the default ticks in the boxes Mean center data and Identify outliers.
Weights
On the weights tab, select all the variables by highlighting them, and set the weight
by selecting the correct radio button.
 Weights: 1/SDev
Click Update.
Validation
Proceed to the Validation tab to set the validation.
 Validation Method: Cross validation
You can now click OK for the PCA to run.

We assume that you are familiar with making models by now. Refer to one of the previous
tutorials if you have trouble finding your way in the PCA dialog.
When the model is built a dialogue asking if you want to view the plots will appear. Click Yes.
The Regression Overview consisting of the plots of the scores, loadings, influence and
explained variance will be displayed. All the results are in a new PCA node in the project.
Activate the explained variance plot in the lower right quadrant and click on the Cal button
on the toolbar so that only Validation variance remains on the plot.
Explained validation variance
We see that the Explained Validation Variance is 91% with 2 PCs.

Activate the scores plot and right click to select sample grouping. Select the row sets for the
Setosa, Versicolor and Virginica. Click OK.
Scores plot with sample grouping
1124
Tutorials
You can see the three groups in different colors; one very distinct (Setosa) and two that are
not so well separated (Versicolor and Virginica). This indicates that it may be difficult to
differentiate Versicolor from Virginica in an overall classification model.
Make class models
Before we classify new samples, each class must be described by a PCA model. These models
should be made independently of each other. This means that the number of components
must be determined for each model, outliers found and removed separately, etc.
Task
Make PCA models for the three classes Setosa, Versicolor, and Virginica.
How to do it
Select Tasks - Analyze-Principal Component Analysis… and make the first PCA model for
Setosa with the following parameters:
Model Inputs
 Rows: Setosa
 Cols: Iris properties
Weights
1/SDev
Validation
Proceed to the Validation tab to set the validation.
 Validation Method: Cross validation. Click Setup and choose Full from the
Cross validation method dropdown menu.
When the model is computed, view the plots. In the project navigator rename the PCA class
model with name PCA Setosa by highlighting the new PCA node, right clicking and selecting
Rename.
Rename menu
1125
Repeat the procedure successively on Row Sets Versicolor and Virginica, also renaming each
new PCA model.
Classify unknown samples
When the different class models have been made and new samples are collected, it is time
to assign them to the known classes. In our case the test samples are already in the data
table, ready to use.
Task
Assign the Sample Set Testing to the classes Setosa, Versicolor, and Virginica.
How to do it
Select Tasks - Predict- Classification - SIMCA….
Menu Tasks - Predict- Classification - SIMCA…
Use the following parameters:
 Rows: Testing
Make sure that Centered Models is checked. Add the three PCA class models Setosa,
Versicolor, and Virginica.
SIMCA classification dialog
1126
Tutorials
The suggested number of PCs to use is 3 for all models; keep that default (it is based on the
variance curve for each model).
Click OK to start the classification.
Interpretation of classification results
The classification results are displayed directly in a table, but you may also investigate the
classification model closer in some plots.
Interpret the classification table
Task
Interpret the classification results displayed in the SIMCA results.
How to do it
Click View when the classification is finished.
A table plot is displayed, called Classification membership. There are three columns: one for
each class model.
Samples “recognized” as members of a class (they are within the limits on sample-to-model
distance and leverage) have a star in the corresponding column.
SIMCA classification table
1127
The significance level can be toggled with the Significance option, which is available as a
toggle on the menu bar.
At the 5% significance level, we can see that all but three samples (false negatives: virg1,
virg36, virg42) are recognized by their rightful class model.
However, some samples are classified as belonging to two classes (false positives): 12
Versicolor samples are also classified as Virginica, while 6 Virginica samples are also
classified as Versicolor. Only the Setosa samples are 100% correctly classified (no false
positives, no false negatives). This is an outcome we may have expected since a clear
separation of these two classes was not seen in the overall PCA model of the calibration
samples.
If you tune up the significance limit to 25%, this reduces the number of false positives but
also increases the number of false negatives (vers41 and virg35 come in addition).
Interpret the Coomans’ plot
If a sample is doubly classified, you should study both Si (sample-to-model distance) and Hi
(leverage) to find the best fit; at similar Si levels, the sample is probably closest to the model
to which it has the smallest Hi. The classification results are well displayed in the Coomans’
plot.
Task
Look at the Coomans’ plot.
1128
Tutorials
How to do it
Under the SIMCA/Plots node choose the Coomans’ plot. You can change which classes it
displays on the toolbar ; now set it for models Virginica and
Versicolor.
This plot displays the sample-to-model distance for each sample to two models. The newly
classified samples (from sample set Testing) are displayed in green color, while the
calibration samples for the two models are displayed in blue and red.
Coomans’ plot for Versicolor vs. Virginica
The Coomans’ plot for the classes Virginica and Versicolor shows that all Setosa samples are
far away from the Virginica model (they appear far to the right). However, we can see that
many Virginica and Versicolor samples are within the distance limits for both models. This
suggests some classification problems.
Interpret the Si vs. Hi plot
We also have to look at the distance from the model center to the projected location of the
sample, i.e. the leverage. This is done in the Si vs. Hi plot.
Task
Look at the Si vs. Hi plots.
How to do it
Under the SIMCA/Plots node choose the Si vs. Hi plot, and set it for the model Versicolor
using the arrows on the toolbar. Before you start interpreting the plot, turn on Sample
Grouping by right clicking in the plot window and selecting the Sample Grouping option. In
the sample grouping & marking dialog, select the row sets Setosa, Versicolor and Virginica.
The point labels can be changed to show just the first two characters of their name by right
clicking and selecting Properties. In the left list, select Point Label to get to the Point Label
dialog. Here one has the option to change the label name to just the first 2 characters of the
name. Select the radio button Name, and under the Label layout use the drop-down list for
show to select first, and in number of characters box enter 2, as shown in the dialog.
Point layout dialog
1129
The then provides a plot which is much easier to interpret: iris type appears clearly with the
initials Se, Ve, Vi in three different colors.
Si vs. Hi plot for the model Versicolor
Some Virginica samples are classified as belonging to the class Versicolor, but most samples
that are not Versicolor are outside the lower left quadrant. The reason for the difficult
classification between Versicolor and Virginica is that the samples are overlapping in the
scores plot. They are very similar with respect to the sepal and petal width.
1130
Tutorials
Diagnosing the classification model

In addition to the Coomans’ and Si vs. Hi plots, there are three more plots that give us
information regarding the classification.
Interpret model-to-model distance
Task
Look at the Model Distance plots.
How to do it
Under the SIMCA/Plots node choose the Model Distance plot, and set it for the model
Versicolor using the arrows on the toolbar. Change it to a bar chart using the shortcut .
Model distance for Versicolor model
This plot allows you to compare different models. A distance larger than three indicates
good class separation. The models are different.
It is clear from this plot that the Setosa model is different from the Versicolor, with a
distance close to 10, while the distance to Virginica is smaller.
Interpret discrimination power
Task
Look at the Discrimination Power plots.
How to do it
Under the SIMCA/Plots node choose the Discrimination Power plot. Using the arrows on the
toolbar, choose the discrimination power for Versicolor projected onto the Setosa model.
This plot tells which of the variables are most useful in describing the difference between
the two types of iris.
Discrimination power:Versicolor onto Setosa
1131
We can see that variables sepal length and sepal width have high discrimination powers
between these classes, while it is lower for the petal length and width.
Do the same for Versicolor onto Virginica: all variables have discrimination powers around 3.
This is obviously not enough to completely discriminate these classes.
Interpret modeling power
Task
Look at the Modeling Power plots.
How to do it
From the plots choose the Modeling Power for Versicolor.
Variables with a modeling power near one are important for the model. A rule of thumb says
that variables with modeling power less than 0.3 are of little importance for the model.
Modeling power for Versicolor
The plot tells us that all variables have a modeling power larger than 0.3, which means that
all variables are important for describing the model. None of the variables should be deleted
1132
Tutorials
from the modeling. The only chance to improve on the classification between Versicolor and
Virginica is to measure some additional variables.
In this exercise, it was found even from the initial exploratory analysis that the three types of
irises cannot be clearly distinguished based on the four measured variables. In the
dendrogram from clustering, as well as the global PCA, there was not a clear separation of
the Virginica and Versicolor class of irises. Nonetheless, a SIMCA classification was
attempted. With PCA-based classification by SIMCA, all Setosa samples could be properly
classified, while there were some ambiguities between the other two classes. It is
recommended that some other distinguishing feature be measured to enable a clean
classification of all three classes of these irises. The classification results provide many useful
model diagnostics to determine how similar the models are, and which variables are most
important in the modeling.
35.2.8 Tutorial F: Interacting with other programs
 Description
 Data table
 Import spectra from an ASCII file
 Import responses from Excel
 Create a category variable
 Append a variable to the data set
 Organizing the data
 Study the data before modeling
 Plot spectral data
 Basic statistics on data
 Make a PLS Model
 Interpretation of the Regression Overview
 Customizing plots and copying them into other programs
 Save PLS model file
 Export ASCII-MOD file
 Export data to ASCII file
Description
It is not uncommon to use The Unscrambler® together with other programs in one’s daily
work. This could be a word processor to document latest work, or instrument software.
This tutorial shows some of the capabilities The Unscrambler® has to interact with other
programs under the Windows operating system. The main focus here is how The
Unscrambler® is used in conjunction with other software.
What you will learn

Tutorial F contains the following parts:
 Import data file;

 Drag and drop from other programs;
 Insert category variable;
1133
 Edit plots and insert into another program;

 Save models for use in The Unscrambler® Online Predictor and The Unscrambler®
Online
 Write an ASCII-MOD file.
References:

 Importing data into The Unscrambler®
 Customizing Plots
 Exporting data from The Unscrambler®
Data table
The data are NIR spectra of wheat samples collected at a mill. Fifty five samples were
collected and the NIR spectra on an instrument using 20 channels.
The water content of wheat samples was measured by a reference method and is the
response variable in the data. These values are stored in a separate file.
Click the following links to save the data files to be used in this tutorial:
 Tutorial F data set: Spectra

 Tutorial F data set: Responses
Import spectra from an ASCII file

Data are stored in many different ways. The most simple and flexible way is to store data in
ASCII files.
Task
Import the “Tutorial_F_spectra.csv” ASCII data file.
How to do it
Start The Unscrambler® and go to File – Import data – ASCII…. Locate the file
“Tutorial_F_spectra.csv” in the ‘Data’ folder in your Unscrambler directory using the
browser and click Open.
Alternatively, click the following link to import the Tutorial F data set used in this tutorial
directly.
This launches the Import ASCII dialog, where you specify what the ASCII file looks like. Use
the options displayed in the dialog. Note that the first row in the data file contains variable
names and the first column contains sample names. Ensure that this is correct for the
Headers settings. The separator for the data is a comma. Check the box Process double
quotes.
ASCII Import Dialog
1134
Tutorials
Click OK to import the file and the data are read into The Unscrambler®, creating a data
table called “Tutorial_F” in the project.
Import responses from Excel
Spreadsheet applications are commonly used for storing data. It is easy to transfer data
between such a program and The Unscrambler®. The water content of the wheat samples is
stored in an Excel file together with the sample names.
Task
Import the water values from the Excel data file “Tutorial_F_responses.xls” into the existing
data table.
How to do it
There are two procedures. Use procedure 1 if you have Microsoft Excel or another
spreadsheet application installed on your computer or procedure 2 if you do not have a
spreadsheet program that can read the file “Tutorial_F_responses.xls”. You only need to
follow one of the procedures.
We will begin by appending a column to the existing data table. Put the cursor in the data
viewer and select Edit – Append, and in the dialog, enter 1 to add a single column.
 Copy and paste from Excel
Launch Microsoft Excel and open the file “Tutorial_F_responses.xls” located in the
‘Data’ folder in your Unscrambler directory. Copy the values from the column water,
and paste them into the empty column that you appended in data matrix “Tutorial
F”.
Alternatively, follow this link Tutorial_F_responses.xls to open the spreadsheet
containing the responses
 Import data from the Excel file
1135
From File – Import data – Excel…, select “Tutorial_F_responses.xls” from the ‘Data’
folder in your Unscrambler directory and click Import.
Alternatively, click the following link to import the responses from
Tutorial_F_responses.xls directly.
In the project navigator you will find the two data matrices which you imported from the
ASCII and Excel files, respectively. Rename the matrices by selecting them, right clicking and
choosing Rename; rename them as Wheat NIR Spectra and water content.
Data matrices in the Navigator
We could leave the response Y values (water content) in a separate matrix, and do the
analysis from these two matrices. But for consistency on data organization in this exercise,
we will copy the values from the Water content matrix into the empty column (21) that we
appended to the data matrix “Wheat NIR Spectra”.
Create a category variable
Category variables are useful to calculate statistics and to use in plot interpretation.
Task
Insert a variable to group the samples into three categories, depending on the water content
level.
How to do it
Place the cursor in the first column and select Edit – Insert… and insert one empty column.
Then use copy (Ctrl+C) - paste (Ctrl+V) to copy the water content data into the new column.
Rename the column as “Water levels”.
Then select the “Water levels” column and go to the menu Edit – Change Data Type and
select Category.
Edit – Change Data Type - Category menu
1136
Tutorials
The category converter dialog appears. Select the option New levels based upon ranges of
values.
Add three levels by entering 3 for the Desired number of levels, and specify the following
ranges manually:
 Low (Water < 13.0),

 Medium (13.0 > Water > 15.0), and
 High (15.0 > Water).
Category Converter menu
1137
The column of the category values is orange to distinguish this kind of variable from the
ordinary ones.
Data after insertion of a category variable
1138
Tutorials
Append a variable to the data set

Sometimes it is interesting to have all the information in only one data table.
Task
Append a variable to have the NIR spectra and the water content in the same table.
How to do it
Place the cursor in the last column and select Edit – Append… and append one empty
column. Then use copy (Ctrl+C) - paste (Ctrl+V) the water content data into the new column.
Rename the column as “Water”.
Organizing the data
Most of the time, you will want to work on subsets of your data table. To do this, you must
define ranges for variables and samples. One Sample Set (Row range) and one Variable Set
(column range) make up a virtual matrix which is used in the analysis.
Task
Define the Column ranges (variable sets) “Level”, “Water content” and “NIR Spectra”.
How to do it
Choose Edit - Define Range… to create sample sets and variable sets by defining Rows and
Columns, or right click upon selecting Rows(samples) or Columns(Variables) to choose
Create Row Range and Create Column Range respectively.
We begin by defining the column range for the water content by highlighting column 22, and
going to Edit - Define Range. This opens the Define range dialog, where we determine the
column range Water, entering this name for Column.
Define Range Dialog
1139
Do the same then to define the column range for “level” in column 1, and “NIR Spectra” in
columns 2-21.
The list of defined data ranges are found in the project navigator as nodes under the data
matrix.
Project navigator with data sets defined
Go to File-Save As… to save the project as Tutorial F.

Study the data before modeling
In any analysis, it is advisable to begin by familiarizing yourself with the data. We should plot
data to see if there are any obvious patterns or problems with the data. Does it look as we
expect? Are there outliers? From looking at the raw data, we may also be able to see if we
should apply a transform to the data. We can also look at the statistics on the data, to get an
understanding of the distributions in the data.
Plot spectral data

The NIR data used here are collected at 20 wavelengths using a filter instrument, so do not
give a complete spectrum. Regardless, it is still advisable to plot the data to have an
understanding of it. Select the column set NIR Spectra in the project navigator. Right click
and select Plot - Line to get the plot as shown below. In the plot, we can see that the
strongest absorbance peak is at 1940 nm, where the OH vibration for water is found in the
1140
Tutorials
NIR spectrum. There is now a new entry in the project navigator for the Line plot. You can
rename this by right clicking and choosing Rename
Line Plot of Spectral Data
Basic statistics on data

We can check the statistics of our data as well. This can be done for all the spectral data, and
for the response variable. Here we will compute the statistics for the water content values.
We begin by plotting a histogram, which shows the distribution of values. When we are
developing a calibration, we would like to have an even distribution of the response values
over the calibration range where we will be operating. Highlight the column “Water” and go
to Plot-Histogram to get the following plot. The line for a normal distribution is
superimposed on the plot, and the statistics for this sample set are displayed.
Histogram plot of water content
1141
We can also compute the statistics without the plot by going to Tasks-Analyze-Descriptive
Statistics…. In the dialog, select all the rows, and the column “Water” and click OK. When
the computation is complete, say Yes to see the plots now. A quantile and mean and
standard deviation plot are displayed. If you had more than one variable, the plots would
show results for all the variables. A new node has been added to the project navigator,
“Descriptive Statistics”. This has subfolders containing the raw data, results, and plots of the
statistical analysis. Expand the folder “Results” and select the matrix “Statistics” to see the
numerical results.
Statistics on water content
Make a PLS Model

The NIR spectra should contain information which makes it possible to predict the water
content from them. Let us make a model and find out.
Task
Make a PLS model from NIR spectra to measure the Water Content.
How to Do It
Select Task - Analyze - Partial Least Squares Regression and specify the following
parameters in the Regression dialog:
Model inputs:
 X: NIR Spectra (55x22)

 X Rows: All
 X Cols: Spectra
 Y: Water content (55x1)
 Y Rows: All
 Y Cols: All
 Maximum number of components: 5
If not already done, check the boxes Mean center data and Identify outliers.
Go to the X weights and Y weights tabs to verify that these are all set to 1.0 (the default
setting). On the Validation tab, select Cross validation.
PLS Dialog
1142
Tutorials
Click OK to launch the calculations.

Click Yes when the calculations are finished, and the prompt appears to view plots now. The
PLS Overview plots are displayed. A new node is also added to the project navigator with all
the PLS results. This has four folders with the raw data, results, validation, and plots for the
PLS model.
Interpretation of the Regression Overview

The most important PLS analysis results are given in the regression overview plot. This has
the plots Scores, X and Y loadings, Explained variance, and Predicted vs. Reference displayed
as the default.
Task
Look at the model results.
How to do it
Study the PLS regression overview plots in the viewer.
PLS Overview Plots
1143
The Scores plot shows that the samples are scattered in the model space, with no evidence
of groupings and that the first two factors explain 92% and 8% of the variance in the data
respectively. The Explained X-variance goes up nicely and is close to 100 after two Factors
(PCs). The Predicted vs. Reference plot looks OK. The fit is quite good. The info box in the
lower left panel of the display indicates that two factors are optimal for this model.
Another very useful plot is of the regression coefficients. Activate the upper-right quadrant
and right click to go to PLS-Regression coefficients - Raw coefficients (B) - Line. From the
regression coefficients one can see that there is a distinct peak around 1940, as expected as
this is where the water absorbance peak is located in the NIR spectrum.
Raw Regression Coefficients
Save the project. All the results and plots that have generated will be part of the saved
project.
1144
Tutorials
Customizing plots and copying them into other programs

In data analysis and research work, it is critical to provide documentation of the results.
Sometimes is may be necessary to transfer plots from The Unscrambler® into a word
processor.
Task
Customize plots within The Unscrambler®, and transfer plots from The Unscrambler®, using
Copy and Paste.
How to do it
Select the scores plot in the regression overview plot, and right click to choose Properties
which gives one options to customize a plot.
Change the plot heading name, as well as the font used for it.
Annotations can be added to a plot by right clicking and selecting Insert Draw Item…, or
from the short cut keys on the toolbar
When the plot has been customized it can readily be saved or copied into another
application. Right click and select Copy to select just the highlighted plot, or Copy All to
select all the four overview plots. Go to another program and place the cursor where the
plot is to appear in the document. Select Edit - Paste. The plot is now inserted as a graphical
object in the other document.
The plot can be saved as a picture file. The picture file option will usually give better quality
plots, but also larger files. Highlight a plot, and right click Save as… to save the plot in a
choice of graphics image file formats, such as EMF or PNG.
Save as options
1145
Save PLS model file

Task
Save just the PLS model file, giving a smaller file with just the model information that can be
used for predicting new samples using The Unscrambler® Online Predictor and The
Unscrambler® Online.
How to do it
To do so right click on the model in the Navigator and select the option Save Result.
Save result
Rename the model as needed and click on Save.

Export ASCII-MOD file
Task
Export an ASCII-MOD file.
How to do it
Go to File - Export menu.
File - Export menu
Select ASCII-MOD to open the dialog:

ASCII-MOD Dialog
1146
Tutorials
Verify that the correct model is selected, and the correct number of factors. It is possible to
select two types of model:
 Full
 Regr.Coef. only: corresponding to only the regression coefficients
Take a look at the ASCII file that is generated, which has the file name extension .AMO. The
format of the file is described in the ASCII-MOD Technical Reference.
Export data to ASCII file
A common file format that most programs read is the simple ASCII file. There are different
ways of writing the ASCII file. Determine the format needed based on the requirements of
other programs that will be used to read the ASCII files.
Task
Write the Wheat NIR Spectra data table to an ASCII file.
How to do it
Select the Wheat NIR Spectra table and select File - Export - ASCII. Use only the columns of
the NIR Spectra, by choosing this column set from the drop-down list. Make sure that the
item deliminator is comma as suggested in the Export ASCII dialog.
Export ASCII Dialog
1147
Provide a file name, and location when prompted. Open the file in an ASCII editor and look
at the file. All names are enclosed in double quotes.
35.2.9 Tutorial G: Mixture design
 Description
 Data table
 Design variables and responses
 Building a Simplex Centroid design
 Import response values from Excel
 Check response variations with statistics
 Model the mixture response surface
 Conclusions
Description
This tutorial is taken from an example presented in John A. Cornell’s reference book
“Experiments With Mixtures”, to illustrate the basic principles and applications of mixture
designs to a constrained system.
A beverage known as Fruit Punch is to be prepared by blending three types of fruit juice:
1148
Tutorials
 watermelon,
 pineapple and
 orange.
The financial driver of the manufacturer is to use their large supplies of watermelons by
introducing the juice into its current blend of fruit juices. As the value of watermelon juice is
relatively cheap compared to the other juices used, the final fruit punch blend should ideally
contain a substantial amount of watermelon - in this case specified as a minimum of 30% of
the total. Pineapple and orange juice have been selected as the other components of the
mixture, based on their availability and preference by most consumers.
To develop suitable blends for preference testing and cost analysis, the manufacturer used
experimental design, in this case, a special class of designs known as mixture designs.
What you will learn

This tutorial guides you through the following aspects on experimental design:
 Building a suitable design for a mixture optimization;

 Importing of response data from Excel (or by directly importing data from the
tutorial);
 Checking of response variations with Statistics;
 Analyzing the results using the so called Sheffe model for the mixture design;
References:
 Mixture designs
 Data import from a spreadsheet
 Analysis of mixture design results
Data table
The data in this exercise consist of two parts:
 The design table, which will be created in the tutorial.
 Measured responses: Sensory data: acceptance, sweetness, bitterness, fruitiness of
the juice as well the cost of production. We begin by setting up the design in The
Unscrambler®. Then you will import the response variables into the design table.
Design variables and responses
The ranges of variation selected for the experiment are as follows:
Ranges of variation for the fruit punch design
Ingredient Low High
Watermelon 30% 100%
Pineapple 0% 70%
Orange 0% 70%
The above constraints define what is known as a Simplex.
The responses of interest for the manufacturer are detailed in the table below.
1149
Responses for the fruit punch design

Variable Type of Measurement Target
Consumer
Average of 63 individual ratings on a 0-5 scale Maximum
acceptance
Computed from mixture composition and raw

Production cost Minimum
material cost
Descriptive
Sweetness Average ratings by sensory panel on a 0-9 scale
only
Descriptive
Bitterness Average ratings by sensory panel on a 0-9 scale
only
Descriptive
Fruitiness Average ratings by sensory panel on a 0-9 scale
only
Consumer acceptance is the response of primary interest. Should the analysis reveal two
responses of high consumer acceptance, the mixture with lower production cost will be
preferred. The sensory descriptors provide an explanation of the consumer acceptance
based on pre-specified properties. These provide possible directions for meeting consumer
expectations and their optimization usually leads to widely acceptable products.
Building a Simplex Centroid design
Since there are only three design variables (called components in the mixture case), setting
up an optimization design is a straight forward process. In this case, the chosen design is the
Simplex Centroid design as the points of this design allow you to investigate the importance
of the pure components, binary (two juice) blends and finally ternary (three component)
blends within the mixture space.
Task
Build a Simplex Centroid design with the help of the design experiment wizard, by selecting
Insert – Create design….
How to do it
Use Insert – Create design… to start the Design Experiment Wizard. The first tab is the Start
tab, where you enter the name of the design and the goal of the experiment. It is also
possible to add additional information in the description field.
Enter “Punch” as a name for the design and select Optimization as the goal.
Start tab for the Punch experiment
1150
Tutorials
Go to the next tab: Define variables. Specify the variables as shown in the following table:
Variables to define
ID Name Type Constraints Type of levels Level range
A Watermelon Design Mixture Continuous 30-100
B Pineapple Design Mixture Continuous 0-70
C Orange Design Mixture Continuous 0-70
1 Acceptance Response - - -
2 Cost Response - - -
3 Sweet Response - - -
4 Bitter Response - - -
5 Fruity Response - - -
Do this by clicking the Add button and entering details into the Variable editor including the
level range for the design variables. Validate by each component and response by clicking
OK.
Variables involved in the design
1151
Go to the next tab: Choose the Design.

There is already a type of design that has been selected: Mixture design. Validate this choice
by going to the next tab.
Choose the design for the Punch experiments
Go to the next section: Design Details

A description of the required design is provided in the Description in this tab. In this case the
Simplex centroid best meets the needs of this problem as it is suitable for optimization.
To better cover the design space tick the option Augmented design. This adds interior points
to the design that allow higher order models to be investigated and may provide more
informative response surfaces.
Design details: Simplex centroid
1152
Tutorials
Go to the next tab: Additional Experiments. There is no need to replicate the design samples
so the Number of replications should be kept at its default value: “1”.
In this study, the centroid is to be replicated 3 times as a source of model error
determination. The Simplex Centroid design contains the “centroid” by default, In this case
select 3 to add a further 3 replicates of the centroid.
Additional experiments tab
Now proceed to the next tab, Randomization. There is no need to make any further
adjustments in this tab, however, try some re-randomizations just to get familiar with this
option.
Randomization tab
1153
Next look at the Summary tab. The displayed table presents a summary of the information
in the design.
Summary tab
Go to the final tab Design Table. Here the data table is presented with several view options.
In this case, select the Display Order as Standard and leave the Design display mode as
Actual Values.
Design table tab for the fruit punch experiment
1154
Tutorials
Once all necessary checks a have been made, click the Finish button to generate the design
table in The Unscrambler® editor.
Now the designed data table appears in the Navigator. The design variables are given first,
followed by their interactions. The responses are given to the right of the interactions in the
same table. The response variables are empty and you need to fill in the responses obtained
for the experimental runs. The design matrix is organized into row and column sets
according to the types of samples (design, center, etc.) and effects.
The first part of the design table, including the mixture components
It is possible to view the data in different ways:
 To change the order from the standard sample sequence to the experiment sample
sequence click on column randomized, and select Edit - Sort - Descending.
 To change from the actual values to the level values click on the table and then View
- Level indices.
1155
Save the new project with File - Save and specify a name such as “Punch Optimization”.
Import response values from Excel
The responses for this design are stored in a separate Excel spreadsheet, which can be
directly imported into the navigator and then copied and pasted into the response columns
of Punch_Design matrix.
Task
Open the Excel table containing the response values and copy them into the response
columns of the design table.
How to do it
Go to File - Import Data - Excel…, select the Excel file “Tutorial_G.xls” (found in the “Data”
sub-directory under your Unscrambler installation folder) and click Open. Alternatively, click
the following link to open the Excel sheet to import the responses from Tutorial_G.xls
directly as a new matrix in the project.
If you are importing the Excel table, in the Excel Preview window, select the “Sheet1”, and
select the 5 responses:
 Accept
 Cost
 Sweet
 Bitter
 Fruity
Excel Preview
1156
Tutorials
Click on OK, and note that a new node “Tutorial_G.xls” is formed in the project navigator.
Look at the sample order of the imported data table. It is very important that the tables
“Punch_Design” and “Tutorial_G.xls” match in their order. If the “Punch-Design” table is not
given in standard order, you can highlight the Standard row header in the design table and
click Edit - Sort - Descending.
Select all the data in “Tutorial_G.xls” and copy them using right click and the option Copy or
with the shortcut Ctrl+C and paste them into the corresponding columns of “Punch_Design”.
To do so place the cursor in the first cell and use right click and the option Paste or the
shortcut Ctrl+V.
Imported response data
1157
Check response variations with statistics

Run a first analysis – by applying Descriptive Statistics, and interpret the results with the
following questions in mind:
 Is there adequate variance in the responses in order to be modelled?
 Is there more variation over the whole design than over the replicated Center
samples?
 Are there any response values outside the expected range?
Task
Run Descriptive Statistics, display the results as plots, check response variations and look for
abnormal values.
How to do it
Highlight the response column set and select Task - Analyze - Descriptive Statistics.
Choose the following settings in the Statistics dialog:
 Data Matrix: Punch_Design (13x15)

 Data Row: All
 Data Cols: Response(5)
 Compute correlation matrix: ticked
then click OK to start the computations.

Descriptive statistics dialog box
Click Yes to view the results. The results are displayed as two main plots. The upper plot is
Quantiles plot, the lower Mean and SDev plot.
Let us have a look at the upper plot: Quantiles.
If you have never interpreted a box-plot (or Quantiles plot) before, follow this link.
Right click on the plot and select View - Numerical View to display the min, max, median, Q1
and Q3 for the responses. Ensure all variations are within their expected ranges for the
responses (0-5 for Acceptance, 0-3 for Cost and 1-9 for the sensory responses on flavor).
Now display the same two plots for design samples and center samples, in order to compare
variation over the whole design to variation over the replicated Center samples. If the
experiments have been performed correctly, there should be much more variation among
design points than among the three replicates of the Centroid.
Return to the graphical view (View - Graphical view).
Right click on the plot and select Sample Grouping. A dialog box opens.
Select the sets Center samples and All design samples from the matrix Punch_Design.
Sample grouping and marking for the statistics
1158
Tutorials
Note: It is possible to edit the color of the bars in the plot and set marker names
Click OK.
To display the legend, click on the plot and then on the -icon in the toolbar.
Quantiles plot with sample grouping
The quantiles plot is now displayed separated into three groups. The boxes for all samples
appear in blue, for design samples in red and the center samples in green. From the
quantiles plot, you can see that there is much more variation between design points than
within the center samples.
Summary of Descriptive Statistics Analysis
The ranges of variation of the 5 responses are within their expected ranges.
There were no abnormal values observed for any response.
1159
There is much more variation over the whole design than among the center samples, which
indicate that the experiments were performed correctly.
Model the mixture response surface
The next step after checking the quality of the data is to model the responses. By this we
mean that we want to study the quantitative relationships between fruit punch composition
and consumer acceptance, production cost and measured sensory properties.
Task
Analyze the design with a Response Surface analysis using a Scheffé model. View the results
and interpret them.
How to do it
Highlight the data table Punch_Design and run Tasks - Analyze - Analyze Design Matrix….
Make the following choices in the Design Analysis dialog:
Method
Classical
Model inputs
 Predictors
 Matrix: “Punch_Design (13x15)”
 Rows: All
 Cols: Design (10)
 Model: Special cubic
 Responses
 Matrix: “Punch_Design (13x15)”
 Rows: All
 Cols: Response (5)
Design Analysis
1160
Tutorials
Note: The Special Cubic model is used here as there are enough points in the Simplex
Centroid design to support the calculation of the three binary mixture interactions and the
ternary blend interaction present within the design. There are also degrees of freedom left
in the design to test the significance of the effects estimated.
Click OK, then Yes to view the model diagnostics and plots when the computation is
complete.
Diagnosing the model
ANOVA results
The ANOVA table provides the overall fit summary for a particular response. It is found in
the upper left quadrant of the DoE overview.
The first ANOVA table is for the response variable “Accept”.
ANOVA Punch
1161
The first thing to look for is the p-values for the model: In this case it is 0.0085 and since it is
smaller than 0,05, this suggest that the model is describing something other than noise.
The p-values for the binary and ternary blending terms (i.e. Watermelon x Pineapple) etc.
are all significant. This indicates that the special cubic model fit may be justified.
Before analysing the ANOVA tables of the other responses, look at the Quality section of the
ANOVA table for the response Acceptance. This is shown below,
The R-Square value for the model is OK, however the Adjusted R-Square is much lower. This
may be indicating that the model is not a good predictor of future results. This is confirmed
by the negative R-Square Prediction value. Negative R-Square Prediction values indicate that
the mean is a better predictor of future data than the model is. Remember, validation is
always the key to good results.
View the results for the other responses by using the drop-down menu or the arrows in the
menu bar .
A summary of the results is provided below
Cost: The model p-value is highly significant for this response (p = 0.0007). Closer
inspection of the sums of squares indicates that a linear model is more applicable.
Sweetness: The model p-value is highly significant (p = 0.0000). The individual sum
of squares terms indicate that the special cubic is a good fit to the data.
Bitter: The model p-value is highly insignificant (p = 0.1857). This suggests that Bitter
is not modelled well at all.
Fruity: The model p-value is highly significant (p = 0.0001). The Watermelon x
Pineapple binary blending is the most significant term in the model and the ternary
blend term is also significant. This indicates that the response is dependent on all of
the components in the blend.
Select the Error Table from the project navigator. This provides and overall summary of the
quality statistics for each response in one table.
1162
Tutorials
Diagnostics
Examine the diagnostic table for Accept Look for extreme residuals and note high values of
Cook’s Distance. These statistics help to isolate outliers based on high leverage. Diagnostics
for response “Accept”
Response surface
Response surfaces are usually the key output desired in the mixture setting as they provide
the location of the “optimal” blend. The following image is the response surface obtained for
Acceptance
Response surface for acceptance
The response surface shows that an acceptable blend can be achieved containing 55%
Watermelon juice. This more than exceeded the manufacturers expectation and allows the
consumption of the excess watermelon supplies.
1163
The diagram below presents the response surfaces that best model each response. The
desired optimized response is also shown in each figure. Bitterness has been omitted as it
was not modelled well.
The optima were chosen on the basis of acceptance and cost as primary responses and that
the sweetness should not be too high and the fruitiness is maximized.
Conclusions
The mixture design and analysis showed that suitable models could be developed for
Acceptance, Cost, Sweetness and Fruitiness. Bitterness was not modelled at all. The
response surface analysis showed that the four modelled responses could be optimized to
develop a blend that uses more than the minimum stated 30% watermelon juice. The best
formulation was achieved with 55% Watermelon juice, 24% Pineapple juice and 21% Orange
juice. This blend also minimised the usage of the highest cost orange juice.
35.2.10 Tutorial H: PLS Discriminant Analysis (PLS-DA)
PLS-DA is the use of PLS regression for discrimination or classification purposes. In The
Unscrambler® PLS-DA is not listed as a separate method. This tutorial explains how to do it.
 Description
 Running a PLS Discriminant Analysis
 Data table
 Build PLS regression model
 Some general comments on classification
Description
PLS Discriminant Analysis (PLS-DA), is a classification method based on modeling the
differences between several classes with PLS. If there are only two classes to separate, the
1164
Tutorials
PLS model uses one response variable, which codes for class membership as follows: -1 for
members of one class, +1 for members of the other one.
If there are three classes or more, the model uses one response variable (-1/+1 or 0/1, which
is equivalent) coding for each class. There are then several Y-variables in the model.
In this tutorial we will analyze the chemical composition of spear heads excavated in the
African desert. 19 samples known to belong to two tribes (classes A and B) are used for
building a discriminant model, while seven new samples of unknown origin make up a test
set to be classified.
The X variables are 10 chemical elements characterizing the composition of the spear heads.
The 19 training samples are divided into 10 from class A and 9 from class B.
The normal way to make dummy variables for classes is to assign 1 if the sample belongs to
the class and 0 if not. A small trick to have a decision line of 0 and not 0.5 in the predicted vs.
reference plot is to use values -1 and 1, which gives an easier visualization.
Running a PLS Discriminant Analysis

When a data table is displayed in the viewer, one may access the Tasks menu to run a
Regression (and later on a Prediction).
In order to run a PLS Discriminant Analysis (PLS-DA), one should first prepare the data table
in the following way:
Insert or append a category variable in the data table. This category variable should have as
many levels as there are classes in the data set. The easiest way to do this is to define one
row set for each class, then build the sample sets based on the category variable (this is an
option in the Define range dialog).
The category variable will allow one to use sample grouping on plots, so that each class
appears with a different color.
Use the function Edit- Split category variable to convert the category variable into indicator
variables. These will be the Y-variables in the PLS model and are created as new columns in
the data table. Then create a Column set containing only the indicator variables, as these are
the responses that will be used in the regression.
What you will learn

This tutorial contains the following parts:
 Run a PLS regression

 Interpret the model
 Save the model
 Classify new samples
References:

 Principles of Regression
 Classification
 Prediction
1165
Data table
Click the following link to import the Tutorial H data set used in this tutorial. The data have
already been organized for you into row sets, and with the class variable, as well as the
indicators for the classes.
Tutorial H data
Build PLS regression model

Task
Run a PLS regression on the data.
How to do it
Click Tasks - Analyze - Partial Least Squares Regression to run a PLS regression and choose
the following settings:
PLS Regression Dialog
1166
Tutorials
Model inputs
 Predictors: X: Tutorial H, Rows: Training, Cols: X

 Responses: Y: Tutorial H, Rows: Training, Cols: Class num
 Mean center data: Enable tick box
X Weights
1/SDev
Y Weights
1/SDev
Validation
Full cross-validation
Set the weights on the X-weights and Y-weights tabs. Select all the variables, select the radio
button A/(SDev+B), and click update. Do this for both the X and Y weights.
X weights dialog
1167
To set the validation method, go to the Validation tab in the PLS Regression dialog. Select
Cross validation, and then click Setup… to get to the dialog to select full cross validation.
Select Full from the cross validation method drop-down list,
Cross Validation Dialog
1168
Tutorials
After the computations are finished the default PLS regression plots will be shown. The
scores plot shows the separation of the two classes.
Scores plot
For better visualization of the classes you may use the sample grouping option. Right click in
the scores plot and select Sample Grouping from the menu.
In the Sample grouping dialog, select the row sets “A” and “B” for visualization. You can
double-click in the small boxes showing the colors to change to your preference.
The same goes for the symbols, and their size.
Sample Grouping Dialog
1169
The scores plot shows that the two classes are well separated in the two first factors.
Scores plot with grouping
Thus, a discrimination line may be inserted in the plot with the line drawing tool in The
Unscrambler® .
Study the explained variance plot for Y shown in the lower-left quadrant. If need be, switch
it to the view for Y by using the X-Y button . The explained variance plot for Y shows
around 98 % explained calibration and 94 % explained validation variance for 2 factors. The
red validation curve indicates that two factors is the optimal number, as there is only a small
increase in explained variance after factor three.
Note: Explained variance or RMSE is not the main figure of merit for PLS-DA,
however.
Variance plot
1170
Tutorials
To interpret the importance in the classification the loading weights is the plot to look into.
This is given in the upper-right quadrant.
In this case the loadings express the same information as the loading weights, and since
correlation loadings show the explained variance directly, this is the preferred view. Make
the loadings plot active, and change it to the Correlation loadings view by selecting the
correlation loadings shortcut .
In the correlation loadings plot for factors one and two we see that Ba, Zr and Sr are the
variables that separate the two classes, as well as Ti, although with a slightly lower
discrimination ability. These are the variables closest to the response variable class, and
between the 50 - 100% explained circles.
The remaining elements are mostly modeling the variance within the classes.
Correlation Loadings Plot
The regression vector is a summary of the important variables, in this case representing the
loading weights plot after 2 factors. In the project navigator, select the plot Regression
Coefficients, and change it to a bar chart by using the toolbar shortcut .
Weighted Regression Coefficients
1171
The magnitude of the regression coefficients is an indication of how important those

variables are for modeling the response, here class.
The predicted vs. reference plot, in the lower-right quadrant, shows how close to the ideal
values -1 and 1 the predicted values are.
Predicted vs. Reference Plot
Note that the blue points are from calibration where the samples are merely put back in the
same model they were a part of. The red points are from cross validation which is more
conservative as the sample was not a part of the model when it was predicted. You can
toggle on/off the regression line, trend line, and statistics for the plot using the shortcut
.
Recall that “prediction” in this context does not mean that the model has been tested by
predicting a real test set. In this case all samples are correctly classified for the cross
validation.
To investigate how the model will behave on unknown samples, the next section will show
how to predict unknown sample class.
1172
Tutorials
It is a good idea to save your work so far. The project will include all the data, as well as all
the results generated thus far. Use File – Save… to save the project.
Classify unknown samples
Assign the unknown samples to the known classes by predicting (classifying) with the PLS
regression model.
Task
Assign the Sample Set Test to the classes A or B.
How to do it
Select Tasks - Predict - Regression….
Tasks - Predict - Regression…
Use the following parameters:

Components
The number of factors (components) to use is two.
Data
 Matrix: Tutorial H
 Rows: Test
 Cols: X
Prediction
 Full Prediction
 Inlier limit
 Sample Inlier dist
 Identify Outliers
Prediction Dialog
1173
Click OK.
The predicted values are shown in the main plot of predicted values with estimated
uncertainties.
All F samples have predicted values close to -1 classifying these as belonging to class “B”.
The E sample 2 has a predicted value around 1 which assigns it to class “A”. As for E samples
1, 3 and 4, their predictions are close to 0, and have high uncertainties. It could be that these
can not be said to belong to any of the classes because the estimated deviation (uncertainty)
around the prediction value includes 0 in the plot.
Predicted values and deviation
1174
Tutorials
A small trick to present the results more visibly is to do Tasks - Predict - Projection and
select the PLS model from above. In the scores plot you see that all samples F are lying in the
“B” class and E samples 2 and 3 are probably belonging to class “A”, as discussed above. The
position of test samples 1 and 4 shows that they are in fact closer to class “A” as also the
predicted values indicate.
Note: Try to analyze the same data by doing PCA on the two groups and then select
Tasks - Predict - Classification - SIMCA and compare results with the PLS-DA.
To check if the prediction can be trusted, study the Inlier vs. Hotelling’s T² plot available from
a right click on the plot and then Prediction - Inlier/Hotelling’s T² - Inliers vs. Hotelling’s T²
Prediction - Inlier/Hotelling’s T² - Inliers vs. Hotelling’s T² menu
For a prediction to be trusted the predicted sample must not be too far from a calibration
sample. This is checked by the Inlier distance. The projection of the sample in the model also
should not be too far from the center. This is checked with the Hotelling’s T² distance.
1175
In this case the samples are found to be in the widely spread in the plot. If samples fall
outside the limit lines that prediction cannot be trusted.
Some general comments on classification
LDA is the basic method that is typically taught in introductory classification courses and is
available as a reference method for comparison with other classification methods such as
SIMCA. Remember that LDA has the same issue with collinearity as MLR, and that more
samples than variables are required in each class. Using PLS regression for classification as
PLS-DA has shown can give very good results in discriminating between classes. In this
context it may also be useful to apply the uncertainty test after deciding on the model
dimensionality and remove the nonrelevant variables. This can in some cases improve
results both in simpler visualization and model performance. However, PLS-DA does not take
into account the within-class variability, and predicted values around 0 (assuming -1 and 1
are used as levels for the classes) are difficult to assign. One alternative procedure is to use
the scores from the PLS-DA in an LDA to have a more “statistical” result. As the score vectors
are orthogonal there is no problem with collinearity in this case.
Using local PCA models which for historical reasons has been given the name “SIMCA” is a
good approach because it also gives the possibility to assign new samples to none of the
existing classes. However, as there is no objective in the individual PCA models to
discriminate between the classes one does not know if the variance modeled is the optimal
for this purpose. The Modeling and Discrimination Power diagnostics are helpful in this
context. One useful procedure is to first do PLS-DA and select the “best” set of variables for
discrimination. Then use these together with the most important variables in the individual
PCA models to have a variable set that both models the within and between class variability.
SVM is a powerful method which can handle nonlinearities, and very good results have been
reported in the literature. However, it is not so transparent as PCA and PLS and the choice of
values for input parameters must be decided from cross validation to assure a robust model.
1176
Tutorials
As for all methods, the proof of the method lies in the classification of a large independent
test set with known reference.
35.2.11 Tutorial I: Multivariate curve resolution (MCR) of dye mixtures
 Description
 Data table
 Data plotting
 Run MCR with default options
 Plot MCR results
 Interpret MCR results
 Run MCR with initial guess
 Validate the estimated results with reference information
 View an MCR result matrix
Description
Multivariate Curve Resolution (MCR) attempts recovery of response profiles (spectra, pH
profiles, time profiles, elution profiles, etc) of the components in an unresolved mixture of at
two or more components. This is especially useful for mixtures obtained in evolutionary
processes and when no prior information is available about the nature and composition of
these mixtures.
The Unscrambler® MCR algorithm is based on pure variable selection from PCA loadings to
find the initial estimation of spectral profiles, and then Alternating Least Squares (ALS) to
optimize resolved spectral and concentration profiles.
The algorithm can apply a constraint of Non-negativity in either spectral or concentration
profiles or both.
It can also apply a constraint of Unimodality in concentration profiles that have only one
maximum, and/or a constraint of Closure in concentration profiles where the sum of the
mixture constituents is constant.
The Unscrambler® MCR functionality does not require any initial guess input. A mixture data
set suitable for MCR analysis should have at least four samples and four variables. If no
initial guess is used, the maximum number of variables is 5000.
In this tutorial we will utilize UV-Vis spectra of dye mixtures to extract pure dye spectra and
their relative concentrations. The data are from the Institute of Applied Research (Prof. W.
Kessler), Reutlingen University, Germany.
What you will learn

 Run a basic MCR analysis

 Plot MCR results
 Interpret MCR results
 Run an MCR analysis with initial guess
 Validate MCR results with reference information
 View the MCR result matrix and convert estimated concentrations into real scale.
1177
References:

 What is MCR?
 Interpreting MCR Plots
Data table
Click the following link to import the Tutorial I data set used in this tutorial.
Organizing the data table
The samples consist of 39 spectra of dye mixture samples. Samples 1 to 3 are pure dyes of
blue, green and orange, respectively. Samples 4 to 39 are 36 mixture samples of those 3
dyes at known concentrations. The X variables are the UV-Vis spectra measured over the
range 250-800 nm with data at 10 nm increments. We will begin by organizing the data for
the analysis into row (sample) and column (variable) sets. The column sets have already
been defined for you, and are found in the folder Column in the project navigator. There are
5 column sets for the different variables of interest in the analysis, including the
concentrations of the three dyes, and two overlapping spectral ranges.
We begin by defining the row sets for these data. Select the entire first row in the data table,
Blue_50, and go to Edit – Define Range… to open the Define Range dialog box. In the dialog,
enter the name “Blue” in the Range row box and click OK.
Define Range Dialog
From the data table, select the sample Green_50, and go to Edit-Define Range to now make
this row set Green. Do the same for the sample Orange_50, and then for samples 4 to 39,
giving that row set the name Mixture. Additionally, create the row set Original by selecting
samples and following the same procedure, Edit-Define Range
The first three columns are concentration measurements of blue, green and orange dyes.
Columns 4 to 59 are UV-Vis spectra measured at range 250-800 nm with step 10 nm. In the
project navigator expand the node Column to see the list of existing column sets. The
1178
Tutorials
organized data will look like this in the navigator and viewer, with color-coding for the
defined set.
Navigator view of organized data
Data plotting
Before starting any analysis, it is a good idea to have a look at the data. We want to make a
line plot of the spectra of all mixture samples together. Go to the original data table and
highlight it in the navigator.
Use Plot - Line, which will open the Line plot dialog where the row set Mixture can be
selected from the drop-down list, and for Cols, the set 250-800nm. This will give an overlay
plot of the spectra.
Line plot of mixture spectra
1179
We will now plot the reference spectra of the three pure components, select row set
Original, and Cols 250–800nm. Go to Plot – Line… and select the rows and columns in the
dialog.
Line plot dialog
This will results in the following plot, where we can see the maximum absorbance for each
of the dyes is at a different wavelength. It is these component spectra that we expect to be
able to extract from the data through the MCR analysis of the data in this tutorial.
Line plot of pure dyes
To plot the reference concentrations of the three dyes, select columns 1-3 and make a Line
plot of Sample set “Mixture” by right clicking and selecting Plot – Line.
Line plot of sample concentrations
1180
Tutorials
Note: Reference measurements of spectra and concentrations of pure components

are not necessary to make your data set suitable for MCR!
Run MCR with default options

Task
Set up the options for an MCR analysis, launch the calculations and plot results.
How to do it
When data set “Tutorial_I” is active on screen, click Tasks - Analyze - Multivariate Curve
Resolution…. The MCR dialog box with default settings will open up. Select Mixture (36)
under the Rows tab, and 250-800nm (56) under the Columns tab. We will not use an initial
guess.
Keep all other settings as default on the Options tab, then click OK. After the calculation is
done, click Yes to View plots
MCR Dialog
1181
When the MCR calculation is completed, a new node, named MCR, is added to the project
navigator and the MCR overview plots are displayed in the viewer. The MCR results overview
includes four plots, from upper-left to lower-right: Component Concentrations, Component
Spectra, Sample Residuals and Total Residuals. The results overview plots are displayed at
the optimum number of pure components, which the system estimates to 3 in this case. Our
optimal number of components (3) is displayed on the toolbar. A summary of the analysis
results is given in the Info tab in the lower left corner of the display, and also tells the
optimal number of pure components.
MCR Info Box
1182
Tutorials
MCR Overview plots
The MCR model results are all together in the new node in the project navigator named
MCR. Rename the MCR model in the project navigator by highlighting the MCR node, right
clicking and choosing Rename. Rename your first MCR model as MCR Original.
Plot MCR results
Task
Plot MCR results for various numbers of pure components.
How to do it
Actually, The Unscrambler® MCR procedure generates several sets of results, covering a
number of estimated pure components from 2 to optimum +1. By default, the results are
plotted for the optimal number of components.
You may view the results for varying numbers of pure components. Let us plot the spectral
profiles for a 2-component solution. Click the shortcut to select Component Number 2.
The plot of (estimated) component spectra for a resolution with two pure components is
displayed.
In a similar manner, click on the right arrow shortcut to plot the 4-component solution.
MCR fitting and PCA fitting results are also available for varying numbers of pure
components from 2 to optimum +1. Each fitting includes Variable Residuals, Sample
Residuals and Total Residuals plots and are stored in result matrices in the MCR node of the
project navigator. The user can plot these results upon selection of respective matrices, or
by selecting the plot from the plots node of the project navigator. The plot of Total Residuals
for MCR fitting is shown by default in the lower-right subframe. Like any other plot, it can
also be accessed from the Plot menu. Change this plot to variable residuals by clicking and
activate the lower-left subframe, then clicking MCR - Variable Residuals to have this plot
displayed in place of the sample residuals plot.
Variable residuals plot
1183
Interpret MCR results

Task
Determine the optimum number of pure components.
How to do it
In the Total Residuals plot, residuals are high for 2 components, and close to zero for 3 and 4
components. Change the appearance of the lower-right plot of the Total Residuals from a
curve to bars, using the toolbar icon .
Total residuals bar plot
This suggests that the model with 3 components is the optimum solution.
Click and activate the Component Spectra plot with 3 components in the upper-right
quadrant. The toolbar contains a set of arrows , which is used to navigate between
results at different numbers of components. Use the arrows to increase and decrease the
number of components, and watch the impact on the spectral profiles.
Run MCR with initial guess
Task
Run the MCR calculation again, this time using an Initial Guess.
How to do it
If prior knowledge such as spectra of pure components or concentrations of mixture samples
exists, this information may be included in the MCR calculation to help the algorithm
converge towards the right solution of curve resolution.
1184
Tutorials
Go back to data table Tutorial_I data by selecting the tab at the bottom of the viewer. Go to
Tasks - Analyze - Multivariate Curve Resolution…. The MCR dialog box with default settings
will open up. Select the same data as before, and then check the box Use initial guess and
select option Pure spectra.
MCR dialog with initial guess
Select Row Set Original as initial guess for spectra, making sure to use the same column set
for the data for the analysis and the initial guess. Then click OK to launch the calculations.
When asked if you want to view the plots now, select yes.
Rename the new MCR results node in the project navigator as MCR Initial Guess.
Notes:
 When using the initial guess option, The Unscrambler® requires all pure
components to be included as initial guess inputs. Partial reference will
generate erroneous results. It is recommended to run MCR without initial
guess if only partial reference is available.
 The Unscrambler® can be run with either spectra or concentration of pure
components as an initial guess input.
Validate the estimated results with reference information

Task
We are going to compare the model’s Estimated Concentrations for a 3-component solution
to the existing reference concentrations found in the data table and plotted earlier. In a first
step we are going to compare the concentration profiles visually.
How to do it
Select the Component Concentrations plot, shown in the upper-left quadrant of the MCR
Overview. Compare this with the three concentrations in the original data table that were
previously plotted as a line plot of the concentrations in the mixture data. . Look at both
profiles. To make them both visible in the viewer, select the line plot you’ve made, and on
1185
the navigator tab right click to choose Pop out, giving an undocked plot that can now be
docked wherever you wish for ease of viewing.
You can observe that the first estimated concentration profile is similar to the reference
profile of the blue dye (blue curves on the plots), the second estimated concentration profile
is similar to the reference profile of the green dye, and the third estimated concentration
profile is very close to the reference concentration of the orange dye (green curves on the
plots).
Caution: Estimated concentrations are relative values within an individual
component itself. Estimated concentrations of a sample are not its real
composition.
The estimated spectral profiles can be compared to the reference spectral profiles in the
same way as for the concentrations. Because we used the spectra as initial guess inputs in
this example, the comparison shows a perfect match. However, estimated spectra are unit-
vector normalized; they are not the “real” spectral profile of the samples.
Plots of the Pure and Estimated Spectra
View an MCR result matrix

Tasks
Plot the MCR result matrix of estimated concentrations,
Compare the estimated concentrations to the reference concentrations in 2-D scatter plots
by combining them into a single matrix.
Convert the estimated concentrations into real scale.
How to do it
Open Project Tutorial_I and expand the Results folder from the project navigator for model
file MCR Initial Guess. The plot of the component concentrations is given in the upper-left
quadrant of the MCR Overview plot. Select the Component concentrations matrix and make
a duplicate of it matrix by selecting it going to Insert-Duplicate Matrix.
Insert Duplicate Matrix
1186
Tutorials
.
Rename this matrix, named Component concentrations, that has been added to the bottom
of the project navigator as Concentrations comparison.
With the cursor in the data matrix, go to Edit - Append and choose to add 3 columns to this
matrix. Go to table Tutorial_i, select the first three columns (blue, green and orange), from
rows 4-39. Copy them and paste them in the empty columns of the Concentrations
comparison matrix, and enter names for columns 4-6 as blue, green, and orange
respectively. We now have a table of six columns, containing the three estimated
concentrations of the pure dyes followed by the three measured concentrations .
New Data Matrix with Estimated and Real Concentrations
Select columns “Blue” and “1” (press the Ctrl key on your keyboard to select several columns
at a time). Click Plot - Scatter to display a 2-D Scatter plot of these columns. The correlation
between estimated and reference concentrations for the blue dye is 0.994. If the box
containing plot statistics (among which correlation) is not displayed on the upper-left corner
of your plot, use the toolbars to display it. These can also be used to add a
regression line and target line to the plot.
Continue to make the scatter plots for the green dye (columns “Green” and “2” in the table),
which has a correlation between estimated and reference concentrations of 0.997.
For the orange dye (columns “Orange” and “3”), the correlation is 0.998. These very high
correlations indicate that the MCR calculations have determined concentration profiles
accurately in this case.
Scatter plot of orange dye concentration
1187
These plots can be customized by right clicking and choosing Properties to make changes to
the plot appearance.
Now let us convert the estimated Orange concentrations to real scale. In order to do this, at
least one reference measurement is needed. The estimated concentrations (in relative scale)
of all samples can be converted into real concentration scale by multiplying by a factor ( real
concentration / estimated concentration ).
In the present case, we can use for example sample PROBE_11, which has a reference
concentration of Orange dye of 7 and an estimated concentration of 0.4443.
Use menu Edit - Append - … to append a new column at the end of the table, and name it
“MCR Orange real scale”. Go to Tasks - Transform - Compute_General…, and type the
expression:
V7=V3*(7/0.4443)
in the Expression space.
Compute_General Dialog
1188
Tutorials
Click OK to perform the calculation. A new matrix is created where the new column has
been filled with the values of estimated Orange dye concentrations converted to real scale.
Data matrix with new values
35.2.12 Tutorial J: MCR constraint settings
Constraint settings in multivariate curve resolution
 Description
 Data table
 Data plotting
 Estimate the number of pure components and detect outliers with PCA
 Run MCR with default settings
 Tune the model’s sensitivity to pure components
 Run MCR with a constraint of closure
 Remove outliers and noisy wavelengths with recalculate
1189
Description
In this tutorial we will utilize FTIR spectra of an esterification reaction to extract pure spectra
and their relative concentrations. The original data are from the University of Rhode Island
(Prof. Chris Brown), USA.
In situ FTIR spectroscopy was used to monitor the esterification reaction of isopropyl alcohol
and acetic anhydride using pyridine as a catalyst in carbon tetrachloride solution. The initial
concentrations of these three chemicals were 15%, 10% and 5% in volume, respectively.
Isopropyl acetate was one of the products in this typical esterification reaction. The reaction
was carried out in a ZnSe cell, and mixture spectra were measured at 4 cm-1 resolution. The
data set consisted of 25 spectra, covering approximately 75 minutes of the reaction. To shift
the equilibrium of the esterification, one-tenth of the volume was removed from the cell at
24, 45 and 60 minutes. An equal amount of a single reactant was added to the cell in the
sequence of acetic anhydride, pyridine and isopropyl alcohol.
What you will learn

 Estimate the number of pure components and detect outliers with PCA
 Run MCR with default settings
 Tune the sensitivity to pure components setting
 Run MCR with a constraint of closure
 Use the Recalculate functionality in MCR
References:

 Principles of PCA
 What is MCR?
 Interpreting MCR Plots
Data table
Click the following link to import the Tutorial J data set used in this tutorial.
The data consist of 25 FTIR spectra of 262 variables covering the spectral region from 1860
to 852 cm-1. There are two row sets already defined: mixture and closure. Mixture contains
all the data, while the row set closure has the samples that will be used when using the
constraint of closure during the MCR.
Data plotting
Before starting the analysis, it is always important to have a look at the data. Make a line
plot of all of the spectra together.
Select all the samples by selecting the data set Tutorial_J in the project navigator. The data
table for the FTIR spectra of the samples will then be displayed in the data editor. Highlight
the samples, and use Plot - Line to display an overlay of the spectra in the viewer.
Line plot dialog
1190
Tutorials
From this plot, one can see that there is a region around 1240 cm-1 that is changing over the
course of the reaction being monitored.
Line plot of FTIR spectra
Estimate the number of pure components and detect outliers with PCA
Principal Component Analysis (PCA) is recommended before running an MCR calculation. It
provides some information on the number of pure components and on sample outliers.
Task
Run a PCA on the raw data.
How to do it
Click Tasks - Analyze - Principal Component Analysis to run a PCA and choose the following
settings:
 Matrix: Tutorial_J
 Rows: All
 Columns: All
 Mean center data: Not selected
 Identify outliers: Selected
1191
PCA Dialog
On the Validations tab, select Cross validation, and Setup… to set this to full cross validation,
from the drop-down list for cross validation method. Click OK, then OK again on the model
inputs page.
Cross Validation Setup
1192
Tutorials
Once the PCA calculations are done, click Yes to view the plots of the PCA model
immediately. The four plot PCA Overview will be displayed in the viewer.
The upper right quadrant is a 2-D plot of the PCA loadings. For spectral data, it is more
informative to have a line plot of the loadings, as it then resembles a spectrum. Select the
existing loading plot, and go to Plot - Loadings - Line; which will give the plot of the first PC
loading, to replace the default plot in this quadrant. This plot, once can see, closely
resembles the FTIR spectra of the raw data. Scroll through the loadings plots for the other
PCs using the arrows on the toolbar .
You can see that the loadings begin to get noisy at about the sixth principal component. The
program recommends three components as the optimal number of PCs in this model. This is
seen in the Info box in the lower left corner of the display, and by clicking on the star on
the menu toolbar. Select the Explained Variance plot in the lower-right quadrant by clicking
on it with the mouse, then right mouse click to select View - Numerical View.
As you can see, the explained variance globally reaches a plateau from the third principal
component. The fourth and fifth PCs still show some slight increase; at that stage, it is
difficult to know whether they represent noise or real information. Now, click on the
Influence plot at the bottom-left corner of the Viewer, and use the PC navigation tool
to display the influence plot at PC4. You may observe that sample 1 sticks out to the right
with a high leverage, and that sample 8 sticks out upwards with a high residual variance.
PCA Influence Plot for PC4
1193
Go to menu Plot - Sample Outliers to display a combination of four useful plots for outlier
detection. Highlight the Residual Sample Variance at the bottom-left quadrant, and use the
PC navigation arrows to change that to show results for PC4. This plot indicates a high
validation residual for sample 8.
Residual Sample Variance Plot for PC4
As there is no validation check in MCR, we may use the outlier information issued from PCA
in our MCR modeling later on.
Rename the PCA model file in the project navigator by highlighting the PCA node, right
clicking and choosing Rename. Rename the model to “PCA Tutorial J”.
Run MCR with default settings
Task
Build a first MCR model with default settings.
How to do it
1194
Tutorials
Go back to the data table Tutorial_J in the project navigator. Run an MCR by going to the
menu and selecting Tasks - Analyze- Multivariate Curve Resolution… and keep the default
settings:
 Matrix: Tutorial_J
 Rows: All
 Columns: All
Go to the Options tab and verify that the default settings are selected. Make changes as
needed.
 Non-negative concentrations: selected

 Non-negative spectra: selected
 Closure: not selected
 Unimodality: not selected
 Sensitivity to pure components: 100
 Maximum ALS iterations: 50
MCR Options Dialog
Click OK to launch the calculations.

Note: MCR computations are demanding. Building the model can easily take several
minutes depending on the size of the data set, the selected options and the
capacity of your computer processor.
Click Yes when the calculations are finished, and you are asked if you want to view plots
now. The MCR Overview plots are displayed. Notice that the program suggests 4 as the
optimal number of pure components, by indicating 4 components in the toolbar .
This information, as well as parameters for the MCR analysis can be seen in the Info box in
the lower left of the display.
1195
Information Box
Rename the MCR model file to “MCR_Defaults”.

Tune the model’s sensitivity to pure components
Task
Read the MCR Warnings, which are found under the MCR model node. Open the warnings
and follow the system’s recommendation for the Sensitivity to pure components setting.
How to do it
Expand the MCR_Defaults node in the project navigator and click on Warnings. A table of
information will be displayed in the viewer and here you can check the recommendations
given by the system. There are four types of recommendations:
Type 1
Increase sensitivity to pure components
Type 2
Decrease sensitivity to pure components
Type 3
Change sensitivity to pure components (increase or decrease)
Type 4
Baseline offset or normalization is recommended.
In the present case, the system recommends to change the setting for sensitivity to pure
components.
The default setting (100) that was used for Sensitivity to pure components is usually a good
starting point. After interpreting the results and reading the system recommendations, you
can tune it up or down between 10 and 190. The higher the Sensitivity, the more pure
components will be extracted. Therefore, if too many components are extracted, it is
recommended to reduce the setting. Likewise, if you would like to see more components at
an almost undetectable level, or even some noise profiles, it is recommended to increase
the sensitivity setting.
Let us build a model with an increased setting.
Go back to the data table and redo the MCR calculation with a Sensitivity to pure
components setting of 150.
The plot of Component Spectra is now shown by default for 5 components instead of 4 in
the previous model.
Component Spectra for 5 Components
1196
Tutorials
One can compare those profiles with FTIR spectra of known constituents, and identify the 5
estimated spectra as pyridine, isopropyl alcohol, a possible intermediate, propyl acetate and
acetic anhydride, from curves 1-5 respectively.
Rename the new MCR model file created in the project navigator as MCR_Sensitivity150.
Run MCR with a constraint of closure
Task
Run MCR with a closure constraint. Compare two MCR models on the same data, with and
without closure.
How to do it
Among the MCR settings we have used so far, two types of constraints were not selected.
A constraint of Unimodality can be applied to restrict the resolution to concentration
profiles that have only one maximum.
With a constraint of Closure, the resolution will yield concentration profiles whose sum is
constant.
In the present case, acetic anhydride was added at 24 minutes (between the eighth and the
ninth samples), which means that the first 8 samples can be treated in closure conditions.
Go back to the data table and run a new MCR model with the following settings:
 Rows: Closure [8] (contains the first 8 samples of the data table)
 Cols: All
 Non-negative concentrations: selected
 Non-negative spectra: selected
 Closure: selected
 Unimodality: not selected
 Sensitivity to pure components: 100
Once the computations are finished, choose to view the plots when prompted. Rename the
new MCR model file as “MCR_Closure”.
You may compare the resolved concentration and spectral profiles of pure components with
and without the closure setting. To do that, compute a new MCR model on sample set
“Closure” without checking the Closure constraint option. Save the new MCR model file as
“MCR_No_Closure” and compare the results to “MCR_Closure”.
The spectral profiles with and without the constraint of closure are very similar.
MCR Component Spectra
1197
You can also observe that under constraint of closure, the concentrations of the pure
components always add up to 1.
MCR Component Concentrations
1198
Tutorials
Notes on MCR result interpretation

 The spectral profiles obtained may be compared to a library of FTIR spectra
in order to identify the nature of the pure components that were resolved.
Likewise, if you have the spectra of your pure components and solvents,
you can compare these to the computed components.
 Estimated concentrations are relative values within an individual
component itself. Estimated concentrations of a sample are not its real
composition.
Remove outliers and noisy wavelengths with recalculate

Task
Use the Recalculate functionality to remove samples or variables with high residuals.
How to do it
Select the MCR_Defaults tab from the navigation bar to display your first MCR model on
screen. If the plots were already closed, you may open them again from the project
navigator; click on the MCR Overview plot from the node MCR_Defaults to display the
results.
The Validation calculations of the PCA model that we built earlier indicated that sample 8
was a potential outlier. We can check this again in the MCR model by looking at the PCA
fitting residuals.
1199
Click on the bottom-left subframe where the Sample residuals are plotted to highlight it. If
needed, use the PC navigation arrow tool to change the view to show the sample residual
for the 4-component model.
Here you may notice a high residual showing for Sample 8, compared to the other samples.
Let us build a model without this sample. You will notice is the sample residuals plot, that
the shape is similar to what is observed in the residual sample variance plot from the PCA
model on this same data set.
MCR Sample Residuals
Use the marking tools to highlight sample 8 in the Sample

Residuals plot.
Marked sample in sample residuals plot
Select the MCR_Defaults model in the project navigator, and right click to select Recalculate
- Without Marked… to specify a new MCR calculation without sample 8.
Menu to recalculate without marked
1200
Tutorials
This brings you back to the MCR dialog, where sample 8 is now included in the Keep Out Of
Calculation field. You may launch the calculations to get the new MCR results.
MCR menu with sample 8 kept out
Similarly, you may want to keep out of the model non-targeted wavelength regions,
or highly overlapped wavelength regions.
From the MCR_Defaults overview plots, click Plot - Variable Residuals.
MCR Variable Residuals
1201
Mark any unwanted variables on the plot using the marking tools, for examples variables
around 1100-1140 cm-1 which present very high residuals, then select the model
“MCR_Defaults” and right click to choose Recalculate - Without Marked… to specify a new
MCR calculation.
General notes on MCR settings and interpretation:

 To have reliable results on the number of pure components, one should
cross-check with a PCA result, change the sensitivity to pure components
setting, and use the navigation bar to study the MCR results for various
numbers of pure components.
 Weak components (either low concentration or noise) are usually listed
first.
 One can utilize estimated concentration profiles and other experimental
information to analyze a chemical/ biochemical reaction mechanism.
 One can utilize estimated spectral profiles to study the mixture composition
or even intermediates during a chemical/biochemical process.
35.2.13 Tutorial K: Clustering
 Description
 Data table
 Transform the raw spectra
 Application of K-Means clustering
 Application of Hierarchical Cluster Analysis (HCA)
 Repeat the HCA using a correlation-based measure
 Using the results of HCA to confirm the results of PCA
Description
This tutorial investigates the use of two well known clustering methods, K-Means and
Hierarchical Cluster Analysis (HCA) for classification of raw materials used in the
pharmaceutical industry, by means of reflectance Near Infrared (NIR) spectroscopy. This is
1202
Tutorials
an example of unsupervised pattern recognition and is an alternative methodology to

Principal Component Analysis (PCA). Unsupervised pattern recognition is the first step
performed to establish whether a discriminant classification method can be developed.
What you will learn

Tutorial K contains the following parts:
 Apply a pretreatment method to the spectral data

 Use K-Means to identify clusters in the data set
 Perform HCA and analyze the resulting dendrogram output.
References

 Principles of PCA
 Data preprocessing and transformations
 Classification
 Cluster Analysis
Data table
Click the following link to import the Tutorial K data set used in this tutorial.
The data table contains 35 NIR spectra of seven classes of raw materials often used in
pharmaceutical manufacturing. Typically when developing classification models it is
recommended that more samples be used, being sure to cover the natural variability of each
class, but for this exercise, we use just five spectra for each class.
The diffuse reflectance spectra have been truncated to the wavelength region 1200 - 2200
nm for this particular example.
The type of raw material is defined in the name of each sample, and includes:
 Citric acid
 Dextrose anhydrous
 Dextrose monohydrate
 Ibuprofen
 Lactose
 Magnesium stearate
 Starch
Transform the raw spectra

Task
Transform the raw spectral data by applying a Standard_Normal_Variate (SNV) to the Tutor
K Data data table.
How to do it
Open the file Tutorial_K.unsb from the tutorial data folder.
First plot the raw data by selecting the entire table and selecting Plot - Line and select all
rows and columns to plot.
Line plot
1203
Click on OK and view the plot. Notice that there are distinct groups of spectra with similar
profiles. The main source of variation within each group comes from differences in the
absorbance (Y) axis. This baseline shifting is due to differences in sampling when preparing
and scanning, resulting in differences in light scattering by the samples measured in
reflectance by NIR spectroscopy.
Line plot of NIR spectral data
A convenient way to remove this variation is by the use of the SNV transform. This transform
reduces the scattering effects in such data by removing the mean value from each point in
the spectrum and divides each point by the standard deviation of all points in the spectrum,
i.e. the SNV transform normalizes the spectrum to itself. The effect of the SNV transform is
to remove the variation in the absorbance scale (baseline shifting), while retaining the
original profile of the spectral data.
This is a commonly used practice in many NIR applications, especially for reflectance spectra
of solids. To perform the SNV transformation, right click in the matrix Tutor K Data and
select Transform - SNV. In the Rows dialog box, select All and in the Columns dialog box,
select All. You can preview the effect of the transformation be clicking in the Preview result
box, or just click OK to perform the transformation.
SNV dialog
1204
Tutorials
The transformed data are displayed as a new node in the project navigator and the matrix is
called Tutor K Data_SNV. Plot the data to see how they now look by selecting all samples in
the new matrix and going to Plot-Line.
The resulting SNV-transformed spectra can be seen below.
Line plot of SNV-transformed NIR Spectra
The spectra are now ready for application of the clustering algorithms described below.
It is a good idea to save your work as you go. Save your project by going to File-Save As….
1205
Application of K-Means clustering

K-Means clustering is an unsupervised classification method which attempts to group a set
of samples being analyzed into “K” distinct groups, where K is specified by the analyst. The
classification is performed based on a predefined distance measure. For more details on the
distance measures available, refer to the section on Cluster Analysis.
Task
Perform a K-Means clustering of all samples.
How to do it
Use Tasks - Analyze- Cluster Analysis… and select the following parameters under the Inputs
tab:
 Matrix: Tutor-K Data_SNV

 Rows: All
 Columns: All
 Number of Clusters: 7
 Clustering Method: K-Means
 Distance Measure: Euclidean
Cluster analysis dialog
With K-means one can also make initial class assignments on the options tab, and set the
number of iterations to use to find the optimal number of clusters. Here we will allow the
algorithm to make assignments with no further input, and use the default number of 50
iterations.
Cluster analysis dialog options tab
1206
Tutorials
Click OK to start the analysis and a new node will appear in the project navigator called
Cluster analysis. Right click on the node and select Rename and call this analysis K-Means.
You will notice that there is no graphical output for K-Means clustering. The output of the
cluster analysis is found in the Results folder. Expand this folder to display a node called
Tutor K Data_SNV_Classified, where the results reside. The classified data matrix is color-
coded according to the clusters (row sets) that have been identified. Expand this matrix.
Expand the rows and the columns folders and you will see that the rows contain seven
assigned clusters from Cluster-0 to Cluster-6. The columns folder contains the class, a single
column of classification results.
The K-Means data table is now classified by different colors, corresponding to the various
assigned classes. Study this table. You will notice that the K-Means algorithm has
successfully classified the data into seven distinct classes, each containing a single raw
material type. Click on the various cluster nodes in the project navigator and confirm that
each cluster contains 5 samples of the same material type. Using the Rename function,
assign cluster names according to the table above. The results of this operation are shown
below.
View of Assigned Classes in Navigator
1207
Now that the separate classes have been defined, you can use this information to use it as a
means to group samples in plots. Go back to the matrix Tutor K Data_SNV and right click to
select Plot-Line. In the plot, now you can right click to select Sample Grouping. In the sample
grouping & marking dialog , first select the matrix containing the clustered data by clicking
on the Select result matrix button, which will allow you to choose the newly formed matrix
Tutor K Data_SNV_Classified. For cols, choose Class1, and the row sets you have just
renamed are available row sets. Select all of these using », and click OK. The line plot will
now have all samples of each set displayed in a single color.
Sample grouping option
Application of Hierarchical Cluster Analysis (HCA)

Hierarchical Cluster Analysis (HCA) is another clustering method. Like K-Means, it is based on
distance measures; however, the main output of the HCA is the dendrogram. The
dendrogram provides information pertaining to sample relationships within a particular data
1208
Tutorials
set. The structure of the dendrogram is dependent on the distance measure used and great
care must be taken when interpreting the structures.
Task
Make a HCA model using the method of single linkage and Euclidean distance.
How to do it
Select Tasks - Analyze - Cluster Analysis… and make a model with the following parameters:
 Matrix: Tutor_K Data_SNV

 Rows: All
 Cols: All
 Clustering Method: Hierarchical Single-linkage
 Distance Measure: Euclidean
Use the drop-down lists to change the clustering method and distance measure. Click OK to
start the analysis. When the the analysis is completed, the dendrogram is displayed in the
editor window, and a new Cluster analysis node is added to the project navigator.
HCA Euclidean Dendrogram
Before reviewing the analysis results, rename the new cluster analysis node in the project
navigator as HCA Euclidean.
Analyze the dendrogram and look at the order of the clusters from top to bottom. It can be
seen that each raw material type is uniquely defined and the carbohydrate materials Starch,
Lactose, Dextrose Monohydrate and Dextrose Anhydrous all group together in the
dendrogram. Towards the bottom, the clustering is not as distinct. This indicates that the
sample classification is based on some similarity in the chemistry of the samples, but it is not
as well defined as it could be. This is one aspect of HCA that must be kept in mind when
performing such a method.
1209
In the project navigator, expand the results folder for the HCA and under the rows folder,
you will see that seven clusters have been assigned to this analysis. These can be renamed
as was done above, so that the names coincide with the class name.
Repeat the HCA using a correlation-based measure
When dealing with spectroscopic data, the spectrum of a material is analogous to its
fingerprint. Using a straight distance measure such as the Euclidean measure may not be the
most sensitive way of assessing the similarities present within the data. The Absolute
correlation measure provides a better way of capturing the within spectral variable
similarities of the materials. We will also change to the complete-linkage, which looks for the
farthest neighbor, as opposed to nearest neighbor used in single-linkage HCA.
Task
Make a HCA model using the method of complete linkage and absolute correlation.
How to do it
Select Tasks - Analyze - Cluster Analysis. Use the following parameters:
 Matrix: Tutor K Data_SNV

 Rows: All
 Columns: All
 Clustering Method: Hierarchical Complete-linkage
 Distance Measure: Absolute Correlation
Click OK to start the analysis and then click Yes to view the plots. The dendrogram for this
analysis is displayed in the editor window, and from the results node it is seen that 7 clusters
are identified.
Before reviewing the analysis results, rename the new cluster analysis node in the project
navigator as “HCA Correlation”.
Notice that all samples are uniquely classified into classes based on the raw material type.
This time there are three distinct clusters in the dendrogram. At the top of the dendrogram
is Starch. The next cluster of samples contains mostly carbohydrates: Lactose, Dextrose
Monohydrate, Dextrose Anhydrous and Citric acid. The last cluster includes the materials
Ibuprofen and Magnesium stearate, whose NIR spectra have features in the 1400 and 1700
nm regions.
HCA Absolute correlation distance dendrogram
1210
Tutorials
The method of absolute correlation not only uniquely classified the individual raw materials,
but it was also able to use the information in the spectral variables far better, by grouping
the materials by their chemical properties.
In the results folder, select the data table Tutor K Data_SNV_Classified. Go to Insert -
Duplicate Matrix…. The following dialog box opens.
Duplicate Matrix
Rename the clusters of the duplicated matrix based on the materials’ name.
Renamed row ranges
1211
We will use these results, in conjunction with PCA, to show how the two methods of
unsupervised pattern recognition can be used together.
Using the results of HCA to confirm the results of PCA
Task
Perform a PCA on the SNV transformed data and group the samples based on the results of
HCA.
How to do it
Select Tasks - Analyze- Principal Component Analysis…. Use the following parameters:
 Matrix: Tutor K Data_SNV_Classified

 Rows: All
 Columns: All
 Maximum Components: 6
 Mean Center Data: Yes
 Identify Outliers: Yes
PCA dialog
1212
Tutorials
Click OK to start the analysis and then click Yes to view the plots. The PCA Overview for this
analysis is displayed in the workspace.
In the Scores Plot right click and select Sample Grouping and from the Select drop-down list,
use the results from your clustering to give you the available row sets of the different
clusters. Click on the » button to select all clusters in the analysis and then click OK.
Sample grouping dialog
1213
Drag the updated scores plot so that it fills most of the screen and analyze the clustering.
The scores plot shows that PC1 explains 66% of the data variance, and PC2 describes 19%.
The main difference along PC1 is between carbohydrate materials and fatty acid based
materials (i.e. Magnesium Stearate and Citric Acid) and PC2 is differentiating between the
starch and ibuprofen samples.
It can be seen that the clustering of the materials as established by HCA is consistent with
that of PCA. PCA provides more information on the groupings as the spectral loadings can be
related to the spectral features which describe the materials. To have a more informative
view of the PCA loadings it is better to look at them as a line plot - resembling then a
spectrum. Activate the loadings plot in the upper-right quadrant, and right click to select
PCA - Loadings - Line. The loadings plot now shows which spectral features are related to
the first PC, which explains most of the variance in this data set. Use the next arrow to
scroll to the next PC loadings plot.
PCA Overview Plot
1214
Tutorials
Now that the work has been done it is a good idea to save the results so you can refer to
them in the future.
When more data (more samples per each class) are available for classification, this exercise
has shown that one can proceed to make a classification model to identify these seven raw
materials from their NIR spectra. Classification modeling such as PLS-DA and SIMCA can be
used to develop methods that can be used for classification of future samples.
35.2.14 Tutorial L: L-PLS Regression
 Description
 Data table
 Open and study the data
 Build an L-PLSR model
 Interpret the results
 Variances
 Products: X Scores
 Product descriptors: X Correlation Loadings
 Consumer descriptors: Z Correlation Loadings
 Consumer liking of the products: Y Correlation Loadings
 Overview of the L-PLS Regression solution
 Verify the results
 Products liking
 Liking Y vs. consumer background Z
 Product descriptor rows in X
 Product descriptor columns in X
1215
 Bibliography
Description
Consumer studies represent an application field where “L-shaped” data matrix structures
X;Y;Z such as described in the following are common: A set of I products has been assessed
by a set of J consumers, e.g. with respect to liking, with results collected in “liking” data table
Y(I J). In addition, each of the I products has been “measured” by K product descriptors
(“X-variables”), reflecting chemical or physical measurements, sensory descriptions,
production facts etc., in data table X(I K). Moreover, each of the J consumers has been
characterized by L consumer descriptors (“Z-variables”), comprising sociological background
variables like gender, age, income, etc., as well as the individual’s general attitude and
consumption patterns; these are collected in data table Z(J L). Relevant questions could
then be: Is it possible to find reliable patterns of variation in the liking data Y, which can be
explained from both product descriptors X and from consumer descriptors Z? Is it possible to
predict how a new product will be liked by these consumers, by measuring its X-variables? Is
it possible to predict how a new consumer group will like these products, from their
background Z-variables?
The data consist of information gathered on Danish children’s liking of apples. Their
response to various apple types is termed Y. Chemical, physical and sensory descriptors of
these apple types are called X, and sociological and attitude descriptors on these children
are in matrix Z. The purpose of the analysis is to find patterns in these X-Y-Z data that are
causally interpretable and have predictive reliability.
We are now going to build an L-PLS regression (L-PLSR) model linking the panelists’ sensory,
chemical and physical evaluations to the consumers and their sociological and attitude
descriptors. The model will summarize all the information about consumers, consumers’
preference, the products and their characteristics.
What you will learn

 Open and study the data.

 Build an L-PLSR model which explains consumer likings for the different consumer
segments from the descriptive sensory attributes and chemical measurements.
 Study the results.
 Verify the results.
References:
 L-shaped Partial Least Squares Regression

 Partial Least Squares Regression
 Scatter plots
Data table
We are going to study three data tables of different sizes. The structure of the data set is as
follows:
 X - ApplesSensoryChem
1216
Tutorials
 Y - ApplesLiking
 Z - AppleChildBackground
L-PLSR Structure
The six products

The data are taken from Thybo et al. (2004). I=6 products were the apple cultivars
“Jonagold”, “Mutsu”, “Gala”, “Gloster”, “Elstar” and “GrannySmith”. All cultivars were
selected due to commercial relevance for the Danish market and due to the fact that the
cultivars were known to span a large variation in sensory quality (Kuhn and Thybo, 2001).
Gloster was chosen as a wine-red cultivar with particularly high glossiness, Gala and
Jonagold as red cultivars with 80-90% red surface, Mutsu as a yellow-green cultivar and
GrannySmith as a green and particularly round-shaped cultivar. GrannySmith was known to
be a rather popular cultivar for some children, due to its texture and moistness
characteristics. Only apples with shape and color deemed representative for their cultivar
were used.
X data
The X data matrix (X - ApplesSensoryChem) contains the chemical, physical and sensory data
of these apple types. Sensory profile descriptors: A panel of ten assessors was trained in
quantitative descriptive analysis of apple types as described by Kuhn and Thybo (2001).
Conventional statistical design with respect to replication and serving order was applied. The
panel average of a subset of the appearance, texture, taste descriptors will be used here:
 Red
 Sweet
 Sour
 Glossy
 Hard
 Round
Chemical and instrumental product descriptors:
 Texture firmness was evaluated instrumentally by penetration (FIRM Instrument).
1217
 Content of acid (ACIDS) and sugar (SUGARS) were determined as malic acid and
soluble solids, respectively.
 Based on prior theory on human sensation of sourness, the ratio ACIDS/SUGARS was
included as a separate variable (Kuhn and Thybo, 2001).
Together, the sensory, chemical and instrumental variables constituted K=10 product
descriptors, which will here be referred to as X(I K) for the I = 6 products.
Y data
The Y data (Y - ApplesLiking) consist of information gathered on Danish children’s liking of
apples. Their response to various apple types is termed Y. Each child was asked to express
the liking of the appearance of the six apple cultivars, using a five-point facial hedonic scale:
 “not at all like to eat it”

 “not like to eat it”
 “it is okay”
 “like to eat it”
 “very much like to eat it”.
One apple at a time was shown to the child to avoid that the child concentrated on
comparing the appearances. All samples were presented in randomized order. The resulting
liking data for the I = 6 products x J = 125 consumers will here be termed Y(I J).
Z data
The Z data table (Z - AppleChildBackground) contains the information collected about the
consumers: sociological and attitude descriptors on these children.
The consumers were children aged 6 to 10 years (51% boys, 49% girls), recruited from a local
elementary school. A total of 146 children were tested and included in the original
publication of Thybo et al. (2004). For simplicity, only the J = 125 children that had no
missing values in their liking and background data are included in the present study.
First, each child was asked to look at a table with five different fruits and answer the
questions: “If you were asked to eat a fruit, which fruit would you then choose, and which
fruit would be your last choice?” The resulting responses are named “fruitFirst” and
“fruitLast”, where fruit is one of RedA (Red apple), GreenA (Green apple), Pear, Bana
(Banana), or Orange. Additional descriptors “AFirst” and “ALast” are also available which
correspond to either red or green apples.
The child was also questioned about how often he/she ate apples, by having the following
opportunities: “every day” (here coded as value 4), a couple of times weekly (3), “a couple of
times monthly” (2), “very seldom” (1); this descriptor is here named “EatAOften”. (A few of
the children responded “do not know” to how often he/she ate apples. To reduce the
number of missing values, this was taken as indicating very low apple consumption, and
coded as 0.) In addition, the child’s gender and age were noted. These two sociological
descriptors were used, together with the attitude variables fruitFirst and fruitLast and eating
habit-variable EatAOften, as L = 15 consumer background descriptors Z(J L) for the J = 125
children.
Open and study the data
Click the following link to import the Tutorial L data set used in this tutorial.
There are three matrices:
 X - ApplesSensoryChem
1218
Tutorials
 Y - ApplesLiking
 Z - AppleChildBackground
Build an L-PLSR model

The model will explain consumer likings from the descriptive sensory attributes, and also
using the consumers’ information.
Go to the menu Tasks - Analyze - L-PLS Regression…
Tasks - Analyze - L-PLS Regression…
 In X select the variable set “X - ApplesSensoryChem”, in Rows and Columns select

All.
 In Y select the variable set “Y - ApplesLiking”, in Rows and Columns select All.
 In Z select the variable set “Z - AppleChildBackground”, in Rows and Columns select
All.
 Ensure the button for mean centering is ticked.
 The maximum number of components is limited by the number of apple types minus
one, due to mean centering of the data. Set the maximum components to 5 PCs.
L-PLS regression settings
1219
Set the weights as follows:
 Click on the X Weights option. Select all the variables clicking on the All button.
Select the option “A / (SDev + B)” with the radio button. Finally click on the Update
button.
 Click on the Y Weights option and use weighting option “A / (SDev + B)” for all the
variables.
 Click on the Z Weights option and use weighting option “A / (SDev + B)” for all the
variables.
L-PLS Regression settings: Weights
1220
Tutorials
Once all necessary options have been selected, click OK to start the computations.
Interpret the results
View the results and study the different plots:
 LPLS Overview
 Correlation Loadings
 Correlation
L-PLSR Analysis node
1221
Variances
Study the bottom right plot in the LPLS overview. It presents the explained variances of the
three data tables: X (blue), Y (red) and Z (green).
Most variation in the product descriptor table X is explained in 3-4 factors, whereas all 5
factors seem to be relevant for explaining variation in the Y- and Z-tables. A total of 72% of
the consumer background variation in Z is explained by the full model.
In total 21% of the variation in the product liking table Y is explained using all 5 factors. The
majority (13%) is explained by Factor-1, whereas 4% is explained by Factor-2.
Products: X Scores
A scatter plot of the X scores describing apple types is given in the top left corner under
Correlation Loadings.
Scores plot
1222
Tutorials
The two first factors explain 54% and 14% of the variation in X. Factor-1 describes variation
separating GrannySmith (and to some degree Mutzu) from the group of products defined by
Gloster, Jona, and Gala. Factor-2 spans a direction where Granny Smith and Mutzu represent
the extremes.
Product descriptors: X Correlation Loadings

Look at the product descriptors in the upper right plot “X Correlation Loadings” under
Correlation Loadings.
X Correlation Loadings
This plot shows the main patterns of the sensory, instrumental and chemical product
descriptors. Interpreting Factor-1 first, it seems the main variation spans two groups of
predictors, where a group describing redness and sweetness is negatively correlated to a
group related to sourness, hardness and roundness. Factor-2 on the other hand separates
glossyness and roundness from sugar content, indicating that round, glossy cultivars tend to
contain less sugars than the other apples in the study.
Comparison with the previous scores plot confirms that e.g. Granny Smith is somewhat sour,
hard and round, and it is not red (but green). As expected, the red cultivars Gala, Jona and
Gloster are found to the right. Elstar has a red and green, marbled appearance, which
explains why its score value for Factor-1 is close for zero (neither red nor green).
Consumer descriptors: Z Correlation Loadings

Select the “Z Correlation Loadings” plot in the Correlation Loadings overview. Press the
“Correlation Loadings” button ( ) to add threshold lines indicating 50% and 100%
squared correlations between the factors and the original consumer descriptors.
1223
Here, the main patterns of the consumer background descriptors picked up by the model are
seen. Factor-1 spans a tendency to choose the green apple first (GreenAFirst) against the
tendency to choose the red apple first (RedAFirst). This component explains 16% of Z (as can
be seen from the scores plot or explained variance plot above). It also seems that older
children tend to prefer red to green apples, while gender is a poor descriptor for childrens
preferences.
The second factor (explaining 22% of Z) exhibits children’s preference in different fruits.
Those who eat apples often tend to prefer apples over bananas, for instance. Similarly, the
children who particularly dislike green apples seem to have a somewhat higher preference
for other fruits.
Consumer liking of the products: Y Correlation Loadings

You may also want to look at the distribution of individual consumers in the “Y Correlation
Loadings”-plot. (As all the consumers are named ’.’ in the Y and Z data tables, the plot will
not reveal the actual identities.)
1224
Tutorials
This plot shows the main, product-related patterns of the consumers with respect to liking.
The children grouping towards either end of the horizontal axis likely have a very clear
preference for green or red apples over the alternative.
Overview of the L-PLS Regression solution

The plot Correlation gives an overview where all the previous correlation loadings are
plotted together with the X scores given as correlations values. The same sort of
interpretations and the same conclusions as given above can be reached by looking at this
plot directly.
Correlation
1225
Verify the results

With a relatively complex modeling tool like the L-PLS regression, it is important to verify the
main aspects of the interpretation by plotting the raw data.
Products liking
Plot a scatter plot of the most extreme products (liking GrannySmith vs. liking Jonagold) and
look at the correlation. As the responses are restricted to 5 levels, many of the values are
superimposed in the plot. Add a regression line ( ) to get a better impression of the
relation between the factors. Optionally add a statistics table to the plot ( ), and change
the point-sizes, point-labels and x-axis limits through menu View - Properties
With only five response levels possible, many data points are superimposed and the pattern
difficult to see. But their raw liking data are clearly negatively correlated (r = -0.4 over the
125 subjects), as expected.
Liking Y vs. consumer background Z

Plot a scatter plot of the liking of the green apple GrannySmith to the background response
green apple first.
To do so copy the row “GreenAFirst” in the Z table and insert a new row in the Y table. Then
paste the “GreenAFirst” row. Generate a scatter plot with regression line as below.
1226
Tutorials
There is a tendency (r = 0.52 over 125 subjects) that if children chose green apple first, they
reported that they liked GrannySmith.
Product descriptor rows in X

Plot a scatter plot of the standardized sensory and chemical variables for the two most
extreme products, GrannySmith and Jonagold.
To do so select the X matrix and go to Tasks - Transform - Center and Scale…
Tasks - Transform - Center and Scale…
1227
Select All for Rows and Cols. For the Transformation field select Mean for Center and
Standard deviation for Scale. Optionally check Preview result
Center and Scale window
1228
Tutorials
From the new matrix generated called “X - ApplesSensoryChem_CenterAndScale” select the

“JonaSC” and “GrannySmithSC” rows and make a scatter plot as described above.
1229
Again, these two products are seen to be described by quite opposite terms; Jonagold is
sweet, red and high in sugars compared to GrannySmith, while GrannySmith has high
acids/sugars ratio, is sour, hard and round compared to Jonagold. The correlation is -0.72
between these two rows of 10 standardized X variables.
Product descriptor columns in X

Plot a scatter plot of the sensory descriptor Sour and the instrumental descriptor FIRM
Instrument.
As expected from the L-PLS regression model, these two variables are almost orthogonal,
with r = 0.07 over the six products.
1230
Tutorials
Bibliography
B.F. Kuhn, A.K. Thybo, The influence of sensory and physiochemical quality on Danish
children’s preferences for apples, Food Qual. Pref., 12, 543-550(2001).
latent variables: L-PLSR, Computational Statistics & Data Analysis 48, 103-123(2005).
A.K. Thybo, B.F. Kuhn, H. Martens, Explaining Danish children’s preferences for apples using
instrumental, sensory and demographic/behavioral data, Food Qual. Pref. 15, 53-63(2004).
35.2.15 Tutorial M: Variable selection and model stability
Learn how to use the Uncertainty Test results in practice.
 Description
 Data table
 Create a PLS model
 Interpret a PLS model
 Variance plot
 Scores plot
 Loadings plot
 Weighted regression coefficients
 Stability plots
 Stability in loading weights plots
 Stability in scores plots
 Conclusions
Description
In this work environment study, PLS regression was used to model 34 samples corresponding
to 34 departments in a company. The data were collected from a questionnaire about
overall job satisfaction (Y), modeled from 26 questions (X1, X2, …, X26) about repetitive
tasks, inspiration from the boss, helpful colleagues, positive feedback from the boss, etc.
The unit for these questions was the percentage of people in each department who ticked
“yes”, e.g. “I can decide the pace of my work”. The response variable was the overall job
satisfaction, on a scale from 1 to 9.
What you will learn

 PLS regression
 Validation methods
 Uncertainty estimates
 Interpretation of plots
This tutorial is also presented differently than the other tutorials, with less detailed
instructions for each task, thus giving a slightly more demanding learning curve.
1231
Data table
Click the following link to import the Tutorial M data set used in this tutorial. The data
already have several row and column sets defined, but you must define the column set for
the response variable, job satisfaction.
Create a PLS model
Click Tasks - Analyze - Partial Least Squares Regression to run a PLS regression and choose
the following settings:
Model inputs
 Predictors: X: Tutorial M, Rows: all, Cols: XData

 Responses: Y: Tutorial M, Rows: all, Cols: Job satisfaction
 Mean center data: Enable tick box
X Weights
1/SDev
Select all the variables, select the radio button A/(SDev+B), and click Update.
Y Weights
1/SDev
Select the “Job satisfaction” and select the radio button A/(SDev+B), and click
Update.
Validation
Full cross-validation. Click on the button Setup… to select this option.
Select the Uncertainty test for the optimal number of factors.
Select Uncertainty test
1232
Tutorials
Click on OK when everything is set.

Interpret a PLS model
The Unscrambler® regression overview gives by default the Scores plot (factor 1-factor 2),
the X-Loading and Y-loadings plot (factor 1- factor 2), the explained variance and the
Predicted vs. Reference plot for 2 factors for this PLS regression model.
Variance plot
The initial model indicated 2 factors to be the optimal model dimension by full cross
validation. Thus the cross validation has created 34 submodels, where 1 sample has been
left out in each. The uncertainties for all x-variables were thus as a second step estimated by
jack-knifing for various model parameters based on a two-factor model.
In the variance plot the validation curve (red) shows 62% explained variance for 2 factors,
which is rather good for data of this kind.
Plot of explained y-variance
1233
Scores plot
The scores plot shows that the samples are well distributed with no apparent outliers.
Plot of scores
Loadings plot
The relations between all variables are more easily interpreted in the correlation loadings
plot rather than the loadings as the explained variance can be seen directly in the plot; the
inner circle depicts 50% explained variance and the outer 100%.
Activate the X-Loadings plot by clicking in it, then use the following shortcut button ; it
will display the two circles.
1234
Tutorials
The most important variables for job satisfaction (Y) seem to be related to how the
employees evaluate their leader. Questions related to the work span the direction from
upper left to lower right in the plot.
Plot of correlation loadings
The variables found significant are marked with circles in the loadings plot. If not shown by
default, activate the marking of the significant variables using the following button .
Although the variable pattern can be interpreted in the correlation loadings, the importance
of the variables is better summarized in terms of the regression coefficients in this case.
Recall that the loadings describe the structure in X and Y whereas the loading weights are
more relevant to interpret for the importance in modeling Y. Alternatively, the predefined
plots under the weighted regression coefficients may be investigated.

Click on the regression coefficient plot in the navigator.
Regression coefficient plot in the navigator
1235
The automatic function Mark significant variables shows clearly which variables have a
significant effect on Y.
When plotting the regression coefficients one can also plot the estimated uncertainty limits
as an approximate 95% confidence interval as shown below.
Plot of the weighted regression coefficients
1236
Tutorials
E.g. variable disrespect has uncertainty limits crossing the zero line: it is not significant at the
5% level. Zoom in with Ctrl+right click to see details.
13 out of 26 X-variables are found to be significant at the 5% level. However, there is nothing
to say that one can not set the cut off at another level depending on the application.
Variables with large regression coefficients may not be significant because the uncertainty
estimate indicates that the relation between this variable and Y is due to only some samples
spanning the range. One effective way to visualize this is to show the stability plot.
The corresponding p-values are given in the output node, in the validation folder.
p-values for the regression coefficients
1237
Stability plots
Stability in loading weights plots
Go back to the loadings plot. By clicking the toolbar button Stability plot the model
stability is clearly visualized.
Stability in loading weights plots
1238
Tutorials
Variable 11 or “Help” is not very stable, the two departments 15 and 26 have a much lower
value than the others, thus being influential for this variable. This indicates that this variable
is probably not reliable to predict the “job satisfaction”.
This can be studied by looking at the scatter plot of the “Help” vs. “job satisfaction”.
To plot it go back to the data table “Work environment case”. Select the column 11 “Help”
as well as the column 27 “Job satisfaction”, use Ctrl.
Then go to Plot - Scatter or click on the icon .
“Help” vs. “job satisfaction”
This plot shows that the variable X11 ”help” (Do you find your colleagues helpful?) is not
very correlated to the “job satisfaction”. The 2 suspicious departments are influential in this
relation.
1239
Stability in scores plots
Go back to the scores plot. By clicking the toolbar button Stability plot the model
stability is clearly visualized.
Stability plot of scores
For each sample one can see a swarm of its scores from each submodel. There are 34
sample swarms. In the middle of each swarm is the score for the sample in the total model.
By clicking on any point, information of the segment is given. Thus, in the case of full cross
validation one can directly see how the models change when a particular sample is kept out.
In other words, a sample that makes the model change when it is not in the segment has
influenced all other submodels due to its uniqueness.
The score and loading stability plots are also very useful for higher factors in models as they
indicate when noise is becoming the main source for a specific component.
Conclusions
In the work environment example, from looking at the global picture from the stability
scores plot one can conclude that all samples seem good and the model seems robust. Also,
the uncertainty test indicates 13 significant variables at the 5% level as visualized with the
95% confidence intervals.
35.3. Quick
35.3.1 Quick start tutorials
PCA
Projection
SIMCA
MLR
PCR
PLS
Prediction
1240
Tutorials
Cluster
MCR
LDA
LDA classification
SVM
SVM classification
LPLS
35.3.2 Projection quick start

Click the following links to import the Projection quick start data set and the Projection quick
start model used in this tutorial.
The project contains now a data set and a PCA model.
The data set contains:
 6 variables, of which one is a category variable

 13 samples
It is divided into ranges:
 column ranges: “Descriptors” containing all the continuous variables; “Composition”

containing the percentages of protein, carbohydrates, fat and saturated fat;
“Energy” being the energy per kJ/g.
 row range: “Training” set of 10 samples and “Test” set of 3 samples.
Data structure
The PCA model has been developed on the variable set “Descriptors”. It needs 4 PCs. Have a
look at the PCA quick start tutorial for more information on this model.
Go to Tasks - Predict - Projection.
In the dialog box project to latent space. Make the following selections:
 Select model: “PCA”

 Components: “4”
 Matrix: “mcdo”
 Rows: “Test”
 Cols: “Descriptors”
Projection inputs
1241
Click OK to launch the calculation and Yes to view the plots.

Look at the scores plot. The projected samples are in red. The “Sundae caramel” is projected
in between “Sundae chokolate” and “Sundae strawberry”. Note that it seems to be closer to
“Sundae chokolate”. Its position more on the right in this plot indicates that it contains
probably more carbohydrates. Check the values in the “mcdo” table.
Projection scores
Look at the residual variance plot to see how well the new samples are described. Look at
the green line. It goes rapidly to zero indicating a good description.
Projection residual variance
For more information on the plots go to the Interpreting Projection plots section
1242
Tutorials
35.3.3 SIMCA quick start

Click the following links to import the SIMCA quick start data set and the four PCA models:
 PCA model 1
 PCA model 2
 PCA model 3
 PCA model 4
used in this tutorial.

The project contains now a data set and four PCA models.
The data set “FiveRawMaterials-small” contains:
 13 spectral variables, and a category variable “Type”,

 70 samples of 4 different types.
 column range: “All variables” containing all the continuous variables;

 row ranges: “AcDiSol”, “DiCaP”, “Kollidon”, “MCC” that group the samples by
category and a “Test” set of 4 samples.
Data structure
Go to Tasks - Predict - Classification - SIMCA….

The first tab is the inputs. Make the following selections:
 Matrix: “FiveRawMaterials-small”
 Cols: “Spectra”
 Class model: “PCA_AcDiSol”, “PCA_DiCaP”, “PCA_Kollidon”, “PCA_MCC”
Leave the default values for the Suggested number of PC. For more information on the
optimal number of PCs to use look up the PCA theory
SIMCA inputs
1243

Look at the classification table. By default the results are shown with a 5% significance level.
The first sample “AcDiSol - 18” is recognized by the model “PCA_AcDiSol”. This means that
the sample is part of the class AcDiSol at a 5% significance level.
The sample “MCC - 19” is not recognized by any model at 5% significance level which means
that it is quite different from the MCC samples in the calibration model “PCA_MCC”.
For more information on the plots go to the Interpreting SIMCA plots section
35.3.4 MLR quick start

Click the following link to import the MLR quick start data set used in this tutorial.

 13 samples

Data structure
1244
Tutorials
Go to Tasks - Analyze - Multiple Linear Regression….

The first tab is the model inputs. Make the following selections:
 X - Matrix: “mcdo”
 X - Rows: “Training”
 X - Cols: “Composition”
 Y - Matrix: “mcdo”
 Y - Rows: “Training”
 Y - Cols: “Energy”
Keep the Include intercept term and Identify outliers boxes ticked. Leave the Significance
level (alpha) at “.05”.
MLR model inputs
In the validation tab, select the radio button Leverage correction.

MLR validation
1245

Look at the ANOVA table and look at the p-values. First see if the model is valid (p-value <
0.05). Look then at the p-values for the “Protein (%)”, “Carbohydrates (%)”, “Fat (%)” and
“Saturated fat (%)” and see that only the effects of three variables are significant at 5%.
MLR ANOVA
Look at the quality of the regression by looking at the predicted vs. reference plot. The R-
square is about 1 which is very good. In addition the error is small.
MLR predicted vs. reference
1246
Tutorials
Finally look at the regression coefficients.

The most important variable to predict the “Energy” is “Fat (%)” about twice as much as
“Protein (%)” and “Carbohydrates (%)”. The variable “Saturated fat (%)” has a coefficient
close to 0.
MLR regression coefficients
For more information on the plots go to the Interpreting MLR plots section
35.3.5 PCR quick start

Click the following link to import the PCR quick start data set used in this tutorial.

 13 samples

1247
Data structure
Go to Tasks - Analyze - Principal Component Regression….

Keep the Identity outliers and Mean center data boxes ticked. Leave the Maximum
components set to “4”.
Go to the next tab: X - Weights. Keep the default settings that don’t apply any weight to the
variables as the variables have the same range of variation, even the energy that is not in the
same unit. Keep the default settings for Y - Weights.
PCR model inputs - weights
1248
Tutorials
1249
In the validation tab, select a full cross-validation. To do so select the radio button Cross
validation. Then click on the button setup and in the drop-down menu select the Full
option.
For PCR it is useful to enable the Uncertainty test* by ticking the associated box. This test
will show the important variables in the model in the loadings plot and coefficient
regression plot. For thenumber of factors to use** leave the default option use optimal
number of factors.
PCR validation - cross-validation setup
1250
Tutorials
There are several options in the algorithm tab. Look at the information in the Additional
information field. Select the SVD option as the data set is rather small.
PCR algorithm

Look at the Y-explained variance plot. The calibration curve (blue) shows that the first PC
does not explain much the variation in Y, 0.72%. The validation curve (red) shows that this
component may have some extreme sample as the variance is even negative. The model
may need 2 to 3 components to have a good explained variance for Y.
Look at the X-explained variance plot. The first component explains 68% of the X-variance
but as has been seen only 0.672% of the Y-variance. This means that a large part of the
structure in X is not useful to predict Y.
PCR Y-explained variance and X-explained variance
1251
Look at the scores. Notice that the 2 sundae samples are clustered together. Also “Apple
pie” and “Pommes frites” are also very close which means that the samples have similar
composition.
PCR scores
Along PC1, “Protein (%)” and “Carbohydrates (%)” are diametrically opposite. It means that
they are anti-correlated so that they vary in opposite direction. Samples with positive scores
on PC1 are rich in protein, like all the burgers. Samples with negative scores on PC1 are rich
in carbohydrates and low in protein such as “Pommes Frites”.
PC2 is describing the variation of “Fat (%)” and “Energy”. The more fat in the composition
the more energetic the product. The products that will have negative scores along PC2 have
a high fat content such as “Filet-O-Fish”.
1252
Tutorials
The variable “Saturated fat (%)” which is inside the 50% correlation circle is not a descriptive
variable. The variations are not structured and this variable may be considered as irrelevant
for this data set.
Also note that some variables are circled; they are the important variables as determined by
the uncertainty test.
PCR correlation loadings
Check the quality of the regression with the 2 and 3 factors. As can be seen, the results for 3
factors are much better both for the R-square and the RMSE.
The R-square in validation (red value) is 0.998, which is very good. The error in cross-
validation is about 0.08 on a scale of 6 to 13 kJ/g which is rather small.
PCR predicted vs. reference

close to 0 and is unstable as the uncertainty limits cross the zero line.
PCR regression coefficients
1253
For more information on the plots go to the Interpreting PCR plots section
35.3.6 PLS quick start

Click the following link to import the PLS quick start data set used in this tutorial.

 13 samples

Data structure
Go to Tasks - Analyze - Partial Least Square Regression….

1254
Tutorials
components at “4”.
Go to the next tab: X - Weights. Keep the default settings that don’t apply any weight to the
same unit. Keep the default setting also for Y - Weights.
PLS model inputs - weights
1255
1256
Tutorials
option.
For PLS it is useful to enable the Uncertainty test* by ticking the associated box. This test
will show the important variables in the model in the loadings plot and coefficient
regression plot. For thenumber of factors to use** leave the default option use optimal
number of factors.
PLS validation - cross-validation setup
1257
There are several options in the algorithm tab. Look at the information in the Additional
information field. Select the NIPALS option as it is the classical type.
PLS algorithm

Look at the Y-explained variance plot. The validation curve (red) shows that 2 factors are
needed to explain 99% of the variance which is very good. The model will then need only 2
factors.
PLS Y-explained variance
1258
Tutorials
Look at the scores. Notice that the 2 sundae samples are clustered together. Also “Apple
pie” and “Pommes frites” are also very close which means that the samples have similar
composition.
PLS scores
Factor 1 is describing the variation of the X-variable “Fat (%)” and Y-variable “Energy”. The
more fat in the composition the more energetic the product. The products that have positive
scores along factor 1 have a high fat content such as “Filet-O-Fish”, “Pommes Frites” and
“Apple Pie”.
Along factor 2, “Protein (%)” and “Carbohydrates (%)” are diametrically opposite. It means
that they are anti-correlated so that they vary in opposite direction. Samples with positive
scores on PC1 are rich in protein, like all the burgers. Samples with negative scores on PC2
are rich in carbohydrates and low in protein such as “Pommes Frites”.
for this data set.
Also note that some variables are circled; they are the important variables.
PLS correlation loadings
1259
Check the quality of the regression with the appropriate number of factor: 2.
The R-square in validation (red value) is 0.99, which is very good. The error in cross-
validation is about 0.16 on a scale of 6 to 13 kJ/g which is rather small.
PLS predicted vs. reference

close to 0 and is unstable as the uncertainty limits cross the zero line.
PLS regression coefficients
1260
Tutorials
For more information on the plots go to the Interpreting PLS plots section
35.3.7 Prediction quick start

Click the following links to import the Prediction quick start data set and the Prediction quick
start model used in this tutorial.
The project contains now a data set and a PLS model.

 13 samples

Data structure
1261
The PLS model has been developed on the X-variables “Composition” and Y-variable
“Energy”. It needs 2 factors. Have a look at the PLS quick start tutorial for more information
on this model.
Go to Tasks - Predict - Regression.
In the dialog box Predict Using Regression Model. Make the following selections:
 Select model: “PLS”

 Components: “2”
 Full Prediction: tick that box
 Cols: “Composition”
 Include Y reference: tick that box
 Cols: “Energy”
Prediction inputs

The results can be seen as a plot with the predicted value being at the center of the box and
the deviation being the outline of the box. Don’t forget to look at the results for 2 factors.
Prediction results as plot
1262
Tutorials
The results can also be seen as a table and it is even easier to compare the quality of the
prediction by looking at how close the predicted values are to the reference values. Don’t
forget to look at the results for 2 factors.
In the table look at the values for “Grilled chicken” they show 8.23 predicted and 8.14
reference which is pretty close.
Prediction results as table
For more information on the plots go to the Interpreting Prediction plots section
35.3.8 Cluster quick start

Click the following link to import the Cluster quick start data set used in this tutorial.

 13 samples

Data structure
1263
Go to Tasks - Analyze - Cluster Analysis….

 Rows: “All”
 Cols: “Composition”
 Number of clusters: “3”
 Clustering method: “Hierarchical Average-linkage”
 Distance measure: “Euclidean”
There is no available option for a hierarchical cluster analysis only with K-means as the
distance measure.
Cluster analysis inputs

Look at the dendrogram. The three groups are:
 all the sundaes

 all the burgers
 “Pommes Frites” and “Apple Pie”
They reflect the different composition types.

The clusters are quite distant from each other.
The two closest samples are “Big Mac” and “McChicken”.
Cluster dendrogram
1264
Tutorials
For more information on the plots go to the Interpreting cluster analysis plots section
35.3.9 MCR quick start

Click the following link to import the MCR quick start data set used in this tutorial.
 48 variables, of which three composition variables and 45 spectral variables

 36 samples
 column ranges: “Blue”, “Green”, “Orange” that describe the composition and “360-
800nm” that describes the spectra.
 row range: “samples” set of 36 samples
Data structure
Go to Tasks - Analyze - Multi.

 Matrix: “Dye”
 Rows: “samples”
 Cols: “360-800nm”
We will not use any initial guess but if you wish to learn more read mCR dialogs
MCR model inputs
1265

Look at the explained variance plot. The validation curve (red) shows that 4 PCs are needed
to explain 100% of the variance. However the first local maximum is 2 so it may be that only
2 components are necessary.
PCA explained variance
Look at the total residuals. The minimum is reach at 3 components which means 3
components are needed.
Total residuals
1266
Tutorials
Look at the spectra with 3 components displayed. The shape of the spectra looks good as it
is very close to a signal shape. The 3 spectra have the same intensity which is a good sign for
the results.
Spectra
Look at the concentrations. The summ of concentration almost summs up to 1 which is good
for a mixture. The green component seems to be always in higher concentration.
Concentrations
1267
For more information on the plots go to the Interpreting MCR plots section
35.3.10 LDA quick start

Click the following link to import the LDA quick start data set used in this tutorial.

 column range: “spectra” containing 13 continuous variables and a category variable

“Type”;
category according to their material type and a “Training” set of 66 samples “Test”
set of 4 samples.
Data structure
1268
Tutorials
Go to Tasks - Analyse - Linear Discriminant Analysis….

 Predictors - Descriptors: “FiveRawMaterials-small”

 Predictors - Rows: “Training”
 Predictors - Cols: “Spectra”
 Classification - Category: “FiveRawMaterials-small”
 Predictors - Cols: “Type”
Go to the next tab weights. Leave the weights equal to 1 as the data are spectral data.
LDA inputs - weights
1269
Go to the options tab. Select:
 Method: “Mahalanobis”
 Prior probability: “Calculate prior probabilities from training set”
1270
Tutorials
LDA options

The Discrimination plot shows the samples color-coded by their class, in a 2-D plot of class 1
(AcDiSol) vs. class 2 (DiCaP). Toggle between the different classes using the arrows in the
toolbar.
Go the the Results folder in the project navigator to look at the confusion matrix. All the
samples are well classified.
1271
Confusion matrix
For more information on the results go to the Interpreting LDA results section
35.3.11 LDA classification quick start

Click the following links to import the LDA classification quick start data set used in this
tutorial and the LDA classification quick start model
The project contains a data set and a LDA model.

 column range: “spectra” containing 13 continuous variables and a category variable

“Type”;
category and a “Training” set of 66 samples “Test” set of 4 samples.
Data structure
The LDA model has been developed on the “Training” set. For more information on the
model check the instruction of the LDA quick start.
Go to Tasks - Predict - Classification - LDA….
In the dialog that opens make the following selections:
1272
Tutorials
 Select model: “LDA”

 The Type is already set corresponding to the model properties: “Mahalanobis DA
with 13 variables”
 Matrix: “FiveRawMaterials-small”
 Cols: “Spectra”
Classify using LDA model

Look at the new matrix. For each sample there is the distance to each model and then the
predicted group. All the samples are well classified.
Classified range
35.3.12 SVM quick start

Click the following link to import the SVM quick start data set used in this tutorial.
The data set “FiveRawMaterials” contains:
 700 spectral variables,

 column range: “All variables” containing all the continuous variables;

category and a “Test” set of 4 samples.
Data structure
1273
Go to Tasks - Analyse - Support Vector Machine….

 Predictors - Descriptors: “FiveRawMaterials”

 Predictors - Cols: “All variables”
 Classification - Category: “FiveRawMaterials”
 Predictors - Cols: “Type”
Go to the options tab. Select:
 SVM type: “Classification (nu-SVC)”

 Kernel type: “Radial basis function”
 NU: 0.5
SVM inputs - options
1274
Tutorials
Go to the next tab weights. Leave the weights equal to 1 as the data are spectral data. In the
validation tab enable cross-validate and select 3 for the number of segments.
SVM weights - validation
1275

Look at the confusion matrix. All the samples are well classified.
Confusion matrix
1276
Tutorials
For more information on the results go to the Interpreting SVM results section
35.3.13 SVM classification quick start

Click the following links to import the SVM classification quick start data set used in this
tutorial and the SVM classification quick start model.
The project contains now a dataset and a SVM model.
The dataset “FiveRawMaterials” contains:
 700 spectral variables,and a category variable “Type”;

 column range: “All variables” containing all the continuous variables and a category
variable “Type”;
category and a “Training” and a “Test” set of 4 samples.
Data structure
Go to Tasks - Predict - classification - SVM….

In the dialog that opens, make the following selections:
 Matrix: “FiveRawMaterials”
 Cols: “All variables”
SVM classification inputs
1277
Click OK to launch the calculation and Yes to view the results.

Look at the “Classified_Range” matrix containing the results of classification.
All four samples are well predicted.
“Classified_Range” matrix
35.3.14 PCA quick start

Click the following link to import the PCA quick start data set used in this tutorial.

 13 samples

Data structure
1278
Tutorials
Go to Tasks - Analyze - Principal Component Analysis….

 Rows: “Training”
 Cols: “Descriptors”
components set to “5”.
Go to the next tab: Weights. Keep the default settings that don’t apply any weight to the
same unit.
PCA model inputs - weights
1279
option.
PCA validation - cross-validation setup
1280
Tutorials
There are two options in the algorithm tab. Select the SVD option as there are no missing
values.
PCA algorithm

Look at the explained variance plot. The validation curve (red) shows that 4 PCs are needed
to explain 100% of the variance. However the first local maximum is 2 so it may be that only
2 components are necessary.
PCA explained variance
1281
Look at the scores plot. Notice that the 2 sundae samples are clustered together. Also
“Apple pie” and “Pommes frites” are also very close which means that the samples have the
same type of composition.
PCA scores
Along PC1, “Protein (%)” and “Carbohydrates (%)” are diametrically opposite. It means that
they are anti-correlated so that they vary in opposite direction. Samples with positive scores
on PC1 are rich in carbohydrates and low in protein such as “Pommes Frites”. Samples with
negative scores on PC1 are rich in protein, like all the burgers.
PC2 is describing the variation of “Fat (%)” and “Energy”. The more fat in the composition
the more energetic the product. The products that will have negative scores along PC2 have
a high fat content such as “Filet-O-Fish”.
for this data set.
PCA correlation loadings
For more information on the plots go to the Interpreting PCA plots section
1282
36. Data Integrity and Compliance
36.1. Data Integrity
This section covers how The Unscrambler® X can help an organisation working in a regulated
environment, particularly those that must show compliance to the rules and regulations of
electronic records and signatures as outlined in 21 CFR Part 11.
The following sections cover aspects of data integrity and security, particularly related to
electronic and digital signatures, the compliance mode of the software and audit trails.
 Compliance Statement
 General Application
 Digital Signatures
 Reference
36.2. Statement of Compliance
36.2.1 Introduction
This section provides CAMO Software’s position on helping an organization meet the
requirements of 21 CFR Part 11 (Electronic Signatures and Records). All necessary steps have
been followed to align with the requirements however, it must be stated that certain
procedures, such as verification of a user’s identity with respect to their electronic
signatures with the FDA, and the development of internal SOPs are the sole responsibility of
the Organization implementing The Unscrambler® X. Also, regulations and enforcement
activities change over time and the implementations for meeting 21 CFR Part 11 are based
on current best practices and subject knowledge at the time of the present build of the
program.
36.2.2 Overview
The Unscrambler® X provides the necessary functions for an organization to meet the
requirements of 21 CFR Part 11, as defined in Subparts A, B and C
36.2.3 Other software applications

The use of third-party software packages, such as Data Management Systems or Databases
etc, in connection with The Unscrambler® X should be verified and qualified by the
organization using the software. CAMO Software takes no responsibility for the performance
or compliance of such products, or for any difficulties arising from data transfer to such
products.
36.2.4 Statement of 21 CFR Part 11 Compliance

The details listed in the section on Data Integrity and Security describe how The
Unscrambler® X meets the requirements of 21 CFR Part 11 and should be referred to when
internally qualifying the software. CAMO Software has made every effort to understand and
interpret the narrow definitions within 21 CFR Part 11 regulations. It is CAMO Software’s
1283
belief that with proper due diligence on the part of the organization using The Unscrambler®
X, then compliance with the regulations can be achieved.
36.3. Compliance mode in The Unscrambler® X

The Unscrambler® X finds widespread applicability in regulated industries such as,
 Pharmaceutical/Biopharmaceutical development and production

 Medical Device development and production
 Biotechnology and Medical Applications
Most, if not all of the above industries require strict traceability of actions applied to
documents and data and in particular to data generated electronically. The 21 CFR Part 11
regulations were developed to provide a way for organizations to attach the same meaning
of hand written signatures to electronic documents. The term electronic signature is defined
in the regulation as,
A computer data compilation of any symbol or series of symbols executed, adopted or
authorised by an individual to be the legally binding equivalent of the individuals
handwritten signature.
For clarity, a handwritten signature is defined as,
The scripted or legal mark of an individual used to authenticate a document in a permanent
form
Therefore, an electronic signature must meet the following basic criteria,
 Non-repudiation: knowing that a signer of a document cannot deny sending or

signing the document at a later date.
 Authentication: assurance that the document came from the person who signed it
 Confidentiality: evidence that the content of a message has not been viewed by
those not authorised to do so.
 Integrity: assurance that the contents have not been modified in any way.
The goal is to have a system that can replace traditional handwritten signatures by an
electronic means for authoring, reviewing and releasing data and information based on the
four criteria listed above. This is where the compliance mode in Unscrambler® X can help an
organization achieve these goals.
36.3.1 Main features of the compliance mode

When The Unscrambler® X is installed in compliance mode, the program will use the
Windows Authentication details of the user who logged into the program in the Audit Trail.
This approach takes advantage of the security features of windows (especially in regards to
the uniqueness and authenticity of electronic signatures) and means that security is handled
by the existing policies of the organization using the program.
CAMO Software is not responsible for the setting of windows authentication login
details, this must be addressed by specific organization IT policies and related
security checking of an individuals identity. It is also the responsibility of the
organizations IT department for setting the directories in which to access and save
data .
1284
Data Integrity and Compliance
Logins
There are two ways to use compliance mode,
 A login to the software can also be enforced, which means that a user has to reenter
their electronic signature to access the program. This is useful if the program is
installed on a shared computer and in order to use it, the domain has to be set, such
that the authorised user is the only one who can access the program.
 The login can be hidden. In this case, it is the responsibility of the organization (and
the user) to ensure that the program is installed on a computer that can only be
accessed by the user assigned to that computer. In this case, when the program is
launched, it starts immediately and the windows authentication details are used to
record actions in the Audit Trail.
Note: In compliance mode, the Help - User Setup function is deactivated. The only way to
access the program is via windows authentication.
Audit Trails and Info boxes
In compliance mode, the Audit Trail is always enforced and cannot be deactivated in the
Tools - Options menu. In the Audit Trail itself, the Empty button is disabled and its contents
can either be printed, or saved as a non-editable PDF file.
The Info box will also display the mode of operation that the program is operating in. This
can be found by clicking on The Unscrambler® icon in the project navigator and viewing the
details.
36.3.2 A comprehensive approach to security and data integrity

While no software vendor can guarantee complete compliance to the regulations of 21 CFR
Part 11, it is also not their responsibility either. The Unscrambler® X provides the
appropriate tools to help an organization to become compliant with the regulations and fits
well within the scope of the specific requirements. As an extreme example the following
could be implemented to help assure the complete security of files generated in The
Unscrambler® X,
 Use compliance mode: This will ensure that the relevant login details, verified by the
organization, are used to access the computer and the program. To further ensure
security, use a second login in to the program with windows authentication details.
 Use Password protection on projects to ensure only certain pre-defined users can
access sensitive data. See Protect for more details.
 Use Digital Signatures to ensure the integrity of data when sending projects to
colleagues via electronic media.
36.4. Digital Signatures

Digital signatures provide assurances about the validity and authenticity of an electronic
document. Whether a user is working in a regulated or non-regulated industry, digital
signatures can be used to confirm that information originated from the signer has not been
altered.
36.4.1 Digital Signature implementation in The Unscrambler� X

This function provides a user with a way to verify the integrity of an The Unscrambler� X
project file.
1285
CAMO Software cannot warrant the legal enforceability of the digital signature
generated and evidentiary laws may vary by jurisdiction. .
The Unscrambler� X implements a digital signature by first passing the document through a
hashing algorithm. This creates a digest file that is a unique document number for the
project. This digital signature is saved to the project and recorded in the Info box and the
Audit Trail against the user’s login credentials. When the project is sent (via electronic
media, email etc) to a colleague, when they open the project, The Unscrambler� X
computes the digital signature and compares it with the one saved to the project. If both
signatures match, the integrity of the data can be assured. If not, a warning will be given to
the user that the project has been tampered with.
In the case that the program has been installed in Compliance Mode, the digital signature
uses the users electronic signature details as the security certificate of the digital signature.
36.4.2 How to assign a digital signature to a project

Whether a new project has been created, or an existing one is being worked on, a digital
signature can be assigned to it for the purposes of assuring data integrity. To do this, select
File � Security � Sign. See below
Once signed, the user is prompted with a warning that any changes saved to the project will
result in a loss of the signature, see below,
1286
Data Integrity and Compliance
If a project has not been saved before signing, the following warning will be provided,
The user will be taken to the Save As dialog where they can provide a name to the project
before it is saved.
36.4.3 How to tell if a project has been signed

There are a number of ways a user can determine if a project has been saved,
Opening a project
When a signed project is opened, the following message will be displayed,
Info box
The Info box will record information on the current sign status of the project, an example is
shown below,
1287
Audit Trail
The Audit Trail shows the current status of the digital signature, an example is shown below,
Status bar
Digitally signed project display the sign icon at the bottom of the viewer in the status bar.
36.4.4 Digital signatures and 21 CFR Part 11

The 21 CFR Part 11 regulations on electronic signatures defines a digital signature as follows
Electronic signature based upon cryptographic methods of originator authentication,
computed by using a set of rules and a set of parameters such that the identity of the signer
and the integrity of the data can be verified
The regulations state the following regarding digital signatures
Section 11.30: Persons who use open systems to create, modify, maintain, or transmit
electronic records shall employ procedures and controls designed to ensure the authenticity,
integrity, and, as appropriate, the confidentiality of electronic records from the point of their
creation to the point of their receipt. Such procedures and controls shall include those
identified in 11.10, as appropriate, and additional measures such as document encryption
and use of appropriate digital signature standards to ensure, as necessary under the
circumstances, record authenticity, integrity, and confidentiality.
The Unscrambler� X provides a means for users to implement basic digital signatures. If
your organisation uses a third party certification authority for implementing digital
signatures, please contact CAMO Software on how to implement signatures based on these
third party systems.
36.5. References
 Guidance for industry, Part 11, Electronic Records: Electronic Signatures - Scope and
Application (available on www.fda.gov).
 McDowall, R. D. Electronic Signatures and Logical Security, LC-GC Europe, 13(5), 331-
339 (2000).
 McDowall, R. D. Digital Signatures, LC-GC Europe, 14(1), (2001).
1288
37. References
37.1. Reference documentation
 Glossary of terms
 Method references
 Keyboard shortcuts
Upgrading documentation
 Migrating from earlier versions of The Unscrambler®

 Release notes
37.2. Glossary of terms

3rd order effects
See Cubic Effect.
Accuracy
The accuracy of a measurement method is its faithfulness, i.e. how close the measured value
is to the actual value.
Accuracy differs from precision, which has to do with the spread of successive
measurements performed on the same object.
Additive noise
Noise on a variable is said to be additive when its size is independent of the level of the data
value. The range of additive noise is the same for small data values as for larger data values.
Alternating Least Squares (MCR-ALS)
Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) is an iterative approach
(algorithm) for finding the matrices of concentration profiles and pure component spectra
from a data table X containing the spectra (or instrumental measurements) of several
unknown mixtures of a few pure components.
The number of compounds in X can be determined using PCA or can be known beforehand.
In Multivariate Curve Resolution, it is standard practice to apply MCR-ALS to the same data
with varying numbers of components (2 or more).
Analysis of variance (ANOVA)
Classical method to assess the significance of effects by decomposition of a response’s
variance into explained parts, related to variations in the predictors, and a residual part
which summarizes the experimental error.
The main ANOVA results are: Sum of Squares (SS), number of Degrees of Freedom (DF),
Mean Square (MS=SS/DF), F-value, p-value.
The effect of a design variable on a response is regarded as significant if the variations in the
response value due to variations in the design variable are large compared with the
experimental error. The significance of the effect is given as a p-value: usually, the effect is
considered significant if the p-value is smaller than 0.05 (5%).
1289
ANOVA
See Analysis of Variance.
Axial design
One of the three types of mixture designs with a simplex-shaped experimental region. An
axial design consists of extreme vertices, overall center, axial points, end points. It can only
be used for linear modeling, and therefore it is not available for optimization purposes.
Axial point
In an axial design, an axial point is positioned on the axis of one of the mixture variables, and
must be above the overall center, opposite the end point.
B-Coefficient
See Regression Coefficient.
Bias
Systematic difference between predicted and measured values. The bias is computed as the
average value of the residuals.
BIF-PLS
See bifocal PLS
Bifocal PLS
A method similar to L-PLS
Bilinear modeling
Bilinear modeling (BLM) is one of several possible approaches for data compression.
The bilinear modeling methods are designed for situations where collinearity exists among
the original variables. Common information in the original variables is used to build new
variables, that reflect the underlying (“latent”) structure. These variables are therefore
called latent variables. The latent variables are estimated as linear functions of both the
original variables and the observations, thereby the name bilinear.
PCA, PCR and PLS are bilinear methods.
Box-Behnken design
A class of experimental designs for response surface modeling and optimization, based on
only 3 levels of each design variable. The mid-levels of some variables are combined with
extreme levels of others. The combinations of only extreme levels (i.e. cube samples of a
factorial design) are not included in the design.
Box-Behnken designs are always rotatable. On the other hand, they cannot be built as an
extension of an existing factorial design, so they are more often recommended when
changing the ranges of variation for some of the design variables after a screening stage, or
when it is necessary to avoid too extreme situations.
Calibration
Stage of data analysis where a model is fitted to the available data, so that it describes the
data as well as possible.
After calibration, the variation in the data can be expressed as the sum of a modeled part
(structure) and a residual part (noise).
1290
References
Calibration samples
Samples on which the calibration is based. The variation observed in the variables measured
on the calibration samples provides the information that is used to build the model.
If the purpose of the calibration is to build a model that will later be applied on new samples
for prediction, it is important to collect calibration samples that span the variations expected
in the future prediction samples.
Category variable
A category variable is a class variable, i.e. each of its levels is a category (or class, or type),
without any possible quantitative equivalent.
Examples: type of catalyst, choice among several instruments, wheat variety, material
identification, etc..
Candidate point
In the D-optimal design generation, a number of candidate points are first calculated. These
candidate points consist of extreme vertices and centroid points. Then, a number of
candidate points is selected D-optimally to create the set of design points.
Center sample
Sample for which the value of every design variable is set at its mid-level (halfway between
low and high).
Center samples have a double purpose: introducing one center sample in a screening design
enables curvature checking, and replicating the center sample provides a direct estimation
of the experimental error.
Real center samples can be included when all design variables are continuous.
For design containing category variables real center point do not exist, however it is possible
to generate faced center point taking the middle range values for the continuous variables
and selecting a level for the category variables.
Center samples
See Center sample.
Centering
See Mean centering.
Central composite design
A class of experimental designs for response surface modeling and optimization, based on a
two-level factorial design on continuous design variables. Star samples and center samples
are added to the full factorial design to provide the intermediate levels necessary for fitting
a quadratic model.
Central composite designs have the advantage that they can be built as an extension of a
previous factorial design, if there is no reason to change the ranges of variation of the design
variables.
If the default star point distance to center is selected, these designs are rotatable.
Centroid design
See Simplex-centroid design.
1291
Centroid point
A centroid point is calculated as the mean of the extreme vertices on the design region
surface associated with this centroid point. It is used in Simplex-centroid designs, axial
designs and D-optimal designs.
Classification
Data analysis method used for predicting class membership. Classification can be seen as a
predictive method where the response is a category variable. The purpose of the analysis is
to be able to predict which category a new sample belongs to. Classification methods
implemented in The Unscrambler® include SIMCA, SVM classification, LDA, and PLS-
discriminant analysis.
Classification can for instance be used to determine the geographical origin of a raw material
from the levels of various impurities, or to accept or reject a product depending on its
quality.
To run a SIMCA classification, one needs:
 One or several PCA models (one for each class) based on the same variables;
 Values of those variables collected on known or unknown samples.
Each new sample is projected onto each PCA model. According to the outcome of this
projection, the sample is either recognized as a member of the corresponding class, or
rejected.
Closure
In MCR, the closure constraint forces the sum of the concentrations of all the mixture
components to be equal to a constant value (the total concentration) across all samples.
Clustering
Clustering is a classification method that does not require any prior knowledge about the
available samples. The basic principle consists in grouping together in a “cluster” several
samples which are sufficiently close to each other.
The clustering methods available in The Unscrambler® include the K-means algorithm; the
behavior of the algorithm may be tuned by choosing among various ways of computing the
distance between samples. Hierarchical clustering can also be run, as can clustering using
Ward’s method.
Coefficient of determination
See R-square.
Collinear
See Collinearity.
Collinearity
Linear relationship between variables. Two variables are collinear if the value of one variable
can be computed from the other, using a linear relation. Three or more variables are
collinear if one of them can be expressed as a linear function of the others.
Variables which are not collinear are said to be linearly independent. Collinearity - or near-
collinearity, i.e. very strong correlation - is the major cause of trouble for MLR models,
whereas projection methods like PCA, PCR and PLS handle collinearity well.
1292
References
Component
 PCA, PCR, PLS: See Principal Component.

 Curve Resolution: See Pure Components.
 Mixture Designs: See Mixture Components.
Condition number
It is the square root of the ratio of the highest eigenvalue to the smallest eigenvalue of the
experimental matrix. The higher the condition number, the more spread the region. On the
contrary, the lower the condition number, the more spherical the region. The ideal condition
number is 1; the closer to 1 the better.
Confounded
See Confounded effects.
Confounded effects
Two (or more) effects are said to be confounded when variation in the responses cannot be
traced back to the variation in the design variables to which those effects are associated.
Confounded effects can be separated by performing a few new experiments. This is useful
when some of the confounded effects have been found significant.
Confounding pattern
The confounding pattern of an experimental design is the list of the effects that can be
studied with this design, with confounded effects listed on the same line.
Confusion matrix
Constrained design
Experimental design involving multilinear constraints between some of the designed
variables. There are two types of constrained designed: classical mixture designs and D-
optimal designs.
Constrained experimental region
Experimental region which is not only delimited by the ranges of the designed variables, but
also by multilinear constraints existing between these variables. For classical mixture
designs, the constrained experimental region has the shape of a simplex.
Constraint
 Curve Resolution:
A constraint is a restriction imposed on the solutions to the multivariate curve
resolution problem.
Many constraints take the form of a linear relationship between two variables or
more:
or
1293
where Xi are relevant variables (e.g. estimated concentrations), and each constraint
is specified by the set of constants .
 Mixture Designs: See Multilinear constraint.
Continuous variable
Quantitative variable measured on a continuous scale.
Examples of continuous variables are:
 Amounts of ingredients (in kg, liters, etc.);

 Recorded or controlled values of process parameters (pressure, temperature, etc.).
Corner sample
See vertex sample.
Correlations
See Correlation.
Correlation
A unit less measure of the amount of linear relationship between two variables.
The correlation is computed as the covariance between the two variables divided by the
square root of the product of their variances. It varies from –1 to +1.
Positive correlation indicates a positive link between the two variables, i.e. when one
increases, the other has a tendency to increase too. The closer to +1, the stronger this link.
Negative correlation indicates a negative link between the two variables, i.e. when one
increases, the other has a tendency to decrease. The closer to –1, the stronger this link.
Correlation loadings
Loadings plot marking the 50% and 100% explained variance limits. Correlation loadings are
helpful in revealing variable correlations.
Correlation Optimized Warping (COW)
x axis. This transform is a technique often use for time-shifting chromatographic spectra.
C
A method used to check the significance of effects using a scale-independent distribution as
comparison. This method is useful when there are no residual degrees of freedom.
Covariance
A measure of the linear relationship between two variables.
The covariance is given on a scale which is a function of the scales of the two variables, and
may not be easy to interpret. Therefore, it is usually simpler to study the correlation instead.
Cross terms
See Interaction effects.
1294
References
Cross validation
Validation method where some samples are kept out of the calibration and used for
prediction. This is repeated until all samples have been kept out once. Validation residual
variance can then be computed from the prediction residuals.
In segmented cross validation, the samples are divided into subgroups or “segments”. One
segment at a time is kept out of the calibration. There are as many calibration rounds as
segments, so that predictions can be made on all samples. A final calibration is then
performed with all samples.
In full cross validation, only one sample at a time is kept out of the calibration per iteration.
Cube sample
Any sample which is a combination of high and low levels of the design variables, in
experimental plans based on two levels of each variable.
In Box-Behnken designs, all samples which are a combination of high or low levels of some
design variables, and center level of others, are also referred to as cube samples.
Cubic effects
See Cubic effect.
Cubic effect
When analyzing the results from designed experiments, cubic effects can be included in the
model to handle complex cases of nonlinear effects or multiple interactions between the X-
variables.
Also called third order effects, they comprise:
 Interactions between 3 design parameters (A*B*C),

 Cubic terms of the design variables (A³) and
 Combined effects (A²*B).
Curvature
Curvature means that the true relationship between response variations and predictor
variations is nonlinear.
In screening designs, curvature can be detected by introducing a center sample.
Data compression
Concentration of the information carried by several variables onto a few underlying
variables.
The basic idea behind data compression is that observed variables often contain common
information, and that this information can be expressed by a smaller number of variables
than originally observed.
Data mining
This is the practice of studying large amounts of data to find patterns or trends. MVA is a
form of data mining.
Detrending (DT)
A transformation which seeks to remove nonlinear trends in spectroscopic data. Like
Standard_Normal_Variate (SNV), it is applied to individual spectra. DT and SNV are often
used in combination to reduce multicollinearity, baseline shift and curvature is spectra.
1295
Degree of fractionality
The degree of fractionality of a factorial design expresses how much the design has been
reduced compared to a full factorial design with the same number of variables. It can be
interpreted as the number of design variables that should be dropped to compute a full
factorial design with the same number of experiments.
Example: with 5 design variables, one can either build
 A full factorial design with 32 experiments (25);

 A fractional factorial design with a degree of fractionality of 1, which will include 16
experiments (25-1);
 A fractional factorial design with a degree of fractionality of 2, which will include 8
experiments (25-2).
Degrees of freedom
The number of degrees of freedom of a phenomenon is the number of independent ways
this phenomenon can be varied.
Degrees of freedom are used to compute variances and theoretical variable distributions.
For instance, an estimated variance is said to be “corrected for degrees of freedom” if it is
computed as the sum of square of deviations from the mean, divided by the number of
degrees of freedom of this sum.
Dendrogram
A dendrogram (from Greek dendron “tree”, -gramma “drawing”) is a tree diagram
frequently used to illustrate the arrangement of the clusters produced by hierarchical
clustering.
Design analysis
Calculation of the effects of design variables on the responses. It consists mainly of Analysis
of Variance (ANOVA), various significance tests, and multiple comparisons, response surface
generation whenever they apply.
Design variable
Experimental factor for which the variations are controlled in an experimental design.
Design variables
See Design Variable.
Distribution
Shape of the frequency diagram of a measured variable or calculated parameter. Observed
distributions can be represented by a histogram.
Some statistical parameters have a well-known theoretical distribution which can be used
for significance testing.
D-optimal design
Experimental design generated by a D-optimal algorithm. A D-optimal design takes into
account the multilinear relationships existing between design variables, and thus works with
constrained experimental regions. There are two types of D-optimal designs depending on
their initial points: D-optimal mixture designs which are based on subsimplexes and general
D-optimal designs which are based on subfactorial designs.
1296
References
D-optimal mixture design

D-optimal design involving three or more mixture variables and some multilinear constraints
or a mixture region which is not a simplex. In a D-optimal mixture design, multilinear
relationships can be defined among mixture variables.
D-optimal principle
Principle consisting in the selection of a subset of candidate points which define a maximal
volume region in the multidimensional space. The D-optimal principle aims at minimizing the
condition number.
Downweight
A weighting option which allows one to remove the influence of a variable on a model by
giving it a very low weight in a PCA, PCR or PLS model. The variable is still displayed, showing
how it correlates to other variables. In previous versions of The Unscrambler® this weighting
option was referred to as passify.
Edge center point
In D-optimal and Mixture designs, the edge center points are positioned in the center of the
edges of the experimental region.
End point
In an axial or a simplex-centroid design, an end point is positioned at the bottom of the axis
of one of the mixture variables, and is thus positioned on the side opposite to the axial
point.
Experimental design
This is also referred to as Design of Experiments.
Plan for experiments where input variables are varied systematically within predefined
ranges, so that their effects on the output variables (responses) can be estimated and
checked for significance.
Experimental designs are built with a specific objective in mind, namely screening, screening
with interaction, or optimization.
The number of experiments and the way they are built depends on the objective and on the
operational constraints.
Experimental error
Random variation in the response that occurs naturally when performing experiments.
An estimation of the experimental error is used for significance testing, as a comparison to
structured variation that can be accounted for by the studied effects.
Experimental error can be measured by replicating some experiments and computing the
standard deviation of the response over the replicates. It can also be estimated as the
residual variation when all “structured” effects have been accounted for.
Experimental region
N-dimensional area investigated in an experimental design with N design variables. The
experimental region is defined by:
 The ranges of variation of the design variables,

 If any, the multilinear relationships existing between design variables.
1297
In the case of multilinear constraints, the experimental region is said to be constrained.

See Explained variance.
See Explained variance.
Explained variance
Share of the total variance which is accounted for by the model.
Explained variance is computed as the complement to residual variance, divided by total
variance. It is expressed as a percentage.
For instance, an explained variance of 90% means that 90% of the variation in the data is
described by the model, while the remaining 10% are noise (or error).
F-distribution
Fisher distribution is the distribution of the ratio between two variances.
The F-distribution assumes that the individual observations follow an approximate normal
distribution.
Fill missing
Whenever some data values are missing in a table, one has the possibility to automatically
fill up the holes with a procedure that takes into account the general data structure. In
practice, this provides a rough reconstruction of the missing data, which can be useful when
applying an analysis technique that cannot handle missing values, such as MLR, Kernel PLS
and wide-kernel PLS for instance.
In The Unscrambler® one may fill missing values by using the command Tasks-Transform-Fill
Missing….
Fixed effect
Effect of a variable for which the levels studied in an experimental design are of specific
interest.
Examples are:
 Effect of the type of catalyst on yield of the reaction;

 Effect of resting temperature on bread volume.
The alternative to a fixed effect is a random effect.

Fractional factorial design
A reduced experimental plan often used for screening of many variables. It gives as much
information as possible about the main effects of the design variables with a minimum of
experiments. Some fractional designs also allow two-variable interactions to be studied. This
depends on the resolution of the design.
In fractional factorial designs, a subset of a full factorial design is selected so that it is still
possible to estimate the desired effects from a limited number of experiments.
The degree of fractionality of a factorial design expresses how fractional it is, compared with
the corresponding full factorial.
1298
References
F-ratio
The F-ratio is the ratio between explained variance (associated to a given predictor) and
residual variance. It shows how large the effect of the predictor is, as compared with
random noise.
By comparing the F-ratio with its theoretical distribution (F-distribution), one obtains the
significance level (given by a p-value) of the effect.
Full factorial design
Experimental design where all levels of all design variables are combined.
Such designs are often used for extensive study of the effects of few variables, especially if
some variables have more than two levels. They are also appropriate in screening with
interaction designs, to study both main effects and interactions, especially if no Resolution V
design is available.
Gap
One of the parameters of the Gap-Segment and Norris Gap derivatives, the gap is the length
of the interval that separates the two segments that are being averaged.
See Segment for more information.
General D-optimal design
D-optimal design in which some of the process variables are multilinearly linked, or which
contains a mix of mixture and non-mixture variables.
Histogram
A plot showing the observed distribution of data points. The data range is divided into a
number of bins (i.e. intervals) and the number of data points that fall into each bin is
summed up.
The height of the bar in the histograms shows how many data points fall within the data
range of the bin.
Hotelling’s T statistics
See Hotelling’s T² statistic.
Hotelling’s T² ellipse
This 95% confidence ellipse can be included in scores plots and reveals potential outliers,
lying outside the ellipse.
See Hotelling’s T² statistic for more information.
A linear function of the leverage that can be compared to a critical limit according to an F-
test. This statistic is useful for the detection of outliers at the modeling or prediction stage.
See Hotelling’s T² Ellipse for more information.
Influence
A measure of how much impact a single data point (or a single variable) has on the model.
The influence depends on the leverage and the residuals.
1299
Inlier
A prediction sample far away from the calibration samples in the regression model. Local
“holes” or areas with low density in terms of calibration samples can result in a situation
where some prediction samples are detected as inliers.
Inner relation
In PLS regression models, scores in X are used to predict the scores in Y and from these
predictions, the estimated is found. This connection between X and Y through their
scores is called the inner relation.
Interaction
Interactions
Interaction effects
There is an interaction between two design variables when the effect of the first variable
depends on the level of the other. This means that the combined effect of the two variables
is not equal to the sum of their main effects.
An interaction that increases the main effects is a synergy. If it goes in the opposite
direction, it can be called an antagonism.
Intercept
(Also called Offset). The point where a regression line crosses the ordinate (Y-axis).
Interior point
Point which is not located on the surface, but inside of the experimental region. For
example, an axial point is a particular kind of interior point. Interior points are used in
classical mixture designs.
K-means
An algorithm for data clustering. The samples will be grouped into K (user-determined
number) clusters based on a specific distance measurement, so that the sum of distances
between each sample and its cluster centroid is minimized.
Lack of fit
In Response Surface Analysis, the ANOVA table includes a special chapter which checks
whether the regression model describes the true shape of the response surface. Lack of fit
means that the true shape is likely to be different from the shape indicated by the model.
If there is a significant lack of fit, one can investigate the residuals and try a transformation.
Latent variable
A variable that is not directly observed but is rather inferred (through a mathematical
model) from other variables that are observed and directly measured. Principal components
(PCs) and PLS factors are examples of latent variables.
Lattice degree
The degree of a Simplex-lattice design corresponds to the maximal number of experimental
points -1 for a level 0 of one of the Mixture variables.
1300
References
Lattice design
See Simplex-lattice design.
LDA
See Linear Discriminant Analysis.
Least squares criterion
Basis of classical regression methods, that consists in minimizing the sum of squares of the
residuals. It is equivalent to minimizing the average squared distance between the original
response values and the fitted values.
Leveled variable
A leveled variable is a variable which consists of discrete values instead of a range of
continuous values.
Examples are design variables and category variables.
Leveled variables can be used to separate a data table into different groups. This feature is
used by the Statistics task, and in sample plots from PCA, PCR, PLS, MLR, Prediction and
Classification results.
Leveled variables
See Leveled Variable.
Level
See Levels.
Levels
Possible values of a variable. A category variable has several levels, which are all possible
categories. A design variable has at least a low and a high level, which are the lower and
higher bounds of its range of variation. Sometimes, intermediate levels are also included in
the design.
Leverage
A measure of how extreme a data point or a variable is compared to the majority.
In PCA, PCR and PLS, leverage can be interpreted as the distance between a projected point
(or projected variable) and the model center. In MLR, it is the object distance to the model
center.
Average data points have a low leverage. Points or variables with a high leverage are likely to
have a high influence on the model.
Leverage correction
A quick method to simulate model validation without performing any actual predictions.
It is based on the assumption that samples with a higher leverage will be more difficult to
predict accurately than more central samples. Thus a validation residual variance is
computed from the calibration sample residuals, using a correction factor which increases
with the sample leverage.
Note! For MLR, leverage correction is strictly equivalent to full cross-validation. For
other methods, leverage correction should only be used as a quick-and-dirty
method for a first calibration, and a proper validation method should be employed
later on to estimate the optimal number of components correctly.
1301
Limits for outlier warnings

Leverage and Outlier limits are the threshold values set for automatic outlier detection.
Samples or variables that give results higher than the limits are reported as suspect in the
list of outlier warnings.
Linear Discriminant Analysis (LDA)
LDA is the simplest of all possible classification methods that are based on Bayes’ formula.
The objective of LDA is to determine the best fit parameters for classification of samples by a
developed model.
Linear effect
See Main Effect.
Linear model
Regression model including as X-variables the linear effects of each predictor. The linear
effects are also called main effects.
Linear models are used in the analysis of Plackett-Burman and Resolution III fractional
factorial designs. Higher resolution designs allow the estimation of interactions in addition
to the linear effects.
Loading weights
Loading weights are estimated in PLS regression. Each X-variable has a loading weight along
each model component.
The loading weights show how much each predictor (or X-variable) contributes to explaining
the response variation along each model component. They can be used, together with the Y-
loadings, to represent the relationship between X- and Y-variables as projected onto one,
two or three components (line plot, 2-D scatter plot and 3-D scatter plot respectively).
Loadings
Loadings are estimated in bilinear modeling methods where information carried by several
variables is concentrated onto a few components. Each variable has a loading along each
model component.
The loadings show how well a variable is taken into account by the model components.
Loadings can be used to understand how much each variable contributes to the meaningful
variation in the data, and to interpret variable relationships. They are also useful to interpret
the meaning of each model component.
Lower quartile
The lower quartile of an observed distribution is the variable value that splits the
observations into 25% lower values, and 75% higher values. It can also be called 25%
percentile.
L-PLS
See L-shaped PLS Regression
L-shaped PLS Regression (L-PLS)
As opposed to bilinear modeling such as PLS where the data are arranged in such a way that
the information obtained on a dependent variable Y is related to some independent
measures X, L-PLS can be used in cases where the Y data may have descriptors of its
columns, organized in a third table Z (containing the same number of columns as in Y).
1302
References
Main effect
Average variation observed in a response when a design variable goes from its low to its high
level.
The main effect of a design variable can be interpreted as linear variation generated in the
response, when this design variable varies and the other design variables have their average
values.
Main effects
See Main Effect.
Martens’ Uncertainty Test
See Uncertainty test.
MCR
See Multivariate Curve Resolution.
Mean
Average value of a variable over a specific sample set. The mean is computed as the sum of
the variable values, divided by the number of samples.
The mean gives a value around which all values in the sample set are distributed. In Statistics
results, the mean can be displayed together with the standard deviation.
Mean centering
Subtracting the mean (average value) from a variable, for each data point.
Median
The median of an observed distribution is the variable value that splits the distribution in its
middle: half the observations have a lower value than the median, and the other half have a
higher value. It can also be called 50% percentile.
Missing values
Whenever the value of a given variable for a given sample is unknown or not available, this
results in a hole in the data. Such holes are called missing values, and in The Unscrambler®
corresponding cell of the data table are left empty.
In some cases, it is only natural to have missing values — for instance when the
concentration of a compound (Y) in a new sample is supposed to be predicted from its
spectrum (X).
Sometimes it would be nice to reconstruct the missing values, for instance when applying a
data analysis that does not handle missing values well, like MLR, kernel-PLS or wide-kernel.
One may choose to fill missing values by using the command Tasks - Transform - Missing
Values….
MixSum
Term used in The Unscrambler® for “mixture sum”. See Mixture sum.
1303
Mixture components
Ingredients of a mixture.
There must be at least three components to define a mixture. A unique component cannot
be called mixture.
Two components mixed together do not require a Mixture design to be studied: study the
variation in quantity of one of them as a classical process variable.
Mixture constraint
Multilinear constraint between Mixture variables. The general equation for the Mixture
constraint is
where the Xi represent the ingredients of the mixture, and S is the total amount of mixture.
In most cases, S is equal to 100%.
Mixture design
Special type of experimental design, applying to the case of a mixture constraint. There are
three types of classical Mixture designs: Simplex-Lattice design, Simplex-Centroid design,
and Axial design. Mixture designs that do not have a simplex experimental region are
generated D-optimally; they are called D-optimal mixture designs.
Mixture region
Experimental region for a mixture design. The mixture region for a classical mixture design is
a simplex.
Mixture sum
Total proportion of a mixture which varies in a mixture design. Generally, the mixture sum is
equal to 100%. However, it can be lower than 100% if the quantity in one of the components
has a fixed value.
The mixture sum can also be expressed as fractions, with values varying from 0 to 1.
Mixture variables
See Mixture Variable.
Mixture variable
Experimental factor for which the variations are controlled in a mixture design or D-optimal
mixture design. Mixture variables are multilinearly linked by a special constraint called
mixture constraint.
There must be at least three mixture variables to define a mixture design. See Mixture
components.
MLR
See Multiple Linear Regression.
Model
Mathematical equation summarizing variations in a data set.
Models are built so that the structure of a data table can be understood better than by just
looking at all raw values.
Statistical models consist of a structure part and an error part. The structure part
(information) is intended to be used for interpretation or prediction, and the error part
(noise) should be as small as possible for the model to be reliable.
1304
References
Model center
The model center is the origin around which variations in the data are modeled. It is the
(0,0) point on a scores plot.
If the variables have been centered, samples close to the average will lie close to the model
center.
Model check
In Response Surface Analysis, a section of the ANOVA table checks how useful the
interactions and squares are, compared with a purely linear model. This section is called
model check.
If one part of the model is not significant, it can be removed so that the remaining effects
are estimated with a better precision.
MVA
See Multivariate Analysis
Multiple comparison tests
Tests associating the levels of a category design variable with a response variable, to detect
differences in effects between different levels.
For continuous or binary design variables, if an effect is found to be significant by ANOVA,
the magnitude and direction of the effect can be interpreted directly from the effect value of
that variable. For multi-level category variables the ANOVA will test whether at least one
level is significantly different from the others, however there is no single effect value for
each category variable or level to interpret. A multiple comparison test is used to assess
which category levels are associated with the optimal response.
Interpretation of multiple comparisons in The Unscrambler® X is described in more detail in
the Design of Experiments section.
Multilinear constraints
See Multilinear constraint.
Multilinear constraint
This is a linear relationship between two variables or more. A constraint has the general
form:
or
where Xi are designed variables (mixture or process), and each constraint is specified by the
set of constants .
A multilinear constraint cannot involve both Mixture and Process variables.
Multiple Linear Regression (MLR)
A method for relating the variations in a response variable (Y-variable) to the variations of
several predictors (X-variables), with explanatory or predictive purposes.
An important assumption for the method is that the X-variables are linearly independent, i.e.
that no linear relationship exists between the X-variables. When the X-variables carry
common information, problems can arise due to exact or approximate collinearity.
1305
Multivariate Curve Resolution (MCR)

A method that resolves unknown mixtures into n pure components. The number of
components and their concentrations and instrumental profiles are estimated in a way that
explains the structure of the observed data under the chosen model constraints.
Multivariate analysis
Multivariate analysis (MVA) is based on the statistical principle of
multivariate statistics, which involves observation and analysis of more than
one statistical variable at a time. In design and analysis, the technique is
used to perform trade studies across multiple dimensions while taking into
account the effects of all variables on the responses of interest.
Source: Wikipedia
NIP
In statistics, Non-linear Iterative Partial Least Squares (NIP ALS) is an
algorithm for computing the first few components in a principal component
or partial least squares analysis. For very high-dimensional data sets, such as
those generated in the ‘omics sciences (e.g., genomics, metabolomics) it is
usually only necessary to compute the first few principal components.
Source: Wikipedia
Noise
Random variation that does not contain any information.
The purpose of multivariate modeling is to separate information from noise.
Non-linearity
Deviation from linearity in the relationship between a response and its predictors.
Non-negativity
In MCR, the Non-negativity constraint forces the values in a profile to be equal to or greater
than zero.
Normal distribution
Frequency diagram showing how independent observations, measured on a continuous
scale, would be distributed if there were an infinite number of observations and no factors
caused systematic effects.
A normal distribution can be described by two parameters:
 a theoretical mean, which is the center of the distribution;

 a theoretical standard deviation, which is the spread of the individual observations
around the mean.
Normal probability plot

The normal probability plot (or N-plot) is a 2-D plot which displays a series of observed or
computed values in such a way that their distribution can be visually compared to a normal
distribution.
1306
References
The observed values are used as abscissa, and the ordinate displays the corresponding
percentiles on a special scale. Thus if the values are approximately normally distributed
around zero, the points will appear close to a straight line going through (0,50%).
A normal probability plot can be used to check the normality of the residuals (they should be
normal; outliers will stick out), and to visually detect significant effects in screening designs
with few residual degrees of freedom.
Offset
See Intercept.
Optimization
Finding the settings of design variables that generate optimal response values.
Orthogonal
Two variables are said to be orthogonal if they are completely uncorrelated, i.e. their
correlation is 0.
In PCA and PCR, the principal components are orthogonal to each other.
Factorial designs, Plackett-Burman designs, Central Composite designs and Box-Behnken
designs are built in such a way that the studied effects are orthogonal to each other.
Orthogonal design
Designs built in such a way that the studied effects are orthogonal to each other, are called
orthogonal designs.
Examples: Factorial designs, Plackett-Burman designs, Central Composite designs and Box-
Behnken designs.
D-optimal designs and classical mixture designs are not orthogonal.
Outlier
An observation (outlying sample) or variable (outlying variable) which is abnormal compared
to the major part of the data.
Extreme points are not necessarily outliers; outliers are points that apparently do not belong
to the same population as the others, or that are badly described by a model.
Outliers should be investigated before they are removed from a model, as an apparent
outlier may be due to an error in the data.
Overfitting
For a model, overfitting is a tendency to describe too much of the variation in the data, so
that not only consistent structure is taken into account, but also some noise or
noninformative variation.
Overfitting should be avoided, since it usually results in a lower quality of prediction.
Validation is an efficient way to avoid model overfitting.
Partial Least Squares regression
See PLS regression.
Passified
see Downweight
In previous versions of The Unscrambler®, the term passify was used when a variable was
weighted by multiplying by a very small number. The variable was said to be Passified,
meaning that it loses all influence on the model, but it is not removed from the analysis.
1307
The term for this type of weighting has been changed to Downweight.
PCA
See Principal Component Analysis.
PCR
See Principal Component Regression.
PCs
See Principal Component.
Percentile
The X% percentile of an observed distribution is the variable value that splits the
observations into X% lower values, and 100-X% higher values.
Quartiles and median are percentiles. The percentiles are displayed using a box-plot.
Plackett-Burman design
A very reduced experimental plan used for a first screening of many variables. It gives
information about the main effects of the design variables with the smallest possible
number of experiments.
No interactions can be studied with a Plackett-Burman design, and moreover, each main
effect is confounded with a combination of several interactions, so that these designs should
be used only as a first stage, to check whether there is any meaningful variation at all in the
investigated phenomena.
PLS
See PLS regression.
PLS Discriminant Analysis (PLS-DA)
Classification method based on modeling the differences between several classes with PLS.
If there are only two classes to separate, the PLS model uses one response variable, which
codes for class membership as follows: -1 for members of one class, +1 for members of the
other one.
If there are three classes or more, the PLS model uses one response variable (-1/+1 or 0/1,
which is equivalent) coding for each class.
PLS regression
A method for relating the variations in one or several response variables (Y-variables) to the
Partial Least Squares Regression is a bilinear modeling method where information in the
original X-data is projected onto a small number of underlying (“latent”) variables called PLS
components. The Y-data are actively used in estimating the “latent” variables to ensure that
the first components are those that are most relevant for predicting the Y-variables.
Interpretation of the relationship between X-data and Y-data is then simplified as this
relationship in concentrated on the smallest possible number of components.
By plotting the first PLS components one can view main associations between X-variables
and Y-variables, and also interrelationships within X-data and within Y-data.
1308
References
PLS1
Version of the PLS method with only one Y-variable.
PLS2
Version of the PLS method in which several Y-variables are modeled simultaneously, thus
taking advantage of possible correlations or collinearity between Y-variables.
PLS-DA
See PLS Discriminant Analysis.
Precision
The precision of an instrument or a measurement method is its ability to give consistent
results over repeated measurements performed on the same object. A precise method will
give several values that are very close to each other.
Precision can be measured by standard deviation over repeated measurements.
If precision is poor, it can be improved by systematically repeating the measurements over
each sample, and replacing the original values by their average for that sample.
Precision differs from accuracy, which has to do with how close the average measured value
is to the target value.
Prediction
Computing response values from predictor values, using a regression model.
The following are needed to make predictions:
 a regression model (PCR, PLS or MLR), calibrated on X- and Y-data;

 new X-data collected on samples which should be similar to the ones used for
calibration.
The new X-values are fed into the model equation (which uses the regression coefficients),
and predicted Y-values are computed.
Predictor
Variable used as input in a regression model. Predictors are usually denoted X-variables.
Predictors
See Predictor.
Principal component (PC)
Principal Components (PCs) are composite variables, i.e. linear functions of the original
variables, estimated to contain, in decreasing order, the main structured information in the
data. A PC is the same as a score vector, and is also called a latent variable or a factor.
Principal components are estimated in PCA and PCR. PLS components are also denoted PCs.
Principal Component Analysis (PCA)
PCA is a bilinear modeling method which gives an interpretable overview of the main
information in a multidimensional data table.
The information carried by the original variables is projected onto a smaller number of
underlying (“latent”) variables called principal components. The first principal component
covers as much of the variation in the data as possible. The second principal component is
orthogonal to the first and covers as much of the remaining variation as possible, and so on.
1309
By plotting the principal components, one can view interrelationships between different
variables, and detect and interpret sample patterns, groupings, similarities or differences.
Principal Component Regression (PCR)
PCR is a method for relating the variations in a response variable (Y-variable) to the
Principal Component Regression is a two-step method. First, a Principal Component Analysis
is carried out on the X-variables. The principal components are then used as predictors in a
Multiple Linear Regression.
Process variable
Experimental factor for which the variations are controlled in an experimental design, and to
which the mixture variable definition does not apply.
Process variables
See Process variable.
Project samples
New samples can be projected onto an existing PCA model, thus creating the PCA equivalent
of prediction for a regression model. The projection of a new sample onto the PCA model is
a kind of “prediction” of that sample according to the PCA model.
Projection
Principle underlying bilinear modeling methods such as PCA, PCR and PLS.
In those methods, each sample can be considered as a point in a multidimensional space.
The model will be built as a series of components onto which the samples - and the variables
- can be projected. Sample projections are called scores, variable projections are called
loadings.
The model approximation of the data is equivalent to the orthogonal projection of the
samples onto the model. The residual variance of each sample is the squared distance to its
projection.
Proportional noise
Noise on a variable is said to be proportional when its size depends on the level of the data
value. The range of proportional noise is a percentage of the original data values.
Pure components
In MCR, an unknown mixture is resolved into n pure components. The number of
components and their concentrations and instrumental profiles are estimated in a way that
explains the structure of the observed data under the chosen model constraints.
p-value
The p-value measures the probability that a parameter estimated from experimental data
should be as large as it is, if the real (theoretical, non-observable) value of that parameter
were actually zero. Thus, p-value is used to assess the significance of observed effects or
variations: a small p-value means a small risk of mistakenly concluding that the observed
effect is real.
1310
References
The usual limit used in the interpretation of a p-value is 0.05 (or 5%). If p-value < 0.05, the
observed effect can be presumed to be significant and is not due to random variations.
p-value is also called “significance level”.
Q-residual limits
The Q-residual limits for components 0-A are computed as a function of the remaining
eigenvalues A+1:Amax, where Amax is the maximum number of components that can be
calculated, limited by the number of samples or variables.
When PCA is computed by the SVD algorithm all eigenvalues are returned, and Q-residuals
can be estimated. When the NIP algorithm is chosen, only a few components are normally
estimated, thus Q-residual limits are not available.
Similarly for PLS regression, the Q-residual limits are correct only if the maximum number of
factors is computed, i.e. all the variance in X is modeled.
As the Q-residual limit is a function of the eigenvalue to the power of 3, one may get a
reasonable estimate if more than 95% of the X-variance is explained in the model although
the number of factors is less than the maximum.
Q-residuals
See Q-residual limits.
Quadratic model
Regression model including as X-variables the linear effects of each predictor, all two-
variable interactions, and the square effects.
With a quadratic model, the curvature of the response surface can be approximated in a
satisfactory way.
Quantile plot
The Quantile plot represents the distribution of a variable in terms of percentiles for a given
population. It shows the minimum, the 25% percentile (lower quartile), the median, the 75%
percentile (upper quartile) and the maximum.
Random effect
Effect of a variable for which the levels studied in an experimental design can be considered
to be a small selection of a larger (or infinite) number of possibilities.
Examples:
 Effect of using different batches of raw material;

 Effect of having different persons perform the experiments.
The alternative to a random effect is a fixed effect.

Random order
Randomization is the random mixing of the order in which the experiments are to be
performed. The purpose is to avoid systematic errors which could interfere with the
interpretation of the effects of the design variables.
Reference sample
Sample included in a designed data table to compare a new product under development to
an existing product of a similar type.
1311
The design file will contain only response values for the reference samples, whereas the
input part (the design part) is missing (m).
Reference samples
See Reference sample.
Regression coefficient
In a regression model equation, regression coefficients are the numerical coefficients that
express the link between variation in the predictors and variation in the response.
See Regression coefficient.
Regression
Generic name for all methods relating the variations in one or several response variables (Y-
variables) to the variations of several predictors (X-variables), with explanatory or predictive
purposes.
Regression can be used to describe and interpret the relationship between the X-variables
and the Y-variables, and to predict the Y-values of new samples from the values of the X-
variables.
Repeated measurement
Measurement performed several times on one single experiment or sample.
The purpose of repeated measurements is to estimate the measurement error, and to
improve the precision of an instrument or measurement method by averaging over several
measurements.
Repeated measurements
See Repeated measurement.
Replicate
Replicates are experiments that are carried out several times. The purpose of including
replicates in a data table is to estimate the experimental error.
Replicates should not be confused with repeated measurements, which give information
about measurement error. In cross validation, replicates should be excluded as a group.
Replicates
See Replicate.
Residual
A measure of the variation that is not taken into account by the model.
The residual for a given sample and a given variable is computed as the difference between
observed value and fitted (or projected, or predicted) value of the variable on the sample.
Residuals
See Residual.
Residual variance
The mean square of all residuals, sample- or variable-wise.
1312
References
This is a measure of the error made when observed values are approximated by fitted
values, i.e. when a sample or a variable is replaced by its projection onto the model.
The complement to residual variance is explained variance.
Residual X-variance
See Residual variance.
Residual Y-variance
See Residual variance.
Resolution
 Context: Experimental design
Information on the degree of confounding in fractional factorial designs.
Resolution is expressed as a roman number, according to the following code:
 Resolution III design: Main effects are confounded with 2-factor

interactions.
 Resolution IV design: Main effects are free of confounding with 2-factor
interactions, but 2-factor interactions are confounded with each other.
 Resolution V design: Main effects and 2-factor interactions are free of
confounding.
More generally, in a resolution R design, effects of order k are free of confounding

with all effects of order less than R-k.
 Context: Data analysis
Extraction of estimated pure component profiles and spectra from a data matrix.
See Multivariate Curve Resolution for more details.
Response surface analysis
Regression analysis, often performed with a quadratic model, in order to describe the shape
of the response surface precisely.
This analysis includes a comprehensive ANOVA table, various diagnostic tools such as
residual plots, and two different visualizations of the response surface: contour plot and
landscape plot.
Note: Response surface analysis can be run on designed or non-designed data.
However it is not available for Mixture Designs; use PLS instead.
Response variable
Observed or measured parameter which a regression model tries to predict.
Responses are usually denoted Y-variables.
Response variables
See Response variable.
Responses
See Response variable.
1313
RMSEC
Root Mean Square Error of Calibration. A measurement of the average difference between
predicted and measured response values, at the calibration stage.
RMSEC can be interpreted as the average modeling error, expressed in the same units as the
original response values.
RMSED
Root Mean Square Error of Deviations. A measurement of the average difference between
the abscissa and ordinate values of data points in any 2-D scatter plot.
RMSEP
Root Mean Square Error of Prediction. A measurement of the average difference between
predicted and measured response values, at the prediction or validation stage.
RMSEP can be interpreted as the average prediction error, expressed in the same units as
the original response values.
R-square
The R-square of a regression model is a measure of the quality of the model. Also known as
coefficient of determination, it is computed as 1 - (Residual Y-variance), or
(Explained Y-variance)/100. For Calibration results, this is also the square of the
correlation coefficient between predicted and measured values, and the R-square value is
always between 0 and 1: the closer to 1, the better.
The R-square is displayed among the plot statistics of a Predicted vs. Reference plot. When
based on the calibration samples, it tells about the quality of the fit. When computed from
the validation samples (similar to the “adjusted R-square” found in the literature) it tells
about the predictive ability of the model.
Sample
Object or individual on which data values are collected, and which builds up a row in a data
table.
In experimental design, each separate experiment is a sample.
Sample projection
See Project samples.
Scaling
See Weighting.
Scatter effects
In spectroscopy, scatter effects are effects that are caused by physical phenomena, like
particle size, rather than chemical properties. They interfere with the relationship between
chemical properties and shape of the spectrum. There can be additive and multiplicative
scatter effects.
Additive and multiplicative effects can be removed from the data by different methods.
Multiplicative Scatter Correction removes the effects by adjusting the spectra from ranges of
wavelengths supposed to carry no specific chemical information.
1314
References
Scores
Scores are estimated in bilinear modeling methods where information carried by several
variables is concentrated onto a few underlying variables. Each sample has a score along
each model component.
The scores show the locations of the samples along each model component, and can be
used to detect sample patterns, groupings, similarities or differences.
Screening
First stage of an investigation, where information is sought about the effects of many
variables. Since many variables have to be investigated, only main effects, and optionally
interactions, can be studied at this stage.
There are specific experimental designs for screening, such as factorial or Plackett-Burman
designs.
Segment
One of the parameters of Gap-Segment derivatives and Moving_Average smoothing, a
segment is an interval over which data values are averaged.
a smoothing effect.
Sensitivity to pure components
In MCR computations, sensitivity to pure components is one of the parameters influencing
the convergence properties of the algorithm. It can be roughly interpreted as how
dominating the last estimated primary principal component is (the one that generates the
weakest structure in the data), compared to the first one.
The higher the sensitivity, the more pure components will be extracted.
SEP
See Standard Error of Performance.
Significance level
See p-value.
Significant
An observed effect (or variation) is declared significant if there is a small probability that it is
due to chance.
SIMCA
See SIMCA classification.
Classification method based on disjoint PCA modeling.
SIMCA focuses on modeling the similarities between members of the same class. A new
sample will be recognized as a member of a class if it is similar enough to the other
members; else it will be rejected.
1315
Simplex
Specific shape of the experimental region for a classical mixture design. A Simplex has N
corners but N-1 independent variables in a N-dimensional space. This results from the fact
that whatever the proportions of the ingredients in the mixture, the total amount of mixture
has to remain the same: the Nth variable depends on the N-1 other ones. When mixing three
components, the resulting simplex is a triangle.
Simplex-Centroid design
One of the three types of mixture designs with a simplex-shaped experimental region. A
Simplex-centroid design consists of extreme vertices, center points of all “subsimplexes”,
and the overall center. A “subsimplex” is a simplex defined by a subset of the design
variables. Simplex-centroid designs are available for optimization purposes, but not for a
screening of variables.
Simplex-Lattice design
One of the three types of mixture designs with a simplex-shaped experimental region. A
Simplex-lattice design is a mixture variant of the full-factorial design. It is available for both
screening and optimization purposes, according to the degree of the design (see lattice
degree).
SVD
See Singular Value Decomposition
In linear algebra, the singular value decomposition (SVD) is an important
factorization of a rectangular real or complex matrix, with many applications
in signal processing and statistics. Applications which employ the SVD
include computing the pseudoinverse, least squares fitting of data, matrix
approximation, and determining the rank, range and null space of a matrix.
Source: Wikipedia
SNV
See Standard_Normal_Variate.
Square effect
Average variation observed in a response when a design variable goes from its center level
to an extreme level (low or high).
The square effect of a design variable can be interpreted as the curvature observed in the
response surface, with respect to this particular design variable.
Square effects
See Square effect.
Standard deviation
SDev is a measure of a variable’s spread around its mean value, expressed in the same unit
as the original values.
Standard deviation is computed as the square root of the mean square of deviations from
the mean.
1316
References
Standard error of performance (SEP)

Variation in the precision of predictions over several samples.
SEP is computed as the standard deviation of the residuals.
Standard_Normal_Variate (SNV)
SNV is a transformation usually applied to spectroscopic data, which centers and scales each
individual spectrum (i.e. a sample-oriented standardization). It is sometimes used in
combination with detrending (DT) to reduce multicollinearity, baseline shift and curvature in
spectroscopic data.
Standardization
Widely used preprocessing that consists in first centering the variables, then scaling them to
unit variance.
The purpose of this transformation is to give all variables included in an analysis an equal
chance to influence the model, regardless of their original variances.
In The Unscrambler® standardization can be performed automatically when computing a
model, by choosing 1/SDev as variable weights.
Star points distance to center
In Central Composite designs, the properties of the design vary according to the distance
between the star samples and the center samples. This distance is measured in normalized
units, i.e. assuming that the low cube level of each variable is –1 and the high cube level +1.
Three cases can be considered:
 The default star distance to center ensures that all design samples are located on
the surface of a sphere. In other words, the star samples are as far away from the
center as the cube samples are. As a consequence, all design samples have exactly
the same leverage. The design is said to be “rotatable”;
 The star distance to center can be tuned down to 1. In that case, the star samples
will be located at the centers of the faces of the cube. This ensures that a Central
Composite design can be built even if levels lower than “low cube” or higher than
“high cube” are impossible. However, the design is no longer rotatable;
 Any intermediate value for the star distance to center is also possible. The design
will not be rotatable.
Star samples
In optimization designs of the Central Composite family, star samples are samples with mid-
values for all design variables except one, for which the value is extreme. They provide the
necessary intermediate levels that will allow a quadratic model to be fitted to the data.
Star samples can be centers of cube faces, or they can lie outside the cube, at a given
distance (larger than 1) from the center of the cube — see Star Points Distance To Center.
Steepest ascent
On a regular response surface, the shortest way to the optimum can be found by using the
direction of steepest ascent.
Student’s t-distribution
Frequency diagram showing how independent observations, measured on a continuous
scale, are distributed around their mean when the mean and standard deviation have been
estimated from the data and when no factor causes systematic effects.
1317
When the number of observations increases towards an infinite number, the Student t-
distribution becomes identical to the normal distribution.
A Student’s t-distribution can be described by two parameters: the mean value, which is the
center of the distribution, and the standard deviation, which is the spread of the individual
observations around the mean. Given those two parameters, the shape of the distribution
further depends on the number of degrees of freedom, usually n-1, if n is the number of
observations.
t-distribution
See Student’s t-distribution.
Test samples
Additional samples which are not used during the calibration stage, but only to validate an
already calibrated model.
The data for those samples consist of X-values (for PCA) or of both X- and Y-values (for
regression). The model is used to predict new values for those samples, and the predicted
values are then compared to the observed ones.
Test set validation
Validation method based on the use of different data sets for calibration and validation.
During the calibration stage, calibration samples are used. Then the calibrated model is used
on the test samples, and the validation residual variance is computed from their prediction
residuals.
Third order effects
See Cubic Effect.
Training samples
See Calibration samples.
T-scores
The scores found by PCA, PCR and PLS in the X-matrix.
See Scores for more details.
Tukey’s test
A multiple comparison test (see Multiple comparison tests for more details).
t-value
The t-value is computed as the ratio between the deviation from the mean accounted for by
a studied effect, and the standard error of the mean.
By comparing the t-value with its theoretical distribution (Student’s t-distribution), one
obtains the significance level of the studied effect.
Uncertainty limits
Limits produced by Uncertainty Testing, helping one assess the significance of the X-
variables in a regression model. Variables with uncertainty limits that do not cross the “0”
axis are significant.
1318
References
Uncertainty test
Martens’ Uncertainty Test is a significance testing method implemented in The
Unscrambler® which assesses the stability of PCA or Regression results. Many plots and
results are associated to the test, allowing the estimation of the model stability, the
identification of perturbing samples or variables, and the selection of significant X-variables.
The test is performed with cross validation, and is based on the jack-knifing principle.
Underfit
A model that leaves aside some of the structured variation in the data is said to underfit.
Unimodality
In MCR, the Unimodality constraint allows the presence of only one maximum per profile.
Upper quartile
The upper quartile of an observed distribution is the variable value that splits the
observations into 75% lower values, and 25% higher values. It can also be called 75%
percentile.
U-scores
The scores found by PLS in the Y-matrix.
See Scores for more details.
Validation samples
See Test samples.
Validation
Validation means checking how well a model will perform for future samples taken from the
same population as the calibration samples. In regression, validation also allows for
estimation of the prediction error in future predictions.
The outcome of the validation stage is generally expressed by a validation variance. The
closer the validation variance is to the calibration variance, the more reliable the model
conclusions.
When explained validation variance stops increasing with additional model components, it
means that the noise level has been reached. Thus the validation variance is a good
diagnostic tool for determining the proper number of components in a model.
Validation variance can also be used as a way to determine how well a single variable is
taken into account in an analysis. A variable with a high explained validation variance is
reliably modeled and is probably quite precise; a variable with a low explained validation
variance is badly taken into account and is probably quite noisy.
Three validation methods are available in The Unscrambler®
 test set validation;

 cross validation;
 leverage correction.
Variable
Any measured or controlled parameter that has varying values over a given set of samples.
A variable determines a column in a data table.
1319
Variances
See Variance.
Variance
A measure of a variable’s spread around its mean value, expressed in square units as
compared to the original values.
Variance is computed as the mean square of deviations from the mean. It is equal to the
square of the standard deviation.
Vertex sample
A vertex is a point where two lines meet to form an angle. Vertex samples are used in
Simplex-centroid, axial and D-optimal mixture/non-mixture designs.
Weighting
A technique to modify the relative influences of the variables on a model. This is achieved by
giving each variable a new weight, i.e. multiplying the original values by a constant which
differs between variables. This is also called scaling.
The most common weighting technique is standardization, where the weight is the standard
deviation of the variable. Other weighting options in The Unscrambler® are constant, and
downweighted.
37.3. Method reference

37.4. Keyboard shortcuts

Operation Shortcut key
New project Ctrl+N
Open existing project Ctrl+O
Save Ctrl+S
Print Ctrl+P
Close Ctrl+W
Exit Alt+F4
Cut Ctrl+X
Copy Ctrl+C
Copy with headers Ctrl+Shift+C
Paste Ctrl+V
Delete Ctrl+D, Del
Undo Ctrl+Z
1320
References
Operation Shortcut key
Redo Ctrl+Y
Find/replace Ctrl+H
Open matrix calculator Ctrl+M
Modify/extend design Ctrl+Shift+M
Define range Ctrl+E
Go to Ctrl+G
Select all Ctrl+A
Zoom in Ctrl+Up-arrow, +
Zoom out Ctrl+Down-arrow, -
Report Ctrl+R
Help on active dialog F1
Search help Ctrl+F1
Edit cell F2
37.5. Smarter, simpler multivariate data analysis: The Unscrambler® X

For the regular user of The Unscrambler®, especially from version 6 and onwards, it will be
noticeable that The Unscrambler® X main screen is different in many ways, yet at the same
time, has a familiarity about it.
 Workflow oriented main screen

 A new look for a new generation
 The project navigator
 Improved security
 The menu bar
 Plotting
 Importing data from previous versions of The Unscrambler®
 New analysis methods
 Transforms
 Analysis methods
 Prediction and classification
 Improved design of experiments module
 More data imports
 Improved dialog boxes
 General improvements and inclusions summary
 Edit menu
 Insert menu (new)
 Plot menu
1321
 Tasks menu
 Tools menu (new)
 Help
37.5.1 Workflow oriented main screen

The first major change is the inclusion of a project navigator, i.e. a tree-based data
management system, for better workflow visualization. Menu options are available and have
been streamlined for better data management. Along with enhanced graphical tools, ease of
access to calculated results and the addition of some powerful new analysis methods, The
Unscrambler® X is the most comprehensive MVA package CAMO Software has developed to
date. The Unscrambler® X is based around a plug in architecture. This means that software
updates do not require complete download of the entire program every time CAMO
Software makes a change to an algorithm or preprocessing step. Simply replace the old plug
in with the new one, validate its inclusion and use it; it’s that simple! The plug in architecture
also allows the inclusion of advanced modules in the future, either written by CAMO
Software, or third parties.
The Unscrambler® X is backwards data compatible with files back to version 9.2. For those
using earlier versions who want to migrate data to X, please contact CAMO Software for
assistance on how to do this at http://www.camo.com/.
Note: Models developed in The Unscrambler® X can be used in The Unscrambler®
Online Predictor and The Unscrambler® Online Classifier 9.8 and earlier products.
Contact CAMO at www.camo.com for a plug in to assist with model transfer.
This section is intended to guide users, both new and old, through the major differences in
The Unscrambler® X, to facilitate the easiest possible transition.
37.5.2 A new look for a new generation

The project navigator
The Unscrambler® X was developed with ease of use in mind. Previous versions of The
Unscrambler® had a database look and feel about them, primarily through the use of a data
editor main screen and menu options for data importation and analysis. This system relied
heavily on a file/path data management approach which can be cumbersome, especially
when multiple tables and models need to be analyzed.
The project navigator was developed to provide users with powerful visualization of data
tables and models, providing improved data management.
The project contains all of the data for a particular analysis, any transformed (preprocessed)
data, and also any models developed.
This provides the user greater visualization of the structure present in a data matrix and
allows better tracking of modifications.
Another major improvement in The Unscrambler® X is observed when a user transforms the
data in a matrix. The Unscrambler® X keeps the original data intact during transformation,
and provides a new node in the project navigator containing the transformed data. As an
example, if the SNV transform is applied to the raw data in a project node, the transformed
data will be displayed as a new node in the navigator, with the original name of the matrix,
appended with the transform name. During the transform, the user is also given a preview,
with dynamic updates, allowing one to see the impact of the transform settings on the data,
thus aiding one to optimize the transforms.
1322
References
By keeping the original data intact, the user will never lose important information, and thus
complies with the guidances recommended by regulatory agencies for data integrity (e.g. US
FDA CFR 21 Part 11 compliance in the pharmaceutical industry). The other major advantage
of having successive nodes in the project navigator is that each new transform node forms
the basis for new directions in data pretreatment. In short, The Unscrambler® X, through the
use of the project navigator, has greatly simplified data visualization and management.
Models developed using The Unscrambler® X are presented as nodes in the project
navigator with the original data, results, validation and plots all included as subnodes. These
subnodes are used to navigate around the model. The results in the subnodes can be used
for further investigation. This replaces the File - Import - Unscrambler Results option
available in previous releases of The Unscrambler® and has been developed to make the
task of result importation much simpler.
Projects are also saved as XML-based files. This means that in the future, projects will not be
legacy system dependent as they are based on a universally accepted standard format.
Improved security
The Unscrambler® X allows a user to sign into the program using Windows Domain
Authentication (as well as the usual password access).
The program can be set up to accept user credentials as login, or enter a predefined user
name and password, set up within The Unscrambler® X. This system of login is compliant
with the requirements of electronic signatures and records required by the US FDA.
To further improve security, the Lock function in previous versions of The Unscrambler® is
now replaced by the Protect function. This is an internal, password based system for
protecting individual projects, data tables and models. Protected data can be unprotected
by reentering the password, when the unprotect option is chosen.
The Audit Trail system has been greatly improved and now follows the US FDA’s guidance on
time stamping of audit trails.
The improved security functions of The Unscrambler® X provide greater assurance to users
in all application areas.
The menu bar
The Menu bar in The Unscrambler® X has been optimized for better work flow. Notable
omissions when compared to previous versions include:
 Modify
 Results
 Window
The Modify options are now shared over the Edit and Tasks menus. In particular, the options
now found in the Edit menu include
 Sort (all variants)

 Undo and Redo
 Define Range (similar to the previous option of Modify- Edit Set)
Also note that for the first release of The Unscrambler® X, 3-way data options are not
supported. These will be included in future developments.
The following functionality from the Modify menu can now be found in the Tasks menu:
1323
 Compute_General (Tasks - Transform - Compute_General)

 Transform (Tasks - Transform)
The Tasks menu is optimized for work flow with the following options:
 Transform
 Analyze
 Predict
The Results menu is now superseded due to the project navigator. All results are available
for a particular project in the project navigator, under the node particular to the analysis
performed.
The General View option in Results has been greatly simplified and is now part of the new
Insert menu as the Custom Layout option.
The Window menu is now obsolete. Results are displayed from the project navigator, and
stored within a project. The window functionality is now dispersed throughout the program
through various graphic and data table tab options.
The notable inclusions in the menu bar are:
 Insert
 Tools
The Insert menu allows a user to:
 Create experiment designs using a wizard

 Add Duplicate Matrix to allow a user to work on a locked results matrix by
duplicating it, or just for duplicating a matrix for cell editing.
 Add Custom Layout. This option replaces the General View option in the previous
Results menu and provides options for customization of views.
The Tools menu offer new tools including:
 the Matrix Calculator for performing basic matrix operations on data in the project
navigator
 Report, allowing a user to develop custom reports, based on the output of a
developed model.
 the Audit Trail
The Tools-Options menu options have been migrated from the File menu in previous
versions.The Tools-Audit Trail menu supersedes the previous File-Properties-Log options.
Plotting
General plots
Plotting data in The Unscrambler® X is much easier and more powerful than in previous
versions. The Plot menu has been expanded to include additional:
 The ability to use the mouse roller option to zoom in and out of plots;
 The ability to left click and drag a plots position within the current viewer;
1324
References
 The ability to modify the plot region, headers, include legends and change the font
and size of axes. These are all available by choosing Properties from the Edit Menu
or by right-clicking on a plot and selecting Properties.
Three-dimensional (3-D) rotation of scatter and matrix plots can be performed using the
mouse in a continuous way.
A new plotting option in The Unscrambler® X is the Multiple Scatter Plot. This is a collection
of stepwise 2-D scatter plots of variables chosen. It plots each variable combination against
each other.
All plots have a much sharper appearance and are better suited for journal publications,
reports and presentations.
Results plots
The project navigator now contains a Plots subnode for each analysis procedure containing
plotted results. Simply highlight a plot pane in the viewer, and click on the desired plot from
the project navigator to display it. The plot is updated automatically, thus simplifying the
previous Plot menu routine.
All results plots have the ability to be modified using the Properties menu option when
right-clicking on a plot.
Importing data from previous versions of The Unscrambler
Data and models generated in previous versions of The Unscrambler® (back to version 9.2)
may be directly imported into The Unscrambler® X using the File - Import - Unscrambler
menu option. The Unscrambler® X imports data tables with formatting intact, i.e. column
and row sets defined by the previous Modify - Edit Set function are preserved and displayed
as subnodes in the project navigator.
The Unscrambler® X models still preserve their existing file format. Backwards compatibility
of models is available by using the File- Export-Unscrambler making it possible to use
models developed in The Unscrambler® X in previous versions of The Unscrambler® Online
Predictor and The Unscrambler® Online Classifier.
37.5.3 New analysis methods

Transforms
The following transforms have been added to The Unscrambler® X Transform menu:
 OSC and Deresolve: Advanced spectroscopic transformations previously available

from an additional add-in package;
 Interaction and Square effects and Weights: Incorporated from previous individual
analysis dialog boxes to now become registrable pretreatments.
 Compute_General and Fill Missing: Available now as registrable pretreatments
 Correlation Optimization Warping (COW): For the column-wise (x-axis) alignment of
data; applicable to chromatographic, NMR, and Raman data.
 Center and Scaling now allows a user to scale by range and interquartile range and
has the added option to spherize (a multivariate version of center and scale).
Analysis methods
The Unscrambler® X comes with a number of new analyses for advanced MVA applications.
These are listed as follows:
1325
 Statistical Tests: Basic statistical hypothesis tests are included, providing a valuable
tool for thorough data analysis within The Unscrambler®.
 Normality test, both univariate and multivariate
 Tests for comparing means (t-tests)
 Tests for comparing variances (F-, Levene’s and Bartlett’s tests)
For more information, refer to the chapter on Basic Statistics

 Improved cluster analysis:
 Hierarchical Cluster Analysis (HCA) with dendrograms
 Ward’s Method
For more information, refer to the chapter on Clustering

 L-shaped PLS Regression: A powerful new method for analyzing three data tables in
one analysis. This is particularly useful for sensory and social sciences applications.
For more information, refer to the chapter on L-PLS
 Linear Discriminant Analysis (LDA): A classical statistical approach to classification.
This has been incorporated to compliment other classification tools in The
Unscrambler®.
For more information, refer to the chapter on Linear Discriminant Analysis
 Support Vector Machines (SVM) Classification: A more recent approach to the
classification problem.
For more information, refer to the chapter on Support Vector Machines
Prediction and classification
The Unscrambler® X supports classification using LDA and SVM based models, in addition to
SIMCA classification, and PLS-DA prediction.
Improved design of experiments module
The DOE module of The Unscrambler® X has been completely re-engineered for greater ease
of use and incorporates the methods most commonly encountered.
The improvements include:
 A completely new and easy to use Design Wizard with possibility to go back and
forth when defining the design;
 Suggestion for the best suited design and guidance to the user;
 Inclusion of Scheffe polynomials for the analysis of mixture data;
 More interactive results’ outputs and graphical option;
 A DoE PLS option with some featured plots.
More data imports

The following new data imports and improvements are available from the File - Import
menu:
 Import ASCII and Excel: Easy import using a dedicated import dialog box. The dialog
box allows the import of all, or only part of the data and allows easy assignment of
row and column headers.
 netCDF import of chromatographic data
 Support of the OPC protocol.
1326
References
As additional formats are continually being added refer to the chapter on File Import
Improved dialog boxes
Edit - Define Range
defining data ranges has been simplified and is more interactive.
Insert - Create Design
addition of a designed experiment is much more interactive and flexible compared
to the generation of designed experiments in the past.
Tasks - Transform menu
the dialog boxes allow a preview of the transformation on the data before
application, thus providing an invaluable visualization.
Tasks - Analyze
more tabs have been added to the dialog boxes, making the analyses more self-
contained.
37.5.4 General improvements and inclusions summary

The key new features and improvements in The Unscrambler® X are summarized below. This
can be used as a quick reference to guide a user through new features available.
Edit menu
 Incorporates Undo/Redo functions that were previously in the Modify menu. The
Unscrambler® X allows more than one Undo and Redo operation, with the default
level set to 10.
 Additional Data Types: Category, Text, Numerical and Date and time. The
Unscrambler® X allows an unlimited number of category variables to be used, with
the default limit set at 50.
 Define Range replaces the previous Modify - Edit Set functions.
 Make Header/Add Header allows a user to make any selected columns or rows into
headers or add a new column or row as a header, and to have up to five row and
column headers.
 Read Only/ Edit Mode: provides a safeguard against data editing during the model
development phase.
Insert menu (new)
 Add a new matrix to the project using the Data Matrix option
 Add a designed experiment to the project using the Create Design… function.
 Use the Duplicate Matrix option to add a copy of a data matrix to the project
navigator.
 Use Custom Layout to plot any combination of two or four matrices in the view. This
replaces the Results - General View option from previous versions.
Plot menu
 Inclusion of the multiple scatter plot.
Tasks menu
 Improved workflow, with the Transform menu added to Tasks
 Inclusion of OSC, Deresolve, Interactions and Squares, Weights, Compute_General,
Fill Missing and COW as registrable pretreatments.
1327
 Inclusion of Basic Statistics, improvements to Clustering methods (HCA, Ward’s

Method), L-PLS, LDA and SVM.
 Improved algorithms, including NIPALS and SVD for PCA, NIPALS, Kernel, Wide
Kernel and Orthogonal Scores for PLS
 Inclusion of LDA and SVM to the Predict/Classify menu.
 Inclusion of Analyze Design Matrix for running the analysis of designed experiments.
Tools menu (new)
 Perform basic matrix operations using the Matrix Calculator option.

 Modify/Extend Design for an existing experimental design.
 Create a customized report using the Report option.
 View the Audit Trail of a particular project.
Help
The Help System has been completely updated to be more comprehensive, and reflect
current software operation. It is also simplified, and there is no longer context sensitive help
for every user interface element, as with the 9.x series. Pressing F1 will still bring up the
appropriate help page.
The Unscrambler® X version 10.3

CAMO Software Nedre Vollgate 8, N-0158, Oslo, NORWAY
April 9, 2013
37.6. What’s new in The Unscrambler® X version 10.3

The following briefly describe the new features in version 10.3.
 A completely new response surface plotting module with high resolution, fast
graphics rendering and improved plotting controls for graphical optimization.
 A new D-optimal design module with option to augment design with space-filling
points (more robust).
 Re-introduction of PLS-DoE and more design information displayed in ‘Tasks –
Analyze – Analyze Design Matrix’ to help you find the best method for your data.
New methods
 Basic ATR correction of absorbance transformed spectra included under ‘Tasks –

Transform – Spectroscopic…’
 Introduced Double Kennard-Stone sample selection for PLSR, PCR and PCA (‘Mark’
menu from scores plot)
Plotting
1328
References
 Plot settings in ‘Tools – Options – Viewer’ can be used to change the default
appearance of plots.
 New plots and plot layouts for Residuals and Influence plots in PCA, PCR, PLSR and
Projection, including F-residuals with limits.
 Point labeling using value of any matching variable (Sample Grouping)
General
 ASCII file import with default list separator based on system settings.
 New Alarms tab in analysis dialogs of PCA, MLR, PCR and PLSR and right-click option
for setting alarm limits in the project navigator (these limits are applied for online
prediction using some of our prediction engines).
 New dialog for assigning Scalar/Vector tags as well as units (‘Edit – Scalar and
Vector’ in editor mode or right-click option in project navigator). This information is
used for collecting data from various sources during online monitoring of processes.
 General enhancements and bugfixes.
The Unscrambler® X ver 10.2

RELEASE NOTES March 7, 2012
This document provides information about The Unscrambler® X ver 10.2. The Unscrambler®
X 10.2 contains several enhancements and new features for data import and export,
analysis, graphics, and Design of Experiments (DoE). These updates have been implemented
post release of version 10.1. The Unscrambler® 10.2 is available in a 32-bit and a 64-bit
version.
37.7. What’s new in The Unscrambler® X ver 10.2

The following briefly describe the corrections and updates made in version 10.2
37.8. Applicability
Corrections have been made to address several issues:
 Overall performance of the program has been optimized, mainly based on the way
data is stored in memory during calculations.
 More details of analysis methods and data have been added to info boxes.
 The Find and Replace functionality has been optimized.
 More time allowed for renaming project navigator nodes.
 The definition of Identity matrices in Insert - Data Matrix has been corrected to
produce only square matrices.
 Median Absolute Deviation (MAD) scaling has now been moved to Tasks - Transform
- Centre and Scale as a scaling option in the dropdown list.
 Compute_General has been optimized to handle case-sensitive entries.
 Audit Trails now have a save option for printing and recording project details.
1329
 In Multiple Linear Regression a rank dependency test has been added to better
handle singularities.
 All analysis plots now have titles
37.9. Design of Experiments

New Features
 Blocking of full factorial experiments

 Inclusion of up to 35 design variables in Plackett-Burman designs
 Design, response, and non-controllable variables are given in the same table for
easier analysis
Corrections
 Overhaul of plots such as response surfaces, regression coefficients, multiple

comparison, cube plots, residuals. Improved linking between plots.
 Optimized tests for consistency and simplex shape for mixture designs
 Better support for using actual values in mixture designs
 Model check section of ANOVA table will display sequential (Type I) sums of squares
for mixture designs. Fixed linear terms DF.
 Rank testing implemented for quadratic, special cubic and full cubic terms in mixture
designs.
 Optimized how non-constrained, mixture, and linear constrained variables are
handled in DoE wizard
 Better handling of multilinear constraints for D-optimal designs
 Faster analysis of large Plackett-Burman designs
 Center samples inactivated if more than 4 levels for a categorical variable
Known issues
 The 9th design variable is by default called ‘J’, not the reserved letter ‘I’
 Some larger fractional factorial designs removed due to large memory usage
 Upper limit imposed on the number of experimental runs for full factorial designs
 Display of B coefficients and effects plots/tables removed for designs with
categorical variables with 3 levels or more.
 The DoE PLS option accessible from Tasks – Analyze – Analyze design matrix has
been disabled. For D-optimal designs, use Tasks – Analyze – Partial Least Squares
Regression instead. To analyze other designs using PLSR, change data type to
numeric first.
 For models with category variables and centerpoints included, the total degrees of
freedom is different from 10.1
37.10. Overall Enhancements

New Features
1330
References
 The Import_Interpolate function has been included in some of the vendor-specific

imports. This allows data to be combined into a single table if the starting and
ending points are slightly different from each other.
 Recalculate With New is added as a new feature to PCA, PCR, PLSR, which allows the
addition of new data to an existing model.
 Added new option for inserting Gaussian random data
 Improvements have been made in the way ASCII and Excel files are imported.
 Three new vendor specific imports have been included, DeltaNu, rap-ID and Visiotec.
 Export of files that can be used with the DeltaNu hand-held PharmID are now
available.
 Digital signatures have been added to File - Security for enhanced data integrity
purposes.
 A completely new implementation of the Orthogonal Signal Correction function has
been made based on Tom Fearn’s algorithm.
 An Interpolation function has been added to Tasks - Transform.
 A new algorithm for Quantile Normalization has been added to Tasks - Transform.
 Mean, Min and Max plots have been added to Tasks - Descriptive Statistics for
better data visualization.
 Validation scores and leverages have been added to PCA, PCR and PLSR.
 Inclusion of validation residuals added to PCR, PLSR and MLR.
 Support Vector Machine Regression (SVR) has been added to the Tasks - Analyze
menu.
 Prediction using SVR models is now provided in the Tasks - Predict menu.
 Contingency Analysis has been added to Tasks - Analyze - Statistical Tests.
 Plots have been added to the analysis node of Linear Discriminant Analysis and
 Compliance Mode has been added as an installation option for those organizations
that must meet the requirements of electronic signature and records handling.
 Block Weighting has been added as a new weighting option in the dialog boxes of
PCA, MLR, PCR, PLSR, SVR, LDA, L-PLSR, SVC and is also available as an option in
Tasks - Transform - Weights.
 Q-Residuals are now available at 6 levels of significance.
 A Discard Residuals option has been added to PCA, PCR and PLSR to help reduce the
size of models based on large data sets.
 The Search option in Help has been optimized to include the following search
features, Match All, Match Any and Match Exact.
 Q-Residuals are now available for analyses performed using the NIPALS algorithm.
 A function to show the Support Vectors is available for plots in SVC and SVR.
User-Friendly Enhancements
 The Define Range dialog has been completed overhauled for better ease of use and
functionality.
 Edit - Convert allows the conversion of data collected in nanometers to be displayed
in reciprocal centimeters (and vice versa).
 The Fill function is available as a right click option in the data editor.
 Legend and Display Points icons are now available in the toolbar.
 Duplicate Matrix is now available as a right click option in the project navigator.
 Keep Outs handling in all dialog boxes has been optimized.
1331
 Pretreatments handling has been optimized in all analysis dialogs.

 An option for ordering variables in ascending or descending order in line plots has
been added.
 Improved handling of batch importation of ASCII files
37.11. Known Limitations in The Unscrambler® X ver 10.2
 When Uncertainty test is applied to PLSR, PCR, the uncertainty limits are provided
for weighted coefficients only.
 Models with block weights are not compatible with v10.1. These models when used
for recalculate in 10.1, will produce different weights. Workaround: Reselect weights
while recalculating
 If OSC was used as a transform in version 10.1, these values will not match those of
version 10.2.
 p-Values of jack knife matrices will not match with v9.8. p-Value is set to 1 if the
variables are down weighted
 Jack knife matrices will mismatch with v9.8 and v10.1.
 Correlation loadings will mismatch with v9.8 and v10.1 if weights are set to zero.
The Unscrambler® X ver 10.1

RELEASE NOTES January 12, 2011
This document provides information about The Unscrambler® X ver 10.1. The Unscrambler®
10.1 contains several enhancements and new features for data import and export, graphics,
and Design of Experiments (DoE).These updates have been implemented post release of
version 10.0.1. The Unscrambler® 10.1 is available in a 32-bit and a 64-bit version.
37.12. What’s new in The Unscrambler® X ver 10.1

The following briefly describe the updates made in version 10.1
37.13. Data Import
 The import of Excel data files did not always import all the columns from the Excel
spreadsheet. This has been corrected.
 ASCII files can be batch imported
 U5 data can be imported into The Unscrambler®.
37.14. Data Export

File writer capabilities have been added to enable the export of models into The
Unscrambler® 9.8 format. This feature can be downloaded from our website
www.camo.com
1332
References
37.15. Applicability
 The axis labels in the influence plots in projection have been corrected to properly
reflect the information plotted.
 Predictions made with models that include the MSC transform on part of the
columns did not give consistent results. This issue has been addressed.
 Issues around the display of the correct sample names in the Coomans’ plot for
classified samples using SIMCA have been resolved.
 The info box for a PLS model did not always correctly reflect the validation method
used. When full cross validation was used, the validation was displayed as having
been random with 20 segments.
 The compute general function does allow mathematical formulae with non-integer
values.
 Category variables can be copied and pasted.
 The x-axis values can now be scaled based on the variable values
 Compact, mini and micro models from previous versions of The Unscrambler® can
be imported.

Some changes have been made to how degrees of freedom are computed, relative to how
this was done in previous versions.
 For experiments that include category variables, center points are defined for each
level of the category variables.
 Response surface plots have been improved.
37.17. Overall Enhancements
 The grid editor has been modified to give improved performance. Data that are
generated in version 10.1 cannot be opened in previous versions of The
Unscrambler® X.
 Copy and paste and drag and drop have been implemented in the editor.
 In defining ranges, one can now define the reverse selections of the selected
rows(columns) by a single click
 Prediction diagnostics per segment are available when a cross-validation other than
full is used in developing PLS or PCR models
 The quantile normalization function of median absolute deviation (MAD) has been
added as a new transform
 Q residual limits and Q residuals for samples are available within the results for
predictions
 The ability to save model files as smaller files for easier model file transportability
has been added.
 A user can set the number of components to save in a model file.
 Defined ranges (row and column) can be copied and pasted into a matrix of the
same dimensions.
1333
 The properties for category variables including the names and order of them can be
changed.
 In LDA, the ability to do an automatic PCA-LDA on sample sets with many variables
has been added under the options for LDA.
Several improvements in the graphics have been made including:
 Plot legends are now presented according to the sample grouping used in a given
plot
 Users now have the ability to use sample grouping on 3-D scatter plots, and to more
readily change the properties of 3-D plots
 Sample grouping is now possible under all the relevant PLS results plots including
the X-Y relation outliers, Y-residuals vs. predicted Y and Y-residuals vs. scores
 With sample grouping, the groups in a plot can be separated by a symbol or a color,
or both.
 Greater flexibility in changing plot and axis scales and labels.
 Additional options for the plot types have been added.
37.18. Known Limitations in The Unscrambler® X ver 10.1
 Some issues still remain when calculating mixture designs with constraints, mainly
due to summing of mixture amounts.
The Unscrambler® X ver 10.0.1

RELEASE NOTES July 19, 2010
This document provides information about The Unscrambler® X ver 10.0.1

The Unscrambler® 10.0.1 contains several enhancements for data import, graphics, Design
of Experiments (DoE) and security updates.These updates have been implemented in
response to customer feedback post first release of version 10.0.0.
37.19. What’s new in The Unscrambler® X ver 10.0.1

The following briefly describe the updates made in ver 10.0.1
37.20. Data Import
 GRAMS files with the *.cfl format can now be imported

 Import of BFF4 spectral files from Brimrose SNAP 32! Ver 3.01 is supported
 Import of *.sp and *.spp files from Perkin Elmer Spectrum software is supported
37.21. Tutorials
Additional tutorials have been added to give users a quick start in using The Unscrambler® X
1334
References
The ability to mark evenly distributed samples in scores plots and other PLS results plots has
been added
37.22. Applicability
Several improvements in the graphics have been made including:
 Sample grouping can be applied in 3-D plots

 When a sample or variable has been marked in a plot, it can now be unmarked by
clicking again
 Sample grouping can be applied to DOE data plots, influence plots, and other plots
as relevant
 Plot names have been updated to Predicted vs. Reference (previously Predicted vs.
Measured).
 Data can be specified as spectra, thus showing loadings and regression coefficients
results as line plots when relevant, by selecting a column set, right clicking and
specifying spectra
 The plots for PLS and PCA analysis are displayed with the optimal number of factors
for the given model
 Plots for PLS prediction are shown with the number of factors used in the model
 Response surface plots have been updated
 The Y axis in a plot can be changed by going to Properties
 Import of ASCII file format has been corrected

 When models are developed on data that have an MSC or EMSC transformation
applied, the transformation is registered and applied for predictions
 The log transformation in Compute_General works as expected
 Issues related to incorrect variable names on import of OPUS files has been
corrected

Updates have been made to the Design of Experiments, with results confirmed with
reference to D. C. Montgomery “Design and Analysis of Experiments” 6th ed.
 ANOVA can be run when category variables are present in a design

 Corrections have been made to the experiment names when a fractional factorial
design is generated
 Updates have been made to the Mixture design, and it has been modified so it can
be done with 3 or more variables only
 The x-axis labels in the B coefficient plots correctly reflect the variable names
37.24. Known Limitations in The Unscrambler® X ver 10.0.1
 The OSC transformation has been modified, hence the old OSC model (from version
10.0) cannot be used as a registered pretreatment in prediction.
1335
The Unscrambler® X ver 10.0.0

RELEASE NOTES April 10, 2010
37.25. What’s new in The Unscrambler® X

Savitzky_Golay Smoothing - This function has been slightly changed from previous versions
of The Unscrambler®. The end points of the samples now keep the original data instead of
filling with zeros.
The Median_Filter algorithm in The Unscrambler® v9.8 produced incorrect results. This has
now been fixed, by taking the median value of point pairs less than the size of the window
used.
When applying a Gaussian_Filter, the columns are shifted by 1 so that the peaks are not
repositioned, as they were in The Unscrambler® v9.8.
SIMCA in The Unscrambler®X is not calculated for full rank.
The Varimax Rotation for PCA produced incorrect results in v9.8. This has been corrected in
The Unscrambler® X.
Kurtosis and Skewness values computed in The Unscrambler® v9.8 and earlier versions do
not match with MS Excel or Matlab. The Unscrambler® now implements skewness and
kurtosis with bias correction which is used in Microsoft Excel skew() and kurt() functions;
Matlab functions skewness(x,0) and kurtosis(x,0).
In the first release of The Unscrambler® X, the import of 3-D data is not supported and
neither is N-PLS.
Sample grouping in 3D scatter plots is not available in the initial release.
When a sample or variable is marked in a plot, it cannot be unmarked by clicking again.
Some of the HCA distance measures like Spearman’s are not identical with a test
implementation validated in using an external source.
Annotations and sample groupings of plots are not saved in a project.
The recalculate option is not available in descriptive statistics.
NIPALS differences - In The Unscrambler® 9.8 and previous versions, the handling of missing
values in the NIPALS algorithm was implemented using two variants. The NIPALS missing
value handing in calibration is different from the one used during prediction and
classification. The Unscrambler® X uses a single approach in missing value handling in the
NIPALS algorithm for calibration, prediction and classification making it easier to verify and
compare results.
In Unscrambler® v9.8 and previous, the square sum of scores includes test set samples. This
is wrong. The square sum of scores should include only the calibration set and not the test
set. This has been corrected in The Unscrambler® X
The Analysis of mixture designs is now performed by the use of Scheffe polynomials instead
of PLS, as in previous versions. The PLS option is still available.
For the initial release of The Unscrambler® X, for category variables, only the multiple
comparison results are provided as in 9.8. The real effect values will be provided in the next
release.
The handling of constraints has been improved in D-optimal designs. A wider range of
operators is available.
1336
References
In The Unscrambler® 9.8 and previous versions, the definition of the design, response,
uncontrollable variables was made in three different windows. This has been reduced to the
Define Variables table in the Design Experiment Wizard.
Method reference documentation is yet to be updated.
37.26. System Requirements

for Standalone and Client
* Windows 2003, XP (SP2 or higher), Vista, 2008, Windows 7 or Windows 8

* Any Intel or AMD based processor
* A minimum of 2 GB RAM is recommended
* A minimum of 1 GB of free hard disk space is recommended
* .NET framework version 4.0 or 4.5
* Internet Explorer 7.0 or higher
for License Server
* Windows 2003, XP (SP2 or higher), Windows Vista, 2008, 7 or 8

* Any Intel or AMD based processor
* A minimum of 1 GB RAM is recommended
* A minimum of 1 GB of free hard disk space is recommended
37.27. Installation
1) Run The Unscrambler® X setup application & follow the setup wizard Double click
“TheUnscramblerX_Setup.msi” file to start the installation wizard. The InstallShield Wizard
for The Unscrambler® X is launched. Follow the on-screen instructions
2)Finish the Setup. When the setup is complete, Click Close
3) Start The Unscrambler® X from the Start menu
4) Step 4 : The Activation Wizard dialog opens. Click the Obtain button
5) After receiving The Unscrambler® X activation key, paste in Activate window and click on
the Activate button
OR
Send “machine ID�? from The Unscrambler® X Activation window along with your user
name and E-mail address to support@camo.com. CAMO Support Team will send you The
Unscrambler® X activation key.
For Support visit : http://support.camo.com
CAMO Software Research & Development Team
1337
38. Bibliography
38.1. Bibliography
 Statistics and multivariate data analysis

 Basic statistical tests
 Design of experiments
 Multivariate curve resolution
 Classification methods
 Data transformations and pretreatments
 L-shaped PLS
 Martens’ uncertainty test
 Data formats
38.1.1 Statistics and multivariate data analysis

C. Albano C, W. Dunn III, U. Edlund, E. Johansson, B. Nordén, M. Sjöström and S. Wold, Four
levels of pattern recognition, Anal. Chim. Acta, 103, 429–443(1978).
K.R. Beebe, B.R. Kowalski, An introduction to multivariate calibration and analysis, Anal.
Chem., 57(17), 1007A–1017A(1987).
K.R. Beebe, R.J. Pell and M.B. Seasholtz, Chemometrics: A Practical Guide, John Wiley & Sons,
Inc., New York, 1998.
G.E.P. Box, W.G. Hunter, J.S. Hunter, Statistics for experimenters, Wiley & Sons Ltd, New
York, 1978.
S.D. Brown, Indirect Oberservation: Latent Properties and Chemometrics, Appl. Spectrosc.,
49, 14A-31A(1995).
S.D. Brown, T.B. Blank, S.T. Sum and L.G. Weyer, Chemometrics, Anal. Chem., 66, 315R–
359R(1994).
C.B. Crawford and G.A. Ferguson, A general rotation criterion and its use in orthogonal
rotation, Psychometrika, 35(3), 321–332(1970).
R.A. Darton, Rotation in Factor Analysis, The Statistician, 29, 167–194(1980).
B.S. Dayal and J. F. MacGregor, Improved PLS Algorithms, J. Chemom., 11, 73–85(1997).
S. De Vries, J.F. Ter Braak Cajo, Prediction error in partial least squares regression: a critique
on the deviation used in The Unscrambler, Chemom. Intell. Lab. Syst., 30, 239–245 (1993).
S.N.Deming, J.A. Palasota, J.M. Nocerino, The geometry of multivariate object preprocessing,
J. Chemom., 7, 393–425(1993).
N.R. Draper, H. Smith, Applied Regression Analysis, John Wiley & Sons, Inc, New York, 1981.
K.Esbensen, Multivariate Data Analysis — In Practice, 5th Edition, CAMO Process AS, Oslo,
2002.
R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7,
179–188(1936).
M. Forina, G. Drava, R. Boggia, S. Lanteri, P. Conti, Validation procedures in near-infrared
spectrometry, Anal. Chim. Acta, 295(1–2), 109–118(1994).
I.E. Frank, J.H. Friedman, A statistical view of some chemometrics tools, Technometrics, 35,
109–148(1993).
P.Geladi, B.R. Kowalski, Partial least-squares regression: a tutorial, Anal. Chim. Acta, 185, 1–
17(1986).
1339
G.H.Golub, C.F. van Loan, Matrix Computation, 2nd ed., The John Hopkins University Press,
Baltimore, 1989.
C. R. Goodall, Computation Using the QR Decomposition in Handbook in Statistics Vol. 9,
Elsevier, Amsterdam, 1993.
H.H. Harman, Modern Factor Analysis, 3rd Edition, revised, University of Chicago Press,
1976.
A. Höskuldsson, PLS regression methods, J. Chemom., 2, 211–228 (1988).
Psych., 24, 417–441, 498–520 (1933)
J.E.Jackson, A Users Guide to Principal Components, Wiley & Sons Inc., New York, 1991.
J.E. Jackson and G.S. Mudholkar, Control procedures for residuals associated with principal
component analysis, Technometrics, 21, 341-349 (1979).
J.E. Jackson and G.S. Mudholkar, Control procedures for residuals associated with principal
component analysis, Addendum, Technometrics, 22, 136 (1980).
R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, Prentice-Hall, Upper
Saddle River, NJ, 1988.
H.F. Kaiser, The varimax criterion for analytic rotation in factor analysis, Psychometrika, 23,
187–200(1958).
R. Kramer, Chemometric Techniques for Quantitative Analysis, Marcel Dekker, Inc., New
York, 1998.
F. Lindgren, P. Geladi, S. Wold, The kernel algorithm for PLS, J. Chemom., 7, 45–59(1993).
R. Manne, Analysis of two partial least squares algorithms for multivariate calibration,
Chemom. Intell. Lab. Syst., 2, 187–197 (1987).
K.V. Mardia, J.T. Kent, J.M. Bibby, Multivariate Analysis, Academic Press Inc, London, 1979.
H. Martens, T. Næs, Multivariate Calibration, John Wiley & Sons Inc, Chichester, 1989.
W.L. Martinez and A.R. Martinez, Exploratory Data Analysis with MATLAB,, Chapman and
Hall, London, 2005.
D.L.Massart, B.G.M. Vandegiste, S.N. Deming, Y. Michotte, L. Kaufman, Chemometrics: A
textbook, Elsevier Publ., Amsterdam, 1988.
D.C. Montgomery, E.A. Peck, and C.G. Vining, Introduction to Linear Regression Analysis
Third Edition,Wiley-Interscience, New York, 2001.
T.Næs, T. Isaksson, T. Fearn and T. Davies, A user-friendly guide to multivariate calibration
and classification, NIR Publications, Chichester, 2002.
J.O. Neuhaus and C. Wrigley, The Quartimax Method: An analytic approach to orthogonal
simple structure, British J. Statistical Psychology, 7(2), 81–91(1954).
S. Rannar, F. Lindgren, P. Geladi and S. Wold, A PLS kernel algorithm for data sets with many
variables and fewer objects, Part 1: Theory and Algorithm, J. Chemom., 8, 111–125 (1994).
D.R. Saunders, An analytic method for rotation to orthogonal simple structure, Princeton,
Educational Testing Service Research Bulletin, 53–10 (1953).
S. Weisberg, Applied Linear Regression Second Edition, Wiley, New York, 1985.
S. Wold, Cross-validatory estimation of the number of components in factor and principal
components models, Technometrics, 20(4), 397–405 (1978).
S. Wold, K. Esbensen, P. Geladi, Principal component analysis — A tutorial, Chemom. Intell.
Lab. Syst., 2, 37–52(1987).
S. Wold, Pattern recognition by means of disjoint principal components models, Pattern
Recognition, 8, 127–139(1976).
1340
JDSU Application
38.1.2 Basic statistical tests

M.S. Bartlett, Properties of sufficiency and statistical tests, Proceedings of the Royal
Statistical Society Series A 160, 268–282(1937).
M.B. Brown and A.B.E. Forsythe, Robust tests for the equality of variance, J. American
Statistical Assoc., 69, 364–367(1974).
R.B. D’Agostino, Tests for Normal Distribution, in Goodness-of-fit Techniques, R.B.
D’Agostino, M.A. Stephens(Eds), Marcel Dekker, New York, 1986.
G.E. Dallal and L. Wilkinson, An analytic approximation to the distribution of Lilliefors’ test
for normality, The American Statistician, 40, 294–296(1986).
R. Hogg and A. Craig, Introduction to Mathematical Statistics, 4th Edition, New York,
Macmillan Publishing Co, 1978.
H. Levene, Robust tests for equality of variances, in Contributions to Probability and
Statistics: Essays in Honor of Harold Hotelling, Ingram Olkin, Harold Hotelling et al.(Eds),
Stanford University Press, Stanford, CA, 278–292, 1960.
K.V. Mardia, Measures of Multivariate Skewness and Kurtosis with Applications, Biometrika,
57, 519–530(1970).
K.V. Mardia, Applications of Some Measures of Multivariate Skewness and Kurtosis in
Testing Normality and Robustness Studies, Sankhy�?, Series B, 36, 115–128 (1974).
K.V. Mardia, J.T. Kent and J.M. Bibby, Multivariate Analysis, Academic Press, London, UK,
1979.
J.N. Miller and J.C. Miller, Statistics and Chemometrics for Analytical Chemistry, Fifth Edition,
Prentice Hall, Harlow, UK, 2005.
38.1.3 Design of experiments

R. C. Bose and K. Kishen, On the problem of confounding in the general symmetrical factorial
design, Sankhya, 5, 21(1940).
G.E.P. Box, W.G. Hunter, J.S. Hunter, Statistics for Experimenters, Wiley & Sons Ltd, New
York, 1978.
R. Carlson, Design and Optimization in Organic Synthesis, Elsevier, Amsterdam, 1992.
J.A. Cornell, Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data,
Second edition, John Wiley and Sons, New York, 1990.
G.H. Golub and C.F. Van Loan, Matrix Computations, Third edition, Johns Hopkins University
Press, 1996.
Ø. Langsrud, M.R. Ellekjær, T. Næs, Identifying significant effects in fractional factorial
experiments, J. Chemom., 8, 205-219 (1994).
G.A. Lewis, D. Mathieu, and R. Phan-Tan-Lu, Pharmaceutical Experimental Design, Marcel
Dekker, Inc., New York, 1999.
D.C. Montgomery, Design and Analysis of Experiments, Sixth edition, John Wiley & Sons,
New York, 2004.
E. Morgan, Chemometrics: Experimental Design, John Wiley & Sons Ltd, 1991.
R.H. Myers and D.C. Montgomery, Response Surface Methodology: Process and Product
Optimization using Designed Experiments, Second edition, Wiley, New York, 2002.
R.E.A.C. Paley, On orthogonal matrices, J. Math. Phys., 12, 311–320(1933).
R.L. Plackett and J.P. Burman, The Design of Optimum Multifactorial Experiments,
Biometrika, 33, 305–325(1946).
H. Scheffé, Experiment with Mixtures, J. Roy. Stat. Soc. Ser. B, 20, 344–366(1958).
1341
38.1.4 Multivariate curve resolution

D. Bu and C. Brown, Self-Modeling Mixture Analysis by Interactive Principal Component
Analysis, Appl. Spectrosc., 54, 1214-1221 (2000).
J. C. Hamilton and P.J. Gemperline, An extension of the multivariate component-resolution
method to three components, J. Chemom., 4, 1-13 (1990).
A. de Juan, E. Casassas, R. Tauler, Soft modeling of analytical data, 9800-9837, in
Encyclopedia of Analytical Chemistry: Instrumentation and Applications, R.A. Meyers (Ed).
Wiley, New York, 2001.
A. de Juan and R. Tauler, Chemometrics applied to unravel multicomponent processes and
mixtures - Revisiting latest trends in multivariate resolution, Anal. Chim. Acta, 500, 195-
210(2003).
P.J. Gemperline, J. Chemom., 3, 549(1989).
W. Kessler and R. Kessler, Invited Lecture, 1st European CAMO User Meeting, Frankfurt,
Germany 2005.
W.H. Lawton and E.A. Sylvestre, Self modeling curve resolution, Technometrics, 13, 617-
633(1971).
E.R. Malinowski, J. Chemom., 13, 69 (1999).
E.R. Malinowski, Factor Analysis in Chemistry, Third Edition, John Wiley & Sons, New York,
2002.
R. Manne, On the Resolution Problem in Hyphenated Chromatography, Chemom. Intell. Lab.
Sys., 27, 89-94(1995).
H Martens, Factor analysis of chemical mixtures. Anal. Chim. Acta, 112, 423-448(1979).
H. Stögbauer, A. Kraskov, S. A. Astakhov and P. Grassberger, Least-dependent-component
analysis based on mutual information, Phys. Rev. E, 70, 066123(2004).
E.A. Sylvestre, W.H. Lawton, M.S. Maggio, Curve resolution using a postulated chemical
reaction, Technometrics, 16, 353-368 (1974).
R. Tauler, Calculation of maximum and minimum band boundaries of feasible solutions for
species profiles obtained by multivariate curve resolution, J. Chemom., 15, 627-646(2001).
R. Tauler and B. Kowalski, Anal. Chem., 65, 2040(1993).
R. Tauler, S. Lacorte and D. Barceló, Application of multivariate curve self-modeling curve
resolution for the quantitation of trace levels of organophosphorous pesticides in natural
waters from interlaboratory studies, J. of Chromatogr. A., 730, 177–183(1996).
R. Tauler, A.K. Smilde, B.R. Kowalski, Selectivity, local rank, three-way data analysis and
ambiguity in multivariate curve resolution, J. Chemom., 9, 31-58(1995).
E. Widjaja, M. Garland, J. Comput. Chem., 23, 911(2002).
W. Windig and J. Guilment, Interactive self-modeling mixture analysis, Anal. Chem., 63,
1425-1432(1991).
38.1.5 Classification methods

machines, J. Chemom., 16, 482–489(2002).
D. Cozzolino, A. Vadell, F. Ballesteros, G. Galietta, N. Barlocco, Combining visible and near-
infrared spectroscopy with chemometrics to trace muscles from an autochthonous breed of
pig produced in Uruguay: a feasibility study, Anal. Bioanal. Chem., 385(5), 931–936(2006).
Chemom., 19, 341–354(2005).
1342
JDSU Application
Chemom., 18, 341–349(2004).
C. Medina-Gutiérrez, J. Luis Quintanar, C. Frausto-Reyes, R. Sato-Berrú, The application of
NIR Raman spectroscopy in the assessment of serum thyroid-stimulating hormone in rats,
Spectrochimica Acta Part A, 61 (1–2), 87–91(2005).
T. Næs, T. Isaksson, T. Fearn and T. Davies, A User-friendly Guide to Multivariate Calibration
and Classification, NIR Publications, Chichester, UK, 2002.
38.1.6 Data transformations and pretreatments

R.J. Barnes, M.S. Dhanoa, and S.J. Lister, Standard_Normal_Variate Transformation and De-
trending of Near-Infrared Diffuse Reflectance Spectra, Appl. Spectrosc., 43(5), 772-
777(1989).
R.J. Barnes, M.S. Dhanoa, and S.J. Lister, Correction to the description of
Standard_Normal_Variate (SNV) and De-Trend (DT) Transformations in Practical
Spectroscopy with Applications in Food and Beverage Analysis – 2nd edition, J. Near Infrared
Spectrosc., 1, 185-186(1993).
T. Fearn, On Orthogonal Signal Correction, Chemom. Intell. Lab. Syst., 50, 47-52(2002).
D.W. Hopkins, NIR News, 14(5), 10(2002).
D.W. Hopkins, What is a Norris derivative?, NIR News, 12(3), 3-5(2001).
R.J. Hyndman, and Y.Fan, Sample quantiles in statistical packages,American Statistician, 50,
361-365 (1996).
H. Martens and E. Stark, Extended multiplicative signal correction and spectral interference
subtraction: new preprocessing methods for near infrared spectroscopy, J. Pharm. Biomed.
Anal., 9, 625-635(1991).
A.J. Miller and N-K. Nguyen, A Fedorov exchange algorithm for D-optimal design, Appl.
Stats., 43, 669-678(1994).
N.P.V. Nielsen, J.M. Cartensen, J. Smedsgaard, Aligning of single and multiple wavelength
chromatographic profiles for chemometric data analysis using correlation optimized
warping, J. Chromatogr., A, 805, 17-35(1998).
K. Norris, NIR News, 9(4), 3(1998).
K. Norris, NIR News, 12(3), 6(2001).
K. Norris, NIR News, 13(3), 8(2002).
K.H. Norris and A.M.C. Davies, Spectroscopy Europe, 23(6), 24(2011).
W.H. Press, S.A. Tewkolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes in Fortran: The
art of scientific computing, Second edition, Cambridge University Press, Cambridge, 1992.
A. Savitzky and M.J.E. Golay, Smoothing and differentiation of data by simplified least
squares procedures, Anal. Chem., 36, 1627–1639(1964).
J. Sjöblom, O. Svensson, M. Josefson, H. Kullberg, S. Wold, An evaluation of orthogonal signal
correction applied to calibration transfer of near infrared spectra, Chemom. Intell. Lab. Syst.,
44, 51-61 (1998).
B. Stenberg, T.M. Henriksen, S. Bruun, A. Korsaeth, L.S. Jensen, T.A. Breland, E. Nordlvist, F.
Palmason, T. Salo, J. Gudmundsson, and M. Esala, Description of plant material quality by
near infrared spectroscopy for prediction of carbon and nitrogen mineralization in
agricultural soils in Near Infrared Spectroscopy: NIR in Action – Making a Difference, G.R.
1343
Burling-Claridge, S.E. Holroyd and R.M.W. Sumner (Eds), New Zealand NIRS Society Inc.,
Hamilton, 2007.
G. Tomasi, F.v.d. Berg, C. Andersson, Correlation optimized warping and dynamic time
warping as preprocessing methods for chromatographic data, J. Chemom., 18, 231-
241(2004).
F. Westad, H. Martens, Shift and intensity modelling in spectroscopy - general concept and
applications, Chemom. Intel. Lab. Syst., 45, 361-370(1999).
38.1.7 L-shaped PLS

B.F. Kühn, A.K. Thybo, The influence of sensory and physiochemical quality on Danish
children’s preferences for apples, Food Qual. Pref., 12, 543–550(2001).
latent variables: L-PLSR, Computational Statistics & Data Analysis 48, 103–123(2005).
A.K. Thybo, B.F. Kuhn, H. Martens, Explaining Danish children’s preferences for apples using
instrumental, sensory and demographic/behavioral data, Food Qual. Pref. 15, 53–63(2004).
38.1.8 Martens’ uncertainty test

B. Efron, The Jackknife, the Bootstrap and Other Resampling Plans, Society for Industrial and
Applied Mathematics, Philadelphia, Pennsylvania, 1982.
H. Martens and M. Martens, Modified Jack-knife Estimation of Parameter Uncertainty in
Bilinear Modelling (PLSR), Food Qual. Pref., 11(1-2), 5 (1999).
H. Martens and M. Martens, Validation of PLS Regression models in sensory science by
extended cross-validation), PLS’99 (Proceedings, International Symposium on PLS Methods,
Paris Oct. 5-6, 1999.
H. Martens, J. Pram Nielsen, S. Balling Engelsen, Light Scattering and Light Absorbance
Separated by Extended Multiplicative Signal Correction. Application to Near-Infrared
Transmission Analysis of Powder Mixtures, Anal. Chem., 75, 394-404 (2003).
F. Westad, M. Bystrom, and H. Martens, Modified Jack-knifing in multivariate regression for
variable selection and model stability, NIR-99 (Proceedings, International Conference on NIR
Spectroscopy) Verona June 13-18, 1999.
F. Westad, Relevance and Parsimony in Multivariate Modelling, Ph.D. Thesis, University
NTNU Trondheim, Trondheim, Norway, 1999.
38.1.9 Data formats

R.S. McDonald and P.A. Wilks, JCAMP-DX: A standard form for exchange of infrared spectra
in computer readable form, Appl. Spectrosc., 42(1), 151 (1988).
1344

The Unscrambler X v10.3 - User Manual

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Unscrambler X v10.3 - User Manual

Uploaded by

Copyright:

Available Formats

The Unscrambler® X v10.

2.1. Support resources on our website............................................................................ 3

3.1. What is The Unscrambler® X? ................................................................................... 5

3.5. Demonstration video .............................................................................................. 28

4. Application Framework .............................................................................................. 29

4.1. User interface basics ............................................................................................... 29

4.10. Login ........................................................................................................................ 51

5. Import ...................................................................................................................... 119

5.1. Importing data ...................................................................................................... 119

5.11.3 File – Import Data – Indico… ................................................................................................ 156

5.21.3 File – Import Data – PerkinElmer… ...................................................................................... 190

6. Export ....................................................................................................................... 215

6.1. Exporting data ....................................................................................................... 215

6.3.1 ASCII export ......................................................................................................................... 221

7.1. Line plot ................................................................................................................ 231

7.15. Kennard-Stone (KS) Sample Selection .................................................................. 263

8. Design of Experiments.............................................................................................. 287

8.1. Experimental design.............................................................................................. 287

8.3.4 Choose the Design ............................................................................................................... 339

9. Validation ................................................................................................................. 397

9.1. Validation .............................................................................................................. 397

10. Transform ................................................................................................................. 413

10.1. Transformations .................................................................................................... 413

10.2.2 About baseline corrections .................................................................................................. 414

10.12.2 About fill missing values ...................................................................................................... 455

10.22.1 Weighted_Direct_Standardization (WDS) ........................................................................... 489

11. Univariate Statistics .................................................................................................. 497

11.1. Descriptive statistics ............................................................................................. 497

12. Basic Statistical Tests ................................................................................................ 509

12.1. Statistical tests ...................................................................................................... 509

13. Principal Components Analysis ................................................................................ 527

13.1. Principal Component Analysis (PCA) ..................................................................... 527

14. Multiple Linear Regression ....................................................................................... 583

14.1. Multiple Linear Regression ................................................................................... 583

14.5. MLR method reference ......................................................................................... 616

15. Principal Components Regression ............................................................................ 617

15.1. Principal Component Regression .......................................................................... 617

16. Partial Least Squares ................................................................................................ 675

16.1. Partial Least Squares regression ........................................................................... 675

16.4.1 Predefined PLS plots ............................................................................................................ 695

17. LPLS .......................................................................................................................... 743

17.1. L-PLS regression .................................................................................................... 743

18. Support Vector Machine Regression ........................................................................ 759

18.1. Support Vector Machine Regression (SVMR) ....................................................... 759

18.5.4 Diagnostics ........................................................................................................................... 775

19. Multivariate Curve Resolution.................................................................................. 779

19.1. Multivariate Curve Resolution (MCR) ................................................................... 779

20. Hierarchical Modeling .............................................................................................. 799

20.1. Hierarchical Modeling ........................................................................................... 799

21. Segmented Correlation Outlier Analysis................................................................... 823

21.1. Segmented Correlation Outlier Analysis (SCA) ..................................................... 823

21.2. Introduction to Segmented Correlation Outlier Analysis (SCA) ............................ 823

22. Instrument Diagnostics............................................................................................. 845

22.1. Instrument Diagnostics ......................................................................................... 845

23. Spectral Diagnostics.................................................................................................. 865

23.1. Spectral Diagnostics .............................................................................................. 865

23.3.5 Peak Position........................................................................................................................ 874

24. Cluster Analysis ........................................................................................................ 883

24.1. Cluster analysis ..................................................................................................... 883

25. Projection ................................................................................................................. 895

25.1. Projection .............................................................................................................. 895

26. SIMCA....................................................................................................................... 915

26.1. SIMCA classification .............................................................................................. 915

26.2.4 Outcomes of a classification ................................................................................................ 918