You are on page 1of 18

Chapter 32

Introduction to SIMCA-P and Its Application


Zaibin Wu, Dapeng Li, Jie Meng, and Huiwen Wang

Abstract SIMCA-P is a kind of user-friendly software developed by Umetrics, which is mainly used for the methods of principle component analysis (PCA) and partial least square (PLS) regression. This paper introduces the main glossaries, analysis cycle and basic operations in SIMCA-P via a practical example. In the application section, this paper adopts SIMCA-P to estimate the PLS model with qualitative variables in independent variables set and applies it in the stand storm prevention in Beijing. Furthermore, this paper demonstrates the advantage of lowering the wind erosion by Conservation Tillage method and shows that Conservation Tillage is worth promotion in Beijing sand storm prevention.

32.1 Introduction to SIMCA-P 32.1.1 About SIMCA-P Software


SIMCA-P is developed by Umetrics, which is mainly used for the methods of principle component analysis (PCA) and partial least square (PLS) regression. It is a kind of user-friendly software based on Windows: the operations of models in SIMCA-P are very convenient to handle and the results can be easily illustrated by plots and lists, which present the explanations of the models in kinds of forms. At present,

Z. Wu and H. Wang School of Economics and Management, Beihang University, 37 Xueyuan Road, Haidian District, Beijing 100083, China e-mail: binship@126.com, wanghw@vip.sina.com D. Li Agricultural Bank of China, Beijing 100036, China e-mail: zh.ldp@intl.abocn.com J. Meng School of Statistics, Central University of Finance and Economics, Beijing 100081, China e-mail: mengjie517@126.com V. Esposito Vinzi et al. (eds.), Handbook of Partial Least Squares, Springer Handbooks of Computational Statistics, DOI 10.1007/978-3-540-32827-8 33, c Springer-Verlag Berlin Heidelberg 2010 757

758

Z. Wu et al.

SIMCA-P has been a standard tool in PLS regression analysis for researchers in many elds of science and technology.

32.1.2 Some Glossaries in SIMCA-P


There are several special glossaries in SIMCA-P system, which would help us to gain a mastery of the software. (1) Project. SIMCA-P is organized into projects. A project is a folder including the models with the relative statistics and results. (2) List. All the data are listed in tables in the system. The rst row with the variable names is marked as the Primary variable ID. The rst column is marked as identication numbers. (3) Dataset. The set of processing data is known as a Dataset. A project may contain several datasets. (4) Model. Models are mathematical representations of your process and are developed using the data specied in the workset and with a specied model type. (5) Workset. A workset is the set of data processed by the current active model. A workset can contain all the data, or be a subset of the primary data, with a particular treatment of the variables, such as role (predictor variables X, or responses Y),scaling, transformation, lagging, etc. (6) Block. A block is a combination of the variables with same role. For example, the Y block in a PLS model refers to all the dependent variables (responses). (7) Class. The observations of a dataset can be spitted into different set for different purposes, known as class.

32.1.3 The Analysis Cycle


It is convenient to do PCA or PLS estimation with SIMCA-P. Users can get the analysis results after several steps in accordance with the principles of PCA or PLS methods. The analysis cycles can be summarized as follows. (1) Start a project. Users should import the primary data form le or databases to create a new project. (2) Preprocess the data. View or modify a SIMCA-P data set. For example, it is easy to generate new variables as functions of existing ones or from model results. Users can do similar operations to preprocess the data. (3) Prepare the workset. The default workset is the whole data set with all variables as X at the project start, and the default model (untted) is a principal component model of X. Users should change the role of the variables to t other models. (4) Fit the models. After all the preparing procedures, users can do the estimation. (5) Detect the outliers. Display the score scatter plot to show the possible presence of outliers, groups, and other patterns in the data. Users should exclude the outliers from the workset and go back to step (4) to t a new model.

32

Introduction to SIMCA-P and Its Application


1. Start a project Import data to create a project 2. Preprocess the data View or modify a SIMCA-P dataset 3. Prepare the workset Change the role of variable and specify model type

759

4. Fit the model Fit the model automatically or artificially

5. Detect outliers Detect outliers and exclude them if existing

6. Review the fit Judge the effect and interpretate the results

7. To do prediction Specify a dataset to do prediction

Fig. 32.1 Road map to SIMCA-P

(6) Review the t. After a t, the whole spectrum of plots and lists are available for model interpretation. Users should judge the effect of the tted model and decide whether to do prediction. (7) To do prediction. Build the prediction set from the primary or any secondary data sets to do prediction. The above steps can be shown as the above road map (Fig. 32.1).

32.2 The Basic Operations of SIMCA-P


The example below will illustrate the main operations of SIMCA-P. The data in this example describes the relationship between body condition and sports grade of people. The predictors reect ones body condition including avoirdupois, cummerbund and pulse. The responses are three grades of physical exercise including chin-up, curl and high jump. 20 persons have been selected. Table 32.1 shows the original data set (Jone Neter, used by Tenenhaus 1998).

32.2.1 The Main Window of SIMCA-P


Double click on the SIMCA-P icon on the desktop, the main window opens and displays as Fig. 32.2. (Note: Before this operation, the project with the above data in Table 32.1 has been created. Otherwise the active model status window will not display.) The main window includes the following parts. (1) The command menu bar. The name and the folder of a project will be shown in the title. The menu in the bar includes the entire functions menu.

760 Table 32.1 Observed data of body condition and sports grade No avoirdupois cummerbund pulse chin-up 1 191 36 50 5 2 189 37 52 2 3 193 38 58 12 4 162 35 62 12 5 189 35 46 13 6 182 32 56 4 7 211 38 56 8 8 167 34 60 6 9 176 31 74 15 10 154 33 56 17 11 169 34 50 17 12 166 33 52 13 13 154 34 64 14 14 247 46 50 1 15 193 36 46 6 16 202 37 62 12 17 176 37 54 4 18 157 32 52 11 19 156 33 54 15 20 138 33 68 2

Z. Wu et al.

curl 162 110 101 105 155 101 101 125 200 251 120 210 215 50 70 210 60 230 225 110

high jump 60 60 101 37 58 42 38 40 40 250 38 115 105 50 31 120 25 80 73 43

Fig. 32.2 Main window of SIMCA-P

32

Introduction to SIMCA-P and Its Application

761

(2) Standard and shortcut bar. These shortcut buttons are for activating command menus and plots. Pressing a button will perform a certain task. (3) Plot and maker bar. Use the buttons in Plot toolbar to insert labels or text in plot, enlarge and read positions in graphs, get information about observations or variables, show a regression line in scatter plots or rotate 3D graphs. The main function of Maker toolbar is to exclude or include the observations or variables in the active model and create a new model. (4) The Favorites window. The Favorites window contains commands and plots, which are marked with different symbols. Double click on a symbol will execute a command or open a specied plot. (5) The Workset bar. The bar displays the variables and observations in the workset and their status. (6) The active model status window. The window shows the information about all the models, such as model name, type, number of components, etc. (7) The Audit Trail window. The log events are shown in this window.

32.2.2 The Operations of SIMCA-P


The important operations are as followings. (1) Import the data and create a project Data can only be imported from le or data bases, but not by keyboard. The system supports more than 10 types of les, such as txt, xls, mat, etc. Select FilejNewjGet data from le, a standard dialog box opens to enter the le type, name and source address of the data le to be imported (Fig. 32.3). After importing the data le, the rst page of the import wizard opens (Fig. 32.4). The row with the variable names is by default marked as the Primary variable ID and colored in dark green. You can select any other row as the Primary variable ID. If the Primary variable ID has not been specied, SIMCA-P creates the Primary variable ID as Var 1, Var 2, etc. The column with observation numbers or names is colored in dark yellow. Select any desired column as Primary Observation ID. Data are colored in white and text are colored in blue.

Fig. 32.3 Import the data le

762

Z. Wu et al.

Fig. 32.4 Import data wizard

Fig. 32.5 Specication of the project

Fig. 32.6 The active model status window

In the Import data wizard, you can do other operations by pressing the buttons on the left window. Click on Next, the project specication page of the import wizard displays (Fig. 32.5). Users should specify the project name and the folder to save the work le. The le type is usp. The window still displays other information about the data set. Click on Finish and the data set is imported. A project has been created (Fig. 32.6). The default workset is the whole data set with all variables as X and scaled to unit variance. The associated model is PCA. (2) Explore the data Before tting the model, you should understand the data comprehensively. Select DatasetjQuick infojVariables/Observations, a window opens with the name of the

32

Introduction to SIMCA-P and Its Application

763

Fig. 32.7 Usual statistics and dataset

Fig. 32.8 Generate variables

variable/observation and default options (Fig. 32.7). The usual statistics are showed in the window, such as number, missing values, mean, etc. In some case, a new variable should be generated from raw data. Select Datasetj Generate Variable, SIMCA-P opens the wizard window displaying the active data set in a spreadsheet (Fig. 32.8). Enter the expression dening the new variable and click on Next. SIMCA-P displays the new variable, with its formula, statistics and Quick info plots (Fig. 32.9). Click nish, and the new variable is added at the end of the active dataset. (3) Create the workset and set model options After the primary data set is loaded, all the variables are selected as X variables (predictors). The active model type is PCX (Principle Component Analysis of the X variables). In order to change the role of the variables or observations, you can select WorksetjEdit to open the Overview page of the workset dialog with the current observations and variables and their attributes (Fig. 32.10). The workset is organized into pages. Select the desired page to change the attributes of the observations or variables. In order to do PLS estimation, you can select WorksetjEdit to change roles of variables by marking the variable y and

764

Z. Wu et al.

Fig. 32.9 Info of new variable

Fig. 32.10 Overview page of the workset dialog

Fig. 32.11 Specify variables as responses

clicking on the desired button Y (Fig. 32.11). You can also set class of observations in the Observations page. After the above procedures, you can select WorksetjOptions to set the options of the current active model (Fig. 32.12). (4) Fit the model You can select Analysis menu to t the model. The model is by default non hierarchical base model. The model type is decided according to the role of variables, including PCA on X-block, PCA on Y-block, etc. The methods of t include autot,

32

Introduction to SIMCA-P and Its Application

765

Fig. 32.12 Model options

Fig. 32.13 Fit the model

Fig. 32.14 Model overview

next component, 2 rst components, next component, zero component, remove component and autot class models (Fig. 32.13). Select AnalysisjAutot, SIMCA-P extracts as many components as considered signicant. When you t a model, a plot window opens and displays the cumulative R2 and Q2 for the X(PCA) or Y(PLS) matrix (Fig. 32.14). After tting a model, you can mark the model and click on Active Model TypejHierarchical Base Model and select scores, residuals, or both as variables in

766

Z. Wu et al.

Fig. 32.15 Hieraarchical base model

Fig. 32.16 t[1]/t[2] scatter plot

another model (Fig. 32.15). The scores, residuals, or both would be added to the workset to be used as variables in another model. (5) Detect the outliers Double click t[1]/t[2] Scatter Plot in Favorite window to display score scatter plot after tting (Fig. 32.16). These plots show the possible presence of outliers, groups, and other patterns in the data. In order to illustrate this plot, we extract two components. In Fig. 32.16, observation 14 is outside the 95% condence region of the model. This means observation 14 is an outlier. In order to eliminate the effect of observation 14, you should exclude this observation from the workset. Press Mark item button in Marker toolbar and mark observation 14, and then press the red arrow button. Resultingly, observation 14 is excluded from workset and a new untted model is created without observation 14 (Fig. 32.17). (6) Review the results You can select Analysis to plot or list some statistics, including scores, loading, coefcients, etc. For example, you can select AnalysisjSummaryjList to show the individual cumulative R2 and Q2 for each Y variable (Fig. 32.18). In fact, you can double click a ceratin symbol in Favorites window to execute a command. The Favorites bar is similar to a customized Navigation Bar. It contains commands and plots. They are marked with different symbols for specied

32

Introduction to SIMCA-P and Its Application

767

Fig. 32.17 Detect the outliers

Fig. 32.18 Summary of the results

Fig. 32.19 Coefcients Plot

plots/lists and for a command (works on the active model). For example, you can double click on the Coefcients Plot in Favorites window to show the coefcients plot (Fig. 32.19). Besides the results included in Analysis menu and Favorites window, you can also select Plot/List menu to plot or list all the results. The Plot/List menus allow you to plot and list input data such as observations and variable values, compute

768

Z. Wu et al.

elements such as scaling weights, variable variances, etc., as well as results such as loadings, scores, predictions, etc., of all the tted models. (7) To do Prediction After tting, the workset is by default specied as prediction set. If you want to build a prediction set by combining observations from different data sets, or removes observations from the prediction set, select PredictionsjSpecify Predictions SetjSpecify. The observations are displayed in the left window (Fig. 32.20). Select the ones you want in the prediction set and move them to the right window. After specifying the prediction set, you can select Predictions menu to obtain the prediction information about the current model. For example, you can select Distance to ModeljY BlockjLine Plot to display this plot (Fig. 32.21). The residual standard deviation of an observation in the Y space is proportional to the observation distance to the hyper plane of the PLS model in the corresponding space. SIMCA-P computes the observation distances to the PLS model in the Y

Fig. 32.20 Specify predictions set

Fig. 32.21 Distance to model

32

Introduction to SIMCA-P and Its Application

769

space (DModY) and displays them as line plots. A large DModY value indicates that the observation is an outlier in the Y space. By default, these distances are computed after all extracted components.

32.3 Application
In this section, we provide an application of SIMCA-P. Sandstorms have been a big barrier against the development of the world, which results in an annual global loss of 48 billion USD, including 6.5 billion USD in China. In recent years, sand storms in Beijing have caused many serious problems. Investigations showed that about 70 percent of the sand in these storms are generated by wind erosion of dry, fallow farmland around the city. Consequently, the study on wind erosion of soil becomes very important in sand storm prevention (Shen et al. 2000; Li and Gao 2001; Gao 2002; Zang 2003). In this research, the Water Content in Soil .x1 /, Soil Particle Size.x2 /,the Rate of Straw Mulching .x3 / and the Type of Farmland is dened as four independent variables (IVs). The Type of Farmland is a qualitative variable (QV) consisting of the following four categories: sand farmland, traditional tillage farmland, grass farmland and Conservation Tillage farmland. These categories are regarded as different types of farmland. To establish a regression model of Wind Erosion Rate .y/with the above four IVs, the Type of Farmland should be transformed into four dummy variables (DVs), D1 ; D2 ; D3 ; D4 . Table 32.2 shows the original data set. The sample size is 16. Based on the data in Table 32.2, the regression model can be written as follows: y D u C 1 x1 C 2 x2 C 3 x3 C 1 D1 C 2 D2 C 3 D3 C 4 D4
Table 32.2 No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Wind erosion rate and IVs y x1 11:674 3:623 13:812 3:623 15:260 3:623 12:160 3:623 6:021 6:291 8:598 6:291 10:395 6:291 7:331 6:291 3:689 10:210 5:339 10:210 5:971 10:210 4:893 10:210 2:768 8:883 4:167 8:883 4:357 8:883 4:111 8:883

(32.1)

x2 0:651 0:651 0:651 0:651 0:266 0:266 0:266 0:266 0:337 0:337 0:337 0:337 0:339 0:339 0:339 0:339

x3 12.4 12.4 12.4 12.4 13.8 13.8 13.8 13.8 45.4 45.4 45.4 45.4 58.5 58.5 58.5 58.5

D1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0

D2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0

D3 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0

D4 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1

770

Z. Wu et al.

Fig. 32.22 Get the data from le

Fig. 32.23 Import data

It is clear that the following equation always exists in the model (1): D1 C D2 C D3 C D4 D 1 (32.2)

The above results show that there is full multicollinearity between the IVs. We have used SAS 8.0 to obtain the estimation. The system provides the following notes: the model is not of full rank; the least-squares solutions for the parameters are not unique; some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. Therefore, OLS method is invalid in this case study. Consequently, we adopted PLS to establish the regression model, which was executed by SIMCA-P 9.0. At the beginning, select FilejNewjGet data from le to import the primary dataset and create a new project (Fig. 32.22). The raw data was stored in c:n and the name of the source le is sand.dif. After the le was selected, the window of Import data wizard was displayed (Fig. 32.23). The rst row with the variable names is by default marked as the Primary variable ID. The rst column is by default marked as identication numbers.Click on Next in Import data wizard when nished. After importing the data, we should specify the project name and select the destination folder to save the project (Fig. 32.24). After the above procedures, a project has been created. By default all variables are selected as X. The active model type is PCX (Fig. 32.25). In order to adopt PLS estimation, we can select WorksetjEdit to change all the options of the default model in Workset Window. Variables are displayed with their roles (X, Y or excluded ( )). To change roles, mark the variable y and click on the desired button Y (Fig. 32.26).

32

Introduction to SIMCA-P and Its Application

771

Fig. 32.24 Specication of the project

Fig. 32.25 The default model

Fig. 32.26 Change the role of variable y

Fig. 32.27 Change the role of variable y

When we exit the Workset window, the model type has been changed from PCX to PLS. This model is untted and is the active model (Fig. 32.27). SIMCA-P extracts one component according to the cross validation rules after selecting AnalysisjAutot. The right plot displays the cumulative R2 and Q2 for the Y (PLS) matrix after the extracted component (Fig. 32.28). Double click t[1]/u[1] Scatter Plot in Favorites window to display the t1 =u1 plot (Fig. 32.29). The plot indicates a good t corresponding to the small scatter around the straight line. It proves that there is a strong linear correlation between Wind Erosion Rate and its IVs. So the linear regression is fundamentally established.

772

Z. Wu et al.

Fig. 32.28 The PLS model

Fig. 32.29 t[1]/u[1] scatter plot

Fig. 32.30 t[1]/t[2] scatter plot

For a better illustration of the regression results, we extracted two components. The cumulative Q2 for the extracted components is 0.848 and it can explain 72.6% variation of IVs and 90.2% variation of y. Double click t[1]/t[2] Scatter Plot in Favorites window to display a two-dimensional score plot (Fig. 32.30). SIMCA-P draws the condence ellipse based on Hotelling T2. Observations situated outside the ellipse are outliers. Figure 32.30 shows no outliers. Double click Observed vs. Predicted in Favorites window to shows the observed values vs. the tted or predicted values (Fig. 32.31).

32

Introduction to SIMCA-P and Its Application

773

Fig. 32.31 Observed vs. Predicted values

Fig. 32.32 Standardized regression coefcients

Figure 32.31 demonstrates that the estimation by PLS is effective. The model consisting of original variables is estimated as follows: 1:62D4 (32.3) Double click Coefcients Plot in Favorites window to show standardized regression coefcients of the model (Fig. 32.32). According to Fig. 32.32, the larger the size of soil particle, the more serious the wind erosion. Furthermore, because Soil Water Content and Straw Mulching Rate are negatively correlated with Wind Erosion Rate, these are benecial in easing the soil wind erosion problem by adding Soil Water Content and increasing the Straw Mulching Rate. Considering the different kinds of farmland, we conclude that the Conservation Tillage farmland has the lowest wind erosion rate, while the Sand farmland has the highest. Double click w*c[1]/w*c[2] Scatter Plot in Favorites window to show both the Xweights (w or w*) and Y-weights (c) and thereby the correlation structure between X and Y (Fig. 32.33). Since the Conservation Tillage method cultivates the farmland in a shallow way and leaves the crop residues on the land surface as much as possible, it is the most effective way to prevent wind erosion. Additionally, it can increase land coverage y D 9:36 0:36x1 C 5:54x2 0:03x3 C 2:11D1 C 0:10D2 0:59D3

774

Z. Wu et al.

Fig. 32.33 Loading plots of IVs

rate, prevent water and soil loss and enlarge the level of production. Therefore, it is worthwhile promoting the Conservation Tillage method both for the prevention of sand storms in the Beijing area and for agricultural production.

32.4 Conclusion
This paper has introduced tting modes by employing SIMCA-P. It is obvious that SIMCA-P is an effective tool to conduct multivariate data analysis. In the part of empirical research, the results show that, compared with OLS, PLS is preferable in dealing with QVs. In the investigation, the PLS model not only illustrated the factors of soil wind erosion, which conformed fairly well with reality, but also demonstrated that Conservation Tillage method is the most effective way to ease soil wind erosion. The results provide valuable information for Beijing sand storm prevention.

References
UMETRI AB: SIMCA-P for Windows: Maltivariate Modeling, Analysis and SPC of Process Data. UMETRI AB (1996) Tenenhaus, M.: La Rgression PLS: Thorie et Pratique. TECHNIP, Paris (1998) Shen, Y.C., Yang, Q.Y., Jing, K, Xu, J.X.: Sand-storm and Dust-storm in China and Prevention and Control. Journal of Arid Land Resources and Environment, 14(3), 1114 (2000) Li, L.J., Gao, Q.X.: Source Analysis of Beijing Sand-Dust in 2000, Research of Environmental Sciences, 14(2), 14 (2001) Gao, Q.X.: The Dust Weather of Beijing and Its Impact, China Environmental Science, 22(5), 468471 (2002) Zang, Y.: Experimental Study on Soil Erosion by Wind under Conservation Tillage, 19(2), 5660 (2003)

You might also like