Professional Documents
Culture Documents
Abstract SIMCA-P is a kind of user-friendly software developed by Umetrics, which is mainly used for the methods of principle component analysis (PCA) and partial least square (PLS) regression. This paper introduces the main glossaries, analysis cycle and basic operations in SIMCA-P via a practical example. In the application section, this paper adopts SIMCA-P to estimate the PLS model with qualitative variables in independent variables set and applies it in the stand storm prevention in Beijing. Furthermore, this paper demonstrates the advantage of lowering the wind erosion by Conservation Tillage method and shows that Conservation Tillage is worth promotion in Beijing sand storm prevention.
Z. Wu and H. Wang School of Economics and Management, Beihang University, 37 Xueyuan Road, Haidian District, Beijing 100083, China e-mail: binship@126.com, wanghw@vip.sina.com D. Li Agricultural Bank of China, Beijing 100036, China e-mail: zh.ldp@intl.abocn.com J. Meng School of Statistics, Central University of Finance and Economics, Beijing 100081, China e-mail: mengjie517@126.com V. Esposito Vinzi et al. (eds.), Handbook of Partial Least Squares, Springer Handbooks of Computational Statistics, DOI 10.1007/978-3-540-32827-8 33, c Springer-Verlag Berlin Heidelberg 2010 757
758
Z. Wu et al.
SIMCA-P has been a standard tool in PLS regression analysis for researchers in many elds of science and technology.
32
759
6. Review the fit Judge the effect and interpretate the results
(6) Review the t. After a t, the whole spectrum of plots and lists are available for model interpretation. Users should judge the effect of the tted model and decide whether to do prediction. (7) To do prediction. Build the prediction set from the primary or any secondary data sets to do prediction. The above steps can be shown as the above road map (Fig. 32.1).
760 Table 32.1 Observed data of body condition and sports grade No avoirdupois cummerbund pulse chin-up 1 191 36 50 5 2 189 37 52 2 3 193 38 58 12 4 162 35 62 12 5 189 35 46 13 6 182 32 56 4 7 211 38 56 8 8 167 34 60 6 9 176 31 74 15 10 154 33 56 17 11 169 34 50 17 12 166 33 52 13 13 154 34 64 14 14 247 46 50 1 15 193 36 46 6 16 202 37 62 12 17 176 37 54 4 18 157 32 52 11 19 156 33 54 15 20 138 33 68 2
Z. Wu et al.
curl 162 110 101 105 155 101 101 125 200 251 120 210 215 50 70 210 60 230 225 110
32
761
(2) Standard and shortcut bar. These shortcut buttons are for activating command menus and plots. Pressing a button will perform a certain task. (3) Plot and maker bar. Use the buttons in Plot toolbar to insert labels or text in plot, enlarge and read positions in graphs, get information about observations or variables, show a regression line in scatter plots or rotate 3D graphs. The main function of Maker toolbar is to exclude or include the observations or variables in the active model and create a new model. (4) The Favorites window. The Favorites window contains commands and plots, which are marked with different symbols. Double click on a symbol will execute a command or open a specied plot. (5) The Workset bar. The bar displays the variables and observations in the workset and their status. (6) The active model status window. The window shows the information about all the models, such as model name, type, number of components, etc. (7) The Audit Trail window. The log events are shown in this window.
762
Z. Wu et al.
In the Import data wizard, you can do other operations by pressing the buttons on the left window. Click on Next, the project specication page of the import wizard displays (Fig. 32.5). Users should specify the project name and the folder to save the work le. The le type is usp. The window still displays other information about the data set. Click on Finish and the data set is imported. A project has been created (Fig. 32.6). The default workset is the whole data set with all variables as X and scaled to unit variance. The associated model is PCA. (2) Explore the data Before tting the model, you should understand the data comprehensively. Select DatasetjQuick infojVariables/Observations, a window opens with the name of the
32
763
variable/observation and default options (Fig. 32.7). The usual statistics are showed in the window, such as number, missing values, mean, etc. In some case, a new variable should be generated from raw data. Select Datasetj Generate Variable, SIMCA-P opens the wizard window displaying the active data set in a spreadsheet (Fig. 32.8). Enter the expression dening the new variable and click on Next. SIMCA-P displays the new variable, with its formula, statistics and Quick info plots (Fig. 32.9). Click nish, and the new variable is added at the end of the active dataset. (3) Create the workset and set model options After the primary data set is loaded, all the variables are selected as X variables (predictors). The active model type is PCX (Principle Component Analysis of the X variables). In order to change the role of the variables or observations, you can select WorksetjEdit to open the Overview page of the workset dialog with the current observations and variables and their attributes (Fig. 32.10). The workset is organized into pages. Select the desired page to change the attributes of the observations or variables. In order to do PLS estimation, you can select WorksetjEdit to change roles of variables by marking the variable y and
764
Z. Wu et al.
clicking on the desired button Y (Fig. 32.11). You can also set class of observations in the Observations page. After the above procedures, you can select WorksetjOptions to set the options of the current active model (Fig. 32.12). (4) Fit the model You can select Analysis menu to t the model. The model is by default non hierarchical base model. The model type is decided according to the role of variables, including PCA on X-block, PCA on Y-block, etc. The methods of t include autot,
32
765
next component, 2 rst components, next component, zero component, remove component and autot class models (Fig. 32.13). Select AnalysisjAutot, SIMCA-P extracts as many components as considered signicant. When you t a model, a plot window opens and displays the cumulative R2 and Q2 for the X(PCA) or Y(PLS) matrix (Fig. 32.14). After tting a model, you can mark the model and click on Active Model TypejHierarchical Base Model and select scores, residuals, or both as variables in
766
Z. Wu et al.
another model (Fig. 32.15). The scores, residuals, or both would be added to the workset to be used as variables in another model. (5) Detect the outliers Double click t[1]/t[2] Scatter Plot in Favorite window to display score scatter plot after tting (Fig. 32.16). These plots show the possible presence of outliers, groups, and other patterns in the data. In order to illustrate this plot, we extract two components. In Fig. 32.16, observation 14 is outside the 95% condence region of the model. This means observation 14 is an outlier. In order to eliminate the effect of observation 14, you should exclude this observation from the workset. Press Mark item button in Marker toolbar and mark observation 14, and then press the red arrow button. Resultingly, observation 14 is excluded from workset and a new untted model is created without observation 14 (Fig. 32.17). (6) Review the results You can select Analysis to plot or list some statistics, including scores, loading, coefcients, etc. For example, you can select AnalysisjSummaryjList to show the individual cumulative R2 and Q2 for each Y variable (Fig. 32.18). In fact, you can double click a ceratin symbol in Favorites window to execute a command. The Favorites bar is similar to a customized Navigation Bar. It contains commands and plots. They are marked with different symbols for specied
32
767
plots/lists and for a command (works on the active model). For example, you can double click on the Coefcients Plot in Favorites window to show the coefcients plot (Fig. 32.19). Besides the results included in Analysis menu and Favorites window, you can also select Plot/List menu to plot or list all the results. The Plot/List menus allow you to plot and list input data such as observations and variable values, compute
768
Z. Wu et al.
elements such as scaling weights, variable variances, etc., as well as results such as loadings, scores, predictions, etc., of all the tted models. (7) To do Prediction After tting, the workset is by default specied as prediction set. If you want to build a prediction set by combining observations from different data sets, or removes observations from the prediction set, select PredictionsjSpecify Predictions SetjSpecify. The observations are displayed in the left window (Fig. 32.20). Select the ones you want in the prediction set and move them to the right window. After specifying the prediction set, you can select Predictions menu to obtain the prediction information about the current model. For example, you can select Distance to ModeljY BlockjLine Plot to display this plot (Fig. 32.21). The residual standard deviation of an observation in the Y space is proportional to the observation distance to the hyper plane of the PLS model in the corresponding space. SIMCA-P computes the observation distances to the PLS model in the Y
32
769
space (DModY) and displays them as line plots. A large DModY value indicates that the observation is an outlier in the Y space. By default, these distances are computed after all extracted components.
32.3 Application
In this section, we provide an application of SIMCA-P. Sandstorms have been a big barrier against the development of the world, which results in an annual global loss of 48 billion USD, including 6.5 billion USD in China. In recent years, sand storms in Beijing have caused many serious problems. Investigations showed that about 70 percent of the sand in these storms are generated by wind erosion of dry, fallow farmland around the city. Consequently, the study on wind erosion of soil becomes very important in sand storm prevention (Shen et al. 2000; Li and Gao 2001; Gao 2002; Zang 2003). In this research, the Water Content in Soil .x1 /, Soil Particle Size.x2 /,the Rate of Straw Mulching .x3 / and the Type of Farmland is dened as four independent variables (IVs). The Type of Farmland is a qualitative variable (QV) consisting of the following four categories: sand farmland, traditional tillage farmland, grass farmland and Conservation Tillage farmland. These categories are regarded as different types of farmland. To establish a regression model of Wind Erosion Rate .y/with the above four IVs, the Type of Farmland should be transformed into four dummy variables (DVs), D1 ; D2 ; D3 ; D4 . Table 32.2 shows the original data set. The sample size is 16. Based on the data in Table 32.2, the regression model can be written as follows: y D u C 1 x1 C 2 x2 C 3 x3 C 1 D1 C 2 D2 C 3 D3 C 4 D4
Table 32.2 No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Wind erosion rate and IVs y x1 11:674 3:623 13:812 3:623 15:260 3:623 12:160 3:623 6:021 6:291 8:598 6:291 10:395 6:291 7:331 6:291 3:689 10:210 5:339 10:210 5:971 10:210 4:893 10:210 2:768 8:883 4:167 8:883 4:357 8:883 4:111 8:883
(32.1)
x2 0:651 0:651 0:651 0:651 0:266 0:266 0:266 0:266 0:337 0:337 0:337 0:337 0:339 0:339 0:339 0:339
x3 12.4 12.4 12.4 12.4 13.8 13.8 13.8 13.8 45.4 45.4 45.4 45.4 58.5 58.5 58.5 58.5
D1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
D3 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
D4 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
770
Z. Wu et al.
It is clear that the following equation always exists in the model (1): D1 C D2 C D3 C D4 D 1 (32.2)
The above results show that there is full multicollinearity between the IVs. We have used SAS 8.0 to obtain the estimation. The system provides the following notes: the model is not of full rank; the least-squares solutions for the parameters are not unique; some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. Therefore, OLS method is invalid in this case study. Consequently, we adopted PLS to establish the regression model, which was executed by SIMCA-P 9.0. At the beginning, select FilejNewjGet data from le to import the primary dataset and create a new project (Fig. 32.22). The raw data was stored in c:n and the name of the source le is sand.dif. After the le was selected, the window of Import data wizard was displayed (Fig. 32.23). The rst row with the variable names is by default marked as the Primary variable ID. The rst column is by default marked as identication numbers.Click on Next in Import data wizard when nished. After importing the data, we should specify the project name and select the destination folder to save the project (Fig. 32.24). After the above procedures, a project has been created. By default all variables are selected as X. The active model type is PCX (Fig. 32.25). In order to adopt PLS estimation, we can select WorksetjEdit to change all the options of the default model in Workset Window. Variables are displayed with their roles (X, Y or excluded ( )). To change roles, mark the variable y and click on the desired button Y (Fig. 32.26).
32
771
When we exit the Workset window, the model type has been changed from PCX to PLS. This model is untted and is the active model (Fig. 32.27). SIMCA-P extracts one component according to the cross validation rules after selecting AnalysisjAutot. The right plot displays the cumulative R2 and Q2 for the Y (PLS) matrix after the extracted component (Fig. 32.28). Double click t[1]/u[1] Scatter Plot in Favorites window to display the t1 =u1 plot (Fig. 32.29). The plot indicates a good t corresponding to the small scatter around the straight line. It proves that there is a strong linear correlation between Wind Erosion Rate and its IVs. So the linear regression is fundamentally established.
772
Z. Wu et al.
For a better illustration of the regression results, we extracted two components. The cumulative Q2 for the extracted components is 0.848 and it can explain 72.6% variation of IVs and 90.2% variation of y. Double click t[1]/t[2] Scatter Plot in Favorites window to display a two-dimensional score plot (Fig. 32.30). SIMCA-P draws the condence ellipse based on Hotelling T2. Observations situated outside the ellipse are outliers. Figure 32.30 shows no outliers. Double click Observed vs. Predicted in Favorites window to shows the observed values vs. the tted or predicted values (Fig. 32.31).
32
773
Figure 32.31 demonstrates that the estimation by PLS is effective. The model consisting of original variables is estimated as follows: 1:62D4 (32.3) Double click Coefcients Plot in Favorites window to show standardized regression coefcients of the model (Fig. 32.32). According to Fig. 32.32, the larger the size of soil particle, the more serious the wind erosion. Furthermore, because Soil Water Content and Straw Mulching Rate are negatively correlated with Wind Erosion Rate, these are benecial in easing the soil wind erosion problem by adding Soil Water Content and increasing the Straw Mulching Rate. Considering the different kinds of farmland, we conclude that the Conservation Tillage farmland has the lowest wind erosion rate, while the Sand farmland has the highest. Double click w*c[1]/w*c[2] Scatter Plot in Favorites window to show both the Xweights (w or w*) and Y-weights (c) and thereby the correlation structure between X and Y (Fig. 32.33). Since the Conservation Tillage method cultivates the farmland in a shallow way and leaves the crop residues on the land surface as much as possible, it is the most effective way to prevent wind erosion. Additionally, it can increase land coverage y D 9:36 0:36x1 C 5:54x2 0:03x3 C 2:11D1 C 0:10D2 0:59D3
774
Z. Wu et al.
rate, prevent water and soil loss and enlarge the level of production. Therefore, it is worthwhile promoting the Conservation Tillage method both for the prevention of sand storms in the Beijing area and for agricultural production.
32.4 Conclusion
This paper has introduced tting modes by employing SIMCA-P. It is obvious that SIMCA-P is an effective tool to conduct multivariate data analysis. In the part of empirical research, the results show that, compared with OLS, PLS is preferable in dealing with QVs. In the investigation, the PLS model not only illustrated the factors of soil wind erosion, which conformed fairly well with reality, but also demonstrated that Conservation Tillage method is the most effective way to ease soil wind erosion. The results provide valuable information for Beijing sand storm prevention.
References
UMETRI AB: SIMCA-P for Windows: Maltivariate Modeling, Analysis and SPC of Process Data. UMETRI AB (1996) Tenenhaus, M.: La Rgression PLS: Thorie et Pratique. TECHNIP, Paris (1998) Shen, Y.C., Yang, Q.Y., Jing, K, Xu, J.X.: Sand-storm and Dust-storm in China and Prevention and Control. Journal of Arid Land Resources and Environment, 14(3), 1114 (2000) Li, L.J., Gao, Q.X.: Source Analysis of Beijing Sand-Dust in 2000, Research of Environmental Sciences, 14(2), 14 (2001) Gao, Q.X.: The Dust Weather of Beijing and Its Impact, China Environmental Science, 22(5), 468471 (2002) Zang, Y.: Experimental Study on Soil Erosion by Wind under Conservation Tillage, 19(2), 5660 (2003)