You are on page 1of 16

HONPR1B/103/2011

COLLEGE OF ECONOMIC AND MANAGEMENT SCIENCES SCHOOL OF ECONOMIC SCIENCES DEPARTMENT OF DECISION SCIENCES

University of South Africa P O Box 392, UNISA, 0003

HONPR1B Project 1
Tutorial letter 103/2011
Data mining

HONPR1B/103

Contents
1 Introduction 2 Software 2.1 2.2 2.3 Download and install Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Download and install Cropper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Download and install GIMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 3 3 4 6 6 6 7

3 SAS Enterprise Miner 4 Ending a SAS EMiner session 5 Capturing graphs and diagrams created on the SAS server 5.1 Create EPS les . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Assignment 03 Due date: 31 August 2011

HONPR1B/103

Introduction

For Assignment 3 you need to access SAS Enterprise Miner (EMiner for short)2. This software is installed on a remote server, called INGWE, at the Unisa campus. Please note that the aim of this assignment is to introduce you to SAS EMiner to aord you the opportunity to try it out and to illustrate the use of a software tool to analyse large amounts of data. The intention is not to make you a SAS expert.

Software

In the following sections instructions are provided to download and install software that you will need for Assignment 3. Note that these are all freeware. 1. The Java Runtime Environment will enable you to connect to SAS EMiner on the INGWE server over the internet. 2. Cropper a screen capture utility will enable you to capture graphs and diagrams as displayed on the server and to save it on your computer. 3. GIMP (GNU Image Manipulation Program) is necessary for creating postscript les.

2.1

Download and install Java

On the internet, go to http://java.sun.com/products/archive/j2se/1.4.2_09/index.html. Click Download J2RE (Java Runtime Environment) this is the second download on the page and select your platform (Windows or Linux). Click j2re-1_4_2_09-windows-i586-p.exe and run the le.

2.2

Download and install Cropper

Go to http://cropper.codeplex.com/releases/view/56509 or google Cropper Download and download from the rst item that comes up. Click Cropper 1.9.4 to download the le. Click Open (or double click CropperSetup.msi where it has been saved) and follow the setup wizard to install Cropper .

2.3

Download and install GIMP

Go to http://gimp-win.sourceforge.net/stable.html or google GIMP for Windows. Click Installer and then Download in the GIMP for Windows (version 2.6.11) box. Double click gimp-2.6.11-i686-setup-1.exe where it has been saved to install it.

HONPR1B/103

SAS Enterprise Miner

To access SAS EMiner oner the internet, go to http://ingwe.unisa.ac.za:6098. A Server Status window will open. Click on the Conguration tab and under Java Webstart click Launch. The screen as shown in Figure 1 will open.

Figure 1: Sign-on screen Enter your username1 and our password2 . Click Log On and the screen shown in Figure 2 will open.

Figure 2: Welcome screen


1

Your username is your surname in lowercase letters (e.g. morris); if you have a double surname use only the rst part; for Afrikaans surnames with spaces in them, omit the spaces (e.g. vandermerwe). 2 Your password is your student number without leading zeros.

HONPR1B/103

To start a new project click New Project. The screen in Figure 3 will open.

Figure 3: Create new project screen Choose a name for your project starting with your student number, for instance 12345_Name, and for the path type in /ingwe/eip/config/Lev1/Data/2011. Please do your whole assignment in the same project. Dierent diagrams can be opened for each of the questions. When you click OK the screen in Figure 4 opens.

Figure 4: Project screen The panel on the left-hand side of the screen is divided into three panels. These are the following: 1. The project panel where information regarding the current project is shown. This includes data sources, diagrams, etc. When you open your project at a later stage, previously created data sources and diagrams can be seen by clicking on the circled dot next to Data Sources or Diagrams. 2. The properties panel which shows the properties of each node in the diagram whenever it is highlighted. 3. The information panel. When a property is selected in the properties panel information regarding it is given here.

HONPR1B/103

The panel on the right-hand side is the diagram panel. Note that the tabs at the top of this panel represent SEMMA. To create a new diagram, right click Diagram and click Create Diagram or go to FileNewDiagram. Choose a name for your diagram, for instance Question 3 and click OK.

Ending a SAS EMiner session

Simply click the top right X to close SAS EMiner and click Yes to exit EMiner.

Capturing graphs and diagrams created on the SAS server

To access the Cropper tool, go to All Programs and click on Cropper. (It should appear at the bottom right of your screen as a framed plus sign.) The cropper consists of a transparant rectangle. To position the rectangle over the image to be captured, click on the upper left corner and drag it to coincide with the right upper corner of you image. Then drag the bottom right corner so that the rectangle covers the area that you want to capture. When you now double click on the cross in the middel of the rectangle, the image is captured. You will nd the captured image in the Cropper Captures directory under My Documents.

5.1

Create EPS les

A To be able to include the image in a L TEX le, it is necessary to save it as an EPS le. The process is as follows:

Click on the image in Cropper Captures to view it in the upper window. Now, go to File Open to open the image in Microsoft Oce Picture Manager.
A Go to File Save as... and save the image with a unique name in the same directory as your L TEX le in which you want to include it. This is a JPEG le.

Open this JPEG image in GIMP. Go to File Save as... and save it as a PostScript le with the same name change the extension from .jpg to .ps in the Save Image window and save. A Save as PostScript window opens. Check both the PostScript level 2 and Encapsulated PostScript boxes and save. Open the PS le in GhostView double click on it in Explore. At this stage the le includes the whole page with white space around the gure. To set the Bounding box to only include the graph or diagram, go to File PS to EPS. You may select the option Automatically calculate bounding box. This will include the full graph in the bounding box, which usually is what is required. If you want to choose your own Bounding Box, remove the tick next to Automatically calculate bounding box and enter. You now have the option to select your own bounding box. You will be prompted to click left, bottom, right and top, specifying the part of the graph that you want to include. Once the bounding box has been set, save the EPS le. Make sure it is in the same directory as A the L TEX le in which it has to be included.

HONPR1B/103

Assignment 03

Due date: 31 August 2011

Instructions
A The assignment must be typed in a single L TEX document. A dierent le for each question will not be accepted! Number your questions the same as below.

To enable us to screen assignments for plagiarism, we require you to submit this assignment electronically through myUnisa. Assignments submitted by post will not be accepted. Like for Assignment 01, include your student number and surname in the name of your .tex document (and subsequently also your output le). Remember that myUnisa does not accept more than one le for an assignment. Submit your les zipped together as one le. Include only the following les in the zip le that you submit (and not your entire directory): The output le (.pdf);
A The L TEX source code le (.tex);

All les that are included as gures in your document (.eps or .pdf); The bibliography le (.bib).

Question 1

[20]

A Write two to three pages on data mining. Use L TEX and correct referencing techniques as you have used for Assignment 02. Remember that plagiarism is never acceptable. You may use the internet for your research (at least three sources). Address at least the following:

What data mining entails; Techniques used in data mining (provide details on at least two); Practical applications of data mining. The list of references should appear directly after your literature study. The rubric for marking this question is shown in Table 1 Poor 0 Average 1 Good 2

Aspect under consideration In-text referencing Bibliography Language (grammar/spelling) Flow of text Explanation of topic Logical layout

Mark

12 General impression Table 1: Rubric for Question 1 8

HONPR1B/103

Question 2
Log into SAS EMiner and use the Help utility to answer the following questions: (a) Provide a short description of each of the processes represented by SEMMA. (b) What are the following nodes used for? MultiPlot node; Impute node; Regression node; Decision Tree node; Model Comparison node.

[20]

(c) The Partition node partitions the data for dierent purposes. Give a short description of what training, validation and testing entail. Clearly indicate what the dierence between validation and testing is. If you say that validation and testing have the same purpose, why do you think it is necessary to have dierent data sets for them? (d) When comparing dierent techniques, certain measures are used. Name the measures resorting under each of the following: Classication Measures; Data Mining Measures; Statistical Measures.

Question 3
In SAS EMiner, work in the same project that you have created earlier. Create a new diagram for this question.

[15]

A Work through the example below and type what is asked in the same L TEX document as questions one and two. The questions are framed and numbered (a), (b), . . . .

The scenario The consumer credit department of a bank wants to automate the decision-making process for approval of home equity lines of credit. To do this they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit-scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modelling tools, but the created model must be suciently interpretable to provide reasons for any rejections. The model required The credit-scoring model should nd the probability that a given applicant will default on his/her loan repayment. A threshold is selected in such a way that all applicants whose probability of default is above this threshold will be recommended for rejection. The data The data set HMEQ is used for this problem. It consists of baseline and loan performance information for 5 960 recent home equity loans. Table 2 shows the variables included in the data set.

HONPR1B/103

Name DEFAULT REASON

Model role Target Input

Measurement level Binary Binary

Description 1=defaulted on loan, 0=paid back loan HomeImp=home improvement, DebtCon=debt consolidation Six occupational categories Amount of loan request Amount due on existing mortgage Value of current property Debt-to-income ratio Years at present job Number of major derogatory reports Number of trade lines Number of delinquent trade lines Age of oldest trade line in months Number of recent credit inquiries

JOB LOAN MORTGAGE VALUE DEBTINC YOJ DEROGATORIES CLNO DELINQUENCIES CLAGE INQUIRIES

Input Input Input Input Input Input Input Input Input Input Input

Nominal Interval Interval Interval Interval Interval Interval Interval Interval Interval Interval

Table 2: Variables in HMEQ data set

The procedure Start SAS EMiner by following the steps in paragraph 3. Create a new diagram for this question and follow the steps below. Creating a data source Right-click on Data Sources in the Project Panel and select Create Data Source. The Data Source Wizard opens. Step 1: Select Metadata Repository as the source and click Next. Step 2: Browse to nd the desired data set. For this question select the HMEQ data set and click OK and then click Next. Step 3: Note that the path given here is the same as the one we have entered earlier. Click Next. Step 4: Observe the number of variables and observations of the data set. Click Next. Step 5: In this step we can observe and, if necessary, change the properties of the data. We will not change anything at this stage, but to see how it works, select Advanced and click Customize.

10

HONPR1B/103

Click on each of the properties to see what it represents (in the lower part of the window). Click Cancel to exit and then click Next. Step 6: Information regarding the data is given here. Since we are interested in whether a client will default on his/her loan or not, the tartget for this problem is the variable DEFAULT. Therefore, change the role of DEFAULT to Target select the variable, click on the cell under Role and click on Target in the drop-down menu. It is also important to change the measurement level of both Default and Reason to Binary as shown in Table 2. Step 7: Here the role of the data can be changed to training, testing, scoring, etc. as required. We want to work with the raw data, so leave it as Raw and click Finish. Building the initial ow In the Project Panel the data source HMEQ should now have been created. Click on the HMEQ Data Source node and drag it to the Diagram Workspace. Select the HMEQ Data Source node and click on the ... next to Variables in the properties panel. Change the level of each variables to be the same as the measurement level in Table 2. Select the Explore tab on the toolbar at the top of the workspace. Drag a MultiPlot node to the workspace and place it to the right of the HMEQ Data Source node. Connect the nodes. (A pencil appears if the cursor is on the HMEQ Data Source node. Press the left mouse button and drag it to the MultiPlot node). Right-click on the MultiPlot node and select Run. When prompted, click Yes to run the path. (Depending on your computer, this may take quite long to complete.) When the process has been completed, click Results to open the results. You can also click OK to exit. The results can be viewed at any time by right-clicking on the MultiPlot node and selecting Results. View the dierent graphs. Use Next and Previous to navigate between the graphs. (a) Include the graph of Debt to Income Ratio (DEBTINC by Default) as a gure. Follow the notes in paragraph 5. (The graph should be coloured red and black. If not, right click on the HMEQ data source on your diagram and select Edit Varaibles. Change the role of Default to Target.) (3) Make sure the gure is a oat. Provide a caption. Label it and refer to it at least once. (b) What conclusions can you draw from this graph? (2) Data partition Under the Sample tab on the toolbar, drag a Data Partition node to the workspace and connect it to the HMEQ Data Souce node. Select the Data Partition node and examine the Properties Panel at the left. Under Data Set Allocation note that the data is partitioned with 40% of the data for training, 30% for validation and 30% for testing. Change this to 60% for training, 20% for validation and 20% for testing. (Remember to enter when you have changed a value.)

11

HONPR1B/103

Run the Data Partition node. Open the results and check the number of observations allocated to training, validation and testing.

Decision tree Under the Model tab on the toolbar, drag a Decision Tree node to the workspace and connect it to the Data Partition node. Run the Decision Tree node and view the results. Maximise the Tree window. Right-click in a blank area of the Tree window and select Graph Properties to make desired changes. Click Apply to save changes and Cancel to leave it as it is. To get a better view of the tree, right-click and select ViewFit to page. (c) (d) (e) (f) (g) (h) What kind of decision tree is this a classication or a regression tree? (1) How many leaves does the tree have? (1) Include the decision tree as a gure. (3) On which variable is the rst split made? What does this mean? (2) Which series of splits leads to a prospective client not defaulting on his/her loan? (1) Provide the series of splits that leads to a prospective client defaulting on his/her loan with a probability of more than 80%. (2)

Question 4
In SAS EMiner, work in the same project that you have created earlier. Create a new diagram for this question.

[30]

A Work through the example below and type what is asked in the same L TEX document as the previous questions. The questions are once again framed and numbered (a), (b), . . . .

The scenario A nonprot organisation relies on fund-raising campaigns to support their eorts. They want to send greeting cards to lapsed donors to encourage them to make new donations. The organisation wants to build a model to predict who will be most likely to donate. The types of information available include the following: personal information such as age, gender and income past donation information such as average gift and time since rst donation census information based on the donors address such as the percentage of households with at least one member working for the federal, state or local government

12

HONPR1B/103

The MYRAW data set contains data of 6 974 previous donors. Dening a data source Start a new diagram and give it another name, for instance Diagram Q4. Use the same steps as given in question three to dene a data source. Use the MYRAW data set. Table 3 shows the variables included in the data set. Name Age AverageGift CardGift CardProm FederalGov FirstTime Gender Homeowner IDCode Income LastTime LocalGov MaleMilitary MaleVeteran NumProm PCOwner Pets StateGov TargetB TargetD TimeLag Model role Input Input Input Input Input Input Input Input ID Input Input Input Input Input Input Rejected Rejected Input Target Rejected Input Measurement level Interval Interval Interval Interval Interval Interval Binary Binary Nominal Ordinal Interval Interval Interval Interval Interval Unary Unary Interval Binary Interval Interval Description Donors age Donors average gift Donors gifts to card promotions Number of card promotions % of households with members working in federal government Elapsed time since rst donation F=female, M=Male H=homeowner, U=unknown ID code, unique for each donor Income level (integer values 0-9) Elapsed time since last donation % of households with members working in local government % of households with males active in military % of households with male veterans Total number of promotions Y=donor owns computer (missing otherwise) Y=donor owns pets (missing otherwise) % of households with members working in state government 1=donor to campaign, 0=did not contribute Dollar amount of contribution to campaign Time between rst and second donation

Table 3: Variables in MYRAW data set

13

HONPR1B/103

Note the following: Several variables have a measurement level interval because they are numeric in the data set and have more than 20 distinct levels. The model role for all interval variables is set to input by default. Gender and HomeOwner have a measurement level binary because they have only two dierent nonmissing levels. The model role for binary variables is set to input by default. IDCode is listed as a nominal variable since it is a character variable with more than two nonmissing levels. Because it is nominal and the number of distinct values is greater than 20, it has the model role ID. If it had been stored as a number, it would have been assigned an interval measurement level and an input model role. Income is listed as an ordinal variable since it is numeric with more than two, but no more than 10 distinct levels. All ordinal variables have input as model role by default. PCOwner and Pets have unary measurement levels, because they have only one nonmissing level. The model role is set to rejected. The target variables TargetB and TargetD are the response variables for this analysis. TargetB is binary since there are only two nonmissing levels. TargetD has a measurement level interval. In this analysis we focus on TargetB, namely whether a person made a donation or not. Leave the model role for TargetB as Target and set the model role for TargetD to rejected. Missing values In this question the focus is on regression. Since the Regression node does not use observations that contain missing values to build the regression equation, it is necessary to know how to deal with missing values. Connect a StatExplore node to the MYRAW Data Source node. Select the StatExplore node and examine the properties in the Properties Panel. Because we are interested in examining all variables, change the Hide Rejected Variables eld to No in the Properties panel. Run the StatExplore node and examine the results. Consider the following part from the Output window of the StatExplore results.
Class Variable Summary Statistics (maximum 500 observations printed) Number of Levels 3 3 8 2

Variable Gender Homeowner Income TargetB

Role INPUT INPUT INPUT TARGET

Missing 395 1656 1576 0

Mode F H . 0

Mode Percentage 53.74 55.29 22.60 52.06

Mode M 5 1

Mode2 Percentage 40.59 23.75 16.89 47.94

14

HONPR1B/103

This part of the output gives information about the class variables in the data set. For example, Gender has 395 missing values and HomeOwner has 1 656 (almost 24%) missing values. Note that the classes for HomeOwner in Table 3 are H for being a home owner and U, which we assume includes both not a homeowner and unknown. The results window also shows summary statistics such as the mean, minimum value, maximum value, and median of the interval variables in the data table, and correlation statistics. Data replacement In general, imputation of missing values should be considered before constructing a regression model. Any observation with missing values for any of the variables will not be used in the regression model. Note that imputation is not necessary before generating a decision tree model since a decision tree uses missing values just like any other value in the data set. For this analysis, the missing values for Gender (a class variable) and for Age (an interval variable) need to be replaced. Under the Modify tab on the toolbar, drag an Impute node to the workspace and connect it to the Data Source node. Select the Impute node and examine the properties in the Properties Panel. In the Score section of the Properties panel, use the drop-down menu to change the value of the Indicator Variable row to Unique and the Indicator Variable Role to Input. We need to replace the missing values for the variable Gender with U (unknown) and the missing values for Age with the median. In the Train section of the Properties panel, do the following: Change the Default Character Value to U (type it in). Select the three dots button at the end of the Variables row. Click on Method in the row for Gender and select Constant from the drop-down menu. Click on Method in the row for Age and select Median from the drop-down menu. Click OK to conrm the changes. Run the Impute node and view the results. To see what happened to the missing values, connect a MultiPlot node to the Impute node and run it. View the graphs for Age and Gender after imputation. (a) Include the graphs of IMP_Age and IMP_Gender as gures. (6) (To put the graphs side-by-side in your document, use subfigure.sty. Documentation on this package is available on myUnisa under Additional Resources.) (b) Comment on the replacement of missing values, referring to the graphs included. (2)

Variable transformation Some input variables have highly skewed distributions. In such distributions, a small percentage of the points may have a great deal of inuence. On occasion, performing a transformation on an input variable may yield a better tting model. This section demonstrates how to perform some common transformations. Connect a Transform Variables node to the Impute node.

15

HONPR1B/103

Select this node and examine the Properties panel. Note in the Default Methods section that by default no transformations are done. Click the three dots next to Variables. The Transformation window enables you to transform interval-valued variables using standard transformations such as log, square root, inverse and square. Examine the distributions of variables within the Transform Variables node highlight the row for the variable(s) of interest and select Explore. . . . The distributions of CardGift, LocalGov, StateGov, FedGov and IMP_TimeLag are highly skewed to the right. A log transformation of these variables may provide more stable results. (c) Print the graph of the distribution of IMP_TimeLag before transformation. (d) Comment on the skewness of the distribution. (3) (2)

When considering the results window of the Transform Variables node, we see that the log transformation is performed after adding 1 to the value of each of these variables. (See Table 4.)
Computed Transformations (Maximum 500 observations printed) Input Role INPUT INPUT INPUT INPUT INPUT

Input Name CardGift FederalGov IMP_TimeLag LocalGov StateGov

Level INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL

Name LOG_CardGift LOG_FederalGov LOG_IMP_TimeLag LOG_LocalGov LOG_StateGov

Level INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL

Formula log(CardGift + 1) log(FederalGov + 1) log(IMP_TimeLag + 1) log(LocalGov + 1) log(StateGov + 1)

Table 4: Transformation of variables Why does this occur? Take for example the variable CardGift. It has a minimum value of zero. The logarithm of zero is undened, and the logarithm of something close to zero is extremely negative. SAS Enterprise Miner takes this information into account and actually uses the transformation log(CardGif t + 1) to create a new variable with values greater than or equal to zero (because log(1) = 0). Connect a second StatExplore node to the Transform Variables node and run it. Now, as before, examine the distribution of LOG_CardGift. (e) Include the graph of the distribution of IMP_TimeLag after transformation as a gure. Comment on the distribution after transformation. (3) (f) Create a table showing the means and standard deviations of CardGift and TimeLag before and after transformation. (4)

Data partition Connect a Data Partition node to the Impute node. Change the percentage data points used for training, validation and testing to 60%, 20% and 20%, respectively.

16

HONPR1B/103

Regression Run a stepwise logistic regression on the data by following the steps below: Connect a Regression node to the Data Partition node. Select the Regression node and examine its properties in the Properties panel. Change Selection Model in the Model Selection section from None to Stepwise by using the dropdown menu. Change Regression Type in the Class Targets section to Logistic Regression using the dropdown menu. (Since the target variable is binary, logistic regression is used.) Run the Regression node and view the results. Decision Tree and Neural Network Connect a Decision Tree as well as a Neural Network node to the Data Partition node. Run both and view the results.

(g) Print the Decision Tree.

(3)

(h) Which variables does the Decision Tree use to decide whether a lapsed doner will make new donations? (1) Compare the models Connect a Model Comparison node to the Regression, Decision Tree and Neural Network nodes. Run it and view the results. (i) Which is the best model to use for this problem? Justify your answer by referring to the Misclassication Rate, Maximum Absolute Error, ROC Index and Gini Coecient. (Hint: Use the statistics on the Test data to assess the performance of the models.) (3) (j) Include the nal diagram of this Enterprise Miner project as a gure. (3)

You might also like