You are on page 1of 53

DATA PREPARATION TECHNIQUES

FOR PREDICTIVE ANALYTICS IN


SAS ENTERPRISE GUIDE

Copyright 2013, SAS Institute Inc. All rights reserved.

SAS ENTERPRISE
GUIDE

DATA PREPARATION TECHNIQUES FOR


ANALYTICS

Copyright 2013, SAS Institute Inc. All rights reserved.

Data & Data Format Needed for


Predictive Modeling
Create a Y or Dependent Variable
Create Model Input Variables
Replace Missing Values
Assess Normality
Transform Variables in Order to Meet
Assumptions
Run Linear and Logistic Regression
Q&A

SCENARIO

Company sells Outdoor and Sports items


Want to test a new marketing campaign
Need to compile a data table with information
so we can build a predictive model of who is
likely to respond.

Copyright 2013, SAS Institute Inc. All rights reserved.

CUSTOMER DATA

Copyright 2013, SAS Institute Inc. All rights reserved.

PRODUCT ORDER
TRANSACTIONAL
DETAIL DATA

Copyright 2013, SAS Institute Inc. All rights reserved.

PREDICTIVE MODEL
DEVELOPMENT
DATA

Modeling Data Set

Copyright 2013, SAS Institute Inc. All rights reserved.

Data Warehouse

MAIN TYPES OF
DATA FOR MODELING
DATA MARTS

One-Row-perSubject Data
Mart

Multiple-Row-perSubject Data
Mart
Longitudinal
Data Mart
Copyright 2013, SAS Institute Inc. All rights reserved.

THE ONE-ROW-PERSUBJECT DATA


MART

Required by many statistical methods

Regression Analysis, Neural Networks, Decision Trees, Survival analysis,


Cluster analysis,

Most prominent data mart structure in data mining


Event prediction (Churn, Fraud, Delinquency, Response, )
Value prediction (Purchase Size, Claim Amount, )
Segmentation (Clustering, )

Copyright 2013, SAS Institute Inc. All rights reserved.

OUR DESTINATION ONE ROW PER CUSTOMER

Copyright 2013, SAS Institute Inc. All rights reserved.

THE ONE-ROW-PERSUBJECT
PARADIGM

Copyright 2013, SAS Institute Inc. All rights reserved.

TARGET SAMPLE

Target Sample

Copyright 2013, SAS Institute Inc. All rights reserved.

Data Warehouse

CREATE TARGET
(Y) VARIABLE

Create a variable indicating who


Purchased in the last 2 years (i.e.
January 1, 2012 December 31,
2013)
New Variable named PURCHASED

1=Yes
0=No

Copyright 2013, SAS Institute Inc. All rights reserved.

DEPENDENT
VARIABLE

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE DEPENDENT (Y) VARIABLE DEMO

Copyright 2013, SAS Institute Inc. All rights reserved.

MODEL INPUTS

Model Inputs

Copyright 2013, SAS Institute Inc. All rights reserved.

Data Warehouse

ENTERPRISE
QUERY BUILDER: COMPUTED COLUMNS
GUIDE

Copyright 2013, SAS Institute Inc. All rights reserved.

COMPUTED
COLUMNS SUMMARIZED IS ONE EXAMPLE

Copyright 2013, SAS Institute Inc. All rights reserved.

COMPUTED
COLUMNS RECODED IS ANOTHER EXAMPLE

Copyright 2013, SAS Institute Inc. All rights reserved.

COMPUTED
COLUMNS ADVANCED EXPRESSION IS YET ANOTHER EXAMPLE

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE AGE
VARIABLE

COMPUTED COLUMN

Calculate a customers age from their Birthdate


using Advanced Expression and SAS Functions
New Variable named AGE

YRDIF(t1.Customer_BirthDate,TODAY(),"ACT/ACT")

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE CALCULATED INPUT (AGE)

Copyright 2013, SAS Institute Inc. All rights reserved.

MODEL
DEVELOPMENT AND
TRANSACTION DATA

Copyright 2013, SAS Institute Inc. All rights reserved.

INPUT
POSSIBILITIES:
TABULATIONS

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE SUMMARY
VARIABLES

COMPUTED COLUMNS

Calculate Summary Variables about


purchases

Calculate New Variable for each


transaction and take the Max

Copyright 2013, SAS Institute Inc. All rights reserved.

Total Amount Spent = TotalSpent


Total Number of Items Bought = TotalItems
Average Amount Spent = AvgSpent

Longest Number of Days for a Delivery to


arrive = DaystoDeliver

CREATE SUMMARY
COMPUTED COLUMNS
VARIABLES

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE SUMMARY INPUTS

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE
INDICATOR
VARIABLES

Calculate Indicator Variables

Did the customer buy a product in a certain product


line?

Childrens, Clothes & Shoes, Sports

Calculate New Variable for each transaction for


Product Line and summarize Total_Retail_Price
Gives us the total spent for each product line category
Product Line Description = ProductLineDescription
Total of Product Line for order = OrderTotal
Distinct Count of Customer_ID = Indicator

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE
INDICATORS

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE CATEGORY
TOTALS

Copyright 2013, SAS Institute Inc. All rights reserved.

3 TASKS TO
TRANSPOSE DATA

Transpose Switch rows with


columns and columns with rows
2. Split Split one columns into
multiple columns
3. Stack Stack multiple columns
into one column
1.

Copyright 2013, SAS Institute Inc. All rights reserved.

SPLIT COLUMNS
TASK

THREE QUESTIONS

Which column is being split?


2. Which column identifies the values
being split?
3. Which column groups the data?
1.

Copyright 2013, SAS Institute Inc. All rights reserved.

CREATE INDICATOR INPUTS


USING RECODE, SUMMARY &
TRANSPOSE

Creating indicator variables in PROC SQL


Copyright 2013, SAS Institute Inc. All rights reserved.

REPLACING
MISSING VALUES

Enterprise Guide Query Builder Computed Column


Replace Values
SAS Code

PROC STDIZE - documentation


SAS/STAT PROC MI - documentation

Enterprise Miner Impute Node

Class variables count, default constant value, distribution,


tree, tree surrogate
Target variables count, default constant value, distribution
Interval variables mean, median, midrange, distribution,
tree, tree surrogate, mid-minimum spacing, Tukeys Biweight,
Huber, Andrews Wave, default constant

What is Missing in SAS?


Copyright 2013, SAS Institute Inc. All rights reserved.

REPLACEMENT

TWO OPTIONS

PROC STDIZE
out=dataprep.WhoPurchased
reponly missing=0;
run;

or
PROC DATASETS lib=work;
MODIFY zeros;
FORMAT _all_;
INFORMAT _all_;
RUN;
Copyright 2013, SAS Institute Inc. All rights reserved.

PROC STDIZE OR
PROC DATASETS

MISSING VALUE REPLACEMENT DEMO

Copyright 2013, SAS Institute Inc. All rights reserved.

ASSESS NORMALITY

Copyright 2013, SAS Institute Inc. All rights reserved.

ASSESS
NORMALITY

TasksDescribeDistribution
Analysis
Graphs
Histograms
Q-Q Plot
Kernel Density Plot

Copyright 2013, SAS Institute Inc. All rights reserved.

ASSESS
NORMALITY

TasksDescribeDistribution
Analysis
4 Tests
Shapiro-Wilk
Kolmogorow-Smirnov (K-S)
Cramer-von Mises
Anderson-Darling
Testing Normality of Data using SAS
Guidos Guide to PROC Univariate: A tutorial for SAS
Users
Copyright 2013, SAS Institute Inc. All rights reserved.

ASSESS NORMALITY DEMO

Copyright 2013, SAS Institute Inc. All rights reserved.

TRANSFORM VARIABLES

Copyright 2013, SAS Institute Inc. All rights reserved.

TRANSFORMATIONS
FOR NORMALITY

Copyright 2013, SAS Institute Inc. All rights reserved.

Log
Square Root
Cube Root
Reciprocal
Square Transformation
Many more

TRANSFORMING
VARIABLES

TotalSpent Log Transformation


Age Recode to categorical

Transforming Variables for Normality and Linearity


Before Logistic Modeling A Toolkit for Identifying and
Transforming Relevant Predictors

Copyright 2013, SAS Institute Inc. All rights reserved.

COMPUTED
ADVANCED EXPRESSION
COLUMNS

Copyright 2013, SAS Institute Inc. All rights reserved.

COMPUTED
RECODED
COLUMNS

Copyright 2013, SAS Institute Inc. All rights reserved.

TRANSFORM VARIABLES DEMO

Copyright 2013, SAS Institute Inc. All rights reserved.

RESOURCES

Copyright 2013, SAS Institute Inc. All rights reserved.

DATA PREPARATION
FOR ANALYTICS
USING SAS

ISBN: 978-1-59994-047-2
SAS Bookstore
Amazon
Also available for Kindle
Author Page
Example Code and Data

Copyright 2013, SAS Institute Inc. All rights reserved.

Download copy
Copyright 2013, SAS Institute Inc. All rights reserved.

ADDITIONAL
SUPPORT

ENTERPRISE GUIDE TUTORIALS

View Free Tutorials

Copyright 2013, SAS Institute Inc. All rights reserved.

http://support.sas.com/training/resources/
Getting Started with SAS Enterprise Guide

ADDITIONAL
RESOURCES

Chris Hemedinger
Follow @cjdinger

The SAS Dummy


A SAS blog for the rest of us

http://blogs.sas.com/content/sasdummy/

Books:
Custom Tasks for SAS Enterprise Guide Using
Microsoft .NET
SAS For Dummies
Copyright 2013, SAS Institute Inc. All rights reserved.

AVAILABLE
PAPERS

Copyright 2013, SAS Institute Inc. All rights reserved.

Ad Hoc Data Preparation for Analysis Using SAS


Enterprise Guide
Introduction to Using SAS Enterprise Guide for
Statistical Analysis
Introduction to Building a Linear Regression Model
Take a Fresh Look at SAS Enterprise Guide: From
point-and-click ad hocs to robust enterprise solutions
Advanced Analytics with Enterprise Guide
SAS Enterprise Miner Tip: Imputing Missing Values

FURTHER
TRAINING FROM
SAS EDUCATION

Enterprise Guide 1 : Query and


Reporting
Enterprise Guide 2: Advanced Tasks
and Querying
Enterprise Guide for Experienced SAS
Programmers
Data Preparation for Data Mining

support.sas.com/training
Copyright 2013, SAS Institute Inc. All rights reserved.

THANK YOU FOR USING SAS!

Copyright 2013, SAS Institute Inc. All rights reserved.

www.SAS.com

You might also like