Dmba 2

Data Mining for Business
Analytics AY 2016-17
Dr. Sridhar Vaithianathan
IT & Analytics
Office 2
sridhar.v@imthyderabad.edu.in
Mobile: 99899 04245
Topic Outline
Statistics Vs Analytics
Data Mining Methods
Data Mining Techniques
Data Mining Process
ITs all about DATA

DATA EXPLOSION
Each day, our society creates 2.5 quintillion

bytes of data (thats 2.5 followed by 18 zeros).
With this glut of data, the need to make sense
of it becomes more acute.
DATA QUOTES
Data is the new Oil Dunn Humby

Data is the new raw material of business
Microsoft
The world is now awash in data and we can see
consumers in a lot clearer ways Paypal
The goal is to turn data into information, and
information into insight - HP
Analytics: What it is
NOT?
Statistics
Analytics
Macro - decisioning
Explain/Describe
Population
relationships
Small sample, few
variables
Find good fitting
statistical model
Confidence intervals,
hypothesis test, pvalue
Micro - decisioning
Predict values of
new records
Large sample, many
Vs
variables
Models/Algorithms
with high predictive
power
Predictive power
metrics and cost
Analytics Vs Statistics
The prime
objective of
--------- is to
minimize the
error and
improve the
accuracy.
Answer: ANALYTICS
Data Mining Methods
PREDICTION (NUMERICAL Y)
DIMENSION REDUCTION
CLASSIFICATION
(CATEGORICAL Y)
SEGMENTATION
WHAT GOES WITH
WHAT
Data mining Methods

Answer: This is
supervised learning,
because the database
includes whether the
loan was approved or
not.
Data mining Methods

Answer: This is
because for the other
packets the status is
known.
Data mining Methods

Answer: This is
unsupervised
learning, because
there is no apparent
outcome (e.g.,
whether the
recommendation was
adopted or not).
Data mining Methods

Answer: This is
unsupervised learning
because there is no
known outcome (though
once you use
unsupervised learning to
identify segments, you
could use supervised
learning to classify new
customers into those
segments).
Data mining Methods

Answer: This is
because the status of
the similar firms is
known.
Data mining Methods

Answer: This is
because there is likely
to be knowledge of
actual (historic) repair
times of similar
repairs.
Data mining Methods

Answer: This is
as there is likely to be
knowledge about
whether the sorting
was correct.
Data mining Methods

Answer: This is
unsupervised learning,
if we assume that we do
not know what will be
purchased in the future.
Datamining
Techniques
Classification
Prediction
Association
Cluster/Segmentation
Dr. Sridhar Vaithianathan,

IMT Hyderabad
Data Mining Process
16

IMT Hyderabad
DM Process : Define The Purpose

Business Objective DM/A Objective
17

IMT Hyderabad
DM Process : Obtain the data

Sources External & Internal
18
Preprocessing DATA
Oversampling Rare Events
Types of Variables
Handling Categorical Variables
Variable Selection : Parsimony
Problem of Overfitting
How many Variables (X-Y plots)
How much Data:
10 cases for each variable (or)
6 x m (outcome classes) x P ( No. of Variables)
Detecting Outliers
Handling Missing Data
Normalizing Data

IMT Hyderabad
Examples: response to mailing, fraud in taxes,
19
DATA PARTITION
Rare event oversampling

Often the event of interest is rare
Examples: response to mailing, fraud in taxes,
Sampling may yield too few interesting cases to
effectively train a model
A popular solution: oversample the rare cases to
obtain a more balanced training set
Later, need to adjust results for the oversampling
UNDERSAMPLING ???
Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs.
numeric
Numeric
Continuous
Integer
Categorical
Ordered (low, medium, high)
Unordered (male, female)
Handling Categorical
Variables
Dummy Variable in regression
Eg: Occupation
Student
Unemployed
Employed
Retired
Refer : Effects of life style on Ageing dataset.

IMT Hyderabad
For Single Categorical Variable - u may need to

create four categories with Yes/No.
22
The Problem of Overfitting

Statistical models can produce highly complex
explanations of relationships between variables
The fit may be excellent
When used with new data, models of great
complexity do not do so well.
Causes:
Too many predictors
A model with too many parameters
Trying many different models
Consequence: Deployed model will not work as
well as expected with completely new data.
100% fit not useful for

new data
1600
1400
1200
1000
Revenue
800
600
400
200
0
200
300
400
500
600
Expenditure
700
800
900
1000
Detecting Outliers
An outlier is an observation that is extreme,
being distant from the rest of the data (definition of
distant is deliberately vague)
Outliers can have disproportionate influence on
models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is required to
determine if it is an error, or truly extreme.
In some contexts, finding outliers is the purpose of
the DM exercise (airport security screening). This is
called anomaly detection.
Handling Missing Data

Most algorithms will not process records with missing
values. Default is to drop those records.
Solution 1: Omission
If a small number of records have missing values,
can omit them
If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
If many records have missing values, omission is
not practical
Solution 2: Imputation
Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its
(non-missing) information
Normalizing (Standardizing) Data

Used in some techniques when variables with the
largest scales would dominate and skew results
Puts all variables on same scale
Normalizing function: Subtract mean and divide
by standard deviation (used in XLMiner)
Alternative function: scale to 0-1 by subtracting
minimum and dividing by the range
Useful when the data contain dummies and
numeric

IMT Hyderabad
DATA Partition
28

IMT Hyderabad
Data Partition - How
29
Practice and Recap
Loan Prediction Data set

Retail Sales Dataset
Coggle Mindmap Learnings so far
ST Hackathon
XLMINER INSTALLATION :
\\10.1.1.11\commonfolder\DMBA - Elective
DMBA Text Book Datasets:
\\10.1.1.11\commonfolder\DMBA - Elective\DMBA
DATASETS\DM for BI - DATA SETS June 2016
THANK YOU VERY MUCH

Dmba 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dmba 2

Uploaded by

Copyright:

Available Formats

Data Mining for Business

ITs all about DATA

Each day, our society creates 2.5 quintillion

Data is the new Oil Dunn Humby

Data Mining Methods

Data mining Methods

Data mining Methods

Data mining Methods

Data mining Methods

Data mining Methods

Data mining Methods

Data mining Methods

Data mining Methods

Dr. Sridhar Vaithianathan,

Data Mining Process

Dr. Sridhar Vaithianathan,

DM Process : Define The Purpose

Dr. Sridhar Vaithianathan,

DM Process : Obtain the data

Dr. Sridhar Vaithianathan,

Examples: response to mailing, fraud in taxes,

Rare event oversampling

Refer : Effects of life style on Ageing dataset.

Dr. Sridhar Vaithianathan,

For Single Categorical Variable - u may need to

The Problem of Overfitting

100% fit not useful for

Handling Missing Data

Normalizing (Standardizing) Data

Dr. Sridhar Vaithianathan,

Dr. Sridhar Vaithianathan,

Data Partition - How

Practice and Recap

Loan Prediction Data set

THANK YOU VERY MUCH

You might also like