You are on page 1of 31

Data Mining for Business

Analytics AY 2016-17
Dr. Sridhar Vaithianathan
IT & Analytics
Office 2
sridhar.v@imthyderabad.edu.in
Mobile: 99899 04245

Topic Outline
Statistics Vs Analytics
Data Mining Methods
Data Mining Techniques
Data Mining Process

ITs all about DATA


DATA EXPLOSION

Each day, our society creates 2.5 quintillion


bytes of data (thats 2.5 followed by 18 zeros).
With this glut of data, the need to make sense
of it becomes more acute.
DATA QUOTES

Data is the new Oil Dunn Humby


Data is the new raw material of business
Microsoft
The world is now awash in data and we can see
consumers in a lot clearer ways Paypal
The goal is to turn data into information, and
information into insight - HP

Analytics: What it is
NOT?
Statistics
Analytics
Macro - decisioning
Explain/Describe
Population
relationships
Small sample, few
variables
Find good fitting
statistical model
Confidence intervals,
hypothesis test, pvalue

Micro - decisioning
Predict values of
new records
Large sample, many
Vs
variables
Models/Algorithms
with high predictive
power
Predictive power
metrics and cost

Analytics Vs Statistics
The prime
objective of
--------- is to
minimize the
error and
improve the
accuracy.
Answer: ANALYTICS

Data Mining Methods

PREDICTION (NUMERICAL Y)

DIMENSION REDUCTION

CLASSIFICATION
(CATEGORICAL Y)

SEGMENTATION
WHAT GOES WITH
WHAT

Data mining Methods


Answer: This is
supervised learning,
because the database
includes whether the
loan was approved or
not.

Data mining Methods


Answer: This is
supervised learning,
because for the other
packets the status is
known.

Data mining Methods


Answer: This is
unsupervised
learning, because
there is no apparent
outcome (e.g.,
whether the
recommendation was
adopted or not).

Data mining Methods


Answer: This is
unsupervised learning
because there is no
known outcome (though
once you use
unsupervised learning to
identify segments, you
could use supervised
learning to classify new
customers into those
segments).

Data mining Methods


Answer: This is
supervised learning,
because the status of
the similar firms is
known.

Data mining Methods


Answer: This is
supervised learning,
because there is likely
to be knowledge of
actual (historic) repair
times of similar
repairs.

Data mining Methods


Answer: This is
supervised learning,
as there is likely to be
knowledge about
whether the sorting
was correct.

Data mining Methods


Answer: This is
unsupervised learning,
if we assume that we do
not know what will be
purchased in the future.

Datamining
Techniques
Classification
Prediction
Association
Cluster/Segmentation

Dr. Sridhar Vaithianathan,


IMT Hyderabad

Data Mining Process

16

Dr. Sridhar Vaithianathan,


IMT Hyderabad

DM Process : Define The Purpose


Business Objective DM/A Objective

17

Dr. Sridhar Vaithianathan,


IMT Hyderabad

DM Process : Obtain the data


Sources External & Internal

18

Preprocessing DATA
Oversampling Rare Events

Types of Variables
Handling Categorical Variables
Variable Selection : Parsimony
Problem of Overfitting
How many Variables (X-Y plots)
How much Data:
10 cases for each variable (or)
6 x m (outcome classes) x P ( No. of Variables)

Detecting Outliers
Handling Missing Data
Normalizing Data

Dr. Sridhar Vaithianathan,


IMT Hyderabad

Examples: response to mailing, fraud in taxes,

19

DATA PARTITION

Rare event oversampling


Often the event of interest is rare
Examples: response to mailing, fraud in taxes,
Sampling may yield too few interesting cases to
effectively train a model
A popular solution: oversample the rare cases to
obtain a more balanced training set
Later, need to adjust results for the oversampling

UNDERSAMPLING ???

Types of Variables
Determine the types of pre-processing
needed, and algorithms used
Main distinction: Categorical vs.
numeric
Numeric
Continuous
Integer

Categorical
Ordered (low, medium, high)
Unordered (male, female)

Handling Categorical
Variables
Dummy Variable in regression
Eg: Occupation

Student
Unemployed
Employed
Retired

Refer : Effects of life style on Ageing dataset.

Dr. Sridhar Vaithianathan,


IMT Hyderabad

For Single Categorical Variable - u may need to


create four categories with Yes/No.

22

The Problem of Overfitting


Statistical models can produce highly complex
explanations of relationships between variables
The fit may be excellent
When used with new data, models of great
complexity do not do so well.
Causes:
Too many predictors
A model with too many parameters
Trying many different models
Consequence: Deployed model will not work as
well as expected with completely new data.

100% fit not useful for


new data
1600
1400
1200
1000
Revenue
800
600
400
200
0
200

300

400

500

600
Expenditure

700

800

900

1000

Detecting Outliers
An outlier is an observation that is extreme,
being distant from the rest of the data (definition of
distant is deliberately vague)
Outliers can have disproportionate influence on
models (a problem if it is spurious)
An important step in data pre-processing is
detecting outliers
Once detected, domain knowledge is required to
determine if it is an error, or truly extreme.
In some contexts, finding outliers is the purpose of
the DM exercise (airport security screening). This is
called anomaly detection.

Handling Missing Data


Most algorithms will not process records with missing
values. Default is to drop those records.
Solution 1: Omission
If a small number of records have missing values,
can omit them
If many records are missing values on a small set of
variables, can drop those variables (or use proxies)
If many records have missing values, omission is
not practical
Solution 2: Imputation
Replace missing values with reasonable substitutes
Lets you keep the record and use the rest of its
(non-missing) information

Normalizing (Standardizing) Data


Used in some techniques when variables with the
largest scales would dominate and skew results
Puts all variables on same scale
Normalizing function: Subtract mean and divide
by standard deviation (used in XLMiner)
Alternative function: scale to 0-1 by subtracting
minimum and dividing by the range
Useful when the data contain dummies and
numeric

Dr. Sridhar Vaithianathan,


IMT Hyderabad

DATA Partition

28

Dr. Sridhar Vaithianathan,


IMT Hyderabad

Data Partition - How

29

Practice and Recap

Loan Prediction Data set


Retail Sales Dataset
Coggle Mindmap Learnings so far
ST Hackathon

XLMINER INSTALLATION :
\\10.1.1.11\commonfolder\DMBA - Elective
DMBA Text Book Datasets:
\\10.1.1.11\commonfolder\DMBA - Elective\DMBA
DATASETS\DM for BI - DATA SETS June 2016

THANK YOU VERY MUCH

You might also like