You are on page 1of 23

SCCS 453

Data Warehousing and Data Mining


Lecture 8
Overview of Data Mining Techniques

Songsri Tangsripairoj, Ph.D.


ccsts@mahidol.ac.th

Department of Computer Science


Faculty of Science, Mahidol University

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 1


Semester 2, Year 2006
Topics
 Data Mining Tasks
 Data Mining Techniques
 Data Mining Models
 Data Mining Functions
 Demonstration Data Sets
 Data Mining Tools
 Shopping Cart Analyzer (SCA)
 Weka by the University of Waikato

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 2


Semester 2, Year 2006
Data Mining Tasks
 Descriptive
 Characterize general properties of the data in
databases
 Clustering and Summarization

 Predictive
 Perform inference on the current data in order to
make prediction
 Classification and Estimation

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 3


Semester 2, Year 2006
Data Mining Techniques
 Statistical techniques
 Have strong diagnostic tools
 Can be used for the development of confidence
intervals on parameter estimates, hypothesis testing

 Artificial Intelligence techniques


 Require less assumptions about the data
 Are generally more automatic

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 4


Semester 2, Year 2006
Data Mining Techniques
 Statistical
 Market-Basket Analysis - find groups of items
 Memory-Based Reasoning- case based
 Cluster Detection - undirected (quantitative MBA)
 Artificial Intelligence
 Link Analysis - MCI’s Friends & Family
 Decision Trees, Rule Induction - production rule
 Neural Networks - automatic pattern detection
 Genetic Algorithms - keep best parameters

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 5


Semester 2, Year 2006
Comparison of Features
Rules Neural Net CaseBase Genetic
Noisy data Good Very good Good Very good
Missing data Good Good Very good Good
Large sets Very good Poor Good Good
Different types Good Numerical Very good Transform
Accuracy High Very high High High
Explanation Very good Poor Very good Good
Integration Good Good Good Very good
Ease Easy Difficult Easy Difficult

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 6


Semester 2, Year 2006
Data Mining Models
 Regression: Y = a + bX
 Classification: assign new record to class
 Predictive: assign value to new record
 Clustering: groups for data
 Time-series: assign future value
 Links: patterns in data

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 7


Semester 2, Year 2006
Data Mining Modeling Tools
Radding Algorithms Peacock Functions Basis Task
Cluster detection Cluster analysis Statistics Classification
Regression models Statistics Estimation
Logistic regression Statistics Classification
Discriminant analysis Statistics Classification
Neural networks Neural networks AI Classification
Kohonen nets AI Cluster
Decision trees Association rules AI Classification
Rule induction Association rules AI Description
Link analysis Description
Query tools Description
Descriptive statistics Statistics Description
SCCS 453 DW and DM
Visualization tools
Songsri Tangsripairoj, Ph.D.
Statistics Description 8
Semester 2, Year 2006
Data Mining Functions
 Classification
 Identify categories in data
 Prediction
 Formula to predict future observations
 Association
 Rules using relationships among entities
 Detection
 Anomalies & irregularities (fraud detection)

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 9


Semester 2, Year 2006
Financial Applications
Technique Application Problem Type

Neural net Forecast stock price Prediction


NN, Rule Forecast bankruptcy Prediction
Fraud detection Detection

NN, Case Forecast interest rate Prediction

NN, visual Late loan detection Detection


Rule Credit assessment Prediction
Risk classification Classification

Rule, Case Corporate bond rate Prediction

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 10


Semester 2, Year 2006
Telecom Applications
Technique Application Problem Type

Neural net, Forecast network behav. Prediction


Rule induct

Rule induct Churn Classification


Fraud detection Detection

Case based Call tracking Classification

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 11


Semester 2, Year 2006
Marketing Applications
Technique Application Problem Type

Rule induct Market segment Classification


Cross-selling Association
Rule induct, visual Lifestyle analysis Classification
Performance analy. Association
Rule induct, genetic, Reaction to Prediction
visual promotion

Case based Online sales support Classification

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 12


Semester 2, Year 2006
Web Applications
Technique Application Problem Type

Rule induct, User browsing Classification,


Visualization similarity analy. Association

Rule-based Web page content Association


heuristics similarity

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 13


Semester 2, Year 2006
Other Applications
Technique Application Problem Type

Neural net Software cost Detection

Neural net, Litigation Prediction


rule induct assessment
Rule induct Insurance fraud Detection
Healthcare except. Detection
Case based Insurance claim Prediction
Software quality Classification

Genetic algor. Budget spending Classification

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 14


Semester 2, Year 2006
Demonstration Data Sets
 Loan Application Data
 classification
 Job Application Data
 classification
 Insurance Fraud Data
 detection
 Expenditure Data
 prediction

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 15


Semester 2, Year 2006
Loan Data
 650 observations
 OUTCOMES (binary):
 On-time cost of error: $300
 Late (default) cost of error: $2,000
 Variables:
 Age, Income, Assets, Debts, Want, Credit
 Credit ordinal
 Transform: Assets, Debts, & Want →Risk

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 16


Semester 2, Year 2006
Example
Loan Data

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 17


Semester 2, Year 2006
Job Application Data
 500 observations
 OUTCOMES (ordinal):
 Unacceptable
 Minimal
 Acceptable
 Excellent
 Variables:
 Age, State, Degree, Major, Experience
 State nominal; degree & major ordinal
 State is superfluous

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 18


Semester 2, Year 2006
Example
Job App. Data

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 19


Semester 2, Year 2006
Insurance Claim Data
 5000 observations
 OUTCOMES (binary):
 OK cost of error $500
 Fraudulent cost of error $2,500
 Variables:
 Age, Gender, Claim, Tickets, Prior claims, Attorney
 Gender & attorney nominal, tickets & prior claims
categorical

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 20


Semester 2, Year 2006
Example
Insurance Claim Data

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 21


Semester 2, Year 2006
Expenditure Data
 10,000 observations
 OUTCOMES:
 Could predict response in a number of categories
 Others
 Variables:
 Age, Gender, Marital, Dependents, Income, Job
years, Town years, Education years, Drivers
license, Own home, Number of credit cards
 Churn, proportion of income spent on seven
categories
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 22
Semester 2, Year 2006
Example
Expenditure Data

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 23


Semester 2, Year 2006

You might also like