Semester 2, Year 2006 Topics Data Mining Tasks Data Mining Techniques Data Mining Models Data Mining Functions Demonstration Data Sets Data Mining Tools Shopping Cart Analyzer (SCA) Weka by the University of Waikato
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 2
Semester 2, Year 2006 Data Mining Tasks Descriptive Characterize general properties of the data in databases Clustering and Summarization
Predictive Perform inference on the current data in order to make prediction Classification and Estimation
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 3
Semester 2, Year 2006 Data Mining Techniques Statistical techniques Have strong diagnostic tools Can be used for the development of confidence intervals on parameter estimates, hypothesis testing
Artificial Intelligence techniques
Require less assumptions about the data Are generally more automatic
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 4
Semester 2, Year 2006 Data Mining Techniques Statistical Market-Basket Analysis - find groups of items Memory-Based Reasoning- case based Cluster Detection - undirected (quantitative MBA) Artificial Intelligence Link Analysis - MCI’s Friends & Family Decision Trees, Rule Induction - production rule Neural Networks - automatic pattern detection Genetic Algorithms - keep best parameters
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 5
Semester 2, Year 2006 Comparison of Features Rules Neural Net CaseBase Genetic Noisy data Good Very good Good Very good Missing data Good Good Very good Good Large sets Very good Poor Good Good Different types Good Numerical Very good Transform Accuracy High Very high High High Explanation Very good Poor Very good Good Integration Good Good Good Very good Ease Easy Difficult Easy Difficult
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 6
Semester 2, Year 2006 Data Mining Models Regression: Y = a + bX Classification: assign new record to class Predictive: assign value to new record Clustering: groups for data Time-series: assign future value Links: patterns in data
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 7
Semester 2, Year 2006 Data Mining Modeling Tools Radding Algorithms Peacock Functions Basis Task Cluster detection Cluster analysis Statistics Classification Regression models Statistics Estimation Logistic regression Statistics Classification Discriminant analysis Statistics Classification Neural networks Neural networks AI Classification Kohonen nets AI Cluster Decision trees Association rules AI Classification Rule induction Association rules AI Description Link analysis Description Query tools Description Descriptive statistics Statistics Description SCCS 453 DW and DM Visualization tools Songsri Tangsripairoj, Ph.D. Statistics Description 8 Semester 2, Year 2006 Data Mining Functions Classification Identify categories in data Prediction Formula to predict future observations Association Rules using relationships among entities Detection Anomalies & irregularities (fraud detection)
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 9
Semester 2, Year 2006 Financial Applications Technique Application Problem Type
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 10
Semester 2, Year 2006 Telecom Applications Technique Application Problem Type
Neural net, Forecast network behav. Prediction
Rule induct
Rule induct Churn Classification
Fraud detection Detection
Case based Call tracking Classification
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 11
Semester 2, Year 2006 Marketing Applications Technique Application Problem Type
Rule induct Market segment Classification
Cross-selling Association Rule induct, visual Lifestyle analysis Classification Performance analy. Association Rule induct, genetic, Reaction to Prediction visual promotion
Case based Online sales support Classification
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 12
Semester 2, Year 2006 Web Applications Technique Application Problem Type
Rule induct, User browsing Classification,
Visualization similarity analy. Association
Rule-based Web page content Association
heuristics similarity
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 13
Semester 2, Year 2006 Other Applications Technique Application Problem Type
Neural net Software cost Detection
Neural net, Litigation Prediction
rule induct assessment Rule induct Insurance fraud Detection Healthcare except. Detection Case based Insurance claim Prediction Software quality Classification
Genetic algor. Budget spending Classification
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 14
Semester 2, Year 2006 Demonstration Data Sets Loan Application Data classification Job Application Data classification Insurance Fraud Data detection Expenditure Data prediction
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 15
Semester 2, Year 2006 Loan Data 650 observations OUTCOMES (binary): On-time cost of error: $300 Late (default) cost of error: $2,000 Variables: Age, Income, Assets, Debts, Want, Credit Credit ordinal Transform: Assets, Debts, & Want →Risk
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 16
Semester 2, Year 2006 Example Loan Data
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 17
Semester 2, Year 2006 Job Application Data 500 observations OUTCOMES (ordinal): Unacceptable Minimal Acceptable Excellent Variables: Age, State, Degree, Major, Experience State nominal; degree & major ordinal State is superfluous
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 18
Semester 2, Year 2006 Example Job App. Data
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 19
Semester 2, Year 2006 Insurance Claim Data 5000 observations OUTCOMES (binary): OK cost of error $500 Fraudulent cost of error $2,500 Variables: Age, Gender, Claim, Tickets, Prior claims, Attorney Gender & attorney nominal, tickets & prior claims categorical
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 20
Semester 2, Year 2006 Example Insurance Claim Data
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 21
Semester 2, Year 2006 Expenditure Data 10,000 observations OUTCOMES: Could predict response in a number of categories Others Variables: Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards Churn, proportion of income spent on seven categories SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 22 Semester 2, Year 2006 Example Expenditure Data
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 23