You are on page 1of 39

Data Mining

‫دكترمحسن كاهاني‬
http://www.um.ac.ir/~kahani/
Motivation:
“Necessity is the Mother of Invention”

 Data explosion problem:


 Automated data collection tools and mature database
technology lead to tremendous amounts of data stored
in databases, data warehouses and other information
repositories
 We are drowning in data, but starving for knowledge!

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Related Fields

Machine Visualization
Learning
Data Mining and
Knowledge Discovery

Statistics Databases

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Knowledge Discovery Process
Integration

Interpretation Knowledge
& Evaluation

Knowledge
Raw
Data __ __ __
Patterns

Understanding
__ __ __
__ __ __ and
Rules
Transformed
Data
DATA Target
Data
Ware
house

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Definition of Data Mining

“…The non-trivial process of identifying


valid, novel, potentially useful, and
ultimately understandable patterns in
data…”
Fayyad, Piatetsky-Shapiro, Smyth [1996]

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Need for Data Mining
 Data accumulate and double every 9 months
 There is a big gap from stored data to knowledge; and the
transition won’t occur automatically.
 Manual data analysis is not new but a bottleneck
 Fast developing Computer Science and Engineering generates
new demands
 Seeking knowledge from massive data

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


When is DM useful
 Data rich world
 Large data (dimensionality and size)
 Image data (size)
 Gene chip data (dimensionality)
 Little knowledge about data (exploratory data
analysis)
 What if we have some knowledge?

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Challenges
 Increasing data dimensionality and data size
 Various data forms
 New data types
 Streaming data, multimedia data
 Efficient search and access to data/knowledge
 Intelligent update and integration
 Privacy Concerns

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Results of Data Mining Include:
 Forecasting what may happen in the future
 Classifying people or things into groups by
recognizing patterns
 Clustering people or things into groups based on
their attributes
 Associating what events are likely to occur together
 Sequencing what events are likely to lead to later
events

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining versus OLAP
OLAP - On-line
Analytical Processing
 Provides you with a very
good view of what is
happening, but can not
predict what will happen
in the future or why it is
happening

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining Versus Statistical Analysis

Data Mining Data Analysis


 Originally developed to act as  Tests for statistical correctness of
expert systems to solve models
problems  Are statistical assumptions of
 Less interested in the models correct?
mechanics of the technique  Eg Is the R-Square good?
 If it makes sense then let’s use  Hypothesis testing
it  Is the relationship significant?
 Does not require assumptions  Use a t-test to validate
to be made about data significance
 Can find patterns in very large  Tends to rely on sampling
amounts of data  Techniques are not optimised for
 Requires understanding of data large amounts of data
and business problem  Requires strong statistical skills

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining Taxonomy
Predictive Method
- …predict the value of a particular attribute…

Descriptive Method
- …foundation of human-interpretable patterns that
describe the data…

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining Tasks...
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Deviation Detection [Predictive]

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining Tasks:
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances

Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Classification: Linear
Regression

 Linear Regression
w0 + w1 x + w2 y >= 0
 Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
 Not flexible enough

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Classification: Decision
Trees
if X > 5 then blue
else if Y > 3 then blue
Y else if X > 2 then green
else blue

2 5 X

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Example Decision Tree

Splitting Attributes
Tid Refund Marital Taxable Refund
Status Income Cheat Yes No
1 Yes Single 125K No
NO MarSt
2 No Married 100K No
Single, Divorced Married
3 No Single 70K No
4 Yes Married 120K No TaxInc NO
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No The splitting attribute at a node is
10 No Single 90K Yes determined based on the Gini index.
10

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Classification: Neural Networks

- efficiently model large and complex problems;


- may be used in classification problems or for
regressions;
- Starts with input layer => hidden layer => output
layer
3
1

4 6
2
5 Output
Inputs
Hidden Layer
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Neural Networks (cont.)

- can be easily implemented to run on massively


parallel computers;
- can not be easily interpret;
- require an extensive amount of training time;
- require a lot of data preparation (involve very careful
data cleansing, selection, preparation, and pre-
processing);
- require sufficiently large data set and high signal-to
noise ratio.

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Classification Example

Tid Refund Marital Taxable Refund Marital Taxable


Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
10

Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10

Set Classifier
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Classification Application
 Direct Marketing

 Fraud Detection

 Customer Attrition/Churn

 Sky Survey Cataloging

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining Tasks:
Clustering

 Goal is to identify categories


 Natural grouping of customers
by processing all the available
data about them.
 Other applications
 market segmentation, discovering
affinity groups, and defect
analysis

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Kohonen Network
Description
 unsupervised
 seeks to
describe dataset
in terms of
natural clusters
of cases

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining Tasks:
Association Rule Discovery
 Given a set of records each of which contain some
number of items from a given collection;
 Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Association Rule
Discovery Application
 Marketing and Sales Promotion

 Supermarket Shelf Management

 Inventory Management

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Deviation Detection & Pattern Discovery

Deviation Detection:
…discovering most significant changes in data from
previously measured or normative values…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.

Sequential Pattern Discovery:


…process of looking for patterns and rules that predict
strong sequential dependencies among different
events…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Sequential Patterns

 Identify frequently occurring sequences from given


records
 40 percent of female customers buy a gray skirt six
months after buying a red jacket

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Data Mining Methodology: SAS

 Sample
 Extract a portion of the dataset for data mining
 Explore
 Modify
 create, select and transform variables with the intention of building
a model
 Model
 Specify a relationship of variables that reliably predicts a desired
goal
 Assess
 Evaluate the practical value of the findings and the model resulting
from the data mining effort
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Data Mining Methodology:
CRISP-DM
 Data understanding
 Data preparation
 Modeling
 Evaluation
 Deployment

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


‫‪CRISP-DM Phases‬‬

‫سيستمهاي خبره و مهندسي دانش‪ -‬دكتر كاهاني‬


Phases and Tasks
Business Data Data
Modeling Evaluation Deployment
Understanding Understanding Preparation

Determine Collect Initial Data Data Set Select Modeling Evaluate Results Plan Deployment
Business Objectives Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results w.r.t.
Business Objectives Select Data Modeling Assumptions Business Success Plan Monitoring and
Business Success Describe Data Rationale for Inclusion / Criteria Maintenance
Criteria Data Description Report Exclusion Generate Test Design Approved Models Monitoring and
Test Design Maintenance Plan
Situation Assessment Explore Data Clean Data Review Process
Inventory of Resources Data Exploration Report Data Cleaning Report Build Model Review of Process Produce Final Report
Requirements, Parameter Settings Final Report
Assumptions, and Verify Data Quality Construct Data Models Determine Next Steps Final Presentation
Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions
Risks and Contingencies Generated Records Decision Review Project
Terminology Assess Model Experience
Costs and Benefits Integrate Data Model Assessment Documentation
Merged Data Revised Parameter
Determine Settings
Data Mining Goal Format Data
Data Mining Goals Reformatted Data
Data Mining Success
Criteria

Produce Project Plan


Project Plan
Initial Asessment of
Tools and Techniques

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Major Application Areas for
Data Mining Solutions
Fraud/Non-Compliance Recruiting/Attracting
Anomaly detection customers
 Isolate the factors that lead to Maximizing profitability
fraud, waste and abuse (cross selling, identifying
 Target auditing and
profitable customers)
investigative efforts more Service Delivery and
effectively Customer Retention
 Build profiles of customers likely
Credit/Risk Scoring to use which services

Intrusion detection Web Mining


Health Care
Parts failure prediction

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


Controversial Issues
 Data mining (or simple analysis) on people may come with a
profile that would raise controversial issues of
 Discrimination
 Privacy
 Security
 Examples:
 Should males between 18 and 35 from countries that produced
terrorists be singled out for search before flight?
 Can people be denied mortgage based on age, sex, race?
 Women live longer. Should they pay less for life insurance?

34
Data Mining and
Discrimination
 Can discrimination be based on features like sex,
age, national origin?
 In some areas (e.g. mortgages, employment), some
features cannot be used for decision making
 In other areas, these features are needed to assess the
risk factors
 E.g. people of African descent are more susceptible to
sickle cell anemia

35
Data Mining and Privacy
 Can information collected for one purpose be used for mining
data for another purpose
 In Europe, generally no, without explicit consent
 In US, generally yes
 Companies routinely collect information about customers and
use it for marketing, etc.
 People may be willing to give up some of their privacy in
exchange for some benefits
 See Data Mining And Privacy Symposium,
www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html

36
Data Mining and Privacy
 Data Mining looks for patterns, not people!
 Technical solutions can limit privacy invasion
 Replacing sensitive personal data with anon. ID
 Give randomized outputs
 Multi-party computation – distributed data
…

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬


The Hype Curve for
Data Mining and Knowledge Discovery

Over-inflated
expectations

Growing acceptance
and mainstreaming
rising
expectations

Disappointment Performance

Expectations
1990
1998 2000 2002
‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬
Final Remarks

 Data Mining can be utilized for any field that


needs to find patterns or relationships in their
data.

‫ دكتر كاهاني‬-‫سيستمهاي خبره و مهندسي دانش‬

You might also like