1 Data Mining Processes and Knowledge Discovery

CS359 Introduction to Data
Mining
Course objectives
This course introduces the fundamental concepts of

data mining and knowledge discovery from
databases.
It focuses on the discussion and demonstration of
common data mining methods and how data mining
results become useful to businesses and
organizations.
Grading scheme
Class Standing
10% Assignment
25% Cases
40% Quizzes
25% Long Quiz
Midterm = 70% Class Standing + 30% Midterm Exam

2nd Quarter = 70% 2nd Quarter Class Standing + 30% Final
Exam
Final Grade = 40% Midterm + 40% Final Grade + 20% Project
Policies
Attendance will be checked.
No make-up quizzes
Make-up long exam only for excused absence.
Set schedule within a week after the exam date
Late submissions will not be accepted (assignments,

cases and project)
References
Han, J. & Kamber, M. (2006) Data Mining Concepts
and Techniques 2nd Edition. Morgan Kaufmann
Publisher Elsevier Inc., California.
P. Tan, M. Steinbach & V. Kumar, Introduction to Data
Mining, Addison Wesley, 2006.
Software Links
Data Mining Software Links by Dr. Pang-Ning Tan :
www.cse.msu.edu/~cse980/software.html
RapidMiner : http://rapidi.com/content/view/26/84/lang,en/
Weka : http://www.cs.waikato.ac.nz/ml/weka/
Data Mining Processes and

Knowledge Discovery
Objectives
Define Data Mining and knowledge discovery in
databases.
Discuss some business applications of data mining
Identify the elements of the data mining process
Discuss the steps in CRISP-DM
What is Data Mining?

Is also known as Knowledge Discovery in Databases; a
nontrivial extraction of implicit, previously unknown
and potentially useful information from databases
(Han et al, 1999)
Involves the use of analysis to detect patterns and
allow predictions. (Olson & Shi, 2007)
Data Mining
Exploratory data analysis
Finds its roots along with the development in classical
statistics, artificial intelligence and machine learning
Looks for actionable information, or information that
can be utilized in a concrete way to improve
profitability
General Types of Data Mining

Hypothesis Testing
A theory about the relationship between actions and
outcomes is expressed and tested
Knowledge Discovery
Preconceived notion may not be present
Relationships can be identified by looking in to the data
Data Mining requires the identification of a

problem
Data Mining Applications

Retailing
Affinity Positioning based upon the identification of
products that the same customer is likely to want
Cross-selling knowledge of products that go together
can be used by marketing the complementary product

Banking
Customer Relationship Management identify customer
value, develop programs to maximize revenue
Credit Card Management

Identify Balance Surfers or credit card holders who pays
old balances with a new card
Lift identify effective market segments
Churn identify likely customer turnover

Insurance
Fraud detection identify fraud claims meriting
investigation
Telecommunications
Churn customer turnover or switching carriers
Medicine
Cancer Cell Detection
Machine Vision
Pattern Recognition
CRISP-DM Process
Cross-Industry Standard Process for Data Mining
Phases
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Business Understanding
Knowing what the study is for
Identify business task
Data Understanding
Select the related data from many available
databases to correctly describe a given
business task
Identify relevant data for the problem description

Selected variables for the relevant data should be
independent of each other or do not contain
overlapping information
Types of data: geographic, socio-graphic,

transactional or quantitative and qualitative
Data Preparation
Also known as data preprocessing
Clean selected data for better quality
Filter, aggregate and fill in missing values (imputation)
Filter: remove outliers and redundancies
Aggregate: data is reduces to obtain aggregated
information
Filling-in or Smoothing: missing values are found and
replaces with reasonable values
Data Preparation
Data transformation
Uses mathematical formulations to convert
different measurements into a unified numerical
scale
Numerical to numerical scales
Shrink or enlarge the data
Categorical to numerical scales

Categorical values can be ordinal (less, moderate, strong)
or nominal (red, yellow, blue)
Modeling
Data mining software is used to generate results for
various situations
Data is divided into:
Training set used for the development of the model
Test set used to test the model thats built
Modeling
Data Modeling Techniques
Association the relationship of a particular item in a
data transaction on other items in the same transaction
is used to predict patterns
Classification learning different functions that map
each item of the selected data into one of a predefined
set of classes
Modeling
Clustering takes ungrouped data and uses automatic
techniques to put this data into groups
Prediction Analysis discover the relationship between
the dependent and independent variables
Sequential Pattern Analysis seeks to fine similar
patterns in data transaction over a business period
Evaluation
Data interpretation stage
Two things to consider:
How to recognize business value from knowledge
patterns discovered
How to visualize the results to properly interpret
patterns
Deployment
The results are reported to project sponsors
The result is applied to business task or data mining
objective
Knowledge Discovery Process
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation
Data Mining System Architecture

User Interface
Pattern Evaluation
Knowledge
Base
Data Mining Engine
Database or Data Warehouse Server
Data cleaning, Integration and Selection
Database
Data
Warehouse
WWW
Other
Repositories
Data Mining on what data?
Relational Databases
Data Warehouses
Transactional Databases
Object-Relational Databases
Temporal, Sequence or Time-Series Database
Spatial Databases and Spatiotemporal Databases
Data Mining - what patterns?

Descriptive characterize the general properties of
data
Data characterization, Data discrimination, Association,
Clustering
Predictive performs inference on the current data in

order to make predictions
Classification and Prediction, Evolution analysis
Are all patterns interesting?

NO
A pattern is interesting if
(1) it is easily understood by humans,
(2) valid on new or test data with some degree of
certainty,
(3) potentially useful, and
(4) novel.
A pattern is also interesting if it validates a

hypothesis that the user sought to confirm.
Can a data mining system generate

all interesting patterns?
Refers to COMPLETENESS of a data mining algorithm
It is unrealistic and inefficient for data mining systems
to generate all of the possible patterns.
A focused search which makes use of interestingness
measures should be used to control pattern
generation.
CASE study: Telephone company
1.
2.
3.
4.
5.
What is the business task or data mining objective?

What are the relevant data and their sources?
How was the data prepared? What were the
processes?
What was the data mining technique used?
How was the model used to address the business
task?

1 Data Mining Processes and Knowledge Discovery

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Data Mining Processes and Knowledge Discovery

Uploaded by

Copyright:

Available Formats

CS359 Introduction to Data

This course introduces the fundamental concepts of

Midterm = 70% Class Standing + 30% Midterm Exam

Late submissions will not be accepted (assignments,

Data Mining Processes and

What is Data Mining?

General Types of Data Mining

Data Mining requires the identification of a

Data Mining Applications

Data Mining Applications

Credit Card Management

Data Mining Applications

Identify relevant data for the problem description

Types of data: geographic, socio-graphic,

Categorical to numerical scales

Knowledge Discovery Process

Data Mining System Architecture

Database or Data Warehouse Server

Data cleaning, Integration and Selection

Data Mining on what data?

Data Mining - what patterns?

Predictive performs inference on the current data in

Are all patterns interesting?

A pattern is also interesting if it validates a

Can a data mining system generate

CASE study: Telephone company

What is the business task or data mining objective?

You might also like