You are on page 1of 31

CS359 Introduction to Data

Mining

Course objectives

This course introduces the fundamental concepts of


data mining and knowledge discovery from
databases.
It focuses on the discussion and demonstration of
common data mining methods and how data mining
results become useful to businesses and
organizations.

Grading scheme
Class Standing

10% Assignment
25% Cases
40% Quizzes
25% Long Quiz

Midterm = 70% Class Standing + 30% Midterm Exam


2nd Quarter = 70% 2nd Quarter Class Standing + 30% Final
Exam
Final Grade = 40% Midterm + 40% Final Grade + 20% Project

Policies
Attendance will be checked.
No make-up quizzes
Make-up long exam only for excused absence.
Set schedule within a week after the exam date

Late submissions will not be accepted (assignments,


cases and project)

References
Han, J. & Kamber, M. (2006) Data Mining Concepts
and Techniques 2nd Edition. Morgan Kaufmann
Publisher Elsevier Inc., California.
P. Tan, M. Steinbach & V. Kumar, Introduction to Data
Mining, Addison Wesley, 2006.

Software Links
Data Mining Software Links by Dr. Pang-Ning Tan :
www.cse.msu.edu/~cse980/software.html
RapidMiner : http://rapidi.com/content/view/26/84/lang,en/
Weka : http://www.cs.waikato.ac.nz/ml/weka/

Data Mining Processes and


Knowledge Discovery

Objectives
Define Data Mining and knowledge discovery in
databases.
Discuss some business applications of data mining
Identify the elements of the data mining process
Discuss the steps in CRISP-DM

What is Data Mining?


Is also known as Knowledge Discovery in Databases; a
nontrivial extraction of implicit, previously unknown
and potentially useful information from databases
(Han et al, 1999)
Involves the use of analysis to detect patterns and
allow predictions. (Olson & Shi, 2007)

Data Mining
Exploratory data analysis
Finds its roots along with the development in classical
statistics, artificial intelligence and machine learning
Looks for actionable information, or information that
can be utilized in a concrete way to improve
profitability

General Types of Data Mining


Hypothesis Testing
A theory about the relationship between actions and
outcomes is expressed and tested

Knowledge Discovery
Preconceived notion may not be present
Relationships can be identified by looking in to the data

Data Mining requires the identification of a


problem

Data Mining Applications


Retailing
Affinity Positioning based upon the identification of
products that the same customer is likely to want
Cross-selling knowledge of products that go together
can be used by marketing the complementary product

Data Mining Applications


Banking
Customer Relationship Management identify customer
value, develop programs to maximize revenue

Credit Card Management


Identify Balance Surfers or credit card holders who pays
old balances with a new card
Lift identify effective market segments
Churn identify likely customer turnover

Data Mining Applications


Insurance
Fraud detection identify fraud claims meriting
investigation

Telecommunications
Churn customer turnover or switching carriers

Medicine
Cancer Cell Detection

Machine Vision
Pattern Recognition

CRISP-DM Process
Cross-Industry Standard Process for Data Mining
Phases
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Business Understanding
Knowing what the study is for
Identify business task

Data Understanding
Select the related data from many available
databases to correctly describe a given
business task

Identify relevant data for the problem description


Selected variables for the relevant data should be
independent of each other or do not contain
overlapping information

Types of data: geographic, socio-graphic,


transactional or quantitative and qualitative

Data Preparation
Also known as data preprocessing
Clean selected data for better quality
Filter, aggregate and fill in missing values (imputation)
Filter: remove outliers and redundancies
Aggregate: data is reduces to obtain aggregated
information
Filling-in or Smoothing: missing values are found and
replaces with reasonable values

Data Preparation
Data transformation
Uses mathematical formulations to convert
different measurements into a unified numerical
scale
Numerical to numerical scales
Shrink or enlarge the data

Categorical to numerical scales


Categorical values can be ordinal (less, moderate, strong)
or nominal (red, yellow, blue)

Modeling
Data mining software is used to generate results for
various situations
Data is divided into:
Training set used for the development of the model
Test set used to test the model thats built

Modeling
Data Modeling Techniques
Association the relationship of a particular item in a
data transaction on other items in the same transaction
is used to predict patterns
Classification learning different functions that map
each item of the selected data into one of a predefined
set of classes

Modeling
Clustering takes ungrouped data and uses automatic
techniques to put this data into groups
Prediction Analysis discover the relationship between
the dependent and independent variables
Sequential Pattern Analysis seeks to fine similar
patterns in data transaction over a business period

Evaluation
Data interpretation stage
Two things to consider:
How to recognize business value from knowledge
patterns discovered
How to visualize the results to properly interpret
patterns

Deployment
The results are reported to project sponsors
The result is applied to business task or data mining
objective

Knowledge Discovery Process

Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation

Data Mining System Architecture


User Interface

Pattern Evaluation
Knowledge
Base
Data Mining Engine

Database or Data Warehouse Server

Data cleaning, Integration and Selection

Database

Data
Warehouse

WWW

Other
Repositories

Data Mining on what data?

Relational Databases
Data Warehouses
Transactional Databases
Object-Relational Databases
Temporal, Sequence or Time-Series Database
Spatial Databases and Spatiotemporal Databases

Data Mining - what patterns?


Descriptive characterize the general properties of
data
Data characterization, Data discrimination, Association,
Clustering

Predictive performs inference on the current data in


order to make predictions
Classification and Prediction, Evolution analysis

Are all patterns interesting?


NO
A pattern is interesting if
(1) it is easily understood by humans,
(2) valid on new or test data with some degree of
certainty,
(3) potentially useful, and
(4) novel.

A pattern is also interesting if it validates a


hypothesis that the user sought to confirm.

Can a data mining system generate


all interesting patterns?
Refers to COMPLETENESS of a data mining algorithm
It is unrealistic and inefficient for data mining systems
to generate all of the possible patterns.
A focused search which makes use of interestingness
measures should be used to control pattern
generation.

CASE study: Telephone company

1.
2.
3.
4.
5.

What is the business task or data mining objective?


What are the relevant data and their sources?
How was the data prepared? What were the
processes?
What was the data mining technique used?
How was the model used to address the business
task?