You are on page 1of 60

Introduction to Data Mining

Natasha Balac, Ph.D.


1 © Copyright 2006, Natasha Balac
Outline
 Motivation: Why Data Mining?
 What is Data Mining?
 History of Data Mining
 Data Mining Functionality and Terminology
 Data Mining Applications
 Are all the Patterns Interesting?
2 Issues in Data Mining © Copyright 2006, Natasha Balac
Necessity is the Mother of Invention

 Data explosion
 Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories

 We are drowning in data, but starving for


knowledge!
3 © Copyright 2006, Natasha Balac
Necessity is the Mother of Invention

 We are drowning in data, but starving for


knowledge!
 Solution
 Data Mining
 Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases
4 © Copyright 2006, Natasha Balac
Why DATA MINING?

 Huge amounts of data


 Electronic records of our decisions
 Choices in the supermarket
 Financial records
 Our comings and goings
 We swipe our way through the world – every swipe is
a record in a database
 Data rich – but information poor
 Lying hidden in all this data is information!

5 © Copyright 2006, Natasha Balac


Data vs. Information

 Society produces massive amounts of data


 business, science, medicine, economics, sports, …
 Potentially valuable resource
 Raw data is useless
 need techniques to automatically extract information
 Data: recorded facts
 Information: patterns underlying the data

6 © Copyright 2006, Natasha Balac


What is DATA MINING?

 Extracting or “mining” knowledge from large


amounts of data

 Data -driven discovery and modeling of hidden


patterns (we never new existed) in large
volumes of data

 Extraction of implicit, previously unknown and


unexpected, potentially extremely useful
information from data
7 © Copyright 2006, Natasha Balac
What Is Data Mining?

 Data mining:
 Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large databases

8 © Copyright 2006, Natasha Balac


Data Mining is NOT

 Data Warehousing
 (Deductive) query processing
 SQL/ Reporting
 Software Agents
 Expert Systems
 Online Analytical Processing (OLAP)
 Statistical Analysis Tool
 Data visualization
9 © Copyright 2006, Natasha Balac
Data Mining

 Programs that detect patterns and rules in


the data

 Strong patterns can be used to make non-


trivial predictions on new data

10 © Copyright 2006, Natasha Balac


Data Mining Challenges

 Problem 1: most patterns are not


interesting

 Problem 2: patterns may be inexact or


completely spurious when noisy data
present

11 © Copyright 2006, Natasha Balac


Machine Learning Techniques

 Technical basis for data mining: algorithms for


acquiring structural descriptions from examples

 Methods originate from artificial intelligence,


statistics, and research on databases

12 © Copyright 2006, Natasha Balac


Machine Learning Techniques

 Structural descriptions represent patterns


explicitly can be used to
 predict outcome in new situation
 understand and explain how prediction is
derived (maybe even more important)

13 © Copyright 2006, Natasha Balac


Multidisciplinary Field
Database
Statistics
Technology

Machine
Learning
Data Mining Visualization

Artificial Other
Intelligence Disciplines

14 © Copyright 2006, Natasha Balac


Multidisciplinary Field

 Database technology
 Artificial Intelligence
 Machine Learning including Neural Networks
 Statistics
 Pattern recognition
 Knowledge-based systems/acquisition
 High-performance computing
 Data visualization

15 © Copyright 2006, Natasha Balac


History of Data Mining

16 © Copyright 2006, Natasha Balac


History

 Emerged late 1980s


 Flourished –1990s
 Roots traced back along three family lines
 Classical Statistics
 Artificial Intelligence
 Machine Learning

17 © Copyright 2006, Natasha Balac


Statistics

 Foundation of most DM technologies


 Regression analysis, standard
distribution/deviation/variance, cluster
analysis, confidence intervals
 Building blocks
 Significant role in today’s data mining –
but alone is not powerful enough

18 © Copyright 2006, Natasha Balac


Artificial Intelligence

 Heuristics vs. Statistics


 Human-thought-like processing
 Requires vast computer processing power
 Supercomputers

19 © Copyright 2006, Natasha Balac


Machine Learning

 Union of statistics and AI


Blends AI heuristics with advanced statistical
analysis
 Machine Learning – let computer programs
 learn about data they study - make different
decisions based on the quality of studied data
 using statistics for fundamental concepts and
adding more advanced AI heuristics and algorithms

20 © Copyright 2006, Natasha Balac


Data Mining

 Adoption of the Machine learning techniques


to the real world problems
 Union: Statistics, AI, Machine learning
 Used to find previously hidden trends or
patterns
 Finding increasing acceptance in science and
business areas which need to analyze large
amount of data to discover trends which could
not be found otherwise

21 © Copyright 2006, Natasha Balac


Terminology

 Gold Mining
 Knowledge mining from databases
 Knowledge extraction
 Data/pattern analysis
 Knowledge Discovery Databases or KDD
 Information harvesting
 Business intelligence

22 © Copyright 2006, Natasha Balac


KDD Process

Database

Selection Data Training Data Model,


Transformation Preparation Data Mining Patterns

Evaluation,
Verification

23 © Copyright 2006, Natasha Balac


LEARNING ALGORITHMS

 Fundamental idea:

learn rules/patterns/relationships
automatically from the data

24 © Copyright 2006, Natasha Balac


Data Mining Tasks

 Exploratory Data Analysis


 Predictive Modeling: Classification and Regression
 Descriptive Modeling
 Cluster analysis/segmentation

 Discovering Patterns and Rules


 Association/Dependency rules
 Sequential patterns
 Temporal sequences
 Deviation detection
25 © Copyright 2006, Natasha Balac
Data Mining Tasks

 Concept/Class description: Characterization


and discrimination
 Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
 Association (correlation and causality)
 Multi-dimensional or single-dimensional association
age(X, “20-29”) ^ income(X, “60-90K”)  buys(X, “TV”)

26 © Copyright 2006, Natasha Balac


Data Mining Tasks

 Classification and Prediction


 Finding models (functions) that describe and
distinguish classes or concepts for future prediction
 Example: classify countries based on climate, or
classify cars based on gas mileage
 Presentation:
 If-THENrules, decision-tree, classification rule,
neural network
Prediction: Predict some unknown or missing

27 numerical values © Copyright 2006, Natasha Balac


Data Mining Tasks

 Cluster analysis
 Class label is unknown: Group data to form
new classes,
 Example: cluster houses to find distribution
patterns
 Clustering based on the principle: maximizing
the intra-class similarity and minimizing the
interclass similarity

28 © Copyright 2006, Natasha Balac


Data Mining Tasks

 Outlier analysis
 Outlier: a data object that does not comply with the
general behavior of the data
 Mostly considered as noise or exception, but is
quite useful in fraud detection, rare events analysis
 Trend and evolution analysis
 Trend and deviation: regression analysis
29 Sequential pattern mining, periodicity analysis
© Copyright 2006, Natasha Balac
Data Mining: Classification Schemes

 General functionality
 Descriptive data mining Vs. Predictive data mining
 Different views - different classifications
 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques employed
 Kinds of applications
30 © Copyright 2006, Natasha Balac
A Multi-Dimensional View of Data
Mining Classification
 Databases to be mined
 Relational, transactional, object-oriented, object-
relational, active, spatial, time-series, text, multi-
media,WWW, etc.
 Knowledge to be mined
 Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
 Multiple/integrated functions
 Mining at multiple levels of abstractions
31 © Copyright 2006, Natasha Balac
A Multi-Dimensional View of Data
Mining Classification

 Techniques utilized
 Decision/Regression trees, clustering, neural
networks, etc.
 Applications adapted
 Retail, telecom, banking, DNA mining, stock
market analysis, Web mining

32 © Copyright 2006, Natasha Balac


Data Mining Applications

 Science: Chemistry, Physics, Medicine


 Biochemical analysis
 Remote sensors on a satellite
 Telescopes – star galaxy classification
 Medical Image analysis

33 © Copyright 2006, Natasha Balac


Data Mining Applications

 Bioscience
 Sequence-based analysis
 Protein structure and function prediction
 Protein family classification
 Microarray gene expression

34 © Copyright 2006, Natasha Balac


Data Mining Applications
 Pharmaceutical companies, Insurance
and Health care, Medicine
 Drug development
 Identify successful medical therapies
 Claims analysis, fraudulent behavior
 Medical diagnostic tools
 Predict office visits

35 © Copyright 2006, Natasha Balac


Data Mining Applications

 Financial Industry, Banks, Businesses, E-


commerce
 Stock and investment analysis
 Identify loyal customers vs. risky customer
 Predict customer spending
 Risk management
 Sales forecasting

36 © Copyright 2006, Natasha Balac


Data Mining Applications

 Retail and Marketing


 Customer buying patterns/demographic
characteristics
 Mailing campaigns
 Market basket analysis
 Trend analysis

37 © Copyright 2006, Natasha Balac


Data Mining Applications

 Database analysis and decision support


 Market analysis and management
 target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
 Fraud detection and management
38 © Copyright 2006, Natasha Balac
Data Mining Applications

 Sports and Entertainment


 IBM Advanced Scout analyzed NBA game
statistics (shots blocked, assists, and fouls) to
gain competitive advantage for New York
Knicks and Miami Heat
 Astronomy
 JPL and the Palomar Observatory discovered
22 quasars with the help of data mining

39 © Copyright 2006, Natasha Balac


DATA MINING EXAMPLES
 Grocery store
 NBA
 Banking and Credit Card scoring
 Fraud detection
 Personalization & Customer Profiling
 Campaign Management and Database
Marketing

40 © Copyright 2006, Natasha Balac


Data mining at work:
Case study 1

41 © Copyright 2006, Natasha Balac


Processing Loan Applications
 Given: questionnaire with financial and personal
information
 Problem: should money be lend?
 Borderline cases referred to loan officers
 But: 50% of accepted borderline cases defaulted!
 Solution:
 reject all borderline cases?
 Borderline cases are most active customers!

42 © Copyright 2006, Natasha Balac


Enter Machine Learning

 Given:
1000 training examples of borderline cases
 20 attributes:
 age, years with current employer,years at current address,
years with the bank, years at current job, other credit cards
 Learned rules predicted 2/3 of borderline cases
correctly!
 Rules could be used to explain decisions to customers

43 © Copyright 2006, Natasha Balac


Case study 2:
Screening images
 Given:
 radar satellite images of coastal waters
 Problem:
 detecting oil slicks in those images
 Oil slicks = dark regions with changing size and
shape
 Look-alike dark regions can be caused by weather
conditions (e.g. high wind)
 Expensive process requiring highly trained personnel

44 © Copyright 2006, Natasha Balac


Enter Machine Learning

 Dark regions extracted from normalized image


 Attributes:
 size of region, shape, area, intensity, sharpness
and jaggedness of boundaries, proximity of other
regions, info about background
 Constraints:
 Scarcity of training examples (oil slicks are rare!)
 Unbalanced data: most dark regions aren’t oil
slicks
 Regions from same image form a batch
 Requirement is adjustable false-alarm rate
45 © Copyright 2006, Natasha Balac
Data Mining Challenges

 Computationally expensive to investigate all


possibilities
 Dealing with noise/missing information and
errors in data
 Choosing appropriate attributes/input
representation
 Finding the minimal attribute space
 Finding adequate evaluation function(s)
 Extracting meaningful information
 Not overfitting
46 © Copyright 2006, Natasha Balac
Are All the “Discovered” Patterns
Interesting?

 Interestingness measures: A pattern is


interesting if it is easily understood by humans,
valid on new or test data with some degree of
certainty, potentially useful, novel, or validates
some hypothesis that a user seeks to confirm

47 © Copyright 2006, Natasha Balac


Are All the “Discovered” Patterns
Interesting?

 Objective vs. subjective measures:


 Objective: based on statistics and structures of
patterns
 support and confidence
 Subjective: based on user’s belief in the data
 unexpectedness, novelty, action ability, etc.

48 © Copyright 2006, Natasha Balac


Can We Find All and Only Interesting
Patterns?

 Completeness - Find all the interesting


patterns
 Can a data mining system find all the interesting
patterns?
 Association vs. classification vs. clustering

49 © Copyright 2006, Natasha Balac


Can We Find All and Only Interesting
Patterns?

 Optimization - Search for only interesting patterns


 Can a data mining system find only the interesting
patterns?
 Approaches
 First general all the patterns and then filter out the
uninteresting ones
 Mining query optimization

50 © Copyright 2006, Natasha Balac


Major Issues in Data Mining

 Mining methodology and user interaction


 Mining different kinds of knowledge in databases
 Incorporation of background knowledge
 Handling noise and incomplete data
 Pattern evaluation: the interestingness problem
 Expression and visualization of data mining results

51 © Copyright 2006, Natasha Balac


Major Issues in Data Mining

 Performance and scalability


 Efficiency of data mining algorithms
 Parallel, distributed and incremental mining
methods
 Issues relating to the diversity of data types
 Handling relational and complex types of data
 Mining information from diverse databases

52 © Copyright 2006, Natasha Balac


Major Issues in Data Mining

 Issues related to applications and social


impacts
 Application of discovered knowledge
 Domain-specific data mining tools
 Intelligent query answering

 Expert systems

 Process control and decision making

 A knowledge fusion problem


 Protection of data security, integrity, and privacy
53 © Copyright 2006, Natasha Balac
Summary

 Data mining: discovering interesting patterns


from large amounts of data
 A KDD process includes data cleaning, data
integration, data selection, transformation, data
mining, pattern evaluation, and knowledge
presentation

54 © Copyright 2006, Natasha Balac


Summary
 Mining can be performed in a variety of
information repositories
 Data mining functionalities: characterization,
association, classification, clustering, outlier
and trend analysis, etc.
 Classification of data mining systems
 Major issues in data mining
55 © Copyright 2006, Natasha Balac
Exercise
 Practical Data mining example

56 © Copyright 2006, Natasha Balac


Kinds of Data Mining
 Decision Tree Learning
 Clustering
 Neural Networks
 Association Rules
 Support Vector Machines
 Genetic Algorithms
 Nearest Neighbor Method

57 © Copyright 2006, Natasha Balac


Decision Tree Example
Grandparents

A lot A little

58 © Copyright 2006, Natasha Balac


DECISION TREE FOR THE CONCEPT

“Play Tennis”
Day Outlook Temp Humidity Wind PlayTennis

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14
Mitchell, 1997 Rain Mild High Strong No

59 © Copyright 2006, Natasha Balac


DECISION TREE FOR THE CONCEPT

“Play Tennis”

[Mitchell,1997]

60 © Copyright 2006, Natasha Balac

You might also like