Professional Documents
Culture Documents
Data explosion
Automated data collection tools and mature database
technology lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories
Data mining:
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large databases
Data Warehousing
(Deductive) query processing
SQL/ Reporting
Software Agents
Expert Systems
Online Analytical Processing (OLAP)
Statistical Analysis Tool
Data visualization
9 © Copyright 2006, Natasha Balac
Data Mining
Machine
Learning
Data Mining Visualization
Artificial Other
Intelligence Disciplines
Database technology
Artificial Intelligence
Machine Learning including Neural Networks
Statistics
Pattern recognition
Knowledge-based systems/acquisition
High-performance computing
Data visualization
Gold Mining
Knowledge mining from databases
Knowledge extraction
Data/pattern analysis
Knowledge Discovery Databases or KDD
Information harvesting
Business intelligence
Database
Evaluation,
Verification
Fundamental idea:
learn rules/patterns/relationships
automatically from the data
Cluster analysis
Class label is unknown: Group data to form
new classes,
Example: cluster houses to find distribution
patterns
Clustering based on the principle: maximizing
the intra-class similarity and minimizing the
interclass similarity
Outlier analysis
Outlier: a data object that does not comply with the
general behavior of the data
Mostly considered as noise or exception, but is
quite useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: regression analysis
29 Sequential pattern mining, periodicity analysis
© Copyright 2006, Natasha Balac
Data Mining: Classification Schemes
General functionality
Descriptive data mining Vs. Predictive data mining
Different views - different classifications
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques employed
Kinds of applications
30 © Copyright 2006, Natasha Balac
A Multi-Dimensional View of Data
Mining Classification
Databases to be mined
Relational, transactional, object-oriented, object-
relational, active, spatial, time-series, text, multi-
media,WWW, etc.
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
Multiple/integrated functions
Mining at multiple levels of abstractions
31 © Copyright 2006, Natasha Balac
A Multi-Dimensional View of Data
Mining Classification
Techniques utilized
Decision/Regression trees, clustering, neural
networks, etc.
Applications adapted
Retail, telecom, banking, DNA mining, stock
market analysis, Web mining
Bioscience
Sequence-based analysis
Protein structure and function prediction
Protein family classification
Microarray gene expression
Given:
1000 training examples of borderline cases
20 attributes:
age, years with current employer,years at current address,
years with the bank, years at current job, other credit cards
Learned rules predicted 2/3 of borderline cases
correctly!
Rules could be used to explain decisions to customers
Expert systems
A lot A little
“Play Tennis”
Day Outlook Temp Humidity Wind PlayTennis
“Play Tennis”
[Mitchell,1997]