Why Data Mining?: March 3, 2015

Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,

computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific

simulation,
Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
Necessity is the mother of inventionData miningAutomated

analysis of massive data sets
March 3, 2015
March 3, 2015
March 3, 2015
What Is Data Mining?
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously

unknown and potentially useful) patterns or knowledge
from huge amount of data
Alternative names
Data mining: a misnomer?

Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
Watch out: Is everything data mining?
Simple search and query processing
(Deductive) expert systems
March 3, 2015
Data Mining is:

(1) The efficient discovery of previously
unknown, valid, potentially useful,
understandable patterns in large
datasets
(2) The analysis of (often large)
observational data sets to find
unsuspected relationships and to
summarize the data in novel ways that
are both understandable and useful to
the data owner
March 3, 2015
Examples of Data mining

Applications
1. Fraud detection: credit cards,
phone cards
2. Marketing: customer targeting
3. Data Warehousing: Walmart
4. Astronomy
5. Molecular biology
March 3, 2015
How Data Mining is used
1. Identify the problem

2. Use data mining techniques to
transform the data into information
3. Act on the information
4. Measure the results
March 3, 2015
Knowledge Discovery (KDD) Process
Data miningcore of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
March 3, 2015
March 3, 2015
March 3, 2015
10
March 3, 2015
11
March 3, 2015
12
March 3, 2015
13
Data Mining and Business

Intelligence
Increasing potential
to support
business decisions
Decisio
n
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
March 3, 2015
DBA
14
Data Mining: Confluence of Multiple

Disciplines
Database
Technology
Machine
Learning
Pattern
Recognition
March 3, 2015
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
15
Why Not Traditional Data

Analysis?
Tremendous amount of data
High-dimensionality of data
Algorithms must be highly scalable to handle such as terabytes of data

Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
March 3, 2015
16
Multi-Dimensional View of Data

Mining
Data to be mined
Knowledge to be mined
Characterization, discrimination, association, classification,

clustering, trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW
Database-oriented, data warehouse (OLAP), machine learning,

statistics, visualization, etc.
Applications adapted
March 3, 2015
Retail, telecommunication, banking, fraud analysis, bio-data

mining, stock market analysis, text mining, Web mining, etc.
17
Data Mining: Classification Schemes
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be

discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
March 3, 2015
18
March 3, 2015
19
Data Mining: On What Kinds of

Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. biosequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
March 3, 2015
20
Data Mining Functionalities
Multidimensional concept description: Characterization and

discrimination
Frequent patterns, association, correlation vs. causality
Generalize, summarize, and contrast data

characteristics, e.g., dry vs. wet regions
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and

distinguish classes or concepts for future prediction
March 3, 2015
E.g., classify countries based on (climate), or classify

cars based on (gas mileage)
Predict some unknown or missing numerical values

21
Data Mining Functionalities (2)
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
March 3, 2015
22
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods

Integration of the discovered knowledge with existing one: knowledge
fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
March 3, 2015
Domain-specific data mining & invisible data mining

Protection of data security, integrity, and privacy
23
Data Mining Tasks

1. Classification: learning a function that
maps an item into one of a set of
predefined classes
2. Regression: learning a function that maps
an item to a real value
3. Clustering: identify a set of groups of
similar items
Data Mining Tasks

4. Dependencies and associations:
identify significant dependencies between
data attributes
5. Summarization: find a compact description
of the dataset or a subset of the dataset
Data Mining Methods

1. Decision Tree Classifiers:
Used for modeling, classification
2. Association Rules:
Used to find associations between sets of
attributes
3. Sequential patterns:
Used to find temporal associations in time series
4. Hierarchical clustering:
used to group customers, web users, etc
Why Data Warehousing?
Data warehousing can be considered as an

important preprocessing step for data mining
Heterogeneous
Databases data selection
data cleaning
data integration
Data
Warehou
se
data summarization
A data warehouse also provides on-line analytical

processing (OLAP) tools for interactive
multidimensional data analysis.
Example of a Data Warehouse

(1)
US-Database
Employee
Department
eid name birthdate
... ...
...
did
...
Transaction
Details
tid
1
2
3
...
tid
1
2
3
...
type
sale
sale
buy
...
date
4/11/1999
5/2/1999
5/17/1999
...
dname
...
pid
21
13
41
...
qty
2
1
3
...
HK-Database
Supplier
Country
sid name birthdate
... ...
...
Sales
sid
1
2
3
4
...
cid
...
cname
...
date
time qty
15:4:1999 8:30 2
15:4:1999 9:30 2
???
3
19:5:1999
4
...
pid
11
11
56
22
Data Warehouse
FACT table
timeid
1
2
2
3
...
pid
1
1
2
3
...
sales
2
4
1
2
...
dimension 1: time
timeid
1
2
3
...
day
11
15
2
...
month
4
4
5
year
1999
1999
1999
...
dimension 2: product
pid
1
2
3
...
name
chair
table
desk
...
type
office
office
office

(2)
Data Selection
Only data which are important for analysis are
selected (e.g., information about employees,
departments, etc. are not stored in the warehouse)
Therefore the data warehouse is subject-oriented
Data Integration
Consistency of attribute names
Consistency of attribute data types. (e.g., dates are
converted to a consistent format)
Consistency of values (e.g., product-ids are
converted to correspond to the same products from
both sources)
Integration of data (e.g, data from both sources are
integrated into the warehouse)

(3)
Data Cleaning
Tuples which are incomplete or logically

inconsistent are cleaned
Data Summarization
Values are summarized according to the

desired level of analysis
For example, HK database records the
daytime a sales transaction takes place,
but the most detailed time unit we are
interested for analysis is the day.
Example of a Data
Warehouse (4)
Example of an OLAP query (collects counts)
Summarize all company sales according to

product and year, and further aggregate on each
of these dimensions.
year
product
1999
2000
2001
2002
ALL
chairs
25
37
89
21
172
tables
10
30
45
85
desks
56
84
35
184
shelves
19
20
71
110
16
11
15
47
115
187
109
187
598
boards
ALL
Data cube
What is Data
Warehouse?
Defined in many different ways, but not

rigorously.
A decision support database that is maintained

separately from the organizations operational database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.
A data warehouse is a subject-oriented,

integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process.W. H. Inmon
Data warehousing:
The process of constructing and using data warehouses
Data WarehouseSubjectOriented
Organized around major subjects, such as

customer, product, sales.
Focusing on the modeling and analysis of data for

decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around

particular subject issues by excluding data that
are not useful in the decision support process.
Data WarehouseIntegrated
Constructed by integrating multiple,

heterogeneous data sources
relational databases, flat files, on-line transaction

records
Data cleaning and data integration

techniques are applied.
Ensure consistency in naming conventions,

encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast
covered, etc.
When data is moved to the warehouse, it is
converted.
Data WarehouseTime
Variant
The time horizon for the data warehouse is

significantly longer than that of operational
systems.
Operational database: current value data.
Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not contain

time element (the time elements could be extracted from
log files of transactions)
Data WarehouseNonVolatile
A physically separate store of data transformed

from the operational environment.
Operational update of data does not occur in the

data warehouse environment.
Does not require transaction processing, recovery, and

concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data.
Data Warehouse vs.

Operational DBMS
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)
Major task of data warehouse system
Data analysis and decision making
Distinct features (OLTP vs. OLAP):
User and system orientation: customer vs. market
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
Why Separate Data

Warehouse?
High performance for both systems
DBMS tuned for OLTP: access methods, indexing,

concurrency control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:
missing data: Decision support requires historical data

which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from heterogeneous
sources
data quality: different sources typically use inconsistent
data representations, codes and formats which have to be
reconciled
From Tables and

Spreadsheets to Data Cubes
A data warehouse is based on a multidimensional

data model which views data in the form of a
data cube
A data cube, such as sales, allows data to be

modeled and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand,

type), or time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and

keys to each of the related dimension tables
From Tables and

Spreadsheets to Data Cubes
A dimension is a perspective with respect to

which we analyze the data
A multidimensional data model is usually

organized around a central theme (e.g., sales).
Numerical measures on this theme are called
facts, and they are used to analyze the
relationships between the dimensions
Example:
Central theme: sales
Dimensions: item, customer, time, location, supplier, etc.

Why Data Mining?: March 3, 2015

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Why Data Mining?: March 3, 2015

Uploaded by

Copyright:

Available Formats

Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web,

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks,

Science: Remote sensing, bioinformatics, scientific

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

Necessity is the mother of inventionData miningAutomated

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously

Data mining: a misnomer?

Watch out: Is everything data mining?

Simple search and query processing

(Deductive) expert systems

Data Mining is:

Examples of Data mining

How Data Mining is used

1. Identify the problem

Knowledge Discovery (KDD) Process

Data Mining and Business

Data Mining: Confluence of Multiple

Why Not Traditional Data

Tremendous amount of data

Algorithms must be highly scalable to handle such as terabytes of data

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

Multi-Dimensional View of Data

Characterization, discrimination, association, classification,

Multiple/integrated functions and mining at multiple levels

Database-oriented, data warehouse (OLAP), machine learning,

Retail, telecommunication, banking, fraud analysis, bio-data

Data Mining: Classification Schemes

Descriptive data mining

Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted

Data Mining: On What Kinds of

Database-oriented data sets and applications

Relational database, data warehouse, transactional database

Advanced data sets and advanced applications

Data streams and sensor data

Time-series data, temporal data, sequence data (incl. biosequences)

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data

The World-Wide Web

Data Mining Functionalities

Multidimensional concept description: Characterization and

Frequent patterns, association, correlation vs. causality

Generalize, summarize, and contrast data

Classification and prediction