You are on page 1of 41

Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes

Data collection and data availability

Automated data collection tools, database systems, Web,


computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks,

Science: Remote sensing, bioinformatics, scientific


simulation,

Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

Necessity is the mother of inventionData miningAutomated


analysis of massive data sets

March 3, 2015

March 3, 2015

March 3, 2015

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously


unknown and potentially useful) patterns or knowledge
from huge amount of data

Alternative names

Data mining: a misnomer?


Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

Watch out: Is everything data mining?

Simple search and query processing

(Deductive) expert systems

March 3, 2015

Data Mining is:


(1) The efficient discovery of previously
unknown, valid, potentially useful,
understandable patterns in large
datasets
(2) The analysis of (often large)
observational data sets to find
unsuspected relationships and to
summarize the data in novel ways that
are both understandable and useful to
the data owner
March 3, 2015

Examples of Data mining


Applications
1. Fraud detection: credit cards,
phone cards
2. Marketing: customer targeting
3. Data Warehousing: Walmart
4. Astronomy
5. Molecular biology
March 3, 2015

How Data Mining is used

1. Identify the problem


2. Use data mining techniques to
transform the data into information
3. Act on the information
4. Measure the results

March 3, 2015

Knowledge Discovery (KDD) Process

Data miningcore of
knowledge discovery
process

Pattern Evaluation
Data Mining

Task-relevant Data
Data Warehouse

Selection

Data Cleaning
Data Integration
Databases
March 3, 2015

March 3, 2015

March 3, 2015

10

March 3, 2015

11

March 3, 2015

12

March 3, 2015

13

Data Mining and Business


Intelligence
Increasing potential
to support
business decisions

Decisio
n
Making
Data Presentation
Visualization Techniques

End User

Business
Analyst

Data Mining
Information Discovery

Data
Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
March 3, 2015

DBA

14

Data Mining: Confluence of Multiple


Disciplines
Database
Technology

Machine
Learning
Pattern
Recognition

March 3, 2015

Statistics

Data Mining

Algorithm

Visualization

Other
Disciplines

15

Why Not Traditional Data


Analysis?

Tremendous amount of data

High-dimensionality of data

Algorithms must be highly scalable to handle such as terabytes of data


Micro-array may have tens of thousands of dimensions

High complexity of data

Data streams and sensor data

Time-series data, temporal data, sequence data

Structure data, graphs, social networks and multi-linked data

Heterogeneous databases and legacy databases

Spatial, spatiotemporal, multimedia, text and Web data

Software programs, scientific simulations

New and sophisticated applications

March 3, 2015

16

Multi-Dimensional View of Data


Mining

Data to be mined

Knowledge to be mined

Characterization, discrimination, association, classification,


clustering, trend/deviation, outlier analysis, etc.

Multiple/integrated functions and mining at multiple levels

Techniques utilized

Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW

Database-oriented, data warehouse (OLAP), machine learning,


statistics, visualization, etc.

Applications adapted

March 3, 2015

Retail, telecommunication, banking, fraud analysis, bio-data


mining, stock market analysis, text mining, Web mining, etc.
17

Data Mining: Classification Schemes

General functionality

Descriptive data mining

Predictive data mining

Different views lead to different classifications

Data view: Kinds of data to be mined

Knowledge view: Kinds of knowledge to be


discovered

Method view: Kinds of techniques utilized

Application view: Kinds of applications adapted

March 3, 2015

18

March 3, 2015

19

Data Mining: On What Kinds of


Data?

Database-oriented data sets and applications

Relational database, data warehouse, transactional database

Advanced data sets and advanced applications

Data streams and sensor data

Time-series data, temporal data, sequence data (incl. biosequences)

Structure data, graphs, social networks and multi-linked data

Object-relational databases

Heterogeneous databases and legacy databases

Spatial data and spatiotemporal data

Multimedia database

Text databases

The World-Wide Web

March 3, 2015

20

Data Mining Functionalities

Multidimensional concept description: Characterization and


discrimination

Frequent patterns, association, correlation vs. causality

Generalize, summarize, and contrast data


characteristics, e.g., dry vs. wet regions
Diaper Beer [0.5%, 75%] (Correlation or causality?)

Classification and prediction

Construct models (functions) that describe and


distinguish classes or concepts for future prediction

March 3, 2015

E.g., classify countries based on (climate), or classify


cars based on (gas mileage)

Predict some unknown or missing numerical values


21

Data Mining Functionalities (2)

Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses

March 3, 2015

22

Major Issues in Data Mining

Mining methodology

Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web

Performance: efficiency, effectiveness, and scalability

Pattern evaluation: the interestingness problem

Incorporation of background knowledge

Handling noise and incomplete data

Parallel, distributed and incremental mining methods


Integration of the discovered knowledge with existing one: knowledge
fusion

User interaction

Data mining query languages and ad-hoc mining

Expression and visualization of data mining results

Interactive mining of knowledge at multiple levels of abstraction

Applications and social impacts

March 3, 2015

Domain-specific data mining & invisible data mining


Protection of data security, integrity, and privacy
23

Data Mining Tasks


1. Classification: learning a function that
maps an item into one of a set of
predefined classes
2. Regression: learning a function that maps
an item to a real value
3. Clustering: identify a set of groups of
similar items

Data Mining Tasks


4. Dependencies and associations:
identify significant dependencies between
data attributes
5. Summarization: find a compact description
of the dataset or a subset of the dataset

Data Mining Methods


1. Decision Tree Classifiers:
Used for modeling, classification

2. Association Rules:
Used to find associations between sets of
attributes

3. Sequential patterns:
Used to find temporal associations in time series

4. Hierarchical clustering:
used to group customers, web users, etc

Why Data Warehousing?

Data warehousing can be considered as an


important preprocessing step for data mining

Heterogeneous
Databases data selection
data cleaning
data integration

Data
Warehou
se

data summarization

A data warehouse also provides on-line analytical


processing (OLAP) tools for interactive
multidimensional data analysis.

Example of a Data Warehouse


(1)
US-Database
Employee
Department
eid name birthdate
... ...
...

did
...

Transaction

Details

tid
1
2
3
...

tid
1
2
3
...

type
sale
sale
buy
...

date
4/11/1999
5/2/1999
5/17/1999
...

dname
...

pid
21
13
41
...

qty
2
1
3
...

HK-Database
Supplier
Country
sid name birthdate
... ...
...

Sales

sid
1
2
3
4
...

cid
...

cname
...

date
time qty
15:4:1999 8:30 2
15:4:1999 9:30 2
???
3
19:5:1999
4
...

pid
11
11
56
22

Data Warehouse
FACT table
timeid
1
2
2
3
...

pid
1
1
2
3
...

sales
2
4
1
2
...

dimension 1: time
timeid
1
2
3
...

day
11
15
2
...

month
4
4
5

year
1999
1999
1999
...

dimension 2: product
pid
1
2
3
...

name
chair
table
desk
...

type
office
office
office

Example of a Data Warehouse


(2)

Data Selection
Only data which are important for analysis are
selected (e.g., information about employees,
departments, etc. are not stored in the warehouse)
Therefore the data warehouse is subject-oriented
Data Integration
Consistency of attribute names
Consistency of attribute data types. (e.g., dates are
converted to a consistent format)
Consistency of values (e.g., product-ids are
converted to correspond to the same products from
both sources)
Integration of data (e.g, data from both sources are
integrated into the warehouse)

Example of a Data Warehouse


(3)

Data Cleaning

Tuples which are incomplete or logically


inconsistent are cleaned

Data Summarization

Values are summarized according to the


desired level of analysis
For example, HK database records the
daytime a sales transaction takes place,
but the most detailed time unit we are
interested for analysis is the day.

Example of a Data
Warehouse (4)
Example of an OLAP query (collects counts)

Summarize all company sales according to


product and year, and further aggregate on each
of these dimensions.
year

product

1999

2000

2001

2002

ALL

chairs

25

37

89

21

172

tables

10

30

45

85

desks

56

84

35

184

shelves

19

20

71

110

16

11

15

47

115

187

109

187

598

boards
ALL

Data cube

What is Data
Warehouse?

Defined in many different ways, but not


rigorously.

A decision support database that is maintained


separately from the organizations operational database
Support information processing by providing a solid
platform of consolidated, historical data for analysis.

A data warehouse is a subject-oriented,


integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process.W. H. Inmon
Data warehousing:

The process of constructing and using data warehouses

Data WarehouseSubjectOriented

Organized around major subjects, such as


customer, product, sales.

Focusing on the modeling and analysis of data for


decision makers, not on daily operations or
transaction processing.

Provide a simple and concise view around


particular subject issues by excluding data that
are not useful in the decision support process.

Data WarehouseIntegrated

Constructed by integrating multiple,


heterogeneous data sources

relational databases, flat files, on-line transaction


records

Data cleaning and data integration


techniques are applied.

Ensure consistency in naming conventions,


encoding structures, attribute measures, etc.
among different data sources
E.g., Hotel price: currency, tax, breakfast
covered, etc.
When data is moved to the warehouse, it is
converted.

Data WarehouseTime
Variant

The time horizon for the data warehouse is


significantly longer than that of operational
systems.

Operational database: current value data.

Data warehouse data: provide information from a historical


perspective (e.g., past 5-10 years)

Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain


time element (the time elements could be extracted from
log files of transactions)

Data WarehouseNonVolatile

A physically separate store of data transformed


from the operational environment.

Operational update of data does not occur in the


data warehouse environment.

Does not require transaction processing, recovery, and


concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data.

Data Warehouse vs.


Operational DBMS

OLTP (on-line transaction processing)

Major task of traditional relational DBMS

Day-to-day operations: purchasing, inventory, banking,


manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making

Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries

OLTP vs. OLAP

Why Separate Data


Warehouse?

High performance for both systems

DBMS tuned for OLTP: access methods, indexing,


concurrency control, recovery
Warehousetuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.

Different functions and different data:

missing data: Decision support requires historical data


which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from heterogeneous
sources
data quality: different sources typically use inconsistent
data representations, codes and formats which have to be
reconciled

From Tables and


Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional


data model which views data in the form of a
data cube

A data cube, such as sales, allows data to be


modeled and viewed in multiple dimensions

Dimension tables, such as item (item_name, brand,


type), or time(day, week, month, quarter, year)

Fact table contains measures (such as dollars_sold) and


keys to each of the related dimension tables

From Tables and


Spreadsheets to Data Cubes

A dimension is a perspective with respect to


which we analyze the data

A multidimensional data model is usually


organized around a central theme (e.g., sales).
Numerical measures on this theme are called
facts, and they are used to analyze the
relationships between the dimensions

Example:

Central theme: sales

Dimensions: item, customer, time, location, supplier, etc.

You might also like