You are on page 1of 27

Data Mining

Need of Data Mining

Data mining has attracted a great deal of attention in the information industry
and in society as a whole in recent years, due to the wide availability of huge
amounts of data and the imminent need for turning such data into useful
information and knowledge.
Simply stated, data mining refers to extracting or “mining” knowledge from large
amounts of data.

"The non trivial extraction of implicit, previously unknown, and potentially useful
information from data"

What is not Data Mining? What is Data Mining?

Look up phone number in Certain names are more prevalent in


phone directory certain US locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
Query a Web search engine
for information about Group together similar documents returned
“Amazon” by search engine according to their context
(e.g. Amazon
rainforest, Amazon.com,)
KDD Process

Knowledge Discovery in
Databases
Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data.

Steps in KDD
• Data cleaning (to remove noise and inconsistent data).

• Data integration (where multiple data sources may be combined)

• Data selection (where data relevant to the analysis task are retrieved from the
database)

• Data transformation (where data are transformed or consolidated into forms


appropriate for mining by performing summary or aggregation operations, for
instance).
• Data mining (an essential process where intelligent methods are applied in order
to extract data patterns).

• Pattern evaluation (to identify the truly interesting patterns representing


knowledge based on some interestingness measures.

• Knowledge presentation (where visualization and knowledge representation


techniques are used to present the mined knowledge to the user).
Architecture of a Data Mining System
Database, data warehouse, World Wide Web, or other information repository: This is
one or a set of databases, data warehouses, spread sheets, or other kinds of information
repositories. Data cleaning and data integration techniques may be performed on the
data.

Database or data warehouse server: The database or data warehouse server is


responsible for fetching the relevant data, based on the user’s data mining request.

Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns.

Such knowledge can include concept hierarchies, used to organize attributes or attribute
values into different levels of abstraction.

Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included.
Data mining engine: This is essential to the data mining system and ideally consists of a
set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

Pattern evaluation module: This component typically employs interestingness measures


and interacts with the data mining modules so as to focus the search toward interesting
patterns. It may use interestingness thresholds to filter out discovered patterns.

User interface: This module communicates between users and the datamining system,
allowing the user to interact with the system by specifying a data mining query or task
A Three-Tier Data Warehouse Architecture
A Three-Tier Data Warehouse Architecture
Data warehouses often adopt a three-tier architecture

1. The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources. These tools and utilities perform data extraction,
cleaning, and transformation as well as load and refresh functions to update the data
warehouse.

2. The middle tier is an OLAP server that is typically implemented using either (1) a relational
OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on
multidimensional data to standard relational operations; or (2) a multidimensional OLAP
(MOLAP) model, that is, a special-purpose server that directly implements multidimensional
data and operations.

3. The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.

Enterprise warehouse: An enterprise warehouse collects all of the information about


subjects spanning the entire organization. It provides corporate-wide data integration,
usually from one or more operational systems or external information providers, and is
cross-functional in scope. It typically contains detailed data as well as summarized data,
and can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or
beyond.

Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects.

Virtual warehouse: A virtual warehouse is a set of views over operational databases. A


virtual warehouse is easy to build but requires excess capacity on operational database
servers.
Data Mining—On What Kind of Data?

In principle, data mining is not specific to one type of media or data. Data mining should
be applicable to any kind of information repository.

Flat files: Flat files are actually the most common data source for data mining
algorithms, especially at the research level.

Flat files are simple data files in text or binary format with a structure known by the data
mining algorithm to be applied.

The data in these files can be transactions, time-series data, scientific measurements,
etc.
Relational Databases: Briefly, a relational database consists of a set of tables
containing either values of entity attributes, or values of attributes from entity
relationships.

Tables have columns and rows, where columns represent attributes and rows
represent tuples.

A tuple in a relational table corresponds to either an object or a relationship between


objects and is identified by a set of attribute values representing a unique key.
 Data Warehouses: A data warehouse as a storehouse is a repository of data
collected from multiple data sources (often heterogeneous) and is intended to be
used as a whole under the same unified schema.

 A data warehouse gives the option to analyze data from different sources under the
same roof.

 Transaction Databases: A transaction database is a set of records representing


transactions, each with a time stamp, an identifier and a set of items.
 Associated with the transaction files could also be descriptive data for the items.

 Multimedia Databases: Multimedia databases include video, images, audio and


text media. Multimedia is characterized by its high dimensionality, which makes
data mining even more challenging.
Spatial Databases: Spatial databases are databases that, in addition to usual data,
store geographical information like maps, and global or regional positioning.
Data mining functionalities, and the kinds of patterns they can discover

Characterization and Discrimination

Data characterization: Data characterization is a summarization of the general


characteristics or features of a target class of data.

The data corresponding to the user-specified class are typically collected by a


database query.

For example, to study the characteristics of software products whose sales increased
by 10% in the last year
Data discrimination: Comparison of the target class with one or a set of comparative
classes.

Data discrimination is a comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes.

The target and contrasting classes can be specified by the user, and the corresponding
data objects retrieved through database queries.

For example, the user may like to compare the general features of software products
whose sales increased by 10% in the last year with those whose sales decreased by at-
least 30% during the same period.
Mining Frequent Patterns, Associations, and Correlations

Frequent patterns, as the name suggests, are patterns that occur frequently in data. There
are many kinds of frequent patterns, including itemsets, subsequences, and
substructures.

Mining frequent patterns leads to the discovery of interesting associations and


correlations within data.
Market Basket Analysis

Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets.

With massive amounts of data continuously being collected and stored, many
industries are becoming interested in mining such patterns from their databases.

The discovery of interesting correlation relationships among huge amounts of business


transaction records can help in many business decision-making processes.
A typical example of frequent itemset mining is market basket analysis.

This process analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets”.

The discovery of such associations can help retailers develop marketing strategies
by gaining insight into which items are frequently purchased together by customers.
Classification

Classification, where a model or classifier is constructed to predict categorical labels, such


as “safe” or “risky” for the loan application data; “yes” or “no” for the marketing data; or
“treatment A,” “treatment B,” or “treatment C” for the medical data.

These categories can be represented by discrete values, where the ordering among values
has no meaning. For example, the values 1, 2, and 3 may be used to represent treatments
A, B, and C.

The data analysis task is an example of numeric prediction, where the model constructed
predicts a continuous-valued function, or ordered value, as opposed to a categorical
label. This model is a predictor.
Data classification is a two-step process.

In the first step, a classifier is built describing a predetermined set of data classes or
concepts.

This is the learning step (or training phase), where a classification algorithm builds the
classifier by analyzing or “learning from” a training set made up of database tuples and their
associated class labels.

Each tuple, X, is assumed to belong to a predefined class as determined by another


database attribute called the class label attribute.

The class label attribute is discrete-valued and unordered. It is categorical in that each value
serves as a category or class.

The individual tuples making up the training set are referred to as training tuples and are
selected from the database under analysis.
Because the class label of each training tuple is provided, this step is also known as
supervised learning.

It contrasts with unsupervised learning (or clustering), in which the class label of each
training tuple is not known, and the number or set of classes to be learned may not be
known in advance.

In the second step, the model is used for classification.

First, the predictive accuracy of the classifier is estimated. If we were to use the training
set to measure the accuracy of the classifier, this estimate would likely be optimistic,
because the classifier tends to overfit the data
Clustering

Clustering is the process of grouping data into classes, or clusters, so that objects within
a cluster have high similarity in comparison to one another, but are very dissimilar to
objects in other clusters.

Although classification is an effective means for distinguishing groups or classes of


objects, it requires the often costly collection and labeling of a large set of training
tuples or patterns, which the classifier uses to model each group.

It is often more desirable to proceed in the reverse direction: First partition the set of
data into groups based on data similarity (e.g., using clustering), and then assign labels
to the relatively small number of groups.
Clustering is also called data segmentation in some applications because clustering
partitions large datasets into groups according to their similarity.

Clustering can also be used for outlier detection, where outliers (values that are “far
away” from any cluster) may be more interesting than common cases.

In machine learning, clustering is an example of unsupervised learning. Unlike


classification, clustering and unsupervised learning do not rely on predefined classes
and class-labeled training examples.

For this reason, clustering is a form of learning by observation, rather than learning by
examples.

You might also like