Professional Documents
Culture Documents
Data mining has attracted a great deal of attention in the information industry
and in society as a whole in recent years, due to the wide availability of huge
amounts of data and the imminent need for turning such data into useful
information and knowledge.
Simply stated, data mining refers to extracting or “mining” knowledge from large
amounts of data.
"The non trivial extraction of implicit, previously unknown, and potentially useful
information from data"
Knowledge Discovery in
Databases
Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data.
Steps in KDD
• Data cleaning (to remove noise and inconsistent data).
• Data selection (where data relevant to the analysis task are retrieved from the
database)
Knowledge base: This is the domain knowledge that is used to guide the search or
evaluate the interestingness of resulting patterns.
Such knowledge can include concept hierarchies, used to organize attributes or attribute
values into different levels of abstraction.
Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness
based on its unexpectedness, may also be included.
Data mining engine: This is essential to the data mining system and ideally consists of a
set of functional modules for tasks such as characterization, association and correlation
analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
User interface: This module communicates between users and the datamining system,
allowing the user to interact with the system by specifying a data mining query or task
A Three-Tier Data Warehouse Architecture
A Three-Tier Data Warehouse Architecture
Data warehouses often adopt a three-tier architecture
1. The bottom tier is a warehouse database server that is almost always a relational database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources. These tools and utilities perform data extraction,
cleaning, and transformation as well as load and refresh functions to update the data
warehouse.
2. The middle tier is an OLAP server that is typically implemented using either (1) a relational
OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on
multidimensional data to standard relational operations; or (2) a multidimensional OLAP
(MOLAP) model, that is, a special-purpose server that directly implements multidimensional
data and operations.
3. The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects.
In principle, data mining is not specific to one type of media or data. Data mining should
be applicable to any kind of information repository.
Flat files: Flat files are actually the most common data source for data mining
algorithms, especially at the research level.
Flat files are simple data files in text or binary format with a structure known by the data
mining algorithm to be applied.
The data in these files can be transactions, time-series data, scientific measurements,
etc.
Relational Databases: Briefly, a relational database consists of a set of tables
containing either values of entity attributes, or values of attributes from entity
relationships.
Tables have columns and rows, where columns represent attributes and rows
represent tuples.
A data warehouse gives the option to analyze data from different sources under the
same roof.
For example, to study the characteristics of software products whose sales increased
by 10% in the last year
Data discrimination: Comparison of the target class with one or a set of comparative
classes.
Data discrimination is a comparison of the general features of target class data objects
with the general features of objects from one or a set of contrasting classes.
The target and contrasting classes can be specified by the user, and the corresponding
data objects retrieved through database queries.
For example, the user may like to compare the general features of software products
whose sales increased by 10% in the last year with those whose sales decreased by at-
least 30% during the same period.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, as the name suggests, are patterns that occur frequently in data. There
are many kinds of frequent patterns, including itemsets, subsequences, and
substructures.
Frequent itemset mining leads to the discovery of associations and correlations among
items in large transactional or relational data sets.
With massive amounts of data continuously being collected and stored, many
industries are becoming interested in mining such patterns from their databases.
This process analyzes customer buying habits by finding associations between the
different items that customers place in their “shopping baskets”.
The discovery of such associations can help retailers develop marketing strategies
by gaining insight into which items are frequently purchased together by customers.
Classification
These categories can be represented by discrete values, where the ordering among values
has no meaning. For example, the values 1, 2, and 3 may be used to represent treatments
A, B, and C.
The data analysis task is an example of numeric prediction, where the model constructed
predicts a continuous-valued function, or ordered value, as opposed to a categorical
label. This model is a predictor.
Data classification is a two-step process.
In the first step, a classifier is built describing a predetermined set of data classes or
concepts.
This is the learning step (or training phase), where a classification algorithm builds the
classifier by analyzing or “learning from” a training set made up of database tuples and their
associated class labels.
The class label attribute is discrete-valued and unordered. It is categorical in that each value
serves as a category or class.
The individual tuples making up the training set are referred to as training tuples and are
selected from the database under analysis.
Because the class label of each training tuple is provided, this step is also known as
supervised learning.
It contrasts with unsupervised learning (or clustering), in which the class label of each
training tuple is not known, and the number or set of classes to be learned may not be
known in advance.
First, the predictive accuracy of the classifier is estimated. If we were to use the training
set to measure the accuracy of the classifier, this estimate would likely be optimistic,
because the classifier tends to overfit the data
Clustering
Clustering is the process of grouping data into classes, or clusters, so that objects within
a cluster have high similarity in comparison to one another, but are very dissimilar to
objects in other clusters.
It is often more desirable to proceed in the reverse direction: First partition the set of
data into groups based on data similarity (e.g., using clustering), and then assign labels
to the relatively small number of groups.
Clustering is also called data segmentation in some applications because clustering
partitions large datasets into groups according to their similarity.
Clustering can also be used for outlier detection, where outliers (values that are “far
away” from any cluster) may be more interesting than common cases.
For this reason, clustering is a form of learning by observation, rather than learning by
examples.