You are on page 1of 12

Data Mining Functionalities

Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks.Data
mining tasks can be classified into two categories: descriptive and predictive.

Descriptive mining tasks characterize the general properties of the data in the database.

Predictive mining tasks perform inference on the current data in order to make predictions.

Concept/Class Description: Characterization and Discrimination

Data can be associated with classes or concepts. For example, in the Electronics store, classes of items
for sale include computers and printers, and concepts of customers include bigSpenders and
budgetSpenders.

Data characterization

Data characterization is a summarization of the general characteristics or features of a target class of


data.

Data discrimination

Data discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.

Mining Frequent Patterns, Associations, and Correlations

Frequent patterns, are patterns that occur frequently in data. There are many kinds of frequent
patterns, including itemsets, subsequences, and substructures.

Association analysis

Suppose, as a marketing manager, you would like to determine which items are frequently purchased
together within the same transactions.

buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]

where X is a variable representing a customer.Confidence=50% means that if a customer buys a


computer, there is a 50% chance that she will buy software as well.

Support=1% means that 1% of all of the transactions under analysis showed that computer and software
were purchased together.

Classification and Prediction

Classification is the process of finding a model that describes and distinguishes data classes for the
purpose of being able to use the model to predict the class of objects whose class label is unknown.
“How is the derived model presented?” The derived model may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks.

A decision tree is a flow-chart-like tree structure, where each node denotes a test on an attribute value,
each branch represents an outcome of the test, and tree leaves represent classes or class distributions.

Decision tree

A neural network, when used for classification, is typically a collection of neuron-like processing units
with weighted connections between the units.
Neural Network

Cluster Analysis

In classification and prediction analyze class-labeled data objects, where as clustering analyzes data
objects without consulting a known class label.
Cluster Analysis

The objects are grouped based on the principle of maximizing the intraclass similarity and minimizing
the interclass similarity. That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are very dissimilar to objects in other clusters.

Outlier Analysis

A database may contain data objects that do not comply with the general behavior or model of the data.
These data objects are outliers. Most data mining methods discard outliers as noise or exceptions.The
analysis of outlier data is referred to as outlier mining.

- - END - -

Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the


nontrivial extraction of implicit, previously unknown, and potentially useful
information from data. This encompasses a number of different technical approaches,
such as clustering, data summarization, learning classification rules, finding
dependency net works, analysing changes, and detecting anomalies.
Data mining is the search for relationships and global patterns that exist in large
databases but are `hidden' among the vast amount of data, such as a relationship
between patient data and their medical diagnosis. These relationships represent
valuable knowledge about the database and the objects in the database and, if the
database is a faithful mirror, of the real world registered by the database.

Data mining refers to "using a variety of techniques to identify nuggets of information


or decision-making knowledge in bodies of data, and extracting these in such a way
that they can be put to use in the areas such as decision support, prediction,
forecasting and estimation. The data is often voluminous, but as it stands of low value
as no direct use can be made of it; it is the hidden information in the data that is
useful"

Types of Data Sets

 Record
 Relational records
 Data matrix, e.g., numerical matrix, crosstabs
 Document data: text documents: term-frequency vector
 Transaction data
 Graph and network
 World Wide Web
 Social or information networks
 Molecular Structures
 Ordered
 Video data: sequence of images
 Temporal data: time-series
 Sequential Data: transaction sequences
 Genetic sequence data
 Spatial, image and multimedia:
 Spatial data: maps
 Image data:
 Video data:

IMPORTANT CHARACTERISTICS OF STRUCTURED DATA

 Dimensionality
Curse of dimensionality
 Sparsity
Only presence counts
 Resolution
Patterns depend on the scale
 Distribution
Centrality and dispersion
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.

ATTRIBUTES

 Attribute (or dimensions, features,


variables): a data field, representing a
characteristic or feature of a data
object.
E.g., customer _ID, name, address
 Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
o

Ratio-scaled
o
ATTRIBUTE TYPES

 Nomin
ATTRIBUTE TYPES

 Nominal: categories, states, or “names


of things”
Hair_color = {auburn, black, blond,
brown, grey, red, white}
marital status, occupation, ID
numbers, zip codes
 Binary
Nominal attribute with only 2 states
(0 and 1)
Symmetric binary: both outcomes
equally important
e.g., gender
o

Asymmetric binary: outcomes not


equally important.
e.g., medical test (positive vs.
o

negative)
Convention: assign 1 to most
o

important outcome (e.g., HIV


positive)
 Ordinal
Values have a meaningful order
(ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades,
army rankings
NUMERIC ATTRIBUTE TYPES

 Quantity (integer or real-valued)


 Interv
NUMERIC ATTRIBUTE TYPES

 Quantity (integer or real-valued)


 Interval
oMeasured on a scale of equal-sized
units
oValues have order
E.g., temperature in C˚or F˚,
o

calendar dates
oNo true zero-point
 Ratio
oInherent zero-point
oWe can speak of values as being an
order of magnitude larger than the
unit of measurement (10 K˚ is
twice as high as 5 K˚).
e.g., temperature in Kelvin,
o

length, counts, monetary


quantities
DISCRETE
DISCRETE VS. CONTINUOUS ATTRIBUTES

 Discrete Attribute
 Has only a finite or countably infinite
set of values
oE.g., zip codes, profession, or the
set of words in a collection of
documents
 Sometimes, represented as integer
variables
 Note: Binary attributes are a special
case of discrete attributes
 Continuous Attribute
 Has real numbers as attribute values
oE.g., temperature, height, or
weight
 Practically, real values can only be
measured and represented using a
finite number of digits
 Continuous attributes are typically
represented as floating-point
variables

You might also like