You are on page 1of 16

Dr.

Dipti Chauhan
INTRODUCTION TO DATA MINING Assistant Professor
SCSIT, SUAS Indore
WHY DATA MINING?

The Explosive Growth of Data: from terabytes to petabytes


 Data collection and data availability
 Automated data collection tools, database systems, Web, computerized society
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation, …
 Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—Automated analysis of massive


data sets

Data mining turns a large collection of data into knowledge.

2
DATA MINING AS THE EVOLUTION OF INFORMATION TECHNOLOGY

The evolution of database system technology.


EVOLUTION OF DATABASE TECHNOLOGY
1960s:
 Data collection, database creation, IMS and network DBMS

1970s:
 Relational data model, relational DBMS implementation

1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:
 Data mining, data warehousing, multimedia databases, and Web databases

2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information systems

4
WHAT IS DATA MINING?

Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
 Data mining: a misnomer?

Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?


 Simple search and query processing
 (Deductive) expert systems

5
KNOWLEDGE DISCOVERY (KDD) PROCESS
This is a view from typical
database systems and data
warehousing communities
Data mining plays an
essential role in the
knowledge discovery process
KNOWLEDGE DISCOVERY FROM DATA (KDD)
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
EXAMPLE: A DATA MINING FRAMEWORK

Data mining usually involves


 Data cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of the mining results
 Patterns and knowledge to be used or stored into knowledge-base

8
DATA MINING: ON WHAT KINDS OF DATA?
Data mining can be applied to any kind of data as long as the data are meaningful
for a target application.
The basic repositories include- Advanced Database System includes
Relational database Object-oriented relational databases

Data warehouse Time-series data


Transactional database Spatial database
Advance Database Systems Multimedia database
Flat files Heterogeneous databases and legacy databases
Data Streams Structure data, graphs, social networks and multi-
linked data
WWW Data streams and sensor data
DATA MINING FUNCTIONALITIES- WHAT KIND OF PATTERNS CAN BE MINED

Data mining functionalities are used to specify the kind of patterns to be found
in data mining tasks.
Data mining tasks can be classified in two categories
1. Descriptive- Descriptive mining tasks characterize properties of the data in
a target data set.
 These tasks present the general properties of data stored in database. The descriptive tasks
are used to find out patterns in data i.e. cluster, correlation, trends and anomalies etc.

2. Predictive- Predictive mining tasks perform induction on the current data in


order to make predictions.
 Predictive data mining tasks predict the value of one attribute on the basis of values of other
attributes, which is known as target or dependent variable and the attributes used for making
the prediction are known as independent variables.
DATA MINING FUNCTIONALITIES- WHAT KIND OF PATTERNS CAN BE MINED

Data mining functionalities, and the kinds of patterns they can


discover, are-

1. Associations, and Correlations


2. Classification and Regression
3. Clustering Analysis
4. Outlier Analysis
DATA MINING FUNCTIONALITIES- WHAT KIND OF PATTERNS CAN BE MINED

Data mining functionalities are -


1. Prediction:
Predictive model determined the future outcome rather than present behavior. The predictive
attribute of a predictive model can be geometric or categorical. It engross the ruling of set of
characteristics relevant to the attribute of interest and predicting the value distribution based on
the set of data similar to the selected object (S) for example one may predict the kind of disease
based on the symptoms of patient.
2. Classification:
Classification is used to builds models from data with predefined classes as the model is used to
classify new instance whose classification is not known. The instances used to create the model are
known as training data. A decision tree or set of classification rules is based on such type of
mechanism of classification which can be retrieved for identification of future data for example
one may classify the employee’s potential salary on the bases of salary classification of similar
employees in the company.
DATA MINING FUNCTIONALITIES CONTD..- WHAT KIND OF PATTERNS CAN BE MINED

3. Clustering:
Clustering is the process of partitioning a set of object or data in a same group called a
cluster. These objects are more similar (in some sense or another) to each other than to those in
other groups ( clusters). Clustering is used in many fields, including machine learning, patterns
recognition, bioinformatics, image analysis and information retrieval.
4. Mining Frequent patterns, Associations and correlations:
Frequent patterns can be defined as a pattern (a set of items, subsequence, substructures, etc.)
that appears intermittently in data. A intermittent item set is a set of data that occurs
frequently together in a transaction data set for example, a set of items, such as table and
chair. Subsequence means first of all buying a Computer system, then UPS, and thereafter a
printer. This appears frequently in a shopping history data base and is called a frequent
sequential pattern. Substructure as particular structural forms such as sub graphs, sub tree. If a
substructure appears intermittently, it is named as a frequent structural pattern. Discovering
such type of frequent pattern plays an important role in correlation mining association
clustering and other data mining tasks.
DATA MINING FUNCTIONALITIES CONTD..- WHAT KIND OF PATTERNS CAN BE MINED

5. Outlier Analysis:
A data set may contain objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Many data mining methods
discard outliers as noise or exceptions. However, in some applications (e.g., fraud
detection) the rare events can be more interesting than the more regularly occurring
ones. The analysis of outlier data is referred to as outlier analysis or anomaly
mining.
Outliers may be detected using statistical tests that assume a distribution or
probability model for the data, or using distance measures where objects that are
remote from any other cluster are considered outliers. Rather than using statistical or
distance measures, density-based methods may identify outliers in a local region,
although they look normal from a global statistical distribution view.
DATA MINING FUNCTIONALITIES CONTD..- WHAT KIND OF PATTERNS CAN BE MINED

6. Concept/Class Description: Characterization and Discrimination


Data can be associated with classes or concepts. For example, in the Electronics store, classes of
items for sale include computers and printers, and concepts of customers include bigSpenders and
budgetSpenders. Such descriptions of a classor a concept are called class/concept descriptions.
These descriptions can be derived using-
Data characterization
Data characterization is a summarization of the general characteristics or features of a target class
of data. For example, Summarize the characteristics of customers who spend more than $5000 a
year at AllElectronics.
Data discrimination
Data discrimination is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes. For example, a user may want
to compare the general features of software products with sales that increased by 10% last year
against those with sales that decreased by at least 30% during the same period.

You might also like