You are on page 1of 13

Data Warehousing and Data Mining Unit 8

Unit 8 Introduction to Data Mining


Structure:
8.1 Introduction
Objectives
8.2 Meaning and Working of Data Mining
8.3 Data, Information and Knowledge
8.4 Relation between Data Warehousing and Data Mining
8.5 Data Mining and Knowledge Discovery Process
8.6 Data Mining and Online Analytical Processing (OLAP)
8.7 Data Mining and Statistics
8.8 Data Mining Technologies
8.9 Data Mining Software
8.10 Summary
8.11 Terminal Questions
8.12 Answers

8.1 Introduction
So for in the previous units we had a discussion about data warehousing
from this unit we are going to introduce you the data mining and knowledge
discovery from data. In the Previous semesters you have studied data base
management systems. These units are going to present you from a
database perspective, where emphasis is placed on basic data mining
concepts and techniques for uncovering interesting data patterns hidden in
large data sets. Data Mining is the process of analyzing data from different
perspectives and summarizing it into useful information information that
can be used to increase revenue, cut costs, or both. The implementation
methods discussed are particularly oriented toward the development of
scalable and efficient data mining tools. In this unit, you will learn how data
mining is part of the natural evolution of database technology, how it is
defined and why data mining is important, and. In addition to studying Data
Mining Technologies, you will also read about Data Mining Software tools.

Sikkim Manipal University Page No.: 108


Data Warehousing and Data Mining Unit 8

Objectives:
After studying this unit, you should be able to:
explain the basics of Data Mining.
describe the relationship between Data mining and various Business
Intelligence tools like Data Warehousing, OLAP and Statistics.
discuss on data mining technologies
list data mining Software available in the market.

8.2 Meaning and Working of Data Mining


Data mining is concerned with finding hidden relationships present in
business data to allow businesses to make predictions for future use. It is
the process of data-driven extraction of not so obvious, but useful
information from large databases. Data mining has emerged as a key
business intelligence technology. But the ultimate question is where can it
be useful? And how does it work?
We will discuss the purpose of data mining with POS (point of sale system)
system. Usually supermarkets employ a POS (Point Of Sale) system that
collects data from each item that is purchased. The POS system collects
data on the item brand name, category, size, time and date of the purchase
and at what price the item was purchased. In addition, the supermarket
usually has a customer rewards program, which is also an input to the POS
system. This information can directly link the products purchased with an
individual. All this data for every purchase made for years and years is
stored in a database in a computer by the supermarket.
Now that you have a database with millions of records. What will you do with
this huge data? How do you use this data to forecast or control your
business activities? The solution for this is Data Mining, using data mining
techniques or Algorithm, you can uncover trends, statistical correlations,
relationships and patterns that can help your business become more
efficient, effective and streamlined.
The supermarket can now figure out which brands sell the most, what time
of the day, week, month or year is the most busiest, what products do
consumers buy along with certain items. For instance, if a person buys white
bread, what other item would he be inclined to buy? Typically we can find its

Sikkim Manipal University Page No.: 109


Data Warehousing and Data Mining Unit 8

peanut butter and jelly. There is so much good information that a


supermarket can use just by data mining its own data that it has collected.
There are various definitions. Some of them are listed below.
Data mining is the efficient discovery of valuable, non-obvious
information from a large collection of data.
Knowledge discovery in databases is the non-trivial process of
identifying valid novel potentially useful and ultimately understandable
patterns in the data.
It is the automatic discovery of new facts and relationships in data that
are like valuable nuggets of business data.
It is the process of extracting previously unknown, valid, and actionable
information from large databases and then using the information to
make crucial business decisions.
It is an interdisciplinary field bringing together techniques from machine
learning, pattern recognition, statistics, databases, visualization, and neural
networks.
Data mining streamlines the transformation of masses of information into
meaningful knowledge, which is essential or bottom-line of Business
intelligence.
Typical techniques for data mining involve decision trees, neural networks,
nearest neighbor clustering, fuzzy logic, and Genetic algorithms.
How does data mining work?
Although data mining is still in its infancy, companies in a wide range of
industries including finance, health care, manufacturing, transportation,
are already using data mining tools and techniques to take advantage of
historical data.
The whole logic of data mining is based on modeling. Modeling is simply the
act of building a model (a set of examples or a mathematical relationship)
based on data from situations where the answer is known and then applying
the model to other situations where the answers are not known.
As a simple example of building a model, consider the director of marketing
for a telecommunications company. He would like to focus his marketing
and sales efforts on segments of the population most likely to become big

Sikkim Manipal University Page No.: 110


Data Warehousing and Data Mining Unit 8

users of long-distance services. He knows a lot about his customers, but it is


impossible to discern the common characteristics of his best customers
because there are so many variables. From this existing database of
customers, which contains information such as age, sex, credit history,
income, zip code, occupation, etc., he can use data mining tools, such as
neural networks, to identify the characteristics of those customers who make
lots of long-distance calls. For instance, he might learn that his best
customers are unmarried females between the ages of 21 and 35 who earn
in excess of $60,000 per year. This, then, is his model for high-value
customers, and he would budget his marketing efforts accordingly.
Remember, data mining is the task of discovering interesting patterns from
large amounts of data where the data can be stored in databases, data
warehouses or other information repositories.
8.3 Data, Information and Knowledge
Data are any facts, numbers, or text that can be processed by a computer.
Today organizations are accumulating vast and growing amounts of data in
different formats and databases. This includes
operational or transactional data such as sales, cost, inventory, payroll,
and accounting.
nonoperational data like industry sales, forecast data, and
macroeconomic data.
Metadata are data about the data itself such as logical database design
or data Dictionary definitions.
Information the patterns, associations, or relationships among all this
data can provide information. For example, analysis of retail point-of-
sale transaction data can yield information on which products are selling
and when.
Knowledge Information can be converted into knowledge about historical
patterns and future trends. For example, summary information on retail
supermarket sales can be analyzed in light of promotional efforts to
provide knowledge or consumer buying behavior. Thus, a manufacturer
or a retailer could determine those items that are most susceptible to
promotional efforts.

Sikkim Manipal University Page No.: 111


Data Warehousing and Data Mining Unit 8

Self Assessment Questions


1. Information can be converted into knowledge about _______ patterns
and future trends.
2. Data about data is called _____________________.
3. Facts, numbers, or text is called _________________.
4. ____________ and _________________ are the key emerging
Business Intelligence technologies.
5. Data mining is also called ___________________.

8.4 Relation between Data Warehousing and Data Mining


The connection between data warehouse and data mining is indisputable.
Popular business organizations use these technologies together. The
current section describes the relation between data warehouse and data
mining. Data mining is concerned with finding hidden relationships present
in business data to allow businesses to make predictions for future use. It is
the process of data-driven extraction of not so obvious but useful
information from large databases. Data mining has emerged as a key
business intelligence technology.
Data Mining is a multi disciplinary field drawing works from statistics,
database technology, artificial intelligence, pattern recognition, machine
learning, information theory, knowledge acquisition, information retrieval,
high-performance computing, and data visualization.
The aim of data mining is to extract implicit, previously unknown and
potentially useful (or actionable) patterns from data. Data mining consists of
many up-to-date techniques such as classification (decision trees, native
Bayes classifier, k-nearest neighbor, and neural networks), clustering
(k-means, hierarchical clustering, and density-based clustering), association
(one-dimensional, multidimensional, multilevel association, constraint-based
association). Many years of practice show that data mining is a process, and
its successful application requires data preprocessing (dimensionality
reduction, cleaning, noise/outlier removal), post processing (under
standability, summary, presentation), good understanding of problem
domains and domain expertise.
Data warehousing is defined as a process of centralized data management
and retrieval. Data warehousing, like data mining, is a relatively new term

Sikkim Manipal University Page No.: 112


Data Warehousing and Data Mining Unit 8

although the concept itself has been around for years. Data warehousing
represents an ideal vision of maintaining a central repository of all
organizational data. Data warehouse is an enabled relational database
system designed to support very large databases (VLDB) at a significantly
higher level of performance and manageability. Data warehouse is an
environment, not a product. It is an architectural construct of information that
is hard to access or present in traditional operational data stores.
Any organization or a system in general is faced with a wealth of data that is
maintained and stored, but the inability to discover valuable, often previously
unknown information hidden in the data, prevents it from transferring these
data into knowledge or wisdom.
To satisfy these requirements, the following steps needs to be considered,
1. Capture and integrate both the internal and external data into a
comprehensive view Mine for the integrated data information
2. Organize and present the information and knowledge in ways that
expedite complex decision making.

8.5 Data Mining and Knowledge Discovery Process


Data Mining is not specific to any industry it requires intelligent
technologies and the willingness to explore the possibility of hidden
knowledge that resides in the data. Data Mining is also referred to as
knowledge discovery in databases (KDD). See Fig. 8.1.

Fig. 8.1: Steps in Knowledge Discovery process

Sikkim Manipal University Page No.: 113


Data Warehousing and Data Mining Unit 8

KDD is the overall process of discovering useful knowledge from data.


Data mining: An application of specific algorithms for extracting patterns
from data. Data Mining is a step in the KDD process.

The following points describe the process of Knowledge Discovery:


1) Develop an understanding for the application domain and identify the
goal.
2) Create a target dataset
Selecting a dataset or focusing on a subset of samples or variables
on which to make discoveries
3) Data cleaning and pre processing (pre processing)
Removal of noise and outliers
collecting necessary information to model or account for noise
handling of missing data
accounting for time sequence information
4) Data reduction and projection (pre processing)
Finding useful features to represent the data relative to the goal
Dimensionality reduction/transformation ==> reduce number of
variables
Identification of invariant representations
5) Selection of appropriate data-mining task (Data Mining Task)
Summarization, classification, regression, clustering, etc.
6) Selection of data-mining algorithm(s) (Data Mining Task)
Methods to search for patterns
Decision of which models and parameters may be appropriate
Match method to goal of KDD process
7) Data-Mining
searching for patterns of interest in one or more representational forms
8) Interpretation and visualization
interpretation of mined patterns
visualization of extracted patterns and models
visualization of the data given the extracted models
9) Consolidating discovered knowledge
Incorporating the discovered knowledge into another system
Documenting and reporting knowledge to interested parties
Checking for inconsistencies with other prior extracted or believed
knowledge

Sikkim Manipal University Page No.: 114


Data Warehousing and Data Mining Unit 8

8.6 Data Mining and Online Analytical Processing (OLAP)


Online Analytical Processing (OLAP) is a technology that is used to create
decision support software. OLAP and data mining are used to solve different
kinds of analytic problems:
OLAP summarizes data and makes forecasts. For example, OLAP
answers questions like "What are the average sales of insurance
policies, by region and by year?"
Data mining discovers hidden patterns in data. Data mining operates at
a detailed level instead of a summary level. Data mining answers
questions like "Who is likely to buy insurance polices in the next six
months, and what are the characteristics of these likely buyers?
OLAP and data mining can complement each other. For example, OLAP
might pinpoint problems with sales of mutual funds in a certain region. Data
mining could then be used to gain insight about the behavior of individual
customers in the region. Finally, after data mining predicts something like a
5% increase in sales, OLAP can be used to track the net income.
OLAP systems also provide the following benefits:
Fast access, calculations, and summaries of an organization's data
Support for multiple user access and multiple queries
The ability to handle multiple hierarchies and levels of data
The ability to pre-summarize and consolidate data for faster query and
reporting functions
The ability to expand the number of dimensions and levels of data as a
business grows.
Self Assessment Questions
6. Online Analytical Processing (OLAP) is a technology that is used to
create _______________ software.
7. OLAP Supports ________ user access and multiple queries.

8.7 Data Mining and Statistics


Statistics is a branch of Mathematics. Statistics techniques are incorporated
into Data mining methods. Data mining methods or techniques find the
relations between variables or data in the given data base and express
these relations using statistical nomenclature. Without statistics, there would
be no data mining, as statistics is the foundation of most technologies on

Sikkim Manipal University Page No.: 115


Data Warehousing and Data Mining Unit 8

which data mining is built. Classical statistics embrace concepts such as


Regression Analysis, Standard Distribution, Standard Deviation, Standard
Variance, Discriminant Analysis, Cluster Analysis, and Confidence Intervals,
all of which are used to study data and data relationships. These are the
very building blocks with which more advanced statistical analyses are
underpinned. Certainly, within the heart of today's data mining tools and
techniques, classical statistical analysis plays a significant role.
Note: Data Mining has its roots from Statistics, Artificial Intelligence (AI) and
Machine Learning.
Please note, Statistics, AI and Machine Learning are out of our study here,
so we are not exploring much about them. The details about data mining
techniques will be explored in the forthcoming units.

8.8 Data Mining Technologies


The analytical techniques used in data mining are often well-known
mathematical algorithms and techniques. What is new is the application of
those techniques to general business problems made possible by the
increased availability of data, and inexpensive storage and processing
power. Also, the use of graphical interface has led to tools becoming
available that business experts can easily use.
Some of the techniques are given below:
Artificial neural networks Nonlinear predictive models that learn through
training and resemble biological neural networks in structure.
Decision trees Tree-shaped structures that represent sets of decisions.
These decisions generate rules for the classification of a dataset.
Rule induction The extraction of useful if-then rules from databases on
Statistical significance.
Genetic algorithms Optimization techniques based on the concepts of
genetic combination, mutation, and natural selection.
Nearest neighbor A classification technique that classifies each record
based on the records most similar to it in a historical database.

Sikkim Manipal University Page No.: 116


Data Warehousing and Data Mining Unit 8

Data Mining has different applications in the industry. Some of the industries
to be mentioned is
Banking
Insurance
Credit marketing
Telecommunications
Pharmaceuticals
Bioinformatics
Some of the applications in the above mentioned industries include:
Identifying new customers
Predicting customer buying habits
Confirming suitable loan applicants
Revealing fraud
Relationship marketing
Managing equity portfolios
Diagnosing medical problems
Inventory management
Conducting certain aspects of Marketing
Customer segmentation
Web site design and promotion.

8.9 Data Mining Software


A number of data mining softwares are available in the market, which are
initiated by popular software vendors like IBM, Microsoft, and Orace...etc.
The list is given below:
MineSet (Silicon Graphics Inc. - SGI)
MineSet provides tools for searching, sorting, filtering and drilling down
enabling previously complex data models to be viewed intuitively through
real-time 3-D graphical representation.
Intelligent Miner (IBM Corp)
IBM's data mining capabilities help you detect fraud, segment the customers,
and simplify market basket analysis. IBM's in-database mining capabilities
integrated with the customers existing systems to provide scalable, high

Sikkim Manipal University Page No.: 117


Data Warehousing and Data Mining Unit 8

performing predictive analysis without moving data into proprietary data


mining platforms.
Enterprise Miner (SAS Institute Inc.)
It provides the most powerful, complete data mining solution in the market
with unparalleled model development and deployment alternatives and
extensive integration opportunities. Delivered as a distributed client-server
system, it is especially well suited for data mining in large organizations
Clementine (SPSS Inc - Integral Solutions)
Clementine is an enterprise data mining workbench that enables you to
develop predictive models quickly using business expertise and deploy them
into business operations to improve decision making.
DBMiner (DBMiner Technology Inc.)
DB Miner Insight solutions are world's first server applications providing
powerful and highly scalable association, sequence and differential mining
capabilities for Microsoft SQL Server Analysis Services platform, and they
also provide market basket, sequence discovery and profit optimization for
Microsoft Accelerator for Business Intelligence.
Weak 3 A
It is a collection of machine learning algorithms for solving data mining
problems. It is written in java. So it is portable across all platforms. For
details visit, http://www.cs.waikato.ac.nz/weak/
Oracle 10 g
Oracle 10 g provides software called Darwin, which is data mining tool. It
incorporates Cluster Analysis, Classification, Prediction and Association
rules.
In addition to the above list, the following are popular, Ghost Minor, Mantas,
CART and MARS.
Self Assessment Questions
8. Statistics techniques are incorporated into Data mining methods.
(True/False).
9. ______________ Optimization techniques are based on the concepts
of genetic combination, mutation, and natural selection.
10. What is Mineset?

Sikkim Manipal University Page No.: 118


Data Warehousing and Data Mining Unit 8

8.10 Summary
Data mining is concerned with finding hidden relationships present in
business data to allow businesses to make predictions for future use.
Data Mining is a multidisciplinary field drawing works from statistics,
database technology, artificial intelligence, pattern recognition, machine
learning, information theory, knowledge acquisition, information retrieval,
high-performance computing, and data visualization.
Data Mining consists of many up-to-date techniques such as Classification,
Clustering & Association. Data mining is a process, and its successful
application requires Data Preprocessing (dimensionality reduction, cleaning,
noise/outlier removal), post processing (understandability, summary,
presentation), good understanding of problem domains and domain
expertise. Data mining is also referred to as knowledge discovery in
databases (KDD).OLAP and Data Mining can complement each
other .OLAP stands for Online Analytical Processing Data Mining is a step in
the KDD (Knowledge Discovery Process) Process.

8.11 Terminal Questions


1. What is data mining? Write Data Mining applications.
2. What is OLAP? Write the benefits of OLAP.
3. Differentiate between Data Mining and Data Warehousing
4. What are the data mining techniques?
5. What is Knowledge Discovery? Explain the whole process involved.
6. Write any three data mining techniques.
7. What is preprocessing?

8.12 Answers
Self Assessment Questions
1. Historical
2. Meta data
3. Data
4. Data warehouse and data mining
5. Knowledge discovery
6. Decision support
7. Multiple

Sikkim Manipal University Page No.: 119


Data Warehousing and Data Mining Unit 8

8. True
9. Genetic algorithms
10. MineSet is a software provides tools for searching, sorting, filtering and
drilling down enabling previously complex data models to be viewed
intuitively through real-time 3-D graphical representation
Terminal Questions
1. Data Mining is the process of analyzing data from different perspectives
and summarizing it into useful information - information that can be used
to increase revenue, cuts costs, or both. Refer section 8.2 and 8.8.
2. Online Analytical Processing (OLAP) is a technology that is used to
create decision support software. Refer section 8.6.
3. Data Mining is a multidisciplinary field drawing works from statistics,
database technology, artificial intelligence, pattern recognition, machine
learning, information theory, knowledge acquisition, information retrieval,
high-performance computing, and data visualization where as Data
warehousing is defined as a process of centralized data management
and retrieval. Refer section 8.4.
4. Artificial neural networks, Decision trees, Rule induction etc. Refer
section 8.8.
5. Data Mining is also referred to as knowledge discovery in databases
(KDD). Refer section 8.6.
6. i) Classification
ii) Clustering
iii) Association
7. Data Preprocessing involves dimensionality reduction, cleaning,
noise/outlier removal. Refer section (8.4)

Sikkim Manipal University Page No.: 120

You might also like