You are on page 1of 14

“Connecting the dots to make sense of data”

CONTENTS:
 Introduction
 Motivation for data mining
 Explaining data mining
 Data Mining: On what kind of data?
 Data mining functionality and Classification
 Major issues in data mining
 Article : Knowledge discovery in science as opposed to business

 Conclusion
 References
ABSTRACT:

This era of human development has been rightfully called as the information age.
Today, the world’s most valuable resource is information. With the advent of
computers and since their inception, their boom has lead to overloading and
overflowing of information. The world today faces an information crisis, where
there is a lot of data and the vastness of the data causes chaos. The vast data which
is out there in the cyberspace has to be properly organized in a way some “sense”
could be implied from it. Thus the world today need a tool which “discovers
knowledge” by interpreting given data. Observing the rightful need for something
to properly organize and effectively use the data, the fields of data warehousing
and data mining were incepted.
Data warehousing deals with “Subject-oriented, integrated, nonvolatile, time
variant collection of data in support of management decisions”. Simply put, a data
warehouse is a collection of snapshots of data taken from transaction processing
systems at given intervals. Data warehouses are databases used solely for
reporting. The data warehouse allows the storage of data in a format that facilitates
its access thus enhancing the ability of business decision-makers to gain timely
access to corporate information.
Data mining is the non-trivial extraction of implicit, previously unknown and
potentially useful information from data. Data mining includes methodologies to
interpret knowledge from the data such as exploration and analysis, by automatic
and semi-automatic means, of large quantities of data in order to discover meaning
patterns from the data.
This paper provides a brief introduction to the various aspects of data mining,
putting data warehousing aside. This paper briefly introduces the basic concepts
involved with data mining, like its definition, functionality, characteristics and the
major issues. Finally the paper presents a real-time scenario- an article titled –
“Knowledge discovery in science as opposed to business” which provides
insightful information regarding the scope of data mining.
INTRODUCTION
Data Mining is a powerful new
technology with great potential to
“Knowledge is the ultimate
help companies focus on the most
competitive advantage”
important information in the data they
- Donald Mitchell (from “The have collected about the behavior of
Ultimate Competitive Advantage: their customers and potential
Secrets of Continuously Developing a customers. It discovers information
More Profitable Business Model”) within the data that Queries and
reports can’t effectively reveal. This
Data is probably the most valuable
search may be done just by the user,
resource of an enterprise. Whether the
i.e. just by performing queries, in
decision is Strategic, tactic or
which case it is quite hard and in most
operational the data should be turned
of the cases not comprehensive
into ready-to-use and ready-to-operate
enough to reveal intricate patterns.
information. Data that comes from the
Data mining uses sophisticated
large inventories should be collected
statistical analysis and modeling
and warehoused in a system so that
techniques to uncover such patterns
the past references are readily
and relationships hidden in
available. This data should be
organizational databases - patterns
structured and optimized for querying
that ordinary methods might miss.
and data analysis. This forms an
Once found, the information needs to
essential part of a Data
be presented in a suitable form, with
Warehousing .The Data Warehouse is
graphs, reports, etc.
a central repository of data that
provides the user with integrated, up-
to-date data from various
MOTIVATION FOR DATA MINING
administrative systems.
“Necessity is the Mother of
Invention” - Proverb
In recent past with the advent of
internet boom and what came to be
known as the “Information Age”, data
has become very hard to manage.
Automated data collection tools and
mature database technology have lead
to tremendous amounts of data stored
in databases, data warehouses and
other information repositories. We are
currently overloaded with immense useful information - information that
amount of information which is, can be used to increase revenue, cuts
largely unorganized and freely left in costs, or both.
the cyber space to roam about without
Data mining is a computer-assisted
any purpose.
process of digging through and
Quoting the words of Micheline analyzing enormous sets of data and
Kamber “We are drowning in data, then extracting the meaning of the
but starving for knowledge!” data to predict behaviors and future
trends, allowing businesses to make
The solution for this ‘massive’
proactive, knowledge-driven
problem at hand - Data warehousing
decisions. This stands as an answer to
and data mining.
business questions that traditionally
Meaning, that we data warehouse and were too time consuming to resolve.
perform on-line analytical processing Data mining algorithms scour
of the data in a sequential and databases for hidden patterns, finding
organized fashion to optimally extract predictive information that experts
knowledge from the ‘heap’ of data. may miss because it lies outside their
And also we extract interesting expectations. Technically, data
knowledge like rules, regularities, mining is the process of finding
patterns, constraints from data in large correlations or patterns among dozens
databases in order to make some of fields in large relational database,
“sense” of the data and implication of Data warehouses, transactional
data. databases, and advanced DB and
information repositories. It is based
EXPLAINING D A T A MI N I N G
on filtration and assaying of mountain
Discovery is the process of looking in of data “ore” in order to get “nuggets”
a database to find hidden patterns of knowledge.
without a predetermined idea or
hypothesis about what the patterns
may be. In other words, the program DATA MINING: ON WHAT KIND OF
takes the initiative in finding what the DATA?
interesting patterns are, without the
Data mining would be applied to any
user thinking of the relevant questions
to any kind of data repository as well
first. Former methods of discovery
as transient data such as data streams.
were done manually until data mining
Thus the scope of data repository
came into existence. Generally Data
includes:
mining (sometimes called data or
knowledge discovery) is the process Relational databases
of analyzing data from different
Data warehouses
perspectives and summarizing it into
Transactional databases
Advanced DB and information
repositories like:
• Object-oriented
and object-
relational
databases
• Spatial databases
• Time-series data
and temporal data
• Text databases and
multimedia
databases
• Heterogeneous and
legacy databases
DATA MINING FUNCTIONALITY AND
• WWW CLASSIFICATION
Data mining functionality are used
to specify the kind of patterns
found in data mining tasks. These
fall into two categories
1. Descriptive
2. Predicative
 Concept description:
Characterization and
discrimination
 Association (correlation and
causality)
 Classification and Prediction
 Cluster analysis
Data mining methodologies can be
categorized according to various
criteria.

 Kinds of databases to be
mined
Now data mining functionalities  Kinds of knowledge to be
and the kind of patterns discovered discovered
by them are described below:

MAJOR I S S U E S I N D A T A MI N I N G

Data mining is a young and a


promising field. Due to its
immaturity, there are a few • Mining methodology and user
shortcomings. They are: interaction
• Performance and scalability • Issues relating to the diversity
of data types

ARTICLE : KNOWLEDGE DISCOVERY IN SCIENCE AS OPPOSED TO BUSINESS

1. Introduction 2.Business Data Analysis


Data Mining is the essential Popular commercial applications of
ingredient in the more general process data mining technology are, for
of Knowledge Discovery in Databases example, in direct mail targeting,
(KDD). The idea is that by credit scoring, churn prediction, stock
automatically sifting through large trading, fraud detection, and customer
quantities of data it should be possible segmentation. It is closely allied to
to extract nuggets of knowledge. data warehousing in which large
(gigabytes) corporate databases are
Data mining has become fashionable,
constructed for decision support
not just in computer science (journals
applications. Rather than relational
& conferences), but particularly in
databases with SQL, these are often
business IT. (An example is its
multi-dimensional structures used for
promotion by television advertising .)
so-called on-line analytical processing
The emergence is due to the growth in
(OLAP). Data mining is a step
data warehouses and the realisation
further from the directed questioning
that this mass of operational data has
and reporting of OLAP in that the
the potential to be exploited as an
relevant results cannot be specified in
extension of Business Intelligence.
advance.
Data mining offers a solution:
3. Scientific Data Analysis
automatic rule extraction. By
searching through large amounts of Rules generated by data mining are
data, one hopes to find sufficient empirical - they are not physical laws.
instances of an association between In most research in the sciences, one
data value occurrences to suggest a compares recorded data with a theory
statistically significant rule. that is founded on an analytic
However, a domain expert is still expression of physical laws. The
needed to guide and evaluate the success or otherwise of the
process and to apply the results. comparison is a test of the hypothesis
of how nature works expressed as a data are inherently noisy, making
mathematical formula. This might be the search for patterns and the
something fundamental like an matching of sub-sequences
inverse square law. Alternatively, difficult. Machine learning
fitting a mathematical model to the algorithms such as artificial neural
data might determine physical nets and hidden Markov chains are
parameters (such as a refractive a very attractive way to tackle this
index). computationally demanding
problem.
On the other hand, where there are no
general theories, data mining
techniques are valuable, especially
2. Classification of astronomical
where one has large quantities of data
objects
containing noisy patterns. This
approach hopes to obtain a theoretical The thousands of photographic plates
generalisation automatically from the that comprise a large survey of the
data by means of induction, deriving night sky contain around a billion
empirical models and learning from faint objects. Having measured the
examples. The resultant theory, while attributes of each object, the problem
maybe not fundamental, can yield a is to classify each object as a
good understanding of the physical particular type of star or galaxy Given
process and can have great practical the number of features to consider, as
utility. well as the huge number of objects,
4. Scientific Applications decision-tree learning algorithms have
been found accurate and reliable for
In a growing number of domains, the
this task.
empirical or black box approach of
data mining is good science. Three
typical examples are: 3. Medical decision support
1. Sequence analysis in Patient records collected for diagnosis
bioinformaton: and prognosis include symptoms,
Genetic data used as the nucleotide bodily measurements and laboratory
sequential in genomic DNA are test results. Machine learning
digital. However, experimental methods have been applied to a
variety of medical domains to 1. Jiawei Han and Micheline
improve decision-making. Examples Kamber, Data mining concepts
are the induction of rules for early and techniques, Second edition
diagnosis of rheumatic diseases and
2. Tan, Steinbach, Kumar,
neural nets to recognise the clustered
Introduction to Data Mining
micro-calcifications in digitised
3. Marcello Rossi, Data Mining:
mammograms that can lead to cancer.
Searching Knowledge in Data
Warehouses, Università degli
studi di Roma “La Sapienza”
CONCLUSION Facoltà di Ingegneria Informatica
Data mining is still an area of cu fu 4. Usama Fayyad, Gregory
rrent research, and its problems are Piatetsky-Shapiro, and Padhraic
not yet fully solved. Nonetheless, Smyth, From Data Mining to
despite these difficulties, data mining Knowledge Discovery in
offers an important approach to Databases
achieving of many homeland security
5. David J. Hand, Heikki Mannila
initiatives. Often used as a means for
and Padhraic Smyth, Principles of
detecting fraud, assessing risk, and
Data Mining,The MIT Press
product warehouse for use in decision
(mitpress.mit.edu/026208290X)
support. Data mining is emerging as
one of the key features retailing, data 6. Kurt Thearling, Dynamic and
mining involves the use of data Analytic Technologies – (URL-
analysis tools to discover previously http://www.thearling.com)
unknown, valid patterns and
Weblinks:
relationships in large data sets. Data
mining offers great promise in helping 1. www.ats.ucla.edu/stat/
organizations uncover hidden patterns sas/topics/logistic_regression.htm
in their data and thus would play a www.indiana.edu/~statmath
major role in organization and
evaluation of data and its patterns in 2. www.sas.com/technologies/analyti
thefuture. cs/

REFERENCE datamining/miner/dec_trees.html
3. www.sas.com/offices/asiapacific/ 8. luna.cas.usf.edu/~mbrannic/
sp/training/courses/dmdt.html files/regression/Logistic.html
4. citeseer.ist.psu.edu/36580.htm
5. www.sas.com/technologies/analyti
cs/
datamining/miner/neuralnet.html
6. www.sas.com/offices/asiapacific/
sp/training/courses/dmnn.html
dimacs.rutgers.edu/Workshops/
AdverseEvent/slides/stultz.ppt
7. /stat/all/cat/1b1.htm

.
9.