Professional Documents
Culture Documents
RESEARCH REPORT
DATA MINING
26.11.2016
Seyit Mert AYVAZ
2012555008
Introduction
I hope that, this report can be beneficial for its readers and people
who is curious about data mining.
Page 2 of 29
TABLE OF CONTENT
1.Introductino to Data
Mining...................................................................................................4
1.1 What is Data
Mining.................................................................................................4
1.1.1.Automatic
Discovery..................................................................................4
1.1.2.Prediction...........................................................................
.......................5
1.1.3.Grouping............................................................................
........................5
1.1.4.Actionable
Information.............................................................................5
1.2.Architecture of Data
Mining.....................................................................................6
1.2.1. Data
Sources.......................................................................................
......7
1.2.2. Database or Data Warehouse
Server.......................................................7
1.2.3. Data Mining
Engine...................................................................................7
1.2.4. Pattern Evaluation
Modules.....................................................................7
1.2.5. Graphical User
Interface...........................................................................7
1.2.6. Knowledge
Base........................................................................................7
1.3.Data Mining
Processes.............................................................................................8
1.3.1. Problem
definition....................................................................................
8
1.3.2. Data
exploration..................................................................................
.....9
1.3.3. Data
preparation.................................................................................
.....9
1.3.4.
Modeling.....................................................................................
..............9
1.3.5.
Evaluation...................................................................................
..............9
1.3.6.
Deployment.................................................................................
.............9
Page 3 of 29
2.History of Data
Mining .........................................................................................................1
0
2.1 Foundations of Data
Mining...................................................................................10
2.2. Evolution in data mining for
business...................................................................11
2.3. Milestones of Data
Mining....................................................................................12
3.Scope of Data
Mining............................................................................................................
15
3.1. Usage of Data Mining
Techniques ........................................................................16
3.1.1.
Association..................................................................................
............16
3.1.2.
Classification...............................................................................
............16
3.1.3.
Clustering....................................................................................
............17
3.1.4.
Prediction....................................................................................
...........17
3.1.5. Sequential
Patterns................................................................................17
3.1.6. Decision
trees.........................................................................................1
7
3.2. Data Mining in
Academically.................................................................................18
3.2.1.Science and
Engineering..........................................................................18
3.2.2. Medical Data
Mining...............................................................................19
3.2.3. Spatial Data
Mining.................................................................................19
3.2.4. Pattern
mining........................................................................................
20
3.2.5. Human
Rights.........................................................................................
20
3.2.6. Sensor Data
Mining................................................................................20
3.3 Data Mining in
Business.........................................................................................20
Page 4 of 29
4.Future of Data
Mining...........................................................................................................
23
4.1. Distributed/Collective Data Mining
(DDM) ..........................................................23
4.2. Ubiquitous Data Mining (UDM)
............................................................................23
4.3. Hypertext and Hypermedia Data
Mining...............................................................23
4.4. Multimedia Data
Mining........................................................................................24
4.5. Time Series/Sequence Data
Mining.......................................................................24
Page 5 of 29
The overall goal of the data mining process is to extract information from a
data set and transform it into an understandable structure for further use.
Data mining uses sophisticated mathematical algorithms to segment the
data and evaluate the probability of future events. Data mining is also
known as Knowledge Discovery in Data (KDD).
1.1.1.Automatic Discovery
Data mining is accomplished by building models. A model uses an
algorithm to act on a set of data. The notion of automatic discovery refers
to the execution of data mining models.Data mining models can be used
to mine the data on which they are built, but most types of models are
generalizable to new data. The process of applying a model to new data is
known as scoring.
1.1.2.Prediction
Many forms of data mining are predictive. For example, a model might
predict income based on education and other demographic factors.
Predictions have an associated probability (How likely is this prediction to
be true?). Prediction probabilities are also known as confidence. Some
forms of predictive data mining generate rules, which are conditions that
imply a given outcome. For example, a rule might specify that a person
who has a bachelor's degree and lives in a certain neighborhood is likely to
have an income greater than the regional average.
1.1.3.Grouping
Other forms of data mining identify natural groupings in the data. For
example, a model might identify the segment of the population that has
an income within a specified range, that has a good driving record, and
that leases a new car on a yearly basis.
1.1.4.Actionable Information
Data mining can derive actionable information from large volumes of
data. For example, a town planner might use a model that predicts income
based on demographics to develop a plan for low-income housing. A car
leasing agency might a use model that identifies customer segments to
design a promotion targeting high-value customers.
Page 6 of 29
unusual records (anomaly detection), and dependencies (association rule
mining). This usually involves using database techniques such as spatial
indices. These patterns can then be seen as a kind of summary of the
input data, and may be used in further analysis or, for example, in
machine learning and predictive analytics. For example, the data mining
step might identify multiple groups in the data, which can then be used to
obtain more accurate prediction results by a decision support system.
Neither the data collection, data preparation, nor result interpretation and
reporting is part of the data mining step, but do belong to the overall KDD
process as additional steps.
Page 7 of 29
The major components of any data mining system are data source,
data warehouse server, data mining engine, pattern evaluation module,
graphical user interface. In order to get a better knowledge on these, we
will examine that what are these? and what is the aim of these
components ?
Page 8 of 29
1.2.6. Knowledge Base:The knowledge base is helpful in the whole data
mining process. It might be useful for guiding the search or evaluating the
interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in
the process of data mining. The data mining engine might get inputs from
the knowledge base to make the result more accurate and reliable.
Figure 1.3.1: Phases of the Cross Industry Standard Process for data
mining (CRISP DM) process model. From where*********
Page 9 of 29
1.3.2. Data exploration
Domain experts understand the meaning of the metadata. They
collect, describe, and explore the data. They also identify quality problems
of the data. A frequent exchange with the data mining experts and the
business experts from the problem definition phase is vital.
In the data exploration phase, traditional data analysis tools, for example,
statistics, are used to explore the data.
1.3.4. Modeling
Data mining experts select and apply various mining functions
because you can use different mining functions for the same type of data
mining problem. Some of the mining functions require specific data types.
The data mining experts must assess each model.
In the modeling phase, a frequent exchange with the domain experts from
the data preparation phase is required.
The modeling phase and the evaluation phase are coupled. They can
be repeated several times to change parameters until optimal values are
achieved. When the final modeling phase is completed, a model of high
quality has been built.
1.3.5. Evaluation
Data mining experts evaluate the model. If the model does not
satisfy their expectations, they go back to the modeling phase and rebuild
the model by changing its parameters until optimal values are achieved.
When they are finally satisfied with the model, they can extract business
explanations and evaluate the following questions:
Does the model achieve the business objective?
Have all business issues been considered?
At the end of the evaluation phase, the data mining experts decide how to
use the data mining results.
1.3.6. Deployment
Data mining experts use the mining results by exporting the results
into database tables or into other applications, for example, spreadsheets.
The Intelligent Miner**** products assist you to follow this process.
You can apply the functions of the Intelligent Miner products
independently, iteratively, or in combination.
Data mining roots are traced back along three family lines: classical
statistics, artificial
intelligence, and machine learning.
Statistics are the foundation of most technologies on which data
mining is built, e.g. regression analysis, standard distribution, standard
deviation, standard variance, discriminate analysis, cluster analysis, and
confidence intervals. All of these are used to study data and data
relationships.
Artificial intelligence, or AI, which is built upon heuristics as opposed
to statistics, attempts to apply human-thought-like processing to statistical
problems. Certain AI concepts which were adopted by some high-end
commercial products, such as query optimization modules for Relational
Database Management Systems (RDBMS).
Machine learning is the union of statistics and AI. It could be
considered an evolution of AI, because it blends AI heuristics with
advanced statistical analysis. Machine learning attempts to let computer
programs learn about the data they study, such that programs make
different decisions based on the qualities of the studied data, using
statistics for fundamental concepts, and adding more advanced AI
heuristics and algorithms to achieve its goals.
Data mining, in many ways, is fundamentally the adaptation of
machine learning techniques to business applications. Data mining is best
described as the union of historical and recent developments in statistics,
AI, and machine learning. These techniques are then used together to
study data and find previously-hidden trends or patterns within.
Data mining techniques are the result of a long process of research and
product development. This evolution began when business data was first
stored on computers, continued with improvements in data access, and
more recently, generated technologies that allow users to navigate
through their data in real time. Data mining takes this evolutionary
process beyond retrospective data access and navigation to prospective
and proactive information delivery. Data mining is ready for application in
the business community because it is supported by three technologies
that are now sufficiently mature:
Page 11 of 29
Massive data collection
Powerful multiprocessor computers
Data mining algorithms
Page 12 of 29
The core components of data mining technology have been under
development for decades, in research areas such as statistics, artificial
intelligence, and machine learning. Today, the maturity of these
techniques, coupled with high-performance relational database engines
and broad data integration efforts, make these technologies practical for
current data warehouse environments.
The following are major milestones and firsts in the history of data
mining plus how its evolved and blended with data science and big data.
Page 13 of 29
1936 This is the dawn of computer age which makes possible the
collection and processing of large amounts of data. In a 1936 paper, On
Computable Numbers, Alan Turing introduced the idea of a Universal
Machine capable of performing computations like our modern day
computers. The modern day computer is built on the concepts pioneered
by Turing.
1943 Warren McCulloch and Walter Pitts were the first to create a
conceptual model of a neural network. In a paper entitled A logical
calculus of the ideas immanent in nervous activity, they describe the idea
of a neuron in a network. Each of these neurons can do 3 things: receive
inputs, process inputs and generate output.
1980s HNC trademarks the phrase database mining. The trademark was
meant to protect a product called DataBase Mining Workstation. It was a
general purpose tool for building neural network models and now no
longer is available. Its also during this period that sophisticated
algorithms can learn relationships from data that allow subject matter
experts to reason about what the relationships mean.
2001 Although the term data science has existed since 1960s, it wasnt
until 2001 that William S. Cleveland introduced it as an independent
discipline. As per Build Data Science Teams, DJ Patil and Jeff
Hammerbacher then used the term to describe their roles at LinkedIn and
Facebook.
2015 In February 2015, DJ Patil became the first Chief Data Scientist at
the White House. Today, data mining is widespread in business, science,
engineering and medicine just to name a few. Mining of credit card
transactions, stock market movements, national security, genome
sequencing and clinical trials are just the tip of the iceberg for data mining
applications.
Present (2016) Finally, one of the most active techniques being explored
today is Deep Learning. Capable of capturing dependencies and
complex patterns far beyond other techniques, it is reigniting some of the
biggest challenges in the world of data mining, data science and artificial
intelligence. [2]
At this section, the scope will be examined according to the types of the
relations between transaction and analytical systems, analysis levels and
tasks of the data mining.Then the usage of data mining in academically
and business will both be explained.
Page 15 of 29
Clusters: Data items are grouped according to logical relationships or
consumer preferences. For example, data can be mined to identify
market segments or consumer affinities.
Associations: Data can be mined to identify associations. The beer-
diaper example is an example of associative mining.
Sequential patterns: Data is mined to anticipate behavior patterns
and trends. For example, an outdoor equipment retailer could predict
the likelihood of a backpack being purchased based on a consumer's
purchase of sleeping bags and hiking shoes.
Page 16 of 29
Data visualization: The visual interpretation of complex relationships
in multidimensional data. Graphics tools are used to illustrate data
relationships.
3.1.1. Association
Association is one of the best-known data mining technique. In
association, a pattern is discovered based on a relationship between
items in the same transaction. Thats is the reason why association
technique is also known as relation technique. The association
technique is used in market basket analysis to identify a set of products
that customers frequently purchase together.
3.1.2. Classification
Classification is a classic data mining technique based on machine
learning. Basically, classification is used to classify each item in a set of
data into one of a predefined set of classes or groups. Classification
method makes use of mathematical techniques such as decision trees,
linear programming, neural network and statistics. In classification, we
develop the software that can learn how to classify the data items into
groups. For example, we can apply classification in the application that
given all records of employees who left the company, predict who will
probably leave the company in a future period. In this case, we divide
the records of employees into two groups that named leave and
stay. And then we can ask our data mining software to classify the
employees into separate groups.
3.1.3. Clustering
Clustering is a data mining technique that makes a meaningful or
useful cluster of objects which have similar characteristics using the
automatic technique. The clustering technique defines the classes and
puts objects in each class, while in the classification techniques, objects
Page 17 of 29
are assigned into predefined classes. To make the concept clearer, we
can take book management in the library as an example. In a library,
there is a wide range of books on various topics available. The
challenge is how to keep those books in a way that readers can take
several books on a particular topic without hassle. By using the
clustering technique, we can keep books that have some kinds of
similarities in one cluster or one shelf and label it with a meaningful
name. If readers want to grab books in that topic, they would only have
to go to that shelf instead of looking for the entire library.
3.1.4. Prediction
The prediction, as its name implied, is one of a data mining
techniques that discovers the relationship between independent
variables and relationship between dependent and independent
variables. For instance, the prediction analysis technique can be used in
the sale to predict profit for the future if we consider the sale is an
independent variable, profit could be a dependent variable. Then based
on the historical sale and profit data, we can draw a fitted regression
curve that is used for profit prediction.
In recent years, data mining has been used widely in the areas of
science and engineering, such as bioinformatics, genetics, medicine,
education and electrical power engineering.
Page 18 of 29
In the study of human genetics, sequence mining helps address the
important goal of understanding the mapping relationship between the
inter-individual variations in human DNA sequence and the variability in
disease susceptibility. In simple terms, it aims to find out how the
changes in an individual's DNA sequence affects the risks of developing
common diseases such as cancer, which is of great importance to
improving methods of diagnosing, preventing, and treating these
diseases. One data mining method that is used to perform this task is
known as multifactor dimensionality reduction.
In the area of electrical power engineering, data mining methods
have been widely used for condition monitoring of high voltage
electrical equipment. The purpose of condition monitoring is to obtain
valuable information on, for example, the status of the insulation (or
other important safety-related parameters). Data clustering techniques
such as the self-organizing map (SOM), have been applied to vibration
monitoring and analysis of transformer on-load tap-changers (OLTCS).
Using vibration monitoring, it can be observed that each tap change
operation generates a signal that contains information about the
condition of the tap changer contacts and the drive mechanisms.
Obviously, different tap positions will generate different signals.
However, there was considerable variability amongst normal condition
signals for exactly the same tap position. SOM has been applied to
detect abnormal conditions and to hypothesize about the nature of the
abnormalities.
Data mining methods have been applied to dissolved gas analysis
(DGA) in power transformers. DGA, as a diagnostics for power
transformers, has been available for many years. Methods such as SOM
has been applied to analyze generated data and to determine trends
which are not obvious to the standard DGA ratio methods (such as
Duval Triangle).
In educational research, where data mining has been used to study
the factors leading students to choose to engage in behaviors which
reduce their learning, and to understand factors influencing university
student retention. A similar example of social application of data mining
is its use in expertise finding systems, whereby descriptors of human
expertise are extracted, normalized, and classified so as to facilitate the
finding of experts, particularly in scientific and technical fields. In this
way, data mining can facilitate institutional memory.
Data mining methods of biomedical data facilitated by domain
ontologies, mining clinical trial data, and traffic analysis using SOM.
In adverse drug reaction surveillance, the Uppsala Monitoring Centre
has, since 1998, used data mining methods to routinely screen for
reporting patterns indicative of emerging drug safety issues in the WHO
global database of 4.6 million suspected adverse drug reaction
incidents. Recently, similar methodology has been developed to mine
large collections of electronic health records for temporal patterns
associating drug prescriptions to medical diagnoses.
Data mining has been applied to software artifacts within the realm
of software engineering: Mining Software Repositories.
Page 19 of 29
3.2.2. Medical Data Mining
Page 20 of 29
In the context of pattern mining as a tool to identify terrorist activity,
the National Research Council provides the following definition:
"Pattern-based data mining looks for patterns (including anomalous
data patterns) that might be associated with terrorist activity these
patterns might be regarded as small signals in a large ocean of noise."
Pattern Mining includes new areas such a Music Information Retrieval
(MIR) where patterns seen both in the temporal and non temporal
domains are imported to classical knowledge discovery search
methods.
Page 21 of 29
Categorization of the items available in the e-commerce site is a
fundamental problem. A correct item categorization system is essential
for user experience as it helps determine the items relevant to him for
search and browsing. Item categorization can be formulated as a
supervised classification problem in data mining where the categories
are the target classes and the features are the words composing some
textual description of the items. One of the approaches is to find groups
initially which are similar and place them together in a latent group.
Now given a new item, first classify into a latent group which is called
coarse level classification. Then, do a second round of classification to
find the category to which the item belongs to.
Every time a credit card or a store loyalty card is being used, or a
warranty card is being filled, data is being collected about the users
behavior. Many people find the amount of information stored about us
from companies, such as Google, Facebook, and Amazon, disturbing
and are concerned about privacy. Although there is the potential for our
personal data to be used in harmful, or unwanted, ways it is also being
used to make our lives better. For example, Ford and Audi hope to one
day collect information about customer driving patterns so they can
recommend safer routes and warn drivers about dangerous road
conditions.
Data mining in customer relationship management(CRM)
applications can contribute significantly to the bottom line. Rather than
randomly contacting a prospect or customer through a call center or
sending mail, a company can concentrate its efforts on prospects that
are predicted to have a high likelihood of responding to an offer. More
sophisticated methods may be used to optimize resources across
campaigns so that one may predict to which channel and to which offer
an individual is most likely to respond (across all potential offers).
Additionally, sophisticated applications could be used to automate
mailing. Once the results from data mining (potential
prospect/customer and channel/offer) are determined, this
"sophisticated application" can either automatically send an e-mail or a
regular mail. Finally, in cases where many people will take an action
without an offer, "uplift modeling" can be used to determine which
people have the greatest increase in response if given an offer. Uplift
modeling thereby enables marketers to focus mailings and offers on
persuadable people, and not to send offers to people who will buy the
product without an offer. Data clustering can also be used to
automatically discover the segments or groups within a customer data
set.
Businesses employing data mining may see a return on investment,
but also they recognize that the number of predictive models can
quickly become very large. For example, rather than using one model
to predict how many customers will churn, a business may choose to
build a separate model for each region and customer type. In situations
where a large number of models need to be maintained, some
businesses turn to more automated data mining methodologies.
Data mining can be helpful to human resources (HR) departments in
identifying the characteristics of their most successful employees.
Page 22 of 29
Information obtained such as universities attended by highly
successful employees can help HR focus recruiting efforts accordingly.
Additionally, Strategic Enterprise Management applications help a
company translate corporate-level goals, such as profit and margin
share targets, into operational decisions, such as production plans and
workforce levels.
Market basket analysis, relates to data-mining use in retail sales. If a
clothing store records the purchases of customers, a data mining
system could identify those customers who favor silk shirts over cotton
ones. Although some explanations of relationships may be difficult,
taking advantage of it is easier. The example deals with association
rules within transaction-based data. Not all data are transaction based
and logical, or inexact rules may also be present within a database.
Market basket analysis has been used to identify the purchase
patterns of the Alpha Consumer. Analyzing the data collected on this
type of user has allowed companies to predict future buying trends and
forecast supply demands.[citation needed]
Data mining is a highly effective tool in the catalog marketing industry.
[citation needed] Catalogers have a rich database of history of their
customer transactions for millions of customers dating back a number
of years. Data mining tools can identify patterns among customers and
help identify the most likely customers to respond to upcoming mailing
campaigns.
Data mining for business applications can be integrated into a
complex modeling and decision making process. LIONsolver uses
Reactive business intelligence (RBI) to advocate a "holistic" approach
that integrates data mining, modeling, and interactive visualization into
an end-to-end discovery and continuous innovation process powered by
human and automated learning.
In the area of decision making, the RBI approach has been used to
mine knowledge that is progressively acquired from the decision maker,
and then self-tune the decision method accordingly. The relation
between the quality of a data mining system and the amount of
investment that the decision maker is willing to make was formalized by
providing an economic perspective on the value of extracted
knowledge in terms of its payoff to the organization. This decision-
theoretic classification framework was applied to a real-world
semiconductor wafer manufacturing line, where decision rules for
effectively monitoring and controlling the semiconductor wafer
fabrication line were developed.[3]
Page 23 of 29
4.Future of Data Mining
Over recent years data mining has been establishing itself as one of
the major disciplines in computer science with growing industrial
impact. Undoubtedly, research in data mining will continue and even
increase over coming decades.In this section we will examine the future
trends and applications of data mining.
Page 24 of 29
Hypertext and hypermedia data mining can be characterized as
mining data which includes text, hyperlinks, text mark-ups, and various
other forms of hypermedia information. As such, it is closely related to
both web mining, and multimedia mining, but in reality are quite close
in terms of content and applications. While the World Wide Web is
substantially composed of hypertext and hypermedia elements, there
are other kinds of hypertext/hypermedia data sources which are not
found on the web. Examples of these include the information found in
online catalogues, digital libraries, online information databases, and
the like.. Some of the important data mining techniques used for
hypertext and hypermedia data mining include classification
(supervised learning), clustering(unsupervised learning), semi-
structured learning, and social network analysis.
Conclusion
Page 26 of 29
Glossary
Page 27 of 29
Page 28 of 29
References
[1]
: www.thearling.com
[2]
: www.rayli.net
[3]
: www.wikipedia.org
4: http://www.ibm.com/support/knowledgecenter/
5: https://www.linkedin.com/pulse/what-does-future-hold-data-mining-thiensi-le
6:
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/dat
amining.htm
7:
https://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/materials.shtml#data
ware
8: http://www.cs.bu.edu/~gkollios/dm07/lectnotes.html
9: http://searchsqlserver.techtarget.com/definition/data-mining
10: Introduction to Data Mining, Pang-Ning Tan, Michigan State University,
Michael Steinbach,University of Minnesota Vipin Kumar, University of Minnesota,
(March 25, 2006)
11: Introduction to Data Mining Dr. Sanjay Ranka Professor Computer and
Information Science and Engineering University of Florida, Gainesville
Page 29 of 29