Professional Documents
Culture Documents
10/15/08 Sudarshan 1
Definition
Knowledge Discovery in Databases (KDD) is
about finding new and useful information that
are not obvious .
KDD is sometimes called Data Mining
However Data Mining is only part of the
process.
No one has analyzed all the steps to KDD
10/15/08 Sudarshan 2
Why analyze the KDD
process?
Understand this complex process better
Better creation of better KDD tools
10/15/08 Sudarshan 3
Data Mining
An attempt at knowledge discovery
Searching for patterns and structure in a
sea of data
Uses techniques from many disciplines,
such as statistical analysis and machine
learning
These techniques are not our main interest
10/15/08 Sudarshan 4
Definition (Cont.)
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable patterns
in data.
Terrorbytes
10/15/08 Sudarshan 8
Why Data Mining? -- Potential
Applications
Database analysis and decision support
Market analysis
Corporate analysis
Fraud detection
Other Applications:
Intelligent query answering
Prediction and scheduling
10/15/08 Sudarshan 9
Applications
Banking: loan/credit card approval
Targeted marketing:
identify likely responders to promotions
10/15/08 Sudarshan 11
Preprocessing and Mining
10/15/08 Sudarshan 12
Data Mining: A KDD Process
Data mining: the core of
knowledge discovery Pattern Evaluation
process.
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
10/15/08 Sudarshan 13
Steps of a KDD Process
Learning the application domain:
relevant prior knowledge and goals of application
10/15/08 Sudarshan 16
Why is data mining
necessary?
Make use of your data assets
There is a big gap from stored data to
knowledge; and the transition won’t occur
automatically.
Many interesting things you want to find
cannot be found using database queries
“find me people likely to buy my products”
“Who are likely to respond to my promotion”
10/15/08 Sudarshan 17
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or
future values of other variables.
Description Methods
Findhuman-interpretable patterns that
describe the data.
10/15/08 Sudarshan 18
Main data mining tasks
Classification:
mining patterns that can classify future data into
known classes.
Association rule mining
mining any rule of the form X → Y, where X and Y
are sets of data items.
Clustering
identifying a set of similarity groups in the data
10/15/08 Sudarshan 19
Main data mining tasks (cont …)
Sequential pattern mining:
A sequential rule: A→ B, says that event A
will be immediately followed by event B
with a certain confidence
Deviation detection:
discovering the most significant changes in
data
Data visualization: using graphical
methods to show patterns in data.
10/15/08 Sudarshan 20
Counting co- occurances
A marble basket is a collection of items purchased by a
customer in a single customer transaction.
A customer transaction consists of purchasing the items from
the store by single visit.
Consider the foll. “purchase relation” :
the tuples are stored into groups by transaction. All tuples in a
group have same customer id (cid) and together describes a
customer transaction, that invokes the purchase of one or more
items. There is a redundancy in a table. It can be removed by
decomposing the purchase relation.
10/15/08 Sudarshan 21
Tid Cid Date Item Qty(packets)
204 c1 4/1/05 sugar 2
204 c1 4/1/05 milk 1
204 c1 4/1/05 cheese 1
204 c1 4/1/05 juice 2
To find the frequency purchased item from the store, the original
purchase table is considered. This table is created at the cleaning
steps of KDP . this table is easy to handle for applying data mining
tools.
We can make following observations from purchase table:-
75% of transaction contain purchase of milk and sugar together
25% of transaction contain the purchase of sugar and juice
10/15/08 Sudarshan 23
Following terminology is used to develop an algorithm
for purchasing frequent items from the shop:
A set of item is called item set.
The support of an item set is a fraction of transaction in
the data base that contains item from itemset.
For e.g. {sugar, milk} has 75% support in purchases. We
thus conclude that sugar and milk are frequently
purchased together. On the other hand, sugar and rice
are not purchased together.
User can specify the minimum support (minsup) and find
all items that are above minsup. These sets of items are
may be singleton set.
Let’s consider the user specified minimum support as
70% then frequent items will be {milk, sugar}, {juice}.
10/15/08 Sudarshan 24
Algorithm to identify frequent itemset
For each item
check if it is frequent itemset
(appears in > minsup)
K=1
Repeat
for each new frequent itemset, Ik
with K items, generate all itemsets
I k+1 with K+1 itemset Ik ⊂ Ik+1
Scan all transactions once and check
If generated K+1 itemsets are frequent
K:= K+1
Until new frequent itemsets
10/15/08 Sudarshan are identified. 25
All the best for your
test.
10/15/08 Sudarshan 26