Data Mining, Data Warehousing and Knowledge Discovery

Data Mining, Data Warehousing and Knowledge Discovery
Basic Algorithms and Concepts
Srinath Srinivasa IIIT Bangalore sri@iiitb.ac.in
Overview
Why Data Mining? Data Mining concepts Data Mining algorithms
Tabular data mining Association, Classification and Clustering Sequence data mining Streaming data mining
Data Warehousing concepts
Why Data Mining

From a managerial perspective:
Analyzing trends Wealth generation
Security
Strategic decision making
Data Mining
Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data
No Query But an Interestingness criteria
Data Mining
+
Data Interestingness criteria
=
Hidden patterns
Data Mining
Type of Patterns
+
=
Hidden patterns
Data Mining
Type of data Type of Interestingness criteria
+
=
Hidden patterns
Type of Data
Tabular
Relational Multi-dimensional
(Ex: Transaction data)

(Ex: Remote sensing data) (Ex: Log information)
Spatial Temporal
Streaming (Ex: multimedia, network traffic) Spatio-temporal (Ex: GIS)
Tree (Ex: XML data) Graphs (Ex: WWW, BioMolecular data) Sequence (Ex: DNA, activity logs) Text, Multimedia
Type of Interestingness

Frequency Rarity Correlation Length of occurrence (for sequence and temporal

data)
Consistency Repeating / periodicity Abnormal behavior Other patterns of interestingness
Data Mining vs Statistical Inference

Statistics: Statistical Reasoning
Conceptual Model (Hypothesis )
Proof (Validation of Hypothesis)
Data Mining vs Statistical Inference

Data mining: Mining Algorithm Based on Interestingness
Data
Pattern (model, rule, hypothesis) discovery
Data Mining Concepts

Associations and Item-sets: An association is a rule of the form: if X then Y. It is denoted as X Y Example: If India wins in cricket, sales of sweets go up. For any rule if X Y Y X, then X and Y are called an interesting item-set. Example: People buying school uniforms in June also buy school bags (People buying school bags in June also buy school uniforms)

Support and Confidence: The support for a rule R is the ratio of the number of occurrences of R, given all occurrences of all rules. The confidence of a rule X Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X.

Support and Confidence: Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Crayons Uniform Pencil Book Bag Book Bag Bag Pencil Books Support for {Bag, Uniform} = 5/10 = 0.5 Confidence for Bag Uniform = 5/8 = 0.625
Mining for Frequent Item-sets

The Apriori Algorithm: Given minimum required support s as interestingness criterion: 1. Search for all individual elements (1-element item-set) that have a minimum support of s 2. Repeat 1. From the results of the previous search for i-element item-sets, search for all i+1 element item-sets that have a minimum support of s 2. This becomes the set of all frequent (i+1)-element itemsets that are interesting 3. Until item-set size reaches maximum..

The Apriori Algorithm: (Example)
Let minimum support = 0.3
Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil
Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform
Crayons Uniform Pencil Books Bag Books Bag Bag Pencil Books
Interesting 1-element item-sets: {Bag}, {Uniform}, {Crayons}, {Pencil}, {Books} Interesting 2-element item-sets: {Bag,Uniform} {Bag,Crayons} {Bag,Pencil} {Bag,Books} {Uniform,Crayons} {Uniform,Pencil} {Pencil,Books}

The Apriori Algorithm: (Example)
Let minimum support = 0.3
Crayons Uniform Interesting 3-element item-sets: {Bag,Uniform,Crayons} Pencil Books Bag Books Bag Bag Pencil Books
Mining for Association Rules

Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil Uniform Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform Association rules are of the form AB Crayons Uniform Pencil Which are directional Books Association rule mining requires two Bag Books thresholds: Bag minsup and minconf Bag Pencil Books

Mining association rules using apriori
General Procedure:
1.
2.
3. 4.
Use apriori to generate frequent itemsets of different sizes At each iteration divide each frequent itemset X into two parts LHS and RHS. This represents a rule of the form LHS RHS The confidence of such a rule is support(X)/support(LHS) Discard all rules whose confidence is less than minconf.

Example:
The frequent itemset {Bag, Uniform, Crayons} has a support of 0.3.

This can be divided into the following rules: {Bag} {Uniform, Crayons} {Bag, Uniform} {Crayons} {Bag, Crayons} {Uniform} {Uniform} {Bag, Crayons} {Uniform, Crayons} {Bag} {Crayons} {Bag, Uniform}

Confidence for these rules are as follows:
{Bag} {Uniform, Crayons} {Bag, Uniform} {Crayons} {Bag, Crayons} {Uniform} {Uniform} {Bag, Crayons} {Uniform, Crayons} {Bag} {Crayons} {Bag, Uniform}
0.375 0.6 0.75 0.428 0.75 0.75
If minconf is 0.7, then we have discovered the following rules

Mining association rules using apriori Bag Books Bag Bag Uniform Bag Crayons Books Uniform Pencil
People who buy a school bag and a set of crayons are likely to buy school Uniform Crayons uniform.
Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform
Uniform Pencil People who buy school uniform and a set of crayons are likely to buy a school Books bag. Bag Books People who buy just a set of crayons are Bag likely to buy a school bag and school Bag uniform as well. Pencil Books
Generalized Association Rules

Since customers can buy any number of items in one transaction, the transaction relation would be in the form of a list of individual purchases.
Bill No. 15563 15563 15564 15564
Date 23.10.2003 23.10.2003 23.10.2003 23.10.2003
Item Books Crayons Uniform Crayons

A transaction for the purposes of data mining is obtained by performing a GROUP BY of the table over various fields.
Bill No. 15563 15563 15564 15564
Date 23.10.2003 23.10.2003 23.10.2003 23.10.2003

A GROUP BY over Bill No. would show frequent buying patterns across different customers. A GROUP BY over Date would show frequent buying patterns across different days.
Bill No. 15563 15563 15564 15564
Date 23.10.2003 23.10.2003 23.10.2003 23.10.2003
Classification and Clustering

Given a set of data elements:
Classification maps each data element to one of a set of pre-determined classes based on the difference among data elements belonging to different classes Clustering groups data elements into different groups based on the similarity between elements within a single group
Classification Techniques
Decision Tree Identification
Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Temp 30 15 16 27 25 17 Play? Yes No Yes Yes Yes No
Classification problem Weather Play(Yes,No)
Cloudy
Cloudy
17
35
No
Yes
Hunts method for decision tree identification: Given N element types and m decision classes: 1. For i 1 to N do 1. Add element i to the i-1 element item-sets from the previous iteration 2. Identify the set of decision classes for each item-set 3. If an item-set has only one decision class, then that item-set is done, remove that item-set from subsequent iterations 2. done
Decision Tree Identification Example
Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Temp Warm Chilly Chilly Play? Yes No Yes
Sunny Cloudy
Yes Yes/No
Pleasant Yes Pleasant Yes Chilly No
Overcast
Yes/No
Cloudy
Cloudy
Chilly
Warm
No
Yes
Sunny Cloudy
Yes Yes/No
Overcast
Yes/No
Cloudy
Cloudy
Chilly
Warm
No
Yes
Cloudy Warm Cloudy Chilly
Yes No
Cloudy
Cloudy
Chilly
Warm
No
Yes
Cloudy Pleasant
Yes
Overcast Warm Overcast Chilly No
Cloudy
Cloudy
Chilly
Warm
No
Yes
Overcast Pleasant
Yes
Decision Tree Identification Example Yes/No Cloudy Yes/No Warm Yes Pleasant Chilly No Yes Yes Sunny Yes Overcast Yes/No Chilly No Pleasant
Decision Tree Identification Example Top down technique for decision tree identification Decision tree created is sensitive to the order in which items are considered If an N-item-set does not result in a clear decision, classification classes have to be modeled by rough sets.
Other Classification Algorithms

Quinlans depth-first strategy builds the decision tree in a depth-first fashion, by considering all possible tests that give a decision and selecting the test that gives the best information gain. It hence eliminates tests that are inconclusive.
SLIQ (Supervised Learning in Quest) developed in the QUEST project of IBM uses a top-down breadth-first strategy to build a decision tree. At each level in the tree, an entropy value of each node is calculated and nodes having the lowest entropy values selected and expanded.
Clustering Techniques
Clustering partitions the data set into clusters or equivalence classes.
Similarity among members of a class more than similarity among members across classes.
Similarity measures: Euclidian distance or other application specific measures.
Euclidian Distance for Tables

(Overcast,Chilly,Dont Play) Overcast
(Cloudy,Pleasant,Play)
Cloudy Dont Play Play Sunny Warm Pleasant Chilly
General Strategy:
1. Draw a graph connecting items which are close to one another with edges. 2. Partition the graph into maximally connected subcomponents. 1. Construct an MST for the graph 2. Merge items that are connected by the minimum weight of the MST into a cluster
Clustering types:
Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level
Partitional clustering: Clusters are formed at only one level
Nearest Neighbour Clustering Algorithm:
Given n elements x1, x2, xn, and threshold t, . 1. j 1, k 1, Clusters = {} 2. Repeat 1. Find the nearest neighbour of xj 2. Let the nearest neighbour be in cluster m 3. If distance to nearest neighbour > t, then create a new cluster and k k+1; else assign xj to cluster m 4. j j+1 3. until j > n
Iterative partitional clustering:
Given n elements x1, x2, xn, and k clusters, each with a center. 1. Assign each element to its closest cluster center 2. After all assignments have been made, compute the cluster centroids for each of the cluster 3. Repeat the above two steps with the new centroids until the algorithm converges
Mining Sequence Data

Characteristics of Sequence Data: Collection of data elements which are ordered sequences In a sequence, each item has an index associated with it A k-sequence is a sequence of length k. Support for sequence j is the number of m-sequences (m>=j) which contain j as a sequence Sequence data: transaction logs, DNA sequences, patient ailment history,

Some Definitions: A sequence is a list of itemsets of finite length. Example: {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil} the purchases of a single customer over time
The order of items within an itemset does not matter; but the order of itemsets matter A subsequence is a sequence with some itemsets deleted

Some Definitions: A sequence S = {a1, a2, , am} is said to be contained within another sequence S, if S contains a subsequence {b1, b2, bm} such that a1 b1, a2 b2, , am bm. Hence, {pen}{pencil}{ruler,pencil} is contained in {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}

Apriori Algorithm for Sequences: 1. L1 Set of all interesting 1-sequences 2. k 1 3. while Lk is not empty do 1. Generate all candidate k+1 sequences 2. Lk+1 Set of all interesting k+1-sequences 4. done

Generating Candidate Sequences: Given L1, L2, Lk, candidate sequences of Lk+1 are generated as follows: For each sequence s in Lk, concatenate s with all new 1sequences found while generating Lk-1

Example: abcde bdae aebd be eabda aaaa baaa cbdb abbab abde minsup = 0.5 Interesting 1-sequences: a b d e Candidate 2-sequences aa, ab, ad, ae ba, bb, bd, be da, db, dd, de ea, eb, ed, ee

Example: abcde bdae aebd be eabda aaaa baaa cbdb abbab abde minsup = 0.5 Interesting 2-sequences: ab, bd Candidate 2-sequences aba, abb, abd, abe, aab, bab, dab, eab, bda, bdb, bdd, bde, bbd, dbd, ebd.
Interesting 3-sequences = {}

Language Inference: Given a set of sequences, consider each sequence as the behavioural trace of a machine, and infer the machine that can display the given sequence as behavior. aabb ababcac abbac Input set of sequences Output state machine

Inferring the syntax of a language given its sentences Applications: discerning behavioural patterns, emergent properties discovery, collaboration modeling, State machine discovery is the reverse of state machine construction Discovery is maximalist in nature

Maximal nature of language inference: a,b,c
abc aabc aabbc abbc
Most general state machine
c c c c b
a a
b
Most specific state machine

Shortest-run Generalization (Srinivasa and Spiliopoulou 2000) Given a set of n sequences: 1. Create a state machine for the first sequence 2. for j 2 to n do 1. Create a state machine for the jth sequence 2. Merge this sequence into the earlier sequence as follows: 1. Merge all halt states in the new state machine to the halt state in the existing state machine 2. If two or more paths to the halt state share the same suffix, merge the suffixes together into a single path 3. Done

Shortest-run Generalization (Srinivasa and Spiliopoulou 2000) a a a a b b c c c c b b b
aabcb
aac
aabc
a
b
c
a a c b
Mining Streaming Data

Characteristics of streaming data: Large data sequence No storage Often an infinite sequence Examples: Stock market quotes, streaming audio/video, network traffic

Running mean: Let n = number of items read so far, avg = running average calculated so far,
On reading the next number num:

avg (n*avg+num) / (n+1) n n+1

Running variance: var = (num-avg)2 = num2 - 2*num*avg + avg2 Let A = num2 of all numbers read so far B = 2*num*avg of all numbers read so far C = avg2 of all numbers read so far avg = average of numbers read so far n = number of numbers read so far

Running variance: On reading next number num: avg (avg*n + num) / (n+1) n n+1
A A + num2 B B + 2*avg*num C C + avg2 var = A + B + C

-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999) Let streaming data be in the form of frames where each frame comprises of one or more data elements. Support for data element k within a frame is defined as (#occurrences of k)/(#elements in frame) -Consistency for data element k is the sustained support for k over all frames read so far, with a leakage of (1- )

-Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999) *sup(k)
(1-) levelt(k) = (1-)*levelt-1(k) + *sup(k)
Data Warehousing
A platform for online analytical processing (OLAP) Warehouses collect transactional data from several transactional databases and organize them in a fashion amenable to analysis Also called data marts A critical component of the decision support system (DSS) of enterprises Some typical DW queries: Which item sells best in each region that has retail outlets Which advertising strategy is best for South India? Which (age_group/occupation) in South India likes fast food, and which (age_group/occupation) likes to cook?
Data Warehousing
OLTP
Data Cleaning
Inventory
Data Warehouse (OLAP)
OLTP vs OLAP
Transactional Data (OLTP) Analysis Data (OLAP)
Small or medium size databases Very large databases Transient data Archival data
Frequent insertions and updates Infrequent updates Small query shadow Normalization important to handle updates Very large query shadow De-normalization important to handle queries
Data Cleaning
Performs logical transformation of transactional data to suit the data warehouse Model of operations model of enterprise Usually a semi-automatic process
Data Cleaning
Orders Order_id Price Cust_id
Data Warehouse Customers Products Orders Inventory Price Time Sales Cust_id Cust_prof Tot_sales
Inventory Prod_id Price Price_chng
Multi-dimensional Data Model
Customers Jan01 Jun01 Jan02 Jun02
Time
Some MDBMS Operations

Roll-up
Add dimensions
Drill-down
Collapse dimensions
Vector-distance operations (ex: clustering) Vector space browsing
Star Schema
Dim Tbl_1 Dim Tbl_1
Dim Tbl_1
Fact table
Dim Tbl_1
WWW Based References

http://www.kdnuggets.com/ http://www.megaputer.com/ http://www.almaden.ibm.com/cs/quest/index.html http://fas.sfu.ca/cs/research/groups/DB/sections/publication /kdd/kdd.html http://www.cs.su.oz.au/~thierry/ckdd.html http://www.dwinfocenter.org/ http://datawarehouse.itoolbox.com/ http://www.knowledgestorm.com/ http://www.bitpipe.com/ http://www.dw-institute.com/ http://www.datawarehousing.com/
References
Agrawal, R. Srikant: ``Fast Algorithms for Mining Association Rules'', Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. R. Agrawal, R. Srikant, ``Mining Sequential Patterns'', Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995. R. Agrawal, A. Arning, T. Bollinger, M. Mehta, J. Shafer, R. Srikant: "The Quest Data Mining System", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996. Surajit Chaudhuri, Umesh Dayal. An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record. 26(1), March 1997. Jennifer Widom. Research Problems in Data Warehousing. Proc. of Intl Conf. On Information and Knowledge Management, 1995.
References
A. Shoshani. OLAP and Statistical Databases: Similarities and Differences. Proc. of ACM PODS 1997. Panos Vassiliadis, Timos Sellis. A Survey on Logical Models for OLAP Databases. ACM SIGMOD Record M. Gyssens, Laks VS Lakshmanan. A Foundation for MultiDimensional Databases. Proc of VLDB 1997, Athens, Greece. Srinath Srinivasa, Myra Spiliopoulou. Modeling Interactions Based on Consistent Patterns. Proc. of CoopIS 1999, Edinburg, UK. Srinath Srinivasa, Myra Spiliopoulou. Discerning Behavioral Patterns By Mining Transaction Logs. Proc. of ACM SAC 2000, Como, Italy.

Data Mining, Data Warehousing and Knowledge Discovery

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining, Data Warehousing and Knowledge Discovery

Uploaded by

Copyright:

Available Formats

Data Mining, Data Warehousing and Knowledge Discovery

Basic Algorithms and Concepts

Srinath Srinivasa IIIT Bangalore sri@iiitb.ac.in

Data Warehousing concepts

Why Data Mining

Analyzing trends Wealth generation

Strategic decision making

(Ex: Transaction data)

Streaming (Ex: multimedia, network traffic) Spatio-temporal (Ex: GIS)

Frequency Rarity Correlation Length of occurrence (for sequence and temporal

Consistency Repeating / periodicity Abnormal behavior Other patterns of interestingness

Data Mining vs Statistical Inference

Conceptual Model (Hypothesis )

Proof (Validation of Hypothesis)

Data Mining vs Statistical Inference

Pattern (model, rule, hypothesis) discovery

Data Mining Concepts

Data Mining Concepts

Data Mining Concepts

Mining for Frequent Item-sets

Mining for Frequent Item-sets

Mining for Frequent Item-sets

Mining for Association Rules

Mining for Association Rules

Mining for Association Rules

The frequent itemset {Bag, Uniform, Crayons} has a support of 0.3.

Mining for Association Rules

0.375 0.6 0.75 0.428 0.75 0.75

If minconf is 0.7, then we have discovered the following rules

Mining for Association Rules

Bag Uniform Pencil Crayons Pencil Uniform Crayons Crayons Uniform

Generalized Association Rules

Bill No. 15563 15563 15564 15564

Date 23.10.2003 23.10.2003 23.10.2003 23.10.2003

Item Books Crayons Uniform Crayons

Generalized Association Rules

Bill No. 15563 15563 15564 15564

Date 23.10.2003 23.10.2003 23.10.2003 23.10.2003

Item Books Crayons Uniform Crayons

Generalized Association Rules

Bill No. 15563 15563 15564 15564

Date 23.10.2003 23.10.2003 23.10.2003 23.10.2003

Item Books Crayons Uniform Crayons

Classification and Clustering

Classification problem Weather Play(Yes,No)

Pleasant Yes Pleasant Yes Chilly No

Pleasant Yes Pleasant Yes Chilly No

Cloudy Warm Cloudy Chilly

Pleasant Yes Pleasant Yes Chilly No

Overcast Warm Overcast Chilly No

Pleasant Yes Pleasant Yes Chilly No

Other Classification Algorithms

Similarity measures: Euclidian distance or other application specific measures.

Euclidian Distance for Tables

Partitional clustering: Clusters are formed at only one level

Mining Sequence Data

Mining Sequence Data

Mining Sequence Data

Mining Sequence Data

Mining Sequence Data

Mining Sequence Data

Mining Sequence Data

Mining Sequence Data

A A + num2 B B + 2avgnum C C + avg2 var = A + B + C

(1-) levelt(k) = (1-)levelt-1(k) + sup(k)