You are on page 1of 7

DATA MINING

Association Rules Mining: The task of association rule mining is to find certain association relationships among a
set of objects (called items) in a database. The association, relationships are described in association rules. Each rule has
two measurements, support and confidence. Confidence is a measure of the rule’s strength, while - support corresponds to
statistical significance.
The task of discovering association rules was first introduced in 1993 [AIS93]. Originally, association rule mining is
focused on market “basket data” which stores items purchased on a per-transaction basis. A typical example of an
association rule on market “basket data” is that 70% of customers who purchase bread also purchase butter.
Finding association rules is valuable for crossing-marketing and attached mailing applications. Other applications include
catalog design, add-on sales, store layout, and customer segmentation based on buying patterns. Besides application on
business area, association rule mining can also be applied to other areas, such as medical diagnosis, remotely sensed
imagery.
Let I = { } be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such
that T c I associated with each transaction is a unique identifier, called its TID. An association rule is an implication of the
form X => Y, where X a I, Y c l, and X n Y = 0 X is called
Antecedent while Y is called consequence of the rule.
There are two measurements for each rule, support and confidence.
Initially used for Market Basket Analysis to find how items purchased by customers are related.
Algorithms:
AIS Algorithm
In AIS algorithms [AIS 93], candidate itemsets are generated and counted on-the-fly as the database is scanned. After
reading a trans, it is determined which of the itemsets that were found to be large in the previous pass contained in this
trans. New candidate itemsets are generated by extending these large itemsets with other items in the trans.
SETM Algorithm
This algorithm was motivated by the desire to use SQL to compute large itemsets. Like AIS, the SETM algorithm also
generates candidates on-the-fly based on trans read from the database. To use the standard SQL join operation for
candidate generation, SETM separates candidate generation from counting.
Apriori Algorithm
The disadvantage of AIS and SETM algorithm is the fact of unnecessarily generating and counting too many candidate
itemsets that turn out to be small. To improve the performance, Apriori algorithm was proposed [AS 94]. Apriori
algorithm generate the candidate itemsets to be counted in the pass by using only the itemsets found large in the previous
pass - without considering the transactions in the database. Apriori beats AIS and SETM by more than an order of
magnitude for large datasets. The key idea of Apriori algorithm lies in the “downward-closed” property of support which
means if an itemset has minimum support, then all its subsets also have minimum support.

DHP (Direct Hashing and Pruning) Algorithm: In frequent itemset generation, the heuristic to construct the
candidate set of large itemsets is crucial to performance. The larger the candidate set, the more processing cost required to
discover the frequent itemsets. The processing in the initial iterations in fact dominates the total execution cost. It shows
the initial candidate set generation, especially for the large 2 -itemsets, is the key issue to improve the performance.
Based on the above concern, DHP is proposed [PCY 95]. DHP is a hash-based algorithm and is especially effective for
the generation of candidate set of large 2 - itemsets. DHP has two major features, one is efficient generation for large
itemsets, the other is effective reduction on trans database sizes Instead of including all k-itemsets from Lk-i * Lk-i into in
Apriori, DHP adds a k-itemset into Ck only if that k-itemset passes the hash filtering, i.e., that k-itemset is hashed into a
hash entry whose value is larger than or equal to the min support. Such hash filtering can drastically reduce the size of Q.
DHP progressively trims the transaction database sizes in two ways, one is to reduce the size of some transactions, the
other is to remove some transactions. The execution time of the first pass of DHP is slightly larger than that of Apriori due
to the extra overhead required for generating hash table. However, DHP incurs significantly smaller execution times than
Apriori in later passes. The reason is that Apriori scans the full database for every pass, whereas DHP only scans the full
database for the first 2 passes and then scans the reduced database thereafter.

DIC (Dynamic Itemset Counting) Algorithm


DIC algorithm, proposed in [MUT 97], counts itemsets of different cardinality simultaneously. The transaction sequence
is portioned into blocks. The itemsets are stored in a lattice which is initialized by all singleton sets. While a block is
scarmed, the count of each itemset in the lattice is adjusted. After a block is processed, an itemset is added to the lattice if
and only if all its subsets are potentially large. At the end of the sequence, the algorithm rewinds to the beginning. It
terminates when the count of each item in the lattice is determined. Thus after a finite number of scans, the lattice contains
a superset of all large itemsets and their counts.
DIC (Dynamic itemset counting): add new candidate itemsets at partition points
Once both A and D are determined frequent, the counting of AD begins
Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins

#Performance Evaluation of Different Data Mining Classification Algorithms


Differentiation classification algorithms have been used for the performance evaluation, below are listed.

1: -j48 (C4.5): J48 is an implementation of C4.5 [8] that builds decision trees from a set of training data in the same way
as ID3, using the concept of Information Entropy. The training data is a set S = s1, s2... of already classified samples.
Each sample si = x1, x2... is a vector where x1, x2… represent attributes or features of the sample. Decision tree are
efficient to use and display good accuracy for large amount of data. At each node of the tree, C4.5 chooses one attribute of
the data that most effectively splits its set of samples into subsets enriched in one class or the other
.
2: -Naive Bayes: -a naive Bayes classifier assumes that the presence or absence of a particular feature is unrelated to the
presence or absence of any other feature, given the class variable. Bayesian belief networks are graphical models, which
unlikely naive Bayesian classifier; allow the representation of dependencies among subsets of attributes[10]. Bayesian
belief networks can also be used for classification. A simplified assumption: attributes are conditionally independent:

3: - k-nearest neighborhood:- The k-NN algorithm for continuous-valued target functions Calculate the mean values of
the k nearest neighbors Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors
according to their distance to the query point xqg giving greater weight to closer neighbors Similarly, for real-valued
target functions. Robust to noisy data by averaging k-nearest neighbors.

4: -Neural Network:- Neural networks have emerged as an important tool for classification. The recent vast research
activities in neural classification have established that neural networks are a promising alternative to various conventional
classification methods. The advantage of neural networks lies in the following theoretical aspects. First, neural networks
are data driven self-adaptive methods in that they can adjust themselves to the data without any explicit specification of
functional or distributional form for the underlying model.

5: -Support Vector Machine: - A new classification method for both linear and non linear data.It uses a nonlinear
mapping to transform the original training data into a higher dimension. With the new dimension, it searches for the linear
optimal separating hyper plane (i.e., “decision boundary”). With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a hyper plane SVM finds this hyper plane using support
vectors (“essential” training tuples) and margins (defined by the support vectors).
---------------------------------------------------------------------------------
Review of Basic Data Analytic Methods Using R
Introduction to R: R is a programming language and software framework for statistical analysis and graphics.
Available for use under the GNU General Public License, R software and installation instructions can be obtained via the
Comprehensive R Archive and Network. Functions such as summary() can help analysts easily get an idea of the
magnitude and range of the data, but other aspects such as linear relationships and distributions are more difficult to see
from descriptive statistics
R Graphical User Interfaces: R software uses a command-line interface (CLI) that is similar to the BASH shell in Linux
or the interactive versions of scripting languages such as Python. UNIX and Linux users can enter command R at the
terminal prompt to use the CLI. For Windows installations, R comes with RGui.exe, which provides a basic graphical user
interface (GUI). However, to improve the ease of writing, executing, and debugging R code, several additional GUIs have
been written for R. Popular GUIs include the R commander
Exploratory Data Analysis: Exploratory data analysis [9] is a data analysis approach to reveal the important
characteristics of a dataset, mainly through visualization A useful way to detect patterns and anomalies in the data is
through the exploratory data analysis with visualization. Visualization gives a succinct, holistic view of the data that may
be difficult to grasp from the numbers and summaries alone. Variables x and y of the data frame data can instead be
visualized in a scatter plot , which easily depicts the relationship between two variables. An important facet of the initial
data exploration, visualization assesses data cleanliness and suggests potentially important relationships in the data prior
to the model planning and building phases.

Visualization Before Analysis, Dirty Data, Visualizing a Single Variable { Dotchart and Barplot, Histogram and
Density Plot}

Statistical Methods for Evaluation: Visualization is useful for data exploration and presentation, but statistics is
crucial because it may exist throughout the entire Data Analytics Lifecycle. Statistical techniques are used during the
initial data exploration and data preparation, model building, evaluation of the final models, and assessment of how the
new models improve the situation when deployed in the field. In particular, statistics can help answer the following
questions for data analytics:
● Model Building and Planning
● What are the best input variables for the model?
● Can the model predict the outcome given the input?

Some useful statistical tools:


Hypothesis Testing: A common technique to assess the difference or the significance of the difference is hypothesis
testing. The basic concept of hypothesis testing is to form an assertion and test it with data. When performing hypothesis
tests, the common assumption is that there is no difference between two samples. This assumption is used as the default
position for building the test or conducting a scientific experiment. Statisticians refer to this as the null hypothesis.
It is important to state the null hypothesis and alternative hypothesis, because misstating them is likely to undermine the
subsequent steps of the hypothesis testing process. A hypothesis test leads to either rejecting the null hypothesis in favor
of the alternative or not rejecting the null hypothesis.
Difference of Means: Hypothesis testing is a common approach to draw inferences on whether or not the two populations,
denoted pop1 and pop2, are different from each other. This section provides two hypothesis tests to compare the means of
the respective populations based on samples randomly drawn from each population.
Wilcoxon Rank-Sum Test: A t-test represents a parametric test in that it makes assumptions about the population
distributions from which the samples are drawn. If the populations cannot be assumed or transformed to follow a normal
distribution, a nonparametric test can be used. The Wilcoxon rank-sum test is a nonparametric hypothesis test that checks
whether two populations are identically distributed. Assuming the two populations are identically distributed, one would
expect that the ordering of any sampled observations would be evenly intermixed among themselves.
---------------------------------------------------------------------------------------------------------------------------------------------------

Data Cube Computation: Data cube computation is an essential task in data warehouse implementation. The
precomputation of all or part of a data cube can greatly reduce the response time and enhance the performance of online
analytical processing. However, such computation is challenging because it may require substantial computational time
and storage space.

METHODS: 1. Full Cube: I. Full materialization II. Materializing all the cells of all of the cuboids for a given data cube
III. Issues in time and space
2. Iceberg cube: I. Partial materialization II. Materializing the cells of only interesting cuboids III. Materializing only the
cells in a cuboid whose measure value is above the minimum threshold
3. Closed cube: Materializing only closed cells

Computation Techniques: 1. Aggregating: Aggregating from the smallest child cuboid


2. Caching: Caching the result of a cuboid for the computation of other cuboids to reduce disk I/O.
3.Sorting, Hashing and Grouping: Sorting, hashing, and grouping operations are applied to a dimension in order to reorder
and cluster.
4. Pruning: A priori pruning the cells with lower support than minimum threshold.

#Multi-Way Array Aggregation: i. Array-based “bottom-up” approach, (ii) Uses multi-dimensional chunks (iii) No
direct tuple comparisons (iv) Simultaneous aggregation on multiple dimensions (v) Intermediate aggregate values
are re-used for computing ancestor cuboids (vi) Full materialization
Aggregation Strategy: (i) Partitions array into chunks (ii) Data addressing (III) Multi-way Aggregation

#Bottom-Up Computation: (I) “Top-down” approach (II) Partial materialization (iceberg cube computation) (III)
Divides dimensions into partitions and facilitates iceberg pruning (IV) No simultaneous aggregation

Iceberg Pruning Process: (I) Partitioning: (i) Sorts data values (ii) Partitions into blocks that fit in memory
(II) Apriori Pruning: For each block
If it does not satisfy min_sup, its descendants are prune
• If it satisfies min_sup, materialization and a recursive call including the next dimension

#Shell Fragment Cube Computation (I) Reduces a high dimensional cube into a set of lower dimensional cubes
(II) Lossless reduction (III) Online re-construction of high-dimensional data cube
Fragmentation Strategy: (i) Observation (ii) Fragmentation (iii) Semi-Online Computation
---------------------------------------------------------------------------------------------------------------------------------------------------

Mining Frequent Patterns without Candidate Generation: Frequent pattern mining plays an essential role in mining
associations correlations, sequential patterns, episodes, multi-dimensional patterns , max-patterns, partial periodicity
,emerging patterns, and many other important data mining tasks.

First, we design a novel data structure, called frequent pattern tree, or FP-tree for short, which is an extended prefix-tree
structure storing crucial, quantitative information about frequent patterns. To ensure that the tree structure is compact and
informative, only frequent length-1 items will have nodes in the tree. The tree nodes are arranged in such a way that more
frequently occurring nodes will have better chances of sharing nodes than less frequently occurring ones. Our experiments
show that such a tree is highly compact, usually orders of magnitude smaller than the original database. This offers an FP-
tree-based mining method a much smaller data set to work on.

Second, we develop an FP-tree-based pattern fragment growth mining method, which starts from a frequent length-1
pattern (as an initial suffix pattern), examines only its conditional pattern base (a \sub-database" which consists of the set
of frequent items co-occurring with the suffix pattern), constructs its (conditional) FP-tree, and performs mining
recursively with such a tree. The pattern growth is achieved via concatenation of the suffix pattern with the new ones
generated from a conditional FP-tree. Since the frequent itemset in any transaction is always encoded in the corresponding
path of the frequent pattern trees, pattern growth ensures the completeness of the result. In this context, our method is not
Apriori-like restricted generation-and-test but restricted test only. The major operations of mining are count accumulation
and prefix path count adjustment, which are usually much less costly than candidate generation and pattern matching
operations performed in most Apriori-like algorithms.
Third, the search technique employed in mining is a partitioning-based, divide-and-conquer method rather than Apriori-
like bottom-up generation of frequent itemsets combinations. This dramatically reduces the size of conditional pattern
base generated at the subsequent level of search as well as the size of its corresponding conditional FP-tree. Moreover, it
transforms the problem of finding long frequent patterns to looking for shorter ones and then concatenating the suffix. It
employs the least frequent items as suffix, which offers good selectivity. All these techniques contribute to the substantial
reduction of search costs

Algorithm (FP-growth : Mining frequent patterns with FP-tree and by pattern fragment growth)
Input: FP-tree constructed based on Algorithm 1, using DB and a minimum support threshold _.
Output: The complete set of frequent patterns.
Method: Call FP-growth (FP-tree ; null), which is implemented as follows.

Procedure FP-growth (Tree; a)


{
(1) IF Tree contains a single path P
(2) THEN FOR EACH combination (denoted as B) of the nodes in the path P DO
(3) generate pattern B U a with support = minimum support of nodes in B;
(4) ELSE FOR EACH ai in the header of Tree DO f
(5) generate pattern B = ai U a with support = ai.support;
(6) Construct B's conditional pattern base and then B's conditional FP-tree TreeB;
(7) IF TreeB !=0 ;
(8) THEN Call FP-growth (TreeB; B)
}}
(a as alpha sign, B as bita sign, != not equal and U as Union)
-------------------------------------------------------------------------------------------------------------------

Classification: There are two forms of data analysis that can be used for extracting models describing
important classes or to predict future data trends. These two forms are as follows −

 Classification
 Prediction

Classification models predict categorical class labels. Following are the examples of cases where the data analysis task is
Classification –

 A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky
or which are safe.
 A marketing manager at a company needs to analyze a customer with a given profile, who will buy a
new computer.

The Data Classification process includes two steps −

 Building the Classifier or Model


 Using Classifier for Classification

Building the Classifier or Model

 This step is the learning step or the learning phase.


 In this step the classification algorithms build the classifier.
 The classifier is built from the training set made up of database tuples and their associated class labels.
 Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be
referred to as sample, object or data points.

Using Classifier for Classification

In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of
classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered
acceptable.

Decision Tree: A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal
node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class
label. The topmost node in the tree is the root node.

The benefits of having a decision tree are as follows −

 It does not require any domain knowledge.


 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

The following decision tree is for the concept buy computer that indicates whether a customer at a company is
likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a
class.
Decision Tree Induction Algorithm: A machine researcher named J. Ross Quinlan in 1980 developed a decision tree
algorithm known as ID3 (Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and
C4.5 adopt a greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.

Tree induction metrics using split algo based on info. Theory

Information gain : I(PM) = -P/S log2 P/S-n/S log2 n/S

Entropy : E(A)= Pi+ni/P+n I(Pi,ni)

(2) I(P.n)= -P/P+n(log2) (p/P+n) - n/P+n log2(P/P+n)

Total gain= I (Pn)-E(A)

You might also like