Professional Documents
Culture Documents
Huge amount of Raw DATA is available.The Motivation for the Data Mining is to
Analyse, Classify, Cluster, Charecterize the Data etc...
An ER data model represents the database as a set of entities and their relationships.
DataWareHouse -
A Data warehouse
is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.
Text databases are databases that contain word descriptions for objects. These word descriptions are usually not simple keywords but rather long sentences or paragraphs, such as product specifications, error or bug reports, warning messages, summary reports,notes, or other documents.
Cluster Analysis
Outlier Analysis Evolution Analysis
CLUSTER ANALYSIS
What is cluster analysis?Unlike classification and prediction, which analyze class-labeled data objects, clustering analyzes data objects without consulting a known class label. The objects are clustered or grouped based on the principle of maximizing the intra class similarity and minimizing the interclass similarity.
OUTLIER ANALYSIS
A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. However, in some applications such as fraud detection, the rare events can be more interesting than the more regularly occurring ones.
EVOLUTION ANALYSIS
Data evolution analysis describes and models regularities or trends for objects whose behavior changes over time. Although this may include characterization, discrimination, association and correlation analysis, classification, prediction, or clustering of time related data, distinct features of such an analysis include time-series data analysis, sequence or periodicity pattern matching, and similarity-based data analysis.
The set of task-relevant data to be mined The kind of knowledge to be mined The background knowledge to be used in the discovery process
DATA PREPROCESSING
Why Preprocess the Data ?
Imagine that you are a manager at AllElectronics and have been charged with analyzing the companys data with respect to the sales at your branch.You carefully inspect the companys database and data warehouse, identifying and selecting the attributes or dimensions to be included in your analysis, such as item, price, and units sold. Alas! You notice that several of the attributes for various tuples have no recorded value. For your analysis, you would like to include information as to whether each item purchased was advertised as on sale, yet you discover that this information has not been recorded. Furthermore, users of your database system have reported errors, unusual values, and inconsistencies in the data recorded for some transactions. In other words, the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data), noisy (containing errors, or outlier values that deviate from the expected), and inconsistent (e.g., containing discrepancies in the department codes used to categorize items).
DATA CLEANING
Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Missing Values
Ignore the tuple Fill in the missing value manually Use a global constant to fill in the missing value Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value.
Noisy Data - Noise is a random error or variance in a measured variable. Noise is Removed in the following three ways Binning see the next page Regression - Data can be smoothed by fitting the data to a function Clustering - Outliers may be detected by clustering, where similar values are organized into groups, or clusters.
BINNING
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34 To Detect that Data cleaning is required for a particular Data is called Discrepancy Detection and it can be done by using Knowledge and metadata.
DATA INTEGRATION
It is likely that your data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. There are a number of issues to consider during data integration Schema integration and object matching customer id in one database and cust _number in another
Redundancy
Hence we will perform Data integration by using Metadata Normalisation Correlation analysis using 2
DATA TRANSFORMATION
The data are transformed or consolidated into forms appropriate for mining in the following ways.
DATA REDUCTION
The data to be Mined will be Generally very Huge, Hence the Data is Reduced in the following ways Data cube aggregation Attribute subset selection Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
DIMENSIONALITY REDUCTION
Wavelet Transforms
The discrete wavelet transform(DWT) is a linear signal processing technique that, when applied to a data vector X, transforms it to a numerically different vector, X0, of wavelet coefficients. The two vectors are of the same length. When applying this technique to data reduction, we consider each tuple as an n-dimensional data vector, that is, X = (x1;x2; : : : ;xn), depicting n measurements made on the tuple from n database attributes. How can this technique be useful for data reduction if the wavelet transformed data are of the same length as the original data? The usefulness lies in the fact that the wavelet transformed data can be truncated. A compressed approximation of the data can be retained by storing only a small fraction of the strongest of the wavelet coefficients.
NUMEROSITY REDUCTION
Can we reduce the data volume by choosing alternative, smaller forms of data representation?. This can be done as Regression and Log-Linear Models - the data are modeled to fit a straight line Histograms - A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets. Clustering - They partition the objects into groups or clusters, so that objects within a cluster are similar to one another and dissimilar to objects in other clusters. Sampling - it allows a large data set to be represented by a much smaller random sample (or subset) of the data. Data Discretization and Concept Hierarchy Generation pls see the next page..
DISCRETIZATION AND CONCEPT HIERARCHY GENERATION FOR NUMERICAL DATA Data discretization techniques can be used to reduce the number of values for
a given continuous attribute by dividing the range of the attribute into intervals. This is done in the following ways Binning Here we will take the bins with some intervals Histogram Analysis Here, the histogram partitions the values into buckets. Entropy-Based Discretization - The method selects the value of A that has the minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization. Interval Merging by using 2 Analysis This contrasts with ChiMerge, which employs a bottom-up approach by finding the best neighboring intervals and then merging these to form larger intervals, recursively. Cluster Analysis - A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning the values of A into clusters or groups.
Discretization by Intuitive Partitioning - For example, annual salaries broken into ranges like ($50,000, $60,000] are often more desirable than ranges
like ($51,263.98, $60,872.34], obtained by, say, some sophisticated clustering analysis.
DATAWAREHOUSE
Traditional Databases uses OLAP, whereas DatawareHouse uses OLTP.
A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data analysis. The most popular data model for a data warehouse is a multidimensional model. Such a model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Lets look at each of these schema types.
Enterprise warehouse: An enterprise warehouse collects all of the information about subjects spanning the entire organization. It provides corporate-wide data integration, usually from one or more operational systems or external information providers, and is crossfunctional in scope. Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group of users. The scope is confined to specific selected subjects. For example, a marketing data mart may confine its subjects to customer, item, and sales. Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient query processing, only some of the possible summary views may be materialized.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases, some database systemvendors implement specialized SQL servers that provide advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment.
CONCEPT DESCRIPTION/CHARECTERIZATION
Data generalization summarizes data by replacing relatively low-level values (such as numeric values for an attribute age) with higher-level concepts (such as young, middle aged, and senior). Given the large amount of data stored in databases, it is useful to be able to describe concepts in concise and succinct terms at generalized (rather than low) levels of abstraction. Attribute-Oriented Induction for Data Characterization Before Attribute Induction 1) First, data focusing should be performed before attribute-oriented induction. This step corresponds to the specification of the task-relevant data (i.e., data for analysis). The data are collected based on the information provided in the data mining query. 2) Specifying the set of relevant attributes. For example, suppose that the dimension birth place is defined by the attributes city, province or state, and country. Of these attributes, lets say that the user has only thought to specify city. In order to allow generalization on the birth place dimension, the other attributes defining this dimension should also be included.
CONCEPT DESCRIPTION/CHARECTERIZATION
3) A correlation-based (Section 2.4.1) or entropy-based (Section 2.6.1) analysis method can be used to perform attribute relevance analysis and filter out statistically irrelevant or weakly relevant attributes from the descriptive mining process.
CONCEPT DESCRIPTION/CHARECTERIZATION
Attribute Generalization can be controlled in 2 ways 1) Attribute generalization threshold control - sets one threshold for each attribute. If the number of distinct values in an attribute is greater than the attribute threshold, further attribute removal or attribute generalization should be performed. 2) Generalized relation threshold control - sets a threshold
for the generalized relation. If the number of (distinct) tuples in the generalized relation is greater than the threshold, further generalization should be performed.
CONCEPT DESCRIPTION/CHARECTERIZATION
Presentation of the Derived Generalization
The above table is represented in the form of a Bar chart & Pie chart in the following manner.
CONCEPT DESCRIPTION/CHARECTERIZATION
It is Represented in the form of a 3-D cube in the following way.
CONCEPT DESCRIPTION/CHARECTERIZATION
Mining Class Comparisons - In many applications, users may not be interested in having a single class (or concept) described or characterized, but rather would prefer to mine a description that compares or distinguishes one class (or concept) from other comparable classes (or concepts). The class comparison is done in the following procedure 1) Data collection - The set of relevant data in the database is collected by query processing and is partitioned respectively into a target class and one or a set of contrasting class(es). 2) Dimension relevance analysis - If there are many dimensions, then dimension relevance analysis should be performed on these classes to select only the highly relevant dimensions for further analysis. Correlation or entropy-based measures can be used for this step.
CONCEPT DESCRIPTION/CHARECTERIZATION
3) Synchronous generalization Generalization is performed on the target class to the level controlled by a user- or expert-specified dimension threshold, which results in a prime target class relation.
4) Presentation of the derived comparison - The resulting class comparison description can be visualized in the form of tables, graphs, and rules.
Moreover, it helps in data classification, clustering, and other data mining tasks as well.
APRIORI ALGORITHM
Consider a particular Departmental store where it shows some transactions
Tid 1 2 3 4 5
items Bread, milk Bread, diapers, beer, eggs Milk, diapers, beer, cola Bread, milk, diapers, beer Bread, milk, diapers, cola
APRIORI ALGORITHM
The above table can be represented in binary format as below
Tid 1 2 3 4 5 Total
Bread 1 1 0 1 1 4
milk 1 0 1 1 1 4
diapers 0 1 1 1 1 4
Beer 0 1 1 1 0 3
Eggs 0 1 0 0 0 1
cola 0 0 1 0 1 2
APRIORI ALGORITHM
Then the 1-item sets are generated from the binary table ..i.e.
Item
Count
Beer
Bread Cola Diapers Milk Eggs
3
4 2 4 4 1
APRIORI ALGORITHM
Then by taking the support threshold value as 60% (5 transactions) i.e. minimum support count as 3, the cola and eggs are discarded from the item sets as they have less than 3(threshold). from the 1-item sets, 2-item sets are generated as 4c2. i.e. out of the 6 items in 1-item set, 2 are discarded and 4 are remaining and 2 is to generate 2-item set. Then the 2-item sets are..
Item set
Beer, bread Beer, diapers Beer, milk Bread, diapers Bread, milk Diapers, milk
Count
2 3 2 3 3 3
APRIORI ALGORITHM
In the 2-item sets, (beer, bread) & (beer, milk) are discarded as they have less than 3(threshold). And the 3-item sets are generated as 4c3.
Count 3
APRIORI ALGORITHM
As a conclusion, Apriori generated 1,2,3-item sets as 6c1+4c2+4c3 = 6+6+4 =16 But according to brute force strategy, the same is done as 6c1+6c2+6c3=6+15+20 = 41. Hence, Apriori generates Optimum and Required number of item sets.
From the above 2 Rules, Rule 1 is considered as it is the Generalized one and also the support value is more comparing with Rule 2.
Equal-frequency binning:
age(X, 30:::31)^income(X, 40:::41K))->buys(X, HDTV) age(X, 31:::32)^income(X, 41K:::42K))->buys(X, HDTV) age(X, 32:::33)^income(X, 42:::43K))->buys(X, HDTV) age(X, 33:::34)^income(X, 43K:::44K))->buys(X, HDTV) Here ,for a Bin,equal number of tuples will be taken.
MINING MULTIDIMENSIONAL ASSOCIATION RULES FROM RELATIONAL DATABASES AND DATAWAREHOUSES Clustering-based binning:
Consider the following tuples, then we can group/cluster them as below. age(X, 34)^income(X, 31K:::40K))->buys(X, HDTV) age(X, 35)^income(X, 31K:::40K))->buys(X, HDTV) age(X, 34)^income(X, 41K:::50K))->buys(X, HDTV) age(X, 35)^income(X, 41K:::50K))->buys(X, HDTV)
Then we can form a 2-D Grid and cluster them to get the HDTV Purchase Zone.
lift(game,video) =
We have not only age, income predicates, but we will have lot of other predicates..and each predicate will have some Range of values.. So, for which set of predicates and for which set of values, the predicate buys will be maximum.. We will consider the best set of predicates, best set of values to have the maximum buys predicate. That is Constraint-Based Association Mining.
Entropy(Age) is having the Max value..Hence Age is acting as the Best classifier Attribute in the following figure.
BAYESIAN CLASSIFICATION
They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class.
Bayes Theorem Let X be a data tuple. In Bayesian terms, X is considered evidence. As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H/X), the probability that the hypothesis H holds given the evidence or observed data tuple X. In other words, we are looking for the probability that tuple X belongs to class C, given that we know the attribute description of X.
BAYESIAN CLASSIFICATION
The above Buys_computer Table is classified and if we need to predict for a new tuple X, X = (age = youth, income = medium, student = yes, credit rating = fair),
BAYESIAN CLASSIFICATION
Using the above probabilities, we obtain P(X/buys computer = yes) = P(age = youth / buys computer = yes) x P(income = medium / buys computer = yes) x P(student = yes / buys computer = yes) x P(credit rating = fair / buys computer = yes) = 0.222x0.444x0.667x0.667 = 0.044
Similarly,
P(X/buys computer = no) = 0.600x0.400x0.200x0.400 = 0.019 To find the class, Ci, that maximizes P(X/Ci) P(Ci), we compute
CLUSTER ANALYSIS
The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Types of Data in Cluster Analysis 1) Data matrix
2) Dissimilarity matrix
3) Interval scaled variables - examples include weight and height, latitude and longitude coordinates (e.g., when clustering houses), and weather temperature.
CLUSTER ANALYSIS
Partitioning Methods Given D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k n), where each partition represents a cluster. THE K-MEANS METHOD Input: k: the number of clusters, Output: A set of k clusters. Method: (1) arbitrarily choose k objects from D as the initial cluster centers; (2) repeat (3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; (4) update the cluster means, i.e., calculate the mean value of the objects for each cluster; D: a data set containing n objects.
CLUSTER ANALYSIS
Clustering by k-means partitioning. Suppose that there is a set of objects located in space as depicted in the rectangle shown in Figure 7.3(a). Let k = 3; that is, the user would like the objects to be partitioned into three clusters. According to the algorithm in Figure 7.2, we arbitrarily choose three objects as the three initial cluster centers, where cluster centers are marked by a +. Each object is distributed to a cluster based on the cluster center to which it is the nearest. Such a distribution forms silhouettes encircled by dotted curves, as shown in Figure 7.3(a). Next, the cluster centers are updated. That is, the mean value of each cluster is recalculated based on the current objects in the cluster. Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest. Such a redistribution forms new silhouettes encircled by dashed curves, as shown in Figure 7.3(b).
This process iterates, leading to Figure 7.3(c). The process of iteratively reassigning objects to clusters to improve the partitioning is referred to as iterative relocation. Eventually, no redistribution of the objects in any cluster occurs, and so the process terminates.
The resulting clusters are returned by the clustering process.
CLUSTER ANALYSIS
Clustering of a set of objects based on the k-means method. (The mean of each cluster is marked by a +.)