Professional Documents
Culture Documents
Warehousing
Chapter 1. Introduction
Motivation: Why data mining?
Fraud detection
Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer?
Motivation:
Necessity is the Mother of Invention
The Explosive Growth of Data: from terabytes to petabytes Data explosion problem
Computers have become cheaper and more powerful Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer Relationship Management)
Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation
Data mining
Process of semi-automatically analyzing large databases to find patterns that are:
valid: hold on new data with some certainity novel: non-obvious to the system useful: should be possible to act on the item understandable: humans should be able to interpret the pattern
Data Mining
Technologies for analysis of data and discovery of (very) hidden patterns Uses a combination of statistics, probability analysis and database technologies Fairly young (<20 years old) but clever algorithms developed through database research
DATA MINING
Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouse or the web.
Finds Patterns
Performs Predictions
Database systems
Evolution of Sciences
Before 1600, empirical science 1600-1950s, theoretical science Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding. 1950s-1990s, computational science Over the last 50 years, most disciplines have grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models. 1990-now, data science The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online
The Internet and computing Grid that makes all these archives universally accessible
Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes. Data mining is a major new challenge!
March 25, 2013 Data Mining: Concepts and Techniques 15
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining Data mining and its applications
March 25, 2013 Mining: Concepts and 16 Web technology (XML,Data data integration) and global information systems Techniques
Mining?
Look up phone
number in phone directory
Query a Web
search engine for information about Amazon
Data Warehouse
Data Cleaning Data Integration
Databases
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Contd
Data Mining: An essential process where intelligent methods are applied to extract data patterns. Pattern Evaluation: To identify the truly interesting patterns representing knowledge based on interestingness measures. Knowledge presentation: Where visualization and knowledge representation techniques are used to present mined knowledge to users.
Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouse or the web.
Making Decisions
Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
DBA
Machine Learning
Data Mining
Visualization
Information Science
Other Disciplines
Statistics
Statistics studies the collection, analysis, interpretation and presentation of data. A statistical model is a set of mathematical functions that describe the behavior of the objects in a target class in terms of random variables and their associated probability distribution.
Machine Learning
It investigates how computers can learn based on data. Supervised learning : classification Unsupervised learning: clustering Active learning: user interactive
Relational databases
Relational databases: It is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes and usually stores a large set of tuples. A semantic data model such as entity-relationship (ER) data model, is often constructed for relational data bases.
Data Warehouses
Data Warehouses: A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and usually residing at a single site.
Data warehouses are constructed via a process of data cleaning, data integration, data transformation, data loading, and periodic data refreshing. A data warehouse is usually modeled by a multi dimensional data structure, known as data cube, in which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some aggregate measure such as count or sum. A data cube provides a multidimensional view of data and allows the pre computation and fast access of summarized data.
Transactional Data
Transactional Data: A transactional database captures a transaction, such as customers purchase, flight booking etc. A transaction typically includes transaction id and a list of items making up the transaction. A transactional database may have additional tables which contains other information related to the transaction.
Description Tasks
Find human-interpretable patterns that describe the data.
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification Example
No Yes No Yes
Test Set
Training Set
Learn Classifier
Model
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach:
Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, dont buy} decision forms the class attribute. Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions. Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a competitor. Approach:
Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.
Classification: Application 4
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.
Approach:
Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!
Classifying Galaxies
Early Class:
Stages of Formation
Attributes:
Intermediate
Late
Data Size:
Clustering Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach:
Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters.
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
TID
Items
1 2 3 4 5
Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Regression
Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples:
Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.
Deviation/Anomaly Detection
Detect significant deviations from normal behavior Applications:
Credit Card Fraud Detection
DM Applications
Banking: loan/credit card approval
predict good customers based on old customers
Targeted marketing:
identify likely responders to promotions
DM Applications (continued)
Medicine: disease outcome, effectiveness of treatments
analyze patient disease history: find relationship between diseases
target marketing, customer relation management, market basket analysis, cross selling, market segmentation
Risk analysis and management
Other Applications
Text mining (news group, email, documents) and Web analysis. Intelligent query answering
Target marketing
Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.
Cross-market analysis
Associations/co-relations between product sales Prediction based on the association information
Exploratory-based method:
Try to make sense of a bunch of data without an a priori hypothesis! The only prevention against false results is significance:
ensure statistical significance (using train and test etc.) ensure domain significance (i.e., make sure that the results make sense to a domain expert)
Data Miner:
Notices that the yield is somewhat higher under trees where birds roost. Conclusion: droppings increase yield. Alternative conclusion: moderate amount of shade increases yield.(Identification Problem)
Data Mining:
Large Data sets Efficiency of Algorithms is important Scalability of Algorithms is important Real World Data Lots of Missing Values Pre-existing data - not user generated Data not static - prone to updates Efficient methods for data retrieval available for use
OLAP: On-line Analytical Processing Multi-Dimensional Data Model (Data Cube) Operations:
Roll-up Drill-down Slice and dice Rotate
An OLAM Architecture
Mining query
User GUI API
Mining result
OLAM Engine
Data Cube API
OLAP Engine
Layer3
OLAP/OLAM
Layer2
MDDB
Meta Data
Filtering&Integration
MDDB
Database API
Data cleaning
Filtering
OLAP
Summaries, trends and forecasts
Analysis Multidimensional data modeling, Aggregation, Statistics What is the average income of mutual fund buyers by region by year?
Data Mining
Knowledge discovery of hidden patterns and insights
Insight and Prediction Induction (Build the model, apply it to new data, get the result) Who will buy a mutual fund in the next 6 months and why?
Type of result
Method
Example question
How to couple a DM with DB or DW? If a DM system works as a stand alone system or is embedded in an application program, there are no DB or DW system with whom he has to communicate, then this scheme is known as no coupling. the integration/ coupling of DM can be done in various ways:
No coupling: In this DM system will not utilize any function of a DB or DW system. It may fetch data from a particular source such as file and process the data using DM algorithm and then store the data in a file. Loose Coupling: The DM system will use some facilities of a DB or DW system, fetching data from a data repository managed by these systems, performing data mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse.
Semi tight coupling: In this besides linking a DM system to a DB/ DW system, efficient implementation of a few essential DM primitives can be provided in the DB/ DW. The primitives include sorting, indexing, aggregation etc. Tight coupling: in this a DM system is smoothly integrated into the DB / DW system. The DM subsystem is treated as one functional component of an information system. This provides a uniform information processing environment.
Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem Protection of data security, integrity, and privacy