You are on page 1of 28

Data Mining

Introduction

What is Data Mining?


The process of semi automatically analyzing large databases to find useful patterns (Silberschatz) KDD Knowledge Discovery in Databases (3) Attempts to discover rules and patterns from data Discover Rules Make Predictions Areas of Use
Internet Discover needs of customers Economics Predict stock prices Science Predict environmental change Medicine Match patients with similar problems cure

Business intelligence (BI)


unites data, technology, analytics, and human knowledge to optimize business decisions and ultimately drive an enterprises success. BI programs usually combine an enterprise data warehouse and a BI platform or tool set to transform data into usable, actionable business information.

Data Mining & Data Warehousing


Data Warehouse: is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site. (Silberschatz)
Collect data Store in single repository Allows for easier query development as a single repository can be queried.

Data Mining:
Analyzing databases or Data Warehouses to discover patterns about the data to gain knowledge. Knowledge is power.

Example of Data Mining


Credit Card Company wants to discover information about clients from databases. Want to find:
Clients who respond to promotions in Junk Mail Clients that are likely to change to another competitor Clients that are likely to not pay Services that clients use to try to promote services affiliated with the Credit Card Company Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money.

Forecasting what may happen in the future Classifying people or things into groups by recognizing patterns Clustering people or things into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events

Results of Data Mining Include:

Data Mining Techniques


Classification Clustering Regression Association Rules

Classification
Classification: Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item. Therefore to classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications Responds Rarely, Responds Sometimes, Responds Frequently. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers.

Technique for Classification


Decision-Tree Classifiers
Job
Carpenter Engineer Doctor

Income
<30 K >50 K <40 K

Income
>90 K

Income
<50 K >100K

Bad

Good

Bad

Good

Bad

Good

Predicting credit risk of a person with the jobs

Clustering

Clustering algorithms find groups of items that are similar. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. (2) Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as unsupervised learning

Clustering
Group Data into Clusters
Similar data is grouped in the same cluster Dissimilar data is grouped in the same cluster

How is this achieved ? K-Nearest Neighbor A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its knearest neighbors (where k is an integer).(2) Hierarchical
Group data into t-trees

Regression
Regression deals with the prediction of a value, rather than a class. (1, P747) Example: Find out if there is a relationship between smoking patients and cancer related illness. Given values: X1, X2... Xn Objective predict variable Y One way is to predict coefficients a0, a1, a2
Y = a0 + a1X1 + a2X2 + anXn Linear Regression

Regression
Example graph:
Line of Best Fit Curve Fitting

Association Rules
An association algorithm creates rules that describe how often events have occurred together. (2) Example: When a customer buys a hammer, then

90% of the time they will buy nails.

Association Rules
Support: is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule(1, p748) Example:
People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High Support People who buy hotdog buns buy hangers in 0.005% of cases. = Low support

Situations where there is high support for the antecedent are worth careful attention
E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if there is also high confidence.

Association Rules
Confidence: is a measure of how often the consequent is true when the antecedent is true. (1, p748) Example:
90% of Hotdog bun purchases are accompanied by hotdog sausages. High confidence is meaningful as we can derive rules.

Hotdog bun Hotdog sausage 2 rules may have different confidence levels and have the same support. E.g. Hotdog sausage Hotdog bun may have a much lower confidence than Hotdog bun Hotdog sausage yet they both can have the

Advantages of Data Mining


Provides new knowledge from existing data
Public databases Government sources Company Databases

Old data can be used to develop new knowledge New knowledge can be used to improve services or products Improvements lead to:
Bigger profits More efficient service

Uses of Data Mining


Sales/ Marketing
Diversify target market Identify clients needs to increase response rates

Risk Assessment
Identify Customers that pose high credit risk

Fraud Detection
Identify people misusing the system. E.g. People who have two Social Security Numbers

Customer Care
Identify customers likely to change providers Identify customer needs

Applications of Data Mining


(4)

Source IDC 1998

Phases and Tasks


Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment

Determine Collect Initial Data Data Set Business Objectives Initial Data Collection Data Set Description Background Report Business Objectives Select Data Business Success Describe Data Rationale for Inclusion Criteria Data Description Report Exclusion

Select Modeling Evaluate Results Plan Deployment Technique Assessment of Data Deployment Plan Modeling Technique Mining Results w.r.t. Modeling Assumptions Business Success Plan Monitoring and / Criteria Maintenance Generate Test Design Approved Models Monitoring and Test Design Maintenance Plan Situation Assessment Explore Data Clean Data Review Process Inventory of ResourcesData Exploration Report Data Cleaning Report Build Model Review of Process Produce Final Report Requirements, Parameter Settings Final Report Assumptions, and Verify Data Quality Construct Data Models Determine Next StepsFinal Presentation Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions Risks and Contingencies Generated Records Decision Review Project Terminology Assess Model Experience Costs and Benefits Integrate Data Model Assessment Documentation Merged Data Revised Parameter Determine Settings Data Mining Goal Format Data Data Mining Goals Reformatted Data Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques

Phases in the DM Process:

Phases in the DM Process (1 & 2)


Business Understanding: Statement of Business Objective Statement of Data Data Understanding Mining objective Explore the data and Statement of Success verify the quality Criteria Find outliers

Phases in the DM Process (3)


Data preparation:
Takes usually over 90% of our time Collection Assessment Consolidation and Cleaning
table links, aggregation level, missing values, etc

Data selection
active role in ignoring noncontributory data?

outliers?
Use of samples visualization tools

Transformations - create new variables

Phases in the DM Process (4)


Model building
Selection of the modeling techniques is based upon the data mining objective Modeling is an iterative process - different for supervised and unsupervised learning
May model for either description or prediction

Data Analysis Data Mining Tests for statistical Originally developed to act as expert systems to correctness of models solve problems Are statistical Less interested in the assumptions of models mechanics of the correct? technique Eg Is the R-Square If it makes sense then good? lets use it Hypothesis testing Does not require assumptions to be made Is the relationship about data significant? Can find patterns in very Use a t-test to large amounts of data validate significance Requires understanding Tends to rely on sampling of data and business problem Techniques are not optimised for large amounts of data Requires strong statistical skills

Data Mining Versus Statistical Analysis

Data Mining versus OLAP


OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

The Evolution of Data Analysis


Evolutionary Step Business Question Data Collection (1960s) Data Access (1980s) Enabling Technologies Product Providers Characteristics IBM, CDC Retrospective, static data delivery Retrospective, dynamic data delivery at record level "What was my total Computers, tapes, revenue in the last disks five years?" "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC On-line analytic processing (OLAP), multidimensional databases, data warehouses Advanced algorithms, multiprocessor computers, massive databases

Oracle, Sybase, Informix, IBM, Microsoft

Data Warehousing & Decision Support (1990s)

"What were unit sales in New England last March? Drill down to Boston." "Whats likely to happen to Boston unit sales next month? Why?"

SPSS, Comshare, Retrospective, Arbor, Cognos, dynamic data Microstrategy,NCR delivery at multiple levels

Data Mining (Emerging Today)

SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups

Prospective, proactive information delivery

You might also like