Presentation 1

Data Mining
Introduction
What is Data Mining?

The process of semi automatically analyzing large databases to find useful patterns (Silberschatz) KDD Knowledge Discovery in Databases (3) Attempts to discover rules and patterns from data Discover Rules Make Predictions Areas of Use
Internet Discover needs of customers Economics Predict stock prices Science Predict environmental change Medicine Match patients with similar problems cure
Business intelligence (BI)

unites data, technology, analytics, and human knowledge to optimize business decisions and ultimately drive an enterprises success. BI programs usually combine an enterprise data warehouse and a BI platform or tool set to transform data into usable, actionable business information.
Data Mining & Data Warehousing

Data Warehouse: is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site. (Silberschatz)
Collect data Store in single repository Allows for easier query development as a single repository can be queried.
Data Mining:
Analyzing databases or Data Warehouses to discover patterns about the data to gain knowledge. Knowledge is power.
Example of Data Mining

Credit Card Company wants to discover information about clients from databases. Want to find:
Clients who respond to promotions in Junk Mail Clients that are likely to change to another competitor Clients that are likely to not pay Services that clients use to try to promote services affiliated with the Credit Card Company Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money.
Forecasting what may happen in the future Classifying people or things into groups by recognizing patterns Clustering people or things into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events
Results of Data Mining Include:
Data Mining Techniques

Classification Clustering Regression Association Rules
Classification
Classification: Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item. Therefore to classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications Responds Rarely, Responds Sometimes, Responds Frequently. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers.
Technique for Classification

Decision-Tree Classifiers
Job
Carpenter Engineer Doctor
Income
<30 K >50 K <40 K
Income
>90 K
Income
<50 K >100K
Bad
Good
Bad
Good
Bad
Good
Predicting credit risk of a person with the jobs
Clustering
Clustering algorithms find groups of items that are similar. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. (2) Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as unsupervised learning
Clustering
Group Data into Clusters
Similar data is grouped in the same cluster Dissimilar data is grouped in the same cluster
How is this achieved ? K-Nearest Neighbor A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its knearest neighbors (where k is an integer).(2) Hierarchical
Group data into t-trees
Regression
Regression deals with the prediction of a value, rather than a class. (1, P747) Example: Find out if there is a relationship between smoking patients and cancer related illness. Given values: X1, X2... Xn Objective predict variable Y One way is to predict coefficients a0, a1, a2
Y = a0 + a1X1 + a2X2 + anXn Linear Regression
Regression
Example graph:
Line of Best Fit Curve Fitting
Association Rules
An association algorithm creates rules that describe how often events have occurred together. (2) Example: When a customer buys a hammer, then
90% of the time they will buy nails.
Association Rules
Support: is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule(1, p748) Example:
People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High Support People who buy hotdog buns buy hangers in 0.005% of cases. = Low support
Situations where there is high support for the antecedent are worth careful attention
E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if there is also high confidence.
Association Rules
Confidence: is a measure of how often the consequent is true when the antecedent is true. (1, p748) Example:
90% of Hotdog bun purchases are accompanied by hotdog sausages. High confidence is meaningful as we can derive rules.
Hotdog bun Hotdog sausage 2 rules may have different confidence levels and have the same support. E.g. Hotdog sausage Hotdog bun may have a much lower confidence than Hotdog bun Hotdog sausage yet they both can have the
Advantages of Data Mining

Provides new knowledge from existing data
Public databases Government sources Company Databases
Old data can be used to develop new knowledge New knowledge can be used to improve services or products Improvements lead to:
Bigger profits More efficient service
Uses of Data Mining

Sales/ Marketing
Diversify target market Identify clients needs to increase response rates
Risk Assessment
Identify Customers that pose high credit risk
Fraud Detection
Identify people misusing the system. E.g. People who have two Social Security Numbers
Customer Care
Identify customers likely to change providers Identify customer needs
Applications of Data Mining

(4)
Source IDC 1998
Phases and Tasks

Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment
Determine Collect Initial Data Data Set Business Objectives Initial Data Collection Data Set Description Background Report Business Objectives Select Data Business Success Describe Data Rationale for Inclusion Criteria Data Description Report Exclusion
Select Modeling Evaluate Results Plan Deployment Technique Assessment of Data Deployment Plan Modeling Technique Mining Results w.r.t. Modeling Assumptions Business Success Plan Monitoring and / Criteria Maintenance Generate Test Design Approved Models Monitoring and Test Design Maintenance Plan Situation Assessment Explore Data Clean Data Review Process Inventory of ResourcesData Exploration Report Data Cleaning Report Build Model Review of Process Produce Final Report Requirements, Parameter Settings Final Report Assumptions, and Verify Data Quality Construct Data Models Determine Next StepsFinal Presentation Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions Risks and Contingencies Generated Records Decision Review Project Terminology Assess Model Experience Costs and Benefits Integrate Data Model Assessment Documentation Merged Data Revised Parameter Determine Settings Data Mining Goal Format Data Data Mining Goals Reformatted Data Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques
Phases in the DM Process:
Phases in the DM Process (1 & 2)

Business Understanding: Statement of Business Objective Statement of Data Data Understanding Mining objective Explore the data and Statement of Success verify the quality Criteria Find outliers
Phases in the DM Process (3)

Data preparation:
Takes usually over 90% of our time Collection Assessment Consolidation and Cleaning
table links, aggregation level, missing values, etc
Data selection
active role in ignoring noncontributory data?
outliers?
Use of samples visualization tools
Transformations - create new variables
Phases in the DM Process (4)

Model building
Selection of the modeling techniques is based upon the data mining objective Modeling is an iterative process - different for supervised and unsupervised learning
May model for either description or prediction
Data Analysis Data Mining Tests for statistical Originally developed to act as expert systems to correctness of models solve problems Are statistical Less interested in the assumptions of models mechanics of the correct? technique Eg Is the R-Square If it makes sense then good? lets use it Hypothesis testing Does not require assumptions to be made Is the relationship about data significant? Can find patterns in very Use a t-test to large amounts of data validate significance Requires understanding Tends to rely on sampling of data and business problem Techniques are not optimised for large amounts of data Requires strong statistical skills
Data Mining Versus Statistical Analysis
Data Mining versus OLAP

OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
The Evolution of Data Analysis

Evolutionary Step Business Question Data Collection (1960s) Data Access (1980s) Enabling Technologies Product Providers Characteristics IBM, CDC Retrospective, static data delivery Retrospective, dynamic data delivery at record level "What was my total Computers, tapes, revenue in the last disks five years?" "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC On-line analytic processing (OLAP), multidimensional databases, data warehouses Advanced algorithms, multiprocessor computers, massive databases
Oracle, Sybase, Informix, IBM, Microsoft
Data Warehousing & Decision Support (1990s)
"What were unit sales in New England last March? Drill down to Boston." "Whats likely to happen to Boston unit sales next month? Why?"
SPSS, Comshare, Retrospective, Arbor, Cognos, dynamic data Microstrategy,NCR delivery at multiple levels
Data Mining (Emerging Today)
SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups
Prospective, proactive information delivery

Presentation 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation 1

Uploaded by

Copyright:

Available Formats

Data Mining

What is Data Mining?

Business intelligence (BI)

Data Mining & Data Warehousing

Example of Data Mining

Results of Data Mining Include:

Data Mining Techniques

Technique for Classification

Predicting credit risk of a person with the jobs

90% of the time they will buy nails.

Advantages of Data Mining

Uses of Data Mining

Applications of Data Mining

Source IDC 1998

Phases and Tasks

Phases in the DM Process:

Phases in the DM Process (1 & 2)

Phases in the DM Process (3)

Transformations - create new variables

Phases in the DM Process (4)

Data Mining Versus Statistical Analysis

Data Mining versus OLAP

The Evolution of Data Analysis

Oracle, Sybase, Informix, IBM, Microsoft

Data Warehousing & Decision Support (1990s)

Data Mining (Emerging Today)

SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups

Prospective, proactive information delivery

You might also like