Professional Documents
Culture Documents
Introduction
Data Mining:
Analyzing databases or Data Warehouses to discover patterns about the data to gain knowledge. Knowledge is power.
Forecasting what may happen in the future Classifying people or things into groups by recognizing patterns Clustering people or things into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events
Classification
Classification: Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item. Therefore to classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications Responds Rarely, Responds Sometimes, Responds Frequently. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers.
Income
<30 K >50 K <40 K
Income
>90 K
Income
<50 K >100K
Bad
Good
Bad
Good
Bad
Good
Clustering
Clustering algorithms find groups of items that are similar. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. (2) Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as unsupervised learning
Clustering
Group Data into Clusters
Similar data is grouped in the same cluster Dissimilar data is grouped in the same cluster
How is this achieved ? K-Nearest Neighbor A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its knearest neighbors (where k is an integer).(2) Hierarchical
Group data into t-trees
Regression
Regression deals with the prediction of a value, rather than a class. (1, P747) Example: Find out if there is a relationship between smoking patients and cancer related illness. Given values: X1, X2... Xn Objective predict variable Y One way is to predict coefficients a0, a1, a2
Y = a0 + a1X1 + a2X2 + anXn Linear Regression
Regression
Example graph:
Line of Best Fit Curve Fitting
Association Rules
An association algorithm creates rules that describe how often events have occurred together. (2) Example: When a customer buys a hammer, then
Association Rules
Support: is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule(1, p748) Example:
People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High Support People who buy hotdog buns buy hangers in 0.005% of cases. = Low support
Situations where there is high support for the antecedent are worth careful attention
E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if there is also high confidence.
Association Rules
Confidence: is a measure of how often the consequent is true when the antecedent is true. (1, p748) Example:
90% of Hotdog bun purchases are accompanied by hotdog sausages. High confidence is meaningful as we can derive rules.
Hotdog bun Hotdog sausage 2 rules may have different confidence levels and have the same support. E.g. Hotdog sausage Hotdog bun may have a much lower confidence than Hotdog bun Hotdog sausage yet they both can have the
Old data can be used to develop new knowledge New knowledge can be used to improve services or products Improvements lead to:
Bigger profits More efficient service
Risk Assessment
Identify Customers that pose high credit risk
Fraud Detection
Identify people misusing the system. E.g. People who have two Social Security Numbers
Customer Care
Identify customers likely to change providers Identify customer needs
Determine Collect Initial Data Data Set Business Objectives Initial Data Collection Data Set Description Background Report Business Objectives Select Data Business Success Describe Data Rationale for Inclusion Criteria Data Description Report Exclusion
Select Modeling Evaluate Results Plan Deployment Technique Assessment of Data Deployment Plan Modeling Technique Mining Results w.r.t. Modeling Assumptions Business Success Plan Monitoring and / Criteria Maintenance Generate Test Design Approved Models Monitoring and Test Design Maintenance Plan Situation Assessment Explore Data Clean Data Review Process Inventory of ResourcesData Exploration Report Data Cleaning Report Build Model Review of Process Produce Final Report Requirements, Parameter Settings Final Report Assumptions, and Verify Data Quality Construct Data Models Determine Next StepsFinal Presentation Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions Risks and Contingencies Generated Records Decision Review Project Terminology Assess Model Experience Costs and Benefits Integrate Data Model Assessment Documentation Merged Data Revised Parameter Determine Settings Data Mining Goal Format Data Data Mining Goals Reformatted Data Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques
Data selection
active role in ignoring noncontributory data?
outliers?
Use of samples visualization tools
Data Analysis Data Mining Tests for statistical Originally developed to act as expert systems to correctness of models solve problems Are statistical Less interested in the assumptions of models mechanics of the correct? technique Eg Is the R-Square If it makes sense then good? lets use it Hypothesis testing Does not require assumptions to be made Is the relationship about data significant? Can find patterns in very Use a t-test to large amounts of data validate significance Requires understanding Tends to rely on sampling of data and business problem Techniques are not optimised for large amounts of data Requires strong statistical skills
"What were unit sales in New England last March? Drill down to Boston." "Whats likely to happen to Boston unit sales next month? Why?"
SPSS, Comshare, Retrospective, Arbor, Cognos, dynamic data Microstrategy,NCR delivery at multiple levels