Professional Documents
Culture Documents
Data Mining
Microsoft Confidential. 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Microsoft Corporation Technical Documentation License Agreement (Standard) READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS. 1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows: (a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly; (b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or (c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Companys breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsofts disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential. 2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else. 3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM. 4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party. 6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA. 7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other partys reasonable attorneys fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Overview
Data Mining deals with the proactive process of searching and discovering the patterns in the data and present it in a more predictable and useful format. It enables making business decisions, increasing revenues, reducing costs and boost customer confidence. Data Mining discovers the hidden knowledge with in the information which is not possible with traditional relational or OLAP technologies.
In all these cases the selection and grouping criteria are known in advance. Data Mining provides solution when the specific selection and grouping criteria are not known in advance, but are derived from the data values. Following are scenarios for which data mining can be highly effective: Predicting the seasonally adjusted sales for software products in order to prepare for customer support requirements Identifying demographic groups that purchase different makes of cars to plan import allotments and focus marketing efforts Grocery shop owners identifying products that customers typically buy together so the product physical arrangement can be optimized Targeting a mailing campaign to a customer base that reflects specific behavioral patterns. For example, mailing baby product discount coupons to family that have recently made similar purchases Focusing retention efforts on most effective employees based on revenue contribution, alignment with company policies, and consistency in performance Preventing potential fraud by comparing a transaction to previous purchasing patterns for a customer Identifying matching customers even when data entry errors include misspellings of the customers name or address.
Factors that would encourage considering data mining include the following:
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Data availability in source Systems: Detailed data is available from source systems, preferably on a near real-time basis. Having detailed data would be a good candidate for accurate and predictable results. Huge data volume: Large data sets that can be difficult to analyze effectively using other tools lend themselves to data mining solutions. Also, the statistical functions in data mining require a large sample set in order to produce meaningful results. Complexity to identify trends: Having multiple factors enter into to forecasting or discovery analysis lends itself to data mining, particularly when the appropriate grouping structures are not known in advance. Automating with minimum user interaction: Because data mining is driven by data values, the same solution can be implemented at different customer locations, achieving customized behavior with no changes to the application.
Data mining is not Data warehouse: Data warehouse, relational or OLAP can be used for mining process but data mining itself is not data warehouse store for storing warehouse objects such as facts, dimensions. Reporting store: Data mining is not a report store. It provides a method for analyzing data and making decisions. It does not provide any reports other than the analyzed data OLAP: Online Analytical processing stores the data warehouse data in multi dimensional store and also does aggregates accordingly. Data Mining does not require the data to be in multi dimensional or aggregations. It cannot be treated as a replacement of OLAP store. Data Visualization: DMX queries are to be issues against the data mining models to get the results of algorithms in mining models. There is a limited interface for viewing the data mining predictions in Business Intelligence Development Studio, additional visualization tools must be used to interface with external clients. Though OLAP is a data store in a multi-dimensional structure that includes aggregations, it cannot replace data mining models, as OLAP requires pre-defined grouping buckets. OLAP data can be used as an alternative to detailed relational data as a source for data mining models. The data mining models then provide further analysis and additional insight on the patterns and predictions. The following table shows what is possible/not possible with both OLAP and Data Mining models. OLAP Typically focuses on historical facts Aggregates data using pre-defined groupings Verification driven/Factual results Ad hoc queries and reports Limited ability to include reliability estimates with predictions OLAP can be used as a data source for Data Mining models Data Mining Typically focuses on future outcomes or trends Requires detail data Discovery driven Statistical and machine learning techniques Data models available for predicting, discovering patterns, estimating and producing accurate results for trend analysis and forecasting Data mining results can also be used in OLAP applications by incorporating new predictive variables or scores as dimensions or attributes in your OLAP tool
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Building a data mining model is part of a larger process that includes everything from defining the basic problem the model will solve, to deploying it into a working environment. This process can be divided into the following six steps: 1. 2. 3. 4. 5. 6. 7. Defining the problem Data Preparation Exploring data Building the model Exploring and validating the model Deploying and updating the models Accessing the models
Data Preparation:
The Data Extraction chapter details the type of data sources and extraction methods for data collections effectively. Typically the source transaction system consolidates all the discrete channels of data at one place and applies transactions, but it is not necessarily be one source of information. Data Mining involves analyzing the source data at a broader level that includes both internal and external to the system. Data collection is completely based on the business requirements not necessarily extracting all the data. Following are some examples: Internal Data Sources: Company activities, Customer records, Webs sites, Mail campaigns, purchasing transactions, Inventory External Data Sources: Partners and Supplies that contains external credit agencies, market surveyors, customer feedbacks Data mining modals can extract data either from relational structures or from OLAP store, explained as below in detail. Data mining algorithms implementation work same for both relational and OLAP models. The only difference is the source structure and format.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
customers filling in questionnaires. If it does not occur too often, data mining tools are able to ignore the noise and still find the overall patterns that exist in your data Data preparation and cleaning is an often neglected but extremely important step in the data mining process. It is not required to cleanse the data before the extracted by data mining process if the intention is to find out data quality problems using data mining models. Data can be polluted in a number of ways such as user interface, application problems, data collection mechanisms, heterogeneous systems, data transformation procedures. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), incorrect data combinations (e.g., Having New York in Japan). If the intention is not to find out data quality problems using data mining models, consider cleansing the data before using it in the models. Data quality problems and techniques are explained in Data Transformation chapter. Data mining can handle either numeric or text based data. Numeric Mining: Here the input columns may have descriptive (text) content, but the prediction columns contain numeric data. Simple operations on numeric data as Greater, Less, Percentage can be applied to deduce more meaningful patterns and forecast. This chapter talks about data mining models that deal with numeric data for predictions. Text Mining: While data mining is typically concerned with the detection of patterns in numeric data, very often information that is critical to the business is stored in the form of text. Unlike numeric data, text is often difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, parsing, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.).
Exploring data:
This step involves checking if the required data contains the expected information. For example, if cross-sell information is to be analyzed as part of data mining models then individual customer transactions and products purchased must be captured. The term data reduction in the context of data mining usually applies to the goal of aggregating the information from large datasets into manageable information chunks. Data reduction methods include simple tabulation or aggregation, or more sophisticated techniques like clustering and principal components analysis.
Building Model:
A model typically contains input columns, an identifying column, and a predictable column. Data type for the columns can be defined in a mining structure based on which algorithms process the data. The following basic terms would be useful to understand about the column types, enable for further studying the rest of the sections Continuous Column: This column contains numeric measurements typically the product cost, salary, account balance, shipping date, invoice date having no upper bound. Discrete Column: These are finite unrelated values such as Gender, location, age, telephone area codes. They do not need to be numeric in nature, and typically do not have a fractional component. Discretized Column: This is a continuous column converted to be discrete. For example, grouping salaries into predefined bands.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Key: The column which uniquely identifies the row, similar to the primary key. This is sometimes called the Case attribute. In brief, the available algorithms while using Business Intelligence Development Studio are Microsoft Decision Trees Algorithm: This algorithm uses the values, or states, of the designated input columns to predict the states of the column that was designated as predictable. It identifies the attribute tree that best predicts the result. This algorithm allows for interplay between attributes and provides a hierarchy of attribute definitions that can be used to take a decision. More information can be found at http://msdn2.microsoft.com/en-us/library/ms175312.aspx Microsoft Clustering Algorithm: The algorithm does grouping of the cases in a dataset into clusters that contain similar characteristics. Identifies how the data forms subgroups and how these subgroups are different from each other. This algorithm finds patterns without a specific target result. More information can be found at http://msdn2.microsoft.com/en-us/library/ms174879.aspx Microsoft Naive Bayes Algorithm: Identifies the attribute that is most likely to predict the result. This algorithm is less computationally intense than other Microsoft algorithms, and therefore is useful for quickly generating a mining model to discover relationships between input columns and predictable columns. You can use this algorithm to do initial explorations of data, and then later apply the results to create additional mining models with other algorithms that are more computationally intense and more sophisticated. More information can be found at http://msdn2.microsoft.com/en-us/library/ms174806.aspx Microsoft Association Algorithm: Association models are built on datasets that contain identifiers both for individual cases and item set that the cases contain. An association model is made up of a series of item sets and the rules that describe how those items are grouped together within the cases. The rules that the algorithm identifies can be used to predict a customer's likely future purchases, based on the items that already exist in the customer's shopping cart. It basically identifies the subgroup of data that participates in a specific transaction. More information can be found at http://msdn2.microsoft.com/en-us/library/ms174916.aspx Microsoft Sequence Clustering Algorithm: Identifies the event that is likely to happen next. The algorithm takes a sequence of events as input parameter and is well suited for click stream. This algorithm is similar to the Microsoft Clustering Algorithm. However, instead of finding clusters of cases that contain similar attributes, this algorithm finds clusters of cases that contain similar paths in a sequence. More information can be found at http://msdn2.microsoft.com/en-us/library/ms175462.aspx Microsoft Time Series Algorithm: This algorithm is used for predicting continuous columns such as product sales. While other Microsoft algorithms create models, time series model is based only on the trends that the algorithm derives from the original dataset to create a forecast model. It basically identifies the trends that are happening and predicting future from the current data. More information can be found at http://msdn2.microsoft.com/en-us/library/ms174923.aspx Microsoft Neural Network Algorithm: Similar to the Microsoft Decision Trees algorithm, this algorithm also Identifies attribute tree that best predicts the result, but involves more than 2 attributes analyzed at a time. This algorithm calculates probabilities for each possible state of the input attribute when given each state of the predictable attribute. More information can be found at http://msdn2.microsoft.com/en-us/library/ms174941.aspx
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Microsoft Logistic Regression Algorithm: The Microsoft Logistic Regression algorithm is a variation of the Microsoft Neural Network algorithm, where the HIDDEN_NODE_RATIO parameter is set to 0. This setting will create a neural network model that does not contain a hidden layer, and that therefore is equivalent to logistic regression. More information can be found at http://msdn2.microsoft.com/en-us/library/ms174828.aspx Microsoft Linear Regression Algorithm: The Microsoft Linear Regression algorithm is a variation of the Microsoft Decision Trees algorithm, where the MINIMUM_LEAF_CASES parameter is set to be greater than or equal to the total number of cases in the dataset that the algorithm uses to train. More information can be found at http://msdn2.microsoft.com/en-us/library/ms174824.aspx The output of data mining model can provide you with the analyzed and forecast data that can be readily used by the business analysts. For example, if you have a budget to mail information to 1000 people about a new product, relational or OLAP queries will not produce the optimal set of 1000 people. By enhancing your data by creating a data mining attribute that you can use in your query or OLAP analysis, data mining enables you to find the 1000 people most likely to respond. This example also shows that data mining does not replace OLAP, but enhances it. SQL Server 2005 Business Intelligence Development Studio is an integrated environment that includes several data mining algorithms and tools for building a comprehensive data mining solution. For example, you can use the Data Mining Designer tool in Business Intelligence Development Studio to create, modify, and compare the data mining models. Different Data mining models are explained in detail in the later sections of this chapter. A data mining model applies a mining model algorithm to the data that is represented by a mining structure. Model Parameters and boundary values can be defined on the data mining algorithms and usage parameters on data mining model column. You can define columns to be input columns, key columns, or predictable columns.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Operational analysis is nothing but business transaction Reports (closing bank balances, who was admitted into the hospital today, how many support calls are closed today etc) Trend analysis understands the growth of the historical data over a period of time. Ad hoc analysis is business context analysis (Products sales by region) or it can also be used for finding the root cause such as sudden decrease in sales of a product due floods or natural calamity Predictive analysis is predicting the patterns for the future (also called forecasting)
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
The following picture represents the hierarchical view of data mining models and segregations of algorithms into these two models.
Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. An example of a classification algorithm is the Microsoft Decision Trees Algorithm. Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. An example of a regression algorithm is the Microsoft Regression Algorithm. Time Series algorithms forecast the patterns based on the current set of continuous predictable attributes. The data that is to be taken as a base for future patterns predictions can be configured. Microsoft Time Series algorithms would be a best fit to solve time series related business requirement such as forecasting Prediction is the estimation of future outcomes, such as predicting which customers will be loyal, predicting which customers will respond to a promotion, works on continuous attribute set. Microsoft Time Series and Decision Trees Algorithms would be better examples to implement these scenarios. Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. An example of a segmentation algorithm is the Microsoft Clustering Algorithm. Summarization algorithms are similar to clustering algorithm but instead of grouping the data, it would quantify the members of the group, such as group 1 has more number of line items available and it has most probability of occurring. Microsoft Clustering Algorithm would give this information apart from clustering the selected data set. Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. An example of an association algorithm is the Microsoft Association Algorithm. Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path flow. An example of a sequence analysis algorithm is the Microsoft Sequence Clustering Algorithm.
Choosing the right algorithm to use for a specific business task can be a challenge. While you can use different algorithms to perform the same business task, each algorithm produces a different result, and some algorithms can produce more than one type of result. For example, you can use the Microsoft Decision Trees algorithm not only for prediction, but also as a way to reduce the number of columns in a dataset, because the decision tree can identify columns that do not affect the final mining model.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
You can use different algorithms to perform the same business task and each algorithm produces a different result. Lift charts would be useful to check the accuracy of the data mining models once built on the input data. Use more than one algorithm to produce results and analyze the results for choosing the right one. Different algorithms produce different results. The choosing of the algorithms is based on the accuracy and on the business need Use algorithms together use some algorithms to explore data, and then use other algorithms to predict a specific outcome based on that data. For example, you can use a clustering algorithm, which recognizes patterns, to break data into groups that are more or less homogeneous, and then use the results to create a better decision tree model. Use multiple algorithms within one solution to perform separate tasks. For example, regression tree algorithm can be used to obtain financial forecasting information, and a rule-based algorithm to perform a market basket analysis. Drill down analysis on processed mining models would be useful to check the granular content behavior over the time to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender, geographic region, etc.). If the individual attributes you have are transaction amounts, you should model them as continuous rather than discrete. Below lists out some of the popular business scenarios for which data mining models are sought and right model chosen Real Customer Scenarios: Problem The marketing department of a car company needs to identify the characteristics of existing customers to determine whether they are likely to buy a product in the future Solution By using the Microsoft Decision Trees Algorithm, the marketing department can predict whether a particular customer will purchase a product. The Microsoft Decision Trees algorithm can make a prediction based on the customer information, such as demographics or past buying patterns. By using the Microsoft time Series Algorithm on the past three year historical data, a data mining model that forecasts future car sales can be produced. Using this model, you can also make cross predictions to determine the relationship between sales trends of individual car models By using the Microsoft Association Rules Algorithm on the company records of each sale in the transactional database, the company can identify the various car models and accessories that tend to be purchased together. The company can then predict additional items that a customer might be interested in. Making use of Decision Trees and Neural Networks. Check the recent transaction of the customers when it hit zero, come out with input attributes such as premium repay, summer bike
The marketing department of a car company needs to predict monthly car sales for the coming year. It also needs to identify whether the sales report of one model can be used to predict the sales of another model The car company is redesigning its web site to favor the sale of products
Predicting when the customer balance would become zero, a typical banking requirement so that alerting the customer or automatic moving the funds from another account
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
In price comparison system which matches the best prices for a purchase. There could be a number of major items in an order and each major item could have multiple related sub items. The other variables that affect the price include trade-ins if any, sales going on at the time of order, number of units etc. A Retail Store selling video game consoles wants to introduce the new XBOX 360 during the holiday season. Before the store introduces the product the manager wants to predict which of the existing customers are more likely to buy the new product
renting, gender, and age then prepare the model. This can also be predicted with average monthly balance, average weekly balance of all the customer and predict when current week balance is less than average weekly balance and could reach zero if continues. Making use of decision trees, neural nets, or logistic regression would be a better option, consider the parameters such as product type, weight, cost, location, date of the year as input parameters and predict the cost of the product across different variables Microsoft Decision trees would be a good model for this scenario to study the customer parameters who would mostly buy the Xbox 360, Ex: Kids of age less than 20 would most probably buy product. The input columns for this case consideration would d be Customer ID, Age, Gender, Marital Status, Total children, Occupation, Location, Yearly Income based on who can afford for purchasing and having time at home to play
Usage based matrix of each data mining algorithm: Task Predicting a discrete attribute. For example, to predict whether the recipient of a targeted mailing campaign will buy a product. Predicting a continuous attribute. For example, to forecast next year's sales. Predicting a sequence. For example, to perform a clickstream analysis of a company's Web site. Finding groups of common items in transactions. For example, to use market basket analysis to suggest additional products to a customer for purchase. Finding groups of similar items. For example, to segment demographic data into groups to better understand the relationships between attributes.
Microsoft algorithms to use Microsoft Decision Trees Algorithm Microsoft Naive Bayes Algorithm Microsoft Clustering Algorithm Microsoft Neural Network Algorithm (SSAS) Microsoft Decision Trees Algorithm Microsoft Time Series Algorithm Microsoft Sequence Clustering Algorithm Microsoft Association Algorithm Microsoft Decision Trees Algorithm Microsoft Clustering Algorithm Microsoft Sequence Clustering Algorithm
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
forecasting for next year does not data which was 5 years old. When the incoming data volume changes then update the mining model for accurate results
Consider slicing the Source Cube if OLAP is source or Filter out if Relational storage is used as Mining source
Consider filtering the data either by slicing the cube or using where condition before the data is being processed by data mining, this would eliminate the unnecessary data to mining models to process.
Cleanse the incoming data is modeling is not for data quality identification
Data mining models can be used for predicting the data quality problems such as: Percentage of male gender people have taken pregnancy leave. This is purely a data entry problem, can be identified by data mining models. But if the data mining model intention is not for finding the data quality problems, consider cleansing the incoming data. Data cleansing is explained in Data Transformation chapter.
Use lift chart for finding the accuracy and deciding the modal
The lift chart is important because it helps distinguish between models in a structure that are almost the same, to help you determine which model provides the best predictions. Similarly, the lift chart shows which type of algorithm performs the best predictions for a particular situation
Considering splitting the source data into a training set and a testing set
In order to test the predictive capability of a mining model, you can randomly divide the available source data in half, and use one half to train the model and the other half to test the predictive capability of the model when using a lift chart.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Consider having a statistical analyst for studying the models and aligning with business requirements
Statistical analyst would be a good fit for analyzing results from data mining models and applying it to decision making process. Statistical analysts would typically have experience on predictions, forecasting and patterns identification that is exactly what data mining models do to help him for more accurate business decisions.
Consider using integration services for modeling, maintaining and reprocessing mining models
Make use of SSIS (Integration Services) for identifying the incoming data changes, requirement changes and modify the data mining models accordingly. This would automate reprocessing schedules of the models and keep it accurate every time when the underlying data changes. Requirement changes would need to change the model but at only one place in the integration services package to have the corresponding parameters set or adding yet another model.
Consider aggregating to the required level before building time series model
If daily sales forecast is required then first prepare the data by aggregating sales figures per item to that level. Having multiple records of the same key at the level the forecast is sought, would given incorrect results and may error out
Consider the detailed transaction table as nested and its master table as a case table
This is true for every model that is built for data mining needs to identify the case table and nested table, but this example would give an idea on what is to be selected as case and nested table. The more common approach for market basket analysis is to model each transaction (containing multiple items) as a case so your transaction table would have a composite key comprising the transaction id and item id, with multiple rows for each transaction id, comprising all items that were bought together as part of that single transaction. From the point of view of SQL Data Mining, you would use the same table as the "case table" as well as the "nested table" containing the associated items for each transaction. As a natural modeling concept for one-to-many relationships - examples: products purchased by a customer, movies watched by a person. As a powerful and compact way of representing variable-length, sparse cases where individual cases contain only a small subset of the potential attributes for a case. For instance, a store might sell a thousand products. If you wanted to model each potential product purchase as an attribute, without nested tables, you would need more than a thousand columns in your model , with the attribute states being "missing" or "existing" to indicate whether the customer (the "case") bought the product or not. Instead of this, the nested table allows to pivot the attributes and specify only the ones that are actually present for each case.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.
Challenges/Limitations
Data mining systems face a lot of challenges particularly related to the incoming data and its quality. A data mining system may work perfect for consistent data and perform significant worse when a little noise exists to the incoming data. In this section we take a look at what we mean are the most prominent problems and challenges of data mining systems today.
Noisy Data:
In a large database, many of the attribute values will be inconsistent and/or incorrect. This may be due to erroneous instruments, human error during data entry, migration problems, missing values or incorrect transactions at the source system.
Summary:
This chapter details the need of data mining in the end to end scenario of business intelligence applications implementation. The requirement of having data mining models come from the business analysts who wants to dig out more in day to day transactions to discover the hidden patterns and use it for business forecast, increase customer satisfaction, profits by taking right decisions with the help of mining models. Different data mining models are available in Business Intelligence studio. Same business scenario can be implemented by more than one model. Techniques for selecting the right model and comparing with other similar models would increase the accuracy of expected result. Guidelines and challenges are provided on the considerations while implementing the data mining solutions.
Copyright 2006 by Microsoft Corporation. All rights reserved. By using or providing feedback on these materials, you agree to the attached license agreement. Please provide feedback at BI Feedback Alias.