You are on page 1of 24

Data Mining & its Process

03/17/2012

Advanced Database Management Systems

Contents
Data Mining Definition Data Mining Process Data Mining Process Steps Data Mining Tools

03/17/2012

Advanced Database Management Systems

Data Mining

03/17/2012

Advanced Database Management Systems

Data Mining

a process of discovering actionable information from large sets of data. uses mathematical analysis to derive patterns and trends that exist in data. these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data. These patterns and trends can be collected and defined as a data mining model.

03/17/2012

Advanced Database Management Systems

Applications to business scenarios


Mining models can be applied to specific business scenarios, such as: Forecasting sales Targeting mailings toward specific customers Determining which products are likely to be sold together Finding sequences in the order that customers add products to a shopping cart

03/17/2012

Advanced Database Management Systems

Data mining process

1.
2. 3. 4. 5.

Six steps:
Defining the Problem Preparing Data Exploring Data Building Models Exploring and Validating Models Deploying and Updating Models
Advanced Database Management Systems

6.

03/17/2012

Relationship between each step

03/17/2012

Advanced Database Management Systems

Explaination
Each step does not necessarily lead directly to the next step. Creating a data mining model is a dynamic and iterative process.

After exploring the data, it may be found that the data is insufficient to create the appropriate mining models and therefore more data have to be looked. After building several models, if it is realized that the models do not adequately answer the problem defined and therefore must redefine the problem. The models may have to be updated after they have been deployed because more data has become available.

Each step in the process might need to be repeated many times in order to create a good model.
03/17/2012 Advanced Database Management Systems 8

Phase-I: Defining the Problem

03/17/2012

Advanced Database Management Systems

Defining the Problem

analyzing the business requirements consider ways to provide an answer to the problem defining the scope of the problem defining the metrics by which the model will be evaluated, and defining specific objectives for the data mining project

03/17/2012

Advanced Database Management Systems

10

The Tasks

What are you looking for? What types of relationships are you trying to find? Does the problem you are trying to solve reflect the policies or processes of the business? Do you want to make predictions from the data mining model, or just look for interesting patterns and associations? Which attribute of the dataset do you want to try to predict? How are the columns related? If there are multiple tables, how are the tables related? How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the business? Answer to the questions: A data availability study have to be conducted to investigate the needs of the business users with regard to the available data. If the data does not support the needs of the users, the project might have to be redefined.

03/17/2012

Advanced Database Management Systems

11

Phase-II: Preparing Data

03/17/2012

Advanced Database Management Systems

12

Preparing Data

Removes inconsistencies such as incorrect or missing entries.


For example, the data might show that a customer bought a product before the product was offered on the market, or that the customer shops regularly at a store located 2,000 miles from her home.

Finds hidden correlations in the data Identifies sources of data that are the most accurate and Determine which columns are the most appropriate for use in analysis.

For example, Should you use the shipping date or the order date? Is the best sales influencer the quantity, total price, or a discounted price?

Therefore, before starting to build mining models, these problems should be identified and determined how to fix Advanced Database Management them. 03/17/2012 Systems 13

Phase-III: Exploring Data

03/17/2012

Advanced Database Management Systems

14

Exploring Data
Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data.
For example: By reviewing the maximum, minimum, and mean values it can be determined that the data is not representative of customers or business processes, and therefore must obtain more balanced data. Standard deviations and other distribution values can provide useful information about the stability and accuracy of the results. A large standard deviation can indicate that adding more data might help improve the model.

Exploring the data helps better understanding of the business problem in deciding if the dataset contains flawed data, and then a strategy for fixing the problems can be devised to gain a deeper understanding of the behaviors that are typical of your business
03/17/2012 Advanced Database Management Systems 15

Phase IV: Building Models

03/17/2012

Advanced Database Management Systems

16

Building Models

A mining structure is created to define the data explored in the previous phase. It defines the source of data but does not contain any data until it is processed. Processing a model is called Training. In this, specific mathematical algorithms are applied to the data in the structure to extract patterns. The patterns that found in the training process depend on the selection of training data, the algorithm chosen, and how the algorithm has been configured. Whenever the data changes, both the mining structure and the mining model must be updated . When a mining structure is updated by reprocessing it, data is retrieved from the source, including any new data, and repopulates the mining structure. The mining model are retrained on the new data.

03/17/2012

Advanced Database Management Systems

17

Phase V: Validating Models

03/17/2012

Advanced Database Management Systems

18

Validating Models
Before a model is deployed into a production environment, it is tested for how well the model performs. All the models created with different configurations are tested to see which yields the best results for the specified problem and data. Analysis Services provides tools that help to separate data into training and testing datasets so that one can accurately assess the performance of all models on the same data. The training dataset is used to build the model, and the testing dataset to test the accuracy of the model by creating prediction queries. What if none of the models that created in the Building Models step perform well? return to a previous step in the process and redefine the problem or reinvestigate the data in the original dataset.

03/17/2012

Advanced Database Management Systems

19

Phase VI: Deploying and Updating Models

03/17/2012

Advanced Database Management Systems

20

Deploying and Updating Models


Deploy the models that performed the best to a production environment. After the mining models exist in a production environment, various tasks can be performed, depending on ones needs. The following are some of the tasks you can perform: Use the models to create predictions, which you can then use to make business decisions. Create queries to retrieve statistics, rules, or formulas from the model. Embed data mining functionality directly into an application. You can include Analysis Management Objects (AMO), which contains a set of objects that your application can use to create, alter, process, and delete mining structures and mining models. Use Integration Services to create a package in which a mining model is used to intelligently separate incoming data into multiple tables. For example, if a database is continually updated with potential customers, you could use a mining model together with Integration Services to split the incoming data into customers who are likely to purchase a product and customers who are likely to not purchase a product. Create a report that lets users directly query against an existing mining model. Update the models after review and analysis. Any update requires that you reprocess the models. Update the models dynamically, as more data comes into the organization, and making constant changes to improve the effectiveness of the solution should be part of the deployment strategy.

03/17/2012 Advanced Database Management Systems 21

Data-Mining Tools
Some of the Commercially and publicly available tools are: DataEngine AgentBase/Marketeer BusinessMiner CART Data Surveyor Data Mining Suite DataMind IBM Datajoiner Kensington 2000, etc For the latest tools and their performance visit sites: http://www.kdnuggets.com and http://www.knowledgestorm.com.

03/17/2012

Advanced Database Management Systems

22

References
Data Mining Explained: Rhoda Delmater & Monte Hancock http://findarticles.com/p/articles/mi_m0BRZ/is_9_19 /ai_57778455/ http://www.springer.com/cda/content/document/cda _downloaddocument/9780387333335c2.pdf?SGWID=0-0-45-424299-p173660317 http://matwbn.icm.edu.pl/ksiazki/amc/amc11/amc11 33.pdf http://findarticles.com/p/articles/mi_m0BRZ/is_9_19 /ai_57778455/

03/17/2012

Advanced Database Management Systems

23

03/17/2012

Advanced Database Management Systems

24

You might also like