Professional Documents
Culture Documents
03/17/2012
Contents
Data Mining Definition Data Mining Process Data Mining Process Steps Data Mining Tools
03/17/2012
Data Mining
03/17/2012
Data Mining
a process of discovering actionable information from large sets of data. uses mathematical analysis to derive patterns and trends that exist in data. these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data. These patterns and trends can be collected and defined as a data mining model.
03/17/2012
03/17/2012
1.
2. 3. 4. 5.
Six steps:
Defining the Problem Preparing Data Exploring Data Building Models Exploring and Validating Models Deploying and Updating Models
Advanced Database Management Systems
6.
03/17/2012
03/17/2012
Explaination
Each step does not necessarily lead directly to the next step. Creating a data mining model is a dynamic and iterative process.
After exploring the data, it may be found that the data is insufficient to create the appropriate mining models and therefore more data have to be looked. After building several models, if it is realized that the models do not adequately answer the problem defined and therefore must redefine the problem. The models may have to be updated after they have been deployed because more data has become available.
Each step in the process might need to be repeated many times in order to create a good model.
03/17/2012 Advanced Database Management Systems 8
03/17/2012
analyzing the business requirements consider ways to provide an answer to the problem defining the scope of the problem defining the metrics by which the model will be evaluated, and defining specific objectives for the data mining project
03/17/2012
10
The Tasks
What are you looking for? What types of relationships are you trying to find? Does the problem you are trying to solve reflect the policies or processes of the business? Do you want to make predictions from the data mining model, or just look for interesting patterns and associations? Which attribute of the dataset do you want to try to predict? How are the columns related? If there are multiple tables, how are the tables related? How is the data distributed? Is the data seasonal? Does the data accurately represent the processes of the business? Answer to the questions: A data availability study have to be conducted to investigate the needs of the business users with regard to the available data. If the data does not support the needs of the users, the project might have to be redefined.
03/17/2012
11
03/17/2012
12
Preparing Data
Finds hidden correlations in the data Identifies sources of data that are the most accurate and Determine which columns are the most appropriate for use in analysis.
For example, Should you use the shipping date or the order date? Is the best sales influencer the quantity, total price, or a discounted price?
Therefore, before starting to build mining models, these problems should be identified and determined how to fix Advanced Database Management them. 03/17/2012 Systems 13
03/17/2012
14
Exploring Data
Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data.
For example: By reviewing the maximum, minimum, and mean values it can be determined that the data is not representative of customers or business processes, and therefore must obtain more balanced data. Standard deviations and other distribution values can provide useful information about the stability and accuracy of the results. A large standard deviation can indicate that adding more data might help improve the model.
Exploring the data helps better understanding of the business problem in deciding if the dataset contains flawed data, and then a strategy for fixing the problems can be devised to gain a deeper understanding of the behaviors that are typical of your business
03/17/2012 Advanced Database Management Systems 15
03/17/2012
16
Building Models
A mining structure is created to define the data explored in the previous phase. It defines the source of data but does not contain any data until it is processed. Processing a model is called Training. In this, specific mathematical algorithms are applied to the data in the structure to extract patterns. The patterns that found in the training process depend on the selection of training data, the algorithm chosen, and how the algorithm has been configured. Whenever the data changes, both the mining structure and the mining model must be updated . When a mining structure is updated by reprocessing it, data is retrieved from the source, including any new data, and repopulates the mining structure. The mining model are retrained on the new data.
03/17/2012
17
03/17/2012
18
Validating Models
Before a model is deployed into a production environment, it is tested for how well the model performs. All the models created with different configurations are tested to see which yields the best results for the specified problem and data. Analysis Services provides tools that help to separate data into training and testing datasets so that one can accurately assess the performance of all models on the same data. The training dataset is used to build the model, and the testing dataset to test the accuracy of the model by creating prediction queries. What if none of the models that created in the Building Models step perform well? return to a previous step in the process and redefine the problem or reinvestigate the data in the original dataset.
03/17/2012
19
03/17/2012
20
Data-Mining Tools
Some of the Commercially and publicly available tools are: DataEngine AgentBase/Marketeer BusinessMiner CART Data Surveyor Data Mining Suite DataMind IBM Datajoiner Kensington 2000, etc For the latest tools and their performance visit sites: http://www.kdnuggets.com and http://www.knowledgestorm.com.
03/17/2012
22
References
Data Mining Explained: Rhoda Delmater & Monte Hancock http://findarticles.com/p/articles/mi_m0BRZ/is_9_19 /ai_57778455/ http://www.springer.com/cda/content/document/cda _downloaddocument/9780387333335c2.pdf?SGWID=0-0-45-424299-p173660317 http://matwbn.icm.edu.pl/ksiazki/amc/amc11/amc11 33.pdf http://findarticles.com/p/articles/mi_m0BRZ/is_9_19 /ai_57778455/
03/17/2012
23
03/17/2012
24