You are on page 1of 171

DCS 008 Data Mining and Data Warehousing Unit I

Structure of the Unit 1.1 Introduction 1.2 Learning Objectives 1.3 Data mining concepts 1.3.1 An overview 1.3.2 Data mining Tasks 1.3.3 Data mining Process 1.4 Information and production factor 1.5 Data mining vs Query tools 1.6 Data Mining in Marketing 1.7 Self learning Computer System 1.8 Concept Learning 1.9 Data Learning 1.10 1.11 1.12 Data mining and Data Ware housing Summary Exercises

1.1

Introduction

As a student who knows the basics of the computers and data, you would have known that the modern world is surrounded by various types of data (numbers, image, video, sound). Simply to say that the whole world is a data driven one. As years pass by the size of these data has grown very big . The volume of the old and past data has become enormously big and considered to be a waste by most of the owners. This has occurred in all the areas like Super market transaction data, Credit card processing details , Telephone calls dialed/received details, Ration card details, Election / voters details etc., By the statement Waste to Wealth, these data can be used to get vital informations, answer the important decision making questions, to instruct the beneficial ways by analyzing and arranging. In order to extract the information / answers / ways from the data available in a large size, there are statistical and others concepts are being used . One of the major discipline which has been used for this in these days is known as DATA MINING. Like mining the land for the treasure you have to mine the large data to find the precious information which lies with in the data (like the relationships / Patterns)

1.2

Learning Objectives Understanding the necessity of analyzing and processing of complex, large, information-rich data sets To make the students know the initial concepts related to data mining

1.3 Data mining concepts 1.3.1 An overview Data is growing at a phenomenal rate. Users expect more sophisticated information How to get that? You have to uncover the hidden information in the large data .To do that Data mining is used. You may be familiar with common queries to explore the information from a data base, But, how for the queries in data mining different from this? See the following examples and you will understand the difference. Examples for a data base query Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk

Examples for a data mining query Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules) So, in Short the definition for DATA MINING can be given as Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to summarize the data in a novel ways which is understandable and useful (the hidden information ) and validate the findings by applying the detected patterns to new subsets of data. The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees).

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

1.3.2 Data mining Tasks The Basic Data Mining Tasks Can be defined as follows

Classification maps data into predefined groups or classes Supervised learning Pattern recognition Prediction Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. Unsupervised learning Segmentation Partitioning Summarization maps data into subsets with associated simple descriptions. Characterization Generalization Link Analysis uncovers relationships among data. Affinity Analysis Association Rules Sequential Analysis determines sequential patterns. Ex: Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior

1.3.3 Data mining Process

The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions). Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of

the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage. Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning. Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome. 1.4 Information and production factor Information / Knowledge can behave as a factor of production . According to elementary economics texts, the raw material for any productive activity can be put in one of three categories: land (raw materials, in general), labor, and capital. Some economists mention entrepreneurship as a fourth factor but none talk about knowledge. This is strange since know-how is the key determinant for the most important kind part of output: increased production. Still, its not that strange, since knowledge has unusual properties: there is no metric for it, and one cant calculate a monetary rate for it (cf. $/acre for land). 1.4.1 An example from agriculture

Imagine that you are a crop farmer. Your inputs are land and other raw materials like fertilizer and seed; your labor in planting, cultivating and harvesting the crop; and money youve borrowed from the bank to pay for your tractor. You can increase output by increasing any of these factors: cultivating more land, working more hours, or borrowing money to buy better tractor or better seed. However, you can also increase output through know-how. For example, you might discover that your land is better suited to one kind of corn rather than another. You could make a more substantial improvement in output if you changed your practices, for example by implementing crop rotation. Farmers in Europe had practiced a three-year

rotation since the Middle Ages: rye or winter wheat, followed by spring oats or barley, then letting the soil rest (fallow) during the third stage. Four-field rotation (wheat, barley, turnips, and clover; no fallow) was a key development in the British Agricultural Revolution in the 18th Century. This system removed the need for a fallow period and allowed livestock to be bred year-round. (I suspect that if a four-crop rotation had been invented now, it would be eligible for a business process patent.) Most of the increases in our material well-being have come about through innovation, that is, the application of knowledge. How is it, then, that knowledge as a factor of production gets such a cursory treatment in traditional economics? 1.4.2 Measuring Knowledge A key difficulty is that knowledge is easy to describe but very hard to measure. One can talk about uses of knowledge, but I have so far found no simple metric. Its even hard to measure information content. There are many different perspectives, such as: library science (eg a user-centered measure of information); information theory (measuring data channel capacity); and algorithmic complexity (eg Kolmogorov complexity). All give different results. One can always, of course, argue that money is the ultimate metric: the knowledge value of something is what someone will pay for it. However, this is true for anything, including all the other factors of production. The difference is that land, labor and capital all have an underlying objective measure. One cannot calculate a $/something rate for knowledge in the way one can for the other three. Lets say land is measured in acres and labor in hours, and money in dollars. Youll pay me so much per acre of land, so much per hour of labor, and so many cents of interest per dollar I loan you. Land in different locations, labor of different kinds, and loans of different risks will earn different payment rates. Knowledge does have some value when its sold, e.g. when a patent is licensed or when a list of customer names is valued on a balance sheet. However, theres no rate, no $/something for the knowledge purchased. That suggests that the underlying concept is indefinite. It is perhaps so indefinite that we are fooling ourselves by even imagining that it exists. 1.5 Data mining vs Query tools There are various tools available for data mining commercially . The users can use that and do data mining to get required results and models. Some if them have been given below for your reference. 1.5.1 Clementine

SPSS Clementine, the premier data mining workbench, allows experts in business processes, data, and modeling to collaborate in exploring data and building models. It also supports the proven, industry-standard CRISP-DM methodology, which enables predictive insights to be developed consistently, repeatedly. No wonder that organizations from FORTUNE 500 companies to government agencies and academic institutions point to Clementine as a critical factor in their success. 1.5.2 CART CART is a robust data mining tool that automatically searches for important patterns and relationships in large data sets and quickly uncovers hidden structures even highly complex data sets. It works on the Windows, Mac and Unix platforms 1.5.3 Web Information Extractor Web Information Extractor is a powerful tool for web data mining, content extraction and content update monitor. It can extract structure or unstructured data (including text, picture and other file) from web page, reform into local file or save to database, post to web server. No need to define complex template rules, just browse to the web page you are interesting and click what you want to define the extraction task, and run it as you want. 1.5.4 The Query Tool The Query Tool is a powerful data mining application. It allows you to perform data analysis on any SQL database Developed predominately for the non technical user. No knowledge of SQL is required. NEW features: Query Builder, quickly and simply build powerful queries; Summary; summarise any two columns against an aggregate function (MIN, AVG etc.) of any numerical column. Query Editor; now you can create your own scripts.

1.6 Data Mining in Marketting 1.6.1 Marketting Optimization If you are the owner of a business, you should already be aware of the fact that there a multiple techniques you can use to market to your customers. There is the internet, direct mail, and telemarketing. While using these techniques can help your business succeed, there is even more you can do to tip the odds in your favor. You will want to become familiar with a technique that is called marketing optimization. This is a technique that is intricately connected to data mining. With market optimization, you will take a group of offers and customers, and after reviewing the limits of the campaign, you will use data mining to decide which

marketing offers should be made to specific customers. Market optimization is a powerful tool that will take your marketing to the next level. Instead of mass marketing a product to a broad group of people that may not respond to it, you can take a group of marketing strategies and market them to different people based on patterns and relationships. The first step in marketing optimization is to create a group of marketing offers. Each offer will be created separately from the others, and each one of them will have their own financial attributes. An example of this would be the cost required to run each campaign. Each offer will have a model connected to it that will make a prediction based on the customer information that is presented to it. The prediction could come in the form of a score. The score could be defined by the probability of a customer purchasing a product. The models will be created by data mining tools. These models can be added to your marketing strategy. After you have set up your offers, you will next want to look at the purchasing habits of the customers you already have. Your goal is to analyze each offer you're making and optimize it in a way that will allow you to bring in the largest profits.

1.6.2 Illustration with an example To illustrate market optimization with data mining, let me use an example. Suppose you were the marketing director for a financial institution such as a bank. You have a number of products which you offer to your customers, and these are CDs, credit cards, gold credit cards, and a savings account. Despite the fact that your company offers these four products, it is your job to market checking accounts and savings accounts. It is your goal to figure out which customers will be interested in savings accounts compared to checking accounts. After thinking about how you can successfully market your products to your customers, you can have come up with two possible strategies that you will present to your manager. The first possible strategy is to market to customers who would like to save money for their children so they can attend college when they turn 18 years old. The second strategy is to market to students who are already attending college. Now that you have two offers you're interested in marketing, you will next want to study the data you have obtained. In this example, you work for a large company that has a data warehouse. You look at the customer data over the last few years to make a marketing decision. Your company uses a data mining tool that will predict the chances of people signing up for your products. You will want to create certain mathematical models that will allow you to predict the possible responses. In this example, you are targeting young parents who may be looking to save money for their children, and you are targeting young people that are already in college.

Computer algorithms will be able to look at the history of customer transactions to determine the chances of success for your marketing campaign. In this example, the best way to find out if young parents and college students will be interested in your offer is by looking at the historical response rate. If the historical response rate is only 10%, it is likely that it will remain the same for your new marketing strategy. However, historical response rates are simply, and to be more precise, you will want to use complex data mining strategies. How ever by this time you would have realized how for the data mining concepts are Used in marketing and optimizing the same.

1.7 Self learning Computer System

An Self learning Computer System, also known as a knowledge based system or an Expert system , is a computer program that contains the knowledge and analytical skills of one or more human experts, related to a specific subject. This class of program was first developed by researchers in artificial intelligence during the 1960s and 1970s and applied commercially throughout the 1980s. An Self learning Computer System is a software system that incorporates concepts derived from experts in a field and uses their knowledge to provide problem analysis to users of the software. The most common form of Self learning Computer System is a computer program, with a set of rules, that analyzes information (usually supplied by the user of the system) about a specific class of problems, and recommends one or more courses of user action. The expert system may also provide mathematical analysis of the problem(s). The expert system utilizes what appears to be reasoning capabilities to reach conclusions. A related term is wizard. A wizard is an interactive computer program that helps a user solve a problem. Originally the term wizard was used for programs that construct a database search query based on criteria supplied by the user. However, some rule-based expert systems are also called wizards. Other "Wizards" are a sequence of online forms that guide users through a series of choices, such as the ones which manage the installation of new software on computers, and these are not expert systems. In other words, A Self learning Computer System or an expert system is a computer program that simulates the judgement and behavior of a human or an organization that has expert knowledge and experience in a particular field. Typically, such a system contains a knowledge base containing accumulated experience and a set of rules for applying the knowledge base to each particular situation that is described to the program.

Sophisticated expert systems can be enhanced with additions to the knowledge base or to the set of rules. Among the best-known expert systems have been those that play chess and that assist in medical diagnosis.

1.8 Concept Learning 1.8.1 Analyzing Concepts Concepts are categories of stimuli that have certain features in common.

The shapes on the above are all members of a conceptual category: rectangle. Their common features are (1) 4 lines; (2) opposite lines parallel; (3) lines connected at ends; (4) lines form 4 right angles. The fact that they are different colors and sizes and have different orientations is irrelevant. Color, size, and orientation are not defining features of the concept If a stimulus is a member of a specified conceptual category, it is referred to as a positive instance. If it is not a member, it is referred to as negative instance. These are all negative instances of the rectangle concept: As rectangles are defined, a stimulus is a negative instance if it lacks any one of the specified features.

Every concept has two components: Attributes: These are features of a stimulus that one must look for to decide if that stimulus is a positive instance of the concept. A rule: This a statement that specifies which attributes must be present or absent for a stimulus to qualify as a positive instance of the concept.

10

For rectangles, the attributes would be the four features discussed earlier, and the rule would be that all the attributes must be present. The simplest rules refer to the presence or absence of a single attribute.

For example, a vertebrate animal is defined as an animal with a backbone. Which of these stimuli are positive instances?

This rule is called affirmation. It says that a stimulus must possess a single specified attribute to qualify as a positive instance of a concept. The opposite or complement of affirmation is is negation. To qualify as a positive instance, a stimulus must lack a single specified attribute.

_ +

An invertebrate animal is one that lacks a backbone. These are the positive and negative instances when the negation rule is applied. More complex conceptual rules involve two or more specified attributes. For example, the conjunction rule states that a stimulus must possess two or more specified attributes to qualify as a positive instance of the concept . This was the rule used earlier to define the concept of a rectangle.

1.8.2 Behavioral Processes

11

In behavioral terms, when a concept is learned, two processes control how we respond to a stimulus: Generalization: We generalize a certain response (like the name of an object) to all members of the conceptual class based on their common attributes. Discrimination: We discriminate between stimuli which belong to the conceptual class and those that dont because they lack one or more of the defining attributes.

For example, we generalize the word rectangle to those stimuli that possess the defining attributes...

Rectangle

Rectangle

Rectangle

...and discriminate between these stimuli and others that are outside the conceptual class, in which case we respond with a different word:

1.9 Data learning The learning from the data given can be done in many ways. The data can be arranged in a particular format to learn from them. The following are some of the examples : (i) A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. The software programs involve mechanisms for the definition of database structures; for data storage; for concurrent, shared, or distributed data access; and for ensuring the consistency and security of the information stored, despite system crashes or attempts at unauthorized access.

12

(ii) A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes ( columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. (iii) A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships. (iv) A data warehouse which is a repository of information collected from multiple sources, stored under a unified schema (v) A data mart, subset of a data warehouse. It focuses on selected subjects, 1.10 Data mining and Data Ware housing A data warehouse is an integrated and consolidated collection of data. It can be defined as a repository of purposely selected and adopted operational data which can successfully answer any ad hoc, complex , analytical, statistical queries. Time dependent data will be present in a Data warehouse. A D ata warehousing can be defined as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. 1.10.1 Functional requirements of a Data Warehouse A data ware house provides the needed support for all the informational applications of a company. It must support various types of applications, all of which have their own requirements in terms of data and the way data are modeled and used, etc., A data warehouse must support 1. 2. 3. 4. Decision support processing Informational Application Model building Consolidation

The data in the warehouse is being processed and gives out the decision to be taken in the crucial times of the business. Certain information present in the data warehouse is derived for the necessity. Also modeling of the data can be done by exploring the data in the data warehouse. Consolidation of the data / information can be done through various tools in a data warehouse. Data in a data warehouse must there fore be organized such that it can be analyzed or explored along with different contextual dimensions.

13

Data sources, users, and informational applications for a data ware house

Corporate data

Data warehouse Environment

Offline data

Data Warehouse

External Users

External data
Structured and unstructured data CEOs, Executives etc,

Fig 1.1 show the many sources and different types of users There can be many sources for a data warehouse to get data (Corporate,external,offline etc.,).In a warehouse the data can be structured and unstructured ( like large text objects, pictures, audio, video etc., ). The people who use the data warehouse data can be Executives, Administrative officials, operational end users, External users and data & business analysts etc., The applications like Decision support processing, Extended data warehouse applications can be done on the data in a warehouse.

1.10.2 Data warehousing Data warehousing is essentially what you need to do in order to create a data warehouse, and what you do with it. It is the process of creating, populating, and then querying a data warehouse and can involve a number of discrete technologies such as: In a Dimensional Model, context of the measurements are represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how of a measurement (subject ). In your business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).

14

The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships. Before designing your data warehouse, you need to decide what this data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations, across time and across products then your dimensions are: Location Time Product Each dimension table contains data for one dimension. In the above example you get all your store location information and put that into one single table called Location. Your store location data may be spanned across multiple tables in your OLTP system (unlike OLAP), but you need to de-normalize all that data into one single table. Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. In this model, all data is contained in two types of tables called Fact Table and Dimension Table. 1.11 Summary In this Unit you have learnt about the basic concepts involved in Data Mining. Now You would have got the idea that Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. The role of information on production has been also explained to you. Also you had an overview of the various tools used in the mining like CART, Clementine etc., The marketing can be done in a powerful way by using the Data mining results. The marketing people are much exited to use these facilities and you would have understood this by the example given above in the unit. Also the learning concepts and details about the self learning or Expert systems have been explained. The learning from the data has been explained to you in a brief manner. Lastly the necessity of Data warehouse and its usage in various aspects has been explained. 1.12 Exercises 1. What is data mining? In your answer, address the following: (a) Is it another type? (b) Is it a simple transformation of technology developed fromdatabases, statistics, and

15

machine learning? 2. How information behaves as production factor explain .Illustrate with an example of your own (not given in the book) 3.Give brief notes on various mining tools which are known to you. 4. What do you mean by data mining in marketing ? explain with suitable example 5.What is concept? How one can learn a concept? Explain with examples the factors of concept. 6.what are all the ways the one can learn from data ? 7. Data ware house explain the concepts

16

Unit II Structure of the Unit 2.1 Introduction 2.2 Learning Objectives 2.3 Knowledge discovery process 2.3.1 Data Selection 2.3.2 Data Cleaning 2.3.3 Data Enrichment 2.4 Preliminary Analysis of Data using traditional query tools 2.5 Visualization techniques 2.6 OLAP Tools 2.7 Decision trees 2.8 Association Rules 2.9 Neural Networks 2.10 Genetics Algorithms 2.11 KDD in Data bases 2.12 Summary 2.13 Exercises

17

2.1 Introduction There are various processes and tools involved in Data mining. To get the knowledge from large data bases one of the process used is KDD (Knowledge Discovery Process). Also there are processes like Data cleaning, Data Selection, Data enrichment etc., to prepare the data for mining and get the results out of it. There are methods in data mining, which can be used in various businesses and fields, can give useful and suitable solutions to various problems. They can be Decision trees, Association Rules, Neural Networks, Genetic Algorithms etc., To visualize the results and data there are techniques called Visualization Techniques. Through which one can view various effects on a situation and can understand easily the results. The data in a large data base can be analyzed through various traditional queries to get the suitable information and Knowledge.

2.2 Learning Objectives To Know the concepts in Knowledge Discovery process in mining large data bases . Also to understand the process of data cleaning, Data selection, and Data enrichment under KDD. Students to know about the Visualization techniques used in data mining , Various methods involved in mining process like Decision trees, Association Rules etc.,

2.3 Knowledge discovery process 2.3.1 An Overview

Why Do We Need KDD?


The traditional method of turning data into knowledge relies on manual analysis and interpretation. For example, in the health-care industry, it is common for specialists to periodically analyze current trends and changes in health-care data, say, on a quarterly basis. The specialists then provide a report detailing the analysis to the sponsoring health-

18

care organization; this report becomes the basis for future decision making and planning for health-care management. In a totally different type of application, planetary geologists sift through remotely sensed images of planets and asteroids, carefully locating and cataloging such geologic objects of interest as impact craters. Be it science, marketing, finance, health care, retail, or any other field, the classical approach to data analysis relies fundamentally on one or more analysts becoming innovaintimately familiar with the data and serving as an interface between the data and the users and products. For these (and many other) applications, this form of manual probing of a data set is slow, expensive, and highly subjective. In fact, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains. Databases are increasing in size in two ways: (1) the number N of records or objects in the database and (2) the number d of fields or attributes to an object. Databases containing on the order of N = 109 objects are becoming increasingly common, for example, in the astronomical sciences. Similarly, the number of fields d can easily be on the order of 102 or even 103, for example, in medical diagnostic applications. Who could be expected to digest millions of records, each having tens or hundreds of fields? We believe that this job is certainly not one for humans; hence, analysis work needs to be automated, at least partially. The need to scale up human analysis capabilities to handling the large number of bytes that we can collect is both economic and scientific. Businesses use data to gain competitive advantage, increase efficiency, and provide more valuable services to customers. Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in. Because computers have enabled humans to gather more data than we can digest, it is only natural to turn to computational techniques to help us unearth meaningful patterns and structures from the massive volumes of data. Hence, KDD is an attempt to address a problem that the digital information era made a fact of life for all of us: data overload. Data Mining and Knowledge Discovery in the Real World A large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications, for example, In science, one of the primary application areas is astronomy. Here, a notable success was achieved by SKICAT, a system used by astronomers to perform image analysis, classification, and cataloging of sky objects from sky-survey images (Fayyad, Djorgovski, and Weir 1996). In its first application, the system was used to process the 3 terabytes (1012 bytes) of image data resulting from the Second Palomar Observatory Sky Survey, where it is estimated that on the order of 109 sky objects are detectable. SKICAT can outperform humans and traditional computational techniques in classifying faint sky objects. See Fayyad, Haussler, and Stolorz (1996) for a survey of scientific applications. In business, main KDD application areas includes marketing, finance (especially investment), fraud detection, manufacturing, telecommunications, and Internet agents.

19

The Interdisciplinary Nature of KDD KDD has evolved, and continues to evolve, from the intersection of research fields such as machine learning, pattern recognition, databases, statistics, AI, knowledge acquisition for expert systems, data visualization, and high-performance computing. The unifying goal is extracting high-level knowledge from low-level data in the context of large data sets.

Knowledge discovery is the non-trivial extraction of implicit, previously unknown, and potentially useful information from databases. Both, the number and the size of databases are rapidly growing because of the large amount of data obtained from satellite images, X-ray crystllography or other scientific equipment. This growth by far exceeds human capacities to analyze the databases in order to find implicit regularities, rules or clusters hidden in the data. Therefore, knowledge discovery becomes more and more important in databases. Typical tasks for knowledge discovery are the identification of classes (clustering), the prediction of new, unknown objects (classification), the discovery of associations or deviations in spatial databases. The term 'visual Data Mining' refers to the emphasis of integrating the user in the knowledge discovery process. Since these are challenging tasks, knowledge discovery algorithms should be incremental, i.e. when updating the database the algorithm does not have to be applied to the whole database.

KDD (Knowledge Discovery in Databases or Knowledge Discovery and Data Mining) is a recent term related to data mining and involves sorting of huge quantity of data to pick out useful and relevant information. Basic steps in the knowledge discovery process are,

Data selection Data cleaning/cleansing Data Enrichment Data mining Pattern evaluation Knowledge presentation

20

Knowledge discovery Process An overview


D a t a M in in g : A K D D P r o c e s s
P a tt e r n E v a lu a tio n

D a t a m in in g : t h e c o r e o f k n o w le d g e d is c o v e r y D a t a M in in g p ro c e s s .
T a s k -r e le v a n t D a ta D a ta W a r eh o u se S e le c t io n

D a t a C le a n in g D a ta I n t e g r a t io n D a ta ba ses
H a n : In tr o d u c tio n to K D D

11

2.3.1.Data Selection : The selection of data for a KDD process has to be done as a first step. This selection of data is the selection of relevant data for the field of approach to arrive at a meaningful knowledge. Identification of relevant data In a large and vast data bank one has to select the relevant and necessary data / information that is found to be important for the project / process that has to be done to get the targeted knowledge. For example in a super market if one want to get the knowledge of sales of milk products then the transaction data relevant to sales of milk products has to be gathered and processed and here the other sales details are not necessary. But if the shop keeper wants to know the overall performance then every transaction becomes necessary for the process. Representation of data After choosing the relevant data the data has to be represented in a suitable structure. That structure formats like data base, text etc, can also be decided and data can be represented in that format.

21

2.3.2 Data cleaning: Data Cleaning is the act of detecting and correcting (or removing) corrupt or inaccurate attributes or records The first step in Knowledge discovery Process is the Data Cleaning and that is necessary because Data in the real world is dirty means , Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Before proceed to the further steps in Knowledge discovery Process, the Data cleaning Has to be don that involves Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies etc.,in order to get the successful results. So the organizations are forced to think about a unified logical view of the wide variety of data and databases they possess, they have to address the issues of mapping data to a single naming convention, uniformly representing and handling missing data, and handling noise and errors when possible. We can list some of the Data cleaning tasks as Data acquisition and metadata Fill in missing values Unifieddate format Convertingnominalto numeric Identify outliers and smooth out noisy data Correct inconsistent data How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value : e.g., unknown, a new class?! Imputation: Use the attribute mean to fill in the missing value, or use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree 22

Noisy Data : There can be random error or variance in a measured variable, Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data How to handle Noisy data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries,etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions

2.3.3 Data Enrichment :


The represented data has to be enriched with various additional details apart from the base details that has been gathered. The requirements for this enrichment can be Behavioral opurchase from related businesses (Air Miles) oEg. number of vehicles, travel frequency Demographic oEg. age, gender, marital status, children, income level Psychographic oEg. risk taker, conservative, cultured, hi-tech adverse, credit worthy, trustworthy

23

2.4 Preliminary Analysis of the Data set

The gathered data set can be analysed for various purposes before proceeds to the KDD process. One of the most needed can be Statistical Analysis. Statistical Analysis Mean and Confidence Interval. Probably the most often used descriptive statistic is the mean. The mean is a particularly informative measure of the "central tendency" of the variable if it is reported along with its confidence intervals. Usually we are interested in statistics (such as the mean) from our sample data set only to the extent to which they can infer information about the population. The confidence intervals for the mean give us a range of values around the mean where we expect the "true" (population) mean is located . For example, if the mean in your sample is 23, and the lower and upper limits of the p=.05 confidence interval are 19 and 27 respectively, then you can conclude that there is a 95% probability that the population mean is greater than 19 and lower than 27. If you set the p-level to a smaller value, then the interval would become wider thereby increasing the "certainty" of the estimate, and vice versa; as we all know from the weather forecast, the more "vague" the prediction (i.e., wider the confidence interval), the more likely it will materialize. Note that the width of the confidence interval depends on the sample size and on the variation of data values. The larger the sample size, the more reliable its mean. The larger the variation, the less reliable the mean. The calculation of confidence intervals is based on the assumption that the variable is normally distributed in the population. The estimate may not be valid if this assumption is not met, unless the sample size is large, say n=100 or more. Shape of the Distribution, Normality. An important aspect of the "description" of a variable is the shape of its distribution, which tells you the frequency of values from different ranges of the variable. Typically, a researcher is interested in how well the distribution can be approximated by the normal distribution Simple descriptive statistics can provide some information relevant to this issue. For example, if the skewness (which measures the deviation of the distribution from symmetry) is clearly different from 0, then that distribution is asymmetrical, while normal distributions are perfectly symmetrical. If the kurtosis (which measures "peakedness" of the distribution) is clearly different from 0, then the distribution is either flatter or more peaked than normal; the kurtosis of the normal distribution is 0. More precise information can be obtained by performing one of the tests of normality to determine the probability that the sample came from a normally distributed population of observations (e.g., the so-called Kolmogorov-Smirnov test, or the Shapiro-Wilks' W test.

24

However, none of these tests can entirely substitute for a visual examination of the data using a histogram (i.e., a graph that shows the frequency distribution of a variable).

The graph allows you to evaluate the normality of the empirical distribution because it also shows the normal curve superimposed over the histogram. It also allows you to examine various aspects of the distribution qualitatively. For example, the distribution could be bimodal (have 2 peaks). This might suggest that the sample is not homogeneous but possibly its elements came from two different populations, each more or less normally distributed. In such cases, in order to understand the nature of the variable in question, you should look for a way to quantitatively identify the two sub-samples.

Correlations Purpose (What is Correlation?) Correlation is a measure of the relation between two or more variables. The measurement scales used should be at least interval scales, but other correlation coefficients are available to handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.

25

The most widely-used type of correlation coefficient is Pearson r, also called linear or product- moment correlation. Simple Linear Correlation (Pearson r). Pearson correlation (hereafter called correlation), assumes that the two variables are measured on at least interval scales (see Elementary Concepts), and it determines the extent to which values of the two variables are "proportional" to each other. The value of correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be "summarized" by a straight line (sloped upwards or downwards).

26

This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Note that the concept of squared distances will have important functional consequences on how the value of the correlation coefficient reacts to various specific arrangements of data (as we will later see). How to Interpret the Values of Correlations. As mentioned before, the correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as the significance of the correlation. Significance of Correlations. The significance level calculated for each correlation is a primary source of information about the reliability of the correlation. As explained before (see Elementary Concepts), the significance of a correlation coefficient of a particular magnitude will change depending on the size of the sample from which it was computed. The test of significance is based on the assumption that the distribution of the residual values (i.e., the deviations from the regression line) for the dependent variable y follows the normal distribution, and that the variability of the residual values is the same for all values of the independent variable x. However, Monte Carlo studies suggest that meeting those assumptions closely is not absolutely crucial if your sample size is not very small and when the departure from normality is not very large. It is impossible to formulate precise recommendations based on those Monte- Carlo results, but many researchers follow a rule of thumb that if your sample size is 50 or more then serious biases are unlikely, and if your sample size is over 100 then you should not be concerned at all with the normality assumptions. There are, however, much more common and serious threats to the validity of information that a correlation coefficient can provide; they are briefly discussed in the following paragraphs. Outliers. Outliers are atypical (by definition), infrequent observations. Because of the way in which the regression line is determined (especially the fact that it is based on minimizing not the sum of simple distances but the sum of squares of distances of data points from the line), outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, consequently, the value of the correlation, as demonstrated in the following example. Note, that as shown on that illustration, just one outlier can be entirely responsible for a high value of the correlation that otherwise (without the outlier) would be close to zero. Needless to say, one should never base important conclusions on the value of the correlation coefficient alone (i.e., examining the respective scatterplot is always recommended).

27

Note that if the sample size is relatively small, then including or excluding specific data points that are not as clearly "outliers" as the one shown in the previous example may have a profound influence on the regression line (and the correlation coefficient). This is illustrated in the following example where we call the points being excluded "outliers;" one may argue, however, that they are not outliers but rather extreme values.

Typically, we believe that outliers represent a random error that we would like to be able to control. Unfortunately, there is no widely accepted method to remove outliers automatically (however, see the next paragraph), thus what we are left with is to identify any outliers by examining a scatter plot of each important correlation. Needless to say, outliers may not only artificially increase the value of a correlation coefficient, but they can also decrease the value of a "legitimate" correlation. t-test for independent samples Purpose, Assumptions. The t-test is the most commonly used method to evaluate the differences in means between two groups. For example, the t-test can be used to test for a

28

difference in test scores between a group of patients who were given a drug and a control group who received a placebo. Theoretically, the t-test can be used even if the sample sizes are very small (e.g., as small as 10; some researchers claim that even smaller n's are possible), as long as the variables are normally distributed within each group and the variation of scores in the two groups is not reliably different (see also Elementary Concepts). As mentioned before, the normality assumption can be evaluated by looking at the distribution of the data (via histograms) or by performing a normality test. The equality of variances assumption can be verified with the F test, or you can use the more robust Levene's test. If these conditions are not met, then you can evaluate the differences in means between two groups using one of the nonparametric alternatives to the t- test (see Nonparametric and Distribution Fitting). The p-level reported with a t-test represents the probability of error involved in accepting our research hypothesis about the existence of a difference. Technically speaking, this is the probability of error associated with rejecting the hypothesis of no difference between the two categories of observations (corresponding to the groups) in the population when, in fact, the hypothesis is true. Some researchers suggest that if the difference is in the predicted direction, you can consider only one half (one "tail") of the probability distribution and thus divide the standard p-level reported with a t-test (a "two-tailed" probability) by two. Others, however, suggest that you should always report the standard, two-tailed t-test probability. Arrangement of Data. In order to perform the t-test for independent samples, one independent (grouping) variable (e.g., Gender: male/female) and at least one dependent variable (e.g., a test score) are required. The means of the dependent variable will be compared between selected groups based on the specified values (e.g., male and female) of the independent variable. The following data set can be analyzed with a t-test comparing the average WCC score in males and females. GENDER male male male female female WCC 111 110 109 102 104

case 1 case 2 case 3 case 4 case 5

mean WCC in males = 110 mean WCC in females = 103

t-test graphs. In the t-test analysis, comparisons of means and measures of variation in the two groups can be visualized in box and whisker plots (for an example, see the graph below).

29

These graphs help you to quickly evaluate and "intuitively visualize" the strength of the relation between the grouping and the dependent variable. Breakdown: Descriptive Statistics by Groups Purpose. The breakdowns analysis calculates descriptive statistics and correlations for dependent variables in each of a number of groups defined by one or more grouping (independent) variables. Arrangement of Data. In the following example data set (spreadsheet), the dependent variable WCC (White Cell Count) can be broken down by 2 independent variables: Gender (values: males and females), and Height (values: tall and short). GENDER male male male female female ... HEIGHT short tall tall tall short ... WCC 101 110 92 112 95 ...

case 1 case 2 case 3 case 4 case 5 ...

The resulting breakdowns might look as follows (we are assuming that Gender was specified as the first independent variable, and Height as the second). Entire Mean=100 SD=13 N=120 Males Mean=99 SD=13 Females Mean=101 SD=13 sample

30

N=60 Tall/males Mean=98 SD=13 N=30

N=60 Short/males Tall/females Mean=100 Mean=101 SD=13 SD=13 N=30 N=30

Short/females Mean=101 SD=13 N=30

The composition of the "intermediate" level cells of the "breakdown tree" depends on the order in which independent variables are arranged. For example, in the above example, you see the means for "all males" and "all females" but you do not see the means for "all tall subjects" and "all short subjects" which would have been produced had you specified independent variable Height as the first grouping variable rather than the second. Statistical Tests in Breakdowns. Breakdowns are typically used as an exploratory data analysis technique; the typical question that this technique can help answer is very simple: Are the groups created by the independent variables different regarding the dependent variable? If you are interested in differences concerning the means, then the appropriate test is the breakdowns one-way ANOVA (F test). If you are interested in variation differences, then you should test for homogeneity of variances. Other Related Data Analysis Techniques. Although for exploratory data analysis, breakdowns can use more than one independent variable, the statistical procedures in breakdowns assume the existence of a single grouping factor (even if, in fact, the breakdown results from a combination of a number of grouping variables). Thus, those statistics do not reveal or even take into account any possible interactions between grouping variables in the design. For example, there could be differences between the influence of one independent variable on the dependent variable at different levels of another independent variable (e.g., tall people could have lower WCC than short ones, but only if they are males; see the "tree" data above). You can explore such effects by examining breakdowns "visually," using different orders of independent variables, but the magnitude or significance of such effects cannot be estimated by the breakdown statistics. Frequency tables Purpose. Frequency or one-way tables represent the simplest method for analyzing categorical (nominal) data (refer to Elementary Concepts). They are often used as one of the exploratory procedures to review how different categories of values are distributed in the sample. For example, in a survey of spectator interest in different sports, we could summarize the respondents' interest in watching football in a frequency table as follows: STATISTICA BASIC STATS Category FOOTBALL: "Watching football"

Count Cumulatv Percent Cumulatv

31

ALWAYS : Always interested USUALLY : Usually interested SOMETIMS: Sometimes interested NEVER : Never interested Missing

39 16 26 19 0

Count 39 55 81 100 100

39.00000 16.00000 26.00000 19.00000 0.00000

Percent 39.0000 55.0000 81.0000 100.0000 100.0000

The table above shows the number, proportion, and cumulative proportion of respondents who characterized their interest in watching football as either (1) Always interested, (2) Usually interested, (3) Sometimes interested, or (4) Never interested Applications. In practically every research project, a first "look" at the data usually includes frequency tables. For example, in survey research, frequency tables can show the number of males and females who participated in the survey, the number of respondents from particular ethnic and racial backgrounds, and so on. Responses on some labeled attitude measurement scales (e.g., interest in watching football) can also be nicely summarized via the frequency table. In medical research, one may tabulate the number of patients displaying specific symptoms; in industrial research one may tabulate the frequency of different causes leading to catastrophic failure of products during stress tests (e.g., which parts are actually responsible for the complete malfunction of television sets under extreme temperatures?). Customarily, if a data set includes any categorical data, then one of the first steps in the data analysis is to compute a frequency table for those categorical variables. Tools for this analysis : To do this statistical analysis there are various tools liks SPSS, MicroSoft Excel etc., One can use these tools to have an preliminary analysis of the selected data for KDD. Some tools : Microsoft Excel: The Analysis ToolPak is a tool in Microsoft Excel to perform basic statistical procedures. Microsoft Excel is spreadsheet software that is used to store information in columns and rows, which can then be organized and/or processed. In addition to the basic spreadsheet functions, the Analysis ToolPak in Excel contains procedures such as ANOVA, correlations, descriptive statistics, histograms, percentiles, regression, and t-tests. This document describes how to get basic descriptive statistics, perform an ANOVA, a t-test, and a linear regression. The primary reason to use Excel for statistical data analysis is because it is so widely available. The Analysis Toolpak is an add-on that can be installed for free if you have the installation disk for Microsoft Office. It is also publicly available .

SPSS 32

SPSS is among the most widely used programs for statistical analysis in social science. It is used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others. In addition to statistical analysis, data management (case selection, file reshaping, creating derived data) and data documentation (a metadata dictionary is stored with the data) are features of the base software. Statistics included in the base software:

Descriptive statistics: Cross tabulation, Frequencies, Descriptives, Explore, Descriptive Ratio Statistics Bivariate statistics: Means, t-test, ANOVA, Correlation (bivariate, partial, distances), Nonparametric tests Prediction for numerical outcomes: Linear regression Prediction for identifying groups: Factor analysis, cluster analysis (two-step, Kmeans, hierarchical), Discriminant

2.5 Visualization techniques The human mind has boundless potential, and humans have been exploring many ways to use the mind for thousands of years. The technique of visualization can help you acquire new knowledge and skills more quickly than with conventional techniques The amount of data stored on electronic media is growing exponentially fast. Making sense of such data is becoming harder and more challenging. Online retailing in the Internet age, for example, is very different than retailing a decade ago because the three most important factors of the past (location, location, and location) are irrelevant for online stores. One of the greatest challenges we face today is making sense of all this data. Data mining, or knowledge discovery, is the process of identifying new patterns and insights in data, whether it is for understanding the Human Genome to develop new drugs, for discovering new patterns in recent Census data to warn about hidden trends, or for understanding your customers better at an electronic webstore in order to provide a personalized one-to-one experience. Data mining, sometimes referred to as knowledge discovery is at the intersection of multiple researchareas, including Machine Learning Statistics Pattern Recognition],Databases and Visualization Good marketing and business-oriented data mining books are also available. With the maturity of databases and constant improvements in computational speed, data mining algorithms that were too expensive to execute are now within reach. Data mining serves two goals:

33

1. Insight: identify patterns and trends that are comprehensible, so that action can be taken based on the insight. For example, characterize the heavy spenders on a web site, or people that buy product X. By understanding the underlying patterns, the web site can be personalized and improved. The insight may also lead to decisions that affect other channels, such as brick-andmortar storesplacement of products, marketing efforts, and cross-sells. 2. Prediction: a model is built that predicts (or scores) based on input data. For example, a model can be built to predict the propensity of customers to buy product X based on their demographic data and browsing patterns on a web site. Customers with high scores can be used in a direct marketing campaign. If the prediction is for a discrete variable with a few values (e.g., buy product X or not), the task is called classification; if the prediction is for a continuous variable (e.g., customer spending in the next year), the task is called regression. The majority of research in data mining has concentrated on building the best models for prediction. Part of the reason, no doubt, is that a prediction task is well defined and can be objectively measured on an independent test-set. Given a dataset that is labeled with the correct predictions, it is split into a training set and a test-set. A learning algorithm is given the training set and produces a model that can map new unseen data into the prediction. The model can then be evaluated for its accuracy in making predictions on the unseen test-set. Descriptive data mining, which yields human insight, is harder to evaluate, yet necessary in many domains because the users may not trust predictions coming out of a black box or because legally one must explain the predictions. For example, even if a Perceptron algorithm [20] outperforms a loan officer in predicting who will default on a loan, the person requesting a loan cannot be rejected simply because he is on the wrong side of a 37-dimensional hyperplane; legally, the loan officer must explain the reason for the rejection. The choice of a predictive model can have a profound influence on the resulting accuracy and on the ability of humans to gain insight from it. Some models are naturally easier to understand than others. For example, a model consisting of if-then rules is easy to understand, unless the number of rules is too large. Decision trees, are also relatively easy to understand. Linear models get a little harder, especially if discrete inputs are used. Nearest-neighbor algorithms in high dimensions are almost impossible for users to understand, and non-linear models in high dimensions, such as Neural Networks are the most opaque. 3 One way to aid users in understanding the models is to visualize them. MineSet, for example, is a data mining tool that integrates data mining and visualization very tightly. Models built can be viewed and interacted with. Figure 1 shows a visualization of the Nave-Bayes classifier. Given a target value, which in this case was who earns over $50,000 in the US working population, the visualization shows a small set of "important" attributes (measured using mutual information or cross-entropy). For each attribute, a bar chart shows how much "evidence" each value (or range of values) of that attribute provides for the target label. For example, higher education levels (right bars in the education row) imply higher salaries because the bars are higher. Similarly, salary

34

increases with age up to a point and then decreases, and salary increases with the number of hour worked per week. The combination of a back-end algorithm that bins the data, computes the importance of hundreds of attributes, and then a visualization that shows the important attributes visually, makes this a very useful tool that helps identify patterns. Users can interact with the model by clicking on attribute values and seeing the predictions that the model makes.

Figure 1: A visualization of the Naive-Bayes classifier

Examples of Visualization tools


Miner3D

Create engaging data visualizations and live data-driven graphics! Miner3D delivers new data insights by allowing you to visually spot trends, clusters, patterns or outliers A.Unsupervised Visual Data Clustering Kohonen's Self-Organizing Maps Miner3D now includes a visual implementation of Self Organizing Maps. Users looking for unattended data clustering tool will find this modul surprisingly powerful.

35

Users looking for unattended and unsupervised data clustering tool, capable of generating convincible results, will recognize strong data analysis potential of Kohonen's Self-Organizing Maps (SOMs). Kohonen maps are a tool for arranging the data points into a manageable 2D or 3D space in a way that preserves closeness. Also known as self-organizing maps (SOM), Kohonen maps are inspired biologically. The SOM computational mechanism reflects how many scientists think the human brain organizes many-faceted concepts into its 3D structure. The SOM algorithm lays a 2D grid of "neuronal units" and assigns each data point to the unit that will "recognize" it. The assignment is made in such a way that neighboring units recognize similar data. The result of applying a Kohonen map to a data set is a 2D plot, but Miner3D can also support 3D Kohonen maps. In this plot, data points (rows) that are similar in the chosen set of attributes will be grouped close together, while dissimilar rows will be separated by a greater distance in the plot space. This allows you, the user, to tease out salient data patterns. Self-Organizing Maps has been the data clustering method sought by many people from different areas of business and science. The new enhancement of yet powerful set of Miner3D data analysis tools further broadens its application portfolio.

B.K-Means clustering A powerful K-Means clustering method can be used to visually cluster data sets and for data set reduction Cluster analysis is a set of mathematical techniques for partitioning a series of data objects into a smaller amount of groups, or clusters, so that the data objects within one cluster are more similar to each other than to those in other clusters. Miner3D provides the popular K-means method of clustering. K-Means Clustering and K-Means Data Reduction give you more power and more options to process large data sets. K-means can be used either for clustering data sets visually in 3D or for row reduction and compression of large data sets. Miner3Ds implementation of K-Means uses a high-performance proprietary scheme based on filtering algorithms and multidimensional binary search trees.

36

K-means clustering is only available in Miner3D Enterprise and Miner3D Developer packages

2.6 OLAP (or Online Analytical Processing) OLAP (or Online Analytical Processing) has been growing in popularity due to the increase in data volumes and the recognition of the business value of analytics. Until the mid-nineties, performing OLAP analysis was an extremely costly process mainly restricted to larger organizations. The major OLAP vendor are Hyperion, Cognos, Business Objects, MicroStrategy. The cost per seat were in the range of $1500 to $5000 per annum. The setting up of the environment to perform OLAP analysis would also require substantial investments in time and monetary resources. This has changed as the major database vendor have started to incorporate OLAP modules within their database offering - Microsoft SQL Server 2000 with Analysis Services, Oracle with Express and Darwin, and IBM with DB2. What is OLAP? OLAP allows business users to slice and dice data at will. Normally data in an organization is distributed in multiple data sources and are incompatible with each other. A retail example: Point-of-sales data and sales made via call-center or the Web are stored in different location and formats. It would a time consuming process for an executive to obtain OLAP reports such as - What are the most popular products purchased by customers between the ages 15 to 30? Part of the OLAP implementation process involves extracting data from the various data repositories and making them compatible. Making data compatible involves ensuring that the meaning of the data in one repository matches all other repositories. An example of incompatible data: Customer ages can be stored as birth date for purchases made over the web and stored as age categories (i.e. between 15 and 30) for in store sales. It is not always necessary to create a data warehouse for OLAP analysis. Data stored by operational systems, such as point-of-sales, are in types of databases called OLTPs. OLTP, Online Transaction Process, databases do not have any difference from a

37

structural perspective from any other databases. The main difference, and only, difference is the way in which data is stored. Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications, Call Center. OLTPs are designed for optimal transaction speed. When a consumer makes a purchase online, they expect the transactions to occur instantaneously. With a database design, call data modeling, optimized for transactions the record 'Consumer name, Address, Telephone, Order Number, Order Name, Price, Payment Method' is created quickly on the database and the results can be recalled by managers equally quickly if needed.

. Data Model for OLTP Data are not typically stored for an extended period on OLTPs for storage cost and transaction speed reasons.

38

OLAPs have a different mandate from OLTPs. OLAPs are designed to give an overview analysis of what happened. Hence the data storage (i.e. data modeling) has to be set up differently. The most common method is called the star design.

Star Data Model for OLAP The central table in an OLAP start data model is called the fact table. The surrounding tables are called the dimensions. Using the above data model, it is possible to build reports that answer questions such as:

The supervisor that gave the most discounts. The quantity shipped on a particular date, month, year or quarter. In which zip code did product A sell the most.

To obtain answers, such as the ones above, from a data model OLAP cubes are created. OLAP cubes are not strictly cuboids - it is the name given to the process of linking data from the different dimensions. The cubes can be developed along business units such as sales or marketing. Or a giant cube can be formed with all the dimensions.

39

OLAP Cube with Time, Customer and Product Dimensions OLAP can be a valuable and rewarding business tool. Aside from producing reports, OLAP analysis can aid an organization evaluate balanced scorecard targets.

Steps in the OLAP Creation Process

OLAP Online Analytical Processing Tools OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view. Designed for managers looking to make sense of their information, OLAP tools structure data hierarchically the way managers think of their enterprises, but also allows business analysts to rotate that data, changing the relationships to get more detailed insight into corporate information. Examples for OLAP tools 1. WebFOCUS WebFOCUS OLAP combines all the functionality of query tools, reporting tools, and OLAP into a single powerful solution with one common interface so business analysts can slice and dice the data and see business processes in a new way. WebFOCUS makes data part of an organization's natural culture by giving developers the premier design environments for automated ad hoc and parameter-driven reporting and giving everyone

40

else the ability to receive and retrieve data in any format, performing analysis using whatever device or application is part of the daily working life. WebFOCUS ad hoc reporting and OLAP features allow users to slice and dice data in an almost unlimited number of ways. Satisfying the broadest range of analytical needs, business intelligence application developers can easily enhance reports with extensive data-analysis functionality so that end users can dynamically interact with the information. WebFOCUS also supports the real-time creation of Excel spreadsheets and Excel PivotTables with full styling, drill-downs, and formula capabilities so that Excel power users can analyze their corporate data in a tool with which they are already familiar.

41

2. PivotCubeX PivotCubeX is a visual ActiveX control for OLAP analysis and reporting. You can use it to load data from huge relational databases, look for information or details and create summaries and reports that help the end user in making accurate decisions. It provides highly dynamic interface for interactive data analysis

3. OlapCube OlapCube is a simple, yet powerful tool to analyze data. OlapCube will let you create local cubes (files with .cub extension) from data stored in any relational database (including MySQL, PostgreSQL, Microsoft Access, SQL Server, SQL Server Express, Oracle, Oracle Express). You can explore the resulting cube with our OlapCube Reader. Or you can use Microsoft Excel to create rich and customized reports.

2.7. Decision Trees What is a Decision Tree?

42

A decision tree is a predictive model that, as its name implies, can be viewed as a tree. Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification. For instance if we were going to classify customers who churn (dont renew their phone contracts) in the Cellular Telephone Industry a decision tree might look something like that found in Figure 2.1.

Figure 2.1 A decision tree is a predictive model that makes a prediction on the basis of a series of decision much like the game of 20 questions. You may notice some interesting things about the tree:

It divides up the data on each branch point without losing any of the data (the number of total records in a given parent node is equal to the sum of the records contained in its two children). The number of churners and non-churners is conserved as you move up or down the tree It is pretty easy to understand how the model is being built (in contrast to the models from neural networks or from standard statistics). It would also be pretty easy to use this model if you actually had to target those customers that are likely to churn with a targeted marketing offer.

You may also build some intuitions about your customer base. E.g. customers who have been with you for a couple of years and have up to date cellular phones are pretty loyal. Viewing decision trees as segmentation with a purpose From a business perspective decision trees can be viewed as creating a segmentation of the original dataset (each segment would be one of the leaves of the tree). Segmentation

43

of customers, products, and sales regions is something that marketing managers have been doing for many years. In the past this segmentation has been performed in order to get a high level view of a large amount of data - with no particular reason for creating the segmentation except that the records within each segmentation were somewhat similar to each other. In this case the segmentation is done for a particular reason - namely for the prediction of some important piece of information. The records that fall within each segment fall there because they have similarity with respect to the information being predicted - not just that they are similar - without similarity being well defined. These predictive segments that are derived from the decision tree also come with a description of the characteristics that define the predictive segment. Thus the decision trees and the algorithms that create them may be complex, the results can be presented in an easy to understand way that can be quite useful to the business user. Applying decision trees to Business Because of their tree structure and ability to easily generate rules decision trees are the favored technique for building understandable models. Because of this clarity they also allow for more complex profit and ROI models to be added easily in on top of the predictive model. For instance once a customer population is found with high predicted likelihood to attrite a variety of cost models can be used to see if an expensive marketing intervention should be used because the customers are highly valuable or a less expensive intervention should be used because the revenue from this sub-population of customers is marginal. Because of their high level of automation and the ease of translating decision tree models into SQL for deployment in relational databases the technology has also proven to be easy to integrate with existing IT processes, requiring little preprocessing and cleansing of the data, or extraction of a special purpose file specifically for data mining. Where can decision trees be used? Decision trees are data mining technology that has been around in a form very similar to the technology of today for almost twenty years now and early versions of the algorithms date back in the 1960s. Often times these techniques were originally developed for statisticians to automate the process of determining which fields in their database were actually useful or correlated with the particular problem that they were trying to understand. Partially because of this history, decision tree algorithms tend to automate the entire process of hypothesis generation and then validation much more completely and in a much more integrated way than any other data mining techniques. They are also particularly adept at handling raw data with little or no pre-processing. Perhaps also because they were originally developed to mimic the way an analyst interactively performs data mining they provide a simple to understand predictive model based on rules (such as 90% of the time credit card customers of less than 3 months who max out their credit limit are going to default on their credit card loan.). Because decision trees score so highly on so many of the critical features of data mining they can be used in a wide variety of business problems for both exploration and for prediction. They have been used for problems ranging from credit card attrition prediction to time series prediction of the exchange rate of different international

44

currencies. There are also some problems where decision trees will not do as well. Some very simple problems where the prediction is just a simple multiple of the predictor can be solved much more quickly and easily by linear regression. Usually the models to be built and the interactions to be detected are much more complex in real world problems and this is where decision trees excel. Using decision trees for Exploration The decision tree technology can be used for exploration of the dataset and business problem. This is often done by looking at the predictors and values that are chosen for each split of the tree. Often times these predictors provide usable insights or propose questions that need to be answered. For instance if you ran across the following in your database for cellular phone churn you might seriously wonder about the way your telesales operators were making their calls and maybe change the way that they are compensated: IF customer lifetime < 1.1 years AND sales channel = telesales THEN chance of churn is 65%. Using decision trees for Data Preprocessing Another way that the decision tree technology has been used is for preprocessing data for other prediction algorithms. Because the algorithm is fairly robust with respect to a variety of predictor types (e.g. number, categorical etc.) and because it can be run relatively quickly decision trees can be used on the first pass of a data mining run to create a subset of possibly useful predictors that can then be fed into neural networks, nearest neighbor and normal statistical routines - which can take a considerable amount of time to run if there are large numbers of possible predictors to be used in the model. Decision tress for Prediction Although some forms of decision trees were initially developed as exploratory tools to refine and preprocess data for more standard statistical techniques like logistic regression. They have also been used and more increasingly often being used for prediction. This is interesting because many statisticians will still use decision trees for exploratory analysis effectively building a predictive model as a by product but then ignore the predictive model in favor of techniques that they are most comfortable with. Sometimes veteran analysts will do this even excluding the predictive model when it is superior to that produced by other techniques. With a host of new products and skilled users now appearing this tendency to use decision trees only for exploration now seems to be changing. The first step is Growing the Tree The first step in the process is that of growing the tree. Specifically the algorithm seeks to create a tree that works as perfectly as possible on all the data that is available. Most of the time it is not possible to have the algorithm work perfectly. There is always noise in the database to some degree (there are variables that are not being collected that have an impact on the target you are trying to predict). The name of the game in growing the tree is in finding the best possible question to ask at each branch point of the tree. At the bottom of the tree you will come up with nodes that you would like to be all of one type or the other. Thus the question: Are you over 40?

45

probably does not sufficiently distinguish between those who are churners and those who are not - lets say it is 40%/60%. On the other hand there may be a series of questions that do quite a nice job in distinguishing those cellular phone customers who will churn and those who wont. Maybe the series of questions would be something like: Have you been a customer for less than a year, do you have a telephone that is more than two years old and were you originally landed as a customer via telesales rather than direct sales? This series of questions defines a segment of the customer population in which 90% churn. These are then relevant questions to be asking in relation to predicting churn.

The difference between a good question and a bad question


The difference between a good question and a bad question has to do with how much the question can organize the data - or in this case, change the likelihood of a churner appearing in the customer segment. If we started off with our population being half churners and half non-churners then we would expect that a question that didnt organize the data to some degree into one segment that was more likely to churn than the other then it wouldnt be a very useful question to ask. On the other hand if we asked a question that was very good at distinguishing between churners and non-churners - say that split 100 customers into one segment of 50 churners and another segment of 50 nonchurners then this would be considered to be a good question. In fact it had decreased the disorder of the original segment as much as was possible. The process in decision tree algorithms is very similar when they build trees. These algorithms look at all possible distinguishing questions that could possibly break up the original training dataset into segments that are nearly homogeneous with respect to the different classes being predicted. Some decision tree algorithms may use heuristics in order to pick the questions or even pick them at random. CART picks the questions in a very unsophisticated way: It tries them all. After it has tried them all CART picks the best one uses it to split the data into two more organized segments and then again asks all possible questions on each of those new segments individually.

When does the tree stop growing?


If the decision tree algorithm just continued growing the tree like this it could conceivably create more and more questions and branches in the tree so that eventually there was only one record in the segment. To let the tree grow to this size is both computationally expensive but also unnecessary. Most decision tree algorithms stop growing the tree when one of three criteria are met:

The segment contains only one record. (There is no further question that you could ask which could further refine a segment of just one.) All the records in the segment have identical characteristics. (There is no reason to continue asking further questions segmentation since all the remaining records are the same.) The improvement is not substantial enough to warrant making the split.

46

Why would a decision tree algorithm stop growing the tree if there wasnt enough data?
Consider the following example shown in Table 2.1 of a segment that we might want to split further which has just two examples. It has been created out of a much larger customer database by selecting only those customers aged 27 with blue eyes and salaries between $80,000 and $81,000. Name Age Eyes Salary Churned? Steve 27 Blue $80,000 Yes Alex 27 Blue $80,000 No Table 2.1 Decision tree algorithm segment. This segment cannot be split further except by using the predictor "name". In this case all of the possible questions that could be asked about the two customers turn out to have the same value (age, eyes, salary) except for name. It would then be possible to ask a question like: Is the customers name Steve? and create the segments which would be very good at breaking apart those who churned from those who did not: The problem is that we all have an intuition that the name of the customer is not going to be a very good indicator of whether that customer churns or not. It might work well for this particular 2 record segment but it is unlikely that it will work for other customer databases or even the same customer database at a different time. This particular example has to do with overfitting the model - in this case fitting the model too closely to the idiosyncrasies of the training data. This can be fixed later on but clearly stopping the building of the tree short of either one record segments or very small segments in general is a good idea.

Decision trees arent necessarily finished after the tree is grown.


After the tree has been grown to a certain size (depending on the particular stopping criteria used in the algorithm) the CART algorithm has still more work to do. The algorithm then checks to see if the model has been overfit to the data. It does this in several ways using a cross validation approach or a test set validation approach. Basically using the same mind numbingly simple approach it used to find the best questions in the first place - namely trying many different simpler versions of the tree on a held aside test set. The tree that does the best on the held aside data is selected by the algorithm as the best model. The nice thing about CART is that this testing and selection is all an integral part of the algorithm as opposed to the after the fact approach that other techniques use.

ID3 and an enhancement - C4.5


In the late 1970s J. Ross Quinlan introduced a decision tree algorithm named ID3. It was one of the first decision tree algorithms yet at the same time built solidly on work that had been done on inference systems and concept learning systems from that decade as well as the preceding decade. Initially ID3 was used for tasks such as learning good game playing strategies for chess end games. Since then ID3 has been applied to a wide

47

variety of problems in both academia and industry and has been modified, improved and borrowed from many times over. ID3 picks predictors and their splitting values based on the gain in information that the split or splits provide. Gain represents the difference between the amount of information that is needed to correctly make a prediction before a split is made and after the split has been made. If the amount of information required is much lower after the split is made then that split has decreased the disorder of the original single segment. Gain is defined as the difference between the entropy of the original segment and the accumulated entropies of the resulting split segments. ID3 was later enhanced in the version called C4.5. C4.5 improves on ID3 in several important areas:

predictors with missing values can still be used predictors with continuous values can be used pruning is introduced rule derivation

Many of these techniques appear in the CART algorithm plus some others so we will go through this introduction in the CART algorithm.

CART - Growing a forest and picking the best tree


CART stands for Classification and Regression Trees and is a data exploration and prediction algorithm developed by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone and is nicely detailed in their 1984 book Classification and Regression Trees ([Breiman, Friedman, Olshen and Stone 19 84)]. These researchers from Stanford University and the University of California at Berkeley showed how this new algorithm could be used on a variety of different problems from to the detection of Chlorine from the data contained in a mass spectrum. Predictors are picked as they decrease the disorder of the data. In building the CART tree each predictor is picked based on how well it teases apart the records with different predictions. For instance one measure that is used to determine whether a given split point for a give predictor is better than another is the entropy metric. The measure originated from the work done by Claude Shannon and Warren Weaver on information theory in 1949. They were concerned with how information could be efficiently communicated over telephone lines. Interestingly, their results also prove useful in creating decision trees.

CART Automatically Validates the Tree


One of the great advantages of CART is that the algorithm has the validation of the model and the discovery of the optimally general model built deeply into the algorithm. CART accomplishes this by building a very complex tree and then pruning it back to the optimally general tree based on the results of cross validation or test set validation. The tree is pruned back based on the performance of the various pruned version of the tree on the test set data. The most complex tree rarely fares the best on the held aside data as it has been overfitted to the training data. By using cross validation the tree that is most likely to do well on new, unseen data can be chosen.

48

CART Surrogates handle missing data


The CART algorithm is relatively robust with respect to missing data. If the value is missing for a particular predictor in a particular record that record will not be used in making the determination of the optimal split when the tree is being built. In effect CART will utilizes as much information as it has on hand in order to make the decision for picking the best possible split. When CART is being used to predict on new data, missing values can be handled via surrogates. Surrogates are split values and predictors that mimic the actual split in the tree and can be used when the data for the preferred predictor is missing. For instance though shoe size is not a perfect predictor of height it could be used as a surrogate to try to mimic a split based on height when that information was missing from the particular record being predicted with the CART model.

CHAID
Another equally popular decision tree technology to CART is CHAID or Chi-Square Automatic Interaction Detector. CHAID is similar to CART in that it builds a decision tree but it differs in the way that it chooses its splits. Instead of the entropy or Gini metrics for choosing optimal splits the technique relies on the chi square test used in contingency tables to determine which categorical predictor is furthest from independence with the prediction values. Because CHAID relies on the contingency tables to form its test of significance for each predictor all predictors must either be categorical or be coerced into a categorical form via binning (e.g. break up possible people ages into 10 bins from 0-9, 10-19, 20-29 etc.). Though this binning can have deleterious consequences the actual accuracy performances of CART and CHAID have been shown to be comparable in real world direct marketing response models.

2.8. Association rules (Rule Induction)


Association rules or Rule induction is one of the major forms of data mining and is perhaps the most common form of knowledge discovery in unsupervised learning systems. It is also perhaps the form of data mining that most closely resembles the process that most people think about when they think about data mining, namely mining for gold through a vast database. The gold in this case would be a rule that is interesting - that tells you something about your database that you didnt already know and probably werent able to explicitly articulate (aside from saying show me things that are interesting). Rule induction on a data base can be a massive undertaking where all possible patterns are systematically pulled out of the data and then an accuracy and significance are added to them that tell the user how strong the pattern is and how likely it is to occur again. In general these rules are relatively simple such as for a market basket database of items scanned in a consumer market basket you might find interesting correlations in your database such as:

49

If bagels are purchased then cream cheese is purchased 90% of the time and this pattern occurs in 3% of all shopping baskets. If live plants are purchased from a hardware store then plant fertilizer is purchased 60% of the time and these two items are bought together in 6% of the shopping baskets.

The rules that are pulled from the database are extracted and ordered to be presented to the user based on the percentage of times that they are correct and how often they apply. The bane of rule induction systems is also its strength - that it retrieves all possible interesting patterns in the database. This is a strength in the sense that it leaves no stone unturned but it can also be viewed as a weaknes because the user can easily become overwhelmed with such a large number of rules that it is difficult to look through all of them. You almost need a second pass of data mining to go through the list of interesting rules that have been generated by the rule induction system in the first place in order to find the most valuable gold nugget amongst them all. This overabundance of patterns can also be problematic for the simple task of prediction because all possible patterns are culled from the database there may be conflicting predictions made by equally interesting rules. Automating the process of culling the most interesting rules and of combing the recommendations of a variety of rules are well handled by many of the commercially available rule induction systems on the market today and is also an area of active research.

Applying Rule induction to Business


Rule induction systems are highly automated and are probably the best of data mining techniques for exposing all possible predictive patterns in a database. They can be modified to for use in prediction problems but the algorithms for combining evidence from a variety of rules comes more from rules of thumbs and practical experience. In comparing data mining techniques along an axis of explanation neural networks would be at one extreme of the data mining algorithms and rule induction systems at the other end. Neural networks are extremely proficient and saying exactly what must be done in a prediction task (e.g. who do I give credit to / who do I deny credit to) with little explanation. Rule induction systems when used for prediction on the other hand are like having a committee of trusted advisors each with a slightly different opinion as to what to do but relatively well grounded reasoning and a good explanation for why it should be done. The business value of rule induction techniques reflects the highly automated way in which the rules are created which makes it easy to use the system but also that this approach can suffer from an overabundance of interesting patterns which can make it complicated in order to make a prediction that is directly tied to return on investment (ROI).

What is a rule?
In rule induction systems the rule itself is of a simple form of if this and this and this then this. For example a rule that a supermarket might find in their data collected from scanners would be: if pickles are purchased then ketchup is purchased. Or

50

If paper plates then plastic forks If dip then potato chips If salsa then tortilla chips

In order for the rules to be useful there are two pieces of information that must be supplied as well as the actual rule:

Accuracy - How often is the rule correct? Coverage - How often does the rule apply?

Just because the pattern in the data base is expressed as rule does not mean that it is true all the time. Thus just like in other data mining algorithms it is important to recognize and make explicit the uncertainty in the rule. This is what the accuracy of the rule means. The coverage of the rule has to do with how much of the database the rule covers or applies to. Examples of these two measure for a variety of rules is shown in Table 2.2. In some cases accuracy is called the confidence of the rule and coverage is called the support. Accuracy and coverage appear to be the preferred ways of naming these two measurements. Rule If breakfast cereal purchased then milk purchased. If bread purchased then swiss cheese purchased. If 42 years old and purchased pretzels and purchased dry roasted peanuts then beer will be purchased. Accuracy 85% 15% 95% Coverage 20% 6% 0.01%

Table 2.2 Examples of Rule Accuracy and Coverage The rules themselves consist of two halves. The left hand side is called the antecedent and the right hand side is called the consequent. The antecedent can consist of just one condition or multiple conditions which must all be true in order for the consequent to be true at the given accuracy. Generally the consequent is just a single condition (prediction of purchasing just one grocery store item) rather than multiple conditions. Thus rules such as: if x and y then a and b and c.

What to do with a rule


When the rules are mined out of the database the rules can be used either for understanding better the business problems that the data reflects or for performing actual predictions against some predefined prediction target. Since there is both a left side and a right side to a rule (antecedent and consequent) they can be used in several ways for your business. Target the antecedent. In this case all rules that have a certain value for the antecedent are gathered and displayed to the user. For instance a grocery store may request all rules that have nails, bolts or screws as the antecedent in order to try to understand whether discontinuing the sale of these low margin items will have any effect on other higher

51

margin. For instance maybe people who buy nails also buy expensive hammers but wouldnt do so at the store if the nails were not available. Target the consequent. In this case all rules that have a certain value for the consequent can be used to understand what is associated with the consequent and perhaps what affects the consequent. For instance it might be useful to know all of the interesting rules that have coffee in their consequent. These may well be the rules that affect the purchases of coffee and that a store owner may want to put close to the coffee in order to increase the sale of both items. Or it might be the rule that the coffee manufacturer uses to determine in which magazine to place their next coupons. Target based on accuracy. Some times the most important thing for a user is the accuracy of the rules that are being generated. Highly accurate rules of 80% or 90% imply strong relationships that can be exploited even if they have low coverage of the database and only occur a limited number of times. For instance a rule that only has 0.1% coverage but 95% can only be applied one time out of one thousand but it will very likely be correct. If this one time is highly profitable that it can be worthwhile. This, for instance, is how some of the most successful data mining applications work in the financial markets - looking for that limited amount of time where a very confident prediction can be made. Target based on coverage. Some times user want to know what the most ubiquitous rules are or those rules that are most readily applicable. By looking at rules ranked by coverage they can quickly get a high level view of what is happening within their database most of the time. Target based on interestingness. Rules are interesting when they have high coverage and high accuracy and deviate from the norm. There have been many ways that rules have been ranked by some measure of interestingness so that the trade off between coverage and accuracy can be made. Since rule induction systems are so often used for pattern discovery and unsupervised learning it is less easy to compare them. For example it is very easy for just about any rule induction system to generate all possible rules, it is, however, much more difficult to devise a way to present those rules (which could easily be in the hundreds of thousands) in a way that is most useful to the end user. When interesting rules are found they usually have been created to find relationships between many different predictor values in the database not just one well defined target of the prediction. For this reason it is often much more difficult to assign a measure of value to the rule aside from its interestingness. For instance it would be difficult to determine the monetary value of knowing that if people buy breakfast sausage they also buy eggs 60% of the time. For data mining systems that are more focused on prediction for things like customer attrition, targeted marketing response or risk it is much easier to measure the value of the system and compare it to other systems and other methods for solving the problem.

Caveat: Rules do not imply causality


It is important to recognize that even though the patterns produced from rule induction systems are delivered as if then rules they do not necessarily mean that the left hand side of the rule (the if part) causes the right hand side of the rule (the then part) to happen. Purchasing cheese does not cause the purchase of wine even though the rule if cheese then wine may be very strong.

52

This is particularly important to remember for rule induction systems because the results are presented as if this then that as many causal relationships are presented.

Types of databases used for rule induction


Typically rule induction is used on databases with either fields of high cardinality (many different values) or many columns of binary fields. The classical case of this is the super market basket data from store scanners that contains individual product names and quantities and may contain tens of thousands of different items with different packaging that create hundreds of thousands of SKU identifiers (Stock Keeping Units). Sometimes in these databases the concept of a record is not easily defined within the database - consider the typical Star Schema for many data warehouses that store the supermarket transactions as separate entries in the fact table. Where the columns in the fact table are some unique identifier of the shopping basket (so all items can be noted as being in the same shopping basket), the quantity, the time of purchase, whether the item was purchased with a special promotion (sale or coupon). Thus each item in the shopping basket has a different row in the fact table. This layout of the data is not typically the best for most data mining algorithms which would prefer to have the data structured as one row per shopping basket and each column to represent the presence or absence of a given item. This can be an expensive way to store the data, however, since the typical grocery store contains 60,000 SKUs or different items that could come across the checkout counter. This structure of the records can also create a very high dimensional space (60,000 binary dimensions) which would be unwieldy for many classical data mining algorithms like neural networks and decision trees. As well see several tricks are played to make this computationally feasible for the data mining algorithm while not requiring a massive reorganization of the database.

Discovery
The claim to fame of these ruled induction systems is much more so for knowledge discovers in unsupervised learning systems than it is for prediction. These systems provide both a very detailed view of the data where significant patterns that only occur a small portion of the time and only can be found when looking at the detail data as well as a broad overview of the data where some systems seek to deliver to the user an overall view of the patterns contained n the database. These systems thus display a nice combination of both micro and macro views:

Macro Level - Patterns that cover many situations are provided to the user that can be used very often and with great confidence and can also be used to summarize the database. Micro Level - Strong rules that cover only a very few situations can still be retrieved by the system and proposed to the end user. These may be valuable if the situations that are covered are highly valuable (maybe they only apply to the most profitable customers) or represent a small but growing subpopulation which may indicate a market shift or the emergence of a new competitor (e.g. customers are only being lost in one particular area of the country where a new competitor is emerging).

53

Prediction
After the rules are created and their interestingness is measured there is also a call for performing prediction with the rules. Each rule by itself can perform prediction - the consequent is the target and the accuracy of the rule is the accuracy of the prediction. But because rule induction systems produce many rules for a given antecedent or consequent there can be conflicting predictions with different accuracies. This is an opportunity for improving the overall performance of the systems by combining the rules. This can be done in a variety of ways by summing the accuracies as if they were weights or just by taking the prediction of the rule with the maximum accuracy. Table 2.3 shows how a given consequent or antecedent can be part of many rules with different accuracies and coverages. From this example consider the prediction problem of trying to predict whether milk was purchased based solely on the other items that were in the shopping basket. If the shopping basket contained only bread then from the table we would guess that there was a 35% chance that milk was also purchased. If, however, bread and butter and eggs and cheese were purchased what would be the prediction for milk then? 65% chance of milk because the relationship between butter and milk is the greatest at 65%? Or would all of the other items in the basket increase even further the chance of milk being purchased to well beyond 65%? Determining how to combine evidence from multiple rules is a key part of the algorithms for using rules for prediction. Antecedent Consequent Accuracy Coverage bagels cream cheese 80% 5% bagels orange juice 40% 3% bagels coffee 40% 2% bagels eggs 25% 2% bread milk 35% 30% butter milk 65% 20% eggs milk 35% 15% cheese milk 40% 8% Table 2.3 Accuracy and Coverage in Rule Antecedents and Consequents

The General Idea


The general idea of a rule classification system is that rules are created that show the relationship between events captured in your database. These rules can be simple with just one element in the antecedent or they might be more complicated with many column value pairs in the antecedent all joined together by a conjunction (item1 and item2 and item3 must all occur for the antecedent to be true). The rules are used to find interesting patterns in the database but they are also used at times for prediction. There are two main things that are important to understanding a rule: Accuracy - Accuracy refers to the probability that if the antecedent is true that the precedent will be true. High accuracy means that this is a rule that is highly dependable. Coverage - Coverage refers to the number of records in the database that the rule applies to. High coverage means that the rule can be used very often and also that it is less likely to be a spurious artifact of the sampling technique or idiosyncrasies of the database.

54

The business importance of accuracy and coverage


From a business perspective accurate rules are important because they imply that there is useful predictive information in the database that can be exploited - namely that there is something far from independent between the antecedent and the consequent. The lower the accuracy the closer the rule comes to just random guessing. If the accuracy is significantly below that of what would be expected from random guessing then the negation of the antecedent may well in fact be useful (for instance people who buy denture adhesive are much less likely to buy fresh corn on the cob than normal). From a business perspective coverage implies how often you can use a useful rule. For instance you may have a rule that is 100% accurate but is only applicable in 1 out of every 100,000 shopping baskets. You can rearrange your shelf space to take advantage of this fact but it will not make you much money since the event is not very likely to happen. Table 2.4. Displays the trade off between coverage and accuracy. Accuracy Low Accuracy High Coverage High Rule is rarely correct but Rule is often correct and can be used often. can be used often. Coverage Low Rule is rarely correct and Rule is often correct but can can be only rarely used. be only rarely used. Table 2.4 Rule coverage versus accuracy.

Trading off accuracy and coverage is like betting at the track


An analogy between coverage and accuracy and making money is the following from betting on horses. Having a high accuracy rule with low coverage would be like owning a race horse that always won when he raced but could only race once a year. In betting, you could probably still make a lot of money on such a horse. In rule induction for retail stores it is unlikely that finding that one rule between mayonnaise, ice cream and sardines that seems to always be true will have much of an impact on your bottom line.

How to evaluate the rule


One way to look at accuracy and coverage is to see how they relate so some simple statistics and how they can be represented graphically. From statistics coverage is simply the a priori probability of the antecedent and the consequent occurring at the same time. The accuracy is just the probability of the consequent conditional on the precedent. So, for instance the if we were looking at the following database of super market basket scanner data we would need the following information in order to calculate the accuracy and coverage for a simple rule (lets say milk purchase implies eggs purchased). T = 100 = Total number of shopping baskets in the database. E = 30 = Number of baskets with eggs in them. M = 40 = Number of baskets with milk in them. B = 20 = Number of baskets with both eggs and milk in them. Accuracy is then just the number of baskets with eggs and milk in them divided by the number of baskets with milk in them. In this case that would be 20/40 = 50%. The coverage would be the number of baskets with milk in them divided by the total number of baskets. This would be 40/100 = 40%. This can be seen graphically in Figure 2.5.

55

Figure 2.5 Graphically the total number of shopping baskets can be represented in a space and the number of baskets containing eggs or milk can be represented by the area of a circle. The coverage of the rule If Milk then Eggs is just the relative size of the circle corresponding to milk. The accuracy is the relative size of the overlap between the two to the circle representing milk purchased. Notice that we havent used E the number of baskets with eggs in these calculations. One way that eggs could be used would be to calculate the expected number of baskets with eggs and milk in them based on the independence of the events. This would give us some sense of how unlikely and how special the event is that 20% of the baskets have both eggs and milk in them. Remember from the statistics section that if two events are independent (have no effect on one another) that the product of their individual probabilities of occurrence should equal the probability of the occurrence of them both together. If the purchase of eggs and milk were independent of each other one would expect that 0.3 x 0.4 = 0.12 or 12% of the time we would see shopping baskets with both eggs and milk in them. The fact that this combination of products occurs 20% of the time is out of the ordinary if these events were independent. That is to say there is a good chance that the purchase of one effects the other and the degree to which this is the case could be calculated through statistical tests and hypothesis testing.

Defining interestingness
One of the biggest problems with rule induction systems is the sometimes overwhelming number of rules that are produced. Most of which have no practical value or interest. Some of the rules are so inaccurate that they cannot be used, some have so little coverage that though they are interesting they have little applicability, and finally many of the rules

56

capture patterns and information that the user is already familiar with. To combat this problem researchers have sought to measure the usefulness or interestingness of rules. Certainly any measure of interestingness would have something to do with accuracy and coverage. We might also expect it to have at least the following four basic behaviors:

Interestingness = 0 if the accuracy of the rule is equal to the background accuracy (a priori probability of the consequent). The example in Table 2.5 shows an example of this. Where a rule for attrition is no better than just guessing the overall rate of attrition. Interestingness increases as accuracy increases (or decreases with decreasing accuracy) if the coverage is fixed. Interestingness increases or decreases with coverage if accuracy stays fixed Interestingness decreases with coverage for a fixed number of correct responses (remember accuracy equals the number of correct responses divided by the coverage). Consequent then customer will attrite then customer will attrite then customer will attrite then customer will attrite Accuracy 10% 10% 10% 100% Coverage 100% 60% 30% 0.000001%

Antecedent <no constraints> If customer balance > $3,000 If customer eyes = blue If customer social security number = 144 30 8217

Table 2.5 Uninteresting rules There are a variety of measures of interestingness that are used that have these general characteristics. They are used for pruning back the total possible number of rules that might be generated and then presented to the user.

Other measures of usefulness


Another important measure is that of simplicity of the rule. This is an important solely for the end user. As complex rules, as powerful and as interesting as they might be, may be difficult to understand or to confirm via intuition. Thus the user has a desire to see simpler rules and consequently this desire can be manifest directly in the rules that are chosen and supplied automatically to the user. Finally a measure of novelty is also required both during the creation of the rules - so that rules that are redundant but strong are less favored to be searched than rules that may not be as strong but cover important examples that are not covered by other strong rules. For instance there may be few historical records to provide rules on a little sold grocery item (e.g. mint jelly) and they may have low accuracy but since there are so few possible rules even though they are not interesting they will be novel and should be retained and presented to the user for that reason alone.

57

2.9. Neural Networks What is a Neural Network?


When data mining algorithms are talked about these days most of the time people are talking about either decision trees or neural networks. Of the two neural networks have probably been of greater interest through the formative stages of data mining technology. As we will see neural networks do have disadvantages that can be limiting in their ease of use and ease of deployment, but they do also have some significant advantages. Foremost among these advantages is their highly accurate predictive models that can be applied across a large number of different types of problems. To be more precise with the term neural network one might better speak of an artificial neural network. True neural networks are biological systems (a k a brains) that detect patterns, make predictions and learn. The artificial ones are computer programs implementing sophisticated pattern detection and machine learning algorithms on a computer to build predictive models from large historical databases. Artificial neural networks derive their name from their historical development which started off with the premise that machines could be made to think if scientists found ways to mimic the structure and functioning of the human brain on the computer. Thus historically neural networks grew out of the community of Artificial Intelligence rather than from the discipline of statistics. Despite the fact that scientists are still far from understanding the human brain let alone mimicking it, neural networks that run on computers can do some of the things that people can do. It is difficult to say exactly when the first neural network on a computer was built. During World War II a seminal paper was published by McCulloch and Pitts which first outlined the idea that simple processing units (like the individual neurons in the human brain) could be connected together in large networks to create a system that could solve difficult problems and display behavior that was much more complex than the simple pieces that made it up. Since that time much progress has been made in finding ways to apply artificial neural networks to real world prediction problems and in improving the performance of the algorithm in general. In many respects the greatest breakthroughs in neural networks in recent years have been in their application to more mundane real world problems like customer response prediction or fraud detection rather than the loftier goals that were originally set out for the techniques such as overall human learning and computer speech and image understanding.

Dont Neural Networks Learn to make better predictions?


Because of the origins of the techniques and because of some of their early successes the techniques have enjoyed a great deal of interest. To understand how neural networks can detect patterns in a database an analogy is often made that they learn to detect these patterns and make better predictions in a similar way to the way that human beings do. This view is encouraged by the way the historical training data is often supplied to the network - one record (example) at a time. Neural networks do learn in a very real sense but under the hood the algorithms and techniques that are being deployed are not truly different from the techniques found in statistics or other data mining algorithms. It

58

is for instance, unfair to assume that neural networks could outperform other techniques because they learn and improve over time while the other techniques are static. The other techniques if fact learn from historical examples in exactly the same way but often times the examples (historical records) to learn from a processed all at once in a more efficient manner than neural networks which often modify their model one record at a time.

Are Neural Networks easy to use?


A common claim for neural networks is that they are automated to a degree where the user does not need to know that much about how they work, or predictive modeling or even the database in order to use them. The implicit claim is also that most neural networks can be unleashed on your data straight out of the box without having to rearrange or modify the data very much to begin with. Just the opposite is often true. There are many important design decisions that need to be made in order to effectively use a neural network such as:

How should the nodes in the network be connected? How many neuron like processing units should be used? When should training be stopped in order to avoid overfitting?

There are also many important steps required for preprocessing the data that goes into a neural network - most often there is a requirement to normalize numeric data between 0.0 and 1.0 and categorical predictors may need to be broken up into virtual predictors that are 0 or 1 for each value of the original categorical predictor. And, as always, understanding what the data in your database means and a clear definition of the business problem to be solved are essential to ensuring eventual success. The bottom line is that neural networks provide no short cuts.

Applying Neural Networks to Business


Neural networks are very powerful predictive modeling techniques but some of the power comes at the expense of ease of use and ease of deployment. As we will see in this section, neural networks, create very complex models that are almost always impossible to fully understand even by experts. The model itself is represented by numeric values in a complex calculation that requires all of the predictor values to be in the form of a number. The output of the neural network is also numeric and needs to be translated if the actual prediction value is categorical (e.g. predicting the demand for blue, white or black jeans for a clothing manufacturer requires that the predictor values blue, black and white for the predictor color to be converted to numbers). Because of the complexity of these techniques much effort has been expended in trying to increase the clarity with which the model can be understood by the end user. These efforts are still in there infancy but are of tremendous importance since most data mining techniques including neural networks are being deployed against real business problems where significant investments are made based on the predictions from the models (e.g. consider trusting the predictive model from a neural network that dictates which one million customers will receive a $1 mailing).

59

There are two ways that these shortcomings in understanding the meaning of the neural network model have been successfully addressed:

The neural network is package up into a complete solution such as fraud prediction. This allows the neural network to be carefully crafted for one particular application and once it has been proven successful it can be used over and over again without requiring a deep understanding of how it works. The neural network is package up with expert consulting services. Here the neural network is deployed by trusted experts who have a track record of success. Either the experts are able to explain the models or they are trusted that the models do work.

The first tactic has seemed to work quite well because when the technique is used for a well defined problem many of the difficulties in preprocessing the data can be automated (because the data structures have been seen before) and interpretation of the model is less of an issue since entire industries begin to use the technology successfully and a level of trust is created. There are several vendors who have deployed this strategy (e.g. HNCs Falcon system for credit card fraud prediction and Advanced Software Applications ModelMAX package for direct marketing). Packaging up neural networks with expert consultants is also a viable strategy that avoids many of the pitfalls of using neural networks, but it can be quite expensive because it is human intensive. One of the great promises of data mining is, after all, the automation of the predictive modeling process. These neural network consulting teams are little different from the analytical departments many companies already have in house. Since there is not a great difference in the overall predictive accuracy of neural networks over standard statistical techniques the main difference becomes the replacement of the statistical expert with the neural network expert. Either with statistics or neural network experts the value of putting easy to use tools into the hands of the business end user is still not achieved.

Where to Use Neural Networks


Neural networks are used in a wide variety of applications. They have been used in all facets of business from detecting the fraudulent use of credit cards and credit risk prediction to increasing the hit rate of targeted mailings. They also have a long history of application in other areas such as the military for the automated driving of an unmanned vehicle at 30 miles per hour on paved roads to biological simulations such as learning the correct pronunciation of English words from written text.

Neural Networks for clustering


Neural networks of various kinds can be used for clustering and prototype creation. The Kohonen network described in this section is probably the most common network used for clustering and segmentation of the database. Typically the networks are used in a unsupervised learning mode to create the clusters. The clusters are created by forcing the system to compress the data by creating prototypes or by algorithms that steer the system toward creating clusters that compete against each other for the records that they contain, thus ensuring that the clusters overlap as little as possible.

60

Neural Networks for Outlier Analysis


Sometimes clustering is performed not so much to keep records together as to make it easier to see when one record sticks out from the rest. For instance: Most wine distributors selling inexpensive wine in Missouri and that ship a certain volume of product produce a certain level of profit. There is a cluster of stores that can be formed with these characteristics. One store stands out, however, as producing significantly lower profit. On closer examination it turns out that the distributor was delivering product to but not collecting payment from one of their customers. A sale on mens suits is being held in all branches of a department store for southern California . All stores with these characteristics have seen at least a 100% jump in revenue since the start of the sale except one. It turns out that this store had, unlike the others, advertised via radio rather than television.

Neural Networks for feature extraction


One of the important problems in all of data mining is that of determining which predictors are the most relevant and the most important in building models that are most accurate at prediction. These predictors may be used by themselves or they may be used in conjunction with other predictors to form features. A simple example of a feature in problems that neural networks are working on is the feature of a vertical line in a computer image. The predictors, or raw input data are just the colored pixels that make up the picture. Recognizing that the predictors (pixels) can be organized in such a way as to create lines, and then using the line as the input predictor can prove to dramatically improve the accuracy of the model and decrease the time to create it. Some features like lines in computer images are things that humans are already pretty good at detecting, in other problem domains it is more difficult to recognize the features. One novel way that neural networks have been used to detect features is the idea that features are sort of a compression of the training database. For instance you could describe an image to a friend by rattling off the color and intensity of each pixel on every point in the picture or you could describe it at a higher level in terms of lines, circles - or maybe even at a higher level of features such as trees, mountains etc. In either case your friend eventually gets all the information that they need in order to know what the picture looks like, but certainly describing it in terms of high level features requires much less communication of information than the paint by numbers approach of describing the color on each square millimeter of the image. If we think of features in this way, as an efficient way to communicate our data, then neural networks can be used to automatically extract them. The neural network shown in Figure 2.2 is used to extract features by requiring the network to learn to recreate the input data at the output nodes by using just 5 hidden nodes. Consider that if you were allowed 100 hidden nodes, that recreating the data for the network would be rather trivial - simply pass the input node value directly through the corresponding hidden node and on to the output node. But as there are fewer and fewer hidden nodes, that information has to be passed through the hidden layer in a more and more efficient manner since there are less hidden nodes to help pass along the information.

61

Figure 2.2 Neural networks can be used for data compression and feature extraction. In order to accomplish this the neural network tries to have the hidden nodes extract features from the input nodes that efficiently describe the record represented at the input layer. This forced squeezing of the data through the narrow hidden layer forces the neural network to extract only those predictors and combinations of predictors that are best at recreating the input record. The link weights used to create the inputs to the hidden nodes are effectively creating features that are combinations of the input nodes values.

What does a neural net look like?


A neural network is loosely based on how some people believe that the human brain is organized and how it learns. Given that there are two main structures of consequence in the neural network: The node - which loosely corresponds to the neuron in the human brain. The link - which loosely corresponds to the connections between neurons (axons, dendrites and synapses) in the human brain. In Figure 2.3 there is a drawing of a simple neural network. The round circles represent the nodes and the connecting lines represent the links. The neural network functions by accepting predictor values at the left and performing calculations on those values to produce new values in the node at the far right. The value at this node represents the prediction from the neural network model. In this case the network takes in values for predictors for age and income and predicts whether the person will default on a bank loan.

62

Figure 2.3 A simplified view of a neural network for prediction of loan default.

How does a neural net make a prediction?


In order to make a prediction the neural network accepts the values for the predictors on what are called the input nodes. These become the values for those nodes those values are then multiplied by values that are stored in the links (sometimes called links and in some ways similar to the weights that were applied to predictors in the nearest neighbor method). These values are then added together at the node at the far right (the output node) a special thresholding function is applied and the resulting number is the prediction. In this case if the resulting number is 0 the record is considered to be a good credit risk (no default) if the number is 1 the record is considered to be a bad credit risk (likely default). A simplified version of the calculations made in Figure 2.3 might look like what is shown in Figure 2.4. Here the value age of 47 is normalized to fall between 0.0 and 1.0 and has the value 0.47 and the income is normalized to the value 0.65. This simplified neural network makes the prediction of no default for a 47 year old making $65,000. The links are weighted at 0.7 and 0.1 and the resulting value after multiplying the node values by the link weights is 0.39. The network has been trained to learn that an output value of 1.0 indicates default and that 0.0 indicates non-default. The output value calculated here (0.39) is closer to 0.0 than to 1.0 so the record is assigned a non-default prediction.

63

Figure 2.4 The normalized input values are multiplied by the link weights and added together at the output.

How is the neural net model created?


The neural network model is created by presenting it with many examples of the predictor values from records in the training set (in this example age and income are used) and the prediction value from those same records. By comparing the correct answer obtained from the training record and the predicted answer from the neural network it is possible to slowly change the behavior of the neural network by changing the values of the link weights. In some ways this is like having a grade school teacher ask questions of her student (a.k.a. the neural network) and if the answer is wrong to verbally correct the student. The greater the error the harsher the verbal correction. So that large errors are given greater attention at correction than are small errors. For the actual neural network it is the weights of the links that actually control the prediction value for a given record. Thus the particular model that is being found by the neural network is in fact fully determined by the weights and the architectural structure of the network. For this reason it is the link weights that are modified each time an error is made.

How complex can the neural network model become?


The models shown in the figures above have been designed to be as simple as possible in order to make them understandable. In practice no networks are as simple as these. Networks with many more links and many more nodes are possible. This was the case in the architecture of a neural network system called NETtalk that learned how to pronounce written English words. Each node in this network was connected to every node in the level above it and below it resulting in 18,629 link weights that needed to be learned in the network.

64

In this network there was a row of nodes in between the input nodes and the output nodes. These are called hidden nodes or the hidden layer because the values of these nodes are not visible to the end user the way that the output nodes are (that contain the prediction) and the input nodes (which just contain the predictor values). There are even more complex neural network architectures that have more than one hidden layer. In practice one hidden layer seems to suffice however.

Hidden nodes are like trusted advisors to the output nodes


The meaning of the input nodes and the output nodes are usually pretty well understood and are usually defined by the end user based on the particular problem to be solved and the nature and structure of the database. The hidden nodes, however, do not have a predefined meaning and are determined by the neural network as it trains. Which poses two problems:

It is difficult to trust the prediction of the neural network if the meaning of these nodes is not well understood. ince the prediction is made at the output layer and the difference between the prediction and the actual value is calculated there, how is this error correction fed back through the hidden layers to modify the link weights that connect them?

The meaning of these hidden nodes is not necessarily well understood but sometimes after the fact they can be looked at to see when they are active and when they are not and derive some meaning from them.

The learning that goes on in the hidden nodes.


The learning procedure for the neural network has been defined to work for the weights in the links connecting the hidden layer. A good metaphor for how this works is to think of a military operation in some war where there are many layers of command with a general ultimately responsible for making the decisions on where to advance and where to retreat. The general probably has several lieutenant generals advising him and each lieutenant general probably has several major generals advising him. This hierarchy continuing downward through colonels and privates at the bottom of the hierarchy. This is not too far from the structure of a neural network with several hidden layers and one output node. You can think of the inputs coming from the hidden nodes as advice. The link weight corresponds to the trust that the general has in his advisors. Some trusted advisors have very high weights and some advisors may no be trusted and in fact have negative weights. The other part of the advice from the advisors has to do with how competent the particular advisor is for a given situation. The general may have a trusted advisor but if that advisor has no expertise in aerial invasion and the question at hand has to do with a situation involving the air force this advisor may be very well trusted but the advisor himself may not have any strong opinion one way or another. In this analogy the link weight of a neural network to an output unit is like the trust or confidence that a commander has in his advisors and the actual node value represents how strong an opinion this particular advisor has about this particular situation. To make a decision the general considers how trustworthy and valuable the advice is and how

65

knowledgeable and confident each advisor is in making their suggestion and then taking all of this into account the general makes the decision to advance or retreat. In the same way the output node will make a decision (a prediction) by taking into account all of the input from its advisors (the nodes connected to it). In the case of the neural network this decision is reach by multiplying the link weight by the output value of the node and summing these values across all nodes. If the prediction is incorrect the nodes that had the most influence on making the decision have their weights modified so that the wrong prediction is less likely to be made the next time. This learning in the neural network is very similar to what happens when the wrong decision is made by the general. The confidence that the general has in all of those advisors that gave the wrong recommendation is decreased - and all the more so for those advisors who were very confident and vocal in their recommendation. On the other hand any advisors who were making the correct recommendation but whose input was not taken as seriously would be taken more seriously the next time. Likewise any advisor that was reprimanded for giving the wrong advice to the general would then go back to his advisors and determine which of them he had trusted more than he should have in making his recommendation and who he should have listened more closely to.

Sharing the blame and the glory throughout the organization


This feedback can continue in this way down throughout the organization - at each level giving increased emphasis to those advisors who had advised correctly and decreased emphasis to those who had advised incorrectly. In this way the entire organization becomes better and better and supporting the general in making the correct decision more of the time. A very similar method of training takes place in the neural network. It is called back propagation and refers to the propagation of the error backwards from the output nodes (where the error is easy to determine the difference between the actual prediction value from the training database and the prediction from the neural network ) through the hidden layers and to the input layers. At each level the link weights between the layers are updated so as to decrease the chance of making the same mistake again.

Different types of neural networks


There are literally hundreds of variations on the back propagation feedforward neural networks that have been briefly described here. Most having to do with changing the architecture of the neural network to include recurrent connections where the output from the output layer is connected back as input into the hidden layer. These recurrent nets are some times used for sequence prediction where the previous outputs from the network need to be stored someplace and then fed back into the network to provide context for the current prediction. Recurrent networks have also been used for decreasing the amount of time that it takes to train the neural network. Another twist on the neural net theme is to change the way that the network learns. Back propagation is effectively utilizing a search technique called gradient descent to search for the best possible improvement in the link weights to reduce the error. There are, however, many other ways of doing search in a high dimensional space including Newtons methods and conjugate gradient as well as simulating the physics of cooling metals in a process called simulated annealing or in simulating the search process that

66

goes on in biological evolution and using genetic algorithms to optimize the weights of the neural networks. It has even been suggested that creating a large number of neural networks with randomly weighted links and picking the one with the lowest error rate would be the best learning procedure. Despite all of these choices, the back propagation learning procedure is the most commonly used. It is well understand, relatively simple, and seems to work in a large number of problem domains. There are, however, two other neural network architectures that are used relatively often. Kohonen feature maps are often used for unsupervised learning and clustering and Radial Basis Function networks are used for supervised learning and in some ways represent a hybrid between nearest neighbor and neural network classification.

Kohonen Feature Maps


Kohonen feature maps were developed in the 1970s and as such were created to simulate certain brain function. Today they are used mostly to perform unsupervised learning and clustering. Kohonen networks are feedforward neural networks generally with no hidden layer. The networks generally contain only an input layer and an output layer but the nodes in the output layer compete amongst themselves to display the strongest activation to a given record. What is sometimes called winner take all. The networks originally came about when some of the puzzling yet simple behaviors of the real neurons were taken into effect. Namely that physical locality of the neurons seems to play an important role in the behavior and learning of neurons. When these networks were run, in order to simulate the real world visual system it became that the organization that was automatically being constructed on the data was also very useful for segmenting and clustering the training database. Each output node represented a cluster and nearby clusters were nearby in the two dimensional output layer. Each record in the database would fall into one and only one cluster (the most active output node) but the other clusters in which it might also fit would be shown and likely to be next to the best matching cluster.

How much like a human brain is the neural network?


Since the inception of the idea of neural networks the ultimate goal for these techniques has been to have them recreate human thought and learning. This has once again proved to be a difficult task - despite the power of these new techniques and the similarities of their architecture to that of the human brain. Many of the things that people take for granted are difficult for neural networks - like avoiding overfitting and working with real world data without a lot of preprocessing required. There have also been some exciting successes.

Combatting overfitting - getting a model you can use somewhere else


As with all predictive modeling techniques some care must be taken to avoid overfitting with a neural network. Neural networks can be quite good at overfitting training data with a predictive model that does not work well on new data. This is particularly problematic for neural networks because it is difficult to understand how the model is working. In the early days of neural networks the predictive accuracy that was often

67

mentioned first was the accuracy on the training set and the vaulted or validation set database was reported as a footnote. This is in part due to the fact that unlike decision trees or nearest neighbor techniques, which can quickly achieve 100% predictive accuracy on the training database, neural networks can be trained forever and still not be 100% accurate on the training set. While this is an interesting fact it is not terribly relevant since the accuracy on the training set is of little interest and can have little bearing on the validation database accuracy. Perhaps because overfitting was more obvious for decision trees and nearest neighbor approaches more effort was placed earlier on to add pruning and editing to these techniques. For neural networks generalization of the predictive model is accomplished via rules of thumb and sometimes in a more methodically way by using cross validation as is done with decision trees. One way to control overfitting in neural networks is to limit the number of links. Since the number of links represents the complexity of the model that can be produced, and since more complex models have the ability to overfit while less complex ones cannot, overfitting can be controlled by simply limiting the number of links in the neural network. Unfortunately there is no god theoretical grounds for picking a certain number of links. Test set validation can be used to avoid overfitting by building the neural network on one portion of the training database and using the other portion of the training database to detect what the predictive accuracy is on vaulted data. This accuracy will peak at some point in the training and then as training proceeds it will decrease while the accuracy on the training database will continue to increase. The link weights for the network can be saved when the accuracy on the held aside data peaks. The NeuralWare product, and others, provide an automated function that saves out the network when it is best performing on the test set and even continues to search after the minimum is reached.

Explaining the network


One of the indictments against neural networks is that it is difficult to understand the model that they have built and also how the raw data effects the output predictive answer. With nearest neighbor techniques prototypical records are provided to explain why the prediction is made, and decision trees provide rules that can be translated in to English to explain why a particular prediction was made for a particular record. The complex models of the neural network are captured solely by the link weights in the network which represent a very complex mathematical equation. There have been several attempts to alleviate these basic problems of the neural network. The simplest approach is to actually look at the neural network and try to create plausible explanations for the meanings of the hidden nodes. Some times this can be done quite successfully. In the example given at the beginning of this section the hidden nodes of the neural network seemed to have extracted important distinguishing features in predicting the relationship between people by extracting information like country of origin. Features that it would seem that a person would also extract and use for the prediction. But there were also many other hidden nodes, even in this particular example that were hard to explain and didnt seem to have any particular purpose. Except that they aided the neural network in making the correct prediction.

68

2.10 Genetic Algorithm A genetic algorithm (GA) is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Genetic algorithms are categorized as global search heuristics. Genetic algorithms are a particular class of evolutionary algorithms (also known as evolutionary computation) that use techniques inspired by evolutionary biology such as inheritance, mutation, selection, and crossover (also called recombination). First, a Biology Lesson Every organism has a set of rules, a blueprint so to speak, describing how that organism is built up from the tiny building blocks of life. These rules are encoded in the genes of an organism, which in turn are connected together into long strings called chromosomes. Each gene represents a specific trait of the organism, like eye colour or hair colour, and has several different settings. For example, the settings for a hair colour gene may be blonde, black or auburn. These genes and their settings are usually referred to as an organism's genotype. The physical expression of the genotype - the organism itself - is called the phenotype. When two organisms mate they share their genes. The resultant offspring may end up having half the genes from one parent and half from the other. This process is called recombination. Very occasionally a gene may be mutated. Normally this mutated gene will not affect the development of the phenotype but very occasionally it will be expressed in the organism as a completely new trait. Life on earth has evolved to be as it is through the processes of natural selection, recombination and mutation As you can see the processes of natural selection - survival of the fittest - and gene mutation have very powerful roles to play in the evolution of an organism. But how does recombination fit into the scheme of things? Genetic Algorithms are a way of solving problems by mimicking the same processes mother nature uses. They use the same combination of selection, recombination and mutation to evolve a solution to a problem. The Genetic Algorithm - a brief overview

69

Before you can use a genetic algorithm to solve a problem, a way must be found of encoding any potential solution to the problem. This could be as a string of real numbers or, as is more typically the case, a binary bit string. I will refer to this bit string from now on as the chromosome. A typical chromosome may look like this: 10010101110101001010011101101110111111101 (Don't worry if non of this is making sense to you at the moment, it will all start to become clear shortly. For now, just relax and go with the flow.) At the beginning of a run of a genetic algorithm a large population of random chromosomes is created. Each one, when decoded will represent a different solution to the problem at hand. Let's say there are N chromosomes in the initial population. Then, the following steps are repeated until a solution is found

Test each chromosome to see how good it is at solving the problem at hand and assign a fitness score accordingly. The fitness score is a measure of how good that chromosome is at solving the problem to hand. Select two members from the current population. The chance of being selected is proportional to the chromosomes fitness. Roulette wheel selection is a commonly used method. Dependent on the crossover rate crossover the bits from each chosen chromosome at a randomly chosen point. Step through the chosen chromosomes bits and flip dependent on the mutation rate. Repeat step 2, 3, 4 until a new population of N members has been created.

Tell me about Roulette Wheel selection This is a way of choosing members from the population of chromosomes in a way that is proportional to their fitness. It does not guarantee that the fittest member goes through to the next generation, merely that it has a very good chance of doing so. It works like this: Imagine that the populations total fitness score is represented by a pie chart, or roulette wheel. Now you assign a slice of the wheel to each member of the population. The size of the slice is proportional to that chromosomes fitness score. i.e. the fitter a member is the bigger the slice of pie it gets. Now, to choose a chromosome all you have to do is spin the ball and grab the chromosome at the point it stops.

What's the Crossover Rate? This is simply the chance that two chromosomes will swap their bits. A good value for this is around 0.7. Crossover is performed by selecting a random gene along the length of the chromosomes and swapping all the genes after that point.

70

e.g. Given two chromosomes 10001001110010010 01010001001000011 Choose a random bit along the length, say at position 9, and swap all the bits after that point so the above become: 10001001101000011 01010001010010010 What's the Mutation Rate? This is the chance that a bit within a chromosome will be flipped (0 becomes 1, 1 becomes 0). This is usually a very low value for binary encoded genes, say 0.001 So whenever chromosomes are chosen from the population the algorithm first checks to see if crossover should be applied and then the algorithm iterates down the length of each chromosome mutating the bits if applicable. From Theory to Practice To hammer home the theory you've just learnt let's look at a simple problem: Given the digits 0 through 9 and the operators +, -, * and /, find a sequence that will represent a given target number. The operators will be applied sequentially from left to right as you read. So, given the target number 23, the sequence 6+5*4/2+1 would be one possible solution. If 75.5 is the chosen number then 5/2+9*7-5 would be a possible solution. Please make sure you understand the problem before moving on. I know it's a little contrived but I've used it because it's very simple.

Stage 1: Encoding First we need to encode a possible solution as a string of bits a chromosome. So how do we do this? Well, first we need to represent all the different characters available to the solution... that is 0 through 9 and +, -, * and /. This will represent a gene. Each chromosome will be made up of several genes. Four bits are required to represent the range of characters used:

71

0: 1: 2: 3: 4: 5: 6: 7: 8: 9: +: -: *: /:

0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101

The above show all the different genes required to encode the problem as described. The possible genes 1110 & 1111 will remain unused and will be ignored by the algorithm if encountered. So now you can see that the solution mentioned above for 23, ' 6+5*4/2+1' would be represented by nine genes like so: 0110 1010 0101 1100 0100 1101 0010 1010 0001 6 + 5 * 4 / 2 + 1

These genes are all strung together to form the chromosome: 011010100101110001001101001010100001 A Quick Word about Decoding Because the algorithm deals with random arrangements of bits it is often going to come across a string of bits like this: 0010001010101110101101110010 Decoded, these bits represent: 0010 0010 1010 1110 1011 0111 0010 2 2 + n/a 7 2

Which is meaningless in the context of this problem! Therefore, when decoding, the algorithm will just ignore any genes which dont conform to the expected pattern of:

72

number -> operator -> number -> operator and so on. With this in mind the above nonsense chromosome is read (and tested) as: 2 + 7

Stage 2: Deciding on a Fitness Function This can be the most difficult part of the algorithm to figure out. It really depends on what problem you are trying to solve but the general idea is to give a higher fitness score the closer a chromosome comes to solving the problem. With regards to the simple project I'm describing here, a fitness score can be assigned that's inversely proportional to the difference between the solution and the value a decoded chromosome represents. If we assume the target number for the remainder of the tutorial is 42, the chromosome mentioned above 011010100101110001001101001010100001 has a fitness score of 1/(42-23) or 1/19. As it stands, if a solution is found, a divide by zero error would occur as the fitness would be 1/(42-42). This is not a problem however as we have found what we were looking for... a solution. Therefore a test can be made for this occurrence and the algorithm halted accordingly.

Stage 3: Getting down to business First, please read this tutorial again. If you now feel you understand enough to solve this problem I would recommend trying to code the genetic algorithm yourself. There is no better way of learning. If, however, you are still confused, I have already prepared some simple code which you can find here. Please tinker around with the mutation rate, crossover rate, size of chromosome etc to get a feel for how each parameter effects the algorithm. Hopefully the code should be documented well enough for you to follow what is going on! If not please email me and Ill try to improve the commenting. Note: The code given will parse a chromosome bit string into the values we have discussed and it will attempt to find a solution which uses all the valid symbols it has found. Therefore if the target is 42, + 6 * 7 / 2 would not give a positive result even though the first four symbols("+ 6 * 7") do give a valid solution.

Stuff to Try

73

If you have succeeded in coding a genetic algorithm to solve the problem given in the tutorial, try having a go at the following more difficult problem: Given an area that has a number of non overlapping disks scattered about its surface as shown in Screenshot 1,

Screenshot 1 Use a genetic algorithm to find the disk of largest radius which may be placed amongst these disks without overlapping any of them. See Screenshot 2.

74

Screenshot 2

2.11 KDD (Knowledge Discover in Data Bases):

Looming atop a wide variety of human activities are the menacing profiles of evergrowing mountains of data. These mountains grew as a result of great engineering successes that enabled us to build devices to generate, collect, and store digital data. With major advances in database technology came the creation of huge efficient data stores. Advances in computer networking have enabled the data glut to reach anyone who cares to tap in. Unfortunately, we have not witnessed corresponding advances in computational techniques to help us analyze the accumulated data. Without such developments, we risk missing most of what the data have to offer. Be it a satellite orbiting our planet, a medical imaging device, a credit-card transaction verification system, or a supermarkets checkout system, the human at the other end of the data gathering and storage machinery is faced with the same problem: What to do with all this data? Ignoring whatever we cannot analyze would be wasteful and unwise. Should one choose to ignore valuable information buried within the data, then ones competition may put them to good use; perhaps to ones detriment. In scientific endeavours, data represents observations carefully collected about some phenomena under study, and the race is on for who can explain the observations best. In business endeavours, data captures information about the markets, competitors, and customers. In 75

manufacturing, data captures performance and optimization opportunities, and keys to improving processes and troubleshooting problems. The value of raw data is typically predicated on the ability to extract higher level information: information useful for decision support, for exploration, and for better understanding of the phenomena generating the data. Traditionally, humans have done the task of analysis. One or more analysts get intimately familiar with the data and with the help of statistical techniques provide summaries and generate reports. In effect, analysts determine the right queries to ask and sometimes even act as sophisticated query processors. Such an approach rapidly breaks down as the volume and dimensionality of the data increase. Who could be expected to "understand" millions of cases each having hundreds of fields? To further complicate the situation, the data grow and change at rates that would quickly overwhelm manual analysis (even if it were possible). Hence tools to aid in at least the partial automation of analysis tasks are becoming a necessity. Why Data Mining and Knowledge Discovery? Knowledge Discovery in Databases (KDD) is concerned with extracting useful information from databases. The term data mining has historically been used in the database community and in statistics (often in the latter with negative connotations to indicate improper data analysis). We take the view that any algorithm that enumerates patterns from, or fits models to, data is a data mining algorithm. We further view data mining to be a single step in a larger process that we call the KDD process. The various steps of the process which include data warehousing, target data selection, cleaning, preprocessing, transformation and reduction, data mining, model selection (or combination), evaluation and interpretation, and finally consolidation and use of the extracted "knowledge". Hence data mining is but a step in this iterative and interactive process. We chose to include it in the name of the journal because it represents a majority of the published research work, and because we wanted to build bridges between the various communities that work on topics related to data mining. KDDs goal, as stated above, is very broad, and can describe a multitude of fields of study. Statistics has been preoccupied with this goal for over a century. So have many other fields including database systems, pattern recognition, artificial intelligence, data visualization, and a host of activities related to data analysis. So why has a separate community emerged under the name "KDD"? The answer: new approaches, techniques, and solutions have to be developed to enable analysis of large databases. Faced with massive data sets, traditional approaches in statistics and pattern recognition collapse. For example, a statistical analysis package (e.g. K-means clustering in your favorite Fortran library) assumes data can be "loaded" into memory and then manipulated. What happens when the data set will not fit in main memory? What happens if the database is on a remote server and will never permit a nave scan of the data? How do I sample effectively if I am not permitted to query for a stratified sample because the relevant fields are not indexed? What if the data set is in a multitude of tables (relations) and can only be accessed via some hierarchically structured

76

set of fields? What if the relations are sparse (not all fields are defined or even applicable to any fixed subset of the data)? How do I fit a statistical model with a large number of variables? The open problems are not restricted to scalability issues of storage, access, and scale. For example, a problem that is not addressed by the database field is one I like to call the "query formulation problem": what to do if one does not know how to specify the desired query to begin with? For example, it would be desirable for a bank to issue a query at a high level: "give me all transactions whose likelihood of being fraudulent exceeds 0.75. It is not clear that one can write a SQL query (or even a program) to retrieve the target. ". Most interesting queries that arise with end-users of the data are of this class. KDD provides an alternative solution to this problem. Assuming that certain cases in the database can be identified as "fraudulent" and others as "known to be legitimate", then one can construct a training sample for a data mining algorithm, let the algorithm build a predictive model, and then retrieve records that the model triggers on. This is an example of a much needed and much more natural interface between humans and databases. Issues of inference under uncertainty, search for patterns and parameters in large spaces, and so on are also fundamental to KDD. While these issues are studied in many related fields, approaches to solving them in the context of large databases are unique to KDD. I outline several other issues and challenges for KDD later in this editorial, and I am sure future pages of this journal will unveil many problems we have not thought of yet.

Related Fields Many research communities are strongly related to KDD. For example, by our definition, all work in classification and clustering in statistics, pattern recognition, neural networks, machine learning, and databases would fit under the data mining step. In addition to exploratory data analysis (EDA), statistics overlaps with KDD in many other steps including data selection and sampling, preprocessing, transformation, and evaluation of extracted knowledge. The Database field is of fundamental importance to KDD. The efficient and reliable storage and retrieval of the data, as well as issues of flexible querying and query optimization, are important enabling techniques. In addition, contributions from the database research literature in the area of data mining are beginning to appear. On-line Analytical Processing (OLAP) is an evolving field with very strong ties to databases, data warehousing, and KDD. While the emphasis in OLAP is still primarily on data visualization and query-driven exploration, automated techniques for data mining can play a major role in making OLAP more useful and easier to apply. Other related fields include optimization (in search), high-performance and parallel computing, knowledge modeling, the management of uncertainty, and data visualization. Data visualization can contribute to effective EDA and visualization of extracted knowledge. Data mining can enable the visualization of patterns hidden deep within the data and embedded in much higher dimensional spaces. For example, a clustering method

77

can segment the data into homogeneous subsets that are easier to describe and visualize. These in turn can be displayed to the user instead of attempting to display the entire data (or a global random sample of it) which usually results in missing the embedded patterns. In an ideal world, KDD should have evolved as a proper subset of statistics. However, statisticians have not focused on considering issues related to large databases. In addition, historically, the majority of the work has been primarily focused on hypothesisverification as the primary mode of data analysis (which is certainly no longer true now). The de-coupling of database issues (storage and retrieval) from analysis issues is also a culprit. Furthermore, compared with techniques that data mining draws on from pattern recognition, machine learning, and neural networks, the traditional approaches in statistics perform little search over models and parameters (again with notable recent exceptions). KDD is concerned with formalizing and encoding aspects of the "art" of statistical analysis and making analysis methods easier to use by those who own the data, regardless of whether they have the pre-requisite knowledge of the techniques being used. A marketing person interested in segmenting a database may not have the necessary advanced degree in statistics to understand and use the literature or the library of available routines. We do not dismiss the dangers of blind mining and that it can easily deteriorate to data dredging. However, the strong need for analysis aids in the dataoverloaded society need to be addressed.

Future Prospects and Challenges Successful KDD applications continue to appear, driven mainly by a glut in databases that have clearly grown to surpass raw human processing abilities. Driving the healthy growth of this field are strong forces (both economic and social) that are a product of the data overload phenomenon. I view the need to deliver workable solutions to pressing problems as a very healthy pressure on the KDD field. Not only will it ensure our healthy growth as a new engineering discipline, but it will provide our efforts with a healthy dose of reality checks; insuring that any theory or model that emerges will find its immediate real-world test environment. The fundamental problems are still as difficult as they always were, and we need to guard against building unrealistic expectations in the publics mind. The challenges ahead of us are formidable. Some of these challenges include: 1. Develop mining algorithms for classification, clustering, dependency analysis, and change and deviation detection that scale to large databases. There is a tradeoff between performance and accuracy as one surrenders to the fact that data resides primarily on disk or on a server and cannot fit in main memory. 2. Develop schemes for encoding "metadata" (information about the content and meaning of data) over data tables so that mining algorithms can operate meaningfully on a database and so that the KDD system can effectively ask for more information from the user.

78

3. While operating in a very large sample size environment is a blessing against overfitting problems, data mining systems need to guard against fitting models to data by chance. This problem becomes significant as a program explores a huge search space over many models for a given data set. 4. Develop effective means for data sampling, data reduction, and dimensionality reduction that operate on a mixture of categorical and numeric data fields. While large sample sizes allow us to handle higher dimensions, our understanding of high dimensional spaces and estimation within them is still fairly primitive. The curse of dimensionality is still with us. 5. Develop schemes capable of mining over nonhomogenous data sets (including mixtures of multimedia, video, and text modalities) and deal with sparse relations that are only defined over parts of the data. 6. Develop new mining and search algorithms capable of extracting more complex relationships between fields and able to account for structure over the fields (e.g. hierarchies, sparse relations); i.e. go beyond the flat file or single table assumption. 7. Develop data mining methods that account for prior knowledge of data and exploit such knowledge in reducing search, that can account for costs and benefits, and that are robust against uncertainty and missing data problems. Bayesian methods and decision analysis provide the basic foundational framework. 8. Enhance database management systems to support new primitives for the efficient extraction of necessary sufficient statistics as well as more efficient sampling schemes. This includes providing SQL support for new primitives that may be needed (c.f. the paper by Gray et al in this issue). 9. Scale methods to parallel databases with hundreds of tables, thousands of fields, and terabytes of data. Issues of query optimization in these settings are fundamental. 10. Account for and model comprehensibility of extracted models; allow proper tradeoffs between complexity and understandability of models for purposes of visualization and reporting; enable interactive exploration where the analyst can easily provide hints to help the mining algorithm with its search. 11. Develop theory and techniques to model growth and change in data. Large databases, because they grow over a long time, do not typically grow as if sampled from a static joint probability density. The question of how does the data grow? needs to be better understood (see articles by P. Huber, by Fayyad & Smyth, and by others in [6]) and tools for coping with it need to be developed. KDD holds the promise of an enabling technology that could unlock the knowledge lying dormant in huge databases, thereby improving humanitys collective intellect: a sort of amplifier of basic human analysis capabilities. Perhaps the most exciting aspect of the launch of this new journal is the possibility of the birth of a new research area properly mixing statistics, databases, automated data analysis and reduction, and other related areas. While KDD will draw on the substantial body of knowledge built up in its constituent fields, it is my hope that a new science will inevitably emerge. A science of how to exploit massive data sets, how to store and access them for analysis purposes, and

79

how to cope with growth and change in data. I sincerely hope that future issues of this journal will address some of the challenges and chronicle the development of theory and applications of the new science for supporting analysis and decision making with massive data sets.

2.12 Summary In this unit you have learnt about the Knowledge Discovery Process from the data or information. Data cleaning, Data selection, Data enrichment are some of the procedures and concepts that have been introduced in this chapter. Also the Visualization techniques , the various Data mining methods like Classification, Decision Trees, Association Rules and etc. have been learnt by you.

2.13 Exercises : 1. Explain KDD Process 2. Write about Data Selection, data Cleaning and Data enrichment with examples 3. What are all the preliminary analysis that can be done on a data set? Give some of the examples for the tools. 4. Visualization Techniques explain with examples 5. How are OLAP tools used ? 6. Explain about Decision trees? 7. Association Rules write about this with examples 8. What are Neural Networks?- Explain the Concept with examples 9. Genetics Algorithm explain 10. How KDD can be used in Data bases?

80

Unit III Structure of the Unit 3.1 Introduction 3.2 Learning Objectives 3.3 Data Warehouse Architecture 3.4 System Process 3.5 Process Architecture 3.6 Design 3.7 Data Base Schema 3.8 Partitioning Strategy 3.9 Aggregations 3.10 3.11 3.12 3.13 3.13 Data Marting Meta Data System and Warehouse process Managers Summary Exercises

81

3.1 Introduction
Data warehouses generalize and consolidate data in multidimensional space. The construction of data warehouses involves data cleaning, data integration, and data transformation and can be viewed as an important preprocessing step for data mining.Moreover, data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining. Many other data mining functions, such as association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse has become an increasingly important platformfor data analysis and on-line analytical processing and will provide an effective platformfor data mining. Therefore, data warehousing and OLAP form an essential step in the knowledge discovery process. This chapter presents an overview of data warehouse and OLAP technology. Such an overview is essential for understanding the overall data mining and knowledge discovery process. In this chapter, we study a well-accepted definition of the data warehouse and see why more and more organizations are building data warehouses for the analysis of their data. In particular, we study the data cube, amultidimensional data model for data warehouses and OLAP, as well as OLAP operations such as roll-up, drill-down, slicing, and dicing. We also look at data warehouse architecture, including steps on data warehouse design and construction. An overview of data warehouse implementation examines general strategies for efficient data cube computation, OLAP data indexing, and OLAP query processing. Finally, we look at on-line-analytical mining, a powerful paradigm that integrates data warehouse and OLAP technology with that of data mining.

3.2 Learning Objectives To learn about the need and architecture of Data warehouse To know about the processes and operations done on a Ware house

3.3 Data ware house Architecture: What is a Data Warehouse (DW) ? An organized system of enterprise data derived from multiple data sources designed primarily for decision making in the organization can called as Data Warehouse. StatSoft defines data warehousing as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide information management 82

systems, using designated technology suitable for corporate data base management ((e.g.,Oracle,Sybase,MSSQLServer.) Another definition says that a data warehouse is a repository of an organization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis. This classic definition of the data warehouse focuses on data storage. However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the dictionary data are also considered essential components of a data warehousing system.

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:

Subject Oriented Integrated Nonvolatile Time Variant

Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

83

Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. Data Warehousing Objects Fact tables and dimension tables are the two types of objects commonly used in dimensional data warehouse schemas. Fact tables are the large tables in your warehouse schema that store business measurements. Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables represent data, usually numeric and additive, that can be analyzed and examined. Examples include sales, cost, and profit. Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouse. Dimension tables store the information you normally use to contain queries. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. Examples are customers or products. Fact Tables A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.

Creating a New Fact Table


You must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all of its foreign keys.

84

Dimension Tables A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies.

Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies--one for product categories and one for product suppliers. Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse. When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization. Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly. Levels A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies. Level Relationships

85

Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy. Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known.

Typical Dimension Hierarchy


Figure 2-2 illustrates a dimension hierarchy based on customers.

Figure 2-2 Typical Levels in a Dimension Hierarchy

This illustrates a typical dimension hierarchy. In it:


region: is at the top of the dimension hierarchy subregion: is below region country_name: is below subregion customer: is at the bottom of the dimension hierarchy

Unique Identifiers Unique identifiers are specified for one distinct record in a dimension table. Artificial unique identifiers are often used to avoid the potential problem of unique identifiers changing. Unique identifiers are represented with the # character. For example, #customer_id. Relationships Relationships guarantee business integrity. An example is that if a business sells something, there is obviously a customer and a product. Designing a relationship between

86

the sales information in the fact table and the dimension tables products and customers enforces the business rules in databases. Example of Data Warehousing Objects and Their Relationships Figure 2-3 illustrates a common example of a sales fact table and dimension tables customers, products, promotions, times, and channels.

Figure 2-3 Typical Data Warehousing Objects

This illustrates a typical star schema with some columns and relationships detailed. In it, the dimension tables are:

times channels products, which contains prod_id customers, which contains cust_id, cust_last_name, cust_city, and cust_state_province

The fact table is sales, which contains cust_id and prod_id.

Data Warehouse Architecture (DWA) is a way of representing the overall structure of data, communication, processing and presentation that exists for end user computing within the enterprise. Conceptualization of a data warehouse architecture consists of the following interconnected layers: Operational database layer The source data for the data warehouse

87

Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data Data access layer The interface between the operational and informational access layer Metadata layer The data directory (which is often much more detailed than an operational system data directory

Data Warehouse Architectures Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:

Data Warehouse Architecture (Basic) Data Warehouse Architecture (with a Staging Area) Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (Basic) Figure 1-2 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.

Figure 1-2 Architecture of a Data Warehouse

This illustrates three things:

88

Data Sources (operational systems and flat files) Warehouse (metadata, summary data, and raw data) Users (analysis, reporting, and mining)

In Figure 1-2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view. Data Warehouse Architecture (with a Staging Area) In Figure 1-2, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 1-3 illustrates this typical architecture.

Figure 1-3 Architecture of a Data Warehouse with a Staging Area

This illustrates four things:


Data Sources (operational systems and flat files) Staging Area (where data sources go before the warehouse) Warehouse (metadata, summary data, and raw data) Users (analysis, reporting, and mining)

89

Data Warehouse Architecture (with a Staging Area and Data Marts) Although the architecture in Figure 1-3 is quite common, you may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. Figure 1-4 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.

Figure 1-4 Architecture of a Data Warehouse with a Staging Area and Data Marts

This illustrates five things:


Data Sources (operational systems and flat files) Staging Area (where data sources go before the warehouse) Warehouse (metadata, summary data, and raw data) Data Marts (purchasing, sales, and inventory) Users (analysis, reporting, and mining)

Architecture

Review

and

Design

90

The Architecture is the logical and physical foundation on which the Data Warehouse will be built. The Architecture Review and Design stage, as the name implies, is both a requirements analysis and a gap analysis activity. It is important to assess what pieces of the architecture already exist in the organization (and in what form) and to assess what pieces are missing which are needed to build the complete Data Warehouse architecture. During the Architecture Review and Design stage, the logical Data Warehouse architecture is developed. The logical architecture is a configuration map of the necessary data stores that make up the Warehouse; it includes a central Enterprise Data Store, an optional Operational Data Store, one or more (optional) individual business area Data Marts, and one or more Metadata stores. In the metadata store(s) are two different kinds of metadata that catalog reference information about the primary data. Once the logical configuration is defined, the Data, Application, Technical and Support Architectures are designed to physically implement it. Requirements of these four architectures are carefully analyzed so that the Data Warehouse can be optimized to serve the users. Gap analysis is conducted to determine which components of each architecture already exist in the organization and can be reused, and which components must be developed (or purchased) and configured for the Data Warehouse. The Data Architecture organizes the sources and stores of business information and defines the quality and management standards for data and metadata. The Application Architecture is the software framework that guides the overall implementation of business functionality within the Warehouse environment; it controls the movement of data from source to user, including the functions of data extraction, data cleansing, data transformation, data loading, data refresh, and data access (reporting, querying). The Technical Architecture provides the underlying computing infrastructure that enables the data and application architectures. It includes platform/server, network, communications and connectivity hardware/software/middleware, DBMS, client/server 2-tier vs.3-tier approach, and end-user workstation hardware/software. Technical architecture design must address the requirements of scalability, capacity and volume handling (including sizing and partitioning of tables), performance, availability, stability, chargeback, and security. The Support Architecture includes the software components (e.g., tools and structures for backup/recovery, disaster recovery, performance monitoring, reliability/stability compliance reporting, data archiving, and version control/configuration management) and organizational functions necessary to effectively manage the technology investment. Architecture Review and Design applies to the long-term strategy for development and refinement of the overall Data Warehouse, and is not conducted merely for a single iteration. This stage develops the blueprint of an

91

encompassing data and technical structure, software application configuration, and organizational support structure for the Warehouse. It forms a foundation that drives the iterative Detail Design activities. Where Design tells you what to do; Architecture Review and Design tells you what pieces you need in order to do it. The Architecture Review and Design stage can be conducted as a separate project that runs mostly in parallel with the Business Question Assessment stage. For the technical, data, application and support infrastructure that enables and supports the storage and access of information is generally independent from the business requirements of which data is needed to drive the Warehouse. However, the data architecture is dependent on receiving input from certain BQA activities (data source system identification and data modeling), so the BQA stage must conclude before the Architecture stage can conclude. The Architecture will be developed based on the organization's long-term Data Warehouse strategy, so that future iterations of the Warehouse will have been provided for and will fit within the overall architecture. 3.4 System Process A Data Warehouse is not an individual repository product. Rather, it is an overall strategy, or process, for building decision support systems and a knowledge-based applications architecture and environment that supports both everyday tactical decision making and long-term business strategizing. The Data Warehouse environment positions a business to utilize an enterprise-wide data store to link information from diverse sources and make the information accessible for a variety of user purposes, most notably, strategic analysis. Business analysts must be able to use the Warehouse for such strategic purposes as trend identification, forecasting, competitive analysis, and targeted market research. Data Warehouses and Data Warehouse applications are designed primarily to support executives, senior managers, and business analysts in making complex business decisions. Data Warehouse applications provide the business community with access to accurate, consolidated information from various internal and external sources. The primary objective of Data Warehousing is to bring together information from disparate sources and put the information into a format that is conducive to making business decisions. This objective necessitates a set of activities that are far more complex than just collecting data and reporting against it. Data Warehousing requires both business and technical expertise and involves the following activities: - Accurately identifying the business information that must be contained in the Warehouse - Identifying and prioritizing subject areas to be included in the Data Warehouse - Managing the scope of each subject area which will be implemented into the Warehouse On an iterative basis

92

- Developing a scaleable architecture to serve as the Warehouses technical and application foundation, and identifying and selecting the hardware/software/middleware components to implement it - Extracting, cleansing, aggregating, transforming and validating the data to ensure accuracy and consistency - Defining the correct level of summarization to support business decision making - Establishing a refresh program that is consistent with business needs, timing and cycles - Providing user-friendly, powerful tools at the desktop to access the data in the Warehouse - Educating the business community about the realm of possibilities that are available to them through Data Warehousing - Establishing a Data Warehouse Help Desk and training users to effectively utilize the desktop tools - Establishing processes for maintaining, enhancing, and ensuring the ongoing success and applicability of the Warehouse

Until the advent of Data Warehouses, enterprise databases were expected to serve multiple purposes, including online transaction processing, batch processing, reporting, and analytical processing. In most cases, the primary focus of computing resources was on satisfying operational needs and requirements. Information reporting and analysis needs were secondary considerations. As the use of PCs, relational databases, 4GL technology and end-user computing grew and changed the complexion of information processing, more and more business users demanded that their needs for information be addressed. Data Warehousing has evolved to meet those needs without disrupting operational processing. In the Data Warehouse model, operational databases are not accessed directly to perform information processing. Rather, they act as the source of data for the Data Warehouse, which is the information repository and point of access for information processing. There are sound reasons for separating operational and informational databases, as described below. - The users of informational and operational data are different. Users of informational data are generally managers and analysts; users of operational data tend to be clerical, operational and administrative staff. - Operational data differs from informational data in context and currency. Informational data contains an historical perspective that is not generally used by operational systems. - The technology used for operational processing frequently differs from the technology required to support informational needs. - The processing characteristics for the operational environment and the informational environment are fundamentally different.

93

The Data Warehouse functions as a Decision Support System (DSS) and an Executive Information System (EIS), meaning that it supports informational and analytical needs by providing integrated and transformed enterprise-wide historical data from which to do management analysis. A variety of sophisticated tools are readily available in the marketplace to provide user-friendly access to the information stored in the Data Warehouse. Data Warehouses can be defined as subject-oriented, integrated, time-variant, nonvolatile collections of data used to support analytical decision making. The data in the Warehouse comes from the operational environment and external sources. Data Warehouses are physically separated from operational systems, even though the operational systems feed the Warehouse with source data.

Subject Orientation

Data Warehouses are designed around the major subject areas of the enterprise; the operational environment is designed around applications and functions. This difference in orientation (data vs. process) is evident in the content of the database. Data Warehouses do not contain information that will not be used for informational or analytical processing; operational databases contain detailed data that is needed to satisfy processing requirements but which has no relevance to management or analysis. Integration and Transformation

The data within the Data Warehouse is integrated. This means that there is consistency among naming conventions, measurements of variables, encoding structures, physical attributes, and other salient data characteristics. An example of this integration is the treatment of codes such as gender codes. Within a single corporation, various applications may represent gender codes in different ways: male vs. female, m vs. f, and 1 vs. 0, etc. In the Data Warehouse, gender is always represented in a consistent way, regardless of the many ways by which it may be encoded and stored in the source data. As the data is moved to the Warehouse, it is transformed into a consistent representation as required.

Time Variance All data in Data Warehouse is accurate as of some moment in time, providing an historical perspective. This differs from the operational environment in which data is intended to be accurate as of the moment of access. The data in the Data Warehouse is, in effect, a series of snapshots. Once the data is loaded into the enterprise data store and data

94

marts, it cannot be updated. It is refreshed on a periodic basis, as determined by the business need. The operational data store, if included in the Warehouse architecture, may be updated. Non-Volatility Data in the Warehouse is static, not dynamic. The only operations that occur in Data Warehouse applications are the initial loading of data, access of data, and refresh of data. For these reasons, the physical design of a Data Warehouse optimizes the access of data, rather than focusing on the requirements of data update and delete processing. Data Warehouse Configurations A Data Warehouse configuration, also known as the logical architecture, includes the following components: - one Enterprise Data Store (EDS) - a central repository which supplies atomic (detail level) integrated information to the whole organization. - (optional) one Operational Data Store - a "snapshot" of a moment in time's enterprisewide data - (optional) one or more individual Data Mart(s) - summarized subset of the enterprise's data specific to a functional area or department, geographical region, or time period - one or more Metadata Store(s) or Repository(ies) - catalog(s) of reference information about the primary data. Metadata is divided into two categories: information for technical use, and information for business end-users. The EDS is the cornerstone of the Data Warehouse. It can be accessed for both immediate informational needs and for analytical processing in support of strategic decision making, and can be used for drill-down support for the Data Marts which contain only summarized data. It is fed by the existing subject area operational systems and may also contain data from external sources. The EDS in turn feeds individual Data Marts that are accessed by end-user query tools at the user's desktop. It is used to consolidate related data from multiple sources into a single source, while the Data Marts are used to physically distribute the consolidated data into logical categories of data, such as business functional departments or geographical regions. The EDS is a collection of daily "snapshots" of enterprise-wide data taken over an extended time period, and thus retains and makes available for tracking purposes the history of changes to a given data element over time. This creates an optimum environment for strategic analysis. However, access to the EDS can be slow, due to the volume of data it contains, which is a good reason for using Data Marts to filter, condense and summarize information for specific business areas. In the absence of the Data Mart layer, users can access the EDS directly. Metadata is "data about data," a catalog of information about the primary data that defines access to the Warehouse. It is the key to providing users and developers with a road map to the information in the Warehouse. Metadata comes in two different forms: end-user and transformational. End-user metadata serves a business purpose; it translates a cryptic name code that represents a data element into a meaningful description of the

95

data element so that end-users can recognize and use the data. For example, metadata would clarify that the data element "ACCT_CD" represents "Account Code for Small Business." Transformational metadata serves a technical purpose for development and maintenance of the Warehouse. It maps the data element from its source system to the Data Warehouse, identifying it by source field name, destination field code, transformation routine, business rules for usage and derivation, format, key, size, index and other relevant transformational and structural information. Each type of metadata is kept in one or more repositories that service the Enterprise Data Store. While an Enterprise Data Store and Metadata Store(s) are always included in a sound Data Warehouse design, the specific number of Data Marts (if any) and the need for an Operational Data Store are judgment calls. Potential Data Warehouse configurations should be evaluated and a logical architecture determined according to business requirements. The Data Warehouse Process

The james martin + co Data Warehouse Process does not encompass the analysis and identification of organizational value streams, strategic initiatives, and related business goals, but it is a prescription for achieving such goals through a specific architecture. The Process is conducted in an iterative fashion after the initial business requirements and architectural foundations have been developed with the emphasis on populating the Data Warehouse with "chunks" of functional subject-area information each iteration. The Process guides the development team through identifying the business requirements, developing the business plan and Warehouse solution to business requirements, and implementing the configuration, technical, and application architecture for the overall Data Warehouse. It then specifies the iterative activities for the cyclical planning, design, construction, and deployment of each population project. The following is a description of each stage in the Data Warehouse Process. (Note: The Data Warehouse Process also includes conventional project management, startup, and wrap-up activities which are detailed in the Plan, Activate, Control and End stages, not described here.) 3.5 Process Architecture
The Process of DataWarehouse A data warehouse can be built using a top-down approach, a bottom-up approach, or a combination of both. The top-down approach starts with the overall design and planning. It is useful in cases where the technology is mature and well known, and where the business problems that must be solved are clear and well understood. The bottom-up approach starts with experiments and prototypes. This is useful in the early stage of business modeling and technology development. It allows an organization to move forward at considerably less expense and to evaluate the benefits of the technology before making significant commitments. In the combined approach, an organization can exploit the planned and strategic nature of the top-down approach while retaining the rapid

96

implementation and opportunistic application of the bottom-up approach. From the software engineering point of view, the design and construction of a data warehouse may consist of the following steps: planning, requirements study, problemanalysis, warehouse design, data integration and testing, and finally deployment of the datawarehouse. Large software systems can be developed using two methodologies: the waterfall method or the spiral method. The waterfall method performs a structured and systematic analysis at each step before proceeding to the next, which is like a waterfall, falling from one step to the next. The spiral method involves the rapid generation of increasingly functional systems, with short intervals between successive releases. This is considered a good choice for data warehouse development, especially for data marts, because the turnaround time is short, modifications can be done quickly, and new designs and technologies can be adapted in a timely manner. In general, the warehouse design process consists of the following steps: 1. Choose a business process to model, for example, orders, invoices, shipments, inventory, account administration, sales, or the general ledger. If the business process is organizational and involves multiple complex object collections, a data warehouse model should be followed. However, if the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen. 2. Choose the grain of the business process. The grain is the fundamental, atomic level of data to be represented in the fact table for this process, for example, individual transactions, individual daily snapshots, and so on. 3. Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, warehouse, transaction type, and status. 4. Choose the measures that will populate each fact table record. Typical measures are numeric additive quantities like dollars sold and units sold. Because data warehouse construction is a difficult and long-term task, its implementation scope should be clearly defined. The goals of an initial data warehouse implementation should be specific, achievable, and measurable. This involves determining the time and budget allocations, the subset of the organization that is to be modeled, the number of data sources selected, and the number and types of departments to be served. Once a data warehouse is designed and constructed, the initial deployment of the warehouse includes initial installation, roll-out planning, training, and orientation. Platform upgrades and maintenance must also be considered. Data warehouse administration includes data refreshment, data source synchronization, planning for disaster recovery, managing access control and security, managing data growth, managing database performance, and data warehouse enhancement and extension. Scope management includes controlling the number and range of queries, dimensions, and reports; limiting the size of the data warehouse; or limiting the schedule, budget, or resources. Various kinds of data warehouse design tools are available. Data warehouse development tools provide functions to define and edit metadata repository contents (such as schemas, scripts, or rules), answer queries, output reports, and ship metadata to and from relational database system catalogues. Planning and analysis tools study the impact of schema changes and of refresh performance when changing refresh rates or time windows.

97

3.6 Design Logical Design in Data Warehouses This chapter tells you how to design a data warehousing environment and includes the following topics:

Logical Versus Physical Design in Data Warehouses Creating a Logical Design Data Warehousing Schemas Data Warehousing Objects

Logical Versus Physical Design in Data Warehouses Your organization has decided to build a data warehouse. You have defined the business requirements and agreed upon the scope of your application, and created a conceptual design. Now you need to translate your requirements into a system deliverable. To do so, you create the logical and physical design for the data warehouse. You then define:

The specific data content Relationships within and between groups of data The system environment supporting your data warehouse The data transformations required The frequency with which data is refreshed

The logical design is more conceptual and abstract than the physical design. In the logical design, you look at the logical relationships among the objects. In the physical design, you look at the most effective way of storing and retrieving the objects as well as handling them from a transportation and backup/recovery perspective. Orient your design toward the needs of the end users. End users typically want to perform analysis and look at aggregated data, rather than at individual transactions. However, end users might not know what they need until they see it. In addition, a well-planned design allows for growth and changes as the needs of users change and evolve. By beginning with the logical design, you focus on the information requirements and save the implementation details for later.

98

Creating a Logical Design A logical design is conceptual and abstract. You do not deal with the physical implementation details yet. You deal only with defining the types of information that you need. One technique you can use to model your organization's logical information requirements is entity-relationship modeling. Entity-relationship modeling involves identifying the things of importance (entities), the properties of these things (attributes), and how they are related to one another (relationships). The process of logical design involves arranging data into a series of logical relationships called entities and attributes. An entity represents a chunk of information. In relational databases, an entity often maps to a table. An attribute is a component of an entity that helps define the uniqueness of the entity. In relational databases, an attribute maps to a column. To be sure that your data is consistent, you need to use unique identifiers. A unique identifier is something you add to tables so that you can differentiate between the same item when it appears in different places. In a physical design, this is usually a primary key. While entity-relationship diagramming has traditionally been associated with highly normalized models such as OLTP applications, the technique is still useful for data warehouse design in the form of dimensional modeling. In dimensional modeling, instead of seeking to discover atomic units of information (such as entities and attributes) and all of the relationships between them, you identify which information belongs to a central fact table and which information belongs to its associated dimension tables. You identify business subjects or fields of data, define relationships between business subjects, and name the attributes for each subject. Your logical design should result in (1) a set of entities and attributes corresponding to fact tables and dimension tables and (2) a model of operational data from your source into subject-oriented information in your target data warehouse schema. You can create the logical design using a pen and paper, or you can use a design tool such as Oracle Warehouse Builder (specifically designed to support modeling the ETL process) or Oracle Designer (a general purpose modeling tool).

Physical Design in Data Warehouses This chapter describes the physical design of a data warehousing environment, and includes the following topics:

99

Moving from Logical to Physical Design Physical Design

Moving from Logical to Physical Design Logical design is what you draw with a pen and paper or design with Oracle Warehouse Builder or Designer before building your warehouse. Physical design is the creation of the database with SQL statements. During the physical design process, you convert the data gathered during the logical design phase into a description of the physical database structure. Physical design decisions are mainly driven by query performance and database maintenance aspects. For example, choosing a partitioning strategy that meets common query requirements enables Oracle to take advantage of partition pruning, a way of narrowing a search before performing it. Physical Design During the logical design phase, you defined a model for your data warehouse consisting of entities, attributes, and relationships. The entities are linked together using relationships. Attributes are used to describe the entities. The unique identifier (UID) distinguishes between one instance of an entity and another. Figure 3-1 offers you a graphical way of looking at the different ways of thinking about logical and physical designs.

100

Figure 3-1 Logical Design Compared with Physical Design

The logical entities are:


entities relationships attributes unique identifiers

The logical model is mapped to the following database structures:


tables indexes columns dimensions materialized views integrity constraints

During the physical design process, you translate the expected schemas into actual database structures. At this time, you have to map:

Entities to tables Relationships to foreign key constraints Attributes to columns Primary unique identifiers to primary key constraints 101

Unique identifiers to unique key constraints

Physical Design Structures Once you have converted your logical design to a physical one, you will need to create some or all of the following structures:

Tablespaces Tables and Partitioned Tables Views Integrity Constraints Dimensions

Some of these structures require disk space. Others exist only in the data dictionary. Additionally, the following structures may be created for performance improvement:

Indexes and Partitioned Indexes Materialized Views

Tablespaces A tablespace consists of one or more datafiles, which are physical structures within the operating system you are using. A datafile is associated with only one tablespace. From a design perspective, tablespaces are containers for physical design structures. Tablespaces need to be separated by differences. For example, tables should be separated from their indexes and small tables should be separated from large tables. Tablespaces should also represent logical business units if possible. Because a tablespace is the coarsest granularity for backup and recovery or the transportable tablespaces mechanism, the logical business design affects availability and maintenance operations. Tables and Partitioned Tables Tables are the basic unit of data storage. They are the container for the expected amount of raw data in your data warehouse. Using partitioned tables instead of nonpartitioned ones addresses the key problem of supporting very large data volumes by allowing you to decompose them into smaller and more manageable pieces. The main design criterion for partitioning is manageability, though you will also see performance benefits in most cases because of partition pruning or intelligent parallel processing. For example, you might choose a partitioning strategy based on a sales transaction date and a monthly granularity. If you have four years' worth of data, you can delete a month's data as it becomes older than four years with a single, quick DDL statement and load new data while only affecting 1/48th of the complete

102

table. Business questions regarding the last quarter will only affect three months, which is equivalent to three partitions, or 3/48ths of the total volume. Partitioning large tables improves performance because each partitioned piece is more manageable. Typically, you partition based on transaction dates in a data warehouse. For example, each month, one month's worth of data can be assigned its own partition. Data Segment Compression You can save disk space by compressing heap-organized tables. A typical type of heaporganized table you should consider for data segment compression is partitioned tables. To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Data segment compression can also speed up query execution. There is, however, a cost in CPU overhead. Data segment compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted. Views A view is a tailored presentation of the data contained in one or more tables or other views. A view takes the output of a query and treats it as a table. Views do not require any space in the database. Integrity Constraints Integrity constraints are used to enforce business rules associated with your database and to prevent having invalid information in the tables. Integrity constraints in data warehousing differ from constraints in OLTP environments. In OLTP environments, they primarily prevent the insertion of invalid data into a record, which is not a big problem in data warehousing environments because accuracy has already been guaranteed. In data warehousing environments, constraints are only used for query rewrite. NOT NULL constraints are particularly common in data warehouses. Under some specific circumstances, constraints need space in the database. These constraints are in the form of the underlying unique index. Indexes and Partitioned Indexes Indexes are optional structures associated with tables or clusters. In addition to the classical B-tree indexes, bitmap indexes are very common in data warehousing environments. Bitmap indexes are optimized index structures for set-oriented operations.

103

Additionally, they are necessary for some optimized data access methods such as star transformations. Indexes are just like tables in that you can partition them, although the partitioning strategy is not dependent upon the table structure. Partitioning indexes makes it easier to manage the warehouse during refresh and improves query performance. Materialized Views Materialized views are query results that have been stored in advance so long-running calculations are not necessary when you actually execute your SQL statements. From a physical design point of view, materialized views resemble tables or partitioned tables and behave like indexes. Dimensions A dimension is a schema object that defines hierarchical relationships between columns or column sets. A hierarchical relationship is a functional dependency from one level of a hierarchy to the next one. A dimension is a container of logical relationships and does not require any space in the database. A typical dimension is city, state (or province), region, and country.

3.7 Data base schema Data Warehousing Schemas A schema is a collection of database objects, including tables, views, indexes, and synonyms. You can arrange schema objects in the schema models designed for data warehousing in a variety of ways. Most data warehouses use a dimensional model. The model of your source data and the requirements of your users help you design the data warehouse schema. You can sometimes get the source model from your company's enterprise data model and reverse-engineer the logical data model for the data warehouse from this. The physical implementation of the logical data warehouse model may require some changes to adapt it to your system parameters--size of machine, number of users, storage capacity, type of network, and software. Star Schemas The star schema is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables, as shown in Figure 2-1.

104

Figure 2-1 Star Schema

This illustrates a typical star schema. In it, the dimension tables are:

times channels products customers

The fact table is sales. sales shows columns amount_sold and quantity_sold. The most natural way to model a data warehouse is as a star schema, only one join establishes the relationship between the fact table and any one of the dimension tables. A star schema optimizes performance by keeping queries simple and providing fast response time. All the information about each level is stored in one row. Other Schemas Some schemas in data warehousing environments use third normal form rather than star schemas. Another schema that is sometimes useful is the snowflake schema, which is a star schema with normalized dimensions in a tree structure.

3.8 Partitioning Strategy


HEAPS/CLUSTERED/NONCLUSTERED

105

Data Partitioning Data Partitioning is the formal process of determining which data subjects, data occurrence groups, and data characteristics are needed at each data site. It is an orderly process for allocating data to data sites that is done within the same common data architecture. Data Partitioning is also the process of logically and/or physically partitioning data into segments that are more easily maintained or accessed. Current RDBMS systems provide this kind of distribution functionality. Partitioning of data helps in performance and utility processing. Data Partitioning in Data warehouses Data warehouses often contain very large tables and require techniques both for managing these large tables and for providing good query performance across them. An important tool for achieving this, as well as enhancing data access and improving overall application performance is partitioning. Partitioning offers support for very large tables and indexes by letting you decompose them into smaller and more manageable pieces called partitions. This support is especially important for applications that access tables and indexes with millions of rows and many gigabytes of data. Partitioned tables and indexes facilitate administrative operations by enabling these operations to work on subsets of data. For example, you can add a new partition, organize an existing partition, or drop a partition with minimal to zero interruption to a read-only application. Partitioning can help you tune SQL statements to avoid unnecessary index and table scans (using partition pruning). It also enables you to improve the performance of massive join operations when large amounts of data (for example, several million rows) are joined together by using partition-wise joins. Finally, partitioning data greatly improves manageability of very large databases and dramatically reduces the time required for administrative tasks such as backup and restore. Granularity in a partitioning scheme can be easily changed by splitting or merging partitions. Thus, if a table's data is skewed to fill some partitions more than others, the ones that contain more data can be split to achieve a more even distribution. Partitioning also enables you to swap partitions with a table. By being able to easily add, remove, or swap a large amount of data quickly, swapping can be used to keep a large amount of data that is being loaded inaccessible until loading is completed, or can be used as a way to stage data between different phases of use. Some examples are current day's transactions or online archives. A good starting point for considering partitioning strategies is to use the partitioning advice within the SQL Access Advisor, part of the Tuning Pack. The SQL Access Advisor offers both graphical and command-line interfaces. 106

--

Data Partitioning can be of great help in facilitating the efficient and effective management of highly available relational data warehouse. But data partitioning could be a complex process which has several factors that can affect partitioning strategies and design, implementation, and management considerations in a data warehousing environment.

A data warehouse which is powered by a relational database management system can provide for a comprehensive source of data and an infrastructure for building Business Intelligence (BI) solutions. Typically, an implementation of a relational data warehouse can involve creation and management of dimension tables and fact tables. A dimension table is usually smaller in size compared to a fact table but they both provide details about the attributes used to describe or explain business facts. Some examples of a dimension include item, store, and time. On the other hand, a fact table represents a business recording like item sales information for all the stores. All fact table need to be periodically updated using data which are the most recently collected from the various data sources.

Since data warehouses need to manage and handle high volumes of data updated regularly, careful long term planning is beneficial. Some of the factors to be considered for long term planning of a data warehouse include data volume, data loading window, Index maintenance window, workload characteristics, data aging strategy, archive and backup strategy and hardware characteristics

There are two approaches to implementing a relational data warehouse: monolithic approach and partitioned approach. The monolithic approach may contain huge fact tables which can be difficult to manage.

There are many benefits to implementing a relational data warehouse using the data partitioning approach. The single biggest benefit to a data partitioning approach is easy yet efficient maintenance. As an organization grows, so will the data in the database. The need for high availability of critical data while accommodating the need for a small database maintenance window becomes indispensable. Data partitioning can answer the need to small database maintenance window in a very large business organization. With data partitioning, big issues pertaining to supporting large tables can be answered by having the database decompose large chunks of data into smaller partitions thereby resulting in better management. Data partitioning also results in faster data loading, easy monitoring of aging data and efficient data retrieval system. 107

Data partitioning in relational data warehouse can implemented by objects partitioning of base tables, clustered and non-clustered indexes, and index views. Range partitions refer to table partitions which are defined by a customizable range of data. The end user or database administrator can define the partition function with boundary values, partition scheme having file group mappings and table which are mapped to the partition scheme.

There are so many ways wherein data partitioning can be implemented. Implementation methods vary depending on the database software application vendor or developer. Management of these partitioned data can vary as well. But the important thing to note is that regardless of the software application implementing data partitioning, the benefits of separating data into partitions will continue to bring benefits to data warehouses, which now have become standard requirements for large companies in order to operate efficiently.

3.9 Aggregations
Introduction In a competitive business environment, the areas that are given more focus to gain competitive edge over other companies include the need for timely financial reporting, real time disclosure so that the company can meet compliance regulations and accurate sales and marketing data so the company can grow a larger customer base and thus increase profitability. Data aggregation helps company data warehouses try to piece together different kinds of data within the data warehouse so that they can have meaning that will be useful as statistical basis for company reporting and analysis.

Here are some practical guidance on how to implement a sensible aggregation strategy for a data warehouse. The goal is to help answer the questions "How do I choose which aggregates to create," "How do I create and store aggregates," and "How do I monitor and maintain aggregates in a database?" The information in this article has been gathered from several years of consulting in the relational decision support market. This article assumes some familiarity with dimensional or "star" schema design, as this forms the base from which data is aggregated. Approaches to Aggregation

108

Before trying to answer the questions mentioned above, there are some basic tradeoffs to keep in mind. Creating an aggregate is really summarizing and storing data which is available in the fact table in order to improve the performance of end-user queries. There are direct costs associated with this approach: the cost of storage on the system, the cost of processing to create the aggregates, and the cost of monitoring aggregate usage. We are trading these costs against the need for query performance. There are three approaches to aggregation: no aggregation, selective aggregation, or exhaustive aggregation. In some cases, the volume of data in the fact table will be small enough that performance is acceptable without aggregates. In a typical database the data volumes will be large enough that this will not be the case. The opposite extreme is exhaustive aggregation. This approach will produce optimal query results because a query can read the minimum number of rows required to return an answer. However, this approach is not normally practical due to the processing required to produce all possible aggregates and the storage required to store them. In a simple sales example where the dimensions are product, sales geography, customer, and time, the number of possible aggregates is the number of levels in each hierarchy of each dimension multiplied together. Figure 1 depicts sample hierarchies in each dimension and the total number of aggregates possible.

Figure 1: Number of possible aggregates. Each dimension has several levels of summarization possible. To determine the number of aggregates, simply multiple the number of levels in each of the dimension hierarchies. This will show the total number of aggregates possible. Creating a large number of aggregates will take a lot of processing time, even on a large system. Aggregates are created after new fact data has been verified and loaded. Given the loading time and the time to perform database backups, there is a small window left in the batch cycle to create aggregates. This time window is a restriction to how many aggregates may be created. Given the above constraints and the huge number of rows to store for every possible aggregate, it is apparent that an exhaustive approach is not generally feasible. This leaves selective aggregation as the middle ground. The difficult question becomes "Which aggregates should I create?"

109

Choosing Aggregates to Create Usage and Analysis Patterns There are two basic pieces of information which are required to select the appropriate aggregates. Probably the most important item is the expected usage patterns of the data. This information is usually known after the requirements phase of a decision support project. One of the areas which this requirements analysis normally focuses on is the decision making processes of individual users. Based on this information it is possible to determine that they often look for anomalies in their data by focusing at a certain level, and then looking for information at lower or higher levels based on what they find. The most frequently examined levels will be good candidates for aggregation. As an example, someone looking at the profitability of car insurance policies may start by examining the profitability of all policies broken out by geographic regions. From there they may note that a certain region has a higher profitability and start looking for the contributing factors by drilling down to a district level, or looking at the policies by policy or coverage types. If this pattern of analysis is common, then aggregates by region and policy type will be most useful. Base Table Row Reduction The second piece of information to consider is the data volumes and distributions in the fact table. This information is often not available until the initial loading of data into the database is complete and it will likely change over time. After loading the data, it is a good idea to run some queries to get an idea of the number of rows at various levels in the dimension hierarchies. This will tell you where there are significant decreases in the volume of data along a given hierarchy. Some of the best candidates for aggregation will be those where the row counts decrease the most from one level in a hierarchy to the next. The decrease of rows in a dimension hierarchy is not a hard rule due to the distribution of data along multiple dimensions. When you combine the fact rows to create an aggregate at a higher level, the reduction in the number of rows may not be as much as was expected. This is due to the sparsity of the fact data: as you look at the data values for a given dimension you will notice that certain values do not exist at the detail level, but the combination of all the dimensions will have a row at a higher level. A simple example of this is in high volume retail sales. A single store may carry 70,000 unique products, but in a given day a typical store will sell only ten percent of those products. In a single week the store may sell 15,000 unique products. If we calculate the number of rows in the fact table for a chain with 100 stores where every store sells 7000 products a day, 365 days a year, we will have 255,500,000 rows. If we create an aggregate of product sales by store by week we would intuitively expect that the number of rows in the aggregate table would be reduced by seven since we have summarized seven daily records into a single weekly record for a given product. This will not be the case due to the sparsity of data. Since each store will move 15,000 products in a week, the number of rows in the aggregate will not be 36,500,000; the number of rows will be 78,000,000, or double what we were expecting!

110

Since we are trying to reduce the number of rows a query must process, one of the key procedures is finding aggregates where the intersection of dimensions has a significant decrease in the number of rows. Figure 2 shows the row counts for all possible aggregates of product by store by day using one year of data for a 200 store retail grocer.

Figure : Row counts for possible aggregates. The base level of detail is product by store by day shown in the upper left corner of the chart. The highest level summary is total sales corporate-wide in the lower right, containing a single row for each day. Looking at this chart, it is apparent that creating aggregates at some of the highest levels will provide minimal performance improvement. Depending on the frequency of usage, there are several likely candidates. Any of the subcategory level aggregates provide a significant reduction in volume and would be good starting points for exploration. The brand by district aggregate provides a very significant drop over the detail data, and will probably be small enough that all higher level product and geography queries may be satisfied by this aggregate. One thing to keep in mind is that it is appealing to decide based on what you can see in the chart, but there are still tens of millions of rows in some of the lower level aggregates. Knowing how fast your database and hardware combination can move data for a query is still important and will help you determine where it is practical to stop aggregating. Aggregate Storage Once you have made an initial decision about which aggregates to create you have to answer the next question: how to create and store those aggregates. There are two parts to this question. The first is how to store the aggregated data. The second is how to create and update the aggregated data. Storing aggregates can be complicated by the columns available in the base fact table. Some data

111

may be invalid at a higher aggregate level, or it may not be possible to summarize the data in a column. For example, it is not possible to aggregate automobile insurance claims information by vehicle type and preserve information from the claim such as the gender of the policyholder. This will be true of most semi-additive fact data and all non-additive fact data. Since semi-additive and non-additive data is only valid at the detail level, we will very likely have fewer columns in an aggregate table than we have in a fact table. This is a common issue for businesses such as insurance, catalog sales, subscription services, and health care. There are some ways to preserve a portion of the information. In the example above, two "count" columns could be added with the number of male and female claimants stored in each. Another item which appears regularly in aggregate table design is the required precision of columns storing counts or monetary values. For data at a low level of detail, the values stored in a column may never exceed five digits. If the data is summarized for an entire week the column may require seven or more digits. This must be taken into account when creating the physical table to store an aggregate. Storing Aggregate Rows There are three basic options for storing the aggregated data which are diagrammed in figure 3. You can create a combined fact and aggregate table which will contain both the base level fact rows and the aggregate rows. You can create a single aggregate table which holds all aggregate data for a single fact table. Lastly, you can create a separate table for each aggregate created.

Figure 3: Three possibilities for storing aggregates. Aggregates may be stored in the same table with the base level fact rows, they may be stored in a separate aggregate-only table, or they may be stored in individual aggregate tables. I normally recommend creating a separate table for each aggregate. The combined fact and aggregate table approach is appealing, but it usually results in a very large and unmanageable table. The single aggregate table is almost as unmanageable. Both approaches suffer from contention problems during query and update, issues with data storage for columns which are not valid at

112

higher levels of aggregation, and the possibility of incorrectly summarizing data in a query. The contention problem with a single table for detail and aggregates is straightforward: the same table is read from and written to in order to create or update the aggregate rows. Given the large batch nature of aggregate creation and update, contention during the batch cycle may be considerable. Query contention due to all end-user queries hitting the same table will be an issue, as will indexing in such a way that aggregate rows will be efficiently retrieved. The same drawbacks apply to a single separate table for all aggregate rows. Using a separate table for each aggregate avoids these problems and has the advantages of allowing independent creation and removal of aggregates, simplified keying of the aggregates, and easier management of performance issues (for example, spreading I/O load by rearranging tables on disks, or allowing multiple aggregates to be updated concurrently). The most difficult issue to resolve is the complication of end-user query access. The complication results from the introduction of a number of possible tables from which the data may be queried. This design approach introduces a new factor into the selection of end user query tools, particularly ad-hoc query tools: they should be "aggregate aware". Products which are not "aggregate aware" will present users with all the fact and aggregate tables and it will be up to the user to select the appropriate table for their query. This is not practical with more than a few aggregates. The problem is worse for the custom application designer because queries are embedded in the program. If an aggregate is added or removed, the program must be manually changed to query from the appropriate table. With programmatic interfaces, these issues can be managed by designing the applications to dynamically generate queries against the appropriate table. For packaged query tools the issue is somewhat more complex. There are products available which will act as intelligent middleware. This provides a single logical view of the schema and hides the aggregates from the user or developer. They operate by examining the query and re-writing it so that it uses the appropriate aggregate table rather than the base fact table. Examples of companies providing this type of software are MicroStrategies, Information Advantage, and Stanford Technologies (now owned by Informix). The logical place for this type of query optimization is in the database itself, but no commercial RDBMS vendor has provided extensions to their products to handle this issue. In spite of this limitation, I prefer this design for aggregate storage due to the advantages over the other methods. For the remainder of this article I will assume that each aggregate is stored in an individual table. Storing Aggregate Dimension Rows A big issue encountered when storing aggregates is how the dimensions will be managed. Normally the dimensions contain one row for each discrete item. For example, a product dimension has a single row for each product manufactured by the company. The question arises, "how do you store information about hierarchies so the fact and aggregate tables are appropriately keyed and queried?" No matter how the dimensions and aggregates are handled, the aggregate rows will require

113

generated keys. This is because the levels in a dimension hierarchy are not actually elements of the dimension. They are constructs above the detail level within the dimension. This is easily seen if we look at the company geography dimension described in the example for figure 2. The granularity of the fact table is product by store by day. This means the base level in the geography dimension is the store level. All fact rows will have as part of their key the store key from a row in this dimension. The hierarchy in the dimension is store district region all stores. There is no row available in the dimension table describing a district or region. We must create these rows and derive keys for them. The keys can't duplicate any of the base level keys already present in the dimension table. This can be done in several ways. The preferred method is to store all of the aggregate dimension records together in a single table. This makes it simple to view dimension records when looking for particular information prior to querying from the fact or aggregate tables. There is one issue with the column values if all the rows are stored in a single table like this. When adding an aggregate dimension record there will be some columns for which no values apply. For example, a district level row in the geography dimension will not have a store number. This is shown in figure 4.

Figure 4: Storing aggregate dimension rows. Each level above the base store level has keys in a distinct range to avoid conflicts, and all column values are empty for those columns which do not apply at the given level. When you wish to create a pick list of values for a level in the hierarchy you can issue a SELECT DISTINCT on the column for that level. An alternative to this method is to include a level column which contains a single value for each level in the hierarchy. Then queries for a set of values for a particular level need only select where the level column is the level required. Other methods for storing the aggregate dimension rows include using a separate table for each level in the dimension, normalizing the dimension, or using one table for the base dimension rows and a separate table for the hierarchy information. The disadvantage of all of these methods is that the dimension is stored in multiple tables, which further complicates the query process. The first method is conceptually clean because each fact table has a set of dimension tables which is associated only with that table, so all data is available at the same grain. The problem comes when the user is viewing dimension data at one level, and then wants to drill up or down along a hierarchy. Browsing through values in the dimension is extremely complicated. In addition, there are now many more tables and table relationships to maintain. This runs counter to the goal of the

114

dimensional model, which is to simplify access to the data. Normalizing the dimension is another way to store the hierarchy information. Rather than store values in dimension columns for the different levels of a hierarchy and issue a SELECT DISTINCT on the appropriate column, a key to another table is stored. This table contains just the values for that column. This is not much different from storing the values in the dimension table, and it complicates queries by adding more tables. Again, this runs counter to the goal of simplifying access for both the user and the query optimizer in the database. Using a single table for the base level dimension rows and a separate table for all aggregate dimension rows has the disadvantage of adding another table. It has an advantage which may make this approach better than using a single dimension table for all rows. If creating non-duplicate key values for the base level dimension rows and the aggregate rows is difficult, storing the aggregate rows in a separate table will make this problem simpler to resolve. The aggregate dimension rows can use a simpler key structure since they are no longer under the column constraints imposed by the base level dimension. Another topic worth mentioning in the storing of aggregate dimensions is multiple hierarchies in a single dimension. This shows up frequently when initially designing the dimensions, and it has an impact on the aggregates. When a dimension has multiple hierarchies, this implies that the number of possible aggregates will be multiplied by the number of levels in the extra hierarchies. When looking at the number of aggregates you must remember to take into account each hierarchy which exists in a dimension. Multiple hierarchies may create further problems at higher summary levels because values at a low level may be double counted at a higher level. Places where you will frequently find multiple hierarchies are in customer dimensions, product dimensions, and the time dimension. Products may have several hierarchies depending on whether you are viewing them from a manufacturing, warehousing, or sales perspective. Customer dimensions will sometimes have hierarchies for physical geography, demographic geography, and organizational geography. You might see two hierarchies in the time dimension: one for the calendar year and one for the fiscal year. Once the method for storing aggregates and their dimension values is chosen, the next step is to create the aggregates. The optimal approach depends on the volume of data, number of aggregates, and parallel capabilities of your database and hardware. Since there is no single best method, I will offer some guidelines on approaches. Aggregate Creation There are a number of factors which will help to define the approach. The first is the size of the fact table. It is not uncommon for a fact table to contain hundreds of millions to more than a billion rows of detail data, and to exceed 75 gigabytes in size. This volume of data will limit approaches which require frequent recalculation of the aggregates, or which require multiple scans through the fact table. The number of aggregates which must be created is a constraint. A typical fact table may have more than fifty aggregate tables in a production system. The number will depend on the fact table size, number of dimensions, and query performance. As the number of aggregates grows, the processing

115

window required for the batch update cycle will increase, possibly spilling over into the online usage period. Another constraint is the parallel capabilities of the database and computing platform. With the typical volumes of data it is unlikely that a simplistic approach using a single threaded program and no database parallelism will complete in a reasonable amount of time. For very large databases and high numbers of concurrent users, high end symmetric multiprocessing (SMP) or massively parallel (MPP) platforms will be required. Parallel database performance improvements are impressive, but the usefulness of the technology may be limited. Depending on the database, only certain SQL operations may be parallelized. Most commercial databases have limitations of this type. This can be a serious issue when building an aggregate table. If you are creating a very large table and the database can't parallelize the INSERT statement then you might be faced with a bottleneck that prevents you from using a simple SQL statement to create the table. In addition, there may be constraints on the query portion of the statement such that only certain types of queries will execute in parallel. This can effective cripple the statement by turning it into a single-threaded access to millions of rows of data. If the critical path of nightly batch processing can't fully utilize the hardware, parallelism may help alleviate single-stream bottlenecks by allowing certain processes to use more resources and complete sooner. If the aggregate processing has already been parallelized by partitioning the work into multiple application processes then there may be less benefit. Due the brute-force nature of many parallel implementations, databases have the ability to use all available resources on a server for a single SQL statement. This resource utilization often constrains use of parallel operations to a limited scope. If you try to run more than a handful of operations without constraining them in some way, they will introduce serious contention issues. Recreating Versus Updating Aggregates One of the major design choices for the aggregation programs is whether to drop and recreate the aggregate tables during the batch cycle, or update the tables in place. The time to completely regenerate an entire aggregate table is a prime consideration. Some aggregates may be too large or require summarizing of too much data for the regeneration approach to work effectively. Alternatively, regeneration may be more appropriate if there is a lot of program logic to determine what data must be updated in the aggregate table. Time period-to-date aggregates create their own special set of problems when making the recreate versus update decision. When updating the aggregate, new data which is not yet present in the aggregate table will likely require insertion. This implies a two pass approach in the aggregate program design, where the first pass scans aggregate data to see what should be inserted and what is already present. The second pass updates the existing data, but not the newly inserted data from the first pass. Updates to the rows can cause database update and query performance issues if the tables are not tuned properly. Updating column values may create problems with internal space allocation if

116

numeric values are stored with a variable length encoding scheme. If a dollar value increases from one to eight digits during several updates to rows in a month-to-date sales table, the database must reallocate for those rows (Oracle refers to this as row-chaining). This will lead to slower update and query performance over time, eventually requiring a table reorganization. Given the data volumes in a typical decision support database, it will probably be most efficient for aggregation programs to update the aggregate tables with the newly loaded data, rather than dropping and recreating all aggregates. The tradeoff with this approach is the programming complexity. Creating an aggregate table may be as simple a single SELECT statement. Updating the same table may require several passes through the new data and the existing aggregate table. Single Threaded Versus Multi-threaded Creation The program to create aggregates can be written to build or update the tables in single threaded fashion (one at a time). If the number of aggregates to generate is limited, data volumes are not very large, there is little concurrent activity on the system, or the processing window is large then generating aggregates in single stream fashion may be practical. This approach requires less development effort because there is no coordination among processes and there will be few dependencies due to the serial nature of processing. Also, if the queries to create the aggregate can take advantage of database parallelism then all resources on the server may be dedicated to a single process. This will allow the operation to complete in a fraction of the time. One limitation with this approach is the queries which do not take advantage of parallelism.. If the processing window is not sufficiently large, these processes will become bottlenecks due to the serial nature of the design. Another limitation is the inflexibility of a single batch stream. If dependencies are created or new long-running processes are added, the design may require changes. If there are many aggregates, multiple period-to-date aggregates, or large volumes of data then a design which allows multiple processes to execute simultaneously will probably be required. A multi-threaded design must take into account the impact of simultaneous access to data stored in the same table, and the impact of writing aggregated data into the aggregate tables. This will require more detailed knowledge of the data to determine when to schedule the programs. Running too many concurrent programs could create a bottleneck in CPU, memory, or I/O resources on the platform, or cause contention issues in the logging mechanism of the database. This approach requires more effort to design, and will require more development effort. This is due to the addition of dependencies and constraints on processes and the added monitoring and scheduling required. There will also be fewer opportunities to use the brute-force approach that database parallelism allows with single large operations. Using a "Cascading" Model to Create Aggregates A final note on aggregate creation is for aggregates which are one or more levels removed from the base level data. It may be possible to use aggregates stored at lower levels to generate the higher level aggregates, as shown in Figure 5. By using the lower level tables, the program will perform less work to generate successively higher level aggregates. This approach may be taken with either

117

the single-stream or multi-threaded designs.

118

Figure 5: Tables in a cascading design. The base table is at the grain of product by salesperson by customer by day. The aggregation program creates the first aggregate, in this case summarizing the data to the level of product by salesperson by day. The next aggregate is product by district by month, and is built from the previous aggregate table. When designing a cascading model like this there are several issues which should be taken into consideration. The addition of dependencies on intermediate processing may result in missing highlevel aggregates due to a failure during creation of a lower level table. If there are numerous highlevel aggregates then a lower level failure will result in many missing aggregates. If the problem can't be fixed before the users log on to the system, they will suffer from seriously degraded performance for high-level queries. Error propagation can be an issue with cascading creation. If an error is encountered during aggregation at a lower level, the error will be propagated throughout all higher level aggregates. Correction of the error will require the recalculation of all data at all levels above the level where the error first occurred. When choosing whether to use a cascading approach, a consideration should be what the system management and software maintenance impact will be versus the available processing window. If there is sufficient time available to process aggregates from the base data, the cascading approach should not be taken because it will be more complex. Strong change management practices and coordinated development are required to avoid spending excessive time solving operational problems when something changes or goes wrong. Aggregate Maintenance Once the system is in production a new set of problems will introduce themselves. After the initial

119

rollout of the system some users will experience very long running queries or reports. This may result in the need for some new aggregates to resolve the performance issue. It will also be the case that certain aggregates are rarely used by any query. These are good candidates for removal since they don't do anything other than take up space in the database. An interesting pattern that I have observed with decision support databases is that over a period of time the usage of the data will change. Users will become more educated about the available information and the questions they ask will evolve. Previously useful aggregates will be used less, and new aggregates will be required to meet the current performance expectations. This implies continued maintenance of the aggregation programs. If an aggregate is no longer useful and is removed from the system, all associated programs must be updated. If you have specific metadata about aggregates, this must be updated. If you are using a cascading model for creating aggregates, the addition of an aggregate may provide a more efficient base for existing higher level aggregates and suggest changes in how they are generated. Aggregate maintenance is a mostly manual process and requires monitoring the usage of aggregates by the users. There are some end user access products on the market which include a query monitor component which will collect statistics on the usage of aggregates and on the queries which are executed. In many cases, you must build a monitoring component as part of your decision support system. This type of data collection is very useful in determining when to add or remove aggregates. Some users will notify the DBA when there is a performance issue with a query. Many will assume there is a problem with their PC, the database, or that this is normal. Without statistics or their feedback there is no way to know if the system is performing adequately. The most basic information which should be collected in order to monitor aggregate usage is the number of queries against the fact table and the number of queries against each of the aggregate tables. These two data points will indicate which tables are or are not being used. If there are frequent queries against low level aggregates or the fact table, this is an indication that another aggregate may be in order. Beyond this basic table level information, the following items are very useful if they can be captured:

column value histograms on the constraint columns in the fact and aggregate tables; this is useful in determining the selectivity of various possible indexes and will influence your indexing strategy. histograms for the combination of values at each level in the fact table; this information will help estimate the row counts in any aggregate you might wish to create. query parse counts; this is useful only if a fixed query front-end application is used to access the data. This will tell exactly how many times a given set of information was queried. The total count gives an idea of how much the system is being used. query duration; this is a simple measure to determine whether there are any excessively long queries which should be examined in further detail to see if they are the result of excessive I/O or poorly written SQL. query resource utilization; queries may be highly complex, and therefore slow. There is

120

nothing to do with this type of query except try to rewrite it. If a query has a very large amount of logical I/O but few rows returned then it is reading a large amount of data, and might benefit from an aggregate. number of rows retrieved; this is very important because some queries simply return lots of data, and this is the reason for the slow response. If a query returns 10,000 rows, an aggregate is not going to reduce the number of rows returned. level of data requested; knowing the levels of data requested along the dimension hierarchies will tell if queries are accessing the correct aggregate level. It may be that they are querying a lower level aggregate due to the absence of an aggregate at the level required.

Conclusion Some of the techniques mentioned above mainly in the monitoring and analysis space, have been adopted by database and query tool vendors. One rapidly growing product area is the "aggregate aware" query tool arena, which includes vendors like MicroStrategy, Business Objects, and others. Oracle has also added aggregate awareness to its database engine. Aggregate aware tools have the ability to process queries issued against a base-level dimensional schema and select the appropriate aggregate to satisfy the query. Some of the products include a component which can monitor queries and indicate potential candidates for aggregation. One drawback to many of these products is that they are not "open" - many require that if you use their aggregate middleware, you use their aggregation tool. For the most part, when deciding how to create and maintain aggregates in the database, it will be up to the implementor to determine the optimal approach for their set of constraints. This article only touches the surface of the issues around performance in a large decision support database. I advise people starting complex projects like these to seek professional consulting help from companies with experience in the end-to-end implementation of similar projects, and with a proven record of successful references. Data aggregation can really grow to be a complex process through time. It is always good to plan the business architecture so that data will be in sync between real activities and the data model simulating the real scenario. IT decision makers need to make careful choice in software applications as there are hundreds of choices that can be bought from software vendors and developers around the world.

3.10 Data Marting A data mart is a subset of an organizational data store, usually oriented to a specific purpose or major data subject, that may be distributed to support business needs.Data marts are analytical data stores designed to focus on specific business functions for a specific community within an organization. Data marts are often derived from subsets of

121

data in a data warehouse, though in the bottom-up data warehouse design methodology the data warehouse is created from the union of organizational data marts.

Terminology In practice, the terms data mart and data warehouse each tend to imply the presence of the other in some form. However, most writers using the term seem to agree that the design of a data mart tends to start from an analysis of user needs and that a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); a data mart is a data repository that may or may not derive from a data warehouse and that emphasizes ease of access and usability for a particular designed purpose. In general, a data warehouse tends to be a strategic but somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting an immediate need. One writer, Marc Demerest, suggests combining the ideas into a Universal Data Architecture (UDA). In practice, many products and companies offering data warehouse services also tend to offer data mart capabilities or services. There can be multiple data marts inside a single corporation; each one relevant to one or more business units for which it was designed. DMs may or may not be dependent or related to other data marts in a single corporation. If the data marts are designed using conformed facts and dimensions, then they will be related. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data.[2] This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.

Design schemas

Star schema or dimensional model is a fairly popular design choice, as it enables a relational database to emulate the analytical functionality of a multidimensional database. Snowflake schema

122

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema, consisting of a few "fact tables" (possibly only one, justifying the name) referencing any number of "dimension tables". The "facts" that the data warehouse helps analyze are classified along different "dimensions": the fact tables hold the main data, while the usually smaller dimension tables describe each value of a dimension and can be joined to fact tables as needed. Dimension tables have a simple primary key, while fact tables have a compound primary key consisting of the aggregate of relevant dimension keys. It is common for dimension tables to consolidate redundant data and be in second normal form, while fact tables are usually in third normal form because all data depend on either one dimension or all of them, not on combinations of a few dimensions. The star schema is a way to implement multi-dimensional database (MDDB) functionality using a mainstream relational database: given the typical commitment to relational databases of most organizations, a specialized multidimensional DBMS is likely to be both expensive and inconvenient. Another reason for using a star schema is its simplicity from the users' point of view: queries are never complex because the only joins and conditions involve a fact table and a single level of dimension tables, without the indirect dependencies to other tables that are possible in a better normalized snowflake schema. Example Consider a database of sales, perhaps from a store chain, classified by date, store and product.
f_sales is the fact table and there are three dimension tables d_date, d_store and d_product.

Each dimension table has a primary key called id, corresponding to a three-column primary key (date_id, store_id, product_id) in f_sales. Data columns include f_sales.units_sold (and sale price, discounts etc.); d_date.year (and other date components); d_store.country (and other store address components); d_product.category and d_product.brand (and product name etc.). The following query extracts how many TV sets have been sold, for each brand and country, in 1997.
SELECT P.brand, S.country, sum (FS.units_sold) FROM f_sales FS INNER JOIN d_date D ON D.id = FS.date_id

123

INNER JOIN d_store S ON S.id = FS.store_id INNER JOIN d_product P ON P.id = FS.product_id WHERE D.year = 1997 AND P.category = 'tv' GROUP BY P.brand, S.country

A snowflake schema is a way of arranging tables in a relational database such that the entity relationship diagram resembles a snowflake in shape. At the center of the schema are fact tables which are connected to multiple dimensions. When the dimensions consist of only single tables, you have the simpler star schema. When the dimensions are more elaborate, having multiple levels of tables, and where child tables have multiple parent tables ("forks in the road"), a complex snowflake starts to take shape. Generally, whether a snowflake or a star schema is used only affects the dimensional tables. The fact table is unchanged. The star and snowflake schema are most commonly found in data warehouses where speed of data retrieval is more important than speed of insertion. As such, these schema are not normalized much, and are frequently left at third normal form or second normal form. The decision on whether to employ a star schema or a snowflake schema should consider the relative strengths of the database platform in question and the query tool to be employed. Star schema should be favored with query tools that largely expose users to the underlying table structures, and in environments where most queries are simpler in nature. Snowflake schema are often better with more sophisticated query tools that isolate users from the raw table structures and for environments having numerous queries with complex criteria.

Reasons for creating a data mart


Easy access to frequently needed data Creates collective view by a group of users Improves end-user response time Ease of creation Lower cost than implementing a full Data warehouse Potential users are more clearly defined than in a full Data warehouse

Dependent data mart According to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following reasons:

124

A need for a special data model or schema: e.g., to restructure for OLAP Performance: to offload the data mart to a separate computer for greater efficiency or to obviate the need to manage that workload on the centralized data warehouse. Security: to separate an authorized data subset selectively Expediency: to bypass the data governance and authorizations required to incorporate a new application on the Enterprise Data Warehouse Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an application prior to migrating it to the Enterprise Data Warehouse Politics: a coping strategy for IT (Information Technology) in situations where a user group has more influence than funding or is not a good citizen on the centralized data warehouse. Politics: a coping strategy for consumers of data in situations where a data warehouse team is unable to create a usable data warehouse.

According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited scalability, duplication of data, data inconsistency with other silos of information, and inability to leverage enterprise sources of data.

3.11 Meta Data The primary rational for data warehousing is to provide businesses with analytics results from data mining, OLAP, Scorecarding and reporting. The cost of obtaining front-end analytics are lowered if there is consistent data quality all along the pipeline from data source to analytical reporting.

125

Figure 1. Overview of Data Warehousing Infrastructure Metadata is about controlling the quality of data entering the data stream. Batch processes can be run to address data degradation or changes to data policy. Metadata policies are enhance by using metadata repositories. One of the projects we recently worked on was with a major insurance company in North America. The company had amalgamated over the years with acquisitions and also had developed external back-end data integrations to banks and reinsurance partners.

126

Figure 2. Disparate Data Definition Policies in an Insurance Company The client approached DWreview as they felt that they were not obtaining sufficient return-on-investments on their data warehouse. Prediction analysis, profit-loss ratio and OLAP reports were labor and time intensive to produce. The publicly listed insurance company was also in the process of implementing a financial Scorecarding application to monitor compliance with the Sarbanes-Oxley act. In consultation with the company's IT managers, we analyzed the potential quid-pro-quos of different design changes. The first step in the process of realignment the data warehousing policies was the examination of the metadata policies and deriving a unified view that can work for all stakeholders. As the company was embarking on a new Scorecarding initiative it became feasible to bring the departments together and propose a new enterprise-wide metadata policy. Departments had created their own data marts for generating quick access to reports as they had felt the central data warehouse was not responsive to their needs. This also created a bottleneck as data was not always replicated between the repositories. With the IT manager's approval and buy-in of departmental managers, a gradual phase in of a company-wide metadata initiative was introduced. Big bang approaches rarely work - and the consequences are extremely high for competitive industries such as insurance. The metaphor we used for the project was the quote from Julius Cesar by Shakespeare given at the start of the article. We felt that this was a potentially disruptive move but if the challenges were met positively, the rewards would be just.

Figure 3. Company-wide Metadata Policy

127

Industry metadata standards exists in industry verticals such as insurance, banks, manufacturing. OMGs Common Warehouse Metadata Initiative (CWMI) is a vendor back proposal to enable easy interchange of metadata between data warehousing tools and metadata repositories in distributed heterogeneous environments.

Figure 4. Partial Schematic Overview of Data Flow after Company-wide Metadata Implementation In the months since the implementation, the project has been moving along smoothly. There were training seminars given to keep staff abreast on the development and the responses were overwhelmingly positive. The implementation of the Sarbanes-Oxley Scorecarding initiative was on time and relatively painless. Many of the challenges that would have been faced without a metadata policy were avoided. With a unified data source and definition, the company is embarking further on the analysis journey. OLAP reporting is moving across stream with greater access to all employees. Data mining models are now more accurate as the model sets can be scored and trained on larger data sets. Text mining is being used to evaluate claims examiners comments regarding insurance claims made by customers. The text mining tool was custom developed by DWreview for the client's unique requirements. Without metadata policies in place it would be next to impossible to perform coherent text mining. The metadata terminologies used in claims

128

examination were developed in conjunction with insurance partners and brokers. Using the text mining application, the client can now monitor consistency in claims examination, identify trends for potential fraud analysis and provide feedback for insurance policy development. Developing metadata policies for organizations falls into three project management spheres - generating project support, developing suitable guidelines and setting technical goals. For a successful metadata implementation strong executive backing and support must be obtained. A tested method for gathering executive sponsorship is first setting departmental metadata standards and evaluating the difference in efficiency. As metadata is abstract in concept a visceral approach can be helpful. It will also help in gaining trust from departments that may be reluctant to hand over metadata policies. 3.12 System and data warehouse process managers
DataWarehouse Usage Data warehouses and data marts are used in a wide range of applications. Business executives use the data in data warehouses and data marts to perform data analysis and make strategic decisions. In many firms, data warehouses are used as an integral part of a plan-execute-assess closed-loop feedback system for enterprise management. Data warehouses are used extensively in banking and financial services, consumer goods and retail distribution sectors, and controlled manufacturing, such as demandbased production. Typically, the longer a data warehouse has been in use, the more it will have evolved. This evolution takes place throughout a number of phases. Initially, the data warehouse is mainly used for generating reports and answering predefined queries. Progressively, it is used to analyze summarized and detailed data, where the results are presented in the form of reports and charts. Later, the data warehouse is used for strategic purposes, performing multidimensional analysis and sophisticated slice-and-dice operations. Finally, the data warehouse may be employed for knowledge discovery and strategic decision making using data mining tools. In this context, the tools for data warehousing can be categorized into access and retrieval tools, database reporting tools, data analysis tools, and data mining tools. Business users need to have the means to know what exists in the data warehouse (through metadata), how to access the contents of the data warehouse, how to examine the contents using analysis tools, and how to present the results of such analysis. There are three kinds of data warehouse applications: information processing, analytical processing, and data mining: Information processing supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts, or graphs. A current trend in data warehouse information processing is to construct low-cost Web-based accessing tools that are then integrated withWeb browsers.

129

Analytical processing supports basic OLAP operations, including slice-and-dice, drill-down, roll-up, and pivoting. It generally operates on historical data in both summarized and detailed forms. The major strength of on-line analytical processing over information processing is themultidimensional data analysis of data warehouse data. Data mining supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. How does data mining relate to information processing and on-line analytical processing? Information processing, based on queries, can find useful information.However, answers to such queries reflect the information directly stored in databases or computable by aggregate functions. They do not reflect sophisticated patterns or regularities buried in the database. Therefore, information processing is not data mining. On-line analytical processing comes a step closer to data mining because it can derive information summarized at multiple granularities from user-specified subsets of a data warehouse. Such descriptions are equivalent to the class/concept descriptions discussed in Chapter 1. Because data mining systems can also mine generalized class/concept descriptions, this raises some interesting questions: Do OLAP systems perform data mining? Are OLAP systems actually data mining systems? The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data summarization/aggregation tool that helps simplify data analysis, while data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data. OLAP tools are targeted toward simplifying and supporting interactive data analysis, whereas the goal of data mining tools is to automate as much of the process as possible, while still allowing users to guide the process. In this sense, data mining goes one step beyond traditional on-line analytical processing. An alternative and broader view of data mining may be adopted in which data mining covers both data description and data modeling. Because OLAP systems can present general descriptions of data from data warehouses, OLAP functions are essentially for user-directed data summary and comparison (by drilling, pivoting, slicing, dicing, and other operations). These are, though limited, data mining functionalities. Yet according to this view, data mining covers a much broader spectrum than simple OLAP operations because it performs not only data summary and comparison but also association, classification, prediction, clustering, time-series analysis, and other data analysis tasks. Data mining is not confined to the analysis of data stored in data warehouses. It may analyze data existing at more detailed granularities than the summarized data provided in a data warehouse. It may also analyze transactional, spatial, textual, and multimedia data that are difficult to model with current multidimensional database technology. In this context, data mining covers a broader spectrum than OLAP with respect to data mining functionality and the complexity of the data handled. Because data mining involves more automated and deeper analysis than OLAP, data mining is expected to have broader applications. Data mining can help business managers find and reach more suitable customers, as well as gain critical business insights that may help drive market share and raise profits. In addition, data mining can help managers understand customer group characteristics and develop optimal pricing strategies accordingly, correct item bundling based

130

not on intuition but on actual item groups derived from customer purchase patterns, reduce promotional spending, and at the same time increase the overall net effectiveness of promotions.

3.13 Summary
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data organized in support of management decision making. Several factors distinguish data warehouses from operational databases. Because the two systems provide quite different functionalities and require different kinds of data, it is necessary to maintain data warehouses separately from operational databases. A multidimensional data model is typically used for the design of corporate data warehouses and departmental data marts. Such a model can adopt a star schema, snowflake schema, or fact constellation schema. The core of the multidimensional model is the data cube, which consists of a large set of facts (or measures) and a number of dimensions. Dimensions are the entities or perspectives with respect to which an organization wants to keep records and are hierarchical in nature. A data cube consists of a lattice of cuboids, each corresponding to a different degree of summarization of the given multidimensional data. Concept hierarchies organize the values of attributes or dimensions into gradual levels of abstraction. They are useful in mining at multiple levels of abstraction. On-line analytical processing (OLAP) can be performed in data warehouses/marts using the multidimensional data model. Typical OLAP operations include rollup, drill-(down, across, through), slice-and-dice, pivot (rotate), as well as statistical operations such as ranking and computing moving averages and growth rates. OLAP operations can be implemented efficiently using the data cube structure. Data warehouses often adopt a three-tier architecture. The bottomtier is a warehouse database server, which is typically a relational database system. The middle tier is an OLAP server, and the top tier is a client, containing query and reporting tools. A data warehouse contains back-end tools and utilities for populating and refreshing the warehouse. These cover data extraction, data cleaning, data transformation, loading, refreshing, and warehouse management. Data warehouse metadata are data defining the warehouse objects. A metadata repository provides details regarding the warehouse structure, data history, the algorithms used for summarization, mappings from the source data to warehouse form, system performance, and business terms and issues. OLAP servers may use relational OLAP (ROLAP), or multidimensional OLAP (MOLAP), or hybrid OLAP (HOLAP). A ROLAP server uses an extended relational DBMS that maps OLAP operations on multidimensional data to standard relational operations. A MOLAP server maps multidimensional data views directly to array structures. A HOLAP server combines ROLAP and MOLAP. For example, it may use ROLAP for historical data while maintaining frequently accessed data in a separate MOLAP store. Full materialization refers to the computation of all of the cuboids in the lattice defining a data cube. It typically requires an excessive amount of storage space, particularly as the number of dimensions and size of associated concept hierarchies grow. This problem is known as the curse of dimensionality. Alternatively, partial materialization is the selective computation of a subset of the cuboids or subcubes in the lattice. For example, an iceberg cube is a data cube that stores only those cube cells whose

131

aggregate value (e.g., count) is above some minimum support threshold. OLAP query processing can be made more efficient with the use of indexing techniques. In bitmap indexing, each attribute has its own bitmap index table. Bitmap indexing reduces join, aggregation, and comparison operations to bit arithmetic. Join indexing registers the joinable rows of two or more relations from a relational database, reducing the overall cost of OLAP join operations. Bitmapped join indexing, which combines the bitmap and join index methods, can be used to further speed up OLAP query processing. Data warehouses are used for information processing (querying and reporting), analytical processing and data mining (which supports knowledge discovery). OLAP-based data mining is referred to as OLAP mining, or on-line analytical mining (OLAM), which emphasizes the interactive and exploratory nature of OLAP mining. 3.14 EXCERCISES 1.Suppose that a data warehouse for Big University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg grade measure stores the actual course grade of the student. At higher conceptual levels, avg grade stores the average grade for the given combination. (a) Draw a snowflake schema diagram for the data warehouse. (b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each Big University student. 2.Explain the architecture of data warehouse 3.Describe the data base schemas being used in data ware house 4.What is meta data ? explain 5.Aggregations - How it is being done on a data ware house? 6.Why Partitioning strategies are needed in data ware housing maintenance? 7.How is a data warehouse different from a database? How are they similar? 8. Briefly compare the following concepts. You may use an example to explain your point(s). (a) Snowflake schema, fact constellation, star net query model (b) Data cleaning, data transformation, refresh (c) Enterprise warehouse, data mart, virtual warehouse 9.Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where charge is the fee that a doctor charges a patient for a visit. (a) Enumerate three classes of schemas that are popularly used for modeling data warehouses. (b) Draw a schema diagram for the above data warehouse using one of the schema classes listed in (a). (c) Starting with the base cuboid [day; doctor; patient], what specic OLAP operations should be per-formed in order to list the total fee collected by each doctor in 2004? (d) To obtain the same list, write an SQL query assuming the data is stored in a relational database with the schema fee (day, month, year, doctor, hospital, patient, count, charge).

132

10.Suppose that a data warehouse consists of the four dimensions, date, spectator, location, and game, and the two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. (a) Draw a star schema diagram for the data warehouse. (b) Starting with the base cuboid [date; spectator; location; game], what specic OLAP operations should one perform in order to list the total charge paid by student spectators at GM Place in 2004? (c) Bitmap indexing is useful in data warehousing. Taking this cube as an example, briey discuss advantages and problems of using a bitmap index structure.

133

Unit IV Structure of the Unit 4.1 Introduction 4.2 Learning Objectives 4.3 Data Warehouse hardware Architecture 4.4 Physical Layout 4.5 Security 4.6 Back up and recovery 4.7 Service level agreement 4.8 Operating the data Ware house 4.9 Summary 4.10 Exercises

134

4.1 Introduction

Once the Planning and Design stages are complete, the project to implement the current Data Warehouse iteration can proceed quickly. Necessary hardware, software and middleware components are purchased and installed, the development and test environment is established, and the configuration management processes are implemented. Programs are developed to extract, cleanse, transform and load the source data and to periodically refresh the existing data in the Warehouse, and the programs are individually unit tested against a test database with sample source data. Metrics are captured for the load process. The metadata repository is loaded with transformational and business user metadata. Canned production reports are developed and sample ad-hoc queries are run against the test database, and the validity of the output is measured. User access to the data in the Warehouse is established. Once the programs have been developed and unit tested and the components are in place, system functionality and user acceptance testing is conducted for the complete integrated Data Warehouse system. System support processes of database security, system backup and recovery, system disaster recovery, and data archiving are implemented and tested as the system is prepared for deployment. The final step is to conduct the Production Readiness Review prior to transitioning the Data Warehouse system into production. During this review, the system is evaluated for acceptance by the customer organization 4.2 Learning Objectives Make the student to have the knowledge of the hardware requirement for a data ware house , To know about the needed security for the information stored and accessed To have the awareness of backup and recovery of the data and the operating the data ware house.

4.3 Data Warehouse hardware Architecture

Needed hardware Multiprocessors with in the same machine sharing same disk and memory: This is good approach for small to small-medium size data warehouse. The problem with the DW (which is not in OLTP) is that the kind of load and queries are not certain. Therefore, sometimes the allocation of processes across the processors, itself runs out of breath. Even if you are having a kind of cluster, (with load-balancing and automatic failover), it will become complex once you go beyond a certain size. 135

Parallel Processing Servers Here the processing is done across multiple servers with each having its own memory and disk space. This way they get their own playing field, instead of fighting for common resources (as in the multiprocessor architecture). This way one can add hundreds of servers to share the load through messaging or other mode of EAI (enterprise application integration). As you design this processing architecture, one needs to ensure that there is not too many cross connections. For example if you want to join two star-schemas (refer multi-cube in OLAP server), you have to ensure that relevant data for the two cubes is in same or few servers. Combining the above two Depending upon the kind of business you want to run through your data warehouse, the best is to get the combination of the above two.
TIP-

Ask your vendor, if the data warehouse can support all the above three processing architecture styles. Also ask the following questions:

How many multiprocessor machines it can support? Can it support cluster architecture? Does it have load-balancing and fail-over capability? What has been the field experience of the number of parallel processing servers that this platform has achieved?

For the top quadrant like Oracle, Teradata..., the answer for all of this is going to be positive.
TIP-

If you have enterprise strength database from Oracle, IBM DB2, SQL server (2008 is preferable), one needs to ride over them unless there are strong reasons for you to go for some other database for your data warehouse.

4.4 Hardware architecture of a data ware house Physical Layout

136

Data Warehouse Components In most cases the data warehouse will have been created by merging related data from many different sources into a single database a copy-managed data warehouse as in above figure More sophisticated systems also copy related files that may be better kept outside the database for such things as graphs, drawings, word processing documents, images, sound, and so on. Further, other files or data sources may be accessed by links back to the original source, across to special Intranet sites or out to Internet or partner sites. There is often a mixture of current, instantaneous and unfiltered data alongside more structured data. The latter is often summarized and coherent information as it might relate to a quarterly period or a snapshot of the business as of close of day. In each case the goal is better information made available quickly to the decision makers to enable them to get to market faster, drive revenue and service levels, and manage business change. A data warehouse typically has three parts a load management, a warehouse, and a query management component.
LOAD MANAGEMENT relates to the collection of information from disparate internal or

external sources. In most cases the loading process includes summarizing, anipulating and changing the data structures into a format that lends itself to analytical processing. Actual raw data should be kept alongside, or within, the data warehouse itself, thereby enabling the construction of new and different representations. A worst-case scenario, if the raw data is not stored, would be to reassemble the data from the various disparate sources around the organization simply to facilitate a different analysis. WAREHOUSE MANAGEMENT relates to the day-to-day management of the data warehouse. The

137

management tasks associated with the warehouse include ensuring its availability, the effective backup of its contents, and its security. QUERY MANAGEMENT relates to the provision of access to the contents of the warehouse and may include the partitioning of information into different areas with different privileges to different users. Access may be provided through custom-built applications, or ad hoc query tools.

4.5 Security

Imagine your organization has just built its data warehouse. The new data warehouse environment enables you to access corporate data in the form you want, when you want it, and where you want it to solve dynamic organizational problems, or make important decisions. You no longer feel frustrated with the inability of the Information Systems (IS) function to respond quickly to your diverse needs for information. The new environment empowers you to have the information processing world by the tail, and you are exceedingly thrilled by it all! Suddenly, a paranoid thought creeped into your head, and you asked the classic question: What is your organization doing to identify, classify, quantify, and protect its valuable information assets? You posed this question to the data warehouse architects and administrators. They told you that there was nothing to worry about because the in-built security measures of your data warehouse environment could put the DoD systems to shame. Somewhere along the lines, you sensed that they were neither objective and convincing. So, you put on your hacking hat and went about the process of finding the answer to your question. As a general user, you easily managed to access some powerful user tools that were presumably restricted to unlimited access users. The tools enabled you to issue complex queries which accessed numerous data, consumed enormous resources, and slowed system response time considerably. Your trusted friend, a reformed hacker, was also able to access sensitive corporate data through the Internet without much ado. He was able to disclose your exact salary, birth date, social security number, and the date of your last performance evaluation among other things. Your findings led you to the classic answer: Your organization, like most, is doing little or nothing to protect its strategic information assets! Your data warehouse administrators could not pinpoint the causes of recent system problems and security breaches until you showed them the shocking results of what you and your friend had done. It was then that they admitted that security was not a priority during the development of data warehouse. Driven by the needs to complete the data warehouse project on time and within budget, and get impatient users off their backs, they did not give security requirements any thought.

138

Your euphoric excitement about the new data warehouse vanished into the thick air of security concerns over your valuable corporate data. As a diligent corporate steward, you realized that it is high time for your organization to take a reality check! As you know that a Data warehouse (DW) is a collection of integrated databases designed to support managerial decision-making and problem-solving functions. It contains both highly detailed and summarized historical data relating to various categories, subjects, or areas. All units of data are relevant to appropriate time horizons. DW is an integral part of enterprise-wide decision support system, and does not ordinarily involve data updating. It empowers end-users to perform data access and analysis. This eliminates the need for the IS function to perform informational processing from the legacy systems for the end-users. It also gives an organization certain competitive advantages, such as: fostering a culture of information sharing; enabling employees to effectively and efficiently solve dynamic organizational problems; minimizing operating costs and maximizing revenue; attracting and maintaining market shares, and; minimizing the impact of employee turnovers. For instance, the internal audit functions of a multi-campus institution like the University of California builds a DW to facilitate the sharing of strategic data, best audit practices, and expert insights on a variety of control topics. Auditors can access and analyze the DW data to efficiently make well reasoned decisions (e.g., recommend cost-effective solutions to various internal control problems). Marrying DW architecture to artificial intelligence or neural applications also facilitates highly unstructured decision-making by the auditors. This results in timely completion of audit projects, improved quality of audit services, lower operating costs, and minimal impact from staff turnover. Implicit in the DW design is the concept of progress through sharing. The security requirements of the DW environment are not unlike those of other distributed computing systems. Thus, having an internal control mechanism to assure the confidentiality, integrity and availability of data in a distributed environment is of paramount importance. Unfortunately, most data warehouses are built with little or no consideration given to security during the development phase. Achieving proactive security requirements of DW is a seven-phase process: 1) identifying data, 2) classifying data, 3) quantifying the value of data, 4) identifying data security vulnerabilities, 5) identifying data protection measures and their costs, 6) selecting cost-effective security measures, and 7) evaluating the effectiveness of security measures. These phases are part of an enterprise-wide vulnerability assessment and management program.

To have the security, Phase One - Identifying the Data


The first security task is to identify all digitally stored corporate data placed in the DW. This is an often ignored, but critical phase of meeting the security requirements of the DW environment since it forms the foundation for subsequent phases. It entails taking a complete inventory of all the data that is available to the DW end-users. The installed data monitoring software -- an important component of the DW -- can provide an accurate information about all databases, tables, columns, rows of data, and profiles of

139

data residing in the DW environment as well as who is using the data and how often they use the data. A manual procedure would require preparing a checklist of the same information described above. Whether the required information is gathered through an automated or a manual method, the collected information needs to be organized, documented and retained for the next phase.

Phase Two - Classifying the Data for Security


Classifying all the data in the DW environment is needed to satisfy security requirements for data confidentiality, integrity and availability in a prudent manner. In some cases, data classification is a legally mandated requirement. Performing this task requires the involvement of the data owners, custodians, and the end-users. Data is generally classified on the basis of criticality or sensitivity to disclosure, modification, and destruction. The sensitivity of corporate data can be classified as:

PUBLIC (Least Sensitive Data): For data that is less sensitive than confidential corporate data. Data in this category is usually unclassified and subject to public disclosure by laws, common business practices, or company policies. All levels of the DW end-users can access this data (e.g., audited financial statements, admission information, phone directories, etc.). CONFIDENTIAL (Moderately Sensitive Data): For data that is more sensitive than public data, but less sensitive than top secret data. Data in this category is not subject to public disclosure. The principle of least privilege applies to this data classification category, and access to the data is limited to a need-to-know basis. Users can only access this data if it is needed to perform their work successfully (e.g., personnel/payroll information, medical history, investments, etc.). TOP SECRET (Most Sensitive Data): For data that is more sensitive than confidential data. Data in this category is highly sensitive and mission-critical. The principle of least privilege also applies to this category -- with access requirements much more stringent than those of the confidential data. Only highlevel DW users (e.g., unlimited access) with proper security clearance can access this data (e.g., R&D, new product lines, trade secrets, recruitment strategy, etc.). Users can access only the data needed to accomplish their critical job duties.

Regardless of which categories are used to classify data on the basis of sensitivity, the universal goal of data classification is to rank data categories by increasing degrees of sensitivity so that different protective measures can be used for different categories. Classifying data into different categories is not as easy as it seems. Certain data represents a mixture of two or more categories depending on the context used (e.g., time, location, and laws in effect). Determining how to classify this kind of data is both challenging and interesting.

140

Phase Three - Quantifying the Value of Data


In most organizations, senior management demands to see the smoking gun (e.g., cost-vsbenefit figures, or hard evidence of committed frauds) before committing corporate funds to support security initiatives. Cynic managers will be quick to point out that they deal with hard reality -- not soft variables concocted hypothetically. Quantifying the value of sensitive data warranting protective measures is as close to the smoking gun as one can get to trigger senior management's support and commitment to security initiatives in the DW environment. The quantification process is primarily concerned about assigning "street value" to data grouped under different sensitivity categories. By itself, data has no intrinsic value. However, the definite value of data is often measurable by the cost to (a) reconstruct lost data, (a) restore the integrity of corrupted, fabricated, or intercepted data, (c) not make timely decisions due to denial of service, or (d) pay financial liability for public disclosure of confidential data. The data value may also include lost revenue from leakage of trade secrets to competitors, and advance use of secret financial data by rogue employees in the stock market prior to public release. Measuring the value of sensitive data is often a Herculean task. Some organizations use simple procedures for measuring the value of data. They build a spreadsheet application utilizing both qualitative and quantitative factors to reliably estimate the annualized loss expectancy (ALE) of data at risk. For instance, if it costs $10,000 annually (based on labor hours) to reconstruct data classified as top secret with assigned risk factor of 4, then the company should expect to lose at least $40,000 a year if this top secret data is not adequately protected. Similarly, if an employee is expected to successfully sue the company and recover $250,000 in punitive damages for public disclosure of privacyprotected personal information, then the liability cost plus legal fees paid to the lawyers can be used to calculate the value of the data. The risk factor (e.g., probability of occurrence) can be determined arbitrarily or quantitatively. The higher the likelihood of attacking a particular unit of data, the greater the risk factor assigned to that data set. Measuring the value of strategic information assets based on accepted classification categories can be used to show what an organization can save (e.g., Return on Investment) if the assets are properly protected, or lose (annual dollar loss) if it does not act to protect the valuable assets.

Phase Four - Identifying Data Vulnerabilities


This phase requires the identification and documentation of vulnerabilities associated with the DW environment. Some common vulnerabilities of DW include the following:

141

In-built DBMS Security: Most data warehouses rely heavily on in-built security that is primarily VIEW-based. The VIEW-based security is inadequate for the DW because it can be easily bypassed by a direct dump of data. It also does not protect data during the transmission from servers to clients -- exposing the data to unauthorized access. The security feature is equally ineffective for the DW environment where the activities of the end-users are largely unpredictable. DBMS Limitations: Not all database systems housing the DW data have the capability to concurrently handle data of different sensitivity levels. Most organizations, for instance, use one DW server to process top secret and confidential data at the same time. However, the programs handling high top security data may not prevent leaking the data to the programs handling the confidential data, and limited DW users authorized to access only the confidential data may not be prevented from accessing the top secret data. Dual Security Engines: Some data warehouses combine the in-built DBMS security features with the operating system access control package to satisfy their security requirements. Using dual security engines tends to present opportunity for security lapses and exacerbate the complexity of security administration in the DW environment. Inference Attacks: Different access privileges are granted to different DW users. All users can access public data, but only a select few would presumably access confidential or top secret data. Unfortunately, general users can access protected data by inference without having a direct access to the protected data. Sensitive data is typically inferred from a seemingly non-sensitive data. Carrying out direct and indirect inference attacks is a common vulnerability in the DW environment. Availability Factor: Availability is a critical requirement upon which the shared access philosophy of the DW architecture is built. However, availability requirement can conflict with or compromise the confidentiality and integrity of the DW data if not carefully considered. Human Factors: Accidental and intentional acts such as errors, omissions, modifications, destruction, misuse, disclosure, sabotage, frauds, and negligence account for most of the costly losses incurred by organizations. These acts adversely affect the integrity, confidentiality, and availability of the DW data. Insider Threats: The DW users (employees) represent the greatest threat to valuable data. Disgruntled employees with legitimate access could leak secret data to competitors and publicly disclose certain confidential human resources data. Rogue employees can also profit from using strategic corporate data in the stock market before such information is released to the public. These activities cause (a) strained relationships with business partners or government entities, (b) loss of money from financial liabilities, (c) loss of public confidence in the organization, and (d) loss of competitive edge.

142

Outsider Threats: Competitors and other outside parties pose similar threat to the DW environment as unethical insiders. These outsiders engage in electronic espionage and other hacking techniques to steal, buy, or gather strategic corporate data in the DW environment. Risks from these activities include (a) negative publicity which decimates the ability of a company to attract and retain customers or market shares, and (b) loss of continuity of DW resources which negates user productivity. The resultant losses tend to be higher than those of insider threats. Natural Factors: Fire, water, and air damages can render both the DW servers and clients unusable. Risks and losses vary from organization to organization -depending mostly on location and contingency factors. Utility Factors: Interruption of electricity and communications service causes costly disruption to the DW environment. These factors have a lower probability of occurrence, but tend to result in excessive losses.

A comprehensive inventory of vulnerabilities inherent in the DW environment need to be documented and organized (e.g., as major or minor) for the next phase.

Phase Five - Identifying Protective Measures and Their Costs


Vulnerabilities identified in the previous phase should be considered in order to determine cost-effective protection for the DW data at different sensitivity levels. Some protective measures for the DW data include:

The Human Wall: Employees represent the front-line of defense against security vulnerabilities in any decentralized computing environment, including DW. Addressing employee hiring, training (security awareness), periodic background checks, transfers, and termination as part of the security requirements is helpful in creating security-conscious DW environment. This approach effectively treats the root causes, rather than the symptoms, of security problems. Human resources management costs are easily measurable. Access Users Classification: Classify data warehouse users as 1) General Access Users, 2) Limited Access Users, and 3) Unlimited Access Users for access control decisions. Access Controls: Use access controls policy based on principles of least privilege and adequate data protection. Enforce effective and efficient access control restrictions so that the end-users can access only the data or programs for which they have legitimate privileges. Corporate data must be protected to the degree consistent with its value. Users need to obtain a granulated security clearance before they are granted access to sensitive data. Also, access to the sensitive data should rely on more than one authentication mechanism. These access controls minimize damage from accidental and malicious attacks.

143

Integrity Controls: Use a control mechanism to a) prevent all users from updating and deleting historical data in the DW, b) restrict data merge access to authorized activities only, c) immunize the DW data from power failures, system crashes and corruption, d) enable rapid recovery of data and operations in the event of disasters, and e) ensure the availability of consistent, reliable and timely data to the users. These are achieved through the OS integrity controls and well tested disaster recovery procedures. Data Encryption: Encrypting sensitive data in the DW ensures that the data is accessed on an authorized basis only. This nullifies the potential value of data interception, fabrication and modification. It also inhibits unauthorized dumping and interpretation of data, and enables secure authentication of users. In short, encryption ensures the confidentiality, integrity, and availability of data in the DW environment. Partitioning: Use a mechanism to partition sensitive data into separate tables so that only authorized users can access these tables based on legitimate needs. Partitioning scheme relies on a simple in-built DBMS security feature to prevent unauthorized access to sensitive data in the DW environment. However, use of this method presents some data redundancy problems. Development Controls: Use quality control standards to guide the development, testing and maintenance of the DW architecture. This approach ensures that security requirements are sufficiently addressed during and after the development phase. It also ensures that the system is highly elastic (e.g., adaptable or responsive to changing security needs).

The estimated costs of each security measure should be determined and documented for the next phase. Commercial packages (e.g., CORA, RANK-IT, BUDDY SYSTEM, BDSS, BIA Professional, etc.) and in-house developed applications can help in identifying appropriate protective measures for known vulnerabilities, and quantifying their associated costs or fiscal impact. Measuring the costs usually involves determining the development, implementation, and maintenance costs of each security measure.

Phase Six - Selecting Cost-Effective Security Measures


All security measures involve expenses, and security expenses require justification. This phase relies on the results of previous phases to assess the fiscal impact of corporate data at risk, and select cost-effective security measures to safeguard the data against known vulnerabilities. Selecting cost-effective security measures is congruent with a prudent business practice which ensures that the costs of protecting the data at risk does not exceed the maximum dollar loss of the data. Senior management would, for instance, deem it imprudent to commit $500,000 annually in safeguarding the data with annualized loss expectancy of only $250,000.

144

However, the cost factor should not be the only criterion for selecting appropriate security measures in the DW environment. Compatibility, adaptability and potential impact on the DW performance should also be taken into consideration. Additionally, there are two important factors. First, the economy of mechanism principle dictates that a simple, well tested protective measure can be relied upon to control multiple vulnerabilities in the DW environment. Second, data, unlike hardware and software, is an element in the IS security arena that has the shortest life span. Thus, the principle of adequate data protection dictates that the DW data can be protected with security measures that are effective and efficient enough for the short life span of the data.

Phase Seven - Evaluating the Effectiveness of Security Measures


A winning basketball formula from the John Wooden school of thought teaches that a good team should be prepared to rebound every shot that goes up, even if it is made by the greatest player on the court. Similarly, a winning security strategy is to assume that all security measures are breakable, or not permanently effective. Every time we identify and select cost-effective security measures to secure our strategic information assets against certain attacks, the attackers tend to double their efforts in identifying methods to defeat our implemented security measures. The best we can do is to prevent this from happening, make the attacks difficult to carry out, or be prepared to rebound quickly if our assets are attacked. We will not be well positioned to do any of these if we do not evaluate the effectiveness of security measures on an ongoing basis. Evaluating the effectiveness of security measures should be conducted continuously to determine whether the measures are: 1) small, simple and straightforward, 2) carefully analyzed, tested and verified, 3) used properly and selectively so that they do not exclude legitimate accesses, 4) elastic so that they can respond effectively to changing security requirements, and 5) reasonably efficient in terms of time, memory space, and usercentric activities so that they do not adversely affect the protected computing resources. It is equally important to ensure that the DW end-users understand and embrace the propriety of security measures through an effective security awareness program. The data warehouse administrator (DWA) with the delegated authority from senior management is responsible for ensuring the effectiveness of security measures.

Encryption Requirements
Encrypting sensitive data in the DW environment can be done at the table, column, or row level. Encrypting columns of a table containing sensitive data is the most common and straightforward approach used. Few examples of columns that are usually encrypted include social security numbers, salaries, birth dates, performance evaluation ratings, confidential bank information, and credit card numbers. Locating individual records in a table through a standard search command will be exceedingly difficult if any of the encrypted columns serve as keys to the table. Organizations that use social security numbers as key to database tables should seriously consider using alternative pseudonym codes (e.g., randomly generated numbers) as keys before encrypting the SSN column.

145

Encrypting only selected rows of data is not commonly used, but can be useful in some unique cases. For instance, a single encryption algorithm can be used to encrypt the ages of some employees who insist on non-disclosure of their ages for privacy reasons. Multiple encryption algorithms can also be used to encrypt rows of data reflecting sensitive transactions for different campuses so that geographically distributed users of the same DW can only view/search transactions (rows) related to their respective campuses. If not carefully planned, mixing separate rows of encrypted and unencrypted data and managing multiple encryption algorithms in the same DW environment can introduce chaos, including flawed data search results. Encrypting a table (all columns/rows) is very rarely used because it essentially renders the data useless in the DW environment. The procedures required to decrypt the encrypted keys before accessing the records in a useful format are very cumbersome and cost-prohibitive. The encryption algorithm selected for the DW environment should be able to preserve field type and field length characteristics. It should also work cooperatively with the access and analysis software package in the DW environment. Specifically, the data decryption sequence must be executed before it reaches the software package handling the standard query. Otherwise, the package could prevent decryption of the encrypted data -- rendering the data useless.

Encryption Constraints
Performing data encryption and decryption on the DW server consumes significant CPU processing cycles. This results in excessive overhead costs and degraded system performance. Also, performing decryption on the DW server before transmitting the decrypted data to the client (end-user's workstation) exposes the data to unauthorized access during the transmission. These problems can be minimized if the encryption and decryption functions are effectively deployed to the workstation level with greater CPU cycles available for processing. In addition, improperly used encryption (e.g., weak encryption algorithm) can give users a false sense of security. Encrypted data in the DW must be decrypted before the standard query operations can be performed. This increases the time to process a query which can irritate the end-users and force them to be belligerent toward encryption mechanism. Finally, it is still illegal to use certain encryption algorithms outside the U.S. borders.

Data Warehouse Administration


The size of historical data in the DW environment grows significantly every year, while the use of the data tends to decrease dramatically. This increases storage, processing and operating costs of the DW annually. It necessitates the periodic phasing out of least used or unused data -- usually after a detailed analysis of the least and most accessed data over a long time horizon. A prudent decision has to be made as to how long historical data should be kept in the DW environment before they are phased out en mass. The DWA

146

may not meet effectively these challenges without the necessary tools (activity and data monitors), resources (funds and staffing support) and management philosophy (strategic planning and management). For these reasons, the DWA should be a good strategist, an effective communicator, an astute politician, and a competent technician.

Control Reviews
The internal control review approach of the DW environment should be primarily forward-looking (emphasizing up-front prevention) as opposed to backward-looking (emphasizing after-the-fact verification). This approach calls for the use of pre-control and concurrent control assessment techniques to look at such issues as (a) data quality control, (b) effectiveness of security management, (c) economy and efficiency of DW operations, (d) accomplishment of operational goals or quality standards, and (e) overall DW administration. Effective collaboration with the internal customers (the DWA and en-users) and use of automated control tools are essential for conducting these control reviews competently.

Conclusions
The seven phases of systematic vulnerability assessment and management program described in this article are helpful in averting underprotection and overprotection (two undesirable security extremes) of the DW data. This is achieved through the eventual selection of cost-effective security measures which ensure that different categories of corporate data are protected to the degree necessary. The program also shifts the management focus from taking corrective security actions in a crisis mode to prevention of security crises in the DW environment. It is generally recognized that the goal of DW is to provide decision-makers access to consistent, reliable, and timely data for analytical, planning, and assessment purposes in a format that allows for easy retrieval, exploration and analysis. The need for accurate information in the most efficient and effective manner is congruent with the security requirements for data integrity and availability. Thus, it is a winning corporate strategy to ensure a happy marriage between the idealism of DW based on empowered informational processing, and the pragmatism of a proactive security philosophy based on prudent security practices in the empowered computing environment. The myth that security defeats the goal of DW, or cannot coexist in the DW environment should be debunked. Anything less would be imprudent.

4.6 Backup and Recovery

147

It can take six months or more to create a data warehouse, but only a few minutes to lose it! Accidents happen.Planning is essential. A well planned operation has fewer accidents, and when they occur recovery is far more controlled and timely. Backup and Restore The fundamental level of safety that must be put in place is a backup system that automates the process and guarantees that the database can be restored with full data integrity in a timely manner. The first step is to ensure that all of the data sources from which the data warehouse is created are themselves backed up. Even a small file that is used to help integrate larger data sources may play a critical part. Where a data source is external it may be expedient to cache the data to disk, to be able to back it up as well. Then there is the requirement to produce say a weekly backup of the entire warehouse itself which can be restored as a coherent whole with full data integrity. Amazingly many companies do not attempt this. They rely on a mirrored system not failing, or recreating the warehouse from scratch. Guess what? They do not even practise the recreation process, so when (not if) the system breaks the business impact will be enormous. Backing up the data warehouse itself is fundamental. What must we back up? First the database itself, but also any other files or links that are a key part of its operation. How do we back it up? The simplest answer is to quiesce the entire data warehouse and do a cold backup of the database and related files. This is often not an option as they may need to be operational on a nonstop basis, and even if they can be stopped there may not be a large enough window to do the backup. The preferred solution is to do a hot database backup that is, back up the database and the related files while they are being updated. This requires a high-end backup product that is synchronized with the database systems own recovery system and has a hot-file backup capability to be able to back up the conventional file system. Veritas supports a range of alternative ways of backing up and recovering a data warehouse for this paper we will consider this to be a very large Oracle 7 or 8 database with a huge number of related files. The first mechanism is simply to take cold backups of the whole environment, exploiting multiplexing and other techniques to minimize the backup window (or restore time) by exploiting to the full the speed and capacity of the many types and instances of tape and robotics devices that may need to be configured. The second method is to use the standard interfaces provided by Oracle (Sybase, Informix, SQL BackTrack, SQL server, etc.) to synchronize a backup of the database with the RDBMS recovery mechanism to provide a simple level of hot backup of the database, concurrently with any related files. Note that a hot file system or checkpointing facility is also used to assure the conventional files backed up correspond to the database. The third mechanism is to exploit the RDBMS special hot backup mechanisms provided by Oracle and others. The responsibility for the database part of the data warehouse is taken by, say, Oracle, who provide a set of data streams to the backup system and later

148

request parts back for restore purposes. With Oracle 7 and 8 this can be used for very fast full backups, and the Veritas NetBackup facility can again ensure that other nondatabase files are backed up to cover the whole data warehouse. Oracle 8 can also be used with the Veritas NetBackup product to take incremental backups of the database. Each backup, however, requires Oracle to take a scan of the entire data warehouse to determine which blocks have changed, prior to providing the data stream to the backup process. This could be a severe overhead on a 5 terabyte data warehouse. These mechanisms can be fine tuned by partition etc. Also optimizations can be done for example, to back up readonly partitions once only (or occasionally) and to optionally not back up indexes that can easily be recreated. Veritas now uniquely also supports block-level incremental backup of any database or file system without requiring pre-scanning. This exploits file-system-level storage checkpoints. The facility will also be available with the notion of synthetic full backups where the last full backup can be merged with a set of incremental backups (or the last cumulative incremental backup) to create a new full backup off line. These two facilities reduce by several orders of magnitude the time and resources taken to back up and restore a large data Richard Barker 6 warehouse. Veritas also supports the notion of storage checkpoint recovery by which means the data warehouse can be instantly reset to a date/time when a checkpoint was taken; for example, 7.00 a.m. on each working day. The technology can be integrated with the Veritas Volume Management technology to automate the taking of full backups of a data warehouse by means of third-mirror breakoff and backup of that mirror. This is particularly relevant for smaller data warehouses where having a second and third copy of it can be exploited for resilience and backup. Replication technology, at either the volume or full-system level, can also be used to keep an up-to-date copy of the data warehouse on a local or remote site another form of instantaneous, always available, full backup of the data warehouse. And finally, the Veritas software can be used to exploit network-attached intelligent disk and tape arrays to take backups of the data warehouse directly from disk to tape without going through the server technology. Alternatives here are disk to tape on a single, intelligent, network-attached device, or disk to tape across a fiber channel from one network-attached device to another. Each of these backup and restore methods addresses different service-level needs for the data warehouse and also the particular type and number of computers, offline devices and network configurations that may be available. In many corporations a hybrid or combination may be employed to balance cost and services levels. Online Versus Offline Storage With the growth in data usage often more than doubling each year it is important to assure that the data warehouse can utilize the correct balance of offline as well as online storage. A well balanced system can help control the growth and avoid disk full

149

problems, which cause more than 20% of stoppages on big complex systems. Candidates for offline storage include old raw data, old reports, and rarely used multi media and documents. Hierarchical Storage Management (HSM) is the ability to off line files automatically to secondary storage, yet leaving them accessible to the user. The user sees the file and can access it, but is actually looking at a small stub since the bulk of the file has been moved elsewhere. When accessed, the file is returned to the online storage and manipulated by the user with only a small delay. The significance of this to a data warehousing environment in the first instance relates to user activities around the data warehouse. Generally speaking, users will access the data warehouse and run reports of varying sophistication. The output from these will either be viewed dynamically on screen or held on a file server. Old reports are useful for comparative purposes, but are infrequently accessed and can consume huge quantities of disk space. The Veritas HSM system provides an effective way to manage these disk-space problems by migrating files, of any particular type, to secondary or tertiary storage. Perhaps the largest benefit is to migrate off line the truly enormous amounts of old raw data sources, leaving them apparently on line in case they are needed again for some new critical analysis. Veritas HSM and NetBackup are tightly integrated, which provides another immediate benefit reduced backup times since the backup is now simply of stubs instead of complete files. Disaster Recovery From a business perspective the next most important thing may well be to have a disaster recovery site set up to which copies of all the key systems are sent regularly. Several techniques can be used, ranging from manual copies to full automation. The simplest mechanism is to use a backup product that can automatically produce copies for a remote site. For more complex environments, particularly where there is a hybrid of database management systems and conventional files, an HSM system can be used to copy data to a disaster recovery site automatically. Policies can be set so files or backup files can be migrated automatically from media type to media type, and from site to site; for example, disk to optical, to tape, to an off-site vault. Richard Barker 7 Where companies can afford the redundant hardware and very-high-bandwidth (fiber channel) communications widearea network, volume replication can be used to retain a secondary remote site identical to the primary data warehouse site. Reliability and High Availability A reliable data warehouse needs to depend upon restore and recovery a lot less. After choosing reliable hardware and software the most obvious thing is to use redundant disk technology, dramatically improving both reliability and performance, which are often key in measuring end-user availability. Most data warehouses have kept the database on raw partitions rather than on top of a file system purely to gain performance. The Veritas file system has been given a special mechanism to run the database at exactly the same speed as the raw partitions. The Veritas file system is a journaling file system and recovers in seconds, should there be a crash. This then enables the database

150

administrator to get all the usability benefits of a file system with no loss of performance. When used with the Veritas Volume Manager we also get the benefit of software RAID, to provide redundant disk-based reliability and performance. After disks, the next most sensible way of increasing reliability is to use a redundant computer, along with event management and high availability software, such as FirstWatch. The event management software should be used to monitor all aspects of the data warehouse operating system, files, database, and applications. Should a failure occur the software should deal with the event automatically or escalate instantly to an administrator. In the event of a major problem that cannot be fixed, the High Availability software should fail the system over to the secondary comp uter within a few seconds. Advanced HA solutions enable the secondary machine to be used for other purposes in the meantime, utilizing this otherwise expensive asset. 4.7 Service level agreement

Service Level Agreement (SLA) A binding contract which formally specifies enduser expectation about the solution and tolerances. It is a collection of service level requirements that have been negotiated and mutually agreed upon by the information providers and the information consumers. The SLA has three attributes: STRUCTURE, PRECISION, AND FEASIBILITY. This agreement establishes expectations and impacts the design of the components of the data warehouse solution.

Data warehouse projects are popular within the business world today. Competitive advantages are maintained or gained by the strategic use of business information that has been analyzed to produce ways to attract new customers and sell more products to existing customers. The benefits of this analysis have caused business executives to push for data warehouse technology and expectations for these projects are high. This document examines the performance characteristics of a data warehouse and looks at how expectations for these projects can be set and managed. This paper focuses on the performance aspects of a warehouse running on an IBM mainframe and using UDB for OS/390 as the database. Performance Characteristics of a Data Warehouse Environment The art of performance tuning has always been about matching workloads for execution with resources for that execution. Therefore, the beginning of a performance tuning strategy for a data warehouse must include the characterization of the data warehouse workloads. To perform that characterization, there must be some common metrics to differentiate one workload from another. Once the workloads have been characterized, some analysis should be performed to determine the impact of executing multiple workloads at the same time. It is possible that

151

some workloads will not work well together and thus, when executed at the same time, will degrade each other's performance. In such cases, it is best to keep these workloads from running at the same time. The workload that consists of the maintenance programs for the warehouse should be tracked closely because of its impact on availability. While not specifically related to the issue, data marts have their place in a warehouse strategy. One factor that should be evaluated in deciding to establish marts is the ability to segregate workloads across multiple data marts and thus mitigate the instances of competing workloads that degrade performance. One issue to address in a warehouse environment is whether there will be uniform workloads or whether all work will be unique. Some companies may find queries that are executed on a regular basis and thus can be characterized as a workload. Others may find the dynamic nature of the environment very difficult to characterize. This will be addressed in more detail later. The performance analyst can get help with pattern matching from end users while trying to determine the workload characteristics of the company's warehouse. Workload arrival rates and how those arrival rates can be influenced must be combined with the workload characterization. Queries against a warehouse are not driven by customers transacting business with the company, but rather by users who are searching for information to make the business run smoother and be more responsive to its customers. Therefore, the timing of these queries can be under some control. Typically, this control will work best when it is integrated into the warehouse service-level agreements. This will make control a part of the agreement between IS and the users of the system. For example, this control might be structured by setting different response time goals for different workloads or groups of users based on day of the week or hour of the day. This would not guarantee that the work would arrive at certain times, but it would encourage submission of workloads at different times. Sometimes availability is overlooked in the evaluation of performance. If a system is unavailable, then it is not performing. Therefore, the ability of the platform to deliver the required availability is critical. Some might question the need for high availability of a warehouse compared to the availability requirements of an operational system. A warehouse may, in fact, need 24x7 availability. Consider that queries against a warehouse will have to process large volumes of data, which may take hours or perhaps days. Longer outages might be tolerated by an operational system if they are planned around user queries, but unplanned outages at the middle or end of a long running query may be unacceptable for users. In addition, the more a company uses a warehouse to make strategic business decisions, the more the warehouse becomes just as critical as the operational systems that process the current orders. Many have argued the value of a warehouse to project future buying patterns and needs to ensure that the business remains competitive in a changing marketplace. These decisions affect whether there will be future orders to record in an operational system.

152

Realistic service-level agreements


Realistic service-level agreements can be achieved within the management of a data warehouse. Testing the warehouse before production implementation is the basis of realistic service-level agreements. This will probably occur near the end of the testing phase and should include tests against the same data that will actually exist in the production warehouse at its inception. This will have to be coordinated with the users to ensure that the tests represent realistic queries. If a tool generates the queries, this will be an opportunity to determine how much control you have within the tool to generate efficient SQL. This may take some time and effort to work with the vendor of the tool and to work with some level of SQL EXPLAIN information to achieve the best SQL for your system. If all of the benefits gained from that design in the SQL queries are not used, then the long hours spent designing the schema of the warehouse to meet the users' requirements will have been wasted. There are many choices of platforms on which to run the data warehouse. Performance is sometimes a consideration when making this choice, but not always. If performance of the warehouse is not a major consideration, then service-level agreements for the warehouse will have to be adjusted based on the ability of the platform to deliver service. Service-level agreements for a data warehouse may need to be more fluid than those established for an operational system. An operational system may have a requirement that 90% of all transactions complete in less than one second. This is easy to track, report and understand. However, in a data warehouse system, there may be a requirement to produce an answer within one second for every 10,000 pages processed by the query. Metrics that can be obtained without traces are important because of the overhead of DB2 traces. While the user may require some education on the metric, this allows the system's administrator to have more control over the delivery of service and the resources needed to deliver on service-level agreements. Service-level agreements need to be adjustable after implementation rather than rigid. The goal of the data warehouse is to deliver value to the business. Value is delivered in terms of more business, growth in new business areas or reduced costs for the business. When this is successful, the company executives will want more. This will generate an increase in warehouse activities that will strain the initial resources. Service-level agreements along with capacity planning information can pave the way for warehouse growth as it proves its value to the company and expands its mission. In addition, time-sensitive information may occasionally be required for business opportunities that exist for only a short time. The ability to adjust resources to meet specific workloads can have an impact on the attainment of other service-level agreements. Because of these adjustments and the fluid nature of the SLA system, reporting of attainment should have some level of prioritization contained within the reporting structure. This allows for the recognition that some workloads for brief periods were considered less important. This reduces the impact of missing the SLA for these workloads. Furthermore, within the SLA reporting structure, the emphasis should always

153

be on percentages of attainment, of resource utilization, and so forth because the actual resources and response times could and probably will change over time. 4.8 Operating the Data Ware house After the data ware house becomes operational The data management process including Extracton, Transformation, Staging and Load scheduling are automated where ever it is possible. The loading is based on one of two common procedures : Bulk down load ( which refreshes the entire data base periodically) ,Change based replication (Which copies the data residing in different servers) Once the Data ware house is up and running it will continue to require attention in different ways. A successful data ware house see its users increase considerably, which in turn affect its performance. if it has not been properly sized. Also support and enhancement requests will roll in. Some other common issues that need to be dealt with operating the data ware house are Loading the new data on a regular basis, which can range from real-time to weekly. Ensuring up time and realiability Managing the front end tools. Managing the back end components. Updating the data to reflect the organizational changes, mergers and acquisitions,etc., Management Tools
Managing such potential complexity and making decisions about which options to use can become a nightmare. In fact many organizations have had to deploy many more administrators than they had originally planned just to keep the fires down. These people cost a lot of money. Being reactive, rather than proactive, means that the resources supporting the data warehouse are not properly deployed typically resulting in poor performance, excessive numbers of problems, slow reaction to them, and over buying of hardware and software. The when in doubt throw more kit at the problem syndrome. Management tools must therefore address enabling administrators to switch from reactive to proactive management by automating normal and expected exception conditions against policy. The tools must encompass all of the storage being managed which means the database, files, file systems, volumes, disk and tape arrays, intelligent controllers and embedded or other tools that each can manage part of the scene.

154

To assist the proactive management, the tools must collect data about the data warehouse and how it is (and will be) used. Such data could be high level, such as number of users, size, growth, online/offline mix, access trends from different parts of the world. Or it could be very detailed such as space capacity on a specific disk or array of disks, access patterns on each of several thousand disks, time to retrieve a multi-media file from offline storage, peak utilization of a server in a cluster, and so on in other words, raw or aggregate data over time that could be used to help optimize the existing data warehouse configuration. The tools should, ideally, then suggest or recommend better ways of doing things. A simple example would be to analyze disk utilization automatically and recommend moving the backup job to an hour later (or use a different technique) and to stripe the data on a particular partition to improve the performance of the system (without adversely impacting other things a nicety often forgotten). This style of tool is invaluable when change is required. It can automatically manage and advise on the essential growth of the data warehouse, pre-emptively advise on Richard Barker 10 problems that will otherwise soon occur using threshold management and trend analysis. The tools can then be used to execute the change envisaged and any subsequent fine tuning that may be required. Once again it is worth noting that the management tools may need to exploit lower-level tools with which it is loosely or tightly integrated. Finally, the data about the data warehouse could be used to produce more accurate capacity models for proactive planning. Veritas is developing a set of management tools that address these issues. They are: STORAGE MANAGER: managing all storage objects such as the database, file systems, tape and disk arrays, network-attached intelligent devices, etc. The product also automates many administrative processes, manages exceptions, collects data about data, enables online performance monitoring and lets you see the health of the data warehouse at a glance. Storage Manager enables other Veritas and third-party products to be exploited in context, to cover the full spectrum of management required snap-in tools. STORAGE ANALYST: collects and aggregates further data, and enables analysis of the data over time. STORAGE OPTIMIZER: recommends sensible actions to remove hot spots and otherwise improve the performance or reliability of the online (and later offline) storage based on historical usage patterns. STORAGE PLANNER: will enable capacity planning of online/offline storage, focusing on very large global databases and data warehouses. [Note: versions of Storage Manager and Optimizer are available now, with the others being phased for later in 1998 and 1999.]

155

The use of these tools and tools from other vendors should ideally be started during the Design and Predict phase of development of a data warehouse. It is, however, an unfortunate truth that in most cases they will have to be used retrospectively to manage situations that have become difficult to control to regain the initiative with these key corporate assets.

Data warehouses, datamarts and other large database systems are now critical to most global organizations. Management of them starts with a good life-cycle process that concentrates on the operational aspects of the system. Their success is dependent on the availability, accessibility and performance of the system.The operational management of a data warehouse should ideally focus on these success factors. Putting the structured part of the database on a leading database, such as Oracle, provides the assurance of the RDBMS vendor and the vendors own management tools. Running the database on top of the Veritas File System, along with other data warehouse files, provides maximum ease of management with optimal performance and availability.
It also enables the most efficient incremental backup method available when used with the Veritas NetBackup facilities. By exploiting the Veritas Volume Manager the disk arrays can be laid out for the best balance of performance and data resilience. The Optimizer product can identify hot spots and eliminate them on the fly. Replication services at the volume or file-system level can be used to provide enhanced disaster recovery, remote backup and multiple remote sites by which the decision support needs of the data warehouse can be localized to the user community thereby adding further resilience and providing optimal read access. High availability and advanced clustering facilities complete the picture of constantly available, growing, high performance data warehouses. High-level storage management tools can provide a simple view of these sophisticated options, and enable management by policy and exception.

They can also add value by analysis of trends, optimization of existing configurations, through to predictive capacity planning of the data warehouse future needs. In summary, the key to the operational success of a global data warehouse is online everything, where all changes to the software, online and offline storage and the hardware can be done on line on a 24x365 basis. Veritas is the Richard Barker 11 storage company to provide end-to-end storage management, performance and availability for the most ambitious data warehouses of today.

4.9 Summary 156

(i)The hardware needed for a Ware house had been explained with the architecture. The problem with the DW (which is not in OLTP) is that the kind of load and queries are not certain. Therefore, sometimes the allocation of processes across the processors, itself runs out of breath. Even if you are having a kind of cluster, (with load-balancing and automatic fail-over), it will become complex once you go beyond a certain size. Here the processing is done across multiple servers with each having its own memory and disk space. This way they get their own playing field, instead of fighting for common resources (as in the multiprocessor architecture). (ii)The various aspects of security for a data ware house has been discussed in detail. The security requirements of the DW environment are not unlike those of other distributed computing systems. Thus, having an internal control mechanism to assure the confidentiality, integrity and availability of data in a distributed environment is of paramount importance. Unfortunately, most data warehouses are built with little or no consideration given to security during the development phase. Achieving proactive security requirements of DW is a seven-phase process: 1) identifying data, 2) classifying data, 3) quantifying the value of data, 4) identifying data security vulnerabilities, 5) identifying data protection measures and their costs, 6) selecting cost-effective security measures, and 7) evaluating the effectiveness of security measures. These phases are part of an enterprise-wide vulnerability assessment and management program. (iii) Regarding the backup and recovery - The first step is to ensure that all of the data sources from which the data warehouse is created are themselves backed up. Even a small file that is used to help integrate larger data sources may play a critical part. Where a data source is external it may be expedient to cache the data to disk, to be able to back it up as well. Then there is the requirement to produce say a weekly backup of the entire warehouse itself which can be restored as a coherent whole with full data integrity. The first mechanism is simply to take cold backups of the whole environment, exploiting multiplexing and other techniques to minimize the backup window (or restore time) by exploiting to the full the speed and capacity of the many types and instances of tape and robotics devices that may need to be configured. The second method is to use the standard interfaces provided by Oracle (Sybase, Informix, SQL BackTrack, SQL server, etc.) to synchronize a backup of the database with the RDBMS recovery mechanism to provide a simple level of hot backup of the database, concurrently with any related files. Note that a hot file system or checkpointing facility is also used to assure the conventional files backed up correspond to the database.The third mechanism is to exploit the RDBMS special hot backup mechanisms provided by Oracle and others. The responsibility for the database part of the data warehouse is taken by, say, Oracle, who provide a set of data streams to the backup system and later request parts back for restore purposes. (iv) Service Level Agreement (SLA) A binding contract which formally specifies end-user expectation about the solution and tolerances. It is a collection of service level requirements that have been negotiated and mutually agreed upon by the information providers and the information consumers. The SLA has three attributes:

157

STRUCTURE, PRECISION, AND FEASIBILITY. This agreement establishes expectations and impacts the design of the components of the data warehouse solution. (v) After the data ware house becomes operational The data management process including Extracton, Transformation, Staging and Load scheduling are automated where ever it is possible. The loading is based on one of two common procedures : Bulk down load ( which refreshes the entire data base periodically) ,Change based replication (Which copies the data residing in different servers) .Also the management tools are used for the operations on a data ware house. 4.10 1. 2. 3. 4. Excercises

Explain about hardware requirement for a Data ware house. Discuss the Hardware architecture for a DW Why security is needed in a Data ware house environment? Describe in detail the various security measures that can be given for a Data Ware house? 5. What type of Backup and Recovery for a Data ware house are needed? 6. What is SLA? Explain in detail. 7. Give short notes on Operating the Data ware house

158

Unit V Structure of the Unit 5.1 Introduction 5.2 Learning Objectives 5.3 Tuning the data ware house 5.4 Testing the data ware house 5.5 Data ware house features 5.6 Summary 5.7 Exercises

159

5.1 Introduction Data warehouses are often at the heart of the strategic reporting systems used to help manage and control the business. The function of the data warehouse is to consolidate and reconcile information from across disparate business units and IT systems to provide a subject-orientated, time-variant, non-volatile, integrated store for reporting on and analysing data.

Hence after the development of the Data Ware houses the testing and the fine tuning have to be done in order to get the correct and accurate result. If these have been done in a correct way then the Data Ware house will have the needed features. Since the nature of the Data ware house is dynamic the fine tuning and the testing are recurrent processes in a Data Warehouse Environment.

5.2 Learning Objectives To Know about the Testing done in a Ware house To have the Knowledge about the fine tuning done in a Data Ware house Finally to have the knowledge of the features that a Data Ware house has to posses

5.3 Tuning the data ware house


Overview

The speed and efficiency with which a RDBMS can respond to a query strongly affects the response time experienced by end users. All data warehouses can benefit from the creation of the best possible indexes and materialized views. The wrong type of indexes or materialized views generated by the wrong type of SQL commands may degrade performance rather than improve it. Extremely large data warehouses should be striped and partitioned. Before taking this step, however, you should confirm that the data at the lowest levels of aggregation are

160

truly needed for the types of analysis being performed. Eliminating unnecessarily lowlevel data from your data warehouse is much easier than striping. If your data warehouse is configured in a snowflake schema, you should look at the frequency that queries must perform joins on dimension tables. Denormalizing can improve performance significantly.
Data types of key columns

In a dimension table, the key column should have a NUMBER data type for the best performance. A primary index is always created on the key column to ensure that each row has a unique value. A NUMBER data type reduces the amount of disk space needed to store the index values for the key, since the index values are also stored as numbers instead of text strings. The smaller the index, the faster the database can search it. The larger the number of values in the dimension, the greater the improvement in performance of NUMBER keys over CHAR (or other text) keys. Since time dimensions are typically rather small, a NUMBER key will improve performance only slightly. Thus, time dimensions can have either NUMBER or CHAR keys with little difference in performance between the them. However, dimensions for products and geographical areas often have thousands of members, and the performance benefits of NUMBER keys can be significant

Indexing Indexing is a vital component of any data warehouse. It allows Oracle to select rows quickly that satisfy the conditions of a query, without having to scan every row in the table. B-tree indexes are the most common; however, bitmap indexes are often the most effective in a data warehouse. A column identifying gender will have in each cell one of two possible values to indicate male or female. Because the number of distinct values is small, the column has low cardinality. In dimension tables, the parent level columns also have low cardinality because the parent dimension values are repeated for each of their children. A column containing actual sales figures might have unique values in most cells. Because the number of distinct values is large, columns of this type have high cardinality. Most of the columns in fact tables have high cardinality. Dimension key columns have extremely high cardinality because each value must be unique. In fine-tuning your data warehouse, you may discover factors other than cardinality that influence your choice of an indexing method. With that caveat understood, here are the basic guidelines:

161

Create bitmap indexes for columns with low to high cardinality. Create B-tree indexes for columns with very high cardinality (that is, all or nearly all unique values).

Striping and partitioning Striping and partitioning are techniques used on large databases that contain millions of rows and multiple gigabytes of data. Striping is a method of distributing the data over your computer resources (such as multiple processors or computers) to avoid congestion when fetching large amounts of data. Partitioning is a method of dividing a large database into manageable subsets. Using partitions, you can reduce administration downtime, eliminate unnecessary scans of tables and indexes, and optimize joins. If other methods of optimizing your database have not been successful in bringing performance up to acceptable standards, then you should investigate these techniques.

Materialized views Oracle will rewrite queries written against tables and views to use materialized views whenever possible. For the optimizer to rewrite a query, it must pass several tests to verify that it is a suitable candidate. If the query fails any of the tests, then the query is not rewritten, and the materialized views are not used. And when the aggregate data must be recalculated at runtime, performance degrades. All materialized views for use by the OLAP API must be created from within the OLAP management tool of OEM. Materialized views created elsewhere in Oracle Enterprise Manager or directly in SQL are unlikely to match the SQL statements generated by the OLAP API, and thus will not be used by the optimizer. Application tuning in Data ware house
Application tuning helps companies save money through query optimization.

CIOs and data warehouse directors are under pressure and overwhelmed with user requests for more resources, applications and poweroften without an accompanying increase in budget. Fortunately, it's possible to relieve some of the pressure and "find money" by effectively tuning applications on the data warehouse.
Sometimes queries that perform unnecessary full-table scans or other operations that consume too many system resources are submitted to the data warehouse. Application tuning is a process to identify and tune target applications for performance improvements and proactively prevent application performance problems.

162

Application tuning focuses on returning capacity to a system by concentrating on query optimization. Through application tuning, database administrators (DBAs) look for queries wreaking havoc on the system and then target and optimize those queries to improve system performance and prevent application performance problems. The results can be dramatic, often providing a gain of several nodes' worth of processing power. Savings in the works

A holistic view of Teradata performancegained through the timely collection of datais a good precursor to application tuning. Many customers have engaged Teradata Professional Services to install the performance data collection and reporting (PDCR) database. This historical performance database and report toolkit provides diagnostic reports and graphs to help tune applications, monitor performance, manage capacity and operate the Teradata system at peak efficiency. If the PDCR database is not installed for performance tuning, it is imperative to enable Database Query Log (DBQL) detail, SQL and objects data logging for a timeframe that best represents the system workload to identify the optimum queries for tuning. To optimize performance and extract more value from your Teradata system, follow these application tuning steps: STEP 1: Identify performance-tuning opportunities The DBQL logs historical data about queries including query duration, CPU consumption and other performance metrics. It also offers information to calculate suspect query indicators such as large-table scans, skewing (when the Teradata system is not using all the AMPs in parallel) and large-table-to-large-table product joins (a highly consumptive join). STEP 2: Find and record "like queries" with similar problems While the DBQL is used to find specific incidents of problem queries, it can also be used to examine the frequency of a problem query. In this scenario, a DBA might notice that a marketing manager runs a problem query every Monday morning, and the same problem query is run several times a day by various users. Identifying and documenting the frequency of problem queries offers a more comprehensive view of the queries affecting data warehouse performance and helps prioritize tuning efforts. STEP 3: Determine a tuning solution To improve query performanceparticularly queries with large-scan indicatorsadditional indexes or index changes should be considered. Teradata's various indexing options enable efficient resource use, saving I/O and CPU time and thereby making more resources available for other work. Options such as partitioned primary index (PPI), secondary indexes and join indexes can help reduce resource consumption and make queries more efficient. STEP 4: Determine the best solution

To determine the best tuning options, it is important to baseline existing performance conditions (using DBQL data), pilot potential solutions through experimentation and analyze the results. If

163

multiple optimization strategies are found, DBAs should test one strategy at a time by temporarily creating the new scenario, changing the queries to use the new objects, running the queries, and measuring, documenting and analyzing the results DBAs must run tests on the same production system and take the following steps to determine the solution with the best cost/benefit and viability of the final performance fixes: Test the system, using a user ID with a low workload priority. Use each of the optimization strategies and gather the new DBQL data. Compare the new DBQL measurements with the original baseline. STEP 5: Regression testing Regression testing is an important quality control process to ensure any optimization changes or fixes will not adversely affect data warehouse performance. First, the DBA must determine a representative list of queries that apply to the selected performance fix. From there, a regression test suite is created to gauge the effectiveness of the solution before production. In regression testing, the new environment is re-created on the same production system, and the effects of the change are measured and documented. The goal is to ensure queries that are not part of the tuning process are not unduly affected by the optimization changes. STEP 6: Quantify and translate performance gains into business value CIOs are routinely pressed to show how their IT dollars affect operations and enable cost reduction and business growth. Quantifying the business value of query optimization, or any IT improvement, is an important step to showcasing the value of the data warehouse. Determining business value can be broken into calculations and sub-calculations. To answer the question "How many CPU seconds equals a node?" use the following calculations: Determine per node CPU seconds in a day (number of CPUs per node X 86,400) - 20%, where 86,400 equals the number of seconds in a day, and 15% to 20% is subtracted from the equation to account for system-level work not recorded in DBQL. Multiply per node CPU seconds in a day by 30 to get CPU seconds per node per month. On a four-CPU node, the equation would look something like this: ((4 X 86,400) - (4 X 86,400)/5)) X 30 = 8,294,400 CPU seconds. Check the impact of making a tuning change: Monthly CPU saved = Total old CPU for a month X the average improvement percent. STEP 7: Document and implement Presenting application tuning recommendations to IT management and business users typically requires more than a spreadsheet of data, although a spreadsheet can be used for backup material or a deeper dive into performance data and options. The presentation should be tailored to a specific audience and should capture the value of application tuning. The presentation might include: Query optimization process Options found and tested

164

Best option Options discarded, and why Lists of what still needs testing Observations and recommendations Anticipated savings

Customers looking to add new applications, improve application performance or quantify the need for hardware expansion can benefit from application tuning. Following the application tuning methodology will help you optimize performance

5.4 Testing the Data warehouse Testing a data warehouse is a wondrous and mysterious process, it's really not that different than any other testing project. The basic system analysis and testing process still applies. Let's review a few of these steps and how they fit within a data warehouse context:

Analyze source documentation

As with many other projects, when testing a data warehouse implementation, there is typically a requirements document of some sort. These documents can be useful for basic test strategy development, but often lack the details to support test development and execution. Many times there are other documents, known as source-to-target mappings, which provide much of the detailed technical specifications. These source-to-target documents specify where the data is coming from, what should be done to the data, and where it should get loaded. If you have it available, additional system-design documentation can also serve to guide the test strategy. Develop strategy and test plans

As you analyze the various pieces of source documentation, you'll want to start to develop your test strategy. I've found that from a lifecycle and quality perspective it's often best to seek an incremental testing approach when testing a data warehouse. This essentially means that the development teams will deliver small pieces of functionality to the test team earlier in the process. The primary benefit of this approach is that it avoids an overwhelming "big bang" type of delivery and enables early defect detection and simplified debugging. In addition, this approach serves to set up the detailed processes involved in development and testing cycles. Specific to data warehouse testing this means testing of acquisition staging tables, then incremental tables, then base historical tables, BI views and so forth.

165

Another key data warehouse test strategy decision is the analysis-based test approach versus the query-based test approach. The pure analysis-based approach would put test analysts in the position of mentally calculating the expected result by analyzing the target data and related specifications. The query-based approach involves the same basic analysis but goes further to codify the expected result in the form of a SQL query. This offers the benefit of setting up a future regression process with minimal effort. If the testing effort is a one time effort, then it may be sufficient to take the analysis-based path since that is typically faster. Conversely, if the organization will have an ongoing need for regression testing, then a query-based approach may be appropriate. Test development and execution

Depending on the stability of the upstream requirements and analysis process it may or may not make sense to do test development in advance of the test execution process. If the situation is highly dynamic, then any early tests developed may largely become obsolete. In this situation, an integrated test development and test execution process that occurs in real time can usually yield better results. In any case, it is helpful to frame the test development and execution process with guiding test categories. For example, a few data warehouse test categories might be:

record counts (expected vs. actual) duplicate checks reference data validity referential integrity error and exception logic incremental and historical process control column values and default values

In addition to these categories, defect taxonomies like Larry Greenfield's may also be helpful. An often forgotten aspect of execution is an accurate status reporting process. Making sure the rest of the team understands your approach, the test categories and testing progress will ensure that the rest of the team is clear on the testing status. With some careful planning, follow-through and communication, a data warehouse testing process can be set up to guide the project team to a successful release.
Goals for a successful data warehouse testing, namely;

Data completeness. Ensures that all expected data is loaded. Data transformation. Ensures that all data is transformed correctly according to business rules and/or design specifications. Data quality. Ensures that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data.

166

Performance and scalability. Ensures that data loads and queries perform within expected time frames and that the technical architecture is scalable. Integration testing. Ensures that the ETL process functions well with other upstream and downstream processes. User-acceptance testing. Ensures the solution meets users' current expectations and anticipates their future expectations. Regression testing. Ensures existing functionality remains intact each time a new release of code is completed.

5.5 Data Warehouse features Data warehouse technologies need to have the following features

capacity: what are the outer bounds of the DBMS technology's physical storage capability, and what in practice (meaning in real-world, referenceable commercial applications) are the boundaries of the technology's capacity? loading and indexing performance: how fast does the DBMS load and index the raw data sets from the production systems or from reengineering tools? operational integrity, reliability and manageability: how well does the technology perform from an operational perspective? how stable is it in near-24x7 environments? how difficult is it to archive the data warehouse data set, in whole or in part, while the data warehouse is online? what kinds of tools and procedures exist for examining warehouse use and tuning the DBMS and its platform accordingly? client/server connectivity: how much support, and what kinds of support, does the open market provide for the DBMS vendor's proprietary interfaces, middleware and SQL dialects? query processing performance: how well does the DBMS' query planner handle ad hoc SQL queries? How well does the DBMS perform on table scans? on more or less constrained queries? on complex multi-way join operations?

None of these areas is exclusively the province of the DBMS technology; all depend on the elusive combination of the right design, the right DBMS technology, the right hardware platform, the right operational procedures, the right network architecture and implementation and the hundred other variables that make up a complex client/server environment. Nevertheless, it should be possible to get qualitative if not quantitative information from a prospective data warehouse DBMS vendor in each of these areas.

167

Capacity Capacity is a funny kind of issue in large-scale decision support environments. Until just a few years ago, very large database (VLDB) boundaries hovered around the 10 gigabyte (GB) line, yet data warehouses are often spoken of in terms of multiple terabytes (TB). DSS (Decision Support System) is generally an area where, prior to the first DSS project, data machismo reigns: the firm with the biggest warehouse wins, and sometimes the design principle at work is "let's put everything we have into the warehouse, since we can't tell quite what people want." The reality, is that:

well-designed warehouses are typically greater than 250 GB for a mid-sized to large firm. The primary determinants of size are granularity or detail, and the number of years of history kept online. the initial sizing estimates of the warehouse are always grossly inaccurate. the difference between the raw data set size (the total amount of data extracted from production sources for loading into the warehouse) and the loaded data set size varies widely, with loaded sets typically taking 2.5 to 3 times the space of raw data sets. the gating factor is never, as old guard technologists seem to think, the available physical storage media, but instead the ability of the DBMS technology to manage the loaded set comfortably. any DBMS technology's capacity is only as good as its leading-edge customers say (and demonstrate) that it is; a theoretical limit of a multiple terabytes means little in the face of an installed base with no sites larger than 50 GB.

Loading And Indexing Speed Data engineering -- the extraction, transformation, loading and indexing of DSS data -- is a black art, and there are as many data engineering strategies as there are data warehouses.A firm has customers using state-of-the-art reengineering and warehouse management tools like those from Prism and ETI, customers using sophisticated homegrown message-based near-real-time replication mechanisms, and customers who cut 3480 tapes using mainframe report writers. The bottom line is that the specifics of a DBMS technology's load and indexing performance is conditioned by the data engineering procedures in use, and it's therefore necessary to have a clear idea of likely data engineering scenarios before it is possible to evaluate fully a DBMS' suitability for data warehousing applications.

168

This area of evaluation is made more complex by the fact that some proprietary MDDBMS environments lack the ACID characteristics required to recover from a failure during loading, or do not support incremental updates at all, making full drop-and-reload operations a necessity. Nevertheless, it is definitely significant numbers of first-time DSS projects fail not because the DBMS is incapable of processing queries in a timely fashion, but because the database cannot be loaded in the allotted time window. Loads requiring days are not unheard of when this area of evaluation is neglected, and, when the warehouse is refreshed daily, this kind of impedance mismatch spells death for the DSS project. Operational Integrity, Reliability and Manageability A naive view of DSS would suggest that, since the data warehouse is a copy of operational data, traditional operational concerns about overall system reliability, availability and maintainability do not apply to data warehousing environments. Nothing could be farther from the truth. First of all, the data warehouse is a unique, and quite possibly the most clean and complete, data source in the enterprise. The consolidation and integration that occur during the data engineering process create unique data elements, and scrub and rationalize data elements found elsewhere in the enterprise's data stores. Second of all, the better the enterprise DSS design, the more demand placed on the warehouse and its marts. Effective DSS environments quickly create high levels of dependency, within end-user communities, on the warehouse and its marts; organizational processes are built around the DSS infrastructure; other applications depend on the warehouse or one of its marts for source data. The loss of warehouse or mart service can quite literally bring parts of the firm to a grinding, angry halt. Bottom line: all the operational evaluation criteria we would apply without thinking to an online transaction processing (OLTP) system apply equally to the data warehouse. Client/Server Connectivity The warehouse serves, as a rule, data marts and not end-user communities. There are exceptions to this rule: some kinds of user communities, particularly those who bathe regularly in seas of quantitative data, will source their analytic data directly from the warehouse. For that reason, and because the open middleware marketplace is now producing open data movement technology that promises to link heterogeneous DBMSs with high-speed data transfer facilities, it is important to understand what kind and what quantity of support exist in the open marketplace for the DBMS vendors proprietary client/server interfaces and SQL dialects. A DBMS engine that processes queries well, but which has such small market share that it is ignored by the independent software vendor

169

community, or supported only through an open specification like Microsoft's Open Database Connectivity (ODBC) specification, is a dangerous architectural choice.

Query Processing Performance Query processing performance, like capacity, is an area of the DSS marketplace in which marketing claims abound and little in the way of common models or metrics are to be found. Part of the practical difficulty in establishing conventions in this area has to do with the usage model for the warehouse. If, for example, the warehouse is primarily concerned with populating marts, its query performance is a secondary issue, since it is unlikely that a large mart would request its load set using dynamic SQL, and far more likely that some kind of bulk data transfer mechanism, fed by a batch extract from the warehouse, would be used. If the warehouse serves intensive analytic applications like statistical analysis tools or neural network-based analytic engines, or if the warehouse is the target for (typically batch-oriented) operational reporting processes, the warehouse is likely to have to contend with significant volumes of inbound queries imposing table scanning disciplines on the database: few joins, significant post-processing, and very large result sets. If, on the other than, the warehouse is connected directly to significant numbers of intelligent desktops equipped with ad hoc query tools, the warehouse will have to contend with a wide range of unpredictable constrained and unconstrained queries, many of which are likely to impose a multi-way join discipline on the DBMS. All of these usage models suggest different performance requirements, and different (and perhaps mutually exclusive) database indexing strategies. Thus -- as is the case with load and indexing performance -- its is critical to have a clear idea of the warehouse usage model before structuring performance requirements in this area. 5.6 Summary Tuning in Ware house has to be done to improve the performance. This can be done through deciding the proper data types for the important coloumns, indexing , Partioning and materialized views. Partitioning is a method of dividing a large database into manageable subsets. Through this Partitioning one can achieve easy and consistent query performance, complete maintenance of data in a simpler way and scalability of the data ware house can be increased. Data compression can also be made as a part of tuning of the Ware house

170

Testing in a Data ware house is done to test the data completeness, data transformation,data quality, performance, scalability, user acceptance etc., First the test data is being prepared. Then test plans and strategy are prepared and test on a ware house are done.

General features of a Data Ware house can be listed as follows.


capacity: what are the outer bounds of the DBMS technology's physical storage capability loading and indexing performance: how fast does the DBMS load and index the raw data sets from the production systems or from reengineering tools? operational integrity, reliability and manageability: how well does the technology perform from an operational perspective? how stable is it in near-24x7 environments? how difficult is it to archive the data warehouse data set, in whole or in part, while the data warehouse is online? client/server connectivity: how much support, and what kinds of support, does the open market provide for the DBMS vendor's proprietary interfaces, middleware and SQL dialects? query processing performance: how well does the DBMS' query planner handle ad hoc SQL queries? How well does the DBMS perform on table scans? on more or less constrained queries?

5.7 Exercise

1. What do you mean by Data Ware house Tunig ? Discuss in detail with examples. 2. How does tuning helps to improve the performance in a Ware house 3. How testing can be done on a Ware house ? list out the steps and explain 4. Data Ware House features Write in detail

171

You might also like