You are on page 1of 37

.

DataWareHouse Concepts

----------

David Sharon T

For any subset of facts and any subset of dimensions, will provide one and only one answer The perfect data warehouse can be represented as a fully connected directed acyclic graph. Think of facts as basic pieces of information that users want to see in the answer to a question and dimensions as one way for the users to constrain the scope of the question. Business model is a function of the information but that the dimensional model depends upon the analytical processing that will be performed. Further, the physical model is the result of modifications necessary to reach performance objectives Keep these following rules pasted on your forehead: #1: Develop a thorough and complete understanding of the business processes that the warehouse will support. #2: Obsess over rule #1. Dimension : A simple dimension joins a single intersection entity to a single strong entity. A complex dimension joins a single intersection entity to more than one strong entity. One of the strong entities should be identified as the primary dimensional structure while the others will be identified as alternate dimensional structures. An improper dimension is a cyclic path joining a single intersection entity to one or more strong entities. Heirarchy : A hierarchy is an instance of a particular dimensional structure. A dimension may have more than one hierarchy. For example, a store dimension may have a geographical hierarchy; a demographic hierarchy; a store attribute hierarchy. Time may have a calendar hierarchy, a fiscal hierarchy, a rolling hierarchy. Informational Dimension : Informational dimensions define what data is stored. That is, the informational dimensions define what the data points (or facts or Key Performance Indicators) are. These are called attributes in the relational data model. The advantage to thinking of them as a dimension is twofold: first, they often have a complex structure (e.g. a corporate chart of accounts) and two, a dimensional structure is much more flexible. Given the limitations of the relational data model, however, these advantages are usually only achieved in a multi-dimensional database Structural Dimension : Structural dimensions define how the data is stored. These are the entities in an ER diagram. The ability to modify the structure (either by adding new dimension members or by adding new dimensions) is where the relational databases really have an edge over some (but not all) multi-dimensional databases. Categorical Dimension : Categorical dimensions classify or categorize other entities. For example, box size might categorize breakfast cereals. And color may be used to categorize automobiles. There are some advantages (discussed later) to thinking of these from a dimensional viewpoint. Partitioning Dimension : Partitioning dimensions replicate the remaining structure of the database. Scenario is a common partitioning dimension. Some example scenarios would be Actual and Budget. In this case, we want the complete structure to occur for both Actual and Budget. There are many advantages to thinking of these from a dimensional viewpoint Members : Members are either the actual occurrence of a dimension or the named level of a dimension. For example, in the Sales Organization dimension, we might have Sales Districts (as a named level) and New York as one of the Sales Districts. Both of these may be called members. The difference should be clear in the context.

Star Architecture: A valid star architecture consists of a single intersection entity (fact table) joined to one or more proper dimensions. An invalid star architecture consists of a single intersection entity joined to one or more dimensions, where at least one of the dimensions is an improper dimension. A valid star architecture will always return consistent results, when each dimension is included in the query. An invalid star architecture may return different results depending upon the path that is taken through the improper dimensions. Note that the correctness of the query is not a structural issue, but is a semantic issue.

The casual reader will note that our definition of a valid star architecture is a FCDAG. A composite star architecture consists of two valid star architectures that are connected, either through a dimensional structure (of some type) or by connecting the fact tables. Fact Table : This is the central table in a star architecture that contains all of the Key Performance Indicators (also called data points, also called attributes, also called the informational dimension). Technically, the fact table is an intersection entity whose primary key is a composite key. The domain of each component of the key key consists of the union of the domains of the different dimension levels. . Aggregate Key : A composite key is a primary key that consists of a number of foreign keys. An aggregate key is a logical concept that includes other attributes as well. It is structurally meaningful when you are pre-aggregating categorical dimensions. There is more on this later, in the section on Types & Subtypes.

Data modeling Steps : In brief, the steps that we will follow are: 1. Develop a normalized entity-relationship model of the business model of the data warehouse. 2. Translate this into the dimensional model. This step reflects the information and analytical characteristics of the data warehouse. 3. Translate this into the physical model. This reflects the changes necessary to reach the stated performance objectives. The three model architecture (business model, logical model, internal or physical model) has been discussed in ER modeling for some time. However, this was developed with transaction processing systems in mind. Since data warehouses are developed with analytical processing in mind, we have added the concept of the dimensional model. This model represents the information along dimensional lines You care most about the questions that we should ask We mean ask them for their needs and desires as a base for discussion. Don't focus on what they do today, because it is very likely going to be different in the future. Focus on the information first, then the access paths next Data warehouses (like most analytical applications) are organic: they grow and they grow in directions that we are not able to predict. If we focus on current functional requirements, we will be able to develop a highly optimized solution for today's problems. However, as the problems change the solution becomes less and less optimal. Eventually, it is sub-optimal and hence, unacceptable. A better approach is to start out by trading off flexibility for optimization. The resulting solution may always be a little less optimal but it will be more adaptable and hence have a much longer life.

Model normalization
Once we have gathered all of the entity and relationship definitions, we can construct the normalized model. Loosely speaking, 3rd Normal Form (3NF) is reached when the attributes depend upon 'the key, the whole key and nothing but the key'. There are many advantages to 3NF. The structure is remarkably insensitive to change. Specifically, the 'ripple' effect of changes on other areas is very well contained. If we change (add to or delete from) the attributes that are related to a particular key, there should be no reason to change other entities or relations. And if we add new entities, only the directly effected elements in the database need changing. From a data warehouse perspective, this means that the warehouse structure can be easily modified to reflect new organizational changes or new business problems. (One example of this is the ease with which multiple hierarchies for a dimension can be built and maintained.) The structural paths for accessing information are very clear. Since this is a directed graph, the only difficulty arises when we are forced to include cyclic paths. Since this is the result of business conditions, we know that these paths are required. However, it does mean that we need to educate our users about the different paths and what interpretation they have. The process that we describe below will help to eliminate the cyclic paths, since they are death to most query tools. Having clear structural paths means nothing, by the way. Remember our emphasis on semantics? Semantics is everything!!! CRUD (Create, Report, Update, Delete) anomalies. If you read almost any book on data modeling and database design (several are in the bibliography) then you will notice that a great deal of the discussion on the normal forms revolves around the elimination of update anomalies. That is, the type of update problems that are solved by moving to a higher normal form. Since a data warehouse is rarely updated, the possibility of these anomalies occurring is much lower. This is why it is acceptable to talk about denormalizing the data warehouse. As always in life, there are some disadvantages to 3NF:

Performance can be truly awful. Most of the work that is performed on 'denormalizing' a data model is an attempt to reach performance objectives. 2. The structure can be overwhelmingly complex. We may wind up creating many small relations which the user might think of as a single relation or group of data. The dimensional model reduces that to some extent. This is the semantic part. Since the business model will not be implemented, these disadvantages are rarely critical. Carrying them over to the dimensional model will recreate these disadvantages, however. The process of normalizing an ER model has been dealt with adequately in many other books. Since the process is no different for a data warehouse, we will not dwell on it here. There are some recommended books in the bibliography.

1.

Integrity constraints
As a general rule, you should understand and enforce the integrity constraints within a data warehouse. Integrity rules will help guarantee the consistency of the query results. If we drill down through a dimensional structure, integrity constraints will guarantee that we will get the same answer as if we had done a full table scan. Note that failing to enforce integrity may result in different answers. Every rule has an exception! Enforce the integrity constraints unless you can't. Enforcing integrity constraints may result in some records being rejected during the load process. The decision must be made whether a consistent but possibly incomplete answer is better or worse than a complete but possibly inconsistent answer. It is possible that the upstream operational systems have created a situation where this can not be avoided. Since correcting these systems is beyond the scope of the data warehouse, you may very well have to live with relaxed integrity constraints. If so, understand the implications and make sure that they are well-defined and that definition is readily available. Explaining the impacts of these relaxed constraints will be a recurring problem as new users join the community. For example, I worked on a DSS application that tracked sales by customer and contract (and some other dimensions). For various reasons (which were valid for that business), there were ex post facto adjustments to sales. If we sliced the data by either customer or contract, we could reconcile each load. However, because of these adjustments, we could not reconcile a two dimensional (customer/contract) slice. Since this was the result of some problems in the operational systems, it was a lamentable fact of life in the DSS application. Note that it did not effect the value of this decision support system.

Dimensional model
The dimensional model overcomes one of the objectives (model complexity) by defining how the user will access information. The dimensional model should be much closer to how the users think of the information. It should resolve any of the semantic ambiguities that are in the business model. That is, the dimensional model should be develop as a fully connected directed acyclic graph. (Este hermosa!) In the following sections we discuss the different types of dimensions. These are conceptual differences. Creating this taxonomy of dimensions will help us create a better dimensional model and then a better physical model.

Structural dimensions
The first step is the development of the structural dimensions. This step corresponds very closely to what we normally do in a relational database. This step is commonly referred to as 'denormalization'. Indeed, with respect to Codd's relational data model, the database that results from this process may be pretty much 0NF! While this may result in very fast response times for queries, we do lose the level of flexibility that is the hallmark of a 3NF design. No free lunch! The star architecture that we will develop here depends upon taking the central intersection entities as the fact tables and building the foreign key => primary key relations as dimensions. In our sales and marketing example, the sales history table is the fact table of greatest interest (there may be others). The relations that define the sales record will be determined as the result of the analysis above (who bought the product and so on). Technically, we can see that this is a directed graph.

Pattern I: The simple case

Figure 3 Pattern I: The simple case In this example, everything is very clean and orderly: both of the dimensional structures are proper and simple. Note that the intersection entity has a number of relationships that are quite simple. By this we mean that each foreign key exists in only one primary key relation. Each of these paths then becomes a dimension. The structure of the fact table is obvious, but what is the best way to structure the dimension tables? Using the definition of levels from before, create each dimension table as follows: Column Description This column will contain all possible values of the complete dimensional structure. For example, if time is the dimension, it will contain months, quarters and years. Note that this requires that all members in a Dimension key dimension have unique identifiers. If you like to think of domains, then the domain of the dimension key is the union of the domains of the member table keys. This is only relevant if the dimension key is L0. For other dimension members, it will be null. This, and the L1 Parent other Ln parents are included only for performance reasons. L2 Parent This is only relevant if the dimension key is L1 or L0. For other dimension members, it will be null. This is only relevant if the dimension key is Ln-1, or any lower level. For other dimension members, it will Ln Parent be null. This is an integer that identifies the height dimension that the member is at. That is, the height from the leaf Height node. This is very useful for certain types of queries. This is an integer that identifies the depth of the dimension member.That is, how far down from the root Depth node. This is also useful for certain types of queries. Figure 4 Dimensional table definition As you can see, the definition of this dimensional table flies in the face of all normalization rules. However, it will support analytical queries and will produce much faster response times than a normalized design. The process of loading these tables is very straightforward and typically very fast. As part of the yin-yang principle, this approach does have significant problems with implementing dimensional dependencies. These occur when different attributes are related to different levels of the dimensional structure. This is discussed in more detail later on.

Alternate dimensional structures


There are alternate design approaches that are different from the dimensional table structure that we propose here. First, the simple recursive model (referred to as the vertical model); second, the flattened model (referred to as the horizontal model); third, the combined approach. The advantages and disadvantages of each are summarized in the table below. The primary key for this approach will be the TimeDimId shown above. The domain for this column is the union of the domains for Day, Week, Month et cetera. That is, the union of all of the time structural levels. Drilling down through the time hierarchy is simply a selfjoin of the Simple vertical model TimeDimId to the ParentId. Drilling up is easily done through the ParentId. This structure is efficient if the aggregates are pre-constructed. However, drilling operations will require one SQL statement per level. This is also the only realsitic method of handling unbalanced hierarchies in an RDB. This structure uses the leaf members as the primary key and codes the ancestral members as columns. This approach is best if all aggregates are calculated dynamically. Note that it is possible Simple horizontal model to use this even if the aggregates are pre-calculated, but the vertical model is a more general solution. The major reason for this approach is that drilling operations of any depth can be performed with a single join. This structure has the union as the primary key but also codes the ancestral members as columns. Combined model This is an efficient structure for pre-aggregated databases and also supports multi-level drilling. Figure 5 Dimensional Structures

Time example
Time is probably the most common dimension in a data warehouse. A warehouse is, after all, the retention of historical data for analytical purposes. There are three major different ways to structure a calendar: calendar, fiscal and rolling. We will explore all of these eventually, but for now we will define the calendar structure. Note that time starts out appearing as a straight-forward issue. However, it can get complicated. The problem occurs because two months (and quarters and years) can occur in the same week. There are several ways to deal with this. Two possible solutions are: 4. Name the weeks so that you can split them in the middle. E.G. we would have SepWk5 and OctWk1; each with less than five days. This will give you correct rollups to month and above, but will make week to week comparisons difficult. A variation on this solution that cures the last problem is to define alternate rollups for week that would also include the full five days. 5. Assume that the week belongs in the month where either Friday or Monday (or some other day) occur. This always gives you five day comparisons but will result in incorrect week to month aggregations. A solution to this problem is to build separate hierarchies so that day is related to week and month. Since the resolution for the approach is more a business problem than a technical problem we will avoid the entire topic by excluding week from the model. The following diagram is the normalized data model for a calendar.

Figure 6 Simple time dimension model As you can see, these solutions are not very complicated. Nor are the tables very complex.

Unbalanced recursive structures


The dimensional table that we have just built is within the normal relational model and is a recursive structure. If we remove the Ln parents (and any descriptive attributes), then this table is in 3rd Normal Form. This table is equivalent to a tree structure. If the structure is balanced then using SQL to retrieve rows and join this to the fact table is very straightforward. However, there are many instances where this structure will not be balanced. The example below shows the balanced versus unbalanced trees. The SQL difficulty when traversing an unbalanced tree is caused by the mixing of leaf nodes with non-leaf nodes. In essence, there is no easy way to tell when we are completely at the bottom. Since, in our case, we want to collect the leaf nodes and then join them to the fact table, this would have to be a many step process: 3. For the members retrieved so far, select the child nodes. 4. Separate the leaf nodes from the non-leaf nodes. Collect the leaf nodes together. 5. Using the non-leaf nodes, repeat steps 1 and 2 until there are only leaf nodes. 6. Take the collection of leaf nodes and join these to the fact table. Unbalanced structures are an example of a problem that is best solved by artfully combining the data model and the process model. In this specific example, a good solution is to pre-aggregate the dimension that is unbalanced. We will discuss this later when we talk in the chapter on Aggregation strategies.

Figure 7 Balanced and unbalanced trees

Pattern II: Alternate hierarchies

Figure 8 Pattern II: Alternate hierarchies The essential difference between Pattern I and Pattern II is that at least two G1 tables have the same Ln descendant (L0 in our pattern): the dimensional structure is proper and complex. That is, at least two paths join at some point. There is no real difference between two paths joining or more than two paths. Since the Ln table is the same, these can be viewed as variations on the same dimensional structure. Since

the split is clean there is only one way to go from any generation level to Ln (and then to the fact table), so this remains an acyclic directed graph. The problem with alternate hierarchies is mostly a problem with the front end. That is, how this is presented to the enduser. Since the access paths are mutually exclusive, you probably need to present them to the enduser as two different paths. Alternate hierarchies are so common that they may be the rule rather than the exception. Almost any company will have different ways of looking at products. Any company that has a fiscal year that is not a calendar year will probably want alternate time hierarchies. Fortunately, these are easy to deal with. The dimension table becomes wider, in order to accommodate the longer path (the sum of the two paths), and hence, looks a good deal uglier, but it is still easy to build and supports very fast response times. Column Description Dimension key Same as before, except that it must contain the members from all paths. This is the foreign key in the Ln-1 table. These are the same as before, except that we must include the members from the different paths in different columns. It does muddy up the levels and generations Ln Parent somewhat. You may be tempted to reuse an Ln column. That is, use it for two different hierarchies (after the split in the path). Resist this temptation! It will cause you trouble eventually and once you start it will be very hard to correct. Level id Levels may be counted as before, except that we will have more than one level at some point. Generations are also counted as before. Again, we will have more than one occurrence of the same Generation id generation. Figure 9 Alternate hierarchy dimension table

Alternate dimensional structure


There are clearly two different ways to implement this pattern. The one defined above provides a single, albeit messy, table for all of the alternate hierarchies. This approach becomes untenable for many hierarchies. It is also difficult to add hierarchies dynamically. However, the alternate approach is to create separate tables for each hierarchy. These will all be joined to the fact table at the same dimension key. The advantages here are almost exactly opposite the disadvantages of the single table solution. Each table has a simpler structure and it is a lot easier to add and remove hierarchies.

Time example
Back to the time example In the previous section, we mentioned that there are three basic hierarchies for time: calendar, fiscal and rolling. These hierarchies are shown below.

Figure 10 Multiple time hierarchies This solution, now that we have introduced multiple hierarchies, is now much more complicated and messy. This is the issue with multiple hierarchies more than with the approaches. Another approach is to develop individual dimension tables. If we took that approach with the combined model, we would have the following dimensional structure.

Figure 11 Alternate time hierarchies

Pattern III: Cyclic paths

Figure 12 Pattern III: Cyclic paths This is a complicated pattern; it represents an improper dimensional structure. In this pattern, we can see that, although there are alternate hierarchies, they have the same starting and ending point. Readers who are from the healthcare world will note that this is the relation

between hospitals, buying groups and the manufacturers. Other examples are common. Any business where there is an intermediate dealer (automobiles, consumer package goods) has the potential for this type of structure. This is more than a simple interface issue. An incorrect query may incorrectly compute the aggregated results. It also confuses the question-answer relation. The resolution is simple to explain, reasonably simple to implement but often hard to live with (from a user's perspective). Since the dimensional data model does not tolerate this type of structure, it must be eliminated. That is, this pattern must be resolved into either Pattern I or Pattern II. Depending upon how you want to implement your database there are different ways of resolving this. 1. Treat them as independent dimensions. Your descriptions to the enduser (documentation as well as front-end training) must be clear. Make it a major point of discussion in your enduser training. Since this complexity is a result of the data and not a technical complexity, they will be familiar with the issue. Indeed, they may be very helpful in the documentation/training problems. 2. Create alternate views. This is a variation on the first one, except that it has fewer physical tables. The tradeoffs are performance versus physical database complexity. The issues of training etc. remain.

Pattern IV: Splits and joins

Figure 13 Pattern IV: Splits and joins This pattern is a combination of Pattern II and Pattern III. It should have the same resolution as Pattern III. That is, create two different structures.

Pattern V: Types & subtypes

Figure 14 Pattern V: Types & Subtypes Types and subtypes are used in an ER model when the entity is non-heterogeneous. That is, the attributes of the entity depend upon some type of class membership. We might, for example, classify our sales representatives into two different types: full-line and partial-line. In the sample data warehouse below, we might classify the contracts into classes according to whether it is a contract with a buying group, a hospital organization or an individual hospital. There are two common methods of handling this (in an RDB implementation). First, define a table that contains all of the attributes and only use those that are relevant to the type of the row being processed. Second, for each type, create a separate table that contains only those attributes that are relevant. Neither one of these is very satisfactory. Types and subtypes can be implemented in a data warehouse as a categorical dimension. Since these are discussed below, there is no need for further discussion right now.

Dimensional dependencies
Earlier, we mentioned dimensional dependencies. At that time, we talked about the dependencies of some calculations on the dimension level. The example that we used was Average Selling Price. To summarize, although ASP makes a great deal of sense at the lowest level of product, it makes less and less sense as we move up the hierarchy. The dependency that ASP has is not true for other dimensions. The ASP average for the an entire year or a geographic region may be very important. This class of dependency is dependent on the processing and needs to be handled either in the load programs or in the dynamic query SQL. There is another class of dependencies that are purely structural. Product size and color are relevant for SKU levels (possibly even one level higher) but are not at all relevant for the highest levels of the product dimension. Similary, there may be time data points that are only valid at the quarterly or annual levels. These dependencies are of a more structural nature and hence, may be solved in the dimensional model.

There are really two problems with this class of dimensional dependencies. First, making sure that the user does not mix apples and oranges by requesting the product size for the highest dimension levels. Second, the physical problems of how to store these in a dimension table. Denormalized dimension table Each row in this structure stores the leaf node and all of its parents. In order to deal with the dimensional dependencies, we would have to repeat all of the dependent columns for each row. This would result in a huge amount of repitition. Although redundancy in a data warehouse is not necessarily sinful, this one particular sin becomes gluttony! The requirement to handle dimensional dependencies may force you away from a denormalized implementation. Vertical dimension table Remember that this approach implements a recursive structure. Each row consists of a < member, parent > pair as the primary key. However, the other columns must be the same for all levels of the dimension. The only two approaches here are to define the unique columns for each dependency but leave them null where they are not relevant or to overload the columns, so that their meaning depends upon the dimension level. The first approach is hugely wasteful while the second is very fragile and fraught with all kinds of problems. There is another approach that trades off the problems of wasted space for having additional tables. This approach may be used with either the denormalized or vertical dimension table. We will isolate the columns for each level into their own table. This will reduce the space requirements to the minimum. The downside, of course, is that an extra join will be required in order to retrieve any characteristics. (The impact of this can be reduced by isolating only the dimensionally dependent columns.) There are a number of other advantages, however. This approach makes it much easier to implement multiple hierarchies which retain some (but not necessarily all) of the dimensional dependencies. For example, we could have the three different time hierarchies that we mentioned earlier. If we isolated the daily (lowest level) and yearly (highest level) dependencies then these could be used for all three hierarchies. This last solution is one that continues the list of trade offs that must be dealt with when designing RDB-based decision support applications. However, it may be just what you need!

Summary
Virtually any data model that you come across will be based upon these patterns. Although the possible combinations are potentially very complex, dealing with them piecemeal and resolving them as described above should result in a valid star architecture.

Informational dimensions
One of the defining rules of the relational data model is that there is no relation between the attributes or columns in a relation. This is contained in the process model. Regretfully, this eliminates the possibility of creating informational dimensions. Note that this is one of the great strengths of many of the multi-dimensional database management systems (and conversely, one of the great weaknesses of the relational data model). Nevertheless, we will spend a few minutes discussing informational dimensions. A well-designed fact table will have a single informational dimension. Although many different fact tables may have the same informational dimension, it is not necessary. If you are using an MDB, you should determine whether it will support a single or multiple informational dimensional structures. Some MDBs do support only a single informational dimension wile others support many. Make sure that a single information dimension will not unduly complicate your database design. On the other side, you should be aware that supporting many informational dimensions may complicate the understanding of the information, so it is not a simple tradeoff. Perhaps the best example of an informational dimension is a corporate balance sheet or chart of accounts. A sales and marketing (or manufacturing) data warehouse will have an informational dimension, but is likely going to be much simpler than the informational dimension of a financial data warehouse. The RDB concept of the informational dimension is embodied in the fact table. We have previously asserted that the fact table should contain a single informational dimension. Is this a criteria or a convenient design? The meaning of multiple informational dimensions is that the structural dimensions that define access to the information are not the same. For example, if we built a warehouse containing both manufacturing and sales information, some of the information would be dimensioned by plant and some by customer. It is possible to place these into the same fact table and deal with it by types & subtypes. If your reaction to this is "Wow, that's weird." that's because it is. A better design would be to have separate fact tables, with some common dimensions.

Modeling informational dependencies


As we mentioned earlier, some facts will not make sense across all levels of some dimensions. The example that we gave was Average Selling Price. There are many others. One set of such facts are those that need to be 'balanced' across the time dimension. For example, period beginning and ending inventory levels. For these facts, we want to take the value of the first (and last, respectively) values of the

time child value. Hence, the quarterly beginning inventory level is the inventory level of the first month. The ending level is the last month. How should these be modeled? The first issue (dimensional dependencies) is pure data modeling but the second is a mix of data and process modeling. Dimensional dependencies can be easily modeled if the dependencies are orthogonal. Back to our ASP example&ldots; The ASP (at the leaf level of product) makes sense across all other dimensions: geography, time, sales organization, whatever. So this dependency is orthogonal. We can model this dependency in the normalized business model as we would any other attribute. ASP is simply not defined as an attribute of the higher levels of product. What if the dimensional dependencies are not orthogonal? If we follow a strict normalized model (as we are advocating here) this would require that the model contain all of the dimensional combinations that define these dependencies. This may be a rare circumstance. In fact, I can not think of any examples! We may be saved by the fact that this doesn't happen very often. That is, we can follow the normalized model which will be simple enough due to the small size of the problem.

Categorical dimensions
Categorical dimensions are typically used in support of some analytical process. One of the main distinguishing characteristics is that they are usually related to an attribute, rather than to a primary key. For example, income bracket is related to customer income. In a very real sense, categorical dimensions are often dimensions of dimensions. Categorical dimensions help us deal with the issue of attributes of dimensions. Some of the oddest things have to be categorized. For example, automobile paint color may be a very complex category. Someone who is in charge of planning plant capacity may need to analyze trends in order to evaluate utilization of the paint drying facilities (a large and expensive piece of equipment). They may also need to plan for new drying facilities. Since the auto industry probably has many different shades and tints of red and some may require different drying facilities, the ability to categorize sales by color can become a problem with expensive parameters. Another category is customer size. This may be determined by annual sales, potential sales, how tall they are, incomes, debt levels and so on. One of the primary uses of data mining is the categorization of customers (for example) into categories that no one knew existed before. It is common for a categorical dimension to be used in many queries, especially if you have different aggregate fact tables. If you have customer size, you will most likely want to see sales by customer size and product. (What are your heavy hitters buying?) This has many of the problems of partitioning dimensions, which are dealt with next. Whether you create categorical dimensions or leave the categorization process to a query depends upon the complexity of the categorization and how frequently it needs to be analyzed. The paint question may be very important but may not be asked very often. Hence, the cost of answering the question in anticipation (pre-aggregation) may not be justified. However, customer categories may be very complex (even moving into data mining) and the results may provide fruit for a great many analyses. The major question is what is the lowest level where the type/subtype starts and what is the highest level where it no longer makes sense? In the example of full-line and partial line sales representatives, the lowest level is sales representative, but the highest? Even if all sales representatives ultimately report to the same sales area manager (i.e. type is no longer organizationally defined) it may still make sense to separate full-line and partial-line at that level. It often makes sense across other dimensions as well. If two sales representatives can sell the same product, then it certainly makes sense to review product performance by sales rep type. This is usually an analytical not structural consideration. The type/subtype attribute is clearly in the base table, however, in order to avoid turning this into a snowflake (see below) we will need to include it in the dimensional table as well. Hence, we will add another column that contains the subtype classification values. Since we may do both selection and reporting by this attribute, it may (or may not) make sense to include it as part of the composite key. If you preaggregate this dimension, then it must be included as part of the aggregate key. If there are several categorizations that make sense at different levels (which must be mutually exclusive) then you may place them into the same column. This is possibly a dangerous solution so it should be avoided. The exclusivity requirement may be relaxed in the future and changing the implementation to accommodate the data will be very expensive.

Modeling categorical dimensions


The decision to create a categorical dimension is the result of the dimensional analysis. Hence, they typically do not appear in the business model but rather in the dimensional model. Aside from that, you should follow the standard data modeling rules to determine foreign key dependencies, attributes and so on. This will place them within the correct context in the dimensional model. For example, if we are analyzing customers by income brackets, then this is a categorical dimension that is related to the customer's income.

Mixed categorical dimensions


One of the advantages of creating categorical dimensions is that it is possible to support very complex (i.e. expensive) analytical models. As an example, consider customer segmentation. Suppose we want to study product sales by various customer demographics; say, income, marital status, sex and age. Creating a query to analyze these would be quite complicated and expensive to execute. However, if we created a categorical dimension that was based upon these demographic measures, the resulting queries would be easier to create and cheaper to execute. The sample dimensional structure for this example would be (not all levels are filled in): Income Bracket #1: 10,000 -> 25,000 Divorced Male Under 21 21 -> 35 35 -> 55 55 -> 65 65+ Female Under 21 Customer Id 1 Customer Id 2 Customer Id 3 21 -> 35 35 -> 55 55 -> 65 65+ Married Male Under 21 21 -> 35 35 -> 55 55 -> 65 65+ Female Under 21 21 -> 35 35 -> 55 55 -> 65 65+ Single Male Under 21 21 -> 35 35 -> 55 55 -> 65 65+ Female Under 21 21 -> 35 35 -> 55 55 -> 65 65+ Income Bracket #2: 25,000 -> 50,000 Divorced

Figure 15 Mixed categorical dimensions You can see how this dimensional structure would support very complex queries without the associated expense. It is also relatively easy to build the dimensional structure from a single query. Since the leaf members for the structure are customers, which are already in the fact table, this structure can be built without modifications to the fact table. In fact, many different versions of them can be built, as required and dropped when no longer needed. This flexibility is one of the great advantages of data warehouses built using an RDB.

Married Single Income Bracket #3: 50,000 -> 100,000 Divorced Married Single Income Bracket #4: 100,000 -> 250,000 Divorced Married Single Income Bracket #5: 250,000+ Divorced Married Single

Partitioning dimensions
This type of dimension demonstrates another weakness with current data modeling tools. What we want to happen is very easy to explain. How we implement it is rather tedious but not difficult. However, including these in the model makes it look awful. For example, if we wanted to track plans vs. actuals in the database, we would create a partitioning dimension. This foreign key would appear in every table where we wanted to track plans vs. actuals. If we have a single fact table that contains all of the aggregate information as well as detail, there is no problem. However, if we create individual aggregate tables, then the scenario foreign key must appear in each one. This makes the model look awful. Regretfully, these are very common in data warehouses. Especially those that are developed to monitor company performance. One of the advantages of multi-dimensional databases is the ease with which they can create and use partitioning dimensions. For example, it is common to see the structure of time separated from the repetition of time. That is, we may see one dimension that is simply the structure of a year: month -> quarter -> half -> year. And a separate calendar dimension that contains the different years: 1996, 1997 and so on. Adding a new year to the database is simply a matter of adding a new member to the calendar dimension. Adding a new year to a relational database usually requires that each month, quarter, half also be added. There are disadvantages to this approach that we will leave for another discussion. Calculating performance in relation to the prior year is now simply a matter of performing a calculation on the two members. Indeed, an MDB with a powerful calculation engine will allow you to perform calculations on any members. So that prior year variance may be calculated as follows: 1997 Prior Year = 1997 - 1996. We have called these partitioning dimensions because they are an obvious candidate for creating partitions in the data warehouse. That is, if your fact tables are large enough to warrant splitting them, then an obvious choice is to use the partitioning dimension (if one is present).

Informational vs. partitioning dimension


One of the common errors made when modeling a star architecture database is the incorrect placement of data points into partitions: confusing informational and partitioning dimensions. If you have a series of attributes that have column names like: Actual Sales, Budget Sales, Forecast Sales, Actual Units, Budget Units, Forecast Units then you need to have three partitions: Actual, Budget and Forecast (members of the same dimension) and the informational dimension contains only Sales and Units. Adding a fourth partition (say, Reforecast) is now a trivial matter of insertion. It is less clear what to do with derived data points. For example, if we want to track a three month moving average, we might create the columns: average sales and average units. Already you can see the advantage of having Actual, Budget and Reforecast in different partitions. That is, you can calculate the three month moving average for all partitions as easily as you can for any partition. Including partitions that no one has thought of yet! However, even the average should be a partition. So now we have four: Actual, Budget, Reforecast and Three Month Average. The reason for this approach is flexibility. If someone wants a Three Month Average, they may very well want a Six or Twelve Month Average. Adding them is simply adding a partition: a row insertion (into the scenario dimension table) and a new SQL statement.

There are other examples. If you have Actual and Budget, you will definitely want Actual vs. Budget (differences as well as percentages). And if you have Reforecast, you will want Actual vs. Reforecast and Budget vs. Reforecast. Plus Three Month Averages of all of them. As you can see, creating columns for all of these will become a D.B.A. nightmare. However, if you make them a partitioning dimension, your life becomes much easier. In summary, the guideline is that if you have data points that are described as an adjective of your base data (e.g. Actual Sales), they should probably be a partitioning dimension.

Alternate dimensional designs


The structure for the dimension tables given here is not the only choice. There are several others. A common approach is to normalize the dimensional structure so that the different levels are in different tables. Perhaps the main reason for doing this is space conservation. If the dimension table is millions of rows then the space implications are significant. We believe that this is not a consideration of the dimensional model, but rather of the physical model (see the next section). There are other approaches that are variations of this theme. They are referred to as snowflake design, outrigger tables and other terms.

Slowly changing dimensions


This is a subject raised by Ralph Kimball in his book The Data Warehouse Toolkit. There are two types of slowly changing dimensions: 1. Structural changes. This type of change occurs when the foreign key is changed. Product and territory realignments are good examples. 2. Attribute changes. This type of change occurs when one of the attributes of the dimension member changes. Marital status is the example discussed in Dr. Kimball's book.

Structural changes
Structural changes are controlled. The current conventional wisdom is that the value of the historical structure rapidly diminishes with time. Hence, there is usually little need to retain them. Conversely, there is little complexity in retaining them. Creating them as an alternate hierarchy will almost always be satisfactory. My belief is that the current disinterest in history only masks the real issue. Realignment analysis requires that we support multiple structures. If we can (easily) do it for realignment, we can do it for history. This is primarily a cost-benefit analysis.

Attribute changes
These are more difficult to deal with. They happen sporadically and without predictable timings or limits. The value of tracking is almost always very difficult to predict. However, if it possible to track these (and retain the history) then this should be explored with the users in great detail. Remember the earlier discussion on the questions that we can ask versus the questions that we should ask. The ability to perform event-based analysis can dramatically increase the ROI of the data warehouse. However, in order to be able to do that, you need to track the different events! This is a fancy term for tracking the historical changes.

Implementing slowly changing dimensions


In his book, Dr. Kimball proposes three types of solutions: Type I

In this solution, we overwrite the changed attributes and do not keep track of the history. This reduces the set of answers that we have available. Again, this may be good enough. Technically, this is the simplest solution.
Type II

In this solution, we create a new record that includes all attributes and the changed ones. Essentially, this solution will let us create 'snapshots' of the values at a point in time. This may allow us to track all of the historical changes but introduces a level of complexity into the data warehouse that is wholly unjustified. Note that this approach is actually quite common in the insurance industry, where an insurance policy might have to be re-created.
Type III

Add additional attributes to the entity. There would track some amount of historical change; say current and previous. If you implement this approach, don't forget to include the date of change. This has some obvious limitations.

Suppose that we look at changing dimensions from a strict data modeling view. If marital status was dependent upon time, then it should be placed into a separate entity: Historical Marital Status. The composite key would be Customer Id and Time Id. The attribute would be marital status. Each attribute that has to be historically tracked would be dealt this way. Remember when we said that one of the disadvantages of normalized form was that we might create many small entities, which the users might logically think of as one larger entity? Here is an example of that! However, there may be a compromise solution. If we kept the current marital status in the Customer entity and placed the previous values into the Historical Marital Status, then we would have the current status (the most popular one) available without performance compromise and the historical values when they were needed. The storage requirements should be relatively modest. The only queries that would be impacted are the ones that should be impacted. Hence, we propose a 4th type: Type IV Place the historical values in a special entity. This would be used for the special analytical requirements. This historical table should contain both the starting and ending dates for the period. Although only of these is necessary (the other can be inferred) having both simplifies some of the queries.

Enduser considerations
Perhaps more difficult than the technical problem is the problem of presenting these to the user in a meaningful way. The example question that we had before: "What did married people buy last year?" is a perfect example. This is a really innocuous question! But imagine translating it into an executable query. We would probably use a subquery that selected the people who were married last year and then select their purchases. Seems straight-forward enough. However, now imagine explaining that to the users or having some query tool translate it. Much more difficult. This type of complexity is not something that we should force on our users. However, we should not use this as an excuse to avoid the problem.

Trend analysis
The normal use of the term trend analysis refers to a time-series analysis of the facts. However, the implication of slowly changing dimensions requires that we also consider the change in the dimensions as well. That is, trend analysis must also refer to trends in the dimensions and how this impacts the selection of the facts. Let us return to the marital status example. Suppose we are studying the long-term buying habits of married versus single people and we do not have the historical changes. As the length of the time period increases, so will the inaccuracy of the answer. Although this is the simplest model to implement, it fails to provide the level of accuracy that we may require. Suppose however, that we do have the marital status changes. What does this mean? The normal SQL join is based upon selecting dimension values that are independent of each other. Suppose we are selecting market, customer, time and product. The join process will select the rows based on market, then keep the rows based on customer, then time, then product. Our example radically changes this, since we now want to select a time period that changes from customer to customer. Indeed, a given customer may have two non-contiguous time periods where they were married. Rather than performing a series of single-column joins between a dimension table and the appropriate key column in the fact table, at least one join will be a multi-column join based upon a join of multiple dimensions. In the marital status example, we would create a query that would select the appropriate customers and their time periods. These time periods would be based upon the historical marital status. This is then joined to the fact table (along with market and product, as required) to select the fact rows. In the figure below, the 1st query produces the temporary table "Selected Customers". The 2nd query produces the final results.

Figure 16 Slowly changing dimensions - query example The 1st query can be a subquery if your SQL supports a multi-column subquery - most don't. It is more likely that this will have to be multiple queries, with a temporary table. Note how this complicates the data warehouse. Yet, it may be unavoidable. Note how this process drives in from the dimension tables. It is also possible to answer this question by driving out from the fact table. However, this approach does not appear to eliminate the temporary table and may severely impact performance. For these reasons, we do not discuss it here but leave it for the casual reader to figure out!

Physical model
Each of the models in the (now) four model architecture serves a particular purpose. The physical model is often necessary to achieve performance objectives. If we had computers of infinite speed and resources, there is no doubt that we would implement the business or dimensional model. Regretfully, this is not the case! Today's machines are seriously over-taxed when confronted with large scale data warehouses and the corresponding queries. Hence, a good physical model may often be the difference between success and failure of the data warehouse. With equal regret, we state that the physical model is often dependent upon a particular DBMS or hardware configuration. This is, after all, what we mean by physical model! The guidelines that we will give here are as general as we can make them without focusing on specific DBMS or hardware products.

General objectives
With rare exception, data warehouse performance will be limited by input-output processing. Remember, however, that most performance problems are design related. Sometimes the design issues are contained in the DBMS and sometimes in the particular data warehouse. If you have significant performance problems (those that need to be cured by reductions of orders of magnitude) review your physical model.

Some general myths


One of the generally considered approaches is to put the index and tables on different controllers so that the possible contention is reduced. This is the correct conclusion if there is only a single user. However, as soon as the number of users reaches a certain level, the argument no longer holds. If the access activity on the index and table is the essentially random, the contention considerations of a particular query is overwhelmed and rendered irrelevant. Hence, the allocation process should be to interleave the indexes and tables and spread them across the devices so as to achieve even access under load. Since this may not be possible to predict, the DBMS that you select should make it easy to stripe the data warehouse across devices at any time. Although this may require moving around some rather large OpSys files, it should not require modifications to anything other than file allocation information in the database. It certainly should not require unloading and reloading the complete database. Once you have a 500 gbyte data warehouse in place, it isn't going anywhere; at least, not in one piece.

Parallel processing
The value of parallel processing is reached when we can decompose a single operation or query into parallel streams and execute them simultaneously. That is, we distinguish between parallel processing and multi-threading. It is easy enough to decompose an SQL query along the leading edge of an index. If that index corresponds to the physical sequence of the table, then we can execute the parts of the query at near disk speeds. However, parallelization along the trailing edge is vastly more difficult. And since the relation between the logical access (index) and physical access (disk layout) may be zero, the performance may actually be worse than single-thread processing. In general, if the type of processing that you will expect to perform will benefit from parallel processing, make sure that the DBMS you use will allow you to physically structure the disks so that parallel processing is not limited to the physical sequence of the rows.

Different types of parallel processing


There are essentially three different types of parallel processing: symmetric parallel processing (SMP), massively parallel processing (MPP) and clustering. The difference between SMP and MPP is what is shared. In SMP, the processors share a common memory, while in MPP the processors don't share anything. SMP machines are easier to control while MPP machines scale much farther. Clusters are groups of machines joined through some type of high-speed network like connection. The advantage of parallel processing (in general) is that a lot of cheap, slow processors are faster than a small number of expensive, fast processors. The relative advantages of SMP versus MPP are best left to another discussion. However, the advantages of clustering are worth discussing. The range of query loads in a data warehouse is truly astounding. Some queries take micro-seconds and some take hours. Some complex (but required) reports may perform hundreds of queries that cover the spectrum. The data load process itself is usually an infrequent process that is composed of a small number of very large complex processes. Clustering will allow you to create a large cluster to process unusual high demand periods. This might be the data load process, or it might be the peak that one sees at quarter end. Clustering should probably be a feature that you should require in both your server and your DBMS.

Basic considerations
This section summarizes some approaches that we assume will always be in place. 6. Synthetic keys. Textual keys should be replaced by numeric (usually integer) keys. This can have significant impact on the storage required for both the data and the indexes. We don't care about the disk storage as much as the time to process it. Search times for indexes composed of integer synthetic keys will be much faster than indexes composed of the textual keys. 7. Transitive dependencies. The attributes in a relation are transitively dependent on the relation of all weaker entities. For example, an attribute of product group is transitively dependent on the relation product subgroup. The process of normalization removes them from the weaker entities and placed them in the stronger entities. However, it may be beneficial to denormalize these and place them in the weaker entity to improve performance. We might, for example, place the name of the product group in the product subgroup entities in order to avoid a join to the product group table simply to retrieve the group name. The casual (but astute) reader will note that the dimension tables described above follow this process to its complete conclusion.

Structural changes
The basic tenant of i/o performance improvements is to reduce the number of read operations. Although this sounds obvious, it can be quite a balancing act. Suppose that we have a categorical dimension (income) which has enough access that we are justified in treating it structurally (rather than dynamically). One possibility is to leave the income data in the fact table alone and always access the categories through a categorical dimension. We can improve the performance by adding the categorical dimension to the fact table. Do not remove the base income fact, however. This increases the size of the rows but means that we will have a much simpler join (faster query).

Indexing and join changes


A number of indexing schemes have been developed to satisfy the different requirements of DSS and data warehousing. Note that the terms used here may be different from those used by a particular vendor. This index creates an array where the columns are the domain of the key and the rows correspond to the rows of the table. Each value in the array is an on/off bit that indicates whether that row is pointed to by that index. For example, if we indexed Marital Status, we might have three values in the domain. Each row would be a 0 or 1 to indicate the Marital Status of that row. Processing a bitmap index is very fast. If we have multiple values that we are searching for (an in list) the DBMS can simply perform OR/AND operations on the bitmaps as required. Bit mapped index The issue is the cardinality of the domain. Some DBMS' support low cardinality indexes, limiting the use to domains of a few hundred values. This is next to worthless. The DBMS that you select should support high cardinality domains, up to as many as 10,000 or more values. Interesting sidenote: a bitmap index (constructed this way) will support multiple hits. That is, a single row may have multiple index entries. Although this is also possible with other indexing schemes it is much easier with bitmap indexes (there is no loss of integrity constraints). I find this very interesting, although I have yet to think of a meaningful use! Technically, this is usually not something that you can apply after the fact. That is, the DBMS must include this up front. The common (pair-wise) join works its way in and out of the tables specified in the query. If your query selected products, by market by time, the DBMS may join the fact table and time, then refine that from the products and then by market. At each join, the DBMS will add the columns it needs from the dimension tables (including description perhaps). This is then included in all of the group by/order by processing. Star joins A DBMS designed with DSS in mind will follow a much different join process. The dimension tables are searched first. This results in a list of index entries. Then the fact table is processed. Once the rows have been retrieved, the group by/order by processing is performed. Finally, any remaining information (description) is retrieved from the dimension tables. This approach minimizes the number and size of the rows that are sent through each intermediate step in the query processing. It can have enormous reductions in response time.

General DBMS performance


An engine optimized for one purpose will always perform better at that purpose than some other engine. This is true for OLTP vs. DSS applications, just like it is true for NASCAR vs. NHRA engines. An important example of the difference between engines is how the joins are performed and when the engine retrieves information from the dimension tables. A typical OLTP approach is to make a guess as to which join will have the largest reduction in the number of fact table rows and use that first. Then use this list to drive out to the other dimension tables. In brief, this approach bounces between the dimension tables and the fact tables. As each dimension table is visited, any columns referenced in the query are extracted. This is done whether or not they are part of the selection criteria. This approach means that a dimension table will have to be visited only one time, but that the fact table will be visited many times. An additional inefficiency is that all columns will be dragged through any group by/order by operation, even if they are not needed. This is a good approach for an OLTP application. The queries are likely to be very small and few queries will perform group by/order by operations. A data warehouse is almost an exact opposite of this. Most queries will be quite large and will perform group by/order by operations. Hence, an engine that is designed for a data warehouse should visit all of the dimensional tables first and develop the complete fact table selection criteria. Then the rows should be selected from the fact table. Finally, the dimension tables should be revisited to extract the reporting columns. This approach may hit the dimension tables twice (at most) but it hits the fact table only once. Since the dimension tables are usually many orders of magnitude smaller than the fact table, this is a much more efficient approach.

Partition & Aggregate tables


Partition dimensions are usually dictated by the business requirements. There are two different approaches, with subtle variations on the theme. First, include the partitions in the base fact table. Second, please the partitions into different fact tables. The creation of aggregate tables follows similar logic, so we will discuss them together. Aggregate tables, however, are usually created for pure performance reasons. Under the objective of simpler is better, placing the partition and aggregate tables in the base fact table makes a lot of sense. There are significant performance improvements that can be gained from this. However, most of the performance improvements are found during the creation process. If the resultant table is large enough, it will reduce query performance.

The tradeoff here is load processing versus query processing. If your data warehouse is physically small but structurally and/or computationally complex, there may be a very strong argument for placing the partitions and aggregate in the base fact table. This approach is discussed in more detail in a different publication.

Categorical dimensions
Category analysis can be a very large performance drain on the system. Since categories are usually brackets, evaluating them on a large scale requires a substantial amount of CPU resources. The dependency on an attribute (rather than an indexed foreign key) may require full-table scans for some queries. The most effective way to improve this is to place the category either in the dimension table or in the fact table itself. The tradeoffs here are an improvement in performance versus a loss of flexibility. If the categories are fairly static and the analysis fairly frequent, then this should be a good choice. However, if the categories change or the analysis is infrequent, consider performing the categorization dynamically. Another advantage to dynamic categorization is that it is possible to maintain many categories. We could, for example, have several different methods of categorizing income. Placing the categories into the dimension table or the fact table restricts that flexibility.

Connectedness and complexity


It would be nice to have some quantitative measure that would allow us to discuss the complexity of one data warehouse versus another. Certainly, we should be able to discuss the complexity of one design versus another. In this section, we will discuss some measures of complexity. I should warn you that this section more so than the otehrs, is a work in process. Which basically means that what you see right here are my thoughts at the moment. In a connected graph, every node can be reached from every other node. If we have designed a valid star architecture, then this will be connected. That is, every dimension (which implies every level in each dimension) is connected to every other dimension. Note also that, by extension, a composite star architecture is also connected. The advantage of a connected database is that any question (about the subject matter!) can be answered. If the database is not connected, then there are questions that can not be answered. In fact, a disconnected database is a likely candidate for being split into two data warehouses; or data marts. We can use the concept of connectedness to measure the complexity of a data warehouse. In order to develop this measure, we will start with an adjacency matrix. This measures how many nodes are directly connected. The adjacency matrix is an array with both the rows and columns consisting of the nodes in the graph; or in our case, the entities in the ER model. In the Retail Model (discussed in detail below) the adjacency matrix would look like: Sales Package SubDay Week Month Quarter Year SKU Brand Category History Size category Sales History 1 1 Day 1 Week 1 Month 1 Quarter 1 Year SKU 1 Package Size 1 Brand 1 Sub-Category 1 Category Figure 17 Retail Model Adjacency Matrix Buying Sales Group Hospital Buying Customer Hospital Contract Group Territory District Region History Members Org Groups Org Sales History 1 1 1 Group Members 1 1 Customer 1 1

Hospital Hospital Org Contract Buying Group Buying Group Org Territory District Region

1 1 1

1 1

Figure 18 Partial Health Care Adjacency Matrix The adjacency matrix is constructed by placing a one in the { row, column } cell where row is an entity that contains a foreign key relationship to column. Hence, we will find a 1 in the cell { Day, Week ). We can use the row totals to determine the number of foreign keys in an entity. In the retail model, the only row with a total greater than 1 is the Sales History fact table. Conversely, the column totals indicate the number of tables in which column appears as a foreign key. In a valid star architecture, the only row with a total greater than 1 will be the fact table. Its total should be the number of dimensions connected to the fact table. There should be no columns with a total greater than 1. In the Health Care Adjacency Matrix, we notice that there are both rows and columns with totals greater than 1. This indicates the presence of some cyclic paths. In particular, there are two paths between Hospital Org and Sales History. As we have seen in the discussion above, the cyclic paths need to be resolved into acyclic paths before this will be a valid star architecture.

Complexity
One additional measure that can be derived from the adjacency matrix is that of database complexity. There are several different ways of looking at a data warehouse and talking about its complexity. Sheer size is, of course, the obvious one; and perhaps the most popular. Another is the degree of connectedness. Intuitively, a database that has only simple paths connecting the entities should be much easier to understand than one that has many paths, especially if they are cyclic. However, since our design objective is to turn the database into a valid star architecture and that is, by definition and design, very simple, we would need to measure this type of complexity on the original normalized ER model. Since we are measuring complexity, we would like the metric to reflect that; hence, the more complex the database is, the larger the metric should be. It still seems that we should not define a metric in terms of the raw number of paths, however. A database with two dimensions that have several cyclic paths should be considered more complex than a database with six dimensions and no cyclic paths. In particular, the measure should not respond linearly to increases in complexity. Intuitively, it seems that a database with 100 tables is more than twice as complex as a database with 50 tables. A database that is totally connected (every entity can be accessed through some path to every other entity) will have (n2-n)/2 connections. This is a completely filled adjacency matrix with the diagonal removed and the connections counted only once. Since the paths in a relational database are actually bi-directional, we can use the same equation to measure how connected the database actually is. (p2-p)/2 where p is the number of paths. Our complexity measure is the ratio of these two (multiplied by 100). Which is: ((p2-p) / (n2-n)) * 100 where n is the number of entities and p is the number of edges connecting them. Note that the divisions by 2 simply cancel each other. The normalized health care model contains 19 tables with 18 paths; the star model contains 9 tables with 8 paths. The complexity measures are 89 and 78, respectively. These are relatively close; indicating that we simplified the model somewhat. Just for comparison, the retail normalized model has 20 tables and 19 connections; while the star model as 5 tables and 4 connections. These have complexity measures of 90 and 60, respectively. This supports our belief that the retail star model is much simpler than the normalized model. And that, although the normalized models are roughly the same, the retail star model is significantly simplified. Although it seems that this measure makes intuitive sense, we will need to validate it in a many real world examples.

Other complexity measures


Since a data warehouse consists of more than just the data structure, we would several other metrics in order to describe the complete complexity of the data warehouse. This would include some metric than measured the load/aggregation process and some metric that measured the query process. Note that these metrics might assist us in the analysis of when to pre-aggregate and when to dynamically aggregate. I have given some thought to these metrics and may include them at some future time.

Supporting tools
Data modeling tools
At the time of the original publication (early 1996) there are many tools on the market that did a wonderful job of modeling the structural dimensions. This is, after all, the sine qua non of relational databases. If you have a complex informational dimension, they should also be capable of modeling that. Keep in mind, however, that this is a foreign concept to the relational model. Since that time, a number of the leading tools have started to ship versions that purported to handle data warehouses. However, they continue to remain grounded in the relational data model and, hence, are still unable to distinguish between structure and semantics. Until such time as the vendors start to understand the differences, we will be forced to "play games" with their products to get them to work right. Of course, since we are playing very much the same game withe the RDBs, at least there is consistency in what we have to do! Ideally, we would like the data modeling tool to allow us to draw all three models: the business model, the dimensional model and the physical model. The tool should provide some reference checking to validate integrity between the models. Since the relational model itself does such a poor job of partitioning dimensions (and some categorical dimensions) we can not expect the tools to do a good job. And they don't! In order to maintain some semblance of clarity in the model, I would suggest leaving them out, except where it may simplify the actual creation of the database. Since dimensional modeling has become such an important part of today's D.B.A. activities, we may see some changes to deal with these issues. Indeed, there are rumors that some vendors are working on this problem. Since some of the difficulty lies with the relational model itself, it will be very interesting to see how they deal with the problem.

Query tools
Implementing query tools for a dimensional structure is simultaneously much simpler and much more difficult than for a 'traditional' normalized model. If we have a single fact table (that includes the aggregate data as well) then there is really only one query to deal with. Optimization, indexing and so on, become much simpler problems. The advantages of a single fact table (with detail and aggregate data) are discussed in one of our other white papers. See the last section below. Query tools are more difficult since they must be able to provide a clear differentiation between questions that have subtle differences in statement. Our marital status provides an excellent example; we have already discussed this enough! Another example is given in our sample data warehouse: health care manufacturers. Hospital buying groups sign contracts with manufacturers in order to receive better pricing. However, individual hospitals may not participate in some contracts. That is, a member of a buying group may purchase the products covered under a contract with one buying group via a different contract with a different buying group. Hence, the question as to what a hospital buys when they are a member is different from the question about what they buy under the group contract. This is a subtle yet enormous difference. I have always thought that a query tool that presented the ER diagram would be wonderful from the enduser's perspective. However, most ER models are way too complex for an enduser to wade through. The star architecture has the opportunity to provide an extremely simple query interface. The diagram below shows a star architecture in the form of a radar graph. The enduser would simply click on the portions of the diagram in order to guide the query tool. It is complicated somewhat by the fact retrieval must be separated from reporting, but that is relatively straight-forward. Unfortunately, I have never seen a query product that uses this interface. Pity.

Figure 19 Starburst query tool

Later thoughts
The starburst model shown above will work well for very simplistic models. For example, a single star fact table with a limited number of dimensions. (Too many dimensions simply fills the screen.) However, if you remember one of the wonderful things about FCDAG is that they can be developed as a recursive structure. Hence, we would be able to take a very complex data warehouse and represent it as a number of connected sub-FCDAGs. We can then take the starburst model and use it to provide a recursive view of the data warehouse. The primary advantage to this approach is that most users really only care about their little slice of the pie. So if we presented an enterprise view of the corporate information (in the broadest sense of the term), they wouldn't care about it anyway. Except possibly on very rare occasions. So, the theory goes, we could represent the enterprise as a series of nested FCDAGs; each of which might (or might not) be a separate data warehouse/mart whatever. The user could navigate through this by selectively looking at different starbursts, adding what they want to the query and then letting the query processor do whatever it needs to do to make this happen. Since database navigation has been one of my primary interests over the last fifteen years and since I believe that the concept of an enterprise data warehouse is fundamentally flawed, this approach is one that seems to have a great deal of possibility. AT Telos Solutions, we are, in fact, building such a query tool today. (The following is a paid commercial announcement.) Trinity is based upon the premise that the best logical representation of a data warehouse is an FCDAG. Which, in turns, provides a mechanism for building up a logical view of the enterprise data. Individual users will be able to focus on their own little starburst as they need to or be able to navigate (and extract from) the entire corporate information repository on those rare occasions when they must. Trinity is currently shipping its first version of this approach. The product combines data warehousing, data mining and visualization for complete Knowledge Discovery in Databases. The semantic approach to the data warehouse is more powerful and more sustainable than the structural approach taken by nearly all other query tools.

Sample data warehouses


The retail model
The purveyors of large-scale retail data are credited by Kimball as being the creators of the star schema. Since the retail model is fairly simple (and yet interesting) we will deal with it first. The structure of this model is from Kimball's book The Data Warehouse Toolkit. I

have taken some liberties with this model, in order to demonstrate some of the topics. Any discrepancies between this and Kimball's work or between this and reality are my fault.

Figure 20 Normalized Retail model As you can, see the retail model is quite simple. There are four dimensions and three of them are Type I Patterns. Hence, these may be denormalized directly into dimension tables. Even the fourth dimension is a Type II Pattern and so it may be denormalized directly into a dimension table. The star schema is shown in the following diagram. We have identified the member names in the dimension tables.

Figure 21 Star retail model

Despite the simplicity of the retail model, there are some extensions to it that can make the data warehouse as a whole more interesting. This is the promotions. As anyone from this industry will tell you, deals and promotions are the lifeblood of retail. The ability to evaluate the impact of promotions should provide one of the best benefits of a data warehouse. What challenges will this requirement pose to the data warehouse architect? Earlier, in the section on Trend analysis, we noted that we need to couple the changes in the dimensions with the fact trend analysis. In order to determine the effectiveness of a promotion, we must be able to analyze product sales during several different periods of promotion, as well as during periods of 'non-promotion'.

Extensions to the retail model


One of the reasons why the retail model is so simple is that it pretty much ignores the customer! Adding in who buys the product is not as easy as it seems. The retail industry (especially CPG) is complicated by the fact that the people that buy the product from the manufacturers are not the final consumers. Manufacturers sell to dealers who, in turn, sell to the stores and then finally they sell to the consumer. Some promotions are for dealers and some promotions are for the consumer. Adding dealers to the model complicates it enough but adding the consumer really jumps up the complexity! Nevertheless, that is what many CPG manufacturers are doing. Food chains are developing programs that will track point of sale activity by customer. This will let them create highly targeted marketing and promotion programs, with price breaks and coupons developed for individual customers. This not only will make more complex data warehouses but they will be much, much larger than is possible today. The return on investment, however, is such that these will happen as soon as the technology is capable.

Health care manufacturing


This example is based upon the health-care industry. In particular, it could be a sales and marketing data warehouse for a manufacturer of hospital health care products.

Figure 22 Health care data model There are several dimensional structures on this diagram. 8. Sales Organization (Area, Region, District and Territory). This is a Pattern I structure. 9. Product, containing two alternate structures, one for manufacturing (Facility, Assembly Line, Subgroup) and one for marketing (Family, Group and Subgroup). This is a Pattern II structure. 10. A customer dimension (Hospital Org, Hospital and Customer). 11. A buying group dimension (Buying Group, Contract). The most interesting structure is the Buying Group Organization/Hospital Organization. It is complicated by the many-to-many relation between buying groups and their members. How should we model this? It is, perhaps, made more clear by understanding the differences between these two structures. Even though a hospital belongs to a buying group and that group may have a contract with the manufacturer, the hospital may be buying a product through another buying group or its own personal contract. Hence, the dimensional structure must be able to distinguish between what the group member sales are (through any contract) and what the group contract sales are. Hence, we should decompose these into two patterns, as defined below. As anyone who has modeled this structure knows, the difference between these two questions is huge and not a little bit confusing. It is a good example of how important it is to provide clear differences to the enduser. 7. Customer sales: Hospital Org, Hospital and Customer and the alternate structure Member Sales: Buying Group Org, Buying Group, Members, Customer. 8. Group Contract Sales: Buying Group Org, Buying Group and Contract. The final result of this analysis is shown in the diagram below. Note the diamond shape connected some of the dimensional tables to the fact table. This indicates that these are alternate hierarchies joined to the same key in the fact table.

Figure 23 Dimensional tables In all of the tables, the domain for the primary key (e.g. Customer ID) is the union of the domains for the other dimensional members. For Customer ID, that would be Customer Id, Hospital Id, Hospital Organization, Buying Group Id and Buying Group Organization. The Customer ID should be changed to guarantee uniqueness. Now that we have some meaningful examples, it might be more clear how queries will work. For example, suppose that we want to retrieve all of the products in a particular group. The SQL is select product id where product group = groupx and level identifier = 0. We specify level identifier = 0 since we want product id, which is the lowest level. If we had wanted to retrieve all of the subgroups in a group we would have used select product id where product group = groupx and level identifier = 1. In actual practice, every query will contain all of the dimensions. This allows performance tuning to proceed along a well-defined path. Although it is not required, this structure assumes that the fact table contains all of the aggregate data as well. There are some strong advantages to this approach that are discussed in another white paper (see the following section).

Summary
The development of the entity relation model of a dimensional data warehouse is really a logical extension of practices that have been in use since for some time. There are a few patterns that occur repeatedly and once these are understood, dimensional analysis can proceed quickly. Although these do not cover every structure, the majority of cases are covered; leaving the analyst to focus on the structural issues that are unique to their model. To some extent, this paper is unsatisfactory. There are some areas where my comments are limited to statements like: "This is ugly." However, I plead innocent by reasons of insanity. The analytical requirements of a data warehouse (and analytical applications, in general) require things like categorical and partitioning dimensions. While these may or may not fit well into the relational data model, they are surely a bad fit into data modeling tools. Data warehouses are organic. They grow continuously and in directions that no one will be able to predict. The strength of the dimensional structure is the ease with which it can support that growth. Thinking of data points as dimensions rather than attributes can make adding new data points much easier. Basically, follow the rule that base data points should be attributes and derivations of the base data points should be another dimension.

Stars, snowflakes and galaxies


One of the other approaches to the star physical model has come to be called the snowflake. The essential difference is that the dimensional tables are not 'collapsed' but are maintained in normalized form. Perhaps the primary motivation for this approach is performance. A large dimension table (25,000,000 leaf rows) may be extremely large when the dimensional structure is included (as columns); as presented above. Since much of this space is redundant data, creating a normalized structure will reduce the overall requirement. The downside is that more joins will be needed to execute a query and so performance may be adversely impacted. May be because the smaller dimension tables could perform better. Only benchmarks will determine what is best for your design. Yet another alternative is to leave the dimensional structure as a single table but remove the columns that contain the ancestor structure. This solves the space problem but does not cure the resultant query complexity issue. Depending upon the particular DBMS, performance may in fact be dreadful. The queries necessary to traverse this table will be self-joins. If your DBMS doesn't do these well, this could be the worst performance of all of these. Once again, only benchmarks will determine what is best for your design. I don't know if anyone actually uses the term galaxy and I apologize if I am introducing it here. A galaxy is a collection of stars! That is, if your informational structure is such that it can not fit cleanly into a single star (or fact table) then you will need several fact tables. Since a star fact table is nothing but an intersection entity, if you follow the usual rules of data modeling, you should be able to determine how many fact tables you will need. The difficulty lies when there is overlap between one intersection entity and another. This, again, is the difference between the business model and the physical model. Resolving this depends upon the analytical requirements of your application.

Figure 24 Stars, snowflakes and galaxies

Aggregation strategies
Introduction
This chapter discusses different methods of aggregating star architecture data warehouses. Since many of the questions that a user will ask can require the aggregation of hundreds or thousands of rows, pre-aggregating some or all of the data warehouse can dramatically reduce query response time. The tradeoff, of course, is that the physical database is significantly larger. This chapter will only briefly look at the pros and cons of pre-aggregating. Our main focus will be on different ways to perform the aggregations. There are two major structural approaches to pre-aggregation. The first approach will store each set of aggregate information in a different fact table. The second will store the detail data and the aggregate data in the same fact table. Although the strategies that we will discuss focus on a single fact table, the following section discusses some of the advantages and disadvantages of each.

Pros and cons


There are several reasons for pre-aggregating one or more dimensional structures. 12. The compression ratio (rows displayed / row retrieved) of rows retrieved to rows displayed is very low. That is, the number of rows retrieved is significantly larger than the number of rows displayed. The definition of 'significantly larger than' will depend upon your particular application. There are two impacts from very small compression ratios. First, the retrieval time is very high. Second, the time to perform the dynamic aggregation is very high. 13. The dimensional structure is very complex. An example of this are the unbalanced trees discussed earlier. Another example is a very large structure that will require several joins to traverse from the starting to the ending point in the dimension. See the Healthcare example we discussed earlier. 14. The computations are very complex. This is most often true in a financial data warehouse, where the computations are very often 'chained together'. Hence, the SQL to compute a particular fact may be very simple, but it requires computing some prior fact, which requires computing some other prior fact and so on. In brief, the decision analyzes the cost of creating and storing the aggregates versus the cost of dynamically calculating the aggregates. If the fact table is very large but rather simple (in terms of structure and calculations) then the case for dynamic aggregation is probably stronger. If the fact table is relatively small and computationally or structurally complex, then the case for pre-aggregation may be stronger. Once the decision has been made to pre-calculate, then you need to decide how much to aggregate and where to put the results. Determining how much to pre-aggregate focuses on which dimensions. Note that it is also possible to debate how many levels of a particular dimension to pre-aggregate. I always recommend simplicity first. If you pre-aggregate part of a dimension, do it all. Doing part means that you still have to deal with some of it dynamically. This will complicate the development process and the query execution with little benefit. It is easy to measure the benefit of pre-aggregation for a particular dimension. Basically, you have to determine the compression factor for each level of the dimension. This is the number of rows produced by an aggregation divided by the number of rows that were retrieved. For example, if we summarized product sales by subgroup and retrieved 1,132 rows and reported on 47 rows then the compression factor is 47/1132 = 0.0415. This should actually be calculated using real data; preferably not a sample and preferably covering a long time period. Calculate the compression factor for the entire dimension. Once you have this for all dimensions, you can evaluate which dimensions should be pre-aggregated. Note that some dimensions may have compression factors very close to 1.0. These should not be pre-aggregated. A compression factor that is close to 1.0 will result in a dimension explosion. That is, the size of the aggregates will be very close to the size of the original fact table. Once you have calculated the compression ratios for each level in each dimension, you can arrive at a very accurate estimate of the size of the final fact table(s). Calculating the size of individual aggregate tables is very easy and won't be discussed here. Calculating the size of a single aggregate fact table is simply a matter of 'chaining' the calculations together, as shown in the table below. You can easily construct a spreadsheet to perform this calculation. Note that the sequence of aggregation doesn't make any difference; the number of resultant rows is the same. Later on we will discuss the performance impact of one sequence versus another.

Description of sample data warehouse


Products: 2500 -> 50 -> 5 -> 1 Markets: 450 -> 30 -> 5 ->1 Time: 60 -> 20 -> 5 -> 1 MAT aggregations: 8+4+4+4 = 20 SAT aggregations: 6 This discussion will focus on a strategy that will aggregate a single fact table. The aggregation of multiple fact tables is a BFMI (Brute Force and Massive Ignorance) process.

Figure 25 Dimension compression ratios As you can see from our example, there will be about 58% increase in the number of rows from the base fact table to a fully aggregated fact table. The cost of storing this can be easily compared to the cost of dynamic calculations. Another element that needs to be considered is the impact that this much larger table will have on query performance. Given a small enough table, the impact is negligible. However, for large tables, the impact can be substantial; possibly even unacceptable. The following table may help in the comparison. Multiple aggregate table Single aggregate table This will require a separate table for each set of aggregate data. The extreme case is a fully aggregated By definition, the single aggregate table approach database. For the sample data warehouse, this will requires only a single table! require 9 one dimensional tables; 18 two dimensional tables and Each of the tables counted above will require a separate aggregation operation. This, of course, implies a separate SQL statement. Even though the SQL statements can be generated, the sheer number of We will present an aggregation strategy that limits the them increases the chance for failure and greatly number of operations so that they increase linearly complicates the restartability of the load processing. with the size of the dimensions. This reduces It is not uncommon for the calculations to require everything about the system: the aggregation time, its several passes over the data. (See below.) Hence, the complexity, the difficulty of restart and so on. large number of statements required here is exacerbated by each additional pass. Some of these second pass calculations can be run simultaneously. On the minus side, the single aggregate table can be astoundingly large. It is not uncommon to see a growth factor of 2^N where N is the number of On the plus side, the tables will be smaller. dimensions. However, it is unlikely that the total database size will be any larger (or smaller, for that matter). As a result of the large number of tables, an aggregate Since there is only a single table, no navigator will be navigator will be required. required.

Adding new aggregate tables may be complicated if an aggregate navigator is not being used. Adding new Obviously only a single table needs to be changed. If facts may be horrendous whether or not an aggregate the aggregation process has been designed properly, navigator is being used. Perhaps all of the aggregate then the SQL updates should be relatively small. tables need to be changed and all of their SQL statements, plus the impact on the query tool. Figure 26 Multiple vs. Single Aggregate Tables

SQL-based aggregation
This approach uses SQL for all aggregation and calculation processes. Although it is not the best performing approach, it is the simplest. This might make a good first cut at the overall aggregation process. Subsequent improvements might focus on individual long-running steps.

Simple dimensional structure


In our data warehouse, the composite key of the fact table consists of {product id, market id, time id}. If we aggregate the product data up to product subgroup, our SQL statement will look like: select subgroup_id, market_id, time_id, {data points} from fact_table where product_id = subgroup_id group by subgroup_id, market_id, time_id At the conclusion of this SQL statement, the fact table contains the detail data and one set of aggregate data. We repeat similar SQL statements to complete the product structure aggregation. At the end of these three SQL statements, the fact table will contain aggregations for the complete product structure and the lowest level for market and time. Now we will aggregate the market structure. The SQL statement will look similar, but not exactly the same: select product_id, sales_territory_id, time_id, {data points} from fact_table where market_id = sales_territory_id group by product_id, sales_territory_id, time_id At the conclusion of this SQL statement, we will have the sales territory aggregations for every level in the product structure as well as for the lowest level of time. Note how easily we receive the benefit of the work that was already done. In order to complete the market structure, we need a total of three SQL statements. At the end, we will have a completely aggregated product and a completely aggregated market structure, for the lowest level of time. Now we will aggregate the time structure. Again, the similar (but not duplicate) SQL statement is: select product_id, market_id, quarter_id, {data points} from fact_table where time_id = quarter_id group by product_id, market_id, quarter_id At the conclusion of this SQL statement, we will have a completed aggregate product structure and a completely aggregated market structure for both monthly and quarterly data. With one more SQL statement, we can do annual totals and then we have a completely aggregated database. This strategy used a total of eight SQL statements to produce a completely aggregated data warehouse.

Multiple hierarchies
It is very common for a particular dimension to have alternate structures. Product, for example, may have several different structures. In the earlier sections, we talked about how to design the dimensional structure. The result of that analysis should result in a fairly clean structure. However, each structure will have to be aggregated independently. It is not difficult, simply tedious.

Sequence determination
One of the performance issues that can be addressed is the sequence of aggregation for the different dimensions. The results, both in terms of the values and the number of rows, should not depend upon the order of the dimensions (subject to the non-additive facts discussed below). However, the performance can be greatly effected. Since the length of each of the aggregation operations depends directly upon the number of rows going into the operation, we would like to keep the number of rows as small as possible as long as possible. Hence, the sequence of aggregation should do the smaller dimensional

structures first. You can use the compression ratios that you calculated earlier to determine this sequence. Note that it will reduce the aggregation time below any other sequence. It will not reduce the number of rows.

Non-additive facts and variances


Inevitably, there are some facts that do not simply aggregate. Average Selling Price is an example. Since these depend upon numbers that are aggregated, non-additive facts must be calculated in a second pass over the fact table. However, these are usually very simple SQL statements and so they should be extremely fast. The calculation of variance information is a similar problem. Variances are usually an inter-row calculation (or should be) so your SQL will need to be able to do that. SQL-92, for example, can use a case function. The vagaries of SQL and relational databases will mean that you may have to run a lot of SQL statements to calculate all of the variances, but it will still be significantly fewer than with multiple fact tables.

Summary of steps
9. 10. 11. 12. Calculate the additive derived facts. Aggregate the dimensions. Calculate the non-additive derived facts. Calculate the variance facts.

Non-SQL based
This approach requires the development of specialized programs. This does complicate the development and maintenance processes. There are two basic improvements with this process. First, a dimension can be aggregated in a single pass over the data (rather than many passes). Second, the nature of the process is such that it can easily be decomposed into parallel processes. In brief, the process is to scan the rows that will be aggregated, create subtotal records, sort the file, and then aggregate and load it in a single pass. Since these are all very simple (and sequential) processes, they should occur at disk speeds. Later we will see how they can be decomposed into parallel processes.

Single dimension
In order to aggregate a dimension, we create a copy of each row for each level in the dimension. Each newly generated record will contain the appropriate member for the aggregate levels. For our time dimension, which contains month, quarter, year and total, we will generate quarter, year and total records for each month row. Once this is completed, we will sort the file on the time column; placing all the records that are to be aggregated into the correct sequence. The aggregation/load process can then read through the file, inserting a total row whenever a value break occurs in the time dimension.

Figure 27 Time Aggregation Single Stream This final step can contain whatever level of processing we want it to. It can perform aggregations, variance calculations, transformations and so on. Hence, we can possibly eliminate several SQL passes with this single program. Note however, that there is a development/maintenance tradeoff here that needs to be considered. Once the intermediate file has been created, the aggregation of the different levels can proceed independently of each other. This is different from the SQL-based approach, where Ln+1 must be processed after Ln. If we write the extract program to produce a file for each level (rather than a single file) then these individual files can be loaded in parallel. Note that this also eliminates the need for the sort step. Finally, if we are using Unix, we can pipe the results of the extract program directly into the (parallel) load processes. This eliminates the requirement for a large amount of working space on the disk. The primary performance impediment to this process is that very large contention on the disks occupied by the table and the contention caused by creating the indexes. Depending upon the DBMS and the operating system being used, it may be possible to reduce or eliminate these.

Figure 28 Time Aggregation - Multiple Streams

Performance comparisons
Valid performance comparisons of one DBMS vs. another must take into account a wide range of factors, including but not limited to the operating system, the speed of the i/o devices and the CPU, as well as how much and how effectively parallel processing can be used. At the same time, most performance problems are the result of design factors. Since we are talking about design decisions here, we can use some made-up computer. As long as it remains in the range of today's processors, we are okay. So, for the record, the CPU will have an instruction rate of 100 MIPs. The disk devices will support 100 physical read/write operations per second, with a sustained transfer rate of 16 Mbytes. We will also assume that the database uses an 8K block and that the data can be striped in whatever manner achieves optimum performance. As pie-in-the-sky as this sounds, it is not all that far off from what is today's reality. Disk storage will cost $100/Gbyte and the processing costs will be based on recovering the costs of a $125,000 server during each year. That is, we will charge our users at the rate of $0.0174/CPU sec. The space comparisons for the two approaches is pretty straight-forward. The worst case should be a comparison of two fully aggregated databases. Since both databases will store the same type of information, the total number of rows will be the same. The difference is that the single aggregate table will require all three keys for the data that is really only one or two dimensions. Since the number of rows where this occurs is relatively small (in relation to the three dimensional tables), the excess space required by the single aggregate table is relatively small. As long as we require that both databases maintain the same level of detail, this comparison will generally hold true. Note however that if we have a billion row fact table, even these small percentages may be significant. The cost comparison for dynamic aggregation versus pre-aggregation depends upon how often what levels of detail are requested. Clearly, if we ask the same question over again, the second time that we ask it, pre-aggregation will start to pay for itself. However, if we totally pre-aggregate then we have to weigh the cost of calculating every aggregate versus calculating some aggregates more than once. The other factor that must be taken into consideration is the human cost of waiting for large aggregations. With the techniques that we have described above, it may be possible to calculate the aggregates in a fairly small offline window. If this window is small enough and the cost of wasting people-time is high enough, then pre-aggregating again makes sense. Regretfully, we can't provide general answers to this question; we can only describe how the comparison should be performed.

Diagramming Conventions
Ask any data modeller, the most critical component in a data warehouse is the data model. Without a good data model, you might as well pack it in and go home! And yet we treat the data model (and the modeller) as so much excess baggage. We don't get enough respect

around here! Certainly some of that is our fault. Most data models look they were conceived and drawn in some type of bad dream. So what we need is a little discipline and some conventions. The primary purpose of these concentions is to make the model easier to read. In his book Data Model Patterns Conventions of Thought ( 1196, Dorset House) David C. Hay describes a style of modelling that I have modified somewhat for data warehouses. (P.S. You should get this book.) In general, you will see that the models in this paper (and the models that I draw for clients) follow these conventions: 15. Parent-child relations should flow from top to bottom and left to right. (David's models actually go the other way). 16. Reference tables should be placed as close to their referent as possible, but continue adhering to rule #1. 17. Don't cross lines if you can avoid it. 18. If a dimension has multiple hierarchies, they should be modelled separately but use a diamond symbol on the final path to the fact table. 19. The data model should have resolved all semantic ambiguities. 20. Use a bold outline (or something) to represent logical copies of tables that were modelled to eliminate semantic ambiguities. These conventions tend to place the dimensions in the upper left corner (and across the top) and the facts in the lower right corner (and across the bottom). Since most queries start with dimensions, I find that this approach tends to palce the things people want to see first at the top of the model. The drawback to this convention is that the models tend to be very large and have lots and lots of white space. I usually try to convince my clients that this leaves lots of room for notes! (I have a middling success with this.) Still, the models look very nice and are easy to read. Another (minor) drawback is that star models don't look anything like stars. So it goes...

Some ranting...
As long as we are talking about data modelling... (skip this part if you've heard this before). I have on many occassions been brought in to a client to develop a data warehouse based upon some business model that somebody developed. More often than not, these logical models have been a waste of time. (Although I am usually far too polite to say so.) Based on my experience, the biggest mistake that data modellers make is carrying the level of abstraction too far. Remember, a business model is, first and foremost, about the business. Not some nitwit cencept of abstraction that the modeller thinks is relevant. An example might help. In the insurance industry, it is quite common for people to have several 'roles'. For example, a doctor might be, at various times, an insuree, a plaintiff, a defendant or a claimant. One approach is to develop some entity called "Person" and then have something that defines what role they play at some moment. Resist this approach! It is too abstract and fails to convey the different roles (which are embedded in some type table.) The casual observer will note that by carrying the level of abstraction up to the concept of "Person" the modeller has created a semantic ambiguity. Which we all now understand is a very bad thing! To paraphrase John von Neumann, "People that create semantic ambiguities are living in a state of sin." Now, having read through this paper, you know better. Well, that was cathartic!

You might also like