Professional Documents
Culture Documents
DataWareHouse Concepts
----------
David Sharon T
For any subset of facts and any subset of dimensions, will provide one and only one answer The perfect data warehouse can be represented as a fully connected directed acyclic graph. Think of facts as basic pieces of information that users want to see in the answer to a question and dimensions as one way for the users to constrain the scope of the question. Business model is a function of the information but that the dimensional model depends upon the analytical processing that will be performed. Further, the physical model is the result of modifications necessary to reach performance objectives Keep these following rules pasted on your forehead: #1: Develop a thorough and complete understanding of the business processes that the warehouse will support. #2: Obsess over rule #1. Dimension : A simple dimension joins a single intersection entity to a single strong entity. A complex dimension joins a single intersection entity to more than one strong entity. One of the strong entities should be identified as the primary dimensional structure while the others will be identified as alternate dimensional structures. An improper dimension is a cyclic path joining a single intersection entity to one or more strong entities. Heirarchy : A hierarchy is an instance of a particular dimensional structure. A dimension may have more than one hierarchy. For example, a store dimension may have a geographical hierarchy; a demographic hierarchy; a store attribute hierarchy. Time may have a calendar hierarchy, a fiscal hierarchy, a rolling hierarchy. Informational Dimension : Informational dimensions define what data is stored. That is, the informational dimensions define what the data points (or facts or Key Performance Indicators) are. These are called attributes in the relational data model. The advantage to thinking of them as a dimension is twofold: first, they often have a complex structure (e.g. a corporate chart of accounts) and two, a dimensional structure is much more flexible. Given the limitations of the relational data model, however, these advantages are usually only achieved in a multi-dimensional database Structural Dimension : Structural dimensions define how the data is stored. These are the entities in an ER diagram. The ability to modify the structure (either by adding new dimension members or by adding new dimensions) is where the relational databases really have an edge over some (but not all) multi-dimensional databases. Categorical Dimension : Categorical dimensions classify or categorize other entities. For example, box size might categorize breakfast cereals. And color may be used to categorize automobiles. There are some advantages (discussed later) to thinking of these from a dimensional viewpoint. Partitioning Dimension : Partitioning dimensions replicate the remaining structure of the database. Scenario is a common partitioning dimension. Some example scenarios would be Actual and Budget. In this case, we want the complete structure to occur for both Actual and Budget. There are many advantages to thinking of these from a dimensional viewpoint Members : Members are either the actual occurrence of a dimension or the named level of a dimension. For example, in the Sales Organization dimension, we might have Sales Districts (as a named level) and New York as one of the Sales Districts. Both of these may be called members. The difference should be clear in the context.
Star Architecture: A valid star architecture consists of a single intersection entity (fact table) joined to one or more proper dimensions. An invalid star architecture consists of a single intersection entity joined to one or more dimensions, where at least one of the dimensions is an improper dimension. A valid star architecture will always return consistent results, when each dimension is included in the query. An invalid star architecture may return different results depending upon the path that is taken through the improper dimensions. Note that the correctness of the query is not a structural issue, but is a semantic issue.
The casual reader will note that our definition of a valid star architecture is a FCDAG. A composite star architecture consists of two valid star architectures that are connected, either through a dimensional structure (of some type) or by connecting the fact tables. Fact Table : This is the central table in a star architecture that contains all of the Key Performance Indicators (also called data points, also called attributes, also called the informational dimension). Technically, the fact table is an intersection entity whose primary key is a composite key. The domain of each component of the key key consists of the union of the domains of the different dimension levels. . Aggregate Key : A composite key is a primary key that consists of a number of foreign keys. An aggregate key is a logical concept that includes other attributes as well. It is structurally meaningful when you are pre-aggregating categorical dimensions. There is more on this later, in the section on Types & Subtypes.
Data modeling Steps : In brief, the steps that we will follow are: 1. Develop a normalized entity-relationship model of the business model of the data warehouse. 2. Translate this into the dimensional model. This step reflects the information and analytical characteristics of the data warehouse. 3. Translate this into the physical model. This reflects the changes necessary to reach the stated performance objectives. The three model architecture (business model, logical model, internal or physical model) has been discussed in ER modeling for some time. However, this was developed with transaction processing systems in mind. Since data warehouses are developed with analytical processing in mind, we have added the concept of the dimensional model. This model represents the information along dimensional lines You care most about the questions that we should ask We mean ask them for their needs and desires as a base for discussion. Don't focus on what they do today, because it is very likely going to be different in the future. Focus on the information first, then the access paths next Data warehouses (like most analytical applications) are organic: they grow and they grow in directions that we are not able to predict. If we focus on current functional requirements, we will be able to develop a highly optimized solution for today's problems. However, as the problems change the solution becomes less and less optimal. Eventually, it is sub-optimal and hence, unacceptable. A better approach is to start out by trading off flexibility for optimization. The resulting solution may always be a little less optimal but it will be more adaptable and hence have a much longer life.
Model normalization
Once we have gathered all of the entity and relationship definitions, we can construct the normalized model. Loosely speaking, 3rd Normal Form (3NF) is reached when the attributes depend upon 'the key, the whole key and nothing but the key'. There are many advantages to 3NF. The structure is remarkably insensitive to change. Specifically, the 'ripple' effect of changes on other areas is very well contained. If we change (add to or delete from) the attributes that are related to a particular key, there should be no reason to change other entities or relations. And if we add new entities, only the directly effected elements in the database need changing. From a data warehouse perspective, this means that the warehouse structure can be easily modified to reflect new organizational changes or new business problems. (One example of this is the ease with which multiple hierarchies for a dimension can be built and maintained.) The structural paths for accessing information are very clear. Since this is a directed graph, the only difficulty arises when we are forced to include cyclic paths. Since this is the result of business conditions, we know that these paths are required. However, it does mean that we need to educate our users about the different paths and what interpretation they have. The process that we describe below will help to eliminate the cyclic paths, since they are death to most query tools. Having clear structural paths means nothing, by the way. Remember our emphasis on semantics? Semantics is everything!!! CRUD (Create, Report, Update, Delete) anomalies. If you read almost any book on data modeling and database design (several are in the bibliography) then you will notice that a great deal of the discussion on the normal forms revolves around the elimination of update anomalies. That is, the type of update problems that are solved by moving to a higher normal form. Since a data warehouse is rarely updated, the possibility of these anomalies occurring is much lower. This is why it is acceptable to talk about denormalizing the data warehouse. As always in life, there are some disadvantages to 3NF:
Performance can be truly awful. Most of the work that is performed on 'denormalizing' a data model is an attempt to reach performance objectives. 2. The structure can be overwhelmingly complex. We may wind up creating many small relations which the user might think of as a single relation or group of data. The dimensional model reduces that to some extent. This is the semantic part. Since the business model will not be implemented, these disadvantages are rarely critical. Carrying them over to the dimensional model will recreate these disadvantages, however. The process of normalizing an ER model has been dealt with adequately in many other books. Since the process is no different for a data warehouse, we will not dwell on it here. There are some recommended books in the bibliography.
1.
Integrity constraints
As a general rule, you should understand and enforce the integrity constraints within a data warehouse. Integrity rules will help guarantee the consistency of the query results. If we drill down through a dimensional structure, integrity constraints will guarantee that we will get the same answer as if we had done a full table scan. Note that failing to enforce integrity may result in different answers. Every rule has an exception! Enforce the integrity constraints unless you can't. Enforcing integrity constraints may result in some records being rejected during the load process. The decision must be made whether a consistent but possibly incomplete answer is better or worse than a complete but possibly inconsistent answer. It is possible that the upstream operational systems have created a situation where this can not be avoided. Since correcting these systems is beyond the scope of the data warehouse, you may very well have to live with relaxed integrity constraints. If so, understand the implications and make sure that they are well-defined and that definition is readily available. Explaining the impacts of these relaxed constraints will be a recurring problem as new users join the community. For example, I worked on a DSS application that tracked sales by customer and contract (and some other dimensions). For various reasons (which were valid for that business), there were ex post facto adjustments to sales. If we sliced the data by either customer or contract, we could reconcile each load. However, because of these adjustments, we could not reconcile a two dimensional (customer/contract) slice. Since this was the result of some problems in the operational systems, it was a lamentable fact of life in the DSS application. Note that it did not effect the value of this decision support system.
Dimensional model
The dimensional model overcomes one of the objectives (model complexity) by defining how the user will access information. The dimensional model should be much closer to how the users think of the information. It should resolve any of the semantic ambiguities that are in the business model. That is, the dimensional model should be develop as a fully connected directed acyclic graph. (Este hermosa!) In the following sections we discuss the different types of dimensions. These are conceptual differences. Creating this taxonomy of dimensions will help us create a better dimensional model and then a better physical model.
Structural dimensions
The first step is the development of the structural dimensions. This step corresponds very closely to what we normally do in a relational database. This step is commonly referred to as 'denormalization'. Indeed, with respect to Codd's relational data model, the database that results from this process may be pretty much 0NF! While this may result in very fast response times for queries, we do lose the level of flexibility that is the hallmark of a 3NF design. No free lunch! The star architecture that we will develop here depends upon taking the central intersection entities as the fact tables and building the foreign key => primary key relations as dimensions. In our sales and marketing example, the sales history table is the fact table of greatest interest (there may be others). The relations that define the sales record will be determined as the result of the analysis above (who bought the product and so on). Technically, we can see that this is a directed graph.
Figure 3 Pattern I: The simple case In this example, everything is very clean and orderly: both of the dimensional structures are proper and simple. Note that the intersection entity has a number of relationships that are quite simple. By this we mean that each foreign key exists in only one primary key relation. Each of these paths then becomes a dimension. The structure of the fact table is obvious, but what is the best way to structure the dimension tables? Using the definition of levels from before, create each dimension table as follows: Column Description This column will contain all possible values of the complete dimensional structure. For example, if time is the dimension, it will contain months, quarters and years. Note that this requires that all members in a Dimension key dimension have unique identifiers. If you like to think of domains, then the domain of the dimension key is the union of the domains of the member table keys. This is only relevant if the dimension key is L0. For other dimension members, it will be null. This, and the L1 Parent other Ln parents are included only for performance reasons. L2 Parent This is only relevant if the dimension key is L1 or L0. For other dimension members, it will be null. This is only relevant if the dimension key is Ln-1, or any lower level. For other dimension members, it will Ln Parent be null. This is an integer that identifies the height dimension that the member is at. That is, the height from the leaf Height node. This is very useful for certain types of queries. This is an integer that identifies the depth of the dimension member.That is, how far down from the root Depth node. This is also useful for certain types of queries. Figure 4 Dimensional table definition As you can see, the definition of this dimensional table flies in the face of all normalization rules. However, it will support analytical queries and will produce much faster response times than a normalized design. The process of loading these tables is very straightforward and typically very fast. As part of the yin-yang principle, this approach does have significant problems with implementing dimensional dependencies. These occur when different attributes are related to different levels of the dimensional structure. This is discussed in more detail later on.
Time example
Time is probably the most common dimension in a data warehouse. A warehouse is, after all, the retention of historical data for analytical purposes. There are three major different ways to structure a calendar: calendar, fiscal and rolling. We will explore all of these eventually, but for now we will define the calendar structure. Note that time starts out appearing as a straight-forward issue. However, it can get complicated. The problem occurs because two months (and quarters and years) can occur in the same week. There are several ways to deal with this. Two possible solutions are: 4. Name the weeks so that you can split them in the middle. E.G. we would have SepWk5 and OctWk1; each with less than five days. This will give you correct rollups to month and above, but will make week to week comparisons difficult. A variation on this solution that cures the last problem is to define alternate rollups for week that would also include the full five days. 5. Assume that the week belongs in the month where either Friday or Monday (or some other day) occur. This always gives you five day comparisons but will result in incorrect week to month aggregations. A solution to this problem is to build separate hierarchies so that day is related to week and month. Since the resolution for the approach is more a business problem than a technical problem we will avoid the entire topic by excluding week from the model. The following diagram is the normalized data model for a calendar.
Figure 6 Simple time dimension model As you can see, these solutions are not very complicated. Nor are the tables very complex.
Figure 8 Pattern II: Alternate hierarchies The essential difference between Pattern I and Pattern II is that at least two G1 tables have the same Ln descendant (L0 in our pattern): the dimensional structure is proper and complex. That is, at least two paths join at some point. There is no real difference between two paths joining or more than two paths. Since the Ln table is the same, these can be viewed as variations on the same dimensional structure. Since
the split is clean there is only one way to go from any generation level to Ln (and then to the fact table), so this remains an acyclic directed graph. The problem with alternate hierarchies is mostly a problem with the front end. That is, how this is presented to the enduser. Since the access paths are mutually exclusive, you probably need to present them to the enduser as two different paths. Alternate hierarchies are so common that they may be the rule rather than the exception. Almost any company will have different ways of looking at products. Any company that has a fiscal year that is not a calendar year will probably want alternate time hierarchies. Fortunately, these are easy to deal with. The dimension table becomes wider, in order to accommodate the longer path (the sum of the two paths), and hence, looks a good deal uglier, but it is still easy to build and supports very fast response times. Column Description Dimension key Same as before, except that it must contain the members from all paths. This is the foreign key in the Ln-1 table. These are the same as before, except that we must include the members from the different paths in different columns. It does muddy up the levels and generations Ln Parent somewhat. You may be tempted to reuse an Ln column. That is, use it for two different hierarchies (after the split in the path). Resist this temptation! It will cause you trouble eventually and once you start it will be very hard to correct. Level id Levels may be counted as before, except that we will have more than one level at some point. Generations are also counted as before. Again, we will have more than one occurrence of the same Generation id generation. Figure 9 Alternate hierarchy dimension table
Time example
Back to the time example In the previous section, we mentioned that there are three basic hierarchies for time: calendar, fiscal and rolling. These hierarchies are shown below.
Figure 10 Multiple time hierarchies This solution, now that we have introduced multiple hierarchies, is now much more complicated and messy. This is the issue with multiple hierarchies more than with the approaches. Another approach is to develop individual dimension tables. If we took that approach with the combined model, we would have the following dimensional structure.
Figure 12 Pattern III: Cyclic paths This is a complicated pattern; it represents an improper dimensional structure. In this pattern, we can see that, although there are alternate hierarchies, they have the same starting and ending point. Readers who are from the healthcare world will note that this is the relation
between hospitals, buying groups and the manufacturers. Other examples are common. Any business where there is an intermediate dealer (automobiles, consumer package goods) has the potential for this type of structure. This is more than a simple interface issue. An incorrect query may incorrectly compute the aggregated results. It also confuses the question-answer relation. The resolution is simple to explain, reasonably simple to implement but often hard to live with (from a user's perspective). Since the dimensional data model does not tolerate this type of structure, it must be eliminated. That is, this pattern must be resolved into either Pattern I or Pattern II. Depending upon how you want to implement your database there are different ways of resolving this. 1. Treat them as independent dimensions. Your descriptions to the enduser (documentation as well as front-end training) must be clear. Make it a major point of discussion in your enduser training. Since this complexity is a result of the data and not a technical complexity, they will be familiar with the issue. Indeed, they may be very helpful in the documentation/training problems. 2. Create alternate views. This is a variation on the first one, except that it has fewer physical tables. The tradeoffs are performance versus physical database complexity. The issues of training etc. remain.
Figure 13 Pattern IV: Splits and joins This pattern is a combination of Pattern II and Pattern III. It should have the same resolution as Pattern III. That is, create two different structures.
Figure 14 Pattern V: Types & Subtypes Types and subtypes are used in an ER model when the entity is non-heterogeneous. That is, the attributes of the entity depend upon some type of class membership. We might, for example, classify our sales representatives into two different types: full-line and partial-line. In the sample data warehouse below, we might classify the contracts into classes according to whether it is a contract with a buying group, a hospital organization or an individual hospital. There are two common methods of handling this (in an RDB implementation). First, define a table that contains all of the attributes and only use those that are relevant to the type of the row being processed. Second, for each type, create a separate table that contains only those attributes that are relevant. Neither one of these is very satisfactory. Types and subtypes can be implemented in a data warehouse as a categorical dimension. Since these are discussed below, there is no need for further discussion right now.
Dimensional dependencies
Earlier, we mentioned dimensional dependencies. At that time, we talked about the dependencies of some calculations on the dimension level. The example that we used was Average Selling Price. To summarize, although ASP makes a great deal of sense at the lowest level of product, it makes less and less sense as we move up the hierarchy. The dependency that ASP has is not true for other dimensions. The ASP average for the an entire year or a geographic region may be very important. This class of dependency is dependent on the processing and needs to be handled either in the load programs or in the dynamic query SQL. There is another class of dependencies that are purely structural. Product size and color are relevant for SKU levels (possibly even one level higher) but are not at all relevant for the highest levels of the product dimension. Similary, there may be time data points that are only valid at the quarterly or annual levels. These dependencies are of a more structural nature and hence, may be solved in the dimensional model.
There are really two problems with this class of dimensional dependencies. First, making sure that the user does not mix apples and oranges by requesting the product size for the highest dimension levels. Second, the physical problems of how to store these in a dimension table. Denormalized dimension table Each row in this structure stores the leaf node and all of its parents. In order to deal with the dimensional dependencies, we would have to repeat all of the dependent columns for each row. This would result in a huge amount of repitition. Although redundancy in a data warehouse is not necessarily sinful, this one particular sin becomes gluttony! The requirement to handle dimensional dependencies may force you away from a denormalized implementation. Vertical dimension table Remember that this approach implements a recursive structure. Each row consists of a < member, parent > pair as the primary key. However, the other columns must be the same for all levels of the dimension. The only two approaches here are to define the unique columns for each dependency but leave them null where they are not relevant or to overload the columns, so that their meaning depends upon the dimension level. The first approach is hugely wasteful while the second is very fragile and fraught with all kinds of problems. There is another approach that trades off the problems of wasted space for having additional tables. This approach may be used with either the denormalized or vertical dimension table. We will isolate the columns for each level into their own table. This will reduce the space requirements to the minimum. The downside, of course, is that an extra join will be required in order to retrieve any characteristics. (The impact of this can be reduced by isolating only the dimensionally dependent columns.) There are a number of other advantages, however. This approach makes it much easier to implement multiple hierarchies which retain some (but not necessarily all) of the dimensional dependencies. For example, we could have the three different time hierarchies that we mentioned earlier. If we isolated the daily (lowest level) and yearly (highest level) dependencies then these could be used for all three hierarchies. This last solution is one that continues the list of trade offs that must be dealt with when designing RDB-based decision support applications. However, it may be just what you need!
Summary
Virtually any data model that you come across will be based upon these patterns. Although the possible combinations are potentially very complex, dealing with them piecemeal and resolving them as described above should result in a valid star architecture.
Informational dimensions
One of the defining rules of the relational data model is that there is no relation between the attributes or columns in a relation. This is contained in the process model. Regretfully, this eliminates the possibility of creating informational dimensions. Note that this is one of the great strengths of many of the multi-dimensional database management systems (and conversely, one of the great weaknesses of the relational data model). Nevertheless, we will spend a few minutes discussing informational dimensions. A well-designed fact table will have a single informational dimension. Although many different fact tables may have the same informational dimension, it is not necessary. If you are using an MDB, you should determine whether it will support a single or multiple informational dimensional structures. Some MDBs do support only a single informational dimension wile others support many. Make sure that a single information dimension will not unduly complicate your database design. On the other side, you should be aware that supporting many informational dimensions may complicate the understanding of the information, so it is not a simple tradeoff. Perhaps the best example of an informational dimension is a corporate balance sheet or chart of accounts. A sales and marketing (or manufacturing) data warehouse will have an informational dimension, but is likely going to be much simpler than the informational dimension of a financial data warehouse. The RDB concept of the informational dimension is embodied in the fact table. We have previously asserted that the fact table should contain a single informational dimension. Is this a criteria or a convenient design? The meaning of multiple informational dimensions is that the structural dimensions that define access to the information are not the same. For example, if we built a warehouse containing both manufacturing and sales information, some of the information would be dimensioned by plant and some by customer. It is possible to place these into the same fact table and deal with it by types & subtypes. If your reaction to this is "Wow, that's weird." that's because it is. A better design would be to have separate fact tables, with some common dimensions.
time child value. Hence, the quarterly beginning inventory level is the inventory level of the first month. The ending level is the last month. How should these be modeled? The first issue (dimensional dependencies) is pure data modeling but the second is a mix of data and process modeling. Dimensional dependencies can be easily modeled if the dependencies are orthogonal. Back to our ASP example&ldots; The ASP (at the leaf level of product) makes sense across all other dimensions: geography, time, sales organization, whatever. So this dependency is orthogonal. We can model this dependency in the normalized business model as we would any other attribute. ASP is simply not defined as an attribute of the higher levels of product. What if the dimensional dependencies are not orthogonal? If we follow a strict normalized model (as we are advocating here) this would require that the model contain all of the dimensional combinations that define these dependencies. This may be a rare circumstance. In fact, I can not think of any examples! We may be saved by the fact that this doesn't happen very often. That is, we can follow the normalized model which will be simple enough due to the small size of the problem.
Categorical dimensions
Categorical dimensions are typically used in support of some analytical process. One of the main distinguishing characteristics is that they are usually related to an attribute, rather than to a primary key. For example, income bracket is related to customer income. In a very real sense, categorical dimensions are often dimensions of dimensions. Categorical dimensions help us deal with the issue of attributes of dimensions. Some of the oddest things have to be categorized. For example, automobile paint color may be a very complex category. Someone who is in charge of planning plant capacity may need to analyze trends in order to evaluate utilization of the paint drying facilities (a large and expensive piece of equipment). They may also need to plan for new drying facilities. Since the auto industry probably has many different shades and tints of red and some may require different drying facilities, the ability to categorize sales by color can become a problem with expensive parameters. Another category is customer size. This may be determined by annual sales, potential sales, how tall they are, incomes, debt levels and so on. One of the primary uses of data mining is the categorization of customers (for example) into categories that no one knew existed before. It is common for a categorical dimension to be used in many queries, especially if you have different aggregate fact tables. If you have customer size, you will most likely want to see sales by customer size and product. (What are your heavy hitters buying?) This has many of the problems of partitioning dimensions, which are dealt with next. Whether you create categorical dimensions or leave the categorization process to a query depends upon the complexity of the categorization and how frequently it needs to be analyzed. The paint question may be very important but may not be asked very often. Hence, the cost of answering the question in anticipation (pre-aggregation) may not be justified. However, customer categories may be very complex (even moving into data mining) and the results may provide fruit for a great many analyses. The major question is what is the lowest level where the type/subtype starts and what is the highest level where it no longer makes sense? In the example of full-line and partial line sales representatives, the lowest level is sales representative, but the highest? Even if all sales representatives ultimately report to the same sales area manager (i.e. type is no longer organizationally defined) it may still make sense to separate full-line and partial-line at that level. It often makes sense across other dimensions as well. If two sales representatives can sell the same product, then it certainly makes sense to review product performance by sales rep type. This is usually an analytical not structural consideration. The type/subtype attribute is clearly in the base table, however, in order to avoid turning this into a snowflake (see below) we will need to include it in the dimensional table as well. Hence, we will add another column that contains the subtype classification values. Since we may do both selection and reporting by this attribute, it may (or may not) make sense to include it as part of the composite key. If you preaggregate this dimension, then it must be included as part of the aggregate key. If there are several categorizations that make sense at different levels (which must be mutually exclusive) then you may place them into the same column. This is possibly a dangerous solution so it should be avoided. The exclusivity requirement may be relaxed in the future and changing the implementation to accommodate the data will be very expensive.
Figure 15 Mixed categorical dimensions You can see how this dimensional structure would support very complex queries without the associated expense. It is also relatively easy to build the dimensional structure from a single query. Since the leaf members for the structure are customers, which are already in the fact table, this structure can be built without modifications to the fact table. In fact, many different versions of them can be built, as required and dropped when no longer needed. This flexibility is one of the great advantages of data warehouses built using an RDB.
Married Single Income Bracket #3: 50,000 -> 100,000 Divorced Married Single Income Bracket #4: 100,000 -> 250,000 Divorced Married Single Income Bracket #5: 250,000+ Divorced Married Single
Partitioning dimensions
This type of dimension demonstrates another weakness with current data modeling tools. What we want to happen is very easy to explain. How we implement it is rather tedious but not difficult. However, including these in the model makes it look awful. For example, if we wanted to track plans vs. actuals in the database, we would create a partitioning dimension. This foreign key would appear in every table where we wanted to track plans vs. actuals. If we have a single fact table that contains all of the aggregate information as well as detail, there is no problem. However, if we create individual aggregate tables, then the scenario foreign key must appear in each one. This makes the model look awful. Regretfully, these are very common in data warehouses. Especially those that are developed to monitor company performance. One of the advantages of multi-dimensional databases is the ease with which they can create and use partitioning dimensions. For example, it is common to see the structure of time separated from the repetition of time. That is, we may see one dimension that is simply the structure of a year: month -> quarter -> half -> year. And a separate calendar dimension that contains the different years: 1996, 1997 and so on. Adding a new year to the database is simply a matter of adding a new member to the calendar dimension. Adding a new year to a relational database usually requires that each month, quarter, half also be added. There are disadvantages to this approach that we will leave for another discussion. Calculating performance in relation to the prior year is now simply a matter of performing a calculation on the two members. Indeed, an MDB with a powerful calculation engine will allow you to perform calculations on any members. So that prior year variance may be calculated as follows: 1997 Prior Year = 1997 - 1996. We have called these partitioning dimensions because they are an obvious candidate for creating partitions in the data warehouse. That is, if your fact tables are large enough to warrant splitting them, then an obvious choice is to use the partitioning dimension (if one is present).
There are other examples. If you have Actual and Budget, you will definitely want Actual vs. Budget (differences as well as percentages). And if you have Reforecast, you will want Actual vs. Reforecast and Budget vs. Reforecast. Plus Three Month Averages of all of them. As you can see, creating columns for all of these will become a D.B.A. nightmare. However, if you make them a partitioning dimension, your life becomes much easier. In summary, the guideline is that if you have data points that are described as an adjective of your base data (e.g. Actual Sales), they should probably be a partitioning dimension.
Structural changes
Structural changes are controlled. The current conventional wisdom is that the value of the historical structure rapidly diminishes with time. Hence, there is usually little need to retain them. Conversely, there is little complexity in retaining them. Creating them as an alternate hierarchy will almost always be satisfactory. My belief is that the current disinterest in history only masks the real issue. Realignment analysis requires that we support multiple structures. If we can (easily) do it for realignment, we can do it for history. This is primarily a cost-benefit analysis.
Attribute changes
These are more difficult to deal with. They happen sporadically and without predictable timings or limits. The value of tracking is almost always very difficult to predict. However, if it possible to track these (and retain the history) then this should be explored with the users in great detail. Remember the earlier discussion on the questions that we can ask versus the questions that we should ask. The ability to perform event-based analysis can dramatically increase the ROI of the data warehouse. However, in order to be able to do that, you need to track the different events! This is a fancy term for tracking the historical changes.
In this solution, we overwrite the changed attributes and do not keep track of the history. This reduces the set of answers that we have available. Again, this may be good enough. Technically, this is the simplest solution.
Type II
In this solution, we create a new record that includes all attributes and the changed ones. Essentially, this solution will let us create 'snapshots' of the values at a point in time. This may allow us to track all of the historical changes but introduces a level of complexity into the data warehouse that is wholly unjustified. Note that this approach is actually quite common in the insurance industry, where an insurance policy might have to be re-created.
Type III
Add additional attributes to the entity. There would track some amount of historical change; say current and previous. If you implement this approach, don't forget to include the date of change. This has some obvious limitations.
Suppose that we look at changing dimensions from a strict data modeling view. If marital status was dependent upon time, then it should be placed into a separate entity: Historical Marital Status. The composite key would be Customer Id and Time Id. The attribute would be marital status. Each attribute that has to be historically tracked would be dealt this way. Remember when we said that one of the disadvantages of normalized form was that we might create many small entities, which the users might logically think of as one larger entity? Here is an example of that! However, there may be a compromise solution. If we kept the current marital status in the Customer entity and placed the previous values into the Historical Marital Status, then we would have the current status (the most popular one) available without performance compromise and the historical values when they were needed. The storage requirements should be relatively modest. The only queries that would be impacted are the ones that should be impacted. Hence, we propose a 4th type: Type IV Place the historical values in a special entity. This would be used for the special analytical requirements. This historical table should contain both the starting and ending dates for the period. Although only of these is necessary (the other can be inferred) having both simplifies some of the queries.
Enduser considerations
Perhaps more difficult than the technical problem is the problem of presenting these to the user in a meaningful way. The example question that we had before: "What did married people buy last year?" is a perfect example. This is a really innocuous question! But imagine translating it into an executable query. We would probably use a subquery that selected the people who were married last year and then select their purchases. Seems straight-forward enough. However, now imagine explaining that to the users or having some query tool translate it. Much more difficult. This type of complexity is not something that we should force on our users. However, we should not use this as an excuse to avoid the problem.
Trend analysis
The normal use of the term trend analysis refers to a time-series analysis of the facts. However, the implication of slowly changing dimensions requires that we also consider the change in the dimensions as well. That is, trend analysis must also refer to trends in the dimensions and how this impacts the selection of the facts. Let us return to the marital status example. Suppose we are studying the long-term buying habits of married versus single people and we do not have the historical changes. As the length of the time period increases, so will the inaccuracy of the answer. Although this is the simplest model to implement, it fails to provide the level of accuracy that we may require. Suppose however, that we do have the marital status changes. What does this mean? The normal SQL join is based upon selecting dimension values that are independent of each other. Suppose we are selecting market, customer, time and product. The join process will select the rows based on market, then keep the rows based on customer, then time, then product. Our example radically changes this, since we now want to select a time period that changes from customer to customer. Indeed, a given customer may have two non-contiguous time periods where they were married. Rather than performing a series of single-column joins between a dimension table and the appropriate key column in the fact table, at least one join will be a multi-column join based upon a join of multiple dimensions. In the marital status example, we would create a query that would select the appropriate customers and their time periods. These time periods would be based upon the historical marital status. This is then joined to the fact table (along with market and product, as required) to select the fact rows. In the figure below, the 1st query produces the temporary table "Selected Customers". The 2nd query produces the final results.
Figure 16 Slowly changing dimensions - query example The 1st query can be a subquery if your SQL supports a multi-column subquery - most don't. It is more likely that this will have to be multiple queries, with a temporary table. Note how this complicates the data warehouse. Yet, it may be unavoidable. Note how this process drives in from the dimension tables. It is also possible to answer this question by driving out from the fact table. However, this approach does not appear to eliminate the temporary table and may severely impact performance. For these reasons, we do not discuss it here but leave it for the casual reader to figure out!
Physical model
Each of the models in the (now) four model architecture serves a particular purpose. The physical model is often necessary to achieve performance objectives. If we had computers of infinite speed and resources, there is no doubt that we would implement the business or dimensional model. Regretfully, this is not the case! Today's machines are seriously over-taxed when confronted with large scale data warehouses and the corresponding queries. Hence, a good physical model may often be the difference between success and failure of the data warehouse. With equal regret, we state that the physical model is often dependent upon a particular DBMS or hardware configuration. This is, after all, what we mean by physical model! The guidelines that we will give here are as general as we can make them without focusing on specific DBMS or hardware products.
General objectives
With rare exception, data warehouse performance will be limited by input-output processing. Remember, however, that most performance problems are design related. Sometimes the design issues are contained in the DBMS and sometimes in the particular data warehouse. If you have significant performance problems (those that need to be cured by reductions of orders of magnitude) review your physical model.
Parallel processing
The value of parallel processing is reached when we can decompose a single operation or query into parallel streams and execute them simultaneously. That is, we distinguish between parallel processing and multi-threading. It is easy enough to decompose an SQL query along the leading edge of an index. If that index corresponds to the physical sequence of the table, then we can execute the parts of the query at near disk speeds. However, parallelization along the trailing edge is vastly more difficult. And since the relation between the logical access (index) and physical access (disk layout) may be zero, the performance may actually be worse than single-thread processing. In general, if the type of processing that you will expect to perform will benefit from parallel processing, make sure that the DBMS you use will allow you to physically structure the disks so that parallel processing is not limited to the physical sequence of the rows.
Basic considerations
This section summarizes some approaches that we assume will always be in place. 6. Synthetic keys. Textual keys should be replaced by numeric (usually integer) keys. This can have significant impact on the storage required for both the data and the indexes. We don't care about the disk storage as much as the time to process it. Search times for indexes composed of integer synthetic keys will be much faster than indexes composed of the textual keys. 7. Transitive dependencies. The attributes in a relation are transitively dependent on the relation of all weaker entities. For example, an attribute of product group is transitively dependent on the relation product subgroup. The process of normalization removes them from the weaker entities and placed them in the stronger entities. However, it may be beneficial to denormalize these and place them in the weaker entity to improve performance. We might, for example, place the name of the product group in the product subgroup entities in order to avoid a join to the product group table simply to retrieve the group name. The casual (but astute) reader will note that the dimension tables described above follow this process to its complete conclusion.
Structural changes
The basic tenant of i/o performance improvements is to reduce the number of read operations. Although this sounds obvious, it can be quite a balancing act. Suppose that we have a categorical dimension (income) which has enough access that we are justified in treating it structurally (rather than dynamically). One possibility is to leave the income data in the fact table alone and always access the categories through a categorical dimension. We can improve the performance by adding the categorical dimension to the fact table. Do not remove the base income fact, however. This increases the size of the rows but means that we will have a much simpler join (faster query).
The tradeoff here is load processing versus query processing. If your data warehouse is physically small but structurally and/or computationally complex, there may be a very strong argument for placing the partitions and aggregate in the base fact table. This approach is discussed in more detail in a different publication.
Categorical dimensions
Category analysis can be a very large performance drain on the system. Since categories are usually brackets, evaluating them on a large scale requires a substantial amount of CPU resources. The dependency on an attribute (rather than an indexed foreign key) may require full-table scans for some queries. The most effective way to improve this is to place the category either in the dimension table or in the fact table itself. The tradeoffs here are an improvement in performance versus a loss of flexibility. If the categories are fairly static and the analysis fairly frequent, then this should be a good choice. However, if the categories change or the analysis is infrequent, consider performing the categorization dynamically. Another advantage to dynamic categorization is that it is possible to maintain many categories. We could, for example, have several different methods of categorizing income. Placing the categories into the dimension table or the fact table restricts that flexibility.
Hospital Hospital Org Contract Buying Group Buying Group Org Territory District Region
1 1 1
1 1
Figure 18 Partial Health Care Adjacency Matrix The adjacency matrix is constructed by placing a one in the { row, column } cell where row is an entity that contains a foreign key relationship to column. Hence, we will find a 1 in the cell { Day, Week ). We can use the row totals to determine the number of foreign keys in an entity. In the retail model, the only row with a total greater than 1 is the Sales History fact table. Conversely, the column totals indicate the number of tables in which column appears as a foreign key. In a valid star architecture, the only row with a total greater than 1 will be the fact table. Its total should be the number of dimensions connected to the fact table. There should be no columns with a total greater than 1. In the Health Care Adjacency Matrix, we notice that there are both rows and columns with totals greater than 1. This indicates the presence of some cyclic paths. In particular, there are two paths between Hospital Org and Sales History. As we have seen in the discussion above, the cyclic paths need to be resolved into acyclic paths before this will be a valid star architecture.
Complexity
One additional measure that can be derived from the adjacency matrix is that of database complexity. There are several different ways of looking at a data warehouse and talking about its complexity. Sheer size is, of course, the obvious one; and perhaps the most popular. Another is the degree of connectedness. Intuitively, a database that has only simple paths connecting the entities should be much easier to understand than one that has many paths, especially if they are cyclic. However, since our design objective is to turn the database into a valid star architecture and that is, by definition and design, very simple, we would need to measure this type of complexity on the original normalized ER model. Since we are measuring complexity, we would like the metric to reflect that; hence, the more complex the database is, the larger the metric should be. It still seems that we should not define a metric in terms of the raw number of paths, however. A database with two dimensions that have several cyclic paths should be considered more complex than a database with six dimensions and no cyclic paths. In particular, the measure should not respond linearly to increases in complexity. Intuitively, it seems that a database with 100 tables is more than twice as complex as a database with 50 tables. A database that is totally connected (every entity can be accessed through some path to every other entity) will have (n2-n)/2 connections. This is a completely filled adjacency matrix with the diagonal removed and the connections counted only once. Since the paths in a relational database are actually bi-directional, we can use the same equation to measure how connected the database actually is. (p2-p)/2 where p is the number of paths. Our complexity measure is the ratio of these two (multiplied by 100). Which is: ((p2-p) / (n2-n)) * 100 where n is the number of entities and p is the number of edges connecting them. Note that the divisions by 2 simply cancel each other. The normalized health care model contains 19 tables with 18 paths; the star model contains 9 tables with 8 paths. The complexity measures are 89 and 78, respectively. These are relatively close; indicating that we simplified the model somewhat. Just for comparison, the retail normalized model has 20 tables and 19 connections; while the star model as 5 tables and 4 connections. These have complexity measures of 90 and 60, respectively. This supports our belief that the retail star model is much simpler than the normalized model. And that, although the normalized models are roughly the same, the retail star model is significantly simplified. Although it seems that this measure makes intuitive sense, we will need to validate it in a many real world examples.
Supporting tools
Data modeling tools
At the time of the original publication (early 1996) there are many tools on the market that did a wonderful job of modeling the structural dimensions. This is, after all, the sine qua non of relational databases. If you have a complex informational dimension, they should also be capable of modeling that. Keep in mind, however, that this is a foreign concept to the relational model. Since that time, a number of the leading tools have started to ship versions that purported to handle data warehouses. However, they continue to remain grounded in the relational data model and, hence, are still unable to distinguish between structure and semantics. Until such time as the vendors start to understand the differences, we will be forced to "play games" with their products to get them to work right. Of course, since we are playing very much the same game withe the RDBs, at least there is consistency in what we have to do! Ideally, we would like the data modeling tool to allow us to draw all three models: the business model, the dimensional model and the physical model. The tool should provide some reference checking to validate integrity between the models. Since the relational model itself does such a poor job of partitioning dimensions (and some categorical dimensions) we can not expect the tools to do a good job. And they don't! In order to maintain some semblance of clarity in the model, I would suggest leaving them out, except where it may simplify the actual creation of the database. Since dimensional modeling has become such an important part of today's D.B.A. activities, we may see some changes to deal with these issues. Indeed, there are rumors that some vendors are working on this problem. Since some of the difficulty lies with the relational model itself, it will be very interesting to see how they deal with the problem.
Query tools
Implementing query tools for a dimensional structure is simultaneously much simpler and much more difficult than for a 'traditional' normalized model. If we have a single fact table (that includes the aggregate data as well) then there is really only one query to deal with. Optimization, indexing and so on, become much simpler problems. The advantages of a single fact table (with detail and aggregate data) are discussed in one of our other white papers. See the last section below. Query tools are more difficult since they must be able to provide a clear differentiation between questions that have subtle differences in statement. Our marital status provides an excellent example; we have already discussed this enough! Another example is given in our sample data warehouse: health care manufacturers. Hospital buying groups sign contracts with manufacturers in order to receive better pricing. However, individual hospitals may not participate in some contracts. That is, a member of a buying group may purchase the products covered under a contract with one buying group via a different contract with a different buying group. Hence, the question as to what a hospital buys when they are a member is different from the question about what they buy under the group contract. This is a subtle yet enormous difference. I have always thought that a query tool that presented the ER diagram would be wonderful from the enduser's perspective. However, most ER models are way too complex for an enduser to wade through. The star architecture has the opportunity to provide an extremely simple query interface. The diagram below shows a star architecture in the form of a radar graph. The enduser would simply click on the portions of the diagram in order to guide the query tool. It is complicated somewhat by the fact retrieval must be separated from reporting, but that is relatively straight-forward. Unfortunately, I have never seen a query product that uses this interface. Pity.
Later thoughts
The starburst model shown above will work well for very simplistic models. For example, a single star fact table with a limited number of dimensions. (Too many dimensions simply fills the screen.) However, if you remember one of the wonderful things about FCDAG is that they can be developed as a recursive structure. Hence, we would be able to take a very complex data warehouse and represent it as a number of connected sub-FCDAGs. We can then take the starburst model and use it to provide a recursive view of the data warehouse. The primary advantage to this approach is that most users really only care about their little slice of the pie. So if we presented an enterprise view of the corporate information (in the broadest sense of the term), they wouldn't care about it anyway. Except possibly on very rare occasions. So, the theory goes, we could represent the enterprise as a series of nested FCDAGs; each of which might (or might not) be a separate data warehouse/mart whatever. The user could navigate through this by selectively looking at different starbursts, adding what they want to the query and then letting the query processor do whatever it needs to do to make this happen. Since database navigation has been one of my primary interests over the last fifteen years and since I believe that the concept of an enterprise data warehouse is fundamentally flawed, this approach is one that seems to have a great deal of possibility. AT Telos Solutions, we are, in fact, building such a query tool today. (The following is a paid commercial announcement.) Trinity is based upon the premise that the best logical representation of a data warehouse is an FCDAG. Which, in turns, provides a mechanism for building up a logical view of the enterprise data. Individual users will be able to focus on their own little starburst as they need to or be able to navigate (and extract from) the entire corporate information repository on those rare occasions when they must. Trinity is currently shipping its first version of this approach. The product combines data warehousing, data mining and visualization for complete Knowledge Discovery in Databases. The semantic approach to the data warehouse is more powerful and more sustainable than the structural approach taken by nearly all other query tools.
have taken some liberties with this model, in order to demonstrate some of the topics. Any discrepancies between this and Kimball's work or between this and reality are my fault.
Figure 20 Normalized Retail model As you can, see the retail model is quite simple. There are four dimensions and three of them are Type I Patterns. Hence, these may be denormalized directly into dimension tables. Even the fourth dimension is a Type II Pattern and so it may be denormalized directly into a dimension table. The star schema is shown in the following diagram. We have identified the member names in the dimension tables.
Despite the simplicity of the retail model, there are some extensions to it that can make the data warehouse as a whole more interesting. This is the promotions. As anyone from this industry will tell you, deals and promotions are the lifeblood of retail. The ability to evaluate the impact of promotions should provide one of the best benefits of a data warehouse. What challenges will this requirement pose to the data warehouse architect? Earlier, in the section on Trend analysis, we noted that we need to couple the changes in the dimensions with the fact trend analysis. In order to determine the effectiveness of a promotion, we must be able to analyze product sales during several different periods of promotion, as well as during periods of 'non-promotion'.
Figure 22 Health care data model There are several dimensional structures on this diagram. 8. Sales Organization (Area, Region, District and Territory). This is a Pattern I structure. 9. Product, containing two alternate structures, one for manufacturing (Facility, Assembly Line, Subgroup) and one for marketing (Family, Group and Subgroup). This is a Pattern II structure. 10. A customer dimension (Hospital Org, Hospital and Customer). 11. A buying group dimension (Buying Group, Contract). The most interesting structure is the Buying Group Organization/Hospital Organization. It is complicated by the many-to-many relation between buying groups and their members. How should we model this? It is, perhaps, made more clear by understanding the differences between these two structures. Even though a hospital belongs to a buying group and that group may have a contract with the manufacturer, the hospital may be buying a product through another buying group or its own personal contract. Hence, the dimensional structure must be able to distinguish between what the group member sales are (through any contract) and what the group contract sales are. Hence, we should decompose these into two patterns, as defined below. As anyone who has modeled this structure knows, the difference between these two questions is huge and not a little bit confusing. It is a good example of how important it is to provide clear differences to the enduser. 7. Customer sales: Hospital Org, Hospital and Customer and the alternate structure Member Sales: Buying Group Org, Buying Group, Members, Customer. 8. Group Contract Sales: Buying Group Org, Buying Group and Contract. The final result of this analysis is shown in the diagram below. Note the diamond shape connected some of the dimensional tables to the fact table. This indicates that these are alternate hierarchies joined to the same key in the fact table.
Figure 23 Dimensional tables In all of the tables, the domain for the primary key (e.g. Customer ID) is the union of the domains for the other dimensional members. For Customer ID, that would be Customer Id, Hospital Id, Hospital Organization, Buying Group Id and Buying Group Organization. The Customer ID should be changed to guarantee uniqueness. Now that we have some meaningful examples, it might be more clear how queries will work. For example, suppose that we want to retrieve all of the products in a particular group. The SQL is select product id where product group = groupx and level identifier = 0. We specify level identifier = 0 since we want product id, which is the lowest level. If we had wanted to retrieve all of the subgroups in a group we would have used select product id where product group = groupx and level identifier = 1. In actual practice, every query will contain all of the dimensions. This allows performance tuning to proceed along a well-defined path. Although it is not required, this structure assumes that the fact table contains all of the aggregate data as well. There are some strong advantages to this approach that are discussed in another white paper (see the following section).
Summary
The development of the entity relation model of a dimensional data warehouse is really a logical extension of practices that have been in use since for some time. There are a few patterns that occur repeatedly and once these are understood, dimensional analysis can proceed quickly. Although these do not cover every structure, the majority of cases are covered; leaving the analyst to focus on the structural issues that are unique to their model. To some extent, this paper is unsatisfactory. There are some areas where my comments are limited to statements like: "This is ugly." However, I plead innocent by reasons of insanity. The analytical requirements of a data warehouse (and analytical applications, in general) require things like categorical and partitioning dimensions. While these may or may not fit well into the relational data model, they are surely a bad fit into data modeling tools. Data warehouses are organic. They grow continuously and in directions that no one will be able to predict. The strength of the dimensional structure is the ease with which it can support that growth. Thinking of data points as dimensions rather than attributes can make adding new data points much easier. Basically, follow the rule that base data points should be attributes and derivations of the base data points should be another dimension.
Aggregation strategies
Introduction
This chapter discusses different methods of aggregating star architecture data warehouses. Since many of the questions that a user will ask can require the aggregation of hundreds or thousands of rows, pre-aggregating some or all of the data warehouse can dramatically reduce query response time. The tradeoff, of course, is that the physical database is significantly larger. This chapter will only briefly look at the pros and cons of pre-aggregating. Our main focus will be on different ways to perform the aggregations. There are two major structural approaches to pre-aggregation. The first approach will store each set of aggregate information in a different fact table. The second will store the detail data and the aggregate data in the same fact table. Although the strategies that we will discuss focus on a single fact table, the following section discusses some of the advantages and disadvantages of each.
Figure 25 Dimension compression ratios As you can see from our example, there will be about 58% increase in the number of rows from the base fact table to a fully aggregated fact table. The cost of storing this can be easily compared to the cost of dynamic calculations. Another element that needs to be considered is the impact that this much larger table will have on query performance. Given a small enough table, the impact is negligible. However, for large tables, the impact can be substantial; possibly even unacceptable. The following table may help in the comparison. Multiple aggregate table Single aggregate table This will require a separate table for each set of aggregate data. The extreme case is a fully aggregated By definition, the single aggregate table approach database. For the sample data warehouse, this will requires only a single table! require 9 one dimensional tables; 18 two dimensional tables and Each of the tables counted above will require a separate aggregation operation. This, of course, implies a separate SQL statement. Even though the SQL statements can be generated, the sheer number of We will present an aggregation strategy that limits the them increases the chance for failure and greatly number of operations so that they increase linearly complicates the restartability of the load processing. with the size of the dimensions. This reduces It is not uncommon for the calculations to require everything about the system: the aggregation time, its several passes over the data. (See below.) Hence, the complexity, the difficulty of restart and so on. large number of statements required here is exacerbated by each additional pass. Some of these second pass calculations can be run simultaneously. On the minus side, the single aggregate table can be astoundingly large. It is not uncommon to see a growth factor of 2^N where N is the number of On the plus side, the tables will be smaller. dimensions. However, it is unlikely that the total database size will be any larger (or smaller, for that matter). As a result of the large number of tables, an aggregate Since there is only a single table, no navigator will be navigator will be required. required.
Adding new aggregate tables may be complicated if an aggregate navigator is not being used. Adding new Obviously only a single table needs to be changed. If facts may be horrendous whether or not an aggregate the aggregation process has been designed properly, navigator is being used. Perhaps all of the aggregate then the SQL updates should be relatively small. tables need to be changed and all of their SQL statements, plus the impact on the query tool. Figure 26 Multiple vs. Single Aggregate Tables
SQL-based aggregation
This approach uses SQL for all aggregation and calculation processes. Although it is not the best performing approach, it is the simplest. This might make a good first cut at the overall aggregation process. Subsequent improvements might focus on individual long-running steps.
Multiple hierarchies
It is very common for a particular dimension to have alternate structures. Product, for example, may have several different structures. In the earlier sections, we talked about how to design the dimensional structure. The result of that analysis should result in a fairly clean structure. However, each structure will have to be aggregated independently. It is not difficult, simply tedious.
Sequence determination
One of the performance issues that can be addressed is the sequence of aggregation for the different dimensions. The results, both in terms of the values and the number of rows, should not depend upon the order of the dimensions (subject to the non-additive facts discussed below). However, the performance can be greatly effected. Since the length of each of the aggregation operations depends directly upon the number of rows going into the operation, we would like to keep the number of rows as small as possible as long as possible. Hence, the sequence of aggregation should do the smaller dimensional
structures first. You can use the compression ratios that you calculated earlier to determine this sequence. Note that it will reduce the aggregation time below any other sequence. It will not reduce the number of rows.
Summary of steps
9. 10. 11. 12. Calculate the additive derived facts. Aggregate the dimensions. Calculate the non-additive derived facts. Calculate the variance facts.
Non-SQL based
This approach requires the development of specialized programs. This does complicate the development and maintenance processes. There are two basic improvements with this process. First, a dimension can be aggregated in a single pass over the data (rather than many passes). Second, the nature of the process is such that it can easily be decomposed into parallel processes. In brief, the process is to scan the rows that will be aggregated, create subtotal records, sort the file, and then aggregate and load it in a single pass. Since these are all very simple (and sequential) processes, they should occur at disk speeds. Later we will see how they can be decomposed into parallel processes.
Single dimension
In order to aggregate a dimension, we create a copy of each row for each level in the dimension. Each newly generated record will contain the appropriate member for the aggregate levels. For our time dimension, which contains month, quarter, year and total, we will generate quarter, year and total records for each month row. Once this is completed, we will sort the file on the time column; placing all the records that are to be aggregated into the correct sequence. The aggregation/load process can then read through the file, inserting a total row whenever a value break occurs in the time dimension.
Figure 27 Time Aggregation Single Stream This final step can contain whatever level of processing we want it to. It can perform aggregations, variance calculations, transformations and so on. Hence, we can possibly eliminate several SQL passes with this single program. Note however, that there is a development/maintenance tradeoff here that needs to be considered. Once the intermediate file has been created, the aggregation of the different levels can proceed independently of each other. This is different from the SQL-based approach, where Ln+1 must be processed after Ln. If we write the extract program to produce a file for each level (rather than a single file) then these individual files can be loaded in parallel. Note that this also eliminates the need for the sort step. Finally, if we are using Unix, we can pipe the results of the extract program directly into the (parallel) load processes. This eliminates the requirement for a large amount of working space on the disk. The primary performance impediment to this process is that very large contention on the disks occupied by the table and the contention caused by creating the indexes. Depending upon the DBMS and the operating system being used, it may be possible to reduce or eliminate these.
Performance comparisons
Valid performance comparisons of one DBMS vs. another must take into account a wide range of factors, including but not limited to the operating system, the speed of the i/o devices and the CPU, as well as how much and how effectively parallel processing can be used. At the same time, most performance problems are the result of design factors. Since we are talking about design decisions here, we can use some made-up computer. As long as it remains in the range of today's processors, we are okay. So, for the record, the CPU will have an instruction rate of 100 MIPs. The disk devices will support 100 physical read/write operations per second, with a sustained transfer rate of 16 Mbytes. We will also assume that the database uses an 8K block and that the data can be striped in whatever manner achieves optimum performance. As pie-in-the-sky as this sounds, it is not all that far off from what is today's reality. Disk storage will cost $100/Gbyte and the processing costs will be based on recovering the costs of a $125,000 server during each year. That is, we will charge our users at the rate of $0.0174/CPU sec. The space comparisons for the two approaches is pretty straight-forward. The worst case should be a comparison of two fully aggregated databases. Since both databases will store the same type of information, the total number of rows will be the same. The difference is that the single aggregate table will require all three keys for the data that is really only one or two dimensions. Since the number of rows where this occurs is relatively small (in relation to the three dimensional tables), the excess space required by the single aggregate table is relatively small. As long as we require that both databases maintain the same level of detail, this comparison will generally hold true. Note however that if we have a billion row fact table, even these small percentages may be significant. The cost comparison for dynamic aggregation versus pre-aggregation depends upon how often what levels of detail are requested. Clearly, if we ask the same question over again, the second time that we ask it, pre-aggregation will start to pay for itself. However, if we totally pre-aggregate then we have to weigh the cost of calculating every aggregate versus calculating some aggregates more than once. The other factor that must be taken into consideration is the human cost of waiting for large aggregations. With the techniques that we have described above, it may be possible to calculate the aggregates in a fairly small offline window. If this window is small enough and the cost of wasting people-time is high enough, then pre-aggregating again makes sense. Regretfully, we can't provide general answers to this question; we can only describe how the comparison should be performed.
Diagramming Conventions
Ask any data modeller, the most critical component in a data warehouse is the data model. Without a good data model, you might as well pack it in and go home! And yet we treat the data model (and the modeller) as so much excess baggage. We don't get enough respect
around here! Certainly some of that is our fault. Most data models look they were conceived and drawn in some type of bad dream. So what we need is a little discipline and some conventions. The primary purpose of these concentions is to make the model easier to read. In his book Data Model Patterns Conventions of Thought ( 1196, Dorset House) David C. Hay describes a style of modelling that I have modified somewhat for data warehouses. (P.S. You should get this book.) In general, you will see that the models in this paper (and the models that I draw for clients) follow these conventions: 15. Parent-child relations should flow from top to bottom and left to right. (David's models actually go the other way). 16. Reference tables should be placed as close to their referent as possible, but continue adhering to rule #1. 17. Don't cross lines if you can avoid it. 18. If a dimension has multiple hierarchies, they should be modelled separately but use a diamond symbol on the final path to the fact table. 19. The data model should have resolved all semantic ambiguities. 20. Use a bold outline (or something) to represent logical copies of tables that were modelled to eliminate semantic ambiguities. These conventions tend to place the dimensions in the upper left corner (and across the top) and the facts in the lower right corner (and across the bottom). Since most queries start with dimensions, I find that this approach tends to palce the things people want to see first at the top of the model. The drawback to this convention is that the models tend to be very large and have lots and lots of white space. I usually try to convince my clients that this leaves lots of room for notes! (I have a middling success with this.) Still, the models look very nice and are easy to read. Another (minor) drawback is that star models don't look anything like stars. So it goes...
Some ranting...
As long as we are talking about data modelling... (skip this part if you've heard this before). I have on many occassions been brought in to a client to develop a data warehouse based upon some business model that somebody developed. More often than not, these logical models have been a waste of time. (Although I am usually far too polite to say so.) Based on my experience, the biggest mistake that data modellers make is carrying the level of abstraction too far. Remember, a business model is, first and foremost, about the business. Not some nitwit cencept of abstraction that the modeller thinks is relevant. An example might help. In the insurance industry, it is quite common for people to have several 'roles'. For example, a doctor might be, at various times, an insuree, a plaintiff, a defendant or a claimant. One approach is to develop some entity called "Person" and then have something that defines what role they play at some moment. Resist this approach! It is too abstract and fails to convey the different roles (which are embedded in some type table.) The casual observer will note that by carrying the level of abstraction up to the concept of "Person" the modeller has created a semantic ambiguity. Which we all now understand is a very bad thing! To paraphrase John von Neumann, "People that create semantic ambiguities are living in a state of sin." Now, having read through this paper, you know better. Well, that was cathartic!