You are on page 1of 39

What is a Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

Subject Oriented Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. Integrated Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. Nonvolatile Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. Time Variant In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.

the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales

you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management

Although the architecture in Basic is quite common, you may want to customize your warehouse's architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. DW with Staging illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.

Most people are familar with terms like data warehousing, data mining, and data cleansing, but fewer people are familar with data mart. A data mart is a subsection of a data warehouse that deals with specific information. Much like the chapters of a book, a data mart contains related information that deals with the same subject, but may also be divided into its own categories. In this article, we will be exploring data marts and data mart software that allow you to extract information from a data warehouse and analyze that information as its own stand-alone subject. What Is A Data Mart? Data marts can be seen often in daily life, especially in business. For example, a company may have many stores across a large region or even in multiple regions but the company might choose to treat each branch as its own business and have each branch contribute to the overall success of the company. In essence, this is what franchising is all about. While the headquarters of this company may be seen as a data warehouse, each branch would be considered a data mart.

One major difference between the types of system is that data warehouses are not usually in third normal form (3NF), a type of data normalization common in OLTP environments. Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems: Workload Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations. OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations. Data modifications A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse. In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction. Schema design Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance. OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency. Typical operations A typical data warehouse query scans thousands or millions of rows. For example, "Find the total sales for all customers last month." A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer." Historical data Data warehouses usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

Ralph Kimball : Bottom-up design Ralph Kimball is a proponent of an approach to data warehouse design frequently considered as bottom-up. In the so-called bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data. These data marts can eventually be unioned together to create a comprehensive data warehouse. The combination of data marts is managed through the implementation of what Kimball calls "a data warehouse bus architecture". Business value can be returned as quickly as the first data marts can be created. Maintaining tight management over the data warehouse bus architecture is fundamental to maintaining the integrity of the data warehouse. The most important management task is making sure dimensions among data marts are consistent. In Kimball words, this means that the dimensions "conform".

Bill Inmon : Top down design Bill Inmon, one of the first authors on the subject of data warehousing, has defined a data warehouse as a centralized repository for the entire enterprise.[6] Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. In the Inmon vision the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities. Inmon states that the data warehouse is: Subject-oriented : The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together. Non-volatile : Data in the data warehouse is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting. Integrated : The data warehouse contains data from most or all of an organization's operational systems and this data is made consistent.

The top-down design methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the top-down methodology is that it represents a very large project with a very broad scope. The up-front cost for implementing a data warehouse using the top-down methodology is significant, and the duration of time from the start of project to the point that end users experience initial benefits can be substantial. In addition, the top-down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases

What is an operational data store (ODS)?

An operational data store is a collection of operational data which is used to support operational monitoring and management. It involves: Extracting data from operational systems; Moving it into ODS structures; and
Reorganizing and structuring the data for analysis purposes How is an ODS different from a data warehouse? A data warehouse is intended to support strategic planning and business intelligence decision support. It should contain: Integrated subject oriented data, e.g sales data; Static data, e.g. data that is moved into data warehouses should not change after it is stored in the data warehouse environment; Historical data, e.g. data warehouses will usually contain several years worth of historical data; and Aggregated, or summarized data e.g. as data becomes "older", it is summarized to reduce data storage requirements and to improve analysis performance.

An operational data store is intended to support operational management and monitoring and should contain: Integrated subject oriented data (similar to data warehouses) e.g sales data; Volatile data, e.g. data that is moved into an ODS will probably change frequently; Current data, e.g. an ODS will usually contain several weeks or even months worth of data instead of large volumes of historical data; and Detailed data e.g. as data becomes "older", it is summarized to reduce data storage requirements and to improve analysis performance. ODS data is refreshed frequently to provide a "snap-shot" view of the on-line transaction processing (OLTP) systems and legacy systems. Who uses an ODS? An ODS is frequently used by customer support staff to access integrated data to respond to customer inquiries. What are ODS design considerations? An ODS should be based on an normalized relational database design, similar to a data warehouse design, usually with no data aggregation structures.

What are ODS data load considerations? Typical methods of loading an ODS include: Extract, transform and load (ETL) software e.g. Informatica, Ab Initio, Ascential DataStage is used to collect source data and populate ODS structures;

Materialized views, which create a snap-shot of source tables and can load custom designed views for reporting purposes; and Data replication software e.g. Oracle Data Guard can be used to populate ODS structures that are identical to the source system or they can even be used to create, and populate logical, or more normalized, structures. Requirements analysis considerations? Do not build an ODS in hopes that some-one will want to use it! ODS requirements should be based on approved data requirements and a logical data model. Summary A data warehouse contains a large volume of historical, static, data to support business intelligence decision making. An operational data store contains a smaller volume of current, volatile data, to support operational management and monitoring.

Conceptual data models. These models, sometimes called domain models, are typically used to explore domain concepts with project stakeholders. On Agile teams high-level conceptual models are often created as part of your initial requirements envisioning efforts as they are used to explore the high-level static business structures and concepts. On traditional teams conceptual data models are often created as the precursor to LDMs or as alternatives to LDMs. Logical data models (LDMs). LDMs are used to explore the domain concepts, and their relationships, of your problem domain. This could be done for the scope of a single project or for your entire enterprise. LDMs depict the logical entity types, typically referred to simply as entity types, the data attributes describing those entities, and the relationships between the entities. LDMs are rarely used on Agile projects although often are on traditional projects (where they rarely seem to add much value in practice). Physical data models (PDMs). PDMs are used to design the internal schema of a database, depicting the data tables, the data columns of those tables, and the relationships between the tables. PDMs often prove to be useful on both Agile and traditional projects and as a result the focus of this article is on physical modeling.

A simple logical data model

A simple physical data model

Identify Entity Types Identify Attributes Apply Data Naming Conventions Identify Relationships

Assign Keys

Normalize to Reduce Data Redundancy Data normalization is a process in which data attributes within a data model are organized to increase the cohesion of entity types. In other words, the goal of data normalization is to reduce and even eliminate data redundancy, an important consideration for application developers because it is incredibly difficult to stores objects in a relational database that maintains the same information in several places. the three most common normalization rules describes how to put entity types into a series of increasing levels of normalization. With respect to terminology, a data schema is considered to be at the level of normalization of its least normalized entity type. For example, if all of your entity types are at second normal form (2NF) or higher then we say that your data schema is at 2NF.

Why data normalization? The advantage of having a highly normalized data schema is that information is stored in one place and one place only, reducing the possibility of inconsistent data. Furthermore, highly-normalized data schemas in general are closer conceptually to object-oriented schemas because the object-oriented goals of promoting high cohesion and loose coupling between classes results in similar solutions (at least from a data point of view). This generally makes it easier to map your objects to your data schema. Unfortunately, normalization usually comes at a performance cost

First normal form (1NF) An entity type is in 1NF when it contains no repeating groups of data. Second normal form (2NF) An entity type is in 2NF when it is in 1NF and when all of its non-key attributes are fully dependent on its primary key. Third normal form (3NF) An entity type is in 3NF when it is in 2NF and when all of its attributes are directly dependent on the primary key.

Denormalize to Improve Performance Normalized data schemas, when put into production, often suffer from performance problems. This makes sense the rules of data normalization focus on reducing data redundancy, not on improving performance of data access. An important part of data modeling is to denormalize portions of your data schema to improve database access times

To understand why the differences between the schemas exist you must consider the performance needs of the application. The primary goal of this system is to process new orders from online customers as quickly as possible. To do this customers need to be able to search for items and add them to their order quickly, remove items from their order if need be, then have their final order totaled and recorded quickly. The secondary goal of the system is to the process, ship, and bill the orders afterwards.

Logical Versus Physical Design in Data Warehouses Your organization has decided to build a data warehouse. You have defined the business requirements and agreed upon the scope of your application, and created a conceptual design. Now you need to translate your requirements into a system deliverable. To do so, you create the logical and physical design for the data warehouse. You then define: The specific data content Relationships within and between groups of data The system environment supporting your data warehouse The data transformations required The frequency with which data is refreshed The logical design is more conceptual and abstract than the physical design. In the logical design, you look at the logical relationships among the objects. In the physical design, you look at the most effective way of storing and retrieving the objects as well as handling them from a transportation and backup/recovery perspective. Orient your design toward the needs of the end users. End users typically want to perform analysis and look at aggregated data, rather than at individual transactions. However, end users might not know what they need until they see it. In addition, a wellplanned design allows for growth and changes as the needs of users change and evolve. By beginning with the logical design, you focus on the information requirements and save the implementation details for later.

Creating a Logical Design A logical design is conceptual and abstract. You do not deal with the physical implementation details yet. You deal only with defining the types of information that you need. One technique you can use to model your organization's logical information requirements is entity-relationship modeling. Entityrelationship modeling involves identifying the things of importance (entities), the properties of these things (attributes), and how they are related to one another (relationships). The process of logical design involves arranging data into a series of logical relationships called entities and attributes. An entity represents a chunk of information. In relational databases, an entity often maps to a table. An attribute is a component of an entity that helps define the uniqueness of the entity. In relational databases, an attribute maps to a column.

Data Warehousing Schemas A schema is a collection of database objects, including tables, views, indexes, and synonyms. You can arrange schema objects in the schema models designed for data warehousing in a variety of ways. Most data warehouses use a dimensional model. The model of your source data and the requirements of your users help you design the data warehouse schema. You can sometimes get the source model from your company's enterprise data model and reverse-engineer the logical data model for the data warehouse from this. The physical implementation of the logical data warehouse model may require some changes to adapt it to your system parameters-size of machine, number of users, storage capacity, type of network, and software.

The star schema is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables, as shown

The most natural way to model a data warehouse is as a star schema, only one join establishes the relationship between the fact table and any one of the dimension tables. A star schema optimizes performance by keeping queries simple and providing fast response time. All the information about each level is stored in one row.

Data Warehousing Objects Fact tables and dimension tables are the two types of objects commonly used in dimensional data warehouse schemas. Fact tables are the large tables in your warehouse schema that store business measurements. Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables represent data, usually numeric and additive, that can be analyzed and examined. Examples include sales, cost, and profit. Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouse. Dimension tables store the information you normally use to contain queries. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. Examples are customers or products.

Fact Tables A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detaillevel facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it. Creating a New Fact Table You must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all of its foreign keys. Dimension Tables A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time. Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies.

Hierarchies Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure. Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies--one for product categories and one for product suppliers. Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse. When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization. Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly. Levels A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies.

Level Relationships Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy. Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known. Typical Dimension Hierarchy

Relationships guarantee business integrity. An example is that if a business sells something, there is obviously a customer and a product. Designing a relationship between the sales information in the fact table and the dimension tables products and customers enforces the business rules in databases.

The "Slowly Changing Dimension" problem is a common one particular to data warehousing There can be a situation where your main customers are mainly college going students and they are your main customers and you have seen substantial revenue drops when these students pass out and move to new place for jobs or higher studies. Well even a blind can say , as the company do not have office in new place. However can you say which are popular cities/places where your customer students are moving or whats the revenue drop when they move and whats the operation cost to set up a new office in new place where customers are moving. Yes, you can track this data easily by having slowly changing dimension in Datawarehouse and can build effective report showing all place where customers are moving and can easily calculate the revenue drop and one time setup cost of new office in new place. Slowly Changing Dimensions (SCD) are dimensions that have data that slowly changes.

Slowly Changing Dimension Type 1 ( SCD Type 1) Slowly Changing Dimension Type 1 does not maintain history data and overrides the old data with new data and therefore does not track historical data at all. This is most appropriate when you dont need to track the history of dimension and just want to update the correcting certain types of data errors, such as the spelling of a name or a date. e.g. Consider SCD of Company. CompanyID CompanyName CompanyLocation 1 ABC Supply Company Maharashtra If Company moves to new location and since its SCD Type 1 , We can simply update the table to overwrite data. CompanyID CompanyName CompanyLOcation 1 ABC Supply Company Karnataka The big disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. However an advantage to this is that they are easy to maintain.

Slowly Changing Dimension Type 2 is used for tracking historical data by maintaining a version of records. CD type 2 tracks the data history by creating multiple versions of records and using either date or flag to identify the active record. This method allows tracking any number of histories as each time new record is inserted in table with its version identifier. Lets take an example as we discussed earlier in Type 1 CompanyID CompanyName CompanyLocatino is_active 1 ABC Maharashtra 0 2 ABC Karnataka 1 In the same example, if the Company moves to Karnataka, the table would look like this: CompanyID CompanyName CompanyLocation Start_Date End_Date 1 ABC Maharashtra 1-Jan-00 21-Dec-04 2 ABC Karnataka 22-Dec-04 in above example is_active is the version identifier which helps to identify the current Location, however instead of using flag dates can be used which can also help to find out number of period the record was active. NULL end date can be used to identify the current active record, instead of using NULL if some standard date is used e.g. 1111-11-11 this column can be used for an Index SCD Type 2 Advantages: This allows to keep all historical information. Disadvantages: Maintaining huge history with SCD 2 can make your datawarehouse huge and it should be used after evaluating business use cases and data requirement for reporting. SCD 2 ETL transformations are bit complicated to develop however now a days most of the commercial tool are coming up with wizards to develop SCD Type 2

Slowly Changing Dimension Type 3 ( SCD Type 3) tracks the history my adding separate columns for each version of record. As opposed to Slowly Changing Dimension Type 2 ( SCD Type 2) where we can maintain any number of history of record SCD Type 3 allows limited history of record as adding a separate column for each version is not a good practice. Generally it allows to maintain history up to 2 level. e.g. CompanyID CompanyName CompanyLastLocation CompanyCurrentLocation 1 ABC Maharashtra 1-Jan00 2 ABC Karnataka 22-Dec-04 After looking at above example its clear that with SCD Type 3 we can not track the history of Company movement if it moves again to new Location.

You might also like