You are on page 1of 21

S. M. U.

BCA
Data Warehousing
BOOK ID B1011

Assignment 1

NAME - CHIRAG I. SHAH ROLL NO 511123683

Data Warehousing

Page 1

Q. A.

1. With necessary diagram, Explain about Data Warehouse Development Life Cycle. 1.

The data warehouse development life cycle covers two vital areas. One is warehouse management and the second one is data management. The former deals with defining the project activities and requirements gathering; whereas the latter deals with modeling and designing the warehouse. Life cycle of Data Warehouse Development Define the Project Gather Requirements

Model the Warehouse

Validate the Model

Design the Warehouse

Validate the Design

Implementation
Managing the Project Managing the Data Warehouse project is an ongoing activity. It is not like traditional systems project. The Data warehouse is concerned with the execution of warehousing process and the data. Define the Project The process of defining the project typically involves the following questions: What do I want to analyze? Why do I want? What if I do not do this? How do I get this?

Software personnel should get answers to these questions, then we can understand the requirements that must be addressed.

Data Warehousing

Page 2

Requirements Gathering Transaction Processing System focus on automating the process, making it faster and efficient. This, in turn means that the requirements for transactional systems are specific and more directed towards business process automation. In contrast, the Data Warehousing Development focuses on facilitating the analysis that will change the process to make it more effective.

Data Warehousing

Page 3

Q. A.

2. 2.

What is Metadata? What is its use in Data Warehouse Architecture?

Acquisition metadata maps the translation of information from the operational system to the analytical system. This includes an extract history describing data origins, updates, algorithms used to summarize data, and frequency of extractions from operational systems. Transformation metadata includes a history of data transformations, changes in names and other physical characteristics. Access metadata provides navigation and graphical user interfaces that allow non-technical business users to interact intuitively with the contents of the warehouse. And on top of these three types of metadata, a warehouse needs basic operational metadata, such as procedures on how a data warehouse is used and accessed, procedures on monitoring the growth of the data warehouse relative to the available storage space, and authorizations on who is responsible for and who has access to the data in the data warehouse and data in the operational system. Metadata in a Data Warehouse is similar to the data dictionary or the data catalog in a Database Management System. In the data dictionary, you keep the information about the logical data structure, the information about the files and addresses, the information about the indexes, and so on. The Data Dictionary contains data about the data in the database. The Metadata component is the data about the data in the Data Warehouse. The definition is a commonly used definition. We need to elaborate on this definition. Metadata in a Data Warehouse is similar to a data dictionary, but much more than a data dictionary.

Data Warehousing

Page 4

Q. A.

3. 3.

Write briefly any four ETL tools. What is transformation? Briefly explain the basic transformation types.

ETL process can be created using almost any programming language, creating them from scratch is quite complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes. A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities. ETL Tools 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. PL/SQL SAS Data Integrator/SAS integration studio Ascential Data Stage Cognos Decision Stream Microsoft DTS Business Objects Data Integrator Pervasive Data Junction Hummingbird Genio Informatica Power Center Clover ETL

Transformation Data transformation are often the most complex and, in terms of processing time, the most costly part of the ETL process. They can range from simple data conversions to extremely complex data scrubbing techniques. Before moving the extracted data from the source systems into the Data Warehouse, you inevitably have to perform various kinds of data transformation. You have to transform the data according to standards because they come from many dissimilar source systems. You have to ensure that after all the data is put together, the combined data does not violate any business rules. Basic Transformation Type The data transformation contain the following basic tasks. Selection: This takes place at the beginning of the whole process of data transformation. you select either whole records or parts of several records from the source systems. The task of selection usually forms part of the extraction function itself. Splitting / Joining: This task includes the types of data manipulation you need to perform on the selected parts of source records. Sometimes you will be splitting the selected parts even further during data transformation. Joining of parts selected from many source systems is more widespread in the Data Warehouse environment.

Data Warehousing

Page 5

Conversion: This is an all inclusive task. It includes a large variety of rudimentary conversions of single fields for two primary reasons one to standardize among the data extractions from disparate source systems, and the other to make the fields usable and understandable to the users. Summarization: Sometimes you may find that it is not feasible to keep data at the lowest level of detail in your Data Warehouse. it may be that none of your users ever need data at the lowest granularity for analysis or querying. Ex. For a grocery chain, sales data at the lowest level of detail for every transaction at the checkout may not be needed. Storing sales by product by store by day in the Data Warehouse may be quite adequate. So, in this case, the data transformation function includes summarization of daily sales by product and by store. Enrichment: This task is the rearrangement and simplification of individual fields to make them more useful for the Data Warehouse environment. You may use one or more fields from the same input record to create a better view of the data for the Data Warehouse. This principle is extended when one or more fields originate from multiple records, resulting in a single field for the Data Warehouse.

Data Warehousing

Page 6

Q. A.

4. 4.

What is ROLAP, MOLAP and HLOAP? What is Multidimensional Analysis? How we achieve it?

ROLAP: Relational OLAP (ROLAP). These are the intermediate servers that stand in between a relational back end server and client front end tools. They use a relational or extended-relational DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include optimization for each DBMS back end, implementation of aggregation navigation logic and additional tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The DSS server of Micro strategy, for example adopts the ROLAP approach. The ROLAP Model: Presentation Layer Multidimensional View Desktop Client

Analytical Server APPLICATION LAYER

DATA WAREHOUSE

DATA LAYER RDBMS SERVER

Data Warehousing

Page 7

Multidimensional OLAP (MOLAP) servers: These servers support multidimensional views of data through array-based multidimensional storage engines. They map multidimensional views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored. Many MOLAP servers adopt a two-level storage representation to handle dense and sparse data sets: denser sub cubes are identified and stored as array structures, whereas sparse sub cubes employ compression technology for efficient storage utilization.

Presentation Layer

Desktop Client

MDBMS Server

MOLAP Engine

APPLICATION LAYER MDDB

DATA WAREHOUSE

DATA LAYER RDBMS SERVER

Data Warehousing

Page 8

Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. A HOLAP server may allow large volumes of detail data to be stored in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP server.

OLAP-Tool for Analysis

Internet Browser

ROLAP Navigator MetaData RDBMS

WWW Server

with OLAP-Type Aggregation Tables

A Multidimensional Data Model is typically used for the design of corporate Data Warehouses and departmental data marts. Such a model can adopt a star schema, Snowflake schema or fact constellation schema. The core of the multidimensional model is the data cube, which consists of a large set of facts (or measures) and a number of dimensions. Dimensions are the entities or perspectives with respect to which an organization wants to keep records and are hierarchical in nature. Data Cube: A Data Cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. Table: A 3-D view of sales data for All Electronics, according to the dimensions time, item, and location. The measure displayed is dollars sold (in thousand)

Data Warehousing

Page 9

Q. A.

5. 5.

Explain testing process of Data Warehouse with necessary diagram.

The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors: Are Are Are Are Are the the the the the requirements requirements requirements requirements requirements Complete? Singular? Ambiguous? Developable? Testable?

In a Date Warehouse, the requirements are mostly around reporting. Hence it becomes more important to verify whether these reporting requirements can be catered using the data available. Successful requirements are those structured closely to business rules and address functionality and performance. These business rules and requirements provide a solid foundation to the data architects. Using the defined requirements and business rules, high level design of the data model is created. Once requirements and business rules are available, rough scripts can be drafted to validate the data model constraints against the defined business rules.

QA Team Review BRD for Completeness. QA Team builds Test Plan

Business

High Level Design

Requirements Testing

Review of HLD

Test Case Preparation


Develop Cases and Queries Test SQL

Unit Testing Functional Testing Regression Testing Performance Testing

Test Execution

User Acceptance Testing (UAT)

Data Warehousing

Page 10

Q. A.

6. 6

What is testing? Differentiate between testing and traditional Software testing.

the

Data

Warehouse

Testing for Data Warehouse is quite different from testing the development of OLTP system. The main areas of testing for OLTP include testing user input for valid data type, edge values, etc. Testing for Data Warehouse, on the other hand, cannot and should not duplicate all of the error checks done in the source system. Even though there are some data quality improvements, such as making sure postal codes are associated with the correct city and state that are practical to do, Data Warehouse implementations must pretty much take in what the OLTP system has produced. Testing for Data Warehouse falls into three general categories. These are testing for ETL, testing reports and other artifacts in the Data Warehouse which provide correct answers and lastly that the performance of all the Data Warehouse components is acceptable. Making sure that all the records in the source system that should be brought into the Data Warehouse actually are extracted into the data warehouse: no more, no less. Making sure that all of the components of the ETL process complete successfully. All of the extracted source data is correctly transformed into dimension tables and fact tables. All of the extracted and transformed data is successfully loaded into the Data Warehouse. Data warehouse is system triggered where as OLTP is a user triggered. Volumes of the test data: o The test data in a transaction system is a very small sample of the overall production data. Typically to keep the matters simple, we include as many test cases as are needed to comprehensively include all possible test scenarios, in a limited set of test data. Data Warehouse has typically large test data as one does try to fill-up maximum possible combination and permutations of dimensions and facts. If you are testing the location dimension, you would like the location wise sales revenue report to have some revenue figures for most of the 90 cities and the 34 states. This would mean that you have to have thousands of sales transaction data at sales office level.

Data Warehousing

Page 11

S. M. U.
BCA
Data Warehousing
BOOK ID B1011

Assignment 2

NAME - CHIRAG I. SHAH ROLL NO 511123683

Data Warehousing

Page 12

Q. A.

1. 1.

Explain the Differences between OLTP and Data Warehouse.

The OLTP database records transactions in real time and aims to automate the clerical data entry processes of a business entity. Addition, Modification and Deletion of data in the OLTP database is essential and the semantics of the application used in the front end makes an impact on the organization of the data in the database. The Data Warehouse on the other hand does not cater to real time operational requirements of the enterprise. It is more a storehouse of current and historical data and may also contain data extracted from external data source. However, the Data Warehouse supports OLTP systems by providing a place for the latter to offload data as it accumulates and providing services, which would otherwise degrade the performance of the Database table. Application databases are OLTP (On-Line Transaction Processing) system where every transaction has to be recorded as and when it occurs. Consider the scenario where a bank ATM has disbursed cash to a customer but was unable to record this event in the bank records. If this happens frequently, the bank wouldnt stay in business for too long. So the banking system is designed to make sure that every transaction gets recorded within the time you stand before the ATM machine. A Data Warehouse (DW) on the other end is a data base that is designed for facilitating querying and analysis. Often designed as OLAP (On-Line Analytical Processing) systems, these databases contain read only data that can be queried and analyzed far more efficiently as compared to your regular OLTP application databases. In this sense an OLAP system is designed to be read optimized. Separation from your application database also ensures that your business intelligence solution is scalable better documented and managed. Creation of DW leads to direct increase in quality of analysis as the table structures are simpler, standardized and often de-normalized. Having a well-designed DW is the foundation for successful BI/Analytics initiatives, which are built upon. Data Warehouse usually store many months or years of data. This is to support historical analysis. OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction. OLTP 3 NF Few Many Normalized Rare Mostly predefined Mostly simple All the time Often not available Data Warehouse Multidimensional Many Some Demoralized Common Mostly adhoc Mostly complex Not allowed, only refreshed Essential

Property Nature of Data Warehouse Indexes Joins Duplicate Data Aggregate data Queries Nature of Queries Updates Historical data

Data Warehousing

Page 13

Data Warehouse Database Designed for analysis of business measures by categories and attributes Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table Loaded with consistent, valid data and requires no real time validation Supports limited users. Particularly data analyzers (Decision makers)

OLTP Database Designed for real time business operators Optimized for a common set of transaction, usually adding or retrieving a small set of rows at a time per table Optimized for validation of incoming data during transaction and uses validation data tables Supports thousands of concurrent users

Data Warehousing

Page 14

Q. A.

2. With necessary diagram, Explain about Data Warehouse Development Life Cycle. 2.

The data warehouse development life cycle covers two vital areas. One is warehouse management and the second one is data management. The former deals with defining the project activities and requirements gathering; whereas the latter deals with modeling and designing the warehouse. Life cycle of Data Warehouse Development

Define the Project

Gather Requirements

Model the Warehouse

Validate the Model

Design the Warehouse

Validate the Design

Implementation
Managing the Project Managing the Data Warehouse project is an ongoing activity. It is not like traditional systems project. The Data warehouse is concerned with the execution of warehousing process and the data. Define the Project The process of defining the project typically involves the following questions: What do I want to analyze? Why do I want? What if I do not do this? How do I get this?

Software personnel should get answers to these questions, then we can understand the requirements that must be addressed.

Data Warehousing

Page 15

Requirements Gathering Transaction Processing System focus on automating the process, making it faster and efficient. This, in turn means that the requirements for transactional systems are specific and more directed towards business process automation. In contrast, the Data Warehousing Development focuses on facilitating the analysis that will change the process to make it more effective.

Data Warehousing

Page 16

Q. A.

3. 3.

What is Metadata? What is its use in Data Warehouse Architecture?

Acquisition metadata maps the translation of information from the operational system to the analytical system. This includes an extract history describing data origins, updates, algorithms used to summarize data, and frequency of extractions from operational systems. Transformation metadata includes a history of data transformations, changes in names and other physical characteristics. Access metadata provides navigation and graphical user interfaces that allow non-technical business users to interact intuitively with the contents of the warehouse. And on top of these three types of metadata, a warehouse needs basic operational metadata, such as procedures on how a data warehouse is used and accessed, procedures on monitoring the growth of the data warehouse relative to the available storage space, and authorizations on who is responsible for and who has access to the data in the data warehouse and data in the operational system. Metadata in a Data Warehouse is similar to the data dictionary or the data catalog in a Database Management System. In the data dictionary, you keep the information about the logical data structure, the information about the files and addresses, the information about the indexes, and so on. The Data Dictionary contains data about the data in the database. The Metadata component is the data about the data in the Data Warehouse. The definition is a commonly used definition. We need to elaborate on this definition. Metadata in a Data Warehouse is similar to a data dictionary, but much more than a data dictionary.

Data Warehousing

Page 17

Q. A.

4. 4.

What is surrogate key? When do we need it in data warehouse implementation?

The surrogate keys are simply system-generated sequence numbers. They do not have any built-in meanings. The surrogate keys will be mapped to the production system keys. Nevertheless, they are different. The general practice is to keep the operational system keys as additional attributes in the dimension tables. The STORE KEY is the surrogate primary key for the store dimension table. The operational system primary key for the store reference table may be kept as just another non key attribute in the store dimension table. There are two general principles to be applied when choosing primary keys for dimension tables. The first principle is derived from the problem caused when the product began to be stored in a different warehouse. In other words, the product key in the operational system has built in meanings. Some positions in the operational system product key indicate the warehouse and some other positions in the key indicate the product category. These are built-in meanings in the key. The firs principle is, avoid built-in meanings in the primary key of the dimension tables. In some companies a few of the customers are no longer listed with the companies. They could have left their respective companies many years ago. it is possible that the customer number of such discontinued customer are reassigned to new customers. Now let us say we had used the operational system customer key as the primary key for the customer dimension table. We will have a problem because the same customer number could relate to the data for the newer customer and also to the data of the retired customer. The data of the retired customer may still be used for aggregations and comparisons by city and state. Therefore, the second principle is: do not use production system keys as primary keys for dimension tables.

Data Warehousing

Page 18

Q. A.

5. 5.

What is Data Loading? Explain the Full Refresh Loading.

Two distinct groups of tasks from the data loading function. When you complete the design and construction of the Data Warehouse and go live for the first time, you do the initial loading of the data into the Data Warehouse storage. The initial load moves large volumes of data using up substantial amounts of time. As the Data Warehouse starts functioning, you continue to extract the changes to the source data, transform the data revisions, and feed the incremental data revisions on an ongoing basis. The figure below illustrates the common type of data movements form the staging area to the Data Warehouse storage. Full Refresh Loading This type of application of data involves periodically rewriting the entire Data Warehouse. Sometimes you may also do partial refreshes to rewrite only specific tables. Partial refreshes are rare because every dimension table is intricately tied to the fact table. As far as the data application modes are concerned, full refresh is similar to the initial load. However, in the case of full refreshes, data exists in the target tables before incoming data is applied. The existing data must be erased before applying the incoming data. Just as in the case of the initial load, the load and append modes are applicable to full refresh. Data Sources

Yearly Refresh

Quarterly Refresh

Monthly Refresh

Daily Refresh

Base Data Load

Data Warehouse

Data Warehousing

Page 19

Q. A.

6. 6.

What Data Quality factors effects Data Warehouse. Explain them.

The DWQ project will provide a neutral architectural reference model covering the design, the setting-up, the operation, the maintenance, and the evolution of data warehouses. The basic components and their relationships as seen in current practice. The terms used in this figure can be briefly explained as follows. Sources: any data store whose content is subject to be materialized in a data warehouse. Wrappers: to load the source data into the warehouse Destination databases: data warehouses and data marts Meta database: repository for information about the other components, Agents for administration (data warehouse design, scheduler for initiating updates, etc.) Clients to display the data, for ex. statistical packages
GIS
OLAP

DSS

DATA MART

DATA MART

Meta Database

Data Warehouse

Sources

Text File

Meta Database

External Data

Data Warehousing

Page 20

The Linkage to Data Quality: DWQ provides assistance to DW designers by linking the main components of DW reference architecture to a formal model of data quality. Main difference to the initial model lie in the greater emphasis on historical as well as aggregated data. A data quality policy is the overall intention and direction of an organization with respect to issues concerning the quality of data products. Data quality management is the management function that determines and implements the data quality policy. A data quality system encompasses the organizational structure, responsibilities, procedures, processes and resources for implementing data quality management. Data quality control is a set of operational techniques and activities which are used to attain the quality required for a data product. Data quality assurance includes all the planned and systematic actions necessary to provide adequate confidence that a data product will satisfy a given set of quality requirements.

accuracy

Data Warehousing

Page 21

You might also like