ETL Testing (Extract, Transform, and Load)

What is data warehouse?
A data warehouse is a electronic storage of an Organization's historical data for the purpose of reporting, analysis and data mining or knowledge discovery. Other than that a data warehouse can also be used for the purpose of data integration, master data management etc. According to Bill Inmon, a datawarehouse should be subject-oriented, non-volatile, integrated and time-variant. Explanatory Note Note here, Non-volatile means that the data once loaded in the warehouse will not get deleted later. Time-variant means the data will change with respect to time. The above definition of the data warehousing is typically considered as "classical" definition. However, if you are interested, you may want to read the article - What is a data warehouse - A 101 guide to modern data warehousing - which opens up a broader definition of data warehousing.
What is the benefits of data warehouse?

A data warehouse helps to integrate data (see Data integration) and store them historically so that we can analyze different aspects of business including, performance analysis, trend, prediction etc. over a given time frame and use the result of our analysis to improve the efficiency of business processes.
Why Data Warehouse is used?

For a long time in the past and also even today, Data warehouses are built to facilitate reporting on different key business processes of an organization, known as KPI. Data warehouses also help to integrate data from different sources and show a single-point-of-truth values about the business measures. Data warehouse can be further used for data mining which helps trend prediction, forecasts, pattern recognition etc. Check this article to know more about data mining
What is the difference between OLTP and OLAP?

OLTP is the transaction system that collects business data. Whereas OLAP is the reporting and analysis system on that data.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly normalized. On the other hand, OLAP systems are deliberately denormalized for fast data retrieval through SELECT operations. Explanatory Note: In a departmental shop, when we pay the prices at the check-out counter, the sales person at the counter keys-in all the data into a "Point-Of-Sales" machine. That data is transaction data and the related system is a OLTP system. On the other hand, the manager of the store might want to view a report on out-of-stock materials, so that he can place purchase order for them. Such report will come out from OLAP system
What is data mart?

Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketting etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse.
What is ER model?
ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of modeling is to normalize the data by reducing redundancy. This is different than dimensional modeling where the main goal is to improve the data retrieval mechanism.
What is dimensional modeling?

Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achive high degree of normalization but to facilitate easy and faster data retrieval. Ralph Kimball is one of the strongest proponents of this very popular data modeling technique which is often used in many enterprise level data warehouses. If you want to read a quick and simple guide on dimensional modeling, please check our Guide to dimensional modeling.
What is dimension?
A dimension is something that qualifies a quantity (measure). For an example, consider this: If I just say 20kg, it does not mean anything. But if I say, "20kg of Rice (Product) is sold to Ramesh (customer) on 5th April (date)", then that gives a
meaningful sense. These product, customer and dates are some dimension that qualified the measure - 20kg. Dimensions are mutually independent. Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.
What is Fact?
A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated.
What are additive, semi-additive and non-additive measures?

Non-additive Measures Non-additive measures are those which can not be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a nonadditive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table. Semi Additive Measures Semi-additive measures are those where only a subset of aggregation function can be applied. Lets say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful. Additive Measures Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc. "Classifying data for successful modeling"
What is data?
Let us begin our discussion by defining what is data. Data are values of qualitative or quantitative variables, belonging to a set of items. Simply put, it's an attribute or property or characteristics of an object. Point to note here is, data can be both qualitative (brown eye color) and quantitative (20cm long). A common way of representing or displaying a set of correlated data is through table type structures comprised of rows and columns. In such structures, the columns of the table generally signify attributes or characteristics or features and the rows (tuple) signify a set of co-related features belonging to one single item.
While speaking about data, it is important to understand the difference of data with other similar terms like information or knowledge. While a set of data can be used together to directly derive an information, knowledge or wisdom is often derived in an indirect manner. In our previous article on learning data mining, we have given examples to illustrate the differences in data / information and knowledge. Using the same example, consider a store manager of a local market sells hundreds of candles every Sunday to its customers. Which customer is buying the candles on any certain date, those are the data that are stored in the database of the store. These data gives information like how many candles are sold from the store per week - this information may be valuable for inventory management. These information can be further used to indirectly infer that people who buy candles on every Sunday goes to Church to offer a prayer. Now that's knowledge - it's a new learning based on available information. Another way to look at it is by considering the level of abstraction in them. Data is objective and thus have the lowest level of abstraction whereas information and knowledge are increasingly subjective and involves higher levels of abstraction. In terms of scientific definition, one may conclude that data have higher level of entropy than information or knowledge.
Types of Data
One of the fundamental aspects you must learn before attempting to do any kind of data modeling is the fact that how we model the data depends completely on the nature or type of data. Data can be both qualitative and quantitative. It's important to understand the distinctions between them. Qualitative Data Qualitative data are also called categorical data as they represent distinct categories rather than numbers. In case of dimensional modeling, they are often termed as "dimension". Mathematical operations such as addition or subtraction do not make any sense on that data. Example of qualitative data are, eye color, zip code, phone number etc. Qualitative data can be further classified into below classes:
NOMINAL :
Nominal data represents data where order of the data does not represent any meaningful information. Consider your passport number. There is no information as such if your passport number is greater or lesser than some one else's passport number. Consider Eye color of people, does not matter in which order we represent the eye colors, order does not matter. ID, ZIP code, Phone number, eye color etc. are example of nominal class of qualitative data.
ORDINAL :
Order of the data is important for ordinal data. Consider height of people - tall, medium, short. Although they are qualitative but the order of the attributes does matter, in the sense that they represent some comparative information. Similarly, letter grades, scale of 1-10 etc. are examples of Ordinal data. In the field of dimensional modeling, this kind of data are sometimes referred as nonadditive facts. Quantitative data Quantitative data are also called numeric data as they represent numbers. In case of dimensional data modeling approach, these data are termed as "Measure". Example of quantitative data is, height of a person, amount of goods sold, revenue etc. Quantitative attributes can be further classified as below.
INTERVAL :
Interval classification is used where there is no true zero point in the data and division operation does not make sense. Bank balance, temperature in Celsius scale, GRE score etc. are the examples of interval class data. Dividing one GRE score with another GRE score will not make any sense. In dimensional modeling this is synonymous to semiadditive facts.
RATIO :
Ratio class is applied on the data that has a true "zero" and where division does make sense. Consider revenue, length of time etc. These measures are generally additive. Below table illustrates different actions that are possible to implement on various data types
ACTIONS --> Distinct Order Addition Multiplication
Nominal
Ordinal
Interval
Ratio
It is essential to understand the above differences in the nature of data and suggest appropriate model to store them. Many of our analytical (e.g. MS Excel) and data mining tools (e.g. R) do not automatically understand the nature of the data, so we need to explicitly model the data for those tools. For example, "R" provides 2 test function "is.numeric()" and "is.factor()" to determine if the data is numeric or categorical (dimensional) respectively, and if the default attribution is wrong we can use functions like "as.factor()" or "as.numeric()" to re-attribute the nature of the data.
What is Star-schema?
This schema is used in data warehouse models where one centralized fact table references number of dimension tables so as the keys (primary key) from all the dimension tables flow into the fact table (as foreign key) where measures are stored. This entity-relationship diagram looks like a star, hence the name.
Consider a fact table that stores sales quantity for each product and customer on a certain time. Sales quantity will be the measure here and keys from customer, product and time dimension tables will flow into the fact table. If you are not very familiar about Star Schema design or its use, we strongly recommend you read our excellent article on this subject - different schema in dimensional modeling
Data warehouse:
In 1980, Bill Inmon known as father of data warehousing. "A Data warehouse is a subject oriented, integrated ,time variant, non volatile collection of data in support of management's decision making process".

Subject oriented : means that the data addresses a specific subject such as sales, inventory etc. Integrated : means that the data is obtained from a variety of sources. Time variant : implies that the data is stored in such a way that when some data is changed. Non volatile : implies that data is never removed. i.e., historical data is also kept.
2.
What is the difference between database and data warehouse? A database is a collection of related data. A data warehouse is also a collection of information as well as a supporting system. What are the benefits of data warehousing? Historical information for comparative and competitive analysis. Enhanced data quality and completeness. Supplementing disaster recovery plans with another data back up source.
3.

4.
What are the types of data warehouse? There are mainly three type of Data Warehouse are :

Enterprise Data Warehouse Operational data store Data Mart
5.
What is the difference between data mining and data warehousing? Data mining, the operational data is analyzed using statistical techniques and clustering techniques to find the hidden patterns and trends. So, the data mines do some kind of summarization of the data and can be used by data warehouses for faster analytical processing for business intelligence. Data warehouse may make use of a data mine for analytical processing of the data in a faster way. What are the applications of data warehouse?

Datawarehouse are used extensively in banking and financial services, consumer goods. Datawarehouse is mainly used for generating reports and
answering predefined queries. Datawarehouse is used for strategic purposes, performing multidimensional analysis. Datawarehouse is used for knowledge discovery and strategic decision making using data mining tools.
7. What are the types of datawarehouse applications? Info processing Analytical processing Data mining
8. What is metadata? Metadata is defined as the data about data. Metadata describes the entity and attributes description. 9. What are the benefits of Datawarehousing? The implementation of a data warehouse can provide many benefits to an organization. A data warehouse can : Facilitate integration in an environment characterized by unintegrated applications.

Integrate enterprise data across a variety of functions. Integrate external as well as internal data. Support strategic and longterm business planning. Support daytoday tactical decisions. Enable insight into business trends and business opportunities. Organize and store historical data needed for analysis. Make available historical data, extending over many years, which enables trend analysis. Provide more accurate and complete information. Improve knowledge about the business. Enable costeffective decision making. Enable organizations to understand their customers, and their needs, as well competitors. Enhance customer servicc and satisfaction. Provide competitive advantage. Provide easy access for endusers. Provide timely access to corporate information.
10.
What is the difference between dimensional table and fact table? A dimension table consists of tuples of attributes of the dimension. A fact table can be thought of as having tuples, one per a recorded fact. This fact contains some measured or
observed variables and identifies them with pointers to dimension tables.

ETL testing (Extract, Transform, and Load) It has been observed that Independent Verification and Validation is gaining huge market potential and many companies are now seeing this as prospective business gain. Customers have been offered different range of products in terms of service offerings, distributed in many areas based on technology, process and solutions. ETL or data warehouse is one of the offerings which are developing rapidly and successfully.
Why do organizations need Data Warehouse? Organizations with organized IT practices are looking forward to create a next level of technology transformation. They are now trying to make themselves much more operational with easy-to-interoperate data. Having said that data is most important part of any organization, it may be everyday data or historical data. Data is backbone of any report and reports are the baseline on which all the vital management decisions are taken. Most of the companies are taking a step forward for constructing their data warehouse to store and monitor real time data as well as historical data. Crafting an efficient data warehouse is not an easy job. Many organizations have distributed departments with different applications running on distributed technology. ETL tool is employed in order to make a flawless integration between different data sources from different departments. ETL tool will work as an integrator, extracting data from different sources; transforming it in preferred format based on the business transformation rules and loading it in cohesive DB known are Data Warehouse. Well planned, well defined and effective testing scope guarantees smooth conversion of the project to the production. A business gains the real buoyancy once the ETL processes are verified and validated by independent group of experts to make sure that data warehouse is concrete and robust. ETL or Data warehouse testing is categorized into four different engagements irrespective of technology or ETL tools used:
New Data Warehouse Testing New DW is built and verified from scratch. Data input is taken from customer requirements and different data sources and new data warehouse is build and verified with the help of ETL tools. Migration Testing In this type of project customer will have an existing DW and ETL performing the job but they are looking to bag new tool in order to improve efficiency. Change Request In this type of project new data is added from different sources to an existing DW. Also, there might be a condition where customer needs to change their existing business rule or they might integrate the new rule. Report Testing Report are the end result of any Data Warehouse and the basic propose for which DW is build. Report must be tested by validating layout, data in the report and calculation.
ETL Testing Techniques:
1) Verify that data is transformed correctly according to various business requirements and rules. 2) Make sure that all projected data is loaded into the data warehouse without any data loss and truncation. 3) Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data. 4) Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability. Apart from these 4 main ETL testing methods other testing methods like integration testing and user acceptance testing is also carried out to make sure everything is smooth and reliable.
ETL Testing Process:

Similar to any other testing that lies under Independent Verification and Validation, ETL also go through the same phase.
Business and requirement understanding Validating Test Estimation Test planning based on the inputs from test estimation and business requirement Designing test cases and test scenarios from all the available inputs Once all the test cases are ready and are approved, testing team proceed to perform preexecution check and test data preparation for testing Lastly execution is performed till exit criteria are met Upon successful completion summary report is prepared and closure process is done.
It is necessary to define test strategy which should be mutually accepted by stakeholders before starting actual testing. A well defined test strategy will make sure that correct approach has been followed meeting the testing aspiration. ETL testing might require writing SQL statements extensively by testing team or may be tailoring the SQL provided by development team. In any case testing team must be aware of the results they are trying to get using those SQL statements. Difference between Database and Data Warehouse Testing There is a popular misunderstanding that database testing and data warehouse is similar while the fact is that both hold different direction in testing.
Database testing is done using smaller scale of data normally with OLTP (Online transaction processing) type of databases while data warehouse testing is done with large volume with data involving OLAP (online analytical processing) databases. In database testing normally data is consistently injected from uniform sources while in data warehouse testing most of the data comes from different kind of data sources which are sequentially inconsistent. We generally perform only CRUD (Create, read, update and delete) operation in database testing while in data warehouse testing we use read-only (Select) operation.
Normalized databases are used in DB testing while demoralized DB is used in data warehouse testing.
There are number of universal verifications that have to be carried out for any kind of data warehouse testing. Below is the list of objects that are treated as essential for validation in ETL testing: - Verify that data transformation from source to destination works as expected - Verify that expected data is added in target system - Verify that all DB fields and field data is loaded without any truncation - Verify data checksum for record count match - Verify that for rejected data proper error logs are generated with all details - Verify NULL value fields - Verify that duplicate data is not loaded - Verify data integrity
ETL Testing Challenges:

ETL testing is quite different from conventional testing. There are many challenges we faced while performing data warehouse testing. Here is the list of few ETL testing challenges I experienced on my project: - Incompatible and duplicate data. - Loss of data during ETL process. - Unavailability of inclusive test bed. - Testers have no privileges to execute ETL jobs by their own. - Volume and complexity of data is very huge. - Fault in business process and procedures. - Trouble acquiring and building test data. - Missing business flow information. Data is important for businesses to make the critical business decisions. ETL testing plays a significant role validating and ensuring that the business information is exact, consistent and reliable. Also, it minimizes hazard of data loss in production. In computing, Extract, Transform and Load (ETL) refers to a process in database usage and especially in data warehousing that involves:

Extracting data from outside sources Transforming it to fit operational needs, which can include quality levels Loading it into the end target (database, more specifically, operational data store, data mart or data warehouse)
Extract
The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, since extracting data correctly sets the stage for how subsequent processes go further.
ETL Architecture Pattern
Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization and/or format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. The streaming of the extracted data source and load on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required. In general, the goal of the extraction phase is to convert the data into a single format which is appropriate for transformation processing. An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.
Transform
The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:
Selecting only certain columns to load (or selecting null columns not to load). For example, if the source data has three columns (also called attributes), for example roll_no, age, and salary, then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those records where salary is not present (salary = null). Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female) Encoding free-form values (e.g., mapping "Male" to "M") Deriving a new calculated value (e.g., sale_amount = qty * unit_price) Sorting Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data Aggregation (for example, rollup summarizing multiple rows of data total sales for each store, and for each region, etc.) Generating surrogate-key values Transposing or pivoting (turning multiple columns into multiple rows or vice versa) Splitting a column into multiple columns (e.g., converting a comma-separated list, specified as a string in one column, into individual values in different columns) Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table) Lookup and validate the relevant data from tables or referential files for slowly changing dimensions. Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.
Load
The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; frequently, updating extracted data is done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in a historical form at regular intervals -- for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse will overwrite any data that is older than a year with newer data. However, the entry of data for any one year window will be made in a historical manner. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the data warehouse.
As the load phase interacts with a database, the constraints defined in the database schema as well as in triggers activated upon data load apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process.
For example, a financial institution might have information on a customer in several departments and each department might have that customer's information listed in a different way. The membership department might list the customer by name, whereas the accounting department might list the customer by number. ETL can bundle all this data and consolidate it into a uniform presentation, such as for storing in a database or data warehouse. Another way that companies use ETL is to move information to another application permanently. For instance, the new application might use another database vendor and most likely a very different database schema. ETL can be used to transform the data into a format suitable for the new application to use. An example of this would be an Expense and Cost Recovery System (ECRS) such as used by accountancies, consultancies and lawyers. The data usually ends up in the time and billing system, although some businesses may also utilize the raw data for employee productivity reports to Human Resources (personnel dept.) or equipment usage reports to Facilities Management.
Real-life ETL cycle

The typical real-life ETL cycle consists of the following execution steps:
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up
Challenges
ETL processes can involve considerable complexity, and significant operational problems can occur with improperly designed ETL systems. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source
during data analysis can identify the data conditions that will need to be managed by transform rules specifications. This will lead to an amendment of validation rules explicitly and implicitly implemented in the ETL process. Data warehouses are typically assembled from a variety of data sources with different formats and purposes. As such, ETL is a key process to bring all the data together in a standard, homogeneous environment . Design analysts should establish the scalability of an ETL system across the lifetime of its usage. This includes understanding the volumes of data that will have to be processed within service level agreements. The time available to extract from source systems may change, which may mean the same amount of data may have to be processed in less time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of data. Increasing volumes of data may require designs that can scale from daily batch to multiple-day micro batch to integration with message queues or real-time change-data capture for continuous transformation and update.
Performance
ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple gigabitnetwork connections, and lots of memory. The fastest ETL record is currently held by Syncsort,[1] Vertica and HP at 5.4TB in under an hour which is more than twice as fast as the earlier record held by Microsoft and Unisys. In real life, the slowest part of an ETL process usually occurs in the database load phase. Databases may perform slowly because they have to take care of concurrency, integrity maintenance, and indices. Thus, for better performance, it may make sense to employ:

Direct Path Extract method or bulk unload whenever is possible (instead of querying the database) to reduce the load on source system while getting high speed extract most of the transformation processing outside of the database bulk load operations whenever possible.
Still, even using bulk operations, database access is usually the bottleneck in the ETL process. Some common methods used to increase performance are:

Partition tables (and indices). Try to keep partitions similar in size (watch for null values which can skew the partitioning). Do all validation in the ETL layer before the load. Disable integrity checking (disable constraint ...) in the target database tables during the load. Disable triggers (disable trigger ...) in the target database tables during the load. Simulate their effect as a separate step. Generate IDs in the ETL layer (not in the database). Drop the indices (on a table or partition) before the load - and recreate them after the load (SQL: drop index ...; create index ...).
Use parallel bulk load when possible works well when the table is partitioned or there are no indices. Note: attempt to do parallel loads into the same table (partition) usually causes locks if not on the data rows, then on indices. If a requirement exists to do insertions, updates, or deletions, find out which rows should be processed in which way in the ETL layer, and then process these three operations in the database separately. You often can do bulk load for inserts, but updates and deletes commonly go through an API (using SQL).
Whether to do certain operations in the database or outside may involve a trade-off. For example, removing duplicates using distinct may be slow in the database; thus, it makes sense to do it outside. On the other side, if using distinct will significantly (x100) decrease the number of rows to be extracted, then it makes sense to remove duplications as early as possible in the database before unloading data. A common source of problems in ETL is a big number of dependencies among ETL jobs. For example, job "B" cannot start while job "A" is not finished. You can usually achieve better performance by visualizing all processes on a graph, and trying to reduce the graph making maximum use of parallelism, and making "chains" of consecutive processing as short as possible. Again, partitioning of big tables and of their indices can really help. Another common issue occurs when the data is spread between several databases, and processing is done in those databases sequentially. Sometimes database replication may be involved as a method of copying data between databases - and this can significantly slow down the whole process. The common solution is to reduce the processing graph to only three layers:

Sources Central ETL layer Targets
This allows processing to take maximum advantage of parallel processing. For example, if you need to load data into two databases, you can run the loads in parallel (instead of loading into 1st - and then replicating into the 2nd). Of course, sometimes processing must take place sequentially. For example, you usually need to get dimensional (reference) data before you can get and validate the rows for main "fact" tables.
Parallel processing
A recent development in ETL software is the implementation of parallel processing. This has enabled a number of methods to improve overall performance of ETL processes when dealing with large volumes of data. ETL applications implement three main types of parallelism:
Data: By splitting a single sequential file into smaller data files to provide parallel access.
Pipeline: Allowing the simultaneous running of several components on the same data stream. For example: looking up a value on record 1 at the same time as adding two fields on record 2. Component: The simultaneous running of multiple processes on different data streams in the same job, for example, sorting one input file while removing duplicates on another file.
All three types of parallelism usually operate combined in a single job. An additional difficulty comes with making sure that the data being uploaded is relatively consistent. Because multiple source databases may have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may have to be reconciled to the contents in a source system or with the general ledger, establishing synchronization and reconciliation points becomes necessary.
Rerunnability, recoverability
Data warehousing procedures usually subdivide a big ETL process into smaller pieces running sequentially or in parallel. To keep track of data flows, it makes sense to tag each data row with "row_id", and tag each piece of the process with "run_id". In case of a failure, having these IDs will help to roll back and rerun the failed piece. Best practice also calls for "checkpoints", which are states when certain phases of the process are completed. Once at a checkpoint, it is a good idea to write everything to disk, clean out some temporary files, log the state, and so on.
Virtual ETL
As of 2010 data virtualization had begun to advance ETL processing. The application of data virtualization to ETL allowed solving the most common ETL tasks of data migration and application integration for multiple dispersed data sources. So-called Virtual ETL operates with the abstracted representation of the objects or entities gathered from the variety of relational, semi-structured and unstructured data sources. ETL tools can leverage object-oriented modeling and work with entities' representations persistently stored in a centrally located hub-and-spoke architecture. Such a collection that contains representations of the entities or objects gathered from the data sources for ETL processing is called a metadata repository and it can reside in memory[2] or be made persistent. By using a persistent metadata repository, ETL tools can transition from one-time projects to persistent middleware, performing data harmonization and data profiling consistently and in near-real time.[citation needed]
Dealing with keys

Keys are some of the most important objects in all relational databases as they tie everything together. A primary key is a column which is the identifier for a given entity, where a foreign key is a column in another table which refers a primary key. These keys can also be made up from several columns, in which case they are composite keys. In many cases the primary key is
an auto generated integer which has no meaning for the business entity being represented, but solely exists for the purpose of the relational database - commonly referred to as a surrogate key. As there will usually be more than one datasource being loaded into the warehouse the keys are an important concern to be addressed. Your customers might be represented in several data sources, and in one their SSN (Social Security Number) might be the primary key, their phone number in another and a surrogate in the third. All of the customers information needs to be consolidated into one dimension table. A recommended way to deal with the concern is to add a warehouse surrogate key, which will be used as a foreign key from the fact table.[3] Usually updates will occur to a dimension's source data, which obviously must be reflected in the data warehouse. If the primary key of the source data is required for reporting, the dimension already contains that piece of information for each row. If the source data uses a surrogate key, the ware house must keep track of it even though it is never used in queries or reports. That is done by creating a lookup table which contains the warehouse surrogate key and the originating key.[4] This way the dimension is not polluted with surrogates from various source systems, while the ability to update is preserved. The lookup table is used in different ways depending on the nature of the source data. There are 5 types to consider,[5] where three selected ones are included here: Type 1: - The dimension row is simply updated to match the current state of the source system. The warehouse does not capture history. The lookup table is used to identify which dimension row to update/overwrite. Type 2: - A new dimension row is added with the new state of the source system. A new surrogate key is assigned. Source key is no longer unique in the lookup table. Fully logged: - A new dimension row is added with the new state of the source system, while the previous dimension row is updated to reflect it is no longer active and record time of deactivation.
Tools
Programmers can set up ETL processes using almost any programming language, but building such processes from scratch can become complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes.[6] By using an established ETL framework, one may increase one's chances of ending up with better connectivity and scalability[citation needed]. A good ETL tool must be able to communicate with the many different relational databases and read the various file formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or even Enterprise Service Bus, systems that now cover much more than just the extraction,
transformation, and loading of data. Many ETL vendors now have data profiling, data quality, and metadata capabilities. A common use case for ETL tools include converting CSV files to formats readable by relational databases. A typical translation of millions of records is facilitated by ETL tools that enable users to input csv-like data feeds/files and import it into a database with as little code as possible. ETL Tools are typically used by a broad range of professionals - from students in computer science looking to quickly import large data sets to database architects in charge of company account management, ETL Tools have become a convenient tool that can be relied on to get maximum performance. ETL tools in most cases contain a GUI that helps users conveniently transform data as opposed to writing large programs to parse files and modify data types - which ETL tools facilitate as much as possible Business intelligence (BI) is a set of theories, methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information for business purposes. BI can handle large amounts of information to help identify and develop new opportunities. Making use of new opportunities and implementing an effective strategy can provide a competitive market advantage and long-term stability.[1] BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining, predictive analytics and prescriptive analytics. Though the term business intelligence is sometimes a synonym for competitive intelligence (because they both support decision making), BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and disseminates information with a topical focus on company competitors. If understood broadly, business intelligence can include the subset of competitive intelligence.[ Slowly changing dimension Dimension is a term in data management and data warehousing. It's the logical groupings of data such as geographical location, customer or product information. With Slowly Changing Dimensions (SCDs) data changes slowly, rather than changing on a time-based, regular schedule.[1] For example, you may have a dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another. How do you record such a change in your sales dimension? You could calculate the sum or average of each salesperson's sales, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson was transferred and used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, his/her totals will look much stronger than the other
salespeople in their new region. Or you could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems. Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.
ype 0
The Type 0 method is passive. It manages dimensional changes and no action is performed. Values remain as they were at the time the dimension record was first inserted. In certain circumstances history is preserved with a Type 0. High order types are employed to guarantee the preservation of history whereas Type 0 provides the least or no control. The most common types are I, II, and III.
Type I
This methodology overwrites old with new data, and therefore does not track historical data. Example of a supplier table:
Supplier_Key Supplier_Code Supplier_Name Supplier_State 123 ABC Acme Supply Co CA
In the above example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the row will be unique by the natural key (Supplier_Code). However, to optimize performance on joins use integer rather than character keys. If the supplier relocates the headquarters to Illinois the record would be overwritten:
Supplier_Key Supplier_Code Supplier_Name Supplier_State 123 ABC Acme Supply Co IL
The disadvantage of the Type I method is that there is no history in the data warehouse. It has the advantage however that it's easy to maintain. If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.[1]
Type II
This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert. For example, if the supplier relocates to Illinois the version numbers will be incremented sequentially:
Supplier_Key Supplier_Code Supplier_Name Supplier_State Version. 123 124 ABC ABC Acme Supply Co Acme Supply Co CA IL 0 1
Another method is to add 'effective date' columns.

Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date 123 124 ABC ABC Acme Supply Co Acme Supply Co CA IL End_Date
01-Jan-2000 21-Dec-2004 22-Dec-2004
The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying. Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed. If there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change.[1]
Type III
This method tracks changes using separate columns and preserves limited history. The Type III preserves limited history as it's limited to the number of columns designated for storing historical data. The original table structure in Type I and Type II is the same but Type III adds additional
columns. In the following example, an additional column has been added to the table to record the supplier's original state - only the previous history is stored.
Supplier_Ke Supplier_Cod Supplier_Nam Original_Supplier_Sta Effective_Dat Current_Supplier_Sta y e e te e te 123 ABC Acme Supply Co CA 22-Dec-2004 IL
This record contains a column for the original state and current statecannot track the changes if the supplier relocates a second time. One variation of this is to create the field Previous_Supplier_State instead of Original_Supplier_State which would track only the most recent historical change.[1]
Type IV
The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes. Both the surrogate keys are referenced in the Fact table to enhance query performance. For the above example the original table name is Supplier and the history table is Supplier_History.
Supplier Supplier_key Supplier_Code Supplier_Name Supplier_State 123 ABC Acme Supply Co Supplier_History Supplier_key Supplier_Code Supplier_Name Supplier_State Create_Date 123 ABC Acme Supply Co CA 22-Dec-2004 IL
This method resembles how database audit tables and change data capture techniques function.
Type 6 / hybrid
The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a
conversation with Stephen Pace from Kalido[citation needed]. Ralph Kimball calls this method "Unpredictable Changes with Single-Version Overlay" in The Data Warehouse Toolkit.[1] The Supplier table starts out with one record for our example supplier:
Supplier_Ke Supplier_Cod Supplier_Na Current_Stat Historical_Sta Start_Dat End_Dat Current_Fla y e me e te e e g 123 ABC Acme Supply Co CA CA 01-Jan2000 31-Dec9999 Y
The Current_State and the Historical_State are the same. The Current_Flag attribute indicates that this is the current or most recent record for this supplier. When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:
Supplier_Ke Supplier_Cod Supplier_Na Current_Stat Historical_Sta Start_Dat End_Dat Current_Fla y e me e te e e g 123 ABC Acme Supply Co Acme Supply Co IL CA 01-Jan2000 22-Dec2004 21-Dec2004 31-Dec9999 N
124
ABC
IL
IL
We overwrite the Current_State information in the first record (Supplier_Key = 123) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing. For example if the supplier were to relocate again, we would add another record to the Supplier dimension, and we would overwrite the contents of the Current_State column:
Supplier_Ke Supplier_Cod Supplier_Na Current_Stat Historical_Sta Start_Dat End_Dat Current_Fla y e me e te e e g 123 ABC Acme Supply Co Acme Supply Co NY CA 01-Jan2000 22-Dec2004 21-Dec2004 03-Feb2008 N
124
ABC
NY
IL
125
ABC
Acme Supply Co
NY
NY
04-Feb2008
31-Dec9999
Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are always the same.[1]
Type 2 / Type 6 fact implementation
Type 2 surrogate key with Type 3 attribute

In many Type 2 and Type 6 SCD implementations, the surrogate key from the dimension is put into the fact table in place of the natural key when the fact data is loaded into the data repository.[1] The surrogate key is selected for a given fact record based on its effective date and the Start_Date and End_Date from the dimension table. This allows the fact data to be easily joined to the correct dimension data for the corresponding effective date. Here is the Supplier table as we created it above using Type 6 Hybrid methodology:
Supplier_Ke Supplier_Cod Supplier_Na Current_Stat Historical_Sta Start_Dat End_Dat Current_Fla y e me e te e e g 123 ABC Acme Supply Co Acme Supply Co Acme Supply Co NY CA 01-Jan2000 22-Dec2004 04-Feb2008 21-Dec2004 03-Feb2008 31-Dec9999 N
124
ABC
NY
IL
125
ABC
NY
NY
Once the Delivery table contains the correct Supplier_Key, it can easily be joined to the Supplier table using that key. The following SQL retrieves, for each fact record, the current supplier state and the state the supplier was located in at the time of the delivery:
SELECT delivery.delivery_cost, supplier.supplier_name, supplier.historical_state, supplier.current_state FROM delivery INNER JOIN supplier ON delivery.supplier_key = supplier.supplier_key
Pure Type 6 implementation
Having a Type 2 surrogate key for each time slice can cause problems if the dimension is subject to change.[1] A pure Type 6 implementation does not use this, but uses a Surrogate Key for each master data item (e.g. each unique supplier has a single surrogate key). This avoids any changes in the master data having an impact on the existing transaction data. It also allows more options when querying the transactions. Here is the Supplier table using the pure Type 6 methodology:
Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date 456 456 456 ABC ABC ABC Acme Supply Co Acme Supply Co Acme Supply Co CA IL NY End_Date
01-Jan-2000 21-Dec-2004 22-Dec-2004 03-Feb-2008 04-Feb-2008 31-Dec-9999
The following example shows how the query must be extended to ensure a single supplier record is retrieved for each transaction.
SELECT supplier.supplier_code, supplier.supplier_state FROM supplier INNER JOIN delivery ON supplier.supplier_key = delivery.supplier_key AND delivery.delivery_date >= supplier.start_date AND delivery.delivery_date <= supplier.end_date
A fact record with an effective date (Delivery_Date) of August 9, 2001 will be linked to Supplier_Code of ABC, with a Supplier_State of 'CA'. A fact record with an effective date of October 11, 2007 will also be linked to the same Supplier_Code ABC, but with a Supplier_State of 'IL'. Whilst more complex, there are a number of advantages of this approach, including:
1. If there is more than one date on the fact (e.g. Order Date, Delivery Date, Invoice Payment Date) you can choose which date to use for a query. 2. You can do "as at now", "as at transaction time" or "as at a point in time" queries by changing the date filter logic. 3. You don't need to reprocess the Fact table if there is a change in the dimension table (e.g. adding additional fields retrospectively which change the time slices, or if you make a mistake in the dates on the dimension table you can correct them easily). 4. You can introduce bi-temporal dates in the dimension table.
5. You can join the fact to the multiple versions of the dimension table to allow reporting of the same information with different effective dates, in the same query.
The following example shows how a specific date such as '2012-01-01 00:00:00' (which could be the current datetime) can be used.
SELECT supplier.supplier_code, supplier.supplier_state FROM supplier INNER JOIN delivery ON supplier.supplier_key = delivery.supplier_key AND '2012-01-01 00:00:00' >= supplier.start_date AND '2012-01-01 00:00:00' <= supplier.end_date
Both surrogate and natural key

An alternative implementation is to place both the surrogate key and the natural key into the fact table.[2] This allows the user to select the appropriate dimension records based on:

the primary effective date on the fact record (above), the most recent or current information, any other date associated with the fact record.
This method allows more flexible links to the dimension, even if you have used the Type 2 approach instead of Type 6. Here is the Supplier table as we might have created it using Type 2 methodology:
Supplier_Key Supplier_Code Supplier_Name Supplier_State Start_Date 123 124 125 ABC ABC ABC Acme Supply Co Acme Supply Co Acme Supply Co CA IL NY End_Date Current_Flag N N Y
01-Jan-2000 21-Dec-2004 22-Dec-2004 03-Feb-2008 04-Feb-2008 31-Dec-9999
The following SQL retrieves the most current Supplier_Name and Supplier_State for each fact record:
SELECT delivery.delivery_cost, supplier.supplier_name, supplier.supplier_state FROM delivery INNER JOIN supplier ON delivery.supplier_code = supplier.supplier_code WHERE supplier.current_flag = 'Y'
If there are multiple dates on the fact record, the fact can be joined to the dimension using another date instead of the primary effective date. For instance, the Delivery table might have a primary effective date of Delivery_Date, but might also have an Order_Date associated with each record. The following SQL retrieves the correct Supplier_Name and Supplier_State for each fact record based on the Order_Date:
SELECT delivery.delivery_cost, supplier.supplier_name, supplier.supplier_state FROM delivery INNER JOIN supplier ON delivery.supplier_code = supplier.supplier_code AND delivery.order_date >= supplier.start_date AND delivery.order_date <= supplier.end_date
Some cautions:
If the join query is not written correctly, it may return duplicate rows and/or give incorrect answers. The date comparison might not perform well. Some Business Intelligence tools do not handle generating complex joins well. The ETL processes needed to create the dimension table needs to be carefully designed to ensure that there are no overlaps in the time periods for each distinct item of reference data.
Combining types
Different SCD Types can be applied to different columns of a table. For example, we can apply Type 1 to the Supplier_Name column and Type 2 to the Supplier_State column of the same table, the Supplier table. Data warehousing is the repository of integrated information data will be extracted from the heterogeneous sources. Data warehousing architecture contains the different; sources like oracle, flat files and ERP then after it have the staging area and Data warehousing, after that it has the different Data marts then it have the reports and it also have the ODS - Operation Data Store. This complete architecture is called the Data warehousing Architecture. Benefits of data warehousing: => Data warehouses are designed to perform well with aggregate queries running on large amounts of data. => The structure of data warehouses is easier for end users to navigate, understand and query against unlike the relational databases primarily designed to handle lots of transactions. => Data warehouses enable queries that cut across different segments of a company's operation.
E.g. production data could be compared against inventory data even if they were originally stored in different databases with different structures. => Queries that would be complex in very normalized databases could be easier to build and maintain in data warehouses, decreasing the workload on transaction systems. => Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non uniform and scattered throughout a company. => Data warehousing is an efficient way to manage demand for lots of information from lots of users. => Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom that can provide an organization with competitive advantage. Data modeling is the process of designing a data base model. In this data model data will be stored in two types of table fact table and dimension table. Fact table contains the transaction data and dimension table contains the master data. Data mining is process of finding the hidden trends is called the data mining. A multi-dimensional structure called the data cube. A data abstraction allows one to view aggregated data from a number of perspectives. Conceptually, the cube consists of a core or base cuboids, surrounded by a collection of sub-cubes/cuboids that represent the aggregation of the base cuboids along one or more dimensions. We refer to the dimension to be aggregated as the measure attribute, while the remaining dimensions are known as the feature attributes. OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large amounts of data. E.g. OLAP technology could provide management with fast answers to complex queries on their operational data or enable them to analyze their company's historical data for trends and patterns. OLTP stands for Online Transaction Processing. OLTP uses normalized tables to quickly record large amounts of transactions while making sure that these updates of data occur in as few places as possible. Consequently OLTP database are designed for recording the daily operations and transactions of a business. E.g. a timecard system that supports a large production environment must record successfully a large number of updates during critical periods like lunch hour, breaks, startup and close of work. Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year), Region dimension (profit by country, state, city), Product dimension (profit for product1, product2). MOLAP Cubes: stands for Multidimensional OLAP. In MOLAP cubes the data aggregations and a copy of the fact data are stored in a multidimensional structure on the Analysis Server computer. It is best when extra storage space is available on the Analysis Server computer and the best query performance is desired. MOLAP local cubes contain all the necessary data for calculating aggregates and can be used offline. MOLAP cubes provide the fastest query response time and performance but require additional storage space for the extra copy of data from the fact table.
ROLAP Cubes: stands for Relational OLAP. In ROLAP cubes a copy of data from the fact table is not made and the data aggregates are stored in tables in the source relational database. A ROLAP cube is best when there is limited space on the Analysis Server and query performance is not very important. ROLAP local cubes contain the dimensions and cube definitions but aggregates are calculated when they are needed. ROLAP cubes requires less storage space than MOLAP and HOLAP cubes. HOLAP Cubes: stands for Hybrid OLAP. A ROLAP cube has a combination of the ROLAP and MOLAP cube characteristics. It does not create a copy of the source data however, data aggregations are stored in a multidimensional structure on the Analysis Server computer. HOLAP cubes are best when storage space is limited but faster query responses are needed. You can disconnect the report from the catalog to which it is attached by saving the report with a snapshot of the data. An active data warehouse provides information that enables decision-makers within an organization to manage customer relationships nimbly, efficiently and proactively. Star schema A single fact table with N number of Dimension, all dimensions will be linked directly with a fact table. This schema is de-normalized and results in simple join and less complex query as well as faster results. Snow schema Any dimensions with extended dimensions are know as snowflake schema, dimensions maybe interlinked or may have one to many relationship with other tables. This schema is normalized and results in complex join and very complex query as well as slower results. A concept hierarchy that is a total (or) partial order among attributes in a database schema is called a schema hierarchy. The roll-up operation is also called drill-up operation which performs aggregation on a data cube either by climbing up a concept hierarchy for a dimension (or) by dimension reduction. Indexing is a technique, which is used for efficient data retrieval (or) accessing data in a faster manner. When a table grows in volume, the indexes also increase in size requiring more storage.
Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated.Dimension modeling is a method for designing data warehouse. Three types of modeling are there 1. Conceptual modeling 2. Logical modeling 3. Physical modeling
Data Transformation Services is a set of tools available in SQL server that helps to extract, transform and consolidate data. This data can be from different sources into a single or multiple destinations depending on DTS connectivity. To perform such operations DTS offers a set of tools. Depending on the business needs, a DTS package is created. This package contains a list of tasks that define the work to be performed on, transformations to be done on the data objects. Import or Export data: DTS can import data from a text file or an OLE DB data source into a SQL server or vice versa. Transform data: DTS designer interface also allows to select data from a data source connection, map the columns of data to a set of transformations, and send the transformed data to a destination connection. For parameterized queries and mapping purposes, Data driven query task can be used from the DTS designer. Consolidate data : the DTS designer can also be used to transfer indexes, views, logins, triggers and user defined data. Scripts can also be generated for the sane For performing these tasks, a valid connection(s) to its source and destination data and to any additional data sources, such as lookup tables must be established. Data mining extension is based on the syntax of SQL. It is based on relational concepts and mainly used to create and manage the data mining models. DMX comprises of two types of statements: Data definition and Data manipulation. Data definition is used to define or create new models, structures. Example: CREATE MINING SRUCTURE CREATE MINING MODEL Data manipulation is used to manage the existing models and structures. Example: INSERT INTO SELECT FROM .CONTENT (DMX) SQL Server data mining offers Data Mining Add-ins for office 2007 that allows discovering the patterns and relationships of the data. This also helps in an enhanced analysis. The Add-in called as Data Mining client for Excel is used to first prepare data, build, evaluate, manage and predict results. Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. it is more commonly used to transform large amount of data into a meaningful form. Data here can be facts, numbers or any real time information like sales figures, cost, meta data etc. Information would be the patterns and the relationships amongst the data that can provide information. Sequence clustering algorithm collects similar or related paths, sequences of data containing events. The data represents a series of events or transitions between states in a dataset like a series of web clicks. The algorithm will examine all probabilities of transitions and measure the differences, or distances, between all the possible sequences in the data set. This helps it to determine which sequence can be the best for input for clustering.
E.g. Sequence clustering algorithm may help finding the path to store a product of similar nature in a retail ware house. Association algorithm is used for recommendation engine that is based on a market based analysis. This engine suggests products to customers based on what they bought earlier. The model is built on a dataset containing identifiers. These identifiers are both for individual cases and for the items that cases contain. These groups of items in a data set are called as an item set. The algorithm traverses a data set to find items that appear in a case. MINIMUM_SUPPORT parameter is used any associated items that appear into an item set. Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. The algorithm generates a model that can predict trends based only on the original dataset. New data can also be added that automatically becomes a part of the trend analysis. E.g. Performance one employee can influence or forecast the profit. Nave Bayes Algorithm is used to generate mining models. These models help to identify relationships between input columns and the predictable columns. This algorithm can be used in the initial stage of exploration. The algorithm calculates the probability of every state of each input column given predictable columns possible states. After the model is made, the results can be used for exploration and making predictions. A decision tree is a tree in which every node is either a leaf node or a decision node. This tree takes an input an object and outputs some decision. All Paths from root node to the leaf node are reached by either using AND or OR or BOTH. The tree is constructed using the regularities of the data. The decision tree is not affected by Automatic Data Preparation. Models in Data mining help the different algorithms in decision making or pattern matching. The second stage of data mining involves considering various models and choosing the best one based on their predictive performance.
Data mining helps analysts in making faster business decisions which increases revenue with lower costs. Data mining helps to understand, explore and identify patterns of data. Data mining automates process of finding predictive information in large databases. Helps to identify previously hidden patterns. The process of cleaning junk data is termed as data purging. Purging data would mean getting rid of unnecessary NULL values of columns. This usually happens when the size of the database gets too large.
Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc.
History
In a 1958 article, IBM researcher Hans Peter Luhn used the term business intelligence. He defined intelligence as: "the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal."[3] Business intelligence as it is understood today is said to have evolved from the decision support systems that began in the 1960s and developed throughout the mid-1980s. DSS originated in the computer-aided models created to assist with decision making and planning. From DSS, data warehouses, Executive Information Systems, OLAP and business intelligence came into focus beginning in the late 80s. In 1989, Howard Dresner (later a Gartner Group analyst) proposed "business intelligence" as an umbrella term to describe "concepts and methods to improve business decision making by using fact-based support systems."[4] It was not until the late 1990s that this usage was widespread.[5]
Business intelligence and data warehousing

Often BI applications use data gathered from a data warehouse or a data mart. A data warehouse is a copy of transactional data that facilitates decision support. However, not all data warehouses are used for business intelligence, nor do all business intelligence applications require a data warehouse. To distinguish between the concepts of business intelligence and data warehouses, Forrester Research often defines business intelligence in one of two ways: Using a broad definition: "Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making."[6] When using this definition, business intelligence also includes technologies such as data integration, data quality, data warehousing, master data management, text and content analytics, and many others that the market sometimes lumps into the Information Management segment. Therefore, Forrester refers to data preparation and data usage as two separate, but closely linked segments of the business intelligence architectural stack. Forrester defines the latter, narrower business intelligence market as, "...referring to just the top layers of the BI architectural stack such as reporting, analytics and dashboards."[7]
Business intelligence and business analytics

Thomas Davenport argues that business intelligence should be divided into querying, reporting, OLAP, an "alerts" tool, and business analytics. In this definition, business analytics is the subset of BI based on statistics, prediction, and optimization.[8]
Applications in an enterprise
Business intelligence can be applied to the following business purposes, in order to drive business value.[citation needed] 1. Measurement program that creates a hierarchy of performance metrics (see also Metrics Reference Model) and benchmarking that informs business leaders about progress towards business goals (business process management). 2. Analytics program that builds quantitative processes for a business to arrive at optimal decisions and to perform business knowledge discovery. Frequently involves: data mining, process mining, statistical analysis, predictive analytics, predictive modeling, business process modeling, complex event processing and prescriptive analytics. 3. Reporting/enterprise reporting program that builds infrastructure for strategic reporting to serve the strategic management of a business, not operational reporting. Frequently involves data visualization, executive information system and OLAP. 4. Collaboration/collaboration platform program that gets different areas (both inside and outside the business) to work together through data sharing and electronic data interchange. 5. Knowledge management program to make the company data driven through strategies and practices to identify, create, represent, distribute, and enable adoption of insights and experiences that are true business knowledge. Knowledge management leads to learning management and regulatory compliance. In addition to above, business intelligence also can provide a pro-active approach, such as ALARM function to alert immediately to end-user. There are many types of alerts, for example if some business value exceeds the threshold value the color of that amount in the report will turn RED and the business analyst is alerted. Sometimes an alert mail will be sent to the user as well. This end to end process requires data governance, which should be handled by the expert.[citation
needed]
Prioritization of business intelligence projects

It is often difficult to provide a positive business case for business intelligence initiatives and often the projects must be prioritized through strategic initiatives. Here are some hints to increase the benefits for a BI project.

As described by Kimball[9] you must determine the tangible benefits such as eliminated cost of producing legacy reports. Enforce access to data for the entire organization.[10] In this way even a small benefit, such as a few minutes saved, makes a difference when multiplied by the number of employees in the entire organization. As described by Ross, Weil & Roberson for Enterprise Architecture,[11] consider letting the BI project be driven by other business initiatives with excellent business cases. To support this approach, the organization must have enterprise architects who can identify suitable business projects.
Use a structured and quantitative methodology to create defensible prioritization in line with the actual needs of the organization, such as a weighted decision matrix.[12]
Success factors of implementation

Before implementing a BI solution, it is worth taking different factors into consideration before proceeding. According to Kimball et al., these are the three critical areas that you need to assess within your organization before getting ready to do a BI project:[13] 1. The level of commitment and sponsorship of the project from senior management 2. The level of business need for creating a BI implementation 3. The amount and quality of business data available.
Business sponsorship
The commitment and sponsorship of senior management is according to Kimball et al., the most important criteria for assessment.[14] This is because having strong management backing helps overcome shortcomings elsewhere in the project. However, as Kimball et al. state: even the most elegantly designed DW/BI system cannot overcome a lack of business [management] sponsorship.[15] It is important that personnel who participate in the project have a vision and an idea of the benefits and drawbacks of implementing a BI system. The best business sponsor should have organizational clout and should be well connected within the organization. It is ideal that the business sponsor is demanding but also able to be realistic and supportive if the implementation runs into delays or drawbacks. The management sponsor also needs to be able to assume accountability and to take responsibility for failures and setbacks on the project. Support from multiple members of the management ensures the project does not fail if one person leaves the steering group. However, having many managers work together on the project can also mean that there are several different interests that attempt to pull the project in different directions, such as if different departments want to put more emphasis on their usage. This issue can be countered by an early and specific analysis of the business areas that benefit the most from the implementation. All stakeholders in project should participate in this analysis in order for them to feel ownership of the project and to find common ground. Another management problem that should be encountered before start of implementation is if the business sponsor is overly aggressive. If the management individual gets carried away by the possibilities of using BI and starts wanting the DW or BI implementation to include several different sets of data that were not included in the original planning phase. However, since extra implementations of extra data may add many months to the original plan, it's wise to make sure the person from management is aware of his actions.
Business needs
Because of the close relationship with senior management, another critical thing that must be assessed before the project begins is whether or not there is a business need and whether there is
a clear business benefit by doing the implementation.[16] The needs and benefits of the implementation are sometimes driven by competition and the need to gain an advantage in the market. Another reason for a business-driven approach to implementation of BI is the acquisition of other organizations that enlarge the original organization it can sometimes be beneficial to implement DW or BI in order to create more oversight. Companies that implement BI are often large, multinational organizations with diverse subsidiaries.[17] A well-designed BI solution provides a consolidated view of key business data not available anywhere else in the organization, giving management visibility and control over measures that otherwise would not exist.
Amount and quality of available data

Without good data, it does not matter how good the management sponsorship or business-driven motivation is. Without proper data, or with too little quality data, any BI implementation fails. Before implementation it is a good idea to do data profiling. This analysis identifies the content, consistency and structure [..][16] of the data. This should be done as early as possible in the process and if the analysis shows that data is lacking, put the project on the shelf temporarily while the IT department figures out how to properly collect data. When planning for business data and business intelligence requirements, it is always advisable to consider specific scenarios that apply to a particular organization, and then select the business intelligence features best suited for the scenario. Often, scenarios revolve around distinct business processes, each built on one or more data sources. These sources are used by features that present that data as information to knowledge workers, who subsequently act on that information. The business needs of the organization for each business process adopted correspond to the essential steps of business intelligence. These essential steps of business intelligence include but are not limited to: 1. 2. 3. 4. Go through business data sources in order to collect needed data Convert business data to information and present appropriately Query and analyze data Act on those data collected
The quality aspect in business intelligence should cover all the process from the source data to the final reporting. At each step, the quality gates are different: 1. Source Data: o Data Standardization: make data comparable (same unit, same pattern..) o Master Data Management: unique referential 2. Operational Data Store (ODS): o Data Cleansing: detect & correct inaccurate data o Data Profiling: check inappropriate value, null/empty 3. Datawarehouse: o Completeness: check that all expected data are loaded
Referential integrity: unique and existing referential over all sources Consistency between sources: check consolidated data vs sources 4. Reporting: o Uniqueness of indicators: only one share dictionary of indicators o Formula accurateness: local reporting formula should be avoid or checked
o o
User aspect
Some considerations must be made in order to successfully integrate the usage of business intelligence systems in a company. Ultimately the BI system must be accepted and utilized by the users in order for it to add value to the organization.[18][19] If the usability of the system is poor, the users may become frustrated and spend a considerable amount of time figuring out how to use the system or may not be able to really use the system. If the system does not add value to the users mission, they simply don't use it.[19] To increase user acceptance of a BI system, it can be advisable to consult business users at an early stage of the DW/BI lifecycle, for example at the requirements gathering phase.[18] This can provide an insight into the business process and what the users need from the BI system. There are several methods for gathering this information, such as questionnaires and interview sessions. When gathering the requirements from the business users, the local IT department should also be consulted in order to determine to which degree it is possible to fulfill the business's needs based on the available data.[18] Taking on a user-centered approach throughout the design and development stage may further increase the chance of rapid user adoption of the BI system.[19] Besides focusing on the user experience offered by the BI applications, it may also possibly motivate the users to utilize the system by adding an element of competition. Kimball[18] suggests implementing a function on the Business Intelligence portal website where reports on system usage can be found. By doing so, managers can see how well their departments are doing and compare themselves to others and this may spur them to encourage their staff to utilize the BI system even more. In a 2007 article, H. J. Watson gives an example of how the competitive element can act as an incentive.[20] Watson describes how a large call centre implemented performance dashboards for all call agents, with monthly incentive bonuses tied to performance metrics. Also, agents could compare their performance to other team members. The implementation of this type of performance measurement and competition significantly improved agent performance. BI chances of success can be improved by involving senior management to help make BI a part of the organizational culture, and by providing the users with necessary tools, training, and support.[20] Training encourages more people to use the BI application.[18] Providing user support is necessary to maintain the BI system and resolve user problems.[19] User support can be incorporated in many ways, for example by creating a website. The website
should contain great content and tools for finding the necessary information. Furthermore, helpdesk support can be used. The help desk can be manned by power users or the DW/BI project team.[18]
BI Portals
A Business Intelligence portal (BI portal) is the primary access interface for Data Warehouse (DW) and Business Intelligence (BI) applications. The BI portal is the users first impression of the DW/BI system. It is typically a browser application, from which the user has access to all the individual services of the DW/BI system, reports and other analytical functionality. The BI portal must be implemented in such a way that it is easy for the users of the DW/BI application to call on the functionality of the application.[21] The BI portal's main functionality is to provide a navigation system of the DW/BI application. This means that the portal has to be implemented in a way that the user has access to all the functions of the DW/BI application. The most common way to design the portal is to custom fit it to the business processes of the organization for which the DW/BI application is designed, in that way the portal can best fit the needs and requirements of its users.[22] The BI portal needs to be easy to use and understand, and if possible have a look and feel similar to other applications or web content of the organization the DW/BI application is designed for (consistency). The following is a list of desirable features for web portals in general and BI portals in particular: Usable User should easily find what they need in the BI tool. Content Rich The portal is not just a report printing tool, it should contain more functionality such as advice, help, support information and documentation. Clean The portal should be designed so it is easily understandable and not over complex as to confuse the users Current The portal should be updated regularly. Interactive The portal should be implemented in a way that makes it easy for the user to use its functionality and encourage them to use the portal. Scalability and customization give the user the means to fit the portal to each user. Value Oriented It is important that the user has the feeling that the DW/BI application is a valuable resource that is worth working on.
Marketplace
There are a number of business intelligence vendors, often categorized into the remaining independent "pure-play" vendors and consolidated "megavendors" that have entered the market through a recent trend[23] of acquisitions in the BI industry.[24] Some companies adopting BI software decide to pick and choose from different product offerings (best-of-breed) rather than purchase one comprehensive integrated solution (fullservice).[25]
Industry-specific
Specific considerations for business intelligence systems have to be taken in some sectors such as governmental banking regulations. The information collected by banking institutions and analyzed with BI software must be protected from some groups or individuals, while being fully available to other groups or individuals. Therefore BI solutions must be sensitive to those needs and be flexible enough to adapt to new regulations and changes to existing law.
Semi-structured or unstructured data

Businesses create a huge amount of valuable information in the form of e-mails, memos, notes from call-centers, news, user groups, chats, reports, web-pages, presentations, image-files, videofiles, and marketing material and news. According to Merrill Lynch, more than 85% of all business information exists in these forms. These information types are called either semistructured or unstructured data. However, organizations often only use these documents once.[26] The management of semi-structured data is recognized as a major unsolved problem in the information technology industry.[27] According to projections from Gartner (2003), white collar workers spend anywhere from 30 to 40 percent of their time searching, finding and assessing unstructured data. BI uses both structured and unstructured data, but the former is easy to search, and the latter contains a large quantity of the information needed for analysis and decision making.[27][28] Because of the difficulty of properly searching, finding and assessing unstructured or semi-structured data, organizations may not draw upon these vast reservoirs of information, which could influence a particular decision, task or project. This can ultimately lead to poorly informed decision making.[26] Therefore, when designing a business intelligence/DW-solution, the specific problems associated with semi-structured and unstructured data must be accommodated for as well as those for the structured data.[28]
Unstructured data vs. semi-structured data

Unstructured and semi-structured data have different meanings depending on their context. In the context of relational database systems, unstructured data cannot be stored in predictably ordered columns and rows. One type of unstructured data is typically stored in a BLOB (binary large
object), a catch-all data type available in most relational database management systems. Unstructured data may also refer to irregularly or randomly repeated column patterns that vary from row to row within each file or document. Many of these data types, however, like e-mails, word processing text files, PPTs, image-files, and video-files conform to a standard that offers the possibility of metadata. Metadata can include information such as author and time of creation, and this can be stored in a relational database. Therefore it may be more accurate to talk about this as semi-structured documents or data,[27] but no specific consensus seems to have been reached. Unstructured data can also simply be the knowledge that business users have about future business trends. Business forecasting naturally aligns with the BI system because business users think of their business in aggregate terms. Capturing the business knowledge that may only exist in the minds of business users provides some of the most important data points for a complete BI solution.
Problems with semi-structured or unstructured data

There are several challenges to developing BI with semi-structured data. According to Inmon & Nesavich,[29] some of those are: 1. Physically accessing unstructured textual data unstructured data is stored in a huge variety of formats. 2. Terminology Among researchers and analysts, there is a need to develop a standardized terminology. 3. Volume of data As stated earlier, up to 85% of all data exists as semi-structured data. Couple that with the need for word-to-word and semantic analysis. 4. Searchability of unstructured textual data A simple search on some data, e.g. apple, results in links where there is a reference to that precise search term. (Inmon & Nesavich, 2008)[29] gives an example: a search is made on the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies.
The use of metadata

To solve problems with searchability and assessment of data, it is necessary to know something about the content. This can be done by adding context through the use of metadata.[26] Many systems already capture some metadata (e.g. filename, author, size, etc.), but more useful would be metadata about the actual content e.g. summaries, topics, people or companies mentioned. Two technologies designed for generating metadata about content are automatic categorization and information extraction.

ETL Testing (Extract, Transform, and Load)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ETL Testing (Extract, Transform, and Load)

Uploaded by

Copyright:

Available Formats

What is data warehouse?

What is the benefits of data warehouse?

Why Data Warehouse is used?

What is the difference between OLTP and OLAP?

What is data mart?

What is dimensional modeling?

What are additive, semi-additive and non-additive measures?

ACTIONS --> Distinct Order Addition Multiplication

Enterprise Data Warehouse Operational data store Data Mart

observed variables and identifies them with pointers to dimension tables.

ETL Testing Techniques:

ETL Testing Process:

ETL Testing Challenges:

ETL Architecture Pattern

Real-life ETL cycle

Sources Central ETL layer Targets

Dealing with keys

Another method is to add 'effective date' columns.

01-Jan-2000 21-Dec-2004 22-Dec-2004

Type 2 / Type 6 fact implementation

Type 2 surrogate key with Type 3 attribute

Pure Type 6 implementation

01-Jan-2000 21-Dec-2004 22-Dec-2004 03-Feb-2008 04-Feb-2008 31-Dec-9999

Both surrogate and natural key

01-Jan-2000 21-Dec-2004 22-Dec-2004 03-Feb-2008 04-Feb-2008 31-Dec-9999

Business intelligence and data warehousing

Business intelligence and business analytics

Prioritization of business intelligence projects

Success factors of implementation

Amount and quality of available data

Semi-structured or unstructured data

Unstructured data vs. semi-structured data

Problems with semi-structured or unstructured data

The use of metadata

You might also like