Professional Documents
Culture Documents
Basically speaking, Inmon professes the Snowflake Schema while Kimball relies on the Star Schema.
Kimball views data warehousing as a constituency of data marts. Data marts are focused on delivering
business objectives for departments in the organization. And the data warehouse is a conformed dimension
of the data marts. Hence a unified view of the enterprise can be obtained from the dimension modeling on a
local departmental level.
He follows Bottom-up approach i.e. first creates individual Data Marts from the existing sources and then
Create Data Warehouse.
Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the development of
the data warehouse can start with data from their needs arise. Point-of-sale (POS) data can be added later if
management decides it is necessary.
He follows Top-down approach i.e. first creates Data Warehouse from the existing sources and then create
individual Data Marts.
Kimball: creating data marts first then combining them up to form a data warehouse.
Views:
• Stores the SQL statement in the database and let you use it as a table. Every time you access the
view, the SQL statement executes.
• This is PSEUDO table that is not stored in the database and it is just a query.
Materialized Views:
• Stores the results of the SQL in table form in the database. SQL statement only executes once and after that
every time you run the query, the stored result set is used. Pros include quick query results.
• These are similar to a view but these are permanently for the database and often and are useful in
aggregation and summarization of data.
What is Junk Dimension? What is the difference between Junk Dimension and Degenerate Dimension?
Junk Dimension:
The column which we are using rarely or not used, these columns are formed a dimension is called Junk
Dimension.
Degenerate Dimension:
A Degenerate Dimension is data that is dimensional in nature but stored in a fact table.
Example:
But
We are talking only the column empno, ename from the EMP table and forming a dimension this is called
Degenerate Dimension.
By Using Sorter Transformation using sorted port as SAL and Filter Transformation to get first 10 records.
The process of making operational data available to business managers and decision support systems is called Data
Warehousing.
What is the purpose of using UNIX commands in Informatica? Which UNIX commands are generally used
with Informatica?
Sometimes we have to work with UNIX based servers mostly we are using UNIX based servers so there we have to
load data. ”egrep, grep, rm these commands would be used knowledge of UNIX would be advantage.
Create two data flows: one new row and other is changed row. Generate a primary key for new row. Insert new row
in the target and updates changed row in the target over writing existing rows.
Transformation used:
2 Unconnected Lookup
–> Expression –> Router –> Update Strategy –> target (instance).
What is the difference between SQL Overriding in Source Qualifier and Lookup Transformation?
Major difference is we can use any types of joins in sql over riding in source qualifier but in lookup we can use only
eqi-join in sql override.
How will you update the row without using Update Strategy Transformation?
You can set the property at session level “Treat Source Rows as: UPDATE or INSERT”, the record without using
Update Strategy in the mapping.
In Target, there is a Update Override option of updating the records using the non-key columns. Using this one we
can update the records without using Update Strategy Transformation.
Performance tuning is done in several stages, like for first we do check in following order:
Target, Source, Mapping, Session, System, and depending upon which level got bottleneck we do rectify it.
Normalization:
Normalization is the process of removing redundancies. OLTP uses the Normalization process.
Denormalization:
Denormalization is the process of allowing redundancies. OLAP/DWH uses the Denormalized process to
greater level of detailed data (each and every transaction).
A Fact Table consists of measurements of business requirements and foreign keys of dimensions tables as per
business rules.
Basically the fact table consists of the Index keys of the dimension/lookup tables and the measures.
So whenever we have the keys in a table that itself implies that the table is in the normal form.
E-R Modeling is used for normalizing the OLTP database design. It revolves around the Entities and their
relationships to capture the overall process of the system.
In E-R Modeling the data is in Normalized form. So more number of Joins, which may adversely affect the system
performance.
In Dimensional Modeling the data is denormalized, so less number of Joins, by which system performance will
improve.
A Dimension table which is used by more than one fact table is known as a Conformed Dimension.
Conformed facts are allowed to have the same name in separate tables and can be combined and compared
mathematically. The relationship between the facts and dimensions are with 3NF, and can works in any type of Joins
are called as Conformed Schema, the members of that schema are call so…
Every company has Methodology of their own. But to name a few SDLC Methodology, AIM Methodology are
standard used. Other Methodologies are AMM, World class Methodology and many more.
Most of the time, we use Mr. Ralph Kimball Methodologies for Data Warehousing design. Two kinds of Schemas:
Star Schema and Snow Flake Schema. Most probably every one follows Either Star Schema or Snow Flake Schema.
2. Bill Inmon – First Enterprise Data Warehouse then Data Marts from EDWH.
Regarding the Methodologies in the Data Warehousing. They are mainly two methods:
Depends on the requirements of the company any one can follow the company’s DWH will choose the one of the
above models.
Top down approach in the sense preparing individual departments data (data marts) from the Enterprise
DWH.
First loads into Data Marts and then loads into the Data Warehouse.
2. Bottom up Method
Bottom up approach is nothing but first gathering all the departments data and then cleanse the data and
Transforms the data and then load all the individual departments data into the enterprise data warehouse.
First loads into Data Warehouse and then loads into the Data Marts.
Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to
define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to
the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a
family structure.
Data Validation is to make sure that the loaded data is accurate and meets the business requirements.
There are different data types: Dimensions, Measure and Detail. View is nothing but an alias and it can be used to
resolve the loops in the universe. There are called as Object types in the Business Objects (BOs). And “Alias” is
different from View in the universe. View is at database level, but Alias is a different name given for the same table
to resolve the loops in universe.
1. Character
2. Date
3. Long text
4. Number
Dimension, Measure, Detail are objects type. Data types are “character, date and numeric”.
What is Surrogate Key? Where we use it explain why?
Surrogate Key is system generated artificial primary key values. It is mainly used for critical column in DWH. Here
“critical column” means nothing but it is a column which when we updated on them most DWH in top OLTP
systems. Surrogate Keys are that which Join Dimension tables and Fact tables. Surrogate Keys is the Solution for
Critical Column problems.
Example: The “customer purchases different items in different locations, for this situation we have to maintain
historical data.
By using Surrogate Keys we can introduce the row in the data warehouse to maintain historical data.
Surrogate Key is a Unique Identification Key, it is like an artificial or alternative key to production key, because the
production key may be alphanumeric or composite key, because the production key may be alphanumeric or
composite key but the surrogate key is always single numeric key. Assume the production key is an alphanumeric
field. If you create an index for these fields it will occupy more space, so it is not advisable to join/index, because
generally all the data warehousing fact table are having historical data. These fact tables are linked with so many
dimension tables. If it’s a numerical field the performance is high.
Surrogate Key is a substitution for the natural primary key. It is just a unique identifier or number for each row that
can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for
each row in the table.
Data warehouses typically use a surrogate, also known as artificial or identify key, key for the dimension tables
primary keys. They can use in sequence generator, or Oracle sequence, or SQL Server Identify values for the
surrogate key.
It is useful because the natural primary key i.e. “Customer Number” in “Customer Table” can change and this makes
updates more difficult.
What is Workflow?
A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and
loading data.
What is Worklets?
Create a worklet when you want to reuse a set of workflow logic in several workflows. Use the Worklet Designer to
create and edit worklets.
Where to use Worklets?
You can run worklets inside a workflow. The workflow that contains the worklet is called the “parent workflow”.
You can also nest a worklet in another worklet.
You can monitor workflows and tasks in the workflow monitor. View details about workflow or task in Gantt View
or Task View.
Actions:
You can run, stop, abort, and resume workflows from the Workflow Monitor.
OLTP OLAP
A scaled – down version of the data warehouse that It is a database management system that facilitates on-
addresses only one subject like “Sales department, HR line analytical processing by allowing the data to be a
department etc. viewed in different dimensions or perspectives to
provide business intelligence.
One fact table with multiple dimension tables. More than one fact table and multiple dimension tables.
Small Organizations prefer “Data Mart” Bigger Organization prefer Data Warehouse
Structure of Dimension – Surrogate Key, one or more Structure of Fact Table – Foreign Key (fk), Degenerated
other fields that compose the natural key (nk) and set of Dimension and Measurements.
attributes.
In a schema more number of dimensions is presented Size of Fact Table is larger than Dimension Table.
than Fact table.
Surrogate Key is used to prevent the primary key (pk) In a schema less number of Fact Tables observed
violation (store historical data). compared to Dimension Tables.
Provides entry points to data. Compose of Degenerate Dimension fields act as Primary
Key.
Values of fields are in numeric and text representation. Values of the fields always in numeric or integer form.
• Normalized • Denormalized
• Cannot Solve extract and complex problems • Extract and complex problems can be easily
solved
Cubes are multidimensional view of Data Warehouse or Data Marts. It is designed in a logical way to drill up & drill
down slice-n-dice etc. which enables the business users to understand the trend of the business. It is good to have the
design of the cube in the star schema so as to facilitate the effective use of the cube. Every part of the cube is a
logical representation of the combination of facts – dimension attributes.
1. Replicate
2. Transparent
3. Linked
In the Linked cube the data cells can be linked into another analytical database. If an end-user clicks on a data cell,
you are actually linking through another analytic database.
Example:
You may have data for 5GB to create a report, you can specify a size for a cube as 2GB so if the cube exceeds 2GB
it automatically creates the second cube to store the data.
Aggregate table contains the measure values, aggregated/grouped/summed up to some level of hierarchy.
RDBMS DWH
Normalized Denormalized
Less time for queries execution More time for query execution
Have Insert, Delete and Update Transactions Will not have more insert, delete and updates
1. Informatica PowerCenter
2. Ab Initio
3. Data Stage
6. BO Data Integrator
7. SAS ETL
8. MS DTS
13. Sunopsis
Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In
this design model all the data is stored in two types of tables – Fact tables and Dimension tables.
Fact table contains the facts/measurements of the business e.g. sales, revenue, profit etc… and the Dimension table
contains the
How to find the number of success, rejected and bad records in the same mapping?
- First we separate this data using Expression Transformation. This is used to flag the row for 1 or 0. The
condition is follows:
FLAG=1 is considered as invalid data and FLAG=0 is considered as valid data. This data will be routed
into next transformation using Router Transformation. Here we added two user groups one as FLAG=1
for invalid data and the other as FLAG=0 for valid data.
FLAG=1 data is forwarded to the Expression Transformation. Here we take one variable port and two
output ports. One for increment purpose and the other for flag the row.