You are on page 1of 12

FAQ on Informatica (Related basic Data Warehousing and Database concepts) Version 1.

0
14/02/2010

Data ware housing P Ranjini Balakrishnan pranjini.balakrishnan@tcs.com

TCS Public

FAQ on Informatica (Related Data Warehousing and Database concepts)

Introduction
Document contains frequently asked questions on informatica (ETL tool) and related basic data warehousing and database concepts. With the frequently asked question, most possible answers are also included. Note# every individual may have their own approach for answering the questions covered in this document.

Informatica
Q. What is the use of transformations in ETL process? Ans: Transformations are the manipulation of data from how it appears in the source system(s) into another form in the data warehouse or mart in a way that enhances or simplifies its meaning. In short, it is transforming data into information. This includes data merging, cleansing, and aggregation: Data merging: process of standardizing data types and fields. Suppose one source system calls integer type data as smallint where as another calls similar data as decimal. The data from the two source systems needs to be rationalized when moved into the oracle data format called number. Cleansing: this involves identifying any changing inconsistencies or inaccuracies. Eliminating inconsistencies in the data from multiple sources. Converting data from different systems into single consistent data set suitable for analysis. Meets a standard for establishing data elements, codes, domains, formats and naming conventions. Correct data errors and fills in for missing data values. Aggregation: the process where by multiple detailed values are combined into a single summary value typically summation numbers representing dollars spend or units sold. Generate summarized data for use in aggregate fact and dimension tables.

Q. What are the various active & passive transformations? Ans: Transformations can be active or passive. An active transformation can change the number of records passed through it. A passive transformation never changes the record count. For example, the filter transformation is active transformation as it removes rows that do not meet the filter condition defined in the transformation. An expression is passive transformation as the record count remains same before and after applying expression transformation. Active transformations that might change the record count include the following:

Internal Use

FAQ on Informatica (Related Data Warehousing and Database concepts)

Advanced external procedure Aggregator Filter Joiner Normalizer Rank Source qualifier

Passive transformations that never change the record count include the following: Lookup Expression External procedure Sequence generator Stored procedure Update strategy Q. When can we use connected lookup and unconnected lookup? Ans: Connected lookups: A connected lookup transformation is part of the mapping data flow. With connected lookups, we can have multiple return values. That is, we can pass multiple values from the same row in the lookup table out of the lookup transformation. Common use of connected lookup includes: => Finding a name based on a number ex. finding a department name based on a department number. => Finding a value based on a range of dates => Finding a value based on multiple conditions Unconnected lookups: An unconnected lookup transformation exists separate from the data flow in the mapping. You write an expression using The :LKP_ reference qualifier to call the lookup within another transformation. Some common uses for unconnected lookups include: => testing the results of a lookup in an expression => filtering records based on the lookup results => marking records for update based on the result of a lookup (for example, updating slowly changing dimension tables) => calling the same lookup multiple times in one mapping => calling lookup when only one return value is expected.

Q. What is standard and reusable transformation? Ans: Mappings contain two types of transformations, standard and reusable. Standard transformations exist within a single mapping. You cannot reuse a standard transformation you created in another mapping, nor can you create a shortcut to that

Internal Use

FAQ on Informatica (Related Data Warehousing and Database concepts)

transformation. However, often you want to create transformations that perform common tasks, such as calculating the average salary in a department. Since a standard transformation cannot be used by more than one mapping, you have to set up the same transformation each time you want to calculate the average salary in a department. By clicking on reusable option, the transformation can be copied to any mapping and could be reused. Q. What is the default sources which will supported by informatica powermart? Ans: The default source are: Relational tables, views, and synonyms. Fixed-width and delimited flat files that do not contain binary data. COBOL files. Q. When can we create the source definition? How can we connect it with a transformation? Ans: When working with a file that contains fixed-width binary data, you must create the source definition. The designer displays the source definition as a table, consisting of names, datatypes, and constraints. To use a source definition in a mapping, connect a source definition to a source qualifier or normalizer transformation. Q. What is target load order in designer? Ans: In the designer, you can set the order in which the informatica server sends records to various target definitions in a mapping. This feature is crucial if you want to maintain referential integrity when inserting, deleting, or updating records in tables that have the primary key and foreign key constraints applied to them. The informatica server writes data to all the targets connected to the same source qualifier or normalizer simultaneously, to maximize performance. Q. What are different types of tracing levels you have in transformations? Ans: tracing levels in transformations:Level Description Terse indicates when the informatica server initializes the session and its components. Summarizes session results, but not at the level of individual records. Normal includes initialization information as well as error messages and notification of rejected data.

Verbose Initialization

Verbose data

includes all information provided with the normal setting plus more extensive information about initializing transformations in the session. includes all information provided with the verbose initialization setting.

Internal Use

FAQ on Informatica (Related Data Warehousing and Database concepts)

Note: by default, the tracing level for every transformation is normal. To add a slight performance boost, you can also set the tracing level to terse, writing the minimum of detail to the session log when running a session containing the transformation. Q. How can we load the data using informatica? Ans: Using session we can load data. Q. What is mapplet and how do you create mapplet? Ans: A mapplet is a reusable object that represents a set of transformations. It allows you to reuse transformation logic and can contain as many transformations as you need. Create a mapplet when you want to use a standardized set of transformation logic in several mappings. For example, if you have a several fact tables that require a series of dimension keys, you can create a mapplet containing a series of lookup transformations to find each dimension key. You can then use the mapplet in each fact table mapping, rather than recreate the same lookup logic in each mapping. To create a new mapplet: In the mapplet designer, choose mapplets-create mapplet. Enter a descriptive mapplet name. Click ok. The mapping designer creates a new mapplet in the mapplet designer. Choose repository-save.

Q. How can we tune informatica performance for string functions? Ans - String functions definitely have an impact on informatica performance particularly those that change the length of a string (substring, ltrim, rtrim, etc). These functions slow down mapping considerably. The operations behind each string function are expensive (de-allocate, and re-allocate memory within a reader block in the session). String functions are a necessary and important part of ETL, and it is not recommend removing their use completely but we only need to try to limit them to necessary operations. One of the ways we advocate tuning these, is to use "varchar/varchar2" data types in your database sources, or to use delimited strings in source flat files (as much as possible). This will help reduce the need for "trimming" input. If your sources are in a database, perform the ltrim/rtrim functions on the data coming in from a database sql statement, this will be much faster than operationally performing it mid-stream. Q. How can we tune informatica performance for iif functions? Ans - When possible - arrange the logic to minimize the use of iif conditionals. This is not particular to informatica; it is costly in any programming language. It introduces "decisions" within the tool; it also introduces multiple code paths across the logic (thus increasing complexity). Therefore when possible, avoid utilizing an iif conditional and only possibility here might be to use oracle decode function (when required).

Internal Use

FAQ on Informatica (Related Data Warehousing and Database concepts)

Q. How can we improve aggregator transformation performance?


Ans - We can improve aggregator transformation performance by using the sorted input option. When the sorted input option is selected, the informatica server assumes all data is sorted by group. As the informatica server reads rows for a group, it performs aggregate calculations as it reads. When necessary, it stores group information in memory. To use the sorted input option, you must pass sorted data to the aggregator transformation. You can gain added performance with sorted ports when you partition the session. When sorted input is not selected, the informatica server performs aggregate calculations as it reads. However, since data is not sorted, the informatica server stores data for each group until it reads the entire source to ensure all aggregate calculations are accurate.

Q. (i) what is ftp? (ii) How can we connect to remote?


Ans: (i): the ftp (file transfer protocol) utility program is commonly used for copying files to and from other computers. These computers may be at the same site or at different sites thousands of miles apart. Ftp is general protocol that works on unix systems as well as other non- Unix systems. (ii): remote connect will be through commands: Ftp machine name Ex: ftp 129.82.45.181 or ftp xyz (machine name) If the remote machine has been reached successfully, ftp responds by asking for a login name and password. When you enter your own login name and password for the remote machine, it returns the prompt like: Ftp> And it permits you access to your own home directory on the remote machine. You should be able to move around in your own directory and to copy files to and from your local machine using the ftp interface commands. Q. What are the tasks that are done by informatica server? Ans: The informatica server performs the following tasks: Manages the scheduling and execution of sessions and batches Executes sessions and batches Verifies permissions and privileges Interacts with the server manager and pmcmd. The informatica server moves data from sources to targets based on metadata stored in a repository. For instructions on how to move and transform data, the informatica server reads a mapping (a type of metadata that includes transformations and source and target definitions). Each mapping uses a session to define additional information

Internal Use

FAQ on Informatica (Related Data Warehousing and Database concepts)

and to optionally override mapping-level options. You can group multiple sessions to run as a single unit, known as a batch. Q. What are the two programs that communicate with the informatica server? Ans: informatica provides server manager and pmcmd programs to communicate with the informatica server: Server manager. A client application used to create and manage sessions and batches, and to monitor and stop the informatica server. You can use information provided through the server manager to troubleshoot sessions and improve session performance. Pmcmd. A command-line program that allows you to start and stop sessions and batches, stop the informatica server, and verify if the informatica server is running.

Basic Data warehousing concepts


Q. What the difference is between a database, a data warehouse and a data mart? Ans: The difference is: A database is an organized collection of information. A data warehouse is a very large database with special sets of tools to extract and cleanse data from operational systems and to analyze data. A data mart is a focused subset of a data warehouse that deals with a single area of data and that is organized for quick analysis.

Q. What is data mart, data warehouse and decision support system and explain briefly? Ans: data mart: A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide database or data warehouse or be more specialized. The emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a data mart can expect to have data presented in terms that are familiar. In practice, the terms data mart and data warehouse each tend to imply the presence of the other in some form. However, most writers using the term seem to agree that the design of a data mart tends to start from an analysis of user needs and that a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); a data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose. In general, a data warehouse tends to be a strategic but somewhat unfinished concept; a data mart tends to be tactical and aimed at meeting an immediate need.

Internal Use

FAQ on Informatica (Related Data Warehousing and Database concepts)

Data warehouse: A data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect. Typically, a data warehouse is housed on an enterprise mainframe server. Data from various online transaction processing (OLTP) applications and other sources is selectively extracted and organized on the data warehouse database for use by analytical applications and user queries. Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access, but does not generally start from the point-of-view of the end user or knowledge worker who may need access to specialized, sometimes local databases. The latter idea is known as the data mart. Data mining, web mining, and a decision support system (DSS) are three kinds of applications that can make use of a data warehouse. Decision support system: A decision support system (DSS) is a computer program application that analyzes business data and presents it so that users can make business decisions more easily. It is an "informational application" (in distinction to an "operational application" that collects the data in the course of normal business operation). Typical information that a decision support application might gather and present would be: Comparative sales figures between one week and the next Projected revenue figures based on new product sales assumptions The consequences of different decision alternatives, given past experience in a context that is described A decision support system may present information graphically and may include an expert system or artificial intelligence (AI). It may be aimed at business executives or some other group of knowledge workers. Q. What is difference between data scrubbing and data cleansing? Ans: scrubbing data is the process of cleaning up the junk in legacy data and making it accurate and useful for the next generations of automated systems. This is perhaps the most difficult of all conversion activities. Very often, this is made more difficult when the customer wants to make good data out of bad data. It is also the most important and can not be done without the active participation of the user. Data cleaning - a two step process including detection and then correction of errors in a data set Q. What is metadata and repository? Ans: Metadata: It is data about data. It contains descriptive data for end users. It contains data that controls the ETL processing and data about the current state of the data warehouse. ETL updates metadata, to provide the most current state. Repository: the place where you store the metadata is called a repository. The more sophisticated your repository, the more complex and detailed metadata you can store in it. Powercenter use a relational database as the repository.

Internal Use

FAQ on Informatica (Related Data Warehousing and Database concepts)

Q. What are different types of fact tables? Ans - Types of facts There are three types of facts:

Additive: additive facts are facts that can be summed up through all of the dimensions in the fact table. Semi-additive: semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. Non-additive: non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Q. Illustrate additive, semi-additive and non-additive types of fact tables with example. Assume that we are a retailer, and we have a fact table with the following columns: Date Store Product Sales_amount The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_amount is the fact. In this case, sales_amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of sales_amount for all 7 days in a week represents the total sales amount for that week. Say we are a bank with the following fact table: Date Account Current_balance Profit_margin The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_balance and profit_margin are the facts. Current_balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Internal Use

FAQ on Informatica (Related Data Warehousing and Database concepts)

Q. What is a factless fact table? Ans - A factless fact table is to complement slowly changing dimension strategies. A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information. Common examples of factless fact tables include: Identifying product promotion events (to determine promoted products that didnt sell) Tracking student attendance or registration events Tracking insurance-related accident events Identifying building, facility, and equipment schedules for a hospital or university Q. What are the three types of dimensions? Ans - Confirmed dimensions: confirmed is some thing, which can be shared by multiple fact tables or multiple data marts. Junk dimensions: A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. Degenerative dimensions: Degenerative dimension is dimensional in nature but exist in fact table. Q. What is a degenerate dimension and how is it used? Ans - A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example, if you have a dimension that only has order number and order line number, you would have a 1:1 relationship with the fact table. We may have two tables with a billion rows or one table with a billion rows. Therefore, this would be a degenerate dimension and order number and order line number would be stored in the fact table. Q. What is a junk dimension and provide an example. Ans - A "junk" dimension is a collection of random transactional codes, flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes. A good example would be a trade fact in a company that brokers equity trades. The fact would contain several metrics (principal amount, net amount, price per share, commission, margin amount, etc.) And this would be related to several dimensions such as account, date, rep, office, exchange, etc. This fact would also contain several codes and flags that were related to the transaction rather than any of the dimensions, such as origin code (that indicates whether the trade was initiated with a phone call or via the web), a reinvest flag (that indicates whether or not this trade as was the result of the reinvestment of a dividend payout) and a comment field for storing special instructions from the customer. These three attributes would normally be removed from the fact table and stored in a junk dimension and perhaps called the trade dimension. In this way, the number of indexes on the fact table would be reduced, and performance would be enhanced.

Internal Use

10

FAQ on Informatica (Related Data Warehousing and Database concepts)

Basic Database concepts


Q. What is schema? Ans: A schema is collection of database objects of a user. Q. What are schema objects? Ans: Schema objects are the logical structures that directly refer to the database data. Schema objects include tables, views, sequences, synonyms, indexes, clusters, database triggers, procedures, functions packages and database links. Q. Does a view contain data? Ans: Views do not contain or store data. Q. Can a view based on another view. Ans: Yes. Q. What are the advantages of views? Ans: Provide an additional level of table security, by restricting access to a predetermined set of rows and columns of a table. Hide data complexity. Simplify commands for the user. Present the data in a different perspective from that of the base table. Store complex queries. Q. What is an index? Ans: An index is an optional structure associated with a table to have direct access to rows, which can be created to increase the performance of data retrieval. Index can be created on one or more columns of a table. Q. How are indexes updated? Ans: Indexes are automatically maintained and used by oracle. Changes to table data are automatically incorporated into all relevant indexes. Q. What is a data dictionary in terms of database? Ans: The data dictionary of an oracle database is a set of tables and views that are used as a read-only reference about the database. It stores information about both the logical and physical structure of the database, the valid users of an oracle database, integrity constraints defined for tables in the database and space allocated for a schema object and how much of it is being used.

Internal Use

11

FAQ on Informatica (Related Data Warehousing and Database concepts)

Q. What is integrity constrains? Ans: An integrity constraint is a declarative way to define a business rule for a column of a table. Q. Can an integrity constraint be enforced on a table if some existing table data does not satisfy the constraint? Ans:No.

Internal Use

12

You might also like