You are on page 1of 19

Anil with INFORMATICA

INFORMATICA DEVELOPERS GUIDE F OR N AMING STANDARD S

12/27/2012

Contents
1. NAMING CONVENTIONS FOR TRANSFORMATIONS ................................................ 3 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 MAPPINGS ..................................................................................................................4 MAPPLETS ..................................................................................................................4 WORKFLOWS ..............................................................................................................4 SESSIONS ..................................................................................................................4 FOR VARIABLES ...........................................................................................................5 PARAMETER FILES ........................................................................................................5 LOG AND BAD FILES ......................................................................................................6 SERVER VARIABLES.......................................................................................................6 PERFORMANCE IMPROVEMENT RELATED STANDARDS: .............................................................6 MISCELLANEOUS STANDARDS: .........................................................................................6 REFERENCE: ...............................................................................................................9

1. NAMING CONVENTIONS FOR TRANSFORMATIONS Type of Transform Aggregator Transformation Expression Transformation Advanced External Procedure Transform External Procedure Transformation Filter Transformation Joiner Transformation Naming Convention agg_ + target_table_name or descriptive_process_name exp_ + target_table_name or descriptive_process_name aep_+ name ext_ + name external external procedure procedure or Description/Example Aggregate the column(s) e.g.: If the revenue column is being aggregated, then the transform should be named as agg_REVENUE If the object of the expression to generate sales summary then the transform should be named as exp_SALES_SUMMARY An Advanced external procedure transform should be a concatenation of aep_+ the name of the external procedure being called. An external procedure transform should be a concatenation of ext_+ the name of the external procedure being called. If the filter transform is based on a filter condition for the STATE attribute then the transform should be named as fil_STATE Join two disparate sources e.g.: If a file custfile.dat is to be joined with the customer table the transform should be named as jnr_CUSTFILE_CUSTOMER If the lookup is being performed on PRODUCT_ID using the table DIM_PRODUCT then the transform should be named as lkp_DIM_PRODUCT If the lookup is being performed on PRODUCT_ID using the table DIM_PRODUCT to return PRODUCTCODE then the transform should be named as ulkp_DIM_PRODUCT_PRODUCTCODE If the MONTHLY_SALES.dat file is being normalized then the transform should be named as nrm_MONTHLY_SALES rnk_MONTHLY_SALES If the sequence is being generated for column PRODUCT_ID then the transform should be named as seq_DIM_PRODUCT If the source qualifier is defined on table DIM_PRODUCT then the transform should be named as sq_DIM_PRODUCT The stored procedure (sp) transform should be a concatenation of sp_+ the name of the stored procedure being called. NOTE: If a stored procedure name already begins with sp then there is no need to prefix it with sp_. If the update strategy is being defined for column CUST_NAME then the transform should be named as upd_DIM_CUSTOMER un_merge_transaction_records

fil_+ target_table_name descriptive_process_name

Connected Lookup Transformation

jnr_ + source_table1_source_table2 or filename1 and filename2 if joining files lkp_+ table name on which lookup is being performed ulkp_+ table name on which lookup is being performed + field returned nrm_+ file normalized name being

Unconnected Lookup Transformation Normalizer Transformation Rank Transformation Sequence Transformation Source Qualifier Transformation Stored Procedure

rnk_+ target_table_name or descriptive_process_name seq_+ table name sq_+ table name sp_+ stored procedure name

Update Strategy Transformation Union Transformation

upd_ + table name for which update strategy is being defined un_ + description_of_merge

Router Transformation

rtr_+ description_of_routing

rtr_route_wrap_error_valid_records

Ports Input Ports Output Ports Variable Ports 1.1 Mappings

Naming convention to be followed i_<fieldname> if port is created explicitly o_<fieldname> if port is created explicitly v_<fieldname>

The naming convention for mappings which relate to specific systems could be like this: m_<Project name>_<System name>_<Description> Example: m_trex_rps_create_transaction_stf_format 1.2 Mapplets The naming convention for mapplets is: mplt_Description. Mapplet-Input: input_SourceName Mapplet-output: output_TargetName Example: mplt_acr_report_generation 1.3 Workflows Workflows should being with wf_ and represent the functionality of the workflow. The naming convention for Workflows is wf_<Project name>_<System name>_<Description> Example: wf_trex_rps_ create_transaction_stf_format 1.4 Sessions The name of a session is derived from the mapping it is based upon. By default the session name generated by Informatica will be s_+mapping name. The default name should be used as the session name. For example: A session based on the mapping m_gsr_dim_geography will be named as s_m_trex_gsr_dim_geography. The naming convention for sessions is: s_<mappingname> If one mapping is being used by two sessions then the session name should be suffixed suitably to indicate the need for using the same mapping for two or more sessions. Example: s_ m_trex_rps_create_transaction_stf_format

1.5 For Variables The naming convention for variables is: v_<fieldname/PurposeOfVariable>

A variable name should be easy to read and convey the purpose of the variable. This helps when a number of variables are defined in a single transformation. Variable names should follow the sentence case format i.e. the first letter of the name will be in upper case. The use of an underscore is recommended as part of a variable name for clarity and readability. e.g.: The variable for first quarter sales should be named as Q1_sales instead of Q1sales.

Example: Variable name v_Month Variable Expression GET_DATE_PART( DATE_ENTERED, mm) SUM(QUANTITY* PRICE-DISCOUNT, Month=1 or Month=2 or Month=3) IIF ( PREVIOUS_STATE= STATE, State_counter+1,1 ) Description Extract the Month from DATE_ENTERED Calculate Q1 sales

v_Q1_sales

v_State_counter

Increment v_State_counter if STATE in previous data record is same as current data record

1.6 Parameter files Prm_<description> Example: Prm_RPS_010_job.prm

1.7 Log and Bad files The naming convention for Log files : <session_name>.log Bad files : <session_name>.bad 1.8 Server Variables The naming conventions for server variables are $PMRootDir = the root directory (Example: \\Server\C:\Informatica\9.1.0\Server\). $PMRootDir/SessLogs= Session Logs directory. $PMRootDir/BadFiles =Bad Files directory. $PMRootDir/Cache= Cache Files directory. $PMRootDir/TgtFiles= Target Flat files directory (output). $PMRootDir/SrcFiles= Source Flat files directory (input). $PMRootDir/ExtProc= External Procedures directory. $PMRootDir/LkpFiles= Lookup Files directory. $PMRootDir/WorkflowLogs= Workflow Logs directory. 1.9 Performance improvement related standards: 1. 2. 3. 4. 5. 6. 7. 8. Turn off verbose logging. Turn off collect performance statistics, after performance testing. Try not to read a file over network. In places where an opportunity for code reuse is not present, Consider output expressions in place of variables. Using a Sequence generator provides an edge in performance over a stored procedure call to get the sequence from database. An Update strategy slows down the performance of the session. Lookups and aggregators slow down the performance since they involve caching. It is advisable to calculate and check the Index and data cache size when using them. Consider partitioning of files where ever necessary.

Note: Refer to annexure 1. 1.10 Miscellaneous standards: 1. All ports to be in CAPITAL LETTERS for flat files. 2. Description of all the transformations (including sources and targets) is to be filled appropriately. 3. There should be no unused ports in any transformations. 4. Connection information of lookup should be $SOURCE or $TARGET. 5. For output only ports, all error (Transformation error) messages must be removed. 6. When Strings are read from a fixed width flat file source and compared, ensure that the strings are trimmed properly (LTRIM, RTRIM) for blank spaces. 7. Where ever the fields returned from a lookup transformation are used, null values for these fields should be handled accordingly. 8. When ports are dragged to the successive transformations, the names should not be changed. 9. Common sources and target tables are to be used from COMMON folder.

10. By default edit null charaters to space option in session definition for file targets has to be selected. 11. Check the return status of pmcmd. If non-zero then exit. 12. ALL CONTROL REPORTS TO BE IN CAPS - Please modify the necessary scripts. RECORDS PROCESSED: 45 RECORDS REJECTED: 0 Note there is a ":" separating the description and the value in the above example. This should be standard for all control reports. 13. ALL CONTROL REPORTS TO BE IN CAPS 14. The Default values of all the output ports should be removed. 15. All the components in the mapping should have comments. 16. All Ports in should be used and connected. 17. The Tracing level value should be set to 'Terse' in all transformations across the mapping. 18. All the 'Error (Transformation error)' messages should be removed from default value in the Expression transformation. 19. The Mapping should not contain any unused Source/Target Definitions. 20. The Override Tracing should be set to 'TERSE' in the Session properties of Error handling. 21. The 'Enable High Precision' should be disabled in the Session properties of Performance. 22. The Stop on error property should be greater than 0. There may be exceptions to the rule all such cases should be documented in the design document 23. The Session log file directory, Cache directory, File name, database connection stings, reject file directory, target connection string should start with a $. 24. The port names should be as per Standards. 25. The Data type of ports in source qualifier and source should be matching. 26. The sequence of fields in the sql query of a source qualifier should be in the same order as that of the order of the ports in it. 27. The filter expression should be coded properly in the following format: IIF (condition, True, False). 28. The lookup transformation should not be having fields, which are neither checked as out port nor used as lookup only port. 29. Usage of variables Variable expressions can reference any input port. So as a matter of good design principle, in a transform, variables should be defined after all the input ports have been defined. Variables that will be used by other variable(s) should be the first to be defined. Variables can reference other variable(s) in an expression. NOTE: the ordering of variables in a transform is very important. All the variables using other variables in their expression should be defined in the order of the dependency. In the table above, Month is used to calculate Q1_sales. Hence, Q1_sales must always be defined after Month. Variables are initialized to 0 for numeric variables or empty string for character/date variables. Variables can be used for temporary storage (for example, PREVIOUS_STATE is a temporary storage variable which will be overwritten when a new data record is read and processed). Local variables also reduce the overhead of recalculations. If a complex calculation is to be used throughout a mapping then it is

recommended to write the expression once and designate it as a variable. The variable can then be passed to other transforms as an input port. This will increase performance, as the Informatica Server will perform the calculation only once. For example, the variable Q1_sales can be passed from one transformation to another rather than redefining the same formula in different transforms. Variables can remember values across rows and they retain their value until the next evaluation of the variable expression. Hence, the order of the variables can be used to compute procedural computations. For example, let V1 and V2 be two variables defined with the following expressions and V1 occurs before V2: V1 has the expression V2+1 V2 has the expression V1+1 In this case, V1 gets the previous rows value for V2 since V2 occurs after V1 and is evaluated after V1. But V2 gets the current rows value of V1 since V1 is already evaluated by the time V2 is evaluated. 30. Informatica can perform data sorting based on a key defined for a source but it would be beneficial to have an index (matching the sort criteria) on a table for better performance. This is especially helpful while performing key based aggregations as well as lookups. 31. An external procedure/function call should always return a value. The return value will be used to determine the next logical step in the mapping execution. 32. Performance of an aggregator can be improved by presorting the data before passing it to the aggregator. 33. Filter data as early as possible in the ETL 34. Use custom SQL in Source Qualifier for the WHERE clause to filter data. This lets the database do the work, where it can be handled the most efficiently 35. A stored procedure should always return a value. The return value will be used by Informatica to evaluate the success or failure of the event. It can also be used to control the outcome of the mapping execution. 36. Stored procedures used inside of a mapping usually create poor performance. Therefore, it is recommended to avoid using stored procedures inside of a mapping. 37. Mapping a string to an integer, or an integer to a string will perform the conversion, however it will be slower than creating an output port with an expression like: to_integer(xxxx) and mapping an integer to an integer. It's because PMServer is left to decide if the conversion can be done mid-stream which seems to slow things down

1.11 Performance Tuning - Reference: <http://www.coreintegration.com>

Informatica Map/Session Tuning Covers basic, intermediate, and advanced tuning practices.
Table of Contents Basic Guidelines Intermediate Guidelines Advanced Guidelines

INFORMATICA BASIC TUNING GUIDELINES


The following points are high-level issues on where to go to perform "tuning" in Informatica's products. These are NOT permanent instructions, nor are they the end-all solution. Just some items (which if tuned first) might make a difference. The level of skill available for certain items will cause the results to vary. To 'test' performance throughput it is generally recommended that the source set of data produce about 200,000 rows to process. Beyond this - the performance problems / issues may lie in the database - partitioning tables, dropping / re-creating indexes, striping raid arrays, etc... Without such a large set of results to deal with, you're average timings will be skewed by other users on the database, processes on the server, or network traffic. This seems to be an ideal test size set for producing mostly accurate averages. Try tuning your maps with these steps first. Then move to tuning the session, iterate this sequence until you are happy, or cannot achieve better performance by continued efforts. If the performance is still not acceptable,. then the architecture must be tuned (which can mean changes to what maps are created).In this case, you can contact us <http://www.coreintegration.com/contact.htm> - we tune the architecture and the whole system from top to bottom. KEEP THIS IN MIND: In order to achieve optimal performance, it's always a good idea to strike a balance between the tools, the database, and the hardware resources. Allow each to do what they do best. Varying the architecture can make a huge difference in speed and optimization possibilities. Utilize a database (like Oracle / Sybase / Informix / DB2 etc...) for significant data handling operations (such as sorts, groups, aggregates). In other words, staging tables can be a huge benefit to parallelism of operations. In parallel design - simply defined by mathematics, nearly always cuts your execution time. Staging tables have many benefits. Please see the staging table discussion in the methodologies section for full details. Localize. Localize all target tables on to the SAME instance of Oracle (same SID), or same instance of Sybase. Try not to use Synonyms (remote database links) for anything (including: lookups, stored procedures, target tables, sources, functions, privileges, etc...). Utilizing remote links will most certainly slow things down. For Sybase users, remote mounting of databases can definitely be a hindrance to performance. If you can - localize all target tables, stored procedures, functions, views, and sequences in the SOURCE database. Again, try not to connect across synonyms. Synonyms (remote database tables) could potentially affect performance by as much as a factor of 3 times or more. Remove external registered modules. Perform pre-processing / post-processing

utilizing PERL, SED, AWK, and GREP instead. The Application Programmers Interface (API) which calls externals is inherently slow (as of: 1/1/2000). Hopefully Informatica will speed this up in the future. The external module which exhibits speed problems is the regular expression module (Unix: Sun Solaris E450, 4 CPU's 2 GIGS RAM, Oracle 8i and Informatica). It broke speed from 1500+ rows per second without the module - to 486 rows per second with the module. No other sessions were running. (This was a SPECIFIC case - with a SPECIFIC map - it's not like this for all maps). Remember that Informatica suggests that each session takes roughly 1 to 1 1/2 CPU's. In keeping with this - Informatica play's well with RDBMS engines on the same machine, but does NOT get along (performance wise) with ANY other engine (reporting engine, java engine, OLAP engine, java virtual machine, etc...) Remove any database based sequence generators. This requires a wrapper function / stored procedure call. Utilizing these stored procedures has caused performance to drop by a factor of 3 times. This slowness is not easily debugged - it can only be spotted in the Write Throughput column. Copy the map, replace the stored proc call with an internal sequence generator for a test run - this is how fast you COULD run your map. If you must use a database generated sequence number, then follow the instructions for the staging table usage. If you're dealing with GIG's or Terabytes of information - this should save you lot's of hours tuning. IF YOU MUST - have a shared sequence generator, then build a staging table from the flat file, add a SEQUENCE ID column, and call a POST TARGET LOAD stored procedure to populate that column. Place the post target load procedure in to the flat file to staging table load map. A single call to inside the database, followed by a batch operation to assign sequences is the fastest method for utilizing shared sequence generators. TURN OFF VERBOSE LOGGING. The session log has a tremendous impact on the overall performance of the map. Force over-ride in the session, setting it to NORMAL logging mode. Unfortunately the logging mechanism is not "parallel" in the internal core, it is embedded directly in to the operations. Turn off 'collect performance statistics'. This also has an impact - although minimal at times - it writes a series of performance data to the performance log. Removing this operation reduces reliance on the flat file operations. However, it may be necessary to have this turned on DURING your tuning exercise. It can reveal a lot about the speed of the reader, and writer threads. If your source is a flat file - utilize a staging table (see the staging table slides in the presentations section of this web site). This way - you can also use SQL*Loader, BCP, or some other database Bulk-Load utility. Place basic logic in the source load map, remove all potential lookups from the code. At this point - if your reader is slow, then check two things: 1) if you have an item in your registry or configuration file which sets the "Throttle Reader" to a specific maximum number of blocks, it will limit your read throughput (this only needs to be set if the sessions have a demonstrated problems with constraint based loads) 2) Move the flat file to local internal disk (if at all possible). Try not to read a file across the network, or from a RAID device. Most RAID array's are fast, but Informatica seems to top out, where internal disk continues to be much faster. Here - a link will NOT work to increase speed - it must be the full file itself - stored locally. Try to eliminate the use of non-cached lookups. By issuing a non-cached lookup, you're performance will be impacted significantly. Particularly if the lookup table is also a "growing" or "updated" target table - this generally means the indexes are changing during operation, and the optimizer looses track of the index statistics. Again - utilize staging tables if possible. In utilizing staging tables, views in the database can be built which join the data together; or Informatica's joiner object can be used to join data together - either one will help dramatically increase speed.

Separate complex maps - try to break the maps out in to logical threaded sections of processing. Re-arrange the architecture if necessary to allow for parallel processing. There may be smaller components doing individual tasks, however the throughput will be proportionate to the degree of parallelism that is applied. A discussion on HOW to perform this task is posted on the methodologies page; please see this discussion for further details. BALANCE: Balance between Informatica and the power of SQL and the database. Try to utilize the DBMS for what it was built for: reading/writing/sorting/grouping/filtering data enmasse. Use Informatica for the more complex logic, outside joins, data integration, multiple source feeds, etc... The balancing act is difficult without DBA knowledge. In order to achieve a balance, you must be able to recognize what operations are best in the database, and which ones are best in Informatica. This does not degrade from the use of the ETL tool, rather it enhances it - it's a MUST if you are performance tuning for high-volume throughput. TUNE the DATABASE. Don't be afraid to estimate: small, medium, large, and extra large source data set sizes (in terms of: numbers of rows, average number of bytes per row), expected throughput for each, turnaround time for load, is it a trickle feed? Give this information to your DBA's and ask them to tune the database for "wost case". Help them assess which tables are expected to be high read/high write, which operations will sort, (order by), etc... Moving disks, assigning the right table to the right disk space could make all the difference. Utilize a PERL script to generate "fake" data for small, medium, large, and extra large data sets. Run each of these through your mappings - in this manner, the DBA can watch or monitor throughput as a real load size occurs. Be sure there is enough SWAP, and TEMP space on your PMSERVER machine. Not having enough disk space could potentially slow down your entire server during processing (in an exponential fashion). Sometimes this means watching the disk space as while your session runs. Otherwise you may not get a good picture of the space available during operation. Particularly if your maps contain aggregates, or lookups that flow to disk Cache directory - or if you have a JOINER object with heterogeneous sources. Place some good server load monitoring tools on your PMServer in development watch it closely to understand how the resources are being utilized, and where the hot spots are. Try to follow the recommendations - it may mean upgrading the hardware to achieve throughput. Look in to EMC's disk storage array - while expensive, it appears to be extremely fast, I've heard (but not verified) that it has improved performance in some cases by up to 50% SESSION SETTINGS. In the session, there is only so much tuning you can do. Balancing the throughput is important - by turning on "Collect Performance Statistics" you can get a good feel for what needs to be set in the session - or what needs to be changed in the database. Read the performance section carefully in the Informatica manuals. Basically what you should try to achieve is: OPTIMAL READ, OPTIMIAL THROUGHPUT, and OPTIMAL WRITE. Over-tuning one of these three pieces can result in ultimately slowing down your session. For example: your write throughput is governed by your read and transformation speed, likewise, your read throughput is governed by your transformation and write speed. The best method to tune a problematic map, is to break it in to components for testing: 1) Read Throughput, tune for the reader, see what the settings are, send the write output to a flat file for less contention - Check the "Throttle Reader" setting (which is not configured by default), increase the Default Buffer Size by a factor of 64k each shot - ignore the warning above 128k. If the Reader still appears to increase during the session, then stabilize (after a few thousand rows), then try increasing the Shared Session Memory from 12MB to 24MB. If the reader still stabilizes, then you have a slow source, slow lookups, or your CACHE directory is not on internal disk. If the reader's throughput continues to climb above where it stabilized, make note of the session settings. Check the Performance Statistics to make sure the writer throughput is NOT the bottleneck - you are attempting to tune the reader here, and don't

want the writer threads to slow you down. Change the map target back to the database targets - run the session again. This time, make note of how much the reader slows down, it's optimal performance was reached with a flat file(s). This time - slow targets are the cause. NOTE: if your reader session to flat file just doesn't ever "get fast", then you've got some basic map tuning to do. Try to merge expression objects, set your lookups to unconnected (for re-use if possible), check your Index and Data cache settings if you have aggregation, or lookups being performed. Etc... If you have a slow writer, change the map to a single target table at a time - see which target is causing the "slowness" and tune it. Make copies of the original map, and break down the copies. Once the "slower" of the N targets is discovered, talk to your DBA about partitioning the table, updating statistics, removing indexes during load, etc... There are many database things you can do here. Remove all other "applications" on the PMServer. Except for the database / staging database or Data Warehouse itself. PMServer plays well with RDBMS (relational database management system) - but doesn't play well with application servers, particularly JAVA Virtual Machines, Web Servers, Security Servers, application, and Report servers. All of these items should be broken out to other machines. This is critical to improving performance on the PMServer machine. Back To Top <tuningguide.html> INFORMATICA INTERMEDIATE TUNING GUIDELINES The following numbered items are for intermediate level tuning. After going through all the pieces above, and still having trouble, these are some things to look for. These are items within a map which make a difference in performance (We've done extensive performance testing of Informatica to be able to show these affects). Keep in mind - at this level, the performance isn't affected unless there are more than 1 Million rows (average size: 2.5 GIG of data). ALL items are Informatica MAP items, and Informatica Objects - none are outside the map. Also remember, this applies to PowerMart/PowerCenter (4.5x, 4.6x, / 1.5x, 1.6x) - other versions have NOT been tested. The order of these items is not relevant to speed. Each one has it's own impact on the overall performance. Again, throughput is also gauged by the number of objects constructed within a map/Mapplet. Sometimes it's better to sacrifice a little readability, for a little speed. It's the old paradigm, weighing readability and maintainability (true modularity) against raw speed. Make sure the client agrees with the approach, or that the data sets are large enough to warrant this type of tuning. BE AWARE: The following tuning tips range from "minor" cleanup to "last resort" types of things - only when data sets get very large, should these items be addressed, otherwise, start with the BASIC tuning list above, then work your way in to these suggestions.

To understand the intermediate section, you'll need to review the memory usage diagrams (also available on this web site).
Filter Expressions - try to evaluate them in a port expression. Try to create the filter (true/false) answer inside a port expression upstream. Complex filter expressions slow down the mapping. Again, expressions/conditions operate fastest in an Expression Object with an output port for the result. Turns out - the longer the expression, or the more complex - the more severe the speed degradation. Place the actual expression (complex or not) in an EXPRESSION OBJECT upstream from the filter. Compute a single numerical flag: 1 for true, 0 for false as an output port. Pump this in to the filter - you should see the maximum performance ability with this configuration. Remove all "DEFAULT" value expressions where possible. Having a default value - even the "ERROR(xxx)" command slows down the session. It causes an unnecessary evaluation of values for every data element in the map. The only time you want to use "DEFAULT value is when you have to provide a default value for a specific port. There is another method: placing

a variable with an IIF(xxxx, DEFAULT VALUE, xxxx) condition within an expression. This will always be faster (if assigned to an output port) than a default value. Variable Ports are "slower" than Output Expressions. Whenever possible, use output expressions instead of variable ports. The variables are good for "static - and state driven" but do slow down the processing time - as they are allocated/reallocated each pass of a row through the expression object. Datatype conversion - perform it in a port expression. Simply mapping a string to an integer, or an integer to a string will perform the conversion, however it will be slower than creating an output port with an expression like: to_integer(xxxx) and mapping an integer to an integer. It's because PMServer is left to decide if the conversion can be done mid-stream which seems to slow things down. Unused Ports. Surprisingly, unused output ports have no affect on performance. This is a good thing. However in general it is good practice to remove any unused ports in the mapping, including variables. Unfortunately - there is no "quick" method for identifying unused ports. String Functions. String functions definitely have an impact on performance. Particularly those that change the length of a string (substring, ltrim, rtrim, etc..). These functions slow the map down considerably, the operations behind each string function are expensive (de-allocate, and re-allocate memory within a READER block in the session). String functions are a necessary and important part of ETL, we do not recommend removing their use completely, only try to limit them to necessary operations. One of the ways we advocate tuning these, is to use "varchar/varchar2" data types in your database sources, or to use delimited strings in source flat files (as much as possible). This will help reduce the need for "trimming" input. If your sources are in a database, perform the LTRIM/RTRIM functions on the data coming in from a database SQL statement, this will be much faster than operationally performing it mid-stream. IIF Conditionals are costly. When possible - arrange the logic to minimize the use of IIF conditionals. This is not particular to Informatica, it is costly in ANY programming language. It introduces "decisions" within the tool, it also introduces multiple code paths across the logic (thus increasing complexity). Therefore - when possible, avoid utilizing an IIF conditional again, the only possibility here might be (for example) an ORACLE DECODE function applied to a SQL source. Sequence Generators slow down mappings. Unfortunately there is no "fast" and easy way to create sequence generators. The cost is not that high for using a sequence generator inside of Informatica, particularly if you are caching values (cache at around 2000) - seems to be the suite spot. However - if at all avoidable, this is one "card" up a sleve that can be played. If you don't absolutely need the sequence number in the map for calculation reasons, and you are utilizing Oracle, then let SQL*Loader create the sequence generator for all Insert Rows. If you're using Sybase, don't specify the Identity column as a target - let the Sybase Server generate the column. Also - try to avoid "reusable" sequence generators - they tend to slow the session down further, even with cached values. Test Expressions slow down sessions. Expressions such as: IS_SPACES tend slow down the mappings, this is a data validation expression which has to run through the entire string to determine if it is spaces, much the same as IS_NUMBER has to validate an entire string. These expressions (if at all avoidable) should be removed in cases where it is not necessary to "test" prior to conversion. Be aware however, that direct conversion without testing (conversion of an invalid value) will kill the transformation. If you absolutely need a test expression for numerics, try this: IIF(<field> * 1 >= 0,<field>,NULL) preferably you don't care if it's zero. An alpha in this expression should return a NULL to the computation. Yes the IIF condition is slightly faster than the IS_NUMBER - because IS_NUMBER parses the entire string, where the multiplication operator is the actual speed gain. Reduce Number of OBJETS in a map. Frequently, the idea of these tools is to make the "data

translation map" as easy as possible. All to often, that means creating "an" (1) expression for each throughput/translation (taking it to an extreme of course). Each object adds computational overhead to the session and timings may suffer. Sometimes if performance is an issue / goal, you can integrate several expressions in to one expression object, thus reducing the "object" overhead. In doing so - you could speed up the map. Update Expressions - Session set to Update Else Insert. If you have this switch turned on it will definitely slow the session down - Informatica performs 2 operations for each row: update (w/PK), then if it returns a ZERO rows updated, performs an insert. The way to speed this up is to "know" ahead of time if you need to issue a DD_UPDATE or DD_INSERT inside the mapping, then tell the update strategy what to do. After which you can change the session setting to: INSERT and UPDATE AS UPDATE or UPDATE AS INSERT. Multiple Targets are too slow. Frequently maps are generated with multiple targets, and sometimes multiple sources. This (despite first appearances) can really burn up time. If the architecture permits change, and the users support re-work, then try to change the architecture -> 1 map per target is the general rule of thumb. Once reaching one map per target, the tuning get's easier. Sometimes it helps to reduce it to 1 source and 1 target per map. But - if the architecture allows more modularization 1 map per target usually does the trick. Going further, you could break it up: 1 map per target per operation (such as insert vs update). In doing this, it will provide a few more cards to the deck with which you can "tune" the session, as well as the target table itself. Going this route also introduces parallel operations. For further info on this topic, see my architecture presentations on Staging Tables, and 3rd normal form architecture (Corporate Data Warehouse Slides). Slow Sources - Flat Files. If you've got slow sources, and these sources are flat files, you can look at some of the following possibilities. If the sources reside on a different machine, and you've opened a named pipe to get them across the network - then you've opened (potentially) a can of worms. You've introduced the network speed as a variable on the speed of the flat file source. Try to compress the source file, FTP PUT it on the local machine (local to PMServer), decompress it, then utilize it as a source. If you're reaching across the network to a relational table - and the session is pulling many many rows (over 10,000) then the source system itself may be slow. You may be better off using a source system extract program to dump it to file first, then follow the above instructions. However, there is something your SA's and Network Ops folks could do (if necessary) - this is covered in detail in the advanced section. They could backbone the two servers together with a dedicated network line (no hubs, routers, or other items in between the two machines). At the very least, they could put the two machines on the same sub-net. Now, if your file is local to PMServer but is still slow, examine the location of the file (which device is it on). If it's not on an INTERNAL DISK then it will be slower than if it were on an internal disk (C drive for you folks on NT). This doesn't mean a unix file LINK exists locally, and the file is remote - it means the actual file is local. Too Many Aggregators. If your map has more than 1 aggregator, chances are the session will run very very slowly - unless the CACHE directory is extremely fast, and your drive seek/access times are very high. Even still, placing aggregators end-to-end in mappings will slow the session down by factors of at least 2. This is because of all the I/O activity being a bottleneck in Informatica. What needs to be known here is that Informatica's products: PM / PC up through 4.7x are NOT built for parallel processing. In other words, the internal core doesn't put the aggregators on threads, nor does it put the I/O on threads - therefore being a single strung process it becomes easy for a part of the session/map to become a "blocked" process by I/O factors. For I/O contention and resource monitoring, please see the database/datawarehousetuningguide <http://www.coreintegration.com/innercore/library/technical/tuningguide.htm>. Maplets containing Aggregators. Maplets are a good source for replicating data logic. But just

because an aggregator is in a maplet doesn't mean it won't affect the mapping. The reason maplets don't affect speed of the mappings, is they are treated as a part of the mapping once the session starts - in other words, if you have an aggregator in a maplet, followed by another aggregator in a mapping you will still have the problem mentioned above in #14. Reduce the number of aggregators in the entire mapping (included maplets) to 1 if possible. If necessary, split the map up in to several different maps, use intermediate tables in the database if required to achieve processing goals. Eliminate "too many lookups". What happens and why? Well - with too many lookups, your cache is eaten in memory - particularly on the 1.6 / 4.6 products. The end result is there is no memory left for the sessions to run in. The DTM reader/writer/transformer threads are not left with enough memory to be able to run efficiently. PC 1.7, PM 4.7 solve some of these problems by caching some of these lookups out to disk when the cache is full. But you still end up with contention - in this case, with too many lookups, you're trading in Memory Contention for Disk Contention. The memory contention might be worse than the disk contention, because the system OS end's up thrashing (swapping in and out of TEMP/SWAP disk space) with small block sizes to try to locate "find" your lookup row, and as the row goes from lookup to lookup, the swapping / thrashing get's worse. Lookups & Aggregators Fight. The lookups and the aggregators fight for memory space as discussed above. Each requires Index Cache, and Data Cache and they "share" the same HEAP segments inside the core. See Memory Layout document <http://www.coreintegration.com/innercore/library/technical/memorylayout.htm> for more information. Particularly in the 4.6 / 1.6 products and prior - these memory areas become critical, and when dealing with many many rows - the session is almost certain to cause the server to "thrash" memory in and out of the OS Swap space. If possible, separate the maps perform the lookups in the first section of the maps, position the data in an intermediate target table - then a second map reads the target table and performs the aggregation (also provides the option for a group by to be done within the database)... Another speed improvement... INFORMATICA ADVANCED TUNING GUIDELINES The following numbered items are for advanced level tuning. Please proceed cautiously, one step at a time. Do not attempt to follow these guidelines if you haven't already made it through all the basic and intermediate guidelines first. These guidelines may require a level of expertise which involves System Administrators, Database Administrators, and Network Operations folks. Please be patient. The most important aspect of advanced tuning is to be able to pinpoint specific bottlenecks, then have the funding to address them. As usual - these advanced tuning guidelines come last, and are pointed at suggestions for the system. There are other advanced tuning guidelines available for Data Warehousing Tuning <http://www.coreintegration.com/innercore/library/technical/tuningguide.htm>. You can refer to those for questions surrounding your hardware / software resources. Break the mappings out. 1 per target. If necessary, 1 per source per target. Why does this work? Well - eliminating multiple targets in a single mapping can greatly increase speed... Basically it's like this: one session per map/target. Each session establishes it's own database connection. Because of the unique database connection, the DBMS server can now handle the insert/update/delete requests in parallel against multiple targets. It also helps to allow each session to be specified for it's intended purpose (no longer mixing a data driven session with INSERTS only to a single target). Each session can then be placed in to a batch marked "CONCURRENT" if preferences allow. Once this is done, parallelism of mappings and sessions become obvious. A study of parallel processing has shown again and again, that the operations can be completed sometimes in half the time of their original counterparts merely by streaming them at the same time. With multiple targets in the same mapping, you're telling a single database connection to handle multiply diverse database statements -

sometimes hitting this target, other times hitting that target. Think - in this situation it's extremely difficult for Informatica (or any other tool for that matter) to build BULK operations... even though "bulk" is specified in the session. Remember that "BULK" means this is your preference, and that the tool will revert to NORMAL load if it can't provide a BULK operation on a series of consecutive rows. Obviously, data driven then forces the tool down several other layers of internal code before the data actually can reach the database. Develop maplets for complex business logic. It appears as if Maplets do NOT cause any performance hindrance by themselves. Extensive use of maplets means better, more manageable business logic. The maplets allow you to better break the mappings out. Keep the mappings as simple as possible. Bury complex logic (if you must) in to a maplet. If you can avoid complex logic all together - then that would be the key. The old rule of thumb applies here (common sense) the straighter the path between two points, the shorter the distance... Translated as: the shorter the distance between the source qualifier and the target - the faster the data loads. Remember the TIMING is affected by READER/TRANSFORMER/WRITER threads. With complex mappings, don't forget that each ELEMENT (field) must be weighed - in this light a firm understanding of how to read performance statistics generated by Informatica becomes important. In other words - if the reader is slow, then the rest of the threads suffer, if the writer is slow, same effect. A pipe is only as big as it's smallest diameter.... A chain is only as strong as it's weakest link. Sorry for the metaphors, but it should make sense. Change Network Packet Size (for Sybase, MS-SQL Server & Oracle users). Maximum network packet size is a Database Wide Setting, which is usually defaulted at 512 bytes or 1024 bytes. Setting the maximum database packet size doesn't necessarily hurt any of the other users, it does however allow the Informatica database setting to make use of the larger packet sizes - thus transfer more data in a single packet faster. The typical 'best' settings are between 10k and 20k. In Oracle: you'll need to adjust the Listener.ORA and TNSNames.ORA files. Include the parameters: SDU, and TDU. SDU = Service Layer Data Buffer Size (in bytes), TDU = Transport Layer Data Buffer Size (in bytes). The SDU and TDU should be set equally. See the Informatica FAQ page for more information on setting these up. Change to IPC Database Connection for Local Oracle Database . If PMServer and Oracle are running on the same server, use an IPC connection instead of a TCP/IP connection. Change the protocol in the TNSNames.ORA and Listener.ORA files, and restart the listener on the server. Be careful - this protocol can only be used locally, however the speed increases from using Inter Process Communication can be between 2x and 6x. IPC is utilized by Oracle, but is defined as a Unix System 5 standard specification. You can find more information on IPC by reading about in in Unix System 5 manuals. Change Database Priorities for the PMServer Database User. Prioritizing the database login that any of the connections use (setup in Server Manager) can assist in changing the priority given to the Informatica executing tasks. These tasks when logged in to the database then can over-ride others. Sizing memory for these tasks (in shared global areas, and server settings) must be done if priorities are to be changed. If BCP or SQL*Loader or some other bulk-load facility is utilized, these priorities must also be set. This can greatly improve performance. Again, it's only suggested as a last resort method, and doesn't substitute for tuning the database, or the mapping processes. It should only be utilized when all other methods have been exhausted (tuned). Keep in mind that this should only be relegated to the production machines, and only in certain instances where the Load cycle that Informatica is utilizing is NOT impeding other users. Change the Unix User Priority. In order to gain speed, the Informatica Unix User must be given a higher priority. The Unix SA should understand what it takes to rank the Unix logins,

and grant priorities to particular tasks. Or - simply have the pmserver executed under a super user (SU) command, this will take care of reprioritizing Informatica's core process. This should only be used as a last resort - once all other tuning avenues have been exhausted, or if you have a dedicated Unix machine on which Informatica is running. Try not to load across the network. If at all possible, try to co-locate PMServer executable with a local database. Not having the database local means: 1) the repository is across the network (slow), 2) the sources / targets are across the network, also potentially slow. If you have to load across the network, at least try to localize the repository on a database instance on the same machine as the server. The other thing is: try to co-locate the two machines (pmserver and Target database server) on the same sub-net, even the same hub if possible. This eliminates unnecessary routing of packets all over the network. Having a localized database also allows you to setup a target table locally - which you can then "dump" following a load, ftp to the target server, and bulk-load in to the target table. This works extremely well for situations where append or complete refresh is taking place. Set Session Shared Memory Settings between 12MB and 24MB. Typically I've seen folks attempt to assign a session large heaps of memory (in hopes it will increase speed). All it tends to do is slow down the processing. See the memory layout document <http://www.coreintegration.com/innercore/library/technical/memorylayout.htm> for further information on how this affects Informatica and it's memory handling, and why simply giving it more memory doesn't necessarily provide speed. Set Shared Buffer Block Size around 128k. Again, something that's covered in the memory layout document. This seems to be a "sweet spot" for handling blocks of rows in side the Informatica process. MEMORY SETTINGS: The settings above are for an average configured machine, any machine with less than 10 GIG's of RAM should abide by the above settings. If you've got 12+ GIG's, and you're running only 1 to 3 sessions concurrently, go ahead and specify the Session Shared Memory size at 1 or 2 GIG's. Keep in mind that the Shared Buffer Block Size should be set in relative size to the Shared Memory Setting. If you set a Shared Mem to 124 MB, set the Buffer Block Size to 12MB, keep them in relative sizes. If you don't - the result will be more memory "handling" going on in the background, so less actual work will be done by Informatica. Also - this holds true for the simpler mappings. The more complex the mapping, the less likely you are to see a gain by increasing either buffer block size, or shared memory settings - because Informatica potentially has to process cells (ports/fields/values) inside of a huge memory block; thus resulting in a potential re-allocation of the whole block. Use SNAPSHOTS with your Database. If you have dedicated lines, DS3/T1, etc... between servers, use a snapshot or Advanced Replication to get data out of the source systems and in to a staging table (duplicate of the source). Then schedule the snapshot before running processes. The RDBMS servers are built for this kind of data transfer - and have optimizations built in to the core to transfer data incrementally, or as a whole refresh. It may be to your advantage. Particularly if your sources contain 13 Million + rows. Place Informatica processes to read from the snapshot, at that point you can index any way you like - and increase the throughput speed without affecting the source systems. Yes - Snapshots only work if your sources are homogeneous to your targets (on the same type of system). INCREASE THE DISK SPEED. One of the most common fallacies is that a Data Warehouse RDBMS needs only 2 controllers, and 13 disks to survive. This is fine if you're running less than 5 Million Rows total through your system, or your load window exceeds 5 hours. I recommend at least 4 to 6 controllers, and at least 50 disks - set on a Raid 0+1 array, spinning at 7200 RPM or better. If it's necessary, plunk the money down and go get an EMC device. You should see a significant increase in performance after installing or upgrading to such a configuration.

Switch to Raid 0+1. Raid Level 5 is great for redundancy, horrible for Data Warehouse performance, particularly on bulk loads. Raid 0+1 is the preferred method for data warehouses out there, and most folks find that the replication is just as safe as a Raid 5, particularly since the Hardware is now nearly all hot-swappable, and the software to manage this has improved greatly. Upgrade your Hardware. On your production box, if you want Gigabytes per second throughput, or you want to create 10 indexes in 4 hours on 34 million rows, then add CPU power, RAM, and the Disk modifications discussed above. A 4 CPU machine just won't cut the mustard today for this size of operation. I recommend a minimum of 8 CPU's as a starter box, and increase to 12 as necessary. Again, this is for huge Data Warehousing systems - GIG's per hour/MB per Hour. A box with 4 CPU's is great for development, or for smaller systems (totalling less than 5 Million rows in the warehouse). However, keep in mind that Bus Speed is also a huge factor here. I've heard of a 4 CPU Dec-Alpha system outperforming a 6 CPU system... So what's the bottom line? Disk RPM's, Bus Speed, RAM, and # of CPU's. I'd say potentially in that order. Both Oracle and Sybase perform extremely well when given 6+ CPU's and 8 or 12 GIG's RAM setup on an EMC device at 7200 RPM with minimum of 4 controllers.

You might also like