Files As Data Sources

Files as data sources - Part 5 - Missing Files
To handle files that don't arrive I check through the SYS_FILE table to see if there is an entry for that file
for the day. If there is no entry I add a row to the SYS_FILE table to record the fact that no file arrived,
and then I send out an email to person responsible for that file, to let them know that the file did not arrive.
Here is the basic script:
$L_EmailBody = 'The following files did not arrive today: ';

$L_Counter = 0;
#Check if file $G_ActivityEmfe has arrived
If (db_type('Datawarehouse') = 'Oracle')
BEGIN
$L_FileName = replace_substr( $G_ActivityEmfe ,'*','%');
END
IF (0 = sql('Datawarehouse','SELECT COUNT(*) FROM SYS_FILE WHERE FILE_LOCATION =
{$G_FileLocation} AND FILE_NAME like {$L_FileName} AND trunc(FILE_PROCESS_DATETIME) =
trunc(SYSDATE)'))
begin
$L_EmailBody = $L_EmailBody || chr(13) || chr(10) || $G_ActivityEmfe;
$L_Counter = $L_Counter + 1;
end
#Check if file $G_ActivityEmms has arrived

BEGIN
$L_FileName = replace_substr( $G_ActivityEmms ,'*','%');
END
IF (0 = sql('Datawarehouse','SELECT COUNT(*) FROM SYS_FILE WHERE FILE_LOCATION =
{$G_FileLocation} AND FILE_NAME like {$L_FileName} AND trunc(FILE_PROCESS_DATETIME) =
trunc(SYSDATE)'))
begin
$L_EmailBody = $L_EmailBody || chr(13) || chr(10) || $G_ActivityEmms;
end
#If any files did not arrive, ie counter>0, then send email. if ($L_Counter > 0)
begin
print($L_EmailBody);
smtp_to($G_AbsentReportingAddresses, 'Absent Files Report', $L_EmailBody , 0, 0);
end
Posted by Sean A. Hayward at 1:00 AM No comments:
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Links to this post
Labels: Data Services, files, SAP
Monday, November 17, 2014
Files as data sources - Part 4 - File Errors

I always place the dataflow that is reading the file within a try...catch block to specifically respond to any
file errors that may occur.
The two error types you want to look out for are:
1.
Flat file processing errors (1004)
2.
File access errors (1005)
If an error does occur I run the following in the script:
#Record file failure in SYS_FILE table

FileFailed($G_FileLocation,$G_ActivityEmms);
#Archive file
FileMove($G_ActivityEmms, $G_FileLocation, $G_FileErrorLocation);
#Truncate staging table to prevent error data from getting into datamart
sql( 'Datawarehouse','TRUNCATE TABLE STG_ACTIVITY_EMMS_EM');
The FileFailed function has nearly the same code as the FileComplete function. You could actually
combine them into a single function that updates the file status to either COMPLETE or FAILED, or
anything else you might need.
BEGIN
$L_FileName = replace_substr( $P_FileName,'*','%');
END
sql('Datawarehouse','UPDATE SYS_FILE SET STATUS = \'FAILED\' WHERE FILE_LOCATION =
{$P_Directory} AND FILE_NAME like {$L_FileName} AND AUDIT_ID = [$G_AuditID]');
We've already reviewed how the FileMove function works.
Lastly I clear out the staging table so that no partially loaded data goes through the rest of the load.
It would also be a good idea to send out an email alerting the owner of the file that there was something
wrong with the file so that they can take an remedial action before resending the file for processing.
Links to this post

Files as data sources - Part 3 - Moving files around

So now that file has been processed, lets look at what we should do next.
I like to have an archive folder on the file system to keep any files that have been processed, this is so
that I can go back for auditing and/or dealing with any queries about the data that may come up.
I also update the STATUS field in the SYS_FILE table that references the file to COMPLETE.
Lets start with the update function:
BEGIN
$L_FileName = replace_substr( $P_FileName,'*','%');
END
sql('Datawarehouse','UPDATE SYS_FILE SET STATUS = \'COMPLETE\' WHERE FILE_LOCATION =
{$P_Directory} AND FILE_NAME like {$L_FileName} AND AUDIT_ID = [$G_AuditID]');
For this function I pass in the directory and file name, look up the relevant row\s in the SYS_FILE table
and then update them to COMPLETE.
First thing I do is check the database type. If I am using Oracle I will need to convert any *s to % as that is
the wildcard character in oracle. You can add additional database types in here as required.
I look for the rows in the SYS_FILE table that match the file name. The reason I update more than one
row at a time is because when you have multiple files processed through the file object such as File1,
File2 and File3, and you refer to the file name in your file object as File*; then these files are collated into
one batch for processing. So if the dataflow completed successfully then all of the files have been loaded.
Once I've update the rows in the SYS_FILE table I move the files to the archive directory with a little
FileMove function:
$L_FileName = word_ext( $P_FileName ,1,'.');
exec('sh',' -c "mv ' || $P_SourceDirectory || $L_FileName || ' ' || $P_TargetDirectory || '"',8);
In this example my job server is running on a UNIX OS, but you could re-write this a Microsoft OS by
replace the 'sh' with cmd, and using 'move' instead of 'mv'.
I use the word_ext function at the top to avoid any issues with file extensions and wildcards. You may not
have this issue so will not necessarily have to include the first line of the function.

Links to this post
Files as data sources - Part 2 - Keeping track

In the last post I spoke about how to time your job with the file's arrival. In this post I will talk about how
keep track of your files by keeping some basic metadata, and also how to relate the data in your target
destination back to the original file.
Here again is the IsFileAvailable function I am using to wait for the files:
$L_FileFound = wait_for_file($P_FileDirectory || $P_FileName, $P_TimeOut,$P_Interval , -1,
$L_FileNames, $L_FileListSize, ',');
if ($L_FileFound = 1)
begin
$L_Counter = 1;
WHILE ($L_Counter <= $L_FileListSize)
begin
FileNew($P_FileDirectory, word_ext($L_FileNames,$L_Counter,','));
end
end
Return $L_FileFound;
Once I have found a file, or files, I then run a routine to step through each of the file names. I use the
word_ext function to choose which file name in the list I want to access, and then a counter to step
through the list.
For each file name I then call a function I wrote called FileNew:
#Get time the file was created
$L_FileDateTime = get_file_attibute($P_FullFilePathAndName, 'date_created')
#Get the file type eg .txt .csv .pipe
$L_Type = word_ext($L_FileName,2,'.');
#Get new file id
$L_File_ID = sql('Datawarehouse','Select max(FILE_ID) from SYS_FILE') + 1
#Insert a new record into the SYS_FILENAME table
SQL('Datawarehouse','INSERT INTO SYS_FILE (FILE_ID, DI_FILENAME, FILE_DATETIME,
FILE_PROCESS_DATETIME, TYPE, STATUS, AUDIT_ID)
VALUES ([$L_File_ID]{$P_FullFilePathAndName}, to_date({$L_FileDateTime}, \'YYYY/MM/DD
hh24:mi:ss\'), sysdate, {$L_Type}, \'STARTED\', [$G_AuditID])');
Return $L_File_ID;
What I am doing here is firstly gathering some metadata about the file such as the date it was created
and the type of file it is, and then I am inserting that data in a table called SYS_FILE which is where I
keep track of all the files processed. I also assign a file_id by adding 1 to the max file_id currently in the
table.
I set the status of the file to STARTED, to signify that we have started processing this file. Later on I will
either set the file to COMPLETE or ERROR.
Here is what the data in the file looks like:
You could always expand the number of columns in the SYS_FILE table to store all sorts of additional
metadata about the file such as its source, how it was delivered or who is responsible for it.
Now that I have created a record in the SYS_FILE table for a file, how do I link that to the records that are
coming through from the file?
First of all you need to go into your file object and set the source information on the file as follows:
There are similar options on XML and Excel files too.

What this does is it adds an extra column at the end of your file that will store the file name along with its
path. This is particularly important if you are receiving and processing multiple files at once because DS
will batch process all the files through at the same time. So lets say you are looking for File* and the files
received are File1, File2, and File3. DS will process all 3 files together at once. It will include the file name
that a row of data came from in the DI_FILENAME column, and we will use that information to tie the data
back to the specific row we have created for that file.
In order to tie up the row of data with the correct SYS_FILE entry, we use thelookup_ext function as
follows:
So now that we have our file and have processed it, we need to decide what to do with the file. I'll be
covering archiving and updating the SYS_FILE table to show the completed files in the next post.

Links to this post
Monday, October 27, 2014
Files as data sources - Part 1 - Waiting for the files to arrive

In this post I am going to take a look at how to handle file arrival.
You have a few options for dealing with file arrival.
1.
Run the job at a set time each day. If the file is there it gets processed, and if it isn't, you
just try again tomorrow.
2.
Run the job frequently, say every couple of minutes. Each time the job runs it attempts to
process the file. If the file is there is gets processed, and if it isn't Data Services just tries again on
the next run. This method is useful if the file could arrive multiple times a day, and you can't be
sure exactly when the file will arrive.
3.
Start the job, and have the job wait for the file to arrive. This method is best if you're
processing just one instance of the file everyday, but can't be exactly sure when the file will
arrive, and you want the ETL run as soon as it arrives.
For dealing with all three options above I have written a function called IsFileAvailable that makes use of
the built in DS function called wait_for_file. The reason I don't use the wait_for_file function alone is
because I have built in additional functionality that will enable me to deal with some of the other issues I
mentioned in the introduction. I will be going through those in subsequent posts.
Here is what the functions looks like:
$L_FileFound = wait_for_file($P_FileDirectory || $P_FileName, $P_TimeOut,$P_Interval , -1,
$L_FileNames, $L_FileListSize, ',');
if ($L_FileFound = 1)
begin
$L_Counter = 1;
WHILE ($L_Counter <= $L_FileListSize)
begin
FileNew($P_FileDirectory, word_ext($L_FileNames,$L_Counter,','));
end
end
Return $L_FileFound;
The above function first waits for the file\s to arrive, and then writes a record in to a file handling table
using the FileNew function.
I am using the wait_for_file function to determine when the file arrives.
The return values from the this function are:
0 - No file matched
1 - At least one file was matched
-1 - The function timed out with no files found
-2 - One of the input values is illegal
I'm generally only interested if a file has been found ie if the value 1 is returned.
The first few parameters are fairly straight forward.
Firstly it needs to know the name and location of the file you are waiting for. This can contain a wildcard,
so if you are waiting for any file that starts with the letters file, you can set that value to be file*.txt. If you
are not certain of the extension you can also have it be file*.*, and if you don't even care what the file
name is, as long as any file arrives you can set the value as *.* .
The next parameter is how long you would like Data Services to wait for the file to arrive, the timeout
parameter. This is set in milliseconds, so if you want Data Services to wait 30 minutes for the file to arrive,
then that value should be 30 * 60 (seconds) * 1,000 (milliseconds) to get the value 1,800,000. If the
timeout duration expires, 30 minutes in this example, then the wait_for_file function will return the value 1. This means that it looks for the file for 30 minutes, but no file arrived.
The 3rd parameter is how often you want Data Services to check whether the file has arrived. Again its
the same formula for setting the value. If you want it to have a look ever 5 minutes then its 5 * 60 * 1,000
to get 300,000.
The next 4 parameters are all about returning the names of the files that Data Services finds.
In this example I have -1 set for the max match parameter. This means that I want DS to return the
names of all the matched files that it finds. You could set this to 0 if you don't want any of them, or any
other positive number if you only want a specific number of file names returned.
The next parameter is an output parameter that will store the list of file names returned. So lets say you
set the 1st parameter in the function to file*.txt, and then there are 3 files in the directory: file1.txt, file2.txt
and file3.txt. This variable will hold all three of those file names.
The next parameter will return the number of files found that match the search pattern. So again if you're
looking for file*.txt, and 3 files are found that match file*.txt, then this output parameter will return the
value 3.
The final parameter in the function allows you to set the list separator for the list of file. In this example I
set it to be a comma. So the variable I have above called $L_FileNames, will end up with the
values File1.txt, File2.txt, File3.txt.
The next part of the IsFileAvailable function loops through the list of file names and calls another function
I have written called FileNew for each of the file name values in the list. The purpose of the FileNew
function is to write a record into my SYS_FILE table for each file found.
I'll be going through the purpose of the SYS_FILE table and how you can use it to tie up data in the target
table to the source files in my next post.
Links to this post
Monday, October 20, 2014
Files as Data Sources - Introduction

When files are the source of your data there are a few things to take into consideration:
When will the files arrive?
How will you keep track of the files processed?
How do you relate data in your target tables with the original file?
What if multiple files arrive for the same data set?
What should you do with files once they've been processed
What happens if there is an error with the file?
What should you do if a file doesn't arrive?

For this blog series I am going to go through all the points above. Most of what I discuss will be relevant
for both standard text files as well as xml and Excel files.
Here is the standard structure I use when processing a file through Data Services:
So firstly I used a conditional to check whether a file has arrived or not. I've created my own function,
IsFileAvailable, to look out for the file arrival. Yes, I could have used the built in function wait_for_file, but
there is additional functionality that you might find useful that I have built into IsFileAvailable.
If the file arrives I print that out to the trace file. You don't have to do this, but I just find it easier to see
whats going on that way.
Then I place the data flow that will process the file within a try catch block. This is so that I can handle any
file errors without bringing down the entire job. Within that error handler I can report the file errors to the
owner\s of the file and move the file to an error file location.
In the else section of the conditional I place some code to handle the file not arriving.
Over the next few posts I'll break out the detail of how each of the above pieces works.
Posted by Sean A. Hayward at 12:37 PM No comments:

Links to this post
Monday, May 12, 2014
History preservation and using the filter in the table comparison

transform
I just found another really handy use for the new filter field in the table comparison transform in the latest
versions of Data Services.
If you are loading a dimension, and you are using history preservation, add a line in the filter field to only
bring through current records.
This helps you with two things, one obvious, one not so obvious.
The first thing thing it helps you with is a little bit of a performance boost. If you think about it, you only
need to compare against the current record of a given dimension value. Setting the generated key column
makes sure the comparison only takes place against the most recent row, but the table comparison
transform still has to go through all the records to get there. If you filter only for the current records, then
you can compare against a smaller record set and get through it just a little bit quicker.
Imagine you have a roughly ten rows of history per dimension, this can result in a record set one tenth of
what you had previously. It may not get you that much of a performance gain, but when you are trying to
get your jobs down to near real time, then every second counts.
The second thing this helped with is dealing with expired historical data that becomes current again. Let
me give you and example.
This company had an asset team that existed between 1-JAN-2013 and 1-JUL-2013. So I ended up with
a record in the asset team table that looked something like:
Team A became active again on 1-JAN-2014.

The problem I had was that before I put the filter in the table comparison to only bring through current
records, the table comparison transform was ignoring Team A. It was doing this because it would look in
the target table for a record with a natural key of A. It found the old record there, and because the
effective dates are set by the history preserving transform, it was ignoring it because everything else was
the same. I needed it to process this record as an insert so that the history preservation transform would
process it correctly with the new effective date and CURR_IND equal to 'Y'.
By putting the filter in, I was filtering out the old Team A record from my comparison data set. That way
the table comparison saw this as a new record and put it through as an insert. This is a fairly unique case
where you have historical records for a dimension value but none of them are current.
I hadn't actually encountered this scenario while working in previous versions of Data Services, but if I
had I think the best workaround would have been to create a view where you filter for current records,
and then use that as the comparison data set.
History preservation and using the filter in the table comparison

transform
I just found another really handy use for the new filter field in the table comparison transform in the latest
versions of Data Services.
If you are loading a dimension, and you are using history preservation, add a line in the filter field to only
bring through current records.
This helps you with two things, one obvious, one not so obvious.
The first thing thing it helps you with is a little bit of a performance boost. If you think about it, you only
need to compare against the current record of a given dimension value. Setting the generated key column
makes sure the comparison only takes place against the most recent row, but the table comparison
transform still has to go through all the records to get there. If you filter only for the current records, then
you can compare against a smaller record set and get through it just a little bit quicker.
Imagine you have a roughly ten rows of history per dimension, this can result in a record set one tenth of
what you had previously. It may not get you that much of a performance gain, but when you are trying to
get your jobs down to near real time, then every second counts.
The second thing this helped with is dealing with expired historical data that becomes current again. Let
me give you and example.
This company had an asset team that existed between 1-JAN-2013 and 1-JUL-2013. So I ended up with
a record in the asset team table that looked something like:
Team A became active again on 1-JAN-2014.

The problem I had was that before I put the filter in the table comparison to only bring through current
records, the table comparison transform was ignoring Team A. It was doing this because it would look in
the target table for a record with a natural key of A. It found the old record there, and because the
effective dates are set by the history preserving transform, it was ignoring it because everything else was
the same. I needed it to process this record as an insert so that the history preservation transform would
process it correctly with the new effective date and CURR_IND equal to 'Y'.
By putting the filter in, I was filtering out the old Team A record from my comparison data set. That way
the table comparison saw this as a new record and put it through as an insert. This is a fairly unique case
where you have historical records for a dimension value but none of them are current.
I hadn't actually encountered this scenario while working in previous versions of Data Services, but if I
had I think the best workaround would have been to create a view where you filter for current records,
and then use that as the comparison data set.
Labels: Data Services, history preservation, Performance, SAP, table comparison
Monday, May 5, 2014
SQL Transform - When should you use it?

Never!
You were expecting a longer post about this? Well alright then, never ever!
Whenever I see tons of SQL Transforms around a Data Services job, it is normally a sign of a lazy
developer. Some-one who had written out the queries in SQL, and instead of building proper data flows
using the Query transform and other built in Data Services functionality, just copied and pasted the sql
into the SQL transform.
But what's wrong with that? Well plenty!
DS is NOT a glorified scheduler
For starters, you just bought a not-inexpensive ETL tool that you are now just using as a query scheduler.
If all you want to do is automate your queries then write them into a stored procedure and schedule it on
the database. Don't waste your money on a world class ETL tool.
You've just broken Impact and Lineage analysis
One of the biggest selling points of using a tool like DS, is you can visually see where a field is used in
your ETL and where its data ultimately ends up. Likewise, you can see where a particular piece of data
came from by tracing it all the way back to the source system.
This is a fantastic governance tool for managing your warehouse. You can see if, and where, making a
change to a source system will affect your data warehouse.
The second you put a SQL transform in the way, you've just broken your impact and lineage analysis. DS
treats it as a black box, so whether you are using the lineage in the management console, Data Services
Designer View Where Used function, or in Information Steward, you have just rendered it useless.
In the Datastore below it appears is if none of the tables are being used in any jobs.
But then after a little searching I find an SQL transform.
I look inside it and what do I find?
And this query is hardly complex. It wouldn't have taken very long to develop as a proper data flow with
the Query transform, and now some future developer could make changes to the job thinking that they
have everything covered, but not realize that 3 extra tables were hidden inside an SQL transform.
I can't tell you how many times I've needed to make a change, right clicked on the table in the data store
and chosen View Were Used, made all my changes, only to later discover I've missed one that a
developer hid in a SQL transform.
Squinting at code
One of the great things about using a GUI based ETL tool is that you can open a data flow and
immediately get an idea of what it is doing. You can see the tables on the screen, see which transforms
they flow through and understand what is happening to the data and where it is going. With a SQL
transform you have to open it up, squint at the code to try and figure out what it is up to.
For simple SQL that's not a big deal, but a complicated query with plenty of tables and joins.... well now
you're wasting my time, and my client's money too!
Should you really never use the SQL transform?
I worked in a company where they had a ban on using the SQL transform. Turn in your code with one in
it, and it got returned to you to have it rewritten without the SQL transform. No exceptions.
I will admit, there are times when you will just have to use the SQL transform, but these should be rare
exceptions. Sometimes you need to do something very complex, that's maybe only available on the
database, so you have to use the SQL transform to take advantage of it.
Before you do it though, think really hard about whether you couldn't achieve the same thing using Data
Services built in transforms. Even if you have to split it out over multiple data flows, it will still be better
than using the SQL transform.
Labels: Data Services, SAP, SQL transform
Monday, April 28, 2014
Generated Key in the Table Comparison Transform
When I first learnt how to use Data Services, I knew that I was supposed to set the Generated Key
Column in the table comparison transform, but I never really thought about I why I was supposed to do
that.
So lets take a look at the example above. In this example I have a dimension for Customer with a data
warehouse key of CUSTOMER_KEY. This is my generated key column. I also have a field for the natural
key that comes from the source system and this is CUSTOMER_NK.
I have been maintaining history on this table, so individual CUSTOMER_NK values can appear in there
multiple times. Once for each row of history maintained. So if a customer has lived in more than one city,
there will be a row in the customer table for each city the customer has lived in.
So data in the table might look something like this:
If a new row for John Smith arrives, lets say he moves to San Antonio, how will Data Services know which
of the 3 rows with a CUSTOMER_NK of 'A' to compare against? This is where the Generated Key
Column comes in. Data Services looks for the row with the maximum generated key column value and
just compares against that row. So for John Smith, DS will only do the compare against the row with
CUSTOMER_KEY = 3.
If you didn't set the generated key column value, then DS would compare against all 3 rows for John
Smith instead of just the one with the maximum CUSTOMER_KEY.
The history preservation transform grabs the 1 update row from the table comparison transform and
generates an update to close off the old row, and issues an insert for the new row for John's new address
in San Antonio.
If you are not using history preservation then the Generated Key Column is not that important, but if you
are doing type 2 history, then its essential that you set it for the history preservation to work correctly.

Labels: Data Services, SAP, table comparison
Improving Performance - Part 7 - Whats left?

If you have implemented all the things I've written about in parts 1 through 6, then your job should now be
significantly faster. In most cases you will have now covered 99% of things that will improve your jobs
performance, but if you want that little bit extra, then here are a few extra things to consider:
Join ranking
You can define the order in which Data Service will join tables. Tables with higher join ranks are joined
before tables with lower join ranks. Typically you want your largest table to have have higher ranks that
your smaller tables.
You set the join rank in the source table.
This typically only has an effect if the join is being done on the Data Services job server, in other words,
Pushdown SQL is not happening. For a full and detailed explanation of how this works consult the
Performance Optimzation Guide for DS.
Array fetch size
If you have a powerful set of hardware in place, try raising this value and see what effect it has.
Interestingly, even the DS Performance Optimisation Guide just recommends increasing and decreasing
the value to see what effect it has, and just going with the value that seems to get you the best
performance.
Array fetch size is also set on the source table.
Rows per commit

SAP recommends that you set this at between 500 and 2,000. Again, I just recommend changing the
values up and down to see which seems to have the best effect.
Two things to consider:
1.
If you have Long datatype you are loading, then the commit size will be 1 (Does not apply
to Oracle)
2.
If you are using an overflow file, and an error occurs resulting in a rejected row, then the
commit size will also revert back to 1. I'll be doing a post soon on the pros and cons of using the
overflow file vs using the validation transform.
Committing one row at a time will be slower than committing 1,000. So consider whether you can load the
Long differently (perhaps in a large Varchar field) and decide whether you really need an overflow file.
Chat to your DBA
Seriously, some of them are actually nice people :-) Now that you've fully tuned what you can on DS, the
DBA can look at making improvements in the source and target databases.
Database performance tuning can be different depending on which database you are using, and being an
expert in performance tuning on all the databases that DS can connect to would be quite a feet!
So, if you have a Data Flow that is still running too slow then just go to Validation - Show Optimized SQL,
and send it over to your DBA to see if she can work some of that DBA magic.
Anything else?
If you've found something that has significantly improved the performance of one of your Data Services
jobs that I did not mention in this series, then please let every-one know in the comments.

Labels: Data Services, Optimization, Performance, SAP
Improving Performance - Part 6 - Using the lookup_ext function

The lookup_ext function is often used to lookup foreign keys for your fact table, and to return a default if
no value can be found.
One setting that is often overlooked when using the function is the caching method, labeled as Cache
spec.
There are 3 options you can choose from:

1.
DEMAND_LOAD_CACHE
2.
NO_CACHE
3.
PRE_LOAD_CACHE
Most people just leave the default (PRE_LOAD_CACHE) in place. But what is the difference between the
3 options?
NO_CACHE - This option reads the values from the lookup table for each individual row, but does not
cache the value.
PRE_LOAD_CACHE - This option will load the entire lookup table into memory first before looking up the
values you need.
DEMAND_LOAD_CACHE - This option will load the individual rows into memory as they are identified.
So the first time DS needs a row it looks into the database, finds it and places it in memory. The second
and subsequent times it looks into its own memory for the value.
So which option should you use, and what impact on performance will this have?
DEMAND_LOAD_CACHE - Use this option when you have a small number of input rows, but the table
you are looking up values in is large. In this instance, using DEMAND_LOAD_CACHE over
PRE_LOAD_CACHE can have a huge impact on performance.
Here is a great real life example. I recently had to run a data flow where the source table had 42 records,
and the lookup table had 3,390,904 records.
I ran the job once with PRE_LOAD_CACHE to see how long it would take:
I then ran the job with DEMAND_LOAD_CACHE to see the difference:
As you can see the performance improvement in this instance is dramatic.197 seconds vs just 5 seconds.
PRE_LOAD_CACHE - You should use this option when you anticipate accessing a large number of rows
in the lookup table.
I needed to use the same 3 million row table as a lookup table again, but this time the source table had
161,280 rows.
This time I ran it with DEMAND_LOAD_CACHE first:
Then I ran it with PRE_LOAD_CACHE:

In this case using PRE_LOAD_CACHE was the better choice as my source table had a lot of values
when compared to the lookup table, so in effect I was looking up a far greater number of rows in the
lookup table than I was in the previous example.
NO_CACHE - The only reason I can think of to use NO_CACHE is if you are running short on memory
and are only accessing very few rows in the lookup table.
So as you can see from the details above, choosing the right cache option can have a huge impact on
your jobs performance.
In summary the rule is, if you are going to access most of the rows in your lookup table, then use
PRE_LOAD_CACHE, if you are only going to access a few rows, then use DEMAND_LOAD_CACHE.
And if you can't figure out which one to use? Well just run the job once with DEMAND_LOAD_CACHE
set, and once with PRE_LOAD_CACHE set and see which one runs the fastest for you :-)
Posted by Sean A. Hayward at 1:00 AM 2 comments:
Labels: Data Services, lookup_ext, Optimization, Performance, SAP
Improving Performance - Part 5 - Table comparison transform

How you use the table comparison transform can have a huge impact on the performance of your job.
There are 3 options available for the comparison method used by the transform:
Row-by-row select
Cached comparison table
Sorted input
Use row-by-row select when you have very few rows to compare in relation to the size of your target
table. Doing the individual comparisons for a few dozen rows will be significantly quicker than a full table
comparison against a large table.
But, if you have a lot of rows to compare in relation to the size of the table, then don't use this method as
it will take significantly longer.
If you have large number of rows to compare with the target table, then use either cached comparison
table or sorted input. In general I find that sorted input is preferable because it seems to be quicker, and
also it doesn't require as much memory to work. When it is not possible to sort the table for comparison
then that leaves you with cached comparison table as your only option.
You may have noticed that the table comparison transform now has a place where you can add a filter to
the rows from the target table. Lets say know that you are only getting sales figures for 2014 coming
through, and therefore you only need to compare against data in the table for 2014, then add that to the
filter box on the transform.
Now you'll only be comparing against the 2014 subset of the data. If you have many years of data in your
target table this can also result in a massive increase in performance.
A word of caution though, be very sure that you are applying the filter correctly, because if you filter too
much out of the target table, you may get false inserts.
Sometimes when you have chosen the appropriate the comparison method the time to process all the
rows is still too slow. In this situation I put an extra step in the process to reduce the number of rows I
need to compare against, and this can often give me as much as a 90% increase in performance.
To do this method you need to have 3 data flows, as opposed to just the one you'd normally need, but
even though you'll now have 3 data flows, the total process will still be much quicker.
Lets take the example of where you are trying to load a sales fact table. The sales fact table has over one
hundred million rows in it, and you are trying to add another ten thousand. The table comparison is taking
ages as DS is comparing ten thousand rows to one hundred millions rows in the fact table.
Step 1 - Create a data flow where you do all the processing you need to do to load your target table, but
instead of loading your ultimate target table, just load the data to a new table. I often prefix these tables
with DT (for Data Transfer). This is the temporary holding place for the data you will ultimately load into
the FACT_SALES table.
Step 2 - Create a comparison data set. To do this, join the DT table to the actual target table and load the
common rows into another table we will call the comparison table. So here I will join FACT_SALES to
DT_FACT_SALES and load all the matching rows from FACT_SALES into the
COMPARE_FACT_SALES table.
Step 3 - Now use the COMPARE_FACT_SALES table in your table comparison as the compare table
instead of the FACT_SALES table. Instead of comparing against all one hundred million rows, the table
comparison will now only need to compare against no more than ten thousand rows. When it finds a row
it will do an update, and where it doesn't, well that will be an insert. Then load what it finds into your fact
table.
So your final data flow looks like this.
This method is best used when the number of incoming rows is significantly smaller than the size of the
target table. The smaller the number of input rows in comparison to the size of the target table, the
greater the performance improvement.
On the other hand, if you actually need to compare your input with most of the rows in the target table
then it could actually be slower to use this method as you have to build in the time in takes to create the
DT and COMPARE tables into the process.
When used appropriately, this method can make a massive difference to the overall performance of your
job,.
Posted by Sean A. Hayward at 12:27 PM No comments:

Labels: Data Services, Optimization, Performance, SAP, table comparison
Tuesday, April 1, 2014
Improving Performance - Part 4 - Bulk loading

Bulk loading is a great way to quickly load a large amount of data into your warehouse. The extra speed
is achieved because the database will not log each individual row or apply row locks while the data is
being loaded.
I find bulk loading especially useful when loading staging tables. As I typically truncate and load my
staging tables I am just interested in getting my data in there as fast as possible, with the more
complicated data processing taking place in subsequent data flows. So as I am just dumping the data into
the staging tables I don't care about logging or row locking.
There is one situation where I tend to avoid bulk loading, and that is where the source and target tables
are in the same database, and Data Services is able to push down the entire INSERT..SELECT
statement to the database. This is because it tends to be much quicker to let the database run this
internally, because when you turn on bulk loading all the source data is brought onto the server, and then
pushed over to the target, regardless of whether the source and target are from the same datastore or
not.
To use bulk loading double click on the target table in your data flow, and then click on the Bulk Loader
Options tab.
Under the bulk load option change it from none, and then continue to set the rest of the settings to
achieve the best performance for the database platform you are using. In this case I was using Oracle, so
the options on this tab will be different depending on which database you are using.
You may need to work with your DBA to make sure that you database user has sufficient privileges to run
bulk loading.
I have found that using bulk loading can have up to a 50% improvement in speed, which is especially
noticeable when you are loading large tables.
Labels: bulk loading, Data Services, Optimization, Performance, SAP
Monday, March 31, 2014
Improving Performance - Part 3 - Parallel processing

Running at the same time instead of in sequence can sometimes be one of the biggest gains in
performance you can make to your job.
Imagine you have 3 data flows of 10 minutes each. If you run them sequentially it will take 30 minutes, but
if you run them at the same time it would only take 10 minutes in total.... provided of course that your job
server has sufficient resources to run all 3 at the same time.
Now you can either go about working out on your own which data flows can be run sequentially or in
parallel, or you can let Data Services work it all out for you.
To get Data Services to do the hard work for you, read my blog post on developing Data Services jobs
using components and how the execute only once option helps you out.

Friday, March 28, 2014
Improving Performance - Part 2 - Use the database

One of the clever things that Data Services can do, is it can send processing down to the source
database rather than doing it itself on the job server. We call this pushdown SQL, because the query is
pushed down to the database.
Why is this a good thing?
Imagine a simple query of 2 tables. 1 table has 1 million rows, this is joined to a second table with 10
rows, and then you put a filter on the 10 row table. So maybe the large table is sales figures, and the
small table is cities that you do business in, and you only want sales from 1 city.
Without pushdown SQL Data Services would first have to retrieve all 1 million sales records from the
database and that data would have to be moved over the network to the job server. Then the 10 city
records would be retrieved and it would have to filter the 10 city records to the city you want, and then
would have to join that back to sales to get the 100,000 or so sales records you actually want.
With pushdown SQL, the query will be sent to the database where the database will run the query to
retrieve the 100,000 records required and just those will be sent over the network to the job server.
Now lets say you were just going to move those records to another table in the same database, well in
that case DS can create an insert query that will get run on the database directly, so no data at all needs
to be moved over the network.
So how do you know whether data services is pushing the query down to the database or not? Well in the
data flow go to the menu heading Validation and then the sub-menu Display Optimized SQL.
A window will then open showing you the actual SQL that will be run against the source database.
So what are some of the techniques you can use to make sure that as much as possible is being pushed
down to the database?
Well first of all, don't create unnecessary datastores. I've often seen people create datastores for different
groups of tables that all come from the same database. When you do this DS thinks the tables are in
different databases and doesn't realize that it can join them together on the source.
If you are using Oracle, just create one datastore per database, that way you can take advantage of cross
schema joins.
Also consider using schemas in SQL Server to logically separate out pieces of your database such as
having an stg, and a dm schema instead of placing them in separate databases.
Anything which enables to DS to contain as many of your tables in one datastore as possible, will improve
its ability to push down sql to the database.
Within your datastores you can also tell DS that you have linked databases setup, this way DS can take
advantage of any linked server you have setup as well. Just go to your datastore, clickedit, and the last
field under advanced is where you tell it that you have a linked server setup to another datastore.
Also look at how you are building your dataflows. If you have a long complex dataflow that joins lots of
stuff together, then does a bunch of processing, and then joins in another large table; consider splitting
this out into two dataflows. Send the data back to the database mid-stream and then pick it up again.
Often it turns out that having a few smaller dataflows that can be pushed down to the database runs a lot
faster than having one large complex dataflow where DS is pulling all the data onto the server.
Now that you've got as much as possible running on the database, you can take full advantage of all
those indexes and partitions my interviewees were telling me about in part one!
Labels: Data Services, Optimization, Performance, pushdown, SAP, sql
Thursday, March 27, 2014
Improving Performance - Part 1 - Only process what you need

I've interviewed quite a few people for positions as Data Services consultants and one of my standard
questions is "Say you have a slow job, what are some of the things you could do with Data Services to
improve its performance?"
So many people start off by telling me about indexes and partitions and query optimization, and almost
never talk about the way the data flows themselves are designed and the huge impact that can have on
performance.
So this series of posts is going to be about how to get the best performance out of your DS jobs using
functionality available to you in Data Services, starting with the things you can do that will have the most
impact on performance.
I have found that the number one thing you can do to impact your jobs performance is to make sure you
are only bringing though the data that you need to process in that load. I have seen so many data flows
that reprocess entire tables every time they are run. Often they are this way because when the jobs were
first created the source systems were fairly small, so the impact on performance was minimal. But as the
source systems have grown over the years DS is having to process more and more data, and the impact
on execution time becomes immense.
So how do you make sure you are only processing what you need to process? Well the first thing is to
look for an insert and update datestamp on your source system data. Then use this to make sure you are
only bringing though the data you need for that load. To do this, keep a record of the max update date of
each load, and then only bring through data with a more recent update date the next time you run the job.
This is often referred to as doing a delta load.
To give you an example of the impact this can have, just yesterday I reduced a single data flow's run time
from 20 minutes down to 4 minutes by adding a check to only bring through data amended or added
since the last job execution.
What if there is no update date in the source system? Well in that case you may have to look for
alternatives such as keeping track of the highest primary key since last run, but this only works on inserts.
Another option is to use CDC (Change Data Capture). There is documentation that comes with Data
Services that explains how to use this, but basically, the database keeps a log of all the changes that
occurred, and then you use the Map CDC Operation transform to load them into your warehouse. Its not
entirely straightforward, so if an update date is available on the table, I always go for that option first.
Wednesday, March 26, 2014
5 Myths about SAP BI

1.
You need SAP ERP to use SAP BI
This is one of the most pervasive myths out there and it just isn't true. In the 15 years I have been working
with SAP BI, I have only once used SAP's ERP system as a data source. In fact SAP BI can connect with
over 160 different data sources from big data solutions like Hadoop, through the traditional databases like
Oracle and SQL Server, right down to a basic Excel spreadsheet. In fact on of its greatest strengths is
that it can connect to so many different data sources, and offers the ability to integrate those sources
without having to do any ETL.
2.
It's just a reporting tool
For years Business Objects, which SAP bought in 2006, primary tool was a report writing application with
excellent ad-hoc and highly formatted report capabilities, and that tool still exists today in the form of Web
Intelligence. But SAP BI is so much more the Web Intelligence. It offers data discovery in the form of
Explorer; highly formatted reports built with Crystal; its dashboard builder is world class and its incredibly
easy to deploy any and all content to mobile. All of the reporting tools sit on top of a unified and consistent
data layer, allowing for the same interface to be used for data access across the entire tool-set. Throw in
SAP BI's ETL and Data Quality tool Data Services, and its impact and lineage capabilities in the form of
Information Steward, and you have an end-to-end complete BI and Data Management solution.
3.
It's too expensive for small to medium sized companies
SAP has offered the Edge version of its BI tools for a number of years now. This provides a very cost
effective solution for small to medium companies, while at the same time provides for something that is
very scale-able. Some tools, like Lumira, are even free and come with cloud storage and access.
4.
It won't integrate with Office or SharePoint
SAP BI comes with Live Office that allows you to share any report table or graph within Outlook, Excel,
Word or PowerPoint. You can also create a query directly in Excel against a Universe. A SharePoint
integration kit is also available for full integration with SharePoint.
5.
It's difficult and expensive to deploy
Pre-built cloud solutions are available making it very easy to get an installation up and running. A
standard installation can be a against a single server running a Windows OS. Having said that, a big
advantage of SAP BI is that you can also install it in a clustered environment running on your choice of
OS from Windows through to your favorite Unix\Linux flavor. And don't forget that Lumira Cloud is ready
to go and free, you just need to create an account.

Files As Data Sources

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Files As Data Sources

Uploaded by

Copyright:

Available Formats

Files as data sources - Part 5 - Missing Files

$L_EmailBody = 'The following files did not arrive today: ';

#Check if file $G_ActivityEmms has arrived

Files as data sources - Part 4 - File Errors

#Record file failure in SYS_FILE table

Links to this post

Files as data sources - Part 3 - Moving files around

Posted by Sean A. Hayward at 1:00 AM No comments:

Files as data sources - Part 2 - Keeping track

There are similar options on XML and Excel files too.

Posted by Sean A. Hayward at 1:00 AM No comments:

Files as data sources - Part 1 - Waiting for the files to arrive

Files as Data Sources - Introduction

When will the files arrive?

How will you keep track of the files processed?

What if multiple files arrive for the same data set?

What should you do with files once they've been processed

What happens if there is an error with the file?

What should you do if a file doesn't arrive?

Posted by Sean A. Hayward at 12:37 PM No comments:

History preservation and using the filter in the table comparison

Team A became active again on 1-JAN-2014.

History preservation and using the filter in the table comparison

Team A became active again on 1-JAN-2014.

SQL Transform - When should you use it?

But then after a little searching I find an SQL transform.

I look inside it and what do I find?

Generated Key in the Table Comparison Transform

Posted by Sean A. Hayward at 1:00 AM No comments:

Improving Performance - Part 7 - Whats left?

Rows per commit

Posted by Sean A. Hayward at 1:00 AM No comments:

Improving Performance - Part 6 - Using the lookup_ext function

There are 3 options you can choose from:

I then ran the job with DEMAND_LOAD_CACHE to see the difference:

This time I ran it with DEMAND_LOAD_CACHE first:

Then I ran it with PRE_LOAD_CACHE:

Improving Performance - Part 5 - Table comparison transform

Cached comparison table

So your final data flow looks like this.

Posted by Sean A. Hayward at 12:27 PM No comments:

Improving Performance - Part 4 - Bulk loading

Improving Performance - Part 3 - Parallel processing

Posted by Sean A. Hayward at 7:44 AM No comments:

Improving Performance - Part 2 - Use the database

Improving Performance - Part 1 - Only process what you need

5 Myths about SAP BI

You need SAP ERP to use SAP BI

It's just a reporting tool

It's too expensive for small to medium sized companies

It won't integrate with Office or SharePoint

It's difficult and expensive to deploy

You might also like