Professional Documents
Culture Documents
To handle files that don't arrive I check through the SYS_FILE table to see if there is an entry for that file
for the day. If there is no entry I add a row to the SYS_FILE table to record the fact that no file arrived,
and then I send out an email to person responsible for that file, to let them know that the file did not arrive.
Here is the basic script:
The two error types you want to look out for are:
1.
Flat file processing errors (1004)
2.
File access errors (1005)
If an error does occur I run the following in the script:
If (db_type('Datawarehouse') = 'Oracle')
BEGIN
$L_FileName = replace_substr( $P_FileName,'*','%');
END
sql('Datawarehouse','UPDATE SYS_FILE SET STATUS = \'FAILED\' WHERE FILE_LOCATION =
{$P_Directory} AND FILE_NAME like {$L_FileName} AND AUDIT_ID = [$G_AuditID]');
We've already reviewed how the FileMove function works.
Lastly I clear out the staging table so that no partially loaded data goes through the rest of the load.
It would also be a good idea to send out an email alerting the owner of the file that there was something
wrong with the file so that they can take an remedial action before resending the file for processing.
Posted by Sean A. Hayward at 1:00 AM No comments:
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
You could always expand the number of columns in the SYS_FILE table to store all sorts of additional
metadata about the file such as its source, how it was delivered or who is responsible for it.
Now that I have created a record in the SYS_FILE table for a file, how do I link that to the records that are
coming through from the file?
First of all you need to go into your file object and set the source information on the file as follows:
So now that we have our file and have processed it, we need to decide what to do with the file. I'll be
covering archiving and updating the SYS_FILE table to show the completed files in the next post.
if ($L_FileFound = 1)
begin
$L_Counter = 1;
WHILE ($L_Counter <= $L_FileListSize)
begin
FileNew($P_FileDirectory, word_ext($L_FileNames,$L_Counter,','));
$L_Counter = $L_Counter + 1;
end
end
Return $L_FileFound;
The above function first waits for the file\s to arrive, and then writes a record in to a file handling table
using the FileNew function.
I am using the wait_for_file function to determine when the file arrives.
The return values from the this function are:
0 - No file matched
1 - At least one file was matched
-1 - The function timed out with no files found
-2 - One of the input values is illegal
I'm generally only interested if a file has been found ie if the value 1 is returned.
The first few parameters are fairly straight forward.
Firstly it needs to know the name and location of the file you are waiting for. This can contain a wildcard,
so if you are waiting for any file that starts with the letters file, you can set that value to be file*.txt. If you
are not certain of the extension you can also have it be file*.*, and if you don't even care what the file
name is, as long as any file arrives you can set the value as *.* .
The next parameter is how long you would like Data Services to wait for the file to arrive, the timeout
parameter. This is set in milliseconds, so if you want Data Services to wait 30 minutes for the file to arrive,
then that value should be 30 * 60 (seconds) * 1,000 (milliseconds) to get the value 1,800,000. If the
timeout duration expires, 30 minutes in this example, then the wait_for_file function will return the value 1. This means that it looks for the file for 30 minutes, but no file arrived.
The 3rd parameter is how often you want Data Services to check whether the file has arrived. Again its
the same formula for setting the value. If you want it to have a look ever 5 minutes then its 5 * 60 * 1,000
to get 300,000.
The next 4 parameters are all about returning the names of the files that Data Services finds.
In this example I have -1 set for the max match parameter. This means that I want DS to return the
names of all the matched files that it finds. You could set this to 0 if you don't want any of them, or any
other positive number if you only want a specific number of file names returned.
The next parameter is an output parameter that will store the list of file names returned. So lets say you
set the 1st parameter in the function to file*.txt, and then there are 3 files in the directory: file1.txt, file2.txt
and file3.txt. This variable will hold all three of those file names.
The next parameter will return the number of files found that match the search pattern. So again if you're
looking for file*.txt, and 3 files are found that match file*.txt, then this output parameter will return the
value 3.
The final parameter in the function allows you to set the list separator for the list of file. In this example I
set it to be a comma. So the variable I have above called $L_FileNames, will end up with the
values File1.txt, File2.txt, File3.txt.
The next part of the IsFileAvailable function loops through the list of file names and calls another function
I have written called FileNew for each of the file name values in the list. The purpose of the FileNew
function is to write a record into my SYS_FILE table for each file found.
I'll be going through the purpose of the SYS_FILE table and how you can use it to tie up data in the target
table to the source files in my next post.
Posted by Sean A. Hayward at 1:00 AM No comments:
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Links to this post
Labels: Data Services, files, SAP
Monday, October 20, 2014
How do you relate data in your target tables with the original file?
So firstly I used a conditional to check whether a file has arrived or not. I've created my own function,
IsFileAvailable, to look out for the file arrival. Yes, I could have used the built in function wait_for_file, but
there is additional functionality that you might find useful that I have built into IsFileAvailable.
If the file arrives I print that out to the trace file. You don't have to do this, but I just find it easier to see
whats going on that way.
Then I place the data flow that will process the file within a try catch block. This is so that I can handle any
file errors without bringing down the entire job. Within that error handler I can report the file errors to the
owner\s of the file and move the file to an error file location.
In the else section of the conditional I place some code to handle the file not arriving.
Over the next few posts I'll break out the detail of how each of the above pieces works.
This helps you with two things, one obvious, one not so obvious.
The first thing thing it helps you with is a little bit of a performance boost. If you think about it, you only
need to compare against the current record of a given dimension value. Setting the generated key column
makes sure the comparison only takes place against the most recent row, but the table comparison
transform still has to go through all the records to get there. If you filter only for the current records, then
you can compare against a smaller record set and get through it just a little bit quicker.
Imagine you have a roughly ten rows of history per dimension, this can result in a record set one tenth of
what you had previously. It may not get you that much of a performance gain, but when you are trying to
get your jobs down to near real time, then every second counts.
The second thing this helped with is dealing with expired historical data that becomes current again. Let
me give you and example.
This company had an asset team that existed between 1-JAN-2013 and 1-JUL-2013. So I ended up with
a record in the asset team table that looked something like:
This helps you with two things, one obvious, one not so obvious.
The first thing thing it helps you with is a little bit of a performance boost. If you think about it, you only
need to compare against the current record of a given dimension value. Setting the generated key column
makes sure the comparison only takes place against the most recent row, but the table comparison
transform still has to go through all the records to get there. If you filter only for the current records, then
you can compare against a smaller record set and get through it just a little bit quicker.
Imagine you have a roughly ten rows of history per dimension, this can result in a record set one tenth of
what you had previously. It may not get you that much of a performance gain, but when you are trying to
get your jobs down to near real time, then every second counts.
The second thing this helped with is dealing with expired historical data that becomes current again. Let
me give you and example.
This company had an asset team that existed between 1-JAN-2013 and 1-JUL-2013. So I ended up with
a record in the asset team table that looked something like:
I hadn't actually encountered this scenario while working in previous versions of Data Services, but if I
had I think the best workaround would have been to create a view where you filter for current records,
and then use that as the comparison data set.
Posted by Sean A. Hayward at 1:00 AM No comments:
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Labels: Data Services, history preservation, Performance, SAP, table comparison
Monday, May 5, 2014
And this query is hardly complex. It wouldn't have taken very long to develop as a proper data flow with
the Query transform, and now some future developer could make changes to the job thinking that they
have everything covered, but not realize that 3 extra tables were hidden inside an SQL transform.
I can't tell you how many times I've needed to make a change, right clicked on the table in the data store
and chosen View Were Used, made all my changes, only to later discover I've missed one that a
developer hid in a SQL transform.
Squinting at code
One of the great things about using a GUI based ETL tool is that you can open a data flow and
immediately get an idea of what it is doing. You can see the tables on the screen, see which transforms
they flow through and understand what is happening to the data and where it is going. With a SQL
transform you have to open it up, squint at the code to try and figure out what it is up to.
For simple SQL that's not a big deal, but a complicated query with plenty of tables and joins.... well now
you're wasting my time, and my client's money too!
Should you really never use the SQL transform?
I worked in a company where they had a ban on using the SQL transform. Turn in your code with one in
it, and it got returned to you to have it rewritten without the SQL transform. No exceptions.
I will admit, there are times when you will just have to use the SQL transform, but these should be rare
exceptions. Sometimes you need to do something very complex, that's maybe only available on the
database, so you have to use the SQL transform to take advantage of it.
Before you do it though, think really hard about whether you couldn't achieve the same thing using Data
Services built in transforms. Even if you have to split it out over multiple data flows, it will still be better
than using the SQL transform.
Posted by Sean A. Hayward at 1:00 AM No comments:
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Labels: Data Services, SAP, SQL transform
Monday, April 28, 2014
When I first learnt how to use Data Services, I knew that I was supposed to set the Generated Key
Column in the table comparison transform, but I never really thought about I why I was supposed to do
that.
So lets take a look at the example above. In this example I have a dimension for Customer with a data
warehouse key of CUSTOMER_KEY. This is my generated key column. I also have a field for the natural
key that comes from the source system and this is CUSTOMER_NK.
I have been maintaining history on this table, so individual CUSTOMER_NK values can appear in there
multiple times. Once for each row of history maintained. So if a customer has lived in more than one city,
there will be a row in the customer table for each city the customer has lived in.
So data in the table might look something like this:
If a new row for John Smith arrives, lets say he moves to San Antonio, how will Data Services know which
of the 3 rows with a CUSTOMER_NK of 'A' to compare against? This is where the Generated Key
Column comes in. Data Services looks for the row with the maximum generated key column value and
just compares against that row. So for John Smith, DS will only do the compare against the row with
CUSTOMER_KEY = 3.
If you didn't set the generated key column value, then DS would compare against all 3 rows for John
Smith instead of just the one with the maximum CUSTOMER_KEY.
The history preservation transform grabs the 1 update row from the table comparison transform and
generates an update to close off the old row, and issues an insert for the new row for John's new address
in San Antonio.
If you are not using history preservation then the Generated Key Column is not that important, but if you
are doing type 2 history, then its essential that you set it for the history preservation to work correctly.
This typically only has an effect if the join is being done on the Data Services job server, in other words,
Pushdown SQL is not happening. For a full and detailed explanation of how this works consult the
Performance Optimzation Guide for DS.
Array fetch size
If you have a powerful set of hardware in place, try raising this value and see what effect it has.
Interestingly, even the DS Performance Optimisation Guide just recommends increasing and decreasing
the value to see what effect it has, and just going with the value that seems to get you the best
performance.
Array fetch size is also set on the source table.
I ran the job once with PRE_LOAD_CACHE to see how long it would take:
As you can see the performance improvement in this instance is dramatic.197 seconds vs just 5 seconds.
PRE_LOAD_CACHE - You should use this option when you anticipate accessing a large number of rows
in the lookup table.
I needed to use the same 3 million row table as a lookup table again, but this time the source table had
161,280 rows.
Row-by-row select
Sorted input
Use row-by-row select when you have very few rows to compare in relation to the size of your target
table. Doing the individual comparisons for a few dozen rows will be significantly quicker than a full table
comparison against a large table.
But, if you have a lot of rows to compare in relation to the size of the table, then don't use this method as
it will take significantly longer.
If you have large number of rows to compare with the target table, then use either cached comparison
table or sorted input. In general I find that sorted input is preferable because it seems to be quicker, and
also it doesn't require as much memory to work. When it is not possible to sort the table for comparison
then that leaves you with cached comparison table as your only option.
You may have noticed that the table comparison transform now has a place where you can add a filter to
the rows from the target table. Lets say know that you are only getting sales figures for 2014 coming
through, and therefore you only need to compare against data in the table for 2014, then add that to the
filter box on the transform.
Now you'll only be comparing against the 2014 subset of the data. If you have many years of data in your
target table this can also result in a massive increase in performance.
A word of caution though, be very sure that you are applying the filter correctly, because if you filter too
much out of the target table, you may get false inserts.
Sometimes when you have chosen the appropriate the comparison method the time to process all the
rows is still too slow. In this situation I put an extra step in the process to reduce the number of rows I
need to compare against, and this can often give me as much as a 90% increase in performance.
To do this method you need to have 3 data flows, as opposed to just the one you'd normally need, but
even though you'll now have 3 data flows, the total process will still be much quicker.
Lets take the example of where you are trying to load a sales fact table. The sales fact table has over one
hundred million rows in it, and you are trying to add another ten thousand. The table comparison is taking
ages as DS is comparing ten thousand rows to one hundred millions rows in the fact table.
Step 1 - Create a data flow where you do all the processing you need to do to load your target table, but
instead of loading your ultimate target table, just load the data to a new table. I often prefix these tables
with DT (for Data Transfer). This is the temporary holding place for the data you will ultimately load into
the FACT_SALES table.
Step 2 - Create a comparison data set. To do this, join the DT table to the actual target table and load the
common rows into another table we will call the comparison table. So here I will join FACT_SALES to
DT_FACT_SALES and load all the matching rows from FACT_SALES into the
COMPARE_FACT_SALES table.
Step 3 - Now use the COMPARE_FACT_SALES table in your table comparison as the compare table
instead of the FACT_SALES table. Instead of comparing against all one hundred million rows, the table
comparison will now only need to compare against no more than ten thousand rows. When it finds a row
it will do an update, and where it doesn't, well that will be an insert. Then load what it finds into your fact
table.
This method is best used when the number of incoming rows is significantly smaller than the size of the
target table. The smaller the number of input rows in comparison to the size of the target table, the
greater the performance improvement.
On the other hand, if you actually need to compare your input with most of the rows in the target table
then it could actually be slower to use this method as you have to build in the time in takes to create the
DT and COMPARE tables into the process.
When used appropriately, this method can make a massive difference to the overall performance of your
job,.
pushed over to the target, regardless of whether the source and target are from the same datastore or
not.
To use bulk loading double click on the target table in your data flow, and then click on the Bulk Loader
Options tab.
Under the bulk load option change it from none, and then continue to set the rest of the settings to
achieve the best performance for the database platform you are using. In this case I was using Oracle, so
the options on this tab will be different depending on which database you are using.
You may need to work with your DBA to make sure that you database user has sufficient privileges to run
bulk loading.
I have found that using bulk loading can have up to a 50% improvement in speed, which is especially
noticeable when you are loading large tables.
Posted by Sean A. Hayward at 8:13 AM No comments:
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Labels: bulk loading, Data Services, Optimization, Performance, SAP
Monday, March 31, 2014
A window will then open showing you the actual SQL that will be run against the source database.
So what are some of the techniques you can use to make sure that as much as possible is being pushed
down to the database?
Well first of all, don't create unnecessary datastores. I've often seen people create datastores for different
groups of tables that all come from the same database. When you do this DS thinks the tables are in
different databases and doesn't realize that it can join them together on the source.
If you are using Oracle, just create one datastore per database, that way you can take advantage of cross
schema joins.
Also consider using schemas in SQL Server to logically separate out pieces of your database such as
having an stg, and a dm schema instead of placing them in separate databases.
Anything which enables to DS to contain as many of your tables in one datastore as possible, will improve
its ability to push down sql to the database.
Within your datastores you can also tell DS that you have linked databases setup, this way DS can take
advantage of any linked server you have setup as well. Just go to your datastore, clickedit, and the last
field under advanced is where you tell it that you have a linked server setup to another datastore.
Also look at how you are building your dataflows. If you have a long complex dataflow that joins lots of
stuff together, then does a bunch of processing, and then joins in another large table; consider splitting
this out into two dataflows. Send the data back to the database mid-stream and then pick it up again.
Often it turns out that having a few smaller dataflows that can be pushed down to the database runs a lot
faster than having one large complex dataflow where DS is pulling all the data onto the server.
Now that you've got as much as possible running on the database, you can take full advantage of all
those indexes and partitions my interviewees were telling me about in part one!
Posted by Sean A. Hayward at 9:06 AM No comments:
Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest
Labels: Data Services, Optimization, Performance, pushdown, SAP, sql
Thursday, March 27, 2014
This is one of the most pervasive myths out there and it just isn't true. In the 15 years I have been working
with SAP BI, I have only once used SAP's ERP system as a data source. In fact SAP BI can connect with
over 160 different data sources from big data solutions like Hadoop, through the traditional databases like
Oracle and SQL Server, right down to a basic Excel spreadsheet. In fact on of its greatest strengths is
that it can connect to so many different data sources, and offers the ability to integrate those sources
without having to do any ETL.
2.
For years Business Objects, which SAP bought in 2006, primary tool was a report writing application with
excellent ad-hoc and highly formatted report capabilities, and that tool still exists today in the form of Web
Intelligence. But SAP BI is so much more the Web Intelligence. It offers data discovery in the form of
Explorer; highly formatted reports built with Crystal; its dashboard builder is world class and its incredibly
easy to deploy any and all content to mobile. All of the reporting tools sit on top of a unified and consistent
data layer, allowing for the same interface to be used for data access across the entire tool-set. Throw in
SAP BI's ETL and Data Quality tool Data Services, and its impact and lineage capabilities in the form of
Information Steward, and you have an end-to-end complete BI and Data Management solution.
3.
SAP has offered the Edge version of its BI tools for a number of years now. This provides a very cost
effective solution for small to medium companies, while at the same time provides for something that is
very scale-able. Some tools, like Lumira, are even free and come with cloud storage and access.
4.
SAP BI comes with Live Office that allows you to share any report table or graph within Outlook, Excel,
Word or PowerPoint. You can also create a query directly in Excel against a Universe. A SharePoint
integration kit is also available for full integration with SharePoint.
5.
Pre-built cloud solutions are available making it very easy to get an installation up and running. A
standard installation can be a against a single server running a Windows OS. Having said that, a big
advantage of SAP BI is that you can also install it in a clustered environment running on your choice of
OS from Windows through to your favorite Unix\Linux flavor. And don't forget that Lumira Cloud is ready
to go and free, you just need to create an account.