Professional Documents
Culture Documents
Sizing
SAP BusinessObjects
Data Services, Version 4.1
software vendors.
Disclaimer
Some components of this product are based on Java. Any
PARTNERS
TABLE OF CONTENTS
Introduction ...................................................................................................................................... 2
1
1.1
1.2
1.3
1.4
1.5
Initial Sizing for SAP BusinessObjects Data Services Data Quality .......................................... 7
4.1
4.2
4.3
Assumptions .................................................................................................................................. 7
Batch sizing guidelines ................................................................................................................. 7
Transactional sizing guidelines ..................................................................................................... 8
Initial Sizing for SAP BusinessObjects Data Services Text Data Processing .......................... 9
5.1
5.2
6
Assumptions .................................................................................................................................. 9
Batch sizing guidelines ................................................................................................................. 9
Initial Sizing for SAP BusinessObjects Data Services Data Integration Processing ............. 11
6.1
6.2
Assumptions ................................................................................................................................ 11
Batch sizing guidelines ............................................................................................................... 11
Miscellaneous ................................................................................................................................. 13
1 INTRODUCTION
SAP BusinessObjects Data Services delivers a single enterprise-class solution for data integration, data
quality, data profiling and text data processing that allows you to integrate, transform, improve and
deliver trusted data to critical business processes. It provides one development UI, metadata
repository, data connectivity layer, run-time environment and management console enabling IT
organizations to lower total cost of ownership and accelerate time to value. With SAP BusinessObjects
Data Services, IT organizations can maximize operational efficiency with a single solution to improve
data quality and gain access to heterogeneous sources and applications.
Data quality dashboards that show the impact of data quality problems on all downstream
systems or applications
Ability to apply data quality transformations to all types of data, regardless of industry or data
domain such as structured to unstructured data as well as customer, product, supplier, and
material information
Intuitive business user interfaces and data quality blueprints to guide you through the process of
standardizing, correcting, and matching data to reduce duplicates and identify relationships
Comprehensive global data quality coverage with support for over 230 countries
Comprehensive reference data
Broad, heterogeneous application and system support for both SAP and non-SAP sources and
targets
Prepackaged native integration of data quality best practices for SAP environments
Optimized developer productivity and application maintenance through intuitive transformations,
a centralized business rule repository, and object reuse
High performance and scalability with software that can meet high volume needs through
parallel processing, grid computing, and bulk data loading support
Flexible technology deployment options, from an enterprise platform to intuitive APIs that allow
developers quick data quality deployment and functionality
Analyzes text and automatically identifies and extracts entities, including people, dates, places,
organizations and so on, in multiple languages.
Looks for patterns, activities, events, and relationships among entities and enables their
extraction.
Goes beyond conventional character matching tools for information retrieval, which can only
seek exact matches for specific strings. It understands semantics of words.
Supports extraction in 31 different languages.
Support not only text, HTML, and XML but binary document formats such as PDF and Microsoft
Word.
Allows specifying your own list of entities in a custom dictionary. These dictionaries enable you
to store entities and manage name variations. Known entity names can be standardized using a
dictionary.
Write custom rules to customize extraction output although pre-defined rules are provided to
support sentiment analysis, enterprises, and the public sector.
Broad, heterogeneous application and system support for both SAP and non-SAP sources and
targets
High performance and scalability with software that can meet high volume needs through
parallel processing, grid computing, and bulk data loading support
Easy to configure transforms for typical complex tasks like Slow Changing Dimensions,
Hierarchy Flattening, etc.
Everything you need to build large jobs including error handling, dependency handling and
restart-ability
Extensive operational statistics
Rich connectivity to many sources and targets - most using the vendors native format for
maximum performance
Easy to use parallelization and performance optimization options
Functionalities to simplify daily operations and project hand-over like web based management
console, auto-documentation features and impact lineage information
More details about the architecture of the Data Quality Management and SAP BusinessObjects Data
Services can be found in the SAP BusinessObjects Data Services Administrators Guide.
Access to source and targets The bandwidth to the source and target can affect how fast
data can be passed through the dataflow.
Availability of additional RAM If caching is needed, allocating enough free RAM within the
system will speed up the dataflow, not only to cache lookup data but also reference data for
Data Quality transforms.
Configuration and System Landscape This sizing guide was created with SAP
BusinessObjects Data Services installed on the target database system. Source RDBMS was
located on a separate machine.
Competing applications Running multiple resource intensive applications may cause
competition for the resources and reduce the throughput for an individual job.
Operating System This sizing guide was created with Windows (2003/2008 Server) and
Linux (RedHat 5/6, Suse 10/11) in mind. Please contact SAP for specifics on sizing for other
operating systems.
Degree of Parallelization (DOP) The DOP setting can greatly influence performance when
the appropriate hardware is utilized. Increasing this setting will generally increase throughput.
Loader Method Depending on the databases (and versions) different loader options can
have dramatic differences in performance, regular load, Bulkloading, AutocorrectLoad. But
there is not a best method. Each has pros and cons depending how it got implemented by the
database vendor.
Transactional Loaders Loading data in one transaction means the dataflow cannot use
parallel sessions to speed up the loading.
Lookup and Join settings SAP BusinessObjects Data Services lets the user choose the
best lookup strategy, if the wrong is used based in the amount of data to be processed versus
the size of the lookup table, it can have sever performance impact.
Heterogeneous sources or all in one database If all data is in one database or a database
link exists between the databases, the SAP BusinessObjects Data Services optimizer has more
options so it can decide to delegate parts or all processing to the database.
Document Characteristics The format, length, and density of the input documents impact
performance:
o Format XML and HTML require de-tagging before processing the text which has
more overhead than processing text directly. Additionally, converting a binary document
into a textual representation during processing has overhead.
o Length Longer input documents require more processing time.
o Density More dense, entity and fact rich, input documents require more processing
time.
Rule-based Extraction Using one or more rules to customize extraction may require more
processing time.
Sizing
Sizing means determining the hardware requirements of an SAP application, such as the physical
memory, CPU processing power, and I/O capacity. The size of the hardware and database is influenced
by both business aspects and technological aspects. This means that the number of users using the
various application components and the data load they put on the server must be taken into account.
Benchmarking
Sizing information can be determined using SAP Standard Application Benchmarks and scalability
tests (www.sap.com/benchmark). Released for technology partners, benchmarks provide basic sizing
recommendations to customers by placing a substantial load upon a system during the testing of new
hardware, system software components, and relational database management systems (RDBMS). All
performance data relevant to the system, user, and business applications are monitored during a
benchmark run and can be used to compare platforms.
Initial Sizing
Initial sizing refers to the sizing approach that provides statements about platform-independent
requirements of the hardware resources necessary for representative, standard delivery SAP
applications. The initial sizing guidelines assume optimal system parameter settings, standard business
scenarios, and so on.
Expert Sizing
This term refers to a sizing exercise where customer-specific data is being analyzed and used to put
more detail on the sizing result. The main objective is to determine the resource consumption of
customized content and applications (not SAP standard delivery) by comprehensive measurements. For
more information, see http://service.sap.com/sizing Sizing Guidelines General Sizing Procedures
Expert Sizing.
There is a mix of data from various regions of the world. For this sizing guide we assume 50% NA,
40% EMEA, 10% APJ. Major variations of this mix will affect needs for sizing.
Only data quality transforms are considered. Utilizing non-data quality transforms in a job may affect
the sizing requirements and performance of the overall job.
The reference data used for the transforms that require it was located on a local disk and enough
free RAM was allocated to allow for caching the majority of this reference data.
Address validation transforms are able to perform certified and non-certified processing of
addresses for those countries that provide a certification program. Running with certification mode
enabled requires the collection of processing statistics and use of more strict rules. This sizing guide
assumes that address data is not being processed with certification mode enabled.
The data quality transforms have options to enable and disable the generating of processing
statistics for reporting purposes. The sizing in this document assumes that generation of these
statistics is disabled.
Throughput
(records per hour)
Memory requirements
in GB (per CPU Core)
Small
2 million
Medium
5 million
Large
15 million
16
Throughput
(records per hour)
Memory requirements
in GB (per CPU Core)
Small
2 million
Medium
5 million
Large
15 million
20
Throughput
(records per hour)
Memory requirements
in GB (per CPU Core)
Small
2 million
Medium
5 million
10
Large
15 million
24
Average
Transaction
Response Time
Peak Transactional
Throughput (per
hour)
Number of
CPU Cores
Memory
requirements
in GB (per CPU Core)
<50ms
~375 thousand
20
<50ms
~1.5 million
50
<50ms
~3.75 million
20
Based on the numbers above, a general guideline for transactional processing would be that for every 5
concurrent clients, 2 CPU cores are required to maintain the <50ms response time.
Only text data processing transforms are considered. Utilizing non-text data processing transforms
in a job may affect the sizing requirements and performance of the overall job.
The input data is stored on disk and there is no interaction between processes.
Multiple input languages are supported but only English is used.
Any extraction dictionaries or rules used are stored locally.
10
85
85
Throughput
(MB per hour)
Memory requirements
in GB (per CPU Core)
450
1650
2750
Throughput
(MB per hour)
Memory requirements
in GB (per CPU Core)
160
180
160
720
160
1360
85
85
Throughput
(MB per hour)
Memory requirements
in GB (per CPU Core)
360
1200
1825
11
Material Dimension
The material dimension is built out of two source tables: item and stock. The item table contains all
100,000 products and each product has different stock levels per warehouse (200 warehouses). This
results in 20 million stock rows. The idea is to store in the material dimension all item attributes plus the
information of the stock level.
The delta load use case is identical to the initial load use case except that the data has to be compared
with the target before loading the changes.
Table 1
Test Case
Throughput (rows/sec)
CPU Cores/Disks
Memory requirements
in GB (per CPU Core)
Initial Load
212,000
16 core/8 disk
Initial Load
227,000
8 core/2 disk
Initial Load
204,000
4 core/1 disk
Delta Load
219,000
16 core/8 disk
Delta Load
224,000
8 core/2 disk
12
Delta Load
202,000
4 core/1 disk
Table 2
Test Case
Throughput (rows/sec)
CPU Cores/Disks
Memory requirements
in GB (per CPU Core)
Initial Load
101,000
16 core/8 disk
Initial Load
78,000
8 core/2 disk
Initial Load
56,000
4 core/1 disk
Delta Load
21,000
16 core/8 disk
Delta Load
15,000
8 core/2 disk
Delta Load
15,000
4 core/1 disk
Fact Load
The fact table load is a typical case where two source tables have to be joined the order master and
order line item tables and then some transformation has to happen, most important lookup of the
surrogate keys in the dimension tables. So the use cases before had lots of attributes per row and little
transformation, this scenario does have a narrow table with large amounts of rows and many
transformations.
For the delta load the main difference to before is that a list of potential changes can be identified in the
source, e.g. by reading based on a timestamp.
Table 3
Test Case
Initial Load
Throughput (rows/sec)
200,000
CPU Cores/Disks
Memory requirements
in GB (per CPU Core)
16 core/8 disk
13
Initial Load
130,000
8 core/2 disk
Initial Load
80,000
4 core/1 disk
Delta Load
1,400,000
16 core/8 disk
Delta Load
1,600,000
8 core/2 disk
Delta Load
1,200,000
4 core/1 disk
7 MISCELLANEOUS
Additional performance related information can be found at
http://wiki.sdn.sap.com/wiki/display/BOBJ/Performance