You are on page 1of 16

SAP APO MONITORING

HANDBOOK
Module: SAP BASIS

PREPARED BY:
ASHU BAREJA

1-SAP APO OVERVIEW


2-LIVECACHE MONITORING

System Messages
Data cache Monitoring
SAP liveCache Volume Fill Level
LiveCache Heap Monitoring
Monitoring of Devspaces
Monitoring of COM Routines
Monitoring of Collision Rate
Core Interface Monitoring

3- Monitoring of Optimizers

APO OVERVIEW:
The SAP Advanced Planner and Optimizer (APO) is the planning component of mySAP
SCM, the Supply Chain Management solution provided by SAP. SAP APO is used to
make strategic, tactical, and operational decisions for your organization.
Execution functions, such as confirmations, goods receipt, purchasing, and so on are
performed in the SAP R/3 OLTP system, which contains all functionality for Material
Management MM, Sales and Distribution SD, Production Order Processing PP-SFC,
Logistics Execution LES, and Controlling CO.
The online transaction processing (OLTP) system provided by the SAP R/3 system
also provides relevant planning data (master data and transaction data) for the APO
System. Products are planned in the APO System, and the planning results are transferred
back to the OLTP system.

SCM SYSTEM LANDSCAPE

This document prime objective is monitoring mySAP APO systems (referred to as APO
systems from now on). APO systems are distinguished from classic mySAP R/3
systems (referred to as R/3 systems from now on) by the presence of SAP liveCache,
the SAP optimizer, and the qRFC interface.

LiveCache Monitoring
LiveCache is the most performance critical component in an APO system and should be
monitored permanently. The most important steps in liveCache monitoring are
monitoring of Data Cache, liveCache Heap, Devspaces and COM Routines.

A- System Messages
LiveCache system messages are available in transaction LC10 -> liveCache Monitoring
-> Problem Analysis -> Messages -> Kernel -> Currently

This path provides access to the file KNLDIAG which contains all the liveCache
system messages since liveCache start and initialization log. Error messages can be
found under the tab strip Error messages. This information can be interpreted by
experts and should be provided for SAP Support if required.

To avoid that the size of the knldiag file increases unlimitedly with the time the database
spends in the operation mode online, the knldiag file has a fixed length which can be set
as a configuration parameter of the database. The system messages are written in a
roundtrip. Therefore, it can happen that the knldiag file does not contain all system
messages after long operation time. This is another reason why all error messages are
written into the file knldiag.err.
In contrast to the knldiag file, knldiag.err is not overwritten cyclically or reinitialized
during a restart. It logs consecutively the starting time of the database and any serious
errors. This file is required to analyze errors if the knldiag files, which originally
contained the error messages, are already overwritten

B-- Data Cache Monitoring


The screen Activity Overview (LC10 -> liveCache Monitoring -> Current Status > Activity Overview) provides a compact overview of the current values of the most
important parameters such as Data Cache and Heap usage and hit rates of caches.
Start the analysis with transaction LC10 -> liveCache Monitoring -> Current Status
-> Memory areas. Press the Refresh button in order to obtain dynamic information.
Also use the button Restart monitor in order to reset monitoring statistics when it is
necessary to obtain performance statistics for a certain time period (for example, a
transaction trace).

In an optimally configured liveCache the Data Cache hit rate should be at least 99,
8% calculated as the number of successful accesses divided by the total number of
accesses and multiplied by 100%.
Sometimes the reason for a poor Data Cache hit rate is a high number of long running
versions or transactions. To keep the consistent state of the versions or transactions,
liveCache is forced to store a large number of history pages, which fill the Cache and
lead to a roll out of data and history pages to the data devspaces. To check this, go to
liveCache Monitoring -> Problem Analysis -> Performance -> OMS versions in
transaction LC10 and check for all the versions older than 4 hours. If there are any, find
out the users, who have started those versions and contact the functional department in
order to optimize business processes and avoid such long running versions.

C- SAP liveCache Volume Fill Level


Generally Data Cache filling level should be below 100%. A value below 80% is not
critical for system performance. If the data Cache usage reaches 100% while Hit
Rate is below 100%, an extension of the Data Cache may be required. If the value is
between 80% and 100%, the most important characteristic is the correspondence between
the number of OMS data and history pages, which normally should be around 4:1. With
lower values, check for long running versions as described above and optimize business
processes to avoid long running versions

If the history proportion is too large, check whether there are any long-running OMS
versions (transactional simulations). When the database filling level exceeds 90%, the
transactional simulation is deleted and history is released by liveCache garbage
collectors. To avoid this situation, schedule the report /SAPAPO/OM_REORG_DAILY
to run at least once a day. Among other things, this report deletes transactional
simulations older than 8 hours.
Also schedule the report SAPAPO/OM_DELETE_OLD_SIMSESS to run every 30
minutes
Another critical point is garbage collectors. As of liveCache 7.2.5 Build 14, configuration
parameter _GC_DC_THRESHOLD defines the threshold Data Cache filling level when

garbage collectors are started). By default, that parameter is set to 80 (%). As long as the
Data Cache filling level is above the threshold value, garbage collectors run every 30
seconds and can be monitored under liveCache Console -> Active Tasks.
Performance problems can occur if garbage collectors are not running.

D- LiveCache Heap Monitoring


The OMS heap is a critical resource in liveCache. It contains copies of OMS objects from
Data Cache and data needed for internal purposes in COM routines. Accesses to OMS
objects are very fast, several times faster than accesses to data in the global data cache.
The maximum size of OMS heap is configured by the parameter OMS_HEAP_LIMIT,
the part of heap memory which can be occupied by one transactional simulation, defined
by parameters OMS_HEAP_THRESHOLD (in %) and OMS_VERS_THRESHOLD (in
kB)
For Ideal System Performance, the heap usage by a particular COM Routine should
be less than 80% of OMS_HEAP_LIMIT.
If memory consumption by COM routines reaches this limit, data from transactional
simulations may be rolled out to Data Cache, causing a much higher access time when
the transactional simulation accesses them again.
LiveCache Heap usage can be monitored under LC10 -> liveCache Monitoring ->
Current Status -> Memory areas -> Heap Usage.

LiveCache heap monitoring is important in systems with little main memory, especially
32-bit platforms. The configuration parameter OMS_HEAP_COUNT defines the number
of heap segments (allocators). Value Currently Used is the currently used heap memory
in bytes. Value Size indicates the maximal heap memory usage in bytes per segment

since the liveCache server was last restarted. The heap memory can be allocated by
application until the value Size (sum for all the segments) reaches the value defined by
parameter OMS_HEAP_LIMIT. If OMS_HEAP_LIMIT is set to 0, the value Size can
grow until the physical memory limit is reached. This leads to server standstill because of
lack of memory for the operating system. The problem can be solved by a server restart.
The problem becomes more critical in the case of a small main memory (32-bit)
liveCache server.
For a 64-bit liveCache, the sum of CACHE_SIZE plus OMS_HEAP_LIMIT should be
less than the virtual memory, which is physical memory plus swap space.
NOTE:Detailed information about Heap usage by separate versions can be obtained using SQL
Studio command:
SELECT oms_version_id, heap_usage, unloaded FROM oms_versions
The number of COM Routine errors due to memory shortage in Heap can be obtained
using another SQL Studio command:
SELECT SUM (OutOfMemoryExceptions) FROM monitor_oms
LiveCache Heap usage can be optimized by tuning the parameters OMS_HEAP_LIMIT,
OMS_HEAP_THRESHOLD and OMS_VERS_THRESHOLD.

E-Monitoring of Devspaces
Data devspaces can be monitored in transaction LC10: choose liveCache Monitoring ->
Current Status -> Memory areas -> Data area.
The most important parameter to be monitored is Used area. If it exceeds 90%,
new devspace should be added.

Used area is the sum of Permanent used area and Temporary used area. Persistent
OMS objects and history pages are stored as permanent pages. Permanent pages are
available after a restart of liveCache. Only permanent pages are included in checkpoints
and backups. Named consistent views, which were swapped from private OMS cache to
global data cache, are stored in temporary pages. All temporary pages are released after a
restart of liveCache. Remember, that Total data space is half the value of the
configured space for data pages which is displayed in the configuration. The other 50%
are reserved for old data pages after a checkpoint, which must remain unchanged until
next checkpoint.

F-Monitoring of COM Routines


The screen LC10 -> liveCache Monitoring -> Problem Analysis -> Performance
-> OMS Monitor displays information about the COM routines running in the actual
liveCache. Monitoring the COM routines may give indicators for the reason for
performance problems caused by the APO application (high runtime, high memory
consumption in liveCache heap etc.). Very detailed statistics about COM Routines calls
are displayed. The sort function can be used to display the most expensive procedures.
Usually it is sufficient to monitor the COM Routines that are on top with respect to their
total time consumption. In the case of COM performance problems it would be very
useful for SAP support to have an hourly history of COM Routines runtime.
Unfortunately it cannot be automatically collected by SAP tools. As soon as performance
degradation is recognized, it is recommended that system administrators monitor COM
Routines runtime in parallel to other actions and save the statistics in local files.

After a COMMIT or ROLLBACK the internal routine Transaction End is called. This
routine performs the copy process from OMS cache (liveCache Heap) to global cache
(Data Cache). The routines FORCE CHECKPOINT and WAIT FOR CHECKPOINT
are called by the APO system when a checkpoint is performed by report
/SAPAPO/OM_CHECKPOINT_WRITE (only relevant for liveCache 7.2).
The high runtime of WAIT FOR CHECKPOINT normally is the result of the wait
situation. WAIT FOR CHECKPOINT has to wait until all changed data in data cache
was marked for change.
For each procedure the number of accesses to liveCache data is displayed. Objects can be
read from global data cache (LC de-referencing) or local OMS cache (OMS dereferencing), when the objects are already present in the local cache of the consistent
view the procedure belongs to. Accesses to local OMS cache are much faster than
accesses to global data cache because copying of pages is avoided. Storage operations are
only performed in the local OMS cache of the procedure (OMS store, OMS delete, etc.).
The transfer of objects to the global liveCache data (data pages) is done by the internal
procedure Transaction End which is called at transaction end (LC store, LC delete, etc.)

G-Monitoring of Collision Rate


LiveCache uses critical regions to protect access to internal data structures (Data Cache
administration, catalog access, etc.) against concurrently active user tasks. Generally
critical regions will only be held for a very short time (much less than 1 microsecond) to

reduce the risk of collisions. If the liveCache server faces limited CPU resources, the
operating system may dispatch a liveCache thread that currently holds a critical region. In
this case, the possibility increases that other threads will collide in the hold region.
Therefore high collision rates are typical for heavy workloads on a liveCache
server (CPU, paging).
Collision rate should be monitored under LC10 -> liveCache Monitoring -> Current
Status -> Critical Regions.

Collision rates of over 30% for any region are considered to be critical. During
concurrent CIF transfers, high collision rates on OMSVDIR and CNSTVIEW regions are
common. The reason for this is the creation and destruction of short lived transactional
simulations. Generally, high collision rates on OMSVDIR and CNSTVIEW regions can
be ignored.
Check the OS workload (OS06, OS07). If liveCache is not running on a dedicated server,
try to transfer the other components to another server, if high region collisions repeatedly
occur. If they only occur during concurrent CIF transfers, you can also try to reduce the
number of work process used for transfer

H- Core Interface Monitoring


Core Interface uses the technology of queued RFC to provide reliable communication
between APO and R/3 systems. General performance of an APO system can be affected
sufficiently by problems occurring during data transmission via CIF. These problems can
affect performance directly (slow transmission because of insufficient resources) or
indirectly (by producing, for example, high database load due to bad ABAP or SQL

performance). Sometimes termination of queue processing can cause cancellation of


scheduled background jobs and therefore incomplete data development.
The most important monitoring points are qRFC versions installed, status of queue
transmission (error messages), status of qRFC tables in database.
Monitoring of the Queue Status
Queue transmission process can be monitored in transactions SMQ1 (outbound queues)
and SMQ2 (inbound queues). All the error messages appearing in these monitors should
be analysed and resolved as soon as possible in order to complete data transmission
within the necessary time frame and to avoid delay or cancellation of succeeding jobs.
Monitoring of the Outbound Queues
Transaction SMQ1 offers options for monitoring of specific queues, specific destinations,
hanging queues or any combination of these criteria.

The following operation and error statuses can be displayed in the column Status for
each entry :
READY If you monitor this status for more than 30 minutes, clear with the application
team whether the queue can be activated explicitly. The reason is that the queue was
locked manually via transaction SMQ1 or via a program and then unlocked without being
activated.
RUNNING If you monitor this status for more than 30 minutes, clear with the
application team whether the queue can be activated explicitly. The reason is that the
work process responsible for sending this LUW has terminated. Note that activating a
queue in status RUNNING may cause a LUW to be executed several times if this LUW is
processed in the target system at that.
EXECUTED - If you monitor this status for more than 30 minutes, clear with the
application team whether the queue can be activated explicitly. The reason is that the

work process responsible for sending this LUW has terminated. In contrast to status
RUNNING, this current LUW has definitely been executed successfully. The qRFC
Manager will automatically delete the LUW already executed and send the next LUW.
SYSLOAD - At the time of the qRFC call, no DIA work processes were free in the
sending system for sending the LUW asynchronously. A batch job for subsequent
sending has already been scheduled.
SYSFAIL - A serious error occurred in the target system while the first LUW of this
queue was executed. The execution was interrupted. When you double-click on this
status, the system displays an error text. You can find additional information on this error
in the corresponding short dump in the target system (ST22). No batch job is scheduled
for repetition and the queue is no longer processed. To solve the problem, information
from the affected application is required. Refer to SAP Note 335162 for the special error
text "connection closed".
CPICERR - During transmission or processing of the first LUW in the target system, a
network or communication error occurred. When you double-click on this status, the
system displays an error text. You can find additional information on this error in the
syslog (SM21). Depending on the definition in transaction SM59 for the destination used,
a batch job is scheduled for subsequent sending. Status CPICERR may also exist
although no communication error has occurred. A qRFC application finds out that a
LUW cannot be processed any further due to a temporary error in the application.
Therefore it calls the RESTART_OF_BACKGROUNDTASK function module to prompt
the qRFC Manager to cancel the execution of this LUW and to repeat this LUW later in
accordance with the specification in transaction SM59. In this case, qRFC simulates a
communication error with the text "Command to tRFC/qRFC: Execute LUW once
again". If this error often occurs, contact the corresponding application team.
STOP - On this queue or a generic queue (for example BASIS_*) a lock has been set
explicitly (SMQ1 or programs). Note that the qRFC never locks a queue in processing.
Clear with application team whether the queue can be activated using transaction SMQ1.
WAITSTOP - The first LUW of this queue has dependencies to other queues, and at least
one of these queues is currently still locked.
WAITING - The first LUW of this queue has dependencies to other queues, and at least
one of these queues contains other LUWs with higher priorities.
NOSENDS - During the qRFC call, the application determines that the current LUW is
not being sent immediately. This error is used to debug the execution of an LUW via
transaction SMQ1. Investigate the application calling this qRFC or contact the
appropriate application team to clarify this status since this is either a programming or
configuration problem.

WAITUPDA - This status is set if qRFC is called within a transaction that also contains
one or more update functions. As a result of this status, the LUW and thus the queue is
blocked until the update has been completed successfully. If this status takes longer than
a few minutes, check the status of the update or the update requests using transaction
SM13. After a successful retroactive update, the blocked LUW is sent automatically.
Clear with application team whether the LUWs can be restarted manually in the
WAITUPDA status without a successful retroactive update (via transaction SMQ1 ->
Reset status -> Activate queue). This WAITUPDA problem may be avoided if both
qRFC calls and update calls occur within a transaction, qRFC must be executed
exclusively within the update. In this case, the qRFC LUW is only created after the
update has been completed successfully.
ARETRY - During LUW execution the application has diagnosed a temporary problem
and has prompted the qRFC Manager in the sending system via a specific qRFC call to
schedule a batch job for a repetition on the basis of the definition in transaction SM59.
ANORETRY - During LUW execution, the application has found a serious error and
prompted the qRFC Manager via a specific qRFC call to cancel processing of this LUW.
Information from the affected application is required to solve the problem.
MODIFY - Processing of this queue is temporarily locked as the LUW data is being
modified.
Monitoring of the Inbound Queues
Transaction SMQ2 offers the same monitoring options as SMQ1 with small differences
in the possible status. The status READY, RUNNING, SYSFAIL, CPICERR, STOP,
WAITSTOP, WAITING, ARETRY, ANORETRY, MODIFY have the same meaning as
for outbound queues but consider queue processing instead of queue sending.
NOEXEC - During the qRFC call, the application team determines that the current LUW
is not processed automatically even if the queue to the QIN Scheduler (SMQR) is
registered. This information is used to debug the execution of an LUW via transaction
SMQ2. Contact the corresponding qRFC application team to clarify this status since this
is either a programming or configuration problem.
Monitoring of the qRFC Tables
Where the communication errors described above occur, corresponding entries are
written in the tRFC/qRFC tables in the database. These tables are ARFCSSTATE,
ARFCSDATA, ARFCRSTATE, TRFCQDATA, TRFCQIN, TRFCQOUT and
TRFCQSTATE. They are requested during qRFC communication. When the errors are
not resolved in a reasonable time, or when errors appear much faster than they can be
resolved, the size of these tables grows and therefore performance degrades dramatically.
The number of entries can be checked in transaction SE16 -> Number of entries

Monitoring of Optimizers
The main criterion for the optimizer program performance is the runtime of the
appropriate background job. This can be monitored in transaction SM37 or with an
external monitoring tool, if any. Optimizer run statistics are collected in transaction
/SAPAPO/OPT11 History of Optimization Runs.

With suspected performance problems, the main object which should be monitored and
analysed is the hardware (RAM, CPU) of the Optimizer server. Generally there is no way
to tune the performance of an optimizer program itself. The most important task for
providing high performance of optimizers is a proper sizing for the optimizer servers.
Tuning potential can be found in tuning of the operation system or resizing.

You might also like