You are on page 1of 7

Service Level Agreement Management Challenge

Service Level Agreements (SLAs) and Operation Level Agreements (OLAs) for an Integration Competency Center (ICC) or for any other information system may either be a formal (with penalties and escalation paths for failure to meet the agreement) or an informal set of expectations. An OLA focuses on the operational availability of systems on which the ICC depends upon for data. An SLA focuses on the delivery of processed data for reporting and downstream processing. Some of the challenges around SLA management include:

Defining the SLA or OLA Harmonizing the needs of diverse teams in a Shared Infrastructure Determining the formality/informality of the SLA/OLA and expectations Discerning the impact to the organization of a missed SLA/OLA Designing and building the systems that allow the SLA/OLA to be achieved Setting the escalation path for resolving a missed or unachievable SLA/OLA Maintaining Service Levels and Operation Levels for growth

Description
Defining the SLA
Service Level Agreements may be formal, documented, signed agreements or informal understandings, hopes, and expectations regarding the level of service to expect. In the context of Integration Systems, SLAs are centered on the availability of data and processing for acquisition from legacy applications, conversion and load to target systems and for end-user reporting and other dependent applications. Operations Level Agreements similarly may be formal or informal. They differ from SLAs in that they pertain to hardware architecture performance and availability. An uptime percentage would be a typical metric found in an OLA. For simplicity, this document will refer to both SLAs and OLAs as SLAs unless clarification is required.

Considerations
In order to define an SLA for a new or existing application system, there need to be an understanding not only of the application, but also how it fits into the environment with

other systems and applications. For an application for which an SLA is defined, the organization will want to understand:

Data Volumes Growth Projections Execution Window Required System Resources Required Dependencies (Predecessors and Successors) Concurrent Processing Conflicts Production Schedules User Requirements

Be sure to understand these items before formalizing the SLA, otherwise expectations may be set that cannot be realized. For example, the delivery of updated data to a reporting system cannot be performed if the source data is not available for the planned processing. Similarly, if operations downtime is planned for certain hours to perform routine maintenance or backups, then that is a window when processing will not be able to occur. Downstream users will need to understand the requirements. Alternatively, if the user requirements show a critical business need, then the development and support teams might work to see if the scheduling of predecessor events can be adjusted to meet the requirement. As a side note to adjusting production schedules and outage windows, when introducing new hardware or software with increased performance capabilities into the environment, review the dependency information to determine if there is an opportunity to make adjustments. If shorter execution times or earlier availability of predecessor items are expected then the organization might improve service by guaranteeing sooner delivery. As applications mature, growth based on increased usage or higher data volumes may begin to erode the performance and require longer execution times. Allow for that eventuality in the SLA expectations or plan to continue to meet the delivery expectations with capacity increases that keep performance constant or open up a bigger execution window. Similarly, other applications running in a shared architecture can affect an application if they are utilizing the resources needed for efficient processing. This concurrent processing may cause longer elapsed times for the application. A growth in usage of the shared infrastructure (by adding more users or more applications or processing) must be considered when developing the SLA. As these items begin to migrate to the shared environment, review their impact to the SLA. Possible actions include adjusting the SLA, adjusting the production schedule or adding capacity in order to accommodate the increased demand. The SLA defines the expectations surrounding performance, availability and delivery based on an understanding of the above items and their limitations. Additionally, the SLA often includes measures to take when an event occurs that does not permit the SLA to be met. Response times for resolution of the incident may be specified, with several levels of

response urgency indicated for more or less business critical applications. Semi-routine service tasks such as code migrations, new user access requests, password resets or application enhancements may also have an SLA focusing on how soon the team addresses the request. SLAs for problem resolution generally indicate several levels of response, depending on if there is a work-around for the problem, the number of users affected and the business criticality of the disrupted application. SLA expectations need to apply not only to those providing the service, but also to those using it. A typical responsibility of an end-user is to report problems or deficiencies promptly and to be available to test and approve resolution. The end-user also approves outages or reduced services for special maintenance windows for upgrades and promptly notifies the service provider of special requests (and includes requirements and deadlines) allowing enough lead time to deliver. The SLA can include the following items:

Application or system performance expectations Availability expectations (percentage uptime, time and days, data refresh schedule, etc.) Routine request turn-around Special request turn-around Problem management response and resolution time Change request management timelines, approval processes and completion expectations Prioritization based on urgency Escalation path Off-hours support expectations Roles and responsibilities for all parties

Often accompanying the SLA is a Support document that outlines the steps, typical resolutions, contact information and escalation paths to use when a problem incident is raised. This allows the team responsible for resolution to be well prepared to handle urgent issues with no wasted effort.

Relevance of Formal SLAs vs. Informal SLAs


The formality or informality of an SLA depends upon a number of factors:

Are SLAs being managed and resolved in-house or by a third party organization? Are there contractual agreements in place regarding performance, delivery and service levels? Are there penalties for failure to deliver or to resolve issues promptly? Are there bonuses for better than promised service? Is the application business critical? Is the time of day or day of delivery business critical?

If the answers to the above considerations are yes, then create a formal written agreement and have it signed by all parties. If no, then the SLA does not need to be specifically

documented, but rather might be an informal expectation based on experience or desires. Be aware that even informal agreements become expectations that can disappoint the user if not met consistently.

Measuring Success
In particular for formal SLAs, the organization must be able to measure and quantify what meets or does not meet the agreement as well as improvements or declines in the delivery of the expectations that were established in the SLA. Monitor items such as:

Hardware Metrics o Uptime o Failover o CPU utilization o Disk utilization o Memory utilization Software and Application Metrics o Delivery of new features to production processing o Availability of data o Query performance o Number of rows processed o Number of jobs processed o Execution time o CPU time utilized Common Metrics o Number of problem incidents o Speed of resolution and initial response o Change request management response and completion

Score current service levels and compare them to historical trends. In this way the organization will be able to determine problem areas and make corrections to processes, schedules and capacity if there are red flags in any of these areas. Maintaining metrics, publishing them and resolving deficiencies or declining trends (as already stated) are most important when formal agreements have been made, but these activities can be equally important in ensuring an organization is informed and can take action to meet informal expectations before items like capacity and scheduling issues impact an SLA.

Samples
Example 1

The figure above illustrates the dependencies among SLAs and OLAs. The ability to meet the SLA depends upon the availability of hardware, data and processing capabilities. It is possible for dependencies to span multiple items. For example, the enduser requesting a report might have an SLA that agrees to deliver the report by a certain time and day. Achieving the SLA depends not only on Your Process, Your Data, and Your Server, but also on the other systems that feed them.

The Sample Metrics Tracking figure above shows the capture and plotting of elapsed execution time for an application over time. Spikes indicate when data errors or delays in receiving source data occurred, causing an extended processing window. The drop beginning in late June 2006 resulted from replacing the database systems for this data warehouse application with a faster database and database server to increase capacity. In reviewing an SLA like the one in Example 1 (where end-users expect a report by 8 AM) the application processing (as shown above) was beginning to push the threshold and run past the SLA target time on occasion (and a little too close for comfort on the days leading up to the capacity improvements). Thirteen out of the fifty two executions shown for April and May (25 %) exceeded the target SLA! Imagine the frustration and lack of confidence end-users began to have with the system. In addition to needing to address this problem, there was a desire to move the SLA to 6 AM. The increased capacity moved up data availability significantly leaving plenty of opportunity to meet the new SLA target (even if source data issues occurred). There are eight blips where source data delivery issues extended the execution window, but only one missed SLA. Through extrapolation and understanding the system, the metrics diagram also shows that the SLAs focused on upstream processing to deliver the source data are being missed too frequently. The system that provides the data now then needs attention.

Example 2
The sample support SLA below shows the guidelines and expectations when raising a problem request. Other portions of the document explain how to raise a request, and what means of communications are acceptable for the first notification and then for follow-up. Priority Urgent Guidelines For Use A product or service is unavailable in Production and no workaround is available There is a severe impact to the Production environment Multiple people are affected Immediate action is needed Service Level Support team is paged when the request is submitted Request will be addressed within 15 minutes of receipt Client will be notified by phone and/or by email when the request is placed in an "In

This priority should not be used to report a problem affecting a test environment Should be reserved for those items that are truly Urgent A product or service is unavailable in test or production and no workaround is available Multiple people are unable to complete critical tasks in test or in Production There is a severe impact on the completion of project milestones or to the production environment Short-term workarounds may be available while the problem is addressed Prompt action is needed This priority should not be used to report a problem affecting only one person in a test environment One or more people are experiencing problems or need support work Project milestones are not likely to be affected and other tasks can be completed while this problem or task is being resolved

Progress" status

High

Support team is paged when the request is submitted Request will be addressed within 1 hour of receipt Client will be notified by phone and/or by email when the request is placed in "In Progress" status

Medium

Request will be addressed within 4 business hours of receipt Client will be notified by phone and/or by email when the request is placed in "In Progress" status

Low

Note: Most Requests should use this priority Problem resolution or support work is requested but is not needed immediately

Request will be addressed as agreed upon by both parties. Client will be notified by phone and/or by email when the request is placed in "In Progress" status

Project deadlines and/or an individuals performance will be enhanced

Example 3

The following SLA Sample outlines the responsibilities of the development Team and the Support Team for different environments and outlines the notification plan when an incident or a change request occurs. A. Development / UAT i. Support Team provides installation/upgrades/patches, etc. ii. Developers are granted access to perform migrations. iii. Developers are granted access to start/stop servers. iv. In the event that they are unable to successfully start/stop the instances Developers must: 1. Open a Problem Ticket 2. Provide details for exactly what they tried to do in order to bring the instance up / down. 3. Support Team gets involved as second level support and provides assistance. B. Production i. This is a controlled environment to which the development teams do not have any access. ii. All issues must be raised directly to Support Team via the following process: 1. Open a Problem Ticket 2. Provide details for exactly what the issue is 3. Contact the help desk 4. If help desk is unable to resolve, Support Team gets involved as second level support and provides assistance. iii. Code Migrations requests are raised via the RFC change process using the Change Management System. 1. Open a RFC in the Support Team queue 2. Provide details of what steps to take for the change: a. Associated Repository and Informatica Version b. Information about which Informatica folders are affected by the change c. Location of XML file containing Informatica objects to import d. Details of any other change steps (modified relational connections, etc.) 3. Validate the success of the change during the change window. 4. For non-emergency changes, if the change requires off-hours support, please provide 5 business days advance notice.

d
Last updated: 06-Sep-08 16:21

You might also like