Professional Documents
Culture Documents
UC SF
TABLE OF CONTENTS
TABLE OF CONTENTS ........................................................................................... 2 DOCUMENT VERSION CONTROL ........................................................................... 3 STAKEHOLDER TEAM ........................................................................................... 3 1.
1.1. 1.2. 1.3.
INTRODUCTION........................................................................................... 4
PURPOSE .................................................................................................................................................................... 4 SCOPE ......................................................................................................................................................................... 4 DEFINITIONS ............................................................................................................................................................. 4
2. 3.
3.1. 3.2.
4. 5. 6. 7. 8. 9.
RACI CHART ................................................................................................ 8 ENTRY CRITERIA ......................................................................................... 9 PROCEDURE ................................................................................................ 9 EXIT CRITERIA .......................................................................................... 13 PROBLEM MANAGEMENT SERVICE LEVELS ................................................ 13 PROBLEM MANAGEMENT SERVICE METRICS .............................................. 14
2 of 14
Rev 12/11/09
UC SF
Issue Date
5/11/07 7/09/09 12/11/09
Prepared By
Terrie Coleman Francine Sneddon Francine Sneddon First draft
STAKEHOLDER TEAM
Department Name
Customer Support Services (CSS) Enterprise Information Security (EIS) Enterprise Network Services (ENS) Information Technology Services (ITS) Academic Research Systems (ARS) Application Services (AS) Business & Resource Management (BRM)
Rebecca Nguyen Michael Kamerick (Interim) Jeff Fritz Heidi Schmidt Michael Kamerick Jane Wong Shahla Raissi
This document contains confidential, proprietary information intended for internal use only and is not to be distributed outside the University of California, San Francisco (UCSF) without an appropriate non-disclosure agreement in force. Its contents may be changed at any time and create neither obligations on UCSFs part nor rights in any third person. UCSF Internal Use Only Problem Mgmt 12-11-09.doc 3 of 14 Rev 12/11/09
UC SF
1.1.
1. INTRODUCTION
PURPOSE
The objective of Problem Management is to resolve the underlying root cause of incidents and consequently prevent them from recurring. Reactive Problem Management aims to identify the root cause of past incidents and presents proposals for improvement or rectification. Proactive Problem Management aims to prevent incidents from recurring by identifying weaknesses in the infrastructure and making proposals to eliminate them.
1.2. SCOPE
The scope of the Problem Management process includes a standard set of processes, procedures, responsibilities and metrics that are utilized by all OAAIS services, applications, systems and network support teams.
1.3. DEFINITIONS
A problem describes an undesirable situation, indicating the unknown root cause of one or more existing or potential incidents. A known error is a problem for which the root is known and for which a temporary workaround has been identified. A Request for Change (RFC) proposes a change to eliminate a known error and is addressed by the Change Management process. The Problem Management process includes Problem Control, Error Control, Proactive Problem Management and Providing Management Reporting.
2. RESPONSIBILITIES
Problem Management Team
Reactive Problem Management Identifies and records problems by analyzing incident details Approves problem resolution recommendations and establishes resolution priority Generates Requests for Change (RFC) Identifies trends and records problems Approves problem resolution recommendations and establishes resolution priority
Problem Owner
Investigates and manages problems based on their priority Assigns (or obtains) resources and manages error control activities Schedules and facilitates major problem reviews Develops recommendations for problem resolution Monitors the progress of known errors Generates Requests for Change (RFC)
Problem Manager
Coordinates and guides activities of the Problem Management Team and Problem Owner(s) Provides management information and uses it proactively to prevent the occurrence of incidents and problems in both production and development environments Escalates the analysis and resolution of cross-functional problems to Unit and OAAIS levels Conducts post mortem or Post-Implementation Reviews (PIR) for continuous improvement Develops and improves Problem Control and Error Control procedures
4 of 14
Rev 12/11/09
UC SF
3. PROCESS DEFINITION
3.1. PROCESS MAP
5 of 14
Rev 12/11/09
UC SF
START
From Step 29 Yes 2 Identify High Occurrence of Similar Incidents 4 Match Incidents & Create Problem(s) 6 Assign Problem Owner(s)
Go to Step 16
1 Categorize Incidents
3 Identify Problem(s)
5 Review Problem(s)
Yes
15 Fix? No Go to Step 25
Problem Manager
Problem Owner(s)
8 Assess Problem
6 of 14
Rev 12/11/09
UC SF
7 of 14
Rev 12/11/09
UC SF
4. RACI CHART
Te am em en t er an ag M P ro bl em
an ag
em
Problem Management Step Description Step # Conduct Problem Control Categorize Incidents R A 1 Identify High Occurrence of Similar Incidents R A 2 Identify Problems R A 3 Match Incidents & Create Problems R 4 A Review Problems R A 5 Assign Problem Owners R 6 A Accept Problem Assignment R A 7 Conduct Error Control Assess Problem A/R 8 Investigate & Diagnose Problem A/R 9 Determine Root Cause A/R 10 Document the Error A/R 11 Document Possible Solutions A/R 12 Present Solution Recommendation I I A/R 13 Approve? R A/R 14 Fix? R 15 A/R Determine Priority R A/R 16 Schedule R A/R 17 Monitor the Error 18 I A/R RFC Required? A/R 19 Create Change Request A/R 20 Perform Change Management Process - Hand-off Implement Solution 21 A/R Monitor Resolution 22 A/R Error Resolved? 23 A/R Permanent Solution? 24 A/R Update & Close Problem 25 A/R Conduct Review 26 A/R Proactive Problem Management Collect, Review & Analyze Data - incident, problem, known errors, industry, performance management 27 , security, and user data A/R Identify Infrastructure Issues - weak, overloaded, 28 A/R vulnerable components Create Problem & Review Problem 29 A/R Provide Reporting Generate Updated Problem Report 30 A/R Distribute Problem Reports, as required 31 A/R
P ro bl P ro bl
Responsible People who do the work, facilitate it and/or organize it
em
O w
n er
Output
Accountable The one who ensures that desired outcomes are reached and has yes/no decision making authority Consulted People who have critical expertise to contribute before a decision is made Informed People who are significantly affected by the activity/decision and must be informed to ensure successful implementation
8 of 14
Rev 12/11/09
UC SF
5. ENTRY CRITERIA
Incident details, including workarounds and RFCs Configuration details (from Configuration Management Database future) Product information including technical details and known errors Details about infrastructure behavior, capacity, performance and service levels
6. PROCEDURE
ID Conduct Problem Control 1 Categorize Incidents Team receives an Incident Report from Remedy showing open and closed incidents. Incidents will be categorized by category, type and item. The team reviews the report and assigns a problem category to each incident, such as: 2 Application Functional Geography Database Type Version Operating System Hardware Network Problem Management Team Problem Management Team Step Responsibility
Identify High Occurrence of Similar Incidents Using best judgment and experience, review the categorized incidents looking for similarities and/or high occurrence of the same incident.
Identify Problem(s) Identify problem(s) based on high occurrence, critical issues, trends, threatened service levels and / or incidents not linked to an existing problem or known error.
Match Incidents & Create Problems Link new incidents to a problem, by entering the incident # into the problem worklog. If an incident cannot be linked to an existing problem the team will create a new problem in Remedy with a link to the incident. The following fields are required: Assignment Group Summary Description Requester Create Date Assign To Planned End Date Status Case Type = Problem
Review Problem(s) A Problem Report is generated by Remedy, reviewed and the team: Consolidates similar problems Creates new problems, if needed Prioritizes problems Defines investigation scope Establishes target resolution dates
9 of 14
Rev 12/11/09
UC SF
ID 6
Step Assign Problem Owner(s) Assign one or more owners to work a problem based on: Availability Expertise Complexity of the problem
Accept Problem Assignment Team and Problem Owner(s) agree on the scope and plan for the assignment.
Problem Owner(s)
Conduct Error Control 8 Assess Problem Review and document what is known about the problem, including: 9 Incident details User errors Intermittent errors Application logs Server logs Workarounds Problem Owner(s) Problem Owner(s)
Investigate and Diagnose Problem Investigate and diagnosis the problem. Possible resources include: Internet search - Google error code Establish trouble tickets with vendors Discuss with other functional areas Duplicate the problem Conduct isolation analysis Forums/user groups
10
Determine Root Cause Conduct a root cause analysis of the problem. Possible tools and methods include: Five Whys Fishbone Diagram Ishikawa Diagram Pareto Analysis Statistical analysis
Problem Owner(s)
11
Document the Known Error Update Remedy with information regarding the Known Error, including: Error description Error code Impact/severity/priority Root cause
Problem Owner(s)
10 of 14
Rev 12/11/09
UC SF
ID 12
Step Document Possible Solution(s) Analyze, compare and evaluate alternatives including permanent solutions, temporary solutions or not to fix. Possible resources include: Vendor troubleshoot guides and documents Internet searches Vendor trouble tickets Identified workarounds and fixes User group recommendations
Draft the following documents based on the severity and type of problem: 13 Cost benefit analysis Risk analysis Impact analysis Criteria for success Problem Owner(s)
Present Solution Recommendation Problem Owner(s) presents recommendation(s) for solving the problem to the team. The following items may be included: Cost benefit analysis Risk analysis Impact analysis Criteria for success
14
Approve? Evaluate and approve the recommendation(s) using the following criteria: Objective analysis of problem Strategic direction Severity/visibility of problem Cost benefit analysis
Yes go to Step 15 / No go to Step 12 15 Fix? Yes go to Step 16/ No go to Step 25 16 Determine Priority Set the priority based on: 17 Impact on the business Urgency of the problem Risk assessment Resource constraints Problem Management Team Problem Management Team Problem Management Team
Schedule? Estimate level of effort and schedule the work to be completed or determine that the work cannot be scheduled at this time but should be monitored. Yes go to Step 19 / No go to Step 18
11 of 14
Rev 12/11/09
UC SF
ID 18
Step Monitor the Error Monitor the problem using the following tools: Log files Scripts Performance tools
Alert the Problem Manager if the problem recurs or there is a change in the impact of the problem. Go to Step 16 19 RFC Required? Yes go to step 20 / No go to step 21 20 Create Request for Change Go to Perform Change Management Process Perform Change Management Process Hand-off 21 22 Implement Solution Monitor Resolution Monitor the resolution after implementation to make sure that problem does not recur. Validate that there is a reduction in occurrences and severity and that other problems do not occur as a result of the fix. 23 Error Resolved? Yes go to Step 24 / No go to Step 8 24 Permanent Solution? Yes go to Step 25 / No go to Step 8 25 Update and Close Problem Update and close Problem, Known Error and associated Incident records in Remedy. Communicate problem resolutions and known errors to support team incident managers. 26 Conduct Review Conduct Post-Implementation Review (PIR) to understand what was done well, what was done badly, how to do better next time, and how to prevent recurrence of the failure. Problem Manager Problem Owner(s) Problem Owner(s) Problem Owner(s) Problem Owner(s) Problem Owner(s) Problem Owner(s) Problem Owner(s)
12 of 14
Rev 12/11/09
UC SF
Proactive Problem Management 27 Collect, Review and Analyze Data Possible data sources include: Incident data Problem data Known errors Industry data Performance data System management data Security data User data Problem Management Team
Analyze data for similarities, repeat occurrences and trends. Review performance and industry data looking for potential problems and best practices solutions. 28 Identify IT Enterprise Issues: People, Process, Technology Identify specific components in the enterprise that are causing problems such as: 29 Failing or outdated equipment Insufficient CPU, memory or storage Poorly written code Incorrect configuration Inadequate user training Problem Management Team Problem Management Team
Provide Management Reporting 30 Generate Updated Problem Report Provide reports on open problems, resolved problem, and know errors. Provide reports on problem management service levels and service metrics. 31 Distribute Problem Reports, as required Distribute reports on open problems, resolved problem, and know errors to the service desk and incident managers. Distribute reports on problem management service levels and service metrics to management. Problem Manager Problem Manager
7. EXIT CRITERIA
Known Error database updated Request for Change (RFC), if required Problem records updated with known errors, solutions and / or workarounds Closed problem records once root cause is eliminated Management information
13 of 14
Rev 12/11/09
UC SF
Number of days from problem creation to problem close for problems closed in the current period.
Number of incidents that are closed by solutions registered in the Known Errors database.
Establish incidents with Known Errors baseline and monitor trends over time.
14 of 14
Rev 12/11/09