DBA Case Studies

Case Study Automate Deadlock Notification
ABOUT THE ENGAGEMENT

The client is a media and entertainment giant engaged in the production and marketing of entertainment, news, and information products and services to a global customer base. The IT backbone supporting the organization is pretty complex network and a few thousand servers spread across multiple platforms ranging from legacy to open source. There are around ~1000 databases which are currently in use for various applications. BUSINESS NEED
Application Process are killed with information.
APPROACH
Automate the deadlock information report from error log Report to be generated as soon as deadlock occurs on the SQL
Killed with or without information ? What processes are these and what are they used for ?
Server
What automation was done ?
CHALLENGES
Application Team raise a ticket with Database team. There is a considerable delay in raising a ticket. Database Team takes 24 hour SLA to send the report. Database Team collects the data manually from the error log and
formats the data and sends to app team. Time Taken for Database Team per deadlock 45 minutes.
BUSINESS VALUE DELIVERED

Turn Around Time for Application Team reduced to 0-5 min {from
SCOPE
To reduce the Non-Value Add work from Database Team To reduce the turn-around time for getting information to the app
team
?} Non-Value Work for DB Team removed. How much time saved ? Number of Alerts reduced to Zero for deadlocks {from ?} Number of tickets reduced to Zero for deadlock information {from ?} Risk Reduction DB Team do NOT need to log into the server to get this information.
2011 Copyright Genpact. All Rights Reserved.
Case Study Automate Deadlock Notification

The client is a media and entertainment giant engaged in the production and marketing of entertainment, news, and information products and services to a global customer base. The IT backbone supporting the organization is pretty complex network and a few thousand servers spread across multiple platforms ranging from legacy to open source. There are around ~1000 databases which are currently in use for various applications.
BUSINESS NEED
Windows Clusters Servers are patched every month. Services
APPROACH
Conf Call Setup During patching window to ensure the resources
have to failed back to their resident node after patching.
CHALLENGES
Compute Teams patching 100s of servers at the same time No Alerting Mechanism if servers are on wrong node Outages due to resource NOT on Resident Node High Number of Memory/CPU Alerts due to resource NOT on
are on resident node between database and compute teams Compute Team has been provided a list of resources and their resident nodes as checklist. Database Team has alert setup for resources NOT on their resident nodes Auto-Reboot and Auto-Update features disabled to ensure all clusters on same patch level
resident node Auto Reboot and Auto Update Feature enabled on several servers leading to inconsistency between patch levels

Non-Value Work for DB Team removed. - How much time saved
SCOPE
To reduce the Non-Value Add work from Database Team To reduce the outages due to resources not on resident node To reduce the number of alerts due to resources not on resident
? Number of Alerts reduced to Zero for CPU and Memory - {from ?} Number of Outages reduced to Zero due to resource NOT on resident node and Auto Reboots due to Auto Updates - {from ?}
node
Case Study ASM Scan Order change

ABOUT THE ENGAGEMENT The client is a media and entertainment giant engaged in the production and marketing of entertainment, news, and information products and services to a global customer base. The IT backbone supporting the organization is pretty complex network and a few thousand servers spread across multiple platforms ranging from legacy to open source. There are around ~1000 databases which are currently in use for various applications. BUSINESS NEED APPROACH
ASM does NOT use Multipath or powerpath pseduo devices Bring all the instances down on one node including ASM instance instead uses single devices and accounts to single point of failure in Perform the Scan order changes in /etc/sysconfig/oracleasm RAC environments and cause IO Failure during the storage device Start and stop ASMLIB failover Start the instances. Repeat the process on all other nodes of the Cluster CHALLENGES
Application Team Sign-Off and the activity to be done with zero
downtime. Databases on different operating systems. BUSINESS VALUE DELIVERED

No. of I/O Failures reduced to zero from 4-5 failures per month. No. of Outages reduced to zero due to I/O issues related to
Multipath/powerpath issues. SCOPE

ASM Scan order change in /etc/sysconfig/oracleasm file. Ensure ASM uses Multipath/powerpath after change Ensure storage works fine without any I/O errors after change
Case Study Automation of Monthly Database Recycle

ABOUT THE ENGAGEMENT The client is a media and entertainment giant engaged in the production and marketing of entertainment, news, and information products and services to a global customer base. The IT backbone supporting the organization is pretty complex network and a few thousand servers spread across multiple platforms ranging from legacy to open source. There are around ~1000 databases which are currently in use for various applications. BUSINESS NEED Every Month the database has to be recycled with prior and post information to Application Teams as per scheduled and approved Maintenance Window CHALLENGES
Script to run based on the day, date and time
APPROACH
Script to verify the day, date and time of server and match with
approved time already coded. Script to send notification 2 days prior to the activity Script to send notification 30 mins prior to the activity Script to Recycle node by node. Script to send notification once the activity is complete
Database has to be recycled Node by Node with no downtime and at any point in time, one node will be down only during this activity other available nodes have to be up and running BUSINESS VALUE DELIVERED SCOPE
The databases to be recycled one node at a time Email Notifications to be sent to App and Database Teams prior No Manual Intervention Needed Saved Man Hours: 4 hrs * 10 = 40 hrs per month (4 hrs includes
and post the activity is performed.
activity or recycle, notification and also getting approval) Reduced to Zero Human Errors
Case Study DBMS-Stats Project

ABOUT THE ENGAGEMENT The client is a media and entertainment giant engaged in the production and marketing of entertainment, news, and information products and services to a global customer base. The IT backbone supporting the organization is pretty complex network and a few thousand servers spread across multiple platforms ranging from legacy to open source. There are around ~1000 databases which are currently in use for various applications. BUSINESS NEED Oracle 10g databases were using process used in 9i version databases for gathering the stats using analyze method. The need is to implement DBMS_STATS on all the non default schemas and removal of current analyze jobs. CHALLENGES App Team reluctant to change the stats collection jobs to new method because they dont want to changes in current infra. Removal of current jobs and schedule new ones. APPROACH
Verify the Analyze job details and configuration Delete the existing Statistics collected by Analyze Job Gather Statistics using DBMS Stats scripts as per
recommendation of 10g BUSINESS VALUE DELIVERED

20 % reduction in the number of outages due to performance
issues. Overall Better Database Performance in over more then 40 apps. App Team acknowledged Better database performance
SCOPE
Implement the new stats collection job in all Oracle 10G
Instances Remove Existing Analyze Jobs
Case Study Live Re-Org

BUSINESS NEED
APPROACH
Database Tables with Huge amount of data and has frequent Identify the tables and its dependents which are highly updates and deletes expected to get fragmented. There is a need to fragmented the database views dba_segments and De-fragment the tables which will improve the performance and dba_tables.avg_row_len also release space Use QSM ( Quest Space Manager ) and reorg the tables to release the space
CHALLENGES
Need to perform the activity online when the application actively

Reclaimed 40% of space each time reorg is performed Performed online reorgs with zero downtime to application and
using the tables Reduce the amount of Archives ( Change Logs ) that gets generated during Re-org of table Improve the Read performance of the tables
database Tracking historical reorganization activity for proactive analysis and planning.
SCOPE
Defragment the Tables Improve the Read Performance and reduce IO
Case Study 9i to 11g upgrade with Unicode conversion

BUSINESS NEED
APPROACH
SAP Application upgrade which demanded database upgrade from Upgraded database from 9i on AIX to 10g on AIX 9i to 11g with Unicode conversion(Unicode conversion will support Used recover point for database restoration from 9i to 10g multiple character set).9i database was on AIX but business Split the large tables demanded 11g database to be in Solaris Performed Export for large tables first and other tables later Created new 11g database on Solaris server CHALLENGES Performed Import of tables along with index creation.We had Perform an Upgrade from 9i AIX to 11g on Solaris automated export/import and index creation. Database Size 8TB Maintenance Window Provided = 30 hours BUSINESS VALUE DELIVERED Need to move Data, Indexes in the maintenance Window Need to perform Unicode conversion in the window Database upgrade to 11g resulted in high performance gain and different character set support. Database performance improved by 20% because queries are performing well and reports are completed faster.
SCOPE
Database Upgrade from oracle 9i to 11g To support application upgrade with Unicode conversion Change in Operating System
Case Study Splitting the Partition in 10g

ABOUT THE ENGAGEMENT The client is a media and entertainment giant engaged in the production and marketing of entertainment, news, and information products and services to a global customer base. The IT backbone supporting the organization is pretty complex network and a few thousand servers spread across multiple platforms ranging from legacy to open source. There are around ~1000 databases which are currently in use for various applications. BUSINESS NEED
Table XYZ of size 700G is a partitioned table with two partitions
APPROACH
Created the year wise test tables and deleted the same data from
50G and 650G The Bigger partition takes 4+ hours for collecting stats which was resulting in outage. High load observed on the server due to stats collection on this table which was causing application to work slow and was causing outage. CHALLENGES
Split partition is very expensive and time consuming task Managing space in Data tablespace, Temporary tablespace and
actual table. Split the bigger partition for the year whose data was deleted resulting in split partition execution to be faster due to no data in the actual table. Exchanged the test table data with the new partition. Rebuild the indexes. All the steps performed through automated scripts.
Undo tablespace Very slow performance due to huge data and lower server configuration

Stats collection Time reduced to 1 hour from 4-5 hours. No. of Outages reduced to Zero 2-3 outages monthly. 80% of stats collection time reduced.Earlier it was 4-5 hours and
SCOPE
Split partition to reduce the partition size.We did it in parallel with
now it is 1 hour.
NOLOGGING to do it faster. Gain in query performance Reduce Stats Collection Time on the server
Case Study Table Compression

BUSINESS NEED
APPROACH
Databases tables grow in size over the period of time with archival Identify the tables which are Read Mostly data. There is a need to reduce the size of the databases, tables Compress the tables using alter table.. Compress option to and in-turn reduce the backup size which will improve the compress the data at the database block level manageability of the database.
CHALLENGES
Indentify the tables which are Read-Mostly and can be
compressed Updates / Inserts operations on the table after the compression have no affect of compression ( Direct Load is exception ) Sign off from Application Teams

Up to 3x disk savings Faster full Scan / Range Scan Operations Reduced Network traffic because oracle compresses and
decompresses the data so the network packets will be significantly smaller Upto 3x Reduction in Backup Size
SCOPE
Release space from the Mostly Read Only tables Improve the IO performance
Case Study Tempdb Threshold Change

10
BUSINESS NEED
Huge Number of Alerts on Tempdb Log Space Usage. Any data
APPROACH
Identify the number of servers and analyze the pattern Researched Vendor Site and identified that Tempdb Log File
points on how many alerts ?
gets truncated automatically by SQL Server itself Increased the alert threshold on tempdb from 65% to 75% What did this do ?
CHALLENGES
Analyze the huge number of alerts and identify the root cause

Non-Value Work for DB Team removed. How much time saved
SCOPE
To reduce the number of alerts on tempdb log space on servers
? Number of Alerts reduced by 95% for Tempdb Log File Usage {from ?}
Back-up Mechanism Revamp of Sybase Servers

Current Backup Mechanism
Backup Mechanism Used : Compression Stripes : : Sybase Uncompressed Backups Gzip the files 13 files for each database 1 job for each database ( e.g : 10 jobs for 10 app databases)
e ore A ter
Number of jobs per server:
Root Cause Analysis

Job does not have a proper error identification mechanism No notification of Job success or failure
Log files are over written each day
Problem Y
Time utilized for backing up is huge (e.g. 65 GB database takes up to 4 hours to backup) Total # of Ctrl-M Jobs for 5 servers - 136

# obs
er ers
tri es
ime a en

Problem Y
Inconsistent Back-up mechanism Multiple jobs running for Back-up High utilization of system resources No tracking mechanism
1 job for clearing old backup files
1 job for all system databases
mins
Solution Single Globalized Back-up Script

X Solution Leveraged from Genpact DB Team X Single Globalized Back-up Script Reduces Complexity & Utilizes Robust Back-up
process to Reduce system resource utilization X Single Script to Alert, Rotate, Compress & Purge (as per NBCU Log Retention Policy) X Effective Utilization of time to run the back-ups X Automated Dashboard (Capacity Vs Utilization) & Trend Analysis of Logs Growth rate
Before
X Sybase Uncompressed Backup X X X X
After
X Sybase Native Compressed Backups X X X X
Mechanism used Gzips the files 13 files ( Stripes ) for each database Total of Ctrl-M Jobs for 5 servers 136 Time taken is very huge (for e.g. 65 GB database takes up to 4 hours for complete backup) Time taken to backup one server is approx 14+ hours
Mechanism used 6 files ( Stripes ) for each database Number of Jobs per server : 1 Total of Ctrl-M jobs for 5 servers - 5 Advantages :
Email Alert with the Job Status Job has error mechanism to report the error returned Time taken is less ( for e.g 65 Gb database takes about 30 min to backup) Log files are not over written and are kept for 20 days, after which they are deleted by the script Time taken for each database backup is recorded
Consistent and Robust Back-up Script Saving on Time and Resources

DBA Case Studies

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DBA Case Studies

Uploaded by

Copyright:

Available Formats

Case Study Automate Deadlock Notification

ABOUT THE ENGAGEMENT

What automation was done ?

BUSINESS VALUE DELIVERED

2011 Copyright Genpact. All Rights Reserved.

Case Study Automate Deadlock Notification

have to failed back to their resident node after patching.

BUSINESS VALUE DELIVERED

2011 Copyright Genpact. All Rights Reserved.

Case Study ASM Scan Order change

downtime. Databases on different operating systems. BUSINESS VALUE DELIVERED

Multipath/powerpath issues. SCOPE

2011 Copyright Genpact. All Rights Reserved.

Case Study Automation of Monthly Database Recycle

and post the activity is performed.

2011 Copyright Genpact. All Rights Reserved.

Case Study DBMS-Stats Project

recommendation of 10g BUSINESS VALUE DELIVERED

Instances Remove Existing Analyze Jobs

2011 Copyright Genpact. All Rights Reserved.

Case Study Live Re-Org

BUSINESS VALUE DELIVERED

2011 Copyright Genpact. All Rights Reserved.

Case Study 9i to 11g upgrade with Unicode conversion

2011 Copyright Genpact. All Rights Reserved.

Case Study Splitting the Partition in 10g

BUSINESS VALUE DELIVERED

2011 Copyright Genpact. All Rights Reserved.

Case Study Table Compression

BUSINESS VALUE DELIVERED

2011 Copyright Genpact. All Rights Reserved.

Case Study Tempdb Threshold Change

points on how many alerts ?

BUSINESS VALUE DELIVERED

2011 Copyright Genpact. All Rights Reserved.

Back-up Mechanism Revamp of Sybase Servers

Number of jobs per server:

Root Cause Analysis

Log files are over written each day

1 job for clearing old backup files

1 job for all system databases

Solution Single Globalized Back-up Script

Consistent and Robust Back-up Script Saving on Time and Resources

You might also like