You are on page 1of 11

Title: Troubleshooting Backup Failures

Purpose: Document Netbackup Support function


Date:
10/18/2006
10/30/2006

Your name:
Anthony Nguyen
Anthony Nguyen

11/29/2006

Jackie Schlitz

Description of Change:
Document Creation
Changed escalation process per Ron
Caplinger and Bryce Pier.
Added the following comments:
Whenever a new Netbackup alert appears in
OVOU, please verify that it is not related to a
Sev 2 ticket or any other ticket.
If a backup job is re-started, please monitor it
and keep the job id handy.

Effective date:
10/18/2006
10/30/2006
11/29/2006

INDEX

Responsibilities/Escalation
HPOV Alert
Netbackup Administration Console
Error Numbers and Resolution
Netbackup Troubleshooter
Restarting Failed Backups
Manually Stopping a Backup Job
Troubleshooting Windows NetBackup Clients
Troubleshooting Unix/Linux NetBackup Clients
Shutting off VSP (Veritas SnapShot Provider)
Netbackup Drives Alerts
Contacting IBM on hardware issues

Responsibilities/Escalation:
Infrastructure Support attempts to remediate all backup failures and creates a trouble ticket to
track the failures. Any failure that cant be resolved by Infrastructure Support is escalated to L3ENT-BACK. Normally, if the failure is a single failure, a Sev 3 Med ticket is sufficient. In the case
of error code 96 (Out of Media), send a Sev 3 Critical ticket to L3-ENT-BACK. If there are multiple
errors of the same kind, such as: more than two tape drives down and cannot be UPed; both
tape drives on the same Media Server are down and cannot be UPd; multiple 219 (Storage Unit
Unavailable) or multiple 84 (Media Write Error) errors, first follow normal trouble-shooting
procedures for those error types. If still unable to resolve, send a Sev 3 Critical ticket to L3-ENTBACK.
Teradata Backups:
At this time, Teradata backups are the responsibility of the backup team. Teradata backups can
be distinguished by the policy they belong to. Teradata backups will start with a "TD_xxxxxx" in
the policy name:
Example: TD_DS32BKP, TD_DS35BKP, TD_1a_Inv_Item_Dict, etc...
For Teradata backups, create a sev3 ticket to L3-ENT-BACK.

HPOV Alert
Infrastructure Support will get an alert when a backup fails: The alerts will appear in the OVOW
console with the message group NBup. The alerts will also appear in the OVOU console with the
message group DCTech. The alert will look like the following:
HPOV dxp11uxa.bestbuy.com NBU_JobFailure.log Entry: ds27bkup dvp03fc2z
BDC_Wintel_Prod_File_Z_Drive Monthly_Cumulative 3896202 3896202
10/09/2006 06:30:06 54 :timed out connecting to client on 10/09/2006 at
07:07:49

The report will provide the name of the media server ( ds27bkup), the name of the client
(dvp03fc2z), the backup policy (BDC_Wintel_Prod_File_Z_Drive), the schedule being executed
(Monthly_Cumulative), the date & time the job started (10/09/2006 06:30:06), the error
number (54), the reason why the backup failed (timed out connecting to client), and the
date & time the job abended (10/09/2006 at 07:07:49.
Since Netbackup retries failed backup jobs twice, you should only get one job failure alert after
the 2nd retry.
In addition to the backup job failure alerts, Infrastructure Support will also get alerts on Netbackup
Drives.
Whenever a new Netbackup alert appears in OVOU, please verify that it is not related to a Sev 2
ticket or any other ticket.

Netbackup Administration Console


Backup failures can be monitor via the activity monitor on the Netbackup Administration Console.
There are two ways to access the Netbackup Administration Console.
1. Java GUI - For the DC Tech Team, the Netbackup Administration Console via the
Java GUI has limited access. Please use the Java GUI to view the Activity Monitor
for troubleshooting backup failures.
2. Media Servers - The Netbackup Administration Console on the media servers will
give you admin access when you log on with your -a account. Use the Netbackup
Administration Console from the Media Servers to rerun failed backup jobs.

Error Numbers and Resolution


Some of the common issues that arise can be resolved. Once resolved, the job or jobs can
generally be restarted. A list of the error numbers are listed below:
Hung Jobs - Jobs that are running but show no progress in several hours (or days).
1. Cancel the job.
2. Restart the job that has failed.
Error 1 Operation was partially successful. Most jobs that end with a status 1 will not alert but
there are some that are significant and will cause an alert. These are SQL Cluster backups and
Exchange Store backups.
1. If its for an SQL Cluster backup, re-run the job. For Exchange backups, see Self
Correcting Jobs.
Error 6 Errors caused the user backup to fail.

1. This is an issue that happens when a database agent backup fails.


2. Database backups are run via RMAN and will reschedule on their own. If there are many
in a short time period, contact the Oracle DBA On-call (L2-ORACLE-DBA).
Error 12 File open failed.
1. A possible cause is a permission problem with the file. Check permissions on the file.
2. Restart the job.
Error 13 Errors caused the user backup to fail
1. Restart the job that has failed.
Error 25 Cannot connect on socket.
1. Verify that the Master Servers (dv10bkup,dxp11uxa, dxp10uxa) are in the server list on
the client.
a. Log into the client and launch the Backup, Archive, and Restore interface.
b. Click on File and select Specify Netbackup Machines and Policy Type
c. Under Serverlist, verify dv10bkup, dxp11uxa, and dxp10uxa are listed. Also,
ensure dv10bkup is set to current.
2. Verify that the %SystemRoot%\system32\drivers\etc\services file have the following
entries:
a. bpcd
13782/tcp
b. bprd
13720/tcp
c. bpdbm
13721/tcp
3. Restart the NetBackup Client Service service on the client.
4. Check to see if there is free disk space.
5. Restart the job.
Error 52 Timed out waiting for the media manager to mount volume.
1. Verify the tape is available and the drives are up on the Media Server performing the
backup.
a. To determine which media server performing the backup, go to the Netbackup
Administration Console and click on Activity Monitor. Look for job ID of the
failed job. The media server performing the backup can be found under the
Media Server column.
b. To determine if the drives are up on the media server, go to the Netbackup
Administration Console and click on the toggle icon next to Media and Device
Management. Next, click on Devices. In the All Drives pane, find the media
server under the Device Host column. Click to highlight the media server and
verify the drives attached to the media server is showing a status of UP under
the Drive Status column.
Note: If you do not see the Drive Status column, this means the column is
hidden. To unhide the columns:
i. From the menu bar, click on View and select Columns and select
Layout.
ii. Under Heading, click on Drive Status to highlight it.
iii. Click the Show Column icon or use CTRL-S to show the Drive
Status Column.
iv. Click OK.
c. If the drives are showing a status of Down, take note of the device host and
the drive name. Under Media and Device Management, click on Device
Monitor. Under the right pane, find the drive name. After finding the drive name,
click on it and select Up Drive.
2. Restart the job that has failed.
Error 54 Timed out connecting to client, General Network issue.
1. See Troubleshooting Netbackup Clients

2. Restart the job that has failed.


Error numbers 57 and 59 Client refused connection. This error occurs when a media server
is not listed in the clients approved server list.
1. In the Netbackup Administration Console:
a. Log into the client that had the error.
b. Launch the Backup, Archive, and Restore interface.
c. Click on File and select Specify Netbackup Machines and Policy Type
d. Verify the servers listed in the List of Netbackup Media Servers table are
included in the Server List and verify dxp11uxa, dxp10uxa, and dv10bkup are
also included.
e. If there are servers missing, manually add them one at a time.
f. Restart the job(s).
Error 71 None of the files in the file list exist. Files or directories are listed for backup in the
clients policy, but the files or directories displayed do not exist on this client. Either the clients
configuration was changed and the directories are no longer there, or the policy was changed and
no longer references valid paths on this client.
1. Send an email to *IS NBUadmin for further analysis.
NOTE: There are a couple of servers that will get 71s all the time due to the nature of the
server and the backup policy. DXP12UXA is the primary one this will happen to and it can
be ignored.
Error numbers 83 through 86 These errors describe one of two common issues: Faulty Tape
or Hardware fault. Hardware faults are more common than bad tapes. Hardware faults do not
mean hardware failure.
1. The work around is simply to suspend the tape then restart the job.
a. To suspend a tape, go to the Netbackup Administration Console and click on the
Activity Monitor. Take note of the media server the client was using. The tape
will be listed in the media servers media catalog.
b. Open a command prompt - Run the following command:
d:\veritas\netbackup\bin\admincmd\bpmedia -suspend -m <MEDIA ID> -h
<MEDIA SERVER HOSTNAME>
For example: bpmedia -suspend -m B0100 -h ds25app
2. Send an email to *IS NBUadmin as to which tapes have been frozen.
3. If you get more than two tapes having an 84 error at one time, create a sev3 Critical ticket
to the L3-ENT-BACK as it usually indicates a larger failure that they need to know about.
Error 96 No more scratch tapes available in the library
1. Create a Sev 3 Critical trouble ticket for the Enterprise Backup & Recovery group (L3ENT-BACK).
Error 134 Unable to process request because the server resources are busy
1. The available tape drives are all in use or the host is very busy. NB will requeue the job
automatically to try and backup again.
Error 150 Process terminated by authorized user or process. For some reason, someone
with NetBackup Admin access has terminated the job or a process has failed.
1. Contact Enterprise Backup & Recovery group.
Error 155 Disk full - A disk is full on the server.
1. Connect to server and free up space on the drive.
2. Restart the job.
Error 196 Client backup was not attempted because backup window closed. This will occur if
the backups started AFTER window closed. Cause: Possible queuing delay.

1. Restart the job that has failed. See Restarting Backup jobs via Backup Policies.
Error 219 Storage Unit is currently unavailable
1. Check for drives down. See Error 52
2. Create a Sev 3 Critical trouble ticket for the Enterprise Backup & Recovery group (L3ENT-BACK).
Self Correcting Jobs
Some jobs will automatically re-run when they hit certain errors. The Exchange Store jobs (ie
Exchange_BDC_STG1) are monitored for a status of 1 and are automatically re-ran and an email
with a subject of Restart Notice is sent. Sometimes this process doesnt work well due to an
issue on the Exchange server and youll see many restarts for the same servers in a short period
of time. If this occurs contact the Enterprise Backup & Recovery group.

Netbackup Troubleshooter
Additional errors codes and resolution steps can be found using the Troubleshooter within
Netbackup. To access the Troubleshooter, click on the hand/wrench icon. See Figure 1.

Figure 1
Enter the error code into the status code field and click Lookup (Figure 2). The Troubleshooter
will detail the problem and provide troubleshooting steps based on the error code you entered.
See Figure 3.

Figure 2

Figure 3

Restarting Failed Backups


There are two ways to start a failed backup job: Restart backup jobs via the activity monitor and
restarting backup jobs via backup policies. An error code of 196 will require you to restart the
backup job via the backup policies.
Restart Backup Jobs via the Activity Monitor
Restarting failed backups can be done via the Netbackup Admin Console on the media servers.
To restart a backup,
1. Log into a media server and launch the Netbackup Admin Console.
2. Click on the Activity Monitor. All jobs will be displayed on the right pane.
3. Right-click on the right pane and select Filter. The Filters window will display.
4. Click on the empty cell under client and enter the client name of the failed backup
job. Click OK. All jobs for that client will display.
5. By default, Netbackup will automatically rerun a failed backup job three times.
Before rerunning the backup, verify the status of the backup. Netbackup may have
already kicked off the rerun and the rerun may have already finished successful or
the rerun may still be executing. If the rerun is still executing, wait for it to finish. If all
three reruns have failed, continue with step 6.
6. To rerun the backup, right-click on the backup and select Restart Job.
7. If the job starts and fails with error code 196, this means the backup you just started
was not attempted because the clients backup window is closed. If this is the case,
you will need to restart the backup via backup policies as detailed below.
Note:
1. If you restart a backup within the clients backup window, the rerun will pick up where the
last backup left off. If you restart a backup and the backup is outside the clients backup
window, the rerun will start from the beginning.
2. We dont want to have two identical backup run at the same time as they will be in
contention with one another. If two identical backups are running, stop the backup that
was recently started via these instructions detailed in the section Manually Stopping a
Backup Job.
3. If a backup job is re-started, please monitor it and keep the job id handy.
Restarting Backup jobs via Backup Policies
If you get an error code of 196 and you want to restart a backup, do the following:

1. The first thing you need to know is the backup policy and the schedule the failed
backup ran under. You can find this information from the HPOV alert or by looking at
the activity monitor and locating or filtering for the client.
2. Log into one of the media servers listed in the List of Netbackup Media Server chart.
Choose the media server for the domain the client belongs to. Launch the Netbackup
Administration Console.
3. On the left pane, click on the + next to Policies.
4. Locate the backup policy from step 1.
5. Right click on the backup policy and select Manual Backup (Figure 4).

Figure 4
6. Select the schedule (from step 1) on the left pane and select the client from the right
pane. Click OK (Figure 5).

Figure 5
7. Verify the backup started via the activity monitor.
Note:

It is ok to restart incremental and full backups. If you restart a backup and users
complain about the performance of the server, please kill the backup and restart it at a
later time.

Manually Stopping a Backup Job


To manually stop a backup job:
1. Go to the Activity Monitor and find the job that needs to be cancelled.
2. Right-click on the job and select Cancel Job.
Note:

Be careful not to select Cancel All Jobs.

Troubleshooting Windows NetBackup Clients


Important notes:
If you notice someone logged into a media server, DO NOT log them out. If you need to
log into a media server, try a different one (see the list of media servers below).
1.
2.

Verify that the server pings.


Verify DNS entry (forward and backward) from the media server.
Forward: D:\veritas\netbackup\bin\bpclntcmd hn <client>
Backward: D:\veritas\netbackup\bin\bpclntcmd ip <client IP>
(It is an issue if the backward check resolves to another hostname. If you see
this issue or any other problems, escalate to L2-Network to resolve DNS entries.)
3. Verify connection from Media Server to client.
Enter command from any media server in the same domain as the client you are
attempting to verify:
D:\veritas\netbackup\bin\admincmd\bpgetconfig -M <client>
4. Verify connection from media server to client via bpcd.
Enter command from any media server in the same domain as the client you are
attempting to verify:
telnet <client> bpcd
This should result in a blank screen. Hit enter and the connection will close. If
this happens, there are no issues with the bpcd connection from the media
server to the client. However, if the connection closes right away without hitting
enter or if the connection does not close, there is an issue with the client. Use
CTRL+] to break the session.
5. If step 4 fails, try recycling the NetBackup service on the client and try step 4
again. If the NetBackup services fails to start, schedule a server reboot.
6. If the Netbackup services fails to start because the executable was not found, try
reinstall Netbackup. Please send communication to *IS - NBUadmin to inform
them Netbackup was reinstalled on the server. The backup team will need to
modify the media server list, buffer size, time out settings, etc... on the client.
7. If needed, try kicking off a daily incremental backup on the server. Ensure you
have the correct policy of the client.
If there are any problems with steps 3-7 or if these steps seem to work fine but there is
still an issue, escalate to L3-ENT-BACK.

List of NetBackup Media Servers


NA Domain

DMZ Domain

Teradata

DS15BKUP
DS16BKUP
DS17BKUP
DS20BKUP
DS21BKUP
DS22BKUP
DS23BKUP
DS24BKUP
DS25BKUP
DS26BKUP
DS27BKUP
DS28BKUP
RS20BKUP
RS21BKUP

Prod DMZ:
DS19BKUP
RS22BKUP

DS30BKUP
DS31BKUP
DS32BKUP
DS33BKUP
DS34BKUP
DS35BKUP
DS37BKUP
DS38BKUP

Legacy DMZ:
DS29BKUP

Troubleshooting Unix/Linux NetBackup Clients


1.
2.

Verify that the server pings.


Verify DNS entry (forward and backward) from the master server:
a. From the master server (dv10bkup):
Forward: /usr/openv/netbackup/bin/bpclntcmd -hn <client
hostname>
Backwards: /usr/openv/netbackup/bin/bpclntcmd -ip <client IP
address>
b. From the client server:
/usr/openv/netbackup/bin/bpclntcmd -pn
3. Verify connection from master server to client via bpcd.
a. From the master server:
telnet <client> bpcd
b. From the client server:
telnet <maseter server> bpcd
This should result in a blank screen. Hit enter and the connection will close. If
this happens, there are no issues with the bpcd connection from the media
server to the client. However, if the connection closes right away without hitting
enter or if the connection does not close, there is an issue with the client. Use
CTRL+] to break the session.
4. If step 3 fails, escalate to L3-ENT-BACK.

Shutting off VSP (Veritas SnapShot Provider)


If you notice that a VSP temp file is taking up space on a server, proceed to delete the file. This
was probably left over from previous installs as the new install does not delete this file.
If you cant delete the temp file because a system process has a lock on the file, this indicates
that VSP is running on the server. Use process explorer to identify this process and close the
handle the process has on the VSP temp file. Now you will be able to delete the VSP temp file
without rebooting the server. To shut off VSP on a server:
1. Log in to one of the media servers.
2. Launch the Netbackup Administrator Console.

3.
4.
5.
6.

Click on the + next to Host Properties.


Click on Clients.
On the right pane, find the client that has VSP running.
Right-click on the client and select Properties. The client properties will appear
(Figure 6):

Figure 6
7.
8.

On the left pane, click the + next to Windows Client.


Click on VSP. The VSP snapshot Provider window will open (Figure 7):

Figure 7
9.

In the field below VSP volume exclude list (drive letters separated by commas):
enter all drives on the server. (example: c,d,e,f,g)
10. Check Customize the cache sizes

11. Select Cache size in MB


12. Click OK.

Netbackup Drives Alerts


The Drive alerts are going through HPOV. The drive alerts and their remediation steps can be
found here.

Contacting IBM on hardware issues


If requested by the Backup Team to engage IBM on a hardware issue, IBM's contact information
is provided below:
IBM dispatch #:
IBM Tape Library Model#:
IBM Tape Drive Model #:
Library
RDC prod (rtp01)
BDC prod (dtp01)
BDC test (dtt09)

Serial #
78A0347
78A0308
78A1096

800-426-7378
3584
3592
Phone # Assoc. to Lib.
(612) 670-2706
(952) 324-1872
(952) 324-1872

FYI, you will need to give IBM the Phone # associated to the library, as well as the library model
# (3584, same for all libraries) and serial #. If the problem appears to be a tape drive or if IBM
asks, the tape drive model we use # is 3592.

You might also like