Group and Resource Failure Problems

Group and resource failure problems
What problem are you having?
• A resource fails, but is not brought back online.
• You cannot bring a resource online.
• You cannot bring the default physical disk resource online in Cluster Administrator.
• In Disk Management, you do not see the disk for the group that is online on that node.
• You are unable to manually move a group, or it does not fail over to another node when it is supposed to.
• A group failed over but did not fail back.
• The entire group failed and has not restarted.
• All nodes are functioning, but resources fail back repeatedly.
• The Cluster service does not successfully fail over resources.
• You fail over a resource group from one node to another, but it automatically fails back.
• The Network Name resource fails when you change to a system locale that is different than the input
language used by the Network Name resource.
• The Message Queuing resource fails to handle message activity correctly which may result in resource
failures.
• A third-party resource fails to come online in a mixed-version cluster or while upgrading a cluster.
A resource fails, but is not brought back online.
Cause: A resource may depend on another resource that has failed.
Solution: In the resource Properties dialog box, make sure that the Do not restart check box is clear. If the
resource needs another resource to function, and if the second resource fails, confirm that the dependencies are
correctly configured.
You cannot bring a resource online.
Cause: The resource is not properly installed.
Solution: Make sure the application or service associated with the resource is properly installed.
Cause: The resource is not properly configured.
Solution: Make sure the properties are set correctly for the resource.
Cause: The resource is not compatible with server clusters.
Solution: Not all applications can be configured to fail over in a cluster. For more information, see Choosing
applications to run on a server cluster.
Cause: The resource is generating a specific error.

Solution: Review the system Event Log (look for ClusSvc entries under the Source column) to see if that resource
is generating a specific error message.
You cannot bring the default physical disk resource online in Cluster Administrator.
Most cluster configuration problems result from improper configuration of the shared storage bus or the restart of
servers.
Cause: You may not have restarted the servers after installing the Cluster service.
Solution: Make sure that you restarted all servers after installing the Cluster service.
When the servers are restarted, the signature of each disk in the cluster storage is read, and the registries are
updated with the signature information.
Cause: There may be hardware errors or transport problems.
Solution: Make sure that there are no hardware errors or transport problems.
Using Event Viewer (on the Start menu, under Programs and Administrative Tools (Common)), look in the
event log for disk I/O error messages or indications of problems with the communications transport.
Cause: You may not have waited long enough for the registries to be updated.
Solution: Make sure that you waited long enough for the registries to be updated.
Cluster Administrator takes a backup of the registry when it starts up. However, it can take up to a minute after
the second server restarts for the disk signatures to be written to the registries. Wait a minute, and then click
Refresh.
Cause: One or more adapters on the shared storage bus are configured incorrectly.
Solution: Make sure that the adapters are configured correctly.
Cause: The shared storage bus exceeds the maximum cable length.
Solution: Make sure that the shared storage bus does not exceed the maximum cable length.
Cause: The disk is not supported.
Solution: Make sure that the disk hardware or firmware revision level is not outdated.
Cause: The bus adapter is not supported, or the adapter hardware or firmware revision level is outdated.
Solution: Make sure that the bus adapter is supported, and that the adapter hardware or firmware revision level
is current.
Cause: If you move your storage bus adapter to another I/O slot, add or remove bus adapters, or install a new
version of the bus adapter driver, the cluster software may not be able to access disks on your shared storage bus
Solution: To accommodate these changes, make sure that your shared storage bus adapter has been properly
reconfigured.
Cause: The operating system is incorrectly configured to access the shared storage bus.
Solution: Verify that the operating system can detect the shared storage bus adapter.
In Disk Management, you do not see the disk for the group that is online on that node.
Where?
• Computer Management/Storage/Disk Management
Cause: You may not be looking at the right disks.
Solution: Make sure that you are looking at the right disks.
If you have not labeled your disks or assigned fixed drive letters to them, you may not recognize which disks are
part of the cluster and which ones are not. Label your disks in a meaningful manner and assign fixed drive letters
to all partitions.
Cause: There may have been hardware problems.
Solution: Make sure that there have not been any hardware problems.
Run Event Viewer and check for disk I/O error messages or indications of hardware problems.
You are unable to manually move a group, or it does not fail over to another node when it is supposed
to.
Cause: The fail over node may not be designated as a possible owner for all resources in the group that you want
to fail over.
Solution: Make sure that the fail over node is designated as a possible owner for all resources in the group you
want to fail over.
Check the ownership configuration in the group resource Properties dialog box. If the node is not set as a possible
owner for all resources in the group, the node cannot own the group, so failover will not occur. To fix this, make
the node a possible owner for all resources in the group.
Cause: A resource in the group may be continually failing.
Solution: Determine if a resource in the group is continually failing.
If the node can, it will bring the resource back up without failing over the group. If the resource continually fails
but does not fail over, make sure that the resource property Restart and affect the group is selected. Also,
check the Restart Threshold and Restart Period settings, which are also in the resource Properties dialog box.
A group failed over but did not fail back.
Cause: The group will only fail back if the node the group was running on itself failed and then rejoined the
cluster. If the group, but not the node, failed, then the group will fail over to another node, but will not fail back to
the original node.
Cause: The failback policies of both the group and the resources may not be properly configured.
Solution: Make sure that the Prevent failback check box is clear in the group Properties dialog box. If the
Allow failback check box is selected, be sure to wait long enough for the group to fail back. Check these settings
for all affected resources within a group. Because groups fail over as a whole, one resource that is prevented from
failing back affects the entire group.
Cause: The node to which you want the group to fail back is not configured as the preferred owner of the group.
Solution: Make sure that the node to which you want the group to fail back is configured as the preferred owner
of the group. If not, the Cluster service leaves the group on the node to which they failed over.
The entire group failed and has not restarted.
Cause: A node is offline.
Solution: Make sure that the node is not offline.
If the node on which the group had been running is offline, check that another node is a possible owner of the
group and of every resource in the group.
Cause: The group has failed repeatedly.
Solution: The group may have exceeded its failover threshold or its failover period. Try to bring the resources
online individually (following the correct sequence of dependencies) to determine which resource is causing the
problem. Or, create a temporary resource group (for testing purposes) and move the resources to it, one at a time.
All nodes are functioning, but resources fail back repeatedly.
Cause: Power may be intermittent or failing.
Solution: Ensure that your power is not intermittent or failing. You can correct this by using an uninterruptable
power supply (UPS) or, if possible, by changing power companies.
The Cluster service does not successfully fail over resources.
Cause: Cluster storage device is not properly configured.
Solution: Verify that the cluster storage device is properly configured and that all cables are properly connected.
You fail over a resource group from one node to another, but it automatically fails back.
Cause: One or more resources fail to come online on the new node.
Solution: Use a process of elimination to determine which resource is failing to come online. For more
information, see article Q303431, "Explanation of Why Server Clusters Do Not Verify that Resources will Work
Properly on All Nodes" in the Microsoft Knowledge Base.
The Network Name resource fails when you change to a system locale that is different than the input
language used by the Network Name resource.
Cause: The system locale must be the same on all nodes of a cluster and on the computer used to connect to the
cluster.
Solution: Change the system locale. For more information, see Connect to a cluster with Cluster Administrator.
The Message Queuing resource fails to handle message activity correctly which may result in resource
failures.
Cause: Each instance of Message Queuing on a server maps 4 MB of the system view space when handling
message activity. This results in a default limit of three active, working instances of Message Queuing on a cluster
node. In a server cluster with three Message Queuing resources, a node could have four concurrent Message
Queuing services running (the service running on the local node plus the three services associated with the
Message Queuing resources.) In this scenario, message activity could be limited, resulting in resource failures.
Solution: Increase the system view space memory pool on each node of a server cluster with three or more
Message Queuing resources. (We also recommend that you increase the system view space memory pool even for
nodes running fewer than three Message Queuing resources.)
• Open Registry Editor.
• Open the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session
Manager\Memory Management.
• Create a new DWORD value called SystemViewSize.
• Calculate and enter the data for this DWORD value using the following formula: (16 + (the number of
Message Queuing resources x 4)).
For example, the calculation result for a cluster with three Message Queuing resources is 28.
• Reboot each node.
A third-party resource fails to come online in a mixed-version cluster or while upgrading a cluster.
Cause: If a resource uses a cryptographic provider not supplied by Microsoft to export (encrypt) and import
(decrypt) resource data (cluster and cluster application cryptographic checkpoints), the default encryption key
lengths may be different in the Windows 2000 and the Windows Server 2003 family operating systems. The result
is that the resource might fail to come online and the cluster and event logs might contain cryptographic
checkpoint synchronization errors for that resource.
Solution: Use the cluster.exe "CSP" private property to set the key length and effective key length for the third-
party cryptographic provider that encrypts and decrypts data for the failing resource type.
• Open Command Prompt.
• Type clusterClusterName"CSP"=key_length,effective_key_length:MULTISTR
ClusterName is the name of the cluster, CSP is the name of the cryptographic provider, and key_length
and effective_key_length are the key and effective key lengths for the RC2 encryption algorithm, in bits.
For more information on using cluster.exe, see Cluster.
• Depending on the resource, either bring the resource online or recreate the resource to add the new
cryptographic checkpoint.
Note
• Review the documentation for your cryptographic provider to obtain valid values for the following RC2
encryption algorithm parameters: key_length and effective_key_length. Also review the cryptographic
provider documentation for the correct procedure for adding the cryptographic checkpoint.
For information about how to obtain product support, see Technical support options.

Group and Resource Failure Problems

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Group and Resource Failure Problems

Uploaded by

Copyright:

Available Formats

Group and resource failure problems

What problem are you having?

• A resource fails, but is not brought back online.

• You cannot bring a resource online.

• A group failed over but did not fail back.

• The entire group failed and has not restarted.

• All nodes are functioning, but resources fail back repeatedly.

• The Cluster service does not successfully fail over resources.

language used by the Network Name resource.

A resource fails, but is not brought back online.

Cause: A resource may depend on another resource that has failed.

You cannot bring a resource online.

Cause: The resource is not properly installed.

Cause: The resource is not properly configured.

Cause: The resource is not compatible with server clusters.

Cause: The resource is generating a specific error.

Cause: There may be hardware errors or transport problems.

Solution: Make sure that the adapters are configured correctly.

Cause: The disk is not supported.

• Computer Management/Storage/Disk Management

Cause: You may not be looking at the right disks.

Cause: There may have been hardware problems.

Cause: A resource in the group may be continually failing.

Solution: Determine if a resource in the group is continually failing.

A group failed over but did not fail back.

The entire group failed and has not restarted.

Cause: A node is offline.

Solution: Make sure that the node is not offline.

Cause: The group has failed repeatedly.

Cause: Power may be intermittent or failing.

The Cluster service does not successfully fail over resources.

Cause: Cluster storage device is not properly configured.

• Open Registry Editor.

• Open the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session

• Create a new DWORD value called SystemViewSize.

Message Queuing resources x 4)).

• Reboot each node.

• Open Command Prompt.

You might also like