You are on page 1of 6

General administrative problems

What problem are you having?

The Cluster service fails and the node cannot detect the network.

An IP address added to a group in the cluster fails.

An IP address resource is unresponsive when taken offline, for example you are unable to query its properties.

You receive the error: "RPC server is unavailable."

Cluster Administrator cannot open a connection to a node.

An application starts but cannot be closed.

A resource group has failed over but will not fail back.

All nodes appear to be functioning correctly, but you cannot access all of the drives from one node.

Cluster Administrator update delays.

Cluster Administrator stops responding when a node fails.

Cannot connect to cluster from recent file list.

Node performance is sluggish and node fails.

The cluster log contains numerous resource informational messages (for example, Entered LooksAlive, Entered
Open, Entered Offline).

The Cluster service fails to start and returns an error code of ERROR_SHARING_VIOLATION (32) with event ID
1144 (NM_EVENT_REGISTER_NETWORK_FAILED).

You cannot manually restore the cluster database on a local node by copying the systemroot\cluster\CLUSDB file
from another node.

The Cluster service fails and the node cannot detect the network.

In this case, you probably have a configuration problem. Check the following:

• Cause: Have you made any configuration changes recently?

Solution: If the node was recently configured, or if you have installed some resource that required you to
restart the computer, make sure that the node is still properly configured for the network.

• Cause: Is the node properly configured?

Solution: Check that the server is properly configured for TCP/IP. Also check that the appropriate
services are running. If the node recently failed, there is an instance of failover; but, if the other nodes
are misconfigured as well, the failover will be inadequate and client access will fail.

An IP address added to a group in the cluster fails.

• Cause: The Internet protocol (IP) address is not unique.

Solution: The IP address must be different from every other group IP address and every other IP address
on the network.
• Cause: The IP address is not a static IP address.

Solution: The IP addresses must be statically assigned outside of a DHCP scope, or they must be
reserved by the network administrator.

An IP address resource is unresponsive when taken offline, for example you are unable to query its
properties.

• Cause: You may not have waited long enough for the resource to go offline.

Solution: If an IP Address resource is unresponsive when taken offline, make sure that you wait long
enough for the resource to go offline.

Certain resources take time to go offline. For example, it can take up to three minutes for the IP Address
resource to go fully offline.

You receive the error: "RPC server is unavailable."

• Cause: The server may not be operational, or the Cluster service and the RPC services may not be

running.

Solution: If you receive the error "RPC Server is unavailable," make sure the server is operational and
that both the Cluster service and the RPC services are running. Also, check the name resolution of the
cluster; it is possible that you are using the wrong name or that the name is not being properly resolved
by WINS or DNS.

Cluster Administrator cannot open a connection to a node.

• Cause: The node may not be running.

Solution: If Cluster Administrator cannot open a connection to a node, make sure that the node is
running. If it is, confirm that both the Cluster service and the RPC services are running.

An application starts but cannot be closed.

• Cause: You may not have taken a resource offline using Cluster Administrator.

Solution: When you bring resources online using Cluster Administrator, you must also take those
resources offline using Cluster Administrator; do not attempt to close or exit the application from the
application interface.

A resource group has failed over but will not fail back.

• Cause: The hardware and network configurations may not be valid.

Solution: Make sure that the hardware and network configurations are valid.

If any interconnect fails, failover can occur because the Cluster service does not detect a heartbeat, or it
may not even register that the node was ever online. In this case, the Cluster service fails over the
resources to the other nodes in the server cluster, but it cannot fail back because that node is still down.

• Cause: The resource group may not have been configured to fail back immediately, or you are not

troubleshooting the problem within the allowable failback hours for the resource.

Solution: Make sure that the resource group is configured to fail back immediately, or that you are
troubleshooting the problem within the allowable failback hours for the resource group.

A group can be configured to fail back only during specified hours. Often, administrators prevent failback
during peak business hours. To check this, use Cluster Administrator to view the resource failback policy.

• Cause: You restarted the node to test the failover policy for the group instead of pressing the reset

button.

Solution: Make sure that you press the reset button on the node. The resource group will not failback to
the preferred node if you shutdown, then restart the node. For more information on testing failback
policies, see Test node failure.

All nodes appear to be functioning correctly, but you cannot access all of the drives from one node.

• Cause: The shared drive may not be functioning.

Solution: Confirm that the shared drive is still functioning.

Try to access the drive from another node. If you can do that, check the cable from the device to the node
that you cannot perform the access. If the cable is not the problem, restart the computer and then try
again to access the device. If you cannot access the drive, check your configuration.

• Cause: The drive has completely failed.

Solution: Determine (from another node) whether the drive is functioning at all. You may have to restart
the drive (by restarting the computer) or replace the drive.

The hard disk with the resource or a dependency for the resource may have failed. You may have to
replace a hard disk. You may also have to reinstall the cluster.

Cluster Administrator update delays.

• Cause: If you run Cluster Administrator from a remote computer, it may not display the correct (updated)

cluster state when the cluster network name fails over from one node to another node. This can result in
Cluster Administrator displaying a node as being online, when it is actually offline.

Solution: To work around this problem, restart Cluster Administrator.

You can avoid this problem by connecting to clusters through node names. However, if the node you are
connected to fails, Cluster Administrator stops responding until the RPC connection times out.

Cluster Administrator stops responding when a node fails.


• Cause: The Cluster Administrator may be slow in doing dynamic updates.

Solution: If Cluster Administrator stops responding when a node fails, make sure that Cluster
Administrator is not just slow in doing dynamic updates. If the Cluster service is running on a remaining
node, Cluster Administrator is either not responding or is updating very slowly. There are two ways to see
if the Cluster service is running on a remaining node:

• Use the TCP/IP Ping utility to ping the cluster name on a remaining node.

• In Control Panel, double-click Services, and check whether the Cluster service is running.

Cannot connect to cluster from recent file list.

• Cause: Files listed in the Cluster Administrator recent file list (both on the File menu and in the Open

Connection to Cluster dialog box) have the cluster name appended to the path. For example, instead of
Webclust1, the recent file list may list C:\Windows\Cluster\Webclust1. This problem occurs when Microsoft
Visual C++ version 5.0 is installed.

Solution: To work around this problem, manually type the cluster name when you open the connection.

Node performance is sluggish and node fails.

• Cause: CPU may be overloaded.

Solution: Check that your node is not processor-bound. That is, that the CPU is not running at 100-
percent utilization. If you try to run too many resources for the node capacity, you can overload the CPU.

Also, review the size of your paging file. If the paging file is too small, the Cluster service can detect this
as a node failure and fail over the groups.

The cluster log contains numerous resource informational messages (for example, Entered LooksAlive,
Entered Open, Entered Offline).

• Cause: One or more of your Generic Script resources fills the cluster log with multiple copies of Entered

LooksAlive, Entered Open, Entered Offline messages.

Solution: When creating a script for a Generic Script resource, do not use the LogInformation method
when calling the LooksAlive function. For more information, see the Microsoft Platform Software
Development Kit (SDK).

The Cluster service fails to start and returns an error code of ERROR_SHARING_VIOLATION (32) with
event ID 1144 (NM_EVENT_REGISTER_NETWORK_FAILED).

• Cause: The Internet Assigned Numbers Authority (IANA)-assigned port (3343) used by the cluster

network driver (ClusNet) is bound to another process, preventing the Cluster service from starting.

Solution: Use port scanning and process termination utilities to identify and end the process that is bound
to port 3343.
To do this:

1. Open Command Prompt.

2. Navigate to the %systemroot%\system32 directory.

3. Type netstat -a -o.

This will display all listening and connected ports and the process ID of each process bound to that port.
Port 3343 will appear for each cluster network on the node.

Notes

• The -a option indicates that all connections and listening ports are supposed to be displayed.

Server clusters uses UDP so the ports are normally in listening mode rather than in connections.

• The -o option indicates that the owning process ID is supposed to be displayed.

4. Type tasklist.

This will display the IDs for all the processes running on the node, including the process ID that matches
the Cluster service (ClusSvc.exe).

5. Type taskkill /pid ID to terminate the process(es) bound to port 3343 that do not match the ID for the

Cluster service.

You cannot manually restore the cluster database on a local node by copying the
systemroot\cluster\CLUSDB file from another node.

• Cause: If the cluster registry hive is already locked and loaded by the Cluster service, the operating

system will prevent you from copying a local CLUSDB file or overwriting an existing CLUSDB file on
another node.

Solution: Stop the Cluster service. Then unload the HKEY_LOCAL_MACHINE\Cluster hive before
restoring the cluster database file.

To do this:

1. Open Command Prompt.

2. Type net stop clussvc to stop the Cluster service.

3. Use the Registry Editor to unload the hive under HKEY_LOCAL_MACHINE\Cluster. For more

information, see Unload a hive from the registry.

The operating system will now allow you to copy the CLUSDB file from a node and manually restore it to
another node.

For information about how to obtain product support, see Technical support options.