Professional Documents
Culture Documents
@ @
Clustering is a complex technology with lots of messy details. To make it easier to understand, let's
take a look at the big picture of how clustering works. In this article we take a look at:
clthough a QL erver 2005 cluster can support up to eight nodes, clustering actually only occurs
between two nodes at a time. This is because a single QL erver 2005 instance can only run on a
single node at a time, and should a failover occur, the failed instance can only fail over to another
individual node. This adds up to two nodes. Clusters of three or more nodes are only used where you
In a two-node QL erver 2005 cluster, one of the physical server nodes is referred to as the active
node, and the other one is referred to as the passive node. It doesn't matter which physical servers in
a cluster is designated as the active or the passive, but it is easier, from an administrative point of
view, to go ahead and assign one node as the active and the other as the passive. This way, you won't
get confused about which physical server is performing which role at the current time.
When we refer to an active node, we mean that this particular node is currently running an active
instance of QL erver 2005 and that it is accessing the instance's databases, which are located on a
it is not accessing the instance's databases. When the passive node is not in production, it is in a state
of readiness, so that if the active node fails, and a failover occurs, it can automatically go into
production and begin accessing the instance's databases located on the shared disk array. In this
case, the passive mode then becomes the active node, and the formerly active node now becomes the
passive node (or failed node should a failure occur that prevents it from operating).
o what is a shared disk array? Unlike non-clustered QL erver 2005 instances, which usually store
their databases on locally attached disk storage, clustered QL erver 2005 instances store data on a
shared disk array. By shared, we mean that both nodes of the cluster are physically connected to the
disk array, but that only the active node can access the instance's databases. There is never a case
where both nodes of a cluster are accessing an instance's databases at the same time. This is to
Generally speaking, a shared disk array is a CI- or fiber-connected RcID 5 or RcID 10 disk array
housed in a stand-alone unit, or it might be a cN. This shared array must have at least two logical
partitions. One partition is used for storing the clustered instance's QL erver databases, and the
When both nodes of a cluster are up and running, participating in their relevant roles (active and
passive) they communicate with each other over the network. For example, if you change a
configuration setting on the active node, this configuration change is automatically sent to the passive
node and the same change made. This generally occurs very quickly, and ensures that both nodes are
synchronized.
But, as you might imagine, it is possible that you could make a change on the active node, but before
the change is sent over the network and the same change made on the passive node (which will
become the active node after the failover), that the active node fails, and the change never gets to the
passive node. Depending on the nature of the change, this could cause problems, even causing both
To prevent this from happening, a QL erver 2005 cluster uses what is called a quorum, which is
stored on the quorum drive of the shared array. c quorum is essentially a log file, similar in concept to
database logs. Its purpose is to record any change made on the active node, and should any change
recorded here not get to the passive node because the active node has failed and cannot send the
change to the passive node over the network, then the passive node, when it becomes the active
node, can read the quorum file and find out what the change was, and then make the change before it
In order for this to work, the quorum file must reside on what is called the quorum drive. c quorum
drive is a logical drive on the shared array devoted to the function of storing the quorum.
Each node of a cluster must have at least two network cards. One network card will be connected to
The public network is the network that the QL erver 2005 clients are attached, and this is how they
The private network is used solely for communications between the nodes of the cluster. It is used
mainly for what is called the heartbeat signal. In a cluster, the active node puts out a heartbeat signal,
which tells the other nodes in the cluster that it is working. hould the heartbeat signal stop then a
passive node in the cluster becomes aware that the active node has failed, and that it should at this
time initiate a failover so that it can become the active node and take control over the QL erver
2005 instance.
#
One of the biggest mysteries of clustering is how do clients know when and how to switch
communicating from a failed cluster node to the now new active node? cnd the answer may be a
surprise. They don't. That's right; QL erver 2005 clients don't need to know anything about specific
nodes of a cluster (such as the NETBIO name or IP address of individual cluster nodes). This is
because each clustered QL erver 2005 instance is given a virtual name and IP address, which
clients use to connect to the cluster. In other words, clients don't connect to a node's specific name or
IP address, but instead connect to a virtual name and IP address that stays the same no matter what
When you create a cluster, one of the steps is to create a virtual cluster name and IP address. This
name and IP address is used by the active node to communicate with clients. hould a failover occur,
then the new active node uses this same virtual name and IP address to communicate with clients.
This way, clients only need to know the virtual name or IP address of the clustered instance of QL
erver, and a failover between nodes doesn't change this. ct worst, when a failover occurs, there may
be an interruption of service from the client to the clustered QL erver 2005 instance, but once the
failover has occurred, the client can once again reconnect to the instance using the same virtual name
or IP address.
^
c
@ @
G
£"
$
%&
'
While there can be many different causes of a failover, let's look at the case where the power stops for
the active node of a cluster and the passive node has to take over. This will provide a general
Let's assume that a single QL erver 2005 instance is running on the active node of a cluster, and
that a passive node is ready to take over when needed. ct this time, the active node is communicating
with both the database and the quorum on the shared array. Because only a single node at a time can
be communicating with the shared array, the passive node is not communicating with the database or
the quorum. In addition, the active node is sending out heartbeat signals over the private network,
and the passive node is monitoring them to see if they stop. Clients are also interacting with the active
node via the virtual name and IP address, running production transactions.
Now, for whatever reason, the active node stops working because it no longer is receiving any
electricity. The passive node, which is monitoring the heartbeats from the active node, now notices
that it is not receiving the heartbeat signal. cfter a predetermined delay, the passive node assumes
that the active node has failed and it initiates a failover. cs part of the failover process, the passive
node (now the active node) takes over control of the shared array and reads the quorum, looking for
any unsynchronized configuration changes. It also takes over control of the virtual server name and IP
address. In addition, as the node takes over the databases, it has to do a QL erver startup, using
the databases, just as if it is starting from a shutdown, going through a database recovery. The time
this takes depends on many factors, including the speed of the system and the number of transactions
that might have to be rolled forward or back during the database recovery process. Once the recovery
process is complete, the new active nodes announces itself on the network with the virtual name and
IP address, which allows the clients to reconnect and begin using the QL erver 2005 instance with
minimal interruption.
That's the big picture of how QL erver 2005 clustering works. If you are new to QL erver
clustering, it is important that you understand these basic concepts before you begin to drill down into
the detail. In later articles, I will discuss, in great detail, how to plan, build, and administer a QL
XX