You are on page 1of 6

GPFS NSD Server Design and Tuning

Yuri Volobuev

GPFS Development

Version 1.0, November 2015


Background
One of the key GPFS functions is the ability to ship IO requests to remote nodes using the Network
Shared Disk (NSD) protocol. The latter is a client-server protocol, with NSD servers providing access to
storage that is visible on servers as local block devices. A typical GPFS cluster comprises of a smaller
number of NSD servers and a large number of NSD clients. Naturally, the NSD server function is critical
to the operation of a GPFS cluster, and tuning NSD server operation is of considerable interest to GPFS
administrators.

The core NSD functionality dates back to the early days of GPFS. The so-called Traditional NSD server
model has a fairly simple design, and allows for multiple simultaneously active NSD servers for the same
NSD, and does not involve any read or write caching on the server side. In the Traditional model, an
NSD server can be thought of as a smart wire, which basically channels IO requests from NSD clients to
the underlying storage device. A significant change to the NSD server operation was introduced by the
advent of the GPFS Native RAID (GNR) capability. A GNR server, which also acts as an NSD server, can be
thought of as a RAID controller implemented in software. A GNR server caches IO requests, including
those arriving through the NSD protocol RPCs, and stipulates a single active server for a given GNR-
backed NSD. The dynamics of the GPFS pagepool memory usage and NSD worker thread model are
quite different between the Traditional and GNR NSD server models. As the result, NSD tuning may
need to be done differently, depending on the NSD model in use.

NSD Server Operation Overview


NSD clients and servers communicate using a family of GPFS RPCs. An RPC originates on an NSD client,
and is sent to an NSD server using the standard GPFS RPC framework, using TCP/IP, VERBS (or a
combination of the two) for data transfer over the network. Since it is advantageous to have a large
number of IO requests issued in parallel, an NSD server has to be able to cope with very high NSD RPC
average volumes and even higher peak loads. To keep RPC traffic flowing, and not run out of threads
and temporary buffer memory during burst of IO, incoming NSD RPC requests are queued after being
received, and then processed by a host of NSD Worker threads.

Handling an NSD IO requests requires a memory buffer of at least the size of the IO request. Such
buffers come out of GPFS pagepool. In order to avoid distributed deadlocks, a fraction of the pagepool
memory on an NSD server is dedicated to processing incoming NSD IO requests. The size of that
fraction, in percentage points, is configurable though the nsdBufSpace (30% by default) configuration
parameter on Traditional NSD servers, and nsdRAIDBufferPoolSizePct (70% by default) on GNR NSD
servers. The default nsdBufSpace setting is appropriate for node that is not dedicated to NSD serving,
and processes application IO workloads in addition to serving NSD requests. On a dedicated NSD server,
is it recommended to increase nsdBufSpace to its maximum allowed setting (70%).

The buffering of incoming NSD IO requests is done differently on Traditional and GNR NSD servers. On a
Traditional NSD server, there is no caching, and a temporary pagepool buffer is used only for copying
data out of network buffers (or an RDMA transfer), and submitting an IO request to disk; the temporary
buffer is reused for serving a different IO request as soon as the NSD IO request is handled. On a GNR
NSD server, the GNR fraction of the pagepool is used as a general-purpose cache, and regular GNR
buffers are used for processing NSD IO requests; the content of an NSD IO buffer persists in pagepool
after the original NSD IO requests is handled. This has significant implications for resource management
on NSD server nodes. The algorithms to calculate various parameters of NSD server operation are
different for the Traditional and GNR NSD server cases.

NSD Threading Model, GPFS 3.4 and Older


In GPFS 3.4 and older versions, a fairly simple request processing model was used. All incoming NSD IO
requests were put on a single global queue, and a host of NSD Worker threads was used to take
requests off the queue and process them. The number of NSD Worker threads was calculated as a
product of nsdThreadsPerDisk configuration parameter (3 by default) and the number of NSDs served
on the current node, with a minimum of nsdMinWorkerThreads (16 by default) and a maximum of
nsdMaxWorkerThreads (64 by default). This model was acceptable in some scenarios, but was
problematic for configurations with a large number of CPU cores and low disk IO latency. In the latter
case, the contention for locks protecting the global NSD request queue could have a significant impact
on the overall GPFS performance. A small pool of served buffers of maxBlockSize size was allocated at
GPFS startup time, and used for NSD IO request processing. However, for heavier workloads, the
reserved buffers would often not be sufficient, and NSD Worker threads would need to procure working
buffers through the regular pagepool buffer allocation code path. The overhead of temporary buffer
allocation could be problematic under certain workloads, too.

NSD Threading Model, GPFS 3.5 and Later


In GPFS 3.5, the NSD server code went through a major overhaul. To improve SMP scalability, the global
NSD request queue has been supplanted by an array of independently lockable queues. To avoid the
overhead of temporary buffer allocation in the Traditional NSD server scenario, each NSD Worker thread
is statically assigned a pagepool buffer. This has led to major improvements in SMP scalability and
better performance, but has also made the NSD configuration picture substantially more complicated.

In general, an NSD Worker thread has to be prepared to handle an IO request of any size up to
maxBlockSize configuration parameter. When a large number of NSD Worker threads are desired,
allocating a buffer of maxBlockSize size for each one may require a very significant amount of pagepool
memory. However, experience suggests that for larger request sizes fewer worker threads are typically
needed to saturate the underlying disk IO subsystem than for smaller IO requests. On the other hand,
small IO requests may require very high worker thread numbers to max out the disk subsystem. So
when the number of NSD Worker threads needed is large, the majority of them may not need very large
pagepool buffers assigned, which allows for an opportunity to optimize pagepool usage on Traditional
NSD servers by splitting NSD Worker threads into Large and Small pools, with a corresponding split of
NSD IO request queues. Large and Small queues and NSD Workers are used to service NSD IO requests
of different sizes. An NSD IO request is considered to be Large if it its size exceeds nsdSmallBufferSize
(65536 bytes by default). On Traditional NSD servers, Large NSD Worker threads are permanently
assigned a buffer of maxBlockSize size, while Small NSD Worker threads are assigned a buffer of
nsdSmallBufferSize. Note that the need to have two pools of NSD Worker threads stems from the need
to have a pagepool buffer assigned to each thread, which is only the case for Traditional NSD servers.
On GNR NSD servers, NSD Worker threads do not have pre-assigned buffers, and utilize the GNR buffer
subsystem for dynamic buffer allocation.

The algorithm that calculates Large and Small worker thread counts and the corresponding queue
parameters is complicated, far beyond what can be explained by a simple formula. The complexity
stems from the desire to find an optimal combination of tread counts and the resulting pagepool buffer
counts, under the constraints of the available pagepool memory. When the amount of pagepool
memory available for NSD use is not plentiful, the job of deriving good parameters gets considerably
harder. The goal of the algorithm is to split the overall desired number of worker threads into the
number of Large NSD Worker threads (nsdLargeWorkerThreads) and Small (nsdLargeWorkerThreads)
pools, in such a way that the nsdSmallWorkerThreads / nsdLargeWorkerThreads ratio is very close to the
nsdSmallThreadRatio configuration parameter (7 by default), while also trying to prevent
nsdLargeWorkerThreads from being unexpectedly low for users transitioning from older versions of
GPFS, and possibly accustomed to the levels of performance provided by a larger number of NSD
Worker threads servicing large IO requests.

As with older versions of GPFS, the starting point in calculating the desired overall number of NSD
Worker threads is a product of nsdThreadsPerDisk and the number of NSDs served on the current node.
As before, the overall number of NSD Worker threads is bracketed by nsdMinWorkerThreads (16 by
default), and nsdMaxWorkerThreads (512 by default). So the 3 configuration parameters mentioned
above generally preserve their existing semantics. The similarity with older versions ends there.

The initial proposed values for the number of Large NSD Worker threads (nsdLargeWorkerThreads) and
Small NSD Worker threads (nsdSmallWorkerThreads) are calculated to be the best fit for these
conditions (to the accuracy allowed by integer math):

nsdSmallWorkerThreads + nsdLargeWorkerThreads = nsdThreadsPerDisk nServerDisks

nsdSmallWorkerThreads / nsdLargeWorkerThreads = nsdSmallThreadRatio

For the Traditional NSD server scenario, the results are then checked against the available amount of
pagepool memory (nsdBufSpace/100 pagepool). The amount of pagepool memory needed by all NSD
worker threads is:

nsdSmallWorkerThreads nsdSmallBufferSize + nsdLargeWorkerThreads maxBlockSize

If the desired amount of pagepool memory is not available, nsdLargeWorkerThreads is scaled down so
that all Large buffers fit in the memory available, and nsdSmallWorkerThreads is recalculated as
nsdSmallThreadRatio nsdLargeWorkerThreads. At this point, nsdLargeWorkerThreads is compared to
the desired number of threads calculated using the pre-3.5 algorithm (oldDesiredThreads). If the former
is greater than the latter, no further adjustments are necessary. Otherwise, we revise the thread counts
again, to avoid a drop in performance for those users who are migrating from pre-3.5 levels of GPFS, and
have a predominantly large IO workload. In the GNR case, not being hampered by memory constraints,
we simply make nsdLargeWorkerThreads equal to oldDesiredThreads, and decrease
nsdSmallWorkerThreads by the same amount that we increase nsdLargeWorkerThreads. In the
Traditional case, we have to employ a complex two-step procedure that attempts to find the best fit for
at least nsdLargeWorkerThreads equal to oldDesiredThreads, and as many nsdSmallWorkerThreads as
possible (but at least as many as nsdLargeWorkerThreads). As the result of this computation, the actual
ratio of nsdLargeWorkerThreads / nsdSmallWorkerThreads may become different from the original
target of nsdSmallThreadRatio. Next, the overall worker thread count is checked against
nsdMaxWorkerThreads, and if the latter is exceeded, both thread counts are scaled down
proportionally. The overall pagepool memory demand is calculated again, in the Traditional case, and if
the demand still exceeds the amount of space available, thread counts are simply scaled down
proportionally to make buffers fit in. Finally, we calculate the number of Large and Small IO request
queues using the nsdThreadsPerQueue configuration parameter. The resulting total number of queues
is checked against the nsdMultiQueue configuration parameter (256 by default), and queue counts are
scaled down proportionally if the total number of queues exceeds nsdMultiQueue. At this point, all NSD
server parameters are known, and NSD Worker threads are spawned, and corresponding pagepool
buffers are allocated.

This process takes place on GPFS startup, and also when the pertinent configuration parameters are
adjusted dynamically.

When Does NSD Subsystem Need Tuning?


The intent of the design of GPFS NSD server code is to operate well without requiring manual tuning.
The complexity of the corresponding algorithms is largely driven by the desire to calculate optimal
values for various internal parameters automatically. Unfortunately, the flip side of this is the
considerable difficulty of making manual adjustments. It is not straightforward to predict how tweaking
certain input configuration parameters is going to translate into final operational parameters. The most
basic input parameters in this space are nsdBufSpace and pagepool. As with most caches, bigger is
better. Increasing the overall pagepool size and increasing nsdBufSpace on dedicated NSD servers does
not bear any complexity and is simply beneficial. Besides the obvious performance benefits, having
more pagepool memory available for NSD code consumption greatly simplifies other NSD configuration
aspects. Beyond those two parameters, things get more complicated, and manual tuning is not
recommended, unless advised by IBM Service. If more NSD Worker threads are needed, the simplest
way to accomplish this is to increase nsdThreadsPerDisk, and nsdMaxWorkerThreads if necessary. If
enough pagepool is available, increasing this parameter should increase both Large and Small NSD
Worker thread counts. If the workload composition on a given cluster is well known, it may be desirable
to change the default nsdSmallThreadRatio, to allow for more efficient pagepool utilization. It should
be kept in mind, however, that internal GPFS metadata processing always employs fairly small IOs. The
code allows making nsdSmallWorkerThreads as small as nsdLargeWorkerThreads but no lower. Since
the overall amount of pagepool memory assigned to Small NSD Worker threads is fairly small, this
should not be a practical concern.

Future work
Many of the currently available NSD Server configuration parameters can be hard to change effectively.
For a long time, many of the parameters have not been documented. One may reasonably wonder:
Why? This is a complex question, and has answering it would go beyond the scope of the current
document. In short, the current situation represents a transitional state, and there is clearly much work
to be done to make things better. The problems with the current design are well understood. There is
clearly a need to provide tunables that directly control the allocation of worker threads and other
resources (as opposed to giving inputs to a complex algorithm) for advanced users, and also to provide
much more detailed feedback about inner workings of the NSD server code, by providing a way to look
at the internal state using supported tools. Various NSD server performance metrics must be available
for monitoring using native GPFS tools and external monitoring frameworks.

You might also like