Professional Documents
Culture Documents
Yuri Volobuev
GPFS Development
The core NSD functionality dates back to the early days of GPFS. The so-called Traditional NSD server
model has a fairly simple design, and allows for multiple simultaneously active NSD servers for the same
NSD, and does not involve any read or write caching on the server side. In the Traditional model, an
NSD server can be thought of as a smart wire, which basically channels IO requests from NSD clients to
the underlying storage device. A significant change to the NSD server operation was introduced by the
advent of the GPFS Native RAID (GNR) capability. A GNR server, which also acts as an NSD server, can be
thought of as a RAID controller implemented in software. A GNR server caches IO requests, including
those arriving through the NSD protocol RPCs, and stipulates a single active server for a given GNR-
backed NSD. The dynamics of the GPFS pagepool memory usage and NSD worker thread model are
quite different between the Traditional and GNR NSD server models. As the result, NSD tuning may
need to be done differently, depending on the NSD model in use.
Handling an NSD IO requests requires a memory buffer of at least the size of the IO request. Such
buffers come out of GPFS pagepool. In order to avoid distributed deadlocks, a fraction of the pagepool
memory on an NSD server is dedicated to processing incoming NSD IO requests. The size of that
fraction, in percentage points, is configurable though the nsdBufSpace (30% by default) configuration
parameter on Traditional NSD servers, and nsdRAIDBufferPoolSizePct (70% by default) on GNR NSD
servers. The default nsdBufSpace setting is appropriate for node that is not dedicated to NSD serving,
and processes application IO workloads in addition to serving NSD requests. On a dedicated NSD server,
is it recommended to increase nsdBufSpace to its maximum allowed setting (70%).
The buffering of incoming NSD IO requests is done differently on Traditional and GNR NSD servers. On a
Traditional NSD server, there is no caching, and a temporary pagepool buffer is used only for copying
data out of network buffers (or an RDMA transfer), and submitting an IO request to disk; the temporary
buffer is reused for serving a different IO request as soon as the NSD IO request is handled. On a GNR
NSD server, the GNR fraction of the pagepool is used as a general-purpose cache, and regular GNR
buffers are used for processing NSD IO requests; the content of an NSD IO buffer persists in pagepool
after the original NSD IO requests is handled. This has significant implications for resource management
on NSD server nodes. The algorithms to calculate various parameters of NSD server operation are
different for the Traditional and GNR NSD server cases.
In general, an NSD Worker thread has to be prepared to handle an IO request of any size up to
maxBlockSize configuration parameter. When a large number of NSD Worker threads are desired,
allocating a buffer of maxBlockSize size for each one may require a very significant amount of pagepool
memory. However, experience suggests that for larger request sizes fewer worker threads are typically
needed to saturate the underlying disk IO subsystem than for smaller IO requests. On the other hand,
small IO requests may require very high worker thread numbers to max out the disk subsystem. So
when the number of NSD Worker threads needed is large, the majority of them may not need very large
pagepool buffers assigned, which allows for an opportunity to optimize pagepool usage on Traditional
NSD servers by splitting NSD Worker threads into Large and Small pools, with a corresponding split of
NSD IO request queues. Large and Small queues and NSD Workers are used to service NSD IO requests
of different sizes. An NSD IO request is considered to be Large if it its size exceeds nsdSmallBufferSize
(65536 bytes by default). On Traditional NSD servers, Large NSD Worker threads are permanently
assigned a buffer of maxBlockSize size, while Small NSD Worker threads are assigned a buffer of
nsdSmallBufferSize. Note that the need to have two pools of NSD Worker threads stems from the need
to have a pagepool buffer assigned to each thread, which is only the case for Traditional NSD servers.
On GNR NSD servers, NSD Worker threads do not have pre-assigned buffers, and utilize the GNR buffer
subsystem for dynamic buffer allocation.
The algorithm that calculates Large and Small worker thread counts and the corresponding queue
parameters is complicated, far beyond what can be explained by a simple formula. The complexity
stems from the desire to find an optimal combination of tread counts and the resulting pagepool buffer
counts, under the constraints of the available pagepool memory. When the amount of pagepool
memory available for NSD use is not plentiful, the job of deriving good parameters gets considerably
harder. The goal of the algorithm is to split the overall desired number of worker threads into the
number of Large NSD Worker threads (nsdLargeWorkerThreads) and Small (nsdLargeWorkerThreads)
pools, in such a way that the nsdSmallWorkerThreads / nsdLargeWorkerThreads ratio is very close to the
nsdSmallThreadRatio configuration parameter (7 by default), while also trying to prevent
nsdLargeWorkerThreads from being unexpectedly low for users transitioning from older versions of
GPFS, and possibly accustomed to the levels of performance provided by a larger number of NSD
Worker threads servicing large IO requests.
As with older versions of GPFS, the starting point in calculating the desired overall number of NSD
Worker threads is a product of nsdThreadsPerDisk and the number of NSDs served on the current node.
As before, the overall number of NSD Worker threads is bracketed by nsdMinWorkerThreads (16 by
default), and nsdMaxWorkerThreads (512 by default). So the 3 configuration parameters mentioned
above generally preserve their existing semantics. The similarity with older versions ends there.
The initial proposed values for the number of Large NSD Worker threads (nsdLargeWorkerThreads) and
Small NSD Worker threads (nsdSmallWorkerThreads) are calculated to be the best fit for these
conditions (to the accuracy allowed by integer math):
For the Traditional NSD server scenario, the results are then checked against the available amount of
pagepool memory (nsdBufSpace/100 pagepool). The amount of pagepool memory needed by all NSD
worker threads is:
If the desired amount of pagepool memory is not available, nsdLargeWorkerThreads is scaled down so
that all Large buffers fit in the memory available, and nsdSmallWorkerThreads is recalculated as
nsdSmallThreadRatio nsdLargeWorkerThreads. At this point, nsdLargeWorkerThreads is compared to
the desired number of threads calculated using the pre-3.5 algorithm (oldDesiredThreads). If the former
is greater than the latter, no further adjustments are necessary. Otherwise, we revise the thread counts
again, to avoid a drop in performance for those users who are migrating from pre-3.5 levels of GPFS, and
have a predominantly large IO workload. In the GNR case, not being hampered by memory constraints,
we simply make nsdLargeWorkerThreads equal to oldDesiredThreads, and decrease
nsdSmallWorkerThreads by the same amount that we increase nsdLargeWorkerThreads. In the
Traditional case, we have to employ a complex two-step procedure that attempts to find the best fit for
at least nsdLargeWorkerThreads equal to oldDesiredThreads, and as many nsdSmallWorkerThreads as
possible (but at least as many as nsdLargeWorkerThreads). As the result of this computation, the actual
ratio of nsdLargeWorkerThreads / nsdSmallWorkerThreads may become different from the original
target of nsdSmallThreadRatio. Next, the overall worker thread count is checked against
nsdMaxWorkerThreads, and if the latter is exceeded, both thread counts are scaled down
proportionally. The overall pagepool memory demand is calculated again, in the Traditional case, and if
the demand still exceeds the amount of space available, thread counts are simply scaled down
proportionally to make buffers fit in. Finally, we calculate the number of Large and Small IO request
queues using the nsdThreadsPerQueue configuration parameter. The resulting total number of queues
is checked against the nsdMultiQueue configuration parameter (256 by default), and queue counts are
scaled down proportionally if the total number of queues exceeds nsdMultiQueue. At this point, all NSD
server parameters are known, and NSD Worker threads are spawned, and corresponding pagepool
buffers are allocated.
This process takes place on GPFS startup, and also when the pertinent configuration parameters are
adjusted dynamically.
Future work
Many of the currently available NSD Server configuration parameters can be hard to change effectively.
For a long time, many of the parameters have not been documented. One may reasonably wonder:
Why? This is a complex question, and has answering it would go beyond the scope of the current
document. In short, the current situation represents a transitional state, and there is clearly much work
to be done to make things better. The problems with the current design are well understood. There is
clearly a need to provide tunables that directly control the allocation of worker threads and other
resources (as opposed to giving inputs to a complex algorithm) for advanced users, and also to provide
much more detailed feedback about inner workings of the NSD server code, by providing a way to look
at the internal state using supported tools. Various NSD server performance metrics must be available
for monitoring using native GPFS tools and external monitoring frameworks.