You are on page 1of 6

The systemd fallacy My favorite books on hackers

Barriers, Caches, Filesystems


Posted on 2011-06-30 by tgr
With the recent proliferation of ext4 as the new default Linux filesystem theres been much talk of
write barrier support. The flurry of post-2.6.18 barrier related development in most storage
subsystems has left some novice users and administrators perplexed. I hope I can clear it up a bit
with this primer/refresher.
If youre familiar with the basics of I/O caching, just skip to the Barriers section.
Barriers have long been implemented in the kernel, e.g. in ext3, XFS and ReiserFS 3. However, they
are disabled by default in ext3. Up until recently, there was no barrier support for anything other
than simple devices.

Two words: data safety
Lets take a look at the basic path data takes through the storage layers during a write-out in a
modern storage setup:
Some of these layers/components have their own caches:
There may be other caches in the path, but this is the usual setup. The page cache is omitted if data
is written in O_DIRECT mode.
When a userland process writes data to a filesystem, its paramount (unless explicitly requested
otherwise) that the data makes it safely to physical, non-volatile media. This is a part of the D in
ACID. Generally, data may be lost if its in volatile storage during hardware failure (e.g. power loss)
or software crash.

Caches
The OS page cache (a subsystem of the VFS cache) and the buffer cache are in the hosts RAM,
obviously volatile. The risk here is that the page cache is relatively large compared to other caches. It
cant survive OS crashes.
The storage controller write cache is present in most mid- and hi-end controllers and/or HBAs
working in modes other than initiator-target: RAID HBAs, DAS RAID boxen, SAN controllers, etc. Every
vendor seems to have their own name for it:
BBWC (Battery-Backed Write Cache)
Search
Pages
About
Recent Posts
The fairy tale of paid
hardware support
Hi-end audio for nerds part
1
Flow control flaw in
Broadcom BCM5709 NICs
and BCM56xxx switches
My favorite books on hackers
Barriers, Caches, Filesystems
The systemd fallacy
Cloud computing
Content authorization with
Varnish
Categories
Books
IT
Music
Uncategorized
Tags
best-practices
bug hardware kernel KVM
Linux monitoring
networking
systems varnish
virtualization Xen
Links
Computers in spaceflight
Cygnus High Altitude Balloon
Gearslutz
IT Blogroll
Ben Rockwood
Domas Mituzas
High Scalability
MySQL Performance Blog
Robert Milkowski
StorageMojo
Archives
Select Month
monolight Leszek Urbanski on IT ops & stuff
Home About
converted by Web2PDFConvert.com
BBU (Battery-Back-Up [Write Cache])
Array Accelerator (BBWC in HPese)
FBWC (Flash-Backed Write Cache)
As the names suggest, BBWC is simply some memory and a rechargeable battery, usually in one or
more proprietary FRU modules. In hi-end storage systems, the battery modules are hot-swappable,
in mid-end systems a controller has to be shut down for battery replacement. RAID HBAs require host
down time for battery maintenance unless you have hot-swap slots and multiple HBAs serving
multiple paths.
FBWC is the relatively new generation of volatile cache where the battery assembly is replaced with
NAND flash storage not unlike todays SSDs and a replaceable capacitor bank that holds enough
charge to allow data write-out from DRAM to flash in case of power failure.
Both types of cache have their drawbacks: BBWC needs constant battery monitoring and re-learning.
Re-learning is a recurring process: the controller fully cycles (discharges and recharges) the battery
to learn its absolute capacity which obviously deteriorates with time and usage (cycles). While re-
learning, write cache must be disabled (since at some point in the process the battery will be almost
completely discharged and unable to power the BBWC memory). This is a periodic severe
performance penalty for write-heavy workloads, unless theres a redundant battery and/or controller
to take over. Good controllers allow the administrator to customize re-learn schedules. The batteries
must be replaced every few months or years.
Flash-based write cache is also subject to deterioration: the dreaded maximum write count for flash
memory cells (however, flash is used only on power failure). The backup capacitors degrade over
time. The NAND modules and the capacitor bank must be monitored and replaced if necessary.
Write cache on physical media (disk drives) is almost always volatile. Most enterprise SSDs and some
consumer SSDs (e.g. the Intel 320 series, but not the extremely popular X25-M series) have backup
capacitors.
Modern disks have 16-64 megabytes of cache. The problem with this type of cache is that not all
drives will flush it reliably when requested. SCSI and SAS drives do the right thing: the SYNCHRONIZE
CACHE (opcode 35) command is a part of the SCSI standard. PATA drives have usually outright lied to
cheat on benchmarks. SATA does have the FLUSH CACHE EXT command, but whether the drive
actually acts on it depends on the vendor. Get SCSI/SAS drives for mission critical data nothing new
here.
One more caveat with disk write cache is that the controller software to ensure data durability
MUST guarantee that all data flushed out of the controller write cache is committed to non-volatile
media. In other words, when the OS requests a flush and the controller returns success, the data
MUST have already been committed to non-volatile media. This is why disk write cache MUST be
disabled if BBWC or other form of controller cache is enabled the controller cache must be flushed
directly to non-volatile media and not to another layer of volatile cache.
Software RAID with JBOD is a special case: there is no controller cache, only the drive cache, the OS
page cache and buffer cache.

Barriers
Think of write barriers on Linux as a unified approach to flushing and forced I/O ordering.
Consider the following setup:
Meta
Entries RSS
Comments RSS
Log in
converted by Web2PDFConvert.com
This is a bit on the extreme side, but ponder for a moment how many layers of I/O (and caches) the
data has to pass through to be stored on the physical disks.
If the filesystem is barrier-aware and all I/O layers support barriers/flushes, an fs transaction
followed by a barrier is committed (flushed) to persistent storage (disks). All requests issued prior
to the barrier must be satisfied before continuing. Also, an fsync() or a similar call will flush the write
caches of the underlying storage (fsync() without barriers does NOT guarantee this!). Barrier bios
(block I/Os) actually do two flushes: one before the bio and one afterwards. Its possible to issue an
empty barrier bio to flush only once.
Barriers ensure critical transactions are committed to persistent media and committed in the right
order, but they incur a sometimes severe performance penalty.
Lets get back to our two hardware setups: software RAID on JBOD and hardware RAID with BBWC.
Since barriers force write-outs to persistent storage, disk write cache can be safely enabled for MD
RAID if the following conditions are met:
the filesystem supports barriers and they are enabled
the underlying I/O layers support barriers/flushes (see below)
the disks reliably support cache flushes.
However, on hardware RAID with BBWC, the cache itself is (quasi-)persistent. Since RAID controllers
do implement the SYNCHRONIZE CACHE command, each barrier would flush the entire write cache,
negating the performance advantage of BBWC. Its recommended to disable barriers if and only if
you have healthy BBWC. If you disable barriers, you must monitor and properly maintain your BBWC.
Full support for barriers on various virtual devices has been added only recently. This is a rough
matrix of barrier support in vanilla kernel versions, milestones highlighted:
Barrier support Kernel
version
Commit
I/O barrier support 2.6.9 1
ext3 2.6.9 1
reiserfs 2.6.9 1
SATA 2.6.12 -
XFS barriers enabled by default 2.6.16 1
ext4 barriers enabled by default 2.6.26 1
DM simple devices (i.e. a single underlying device) 2.6.28 1
loop 2.6.30 1
DM rewrite of the barrier code 2.6.30 1
DM crypt 2.6.31 1
DM linear (i.e. standard LVM concatenated volumes) 2.6.31 1
DM mpath 2.6.31 1
virtio-blk (only really safe with O_DIRECT backing
devices)
2.6.32 1
DM dm-raid1 2.6.33 1
DM request based devices 2.6.33 1
MD barrier support on all personalities * 2.6.33 1
barriers removed and replaced with FUA / explicit 2.6.37 1 2 3
converted by Web2PDFConvert.com
flushes
* Note: previously barriers were only supported on MD raid1. This patch can be easily applied to 2.6.32.
As of 2.6.37, block layer barriers have been removed from the kernel for performance reasons. They
have been completely superseded by explicit flushes and FUA requests.
FUA is Force Unit Access: an I/O request flag which ensures the transferred data is written directly to
(or read from) persistent media, regardless of any cache settings.
Explicit flushes are just that write cache flushes explicitly requested by a filesystem. In fact, the
responsibility for safe request ordering has been completely moved to filesystems. The block layer or
TCQ/NCQ can safely reorder requests if necessary, since the filesystem will issue flush/FUA requests
for critical transactions anyway and wait for their completion before proceeding.
These changes eliminate the barrier-induced request queue drains that significantly affected write
performance. Other I/O requests (e.g. without a transaction) can be issued to a device while a
transaction is still being processed.
However, as 2.6.32.x is the longterm kernel for several distros, barriers are here to stay (at least for
a few years).

Filesystems
Barriers/flushes are supported on most modern filesystems: ext3, ext4, XFS, JFS, ReiserFS 3, etc.
ext3/4 are unique in that they support three data journaling modes: data={ordered,journal,writeback}.
data=journal essentially writes data twice: first to the journal and then to the data blocks.
data=writeback is similar to journaling on XFS, JFS, or ReiserFS 3 before Linux 2.6.6. Only internal
filesystem integrity is preserved and only metadata is journaled; data may be written to the
filesystem out of order. Metadata changes are first recorded in the journal and a commit block is
written. After the journal has been updated, metadata and data write-outs may proceed.
data=writeback can be a severe security risk: if the system crashes while appending to a file, after the
metadata has been committed (and additional data blocks allocated), but before the data has been
written (data blocks overwritten with new data), then after journal recovery that file may contain
blocks filled with data from previously deleted files from any user.
Note: ReiserFS 3 supports data=ordered since 2.6.6 and its the default mode. XFS does support ordering
in specific cases, but its neither always guaranteed nor enforced via the journaling mechanism. There is
some confusion about that, e.g. this Wikipedia article on ext3 and this paper [PDF] seem to contradict
what a developer from SGI stated (the paper seems flawed anyway, as an assumption is made that XFS
is running in ordered mode, based on the result of one test).
data=ordered only journals metadata, like writeback mode, but groups metadata and data changes
together into transactions. Transaction blocks are written together, data first, metadata last.
With barriers enabled, the order looks more or less like this:
1. the transaction is written
2. a barrier request is issued
3. the commit block is written
4. another barrier is issued
There is a special case on ext4 where the first barrier (between the transaction and the commit
block) is omitted: the journal_async_commit mount option. ext4 supports journal checksumming if the
commit block has been written but the checksum is incorrect, the transaction will be discarded at
journal replay. With journal_async_commit enabled the commit block may be written without waiting
for the transaction write-out. Theres a caveat: before this commit the barrier was missing at step 4
in async commit mode. The patch adds it, so that now theres a single empty barrier (step 4) after the
commit block instead of a full barrier (two flushes) around it.
ext3 tends to flush more often than ext4. By default both ext3 and ext4 are mounted with
data=ordered and commit=5. On ext3 this means not only the journal, but effectively all data is
committed every 5 seconds. However, ext4 introduces a new feature: delayed allocation.
Note: delayed allocation is by no means a new concept. Its been used for years e.g. in XFS; in fact ext4
behaves similarly to XFS in this regard.
converted by Web2PDFConvert.com
New data blocks on disk are not immediately allocated, so they are not written out until the
respective dirty pages in the page cache expire. The expiration is controlled by two tunables:
/proc/sys/vm/dirty_expire_centisecs
/proc/sys/vm/dirty_writeback_centisecs
The first variable determines the expiration age 30 seconds by default as of 2.6.32. On expiration,
dirty pages are queued for eviction. The second variable controls the wakeup frequency of the
flush kernel threads, which process the queues.
You can check the current cache sizes:
grep ^Cached: /proc/meminfo # page cache size
grep ^Dirty: /proc/meminfo # total size of all dirty pages
grep ^Writeback: /proc/meminfo # total size of actively processed dirty pages
Note: The VFS cache (e.g. dentry and inode caches) can be further examined by viewing the /proc/slabinfo
file (or with the slabtop util which gives a nice breakdown of the slab count, object count, size, etc).
Note: before 2.6.32 there was a well-known subsystem called pdflush: global kernel threads for all
devices, spawned and terminated on demand (the rule of thumb is: if all pdflush threads have been busy
for 1 second, spawn another thread. If one of the threads has been idle for 1 second, terminate). Its
been replaced with per-BDI (per-backing-device-info) flushers one flush thread per each logical device
(one for each filesystem).
On top of all that, there was the dreaded pre-2.6.30 ext4 delayed allocation data loss bug/feature.
Workarounds were introduced in 2.6.30, namely the auto_da_alloc mount option, enabled by default.
You should also take into consideration the size of the OS page cache. These days machines have a
lot of RAM (32+ or 64+ GB is not uncommon). The more RAM you have, the more dirty pages can be
held in RAM before flushing to disk. By default, Linux 2.6.32 will start writing out dirty pages when
they reach 10% of RAM. On a 32 GB machine this is 3.2 GB of uncommitted data in write-heavy
environments, where you dont hit the time based constraints mentioned above quite a lot to lose
in the event of a system crash or power failure.
This is why its so important to ensure data integrity in your software by flushing critical data to disks
e.g. by fsync()ing (though at the application level you may only hope the filesystem, the OS and the
devices will all do the right thing). This is why database systems have been doing it for decades.
Also, this is one of the reasons why some database vendors recommend placing transaction commit
logs on a separate filesystem. The synchronous load profile of the commit log would otherwise
interfere with the asynchronous flushing of the tablespaces: if the logs were kept on a single
filesystem along with the tablespaces, every fsync would flush all dirty pages for that filesystem,
killing I/O performance.
Note: fsync() is a double-edged sword in this case. fsyncing too often will reduce performance (and spin up
devices). Thats why only critical data should be fsynced.
Dirty page flushing can be tuned traditionally with these two tunables:
/proc/sys/vm/dirty_background_ratio
/proc/sys/vm/dirty_ratio
Both values are expressed as a percentage of RAM. When the amount of dirty pages reaches the
first threshold (dirty_background_ratio), write-outs begin in the background via the flush kernel
threads. When the second threshold is reached, processes will block, flushing in the foreground.
The problem with these variables is their minimum value: even 1% can be too much. This is why
another two controls were introduced in 2.6.29:
/proc/sys/vm/dirty_background_bytes
converted by Web2PDFConvert.com
The systemd fallacy My favorite books on hackers
/proc/sys/vm/dirty_bytes
Theyre equivalent to their percentage based counterparts. Both pairs of tunables are exclusive: if
either is set, its respective counterpart is reset to 0 and ignored. These variables should be tuned in
relation to the BBWC memory size (or disk write cache size on MD RAID). Lower values generate
more I/O requests (and more interrupts), significantly decrease sequential I/O bandwidth but also
decrease random I/O latency. The idea is to find a sweet spot where BBWC would be used most
effectively: the ideal I/O rate should not allow BBWC to overfill or significantly under-fill. Obviously,
this is hit/miss and only theoretically achievable under perfect conditions. As usual, you should tune
and benchmark for your specific workload.
When benchmarking, remember ext3 has barriers disabled by default. A direct comparison of ext3 to
ext4 with default mount options is usually quite pointless. ext4 offers an increased level of data
protection at the cost of speed. Likewise, directly comparing ext3 in ordered mode to a filesystem
offering only metadata journaling may not yield conclusive results. Some people got their
benchmarks wrong.
Note: I did that kind of benchmark a while ago: the goal was to measure system file operations
(deliberately on default settings), not sequential throughput or IOPS and ext4 was faster anyway.
All in all, its your data! Test everything yourself with your specific workloads, hardware and
configuration. Heres a simple barrier test workload to get you going.
This entry was posted in IT and tagged best-practices, kernel, Linux, systems. Bookmark the permalink.
monolight Proudly powered by WordPress.
converted by Web2PDFConvert.com

You might also like