Posted on 2011-06-30 by tgr With the recent proliferation of ext4 as the new default Linux filesystem theres been much talk of write barrier support. The flurry of post-2.6.18 barrier related development in most storage subsystems has left some novice users and administrators perplexed. I hope I can clear it up a bit with this primer/refresher. If youre familiar with the basics of I/O caching, just skip to the Barriers section. Barriers have long been implemented in the kernel, e.g. in ext3, XFS and ReiserFS 3. However, they are disabled by default in ext3. Up until recently, there was no barrier support for anything other than simple devices.
Two words: data safety Lets take a look at the basic path data takes through the storage layers during a write-out in a modern storage setup: Some of these layers/components have their own caches: There may be other caches in the path, but this is the usual setup. The page cache is omitted if data is written in O_DIRECT mode. When a userland process writes data to a filesystem, its paramount (unless explicitly requested otherwise) that the data makes it safely to physical, non-volatile media. This is a part of the D in ACID. Generally, data may be lost if its in volatile storage during hardware failure (e.g. power loss) or software crash.
Caches The OS page cache (a subsystem of the VFS cache) and the buffer cache are in the hosts RAM, obviously volatile. The risk here is that the page cache is relatively large compared to other caches. It cant survive OS crashes. The storage controller write cache is present in most mid- and hi-end controllers and/or HBAs working in modes other than initiator-target: RAID HBAs, DAS RAID boxen, SAN controllers, etc. Every vendor seems to have their own name for it: BBWC (Battery-Backed Write Cache) Search Pages About Recent Posts The fairy tale of paid hardware support Hi-end audio for nerds part 1 Flow control flaw in Broadcom BCM5709 NICs and BCM56xxx switches My favorite books on hackers Barriers, Caches, Filesystems The systemd fallacy Cloud computing Content authorization with Varnish Categories Books IT Music Uncategorized Tags best-practices bug hardware kernel KVM Linux monitoring networking systems varnish virtualization Xen Links Computers in spaceflight Cygnus High Altitude Balloon Gearslutz IT Blogroll Ben Rockwood Domas Mituzas High Scalability MySQL Performance Blog Robert Milkowski StorageMojo Archives Select Month monolight Leszek Urbanski on IT ops & stuff Home About converted by Web2PDFConvert.com BBU (Battery-Back-Up [Write Cache]) Array Accelerator (BBWC in HPese) FBWC (Flash-Backed Write Cache) As the names suggest, BBWC is simply some memory and a rechargeable battery, usually in one or more proprietary FRU modules. In hi-end storage systems, the battery modules are hot-swappable, in mid-end systems a controller has to be shut down for battery replacement. RAID HBAs require host down time for battery maintenance unless you have hot-swap slots and multiple HBAs serving multiple paths. FBWC is the relatively new generation of volatile cache where the battery assembly is replaced with NAND flash storage not unlike todays SSDs and a replaceable capacitor bank that holds enough charge to allow data write-out from DRAM to flash in case of power failure. Both types of cache have their drawbacks: BBWC needs constant battery monitoring and re-learning. Re-learning is a recurring process: the controller fully cycles (discharges and recharges) the battery to learn its absolute capacity which obviously deteriorates with time and usage (cycles). While re- learning, write cache must be disabled (since at some point in the process the battery will be almost completely discharged and unable to power the BBWC memory). This is a periodic severe performance penalty for write-heavy workloads, unless theres a redundant battery and/or controller to take over. Good controllers allow the administrator to customize re-learn schedules. The batteries must be replaced every few months or years. Flash-based write cache is also subject to deterioration: the dreaded maximum write count for flash memory cells (however, flash is used only on power failure). The backup capacitors degrade over time. The NAND modules and the capacitor bank must be monitored and replaced if necessary. Write cache on physical media (disk drives) is almost always volatile. Most enterprise SSDs and some consumer SSDs (e.g. the Intel 320 series, but not the extremely popular X25-M series) have backup capacitors. Modern disks have 16-64 megabytes of cache. The problem with this type of cache is that not all drives will flush it reliably when requested. SCSI and SAS drives do the right thing: the SYNCHRONIZE CACHE (opcode 35) command is a part of the SCSI standard. PATA drives have usually outright lied to cheat on benchmarks. SATA does have the FLUSH CACHE EXT command, but whether the drive actually acts on it depends on the vendor. Get SCSI/SAS drives for mission critical data nothing new here. One more caveat with disk write cache is that the controller software to ensure data durability MUST guarantee that all data flushed out of the controller write cache is committed to non-volatile media. In other words, when the OS requests a flush and the controller returns success, the data MUST have already been committed to non-volatile media. This is why disk write cache MUST be disabled if BBWC or other form of controller cache is enabled the controller cache must be flushed directly to non-volatile media and not to another layer of volatile cache. Software RAID with JBOD is a special case: there is no controller cache, only the drive cache, the OS page cache and buffer cache.
Barriers Think of write barriers on Linux as a unified approach to flushing and forced I/O ordering. Consider the following setup: Meta Entries RSS Comments RSS Log in converted by Web2PDFConvert.com This is a bit on the extreme side, but ponder for a moment how many layers of I/O (and caches) the data has to pass through to be stored on the physical disks. If the filesystem is barrier-aware and all I/O layers support barriers/flushes, an fs transaction followed by a barrier is committed (flushed) to persistent storage (disks). All requests issued prior to the barrier must be satisfied before continuing. Also, an fsync() or a similar call will flush the write caches of the underlying storage (fsync() without barriers does NOT guarantee this!). Barrier bios (block I/Os) actually do two flushes: one before the bio and one afterwards. Its possible to issue an empty barrier bio to flush only once. Barriers ensure critical transactions are committed to persistent media and committed in the right order, but they incur a sometimes severe performance penalty. Lets get back to our two hardware setups: software RAID on JBOD and hardware RAID with BBWC. Since barriers force write-outs to persistent storage, disk write cache can be safely enabled for MD RAID if the following conditions are met: the filesystem supports barriers and they are enabled the underlying I/O layers support barriers/flushes (see below) the disks reliably support cache flushes. However, on hardware RAID with BBWC, the cache itself is (quasi-)persistent. Since RAID controllers do implement the SYNCHRONIZE CACHE command, each barrier would flush the entire write cache, negating the performance advantage of BBWC. Its recommended to disable barriers if and only if you have healthy BBWC. If you disable barriers, you must monitor and properly maintain your BBWC. Full support for barriers on various virtual devices has been added only recently. This is a rough matrix of barrier support in vanilla kernel versions, milestones highlighted: Barrier support Kernel version Commit I/O barrier support 2.6.9 1 ext3 2.6.9 1 reiserfs 2.6.9 1 SATA 2.6.12 - XFS barriers enabled by default 2.6.16 1 ext4 barriers enabled by default 2.6.26 1 DM simple devices (i.e. a single underlying device) 2.6.28 1 loop 2.6.30 1 DM rewrite of the barrier code 2.6.30 1 DM crypt 2.6.31 1 DM linear (i.e. standard LVM concatenated volumes) 2.6.31 1 DM mpath 2.6.31 1 virtio-blk (only really safe with O_DIRECT backing devices) 2.6.32 1 DM dm-raid1 2.6.33 1 DM request based devices 2.6.33 1 MD barrier support on all personalities * 2.6.33 1 barriers removed and replaced with FUA / explicit 2.6.37 1 2 3 converted by Web2PDFConvert.com flushes * Note: previously barriers were only supported on MD raid1. This patch can be easily applied to 2.6.32. As of 2.6.37, block layer barriers have been removed from the kernel for performance reasons. They have been completely superseded by explicit flushes and FUA requests. FUA is Force Unit Access: an I/O request flag which ensures the transferred data is written directly to (or read from) persistent media, regardless of any cache settings. Explicit flushes are just that write cache flushes explicitly requested by a filesystem. In fact, the responsibility for safe request ordering has been completely moved to filesystems. The block layer or TCQ/NCQ can safely reorder requests if necessary, since the filesystem will issue flush/FUA requests for critical transactions anyway and wait for their completion before proceeding. These changes eliminate the barrier-induced request queue drains that significantly affected write performance. Other I/O requests (e.g. without a transaction) can be issued to a device while a transaction is still being processed. However, as 2.6.32.x is the longterm kernel for several distros, barriers are here to stay (at least for a few years).
Filesystems Barriers/flushes are supported on most modern filesystems: ext3, ext4, XFS, JFS, ReiserFS 3, etc. ext3/4 are unique in that they support three data journaling modes: data={ordered,journal,writeback}. data=journal essentially writes data twice: first to the journal and then to the data blocks. data=writeback is similar to journaling on XFS, JFS, or ReiserFS 3 before Linux 2.6.6. Only internal filesystem integrity is preserved and only metadata is journaled; data may be written to the filesystem out of order. Metadata changes are first recorded in the journal and a commit block is written. After the journal has been updated, metadata and data write-outs may proceed. data=writeback can be a severe security risk: if the system crashes while appending to a file, after the metadata has been committed (and additional data blocks allocated), but before the data has been written (data blocks overwritten with new data), then after journal recovery that file may contain blocks filled with data from previously deleted files from any user. Note: ReiserFS 3 supports data=ordered since 2.6.6 and its the default mode. XFS does support ordering in specific cases, but its neither always guaranteed nor enforced via the journaling mechanism. There is some confusion about that, e.g. this Wikipedia article on ext3 and this paper [PDF] seem to contradict what a developer from SGI stated (the paper seems flawed anyway, as an assumption is made that XFS is running in ordered mode, based on the result of one test). data=ordered only journals metadata, like writeback mode, but groups metadata and data changes together into transactions. Transaction blocks are written together, data first, metadata last. With barriers enabled, the order looks more or less like this: 1. the transaction is written 2. a barrier request is issued 3. the commit block is written 4. another barrier is issued There is a special case on ext4 where the first barrier (between the transaction and the commit block) is omitted: the journal_async_commit mount option. ext4 supports journal checksumming if the commit block has been written but the checksum is incorrect, the transaction will be discarded at journal replay. With journal_async_commit enabled the commit block may be written without waiting for the transaction write-out. Theres a caveat: before this commit the barrier was missing at step 4 in async commit mode. The patch adds it, so that now theres a single empty barrier (step 4) after the commit block instead of a full barrier (two flushes) around it. ext3 tends to flush more often than ext4. By default both ext3 and ext4 are mounted with data=ordered and commit=5. On ext3 this means not only the journal, but effectively all data is committed every 5 seconds. However, ext4 introduces a new feature: delayed allocation. Note: delayed allocation is by no means a new concept. Its been used for years e.g. in XFS; in fact ext4 behaves similarly to XFS in this regard. converted by Web2PDFConvert.com New data blocks on disk are not immediately allocated, so they are not written out until the respective dirty pages in the page cache expire. The expiration is controlled by two tunables: /proc/sys/vm/dirty_expire_centisecs /proc/sys/vm/dirty_writeback_centisecs The first variable determines the expiration age 30 seconds by default as of 2.6.32. On expiration, dirty pages are queued for eviction. The second variable controls the wakeup frequency of the flush kernel threads, which process the queues. You can check the current cache sizes: grep ^Cached: /proc/meminfo # page cache size grep ^Dirty: /proc/meminfo # total size of all dirty pages grep ^Writeback: /proc/meminfo # total size of actively processed dirty pages Note: The VFS cache (e.g. dentry and inode caches) can be further examined by viewing the /proc/slabinfo file (or with the slabtop util which gives a nice breakdown of the slab count, object count, size, etc). Note: before 2.6.32 there was a well-known subsystem called pdflush: global kernel threads for all devices, spawned and terminated on demand (the rule of thumb is: if all pdflush threads have been busy for 1 second, spawn another thread. If one of the threads has been idle for 1 second, terminate). Its been replaced with per-BDI (per-backing-device-info) flushers one flush thread per each logical device (one for each filesystem). On top of all that, there was the dreaded pre-2.6.30 ext4 delayed allocation data loss bug/feature. Workarounds were introduced in 2.6.30, namely the auto_da_alloc mount option, enabled by default. You should also take into consideration the size of the OS page cache. These days machines have a lot of RAM (32+ or 64+ GB is not uncommon). The more RAM you have, the more dirty pages can be held in RAM before flushing to disk. By default, Linux 2.6.32 will start writing out dirty pages when they reach 10% of RAM. On a 32 GB machine this is 3.2 GB of uncommitted data in write-heavy environments, where you dont hit the time based constraints mentioned above quite a lot to lose in the event of a system crash or power failure. This is why its so important to ensure data integrity in your software by flushing critical data to disks e.g. by fsync()ing (though at the application level you may only hope the filesystem, the OS and the devices will all do the right thing). This is why database systems have been doing it for decades. Also, this is one of the reasons why some database vendors recommend placing transaction commit logs on a separate filesystem. The synchronous load profile of the commit log would otherwise interfere with the asynchronous flushing of the tablespaces: if the logs were kept on a single filesystem along with the tablespaces, every fsync would flush all dirty pages for that filesystem, killing I/O performance. Note: fsync() is a double-edged sword in this case. fsyncing too often will reduce performance (and spin up devices). Thats why only critical data should be fsynced. Dirty page flushing can be tuned traditionally with these two tunables: /proc/sys/vm/dirty_background_ratio /proc/sys/vm/dirty_ratio Both values are expressed as a percentage of RAM. When the amount of dirty pages reaches the first threshold (dirty_background_ratio), write-outs begin in the background via the flush kernel threads. When the second threshold is reached, processes will block, flushing in the foreground. The problem with these variables is their minimum value: even 1% can be too much. This is why another two controls were introduced in 2.6.29: /proc/sys/vm/dirty_background_bytes converted by Web2PDFConvert.com The systemd fallacy My favorite books on hackers /proc/sys/vm/dirty_bytes Theyre equivalent to their percentage based counterparts. Both pairs of tunables are exclusive: if either is set, its respective counterpart is reset to 0 and ignored. These variables should be tuned in relation to the BBWC memory size (or disk write cache size on MD RAID). Lower values generate more I/O requests (and more interrupts), significantly decrease sequential I/O bandwidth but also decrease random I/O latency. The idea is to find a sweet spot where BBWC would be used most effectively: the ideal I/O rate should not allow BBWC to overfill or significantly under-fill. Obviously, this is hit/miss and only theoretically achievable under perfect conditions. As usual, you should tune and benchmark for your specific workload. When benchmarking, remember ext3 has barriers disabled by default. A direct comparison of ext3 to ext4 with default mount options is usually quite pointless. ext4 offers an increased level of data protection at the cost of speed. Likewise, directly comparing ext3 in ordered mode to a filesystem offering only metadata journaling may not yield conclusive results. Some people got their benchmarks wrong. Note: I did that kind of benchmark a while ago: the goal was to measure system file operations (deliberately on default settings), not sequential throughput or IOPS and ext4 was faster anyway. All in all, its your data! Test everything yourself with your specific workloads, hardware and configuration. Heres a simple barrier test workload to get you going. This entry was posted in IT and tagged best-practices, kernel, Linux, systems. Bookmark the permalink. monolight Proudly powered by WordPress. converted by Web2PDFConvert.com