Professional Documents
Culture Documents
Overview
Tuning is Evil Tuning is often evil and should rarely be done. First, consider that the default values are set by the people who know the most about the effects of the tuning on the software that they supply. If a better value exists, it should be the default. While alternative values might help a given workload, it could quite possibly degrade some other aspects of performance. Occasionally, catastrophically so. Over time, tuning recommendations might become stale at best or might lead to performance degradations. Customers are leery of changing a tuning that is in place and the net effect is a worse product than what it could be. Moreover, tuning enabled on a given system might spread to other systems, where it might not be warranted at all. Nevertheless, it is understood that customers who carefully observe their own system may understand aspects of their workloads that cannot be anticipated by the defaults. In such cases, the tuning information below may be applied, provided that one works to carefully understand its effects. If you must implement a ZFS tuning parameter, please reference the URL of this document: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
The Tunables
In no particular order: CHECKSUMS ARCSIZE ZFETCH VDEVPF MAXPEND FLUSH ZIL METACOMP IMMSZ
Having file system level checksums enabled can alleviate the need to have application level checksums enabled. In this case, using the ZFS checksum becomes a performance enabler. The checksums are computed asynchronously to most application processing and should normally not be an issue. However, each pool currently has a single thread computing the checksums (RFE below) and it's possible for that computation to limit pool throughput. So, if disk count is very large (>> 10) or single CPU is weak (< Ghz), then this tuning might help. If a system is close to CPU saturated, the checksum computations might become noticeable. In those cases, do a run with checksums off to verify if checksum calculation is a problem. If you tune this parameter, please reference this URL in shell script or in an /etc/system comment. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Tuning_ZFS_Checksums Verify the type of checksum used: zfs get checksum <filesystem> Tuning is achieved dynamically by using: zfs set checksum=off <filesystem> And reverted: zfs set checksum='on | fletcher2 | fletcher4 | sha256' <filesystem> Fletcher2 checksum has been observed to consume roughly 1Ghz of a CPU when checksumming 500 MB per second. RFEs 6533726 single-threaded checksum & raidz2 parity calculations limit write bandwidth on thumper (Fixed in Nevada, build 79 and Solaris 10 10/08)
For theses cases, you might consider limiting the ARC. Limiting the ARC will, of course, also limit the amount of cached data and this can have adverse effects on performance. No easy way exists to foretell if limiting the ARC degrades performance. If you tune this parameter, please reference this URL in shell script or in an /etc/system comment. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Limiting_the_ARC_Cache
Current Solaris 10 Releases and Solaris Nevada Releases This syntax is provided starting in the Solaris 10 8/07 release and Nevada (build 51) release. For example, if an application needs 5 GB of memory on a system with 36-GB of memory, you could set the arc maximum to 30 GB, (0x780000000 or 32212254720 bytes). Set the zfs:zfs_arc_max parameter in the /etc/system file: set zfs:zfs_arc_max = 0x780000000 or set zfs:zfs_arc_max = 32212254720 [edit] Earlier Solaris Releases You can only change the ARC maximum size by using the mdb command. Because the system is already booted, the ARC init routine has already executed and other ARC size parameters have already been set based on the default c_max size. Therefore, you should tune the arc.c and arc.p values, along with arc.c_max, using the formula: arc.c = arc.c_max arc.p = arc.c / 2 For example, to the set the ARC parameters to small values, such as arc_c_max to 512MB, and complying with the formula above (arc.c_max to 512MB, and arc.p to 256MB), use the following syntax: # mdb -kw > arc::print -a ffffffffc00b3260 ffffffffc00b3268 ffffffffc00b3278 p c c_max p = 0xb75e46ff c = 0x11f51f570 c_max = 0x3bb708000 = 0x10000000 = 0x20000000 = 0x20000000
> ffffffffc00b3260/Z 0x10000000 ffffffffc00b3260: 0xb75e46ff > ffffffffc00b3268/Z 0x20000000 ffffffffc00b3268: 0x11f51f570 > ffffffffc00b3278/Z 0x20000000 ffffffffc00b3278: 0x11f51f570
You should verify the values have been set correctly by examining them again in mdb (using the same print command in the example). You can also monitor the actual size of the ARC to ensure it has not exceeded: # echo "arc::print -d size" | mdb -k The above command displays the current ARC size in decimal. You can also use the arcstat script available at http://blogs.sun.com/realneel/entry/zfs_arc_statistics to check the arc size as well as other arc statistics Here is a perl script that you can call from an init script to configure your ARC on boot with the above guidelines: #!/bin/perl use strict; my $arc_max = shift @ARGV; if ( !defined($arc_max) ) { print STDERR "usage: arc_tune <arc max>\n"; exit -1; } $| = 1; use IPC::Open2; my %syms; my $mdb = "/usr/bin/mdb"; open2(*READ, *WRITE, "$mdb -kw") || die "cannot execute mdb"; print WRITE "arc::print -a\n";
while(<READ>) { my $line = $_; if ( $line =~ /^ +([a-f0-9]+) (.*) =/ ) { $syms{$2} = $1; } elsif ( $line =~ /^\}/ ) { last; } } # set c & c_max to our max; printf WRITE "%s/Z 0x%x\n", print scalar <READ>; printf WRITE "%s/Z 0x%x\n", print scalar <READ>; printf WRITE "%s/Z 0x%x\n", print scalar <READ>; set p to max/2 $syms{p}, ( $arc_max / 2 ); $syms{c}, $arc_max; $syms{c_max}, $arc_max;
RFEs 6488341 ZFS should avoiding growing the ARC into trouble (Fixed in Nevada, build 107) 6522017 The ARC allocates memory inside the kernel cage, preventing DR 6424665 ZFS/ARC should cleanup more after itself 6429205 Each zpool needs to monitor it's throughput and throttle heavy writers (Fixed in Nevada, build 87 and Solaris 10 10/08) For more information, see this link: New ZFS write throttle 6855793 ZFS minimum ARC size might be too large
File-Level Prefetching
ZFS implements a file-level prefetching mechanism labeled zfetch. This mechanism looks at the patterns of reads to files, and anticipates on some reads, reducing application wait times. The current code needs attention (RFE below) and suffers from 2 drawbacks: Sequential read patterns made of small reads very often hit in the cache. In this case, the current code consumes a significant amount of CPU time trying to find the next I/O to issue, whereas performance is governed more by the CPU availability. The zfetch code has been observed to limit scalability of some loads.
So, if CPU profiling, by using lockstat(1M) with -I argument or er_kernel as described here: http://developers.sun.com/prodtech/cc/articles/perftools.html shows significant time in zfetch_* functions, or if lock profiling (lockstat(1m)) shows contention around zfetch locks, then disabling file level prefetching should be considered. Disabling prefetching can be achieved dynamically or through a setting in the /etc/system file. If you tune this parameter, please reference this URL in shell script or in an /etc/system comment. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#File-Level_Prefetching [edit] Current Solaris 10 Releases and Solaris Nevada Releases This syntax is provided starting in the Solaris 10 8/07 release and Solaris Nevada build 51 release. Set dynamically: echo zfs_prefetch_disable/W0t1 | mdb -kw Revert to default:
echo zfs_prefetch_disable/W0t0 | mdb -kw Set the following parameter in the /etc/system file: set zfs:zfs_prefetch_disable = 1 Earlier Solaris Releases Set dynamically: echo zfetch_array_rd_sz/Z0x0 | mdb -kw Revert to default: echo zfetch_array_rd_sz/Z0x100000 | mdb -kw Set the following parameter in the /etc/system file: set zfs:zfetch_array_rd_sz = 0 RFEs 6412053 zfetch needs some love 6579975 dnode_new_blkid should first check as RW_READER (Fixed in Nevada, build 97)
Device-Level Prefetching
ZFS does device-level read-ahead in addition to file-level prefetching. When ZFS reads a block from a disk, it inflates the I/O size, hoping to pull interesting data or metadata from the disk. This data is stored in a 10MB LRU per-vdev cache, which can short-cut the ZIO pipeline if present in cache. Prior to the Solaris Nevada build snv_70, the code caused problems for system with lots of disks because the extra prefetched data could cause congestion on the channel between the storage and the host. Tuning down the size by which I/O was inflated () had been effective for OLTP type loads in the past. The code is now only prefetching metadata, fixed by bug 6437054, and thus, is not expected to require any tuning. This parameter can be important for workloads when ZFS is instructed to cache only metadata by setting the primarycache property per file system. For workloads that have an extremely wide random reach into 100s of TB with little locality, then even metadata is not expected to be cached efficiently. Setting primarycache to metadata or even none needs to be investigated. In conjunction, device level prefetch tuning can help reduce the number of 64K IOPs done on behalf of the vdev cache for metadata. If you tune this parameter, please reference this URL in shell script or in an /etc/system comment. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device-Level_Prefetching No tuning is required for Solaris Nevada releases, build 70 and after. Previous Solaris 10 and Solaris Nevada Releases Setting this tunable might only be appropriate in the Solaris 10 8/07 and Solaris 10 5/08 releases and Nevada releases from build 53 to build 69. Set the following parameter in the /etc/system file: set zfs:zfs_vdev_cache_bshift = 13 /* Comments /* Setting zfs_vdev_cache_bshift with mdb crashes a system. /* zfs_vdev_cache_bshift is the base 2 logarithm of the size used to read disks. /* The default value of 16 means reads are issued in size of 1 << 16 = 64K. /* A value of 13 means disk reads are padded to 8K. For earlier releases, see: http://blogs.sun.com/roch/entry/tuning_the_knobs
RFEs 6437054 vdev_cache wises up: increase DB performance by 16% (Fixed in Nevada, build 70 and Solaris 10 10/08)
Device Driver Considerations Device drivers may also limit the number of outstanding I/Os per LUN. If you are using LUNs on storage arrays that can handle large numbers of concurrent IOPS, then the device driver constraints can limit concurrency. Consult the configuration for the drivers your system uses. For example, the limit for the QLogic ISP2200, ISP2300, and SP212 family FCl HBA (qlc) driver is described as the execution-throttle parameter in /kernel/drv/qlc.conf. RFEs 6471212 need reserved I/O scheduler slots to improve I/O latency of critical ops
Cache Flushes
If you've noticed terrible NFS or database performance on SAN storage array, the problem is not with ZFS, but with the way the disk drivers interact with the storage devices. ZFS is designed to work with storage devices that manage a disk-level cache. ZFS commonly asks the storage device to ensure that data is safely placed on stable storage by requesting a cache flush. For JBOD storage, this works as designed and without problems. For many NVRAM-based storage arrays, a performance problem might occur if the array takes the cache flush request and actually does something with it, rather than ignoring it. Some storage arrays flush their large caches despite the fact that the NVRAM protection makes those caches as good as stable storage. ZFS issues infrequent flushes (every 5 second or so) after the uberblock updates. The problem here is fairly inconsequential. No tuning is warranted here. ZFS also issues a flush every time an application requests a synchronous write (O_DSYNC, fsync, NFS commit, and so on). The completion of this type of flush is waited upon by the application and impacts performance. Greatly so, in fact. From a performance standpoint, this neutralizes the benefits of having an NVRAM-based storage.
If you experimentally observe that setting zfs_nocacheflush with mdb to have a dramatic effect on performance, such as a 5 times or more difference when extracting small tar files over NFS or dd'ing 8 KB to a raw zvol, then this indicates your storage is not friendly to ZFS. Contact you storage vendor for instructions on how to tell the storage devices to ignore the cache flushes sent by ZFS. For Santricity based storage devices, instructions are documented in CR 6578220. If you are not able to configure the storage device in an appropriate way, the preferred mechanism is to tune sd.conf specifically for your storage. See the instructions below. As a last resort, when all LUNs exposed to ZFS come from NVRAM-protected storage array and procedures ensure that no unprotected LUNs will be added in the future, ZFS can be tuned to not issue the flush requests by setting zfs_nocacheflush. If some LUNs exposed to ZFS are not protected by NVRAM, then this tuning can lead to data loss, application level corruption, or even pool corruption. In some NVRAM-protected storage arrays, the cache flush command is a no-op, so tuning in this situation makes no performance difference. NOTE: Cache flushing is commonly done as part of the ZIL operations. While disabling cache flushing can,
at times, make sense, disabling the ZIL does not. If you tune this parameter, please reference this URL in shell script or in an /etc/system comment. http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes NOTE: If you are carrying forward an /etc/system file, please verify that any changes made still apply to your current release. Help us rid the world of /etc/system viruses.
sd-config-list = "ATA Super Duper ", "nvcache1"; nvcache1=1, 0x40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1; Note: In the above examples, nvcache1 is just a token in sd.conf. You could use any similar token. 3. Add whitespace to make the vendor ID (VID) 8 characters long (here "ATA ") and Product ID (PID) 16 characters long (here "Super Duper ") in the sd-config-list entry as illustrated above. 4. After the sd.conf or ssd.conf modifications and reboot, you can tune zfs_nocacheflush back to it's default value (of 0) with no adverse effect on performance. Template:Draft For more cache tuning resource information, see: http://blogs.digitar.com/jjww/?itemid=44. http://forums.hds.com/index.php?showtopic=497.
RFEs
6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE to SBC-2 devices (Fixed in Nevada, build 74 and Solaris 10 5/08)
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 Current Solaris Releases If you must, then: echo zil_disable/W0t1 | mdb -kw Revert to default: echo zil_disable/W0t0 | mdb -kw Note!: The zil_disable tunable is only evaluated during dataset mount. While this can be tuned dynamically, to reap the benefits you must zfs umount and then zfs mount (or reboot, or export/import the pool, etc). RFEs 6280630 zil synchronicity
Further Reading Separate Log Device ZIL Disable ZFS and NFS, a Fine Combination
Earlier Solaris Releases Not tunable. RFEs 6391873 metadata compression should be turned back on (Fixed in Nevada, build 36)
What makes this tuning suitable for database environments is that many of the writes are full record overwrites. The inflation comes when doing a partial record re-write in which a synchronous write system call of size greater than zfs_immediate_write_sz to a file with 128K records causes a full 128K record output. This needs to be considered in regards to the redo log file. If the average size of writes to redo log files is greater than zfs_immediate_write_sz, but many times smaller than the recordsize used for redo logs, then some redo log inflation is expected from this tuning. To avoid this inflation, the redologs can be set on a storage pool in which there is a separate intent log. Set the following parameter in the /etc/system file: * See http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#zfs_immedia te_write_sz * Reduce write throughput required by Oracle data files with potential impact on redo logs * To be used in pure database environment (full block overwrites) with avg redolog transactions either * smaller than the tunable or greater than redo log recordsize. set zfs:zfs_immediate_write_sz = 8191 For x86 systems, when the db_block_size and the recordsize are aligned to the system page size of 4k, then it is better to set zfs_immediate_write_sz to a little less than 4096, as 4000. For more information about optimizing ZFS for database performance, see ZFS for Databases.
http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine http://blogs.sun.com/roch/entry/zfs_and_directio http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on Integrated RFEs that introduced or changed tunables snv_51 : 6477900 want more /etc/system tunables for ZFS performance analysis snv_52 : 6485204 more tuneable tweakin snv_53 : 6472021 vdev knobs can not be tuned