You are on page 1of 28

Systems Engineering at HPCRD

Gary Leong
HPCRD Systems Engineer
High Performance Computing Research
Lawrence Berkeley National Laboratory
High Performance Computing
Research Department

The High Performance Computing Research Department conducts research and


development in mathematical modeling, algorithmic design, software implementation,
and system architectures, and evaluates new and promising technologies.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS – Why?

 HPCRD – research new technologies


 seeks to optimize the performance, redundancy, and
scalability of current hardware
 Benefits and alternative to current filesystems (e.g. ext2,3,
ufs, reiserfs
 ZFS already tentatively embraced by the Unix community –
Apple, Linux
 Open Source – MPL
 Disksuite not quite a commercial/enterprise level product. I.e.
performance, redundancy, scalability
 Alternative, Third Party, Veritas Volume Manager
 Expensive
 Not simple to administer
 Finally, Sun offers a enterprise level filesystem
 Features similar to Veritas without the high cost and fully
integrated into OS, and portable.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS – At a glance

 Zettabyte File System


 128 Bit file system - 16 billion billion times that of 64
bit file system (Huge Capacity)
 Pooled storage – shared bandwidth (I/O) and
capacity
 Increased performance over traditional volume
managers (Filesystem + VM + RAID)
 Transaction Operation – Copy on Write (No
Journaling)
 Snapshots (ro) and Clones (rw)
 End to End Data Integrity – Data Checksumed
 Administration ease (Integration of services)

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS is like “Virtual Memory”

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS – VM similarity

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS – Volumes and Pool Storage
 Traditional Volumes

-One to one ratio between FS to Volume

 ZFS Pool Storage

-Pool Storage expand/shrink automatically


-Shared Bandwidth (I/O)
-Many FS to Storage Pool ratio

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS – is like a “merged FS w/
RAID/Volume manager”

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS – is like an attached “NAS”
 Think of having a NAS with its integrated filesystem, RAID, and other features attached
locally, directly to VFS instead of through the network.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS – “NAS” like elaborated

 Most similar to NAS w/o the network


 not an external storage and not quite a NAS box

 Similar to NetApp in features (software based instead of


hardware based)
 Integrated RAID/VM (Pooled Storage)
 derivative of W—A—F—L (Write Anywhere File Layout)
• Copy on Write
• no need for fsck/journaling - always consistent on
disk
 Snapshots and Clones
• very fast backups
• changes are kept track, rather than copy entire
tree
 Central Administration

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - Copy on Write (COW)

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - Central Administration

 Pool and filesystem created through zfs administration - no


need for format/fdisk and newfs/mkfs
 Automatic mounts - no need to manually enter in /etc/vfstab or
use “mount” command
 Checksum enabled/disabled through zfs administration
 Quotas centralized in zfs administration
 Compression enabled/disabled in zfs administration
 NFS shared through zfs administration
 Snapshots and clones through zfs administration
 Backup (Full and Incremental snapshots) through zfs
administration

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - Other notable features

 All data checksumed


 Self Healing (mirror)
 Disk Scrubbing
 Object Based Transactions
 WAFL - data can be written on any location on disk
 Not block by block changes, but aggregate changes to
objects (transaction group)
 ZFS Intent Log (ZIL)
 RAIDZ
 Variable RAID stripe width
 Dynamic Stripping (add/subtract drives)
 All writes are full-stripe
 Portability - Filesystem transfer between SPARC and x86

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - Data checksum

 Patterned off Merkle tree - each level of data to validate all things below it
 Similar to ECC memory
 Isolation of data and checksum

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - ZIL

 All system calls are logged as transaction records by


ZIL
 Records contain sufficient information to replay after
crash
 Logs are variable size, depending on structure
 ZIL writes
 Small writes - data written as part of log
 Large writes - data written to disk and pointer to
data written to log
 During mount time, ZFS checks for ZIL log - if exists,
system probably crashed
 ZIL allows performance gains especially for
databases

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - RAIDZ

 Dynamic Stripe Width


 Data and parity can be distributed across varying
number of drives, depending on size
 All writes are full-stripe writes
 No need to read-modify-read
• RAID 5 penalty -read old data, corresponding parity,
calculate new parity, and write new data and new parity
 Dynamic Stripping
 Data automatically redistributed as drives are
subtracted and added
 Allows the usage for cheap disk for both data
integrity, performance, and redundancy

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - Truths (no marketing)

 Not entirely new, but a software version of something


existing on hardware with some unique features
 RAIDZ - not really a RAID: RAID and filesystem are
merged. (But this allows for usage of cheap drives)
 Jeff Bonwick - “You have to traverse the
filesystem metadata to determine the RAIDZ
geometry”
• Darcy - “True RAID levels don’t require knowledge of
higher-level applications”

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - Experimental Results

 Hardware - Ultra 2, with external RAID pack.


 Tested
 UFS on Disksuite
 ZFS .
 What was tested?
 Performance: RAID 5 on Disksuite vs. RAIDZ
 Crash recovery
 Creating 400M files
• UFS on Disksuite –RAID 5 (4 drives)
— Wed Jun 14 12:04:16 PDT 2006
— Wed Jun 14 19:37:14 PDT 2006
• ZFS – RAIDZ (4 drives)
— Mon Jun 19 14:16:29 PDT 2006
— Mon Jun 19 15:56:59 PDT 2006
 Redundancy with removal of drive - simulate losing a drive

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Writer Performance: ZFS/UFS (Disksuite)

ZFS: Write Performance - 5 disks UFS: Writer Performance - 5 disks

250000
200000

180000
200000
160000

140000
150000
180000-200000
200000-250000
kB/sec 120000 160000-180000
150000-200000
140000-160000
100000-150000
100000 kB/s 100000 120000-140000
50000-100000 100000-120000
0-50000 80000-100000
80000
60000-80000
50000 40000-60000
16384 60000
20000-40000
2048
0-20000
40000
256 16384
0 Record size - kB 4096
20000
64

32
128

1024
256
512
1024

2048
4096

8192

256
16384

32768

4
65536

0 Record size - kB
131072

262144

64
524288

64

128
File size - kB

256

512

1024
16

2048

4096

8192

16384

32768
4

65536

131072

262144

524288
File size - kB

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Re-writer Performance: ZFS/UFS (Disksuite)

UFS: Re-writer Performance - 5 disks

ZFS: Re-writer Performance - 5 disks

250000
250000

200000
200000

150000
150000
200000-250000
200000-250000 kB/s
kB/sec 150000-200000
150000-200000
100000-150000
100000-150000 100000
100000 50000-100000
50000-100000
0-50000
0-50000

50000 50000
16384 16384
2048 4096
1024
256 256
0 Record size - kB 0 Record size - kB
64
64

32
128

64
256

128
512

1024

256
2048

512

1024
4096

16

2048
8192

16384

4096
4
32768

8192
65536

16384
131072

32768
262144

65536
524288

131072

262144

524288
File size - kB
File size - kB

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Reader Performance: ZFS/UFS (Disksuite)

ZFS: Reader Performance - 5 disks UFS: Reader Performance - 5 disks

300000 250000

250000
200000

200000
150000
250000-300000
200000-250000
kB/sec 150000 200000-250000 kB/s
150000-200000
150000-200000
100000-150000
100000-150000 100000
50000-100000
100000 50000-100000 0-50000
0-50000

50000
16384 16384
50000
4096
2048
1024
256 256
0 0 Record size - kB
Record size - kB 64

64
64

128
32
128

256
256

512
16
512

1024
1024

2048
2048

4096
4096

8192
8192

16384
16384

32768
4
32768

65536
65536

131072
131072

262144
262144

524288
524288

File size - kB File size - kB

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Re-reader Performance: ZFS/UFS (Disksuite)

ZFS: Re-reader Performance - 5 disks UFS: Re-reader Performance - 5 disks

300000 250000

250000
200000

200000

150000
250000-300000
200000-250000
kB/sec 150000 200000-250000 kB/s
150000-200000
150000-200000
100000-150000
100000-150000 100000
50000-100000
100000 50000-100000
0-50000
0-50000

16384 50000
50000 16384
2048 4096
1024
256 256
0 Record size - kB 0 Record size - kB
64
64

32
128

64
256

128
512

256
1024

512
16
2048

1024
4096

2048
8192

4096
16384

8192
32768

16384
65536

32768
4
131072

65536
262144

131072
524288

262144

524288
File size - kB File size - kB

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Random Read Performance: ZFS/UFS (Disksuite)

ZFS: Random Read Performance - 5 disks UFS: Random Read Performance - 5 disks

300000 250000

250000
200000

200000

150000
250000-300000
kB/sec 150000 200000-250000 200000-250000
kB/s
150000-200000 150000-200000
100000-150000 100000-150000
100000
50000-100000 50000-100000
100000
0-50000
0-50000

50000 16384 50000


16384
2048 4096
1024
256
0 Record size - kB 256
0 Record size - kB
32
64
128

64
256
512

64
1024

128
2048

256
4096

512
16
8192

1024
16384

2048
4
32768

4096
65536

8192
131072

16384
262144

32768
4
524288

65536

131072

262144

524288
File size - kB
File size - kB

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Random Write Performance: ZFS/UFS (Disksuite)

UFS: Random Write Performance - 5 disks


ZFS: Random Write Performance - 5 disks

200000
250000

180000

160000
200000

140000
180000-200000
120000 160000-180000
150000 140000-160000
200000-250000 kB/s 100000 120000-140000
kB/sec
150000-200000 100000-120000
100000-150000 80000 80000-100000
100000
50000-100000 60000-80000
0-50000 60000 40000-60000
20000-40000
40000 0-20000
50000 16384
16384
4096
2048 20000
1024
256 256
0 0 Record size - kB
Record size - kB 64

64
128
64

32

256
128

512
16
256

1024
512

2048
1024

4096
2048

8192
4096

16384
8192

32768
4
16384

65536
4
32768

131072
65536

262144
131072

524288
262144

524288

File size - kB File size - kB

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS – Summary/Conclusions

 Large Performance gain over UFS


 Enterprise level Filesystem/Volume/RAID product
 Software based product using inexpensive/cheap
disks
 Performance from: shared I/O and storage
 Ease of administration – Creation, Snapshots &
Clones, Compression, Sharing…etc
 End to end data integrity
 RAIDz
 Sun’s integration into Solaris and portability between
platforms
 Free

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - Upcoming features

 Will be released with new version of Solaris 10


 Support for hot spares
 Encryption
 Secure deletion
 Perhaps NVRAM for ZIL
 Speculation MAC – OS X
 Speculation and possibilities for Linux
 Port has begun by Ricardo Correia to FUSE/Linux as part
of Google SoC.
 Runs as a module in user space.
 Sun’s vested interest in Linux and Opterons may also push
the port to Linux.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
ZFS - References
 Jeff Bonwick; ZFS: the last word in file systems. Sun Microsystems.
 Jeff Bonwick. ZFS: The Last Word in Filesystems. Jeff Bonwick's Blog.
(http://blogs.sun.com/roller/page/bonwick?entry=raid_z)
 Neil Perrin. ZFS: The Lumberjack. Neil Perrin’s Weblog (
http://blogs.sun.com/roller/page/perrin?entry=the_lumberjack)
 ZFS: From Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/ZFS)
 Matthew Ahren. What is ZFS? Matthew Ahren’s Weblog (
http://blogs.sun.com/roller/page/ahrens?catname=%2FZFS)
 NewsForge: Sun’s ZFS builds on promise of RAID
(http://os.newsforge.com/os/06/01/11/1921211.shtml?tid=16 )
 Jeff Darcy. In ZFS’s Defense, RAID-Z Redux, No More Mr. Nice Guy, ZFS Again,
ZFS; Canned Platypus (http://pl.atyp.us/wordpress/?p=1009)
 Dave Hitz, James Lau, & Micheal Malcolm – Network Applicance; File System
Design for an NFS File Server Applicance
 Sun Microsystems; ZFS Administration Guide, March 2006
 Sun Microsystems; ZFS On-Disk Specification (Draft 12/9/2005)
 Eric Schrock. Ztest on Linux. Eric Schrock's Weblog
(http://blogs.sun.com/roller/page/eschrock?entry=ztest_on_linux)

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Thank you

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N

You might also like