Professional Documents
Culture Documents
DCACHEWORKSHOP
DEUTSCHESELEKTRONEN-SYNCHROTRONDESY
ZEUTHEN April19,2012
ChristophAntonMitterer
christoph.anton.mitterer@lmu.de
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
OVERVIEW
Overview
Thislecturecoversthefollowingchapters:
I.Blocks,BlockDevicesAndFilesystems
Givesanintroductiontoblocks,blockdevicesandfilesystemsanddescribes
commontypesofthem.
II.BlockLayerAlignment
Coverstheconceptsofblocklayeralignment,reasonsformisalignmentand
informationonhowtopreventthemforsomecommonsystemsaswellasan
overviewontheLinuxkernelsdevicetopologyinformation.
ChristophAntonMitterer Slide2
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
ChristophAntonMitterer Slide3
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Introduction To Blocks
Incomputing,organisingdatainblocksisageneralandbasictechnique.
Examplesrangefrommostformsofmultimediaencodings(forexampleJPEG,MP3
orH.264)tocryptographicciphersandevensomedatabasesorganisetheirverylow
levelstructuresinakindofblocks.
Most storage media and memory (here, the word page is typically used) are
organised in terms of blocks, although modern concepts like extents or
transparenthugepagesmakesthingsabitmorecomplexonahigherlevel.
Soapartfromsomeexceptionswheredataisstreamed(basicallyallformsoftape),
all the other common types of storage, like hard disk drives, solid state drives and
flashdrivesorcardsaswellasopticaldiscs,areblock-addressed.
ChristophAntonMitterer Slide4
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Introduction To Blocks
Blockshaveseveralbasicproperties:
Theblocksofagivendevicehaveusuallythesamesize.
Abasicandformanyareasthesmallestblocksizeis512B.Thisusedtobethecommonblocksizefor
harddisksbutrecentlydriveswith4KiBshowedup,thoughsomeofthemstillbehaveexternallyasif
theywoulduse512Bblocks.
Theblocksaredirectlyaddressable,thatisrandomlyaccessible.
The contents of a block may be directly accessible or not. For block-organised
storagemedia,theformerisusuallythecase.
Usually, there is also some latency in accessing a block (for example the seek
timeofharddisks.
Dependingonthedevice,datamaybeonlyreadand/orwrittenasfullblocks.
Depending on the device, blocks are writeable many times, or just once (for
exampleWORMornon-erasableopticaldiscs).
Filesystemsarenotblockdevicesthemselvesbutuponthelaters.
Thereforeitisreasonabletoviewthemlikeanotherlayer.
ChristophAntonMitterer Slide5
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Introduction To Blocks
Blocks(arrangedinadevice)canbevisualisedasfollows:
ChristophAntonMitterer Slide6
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Often, block devices can be stacked, which means that the upper level uses and
storesitsowndataonthelowerone.
This works for some physical block devices (for example disk drives that are
assembledtooneRAIDbyahardwarecontroller)andtypicallyformostlogicalblock
devicescreatedandhandledbytheoperatingsystem.
Eachlevelinsuchastackiscalledablockdevicelayer,orshortblocklayer.
Everytypeofblockdeviceimplementsaspecialfunctionality,whichiscontrolledvia
kernelinterfacesand/ortherespectivehardwarecontrollerBIOS.
ChristophAntonMitterer Slide7
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
ChristophAntonMitterer Slide8
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Mosttypesofblockdevicesaddmeta-data,thatshallnotbe(directly)seenbythe
upperlayerandsometypesofblockdevicesevendistributetheactualdatanon-
sequentially.
In order that an upper layer sees sequentially addressed blocks a virtual
addressingisintroducedbymeansofmapping.
Obviouslythemappingcostssomeperformancebutthisistypicallyverysmallandthusneglectable.
ReadCachingAndReadAhead
Manytypesofblockdevicescachedatareadineithermemoryorfasterstorageso
thatitcanbefasterretrievedifdemandedagain.
Closelyrelatedisthetechniqueofreadingahead,whichmeansthatmoredata
than actually requested is automatically read and put into the read cache. More
advanced algorithms try to predict how much data will be read next and
adaptivelyreadahead.
Whetherreadaheadimprovesperformancedependslargelyonthetypicalusagepatternssothereisno
generalrule.Obviously,thenumberofbytesreadaheadhasalargeimpacthere.
ChristophAntonMitterer Slide9
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Dataisimmediatelyflushedtothenextlowerlayer.
AsynchronousWrite(Write-BackorWrite-Behind)
Data may be retained in a cache and flushed to disk later, when the algorithm
decidesthisissuitable.
ChristophAntonMitterer Slide10
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
heads.
Movingpartsleadingtomechanicalwear.
SolidStateDrives(SSD):
BlockSizes:typically512B,4KiB(butmanysuchHDDbehavelogicallyas512Bdevices)
MediumSizes:12TiB(dependsonthetechnique;smallerforenterprisedevices)
Interfaces:SATA,SAS,FibreChannel,PCIExpress,legacy:PATA,SCSI
Manytechniques:typicallyNANDSLCorMLC,ECC,DRAM-buffered
BasicallymuchfasterthanHDDinanyrespect,butalsostillmoreexpensive.
Nomovingparts,butcellsaresubjecttoelectricalwearandcanonlybewrittena
givennumberoftimes.Sophisticatedwearlevellingalgorithmsareused.
Cellsmustbeerasedbeforere-written.Thereforealwaysfullcellsarewritten.
ChristophAntonMitterer Slide11
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
redundancy/resilience,performanceorboth.
RAID-Types:hardware,firmware/driver-based(fake),software
(RAIDsimilarfeaturesarealsofoundinsomemodernfilesystemsorotherblockdevicetypes)
RAID-Levels:linear,0,1,5,6,hybrids(forexample10,50or60),obsolete:2,3,4
alsoNewRAIDClassificationbytheRAIDAdvisoryBoardandnon-standardlevels
Typical Techniques: Read Ahead, Adaptive Read Ahead, Write-Through/Write-
Back,Hot-Plugging,Hot-Spares,BatteryPacks,ScrubbingandVerifying
Striping:Exceptinthelinearmode,thestoragemediaassembledtoaRAIDarenot
filledonaftereachotherbutconcurrently.Datawrittenisdividedinto chunks
ofafixedsize,whereeachchunkiswrittentothenextdata(notparity)medium.
Typicalchunksizesare64KiB,128KiB,256KiB,512KiB,1MiB
ItdependsontherespectiveRAID-implementationandalsoontheRAID-level,but
usuallyonemustexpectthatalwaysfullchunksarereadandwritten.
Therefore,thechunksizemaygreatlyinfluencetheperformanceofaRAID,dependingontherespective
usecase.
Thestripesizeisusuallythesizeofonestripewithitsdataandparitychunks.
ChristophAntonMitterer Slide12
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
ChristophAntonMitterer Slide13
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
ChristophAntonMitterer Slide14
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
devicesasvolumes.
PhysicalVolumes(PV):ThesearetheunderlyingblockdevicesusedbyLVMfor
storingthedata.
VolumeGroups(PV):PVareorganisedinVG,whichhaveanumberofproperties
includingachunksizeandanallocationpolicy(thatishowchunksfromthePVare
distributedtounderlyingLV).
EachVGcanhavemultiplePV,buteachPVmustbelongtoexactlyoneVG.
LogicalVolumes(LV):Theblockdevicesexportedtobeusedbyupperlayers.
LVMallowstocombineordivideblockdevicestootherblockdevices,whichgives
itfeaturesknownfromtheRAIDlevelslinearand0andfrompartitioning.
PVandLVcanbeadded/removedto/fromexistingVG.
LVM also implements advanced features like clustering, snapshots, striping or
mirroring.
Dataisorganisedinextents(defaultsize4MiB),whicharehowever notfullyread
andwritten,asthisisusuallythecasewithRAIDchunks.
ChristophAntonMitterer Slide15
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
ChristophAntonMitterer Slide16
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Disklabel,GUIDPartitionTable
Dependingonthetypeofpartitionlabel,thereareseverallimitations,forexample
theDOStypecannothandlepartitions 2TiB,thenumberofpartitionsislimited
andtheycannotbemoved.
Inmostcasesnotneededanymore,asLVMismuchmoreflexibleinanyway.
dm-crypt:
Afront-endtothedevice-mapperprovidingon-disk-encryption.
Strong algorithms and cipher modes tailored towards on-disk-encryption (for
exampleXTS).
dm-multipath:
Severalpaths(connections)tothesamelowerlevelblockdeviceforredundancy.
Loopdevices:
Mapsafiletoablockdevice.
ChristophAntonMitterer Slide17
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Filesystems
Filesystemslayontopofblockdevicesandexportafilehierarchytotheuserspace,
inwhichdataisorganisedasfilesandnotlongerjustmeaninglessblocks.
Thereby,filesystemshidetheblocklayoutandorganisationaldetailsfromtheuser
space.
Somepropertiesoffilesystems:
A lot of different kinds of global and per-file meta-data, including the normal
POSIXpropertiesaswellasXATTRandACL.
Files are internally organised as blocks or on some newer filesystems
blocks/extentsallocatoralgorithms,etcetera.
ChristophAntonMitterer Slide18
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKS,BLOCKDEVICESANDFILESYSTEMS
Filesystems
Sometypesoffilesystems:
NormalFilesystems
btrfs,ext2/3/4,XFS,JFS,ReiserFS,Reiser4,ZFS,UFS
Media-CentricFilesystems
UDF,ISO9660,JFFS2,LogFS
PseudoFilesystems
procfs,sysfs,swap
SpecialFilesystems
tmpfs,aufs,romfs,SquashFS
Network-AndClusterFilesystems
NFS,CIFS,SMB,GFS2,GPFS,OCFS2,AFS,GlusterFS,Lustre,GFS,XtreemFS,
Ceph
FilesystemsmaybeimplementedinuserspaceviaFUSE,forexample:
davfs2,SSHFS,GlusterFS,GmailFS,etcetera
ChristophAntonMitterer Slide19
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
ChristophAntonMitterer Slide20
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
differently(forexamplestripedorrandomlyinsteadofcontiguously).
Theymayaddmeta-datainformofheaders,footersorwithintheirblockspace.
ChristophAntonMitterer Slide21
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
ontheupperlevel,morethanactuallynecessary
areaccessedonthelowerlevel.
Example:Block0isaccessedontheupperlevel.
Thenblocks0and1needtobeaccessedonthe
lowerlevel.The2ndhalfofblock1wasnot
required.
Throughput-wisenotthatbigproblemon
streaming(ifcachingworks)butonrandom-access.
Moreover,thelowerblock1maybeaccessedeven
twice,whentheupperblock1isread,too.In
anycase,unnecessaryIOPSmaybeproduced.
ChristophAntonMitterer Slide22
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
ontheupperlevel,morethanactuallynecessary
areaccessedonthelowerlevel.
Example:Block0isaccessedontheupperlevel.
Thenblock0,ofwhicharenotrequired,needs
tobeaccessedonthelowerlevel.
Throughput-wisenotthatbigproblemon
streaming(ifcachingworks)butonrandom-
access.Moreover,thelowerblock1maybeaccessed
eventwice,whentheupperblock1,6or7areread,
too.Inanycase,unnecessaryIOPSmaybe
produced.
ChristophAntonMitterer Slide23
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
least. But they usually do not warn if you use sizes on a higher level, that are
smallerthanthoseoflowerlevels.
Thefilesystem's blocksize is perdefault(ext2/3/4uses forexample4KiB) often
much smaller than the chunk size (typically starts at 64KiB) of an underlying
RAID.
Itmaygenerallybereasonabletoincreasethefilesystemsblocksizewhenmainly
bigfilesareused.
Whetherblocksarefullyread/writtendependsonthetypeofblockdeviceand
oftenonthespecificmodelorimplementation.
HDDandSSDandfilesystemstypicallyaccessfullblocks.
ForRAIDthisishighlydependentonthemodel/implementation.
In principle a RAID should not need to read full chunks under normal
operation.Butingeneral:checktherespectivedocumentation!
LVM does not access full extents under normal operation (with the exceptions
whenusingsnapshotsandcopy-on-writeshappen).
ChristophAntonMitterer Slide24
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
ChristophAntonMitterer Slide25
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
ChristophAntonMitterer Slide26
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
global meta-data is stored in the controller itself and the parity data is of the
samesizeastheactualdatachunksandthereforeautomaticallyalignedifthese
are.
ThemdadmsoftwareRAIDfromLinuxmaybeusedwithfourdifferentsuper-block
formats:
0.9and1.0
Stored at/near the end of the underlying block devices. Alignment is not
necessary.
1.1and1.2
ChristophAntonMitterer Slide27
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
ChristophAntonMitterer Slide28
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
ChristophAntonMitterer Slide29
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
loadondriveA.
Both,readandwritecachingmitigatethisonlyto
someextent.
ChristophAntonMitterer Slide30
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
bebalancedbetweenthedrivesA,BandC.
ChristophAntonMitterer Slide31
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
globalmeta-data.
WhenLVMisused,layersabovemaybepronetounbalancedspreadingofglobal
meta-data.
Thisisespeciallythecase,ifitsextentsizeorthetotalsizesofPVorLVisnota
multipleofthelowerlayersstructuresizes.
Caremustalsobetakentoconsiderthedifferentallocationpolicies(theorderin
whichchunksfromunderlyingPVaredistributedtoLV).
ChristophAntonMitterer Slide32
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
ChristophAntonMitterer Slide33
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
Alignsthestartoftheactualdatatothisoffset(oramultiple,ifrequired).
- -dataalignmentoffset
Anadditionalshiftofthedataarea.
Thefollowingvgcreateoptionsareofspecialinterest:
-physicalextentsize
-
SetsthevalueoftheextentsizeusedbytherespectiveVG.
Thefollowinglvcreateoptionsareofspecialinterest:
- -extents
ThesizeoftheLVinextents.Preferredover--size,whichsetsthesizeinbytes.
- -contiguous
Whethercontiguousextentallocationshouldbeperformedornot.
Otherpossiblyinterestingoptionsinclude:
--readahead,--type,--stripes,--stripesizeand--mirrors
ChristophAntonMitterer Slide34
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
Setstheextentallocationpolicytooneof contiguous,cling,normal,anywhereor
inherit.
ChristophAntonMitterer Slide35
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
--sizeand--offset
Loopdevice:
Basically,foraloopdevicetobealigned,theunderlyingfilesystemmustbealigned.
Ifthisisnotthecase,acompensationmaybepossiblewiththe--offsetoption.
- -offset
Shiftsthestartoftheloopdeviceintothefile.
- -sizelimit
Setsthesizeofthedevice.
ChristophAntonMitterer Slide36
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
TheRAID'schunksizeinnumberoffilesystemblocks.
-Estripe_width=value
ThesizeofthedatapartsoftheRAIDsstripesinfilesystemblocks.
Thatisthenumberofdatachunksperstripemultipliedwiththevaluefromthe
-Estrideoption.
ChristophAntonMitterer Slide37
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
drive,0hotspares);filesystemwithablocksizeof4KiB
Estride=(256KiB4KiB=64),s
- tripe_width=(648=512)
RAID6withachunksizeof256KiBand10drivesintotal(7datadrives,2parity
drive,1hotspare);filesystemwithablocksizeof4KiB
Estride=(256KiB4KiB=64),s
- tripe_width=(647=448)
RAID60withachunksizeof256KiBand10drivesintotal(6datadrives,4parity
drive,0hotspares);filesystemwithablocksizeof4KiB
Estride=(256KiB4KiB=64),s
- tripe_width=(646=384)
ChristophAntonMitterer Slide38
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
TheRAIDschunksizeinbytes.
sunit=valueisanalternativeform,wherethevaluehastobespecifiedin512B
blocks.
-dsw=value
ThesizeofthedatapartsoftheRAIDsstripesinbytes.
width=valueisanalternativeform,wherethevaluehastobespecifiedin512B
s
blocks.
ChristophAntonMitterer Slide39
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
drive,0hotspares);filesystemwithablocksizeof4KiB
dsu=256
- -dsw=8
RAID6withachunksizeof256KiBand10drivesintotal(7datadrives,2parity
drive,1hotspare);filesystemwithablocksizeof4KiB
dsu=256
- -dsw=7
RAID60withachunksizeof256KiBand10drivesintotal(6datadrives,4parity
drive,0hotspares);filesystemwithablocksizeof4KiB
dsu=256
- -dsw=6
ChristophAntonMitterer Slide40
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
Thedevicetopologyinformationisalsoexportedviasysfs:
/ sys/b lock/b lock-device[/partition]/a lignment_offset
/ sys/b lock/b lock-device/q ueue/p hysical_block_size
/ sys/b lock/b lock-device/q ueue/l ogical_block_size
/ sys/b lock/b lock-device/q ueue/h w_sector_size
/ sys/b lock/b lock-device/q ueue/m inimum_io_size
/ sys/b lock/b lock-device/q ueue/o ptimal_io_size
Documentationcanbefoundin ./Documentation/A BI/t
esting/sysfs-blockthe
Linuxkernel.
ChristophAntonMitterer Slide41
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
Automatic Alignment
Beginning with recent Linux kernels some recent userland tool versions may be
capableofusingthekernelsdevicetopologyinformationtoautomaticallydetectthe
correctsettingsforalignmentinsomescenarios.
Examples:
LVM
Recent versions of lvm try to determine any underlying mdadm software RAID,
alignmenttotheirchunksizesandalignmentofLVM'sactualdatastart.
Thefollowinglvm.comoptionsareofspecialinterest:
md_component_detection, md_chunk_alignment, data_alignment_detection,
anddata_alignment_offset_detection
dm-crypt
Recentversionsofcryptsetuptrytodeterminealignmentoftheactualdatastart.
Partitions
RecentversionsofGNUPartedtrytoalignpartitions,whenthe --align=optimal
optionisused.
util-linuxfdiskandGNUfdiskhavenosupport,sofar.
ChristophAntonMitterer Slide42
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
Automatic Alignment
General rule: Any automatically determined alignment values should be manually
verified!
ChristophAntonMitterer Slide43
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
BLOCKLAYERALIGNMENT
Literature
http://people.redhat.com/msnitzer/docs/io-limits.txt
https://ata.wiki.kernel.org/articles/a/t/a/ATA_4_KiB_sector_issues_d4b8.html
https://raid.wiki.kernel.org/
ChristophAntonMitterer Slide44
BLOCKDEVICES,FILESYSTEMSANDBLOCKLAYERALIGNMENT
Finiscoronatopus.
ChristophAntonMitterer