You are on page 1of 22

Locking in OS Kernels for SMP Systems

From the seminar


Hot Topics in Operating Systems
TU Berlin, March 2006
Arwed Stare
Abstract: !hen designing an operating system ernel "or a shared memory
symmetric m#ltiprocessor system, shared data has to $e protected "rom
conc#rrent access% &ritical iss#es in this area are the increasing code
comple'ity as well as the per"ormance and scala$ility o" a SM( ernel% An
introd#ction to SM()sa"e locing primiti*es, and how locing can $e applied
to SM( ernels will $e gi*en, and we will "oc#s on how to increase
scala$ility $y red#cing loc contention, and the growing negati*e impact on
locing per"ormance $y caches and memory $arriers% +ew, per"ormance)
aware approaches "or m#t#al e'cl#sion in SM( systems will $e presented,
that made it into today,s -in#' 2%6 ernel. The Se/-oc and the read)copy)
#pdate 01&U2 mechanism%
1 Introduction
1.1 Introduction to SMP systems
As Moore,s law is a$o#t to "ail, since cloc speeds can not $e raised $y a "actor o" two
e*ery year any more as it #sed to $e in the 3good old times3, most o" the gains in comp#ting
power are now achie*ed $y increasing the n#m$er o" processors or processing #nits
woring parallel. The tri#mph o" SM( systems is ine*ita$le%
The a$$re*iation SM( stands "or tightly co#pled, shared memory symmetric m#ltiprocessor
system% A set o" e/#al &(Us accesses a common physical memory 0and 45O ports2 *ia a
shared "ront side $#s% Th#s, the FSB $ecomes a contended reso#rce% A $#s master manages
all read5write accesses to the $#s% A read or write operation is g#aranteed to complete
atomically, which means, $e"ore any other read or write operation is carried o#t on the bus%
4" two &(Us access the $#s within the same cloc cycle, the $#s master
nondeterministically 0"rom the programmers *iew2 selects one o" them to $e "irst to access
the $#s% 4" a &(U accesses the $#s while it is still occ#pied, the operation is delayed% This
can $e seen as a hardware meas#re o" synchronisation%
1.2 Introduction to Locking
4" more than one process can access data at the same time, as is the case in preempti*e
m#ltitasing systems and SM( systems, m#t#al e'cl#sion m#st $e introd#ced to protect this
shared data%
!e can di*ide m#t#al e'cl#sion into three classes. Short)term m#t#al e'cl#sion, short)term
m#t#al e'cl#sion with interr#pts, and long)term m#t#al e'cl#sion 6Sch789% -et #s tae a
loo at the typical #niprocessor 0U(2 ernel sol#tions "or these pro$lem classes, and why
they do not wor "or SM( systems%
Short-term mutual exclusion re"ers to pre*enting race conditions in short critical sections%
They occ#r, when two processes access the same data str#ct#re in memory 3at the same
time3, th#s ca#sing inconsistent states o" data% On U( systems, this co#ld only occ#r i" one
process is preempted $y the other% To protect critical sections, they are g#arded with some
sort o" preempt_disable5 preempt_enable call to disa$le preemption, so a process
can "inish the critical section witho#t $eing interr#pted $y another process% 4n a non)
preempti*e ernel, no meas#res ha*e to $e taen at all% Un"ort#nately, this does not wor
"or SM( systems, $eca#se processes do not ha*e to $e preempted to r#n 3parallel3, there
can $e two processes e'ec#ting the e'act same line o" code at the e'act same time% +o
disa$ling o" preemption will pre*ent that%
Short-term mutual exclusion with interrupts in*ol*es interr#pt handlers that access shared
data% To pre*ent interr#pt handler code "rom interr#pting a process in a critical section, it is
s#""icient to g#ard a critical section in the process conte't with some sort o" cli5sti
0disa$le5 ena$le all interr#pts2 call% Un"ort#nately, this approach does not wor on SM(
systems as well, $eca#se all other &(Us, interr#pts are still acti*e and can e'ec#te the
interr#pt handler code at any time%
-ong)term m#t#al e'cl#sion re"ers to processes $eing held #p accessing a shared reso#rce
"or a longer time% For e'ample, once a write system call to a reg#lar "ile $egins, it is
g#aranteed $y the operating system that any other read or write system calls to the same "ile
2
will $e held #ntil the c#rrent one completes% A write system call may re/#ire one or more
dis 45O operations in order to complete the system call% :is 45O operations are relati*ely
long operations when compared to the amo#nt o" wor that the &(U can accomplish d#ring
that time% 4t wo#ld there"ore $e highly #ndesira$le to inhi$it preemption "or s#ch long
operations, $eca#se the &(U wo#ld sit idle waiting "or the 45O to complete% To a*oid this,
the process e'ec#ting the write system call needs to allow itsel" to $e preempted so other
processes can r#n% As yo# pro$a$ly already now, semaphores are #sed to sol*e this
pro$lem% This also holds tr#e "or SM( systems%
2 !e basic SMP locking "rimiti#es
!hen we tal a$o#t m#t#al e'cl#sion, we mean that we want changes to appear as i" they
were an atomic operation% 4" we can not #pdate data with an atomic operation, we need to
mae an #pdate #ninterr#pti$le and se/#entiali;e it with all other processes that co#ld
access the data% B#t sometimes, we can%
2.1 Atomic O"erations
Most SM( architect#res possess some operations that read and change data within a single,
#ninterr#pti$le step, called atomic operations% &ommon atomic operations are test and set
0TS12, which ret#rns the c#rrent *al#e o" a memory location and replaces it with a gi*en
new *al#e, compare and swap 0&AS2, which compares the content o" a memory location
with a gi*en *al#e, and, i" they e/#al, replaces it with a gi*en new *al#e, or the load lin5
store conditional instr#ction pair 0--5S&2% Many SM( systems also "eat#re atomic
arithmetical operations. Addition $y gi*en *al#e, s#$traction $y gi*en *al#e, atomic
increment, decrement, among others%
The ta$le $elow is an e'ample o" how the line counter++ might appear in assem$ler
code 06Sch7892% 4" this line is e'ec#ted at the same time $y two &(Us, the res#lt is wrong,
$eca#se the operation is not atomical%
CPU 1 CPU 2
Time Instruction Executed Register
R0
Value of
Counter
Instruction Executed Register
R0
1 load R0, counter 0 0 load R0, counter 0
2 add R0, 1 1 0 add R0, 1 1
3 store R0, counter 1 1 store R0, counter 1
To sol*e s#ch pro$lems witho#t e'tra locing, one can #se an atomic increment operation
as shown in -isting <% 04n -in#', the atomic operations are de"ined in atomic%h% Operations
not s#pported $y the hardware are em#lated with critical sections%2
The shared data is still there, $#t the critical section co#ld $e eliminated% Atomic updates
=
atomic_t counter = ATOMIC_INIT(0);
atomic_inc(&counter);
Listing 1: Atomic increment in Linux.
can $e done on se*eral common occasions, "or e'ample the replacement o" a lined list
element 0-isting 22% +ot e*en a special atomic operation is necessary to do that%
+on)$locing synchronisation algorithms solely rely on atomic operations%
2.2 S"in Locks
Spin locs are $ased on some atomic operation, "or e'ample test and set% The principle is
simple. A "lag *aria$le indicates i" a process is c#rrently in the critical section 0loc>*ar ?
<2 or i" no process is in the critical section 0loc>*ar ? 02% A process spins 0$#sy waits2 #ntil
the loc is reset, then sets the loc% Testing and setting o" the loc stat#s "lag m#st $e done
in one step, with an atomic operation% To release the loc, a process resets the loc *aria$le%
-isting = shows a possi$le implementation "or loc and unloc%
+ote that a spin loc can not $e ac/#ired rec#rsi*ely @ it wo#ld deadloc on the second call
to loc% This has two conse/#ences. A process holding a spin loc may not $e preempted,
or else a deadloc sit#ation co#ld occ#r% And spin locs can not $e #sed within interr#pt
handlers, $eca#se i" an interr#pt handler tries to ac/#ire a loc that is already held $y the
process it interr#pted, it deadlocs%
2.2.1 I$%&Safe S"in Locks
The -in#' ernel "eat#res se*eral spin loc *ariants that are sa"e "or #sing with interr#pts%
A critical section in process conte't is g#arded $y spin_loc_ir! and
spin_unloc_ir!, while critical sections in interr#pt handlers are g#arded $y the
normal spin_loc and spin_unloc% The only di""erence $etween these "#nctions is,
that the ir/)sa"e *ersions o" spin_loc disa$le all interr#pts on the local &(U "or the
critical section% The possi$ility o" a deadloc is there"ore eliminated%
Fig#re < shows how two &(Us interact when trying to ac/#ire the same ir/)sa"e spin loc%
8
"" set up ne# element
ne#$%data = some_data;
ne#$%ne&t = old$%ne&t;
"" replace old element #it' it
pre($%ne&t = ne#;
Listing 2: Atomical update of single linked list replace
!old! with !new!
(oid loc((olatile int )loc_(ar_p) *
#'ile (test_and_set_bit(0+ loc_(ar_p) == ,);
-
(oid unloc((olatile int )loc_(ar_p) *
)loc_(ar_p = 0;
-
Listing ": Spin lock #Sch$%&
!hile &(U < 0in process conte't2 is holding the loc, any incoming interr#pt re/#ests on
&(U < are stalled #ntil the loc is released% An interr#pt on &(U 2 $#sy waits on the loc,
$#t it does not deadloc% A"ter &(U < releases the loc, &(U 2 0in interr#pt conte't2 o$tains
it, and &(U < 0now e'ec#ting the interr#pt handler2 waits "or &(U 2 to release it%
2.2.2 'n!ancements of t!e sim"le S"in Lock
Sometimes it is wanted to allow spin locs to $e nested% To do so, the spin loc is e'tended
$y a nesting co#nter and a *aria$le indicating which &(U holds the loc% 4" the loc is held,
the loc "#nction checs i" the c#rrent &(U is the one holding the loc% 4n this case, the
nesting co#nter is incremented and it e'its "rom the spin loop% The #nloc "#nction
decrements the nesting co#nter% The loc is released when the #nloc "#nction has $een
called the same n#m$er o" times the loc "#nction was called $e"ore% This ind o" spin loc
is called a recursi'e lock%
-ocs can also $e modi"ied to allow $locing% -in#', and FreeBS:,s $ig ernel loc is
dropped i" the process holding it sleeps 0$locs2, and reac/#ired when it waes #p%
Solaris 2%' pro*ides a type o" locing nown as adapti'e locks% !hen one thread attempts
to ac/#ire one o" these that is held $y another thread, it checs to see i" the second thread is
acti*e on a processor% 4" it is, the "irst thread spins% 4" the second thread is $loced, the "irst
thread $locs as well%
2.( Sema"!ores )mute*+
Aside "rom the classical #se o" semaphores e'plained in section <%<, semaphores 0initiali;ed
with a co#nter *al#e o" <2 can also $e #sed "or protecting critical sections% For per"ormance
reasons, semaphores #sed "or this ind o" wor are o"ten reali;ed as a separate primiti*e,
called m#te', that replaces the co#nter with a simple loc stat#s "lag%
Using m#te'es instead o" spin locs is prod#cti*e, i" the critical section taes longer than a
conte't switch% Alse, the o*erhead o" $locing compared to $#sy waiting "or a loc to $e
released is worse% On the pro side, m#te'es imply a ind o" "airness, while processes co#ld
B
(igure 1: )*+-safe spin locks
star*e on hea*ily contended spin locs% M#te'es can not $e #sed in interr#pts, $eca#se it is
generally not allowed to $loc in interr#pt conte't%
A semaphore is a comple' shared data str#ct#re itsel", and m#st there"ore $e protected $y
an own spin loc%
2., $eader-.riter Locks
As reading a data str#ct#re does not a""ect the integrity o" data, it is not necessary to
m#t#ally e'cl#de two processes "rom reading the same data at the same time% 4" a data
str#ct#re is read o"ten, allowing readers to operate in parallel is a great ad*antage "or SM(
so"tware%
An rwloc eeps co#nt o" the readers c#rrently holding a read)only loc and has a /#e#e
"or $oth waiting writers and waiting readers% 4" the writer /#e#e is empty, new readers may
gra$ the loc% 4" a writer enters the scene, it has to wait "or all readers to complete, then it
gets an e'cl#si*e loc% Meanwhile arri*ing writers or readers are /#e#ed #ntil the write
loc is dropped% Then, all readers waiting in the /#e#e are allowed to enter, and the game
starts anew 0waiting writers are p#t on hold a"ter a writer completes to pre*ent star*ation o"
readers2% Fig#re 2 shows a typical se/#ence%
The rwloc in*ol*es a marginal o*erhead, $#t should yield almost linear scala$ility "or
read)mostly data str#ct#res 0we will see a$o#t this later2%
( Locking /ranularity in SMP Kernels
(.1 /iant Locking
The designers o" the -in#' operating systems did not ha*e to worry m#ch a$o#t m#t#al
e'cl#sion in their #niprocessor ernels, $eca#se they made the whole ernel non)
preempti*e 0see section <%<2% The "irst -in#' SM( ernel 0*ersion 2%02 #sed the most simple
approach to mae the traditional U( ernel code wor on m#ltiple &(Us. 4t protected the
6
(igure 2: reader,writer lock
whole ernel with a single loc, the $ig ernel loc 0BC-2% The BC- is a spin loc that
co#ld $e nested and is $locing)sa"e 0see section 2%2%22%
There co#ld $e no two &(Us in the ernel at the same time% The only ad*antage o" this was
that the rest o" the ernel co#ld $e le"t #nchanged%
(.2 0oarse&grained Locking
4n -in#' 2%2, the BC- was remo*ed "rom the ernel entry points, and each s#$system was
protected $y an own loc% +ow, a "ile system call wo#ld not ha*e to wait "or a so#nd dri*er
ro#tine or a networ s#$system call to "inish% Still, it was not data that was protected $y the
locs, $ot rather conc#rrent "#nction calls that were se/#entiali;ed% Also, the BC- co#ld not
$e remo*ed "rom all mod#les, $eca#se it was o"ten #nclear which data it protected% And
data protected $y the BC- co#ld $e accessed anywhere in the ernel%
(.( 1ine&grained Locking
Fine)grained locing means. 4ndi*id#al data str#ct#res, not whole s#$systems or mod#les,
are protected $y their own locs% The degree o" gran#larity can $e increased "rom locs
protecting $ig data str#ct#res 0lie, "or e'ample, a whole "ile system or the whole process
ta$le2 to locs protecting indi*id#al data str#ct#res 0"or e'ample, a single "ile or a process
control $loc2 or e*en single elements o" a data str#ct#re% Fine)grained locing was
introd#ced in the -in#' 2%8 ernel series, and has $een "#rthered in the 2%6 series%
Fine)grained locing has also $een introd#ced into the FreeBS: operating system $y the
SM(ng team, into the Solaris ernel, and into the !indows +T ernels as well%
Un"ort#nately, the BC- is still not dead% &hanges to locing code had to $e implemented
*ery ca#tio#sly, as to not $ring in hard to trac down deadloc "ail#res% So, e*ery time a
BC- was considered #seless "or a piece o" code, it was mo*ed into the "#nctions this code
called, $eca#se it was not always o$*io#s i" these "#nctions relied on the BC-% Th#s, the
occ#rrences o" BC- increased e*en more, and mod#le maintainers did not always react to
calls to remo*e the BC- "rom their code%
D
(igure ": -isual description of locking granularity in .S kernels #/ag01&
, Performance 0onsiderations
The -in#' 2%0%80 contains a total o" <D BC- calls, while the -in#' 2%8%=0 ernel contains a
total o" 226 BC-, =27 spin loc and <2< rwloc calls% The -in#' 2%6%<<%D ernel contains
<0< BC-, <D<D spin loc and =87 rwloc calls, as well as B6 se/ loc and <8 1&U 0more
on these synchronisation mechanisms later2 0n#m$ers taen "rom 6Cag0B92%
The reason, why the -in#' programmers too so m#ch wor #pon them, is that coarse)
grained ernels scale poorly on more than =)8 &(Us% The optimal per"ormance "or a n)&(U
SM( system is n times the per"ormance o" a <)&(U system o" the same type% B#t this
optimal per"ormance can only $e achie*ed i" all &(Us are doing prod#cti*e wor all the
time% B#sy waiting on a loc wastes time, and the more contended a loc is, the more
processors will liely $#sy wait to get it%
Hence, the ernel de*elopers r#n special loc contention $enchmars to detect which locs
ha*e to $e split #p to distri$#te the in*ocations on them% The loc *aria$les are e'tended $y
a loc)in"ormation str#ct#re that contains a co#nter "or hits 0s#ccess"#l attempts to gra$ a
loc2, misses 0#ns#ccess"#l attempts to gra$ a loc2, and spins 0total o" waiting loops2
6&am7=9% The n#m$er o" spins5misses shows how contended a loc is% 4" this n#m$er is
high, processes waste a lot o" time waiting "or the loc%
Meas#ring loc contention is a common practice to loo "or $ottlenecs% Bryant and
Hawes wrote a speciali;ed tool to meas#re loc contention in the ernel, which they #sed
to analy;e "ilesystem per"ormance 6Bry029% Others 6Cra0<9 "oc#sed on contention in the
2%8%' sched#ler, which has since $een completely rewritten% Today, the -in#' sched#ler
mostly operates on per)&(U ready /#e#es and scales "ine #p to B<2 &(Us%
-ocing is most prono#nced with applications that access shared reso#rces, s#ch as the
*irt#al "ilesystem 0EFS2 and networ, and applications that spawn many processes% Atison
et% al% #sed se*eral $enchmars that stress these s#$systems as an e'ample o" how the
C-ogger ernel logging and analysis tool can $e #sed "or meas#ring loc contention , #sing
*arying degrees o" paralleli;ation. They meas#red the percentage o" time spent spinning on
locs d#ring mae r#nning a parallel compilation o" the -in#' ernel, +etper" 0a networ
per"ormance e*al#ation tool2, and an Apache we$ ser*er with (erl &F4 $eing stress)tested
6Ati0B9 0see Fig#re B2%
Meas#rements lie these help to spot and eliminate locing $ottlenecs%
G
(igure %: 2xtract from a lock-contention benchmark on 3nix S-*%,45 #6am$"&
hits misses spins spins/miss
>(ageTa$le-oc4n"o < 7,6B6 G0 7,BD< <20
>:ispatcherH#e#e-oc4n"o < 87,7D7 =G2 =0,B0G G0
>SleepHashH#e#e-oc4n"o < 2B,B87 D0G B6,<72 D7
,.1 Scalability
Simon CIgstrJm made similar $enchmars to compare the scala$ility o" the -in#' ernel
on <)G &(Us "rom *ersion 2%0 to 2%6% He meas#red the relati*e speed#p in regard to the
0giant loced2 2%0%80 U( ernel with the (ostmar $enchmar 0Fig#re 62%
The res#lt o" this $enchmar is not s#rprising% As we can see, the more we increase locing
gran#larity 0-in#' 2%62, the $etter the system scales% B#t how "ar can we increase locing
gran#larityK
7
(igure 1: 5ercentage of cycles spent on spinning on locks for
each of the test applications #2ti01&
(igure 7: 5ostmark benchmark of se'eral Linux kernels 6.8()9:S45;y #/ag01&
O" co#rse, we cannot ignore the increasing comple'ity ind#ced $y "iner locing gran#larity%
As more locs ha*e to $e held to per"orm a speci"ic operation, the ris o" deadloc
increases% There is an ongoing disc#ssion in the -in#' comm#nity a$o#t how m#ch locing
hierarchy is too m#ch% !ith more locing comes more need "or doc#mentation o" locing
order, or need "or tools lie deadloc analy;ers% :eadloc "a#lts are among the most
di""ic#lt to come $y%
The o*erhead o" locing operations matters as well% The &(U does not only spend time in a
critical section, it also taes some time to ac/#ire and to release a loc% &ompare the graph
o" the 2%6 ernel with the 2%8 ernel "or less than "o#r &(Us. The ernel with more locs is
the slower one% The e""iciency o" e'ec#ting a critical section can $e mathematically
e'pressed as. Time within critical section 5 0Time within critical section L Time to ac/#ire
loc L Time to release loc2% 4" yo# split a critical section into two, the time to ac/#ire and
release a loc can $e ro#ghly m#ltiplied $y two%
S#rprisingly, e*en one time the ac/#isition o" a loc is generally one time too m#ch, and
the per"ormance penalty o" a simple loc ac/#isition, e*en i" s#ccess"#l at the "irst attempt,
is $ecoming worse and worse% To #nderstand why, we ha*e to "orget the $asic model o" a
simple scalar processor witho#t caches and loo at today,s reality%
,.2 Performance Penalty of Lock O"erations
4mage < shows typical instr#ction costs o" se*eral operations on a G)&(U <%8B FH; ((&
system
<
% The gap $etween normal instr#ctions, cache)hitting memory accesses 0not listed
hereM they are generally =)8 times "aster than an atomic increment operation2 and a loc
operation $ecomes o$*io#s%
-et #s loo at the architect#re o" today,s SM( systems and it,s impact on o#r spin loc%
,.2.1 0ac!es
As &(U power has increased ro#ghly $y "actor 2 each year, memory speeds ha*e not ept
pace, and increased $y only <0 to <BN each year% Th#s, memory operations impose a $ig
per"ormance penalty on todays, comp#ters%
< 4" yo# wonder why an instr#ction taes less time then a &(U cycle, remem$er that we
are looing at a G)&(U SM( system, and *iew these n#m$ers as 3typical instr#ction
cost3%
<0
)mage 1: )nstruction costs on a <-653 1.%19=> 556
system #4c/01&
As a conse/#ence o" this de*elopment, small S1AM caches were introd#ced, which are
m#ch "aster than main memory% :#e to temporal and spatial locality o" re"erence in
programs 0see 6Sch789 "or e'planation2, e*en a comparati*ely small cache achie*es hit
ratios o" 70N and higher% On SM( systems, each processor has its own cache% This has the
$ig ad*antage that cache hits ca#se no load on the common memory $#s, $#t it introd#ces
the pro$lem o" cache consistency%
!hen a memory word is accessed $y a &(U, it is "irst looed #p in the &(U,s local cache%
4" it is not "o#nd there, the whole cache line
2
containing the memory word is copied into the
cache% This is called a cache miss 0to increase the n#m$er o" cache hits, it is th#s *ery
ad*isa$le to align data along cache lines in physical memory and operate on data str#ct#res
that "it within a single cache line2% S#$se/#ent read accesses to that memory address will
ca#se a cache hit% B#t what happens on a write access to a memory word that lies in the
cacheK This depends on the 3write policy3%
3!rite thro#gh3 means that a"ter e*ery write access, the cache line is written $ac to main
memory% This ins#res consistency $etween the cache and memory 0and $etween all caches
o" a SM( system, i" the other caches snoop the $#s "or write accesses2, $#t it is also the
slowest method, $eca#se a memory access is needed on e*ery write access to a cache line%
The 3write $ac3 policy is m#ch more common% On a write access, data is not written $ac
to memory immediately, $#t the cache line gets a 3modi"ied3 tag% 4" a cache line with a
modi"ied tag is "inally replaced $y another line, it,s content is written $ac to memory%
S#$se/#ent write operations hit in the cache as long as the line is not remo*ed%
On SM( systems, the same piece "rom physical memory co#ld lie in more then one cache%
The SM( architect#re needs a protocol to ins#re consistency $etween all caches% 4" two
&(Us want to read the same memory word "rom their cache, e*erything goes well% 4n
addition, $oth read operations can e'ec#te at the same time% B#t i" two &(Us wanted to
write to the same memory word in their cache at the same time, there wo#ld $e a modi"ied
*ersion o" this cache line in $oth caches a"terwards and th#s, two *ersions o" the same
cache line wo#ld e'ist% To pre*ent this, a &(U trying to modi"y a cache line has to get the
3e'cl#si*e3 right on it% !ith that, this cache line is mared in*alid in all other caches%
Another &(U trying to modi"y a cache line has to wait #ntil &(U < drops the e'cl#si*e
right, and has to re)read the cache line "rom &(U <,s cache%
-et #s loo at the e""ects o" the simple spin loc code "rom -isting =, i" a loc is held $y
&(U <, and &(U 2 and = wait "or it.
The loc *aria$le is set $y the test>and>set operation on e*ery spinning cycle% !hile &(U
< is in the critical section, &(U 2 and = constantly read and write to the cache line
containing the loc *aria$le% The line is constantly trans"erred "rom one cache to the other,
$eca#se $oth &(Us m#st ac/#ire an e'cl#si*e copy o" the line when they test)and)set the
loc *aria$le again% This is called 3cache line $o#ncing3, and it imposes a $ig load on the
memory $#s% The impact on per"ormance wo#ld $e e*en worse i" the data protected $y the
loc was also lying in the same cache line%
!e can howe*er modi"y the implementation o" the spin loc to "it the "#nctionality o" a
2 To "ind data in the cache, each line o" the cache 0thin o" the cache as a spreadsheet2 has
a tag containing it,s address% 4" one cache line wo#ld consist o" only one memory word,
a lot o" lines and th#s, a lot o" address tags, wo#ld $e needed% To red#ce this o*erhead,
cache lines contain #s#ally a$o#t =2)<2G $ytes, accessed $y the same tag, and the least
signi"icant $its o" the address ser*e as the $yte o""set within the cache line%
<<
cache%
The atomic read)modi"y)write operation cannot possi$ly ac/#ire the loc while it is held $y
another processor% 4t is there"ore #nnecessary to #se s#ch an operation #ntil the loc is
"reed% 4nstead, other processors trying to ac/#ire a loc that is in #se can simply read the
c#rrent state o" the loc and only #se the atomic operation once the loc has $een "reed%
-isting 8 gi*es an alternate implementation o" the loc "#nction #sing this techni/#e%
Here, one attempt is made to ac/#ire the loc $e"ore entering the inner loop, which then
waits #ntil the loc is "reed% 4" the loc is already taen again on the test>and>set operation,
the &(U spins again in the inner loop% &(Us spinning in the inner loop only wor on a
shared cache line and do not re/#est the cache line e'cl#si*e% They wor cache)local and
do not waste $#s $andwidth% !hen &(U < releases the loc, it mars the cache line
e'cl#si*e and sets the loc *aria$le to ;ero% The other &(Us re)read the cache line and try
to ac/#ire the loc again%
+e*ertheless, spin loc operations are still *ery time)cons#mpti*e, $eca#se they #s#ally
in*ol*e at least one cache line trans"er $etween caches or "rom memory%
,.2.2 Memory 2arriers
!ith the s#perscalar architect#re, parallelism was introd#ced into the &(U cores% 4n a
s#perscalar &(U, there are se*eral "#nctional #nits o" the same type, along with additional
circ#itry to dispatch instr#ctions to the #nits% For instance, most s#perscalar designs incl#de
more than one arithmetic)logical #nit% The dispatcher reads instr#ctions "rom memory and
decides which ones can $e r#n in parallel, dispatching them to the two #nits% The
per"ormance o" the dispatcher is ey to the o*erall per"ormance o" a s#perscalar design.
The #nits, pipelines sho#ld $e as "#ll as possi$le% A s#perscalar &(U,s dispatcher hardware
there"ore reorders instr#ctions "or optimal thro#ghp#t% This holds tr#e "or load5store
operations as well%
For e'ample, imagine a program that adds two integers "rom main memory% The "irst
arg#ment that is "etched is not in the cache and m#st $e "etched "rom main memory%
Meanwhile, the second arg#ment is "etched "rom the cache% The second load operation is
liely to complete earlier% Meanwhile, a third load operation can $e iss#ed% The dispatcher
#ses interlocs to prohi$it that the add operation is iss#ed $e"ore the load operations it
depends on are "inished%
Also, most modern &(Us sport a small register set called store $#""ers, where se*eral store
operations are gathered to $e e'ec#ted at once at a later time% They can $e $#""ered in)order
0which is called total store ordering2 or @ as common with s#perscalar &(Us @ o#t o" order
0partial store ordering2% 4n short. As long as a load or store operation does not access the
<2
(oid loc((olatile loc_t )loc_status)
*
#'ile (test_and_set(loc_status) == ,)
#'ile ()loc_status == ,); "" spin
-
Listing %: Spin lock implementation a'oiding excessi'e cache line
bouncing #Sch$%&
same memory word as a prior store operation 0or *ice *ersa2, they can $e e'ec#ted in any
possi$le order $y the &(U%
This needs "#rther meas#res to ens#re correctness o" SM( code% The simple atomic list
insert code "rom section 2%< co#ld $e e'ec#ted as shown in Fig#re D%
The method re/#ires the new node,s ne't pointer 0and all it,s data2 to $e initiali;ed before
the new element is inserted at the list,s head% 4" these instr#ctions are o#t o" order, the list
will $e in an inconsistent state #ntil the second instr#ction completes% Meanwhile, another
&(U co#ld tra*erse the list, the thread co#ld $e preempted, etc%
4t is not necessary that $oth operations are e'ec#ted right a"ter each other, $#t it is
important that the "irst one was e'ec#ted $e"ore the second% To "orce "inishing o" all read or
write operations in the instr#ction pipeline $e"ore the ne't operation is "etched, s#perscalar
&(Us ha*e so)called memory $arrier instr#ctions% !e disting#ish read memory $arriers
0wait #ntil all pending read operations ha*e completed2, write memory $arriers, and
memory $arriers 0wait #ntil all pending memory operations ha*e completed, read and
write2% &orrect code wo#ld read.
4nstr#ction reordering can also ca#se operations in a critical section to 3$leed o#t3 0Fig#re
G2% The line that claims to $e in a critical section is o$*io#sly not, $eca#se the operation
releasing the loc *aria$le was e'ec#ted earlier% Another &(U co#ld long ago ha*e altered a
part o" the data, with the res#lts $eing #npredicta$le%
<=
ne#$%ne&t = i$%ne&t;
smp_#mb(); "" #rite memor. barrier/
i$%ne&t = ne#;
Listing 1: 6orrect code of atomic list insertion on
machines without se?uential memory model
(igure @: )mpact of non se?uential memory model on atomic list insertion
algorithm
ne#$%ne&t = i$%ne&t;
i$%ne&t = ne#;
code in memory: execution order:
i$%ne&t = ne#;
ne#$%ne&t = i$%ne&t;
To pre*ent this, we ha*e to alter o#r locing operations again 0note that it is not necessary
to pre*ent load5store operations prior to the critical section to 3$leed3 into it% And o" co#rse,
dispatcher #nits do not o*erride the logical instr#ction "low, so e*ery operation in the
critical section will $e e'ec#ted a"ter the &(U e'its that while loop2.
A memory $arrier "l#shes the store $#""er and stalls the pipelines 0to carry o#t all pending
read5write operations $e"ore new ones are e'ec#ted2, so it negati*ely impacts the
per"ormance proportional to the n#m$er o" pipeline stages and n#m$er o" "#nctional #nits%
This is why the memory $arrier operations tae so m#ch time in the chart presented earlier%
Atomic operations tae so long $eca#se they also "l#sh the store $#""er in order to $e
carried o#t immediately%
,.2.( 3as!&able 2enc!mark
Below are the res#lts o" a $enchmar that per"ormed search operations on a hash ta$le with
a dense array o" $#cets, do#$ly lined hash chains, and one element per hash chain% Under
the locing designs tested "or this hash ta$le were. Flo$al spin loc, glo$al reader5writer
<8
(oid loc((olatile loc_t )loc_(ar)
*
#'ile (test_and_set(loc_(ar) == ,)
#'ile ()loc_(ar == ,); "" spin
-
(oid unloc((olatile loc_t )loc_(ar)
*
mb(); "" read$#rite memor. barrier
)loc_(ar = 0;
-
Listing 7: Spin lock with memory barrier
(igure <: )mpact of weak store ordering on critical sections
data01oo = .; "" in critical section
data0ne&t = &bar;
)loc_(ar = 0; ") unloc )"
code in memory:
execution order:
)loc_(ar = 0; ") unloc )"
data01oo = .; "" in critical section
data0ne&t = &bar;
loc, per)$#cet spin loc and rwloc and -in#', $ig reader loc% Aspecially the res#lts "or
the allegedly parallel reader5writer locs seem s#rprising, $#t only s#pport the things said in
the last two sections. The locing instr#ctions, o*erhead thwarts any parallelism%
The e'planation "or this is now rather simple 0Fig#re 72. The ac/#isition and release o"
rwlocs tae so m#ch time 0remem$er the cache line $o#ncing etc%2, that the act#al critical
section is not e'ec#ted parallel any more%
<B
)mage 2: 5erformance of se'eral locking strategies for a hash
table A#4c/01&B
4 Lock&A#oiding Sync!roni5ation Primiti#es
M#ch e""ort has $een p#t in de*eloping synchronisation primiti*es that a*oid locs% -oc)
"ree and wait)"ree synchronisation also plays a maOor part in real time operating systems,
where time g#arantees m#st $e gi*en% Another way to red#ce loc contention is $y #sing
per)&(U data%
Two new synchronisation mechanisms that get $y totally witho#t locing or atomic
operations on the reader side were introd#ced into the -in#' 2%6 ernel to address the a$o*e
iss#es. The Se/-oc and the 1ead)&opy)Update mechanism% !e will disc#ss them in
partic#lar%
4.1 Se6 Locks
Se/ locs 0short "or se/#ence locs2 are a *ariant o" the reader)writer loc, $ased on spin
locs% They are intended "or short critical sections and t#ned "or "ast data access and low
latency%
4n contrast to the rwloc, a se/#ence loc warrants writer access immediately, regardless o"
any readers% !riters ha*e to ac/#ire a writer spin loc which pro*ides m#t#al e'cl#sion "or
m#ltiple writers% They then can alter the data witho#t paying regard to possi$le readers%
There"ore, readers also do not need to ac/#ire any loc to synchroni;e with possi$le
<6
(igure $: 2ffects of cache line bouncing and
memory synchronisation delay on rwlockCs
efficiency A#4c/0"&B
writers% Th#s, a read access generally does not ac/#ire a loc, $#t readers are in charge to
chec i" they read *alid data. 4" a write access too place while the data was read, the data
is in*alid and has to $e read again% The identi"ication o" write accesses is reali;ed with a
co#nter 0see Fig#re <02% A*ery writer increments this ;ero)initiali;ed co#nter once $e"ore
he changes any data, and again a"ter all changes are done% The reader reads the co#nter
*al#e $e"ore he reads the data, then compares it to the c#rrent co#nter *al#e a"ter reading
the data% 4" the co#nter *al#e has increased, the reading was tampered $y one or more
conc#rrent writers and the data has to $e read again% Also, i" the co#nter *al#e was #ne*en
at the $eginning o" the read)side critical section, a writer was in progress while the data was
read and it has to $e discarded% So, strictly speaing, the while loop,s condition is
((count_pre /= count_post) && (count_pre 2 3 == 0))%
4n the worst case, the readers wo#ld ha*e to loop in"initely i" there was a non)ending chain
o" writers% B#t #nder normal conditions, the readers read the data s#ccess"#lly within a "ew,
i" not only one tries% By minimi;ing the time spent in the read)side critical section, the
pro$a$ility o" $eing interr#pted $y a writer can $e red#ced greatly% There"ore, it is part o"
the method to only copy the shared data in the critical section and wor on it later%
-isting D shows how to read shared data protected $y the se/ loc "#nctions o" the -in#'
ernel% time_loc is the se/ loc *aria$le, $#t read_se!be4in and read_
se!retr. only read the se/ loc,s co#nter and do not access the loc *aria$le%
4.2 !e $ead&0o"y&7"date Mec!anism
As yo# can see, synchronisation is a mechanism and a coding con*ention% The coding
con*ention "or m#t#al e'cl#sion with a spin loc is, that yo# ha*e to hold a loc $e"ore yo#
access the data protected $y it, and that yo# ha*e to release it a"ter yo# are done% The
<D
unsi4ned lon4 se!;
do *
se! = read_se!be4in(&time_loc);
no# = time;
- #'ile( read_se!retr.(&time_loc+se!) );
"" (alue in 5no#5 can no# be used
Listing @: Se? Lock: *ead-side critical section
(igure 10: Se? Lock schematics Afigure based upon #+ua0%&B
coding con*ention "or non)$locing synchronisation is, that e*ery data manip#lation only
needs a single atomic operation 0e%g% &AS!, &AS!2 or o#r atomic list #pdate e'ample2%
The 1&U mechanism is $ased on something called /#iescent states% A /#iescent state is a
point in time, where a process that has $een reading shared data does not hold any
re"erences to this data any more% !ith the 1&U mechanism, processes can enter a read)
side critical section any time and can ass#me that the data they read is consistent as long as
they wor on it% B#t a"ter a process lea*es it,s read)side critical section, it must not hold any
re"erences to the data any longer% The process enters the /#iescent state%
This imposes some constraints on how to #pdate shared data str#ct#res% As readers do not
chec i" the data they read is consistent 0as in the Se/-oc mechanism2, writers ha*e to
apply all their changes with one atomical operation% 4" a reader read the data $e"ore the
#pdate, it sees the old state, i" the reader reads the data a"ter the #pdate, it sees the new
state% O" co#rse, readers sho#ld read data once and than wor with it, and not read the same
data se*eral times and than "ail i" they read di""ering *ersions% &onsider a lined list
protected $y 1&U% To #pdate an element o" the list 0see Fig#re <<2, the writer has to read
the old element,s contents, mae a copy o" them (1), #pdate this element (2), and then
e'change the old element with the new one with an atomic operation 0writing the new
element,s address to the pre*io#s element,s ne't pointer2 (3)%
As yo# can see, readers could still read stale data e*en a"ter the writer has "inished
#pdating, i" they entered the read)side critical section $e"ore the writer "inished (4)%
There"ore, the writer cannot immediately delete the old element% 4t has to de"er the
destr#ction, #ntil all processes that were in a read)side critical section at the time the writer
"inished ha*e dropped their re"erences to the stale data (5)% Or in other words, entered the
/#iescent state% This time span is called the grace period% A"ter that time, there can $e
readers holding re"erences to the data, $#t none o" them co#ld possi$ly re"erence the old
data, $eca#se they started at a time when the old data was not *isi$le any more% The old
element can $e deleted (6)%
<G
(igure 11: Dhe six steps of the *63 mechanism
prev cur next
cur
prev cur next
new
prev old next
new
next
prev next
new
,+
4+
8+ (+
2+
1+
prev old
new
next prev old
new
The 1&U mechanism re/#ires data that is stored within some sort o" container that is
re"erenced $y a pointer% The #pdate step consists o" changing that pointer% Th#s, lined lists
are the most common type o" data protected $y 1&U% 4nsertion and deletion o" elements is
done lie presented in section 2%<% O" co#rse, we need memory $arrier operations on
machines with wea ordering% More comple' #pdates, lie sorting a list, need some other
ind o" synchronisation mechanism% 4" we ass#me that readers tra*erse a list in search o" an
element once, and not se*eral times $ac and "orth 0as we ass#med anyways2, we can also
#se do#$ly lined lists 0see -isting G2%
The 1&U mechanism is optimal "or read)mostly data str#ct#res, where readers can tolerate
stale data 0it is "or e'ample #sed in the -in#' ro#ting ta$le implementation2% !hile readers
generally do not ha*e to worry a$o#t m#ch, things get more comple' on the writer,s side%
First o" all, writers ha*e to ac/#ire a loc, O#st lie with the se/loc mechanism% 4" they
wo#ld not, two writers co#ld o$tain a copy o" a data element, per"orm their changes, and
then replace it% The data str#ct#re will still $e intact, $#t one #pdate wo#ld $e lost% Second,
writers ha*e to de"er the destr#ction o" an old *ersion o" the data #ntil sometime% !hen
e'actly is it sa"e to delete old dataK
A"ter all readers that were in a read)side critical section at the time o" the #pdate ha*e le"t
their critical section, or, entered a /#iescent state 0Fig#re <22% A simple approach wo#ld $e
to tae a co#nter that indicates how many processes are within a read)side critical section,
and de"er destr#ction o" all stale *ersions o" data elements #ntil that time% B#t as yo# can
see, later readers are not taen into acco#nt, so this approach "ails% !e co#ld also incl#de a
re"erence co#nter in e*ery data element, i" the architect#re "eat#res an atomic increment
operation% A*ery reader wo#ld increment this re"erence co#nter as it gets a re"erence to the
<7
static inline (oid __list_add_rcu(struct list_'ead ) ne#+
struct list_'ead ) pre(+ struct list_'ead ) ne&t)
*
ne#$%ne&t = ne&t;
ne#$%pre( = pre(;
smp_#mb();
ne&t$%pre( = ne#;
pre($%ne&t = ne#;
-
Listing <: extract from Linux 2.7.1% kernel list.h
(igure 12: *63 9race 5eriod: After all processes that
entered a read-side critical section AgrayB before the writer
AredB finished ha'e entered a ?uiescent state it is sa'e
delete an old element
data element, and decrement it when the read)side critical section completes% 4" an old data
element has a re"co#nt o" ;ero, it can $e deleted 6McC0<9% !hile this sol*es the pro$lem, it
reintrod#ces the per"ormance iss#es o" atomic operations that we wanted to a*oid%
-et #s ass#me that a process in a 1&U read)side critical section does not yield the &(U%
This means. (reemption is disa$led while in the critical section and no "#nctions that might
$loc m#st $e #sed 6McC089% Then no re"erences can $e held across a conte't switch, and
we can "airly ass#me a &(U that has gone thro#gh a conte't switch to $e in a /#iescent
state% The earliest time when we can $e a$sol#tely s#re that no process on any other &(U is
still holding a re"erence to stale data, is a"ter e*ery other &(U has gone thro#gh a conte't
switch at least once a"ter the writer "inished% The writer has to de"er destr#ction o" stale data
#ntil then, either $y waiting or $y registering a call$ac "#nction that "rees the space
occ#pied $y the stale data% This call$ac "#nction is called a"ter the grace period is o*er%
The -in#' ernel "eat#res $oth *ariants%
A simple mechanism to detect when all &(Us ha*e gone thro#gh a conte't switch is to start
a high priority thread on &(U < that repeatedly resched#les itsel" on the ne't &(U #ntil it
reaches the last &(U% This thread then e'ec#tes the call$ac "#nctions or waes #p any
processes waiting "or the grace period to end% There are more e""ecti*e algorithms o#t there
to detect the end o" a grace period, $#t these are o#t o" the scope o" this doc#ment%
-isting 7 presents the -in#' 1&U A(4 0witho#t the -in#' 1&U list A(42% +ote that, while it
is necessary to g#ard read)side critical sections with rcu_read_loc and
rcu_read_unloc, the only thing these "#nctions do 0e'cept "or *is#ally highlighting a
critical section2 is disa$ling preemption "or the critical section% 4" the -in#' ernel is
compiled with 6788M6T_8NA9:8=no, they do nothing% !rite)side critical sections are
protected $y spin>loc02 and spin>#nloc02, and wait "or the grace period a"terwards with
s.nc'roni;e_ernel() or register a call$ac "#nction to destroy the old element
with call_rcu()0
20
(igure 1": Simple detection of a grace period: Dhread !u!
runs once on e'ery 653 #4c/01&
The 1&U mechanism is widely $elie*ed to ha*e $een de*eloped at Se/#ent &omp#ter
Systems, who were then $o#ght $y 4BM, who holds se*eral patents on this techni/#e% The
patent holders ha*e gi*en permission to #se this mechanism #nder the F(-% There"ore,
-in#' is c#rrently the only maOor OS #sing it% 1&U is also part o" the S&O claims in the
S&O *s% 4BM laws#it%
8 0onclusion
The introd#ction o" SM( systems has greatly increased the comple'ity o" locing in OS
ernels% 4n order to stri*e "or optimal per"ormance on all plat"orms, the operating system
designers ha*e to meet con"licti*e goals. 4ncrease gran#larity to increase scala$ility o" their
ernels, and red#ce #sing o" locs to increase the e""iciency o" critical sections, and th#s the
per"ormance o" their code% !hile a spin loc can always $e #sed, it is not always the right
tool "or the Oo$% +on)$locing synchronisation, Se/ -ocs or the 1&U mechanism o""er
$etter per"ormance than spin locs or rwlocs% B#t these synchronisation methods re/#ire
more e""ort than a simple replacement% 1&U re/#ires a complete rethining and rewriting
o" the data str#ct#res it is #sed "or, and the code it is #sed in%
4t too hal" a decade "or -in#' "rom it,s "irst giant loced SM( ernel implementation to a
reasona$ly "ine gran#lar% This time co#ld ha*e $een greatly red#ced, i" the -in#' ernel had
$een written as a preempti*e ernel with "ine gran#lar locing "rom the $eginning% !hen it
comes to m#t#al e'cl#sion, it is always a good thing to thin the whole thing thro#gh "rom
the $eginning% Starting with an approach that is #gly $#t wors, and t#ning it to a well
r#nning sol#tion later, o"ten lea*es yo# coding the same thing twice, and e'periencing
greater pro$lems than i" yo# tried to do it nicely "rom the start%
As m#ltiprocessor systems and modern architect#res lie s#perscalar, s#per pipelined, and
hyperthreaded &(Us as well as m#lti)le*el caches $ecome normalcy, simple code that loos
"ine at "irst glance can ha*e se*ere impact on per"ormance% Th#s, programmers need to
ha*e a thoro#gh #nderstanding o" the hardware they write code "or%
F#rther 1eading
This paper detailed on the per"ormance aspects o" locing in SM( ernels% 4" yo# are
2<
(oid s.nc'roni;e_ernel((oid);
(oid call_rcu(struct rcu_'ead )'ead+
(oid ()1unc)((oid )ar4)+
(oid )ar4);
struct rcu_'ead *
struct list_'ead list;
(oid ()1unc)((oid )ob<);
(oid )ar4;
-;
(oid rcu_read_loc((oid);
(oid rcu_read_unloc((oid);
Listing $: Dhe Linux 2.7 *63 A5) functions
interested in the implementation comple'ity o" "ine grained SM( ernels or e'periences
"rom per"orming a m#ltiprocessor port, please re"er to 6Cag0B9% For a more in)depth
introd#ction to SM( architect#re and caching, read &#rt Schimmel,s $oo 06Sch7892%
4" yo# want to gain deeper nowledge o" the 1&U mechanism, yo# can start at (a#l A%
McCenney,s 1&U we$site 0http.55www%rdrop%com5#sers5pa#lmc51&U2%
1e"erences
6Bry029 1% Bryant, 1% Forester, P% Hawes. Filesystem per"ormance and scala$ility in
-in#' 2%8%<D% 5roceedings of the 3senix Annual Dechnical 6onference, 2002
6&am7=9 Mar :% &amp$ell, 1#ss -% Holt. -oc)Fran#larity Analysis Tools in
SE185M(% )222 Software, March <77=
6Ati0B9 Qoa* Atison, :an Tsa"rir et% al%. Fine Frained Cernel -ogging with C-ogger.
A'perience and 4nsights% =ebrew 3ni'ersity, 200B
6Cag0B9 Simon CIgstrJm. (er"ormance and 4mplementation &omple'ity in
M#ltiprocessor Operating System Cernels% Elekinge )nstitute of Dechnology,
200B
6Cra0<9 M% Cra*et;, H% Frane. Anhancing the -in#' sched#ler% 5roceedings of the
.ttawa Linux Symposium, 200<
6McC0<9 (a#l A% McCenney. 1ead)&opy Update% 5roceedings of the .ttawa Linux
Symposium, 2002
6McC0=9 (a#l A% McCenney. Cernel Corner ) Using 1&U in the -in#' 2%B Cernel%
Linux 4aga>ine, 200=
6McC089 (a#l A% McCenney. 1&U *s% -ocing (er"ormance on :i""erent Types o" &(Us%
http:,,www.rdrop.com,users,paulmck,*63,L6A200%.02.1"a.pdf, 2008
6McC0B9 (a#l McCenney. A$straction, 1eality &hecs, and 1&U%
http:,,www.rdrop.com,users,paulmck,*63,*63intro.2001.0@.27bt.pdf, 200B
6H#a089 PRrgen H#ade, A*a)Catharina C#nst. -in#' Trei$er entwiceln% Fpunkt.'erlag,
2008
6Sch789 &#rt Schimmel. U+4S Systems "or Modern Architect#res. Symmetric
M#ltiprocessing and &aching "or Cernel (rogrammers% Addison Gesley, <778
22

You might also like