ch5. Kernel Synchronization

Kernel Synchronization
國立中正大學
資訊工程研究所
羅習五老師
少部分內容參酌自薛智文老師
Chapter 5: Kernel Synchronization
• Kernel Control Paths
• When Synchronization is Not Necessary
• Synchronization Primitives
• Synchronizing Accesses to Kernel Data
Structures
• Examples of Race Condition Prevention
1
Kernel
• You could think of the kernel as a server that
answers requests; these requests can come
either from a process running on a CPU or an
external device issuing an interrupt request.
Bottom halves
Top halves
2
Kernel Control Paths
• Kernel Control Path (KCP)
– a sequence of instructions executed by the kernel
to handle interrupts (/exception) of different kinds
• Each kernel request is handled by a different
KCP
– system call request System call
(software interrupt):
system_call  ret_from_sys_call
Bottom halves
Top halves
3
Kernel Requests
• A process executing in User Mode causes an
exception. (e.g., x/0)
• A process executing in Kernel Mode causes a Page
Fault exception.
• An external device sends a signal to a programmable
interrupt controller (PIC), and the corresponding
interrupt is enabled
• A process running raises an interprocessor interrupt
(IPI).
4
Kernel Control Paths
• The CPU interleaves KCPs when:
– A process switch occurs. (it relinquishes control of
CPU, e.g., sleep/wait)
– An interrupt occurs.
– A deferrable function is executed.
• Interleaving improves the throughput of PIC
and device controllers.
5
A fully preemptable kernel
• Nonpreemptive kernel? & preemptive kernel?
– Nonpreemptive kernel: Linux kernel ~2.4
– preemptive kernel: Linux kernel 2.6
• Kernel 2.4 + preempt_count* = kernel 2.6
The value is greater than 0 when …
– The kernel is executing an ISR
– The deferrable functions are disabled
– The kernel preemption level has been explicitly
disabled
*: This field is in the thread_info descriptor. 6

When Sync. Is Not Necessary
Simplifying assumptions:
• Interrupt handlers and tasklets need not to be
coded as reentrant functions
– Interrupt handlers, softirqs, and tasklets are both
nonpreemptable and non-blocking
• Per-CPU variable accessed by softirqs and
tasklets only do not require sync.
• A data structure access by only one kind of
tasklet does not require sync.
7
Synchronization Primitives
• Per-CPU variables keep them short!
– One element per each – general, read/write, big
CPU in the system reader
• Atomic operations • Semaphores
– memory bus lock, read- – general, read/write
modify-write (rmw) ops – Local interrupt disabling
• Memory barriers – Local softirq disabling
– avoids compiler, CPU – Read-copy-update (RCU)
instruction re-ordering
• Spin locks
– only on SMP systems;
8
Synchronization Primitives
Technique Description Scope
Atomic read-modify-write
Atomic operation All CPUs
instruction to a counter
Memory barrier Avoid instruction re-ordering Local CPU
Spin lock Lock with busy wait All CPUs
Semaphore Lock with blocking wait All CPUs
Forbid interrupt handling on a
Local interrupt disabling Local CPU
single CPU
Forbid deferrable function
Local softirq disabling Local CPU
handling on a single CPU
Forbid interrupt and softirq
Global interrupt disabling All CPUs
handling on all CPUs
9
Atomic Operations
• Many instructions not atomic in hw (MP)
– rmw instructions: inc, test-and-set, swap
– unaligned memory access
– rep instructions
• Compiler may not generate atomic code
– even i++ is not necessarily atomic! (i=i+1)
• Linux – atomic_ macros
– atomic_t – 24 bit atomic counters
– Intel implementation (atomic, for MP)
• lock prefix byte 0xf0 – locks memory bus
10
Atomic operations in Linux
Function Description
atomic_read(v) Return *v
atomic_set(v,i) Set *v to i
atomic_add(i,v) Add i to *v
atomic_sub(i,v) Subtract i from *v
Subtract i from *v and return 1 if the result is
atomic_sub_and_test(i, v)
zero; 0 otherwise
atomic_inc(v) Add 1 to *v
atomic_dec(v) Subtract 1 from *v
Subtract 1 from *v and return 1 if the result
atomic_dec_and_test(v)
is zero; 0 otherwise
Add 1 to *v and return 1 if the result is zero;
atomic_inc_and_test(v)
0 otherwise
Add i to *v and return 1 if the result is
atomic_add_negative(i, v)
negative; 0 otherwise
11
Atomic bit handling functions in Linux
test_bit(nr, addr) Return the value of the nrth bit of *addr
set_bit(nr, addr) Set the nrth bit of *addr
clear_bit(nr, addr) Clear the nrth bit of *addr
change_bit(nr, addr) Invert the nrth bit of *addr
Set the nrth bit of *addr and return its old
test_and_set_bit(nr, addr)
value
Clear the nrth bit of *addr and return its old
test_and_clear_bit(nr, addr)
value
Invert the nrth bit of *addr and return its old
test_and_change_bit(nr, addr)
value
atomic_clear_mask(mask, addr) Clear all bits of addr specified by mask
atomic_set_mask(mask, addr) Set all bits of addr specified by mask
12
Memory Barriers
• Compilers and hw re-order memory accesses
– as an optimization
– true on SMP and even UP systems!
• Memory barrier – instruction to hw/compiler to complete all
pending accesses before issuing more
– read memory barrier – acts on read requests
– write memory barrier – acts on write requests
• Linux macros
– for UP and MP: mb(), rmb(), wmb()
– for MP only: smp_mp(), smp_rmb(), smp_wmb()
13
Memory barriers in Linux
Macro Description
mb( ) Memory barrier for MP and UP
rmb( ) Read memory barrier for MP and UP
wmb( ) Write memory barrier for MP and UP
smp_mb( ) Memory barrier for MP only
smp_rmb( ) Read memory barrier for MP only
smp_wmb( ) Write memory barrier for MP only
14
Peterson’s Solution
• Two process solution
• Assume that the LOAD and STORE instructions are atomic;
that is, cannot be interrupted.
• The two processes share two variables:
– int turn;
– Boolean flag[2]
• The variable turn indicates whose turn it is to enter the critical
section.
• The flag array is used to indicate if a process is ready to enter
the critical section. flag[i] = true implies that process Pi is
ready!
15
Algorithm for Process Pi
while (true) {
flag[i] = TRUE;
turn = j;
while ( flag[j] && turn == j);
/*CRITICAL SECTION*/
flag[i] = FALSE;
/*REMAINDER SECTION*/
}
Task_i Task_j
while (true) { while (true) {
turn = j;
flag[i] = False turn = i;
turn = i flag[j] = TRUE;
while ( flag[i] && turn == i);
flag[i] = TRUE;
while ( flag[j] && turn == j); /*CRITICAL SECTION*/
/*CRITICAL SECTION*/ flag[i] = FALSE;
flag[i] = FALSE; /*REMAINDER SECTION*/
/*REMAINDER SECTION*/ }
}
Peterson’s Solution
while (true) {
flag[i] = TRUE;
mb( );
turn = j;
while ( flag[j] && turn == j);
/*CRITICAL SECTION*/
flag[i] = FALSE;
/*REMAINDER SECTION*/
}
18
Spin Lock
• A special kind of lock designed to work in a

multiprocessor environment.
– Spin lock
– R/W spin lock
– Sequential lock
• Useless in a uniprocessor environment (?)

19
Spin lock functions
spin_lock_init( ) Set the spin lock to 1 (unlocked)
Cycle until spin lock becomes 1
spin_lock( )
(unlocked), then set it to 0 (locked)
spin_unlock( ) Set the spin lock to 1 (unlocked)
Wait until the spin lock becomes 1
spin_unlock_wait( )
(unlocked)
Return 0 if the spin lock is set to 1
spin_is_locked( )
(unlocked); 1 otherwise
Set the spin lock to 0 (locked), and
spin_trylock( ) return 1 if the lock is obtained; 0
otherwise
20
Spin lock functions
spin_lock(slp) spin_unlock(slp)
1: lock; decb slp lock; movb $1, slp

jns 3f
2: cmpb $0,slp
pause
jle 2b
jmp 1b
3:
21
Read/Write Spin Locks
initial 0x01 000000

lock # of reading
write 0x00000000
One read 0x00ffffff
Two read 0x00fffffe 22
Read Spin Lock
read_lock(rwlp) read_unlock(rwlp)
movl $rwlp,%eax lock; incl rwlp
lock; subl $1,(%eax)
jns 1f
call __read_lock_failed
1:
__read_lock_failed:
lock; incl (%eax)
1:cmpl $1,(%eax)
js 1b
lock; decl (%eax)
js __read_lock_failed
ret
23
Write Spin Lock
write_lock(rwlp) write_unlock(rwlp)
movl $rwlp,%eax lock; addl $0x01000000,rwlp
lock; subl $0x01000000,(%eax)
jz 1f
call write_lock_failed
1:
__write_lock_failed:
lock; addl $0x01000000,(%eax)
1: cmpl $0x01000000,(%eax)
jne 1b
lock; subl $0x01000000,(%eax)
jnz __write_lock_failed
ret
24
Seqlock (sequential lock)
• A seqlock is a locking mechanism Linux for
supporting fast writes of shared variables.
• seqlock := sequence number + lock
– The lock is to support synchronization between
two writers
– the counter is for indicating consistency in readers
25
Seqlock (sequential lock)
– the writer increments the sequence number, both after
acquiring the lock and before releasing the lock.
– Readers read the sequence number before and after
reading the shared data.
do {
while (((old_seq_num = seq_num)%2) != 0);
//READER: critical section
} while (old_seq_num != seq_num);
• Seqlock was first applied to system time counter
updating.
26
MONITOR & MWAIT
(x86, for thread synchronization)
• MONITOR defines an address range used to
monitor write-back stores.
• MWAIT is used to indicate that the software

thread is waiting for a write-back store to the
address range defined by the MONITOR
instruction.
27
Read-copy-update (RCU)
• It allows extremely low overhead, wait-free reads.
• RCU updates can be expensive
– they must leave the old versions of the data structure in
place to accommodate pre-existing readers.
– These old versions are reclaimed after all pre-existing
readers finish their accesses.
• RCU is a new addition in Linux 2.6; it is used in the
networking layer and in the virtual file system (VFS).
Reference: Paul E. McKenney: Read-copy-update (RCU),

http://www.rdrop.com/users/paulmck/rclock/
IPDPS 2006 Best Paper 28
reader
Local_PTR
data
PTR
RCU allows extremely low overhead, wait-free reads.

29
reader
Local_PTR
data
PTR
writer
kmalloc + copy +
data (new)
update
New_PTR
RCU updates can be very expensive…

30
reader
PTR
data
An atomic
operation
writer
PTR = data (new)

PTR
New_PTR
Remove pointers to a data structure, so that

subsequent readers cannot gain a reference to it. 31
reader
PTR
data
writer
PTR = data (new)

PTR
new_PTR
Wait for all previous readers to complete their RCU

read-side critical sections. 32
data
writer or GC
data (new)
kfree(old_ptr) PTR
The “GC” can safely reclaim the data (the old

version). 33
data (new)
PTR
34
Lock scheduler
scheduler
Unlock
CTX_SW
reader
writer GC
Lock_scheduler := preempt_count++
Unlock_scheduler := preempt_count-- 35
Semaphores
• Kernel semaphores
– used by kernel control paths.
– can be acquired only by functions that are allowed
to sleep; interrupt handlers and deferrable
functions cannot use them.
• System V IPC semaphores
– used by User Mode processes
36
Semaphores
• struct semaphore
– count (atomic_t):
• >0 free; 0 inuse, no waiters; <0 inuse, waiters
– wait: wait queue
– sleepers: 0 (none), 1 (some), occasionally 2
• implementation requires lower-level synch!
– atomic updates, spinlock, interrupt disabling
• optimized assembly code for normal case (down())
– C code for slower “contended” case (_ _down())
37
Semaphores
up: down:
movl $sem,%ecx movl $sem,%ecx
lock; incl (%ecx) lock; decl (%ecx);
jg 1f jns 1f
pushl %eax pushl %eax
pushl %edx pushl %edx
pushl %ecx pushl %ecx
call _ _up call _ _down
popl %ecx popl %ecx
popl %edx popl %edx
popl %eax popl %eax
1: 1:
38
_ _down
WaitingQ.ins
WaitingQ.del
39
Read/Write Semaphores
• New feature of Linux 2.4
• Read/Write Semaphores
• FIFO
• complex implementation
– similar to regular semaphores
• operations:
– down_read(), down_write()
– up_read(), up_write()
40
Read/Write Semaphores
• The first process is always awoken.
– If it is a writer, the other processes in the wait
queue continue to sleep.
– If it is a reader, any other reader following the first
process is also woken up and gets the lock.
However, readers that have been queued after a
writer continue to sleep.
R R R W R W R R
41
Completions
• The current implementation of up( ) and
down( ) also allows them to execute
concurrently on the same semaphore.
• up( ) might attempt to access a data structure
that no longer exists.
• up( )  complete( ).
• down( )  wait_for_completion( ).
42
Completions
1 2
create_sem
down
del_sem
up
del_sem
43
Local Interrupt Disabling
• Local interrupt disabling does not protect
against concurrent accesses to data structures
by interrupt handlers running on other CPUs.
• In multiprocessor systems, local interrupt

disabling is often coupled with spin locks.
Spin locks
•only on SMP systems; keep them short!
•general, read/write, big reader 44
Global Interrupt Disabling
• A typical scenario consists of a driver that
needs to reset the hardware device.
• Global interrupt disabling significantly lowers
the system concurrency level.
• An interrupt service routine should never
execute the cli( ) macro.
45
_ _global_cli()
• wait for top and bottom halves to complete
• disable local interrupts
• grab spinlock
• disable all interrupts
46
47
Disabling Deferrable Functions
• disabling interrupts disables deferred
functions
• possible to disable deferred functions but not
all interrupts
• ops (macros):
– local_bh_disable()
– local_bh_enable()
48
Choosing Synch Primitives
• avoid synch if possible! (clever instruction
ordering)
– example: inserting in linked list (needs barrier still)
– Example: task migration
• use atomics or rw spinlocks if possible
• use semaphores if you need to sleep
• complicated structures accessed by deferred
functions
49
Example Race Conditions
• reference counters for sharing structs
– get/put functions
– deallocate when 0
• memory map semaphore
• slab cache list semaphore
• inode semaphore
50

ch5. Kernel Synchronization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ch5. Kernel Synchronization

Uploaded by

Copyright:

Available Formats

Kernel Synchronization

*: This field is in the thread_info descriptor. 6

while (true) { while (true) {

/CRITICAL SECTION/ flag[i] = FALSE;

flag[i] = FALSE; /REMAINDER SECTION/

• A special kind of lock designed to work in a

– R/W spin lock

• Useless in a uniprocessor environment (?)

1: lock; decb slp lock; movb $1, slp

initial 0x01 000000

• MWAIT is used to indicate that the software

Reference: Paul E. McKenney: Read-copy-update (RCU),

RCU allows extremely low overhead, wait-free reads.

RCU updates can be very expensive…

PTR = data (new)

Remove pointers to a data structure, so that

PTR = data (new)

Wait for all previous readers to complete their RCU

The “GC” can safely reclaim the data (the old

• In multiprocessor systems, local interrupt

You might also like