Professional Documents
Culture Documents
&
Asynchronous Database Writer
by
Nitin Vengurlekar
August 2001
I.
Although the buffer cache is depicted as pool of buffers, its really a linear array.
This document will use the term buffer blocks to refer to blocks within the db_block_buffer cache and
also to distinguish itself from buffer headers.
3
A DBA, data block address, uniquely identifies a particular block id within a database file. Thus, the
DBA is a 32 byte address that composes of a 22 byte block# and 10 byte file#.
2
the LRU mechanism. These two separate lists, called the LRU and LRUW linked lists, basically hold the
buffer headers. Both the lists have different properties and thus are treated differently by DBWR.
1. LRU list.
The LRU list is a linear list that has a head and a tail end.
The head of the LRU list is considered to be the hottest part of list; i.e., it contains the MRU (most recently
used) buffers. All new block gets are placed on the MRU end (with the exception of sequentially scanned
blocks). The tail end of the LRU contains the buffers that have not been referenced recently and thus can
be reused. Therefore the tail of LRU is where the foreground processes begin to search for free buffers.
Buffers on the LRU can have one of three statuses; free, pinned, or dirty. Pinned buffers are buffers that
are currently being held by a user and/or have waiters against them. Moreover, the pinned status will be
sub-categorized as pinned clean or pinned dirty. Free buffers are unused buffers; i.e., a new block
that is to be read into the cache (from disk) can use it. The dirty buffers are modified buffers that have
not been moved over to the LRUW cache. Dirty buffers are different from pinned dirty buffers, in
that pinned dirties have user/waiters against them and hence cannot be written out to disk; whereas,
the (non-pinned) dirty buffers are freed buffers and are eligible to be moved to the LRUW list and
subsequently to disk.
2. LRUW list.
The LRUW list contains the dirty buffers eligible for disk write-outs by DBWR. The LRUW list is also
called the dirty list. How buffers get moved over to the LRUW list and consequently to disk is the
foundation of DBWRs function. DBWR writes buffer blocks to disk when it is signaled to do so.
B. Evolution of DBWR and Cache Management.
The evolution of Oracle, from Oracle7.3 to Oracle8 and now to Oracle8i, several key components and
features have been introduced into DBWR/Cache Management code.
Oracle 7.3 introduced multiple LRU latches (working sets) to alleviate the contention against a single
LRU latch.
Oracle8.0 instituted multiple buffer pools and capitalized on the introduction of multiple LRU latches.
Multiple buffer pools provided a mechanism to segregate buffer blocks by database segments (tables,
indexes). Oracle8.0 also introduced checkpoint and file queues. This significant interjection, into the
Cache Management code, provided a means for faster checkpoint processing and thus faster recovery
(instance and crash). Moreover, this feature also eliminated several buffer scans, imposed by
checkpoint requests, thereby reducing CPU overhead.
The remainder of this document will discuss the Buffer Cache Management and DBWR new features, and
capabilities introduced in Oracle8i.
Oracle7
SQL> select description from x$messages where indx in ('9','10','11');
2> and description like 'write %';
DESCRIPTION
-----------------------------------------------------------write dirty buffers when idle - timeout action
write dirty buffers/find clean buffers
write checkpoint-needed buffers/recovery end
The main objective of the DBWR is to perform efficient IO and to avoid IO peaks. This was done by
allowing DBWR to perform aggressive writing and queue management, which will allow a consistent write
pace.
The foundations for the new DBWR were laid down in Oracle8.0, which introduced two flavors of
multiple DBWRs:
DBWR IO slaves (dbwr_io_slaves)
DBWR IO slaves
In Oracle7, the multiple DBWR processes were simple slave processes; i.e., unable to perform
async I/O calls. In Oracle8.0, the slave database writer code was kernalized. This feature is
implemented via the init.ora parameter dbwr_io_slaves. With dbwr_io_slaves, there is still a
master DBWR process and its slave processes. . This feature is very similar to the db_writers in
Oracle7, except the IO slaves are now capable of asynchronous I/O, thus allowing for much better
throughput as slaves are not blocked after the I/O call. Asynchronous I/O is provided to the slave
processes by using the adapter layers internally within the Oracle kernal. Thus slave async IO is
not using native OS async IO libraries, but Oracle internal mechanisms.
Slave processes are started after database open stage (not instance creation), upon initial request of
a slave IO. The names of the DBWR slave processes are different than the slaves of Oracle7.
For example a typical DBWR slave background process maybe: ora_i103_testdb.
Where i indicates that this process is a slave IO process.
1 indicates the IO adapter number
3 specifies the slave number
Therefore if dbwr_io_slaves was set to 3 then the following slave processes will be created:
ora_i101.testdb, ora_i102_testdb and ora_i103_testdb.
p97050
p97050
p97050
p97050
p97050
p97050
p97050
p97050
15304
15298
15296
15302
15292
15290
15294
15306
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
08:37:00
08:36:56
08:36:56
08:37:00
08:36:55
08:36:55
08:36:56
08:37:01
?
?
?
?
?
?
?
?
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
ora_i102_pmig1
ora_smon_pmig1
ora_ckpt_pmig1
ora_i101_pmig1
ora_dbw0_pmig1
ora_pmon_pmig1
ora_lgwr_pmig1
ora_i103_pmig1
This example illustrates how DBWR processes hash to working set. Note, the buffer ranges that each manages.
db_block_lru_latches = 2
dbwr_io_slaves = 6
END_BUF#
--------99
199
Note, from the above listing, that master DBWR is the only one that manages the working sets.
b.
Multiple DBWRs.
Multiple database writers are implemented via the init.ora parameter db_writer_processes. This
feature was enabled in Oracle8.0.4, and allows true database writers; i.e., no master-slave
In Oracle8 disk asynchronous I/O can be enabled using the parameter disk_async_io.
Oracle 8i Async DBWR
relationship. If db_writer_processes is enabled, then the writer processes will be started after
PMON has initialized. The writer processes can be identified (OS level) by viewing the ps
command output . In this example db_writer_processes was set to 3. The sample ps output shows
the following. Note, the DBWR processes are named starting from 0 and upto 9; therefore the can
be up to 10 DBWRs. There is no master DBWR process; all are equally weighted. Note,
db_writer_processes are different from db IO slaves, in that the db_writer_processes do not use
the IO slave layers; therefore the db_writer_processes can only use OS native async IO (via aio
libraries).
p97050
p97050
p97050
p97050
p97050
p97050
p97050
p97050
1472
1474
1466
1478
1470
1480
1468
1476
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
10:48:18
10:48:18
10:48:17
10:48:18
10:48:18
10:48:18
10:48:18
10:48:18
?
?
?
?
?
?
?
?
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
ora_dbw2_pmig1
ora_dbw3_pmig1
ora_pmon_pmig1
ora_ckpt_pmig1
ora_dbw1_pmig1
ora_smon_pmig1
ora_dbw0_pmig1
ora_lgwr_pmig1
With Oracle8 db_writer_processes, each writer process is assigned to a LRU latch set. Thus, it is
recommended to set db_writer_processes equal to the number of LRU latches (db_lru_latches)
and not exceed the number of CPUs on the system. For example, if db_writer_processes was set
to four and db_lru_latches=4, then each writer process will manage its corresponding set; i.e.,
each writer will write buffers from its appropriate LRUW list. Allowing each writer to manage at
least one LRU latch provides a very autonomous and segregated approach to Cache management.
This examples illustrates how DBWR processes hash to working set.
db_block_lru_latches = 4
dbwr_processes = 2
END_BUF#
--------49
99
149
199
Although both implementations of DBWR processes may be beneficial, the general rule, on which option
to use, depends on the following: (1) the amount of write activity; (2) the number of CPUs (the number of
CPUs is also indirectly related to the number LRU latch sets); (3) the size of the buffer cache; and (4) the
availability of asynchronous I/O (from the OS). The following is not necessarily a top down checklist
approach to determine which option use, but rather an outline of the considerations. Note, it is best to try
both options (not simultaneously) against your system to determine which best fits the environment.
If the buffer cache is very large and the application is write intensive, then
db_writer_processes may be beneficial. Note, the number of writer processes should not
exceed the number of CPUs.
If the application is not very write intensive (or even a DSS system) and async I/O is
available, then use single writer processes; if async I/O is not available then use
dbwr_io_slaves.
If the system is a uniprocessor then implement dbwr_io_slaves. Note, a uniprocessor system
will have db_lru_latches set to 1.
Implementing dbwr_io_slaves or db_writer_processes comes with some overhead cost. Enabling these
features requires that extra shared memory be allocated for IO buffers and request queues and extra CPU
cycles5. Multiple writer processes (and IO slaves) are advanced features, meant for high IO throughput.
Implement this feature only if the database environment requires such IO throughput. As stated earlier, in
some cases (if async I/O is available), it may be prudent to disable I/O slaves and run with a single DBWR
in async I/O mode. Review the current throughput and examine possible bottlenecks to determine if it is
feasible to implement these features.
Caveats and Concerns
1.
Multiple DBWRs and DBWR IO slaves cannot coexist. If both are enabled, then the following error
message is produced:
"Cannot start multiple dbwrs when using I/O slaves ";
Moreover, if both parameters are enabled, dbwr_io_slaves will take precedence.
2.
The number of DBWRs cannot exceed the number of working sets. If it does, then the number of
DBWRs will curtailed to equal the number of working sets and the following message is produced in
the alert.log during initialization time:
"Cannot start more dbwrs than db_block_lru_latches
However, the number of sets can exceed the number of DBWRs.
3.
4.
dbwr_io_slaves are not restricted to the number of sets; i.e., dbwr_io_slaves >= db_block_lru_latches.
However, when enabling DBWR IO slaves, the master DBWR will manage the working sets.
5.
Do not set dbwr_io_slaves and disk_async_io=true. dbwr_io_slaves inherently uses async IO through
Oracle IO layers.
6.
In order to enable efficient processing for the new DBWR, the following structures are initialized in each
writer processs PGA at startup (not an exhaustive list)
1. The number of dbwrs
2. max async writes
3. quota for high, medium, low priority writes
4. postme flag for LGWR
5. number of buffers in buffer cache
6. max IO size (max write size)
Structures/List
LRU-P
LRU-W
1 per set
2 per set
LRU-XR
2 per set
LRU-XO
2 per set
LRU
Table 1.
1 per set
Each list shown above will have sublists called the auxiliary write list (AUX) and a MAIN list. For
example, the LRU-P list will have a LRUP-AUX and a LRUP-MAIN list. The MAIN lists house buffers
that are in-use (pinned/dirty) or are candidates to be written out. When DBWR scans the lists in search of
write-able buffers, these eligible buffers (un-pinned dirtied) get moved from the MAIN list to the AUX.
Buffers on the AUX list are either waiting for DBWR to issue the write or already have a write issued
against them; i.e., the write pending flag is set. The primary purpose of the AUX list is to prevent DBWR
from scanning over buffers (from the MAIN list) that have writes already issued against them. This makes
the scanning process for DBWR much more efficient.
As shown in the table above, the LRU list, (replacement list) and LRU-W list (dirty list), still exist, as in
pre8i. The LRU/LRUW function similarly to pre8i; however with a slight twist. As in pre8i all buffers
are initially hashed (at startup) to a LRU set (if using multiple sets), such that each set will house near
equal amounts of buffers. However, in Oracle8i buffers are initially hashed (at startup) to the LRU-AUX,
a.k.a, cold-reusable list. Foregrounds will always begin their scans from this LRU-AUX list, in search of
free buffers. As foregrounds allocate free buffers for CR, dirtying, etc, they are moved (unlinked) from the
AUX list to the middle of the MAIN list. The buffers on the MAIN list will still move from head to tail6.
DBWR will scan the LRU-MAIN list in search of dirty buffers to move to LRU-W lists (specifically the
LRUW-MAIN list)7. When the write for this dirty buffer is issued, it is moved from the LRUW-MAIN to
the LRUW-AUX. Thus, when the next time DBWR issues to write from the LRUW-MAIN, it will not
incur buffers that have already been written out or having writes pending.
6
However the movement between head and tail is now based on the touch count. See sidebar discussion
on touch count.
7
All cold reusable buffers (clean) that were encountered during the scan will be moved to the LRU-AUX
list.
Oracle 8i Async DBWR
The LRU-XR, LRU-XO, LRU-P and the checkpoint queues, shown in table 1, function slightly different
than the LRU and LRUW lists. These new lists are called write lists because they have buffers linked on
due to a specific write action; such as a ping or a drop segment (table, index) request. Note, these buffers
were already on the LRU list and due to a specific write request, were placed on their respective lists (such
as LRU-P or LRU-XR), and thus are candidates for immediate write-outs by DBWR. Moreover, these new
lists will provide capabilities for write prioritizations, based the buffer write request.
Note, when a buffer is initially dirtied it is also placed on the thread/file checkpoint queue as well the LRUMAIN. This introduces the concept of Incremental checkpointing and cold-dirty buffers.
Although the discussion of incremental checkpointing is outside of this scope, a brief illustration maybe
necessary. With advent of incremental checkpointing in Oracle8i , checkpointing becomes more frequent
and continuous; thereby enabling faster instance recovery and possibly precluding LGWR stuck situations.
Incremental checkpointing will determine the current position in the redo log and lag behind a user
predefined amount. Since instance recovery is usually limited to the gap between the last checkpoint and
the current RBA (synced RBA), incremental checkpointing essentially reduces this gap . In pre8i, cold
dirty buffers were usually aged out (moved from the LRU queue to the LRUW queue and subsequently to
disk. With incremental checkpointing enabled, cold dirty buffers will most likely be written out as part of
the incremental checkpoint rather than being aged out. However, if the number of dirty buffers is below
db_block_buffer_max_dirty_target, then the incremental checkpointing will not be enabled, and thus, cold
dirty buffers will aged out as usual.
Write Priority
In Oracle81, write-able buffers can have the following write priorities : urgent/high, medium, and low.
Write-able buffers include ping, cold dirty, recovery and checkpoint buffers. Each write-able buffer is
linked on its appropriate list and marked with the corresponding write-type. For example, a ping buffer is
marked with a ping flag in its buffer header, indicating that the buffer needs to be written due to crossinstance request (ping). Additionally, each write-type is associated with a write priority. Note, a buffer
will always corresponds to only one write-type, and thus one write priority. This write-type marking and
priority level provides multi-fold benefits; it allows the different write-typed buffers to be easily identified
and managed; but, most importantly, it makes write prioritization possible by associating the write-type to
a priority level. See table 2 for a list of priority level cross-references.
Type of write
Ping
Priority level
High/Urgent
Thread checkpoints
High/Urgent
High/urgent
Medium
Medium
Medium
Low
Tablespace checkpoints
Table 2
Low
Write Quotas
To effectively allow all high priority writes to be written out, enough space must exist on the write-queue
to furnish these buffers. Therefore, to accommodate these higher priority writes, a quota system has been
implemented in cache management. This quota system will associate write priorities with an IO slot quota
limit. For example, due to the urgency of the write for pinged blocks, ping buffers are considered to be
high priority writes. Therefore, all ping buffers be will acquired and placed on the write queue before
lower priority writes; such as cold-dirty buffers, are placed.
Interrupt action handler
As in pre8i, DBWR write requests still can come from posts, messages, or timeouts. However, in Oracle8i
, the first interface to DBWR is the action handler. The The interrupt action handler routine is invoked
each time DBWR is posted, messaged or the threshold number of outstanding I/Os have completed. The
function of the interrupt action handler is illustrated below
1.
Write deferred buffers. As described earlier, these buffers are ones that could not be written
out because their corresponding redo had not been flushed to the redo logs. These buffers are
marked and moved to the deferred queue. For each buffer in this state, the count of deferred
buffers is incremented. While this count is non-zero, DBWR is in a post-me state, waiting for
LGWR to complete its write of the corresponding redo8. Note, DBWR is not sitting idle in
this state, DBWR will still service other queued requests. Once LGWR writes out some redo,
it posts DBWR with the high RBA written. This high RBA is compared to the sync-RBA of
the buffer in waiting. If the high RBA is greater than the sync-RBA, then the redo is assumed
to have been written to disk and the buffer is now available to be flushed to disk. After this
buffer is written out, it is unlinked from the deferred queue and onto the write queue. This
check is performed for each buffer in the deferred queue. Once all the deferred buffers are
queued, and the deferred count decrements to 0, then the post-me flag is cleared.
2.
Post processing written buffers. This involves clearing the being writing or checkpoint
flags from the buffer headers , unlinking them from their respective queues and possibly
linking them onto the clean buffer queue (LRU-AUX queue).
3.
Assign quotas. DBWR will scan all lists to allocate the quota assignments by priority and
populate the kcbbwq array. This will allow the higher priority buffer writes to be
accommodated in the write queue.
10
4.
Inspect the task queue, which is broken down by write-type9. Process each write-type
beginning with ping buffers up until reuse buffer range buffers. Also, determine the
maximum number of writes that will be issued in each case.
5.
Scan and accumulate buffers for writing. This entails scanning each list for write-able
buffers, and queuing them onto the write queue. Buffers in this state are considered to be in
the QUEUED state. This scanning will loop until all the lists have been scanned. Note, the
number of buffers to be written, by write-type (ping, checkpoint, etc.), is regulated by the
write quota (kcbbwq).
6.
An internal routine moves buffers from their respective queues to a free IO slot; i.e., the write
queue. This routine determines if any free slots are available, if so, then determine if the
buffer (to be moved) is being modified or about to be modified. Modified buffers cannot be
moved, and thus, these buffers are skipped. If a buffer is available to be moved, DBWR will
pull (unlink) the buffer from the respective queue structure and place them (link) on the free
IO slot10. In order to prevent the moved-buffer from being changed by foregrounds, a flag is
set in the buffer header11 . If a moved-buffer does not have its corresponding redo flushed by
LGWR, then this buffer is linked onto the IO slot deferred write queue, as stated in step 1.
7.
Write the buffers. After all the buffers have been queued up, call will be performed to issue
an IO against the queued slots.
DBWR will continue to issue I/Os for the queued writes until it has reached the maximum number of
outstanding I/Os. Recall, that DBWR will poll the outstanding I/Os using the IO context. After all the
queued writes have been issued, DBWR will sleep. To minimize the number context switches, caused
when going from the sleep state to runnable state, DBWR will only be awoken when a percentage of
outstanding I/Os finish12. Upon awakening, DBWR will cycle through the slots whose state flag is set to
with the write attribute, to determine which I/Os have completed. The buffers that have been written out
to disk will be marked and subsequently require post-processed. As indicated earlier, post processing
involves clearing bits.
Although the lifecycle of particular buffer will vary depending on its write-type and the enabling of
incremental checkpointing; however, illustrated below is a possible scenario for an aged buffer
(considering no incremental checkpointing).
Process flow for the write out of an aged buffer:
1. Buffer initially on LRU-AUX
2. Move over to LRU-MAIN
3. After dirtying (and unpinned), move over to LRUW-MAIN --DBWR--> LRUW-AUX
4. Move from queues to IO slots via kcbbxsv
5. Queue writes from IO slots via kcfqueuewr
6. Then finally write to disk via kcfdowr.
7. Poll the for completion of buffer IO using the context handler.
9
11
kcfdowr
disk
kcfqueuewr
Medium priority
write slots
LRU-P
LRU-W
LRU-X
Thread chkpt
File chkpt
queue
queue
Recovery
chkpt queue
LRU
free io slots
low priority
write slots
IO slots
12