You are on page 1of 13

---

There's an FAQ posted to comp.databases.oracle newsgroup every month or


so. It is also available via anon ftp from rtfm.mit.edu, the home of
all FAQs.
---
That seems very low, we set ours to !"#$ so that the %$A can ma&e their
'(A's
bigger to buffer more data. It all depends on the type of db you're
running, of course.
---
)ou can set '**#A+ to anything up to !($, it does not have any adverse
effect on performance. (enerally rule of thumb is that it should be greater
than
your '(A and sensibly about ,-. of your physically /A#.
0e have 1racle ,.2.2 on !.-. on "-way '3A/45556 ,-7mb /A#. *ere's our
8etc8system entries 9
::: 'et 'hared #emory 8 'emaphores for 1racle
set semsys;seminfo<semmni=!55
set semsys;seminfo<semmns=!55
set semsys;seminfo<semmsl=!5
set shmsys;shminfo<shmma>=?@@5""7?
set shmsys;shminfo<shmmin=
set shmsys;shminfo<shmmni=-!
set shmsys;shminfo<shmseg=5
forceload; sys8msgsys
forceload; sys8shmsys
forceload; sys8semsys
*106A6/, see attached file for what the e>perts say.
--------------------------------- 4ut *ere
---------------------------------
1ptimiBing and #easuring the 'olaris Cernel For Darge 1racle 'ervers.
by #i&e Eaffee, 'un #icrosystems
The first part of the paper will discuss the basics of 'olaris Internals
that
are relevant to the 1racle %$A along with tips to common technical
Fuestions
and relevant header files. The second part is Fuoted tuning information
ta&en
from 'un 6>perts. The final part is a discussion of &ernel memory
allocation,
how to measure it, and some things that can be done to prevent starvation.
'olaris Internals
'parc has two rings of e>ecution. The inner ring is for &ernel functions
and
the outer ring is for user process functions. The process address space is
virtual, and normally only part of a process is in physical memory. The
&ernel
stores the contents of the process address space in physical memory,
on-dis&
files, and specially reserved swap areas. 1ver time the &ernel shuffles
pages
of the processes between physical memory and dis&. 6ach process has
registers
that are stored in the &ernel and are place in the hardware registers at
run
time. A process must bloc& if it is waiting for a resource and allow
another
process to run. The &ernel allows each process a brief period of time,
usually
5 milliseconds, to run before performing a conte>t switch. GAahalia
p.!5-!-H
1n startup once the &ernel is loaded, user processes can reFuest system
services from the &ernel through the system call interface. If the process
misbehaves by dividing by Bero or overflow its stac&, a hardware e>ception
occurs, and the &ernel intervenes, usually aborting the process.
Interrupts
come from peripheral devices usually indicating a status change or I81
completion. Two important processes that manage memory are the swapper and
pagedaemon. GAahalia p.!!-!-H
6ach process has a virtual memory address space GA#AH that is translated to
physical memory addresses by page tables. This mapping is done by the
chip's
##I. GTip - 'ystem panics can be either hardware or software related. The
##I
registers give helpful hints on what actually caused the panic.H In
addition to
&ernel and user mode, there is &ernel and user space. This refers to
regions
in virtual memory address space of the process. There is only one &ernel
and
many processes and hence every process must map in a single &ernel address
space. The &ernel portion of the A#A maintains global data structures and
some
per process obJects. These can only be accessed by the &ernel when the chip
is
running in &ernel mode Gring 5H. 'ince the &ernel is shared by all
processes,
&ernel space must be protected by user-mode access. This is done by
reFuiring
the processes to use the system call interface. This reFuires the chip to
go
into &ernel mode, transfer program control to the &ernel, have the &ernel
e>ecute system code instructions, then switch bac& to user mode and user
control of the process. GAahalia p.!!-!2H
'ystem 'ervices
1racle uses many 'olaris system services such as file and record loc&ing,
inter process communications, virtual memory, and process scheduling.
4ommon
system calls are open, read, write, fcntl, &ill, priocntl, ploc&, memcntl,
sync. 4ommon 'ignals are 'I('6(A - usually means user stac& overflow,
'I($I'
- out of the process address space, 'I(T6/# - user has Khung upK without
e>iting gracefully, 'I(I'/ - defined signal for asynchronous events,
'I(CIDD
- &ill process immediately no e>ceptions. 1racle uses file and record
loc&ing
by setting read write loc&s on portions of a file. Any process can read a
file that is loc&ed but only the owner of the loc& can update the file. A
write loc& is sometimes called an e>clusive loc& and a read loc& is
sometimes
called a shared loc&. 3rocess scheduling is usually managed very well by
the
&ernel, however a slow Job can be speeded up by the priocntl system call.
G'ystem 'ervices (uide p.-!-H Eim '&een of 'unsoft - K1racle gets loc&ed-
down memory as a conseFuence of using intimate shared memory GI'#H, not
through ploc&. It controls sharing inside shared memory through latches,
not
memcntl or ploc&.K *e also cautions against changing the priority of the
1racle processes KThis is something we in %$6 actually strongly discourage.

1nly the most daring and &nowledgable %$A's should attempt this. The
problem
is that system threads can get starved if 1racle processes are not Kwell
behavedK when running in real time class. 1racle processes may easily hog
a
cpu for e>tended periods of time Gtime being measured in Ini> FuantumsH.
0e
in %$6 have e>perimented with changing the dispatch table in useful8clever
ways, to minimiBe the number of involuntary conte>t switches. $ut 1racle
processes still run in T' class.K Gprivate letter '&eenH
1racle Internals and 'olaris 'ystem 'ervices
#ar& Eohnson of 1racle and Eim '&een provide the following e>pert insight
and
information. The system global area is defined as K1ne or more shared
segments visible to all 1racle processes that are used to store precompiled
'QD and 3D8'QD Glibrary cacheH, database buffers Gbuffer cacheH, and for
interprocess communicationK GEohnsonH. As far as process control - K1racle
does use semaphores, but latches are the usual synchroniBing mechanism, as
mute>es implemented as spin loc&sK GEohnsonH. 1n the subJect of loc&s
K1racle
maintains database transaction integrity through use of database loc&s of
various sorts--shared read, e>clusive read, e>clusive write, etc. These
are
implemented through database loc&s, not using Ini> file loc&s. Thus, the
scope of a database loc& can be limited to a single row in the database.
1r,
the database may choose to loc& a database page Gwhich may be Fuite a bit
smaller than a Ini> pageH. 1r, the database may choose to loc& an entire
database table Gwhich may be composed of multiple database files, which in
turn may or may not map into Ini> filesH.K Gprivate letter '&eenH.
1racle uses heavyweight processes that are in the shared memory portion of
the
process address space. The %$0/ Gdata buffer writerH process uses aio
threads
&nown as light weight processes GD03H. An D03 is a &ernel-supported user
thread that is based on &ernel threads. They are independently scheduled
and
share the address space of the process. Aahalia's boo& has a nice
discussion
on D03s. GEaffeeH Cernel Asynchronous I81 and Intimate 'hared #emory are
two
&ey technologies used by 1racle on the 'olaris platform.
Asynchronous I81 is needed because a single bloc&ing thread in a multi-
threaded application causes all threads to wait until the thread wa&es up.
0hat needs to happen is for the thread to issue an asynchronous I81 reFuest
and then pass control to another thread in the process. Also heavy I81 is
not
efficient when done synchronously because of the large number of conte>t
switches that must occur every time a thread is bloc&ed. G*yuc& )ooH
Asynchronous I81 under 'olaris is implemented two ways - under 'olaris !.2
it
is using the library and under 'olaris !.? and beyond it is in the file
system layer of the &ernel. The library approach uses &ernel-level threads
where each I81 reFuest is handled by a newly created &ernel-level thread
that
acts synchronously Gi.e. issuing read and write callsH. The library lives
outside of the &ernel and the &ernel threads that perform the I81 are
separate from the calling process. The &ernel approach is much more
sophisticated and efficient. The basic concept is to not maintain the Fueue
in user space but to put the reFuest directly into the device driver Fueue.
The biowait function is bypassed Gwhich is the device driver eFuivalent to
a
bloc&ing functionH and the thread transfers control rather than sleep in
the
&ernel. The &ernel has buffers with slots called AI1 that maintain a
listing
of all I81 reFuests. G*yuc& )ooH
'olaris has provided the I'# feature since !.!. The main feature of I'# is
in addition to sharing the KmemoryK pages Gli&e the normal shared memoryH,
it
also shares the page table entries for those pages Gtherefore, it's
KintimateKH. Another side feature, which is more important for this
discussion, is that I'# also loc&s down the shared memory segment in real
physical /A#. 'ince the main purpose of I'# is for the %$#' products'
buffer
cache usage, this ma&es sense. GEaffeeH
'haring page table entries solves the problem of page table stealing which
is
e>pensive because all the pages mapped in the stolen page table have to be
flushed before being given to another process. This avoids the condition
where the whole system may thrash as processes steal page tables from each
other. G*. )ooH
The design team created a new segment in the process address space called
segshm so that they could create one set of page tables for a shared memory
segment and share the page tables among the processes that attach that same
shared memory. In addition to saving page table allocation, sharing page
tables have other advantages such as having a higher cache hit rate on
memory
map loo&ups because the tables are in a buffer cache rather than in memory.
It also avoids the amount of overhead done by the hardware address
translation layer since it no longer needs go through page tables for every
process to monitor whether a page has been modified. These are both huge
savings and speed up the virtual memory paging algorithm within 'olaris.
G*.
)ooH
I34
The 1racle /%$#' is a comple> program that uses multiple cooperating
processes
that must communicate with each other and share resources. The &ernel
provides
a mechanism in user space called inter process communication or I34. The
processes operate in a shared memory segment such that if one process
modifies
data it will be immediately visible to the other processes. %ata transfer
and
event notifications occur between the various 1racle processes in the
1racle
'(A. 'emaphores are used for 1racle's own loc&ing and synchroniBation
scheme. Asynchronous events such as errors are reported to the processes
using signals. The default action for most signals from the &ernel is to
terminate the process, however the process may specify an alternate
response
by providing a signal handler function. GTip - $efore installing the &ernel
Jumbo patch read the readme file to see if there are any &nown signal
problems with 1racleH. GAahalia - p-5H The relevant I34 system calls
1racle ma&es are shmget, semget, shmat, shmdt, shmctl, and semctl. The ipc
information is stored in the &ernel with the ipc<perm structure.
shmgetG&ey,
siBe,flagH creates a portion of shared memory Gwhich will be the siBe of
the
1racle '(AH and shmatGshmid, shmaddr, shmflagH attaches the region to a
virtual memory address of the process. Gshmsys is how 1racle sets up the
intimate shared memory segmentH. The structure of a shared memory segment
includes access permission, segment siBe, the 3I% of the process performing
last operation, and the memory map segment descriptor pointer as well as
other fields. Gtip - sgabeg in the &sms.s file is a virtual address not
physical address G5-5>ffffffff = ! ($H. 4hoose small beginning addresses
for
large '(As. Also watch out for !" bit 'parc chips. They have a smaller
virtual addresses. *al 'tern notes KThey're really not !" bit chips, but
instead the system architecture only passes !" bits of virtual address
space
on to the memory bus. Lprivate letterMH 1nce attached the region may be
accessed li&e any other memory location without reFuiring system calls to
read or write data to it. *ence shared memory is the fastest mechanism for
processes to share data. GTip - don't be confused by the 'N field in ps
-elf.
It is in ? C$ pages and represents shared memory in the case of 1racle.
For
e>ample 1racle may have 75 server processes in a shared memory segment all
appro>imately !-555 ? C$ pages. A common misconception is to thin& that
1racle needs 75 + ?C$ + !-555 = 7 ($ of virtual memory. Those 75 processes
are mainly using the shared memory region in the process address spaceH.
GTip - shared memory pages are bac&ed by swap space, not by a file. The
absolute minimum swap must be at least the siBe of the '(A.H A process
detaches the shared memory with shmdtGaddrH and destroys the shared memory
region completely with the I34</#I% command of the shmctl system call.
GTip
- the important commands are ipcs -b9 loo& at field '6('N for shared memory
siBe in use 9 sysdef -i and sysdef -i -n 8dev8&syms for I34 and resource
table definitions9 &ill -@ Oprocess idP to terminate Gno core fileH a hung
process or &ill -7 Oprocess idP to abort Gcore fileH a hung 1racle process.
modload -p sys8shmsys at the command line or forceload; sys8shmsys in the
system file maybe needed if ipcs -b doesn't wor&H correctly. This is
because
the &ernel is dynamic meaning that file systems, drivers, and modules are
loaded into memory when they are used, and the memory is returned if the
module is no longer needed. GAahalia - p----", p7!-7?H 'emaphores are
counters that are used by 1racle to monitor and control the availability of
shared memory segments. Typically the process initialiBes the semaphore
with
semget, assigns ownership of the semaphore with semctl , and then updates
the
semaphore with semop. A process has to bloc& until the semaphore operation
has reached Bero. A semaphore structure contains the following information
-
semaphore value, the 3I% of the process that last performed successfully,
the
number of processes waiting for the semaphore to increase, and the number
of
processes waiting for the semaphore to reach Bero. Gtip-ipc<perm and sem in
ipc.h, sem.hH G'ystem 'ervices (uide - p7"-,,H. 'hared #emory and 'emaphore
Tunables in 'olaris ! relevant to 1racle. GTip - semmnu = semmns = semmsl +
semmniH. There is no harm in setting the numbers too high since the 1racle
instance will only allocate semaphores and shared memory as needed. The
values are definitions not declarations.
Qame %efault #in #a> /eference
'uggested
<<<< <<<<<<< <<< <<< <<<<<<<<<
<<<<<<<<
shmma> 5?"-,7 5?"-,7 Available #a>imum shm segment -5.
of /A#
/A# siBe in bytes
shmmin - #inimum shm segment
siBe in bytes
shmni 55 55 - Qumber of shm id 55
to pre-allocate
shmseg 7 7 - #a>imum number shm 2!
seg per process
semmni 5 5 7--2- Qumber of semaphore 7?
identifiers
semmns 75 - - Qumber of semaphores 755
in system
semmnu 25 - - Qumber of undo !-5
structures in sys
semmsl !- - - #a>imum number of !-
Gfi>edH
semaphores per I%
'olaris Tuning According to the 6>perts
6very month in 'un0orld 1nline, the performance e>perts at 'un write
articles
on tuning. In addition to the well &nown boo&, K'un 3erformance and
TuningK,
Adrian 4oc&croft with the help of /ich 3ettit have put together a series of
scripts called se!.- Gwww.sun.com8@75258columns8adrian 8se!.-.html. *al
'tern, another well &nown 'un tuning guru, has written an 1'/eilly press
boo&
on K#anaging QF' R QI'K and he too writes articles that can be downloaded
off
of the web. Fellow 'un'ervice 6ngineers 4hris %ra&e and Cimberley 0oods
wrote
K3anic - 'ystem 4ore dump AnalysisK which contains detailed information on
the
'olaris &ernel and common techniFues used in to analysis core files. $rian
0ong the hardware e>pert has written a boo& called K4onfiguration and
4apacity
3lanning of Darge 'un 'erversK. #ost of the tuning information for large
'un
'ervers running 1racle can be found in these sources. 'ince many customers
often call 'un'ervice for further e>planations, it is appropriate to
highlight
some common Fuestions and answer them as the e>perts would.
Question - 0here is all my #emoryS
3robably the most common performance Fuestion of all is K0hy does vmstat
report
only >>>> about of free memory availableSK To use an e>ample, type the
vmstat - and suppose the system shows freemem of "5,5" and available swap
is
225555. Qow start the application and observe that the freemem goes down
to
""!? and swap goes to 255555. Qow stop the application and observe that all
of the available swap returns to 225555 but the freemem returns only to
!!75. 0here then is all of the ramS %oes we have a memory lea&S The answer
is probably no because as 4oc&croft notes KGthe appH starts up more Fuic&ly
than it did the first time, and with less dis& activity. The application
code
and its data files are still in memory, even though they are not active.
The
memory they occupy is not Kfree.K If you restart the same application it
finds the pages that are already in memory. The pages are attached to the
inode cache entries for the files. If you start a different application,
and
there is insufficient free memory, the &ernel will scan for pages that have
not been touched for a long time, and KfreeK them. 1nce you Fuit the first
application, the memory it occupies is not being touched, so it will be
freed
Fuic&ly for use by other applications. KG4oc&croft H Deaving parts of the
app in memory even after termination is efficient because KAttaching to a
page in memory is around ,555 times faster than reading it in from dis&.K
G4oc&croft H 'o how can one &now if he has a memory lea& in his
applicationS
The answer is there will be a shortage of swap space after the program runs
a while and the 'N field in ps -elf for that app will grow over time.
Question ! - #y 1racle 'erver is slow. 4an you help me tune the &ernelS
The answer depends on the version of the operating system and the level of
the
patches. 6arly versions of the os had performance bugs and incompatible
hardware that were the cause of slow performance. The latest version of
the os
is self-tuning for high performance and will wor& Fuite successfully on
systems
ranging from a huge 'parc4enter !555 to small des&tops. As 4oc&croft says
KIn
normal use there is no need to tune the 'olaris ! &ernel, since it
dynamically
adapts itself to the given hardware configuration and application wor&load.
K
G4oc&croft !H *owever for really large 1racle servers some tuning may be
needed if using early versions of 'olaris !.2 !.? and !.- without a &ernel
patch that automatically adJusts the the paging algorithm. 'olaris !.-. is
self tuning for large memory systems. 3aul Faramelli of the &ernel T'6
group
has put together the following list of tunables for 'olaris.
/ecommendations
for large 1racle servers G/am P ($H are listed. GTip - Ise crash to
display
&ernel tunables. As root type crash. At the greater than prompt, type Kod
-d
ma>userK or Kod -d lotsfreeK. The od stands for octal dump, and the -d
stands
for decimal. $y the way every 'olaris tunable Leven undocumented onesM can
be
displayed by typing nm 8&ernel8uni>H. Qote these recommendations are only
necessary for early versions of 'olaris. The some recommendations are
provided by 'teve 1'Qeil of 'un'ervice. G4aution - there is no right
answerH
3arameter %escription
/ecommended
--------- -----------
-----------
dump<cnt 'iBe of the dump

autoup Ised in struct var for dynamic configuration of the age
255
that a delayed-write buffer must be, in seconds, before
bdflush will write it out Gdefault = 75H

bufhwm Ised in struct var for v<bufhwm9 it's the high water mar&
"555
for buffer cache memory usage, in Cbytes G!. of memoryH.

ma>users #a>imum number of users GIn !.2 and !.? the default is

number of #egabytes in memoryH

ma><nprocs #a>imum number of processes G5 T 7 : ma>userH

ma>uprc The ma>imum number of user processes. Gma><nprocs - -H

rstchown 31'I+<4*10Q</6'T/I4T6% is enabled Gdefault = H

ngroups<ma> #a>imum number of supplementary groups per user Gdef 2!H.

rlim<fd<cur #a>imum number of open file descriptors per process sysem

wide Gdefault = 7?, ma> = 5!?H

ncallout Qumber of callout buffers Gdefault = 7 T ma><nprocsH.

GQo longer e>ists in 'olaris !.! and later releasesH

nautopush Qumber of entries in the autopush free list
5!?
sadcnt Qumber allowed of concurrent opens of both 8dev8sad8user
!5?"
and 8dev8sad8admin Gdefault 7H.

npty Qumber of ?.+ psuedo-ttys configured Gdefault ?"H
5!?
pt<cnt Qumber of -.+ psuedo-ttys configured Gdefault ?"H
5!?
physmem 'ets the number of pages usable in physical memory. 1nly

use this for testing, it reduces the siBe of memory.

minfree #emory threshold which determines when to start swapping
55
processes, when free memory falls to this level swapping

begins Gdefault; !.? - ?d = -5 pages, all others !-

pages, !.2 - physmem 8 7? H.

desfree This is the KdesperationK level, this determines when
!55
paging is abandoned for swapping. 0hen free memory stays

below this level for 25 seconds, swapping &ic&s in G !.?

?d = 55 pages, all others -5 pages, !.2 physmem 8 2! H.

lotsfree #emory threshold which determines when to start paging.
-!
0hen free memory falls below this level paging begins G!.?

?d = !-7 pages all others !" pages, !.2 physmem 87H

fastscan The number of pages scanned per second when free memory

is Bero, the scan rate increases as free memory falls

from lotsfree to Bero, reaching fastscan G default; !.?

physmem 8 ? with 7?#b being ma>, !.2 physmem 8 ! H.

slowscan The number of pages scanned per second when free memory

is eFual to lotsfree, also see fastscan G defaults; !.?

is fi>ed at 55, !.2 fastscan 85 H.

handspr- Is the distance between the front hand and bac&hand in

eadpages the cloc& algorithm. The larger the number the longer an

idle page can stay in memory Gdefault; !.? physmem 8 ?

!.2 physmem 8 ! H.

ma>pgio The ma>imum number of page-out I81 operations per second.
!5
This acts as a throttle for the page deamon to prevent

page thrashing GG%I'C/3# : !H 82 = ?5H. This parameter
must be set higher if using two swap partitions.
t<gpgslo !. through !.2, Ised to set the threshold on when to

swap out processes Gdefault !- pages H.

ufs<ninode #a>imum number of inodes. Gma><nprocsT7Tma>usersT7?H
2?@57
ndFuot Qumber of dis& Fuota structures. Gdefault = Gma>users :

Q#1IQT 8 ?H T ma><nprocsH

ncsiBe Qumber of dnlc entries. Gdefault = ma><procs T 7 T
2?@57
ma>users T 7?H9 dnlc is the directory-name loo&up cache

4oc&croft on ma>users
KI never set ma>users. It siBes itself based on the amount of /A# in the
system. In some cases on configurations with gigabytes of /A# it needs to
be
reduced to avoid problems with lac& of &ernel address space. The &ernel
uses up
a lot of space &eeping trac& of all the /A# in a system. 'everal other
&ernel
table siBes and limits are derived from ma>users.K G4oc&croft !H
4oc&croft on ncsiBe
KThe directory name loo&up cache G%QD4H is siBed to a default value based
on
ma>users. A large cache siBe GncsiBeH significantly helps QF' servers that
have a lot of clients. 1n other systems the default is adeFuate.KG4oc&croft
!H
Question 2; *ow much swap is needed for a large 1racle databaseS
#any people are under the impression that very little swap is needed for
1racle
because the architecture uses temporary tablespaces for sorting and the '(A
is
fi>ed in memory. 0ell the truth is large databases reFuire a lot of swap.
The
shared memory segment is bac&ed by swap so the allocated swap #I'T be at
least
as large as the shared memory segments. In addition when the database uses
intimate shared memory this is also bac&ed by swap. All of the 1racle
processes must be partially bac&ed by swap. 'teve 'chuettinger, the 1racle
applications specialist at 'un, recommends at least ! ($ of swap for
benchmar&
testing on large servers. 1bviously since /A# plus swap eFuals virtual
memory,
once swap is gone, the program will halt and no new apps can be started
until
other programs have stopped. As Adrian 4oc&croft says KThe important thing
to
realiBe about swap space is that it is the combined total siBe of every
program
running and dormant on the system that matters. 0hen a system runs out of
swap
space it can be very difficult to recover. 'ometimes you find that there is
insufficient swap space left to login as root or run the commands needed to
&ill the errant process that is consuming all the swap space.K G4oc&croft
2H In
Theory 'olaris ! changes the rules by adding the /A# and the dis& space so
if
the system has enough /A# for the wor&load, Kit can run with no swap dis&.
In
practice common database applications that are siBed to run in a few
gigabytes
of /A# will actually need many gigabytes of dis& allocated as swap space.K
G4oc&croft 2H In the same article 4oc&croft says KThe conseFuences of
running
out of swap space affect a larger number of users on a big server, so it
wise
to allocate a lot more than you normally need to cope with any usage pea&s.
To
start with, add twice as much dis& as you have /A#.K G4oc&croft 2H GTip -
It is
not worth ma&ing a striped metadevice to swap on - that would Just add
overhead
and slow it down. There is also a limit of ! gigabytes on the siBe of each
swap
partition, so striping dis&s together tends to ma&e them too big.
8usr8ucb8ps al>, fields 'N or 'IN6, 8usr8proc8bin8pmap
. 8usr8ucb8ps al>
F II% 3I% 33I% 43 3/I QI 'N /'' 04*AQ ' TT TI#6
41##AQ%
" !-@- 22 25 5 ?" !5 @"" 275 modlin&a ' pts8? 5;55
-bin8csh
There is confusion between what ps reports. The K8bin8ps prints a field
labelled 'N, but this is the resident set siBe in /A# -- printed as /''
by the
8usr8ucb8ps. )ou need to use the 'N or 'IN6 field reported by
8usr8ucb8ps al>
in units of &ilobytes to determine the amount of swap space used by the
process.K G4oc&croft 2H
1racle's #ar& Eohnson adds the following KI had thought the standard 1racle
rule of thumb was ! to ? times physical memory Gcan be a bit less on very
large memory systemsH. 'maller memory systems may want to use higher
ratios
of '(A siBe to physical memory siBe and higher swap space ratios. GI ended
up using ratios of ; and ;? for a very small 'olaris for Intel system
with
surprisingly good results.HK
*al 'tern says K'o why do you need swap space if your '(A OO phys memS The
short answer is that the Kphys memK in that calculation is the non-loc&ed-
down physical memory, and when you allocate an oracle '(A, you allocate
intimate shared memory GI'#H that is ta&en out of the physical memory pool
Gie, it gets loc&ed downH. so on a (byte machine, you may thin& you're
o&
with a !-7# '(A, leaving ,55#T for processes. $IT; the !-7# '(A gets ta&en
out of the available memory pool, so your ma>imum A# is only ,55#T, and you
could probably use the swap space....as the '(A8memory ratio goes up, this
is
even more true.K Gprivate letter from 'ternH
Question ? - 0ill a faster cpu help performanceS
The answer is not easy to answer. As *al 'tern noted K Qoticing that you're
using !5 percent of the 43I doesn't mean anything until you &now the &ind
of
wor& that's using the cycles. If you're 43I-bound, then you have headroom
to
increase the wor&load by a factor of four or five. An I81-bound Job,
however,
that uses !5 percent of the 43I might be improved by adding dis& spindles.
As
you increase the dis& count and I81 load, to ease the bottlenec&, you'll
use
more 43I to deal with the I81 setup, system calls, and interrupts from the
additional wor&. )ou run the ris& of morphing a dis& problem into a 43I
shortage. *ow do you &now when rela>ing one constraint pops another one
into
the foregroundS %efine the right relationships -- 43I time used per dis&
I81
tells you how much system time you eat up as you add dis& load -- and
measure
with your tailored yardstic&.K G'tern H
3reventing Cernel #emory 'tarvation
0hen 1racle is wor&ing very hard and the operating system is 'olaris !.2 or
early 'olaris !.?, it is possible to have &ernel memory allocation faults
that can eventually lead to &ernel memory starvation. A new memory
allocator
algorithm has been developed and integrated into 'olaris !.-. Gthe old
allocator had paging thresholds that were too low which causing &ernel
memory
allocation failures on very large systemsH. The allocator has been bac&
ported to rev ?5 of the 'olaris !.? Jumbo patch and to a future rev of the
!.- Jumbo patch. Qo fi> has yet been developed for 'olaris !.2. GTip -
large
database users should upgrade to 'olaris !.? or betterH. In the past 1racle
customers could manually adJust paging thresholds. The actual value that
needed to be set was proportional and depended upon the amount of memory
and
the number of cpus on the system. Also in some cases decreasing ma>users
and
bufhwm would mitigate the problem. The total allowable siBe for the &ernel
on
the ultrasparc servers running !.- is now so large that &ernel memory
allocation problems on very large systems is virtually impossible. 'ee
e>amples below. The crash output displaying &ernel memory starvation is
ta&en
from a 'parc'erver 555 running 'olaris !.2 with ($ of ram and " cpus.
'olaris !.?; 'olaris !.-; Cernel memory limits
sun?c 22#$ sun?c 22#$
sun?m 7#$ sun?m 55#$
sun?d 2@#$ sun?d !-#$
sun?u !-!-#$
UP &as crash -
Pmap &ernelmap F/66; !5?! 0AQT; 'IN6; !5?! 'IN6 A%%/6'' T1TAD
QI#$6/ 1F '6(#6QT' 5 T1TAD 'IN6 5
P &mastat
total bytes total bytes
siBe V pools in pools allocated V failures
-----------------------------------------------------------------
small 7"5, !72"""5 !-7,,-"? @"@@-
big !7-! ,-!,7!"" ,25?7-!" 5
outsiBe - - "-,!7? ?-2-
4rash is a very powerful tool that helps analyBe &ernel memory allocation
failures. 0e see from the output KT1TAD 'IN6 5K indicates that no more free
&ernel memory e>ists. The F/66 field G!5?!H indicates that there is still
plenty of memory in the user portion of the virtual address space. 4arl of
'unsoft provides an e>planation of &ernel map scarcity under 'olaris !.2
and
'olaris !.?. KIn the overwhelming maJority of cases on large database
servers, we have found that 7?#$ is overly generous for bufhwm in that
it can be cut bac& by one-half Gto 2!#$H without too much of an impact on
the
cache hit ratio. 0hat is usually in short supply on these machines is not
the
buffer cache but the amount of &ernel heap Gmapped by &ernelmapH that
remains
for non-buffer cache usage. Dimiting buffer cache growth to 2!#$ frees up
an
addition 2!#$ to the heap and has proven successful in avoiding &ernelmap
scarcity at a number of sites running large database applications.
Cernelmap
scarcity Gor eFuivalently &ernel heap scarcity as the siBe of the &ernel
heap
is limited by the siBe of the address space the &ernelmap can mapH results
in
an e>treme slowdown of processing in the systems. All of a sudden
&ernelmap
becomes a scarce resource that every thread contends for and to e>acerbate
the situation the rate of release is slowed by the very same contention to
the point that &ernelmap turnover grinds down almost to the point of
deadloc&. 0hy 7?#$'s worth of &ernelmap is inadeFuate for the largest
database servers is un&nown. The sites on which this has been a
problem have been chec&ed for &ernelmap lea&age and none has been found.
There has
also been a problem in the past with some &ernel data structures being pre
allocated from the heap and the siBe of this pre allocation being
inappropriately scaled to physical memory. As it is fairly common now
for machines to be eFuipped with 2($ of physical memory, this was not the
right thing to do and did account for some &ernelmap depletion headaches.
$ut
this particular bug has been fi>ed. 0ith these two things discounted, the
only conclusion is that modern database wor&loads are driving up pea&
transient demands for &ernelmap to the 55#$ level.K GTip -For large
databases
running 'olaris !.? or less set bufhwm to "555 on ?c, ?m, and ?d or upgrade
to
'olaris !.- which has a large &ernel map address space.H
Ac&nowledgements
I want to than& 'un performance gurus Adrian 4oc&croft and *al 'tern
for their contributions to this paper. IQI+ architect #ar& Eohnson of
1racle and database e>pert Eim '&een of 'unsoft provided comments on 1racle
internals. Cernel architect Eeff $onwic& has added e>planations and
suggestions
regarding &ernel memory allocation and &ernel memory starvation.
'un'ervice &ernel engineer 3aul Faramelli documented the 'olaris tuning
parameters
and 'un'ervice Technical 6>pert 'teve 1'Qeil provided recommendations for
tuning large 1racle databases on versions of 'olaris that are not self
tuning.
Finally I want to than& Iresh Aahalia who gave me permission to Fuote
at length from his wonderful boo& KIQI+ Internals - The Qew FrontiersK.
%isclaimer
The author alone is responsible for the contents of this paper. Qo one
at 'un #icrosystems, 'unsoft, 'un'ervice, or the 1racle corporation has
reviewed or approved the paper for completeness or accuracy in it's
published
format and nothing in the paper can be construed as the official policy of
'un
#icrosystems or the 1racle 4orporation.
/eferences
IQI+ Internals - The Qew Frontiers by Iresh Aahalia, 3rentice *all @@7
K*ow the 'olaris Cernel is 1ptimiBed for 1racleK by #i&e Eaffee @@7
K'hared 3age Table; Airtual #emory 6nhancement for %ata 'haring in IQI+K
*.)oo
K4omparative analysis of Asynchronous I81 in #ultithreaded IQI+K *yuc& )oo
K*elpW I've lost my memoryWK by Adrian 4oc&croft, 'un0orld1nline @@- GH
K0hat are the tunable &ernel parameters for 'olaris !SK by Adrian 4oc&croft
G!H
K*ow does swap space wor&SK by Adrian 4oc&croft, 'un0orld1nline @@- G2H
K0e suggest creative ways to better your systemK performance by *al 'tern
'ystem 'ervice (uide - 'olaris !.? #anual, 'un'oft, @@?
KThe 'lab Allocator; An 1bJect-4aching Cernel #emory AllocatorK Eeff
$onwic&

Eu
JulienlimXroc&etmail.com
$ecause e-mail can be altered electronically, the integrity of this
communication cannot be guaranteed.

You might also like